Complete probabilistic analysis of RNA shapes

Background Soon after the first algorithms for RNA folding became available, it was recognised that the prediction of only one energetically optimal structure is insufficient to achieve reliable results. An in-depth analysis of the folding space as a whole appeared necessary to deduce the structural properties of a given RNA molecule reliably. Folding space analysis comprises various methods such as suboptimal folding, computation of base pair probabilities, sampling procedures and abstract shape analysis. Common to many approaches is the idea of partitioning the folding space into classes of structures, for which certain properties can be derived. Results In this paper we extend the approach of abstract shape analysis. We show how to compute the accumulated probabilities of all structures that share the same shape. While this implies a complete (non-heuristic) analysis of the folding space, the computational effort depends only on the size of the shape space, which is much smaller. This approach has been integrated into the tool RNAshapes, and we apply it to various RNAs. Conclusion Analyses of conformational switches show the existence of two shapes with probabilities approximately 23 vs. 13, whereas the analysis of a microRNA precursor reveals one shape with a probability near to 1.0. Furthermore, it is shown that a shape can outperform an energetically more favourable one by achieving a higher probability. From these results, and the fact that we use a complete and exact analysis of the folding space, we conclude that this approach opens up new and promising routes for investigating and understanding RNA secondary structure.


Background
RNA secondary structure analysis is a common task in research on RNA and its manyfold functions. The first algorithm capable of computing the structure with minimum free energy (MFE) based on the nearest neighbour energy model was introduced in [1]. It was capable of calculating the MFE-structure only, and gave valuable results for short sequences. Nevertheless, it was recognised that although predicted RNA secondary structures contain, on average, 73% of known base pairs for RNA sequences divided into domains of less than 700 nucleotides [2], the predicted structures are sometimes quite different from the secondary structures obtained by comparative sequence analysis. After two decades of refined measure-ments of thermodynamic parameters, the problem persists [3], and the limited credibility of the MFE-structure is attributed to intrinsic properties of the folding space, such as its partioning into families of similar structures and the kinetics of folding [4,5].
The state of an RNA molecule must be seen as a Boltzmann ensemble of structures, some very similar, some quite distinct. The challenge of folding space analysis is to determine whether there is some family of structures in this ensemble that is internally similar, distinct from the rest, and collectively dominates the probabilities of all other families. The dominating family, if any, should be the biologically relevant one. For this reason, it is of interest to include suboptimal solutions in the process of structure elucidation. In [6] Zuker introduced an extended version of his algorithm, which was also capable of predicting certain suboptimal structures. This allows a researcher to check different predictions for correspondence to experimental results. The most recent version of the algorithm [2,7] is implemented in the MFOLD package [8].
A drawback of the Zuker algorithm is the use of heuristic filters to circumvent the repeated output of the same structure. These filters remove not only redundant structures but also similar structures. This is desirable from a human observer's point of view, but precludes a probabilistic analysis. In [9], an algorithm was introduced allowing for non-redundant and complete suboptimal folding, which is implemented in the tool RNAsubopt from the Vienna RNA package [10]. It is designed to compute rigorously all structures within a given energy range and is guaranteed not to miss any structure that is feasible with respect to the nearest neighbour energy model. The major advantage of this approach is that it gives access to all suboptimal structures, i.e. the complete folding space of an RNA sequence. However, as the number of structures is exponentially related to sequence length [11], this method produces a large number of structures, which are laborious to analyse.
The free energies of RNA structures can be imagined as a rough landscape over the folding space. The folding space is described by the notion of neighbourhood, which in the case of RNA secondary structure is the difference in exactly one base pair. A structure having only neighbours with higher free energy is a local minimum and forms the bottom of a valley. All structures that can be reached by neighbour moves (opening or closing of a base pair) while increasing the energy form a valley in the landscape and can be seen as a family of structures. A structure having neighbours in more than one valley is referred to as a saddle point in the landscape. Rephrasing our challenge in this alpine terminology, the task is not only to find the lowest point overall, but to relate the depths of valleys to their population sizes, and to determine the family of which members are most likely to be encountered when this landscape is explored. The first method (even prior to RNAsubopt) for analysing the complete folding space in order to assess the relevance of a secondary structure was introduced by McCaskill in [12]. The author makes use of the partition function to address this property. In general, the partition function provides a measure of the total number of states (structures) weighted by their individual energies at a particular temperature. For an RNA sequence and the set S of all possible structures for this sequence, it is defined as follows: where E j is the energy of structure j, R the universal gas constant (0.00198717 kcal/K) and T the temperature in Kelvin. In words, this is the sum of the Boltzmann weighted energies of all structures. The probability P of a particular secondary structure x ∈ S is defined as: where E x is the energy of structure x in kcal/mol. Intrinsic to this approach is that the probability is proportional to the (Boltzmann weighted) energy of a structure. Hence, this approach provides no further information on structural relevance. No individual structure can have a higher probability than a structure with lower free energy, and the MFE-structure is always the most probable one; albeit with an individual probability that is often very close to zero. This problem has already been stated in [12], and the author also provides a means to alleviate it. Instead of computing the probability of a complete structure, the probabilities of atomic structural elements, i.e. base pairs, are computed. Displaying these in a matrix, as squares with area proportional to the probability, results in the so called "dot plot" for base pairing probabilities. This visualisation shows all possible base pairs and allows for the detection of alternative structures with high probability.
The partition function cannot only be used to calculate the probabilities of individual structures or base pairs. In [4], Ding and Lawrence introduced a statistical sampling algorithm that is implemented in the tool SFOLD. In each step of the recursive backtracing procedure, base pairs and the structural elements they belong to are sampled according to their probabilities, obtained from the partition function. Features of the sampling procedure are that each run is likely to produce a different sample and that the same structure can be sampled multiple times, where the MFE-structure is the most frequent structure, as it has the highest probability. Nevertheless, the MFE-structure is not guaranteed to be present in the sample, especially for long sequences. The authors showed that sampling the folding space of the Spliced Leader of Leptomonas collosoma could give structures from two families. These two families, which were defined by manual alignment of the sampled structures, correspond to the alternating structures of this conformational switch. This improves over a non-probabilistic sampling procedure yielding simlar results [13,14].
Another tool analysing the complete folding space of an RNA, or part thereof, is barriers [15] by Flamm et al. It is designed to find local minima and saddle points connecting these, and in addition it generates the so-called "barrier tree" as a visualisation of the landscape. In the barrier tree, the local minima are leaves and saddle points are nodes connecting either two local minima, a local minimum and a saddle point, or two saddle points. The length of an edge corresponds to the energy difference between the connected elements.
Common to all approaches is the attempt to partition the folding space into structural families and to derive features of each such family. For the partitioning, Zuker uses structural similarity based on a distance measure, McCaskill base pairs, Ding and Lawrence similarity of sampled structures and Flamm et al. the affiliation to the same valley. One problem persists: exhaustive enumeration is slow, while sampling cannot clearly designate a dominating family.
In principle, the local minima in the folding space neighbourhoods can be taken as representatives of all the families. Unfortunately, no algorithm has been found that computes these representatives directly, i.e. without explicit enumeration of all individual structures. From a more macroscopic point of view, the notion of neighbourhood based on base pair opening and closing seems too low-level anyway: two alternative structures may both have (say) a cloverleaf shape, but not share a single base pair, and hence belong to different families. Having the same shape is therefore a stronger abstraction. It retains adjacency and nesting of stacks and hairpins. It gives us the option of regarding or disregarding the presence of bulges. And it completely abstracts from individual base pairs and their location in the sequence. This idea has been formalized in the approach of abstract shape analysis of RNA [16]. Each shape is a distinct class of structures, which has a representative structure, shrep for short, of minimal free energy within the shape. These shreps can be computed directly -avoiding the burden of exhaustive enumeration of individual structures. It has been shown that computing the k lowest-energy shreps provides useful information. Abstract shape analysis, as described in [16], can also provide precise accounts of the number of structures within each shape -but perhaps surprisingly, it does not provide the overall probability of a shape. This is the classical challenge formulated above, and a solution based on the concept of shapes will be described in this contribution.

Outline of probabilistic shape analysis
Complete probabilistic shape analysis computes, for each shape, the probability sum of the structures within that shape. While this goal is simply stated, it is more difficult and computationally more costly to achieve than simple shape analysis. Our presentation is organised as follows: We first show how to compute the shapes and Boltzmannweighted energies of individual structures, and, by analogous means, the partition function. We then combine these calculations by a programming technique called classified dynamic programming. It allows to accumulate the Boltzmann-weighted energies of all structures by shape. We then study the algorithmic efficiency of this calculation, where we find that it avoids exponential relationship to the number of structures, but is exponentially related to the number of shapes. In sharp contrast to the number of structures, the number of shapes is typically small enough to make the approach practical. Finally, we report on applications of complete probabilistic shape analysis to several types of RNA, and discuss the results.

Results
In the first three subsections, we explain the mathematical model underlying our new type of analysis. (Algorithmic details and efficiency concerns are deferred to the Methods section.) Subsequently, we report on the findings of various applications of the method.
Modelling the folding space RNA secondary structures can be represented as strings, base pair lists, graphics, and in many other forms. When they are to be analysed under different objective functions, and when pseudoknots are not involved, RNA secondary structures are most conveniently represented as trees. Trees allow the pattern of helix adjacency and nesting that characterises a secondary structure to be represented naturally. The tree-like representations presented here will not subsequently be computed. Instead, various secondary structure features will be computed (such as their free energy, shape, string representation, etc.), and the tree-like structure representations will serve as a common model for the precise and uniform definition of these derived features.
Our structure representations are rooted, ordered, labelled trees. On their leaf nodes, in left to right order, the labels spell out the primary sequence, built from nucleotides A, C, G and U. The inner nodes of the tree are labelled by operators related to the structural features of RNA: single stranded regions (SS), hairpin loops (HL), stacked base pairs that form stacking regions (SR), 5' and 3' bulges (BL and BR), internal loops (IL) and multiloops (ML). Multiloops comprise a closing base pair and a list of adjacent (AD) structure elements inside. For mathematical completeness, we also need operator E denoting an empty list of adjacent structures. In the more familiar dot-bracket notation, its representation would be ". . ( ( ( ( . ( . . . . ) . ) ) ) ) .". While the string representation is easier for us to read, the trees are mathematically more convenient. Each operator can also be seen as a function symbol, taking a fixed number of arguments of fixed types. For example, the BL operator accepts a (closing) pair of bases, say (a, b), a single stranded region l, and a closed substructure x. We can write the formula BL(a, l, x, b), which is equivalent to the tree. This interpretation of operators as function symbols that compose structures is summarized in Table 1. Our example structure is shown as a formula in Figure 1 It is important to note that not all such trees or formulas represent legal structures. For example, ML('C', HL('A', "CCC", 'U'), 'G') is valid as a formula -every operator has the right number and type of arguments -but not valid as a structure, since the notion of a multiloop implies that there are at least two closed substructures inside. Rules for building trees that are valid structures can be given in the form of a tree grammar. The mathematical appeal of a tree grammar is that, besides providing a precise definition of the folding space associated with a given RNA molecule, it can automatically be converted into a parsing algorithm, based on dynamic programming, that evaluates this folding space. For example, it can find the minimal free energy structure, or derive any other type of information that can be described in the systematic fashion introduced below.
A tree grammar for RNA secondary structures is shown in Table 2. We use an ASCII representation of grammar rules, which is both easy to read and suitable for algorithm generation. A clause such as u = f <<< x ~~~ y | | | g <<< z says that a tree of type u can be built either with operator f being applied to subtrees of type x and y, or with operator g being applied to a subtree of type z.
Our grammar has been written to exclude structures with isolated base pairs, as such "lonely pairs" can be considered not to occur in native structures. Where such lonely pairs are energetically favourable, this is probably an artefact of the energy model used (Gerhard Steger, personal communication). Optionally, however, our program RNAshapes offers calculations that include such pairs. The grammar also imposes a minimal length of 3 on the turns inside hairpin loops. Readers are invited to derive some trees with this grammar, to assure themselves that invalid structures cannot be derived.
The folding space F(s) of a sequence s is now formally defined as the set of all trees of type struct that exhibit s as their sequence of leaves. Care has been taken to ensure that the grammar is non-ambiguous: Each structure can be represented uniquely by a tree, derived by the rules of the grammar in exactly one way. Such non-ambiguity is essential for a mathematically correct probabilistic analy- Table 1: Secondary structure operators. Operators build terms by application to (sub-)terms. Operators can be interpreted in different ways with algebras, such as the Boltzmann-weighted energy algebra. In this case, terms evaluate to real numbers. Interpreting operators as mere symbols leads to symbolic terms that represent structures, (cf. also Figure 1) operator description SS(l) single-stranded region l HL(a,l,b) hairpin loop with single stranded region l, closed by basepair (a,b) SR(a,x,b) stacking region, closed by basepair (a,b); x is a closed structure BL(a,l,x,b) bulge left with single stranded region l, closed by basepair (a,b); x is a closed structure BR(a,x,l,b) bulge right with single stranded region l, closed by basepair (a,b); x is a closed structure IL(a,l,x,l',b) internal loop with single stranded regions l and l', closed by basepair (a,b); x is a closed structure ML(a,c,b) multi-loop, closed by basepair (a,b) AD(x,c) list of adjacent structures; x is a structure, c a (possibly empty) list of adjacent structures E empty list of adjacent structures sis, as has recently been analysed in [17][18][19]. Showing the non-ambiguity of a grammar is often difficult. For our grammar, we succeeded with the automatic proof technique presented in [19].
The grammar presented here is actually a simplified version of the full grammar used in our implementation. More details of the full grammar are given in the sections on algorithmics and efficiency analysis.

Deriving features from structures
As our structures are formulas, we can derive various kinds of information in a most uniform way: We simply interpret the operators as functions operating on a particular domain of interest, such as shapes, energies or strings. Such interpretations consist of one function per operator, and will be called evaluation algebras. The value resulting from the interpretation of a structure x in algebra α will be denoted x α . The convenience of this formalization is that whatever feature of interest we cast in terms of an evaluation algebra, we can be sure that the parsing algorithm that evaluates the folding space can compute this feature [20]. We will specify evaluation algebras that compute free energies, Boltzmann-weighted energies, shapes and string representations of structures. When an interpretation is given, by convention, operator names are converted to lower case and superscripted. This means, for example, that operator SR will be given many different Table 2: Basic secondary structure grammar. This grammar is a simplified version, included for illustrative purposes. The grammar that is actually used for calculating shape probabilities is larger, owing to the requirement to be unambiguous; see the discussion in paragraph "A non-ambiguous grammar with correct dangles" and Table 6. Part a) shows the grammar in its algebraic form. | | | signifies alternative right-hand sides of productions, ... h the application of choice function h, ~~~ juxtaposition of terms. <<< denotes application of the operator to its left-hand side to the arguments of its right-hand side. Operators are as in Table 1, plus ul(x) as an abbreviation for ad(x,e), str for structures, and blk for blocks. The axiom of the grammar is struct. Part b) shows the same grammar in EBNF notation, naturally without the operators to be applied.
interpretations -as function sr en , which adds the free energy increment of stack extension to the enclosed structure's energy, as sr bw , which computes the Boltzmannweighted energy of the extended substructure in the same situation, as sr π , which computes the stack extension's effect on the shape x π . of structure x, and as sr (·) , which adds another pair of brackets around the string representation of the enclosed substructure.

String representation of structures
As a simple first example, we define the interpretation x (·) that derives the dot-bracket representation of structure x. "(·)" is meant as an abbreviation for "dot-bracket representation".

Free energy
In the energy algebra, each operator, representing a structural feature, is interpreted as a function adding a certain free energy increment to the energy of its embedded substructure(s). Evaluating structure x under this interpretation yields energy value x en . Base pairs and stack extensions add stabilising (negative) energies, while most kinds of bulges and loops add destabilising (positive) energies. The concrete energy parameters have been determined experimentally [2,21,22]. We abbreviate them by δ SR , δ BL etc. Using these parameters, we can interpret SR by function sr en in the following way: (12) Energy functions for the other operators can be given in a similar way. Comparing SR(a, x, b) to its free energy interpretation sr en (a,x en , b), we see that the tree representing a structure has simply been replaced by an isomorphic formula that computes its free energy. (Technically, a and b have to be base coordinates here, not just bases, since δ SR is defined on stacked base pairs, not just single base pairs.) Under this interpretation, our example structure from Figure 1 becomes

Boltzmann-weighted energies
When computing probabilities of structures according to Eq. 2, it is convenient to defer division by Q until the very end, and compute Boltzmann-weighted energies instead, according to the equation x bw = Hence, the Boltzmann-weighted energy algebra can be derived from the free energy algebra: Again, the other functions can be derived in a similar way. Like energies summate over substructures; Boltzmannweighted energies multiply.
For our example structure, the Boltzmann-weighted energy is 774.6261 and Q = 793.9457, resulting in a probability P = 0.9756663.

Shapes
RNA abstract shapes are a generic concept. They are defined by means of abstraction functions preserving varying amounts of detail. These functions are homomorphisms from structures to another tree-like domain, preserving adjacency and nesting of substructures. For the trees representing shapes, we use 4 operators: OP ("open") represents the shape of all structures without base pairs, CL ("closed") represents a helical region, AD and E are re-used here to represent lists of adjacent (sub)shapes. We present two shape abstraction functions π5 and π3, known as level-5 and level-3 abstraction [16].
π5 maps all helices to the CL operator, abstracting from helix length and interruptions by bulges or internal loops. Except for the completely unpaired structure, no singlestranded regions at all are retained. In contrast to this, π3 introduces a new CL operator in the shape tree, whenever a helix is interrupted by a bulge or internal loop. However, the type of interruption is not recorded. Figure 2 shows a structure and the two shape trees according to π5 and π3.

Complete probabilistic shape analysis
Given the tree grammar defining valid structures, and a particular RNA sequence s, a parsing algorithm can construct the complete folding space F(s). For all x ∈ F(S), it can compute values x en , x bw , x π and x (·) , where π is one of our shape abstraction functions. We shall make use of the notion of an exploded folding space, which is a list of derived values, one for each member of F(s). Exploded folding spaces are convenient for formal definitions, but they must be avoided in actual computations, as their size increases exponentially with the length of s. This is possible with dynamic programming.
x bl a l π π π π π π π π π 5 5 3 3 3 π π π π 3 5 3 3 To explain the model underlying complete probabilistic shape analysis, we proceed in three steps. We first review simple shape analysis (as described in [16]). Then we describe the computation of the partition function as well as the structure of maximal Boltzmann weight. Finally, we combine the objectives of both types of analysis to define shape probabilities.

Simple shape analysis
Consider the list L shreps of all shape-representative structures (shreps) for s, together with their energies and shapes, and sorted on the energy component. Here, π (F(s)) is For given k, simple shape analysis computes the first k elements of this list, in O(kn 3 ) time and O(n 2 ) space.

Computing Boltzmann-weighted energies
Our new goal is to compute Boltzmann-weighted energies of shapes, defined as the accumulated Boltzman-weighted energies of all structures within a shape. We want to compute the value triple where , the sum of all Boltzmannweighted energies, and x opt ∈ F(s) is the structure of maximal Boltzmann-weighted energy.
We define a function h B , which computes B from the exploded folding space (20) by accumulation of Boltzmann-weighted energies in the first component and maximization of Boltzmannweighted energies in the second, while recording the structure of highest Boltzmann weight in its string representation in the third component. For a single structure, both energy sum and individual energy coincide, which explains why x bw occurs twice in Equation 20. Details are given in the Methods section.

Computing shape probabilities
Our ultimate goal is to compute the probabilities of all shapes in π(F(s)). To this end, we need to accumulate Boltzmann-weighted energies per shape. The accumulated weight of shape p is and the shape's probability is .
Along with the accumulated weights, we also want to compute the shapes themselves, together with their shreps and their Boltzmann-weighted energies .
(The latter will eventually be converted back to free energies in the output of our program.) Our desired result is therefore a complete list of all these values, in the form We define a function h P that computes P from the exploded folding space (23) by computing shape abstractions in the first component, and applying h B on the other components in a shape-wise fashion. Details are given in the Methods section.
We now report our findings from applications of complete probabilistic shape analysis to various types of RNA.

Transfer RNA
Applying the shape probability algorithm to the alanine tRNA of Natronobacterium pharaonis (embl:AB003409.1) gives the results shown in Figure 3.
The shape holding the MFE-structure (shape 1) shows a high probability, whereas the other shapes have probabilities below 1%. This means that this unmodified RNA is very unlikely to occur in the cloverleaf shape. This is consistent with biological knowledge and clearly expresses the need for other mechanisms, such as base modifications, to ensure that the cloverleaf structure is actually achieved.

Attenuator
The pheS-pheT-Attenuator of E. coli (embl:V00291.1/ 3682-3746) is known to switch from a translationally inactive to a translationally active conformation under specific conditions. These two conformations correspond to two valleys in the structure landscape that are separated by a saddle point (energy barrier). In terms of shape analysis, this means that two shapes should be present with reasonable probability. The corresponding experiment yields the results summarised in Figure 4. The analysis shows that Shapes 1 and 3 are rather similar with respect to their shreps, so their probabilities can be added. This means that the shape with two hairpins, which may be embedded in a multiloop, has a probability of 0.635765 and the shape with one hairpin has a probability of 0.324386. Shape 1+3 corresponds to the "off" position of Shreps of the N. pharaonis tRNA-ala Figure 3 Shreps of the N. pharaonis tRNA-ala. Shreps of the three most probable shapes of the N. pharaonis tRNA-ala together with the probabilities of the shapes (sorted by increasing energy).
, −31.7kcal/mol, P = 0.001257 P = 0.001257 P = 0.001257 the switch and the higher probability inidcates that this is its native position. The "on" position (Shape 2) is less probable, indicating the need for some external effector to trigger the switch.

Leader of ptsGHI
The ptsGHI operon in B. subtilis (gb:Y11193/1016-1107) includes the genes involved in glucose transport by the phosphotransferase system. In [23], it was shown that expression of this operon is controlled at the level of transcript elongation by a protein-dependent riboswitch. In the absence of glucose, a transcriptional terminator prevents elongation into the structural genes. In the presence of glucose, the GlcT protein is activated and binds and stabilises an alternative structure that overlaps the terminator and prevents termination. Applying RNAshapes to calculate probabilities of shapes gave the results shown in Figure 5. The first striking result of this analysis is that a shape holding an energetically less favourable shrep (shape 3) is the most probable one. Again, we can further merge shapes by looking at their shreps. Only Shreps 2 and 4 carry the antiterminator hairpin (the small hairpin at the 5'-end of shrep 3 does not alter the terminator hairpin), whereas Shreps 1 and 3 carry the terminator hairpin. Summation of the probabilities gives 0.788176 for the terminating conformations and 0.173566 for the read-through conformations. This corresponds to experimental results showing that the switch is natively in the "off" position and is triggered by the GlcT protein to enable transcript elongation.
Precursor of microRNA lin-4 microRNAs (miRNAs) are small (~22 nt) regulatory RNAs that are processed from larger precursors, for which the secondary structure is assumed to play an important role. A common feature of all known precursors is that they form a hairpin with significantly lower energy than for random sequences of the same dinucleotide distribution [24]. This suggests a well-defined secondary structure, which implies that the corresponding shape should have a very high probability. An analysis of the precursor of C. elegans lin-4 (miRBase:MI0000002) [25] reveals the shape [] with probability 0.999996, which means that only 1 in 250,000 molecules has a different shape, or that each molecule spends 99.9996% of its lifetime in the single hairpin shape. A probability cut-off of 10 -6 was used for the output, which might be the reason that no further shapes appear.
Another fact that has to be considered is that the shape abstraction might have been too strong. For this reason, we performed an analysis with abstraction level 3, retaining more structural detail, and giving the results shown in Figure 6.
All shreps are very similar, so it is reasonable to combine them in the single hairpin shape. Their probability sum is 0.999956, which except for rounding inaccuracies is the same as the probability of the single hairpin shape.

mRNA
The previous sections show how the probabilities of shapes can be used to analyse functional RNAs and their specific structural properties. But what about messenger RNA (mRNA)? As the structure of an mRNA is generally assumed to be less important, and because evolution has to ensure the correct coding of amino acids, we would expect the results to be inclonclusive. Interestingly, analyses of numerous mRNA coding sequences revealed a wide variety of results. In Figure 7 we give two examples that can be seen as extremal observations. The "expected" case is observed for ENST00000328857.1 (see Figure  FIG:CDS:PROBS:A), where we find nine shapes with remarkably high probabilities, and four cases of "overtaking", where the shape probability ranking contradicts the ranking of shreps by energy. The other extreme is observed for ENST00000326531.1 (see Figure FIG:CDS:PROBS:B), where we find a hairpin shape with probability 0.999819 inside the coding sequence. Reasons for this diversity of results may be that at least some coding sequences carry structural features necessary for correct function, or that structure, no matter whether well-defined or ill-defined, was never selected for or against.

Approximating shape probabilities via sampling
In the Methods section, we show that complete probabilistic shape analysis has a "slow" exponential term in its runtime requirements. This makes probabilistic shape analysis unfeasible for long sequences. Hence, to allow the analysis of such sequences, we combine the stochastic sampling introduced in [4] with a-posteriori shape abstraction. A sample from the structure space holds M structures together with their shapes, on which classification is performed. The probability of shape p can then be approximated by its frequency f p in the sample. The accuracy of this approach depends on the sample size M, which must be large enough to achieve statistical confidence. Assuming the counts to have a Poisson distribution with parameter Mp, we can approximate them with a normal distribution with mean Mp and variance Mp for large sample sizes. To allow 10% deviation of the estimated frequency f p within a 95% confidence interval, we require that ≤ 0.1 mean or Mp ≥ 400, or, with lspoi being the lowest shape probability of interest, M ≥ 400/lspoi. Thus, M = 1000 is sufficient to achieve reasonable results for the shapes with high probability, and this number is 2 var independent of the sequence length. This result was confirmed by empirical analyses (see Table 3).  ((((....))))))))......  (((((.................)).)))).....)))))))...))))...........  (((((....))))))))))))))

Shrep of shape 4: [[][]], −22.1kcal/mol, P = 0.073648
The remaining question is, when to use the exact algorithm and when the sampling? The running times for the two methods are summarised in Table 4 and show that for sequences up to 100 nt the exact algorithm should be used, whereas for longer sequences the sampling algorithm is favourable.
More recently, the approach of [4] was extended by a clustering step and computation of centroids for each cluster [26]. The clusters can be seen as analogous to our shapes, though without an explicit notion of abstraction.

Mathematical and algorithmic aspects
Abstract shapes are a mathematically precise, intuitively simple and non-heuristic means for partitioning the folding space into classes of structures. They enable us to derive synoptic properties of these classes such as the shreps of each class [16] or the accumulated class probabilities, as introduced here. The granularity of the partitioning can be adapted to the length of the sequence by choosing different abstraction levels. Simple shape analysis is possible with the same algorithmic complexity as standard MFE folding (O(n 3 )), while probabilistic shape analysis requires O(n 3 ·P n ), where P depends on the abstraction level chosen (see Methods). When the exponential term becomes problematic, one can switch to probabilistic sampling with subsequent shape classification. However, there remains the challenge of finding an algorithm for probabilistic analysis for the k best shapes that avoids the O(P n ) factor.
Given such satisfactory mathematical and algorithmic properties, the question of whether shapes constitute an abstraction that is also biologically meaningful must not be overlooked.

Biological adequacy of shapes
Our idea is that a shape class comprises similar structures that can potentially perform the same function -with the consequence that looking at the shreps and their shape probabilities gives us a precise and complete account of a molecule's functional potential. This incurs the risk of overlooking an important feature when shapes exhibit substantial internal variation, while only their shrep is submitted for further scrutiny.
We address this concern about variation within shape by two examples. In Figure 8, we show three extremal members of shape [] for the C. elegans lin-4 miRNA precursor.

Conclusion
Probabilistic shape analysis allows us to approach the question of the significance of predicted structures from a new angle. We take the view that in the evolution of structure, competition comes not so much from sequence variation (akin to randomizing effects) as from alternative structures within the folding space. A different structure can come to dominate the native one because of very few mutational events, with the effect that a functional RNA becomes nonfunctional. Hence, evolution should strive to make the shape of a functional structure clearly dominate over other shapes. This not only stabilizes this structure in the overall Boltzmann ensemble of the native sequence, but also protects against immediate loss of function because of a small number of mutations, so that there is time (on an evolutionary scale) to restabilize the structure by compensatory mutations, which we often observe.
This line of thought is consistent with our observations reported here, but it is by no means proven. An observation like that for ENST00000326531.1 calls for further statistical or experimental work, to disprove or prove the biological relevance of a structural motif with shape probability 0.999819 inside this coding sequence. We propose that dominance of shape should be further investigated as a measure of structural significance.

Algorithmics and efficiency analysis
In this section we are concerned with two algorithmic problems: 1. We define our objective functions h B and h P and show that they can be implemented via dynamic programming.
2. We analyse and discuss the asymptotic efficiency of complete probabilistic shape analysis achieved by this implementation.
We shall define our objective functions h B and h P as evaluators of an exploded folding space.
Subsequently, we will show that they can be computed efficiently via dynamic programming because they satisfy the condition known as Bellman's Principle. Thus, we start with a discussion of this principle.

Notation
We define objective functions on lists. The empty list is denoted as []. [x 1 , ..., x n ] is the list of n elements x 1 to x n where n can also be zero, the list thus being empty. In expressions like [...|x 1 ← z 1 ,...,x n ← z n ] the leftarrow ← denotes list membership. The operator concatenates two lists. The function map applies a function elementwise to a list. The function concat concatenates a list of lists into a single list. As above, features that can be derived from a structure x are its free energy x en , its Boltzmann weight x bw and its notation as a "Vienna string" x (·) . According to their subscripts, Boltzmann weights can be summed-up ( ) or optimised ( ). is the Vienna string of an energetically optimal (sub-)structure x. The shape of structure x is denoted by x π .

Bellman's Principle
Bellman's Principle [32] captures the essence of dynamic programming. It states the conditions under which a search space can be pruned by applying an objective function at each intermediate step, whenever a list of alternative intermediate results has been obtained. Solutions to sub-optimal subproblems can be discarded, yet the overall optimal solution will be found.
Bellman's Principle can be formally captured in the following two equations (cf. [20]):   (25) where the z i denote lists of intermediate results, ← denotes list membership, and denotes list concatenation. In our context, functions f will be sr en , sr bw and so on.
It is not generally true that a function that computes the desired result from the exploded search space will also produce this result when applied at intermediate steps in a dynamic programming algorithm. But it is easy to see that the conditions 24 and 25 together guarantee that this is in fact the case.

Definition and justification of objective function h B
Function h b computes B = ( , , ) from the exploded folding space Equations 19 and 20).

Definition of h B
where s i = ( , , ), is a sum of Boltzmannweighted energies, is the optimal Boltzmannweighted energy seen so far, and x (·) is the string representation of a structure with this Boltzmann-weighted energy; likewise for y. Thus, this choice function sums up Boltzmann-weighted energies of structures and keeps track of the structure that has the optimal energy. The result list has at most one element.
The function h not only extracts the desired information from the exploded folding space, it also satisfies Bellman's Principle of Optimality, as will be shown next.

Justification of objective function h B
Equation 25 is easy to verify: every is added once and the optimal structure is determined by looking at each one on the left-hand side once and on the right-hand side twice, which gives the same result. For Equation 24, we have to differentiate between three kinds of operators: the first is of the form f = c, meaning it is a constant that does not depend on sub-solutions (operators SS, HL and E); the second is of the form f = a·x, where a scalar a is multiplied with an evaluated sub-solution x (operators SR, BL, BR, IL, and ML); the third is of the form f = a·x 1 ·x 2 , where a scalar a and the values of two sub-solutions are multiplied (operator AD). The first case (f = c) is trivial, since we do not have any answer lists z, and h is thus applied only once. In the second case ( f = a·x), we sum up terms using a·x 1 + a·x 2 , which is equal to a·( x 1 + x 2 ), or choose the maximum using max ( a·x 1 , a·x 2 ) which is equal to a·max ( x 1 , x 2 ). In the third case ( f = a·x 1 ·x 2 ), we sum up terms using ), which is equal to a·( )·( ), or choose the maximum using max ( ), which is equal to a·max ( )·max ( ).

Definition of objective function h P
Using the previous objective function h B , we define the classifying objective function h P as: h P (z) = concat (map(h B , split(z))) (27) where map applies function h B elementwise to a list, map (f, [a 1 ,..., a r ]) = [f(a 1 ),..., f(a r )] (28) and concat concatenates a list of lists into a single list.
Function split splits answer list z into sublists, one for each occuring shape: where cs is a list of classes (shape-attributed lists of data), x s and y s are attributes (shapes), and x and y are the data to be classified.   since on average it has to run through half the list elements. This can be remedied by using a more efficient data structure, e.g. a hash of which the keys are shapes.
Given this definition, it is clear that computes the desired information from the exploded folding space L sh .
That this can also be achieved efficiently via dynamic programming needs yet to be established.

Dynamic programming with classification
By dynamic programming with classification we refer to an analysis that splits up the search space into (disjoint) classes, and applies some objective function class-wise. The classes are not predefined, but arise from information that is also derived from the search space. Almost certainly, this type of problem has arisen before in the 50 year history of dynamic programming. To the best of our knowledge, however, it has not been described as a generic method.
In DP with classification, we compute a classification attribute together with each score value, such that scores can be attributed to a particular class of (sub)solutions. The classification attribute is derived by interpreting solutions in a particular algebra, in our context the shape algebra. We then apply the optimization objective h separately for each class, returning, for example, the k best scores for each solution class. To this end, we must provide a classification function for each f i , such that we can compute (attribute/score) pairs: ( ×f i ) ( , a), ( ,b)) = ( ( , ), f (a, b)). The function applies h separately on each class and is defined (consistent with our earlier definition) as (x) = concat (map ( h, split (x))) (31) where split separates a list of attribute/value pairs into sublists, one for each attribute occurring, (map (h, z)) applies h to each sublist of z, and concat rejoins sublists.
We show that satisfies Bellman's Principle when h does. This means that classification can be applied with any DP algorithm.

Theorem
When h satisfies Bellman's Principle, and is defined as above, then also satisfies Bellman's Principle, i.e. we have for each list xs, ys of attribute/score pairs and all evaluation functions f:   Table 6: The full unambiguous grammar in EBNF notation. This is the full unambiguous grammar in EBNF notation. Note that dangling bases are not represented explicitly by a special terminal symbol, but as a 'base'. Their dangling property is accounted for by the derivation path, e.g. the secondary structure . ( (...) ). for sequence 'ACCUAUGGG' will be derived as struct → left_dangle → edanglelr left_dangle → base initstem base left_dangle → base initstem base empty. The two unpaired bases in 'base initstem base' have been derived via ' edanglelr', which accounts for their dangling property. We do not give an explicit representation for dangling bases, as the need to derive them explicitly is due to the energy model and not to handling them as discrete structural elements, i.e. a dangling base is nothing more than an unpaired base next to a stem, but it has a non-positive energy contribution that cannot be neglected.

Why probabilistic shape analysis is expensive
Our algorithm computes accumulated probabilities (or Boltzman-weighted energies) for all shapes. Since the number of shapes grows exponentially with sequence length n, runtime is (n 3 P n ). It was determined empirically for shape level 5 in [16] that P ≈ 1.1. P is somewhat larger for less abstract shapes, see Table 5. By contrast, MFE folding runs in (n 3 ), and computing the k best shape representatives takes (n 3 ·k), without incurring an exponential factor.
Since a small number of top-ranking shapes can be assumed to reveal all information relevant in applications, it seems plausible to compute probabilities for the best k shapes only, achieving runtime O(kn 3 ) and avoiding the (albeit slow) exponential factor. Interestingly, this seems impossible -at least with the techniques used so far in simple and probabilistic shape analysis. The reason is that an objective function that maximizes accumulated shape probabilities would not satisfy Condition 2 (Eq. 25) of Bellman's Principle. An example suffices to demonstrate this. As optimal choice does not distribute over combinations of alternatives, Bellman's Principle is violated. However, accumulating scores for all shapes (without a choice of an optimal shape) is correct -at the extra cost of a slow exponential term in the runtime asymptotics.
This consideration applies to all schemes that accumulate scores over shapes. For example, if we score each structure simply by 1, the accumulated score is just the size of each shape's membership. This means that we have no polynomial time algorithm that determines the k largest shapes.

Further implementation details A non-ambiguous grammar with correct dangles
In the previous sections, we used the grammar attributed to Wuchty [9] to describe the general idea on how to calculate the probabilities of shapes. For expository reasons, the grammar presented has been simplified, while a faithful implementation of the current energy model requires more sophistication, in Wuchty's program as well as in ours.
A problem with Wuchty's full grammar is that it handles dangling bases in a simplified way, meaning that the grammar does not explicitly derive dangling bases. Instead, the free energy increment is added by the algebra irrespective of whether the bases are actually dangling. In general, this leads to lower free energies, which the developers see as an approximation for coaxial stacking. The effect of this inaccuracy would be that the partition function calculation and therefore the derived probabilities  Table 6: The full unambiguous grammar in EBNF notation. This is the full unambiguous grammar in EBNF notation. Note that dangling bases are not represented explicitly by a special terminal symbol, but as a 'base'. Their dangling property is accounted for by the derivation path, e.g. the secondary structure . ( (...) ). for sequence 'ACCUAUGGG' will be derived as struct → left_dangle → edanglelr left_dangle → base initstem base left_dangle → base initstem base empty. The two unpaired bases in 'base initstem base' have been derived via ' edanglelr', which accounts for their dangling property. We do not give an explicit representation for dangling bases, as the need to derive them explicitly is due to the energy model and not to handling them as discrete structural elements, i.e. a dangling base is nothing more than an unpaired base next to a stem, but it has a non-positive energy contribution that cannot be neglected. (Continued) also show inaccuracies. We therefore modified this grammar in such a way that it handles dangling bases explicitly and unambiguously. This is achieved by imposing the following rules during grammar design, which apply to the external loop and to multiloops: (1) An unpaired base (singlestrand or dangling base) has to be followed by an unpaired base, either a singlestrand or a structural element with a dangling base. (2) A structural element without a dangling base on the 3'-end has to be followed by a structural element without a dangling base on the 5'-end.
(3) We explicitly handle two structural elements with one unpaired base in between, to be able to decide which of the two possible dangling contributions is energetically favourable.
With the grammar designed in this way, we are able to derive all feasible secondary structures unambiguously with their correct energies and, therefore, also the correct partition function.

Practical efficiency
The nonambiguous, correct-dangle grammar has been implemented together with the algebras described in the previous section, yielding an algorithm for the exact calculation of probabilities for shapes. Application of this algorithm to various RNA sequences, of which some are shown in the Results Section, showed that the time and space requirements are quite moderate, allowing us to analyse sequences up to length 120 (with about 40 000 shapes) within 5 minutes on an Intel Xeon 2.8 GHz CPU.

Shape probabilities based on sampling
To be able to analyse longer sequences, we took up the idea of stochastic sampling introduced in [4]. This is achieved by changing the objective function h in the bw algebra from summation to picking one element randomly according to its Boltzmann weighted energy. This version can be used to draw samples (structures together with their shapes) of the structure space according to the Boltzmann distribution. Based on the shapes, the sample is partitioned into similarity classes and the frequency of each shape is computed. If the sample size is large enough, these shape frequencies come very close to the exact probabilities and can therefore be used instead.
We use our complete probabilistic shape analysis on moderate length sequences to evaluate how well the sample probabilities approximate the true ones with growing sample size, and relate the computational efforts required. The Results Section summarizes our observations using both algorithms.  [33].