Nucleic acid structure prediction: Difference between revisions

Content deleted Content added

Inline

Revision as of 12:06, 7 April 2010

RNA structure prediction is a computational method to determine RNA secondary and tertiary structure from its sequence. Secondary RNA structure can be predicted from a single or from several RNA sequences. Tertiary RNA structure can be predicted from the RNA sequence, or by comparative modeling (when structure of homologous sequence is known).

The problems of predicting RNA secondary structure is dependent mainly on base pairing and base stacking interactions; many RNA molecules have several possible three-dimensional structures, so predicting these structures remains out of reach unless obvious sequence and functional similarity to a known class of RNA molecules, such as transfer RNA or microRNA, is observed. Many RNA secondary structure prediction methods rely on variations of dynamic programming and therefore are unable to efficiently identify pseudoknots.

Single sequence structure prediction

A common problem for researchers working with RNA is to determine the three-dimensional structure of the molecule given just the nucleic acid sequence. However, in the case of RNA much of the final structure is determined by the secondary structure or intra-molecular base-pairing interactions of the molecule. This is shown by the high conservation of base-pairings across diverse species.

The most stable structure

Secondary structure of small RNA molecules is largely determined by strong, local interactions such as hydrogen bonds and base stacking. Summing the free energy for such interactions should provide an approximation for the stability of given structure. To predict the folding free energy of a given secondary structure, an empirical nearest-neighbor model is used. In the nearest neighbor model the free energy change for each motif depends on the sequence of the motif and of its closest base-pairs.^[1] The model and parameters of minimal energy for Watson–Crick pairs, GU pairs and loop regions were derived from empirical calorimetric experiments, the most up-to-date parameters were published in 2004,^[2] although most software packages use the previous set assembled in 1999.^[3]

The simplest way to find the lowest free energy structure would be to generate all possible structures and calculate the free energy for it, but the number of possible structures for a sequence increases exponentially with the length of RNA (Number of secondary structures = (1,8)^N , N- number of nucleotides).^[4] For longer RNA molecules, the number of possible secondary structures is enormous huge (For example sequence of 100 nucleotide, has more than 1025 possible secondary structures), and complicates finding of the most stable structure a lot.^[1]

Dynamic programming algorithms

The ﬁrst and the most popular method to find the most stable structure is a dynamic programming algorithm.^[5]^[6] One of the first attempts to predict RNA secondary structure was made by Ruth Nussinov and co-workers who used the dynamic programming method for maximising the number of base-pairs.^[5] However, there are several issues with this approach, most importantly the solution is not unique. Nussinov et al. published an adaptation of their approach to use a simple nearest-neighbour energy model in 1980.^[6] Michael Zuker and Patrick Stiegler in 1981 proposed using a slightly refined dynamic programming approach that models nearest neighbour energy interactions that directly incorporates stacking into the prediction.^[7]

Dynamic programming algorithms provide a possibility implicitly check all variants of possible secondary structures of RNA molecule without explicitly generating the structures. Firstly, the lowest conformational free energy is determined for each possible sequence fragment starting with the shortest fragments and then for longer fragments. For longer fragments, recursion on the optimal free energy changes determined for shorter sequences speeds the determination of the lowest folding free energy. When the lowest free energy of complete sequence is calculated, on the next step the exact structure of RNA molecule is determined.^[1]

Dynamic programming algorithms are commonly used to detect base pairing patterns that are "well-nested", that is, form hydrogen bonds only to bases that do not overlap one another in sequence position. Secondary structures that fall into this category include double helices, stem-loops, and variants of the "cloverleaf" pattern found in transfer RNA molecules. These methods rely on precalculated parameters estimating the free energy associated with particular types of base-pairing interactions, including Watson-Crick and Hoogsteen base pairs. Depending on the complexity of the method, single base pairs may be considered, or short two- or three-base segments to incorporate the effects of base stacking. This method cannot identify pseudoknots, which are not well nested, without substantial algorithmic modifications that are extremely computationally expensive.^[8]

Suboptimal structures

The accuracy of RNA secondary structure prediction from single sequence by free energy minimization is limited by several factors:

The free energy value's list in nearest neighbor model is incomplete
Not all known RNA folds in to conformations with the thermodynamic minimum.
Some RNA sequences have more than one biologically active conformation (e. i. Riboswitches)

For this reason, ability to predict structures which have similar low free energy would provide significant information. Such kit of structures are termed suboptimal structures. The best known program generating suboptimal structures – mfold.^[9]

Predicting pseaudoknots

One of the issues when predicting RNA secondary structure is that the standard free energy minimization and statistical sampling methods can not find pseudoknots.^[3] The major problem is that usual dynamic programing algorithms, when predicting secondary structure, consider only to the interactions between the closest nucleotides, while pseudocnoted structures are formed due to interactions between distant nucleotides. Rivas and Eddy published a dynamic programming algorithm that could handle pseudoknots.^[8] Hovever, this dynamic programming algorithm is very slow. If the standard dynamic programming algorithm for free energy minimization scales O(N³) in time (N is the number of nucleotides in the sequence), a dynamic programming algorithm of Rivas and Eddy, capable to predict most of known pseudoknot types, scales as bad as O(N⁶) in time. This has prompted several researchers to implement versions of the algorithm that restrict classes of pseudoknots, resulting in performance gains. For example, pknotsRG tool includes only the class of simple recursive pseudocnots and scales O(N4) in time.^[10]

Other approaches for RNA secondary structure prediction

Another approach for RNA secondary structure determination is to sample structures from the Boltzmann ensemble,^[11]^[12] is exemplified by Sfold program. The program generates a statistical sample of all possible RNA secondary structures. The algorithm samples secondary structures according to the Boltzmann distribution. The sampling method offers an appealing solution to the problem of uncertainties in folding.^[12]

Comparative secondary structure prediction

File:Yeast.png

S. cerevisiae tRNA-PHE structure space: the energies and structures were calculated using RNAsubopt and the structure distances computed using RNAdistance.

Sequence covariation methods rely on the existence of a data set composed of multiple homologous RNA sequences with related but dissimilar sequences. These methods analyze the covariation of individual base sites in evolution; maintenance at two widely separated sites of a pair of base-pairing nucleotides indicates the presence of a structurally required hydrogen bond between those positions. The general problem of pseudoknot prediction has been shown to be NP-complete.^[13]

In general, the problem of alignment and consensus structure prediction are closely related. Three different approaches to the prediction of consensus structure can be distinguished:^[14]

Folding of alignment
Simultaneous sequence alignment and folding
Alignment of predicted structures

Align then fold

A practical heuristic approach is to use multiple sequence alignment tools to produce an alignment of several RNA sequences, to find consensus sequence and then fold it. The quality of the alignment determines the accuracy of the consensus structure model. Consensus sequences are folded using various approaches similarly as in individual structure prediction problem. The thermodynamic folding approach is exemplified by RNAalifold program.^[15] The different approaches are exemplified by Pfold and ILM programs. Pfold program implements a SCFGs.^[16] ILM (iterated loop matching) unlike the other algorithms for folding of alignments, can return pseudocnoted structures. It uses combination of thermodynamics and mutual information content scores.^[17]

Align and fold

Evolution frequently preserves functional RNA structure better than RNA sequence.^[15] Hence, a common biological problem is to infer a common structure for two or more highly diverged but homologous RNA sequences. In practice, sequence alignments become unsuitable and do not help to improve the accuracy of structure prediction, when sequence similarity of two sequences is less then 50%.^[18]

Structure-based alignment programs improves the performance of these alignments and most of them are variants of the Sankoff algorithm ^[19]. Basically, Sankoff algorithm is a merger of sequence alignment and Nussinov ^[5] (maximal-pairing) folding dynamic programming method.^[20] Sankoff algorithm itself is a theoretical exercise because it requires extreme computational resources ( (O^(n3m) in time, and O^(n2m) in space, where n is the sequence length and m is the number of sequences). Some notable attempts at implementing restricted versions of Sankoff's algorithm are Foldalign,^[21]^[22] Dynalign,^[23]^[24] PMmulti/PMcomp,^[20] Stemloc,^[25] and Murlet.^[26] In these implementations the maximal length of alignment or variants of possible consensus structures are restricted. For example, Foldalign focuses on local alignments and restricts the possible length of the sequences alignment.

Fold then align

A less widely used approach is to fold the sequences using single sequence structure prediction methods and align the resulting structures using tree-based metrics.^[27] The fundamental weakness with this approach is that single sequence predictions are often inaccurate, thus all further analyses are affected.

Tertiary structure prediction

Once secondary structure of RNA is known, the next challenge is to predict tertiary structure. The biggest problem is to determine the structure of regions between double stranded helical regions. Also RNA molecules often contain posttranscriptionally modified nucleosides, which because of new possible non-canonical interactions, cause a lot of troubles for tertiary structure prediction.^[28]^[29]^[30]^[31]

References

^ ^a ^b ^c Mathews, D.H. Revolutions in RNA secondary structure prediction. J. Mol. Biol 359, 526-532(2006).
^ Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. (2004) Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proceedings of the National Academy of Sciences USA. 101:7287-7292.
^ ^a ^b Mathews DH, Sabina J, Zuker M, Turner DH (1999) Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol. 288(5):911-40.
^ Zuker, M., Sankoff, D. (1984) RNA secondary structures and their prediction. Bull. Math. Biol. 46,591–621.
^ ^a ^b ^c Nussinov R, Piecznik G, Grigg JR and Kleitman DJ (1978) Algorithms for loop matchings. SIAM Journal on Applied Mathematics.
^ ^a ^b Nussinov R and Jacobson AB (1980) Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc Natl Acad Sci U S A. 77(11):6309-13.
^ Zuker M, Stiegler P. (1981) Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. Jan 10;9(1):133-48.
^ ^a ^b Rivas E, Eddy SR. (1999) A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 285(5):2053-68.
^ Zuker, M. (2003) Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13), 3406-3415.
^ Reeder, J. & Giegerich, R. (2004) Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics. BMC Bioinformatics, 5, 104.
^ McCaskill JS. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 29(6-7):1105-19.
^ ^a ^b Ding Y, Lawrence CE. (2003) A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 31(24):7280-301.
^ Lyngsø RB, Pedersen CN. (2000). RNA pseudoknot prediction in energy-based models. J Comput Biol 7(3-4): 409-427.
^ Gardner, P.P. & Giegerich, R .(2004) A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 5, 140.
^ ^a ^b Hofacker IL, Fekete M, Stadler PF. (2002) Secondary structure prediction for aligned RNA sequences. J Mol Biol. 319(5):1059-66.
^ Knudsen B, Hein J. (2003) Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Res. 31(13):3423-8.
^ Ruan, J., Stormo, G.D. & Zhang, W. (2004) ILM: a web server for predicting RNA secondary structures with pseudoknots. Nucleic Acids Research, 32(Web Server issue), W146-149.
^ Bernhart SH, Hofacker IL (2009). "From consensus structure prediction to RNA gene finding". Brief Funct Genomic Proteomic. 8 (6): 461–71. PMID 19833701.
^ Sankoff D (1985) Simultaneous solution of the RNA folding, alignment and protosequence problems. SIAM Journal on Applied Mathematics. 45:810-825.
^ ^a ^b Hofacker IL, Bernhart SH, Stadler PF. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics. 20(14):2222-7.
^ Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 21(9):1815-24.
^ Torarinsson E, Havgaard JH, Gorodkin J. (2007) Multiple structural alignment and clustering of RNA sequences. Bioinformatics.
^ Mathews DH, Turner DH. (2002) Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 317(2):191-203.
^ Harmanci AO, Sharma G, Mathews DH, (2007), Efficient Pairwise RNA Structure Prediction Using Probabilistic Alignment Constraints in Dynalign, BMC Bioinformatics, 8(130).
^ Holmes I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics. 2005 Mar 24;6:73.
^ Kiryu H, Tabei Y, Kin T, Asai K (2007) Murlet: A practical multiple alignment tool for structural RNA sequences. Bioinformatics. 23(13):1588-1598.
^ Shapiro BA and Zhang K (1990) Comparing Multiple RNA Secondary Structures Using Tree Comparisons Computer Applications in the Biosciences, vol. 6, no. 4, pp. 309–318.
^ Shapiro BA, Yingling YG, Kasprzak W, Bindewald E. (2007) Bridging the gap in RNA structure prediction. Curr Opin Struct Biol.
^ Major F, Turcotte M, Gautheret D, Lapalme G, Fillion E, Cedergren R. The combination of symbolic and numerical computation for three-dimensional modeling of RNA. Science. 1991 Sep 13;253(5025):1255-60.
^ Major F, Gautheret D, Cedergren R. Reproducing the three-dimensional structure of a tRNA molecule from structural constraints. Proc Natl Acad Sci U S A. 1993 Oct 15;90(20):9408-12.
^ Frellsen J, Moltke I, Thiim M, Mardia KV, Ferkinghoff-Borg J, Hamelryck T (2009). "A probabilistic model of RNA conformational space". PLoS Comput Biol. 5 (6): e1000406. PMID 19543381.{{cite journal}}: CS1 maint: multiple names: authors list (link)

@@ Line 62: / Line 62: @@
 ==Tertiary structure prediction==
-Once secondary structure of RNA is known, the next challenge is to predict [[RNA tertiary structure|tertiary structure]]. The biggest problem is to determine the structure of regions between double stranded helical regions. Also RNA molecules often contain posttranscriptionally modified nucleosides, which because of new possible non-canonical interactions, cause a lot of troubles for tertiary structure prediction.<ref name="Shapiro07"/><ref name="Major91"/><ref name="Major93"/>
+Once secondary structure of RNA is known, the next challenge is to predict [[RNA tertiary structure|tertiary structure]]. The biggest problem is to determine the structure of regions between double stranded helical regions. Also RNA molecules often contain posttranscriptionally modified nucleosides, which because of new possible non-canonical interactions, cause a lot of troubles for tertiary structure prediction.<ref name="Shapiro07"/><ref name="Major91"/><ref name="Major93"/><ref name="pmid19543381">{{cite journal |author=Frellsen J, Moltke I, Thiim M, Mardia KV, Ferkinghoff-Borg J, Hamelryck T |title=A probabilistic model of RNA conformational space. |journal=PLoS Comput Biol |volume=5 |issue=6 |pages=e1000406 |year=2009 |pmid=19543381}}</ref>
 <!--
 ===Tertiary structure from a sequence===