Direct coupling analysis: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
BG19bot (talk | contribs)
m WP:CHECKWIKI error fix. Syntax fixes. Do general fixes if a problem exists. -
switch refs to citedoi template, other minor ce
Line 1: Line 1:
'''Direct coupling analysis''' or '''DCA''' is an umbrella term comprising several methods for analyzing sequence data in [[computational biology]].<ref name="morcos_2011" /> The common idea of these methods is to use [[statistical modeling]] to quantify the strength of the direct relationship between two positions of a [[Sequence (biology)|biological sequence]], excluding effects from other positions. This contrasts usual measures of [[correlation]], which can be large [[Correlation and dependence|even if there is no direct relationship between the positions]] (hence the name ''direct'' coupling analysis). Such a direct relationship can for example be the [[evolutionary pressure]] for two positions to maintain mutual compatibility in the [[biomolecular structure]] of the sequence, leading to [[molecular coevolution]] between the two positions.
'''Direct coupling analysis''' or '''DCA''' is an umbrella term comprising several methods for analyzing sequence data in [[computational biology]].<ref name="morcos_2011">{{cite journal|last1=Morcos|first1=F.|last2=Pagnani|first2=A.|last3=Lunt|first3=B.|last4=Bertolino|first4=A.|last5=Marks|first5=D. S.|last6=Sander|first6=C.|last7=Zecchina|first7=R.|last8=Onuchic|first8=J. N.|last9=Hwa|first9=T.|last10=Weigt|first10=M.|title=Direct-coupling analysis of residue coevolution captures native contacts across many protein families|journal=Proceedings of the National Academy of Sciences|date=21 November 2011|volume=108|issue=49|pages=E1293–E1301|doi=10.1073/pnas.1111471108|accessdate=5 September 2016}}</ref> The common idea of these methods is to use [[statistical modeling]] to quantify the strength of the direct relationship between two positions of a [[Sequence (biology)|biological sequence]], excluding effects from other positions. This contrasts usual measures of [[correlation]], which can be large [[Correlation and dependence|even if there is no direct relationship between the positions]] (hence the name ''direct'' coupling analysis). Such a direct relationship can for example be the [[evolutionary pressure]] for two positions to maintain mutual compatibility in the [[biomolecular structure]] of the sequence, leading to [[molecular coevolution]] between the two positions.
DCA has been used in the inference of [[Protein contact map|protein residue contacts]],<ref name="morcos_2011" /><ref name="kamisetty_2013">{{cite journal|last1=Kamisetty|first1=H.|last2=Ovchinnikov|first2=S.|last3=Baker|first3=D.|title=Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era|journal=Proceedings of the National Academy of Sciences|date=5 September 2013|volume=110|issue=39|pages=15674–15679|doi=10.1073/pnas.1314045110|accessdate=5 September 2016}}</ref><ref name="aurell_2013">{{cite journal|last1=Ekeberg|first1=Magnus|last2=Lövkvist|first2=Cecilia|last3=Lan|first3=Yueheng|last4=Weigt|first4=Martin|last5=Aurell|first5=Erik|title=Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models|journal=Physical Review E|date=11 January 2013|volume=87|issue=1|doi=10.1103/PhysRevE.87.012707|accessdate=5 September 2016}}</ref><ref name="marks_2011">{{cite journal|last1=Marks|first1=Debora S.|last2=Colwell|first2=Lucy J.|last3=Sheridan|first3=Robert|last4=Hopf|first4=Thomas A.|last5=Pagnani|first5=Andrea|last6=Zecchina|first6=Riccardo|last7=Sander|first7=Chris|last8=Sali|first8=Andrej|title=Protein 3D Structure Computed from Evolutionary Sequence Variation|journal=PLoS ONE|date=7 December 2011|volume=6|issue=12|pages=e28766|doi=10.1371/journal.pone.0028766|accessdate=5 September 2016}}</ref> [[RNA#Structure|RNA structure prediction]],<ref name="leonardis_2015">{{cite journal|last1=De Leonardis|first1=Eleonora|last2=Lutz|first2=Benjamin|last3=Ratz|first3=Sebastian|last4=Cocco|first4=Simona|last5=Monasson|first5=Rémi|last6=Schug|first6=Alexander|last7=Weigt|first7=Martin|title=Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction|journal=Nucleic Acids Research|date=29 September 2015|pages=gkv932|doi=10.1093/nar/gkv932|accessdate=5 September 2016}}</ref><ref name="weinreb_2016">{{cite journal|last1=Weinreb|first1=Caleb|last2=Riesselman|first2=Adam J.|last3=Ingraham|first3=John B.|last4=Gross|first4=Torsten|last5=Sander|first5=Chris|last6=Marks|first6=Debora S.|title=3D RNA and Functional Interactions from Evolutionary Couplings|journal=Cell|date=May 2016|volume=165|issue=4|pages=963–975|doi=10.1016/j.cell.2016.03.030|accessdate=5 September 2016}}</ref> the inference of [[Protein–protein interaction|protein-protein interaction networks]]<ref name="ovchinnikov_2014">{{cite journal|last1=Ovchinnikov|first1=Sergey|last2=Kamisetty|first2=Hetunandan|last3=Baker|first3=David|title=Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information|journal=eLife|date=1 May 2014|volume=3|doi=10.7554/eLife.02030|accessdate=5 September 2016}}</ref><ref name="feinauer_2016">{{cite journal|last1=Feinauer|first1=Christoph|last2=Szurmant|first2=Hendrik|last3=Weigt|first3=Martin|last4=Pagnani|first4=Andrea|last5=Keskin|first5=Ozlem|title=Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon|journal=PLOS ONE|date=16 February 2016|volume=11|issue=2|pages=e0149166|doi=10.1371/journal.pone.0149166|accessdate=5 September 2016}}</ref> and the modeling of [[fitness landscape]]s<ref name="figliuzzi_2015">{{cite journal|last1=Figliuzzi|first1=Matteo|last2=Jacquier|first2=Hervé|last3=Schug|first3=Alexander|last4=Tenaillon|first4=Oliver|last5=Weigt|first5=Martin|title=Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1|journal=Molecular Biology and Evolution|date=January 2016|volume=33|issue=1|pages=268–280|doi=10.1093/molbev/msv211|accessdate=5 September 2016}}</ref><ref name="asti_2016">{{cite journal|last1=Asti|first1=Lorenzo|last2=Uguzzoni|first2=Guido|last3=Marcatili|first3=Paolo|last4=Pagnani|first4=Andrea|last5=Ofran|first5=Yanay|title=Maximum-Entropy Models of Sequenced Immune Repertoires Predict Antigen-Antibody Affinity|journal=PLOS Computational Biology|date=13 April 2016|volume=12|issue=4|pages=e1004870|doi=10.1371/journal.pcbi.1004870|accessdate=5 September 2016}}</ref>
DCA has been used in the inference of [[Protein contact map|protein residue contacts]].,<ref name="morcos_2011" /><ref name="kamisetty_2013" /><ref name="aurell_2013" /><ref name="marks_2011" /> [[RNA#Structure|RNA structure prediction]],<ref name="leonardis_2015" /><ref name="weinreb_2016" /> the inference of [[Protein–protein interaction|protein-protein interaction networks]]<ref name="ovchinnikov_2014" /><ref name="feinauer_2016" /> and the modeling of [[fitness landscape]]s<ref name="figliuzzi_2015" /><ref name="asti_2016" />


== Mathematical Model and Inference==
== Mathematical Model and Inference==

===Mathematical Model===
===Mathematical Model===

The basis of DCA is a statistical model for the variability within a set of [[Phylogeny|phylogentically related]] [[Sequence (biology)|biological sequences]]. When fitted to a [[Multiple Sequence Alignment|multiple sequence alignment]] (MSA) of sequences of length <math> N </math>, the model defines a probability for all possible sequences of the same length.<ref name="morcos_2011" /> This probability can be interpreted as the probability that the sequence in question belongs to the same class of sequences as the ones in the MSA, for example the class of all protein sequences belonging to a specific [[protein family]].
The basis of DCA is a statistical model for the variability within a set of [[Phylogeny|phylogentically related]] [[Sequence (biology)|biological sequences]]. When fitted to a [[Multiple Sequence Alignment|multiple sequence alignment]] (MSA) of sequences of length <math> N </math>, the model defines a probability for all possible sequences of the same length.<ref name="morcos_2011" /> This probability can be interpreted as the probability that the sequence in question belongs to the same class of sequences as the ones in the MSA, for example the class of all protein sequences belonging to a specific [[protein family]].


Line 20: Line 18:
:* <math> Z </math> is a normalization constant (a real number) to ensure <math> \sum\limits_{a} P(a | J,h) = 1 </math>
:* <math> Z </math> is a normalization constant (a real number) to ensure <math> \sum\limits_{a} P(a | J,h) = 1 </math>


The parameters <math> h_i(a_i) </math> depend on one position <math> i </math> and the symbol <math> a_i </math> at this position. They are usually called fields<ref name="morcos_2011" /> and represent the propensity of symbol to be found at a certain position. The parameters <math> J_{ij}(a_i,a_j) </math> depend on pairs of positions <math> i,j </math> and the symbols <math> a_i,a_j, </math> at these positions. They are usually called couplings<ref name="morcos_2011" /> and represent an interaction, i.e. a term quantifying how compatible the symbols at both positions are with each other. The model is [[Network topology#Fully connected network|fully connected]], so there are interactions between all pairs of positions. The model can be seen as a generalization of the [[Ising model]]), with spins not only taking two values, but any value from a given finite alphabet. In fact, when the size of the alphabet is 2, the model reduces to the Ising model. Since it is also reminiscent of [[Potts Model|the model of the same name]], it is often called Potts Model.<ref name="feinauer_2014" />
The parameters <math> h_i(a_i) </math> depend on one position <math> i </math> and the symbol <math> a_i </math> at this position. They are usually called fields<ref name="morcos_2011" /> and represent the propensity of symbol to be found at a certain position. The parameters <math> J_{ij}(a_i,a_j) </math> depend on pairs of positions <math> i,j </math> and the symbols <math> a_i,a_j, </math> at these positions. They are usually called couplings<ref name="morcos_2011" /> and represent an interaction, i.e. a term quantifying how compatible the symbols at both positions are with each other. The model is [[Network topology#Fully connected network|fully connected]], so there are interactions between all pairs of positions. The model can be seen as a generalization of the [[Ising model]]), with spins not only taking two values, but any value from a given finite alphabet. In fact, when the size of the alphabet is 2, the model reduces to the Ising model. Since it is also reminiscent of [[Potts Model|the model of the same name]], it is often called Potts Model.<ref name="feinauer_2014">{{cite journal|last1=Feinauer|first1=Christoph|last2=Skwark|first2=Marcin J.|last3=Pagnani|first3=Andrea|last4=Aurell|first4=Erik|last5=Dunbrack|first5=Roland L.|title=Improving Contact Prediction along Three Dimensions|journal=PLoS Computational Biology|date=9 October 2014|volume=10|issue=10|pages=e1003847|doi=10.1371/journal.pcbi.1003847|accessdate=5 September 2016}}</ref>


It should be noted that even knowing the probabilities of all sequences does not determine the parameters <math> J,h </math> uniquely. For example, a simple transformation of the parameters
It should be noted that even knowing the probabilities of all sequences does not determine the parameters <math> J,h </math> uniquely. For example, a simple transformation of the parameters
Line 30: Line 28:
for any set of real numbers <math> R_{ij} </math> leaves the probabilities the same. The [[likelihood function]] is invariant under such transformations as well, so the data cannot be used to fix these degrees of freedom (although a [[Prior distribution|prior]] on the parameters might do so<ref name="aurell_2013"/>).
for any set of real numbers <math> R_{ij} </math> leaves the probabilities the same. The [[likelihood function]] is invariant under such transformations as well, so the data cannot be used to fix these degrees of freedom (although a [[Prior distribution|prior]] on the parameters might do so<ref name="aurell_2013"/>).


A convention often found in literature<ref name="aurell_2013"/><ref name="baldassi_2014"/> is to fix these degrees of freedom such that the [[Matrix norm#Frobenius norm|Frobenius norm]] of the coupling matrix
A convention often found in literature<ref name="aurell_2013"/><ref name="baldassi_2014">{{cite journal|last1=Baldassi|first1=Carlo|last2=Zamparo|first2=Marco|last3=Feinauer|first3=Christoph|last4=Procaccini|first4=Andrea|last5=Zecchina|first5=Riccardo|last6=Weigt|first6=Martin|last7=Pagnani|first7=Andrea|last8=Hamacher|first8=Kay|title=Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners|journal=PLoS ONE|date=24 March 2014|volume=9|issue=3|pages=e92721|doi=10.1371/journal.pone.0092721|accessdate=5 September 2016}}</ref> is to fix these degrees of freedom such that the [[Matrix norm#Frobenius norm|Frobenius norm]] of the coupling matrix
:<math>
:<math>
F_{ij} = \sqrt{\sum\limits_{a,b} J_{ij}(a,b)^2},
F_{ij} = \sqrt{\sum\limits_{a,b} J_{ij}(a,b)^2},
Line 38: Line 36:


===Maximum Entropy Derivation===
===Maximum Entropy Derivation===
To justify the Potts model, it is often noted that it can be derived following a [[Principle of maximum entropy|maximum entropy principle]]:<ref name="stein_2015">{{cite journal|last1=Stein|first1=Richard R.|last2=Marks|first2=Debora S.|last3=Sander|first3=Chris|last4=Chen|first4=Shi-Jie|title=Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models|journal=PLOS Computational Biology|date=30 July 2015|volume=11|issue=7|pages=e1004182|doi=10.1371/journal.pcbi.1004182|accessdate=5 September 2016}}</ref> For a given set of sample [[covariance]]s and frequencies, the Potts model represents the distribution with the maximal [[Shannon entropy]] of all distributions reproducing those covariances and frequencies. For a [[Sequence alignment|mutiple sequence alignment]], the sample covariances are defined as

To justify the Potts model, it is often noted that it can be derived following a [[Principle of maximum entropy|maximum entropy principle]]:<ref name="stein_2015" /> For a given set of sample [[covariance]]s and frequencies, the Potts model represents the distribution with the maximal [[Shannon entropy]] of all distributions reproducing those covariances and frequencies. For a [[Sequence alignment|mutiple sequence alignment]], the sample covariances are defined as


:<math>
:<math>
Line 65: Line 62:


===Direct Couplings and Indirect Correlation===
===Direct Couplings and Indirect Correlation===
The central point of DCA is to interpret the <math> J_{ij} </math> (which can be represented as a <math> q\times q</math> matrix if there are <math> q </math> possible symbols) as direct couplings. If two positions are under joint [[evolutionary pressure]] (for example to maintain a structural bond), one might expect these couplings to be large because only sequences with fitting pairs of symbols should have a significant probability. On the other hand, a large correlation between two positions does not necessarily mean that the couplings are large, since large couplings between e.g. positions <math> i,j </math> and <math> j,k </math> might lead to large correlations between positions <math> i </math> and <math> k </math>, mediated by position <math> j </math>.<ref name="morcos_2011" /> In fact, such indirect correlations have been implicated in the high false positive rate when inferring protein residue contacts using correlation measures like [[Mutual Information|mutual information]].<ref name="burger_2010">{{cite journal|last1=Burger|first1=Lukas|last2=van Nimwegen|first2=Erik|last3=Bourne|first3=Philip E.|title=Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments|journal=PLoS Computational Biology|date=1 January 2010|volume=6|issue=1|pages=e1000633|doi=10.1371/journal.pcbi.1000633|accessdate=5 September 2016}}</ref>

The central point of DCA is to interpret the <math> J_{ij} </math> (which can be represented as a <math> q\times q</math> matrix if there are <math> q </math> possible symbols) as direct couplings. If two positions are under joint [[evolutionary pressure]] (for example to maintain a structural bond), one might expect these couplings to be large because only sequences with fitting pairs of symbols should have a significant probability. On the other hand, a large correlation between two positions does not necessarily mean that the couplings are large, since large couplings between e.g. positions <math> i,j </math> and <math> j,k </math> might lead to large correlations between positions <math> i </math> and <math> k </math>, mediated by position <math> j </math>.<ref name="morcos_2011" /> In fact, such indirect correlations have been implicated in the high false positive rate when inferring protein residue contacts using correlation measures like [[Mutual Information|mutual information]].<ref name="bruger_2010"/>


===Inference===
===Inference===

The inference of the Potts model on a [[multiple sequence alignment]] (MSA) using [[maximum likelihood estimation]] is usually computationally intractable, because one needs to calculate the normalization constant <math>Z</math>, which is for sequence length <math> N </math> and <math> q </math> possible symbols a sum of <math>q^N</math> terms (which means for example for a small protein domain family with 30 positions <math>20^{30}</math> terms). Therefore, numerous approximations and alternatives have been developed:
The inference of the Potts model on a [[multiple sequence alignment]] (MSA) using [[maximum likelihood estimation]] is usually computationally intractable, because one needs to calculate the normalization constant <math>Z</math>, which is for sequence length <math> N </math> and <math> q </math> possible symbols a sum of <math>q^N</math> terms (which means for example for a small protein domain family with 30 positions <math>20^{30}</math> terms). Therefore, numerous approximations and alternatives have been developed:


* mpDCA<ref name="weigt_2009"/> (inference based on [[Belief propagation|message passing/belief propagation]])
* mpDCA<ref name="weigt_2009">{{cite journal|last1=Weigt|first1=M.|last2=White|first2=R. A.|last3=Szurmant|first3=H.|last4=Hoch|first4=J. A.|last5=Hwa|first5=T.|title=Identification of direct residue contacts in protein-protein interaction by message passing|journal=Proceedings of the National Academy of Sciences|date=30 December 2008|volume=106|issue=1|pages=67–72|doi=10.1073/pnas.0805923106|accessdate=5 September 2016}}</ref> (inference based on [[Belief propagation|message passing/belief propagation]])
* mfDCA<ref name="morcos_2011"/> (inference based on a [[Mean field theory|mean-field approximation]])
* mfDCA<ref name="morcos_2011"/> (inference based on a [[Mean field theory|mean-field approximation]])
* gaussDCA<ref name="baldassi_2014"/> (inference based on a [[Normal distribution|Gaussian]] approximation)
* gaussDCA<ref name="baldassi_2014"/> (inference based on a [[Normal distribution|Gaussian]] approximation)
* plmDCA<ref name="aurell_2013"/> (inference based on [[Pseudolikelihood|pseudo-likelihoods]])
* plmDCA<ref name="aurell_2013"/> (inference based on [[Pseudolikelihood|pseudo-likelihoods]])
* Adaptive Cluster Expansion<ref name="barton_2016">{{cite journal|last1=Barton|first1=J. P.|last2=De Leonardis|first2=E.|last3=Coucke|first3=A.|last4=Cocco|first4=S.|title=ACE: adaptive cluster expansion for maximum entropy graphical model inference|journal=Bioinformatics|date=21 June 2016|pages=btw328|doi=10.1093/bioinformatics/btw328|accessdate=5 September 2016}}</ref>
* Adaptive Cluster Expansion<ref name="barton_2016"/>


All of these methods lead to some form of estimate for the set of parameters <math>J,{h}</math> maximizing the likelihood of the MSA. Many of them include [[Regularization (mathematics)|regularization]] or [[Prior probability|prior]] terms to ensure a well-posed problem or promote a sparse solution.
All of these methods lead to some form of estimate for the set of parameters <math>J,{h}</math> maximizing the likelihood of the MSA. Many of them include [[Regularization (mathematics)|regularization]] or [[Prior probability|prior]] terms to ensure a well-posed problem or promote a sparse solution.


== Applications ==
== Applications ==

=== Protein Residue Contact Prediction ===
=== Protein Residue Contact Prediction ===
A possible interpretation of large values of couplings in a model fitted to a MSA of a protein family is the existence of conserved contacts between positions (residues) in the family. Such a contact can lead to [[molecular coevolution]], since a mutation in one of the two residues, without a compensating mutation in the other residue, is likely to disrupt [[protein structure]] and negatively affect the fitness of the protein. Residue pairs for which there is a strong [[selective pressure]] to maintain mutual compatibility are therefore expect to mutate together or not at all. This idea (which was known in literature long before the conception of DCA<ref name="goebel_1994">{{cite journal|last1=Göbel|first1=Ulrike|last2=Sander|first2=Chris|last3=Schneider|first3=Reinhard|last4=Valencia|first4=Alfonso|title=Correlated mutations and residue contacts in proteins|journal=Proteins: Structure, Function, and Genetics|date=April 1994|volume=18|issue=4|pages=309–317|doi=10.1002/prot.340180402|accessdate=5 September 2016}}</ref>) has been used to predict [[Protein contact map#Contact map prediction|protein contact maps]], for example analyzing the mutual information between protein residues.

A possible interpretation of large values of couplings in a model fitted to a MSA of a protein family is the existence of conserved contacts between positions (residues) in the family. Such a contact can lead to [[molecular coevolution]], since a mutation in one of the two residues, without a compensating mutation in the other residue, is likely to disrupt [[protein structure]] and negatively affect the fitness of the protein. Residue pairs for which there is a strong [[selective pressure]] to maintain mutual compatibility are therefore expect to mutate together or not at all. This idea (which was known in literature long before the conception of DCA<ref name="goebel_1994"/>) has been used to predict [[Protein contact map#Contact map prediction|protein contact maps]], for example analyzing the mutual information between protein residues.


Within the framework of DCA, a score for the strength of the direct interaction between a pair of residues <math> i,j </math> is often defined<ref name="aurell_2013"/><ref name="baldassi_2014"/> using the Frobenius norm <math> F_{ij} </math> of the corresponding coupling matrix <math> J_{ij} </math> and applying an ''average product correction'' (APC):
Within the framework of DCA, a score for the strength of the direct interaction between a pair of residues <math> i,j </math> is often defined<ref name="aurell_2013"/><ref name="baldassi_2014"/> using the Frobenius norm <math> F_{ij} </math> of the corresponding coupling matrix <math> J_{ij} </math> and applying an ''average product correction'' (APC):
Line 101: Line 94:
</math>.
</math>.
This correction term was first introduced for mutual information<ref name="dunn_2008"/> and is used to remove biases of specific positions to produce large <math> F_{ij} </math>. Scores that are invariant under parameter transformations that do not affect the probabilities have also been used.<ref name="morcos_2011"/>
This correction term was first introduced for mutual information<ref name="dunn_2007">{{cite journal|last1=Dunn|first1=S.D.|last2=Wahl|first2=L.M.|last3=Gloor|first3=G.B.|title=Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction|journal=Bioinformatics|date=5 December 2007|volume=24|issue=3|pages=333–340|doi=10.1093/bioinformatics/btm604|accessdate=5 September 2016}}</ref> and is used to remove biases of specific positions to produce large <math> F_{ij} </math>. Scores that are invariant under parameter transformations that do not affect the probabilities have also been used.<ref name="morcos_2011"/>
Sorting all residue pairs by this score results in a list in which the top of the list is strongly enriched in residue contacts when compared to the protein contact map of a homologous protein.<ref name="marks_2011"/> High-quality predictions of residue contacts are valuable as prior information in [[protein structure prediction]].<ref name="marks_2011"/>
Sorting all residue pairs by this score results in a list in which the top of the list is strongly enriched in residue contacts when compared to the protein contact map of a homologous protein.<ref name="marks_2011"/> High-quality predictions of residue contacts are valuable as prior information in [[protein structure prediction]].<ref name="marks_2011"/>


=== Inference of protein-protein interaction ===
=== Inference of protein-protein interaction ===
DCA can be used for detecting conserved [[Protein-protein interaction|interaction]] between protein families and for predicting which residue pairs form contacts in a [[Multiprotein complex|protein complex]].<ref name="ovchinnikov_2014"/><ref name="feinauer_2016"/> Such predictions can be used when generating structural models for these complexes,<ref name="schug_2009">{{cite journal|last1=Schug|first1=A.|last2=Weigt|first2=M.|last3=Onuchic|first3=J. N.|last4=Hwa|first4=T.|last5=Szurmant|first5=H.|title=High-resolution protein complexes from integrating genomic information with molecular simulation|journal=Proceedings of the National Academy of Sciences|date=17 December 2009|volume=106|issue=52|pages=22124–22129|doi=10.1073/pnas.0912100106|accessdate=5 September 2016}}</ref> or when inferring protein-protein interaction networks made from more than two proteins.<ref name="feinauer_2016"/>

DCA can be used for detecting conserved [[Protein-protein interaction|interaction]] between protein families and for predicting which residue pairs form contacts in a [[Multiprotein complex|protein complex]].<ref name="ovchinnikov_2014"/><ref name="feinauer_2016"/> Such predictions can be used when generating structural models for these complexes.,<ref name="schug_2009"/> or when inferring protein-protein interaction networks made from more than two proteins.<ref name="feinauer_2016"/>


=== Modeling of fitness landscapes ===
=== Modeling of fitness landscapes ===

DCA can be used to model fitness landscapes and to predict the effect of a mutation in the amino acid sequence of a protein on its fitness.<ref name="figliuzzi_2015"/>
DCA can be used to model fitness landscapes and to predict the effect of a mutation in the amino acid sequence of a protein on its fitness.<ref name="figliuzzi_2015"/>


== References ==
== References ==
{{Reflist}}

{{Reflist|refs=

<ref name="morcos_2011">Morcos, Faruck, et al. "Direct-coupling analysis of residue coevolution captures native contacts across many protein families." Proceedings of the National Academy of Sciences 108.49 (2011): E1293-E1301.</ref>

<ref name="kamisetty_2013">Kamisetty, Hetunandan, Sergey Ovchinnikov, and David Baker. "Assessing the utility of coevolution-based residue–residue contact predictions in a sequence-and structure-rich era." Proceedings of the National Academy of Sciences 110.39 (2013): 15674-15679.</ref>

<ref name="aurell_2013">Ekeberg, Magnus, et al. "Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models." Physical Review E 87.1 (2013): 012707.</ref>

<ref name="marks_2011">Marks, Debora S., et al. "Protein 3D structure computed from evolutionary sequence variation." PloS one 6.12 (2011): e28766.</ref>

<ref name="leonardis_2015">De Leonardis, Eleonora, et al. "Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction." Nucleic acids research 43.21 (2015): 10444-10455.</ref>

<ref name="weinreb_2016">Weinreb, Caleb, et al. "3D RNA and Functional Interactions from Evolutionary Couplings." Cell 165.4 (2016): 963-975.</ref>

<ref name="ovchinnikov_2014">Ovchinnikov, Sergey, Hetunandan Kamisetty, and David Baker. "Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information." Elife 3 (2014): e02030.</ref>

<ref name="feinauer_2016">Feinauer, Christoph, et al. "Inter-protein sequence co-evolution predicts known physical interactions in bacterial ribosomes and the trp operon." PloS one 11.2 (2016): e0149166.</ref>

<ref name="figliuzzi_2015">Figliuzzi, Matteo, et al. "Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1." Molecular biology and evolution (2015): msv211.</ref>

<ref name="asti_2016">Asti, Lorenzo, et al. "Maximum-Entropy Models of Sequenced Immune Repertoires Predict Antigen-Antibody Affinity." PLoS Comput Biol 12.4 (2016): e1004870.</ref>

<ref name="bruger_2010">Burger, Lukas, and Erik Van Nimwegen. "Disentangling direct from indirect co-evolution of residues in protein alignments." PLoS Comput Biol 6.1 (2010): e1000633.</ref>

<ref name="feinauer_2014">Feinauer, Christoph, et al. "Improving contact prediction along three dimensions." PLoS Comput Biol 10.10 (2014): e1003847.</ref>

<ref name="stein_2015">Stein, Richard R., Debora S. Marks, and Chris Sander. "Inferring pairwise interactions from biological data using maximum-entropy probability models." PLoS Comput Biol 11.7 (2015): e1004182.</ref>

<ref name="weigt_2009">Weigt, Martin, et al. "Identification of direct residue contacts in protein–protein interaction by message passing." Proceedings of the National Academy of Sciences 106.1 (2009): 67-72.</ref>

<ref name="baldassi_2014">Baldassi, Carlo, et al. "Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners." PloS one 9.3 (2014): e92721.</ref>

<ref name="barton_2016">Barton, John P., et al. "ACE: adaptive cluster expansion for maximum entropy graphical model inference." bioRxiv (2016): 044677.</ref>

<ref name="goebel_1994">Göbel, Ulrike, et al. "Correlated mutations and residue contacts in proteins." Proteins: Structure, Function, and Bioinformatics 18.4 (1994): 309-317.</ref>

<ref name="dunn_2008">Dunn, Stanley D., Lindi M. Wahl, and Gregory B. Gloor. "Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction." Bioinformatics 24.3 (2008): 333-340.</ref>

<ref name="schug_2009">Schug, Alexander, et al. "High-resolution protein complexes from integrating genomic information with molecular simulation." Proceedings of the National Academy of Sciences 106.52 (2009): 22124-22129.</ref>
}}


== External links ==
== External links ==

Online services:
Online services:



Revision as of 23:52, 5 September 2016

Direct coupling analysis or DCA is an umbrella term comprising several methods for analyzing sequence data in computational biology.[1] The common idea of these methods is to use statistical modeling to quantify the strength of the direct relationship between two positions of a biological sequence, excluding effects from other positions. This contrasts usual measures of correlation, which can be large even if there is no direct relationship between the positions (hence the name direct coupling analysis). Such a direct relationship can for example be the evolutionary pressure for two positions to maintain mutual compatibility in the biomolecular structure of the sequence, leading to molecular coevolution between the two positions. DCA has been used in the inference of protein residue contacts,[1][2][3][4] RNA structure prediction,[5][6] the inference of protein-protein interaction networks[7][8] and the modeling of fitness landscapes[9][10]

Mathematical Model and Inference

Mathematical Model

The basis of DCA is a statistical model for the variability within a set of phylogentically related biological sequences. When fitted to a multiple sequence alignment (MSA) of sequences of length , the model defines a probability for all possible sequences of the same length.[1] This probability can be interpreted as the probability that the sequence in question belongs to the same class of sequences as the ones in the MSA, for example the class of all protein sequences belonging to a specific protein family.

We denote a sequence by , with the being categorical variables representing the monomers of the sequence (if the sequences are for example aligned amino acid sequences of proteins of a protein family, the take as values any of the 20 standard amino acids). The probability of a sequence within a model is then defined as

where

  • are sets of real numbers representing the parameters of the model (more below)
  • is a normalization constant (a real number) to ensure

The parameters depend on one position and the symbol at this position. They are usually called fields[1] and represent the propensity of symbol to be found at a certain position. The parameters depend on pairs of positions and the symbols at these positions. They are usually called couplings[1] and represent an interaction, i.e. a term quantifying how compatible the symbols at both positions are with each other. The model is fully connected, so there are interactions between all pairs of positions. The model can be seen as a generalization of the Ising model), with spins not only taking two values, but any value from a given finite alphabet. In fact, when the size of the alphabet is 2, the model reduces to the Ising model. Since it is also reminiscent of the model of the same name, it is often called Potts Model.[11]

It should be noted that even knowing the probabilities of all sequences does not determine the parameters uniquely. For example, a simple transformation of the parameters

for any set of real numbers leaves the probabilities the same. The likelihood function is invariant under such transformations as well, so the data cannot be used to fix these degrees of freedom (although a prior on the parameters might do so[3]).

A convention often found in literature[3][12] is to fix these degrees of freedom such that the Frobenius norm of the coupling matrix

is minimized (independently for every pair of positions and ).

Maximum Entropy Derivation

To justify the Potts model, it is often noted that it can be derived following a maximum entropy principle:[13] For a given set of sample covariances and frequencies, the Potts model represents the distribution with the maximal Shannon entropy of all distributions reproducing those covariances and frequencies. For a mutiple sequence alignment, the sample covariances are defined as

,

where is the frequency of finding symbols and at positions and in the same sequence in the MSA, and the frequency of finding symbol at position . The Potts model is then the unique distribution that maximizes the functional

The first term in the functional is the Shannon entropy of the distribution. The are Lagrange multipliers to ensure , with being the marginal probability to find symbols at positions . The Lagrange multiplier ensures normalization. Maximizing this functional and identifying

leads to the Potts model above. It should be noted that this procedure only gives the functional form of the Potts model, while the numerical values of the Lagrange multipliers (identified with the parameters) still have to be determined by fitting the model to the data.

Direct Couplings and Indirect Correlation

The central point of DCA is to interpret the (which can be represented as a matrix if there are possible symbols) as direct couplings. If two positions are under joint evolutionary pressure (for example to maintain a structural bond), one might expect these couplings to be large because only sequences with fitting pairs of symbols should have a significant probability. On the other hand, a large correlation between two positions does not necessarily mean that the couplings are large, since large couplings between e.g. positions and might lead to large correlations between positions and , mediated by position .[1] In fact, such indirect correlations have been implicated in the high false positive rate when inferring protein residue contacts using correlation measures like mutual information.[14]

Inference

The inference of the Potts model on a multiple sequence alignment (MSA) using maximum likelihood estimation is usually computationally intractable, because one needs to calculate the normalization constant , which is for sequence length and possible symbols a sum of terms (which means for example for a small protein domain family with 30 positions terms). Therefore, numerous approximations and alternatives have been developed:

All of these methods lead to some form of estimate for the set of parameters maximizing the likelihood of the MSA. Many of them include regularization or prior terms to ensure a well-posed problem or promote a sparse solution.

Applications

Protein Residue Contact Prediction

A possible interpretation of large values of couplings in a model fitted to a MSA of a protein family is the existence of conserved contacts between positions (residues) in the family. Such a contact can lead to molecular coevolution, since a mutation in one of the two residues, without a compensating mutation in the other residue, is likely to disrupt protein structure and negatively affect the fitness of the protein. Residue pairs for which there is a strong selective pressure to maintain mutual compatibility are therefore expect to mutate together or not at all. This idea (which was known in literature long before the conception of DCA[17]) has been used to predict protein contact maps, for example analyzing the mutual information between protein residues.

Within the framework of DCA, a score for the strength of the direct interaction between a pair of residues is often defined[3][12] using the Frobenius norm of the corresponding coupling matrix and applying an average product correction (APC):

where has been defined above and

.

This correction term was first introduced for mutual information[18] and is used to remove biases of specific positions to produce large . Scores that are invariant under parameter transformations that do not affect the probabilities have also been used.[1] Sorting all residue pairs by this score results in a list in which the top of the list is strongly enriched in residue contacts when compared to the protein contact map of a homologous protein.[4] High-quality predictions of residue contacts are valuable as prior information in protein structure prediction.[4]

Inference of protein-protein interaction

DCA can be used for detecting conserved interaction between protein families and for predicting which residue pairs form contacts in a protein complex.[7][8] Such predictions can be used when generating structural models for these complexes,[19] or when inferring protein-protein interaction networks made from more than two proteins.[8]

Modeling of fitness landscapes

DCA can be used to model fitness landscapes and to predict the effect of a mutation in the amino acid sequence of a protein on its fitness.[9]

References

  1. ^ a b c d e f g h Morcos, F.; Pagnani, A.; Lunt, B.; Bertolino, A.; Marks, D. S.; Sander, C.; Zecchina, R.; Onuchic, J. N.; Hwa, T.; Weigt, M. (21 November 2011). "Direct-coupling analysis of residue coevolution captures native contacts across many protein families". Proceedings of the National Academy of Sciences. 108 (49): E1293–E1301. doi:10.1073/pnas.1111471108. {{cite journal}}: |access-date= requires |url= (help)
  2. ^ Kamisetty, H.; Ovchinnikov, S.; Baker, D. (5 September 2013). "Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era". Proceedings of the National Academy of Sciences. 110 (39): 15674–15679. doi:10.1073/pnas.1314045110. {{cite journal}}: |access-date= requires |url= (help)
  3. ^ a b c d e Ekeberg, Magnus; Lövkvist, Cecilia; Lan, Yueheng; Weigt, Martin; Aurell, Erik (11 January 2013). "Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models". Physical Review E. 87 (1). doi:10.1103/PhysRevE.87.012707. {{cite journal}}: |access-date= requires |url= (help)
  4. ^ a b c Marks, Debora S.; Colwell, Lucy J.; Sheridan, Robert; Hopf, Thomas A.; Pagnani, Andrea; Zecchina, Riccardo; Sander, Chris; Sali, Andrej (7 December 2011). "Protein 3D Structure Computed from Evolutionary Sequence Variation". PLoS ONE. 6 (12): e28766. doi:10.1371/journal.pone.0028766. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  5. ^ De Leonardis, Eleonora; Lutz, Benjamin; Ratz, Sebastian; Cocco, Simona; Monasson, Rémi; Schug, Alexander; Weigt, Martin (29 September 2015). "Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction". Nucleic Acids Research: gkv932. doi:10.1093/nar/gkv932. {{cite journal}}: |access-date= requires |url= (help); no-break space character in |last1= at position 3 (help)
  6. ^ Weinreb, Caleb; Riesselman, Adam J.; Ingraham, John B.; Gross, Torsten; Sander, Chris; Marks, Debora S. (May 2016). "3D RNA and Functional Interactions from Evolutionary Couplings". Cell. 165 (4): 963–975. doi:10.1016/j.cell.2016.03.030. {{cite journal}}: |access-date= requires |url= (help); no-break space character in |first2= at position 5 (help); no-break space character in |first3= at position 5 (help); no-break space character in |first6= at position 7 (help)
  7. ^ a b Ovchinnikov, Sergey; Kamisetty, Hetunandan; Baker, David (1 May 2014). "Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information". eLife. 3. doi:10.7554/eLife.02030. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  8. ^ a b c Feinauer, Christoph; Szurmant, Hendrik; Weigt, Martin; Pagnani, Andrea; Keskin, Ozlem (16 February 2016). "Inter-Protein Sequence Co-Evolution Predicts Known Physical Interactions in Bacterial Ribosomes and the Trp Operon". PLOS ONE. 11 (2): e0149166. doi:10.1371/journal.pone.0149166. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  9. ^ a b Figliuzzi, Matteo; Jacquier, Hervé; Schug, Alexander; Tenaillon, Oliver; Weigt, Martin (January 2016). "Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1". Molecular Biology and Evolution. 33 (1): 268–280. doi:10.1093/molbev/msv211. {{cite journal}}: |access-date= requires |url= (help)
  10. ^ Asti, Lorenzo; Uguzzoni, Guido; Marcatili, Paolo; Pagnani, Andrea; Ofran, Yanay (13 April 2016). "Maximum-Entropy Models of Sequenced Immune Repertoires Predict Antigen-Antibody Affinity". PLOS Computational Biology. 12 (4): e1004870. doi:10.1371/journal.pcbi.1004870. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  11. ^ Feinauer, Christoph; Skwark, Marcin J.; Pagnani, Andrea; Aurell, Erik; Dunbrack, Roland L. (9 October 2014). "Improving Contact Prediction along Three Dimensions". PLoS Computational Biology. 10 (10): e1003847. doi:10.1371/journal.pcbi.1003847. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  12. ^ a b c Baldassi, Carlo; Zamparo, Marco; Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea; Hamacher, Kay (24 March 2014). "Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners". PLoS ONE. 9 (3): e92721. doi:10.1371/journal.pone.0092721. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  13. ^ Stein, Richard R.; Marks, Debora S.; Sander, Chris; Chen, Shi-Jie (30 July 2015). "Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models". PLOS Computational Biology. 11 (7): e1004182. doi:10.1371/journal.pcbi.1004182. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  14. ^ Burger, Lukas; van Nimwegen, Erik; Bourne, Philip E. (1 January 2010). "Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments". PLoS Computational Biology. 6 (1): e1000633. doi:10.1371/journal.pcbi.1000633. {{cite journal}}: |access-date= requires |url= (help)CS1 maint: unflagged free DOI (link)
  15. ^ Weigt, M.; White, R. A.; Szurmant, H.; Hoch, J. A.; Hwa, T. (30 December 2008). "Identification of direct residue contacts in protein-protein interaction by message passing". Proceedings of the National Academy of Sciences. 106 (1): 67–72. doi:10.1073/pnas.0805923106. {{cite journal}}: |access-date= requires |url= (help)
  16. ^ Barton, J. P.; De Leonardis, E.; Coucke, A.; Cocco, S. (21 June 2016). "ACE: adaptive cluster expansion for maximum entropy graphical model inference". Bioinformatics: btw328. doi:10.1093/bioinformatics/btw328. {{cite journal}}: |access-date= requires |url= (help)
  17. ^ Göbel, Ulrike; Sander, Chris; Schneider, Reinhard; Valencia, Alfonso (April 1994). "Correlated mutations and residue contacts in proteins". Proteins: Structure, Function, and Genetics. 18 (4): 309–317. doi:10.1002/prot.340180402. {{cite journal}}: |access-date= requires |url= (help)
  18. ^ Dunn, S.D.; Wahl, L.M.; Gloor, G.B. (5 December 2007). "Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction". Bioinformatics. 24 (3): 333–340. doi:10.1093/bioinformatics/btm604. {{cite journal}}: |access-date= requires |url= (help)
  19. ^ Schug, A.; Weigt, M.; Onuchic, J. N.; Hwa, T.; Szurmant, H. (17 December 2009). "High-resolution protein complexes from integrating genomic information with molecular simulation". Proceedings of the National Academy of Sciences. 106 (52): 22124–22129. doi:10.1073/pnas.0912100106. {{cite journal}}: |access-date= requires |url= (help)

External links

Online services:

Source code: