Protein–protein interaction prediction
Protein–protein interaction prediction is a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins. Understanding protein–protein interactions is important for the investigation of intracellular signaling pathways, modelling of protein complex structures and for gaining insights into various biochemical processes. Experimentally, physical interactions between pairs of proteins can be inferred from a variety of experimental techniques, including yeast two-hybrid systems, protein-fragment complementation assays (PCA), affinity purification/mass spectrometry, protein microarrays, fluorescence resonance energy transfer (FRET), and Microscale Thermophoresis (MST). Efforts to experimentally determine the interactome of numerous species are ongoing, and a number of computational methods for interaction prediction have been developed in recent years.
- 1 Methods
- 1.1 Phylogenetic profiling
- 1.2 Prediction of co-evolved protein pairs based on similar phylogenetic trees
- 1.3 Rosetta stone (gene fusion) method
- 1.4 Classification methods
- 1.5 Inference of interactions from homologous structures
- 1.6 Association methods
- 1.7 Identification of structural patterns
- 1.8 Bayesian network modelling
- 1.9 Domain-pair exclusion analysis
- 1.10 Supervised learning problem
- 2 Relationship to docking methods
- 3 See also
- 4 References
- 5 External links
Proteins that interact are more likely to co-evolve, therefore, it is possible to make inferences about interactions between pairs of proteins based on their phylogenetic distances. It has also been observed in some cases that pairs of interacting proteins have fused orthologues in other organisms. In addition, a number of bound protein complexes have been structurally solved and can be used to identify the residues that mediate the interaction so that similar motifs can be located in other organisms.
Phylogenetic profiling finds pairs of protein families with similar patterns of presence or absence across large numbers of species. This method is based on the hypothesis that potentially interacting proteins should co-evolve and should have orthologs in closely related species. That is, proteins that form complexes or are part of a pathway should be present simultaneously in order for them to function. A phylogenetic profile is constructed for each protein under investigation. The profile is basically a record of whether the protein is present in certain genomes. If two proteins are found to be present and absent in the same genomes, those proteins are deemed likely to be functionally related. A similar method can be applied to protein domains, where profiles are constructed for domains to determine if there are domain interactions. Some drawbacks with the phylogenetic profile methods are that they are computationally expensive to perform, they rely on homology detection between distant organisms, and they only identify if the proteins being investigated are functionally related (part of complex or in same pathway) and not if they have direct interactions.
Prediction of co-evolved protein pairs based on similar phylogenetic trees
It was observed that the phylogenetic trees of ligands and receptors were often more similar than due to random chance. This is likely because they faced similar selection pressures and co-evolved. This method uses the phylogenetic trees of protein pairs to determine if interactions exist. To do this, homologs of the proteins of interest are found (using a sequence search tool such as BLAST) and multiple-sequence alignments are done (with alignment tools such as Clustal) to build distance matrices for each of the proteins of interest. The distance matrices should then be used to build phylogenetic trees. However, comparisons between phylogenetic trees are difficult, and current methods circumvent this by simply comparing distance matrices. The distance matrices of the proteins are used to calculate a correlation coefficient, in which a larger value corresponds to co-evolution. The benefit of comparing distance matrices instead of phylogenetic trees is that the results do not depend on the method of tree building that was used. The downside is that difference matrices are not perfect representations of phylogenetic trees, and inaccuracies may result from using such a shortcut. Another factor worthy of note is that there are background similarities between the phylogenetic trees of any protein, even ones that do not interact. If left unaccounted for, this could lead to a high false-positive rate. For this reason, certain methods construct a background tree using 16S rRNA sequences which they use as the canonical tree of life. The distance matrix constructed from this tree of life is then subtracted from the distance matrices of the proteins of interest. However, because RNA distance matrices and DNA distance matrices have different scale, presumably because RNA and DNA have different mutation rates, the RNA matrix needs to be rescaled before it can be subtracted from the DNA matrices. By using molecular clock proteins, the scaling coefficient for protein distance/RNA distance can be calculated. This coefficient is used to rescale the RNA matrix.
Rosetta stone (gene fusion) method
A Rosetta stone protein is a protein chain composed of two fused proteins. It is observed that proteins or domains that interact with one another tend to have homologs in other genomes that are fused into a Rosetta stone protein , such as might arise by gene fusion when two previously separate genes form a new composite one. This evolutionary mechanism can be used to predict protein interactions. If two proteins are separate in one organism but fused in the other, then it is very likely that they will interact in the case where they are expressed as two separate products. The STRING database makes use of this to predict protein-protein interactions. Gene fusion has been extensively studied and large amounts of data are available. Nonetheless, like phylogenetic profile methods, the Rosetta stone method does not necessarily find interacting proteins, as there can be other reasons for the fusion of two proteins, such as optimizing co-expression of the proteins. The most obvious drawback of this method is that there are many protein interactions that cannot be discovered this way; it relies on the presence of Rosetta stone proteins.
Classification methods use data to train a program (classifier) to distinguish positive examples of interacting protein/domain pairs with negative examples of non-interacting pairs. Popular classifiers used are Random Forest Decision (RFD) and Support Vector Machines. RFD produces results based on the domain composition of interacting and non-interacting protein pairs. When given a protein pair to classify, RFD first creates a representation of the protein pair in a vector. The vector contains all the domain types used to train RFD, and for each domain type the vector also contains a value of 0, 1, or 2. If the protein pair does not contain a certain domain, then the value for that domain is 0. If one of the proteins of the pair contains the domain, then the value is 1. If both proteins contain the domain, then the value is 2. Using training data, RFD constructs a decision forest, consisting of many decision trees. Each decision tree evaluates several domains, and based on the presence or absence of interactions in these domains, makes a decision as to if the protein pair interacts. The vector representation of the protein pair is evaluated by each tree to determine if they are an interacting pair or a non-interacting pair. The forest tallies up all the input from the trees to come up with a final decision. The strength of this method is that it does not assume that domains interact independent of each other. This makes it so that multiple domains in proteins can be used in the prediction. This is a big step up from previous methods which could only predict based on a single domain pair. The limitation of this method is that it relies on the training dataset to produce results. Thus, usage of different training datasets could influence the results.
Inference of interactions from homologous structures
This group of methods makes use of known protein complex structures to predict and structurally model interactions between query protein sequences. The prediction process generally starts by employing a sequence based method (e.g. Interolog) to search for protein complex structures that are homologous to the query sequences. These known complex structures are then used as templates to structurally model the interaction between query sequences. This method has the advantage of not only inferring protein interactions but also suggests models of how proteins interact structurally, which can provide some insights into the atomic level mechanism of that interaction. On the other hand, the ability for these methods to make a prediction is constrained by a limited number of known protein complex structures.
Association methods look for characteristic sequences or motifs that can help distinguish between interacting and non-interacting pairs. A classifier is trained by looking for sequence-signature pairs where one protein contains one sequence-signature, and its interacting partner contains another sequence-signature. They look specifically for sequence-signatures that are found together more often than by chance. This uses a log-odds score which is computed as log2(Pij/PiPj), where Pij is the observed frequency of domains i and j occurring in one protein pair; Pi and Pj are the background frequencies of domains i and j in the data. Predicted domain interactions are those with positive log-odds scores and also having several occurrences within the database. The downside with this method is that it looks at each pair of interacting domains separately, and it assumes that they interact independently of each other.
Identification of structural patterns
This method builds a library of known protein–protein interfaces from the PDB, where the interfaces are defined as pairs of polypeptide fragments that are below a threshold slightly larger than the Van der Waals radius of the atoms involved. The sequences in the library are then clustered based on structural alignment and redundant sequences are eliminated. The residues that have a high (generally >50%) level of frequency for a given position are considered hotspots. This library is then used to identify potential interactions between pairs of targets, providing that they have a known structure (i.e. present in the PDB).
Bayesian network modelling
Bayesian methods integrate data from a wide variety of sources, including both experimental results and prior computational predictions, and use these features to assess the likelihood that a particular potential protein interaction is a true positive result. These methods are useful because experimental procedures, particularly the yeast two-hybrid experiments, are extremely noisy and produce many false positives, while the previously mentioned computational methods can only provide circumstantial evidence that a particular pair of proteins might interact.
Domain-pair exclusion analysis
The domain-pair exclusion analysis detects specific domain interactions that are hard to detect using Bayesian methods. Bayesian methods are good at detecting nonspecific promiscuous interactions and not very good at detecting rare specific interactions. The domain-pair exclusion analysis method calculates an E-score which measures if two domains interact. It is calculated as log(probability that the two proteins interact given that the domains interact/probability that the two proteins interact given that the domains don’t interact). The probabilities required in the formula are calculated using an Expectation Maximization procedure, which is a method for estimating parameters in statistical models. High E-scores indicate that the two domains are likely to interact, while low scores indicate that other domains form the protein pair are more likely to be responsible for the interaction. The drawback with this method is that it does not take into account false positives and false negatives in the experimental data.
Supervised learning problem
The problem of PPI prediction can be framed as a supervised learning problem. In this paradigm the known protein interactions supervise the estimation of a function that can predict whether an interaction exists or not between two proteins given data about the proteins (e.g., expression levels of each gene in different experimental conditions, location information, phylogenetic profile, etc.).
Relationship to docking methods
The field of protein–protein interaction prediction is closely related to the field of protein–protein docking, which attempts to use geometric and steric considerations to fit two proteins of known structure into a bound complex. This is a useful mode of inquiry in cases where both proteins in the pair have known structures and are known (or at least strongly suspected) to interact, but since so many proteins do not have experimentally determined structures, sequence-based interaction prediction methods are especially useful in conjunction with experimental studies of an organism's interactome.
- Protein–protein interaction
- Macromolecular docking
- Protein–DNA interaction site predictor
- Two-hybrid screening
- Protein structure prediction software
- Dandekar T., Snel B.,Huynen M. and Bork P. (1998) "Conservation of gene order: a fingerprint of proteins that physically interact." Trends Biochem. Sci. (23),324-328
- Enright A.J.,Iliopoulos I.,Kyripides N.C. and Ouzounis C.A. (1999) "Protein interaction maps for complete genomes based on gene fusion events." Nature (402), 86-90
- Marcotte E.M., Pellegrini M., Ng H.L., Rice D.W., Yeates T.O., Eisenberg D. (1999) "Detecting protein function and protein-protein interactions from genome sequences." Science (285), 751-753
- Pazos, F.; Valencia, A. (2001). "Similarity of phylogenetic trees as indicator of protein-protein interaction". Protein Engineering. 9 (14): 609–614.
- Pellegrini, M; Marcotte, EM; Thompson, MJ; Eisenberg, D; Yeates, TO (1999). "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles". Proc Natl Acad Sci U S A. 96: 4285–8. PMC . PMID 10200254. doi:10.1073/pnas.96.8.4285.
- Tan S.H., Zhang Z., Ng S.K. (2004) "ADVICE: Automated Detection and Validation of Interaction by Co-Evolution." Nucl. Ac. Res., 32 (Web Server issue):W69-72.
- Pazos, F; Ranea, JA; Juan, D; Sternberg, MJ (2005). "Assessing protein coevolution in the context of the tree of life assists in the prediction of the interactome". J Mol Biol. 352: 1002–1015. doi:10.1016/j.jmb.2005.07.005.
- Marsh, J; Hernandez, H; Hall, Z; Ahnert, S; Perica, T; Robinson, C; Teichmann, S (2013). "Protein complexes are under evolutionary selection to assemble via ordered pathways". Cell. 153 (2): 461–70. PMC . PMID 23582331. doi:10.1016/j.cell.2013.02.044.
- Chen, XW; Liu, M (2005). "Prediction of protein–protein interactions using random decision forest framework". Bioinformatics. 21: 4394–4400. doi:10.1093/bioinformatics/bti721.
- Aloy, P.; Russell, R. B. (2003). "InterPreTS: protein Interaction Prediction through Tertiary Structure". Bioinformatics. 19 (1): 161–162. doi:10.1093/bioinformatics/19.1.161.
- Fukuhara, Naoshi, and Takeshi Kawabata. (2008) "HOMCOS: a server to predict interacting protein pairs and interacting sites by homology modeling of complex structures" Nucleic Acids Research, 36 (S2): 185-.
- Kittichotirat W, M Guerquin, RE Bumgarner, and R Samudrala (2009) "Protinfo PPC: a web server for atomic level prediction of protein complexes" Nucleic Acids Research, 37 (Web Server issue): 519-25.
- Shoemaker, BA; Zhang, D; Thangudu, RR; Tyagi, M; Fong, JH; Marchler-Bauer, A; Bryant, SH; Madej, T; Panchenko, AR (Jan 2010). "Inferred Biomolecular Interaction Server--a web server to analyze and predict protein interacting partners and binding sites". Nucleic Acids Res. 38: D518–24. PMC . PMID 19843613. doi:10.1093/nar/gkp842.
- Esmaielbeiki, R; Nebel, J-C (2014). "Scoring docking conformations using predicted protein interfaces". BMC Bioinformatics. 15: 171. PMC . PMID 24906633. doi:10.1186/1471-2105-15-171.
- Sprinzak, E; Margalit, H (2001). "Correlated sequence-signatures as markers of protein–protein interaction". J Mol Biol. 311: 681–692. PMID 11518523. doi:10.1006/jmbi.2001.4920.
- Aytuna, A. S.; Keskin, O.; Gursoy, A. (2005). "Prediction of protein-protein interactions by combining structure and sequence conservation in protein interfaces". Bioinformatics. 21 (12): 2850–2855. PMID 15855251. doi:10.1093/bioinformatics/bti443.
- Ogmen, U.; Keskin, O.; Aytuna, A.S.; Nussinov, R.; Gursoy, A. (2005). "PRISM: protein interactions by structural matching". Nucl. Ac. Res. 33: W331–336. doi:10.1093/nar/gki585.
- Keskin, O.; Ma, B.; Nussinov, R. (2004). "Hot regions int protein-protein interactions: The organization and contribution of structurally conserved hot spot residues". J. Mol. Biol. 345: 1281–1294. doi:10.1016/j.jmb.2004.10.077.
- Jansen, R; Yu, H; Greenbaum, D; Kluger, Y; Krogan, NJ; Chung, S; Emili, A; Snyder, M; Greenblatt, JF; Gerstein, M (2003). "A Bayesian networks approach for predicting protein-protein interactions from genomic data". Science. 302 (5644): 449–53. doi:10.1126/science.1087361.
- Zhang, QC; Petrey, D; Deng, L; Qiang, L; Shi, Y; Thu, CA; Bisikirska, B; Lefebvre, C; Accili, D; Hunter, T; Maniatis, T; Califano, A; Honig, B (2012). "Structure-based prediction of protein-protein interactions on a genome-wide scale". Nature. 490 (7421): 556–60. doi:10.1038/nature11503.
- Shoemaker, BA; Panchenko, AR (2007). "Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners". PLoS Comput Biol. 3 (4): e43. doi:10.1371/journal.pcbi.0030043.