Bayesian inference in phylogeny
Bayesian Inference of Phylogeny uses a likelihood function to create a quantity called the posterior probability of trees using a model of evolution, and based on some prior probabilities, producing the most likely phylogenetic tree for the given data. The Bayesian approach has become popular due to advances in computing speeds and the integration of Markov chain Monte Carlo (MCMC) algorithms. Bayesian inference has a number of applications in molecular phylogenetics, systematics and evolutionary biology.
- 1 Bayesian Inference of Phylogeny Background and Bases
- 2 Brief comparison to Parsimony and Maximum Likelihood
- 3 Pitfalls and controversies
- 4 MRBAYES software for Bayesian Inference of Phylogeny
- 5 List of phylogenetics softwares for Bayesian Inference of Phylogeny
- 6 Applications of Bayesian Inference of Phylogeny
- 7 External links
- 8 References
Bayesian Inference of Phylogeny Background and Bases
Bayesian Inference of Phylogeny has increased in popularity amongst phylogeneticists since the early 1990's. Based on Bayes' theorem it combines the prior probability of a tree (Pr[Tree]) with the likelihood (Pr [Data|Tree]) to produce a posterior probability distribution on trees (Pr [Tree|Data]). This posterior probability of a tree will indicate the probability of a tree to be correct, being the tree with the highest posterior probability the one chosen to represent best a phylogeny. Some of the advantages over traditional parsimony and maximum likelihood methods are the possibility of account for the phylogenetic uncertainty, use of prior information and incorporation complex models of evolution that limit computational analyses for traditional methods. Although overcoming complex analytical operations the posterior probability still involves a summation over all trees and, for each tree, integration over all possible combinations of substitution model parameter values and branch length. The development of numerical methods to overcome these issues revolutionized Bayesian Inference, the Markov chain Monte Carlo (MCMC) being the method most widely used. The MCMC algorithm can be described in three steps: first, using a stochastic mechanism, a new state for the Markov chain is proposed. Second, the probability of this new state to be correct is calculated. Third, a new random variable (0,1) is proposed. If this new value is less than the acceptance probability the new state is accepted and the state of the chain is updated. This process is run for either thousands or millions of time. The amount of time a single tree is visited during the course of the chain is just a valid approximation of its posterior probability.
Recall that for Bayesian inference:
The denominator is the marginal probability of the data, averaged over all possible parameter values weighted by their prior distribution. Formally,
where is the parameter space for .
The LOCAL algorithm of Larget and Simon
The LOCAL algorithm begins by selecting an internal branch of the tree at random. The nodes at the ends of this branch are each connected to two other branches. One of each pair is chosen at random. Imagine taking these three selected edges and stringing them like a clothesline from left to right, where the direction (left/right) is also selected at random. The two endpoints of the first branch selected will have a sub-tree hanging like a piece of clothing strung to the line. The algorithm proceeds by multiplying the three selected branches by a common random amount, akin to stretching or shrinking the clothesline. Finally the leftmost of the two hanging sub-trees is disconnected and reattached to the clothesline at a location selected uniformly at random. This is the candidate tree.
Suppose we began by selecting the internal branch with length (in Figure (a) (to be added)) that separates taxa and from the rest. Suppose also that we have (randomly) selected branches with lengths and from each side, and that we oriented these branches as shown in Figure(b). Let , be the current length of the clothesline. We select the new length to be , where is a uniform random variable on . Then for the LOCAL algorithm, the acceptance probability can be computed to be:
Suppose we want to estimate a branch length of a 2-taxon tree under JC, in which sites are unvaried and are variable. Assume exponential prior distribution with rate . The density is . The probabilities of the possible site patterns are:
for unvaried sites, and
Thus the unnormalized posterior distribution is:
Update branch length by choosing new value uniformly at random from a window of half-width centered at the current value:
where is uniformly distributed between and . The acceptance probability is:
Example: , . We will compare results for two values of , and . In each case, we will begin with an initial length of and update the length times. (See Figure 3.2 (to be added) for results.)
Metropolis-coupled MCMC (Geyer)
If the target distribution has multiple peaks, separated by low valleys, the Markov chain may have difficulty in moving from one peak to another. As a result, the chain may get stuck on one peak and the resulting samples will not approximate the posterior density correctly. This is a serious practical concern for phylogeny reconstruction, as multiple local peaks are known to exist in the tree space during heuristic tree search under maximum parsimony (MP), maximum likelihood (ML), and minimum evolution (ME) criteria, and the same can be expected for stochastic tree search using MCMC. Many strategies have been proposed to improve mixing of Markov chains in presence of multiple local peaks in the posterior density. One of the most successful algorithms is the Metropolis-coupled MCMC (or ).
In this algorithm, chains are run in parallel, with different stationary distributions , , where the first one, is the target density, while , are chosen to improve mixing. For example, one can choose incremental heating of the form:
so that the first chain is the cold chain with the correct target density, while chains are heated chains. Note that raising the density to the power with has the effect of flattening out the distribution, similar to heating a metal. In such a distribution, it is easier to traverse between peaks (separated by valleys) than in the original distribution. After each iteration, a swap of states between two randomly chosen chains is proposed through a Metropolis-type step. Let be the current state in chain , . A swap between the states of chains and is accepted with probability:
At the end of the run, output from only the cold chain is used, while those from the hot chains are discarded. Heuristically, the hot chains will visit the local peaks rather easily, and swapping states between chains will let the cold chain occasionally jump valleys, leading to better mixing. However, if is unstable, proposed swaps will seldom be accepted. This is the reason for using several chains which differ only incrementally. (See Figure3.3 (to be added)).
An obvious disadvantage of the algorithm is that chains are run and only one chain is used for inference. For this reason, is ideally suited for implementation on parallel machines, since each chain will in general require the same amount of computation per iteration.
Brief comparison to Parsimony and Maximum Likelihood
|This section is empty. You can help by adding to it. (October 2014)|
Pitfalls and controversies
- Boostrap values vs Posterior Probabilities. It has been observed that bootstrap support value, calculated under parsimony or maximum likelihood, tend to be lower than the posterior probabilities obtained by Bayesian inference. This fact leads to a number of question such as: Do posterior probabilities lead to overconfidence in the results? Are boostrap values more robust than posterior probabilities?
- Controversy of using prior probabilities. Using prior probabilities for Bayesian analysis has been seen by many as an advantage as it will provide a hypothesis a more realistic view of the real world. However some biologists argue about the subjectivity of Bayesian posterior probabilities after the incorporation of these priors.
- Model choice. The results of the Bayesian analysis of a phylogeny are directly correlated to the model of evolution chosen so it is important to choose a model that fits the observed data, otherwise inferences in the phylogeny will be erroneous. Many scientists have raised questions about the interpretation of Bayesian inference when the model is unknown or incorrect. For example, an oversimplified model might give higher posterior probabilities or simple evolutionary model are associated to less uncertainty than that from boostrap values.
MRBAYES software for Bayesian Inference of Phylogeny
MrBayes is a free software that performs Bayesian inference of phylogeny. Originally written by John P. Huelsenbeck and Frederik Ronquist in 2001. As Bayesian methods increased in popularity MrBayes became one of the softwares of choice for many molecular phylogeneticists. It is offered for Macintosh, Windows, and UNIX operating systems and it has a command -line interface. The program uses the standard MCMC algorithm as well as the Metropolis coupled MCMC variant. MrBayes reads aligned matrices of sequences (DNA or amino acids) in the standard NEXUS format.
MrBayes uses MCMC to approximate the posterior probabilities of trees. The user can change assumptions of the substitution model, priors and the details of the (MC)3 analysis. It also allows the user to remove and add taxa and characters to the analysis. The program uses the most standard model of DNA substitution and offers different methods for relaxing the assumption of equal rates across sites. MrBayes is also able to infer ancestral states accommodating uncertainty to the phylogenetic tree and model parameters.
MrBayes 3  was a completely reorganized and restructured version of the original MrBayes. The main novelty was the ability of the software to accommodate heterogeneity of data sets. This new framework allows the user to mixed models and take advantages of the efficiency of Bayesian MCMC analysis when dealing with composite datasets. It uses the Metropolis-Coupling MCMC by default.
MrBayes 3.2 new version of MrBayes was released in 2012. The new version allows the users to run multiple analyses in parallel. It also provides faster likelihood calculations and allow these calculations to be delegated to graphics processing unites (GPUs). Version 3.2 provides wider outputs options compatible with FigTree and other tree viewers.
List of phylogenetics softwares for Bayesian Inference of Phylogeny
This list include some of the most common phylogenetic softwares used for inferring phylogenies under a Bayesian framework. Some of them do not use exclusively Bayesian methods.
|Armadillo Workflow Platform||Workflow platform dedicated to phylogenetic and general bioinformatic analysis||Inference of phylogenetic trees using Distance, Maximum Likelihood, Maximum Parsimony, Bayesian methods and related workflows||E. Lord, M. Leclercq, A. Boc, A.B. Diallo and V. Makarenkov|
|Bali-Phy||Simultaneous Bayesian inference of alignment and phylogeny||Bayesian inference, alignment as well as tree search||M.A. Suchard, B. D. Redelings|
|BATWING||Bayesian Analysis of Trees With Internal Node Generation||Bayesian inference, demographic history, population splits||I. J. Wilson, Weale, D.Balding|
|Bayes Phylogenies||Bayesian inference of trees using Markov Chain Monte Carlo methods||Bayesian inference, multiple models, mixture model (auto-partitioning)||M. Pagel, A. Meade|
|BEAST||Bayesian Evolutionary Analysis Sampling Trees||Bayesian inference, relaxed molecular clock, demographic history||A. J. Drummond, A. Rambaut & M. A. Suchard|
|BUCKy||Bayesian concordance of gene trees||Bayesian concordance using modified greedy consensus of unrooted quartets||C. Ané, B. Larget, D.A. Baum, S.D. Smith, A. Rokas and B. Larget, S.K. Kotha, C.N. Dewey, C. Ané|
|Geneious (MrBayes plugin)||Geneious provides genome and proteome research tools||Neighbor-joining, UPGMA, MrBayes plugin, PHYML plugin, RAxML plugin, FastTree plugin, GARLi plugin, PAUP* Plugin||A. J. Drummond,M.Suchard,V.Lefort et al.|
|PAML||Phylogenetic analysis by maximum likelihood||Maximum likelihood and Bayesian inference||Z. Yang|
|TOPALi||Phylogenetic inference||Phylogenetic model selection, Bayesian analysis and Maximum Likelihood phylogenetic tree estimation, detection of sites under positive selection, and recombination breakpoint location analysis||I.Milne, D.Lindner, et al.|
Applications of Bayesian Inference of Phylogeny
Bayesian Inference has extensively been used by molecular phylogeneticists for a wide number of applications. Some of these include:
- Inference of phylogenies
- Inference and evaluation of uncertainty of phylogenies
- Inference of ancestral states and character evolution
- Molecular dating analysis
- Model dynamics of species diversification and extinction
||This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. (April 2011)|
- MrBayes Official Website http://mrbayes.sourceforge.net/
- BEAST Official Website http://beast.bio.ed.ac.uk/
- Larget, B., and D. L. Simon. 1999. Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16:750–759.
- Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. J. Chem. Phys. 21:1087–1092.
- Garcia-Sandoval, R. 2014. Why some clades have low bootstrap frequencies and high Bayesian posterior probabilities. Israel Journal of Ecology & Evolution 60 (1): 41-44.
- Suzuki, Y. et al. 2002. Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics. Proc. Natl. Acad. Sci. U. S. A. 99, 16138–16143
- Erixon, P. et al. 2003. Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. Syst. Biol. 52, 665–673
- Nylander, J. A. A. 2004. MrModeltest 2.0. Program distributed by the author. Evolutionary Biology Centre, Uppsala University. Norbyvagen 18 D. SE-752 36, Uppsala, Sweden.
- Huelsenbeck, J. P. and F. Ronquist. 2001. MrBayes: Bayesian inference of phylogeny. Bioinformatics 17:754–755.
- Maddison,D.R., Swofford,D.L. And Maddison,W.P. 1997. NEXUS: an extensible file format for systematic information. Syst. Biol., 46: 590-621.
- Yang, Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39: 105-111.
- Yang, Z. 1993. Maximum likelihood estimation of phylogeny from DNA sequences when substitutions rates differ over sites. Mol. Biol. Evol. 10: 1396-1401.
- Ronquist F., Huelsenbeck J.P. 2003. Mrbayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 19:1572–1
- Ronquist F., TeslenkoM.,Van Der Mark P.,Ayres D.L., DarlingA.,Hhna S., Larget B., Liu L., Suchard M.A., Huelsenbeck J. 2012. Mrbayes 3.2: Efficient bayesian phylogenetic inference and model choice across a large model space. Syst. Biol. 61:539–542.