PHYLIP

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
PHYLogeny Inference Package
Original author(s)Joseph Felsenstein
Developer(s)University of Washington
Initial releaseOctober 1980; 41 years ago (1980-10)
Stable release
3.697 / 2 November 2014; 7 years ago (2014-11-02)
Repository
Written inC
Operating systemWindows, Mac OS X, Linux
Platformx86, x86-64
Available inEnglish
TypePhylogenetics
License=> v3.697: open-source
=< v3.695: proprietary freeware
Websiteevolution.genetics.washington.edu/phylip.html Edit this on Wikidata

PHYLogeny Inference Package (PHYLIP) is a free computational phylogenetics package of programs for inferring evolutionary trees (phylogenies).[1] It consists of 35 portable programs, i.e., the source code is written in the programming language C. As of version 3.696, it is licensed as open-source software; versions 3.695 and older were proprietary software freeware. Releases occur as source code, and as precompiled executables for many operating systems including Windows (95, 98, ME, NT, 2000, XP, Vista), Mac OS 8, Mac OS 9, OS X, Linux (Debian, Red Hat); and FreeBSD from FreeBSD.org.[2] Full documentation is written for all the programs in the package and is included therein. The programs in the phylip package were written by Professor Joseph Felsenstein, of the Department of Genome Sciences and the Department of Biology, University of Washington, Seattle.[3]

Methods (implemented by each program) that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters.[2]

Each program is controlled through a menu, which asks users which options they want to set, and allows them to start the computation. The data is read into the program from a text file, which the user can prepare using any word processor or text editor (but this text file cannot be in the special format of the word processor, it must instead be in flat ASCII or text only format). Some sequence analysis programs such as the ClustalW alignment program can write data files in the PHYLIP format. Most of the programs look for the data in a file called infile . If the phylip programs do not find this file, they then ask the user to type in the file name of the data file.[2]

File format[edit]

The component programs of phylip use several different formats, all of which are relatively simple. Programs for the analysis of DNA sequence alignments, protein sequence alignments, or discrete characters (e.g., morphological data) can accept those data in sequential or interleaved format, as shown below.

Sequential format:

5 42 
Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT 
Salmo schiAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT 
H. sapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA 
Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT 
Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA

Interleaved format:

5 42 
Turkey    AAGCTNGGGC ATTTCAGGGT 
Salmo schiAAGCCTTGGC AGTGCAGGGT 
H. sapiensACCGGTTGGC CGTTCAGGGT 
Chimp     AAACCCTTGC CGTTACGCTT 
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT 
GAGCCGTGGC CGGGCACGGT AT 
ACAGGTTGGC CGTTCAGGGT AA 
AAACCGAGGC CGGGACACTC AT 
AAACCATTGC CGGTACGCTT AA

The numbers are the number of taxa (different species in the example shown above) followed by the number of characters (aligned nucleotides or amino acids in the case of molecular sequences). Restriction site data must include the number of enzymes as well.

Names are limited to 10 characters by default and must be blank-filled to be of that length and followed immediately by the character data using one-letter codes, although the 10 character limit name can be changed by a minor modification of the code (by changing nmlngth in phylip.h and recompiling). All printable ASCII/ISO characters are allowed names, except for parentheses ("(" and ")"), square brackets ("[" and "]"), colon (":"), semicolon (";") and comma (","). The spaces embedded in the alignment

Many programs for phylogenetic analyses, including the commonly-used RAxML[4][5] and IQ-TREE[6] programs, use the phylip format or a minor modification of that format called the relaxed phylip format.

Relaxed phylip format (sequential):

5 42 
Turkey                  AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT 
Salmo_schiefermuelleri  AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT 
H_sapiens               ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA 
Chimp                   AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT 
Gorilla                 AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

The primary difference in relaxed phylip format is the absence of the 10 character limit and the removal of the need to blank fill names to reach that length (although filling names to start the character matrix at the same position can improve readability for user). This example of relaxed uses underscores rather than spaces in the names and uses spaces between the names and the aligned character data; it is often good practice to avoid white space within taxon names and to separate the character data from the name when generating files. Like strict phylip format files, relaxed phylip format files can be in interleaved format and include spaces and endlines within the sequence data.

The programs that use distance data, like the neighbor program that implements the neighbor-joining method, also use a simple distance matrix format the includes only the number of taxa, their names, and numerical values for the distances:

Phylip distance matrix:

7 
Bovine    0.0000 1.6866 1.7198 1.6606 1.5243 1.6043 1.5905 
Mouse     1.6866 0.0000 1.5232 1.4841 1.4465 1.4389 1.4629 
Gibbon    1.7198 1.5232 0.0000 0.7115 0.5958 0.6179 0.5583 
Orang     1.6606 1.4841 0.7115 0.0000 0.4631 0.5061 0.4710 
Gorilla   1.5243 1.4465 0.5958 0.4631 0.0000 0.3484 0.3083 
Chimp     1.6043 1.4389 0.6179 0.5061 0.3484 0.0000 0.2692 
Human     1.5905 1.4629 0.5583 0.4710 0.3083 0.2692 0.0000

The number indicates the number of taxa and same limitations for taxon names exist. Note that this matrix is symmetric and the diagonal has values of 0 (since the distance between a taxon and itself is zero by definition).

Programs that use trees as input accept the trees in Newick format, an informal standard agreed to in 1986 by authors of seven major phylogeny packages. Output is written onto files with names like outfile and outtree. Trees written onto outtree are in the Newick format.

Component programs[edit]

Programs listed in PHYLIP[7]
Program name Description
protpars Estimates phylogenies of peptide sequences using the parsimony method
dnapars Estimates phylogenies of DNA sequences using the parsimony method
dnapenny DNA parsimony branch and bound method, finds all of the most parsimonious phylogenies for nucleic acid sequences by branch-and-bound search
dnamove Interactive construction of phylogenies from nucleic acid sequences, with their evaluation by DNA parsimony method, with compatibility and display of reconstructed ancestral bases
dnacomp Estimates phylogenies from nucleic acid sequence data using the compatibility criterion
dnaml Estimates phylogenies from nucleotide sequences using the maximum likelihood method
dnamlk DNA maximum likelihood method with molecular clock; using both dnaml and dnamlk together permits a likelihood-ratio test for the molecular clock hypothesis
proml Estimates phylogenies from protein amino acid sequences by using the maximum likelihood method
promlk Protein sequence maximum likelihood method with molecular clock
restml Estimation of phylogenies by maximum likelihood using restriction sites data; not from restriction fragments but from the presence or absence of individual sites
dnainvar For nucleic acid sequence data on four species, computes Lake's and Cavender's phylogenetic invariants, which test alternative tree topologies
dnadist DNA distance method which computes four different distances between species from nucleic acid sequences; distances can then be used in the distance matrix programs
protdist Protein sequence distance method which computes a distance measure for sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983 approximation to it, or a model based on genetic code plus a constraint on changing to a different category of amino acid
restdist Distances calculated from restriction sites data or restriction fragments data
seqboot Bootstrapping-jackknifing program; reads in a data set, and emits multiple data sets from it by bootstrap resampling
fitch Fitch-Margoliash distance matrix method; estimates phylogenies from distance matrix data under the additive tree model according to which the distances are expected to equal the sums of branch lengths between species
kitsch Fitch-Margoliash distance matrix method with molecular clock; estimates phylogenies from distance matrix data under the ultrametric model which is the same as the additive tree model except an evolutionary clock is assumed
neighbor Implementation of the methods neighbor joining and UPGMA
contml Maximum likelihood continuous characters and gene frequencies; estimates phylogenies from gene frequency data by maximum likelihood under a model in which all divergence is due to genetic drift in the absence of new mutations; also does maximum likelihood analysis of continuous characters that evolve by a Brownian Motion model, assuming that the characters evolve at equal rates and in an uncorrelated fashion; does not account for character correlations
contrast Reads a tree from a tree file, and a data set with continuous characters data, and emits the independent contrasts for those characters, for use in any multivariate statistics package
gendist Genetic distance program which computes one of three different genetic distance formulas from gene frequency data
pars Unordered multistate discrete-characters parsimony method
mix Estimates phylogenies by some parsimony methods for discrete character data with two states (0, 1); allows using methods: Wagner, Camin-Sokal, or arbitrary mixes
penny Branch and bound mixed method which finds all of the most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search
move Interactive construction of phylogenies from discrete character data with two states (0, 1); evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree
dollop Estimates phylogenies by the Dollo or polymorphism parsimony criteria for discrete character data with two states (0, 1)
dolpenny Finds all or most parsimonious phylogenies for discrete-character data with two states, for the Dollo or polymorphism parsimony criteria using the branch-and-bound method of exact search
dolmove Interactive construction of phylogenies from discrete character data with two states (0, 1) using the Dollo or polymorphism parsimony criteria; evaluates parsimony and compatibility criteria for those phylogenies; displays reconstructed states throughout the tree
clique Finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states (0, 1); the largest clique (or all cliques within a given size range of the largest one) are found by a fast branch and bound search method
factor Character recoding program which takes discrete multistate data with character state trees and emits the corresponding data set with two states (0, 1)
drawgram Rooted tree drawing program which plots rooted phylogenies, cladograms, and phenograms in a wide variety of user-controllable formats. The program is interactive and allows previewing of the tree on PC or Macintosh graphics screens, and Tektronix or Digital graphics terminals.
drawtree Unrooted tree drawing program similar to DRAWGRAM, but plots phylogenies
consense Consensus tree program which computes trees by the majority-rule tree method, which also allows easily finding the strict consensus tree; unable to compute Adams consensus tree
treedist Computes the Robinson–Foulds symmetric difference distance between trees, which allows differences in tree topology
retree Interactive tree rearrangement program which reads in a tree (with branch lengths if needed) and allows rerooting the tree, to flip branches, to change species names and branch lengths, and then write the result out; can be used to convert between rooted and unrooted trees

File format conversion[edit]

Many programs that convert among alignment formats will output data in phylip or relaxed phylip format. For example, conversion between the PHYLIP multiple sequence alignment format and Multi-FASTA format can done with Genozip[8] using genocat --fasta or genocat --phylip. The PAUP* software package is especially useful for converting between the Nexus format and phylip format.

References[edit]

  1. ^ Felsenstein, J. (1981). "Evolutionary trees from DNA sequences: A maximum likelihood approach". Journal of Molecular Evolution. 17 (6): 368–376. Bibcode:1981JMolE..17..368F. doi:10.1007/BF01734359. PMID 7288891. S2CID 8024924.
  2. ^ a b c "PHYLIP general information page". Retrieved 2010-02-14.
  3. ^ Joseph Felsenstein (August 2003). Inferring Phylogenies. Sinauer Associates. ISBN 0-87893-177-5. Archived from the original on 2011-10-22. Retrieved 2006-03-24.
  4. ^ Stamatakis, Alexandros (2014-05-01). "RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies". Bioinformatics. 30 (9): 1312–1313. doi:10.1093/bioinformatics/btu033. ISSN 1460-2059. PMC 3998144. PMID 24451623.
  5. ^ Kozlov, Alexey M; Darriba, Diego; Flouri, Tomáš; Morel, Benoit; Stamatakis, Alexandros (2019-11-01). Wren, Jonathan (ed.). "RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference". Bioinformatics. 35 (21): 4453–4455. doi:10.1093/bioinformatics/btz305. ISSN 1367-4803. PMC 6821337. PMID 31070718.
  6. ^ Minh, Bui Quang; Schmidt, Heiko A; Chernomor, Olga; Schrempf, Dominik; Woodhams, Michael D; von Haeseler, Arndt; Lanfear, Robert (2020-05-01). Teeling, Emma (ed.). "IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era". Molecular Biology and Evolution. 37 (5): 1530–1534. doi:10.1093/molbev/msaa015. ISSN 0737-4038. PMC 7182206. PMID 32011700.
  7. ^ "PHYLIP package documentation mirror site". Archived from the original on 2005-10-19. Retrieved 2006-03-24.
  8. ^ Lan,D. et al. (2021) Genozip: a universal extensible genomic data compressor, Bioinformatics

External links[edit]