Pan-genome

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In molecular biology a pan-genome (or supra-genome) describes the full complement of genes in a clade (typically applied to species in bacteria and archaea), which can have large variation in gene content among closely related strains. It is the union of the gene sets of all the strains of a clade (e.g. species).[1] The significance of the pan-genome arises in an evolutionary context, especially with relevance to metagenomics,[2] but is also used in a broader genomics context.[3]

The pan-genome includes the "core genome" containing genes present in all strains, a "dispensable genome" containing genes present in two or more strains, and finally "unique genes" specific to single strains.[1] However, these distinctions are not completely objective, since they depend on which genomes are included in the analysis.

Examples[edit]

The S. pneumoniae pan-genome. (a) Number of new genes as a function of the number of sequenced genomes. The predicted number of new genes drops sharply to zero when the number of genomes exceeds 50. (b) Number of core genes as a function of the number of sequenced genomes. The number of core genes converges to 1,647 for number of genomes n→∞. From Donati et al.[4]

The original pan-genome concept was developed by Tettelin et al.[5] when they sequenced six strains of Streptococcus agalactiae which could be described as a core genome shared by all isolates, accounting for approximately 80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Extrapolation suggested that the gene reservoir in the S. agalactiae pan-genome is vast and that new unique genes will continue to be identified even after sequencing hundreds of genomes.[5]

A similar pattern was found in Streptococcus pneumoniae when 44 strains were sequenced (see figure). With each new genome sequenced fewer new genes were discovered. In fact, the predicted number of new genes dropped to zero when the number of genomes exceeds 50 (note, however, that this is not a pattern found in all species). The main source of new genes in S. pneumoniae was Streptococcus mitis from which genes were transferred horizontally. The pan-genome size of S. pneumoniae increased logarithmically with the number of strains and linearly with the number of polymorphic sites of the sampled genomes, suggesting that acquired genes accumulate proportionately to the age of clones.[4]

Another example for the latter can be seen in a comparison of the sizes of the core and the pan-genome of Prochlorococcus. The core genome set is logically much smaller than the pan-genome, which is used by different ecotypes of Prochlorococcus.[6] A 2015 study on Prevotella bacteria isolated from humans, compared the gene repertoires of its species derived from different body sites of human. It also reported an open pan- genome showing vast diversity of gene pool.[7]

The same group has developed a BPGA - A Pan-Genome Analysis Pipeline for prokaryotic genomes in 2016, which provides faster alternative over other software.[8]

Software Tools[edit]

As interest in pangenomes increased, there have been a number of software tools developed to help analyze this kind of data. In 2015, a group reviewed the different kinds of analyses and tools a researcher may want to use.[9] There are seven kinds of analyses software has been developed for to analyze pangenomes: cluster homologous genes; identify SNPs; plot pangenomic profiles; build phylogenetic relationships of orthologous genes/families of strains/isolates; function-based searching; annotation and/or curation; and visualizations.[9]

The two most cited software tools at the end of 2014[9] were Panseq[10] and the pan-genomes analysis pipeline (PGAP).[11]

See also[edit]

References[edit]

  1. ^ a b Medini, Duccio; Donati, Claudio; Tettelin, Hervé; Masignani, Vega; Rappuoli, Rino (2005). "The microbial pan-genome". Current Opinion in Genetics & Development. 15 (6): 589–594. PMID 16185861. doi:10.1016/j.gde.2005.09.006. 
  2. ^ Reno ML, Held NL, Fields CJ, Burke PV, Whitaker RJ (May 2009). "Biogeography of the Sulfolobus islandicus pan-genome". Proc. Natl. Acad. Sci. U.S.A. 106 (21): 8605–10. PMC 2689034Freely accessible. PMID 19435847. doi:10.1073/pnas.0808945106. 
  3. ^ Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL (February 2009). "De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae". Genome Res. 19 (2): 294–305. PMC 2652211Freely accessible. PMID 19015323. doi:10.1101/gr.083311.108. 
  4. ^ a b Donati, C; Hiller, N. L.; Tettelin, H; Muzzi, A; Croucher, N. J.; Angiuoli, S. V.; Oggioni, M; Dunning Hotopp, J. C.; Hu, F. Z.; Riley, D. R.; Covacci, A; Mitchell, T. J.; Bentley, S. D.; Kilian, M; Ehrlich, G. D.; Rappuoli, R; Moxon, E. R.; Masignani, V (2010). "Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species". Genome Biology. 11 (10): R107. PMC 3218663Freely accessible. PMID 21034474. doi:10.1186/gb-2010-11-10-r107. 
  5. ^ a b Tettelin, H; Masignani, V; Cieslewicz, M. J.; Donati, C; Medini, D; Ward, N. L.; Angiuoli, S. V.; Crabtree, J; Jones, A. L.; Durkin, A. S.; Deboy, R. T.; Davidsen, T. M.; Mora, M; Scarselli, M; Margarit y Ros, I; Peterson, J. D.; Hauser, C. R.; Sundaram, J. P.; Nelson, W. C.; Madupu, R; Brinkac, L. M.; Dodson, R. J.; Rosovitz, M. J.; Sullivan, S. A.; Daugherty, S. C.; Haft, D. H.; Selengut, J; Gwinn, M. L.; Zhou, L; et al. (2005). "Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial "pan-genome"". Proceedings of the National Academy of Sciences. 102 (39): 13950–5. PMC 1216834Freely accessible. PMID 16172379. doi:10.1073/pnas.0506758102. 
  6. ^ Kettler GC, Martiny AC, Huang K, Zucker J, Coleman ML, Rodrigue S, Chen F, Lapidus A, Ferriera S, Johnson J, Steglich C, Church GM, Richardson P, Chisholm SW (2007). "Patterns and Implications of Gene Gain and Loss in the Evolution of Prochlorococcus". PLoS Genetics. 3 (12): e231. ISSN 1553-7390. PMC 2151091Freely accessible. PMID 18159947. doi:10.1371/journal.pgen.0030231. 
  7. ^ Gupta VK, Chaudhari NM, Dutta C (2015). "Divergences in gene repertoire among the reference Prevotella genomes derived from distinct body sites of human". BMC Genomics. 16 (153). PMC 4359502Freely accessible. PMID 25887946. doi:10.1186/s12864-015-1350-6. 
  8. ^ Chaudhari NM, Gupta VK, Dutta C (2016). "BPGA- an ultra-fast pan-genome analysis pipeline". Scientific Reports. 6 (24373). PMC 4829868Freely accessible. PMID 27071527. doi:10.1038/srep24373. 
  9. ^ a b c Xiao, Jingfa; Zhang, Zhewen; Wu, Jiayan; Yu, Jun (23 February 2015). "A brief review of software tools for pangenomics". Genomics, Proteomics & Bioinformatics. 13 (1): 73–76. doi:10.1016/j.gpb.2015.01.007. Retrieved 2017-01-28. 
  10. ^ Laing, Chad; Buchanan, Cody; Taboada, Eduardo; Zhang, Yongxiang; Kropinski, Andrew; Villegas, Andrea; Thomas, James; Gannon, Victor (15 September 2010). "Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions". BMC Bioinformatics. 11 (1): 461. doi:10.1186/1471-2105-11-461. Retrieved 2017-01-28. 
  11. ^ Zhao, Yongbing; Wu, Jiayan; Yang, Junhui; Sun, Shixiang; Xiao, Jingfa; Yu, Jun (29 November 2011). "PGAP: pan-genomes analysis pipeline". Bioinformatics. 28 (3): 416–418. doi:10.1093/bioinformatics/btr655. Retrieved 2017-01-28.