Binning (metagenomics)

From Wikipedia, the free encyclopedia
(Redirected from Binning (Metagenomics))

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.[1]

Introduction[edit]

Metagenomic samples can contain reads from a huge number of organisms. For example, in a single gram of soil, there can be up to 18000 different types of organisms, each with its own genome.[2] Metagenomic studies sample DNA from the whole community, and make it available as nucleotide sequences of certain length. In most cases, the incomplete nature of the obtained sequences makes it hard to assemble individual genes,[3] much less recovering the full genomes of each organism. Thus, binning techniques represent a "best effort" to identify reads or contigs within certain genomes known as Metagenome Assembled Genome (MAG). Taxonomy of MAGs can be inferred through placement into a reference phylogenetic tree using algorithms like GTDB-Tk.[4]

The first studies that sampled DNA from multiple organisms used specific genes to assess diversity and origin of each sample.[5][6] These marker genes had been previously sequenced from clonal cultures from known organisms, so, whenever one of such genes appeared in a read or contig from the metagenomic sample that read could be assigned to a known species or to the OTU of that species. The problem with this method was that only a tiny fraction of the sequences carried a marker gene, leaving most of the data unassigned.

Modern binning techniques use both previously available information independent from the sample and intrinsic information present in the sample. Depending on the diversity and complexity of the sample, their degree of success vary: in some cases they can resolve the sequences up to individual species, while in some others the sequences are identified at best with very broad taxonomic groups.[7]

Binning of metagenomic data from various habitats might significantly extend the tree of life. Such approach on globally available metagenomes binned 52 515 individual microbial genomes and extended diversity of bacteria and archaea by 44%.[8]

Algorithms[edit]

Binning algorithms can employ previous information, and thus act as supervised classifiers, or they can try to find new groups, those act as unsupervised classifiers. Many, of course, do both. The classifiers exploit the previously known sequences by performing alignments against databases, and try to separate sequence based in organism-specific characteristics of the DNA,[9] like GC-content.

Mande et al., (2012) [10] provides a review of the premise, methodologies, advantages, limitations and challenges of various methods available for binning of metagenomic datasets obtained using the shotgun sequencing approach. Some of the prominent binning algorithms are described below.

TETRA[edit]

TETRA is a statistical classifier that uses tetranucleotide usage patterns in genomic fragments.[11] There are four possible nucleotides in DNA, therefore there can be different fragments of four consecutive nucleotides; these fragments are called tetramers. TETRA works by tabulating the frequencies of each tetramer for a given sequence. From these frequencies z-scores are then calculated, which indicate how over- or under-represented the tetramer is in contraposition with what would be expected by looking to individual nucleotide compositions. The z-scores for each tetramer are assembled in a vector, and the vectors corresponding to different sequences are compared pair-wise, to yield a measure of how similar different sequences from the sample are. It is expected that the most similar sequences belong to organisms in the same OTU.

MEGAN[edit]

In the DIAMOND[12]+MEGAN[13] approach, all reads are first aligned against a protein reference database, such as NCBI-nr, and then the resulting alignments are analyzed using the naive LCA algorithm, which places a read on the lowest taxonomic node in the NCBI taxonomy that lies above all taxa to which the read has a significant alignment. Here, an alignment is usually deemed "significant", if its bit score lies above a given threshold (which depends on the length of the reads) and is within 10%, say, of the best score seen for that read. The rationale of using protein reference sequences, rather than DNA reference sequences, is that current DNA reference databases only cover a small fraction of the true diversity of genomes that exist in the environment.

Phylopythia[edit]

Phylopythia is one supervised classifier developed by researchers at IBM labs, and is basically a support vector machine trained with DNA k-mers from known sequences.[6]

SOrt-ITEMS[edit]

SOrt-ITEMS (Monzoorul et al., 2009) [14] is an alignment-based binning algorithm developed by Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. Users need to perform a similarity search of the input metagenomic sequences (reads) against the nr protein database using BLASTx search. The generated BLASTx output is then taken as input by the SOrt-ITEMS program. The method uses a range of BLAST alignment parameter thresholds to first identify an appropriate taxonomic level (or rank) where the read can be assigned. An orthology-based approach is then adopted for the final assignment of the metagenomic read. Other alignment-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) include DiScRIBinATE,[15] ProViDE [16] and SPHINX.[17] The methodologies of these algorithms are summarized below.

DiScRIBinATE[edit]

DiScRIBinATE (Ghosh et al., 2010) [15] is an alignment-based binning algorithm developed by the Innovations Labs of Tata Consultancy Services (TCS) Ltd., India. DiScRIBinATE replaces the orthology approach of SOrt-ITEMS with a quicker 'alignment-free' approach. Incorporating this alternate strategy was observed to reduce the binning time by half without any significant loss in the accuracy and specificity of assignments. Besides, a novel reclassification strategy incorporated in DiScRIBinATE was seem to reduce the overall misclassification rate.

ProViDE[edit]

ProViDE (Ghosh et al., 2011) [16] is an alignment-based binning approach developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. for the estimation of viral diversity in metagenomic samples. ProViDE adopts the reverse orthology based approach similar to SOrt-ITEMS for the taxonomic classification of metagenomic sequences obtained from virome datasets. It a customized set of BLAST parameter thresholds, specifically suited for viral metagenomic sequences. These thresholds capture the pattern of sequence divergence and the non-uniform taxonomic hierarchy observed within/across various taxonomic groups of the viral kingdom.

PCAHIER[edit]

PCAHIER (Zheng et al., 2010),[18] another binning algorithm developed by the Georgia Institute of Technology., employs n-mer oligonucleotide frequencies as the features and adopts a hierarchical classifier (PCAHIER) for binning short metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).

SPHINX[edit]

SPHINX (Mohammed et al., 2011),[17] another binning algorithm developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd., adopts a hybrid strategy that achieves high binning efficiency by utilizing the principles of both 'composition'- and 'alignment'-based binning algorithms. The approach was designed with the objective of analyzing metagenomic datasets as rapidly as composition-based approaches, but nevertheless with the accuracy and specificity of alignment-based algorithms. SPHINX was observed to classify metagenomic sequences as rapidly as composition-based algorithms. In addition, the binning efficiency (in terms of accuracy and specificity of assignments) of SPHINX was observed to be comparable with results obtained using alignment-based algorithms.

INDUS and TWARIT[edit]

Represent other composition-based binning algorithms developed by the Innovation Labs of Tata Consultancy Services (TCS) Ltd. These algorithms utilize a range of oligonucleotide compositional (as well as statistical) parameters to improve binning time while maintaining the accuracy and specificity of taxonomic assignments.[19][20]

Other algorithms[edit]

This list is not exhaustive:

  • TACOA (Diaz et al., 2009)
  • Parallel-META (Su et al., 2011)
  • PhyloPythiaS (Patil et al., 2011)
  • RITA (MacDonald et al., 2012)[21]
  • BiMeta (Le et al., 2015) [22]
  • MetaPhlAn (Segata et al., 2012)[23]
  • SeMeta (Le et al., 2016) [24]
  • Quikr (Koslicki et al., 2013)[25]
  • Taxoner (Pongor et al., 2014)[26]
  • MaxBin (Wu et al., 2014)[27]
  • MetaBAT 2 (Kang et al., 2019)[28]
  • CONCOCT (Alneberg et al., 2014)[29]
  • Anvi’o (Eren et al., 2015)[30]
  • DAS Tool (Sieber et al., 2018)[31] - wrapper that combines multiple binning algorithms

All these algorithms employ different schemes for binning sequences, such as hierarchical classification, and operate in either a supervised or unsupervised manner. These algorithms provide a global view of how diverse the samples are, and can potentially connect community composition and function in metagenomes.

References[edit]

  1. ^ Maguire, Finlay; Jia, Baofeng; Gray, Kristen L.; Lau, Wing Yin Venus; Beiko, Robert G.; Brinkman, Fiona S. L. (2020-10-01). "Metagenome-assembled genome binning methods with short reads disproportionately fail for plasmids and genomic Islands". Microbial Genomics. 6 (10): mgen000436. doi:10.1099/mgen.0.000436. ISSN 2057-5858. PMC 7660262. PMID 33001022.
  2. ^ Daniel, Rolf (2005-06-01). "The metagenomics of soil". Nature Reviews Microbiology. 3 (6): 470–478. doi:10.1038/nrmicro1160. ISSN 1740-1526. PMID 15931165. S2CID 32604394.
  3. ^ Wooley, John C.; Adam Godzik; Iddo Friedberg (2010-02-26). "A Primer on Metagenomics". PLOS Comput Biol. 6 (2): e1000667. Bibcode:2010PLSCB...6E0667W. doi:10.1371/journal.pcbi.1000667. PMC 2829047. PMID 20195499.
  4. ^ Chaumeil, Pierre-Alain; Mussig, Aaron J; Hugenholtz, Philip; Parks, Donovan H (2019-11-15). Hancock, John (ed.). "GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database". Bioinformatics. 36 (6): 1925–1927. doi:10.1093/bioinformatics/btz848. ISSN 1367-4803. PMC 7703759. PMID 31730192.
  5. ^ Giovannoni, Stephen J.; Theresa B. Britschgi; Craig L. Moyer; Katharine G. Field (1990-05-03). "Genetic diversity in Sargasso Sea bacterioplankton". Nature. 345 (6270): 60–63. Bibcode:1990Natur.345...60G. doi:10.1038/345060a0. PMID 2330053. S2CID 4370502.
  6. ^ a b McHardy, Alice Carolyn; Hector Garcia Martin; Aristotelis Tsirigos; Philip Hugenholtz; Isidore Rigoutsos (January 2007). "Accurate phylogenetic classification of variable-length DNA fragments". Nature Methods. 4 (1): 63–72. doi:10.1038/nmeth976. ISSN 1548-7091. PMID 17179938. S2CID 28797816.
  7. ^ academic.oup.com https://academic.oup.com/bib/article/23/6/bbac431/6760137. Retrieved 2024-01-17. {{cite web}}: Missing or empty |title= (help)
  8. ^ IMG/M Data Consortium; Nayfach, Stephen; Roux, Simon; Seshadri, Rekha; Udwary, Daniel; Varghese, Neha; Schulz, Frederik; Wu, Dongying; Paez-Espino, David; Chen, I-Min; Huntemann, Marcel (2020-11-09). "A genomic catalog of Earth's microbiomes". Nature Biotechnology. 39 (4): 499–509. doi:10.1038/s41587-020-0718-6. ISSN 1087-0156. PMC 8041624. PMID 33169036.
  9. ^ Karlin, S.; I. Ladunga; B. E. Blaisdell (1994). "Heterogeneity of genomes: measures and values". Proceedings of the National Academy of Sciences. 91 (26): 12837–12841. Bibcode:1994PNAS...9112837K. doi:10.1073/pnas.91.26.12837. PMC 45535. PMID 7809131.
  10. ^ Mande, Sharmila S.; Monzoorul Haque Mohammed; Tarini Shankar Ghosh (2012). "Classification of metagenomic sequences: methods and challenges". Briefings in Bioinformatics. 13 (6): 669–81. doi:10.1093/bib/bbs054. PMID 22962338.
  11. ^ Teeling, Hanno; Jost Waldmann; Thierry Lombardot; Margarete Bauer; Frank Glockner (2004). "TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences". BMC Bioinformatics. 5 (1): 163. doi:10.1186/1471-2105-5-163. PMC 529438. PMID 15507136.
  12. ^ Buchfink, Xie and Huson (2015). "Fast and sensitive protein alignment using DIAMOND". Nature Methods. 12 (1): 59–60. doi:10.1038/nmeth.3176. PMID 25402007. S2CID 5346781.
  13. ^ Huson, Daniel H; S. Beier; I. Flade; A. Gorska; M. El-Hadidi; H. Ruscheweyh; R. Tappu (2016). "MEGAN Community Edition - Interactive exploration and analysis of large-scale microbiome sequencing data". PLOS Computational Biology. 12 (6): e1004957. Bibcode:2016PLSCB..12E4957H. doi:10.1371/journal.pcbi.1004957. PMC 4915700. PMID 27327495.
  14. ^ Haque M, Monzoorul; Tarini Shankar Ghosh; Dinakar Komanduri; Sharmila S Mande (2009). "SOrt-ITEMS: Sequence orthology based approach for improved taxonomic estimation of metagenomic sequences". Bioinformatics. 25 (14): 1722–30. doi:10.1093/bioinformatics/btp317. PMID 19439565.
  15. ^ a b Ghosh, Tarini Shankar; Monzoorul Haque M; Sharmila S Mande (2010). "DiScRIBinATE: a rapid method for accurate taxonomic classification of metagenomic sequences". BMC Bioinformatics. 11 (S7): S14. doi:10.1186/1471-2105-11-s7-s14. PMC 2957682. PMID 21106121.
  16. ^ a b Ghosh, Tarini Shankar; Monzoorul Haque Mohammed; Dinakar Komanduri; Sharmila S Mande (2011). "ProViDE: A software tool for accurate estimation of viral diversity in metagenomic samples". Bioinformation. 6 (2): 91–94. doi:10.6026/97320630006091. PMC 3082859. PMID 21544173.
  17. ^ a b Mohammed, Monzoorul Haque; Tarini Shankar Ghosh; Nitin Kumar Singh; Sharmila S Mande (2011). "SPHINX--an algorithm for taxonomic binning of metagenomic sequences". Bioinformatics. 27 (1): 22–30. doi:10.1093/bioinformatics/btq608. PMID 21030462.
  18. ^ Zheng, Hao; Hongwei Wu (2010). "Short prokaryotic DNA fragment binning using a hierarchical classifier based on linear discriminant analysis and principal component analysis". J Bioinform Comput Biol. 8 (6): 995–1011. doi:10.1142/s0219720010005051. PMID 21121023.
  19. ^ Mohammed, Monzoorul Haque; Tarini Shankar Ghosh; Rachamalla Maheedhar Reddy; CV Reddy; Nitin Kumar Singh; Sharmila S Mande (2011). "INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences". BMC Genomics. 12 (S3): S4. doi:10.1186/1471-2164-12-s3-s4. PMC 3333187. PMID 22369237.
  20. ^ Reddy, Rachamalla Maheedhar; Monzoorul Haque Mohammed; Sharmila S Mande (2013). "TWARIT: an extremely rapid and efficient approach for phylogenetic classification of metagenomic sequences". Gene. 505 (2): 259–65. doi:10.1016/j.gene.2012.06.014. PMID 22710135.
  21. ^ MacDonald, Norman J.; Donovan H. Parks; Robert G. Beiko (2012). "Metagenomic microbial community profiling using unique clade-specific marker genes". Nucleic Acids Research. 40 (14): e111. doi:10.1093/nar/gks335. PMC 3413139. PMID 22532608.
  22. ^ Van Vinh, Le, Van Lang, Tran, and Tran Van Hoai. "A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads." Algorithms for Molecular Biology 10.1 (2015): 1.
  23. ^ Nicola, Segata; Levi Waldron; Annalisa Ballarini; Vagheesh Narasimhan; Olivier Jousson; Curtis Huttenhower (2012). "Metagenomic microbial community profiling using unique clade-specific marker genes". Nature Methods. 9 (8): 811–814. doi:10.1038/nmeth.2066. PMC 3443552. PMID 22688413.
  24. ^ Van Vinh, Le, Van Lang, Tran, and Tran Van Hoai. "A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads". BMC bioinformatics, 17(1), 2016.
  25. ^ Koslicki, David; Simon Foucart; Gail Rosen (2013). "Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing". Bioinformatics. 29 (17): 2096–2102. doi:10.1093/bioinformatics/btt336. PMID 23786768.
  26. ^ Pongor, Lőrinc; Roberto Vera; Balázs Ligeti1 (2014). "Fast and sensitive alignment of microbial whole genome sequencing reads to large sequence datasets on a desktop PC: application to metagenomic datasets and pathogen identification". PLOS ONE. 9 (7): e103441. Bibcode:2014PLoSO...9j3441P. doi:10.1371/journal.pone.0103441. PMC 4117525. PMID 25077800.{{cite journal}}: CS1 maint: numeric names: authors list (link)
  27. ^ Wu, Yu-Wei; Tang, Yung-Hsu; Tringe, Susannah G; Simmons, Blake A; Singer, Steven W (December 2014). "MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm". Microbiome. 2 (1): 26. doi:10.1186/2049-2618-2-26. ISSN 2049-2618. PMC 4129434. PMID 25136443.
  28. ^ Kang, Dongwan D.; Li, Feng; Kirton, Edward; Thomas, Ashleigh; Egan, Rob; An, Hong; Wang, Zhong (2019-07-26). "MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies". PeerJ. 7: e7359. doi:10.7717/peerj.7359. ISSN 2167-8359. PMC 6662567. PMID 31388474.
  29. ^ Alneberg, Johannes; Bjarnason, Brynjar Smári; de Bruijn, Ino; Schirmer, Melanie; Quick, Joshua; Ijaz, Umer Z; Lahti, Leo; Loman, Nicholas J; Andersson, Anders F; Quince, Christopher (November 2014). "Binning metagenomic contigs by coverage and composition". Nature Methods. 11 (11): 1144–1146. doi:10.1038/nmeth.3103. ISSN 1548-7091. PMID 25218180. S2CID 24696869.
  30. ^ Eren, A. Murat; Esen, Özcan C.; Quince, Christopher; Vineis, Joseph H.; Morrison, Hilary G.; Sogin, Mitchell L.; Delmont, Tom O. (2015-10-08). "Anvi'o: an advanced analysis and visualization platform for 'omics data". PeerJ. 3: e1319. doi:10.7717/peerj.1319. ISSN 2167-8359. PMC 4614810. PMID 26500826.
  31. ^ Sieber, Christian M. K.; Probst, Alexander J.; Sharrar, Allison; Thomas, Brian C.; Hess, Matthias; Tringe, Susannah G.; Banfield, Jillian F. (July 2018). "Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy". Nature Microbiology. 3 (7): 836–843. doi:10.1038/s41564-018-0171-1. ISSN 2058-5276. PMC 6786971. PMID 29807988.