Computational genomics: Difference between revisions

Content deleted Content added

Inline

Revision as of 13:14, 24 May 2021

Computational genomics (often incorrectly referred to as Computational Genetics^[1]) refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data,^[2] including both DNA and RNA sequence as well as other "post-genomic" data (i.e., experimental data obtained with technologies that require the genome sequence, such as genomic DNA microarrays). These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes (rather than individual genes) to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.^[3]

History

The roots of computational genomics are shared with those of bioinformatics. During the 1960s, Margaret Dayhoff and others at the National Biomedical Research Foundation assembled databases of homologous protein sequences for evolutionary study.^[4] Their research developed a phylogenetic tree that determined the evolutionary changes that were required for a particular protein to change into another protein based on the underlying amino acid sequences. This led them to create a scoring matrix that assessed the likelihood of one protein being related to another.

Beginning in the 1980s, databases of genome sequences began to be recorded, but this presented new challenges in the form of searching and comparing the databases of gene information. Unlike text-searching algorithms that are used on websites such as Google or Wikipedia, searching for sections of genetic similarity requires one to find strings that are not simply identical, but similar. This led to the development of the Needleman-Wunsch algorithm, which is a dynamic programming algorithm for comparing sets of amino acid sequences with each other by using scoring matrices derived from the earlier research by Dayhoff. Later, the BLAST algorithm was developed for performing fast, optimized searches of gene sequence databases. BLAST and its derivatives are probably the most widely used algorithms for this purpose.^[5]

The emergence of the phrase "computational genomics" coincides with the availability of complete sequenced genomes in the mid-to-late 1990s. The first meeting of the Annual Conference on Computational Genomics was organized by scientists from The Institute for Genomic Research (TIGR) in 1998, providing a forum for this speciality and effectively distinguishing this area of science from the more general fields of Genomics or Computational Biology.^{[citation needed]} The first use of this term in scientific literature, according to MEDLINE abstracts, was just one year earlier in Nucleic Acids Research.^[6] The final Computational Genomics conference was held in 2006, featuring a keynote talk by Nobel Laureate Barry Marshall, co-discoverer of the link between Helicobacter pylori and stomach ulcers. As of 2014, the leading conferences in the field include Intelligent Systems for Molecular Biology (ISMB) and Research in Computational Molecular Biology (RECOMB).

The development of computer-assisted mathematics (using products such as Mathematica or Matlab) has helped engineers, mathematicians and computer scientists to start operating in this domain, and a public collection of case studies and demonstrations is growing, ranging from whole genome comparisons to gene expression analysis.^[7] This has increased the introduction of different ideas, including concepts from systems and control, information theory, strings analysis and data mining. It is anticipated that computational approaches will become and remain a standard topic for research and teaching, while students fluent in both topics start being formed in the multiple courses created in the past few years.

Contributions of computational genomics research to biology

Contributions of computational genomics research to biology include:^[3]

proposing cellular signalling networks
proposing mechanisms of genome evolution
predict precise locations of all human genes using comparative genomics techniques with several mammalian and vertebrate species
predict conserved genomic regions that are related to early embryonic development
discover potential links between repeated sequence motifs and tissue-specific gene expression
measure regions of genomes that have undergone unusually rapid evolution

Genome comparison

Computational tools have been developed to asses the similarity of genomic sequences. Some of them are alignment-based distances such as Average Nucleotid Identity^[8]. These methods are highly specific, while being computationally slow. Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash^[9], a probabilistic approach using minhash. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random hash function on the possible k-mers. For example, if $k=2$ , sketches of size 4 are being constructed and given the following hash function ${\begin{array}{cccc}(AA,0)&(AC,8)&(AT,2)&(AG,14)\\(CA,6)&(CC,13)&(CT,5)&(CG,4)\\(GA,15)&(GC,12)&(GT,10)&(GG,1)\\(TA,3)&(TC,11)&(TT,9)&(TG,7)\end{array}}$ ,

the sketch of the sequence

$CTGACCTTAACGGGAGACTATGATGACGACCGCAT$

is $\lbrace 0,1,1,2\rbrace$ which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers (Jaccard index) of the corresponding sequences. It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000^[10].

By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.

Clusterization of genomic data

Clustering data is a tool used to simplify statistical analysis of a genomic sample. For example in^[11] the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of biosynthetic gene clusters (BGC). In ^[12] succesive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.

Biosynthetic gene clusters

Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data.^[13] Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as Minhash^[14], and clusterization algorithms such as k-medoids and affinity propagation. Also several metrics and similarities have been developed to compare them.

Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs). BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion.

Satria et. al, 2021^[15] across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.^[16]

References

^ WHO definitions of genetics and genomics
^ Koonin EV (March 2001). "Computational genomics". Current Biology. 11 (5): R155–8. doi:10.1016/S0960-9822(01)00081-1. PMID 11267880. S2CID 17202180.
^ ^a ^b Computational Genomics and Proteomics at MIT
^ Mount D (2000). Bioinformatics, Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. pp. 2–3. ISBN 978-0-87969-597-2.
^ Brown TA (1999). Genomes. Wiley. ISBN 978-0-471-31618-3.
^ Wagner A (September 1997). "A computational genomics approach to the identification of gene networks". Nucleic Acids Research. 25 (18): 3594–604. doi:10.1093/nar/25.18.3594. PMC 146952. PMID 9278479.
^ Cristianini N, Hahn M (2006). Introduction to Computational Genomics. Cambridge University Press. ISBN 978-0-521-67191-0.
^ {{cite journal |vauthors= Konstantinidis KT, Tiedje JM |title= Genomic insights that advance the species definition for prokaryotes |journal= Proc Natl Acad Sci U S A.|date= 2005;|volume= 102 |pages= 2567–72.
^ Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.
^ Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.
^ Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M (2020). "A computational framework to explore large-scale biosynthetic diversity". Nat Chem Biol. 16 (1): 60–68. doi:10.1038/s41589-019-0400-9. PMC 6917865. PMID 31768033.
^ Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes": 32. doi:10.1101/2020.12.14.422671. {{cite journal}}: Cite journal requires |journal= (help)
^ Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". bioRxiv: 32. doi:10.1101/2020.12.14.422671.
^ Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.
^ Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi:10.1093/gigascience/giaa154. PMC 7804863. PMID 33438731.
^ Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi:10.1093/gigascience/giaa154. PMC 7804863. PMID 33438731.

External links

Harvard Extension School Biophysics 101, Genomics and Computational Biology, http://www.courses.fas.harvard.edu/~bphys101/info/syllabus.html
University of Bristol course in Computational Genomics, http://www.computational-genomics.net/

[WHO-1] WHO definitions of genetics and genomics

[2] Koonin EV (March 2001). "Computational genomics". Current Biology. 11 (5): R155–8. doi:10.1016/S0960-9822(01)00081-1. PMID 11267880. S2CID 17202180.

[MIT-3] Computational Genomics and Proteomics at MIT

[4] Mount D (2000). Bioinformatics, Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press. pp. 2–3. ISBN 978-0-87969-597-2.

[5] Brown TA (1999). Genomes. Wiley. ISBN 978-0-471-31618-3.

[6] Wagner A (September 1997). "A computational genomics approach to the identification of gene networks". Nucleic Acids Research. 25 (18): 3594–604. doi:10.1093/nar/25.18.3594. PMC 146952. PMID 9278479.

[7] Cristianini N, Hahn M (2006). Introduction to Computational Genomics. Cambridge University Press. ISBN 978-0-521-67191-0.

[8] {{cite journal |vauthors= Konstantinidis KT, Tiedje JM |title= Genomic insights that advance the species definition for prokaryotes |journal= Proc Natl Acad Sci U S A.|date= 2005;|volume= 102 |pages= 2567–72.

[9] Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.

[10] Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.

[11] Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M (2020). "A computational framework to explore large-scale biosynthetic diversity". Nat Chem Biol. 16 (1): 60–68. doi:10.1038/s41589-019-0400-9. PMC 6917865. PMID 31768033.

[12] Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes": 32. doi:10.1101/2020.12.14.422671. {{cite journal}}: Cite journal requires |journal= (help)

[13] Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M (2020). "BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes". bioRxiv: 32. doi:10.1101/2020.12.14.422671.

[14] Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology. 17 (32): 14. doi:10.1186/s13059-016-0997-x. PMC 4915045. PMID 27323842.

[15] Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi:10.1093/gigascience/giaa154. PMC 7804863. PMID 33438731.

[16] Kautsar, Satria A; van der Hooft, Justin J J; de Ridder, Dick; Medema, Marnix H (13 January 2021). "BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters". GigaScience. 10 (1): giaa154. doi:10.1093/gigascience/giaa154. PMC 7804863. PMID 33438731.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

@@ Line 23: / Line 23: @@
 Computational tools have been developed to asses the similarity of genomic sequences. Some of them are [[Sequence alignment|alignment]]-based distances such as [[Average Nucleotid Identity]]<ref>{{cite journal |vauthors= Konstantinidis KT, Tiedje JM |title= Genomic insights that advance the species
 definition for prokaryotes |journal= Proc Natl Acad Sci U S A.|date= 2005;|volume= 102 |pages= 2567–72.</ref>. These methods are highly specific, while being computationally slow.
-Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=|pmid=|doi-access=free}}</ref>, a probabilistic approach using [[minhash]]. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random [[hash function]] on the possible [[k-mers]]. For example, if <math>k=2</math>, sketches of size 4 are being constructed and given the following hash function <math>\begin{array}{cccc}
+Other, alignment-free methods, include statistical and probabilistic approaches. One example is Mash<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=4915045|pmid=27323842|doi-access=free}}</ref>, a probabilistic approach using [[minhash]]. In this method, given a number k, a genomic sequence is transformed into a shorter sketch through a random [[hash function]] on the possible [[k-mers]]. For example, if <math>k=2</math>, sketches of size 4 are being constructed and given the following hash function <math>\begin{array}{cccc}
          (AA,0)    & (AC,8) & (AT,2) & (AG,14) \\
 (CA,6) & (CC,13) & (CT,5) & (CG,4) \\
@@ Line 35: / Line 35: @@
 is <math>\lbrace 0,1,1,2\rbrace</math> which are the smallest hash values of its k-mers of size 2. These sketches are then compared to estimate the fraction of shared k-mers ([[Jaccard index]]) of the corresponding sequences.
-It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=|pmid=|doi-access=free}}</ref>.
+It is worth noticing that a hash value is a binary number. In a real genomic setting a useful size of k-mers ranges from 14 to 21, and the size of the sketches would be around 1000<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=4915045|pmid=27323842|doi-access=free}}</ref>.
 By reducing the size of the sequences, even hundreds of times, and comparing them in an alignment-free way, this method reduces significantly the time of estimation of the similarity of sequences.
 == Clusterization of genomic data ==
-[[Cluster analysis |Clustering]] data is a tool used to simplify statistical analysis of a genomic sample. For example in<ref>{{cite journal |vauthors= Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M |date=2020|title= A computational framework to explore large-scale biosynthetic
+[[Cluster analysis |Clustering]] data is a tool used to simplify statistical analysis of a genomic sample. For example in<ref>{{cite journal |vauthors= Navarro-Muñoz J, Selem-Mojica N, Mullowney M, Kautsar S, Tryon J, Parkinson E, De Los Santos E, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Dias-Cappelini L, Goering A, Thomson R, Metcalf W, Kelleher N, Barona-Gomez F, Medema M |date=2020|title= A computational framework to explore large-scale biosynthetic diversity |url= |journal=Nat Chem Biol|volume= 16|issue=1|pages=60–68 |doi= 10.1038/s41589-019-0400-9|pmc=6917865|pmid=31768033}}</ref> the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of [[Metabolic gene cluster |biosynthetic gene clusters]] (BGC). In <ref>{{cite journal |vauthors= Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M|date=2020|title= BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes |url= |journal= |volume= |issue=|pages= 32|doi= 10.1101/2020.12.14.422671 |pmc=|pmid=}}</ref> succesive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.
-diversity |url= |journal=Nat Chem Biol|volume= 16|issue=1|pages=60-68 |doi= 10.1038/s41589-019-0400-9|pmc=|pmid=}}</ref> the authors developed a tool (BiG-SCAPE) to analize sequence similarity networks of [[Metabolic gene cluster |biosynthetic gene clusters]] (BGC). In <ref>{{cite journal |vauthors= Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M|date=2020|title= BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes |url= |journal= |volume= |issue=|pages= 32|doi= 10.1101/2020.12.14.422671 |pmc=|pmid=}}</ref> succesive layers of clusterization of biosynthetic gene clusters are used in the automated tool BiG-MAP, both to filter redundant data and identify gene clusters families. This tool profiles the abundance and expressions levels of BGC's in microbiome samples.
 == Biosynthetic gene clusters ==
-Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data.<ref>{{cite journal |vauthors= Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M|date=2020|title= BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes |url= |journal= Biorxiv|volume= |issue=|pages= 32|doi= 10.1101/2020.12.14.422671 |pmc=|pmid=}}</ref> Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as [[Minhash]]<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=|pmid=|doi-access=free}}</ref>, and clusterization algorithms such as [[k-medoids]] and [[affinity propagation]]. Also several metrics and similarities have been developed to compare them.
+Bioinformatic tools have been developed to predict, and determine the abundance and expression of, this kind of gene cluster in microbiome samples, from metagenomic data.<ref>{{cite journal |vauthors= Pascal-Andreu V, Augustijn H, van den Berg K, van der Hooft J, Fischbach M, Medema M|date=2020|title= BiG-MAP: an automated pipeline to profile metabolic gene cluster abundance and expression in microbiomes |url= |journal= bioRxiv|volume= |issue=|pages= 32|doi= 10.1101/2020.12.14.422671 |pmc=|pmid=}}</ref> Since the size of metagenomic data is considerable, filtering and clusterization thereof are important parts of these tools. These processes can consist of dimensionality -reduction techniques, such as [[Minhash]]<ref>{{cite journal |vauthors= Ondov B, Treangen T, Melsted P, Mallonee A, Bergman N, Koren S, Phillippy A|date=2016|title= Mash: fast genome and metagenome distance estimation using MinHash |url= |journal=Genome Biology|volume=17 |issue=32|pages= 14|doi=  10.1186/s13059-016-0997-x |pmc=4915045|pmid=27323842|doi-access=free}}</ref>, and clusterization algorithms such as [[k-medoids]] and [[affinity propagation]]. Also several metrics and similarities have been developed to compare them.
 Genome mining for biosynthetic gene clusters (BGCs) has become an integral part of natural product discovery. The >200,000 microbial genomes now publicly available hold information on abundant novel chemistry. One way to navigate this vast genomic diversity is through comparative analysis of homologous BGCs, which allows identification of cross-species patterns that can be matched to the presence of metabolites or biological activities. However, current tools are hindered by a bottleneck caused by the expensive network-based approach used to group these BGCs into gene cluster families (GCFs).
 BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion.
-[[Satria et. al, 2021]]<ref>{{cite journal |last1=Kautsar |first1=Satria A |last2=van der Hooft |first2=Justin J J |last3=de Ridder |first3=Dick |last4=Medema |first4=Marnix H |title=BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters |journal=GigaScience |date=13 January 2021 |volume=10 |issue=1 |pages=giaa154 |doi=10.1093/gigascience/giaa154|doi-access=free }}</ref> across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.<ref>{{cite journal |last1=Kautsar |first1=Satria A |last2=van der Hooft |first2=Justin J J |last3=de Ridder |first3=Dick |last4=Medema |first4=Marnix H |title=BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters |journal=GigaScience |date=13 January 2021 |volume=10 |issue=1 |pages=giaa154 |doi=10.1093/gigascience/giaa154|doi-access=free }}</ref>
+[[Satria et. al, 2021]]<ref>{{cite journal |last1=Kautsar |first1=Satria A |last2=van der Hooft |first2=Justin J J |last3=de Ridder |first3=Dick |last4=Medema |first4=Marnix H |title=BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters |journal=GigaScience |date=13 January 2021 |volume=10 |issue=1 |pages=giaa154 |doi=10.1093/gigascience/giaa154|pmid=33438731 |pmc=7804863 |doi-access=free }}</ref> across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.<ref>{{cite journal |last1=Kautsar |first1=Satria A |last2=van der Hooft |first2=Justin J J |last3=de Ridder |first3=Dick |last4=Medema |first4=Marnix H |title=BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters |journal=GigaScience |date=13 January 2021 |volume=10 |issue=1 |pages=giaa154 |doi=10.1093/gigascience/giaa154|pmid=33438731 |pmc=7804863 |doi-access=free }}</ref>
 == See also ==

v t e Omics
Genomics	Cognitive genomics Computational genomics Comparative genomics Functional genomics Genome project Human Genome Project Metagenomics Human Microbiome Project Pangenomics Personal genomics Population genomics Social genomics Structural genomics
Bioinformatics	Biochip Cheminformatics Chemogenomics Connectomics Human Connectome Project Epigenomics Human Epigenome Project Glycomics Immunomics Lipidomics Metabolomics Microbiomics Nutrigenomics Paleopolyploidy Pharmacogenetics Pharmacogenomics Systems biology Toxicogenomics Transcriptomics
Structural biology	Proteomics Human proteome project Call-map proteomics Structure-based drug design Expression proteomics
Research tools	2-D electrophoresis Mass spectrometer Electrospray ionization Matrix-assisted laser desorption ionization Matrix-assisted laser desorption ionization-time of flight mass spectrometer Microfluidic-based tools Isotope affinity tags Chromosome conformation capture
Organizations	DNA Data Bank of Japan (JP) European Molecular Biology Laboratory (EU) National Institutes of Health (USA) Wellcome Sanger Institute (UK)
List Category