K-mer: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Fixed typo in page ("k-mer" to "3-mer" in 3-mer section)
Citation bot (talk | contribs)
m Alter: doi-broken-date, issue. Add: arxiv, url, pages, pmc, pmid. Removed URL that duplicated unique identifier. Removed parameters. | You can use this bot yourself. Report bugs here.| Activated by User:Marianne Zimmerman
Line 2: Line 2:
{{DISPLAYTITLE:''k''-mer}}
{{DISPLAYTITLE:''k''-mer}}
[[File:K-mer diagram.svg|thumb|The sequence ATGG has two 3-mers: ATG and TGG.]]
[[File:K-mer diagram.svg|thumb|The sequence ATGG has two 3-mers: ATG and TGG.]]
In [[bioinformatics]], '''''k''-mers''' are [[substring|subsequence]]s of length <math>k</math> contained within a biological sequence. Primarily used within the context of [[computational genomics]] and [[sequence analysis]], in which ''k''-mers are composed of [[Nucleotide|nucleotides]] (''i.e''. A, T, G, and C), ''k''-mers are capitalized upon to [[Sequence assembly|assemble DNA sequences]],<ref>{{Cite journal|last=Compeau|first=Phillip E C|last2=Pevzner|first2=Pavel A|last3=Tesler|first3=Glenn|date=November 2011|title=How to apply de Bruijn graphs to genome assembly|url=http://www.nature.com/articles/nbt.2023|journal=Nature Biotechnology|language=en|volume=29|issue=11|pages=987–991|doi=10.1038/nbt.2023|issn=1087-0156}}</ref> improve [[Protein production|heterologous gene expression]],<ref name=":4">{{Cite journal|last=Welch|first=Mark|last2=Govindarajan|first2=Sridhar|last3=Ness|first3=Jon E.|last4=Villalobos|first4=Alan|last5=Gurney|first5=Austin|last6=Minshull|first6=Jeremy|last7=Gustafsson|first7=Claes|date=2009-09-14|editor-last=Kudla|editor-first=Grzegorz|title=Design Parameters to Control Synthetic Gene Expression in Escherichia coli|url=https://dx.plos.org/10.1371/journal.pone.0007002|journal=PLoS ONE|language=en|volume=4|issue=9|pages=e7002|doi=10.1371/journal.pone.0007002|issn=1932-6203}}</ref><ref name=":6">{{Cite journal|last=Gustafsson|first=Claes|last2=Govindarajan|first2=Sridhar|last3=Minshull|first3=Jeremy|date=July 2004|title=Codon bias and heterologous protein expression|url=https://linkinghub.elsevier.com/retrieve/pii/S0167779904001118|journal=Trends in Biotechnology|language=en|volume=22|issue=7|pages=346–353|doi=10.1016/j.tibtech.2004.04.006}}</ref> [[Binning (metagenomics)|identify species in metagenomic samples]],<ref name=":0">{{Cite journal|last=Perry|first=Scott C.|last2=Beiko|first2=Robert G.|date=2010-01-01|title=Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives|url=https://academic.oup.com/gbe/article/doi/10.1093/gbe/evq004/568285|journal=Genome Biology and Evolution|language=en|volume=2|pages=117–131|doi=10.1093/gbe/evq004|issn=1759-6653}}</ref> and create [[Attenuated vaccine|attenuated vaccines]].<ref>{{Cite journal|last=Eschke|first=Kathrin|last2=Trimpert|first2=Jakob|last3=Osterrieder|first3=Nikolaus|last4=Kunec|first4=Dusan|date=2018-01-29|editor-last=Mocarski|editor-first=Edward|title=Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization|url=https://dx.plos.org/10.1371/journal.ppat.1006857|journal=PLOS Pathogens|language=en|volume=14|issue=1|pages=e1006857|doi=10.1371/journal.ppat.1006857|issn=1553-7374}}</ref> Usually, the term ''k''-mer refers to all of a sequence's subsequences of length <math>k</math>, such that the sequence AGAT would have four [[Monomer|monomers]] (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length <math>L</math> will have <math>L - k + 1</math> ''k''-mers and <math>n^{k}</math> total possible ''k''-mers, where <math>n</math> is number of possible monomers (e.g. four in the case of [[DNA]]).
In [[bioinformatics]], '''''k''-mers''' are [[substring|subsequence]]s of length <math>k</math> contained within a biological sequence. Primarily used within the context of [[computational genomics]] and [[sequence analysis]], in which ''k''-mers are composed of [[Nucleotide|nucleotides]] (''i.e''. A, T, G, and C), ''k''-mers are capitalized upon to [[Sequence assembly|assemble DNA sequences]],<ref>{{Cite journal|last=Compeau|first=Phillip E C|last2=Pevzner|first2=Pavel A|last3=Tesler|first3=Glenn|date=November 2011|title=How to apply de Bruijn graphs to genome assembly|journal=Nature Biotechnology|language=en|volume=29|issue=11|pages=987–991|doi=10.1038/nbt.2023|pmid=22068540|pmc=5531759|issn=1087-0156}}</ref> improve [[Protein production|heterologous gene expression]],<ref name=":4">{{Cite journal|last=Welch|first=Mark|last2=Govindarajan|first2=Sridhar|last3=Ness|first3=Jon E.|last4=Villalobos|first4=Alan|last5=Gurney|first5=Austin|last6=Minshull|first6=Jeremy|last7=Gustafsson|first7=Claes|date=2009-09-14|editor-last=Kudla|editor-first=Grzegorz|title=Design Parameters to Control Synthetic Gene Expression in Escherichia coli|journal=PLoS ONE|language=en|volume=4|issue=9|pages=e7002|doi=10.1371/journal.pone.0007002|pmid=19759823|issn=1932-6203}}</ref><ref name=":6">{{Cite journal|last=Gustafsson|first=Claes|last2=Govindarajan|first2=Sridhar|last3=Minshull|first3=Jeremy|date=July 2004|title=Codon bias and heterologous protein expression|journal=Trends in Biotechnology|language=en|volume=22|issue=7|pages=346–353|doi=10.1016/j.tibtech.2004.04.006|pmid=15245907}}</ref> [[Binning (metagenomics)|identify species in metagenomic samples]],<ref name=":0">{{Cite journal|last=Perry|first=Scott C.|last2=Beiko|first2=Robert G.|date=2010-01-01|title=Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives|journal=Genome Biology and Evolution|language=en|volume=2|pages=117–131|doi=10.1093/gbe/evq004|issn=1759-6653}}</ref> and create [[Attenuated vaccine|attenuated vaccines]].<ref>{{Cite journal|last=Eschke|first=Kathrin|last2=Trimpert|first2=Jakob|last3=Osterrieder|first3=Nikolaus|last4=Kunec|first4=Dusan|date=2018-01-29|editor-last=Mocarski|editor-first=Edward|title=Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization|journal=PLOS Pathogens|language=en|volume=14|issue=1|pages=e1006857|doi=10.1371/journal.ppat.1006857|issn=1553-7374}}</ref> Usually, the term ''k''-mer refers to all of a sequence's subsequences of length <math>k</math>, such that the sequence AGAT would have four [[Monomer|monomers]] (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length <math>L</math> will have <math>L - k + 1</math> ''k''-mers and <math>n^{k}</math> total possible ''k''-mers, where <math>n</math> is number of possible monomers (e.g. four in the case of [[DNA]]).


== Introduction ==
== Introduction ==
Line 43: Line 43:
|}
|}


A method of visualizing ''k''-mers, the '''''k''-mer spectrum''', shows the multiplicity of each ''k''-mer in a sequence versus the number of ''k''-mers with that multiplicity.<ref name=":7">{{Cite journal|last=Mapleson|first=Daniel|last2=Garcia Accinelli|first2=Gonzalo|last3=Kettleborough|first3=George|last4=Wright|first4=Jonathan|last5=Clavijo|first5=Bernardo J.|date=2016-10-22|title=KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies|url=https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw663|journal=Bioinformatics|language=en|pages=btw663|doi=10.1093/bioinformatics/btw663|issn=1367-4803}}</ref> The number of modes in a ''k''-mer spectrum for a species's genome varies, with most species having a unimodal distribution.<ref name=":5">{{Cite journal|last=Chor|first=Benny|last2=Horn|first2=David|last3=Goldman|first3=Nick|last4=Levy|first4=Yaron|last5=Massingham|first5=Tim|date=2009|title=Genomic DNA k-mer spectra: models and modalities|url=http://genomebiology.biomedcentral.com/articles/10.1186/gb-2009-10-10-r108|journal=Genome Biology|language=en|volume=10|issue=10|pages=R108|doi=10.1186/gb-2009-10-10-r108|issn=1465-6906}}</ref> However, all [[Mammal|mammals]] have a multimodal distribution. The number of modes within a ''k''-mer spectrum can vary between regions of genomes as well: humans have unimodal ''k''-mer spectra in [[Five prime untranslated region|5' UTRs]] and [[Exon|exons]] but multimodal spectra in [[Three prime untranslated region|3' UTRs]] and [[Intron|introns]].
A method of visualizing ''k''-mers, the '''''k''-mer spectrum''', shows the multiplicity of each ''k''-mer in a sequence versus the number of ''k''-mers with that multiplicity.<ref name=":7">{{Cite journal|last=Mapleson|first=Daniel|last2=Garcia Accinelli|first2=Gonzalo|last3=Kettleborough|first3=George|last4=Wright|first4=Jonathan|last5=Clavijo|first5=Bernardo J.|date=2016-10-22|title=KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies|url=https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw663|journal=Bioinformatics|language=en|pages=btw663|doi=10.1093/bioinformatics/btw663|issn=1367-4803}}</ref> The number of modes in a ''k''-mer spectrum for a species's genome varies, with most species having a unimodal distribution.<ref name=":5">{{Cite journal|last=Chor|first=Benny|last2=Horn|first2=David|last3=Goldman|first3=Nick|last4=Levy|first4=Yaron|last5=Massingham|first5=Tim|date=2009|title=Genomic DNA k-mer spectra: models and modalities|journal=Genome Biology|language=en|volume=10|issue=10|pages=R108|doi=10.1186/gb-2009-10-10-r108|pmid=19814784|pmc=2784323|issn=1465-6906}}</ref> However, all [[Mammal|mammals]] have a multimodal distribution. The number of modes within a ''k''-mer spectrum can vary between regions of genomes as well: humans have unimodal ''k''-mer spectra in [[Five prime untranslated region|5' UTRs]] and [[Exon|exons]] but multimodal spectra in [[Three prime untranslated region|3' UTRs]] and [[Intron|introns]].


== Forces Affecting DNA ''k''-mer Frequency ==
== Forces Affecting DNA ''k''-mer Frequency ==
Line 49: Line 49:


=== ''k'' = 1 ===
=== ''k'' = 1 ===
When ''k'' = 1, there are four DNA ''k''-mers, ''i.e.'', A, T, G, and C. At the molecular level, there are three [[Hydrogen bond|hydrogen bonds]] between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds.<ref>{{Cite journal|last=Yakovchuk|first=P.|date=2006-01-30|title=Base-stacking and base-pairing contributions into thermal stability of the DNA double helix|url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/gkj454|journal=Nucleic Acids Research|language=en|volume=34|issue=2|pages=564–574|doi=10.1093/nar/gkj454|issn=0305-1048}}</ref> Mammals and birds have a higher ratio of Gs and Cs to As and Ts ([[GC-content]]), which led to the hypothesis that thermal stability was a driving factor of GC-content variation.<ref>{{Cite journal|last=Bernardi|first=Giorgio|date=January 2000|title=Isochores and the evolutionary genomics of vertebrates|url=https://linkinghub.elsevier.com/retrieve/pii/S0378111999004850|journal=Gene|language=en|volume=241|issue=1|pages=3–17|doi=10.1016/S0378-1119(99)00485-0}}</ref> However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict.<ref>{{Cite journal|last=Hurst|first=Laurence D.|last2=Merchant|first2=Alexa R.|date=2001-03-07|title=High guanine–cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes|url=http://www.royalsocietypublishing.org/doi/10.1098/rspb.2000.1397|journal=Proceedings of the Royal Society of London. Series B: Biological Sciences|language=en|volume=268|issue=1466|pages=493–497|doi=10.1098/rspb.2000.1397|issn=1471-2954}}</ref> Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that [[Single nucleotide polymorphism|single nucleotide changes]], which are often [[Synonymous substitution|silent]], to alter the fitness of an organism.<ref name=":1">{{Cite journal|last=Mugal|first=Carina F.|last2=Weber|first2=Claudia C.|last3=Ellegren|first3=Hans|date=December 2015|title=GC-biased gene conversion links the recombination landscape and demography to genomic base composition: GC-biased gene conversion drives genomic base composition across a wide range of species|url=http://doi.wiley.com/10.1002/bies.201500058|journal=BioEssays|language=en|volume=37|issue=12|pages=1317–1326|doi=10.1002/bies.201500058}}</ref>
When ''k'' = 1, there are four DNA ''k''-mers, ''i.e.'', A, T, G, and C. At the molecular level, there are three [[Hydrogen bond|hydrogen bonds]] between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds.<ref>{{Cite journal|last=Yakovchuk|first=P.|date=2006-01-30|title=Base-stacking and base-pairing contributions into thermal stability of the DNA double helix|journal=Nucleic Acids Research|language=en|volume=34|issue=2|pages=564–574|doi=10.1093/nar/gkj454|pmid=16449200|issn=0305-1048}}</ref> Mammals and birds have a higher ratio of Gs and Cs to As and Ts ([[GC-content]]), which led to the hypothesis that thermal stability was a driving factor of GC-content variation.<ref>{{Cite journal|last=Bernardi|first=Giorgio|date=January 2000|title=Isochores and the evolutionary genomics of vertebrates|journal=Gene|language=en|volume=241|issue=1|pages=3–17|doi=10.1016/S0378-1119(99)00485-0|pmid=10607893}}</ref> However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict.<ref>{{Cite journal|last=Hurst|first=Laurence D.|last2=Merchant|first2=Alexa R.|date=2001-03-07|title=High guanine–cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes|journal=Proceedings of the Royal Society of London. Series B: Biological Sciences|language=en|volume=268|issue=1466|pages=493–497|doi=10.1098/rspb.2000.1397|pmc=1088632|issn=1471-2954}}</ref> Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that [[Single nucleotide polymorphism|single nucleotide changes]], which are often [[Synonymous substitution|silent]], to alter the fitness of an organism.<ref name=":1">{{Cite journal|last=Mugal|first=Carina F.|last2=Weber|first2=Claudia C.|last3=Ellegren|first3=Hans|date=December 2015|title=GC-biased gene conversion links the recombination landscape and demography to genomic base composition: GC-biased gene conversion drives genomic base composition across a wide range of species|journal=BioEssays|language=en|volume=37|issue=12|pages=1317–1326|doi=10.1002/bies.201500058}}</ref>


Rather, current evidence suggests that [[Gene conversion#GC-biased gene conversion|GC‐biased gene conversion]] (gBGC) is a driving factor behind variation in GC content.<ref name=":1" /> gBGC is a process that occurs during [[Genetic recombination|recombination]] which replaces Gs and Cs with As and Ts.<ref>{{Cite journal|last=Romiguier|first=Jonathan|last2=Roux|first2=Camille|date=2017-02-15|title=Analytical Biases Associated with GC-Content in Molecular Evolution|url=http://journal.frontiersin.org/article/10.3389/fgene.2017.00016/full|journal=Frontiers in Genetics|volume=8|doi=10.3389/fgene.2017.00016|issn=1664-8021}}</ref> This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination.<ref>{{Cite journal|last=Spencer|first=C.C.A.|date=2006-08-01|title=Human polymorphism around recombination hotspots: Figure 1|url=http://www.biochemsoctrans.org/cgi/doi/10.1042/BST0340535|journal=Biochemical Society Transactions|language=en|volume=34|issue=4|pages=535–536|doi=10.1042/BST0340535|issn=0300-5127}}</ref> Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects.<ref>{{Cite journal|last=Weber|first=Claudia C|last2=Boussau|first2=Bastien|last3=Romiguier|first3=Jonathan|last4=Jarvis|first4=Erich D|last5=Ellegren|first5=Hans|date=December 2014|title=Evidence for GC-biased gene conversion as a driver of between-lineage differences in avian base composition|url=http://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0549-1|journal=Genome Biology|language=en|volume=15|issue=12|doi=10.1186/s13059-014-0549-1|issn=1474-760X}}</ref> Interestingly, gBGC does not appear to be limited to [[eukaryotes]].<ref>{{Cite journal|last=Lassalle|first=Florent|last2=Périan|first2=Séverine|last3=Bataillon|first3=Thomas|last4=Nesme|first4=Xavier|last5=Duret|first5=Laurent|last6=Daubin|first6=Vincent|date=2015-02-06|editor-last=Petrov|editor-first=Dmitri A.|title=GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands|url=https://dx.plos.org/10.1371/journal.pgen.1004941|journal=PLOS Genetics|language=en|volume=11|issue=2|pages=e1004941|doi=10.1371/journal.pgen.1004941|issn=1553-7404}}</ref> Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome.<ref>{{Cite journal|last=Santoyo|first=G|last2=Romero|first2=D|date=April 2005|title=Gene conversion and concerted evolution in bacterial genomes|url=http://doi.wiley.com/10.1016/j.femsre.2004.10.004|journal=FEMS Microbiology Reviews|language=en|volume=29|issue=2|pages=169–183|doi=10.1016/j.femsre.2004.10.004}}</ref> That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown.<ref>{{Citation|last=Bhérer|first=Claude|title=Biased Gene Conversion and Its Impact on Genome Evolution|date=2014-06-16|url=http://doi.wiley.com/10.1002/9780470015902.a0020834.pub2|work=eLS|editor-last=John Wiley & Sons Ltd|publisher=John Wiley & Sons, Ltd|language=en|doi=10.1002/9780470015902.a0020834.pub2|isbn=9780470015902|access-date=2019-06-23|last2=Auton|first2=Adam}}</ref>
Rather, current evidence suggests that [[Gene conversion#GC-biased gene conversion|GC‐biased gene conversion]] (gBGC) is a driving factor behind variation in GC content.<ref name=":1" /> gBGC is a process that occurs during [[Genetic recombination|recombination]] which replaces Gs and Cs with As and Ts.<ref>{{Cite journal|last=Romiguier|first=Jonathan|last2=Roux|first2=Camille|date=2017-02-15|title=Analytical Biases Associated with GC-Content in Molecular Evolution|journal=Frontiers in Genetics|volume=8|doi=10.3389/fgene.2017.00016|issn=1664-8021}}</ref> This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination.<ref>{{Cite journal|last=Spencer|first=C.C.A.|date=2006-08-01|title=Human polymorphism around recombination hotspots: Figure 1|journal=Biochemical Society Transactions|language=en|volume=34|issue=4|pages=535–536|doi=10.1042/BST0340535|issn=0300-5127}}</ref> Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects.<ref>{{Cite journal|last=Weber|first=Claudia C|last2=Boussau|first2=Bastien|last3=Romiguier|first3=Jonathan|last4=Jarvis|first4=Erich D|last5=Ellegren|first5=Hans|date=December 2014|title=Evidence for GC-biased gene conversion as a driver of between-lineage differences in avian base composition|journal=Genome Biology|language=en|volume=15|issue=12|doi=10.1186/s13059-014-0549-1|issn=1474-760X}}</ref> Interestingly, gBGC does not appear to be limited to [[eukaryotes]].<ref>{{Cite journal|last=Lassalle|first=Florent|last2=Périan|first2=Séverine|last3=Bataillon|first3=Thomas|last4=Nesme|first4=Xavier|last5=Duret|first5=Laurent|last6=Daubin|first6=Vincent|date=2015-02-06|editor-last=Petrov|editor-first=Dmitri A.|title=GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands|journal=PLOS Genetics|language=en|volume=11|issue=2|pages=e1004941|doi=10.1371/journal.pgen.1004941|issn=1553-7404}}</ref> Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome.<ref>{{Cite journal|last=Santoyo|first=G|last2=Romero|first2=D|date=April 2005|title=Gene conversion and concerted evolution in bacterial genomes|journal=FEMS Microbiology Reviews|language=en|volume=29|issue=2|pages=169–183|doi=10.1016/j.femsre.2004.10.004}}</ref> That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown.<ref>{{Citation|last=Bhérer|first=Claude|title=Biased Gene Conversion and Its Impact on Genome Evolution|date=2014-06-16|work=eLS|editor-last=John Wiley & Sons Ltd|publisher=John Wiley & Sons, Ltd|language=en|doi=10.1002/9780470015902.a0020834.pub2|isbn=9780470015902|last2=Auton|first2=Adam}}</ref>


=== ''k'' = 2 ===
=== ''k'' = 2 ===
Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably.<ref name=":3">{{Cite journal|last=Karlin|first=Samuel|date=October 1998|title=Global dinucleotide signatures and analysis of genomic heterogeneity|url=https://linkinghub.elsevier.com/retrieve/pii/S1369527498800957|journal=Current Opinion in Microbiology|language=en|volume=1|issue=5|pages=598–610|doi=10.1016/S1369-5274(98)80095-7}}</ref> This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from [[Translation (biology)|translation]], then there would be differing patterns of dinucleotide bias in [[Coding region|coding]] and [[Non-coding DNA|noncoding]] regions driven by some dinucelotides' reduced translational efficiency.<ref>{{Cite journal|last=Beutler|first=E.|last2=Gelbart|first2=T.|last3=Han|first3=J. H.|last4=Koziol|first4=J. A.|last5=Beutler|first5=B.|date=1989-01-01|title=Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage.|url=http://www.pnas.org/cgi/doi/10.1073/pnas.86.1.192|journal=Proceedings of the National Academy of Sciences|language=en|volume=86|issue=1|pages=192–196|doi=10.1073/pnas.86.1.192|issn=0027-8424}}</ref> Because there is not, it can therefore be inferred that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack.<ref>{{Cite journal|last=Di Giallonardo|first=Francesca|last2=Schlub|first2=Timothy E.|last3=Shi|first3=Mang|last4=Holmes|first4=Edward C.|date=2017-04-15|editor-last=Dermody|editor-first=Terence S.|title=Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species|url=http://jvi.asm.org/lookup/doi/10.1128/JVI.02381-16|journal=Journal of Virology|language=en|volume=91|issue=8|doi=10.1128/JVI.02381-16|issn=0022-538X}}</ref>
Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably.<ref name=":3">{{Cite journal|last=Karlin|first=Samuel|date=October 1998|title=Global dinucleotide signatures and analysis of genomic heterogeneity|journal=Current Opinion in Microbiology|language=en|volume=1|issue=5|pages=598–610|doi=10.1016/S1369-5274(98)80095-7|pmid=10066522}}</ref> This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from [[Translation (biology)|translation]], then there would be differing patterns of dinucleotide bias in [[Coding region|coding]] and [[Non-coding DNA|noncoding]] regions driven by some dinucelotides' reduced translational efficiency.<ref>{{Cite journal|last=Beutler|first=E.|last2=Gelbart|first2=T.|last3=Han|first3=J. H.|last4=Koziol|first4=J. A.|last5=Beutler|first5=B.|date=1989-01-01|title=Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage.|journal=Proceedings of the National Academy of Sciences|language=en|volume=86|issue=1|pages=192–196|doi=10.1073/pnas.86.1.192|issn=0027-8424}}</ref> Because there is not, it can therefore be inferred that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack.<ref>{{Cite journal|last=Di Giallonardo|first=Francesca|last2=Schlub|first2=Timothy E.|last3=Shi|first3=Mang|last4=Holmes|first4=Edward C.|date=2017-04-15|editor-last=Dermody|editor-first=Terence S.|title=Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species|journal=Journal of Virology|language=en|volume=91|issue=8|doi=10.1128/JVI.02381-16|issn=0022-538X}}</ref>


Counter to gBGC's increasing GC-content is [[CG suppression]], which reduces the frequency of [[CpG site|CG]] 2-mers due to [[deamination]] of [[Methylation#DNA/RNA methylation|methylated]] CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content.<ref>{{Cite journal|last=Żemojtel|first=Tomasz|last2=kiełbasa|first2=Szymon M.|last3=Arndt|first3=Peter F.|last4=Behrens|first4=Sarah|last5=Bourque|first5=Guillaume|last6=Vingron|first6=Martin|date=2011-01-01|title=CpG Deamination Creates Transcription Factor–Binding Sites with High Efficiency|url=https://academic.oup.com/gbe/article/doi/10.1093/gbe/evr107/596611|journal=Genome Biology and Evolution|language=en|volume=3|pages=1304–1311|doi=10.1093/gbe/evr107|issn=1759-6653}}</ref> This interaction highlights the interrelationship between the forces affecting ''k''-mers for varying values of ''k.''
Counter to gBGC's increasing GC-content is [[CG suppression]], which reduces the frequency of [[CpG site|CG]] 2-mers due to [[deamination]] of [[Methylation#DNA/RNA methylation|methylated]] CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content.<ref>{{Cite journal|last=Żemojtel|first=Tomasz|last2=kiełbasa|first2=Szymon M.|last3=Arndt|first3=Peter F.|last4=Behrens|first4=Sarah|last5=Bourque|first5=Guillaume|last6=Vingron|first6=Martin|date=2011-01-01|title=CpG Deamination Creates Transcription Factor–Binding Sites with High Efficiency|journal=Genome Biology and Evolution|language=en|volume=3|pages=1304–1311|doi=10.1093/gbe/evr107|issn=1759-6653}}</ref> This interaction highlights the interrelationship between the forces affecting ''k''-mers for varying values of ''k.''


One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms.<ref name=":3" />
One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms.<ref name=":3" />
Line 63: Line 63:
There are twenty natural [[Amino acid|amino acids]] that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called [[Genetic code|codons]]. While each codon only maps to one amino acid, each amino acid can be [[Codon degeneracy|represented by multiple codons]]. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions.<ref name=":2">Hershberg R, Petrov DA. Selection on Codon Bias. Annual Review of Genetics. Annual Reviews; 2008;42: 287–299. doi:10.1146/annurev.genet.42.110807.091442</ref> This is called [[Codon usage bias|codon-usage bias]] (CUB). When ''k'' = 3, a distinction must be made between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four 3-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the ''k''-mers in a coding region are codons) and will be the main focus of this section.
There are twenty natural [[Amino acid|amino acids]] that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called [[Genetic code|codons]]. While each codon only maps to one amino acid, each amino acid can be [[Codon degeneracy|represented by multiple codons]]. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions.<ref name=":2">Hershberg R, Petrov DA. Selection on Codon Bias. Annual Review of Genetics. Annual Reviews; 2008;42: 287–299. doi:10.1146/annurev.genet.42.110807.091442</ref> This is called [[Codon usage bias|codon-usage bias]] (CUB). When ''k'' = 3, a distinction must be made between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four 3-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the ''k''-mers in a coding region are codons) and will be the main focus of this section.


The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent<ref name=":2" /> and that more highly expressed proteins exhibit greater CUB.<ref>{{Cite journal|last=Sharp|first=Paul M.|last2=Li|first2=Wen-Hsiung|date=1987|title=The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications|url=https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/15.3.1281|journal=Nucleic Acids Research|language=en|volume=15|issue=3|pages=1281–1295|doi=10.1093/nar/15.3.1281|issn=0305-1048}}</ref> This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.
The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent<ref name=":2" /> and that more highly expressed proteins exhibit greater CUB.<ref>{{Cite journal|last=Sharp|first=Paul M.|last2=Li|first2=Wen-Hsiung|date=1987|title=The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications|journal=Nucleic Acids Research|language=en|volume=15|issue=3|pages=1281–1295|doi=10.1093/nar/15.3.1281|pmc=340524|issn=0305-1048}}</ref> This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.


=== ''k'' = 4 ===
=== ''k'' = 4 ===
Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms.<ref name=":0" /> The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.<ref>{{Cite journal|last=Noble|first=Peter A.|last2=Citek|first2=Robert W.|last3=Ogunseitan|first3=Oladele A.|date=April 1998|title=Tetranucleotide frequencies in microbial genomes|url=http://dx.doi.org/10.1002/elps.1150190412|journal=Electrophoresis|volume=19|issue=4|pages=528–535|doi=10.1002/elps.1150190412|issn=0173-0835}}</ref>
Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms.<ref name=":0" /> The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.<ref>{{Cite journal|last=Noble|first=Peter A.|last2=Citek|first2=Robert W.|last3=Ogunseitan|first3=Oladele A.|date=April 1998|title=Tetranucleotide frequencies in microbial genomes|journal=Electrophoresis|volume=19|issue=4|pages=528–535|doi=10.1002/elps.1150190412|issn=0173-0835}}</ref>


==Applications==
==Applications==
The frequency of a set of ''k''-mers in a species'a genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies are computationally easier than [[sequence alignment]], and is an important method in [[alignment-free sequence analysis]]. It can also be used as a first stage analysis before an alignment.
The frequency of a set of ''k''-mers in a species'a genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies are computationally easier than [[sequence alignment]], and is an important method in [[alignment-free sequence analysis]]. It can also be used as a first stage analysis before an alignment.
=== Sequence assembly===
=== Sequence assembly===
[[File:k-mer-example.png|thumb|This figure shows the process of splitting reads into smaller ''k''-mers (4-mers in this case) in order to be able to be used in a De Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in De Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these ''k''-mers overlap by k-1 and can then be used in a De Bruijn graph.|alt=|700x700px]]In sequence assembly, ''k''-mers are used during the construction of [[De Bruijn graph]]s.<ref>{{Cite journal|last=Nagarajan|first=Niranjan|last2=Pop|first2=Mihai|date=2013|title=Sequence assembly demystified|url=http://www.nature.com/articles/nrg3367|journal=Nature Reviews Genetics|language=en|volume=14|issue=3|pages=157–167|doi=10.1038/nrg3367|issn=1471-0056|via=}}</ref><ref>{{cite journal|author=Li|display-authors=etal|year=2010|title=De novo assembly of human genomes with massively parallel short read sequencing|journal=Genome Research|volume=20|issue=2|pages=265–272|doi=10.1101/gr.097261.109|pmc=2813482|pmid=20019144}}</ref> In order to create a De Bruijn Graph, the ''k''-mers stored in each edge with length <math> L</math> must overlap another string in another edge by <math>L-1</math> in order to create a [[vertex (graph theory)|vertex]]. Reads generated from [[next-generation sequencing]] will typically have different read lengths being generated. For example, reads by [[Illumina dye sequencing|Illumina]]’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible ''k''-mers violate the key assumption of De Bruijn graphs that all the ''k''-mer reads must overlap its adjoining ''k''-mer in the genome by <math>k-1</math> (which cannot occur when all the possible ''k''-mers are not present).
[[File:k-mer-example.png|thumb|This figure shows the process of splitting reads into smaller ''k''-mers (4-mers in this case) in order to be able to be used in a De Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in De Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these ''k''-mers overlap by k-1 and can then be used in a De Bruijn graph.|alt=|700x700px]]In sequence assembly, ''k''-mers are used during the construction of [[De Bruijn graph]]s.<ref>{{Cite journal|last=Nagarajan|first=Niranjan|last2=Pop|first2=Mihai|date=2013|title=Sequence assembly demystified|journal=Nature Reviews Genetics|language=en|volume=14|issue=3|pages=157–167|doi=10.1038/nrg3367|pmid=23358380|issn=1471-0056}}</ref><ref>{{cite journal|author=Li|display-authors=etal|year=2010|title=De novo assembly of human genomes with massively parallel short read sequencing|journal=Genome Research|volume=20|issue=2|pages=265–272|doi=10.1101/gr.097261.109|pmc=2813482|pmid=20019144}}</ref> In order to create a De Bruijn Graph, the ''k''-mers stored in each edge with length <math> L</math> must overlap another string in another edge by <math>L-1</math> in order to create a [[vertex (graph theory)|vertex]]. Reads generated from [[next-generation sequencing]] will typically have different read lengths being generated. For example, reads by [[Illumina dye sequencing|Illumina]]’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible ''k''-mers violate the key assumption of De Bruijn graphs that all the ''k''-mer reads must overlap its adjoining ''k''-mer in the genome by <math>k-1</math> (which cannot occur when all the possible ''k''-mers are not present).
The solution to this problem is to break these ''k''-mer sized reads into smaller ''k''-mers, such that the resulting smaller ''k''-mers will represent all the possible ''k''-mers of that smaller size that are present in the genome.<ref name="debruijn">{{cite journal|last1=Compeau|first1=P.|last2=Pevzner|first2=P.|last3=Teslar|first3=G.|year=2011|title=How to apply de Bruijn graphs to genome assembly|url=|journal=Nature Biotechnology|volume=29|issue=11|pages=987–991|doi=10.1038/nbt.2023|pmc=5531759|pmid=22068540}}</ref> Furthermore, splitting the ''k''-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph.
The solution to this problem is to break these ''k''-mer sized reads into smaller ''k''-mers, such that the resulting smaller ''k''-mers will represent all the possible ''k''-mers of that smaller size that are present in the genome.<ref name="debruijn">{{cite journal|last1=Compeau|first1=P.|last2=Pevzner|first2=P.|last3=Teslar|first3=G.|year=2011|title=How to apply de Bruijn graphs to genome assembly|url=|journal=Nature Biotechnology|volume=29|issue=11|pages=987–991|doi=10.1038/nbt.2023|pmc=5531759|pmid=22068540}}</ref> Furthermore, splitting the ''k''-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph.


Beyond being used directly for sequence assembly, ''k''-mers can also be used to detect genome mis-assembly by identifying ''k''-mers that are overrepresented which suggest the presence of [[Repeated sequence (DNA)|repeated DNA sequences]] that have been combined.<ref>{{cite journal|author=Phillippy, Schatz, Pop|year=2008|title=Genome assembly forensics: finding the elusive mis-assembly|journal=Bioinformatics|volume=9|issue=3|page=R55|doi=10.1186/gb-2008-9-3-r55|pmc=2397507|pmid=18341692}}</ref> In addition, ''k''-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics.<ref>{{cite journal|author=Delmont, Eren|year=2016|title=Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies|journal=PeerJ|volume=4|page=e1839|doi=10.7717/Fpeerj.1839|doi-broken-date=2019-05-24}}</ref><ref>{{cite journal|author=Bemm|display-authors=etal|year=2016|title=Genome of a tardigrade: Horizontal gene transfer or bacterial contamination?|journal=Proceedings of the National Academy of Sciences|volume=113|issue=22|pages=E3054–E3056|doi=10.1073/pnas.1525116113|pmc=4896698|pmid=27173902}}</ref>
Beyond being used directly for sequence assembly, ''k''-mers can also be used to detect genome mis-assembly by identifying ''k''-mers that are overrepresented which suggest the presence of [[Repeated sequence (DNA)|repeated DNA sequences]] that have been combined.<ref>{{cite journal|author=Phillippy, Schatz, Pop|year=2008|title=Genome assembly forensics: finding the elusive mis-assembly|journal=Bioinformatics|volume=9|issue=3|page=R55|doi=10.1186/gb-2008-9-3-r55|pmc=2397507|pmid=18341692}}</ref> In addition, ''k''-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics.<ref>{{cite journal|author=Delmont, Eren|year=2016|title=Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies|journal=PeerJ|volume=4|page=e1839|doi=10.7717/Fpeerj.1839|doi-broken-date=2019-07-13}}</ref><ref>{{cite journal|author=Bemm|display-authors=etal|year=2016|title=Genome of a tardigrade: Horizontal gene transfer or bacterial contamination?|journal=Proceedings of the National Academy of Sciences|volume=113|issue=22|pages=E3054–E3056|doi=10.1073/pnas.1525116113|pmc=4896698|pmid=27173902}}</ref>


====Choice of ''k''-mer====
====Choice of ''k''-mer====
Line 106: Line 106:


=== Genetics and Genomics ===
=== Genetics and Genomics ===
With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity.<ref name=":1" /> Prior work has also shown that tetranucleotide biases are able to effectively detect [[horizontal gene transfer]] in both prokaryotes<ref>{{Cite journal|last=Goodur|first=Haswanee D.|last2=Ramtohul|first2=Vyasanand|last3=Baichoo|first3=Shakuntala|date=2012-11-11|title=GIDT — A tool for the identification and visualization of genomic islands in prokaryotic organisms|url=http://ieeexplore.ieee.org/document/6399707/|journal=2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE)|doi=10.1109/bibe.2012.6399707}}</ref> and eukaryotes.<ref>{{Cite journal|last=Jaron|first=K. S.|last2=Moravec|first2=J. C.|last3=Martinkova|first3=N.|date=2014-04-15|title=SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes|url=https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt727|journal=Bioinformatics|language=en|volume=30|issue=8|pages=1081–1086|doi=10.1093/bioinformatics/btt727|issn=1367-4803}}</ref>
With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity.<ref name=":1" /> Prior work has also shown that tetranucleotide biases are able to effectively detect [[horizontal gene transfer]] in both prokaryotes<ref>{{Cite journal|last=Goodur|first=Haswanee D.|last2=Ramtohul|first2=Vyasanand|last3=Baichoo|first3=Shakuntala|date=2012-11-11|title=GIDT — A tool for the identification and visualization of genomic islands in prokaryotic organisms|url=http://ieeexplore.ieee.org/document/6399707/|journal=2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE)|doi=10.1109/bibe.2012.6399707}}</ref> and eukaryotes.<ref>{{Cite journal|last=Jaron|first=K. S.|last2=Moravec|first2=J. C.|last3=Martinkova|first3=N.|date=2014-04-15|title=SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes|journal=Bioinformatics|language=en|volume=30|issue=8|pages=1081–1086|doi=10.1093/bioinformatics/btt727|pmid=24371153|issn=1367-4803}}</ref>


Another application of ''k''-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of ''[[Erwinia]]'' with moderate success.<ref>{{Cite journal|last=Starr|first=M. P.|last2=Mandel|first2=M.|date=1969-04-01|title=DNA Base Composition and Taxonomy of Phytopathogenic and Other Enterobacteria|url=http://mic.microbiologyresearch.org/content/journal/micro/10.1099/00221287-56-1-113|journal=Journal of General Microbiology|language=en|volume=56|issue=1|pages=113–123|doi=10.1099/00221287-56-1-113|issn=0022-1287}}</ref> Similar to the direct use of GC-content for taxonomic purposes is the use of T<small>m</small>, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher T<small>m</small>. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔT<small>m</small> as factor in determining species boundaries as part of the [[Species#Phylogenetic, cladistic, or evolutionary species|phylogenetic species concept]], though this proposal does not appear to have gained traction within the scientific community.<ref>{{Cite journal|last=Moore|first=W. E. C.|last2=Stackebrandt|first2=E.|last3=Kandler|first3=O.|last4=Colwell|first4=R. R.|last5=Krichevsky|first5=M. I.|last6=Truper|first6=H. G.|last7=Murray|first7=R. G. E.|last8=Wayne|first8=L. G.|last9=Grimont|first9=P. A. D.|date=1987-10-01|title=Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics|url=http://www.microbiologyresearch.org/content/journal/ijsem/10.1099/00207713-37-4-463|journal=International Journal of Systematic and Evolutionary Microbiology|language=en|volume=37|issue=4|pages=463–464|doi=10.1099/00207713-37-4-463|issn=1466-5026}}</ref>
Another application of ''k''-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of ''[[Erwinia]]'' with moderate success.<ref>{{Cite journal|last=Starr|first=M. P.|last2=Mandel|first2=M.|date=1969-04-01|title=DNA Base Composition and Taxonomy of Phytopathogenic and Other Enterobacteria|journal=Journal of General Microbiology|language=en|volume=56|issue=1|pages=113–123|doi=10.1099/00221287-56-1-113|issn=0022-1287}}</ref> Similar to the direct use of GC-content for taxonomic purposes is the use of T<small>m</small>, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher T<small>m</small>. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔT<small>m</small> as factor in determining species boundaries as part of the [[Species#Phylogenetic, cladistic, or evolutionary species|phylogenetic species concept]], though this proposal does not appear to have gained traction within the scientific community.<ref>{{Cite journal|last=Moore|first=W. E. C.|last2=Stackebrandt|first2=E.|last3=Kandler|first3=O.|last4=Colwell|first4=R. R.|last5=Krichevsky|first5=M. I.|last6=Truper|first6=H. G.|last7=Murray|first7=R. G. E.|last8=Wayne|first8=L. G.|last9=Grimont|first9=P. A. D.|date=1987-10-01|title=Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics|journal=International Journal of Systematic and Evolutionary Microbiology|language=en|volume=37|issue=4|pages=463–464|doi=10.1099/00207713-37-4-463|issn=1466-5026}}</ref>


Other applications within genetics and genomics include:
Other applications within genetics and genomics include:
Line 114: Line 114:
* [[Gene isoform|RNA isoform]] quantification from [[RNA-Seq|RNA-seq]] data<ref>{{cite journal|author=Patro, Mount, Kingsford|year=2014|title=Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms|journal=Nature Biotechnology|volume=32|issue=5|pages=462–464|arxiv=1308.3700|doi=10.1038/nbt.2862|pmc=4077321|pmid=24752080}}</ref>
* [[Gene isoform|RNA isoform]] quantification from [[RNA-Seq|RNA-seq]] data<ref>{{cite journal|author=Patro, Mount, Kingsford|year=2014|title=Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms|journal=Nature Biotechnology|volume=32|issue=5|pages=462–464|arxiv=1308.3700|doi=10.1038/nbt.2862|pmc=4077321|pmid=24752080}}</ref>
* Classification of human mitochondrial [[haplogroup]]<ref>{{cite journal|author=Navarro-Gomez|display-authors=etal|year=2015|title=Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier|journal=Bioinformatics|volume=31|issue=8|pages=1310–1312|doi=10.1093/bioinformatics/btu825|pmc=4393525|pmid=25505086}}</ref>
* Classification of human mitochondrial [[haplogroup]]<ref>{{cite journal|author=Navarro-Gomez|display-authors=etal|year=2015|title=Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier|journal=Bioinformatics|volume=31|issue=8|pages=1310–1312|doi=10.1093/bioinformatics/btu825|pmc=4393525|pmid=25505086}}</ref>
* Detection of recombination sites in genomes<ref>{{Cite journal|last=Wang|first=Rong|last2=Xu|first2=Yong|last3=Liu|first3=Bin|date=2016|title=Recombination spot identification Based on gapped k-mers|url=http://www.nature.com/articles/srep23934|journal=Scientific Reports|language=en|volume=6|issue=1|doi=10.1038/srep23934|issn=2045-2322}}</ref>
* Detection of recombination sites in genomes<ref>{{Cite journal|last=Wang|first=Rong|last2=Xu|first2=Yong|last3=Liu|first3=Bin|date=2016|title=Recombination spot identification Based on gapped k-mers|journal=Scientific Reports|language=en|volume=6|issue=1|pages=23934|doi=10.1038/srep23934|pmid=27030570|pmc=4814916|issn=2045-2322}}</ref>
* Estimation of [[genome size]] using ''k''-mer frequency vs ''k''-mer depth<ref>{{Citation|last=Hozza|first=Michal|title=How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra|date=2015|url=http://link.springer.com/10.1007/978-3-319-23826-5_20|work=String Processing and Information Retrieval|volume=9309|pages=199–209|editor-last=Iliopoulos|editor-first=Costas|publisher=Springer International Publishing|doi=10.1007/978-3-319-23826-5_20|isbn=9783319238258|access-date=2019-06-26|last2=Vinař|first2=Tomáš|last3=Brejová|first3=Broňa|editor2-last=Puglisi|editor2-first=Simon|editor3-last=Yilmaz|editor3-first=Emine}}</ref><ref>{{Cite journal|last=Lamichhaney|first=Sangeet|last2=Fan|first2=Guangyi|last3=Widemo|first3=Fredrik|last4=Gunnarsson|first4=Ulrika|last5=Thalmann|first5=Doreen Schwochow|last6=Hoeppner|first6=Marc P|last7=Kerje|first7=Susanne|last8=Gustafson|first8=Ulla|last9=Shi|first9=Chengcheng|date=2016|title=Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax)|url=http://www.nature.com/articles/ng.3430|journal=Nature Genetics|language=en|volume=48|issue=1|pages=84–88|doi=10.1038/ng.3430|issn=1061-4036}}</ref>
* Estimation of [[genome size]] using ''k''-mer frequency vs ''k''-mer depth<ref>{{Citation|last=Hozza|first=Michal|title=How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra|date=2015|work=String Processing and Information Retrieval|volume=9309|pages=199–209|editor-last=Iliopoulos|editor-first=Costas|publisher=Springer International Publishing|doi=10.1007/978-3-319-23826-5_20|isbn=9783319238258|last2=Vinař|first2=Tomáš|last3=Brejová|first3=Broňa|editor2-last=Puglisi|editor2-first=Simon|editor3-last=Yilmaz|editor3-first=Emine}}</ref><ref>{{Cite journal|last=Lamichhaney|first=Sangeet|last2=Fan|first2=Guangyi|last3=Widemo|first3=Fredrik|last4=Gunnarsson|first4=Ulrika|last5=Thalmann|first5=Doreen Schwochow|last6=Hoeppner|first6=Marc P|last7=Kerje|first7=Susanne|last8=Gustafson|first8=Ulla|last9=Shi|first9=Chengcheng|date=2016|title=Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax)|journal=Nature Genetics|language=en|volume=48|issue=1|pages=84–88|doi=10.1038/ng.3430|pmid=26569123|issn=1061-4036}}</ref>
* Characterization of [[CpG site|CpG islands]] by flanking regions<ref>{{cite journal|author=Chae|display-authors=etal|year=2013|title=Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes|journal=Nucleic Acids Research|volume=41|issue=9|pages=4783–4791|doi=10.1093/nar/gkt144|pmc=3643570|pmid=23519616}}</ref><ref>{{cite journal|author=Mohamed Hashim, Abdullah|year=2015|title=Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter|journal=Journal of Theoretical Biology|volume=387|pages=88–100|doi=10.1016/j.jtbi.2015.09.014|pmid=26427337}}</ref>
* Characterization of [[CpG site|CpG islands]] by flanking regions<ref>{{cite journal|author=Chae|display-authors=etal|year=2013|title=Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes|journal=Nucleic Acids Research|volume=41|issue=9|pages=4783–4791|doi=10.1093/nar/gkt144|pmc=3643570|pmid=23519616}}</ref><ref>{{cite journal|author=Mohamed Hashim, Abdullah|year=2015|title=Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter|journal=Journal of Theoretical Biology|volume=387|pages=88–100|doi=10.1016/j.jtbi.2015.09.014|pmid=26427337|url=https://zenodo.org/record/895970}}</ref>


*''De novo'' detection of [[repeated sequence]] such as [[transposable element]]<ref>{{cite journal|author=Price, Jones, Pevzner|year=2005|title=De novo identification of repeat families in large genomes|journal=Bioinformatics|volume=21(supp 1)|pages=i351–8|doi=10.1093/bioinformatics/bti1018|pmid=15961478}}</ref>
*''De novo'' detection of [[repeated sequence]] such as [[transposable element]]<ref>{{cite journal|author=Price, Jones, Pevzner|year=2005|title=De novo identification of repeat families in large genomes|journal=Bioinformatics|volume=21(supp 1)|pages=i351–8|doi=10.1093/bioinformatics/bti1018|pmid=15961478}}</ref>
*[[DNA barcoding]] of species.<ref name=":5" /><ref>{{Cite journal|last=Meher|first=Prabina Kumar|last2=Sahu|first2=Tanmaya Kumar|last3=Rao|first3=A.R.|date=2016|title=Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier|url=https://linkinghub.elsevier.com/retrieve/pii/S0378111916305364|journal=Gene|language=en|volume=592|issue=2|pages=316–324|doi=10.1016/j.gene.2016.07.010}}</ref>
*[[DNA barcoding]] of species.<ref name=":5" /><ref>{{Cite journal|last=Meher|first=Prabina Kumar|last2=Sahu|first2=Tanmaya Kumar|last3=Rao|first3=A.R.|date=2016|title=Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier|journal=Gene|language=en|volume=592|issue=2|pages=316–324|doi=10.1016/j.gene.2016.07.010|pmid=27393648}}</ref>
*Characterization of protein-binding [[sequence motif|sequence motifs]]<ref>{{cite journal|author=Newburger, Bulyk|year=2009|title=UniPROBE: an online database of protein binding microarray data on protein–DNA interactions|journal=Nucleic Acids Research|volume=37(supp 1)|issue=Database issue|pages=D77–82|doi=10.1093/nar/gkn660|pmc=2686578|pmid=18842628}}</ref>
*Characterization of protein-binding [[sequence motif|sequence motifs]]<ref>{{cite journal|author=Newburger, Bulyk|year=2009|title=UniPROBE: an online database of protein binding microarray data on protein–DNA interactions|journal=Nucleic Acids Research|volume=37(supp 1)|issue=Database issue|pages=D77–82|doi=10.1093/nar/gkn660|pmc=2686578|pmid=18842628}}</ref>
*Identification of [[mutation]] or [[polymorphism (biology)|polymorphism]] using next generation [[DNA sequencing|sequencing]] data<ref>{{cite journal|author=Nordstrom|display-authors=etal|year=2013|title=Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers|journal=Nature Biotechnology|volume=31|issue=4|pages=325–330|doi=10.1038/nbt.2515|pmid=23475072}}</ref>
*Identification of [[mutation]] or [[polymorphism (biology)|polymorphism]] using next generation [[DNA sequencing|sequencing]] data<ref>{{cite journal|author=Nordstrom|display-authors=etal|year=2013|title=Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers|journal=Nature Biotechnology|volume=31|issue=4|pages=325–330|doi=10.1038/nbt.2515|pmid=23475072}}</ref>


=== Metagenomics ===
=== Metagenomics ===
''k''-mer frequency and spectrum variation is heavily used in metagenomics for both analysis<ref>{{Cite journal|last=Zhu|first=Jianfeng|last2=Zheng|first2=Wei-Mou|date=2014|title=Self-organizing approach for meta-genomes|url=https://linkinghub.elsevier.com/retrieve/pii/S1476927114000954|journal=Computational Biology and Chemistry|language=en|volume=53|pages=118–124|doi=10.1016/j.compbiolchem.2014.08.016|via=}}</ref><ref>{{cite journal|author=Dubinkina|author2=Ischenko|author3=Ulyantsev|author4=Tyakht|author5=Alexeev|year=2016|title=Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis|journal=BMC Bioinformatics|volume=17|page=38|doi=10.1186/s12859-015-0875-7|pmc=4715287|pmid=26774270}}</ref> and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or [[operational taxonomic unit]]), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (''k'' = 4) frequencies.<ref>Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner F. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. Springer Nature; 2004;5: 163. doi:10.1186/1471-2105-5-163</ref>  Other tools that similarly rely on ''k''-mer frequency for metagenomic binning are CompostBin (''k'' = 6),<ref>{{Citation|last=Chatterji|first=Sourav|title=CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads|date=2008|url=http://link.springer.com/10.1007/978-3-540-78839-3_3|work=Research in Computational Molecular Biology|volume=4955|pages=17–28|editor-last=Vingron|editor-first=Martin|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-540-78839-3_3|isbn=9783540788386|access-date=2019-06-26|last2=Yamazaki|first2=Ichitaro|last3=Bai|first3=Zhaojun|last4=Eisen|first4=Jonathan A.|editor2-last=Wong|editor2-first=Limsoon}}</ref> PCAHIER,<ref>{{Cite journal|last=Zheng|first=Hao|last2=Wu|first2=Hongwei|date=2010|title=SHORT PROKARYOTIC DNA FRAGMENT BINNING USING A HIERARCHICAL CLASSIFIER BASED ON LINEAR DISCRIMINANT ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS|url=http://www.worldscientific.com/doi/abs/10.1142/S0219720010005051|journal=Journal of Bioinformatics and Computational Biology|language=en|volume=08|issue=06|pages=995–1011|doi=10.1142/S0219720010005051|issn=0219-7200}}</ref> PhyloPythia (5 ≤ ''k'' ≤ 6),<ref>{{Cite journal|last=McHardy|first=Alice Carolyn|last2=Martín|first2=Héctor García|last3=Tsirigos|first3=Aristotelis|last4=Hugenholtz|first4=Philip|last5=Rigoutsos|first5=Isidore|date=2007|title=Accurate phylogenetic classification of variable-length DNA fragments|url=http://www.nature.com/articles/nmeth976|journal=Nature Methods|language=en|volume=4|issue=1|pages=63–72|doi=10.1038/nmeth976|issn=1548-7091}}</ref> CLARK (''k'' ≥ 20),<ref>{{Cite journal|last=Ounit|first=Rachid|last2=Wanamaker|first2=Steve|last3=Close|first3=Timothy J|last4=Lonardi|first4=Stefano|date=2015|title=CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers|url=http://www.biomedcentral.com/1471-2164/16/236|journal=BMC Genomics|language=en|volume=16|issue=1|doi=10.1186/s12864-015-1419-2|issn=1471-2164}}</ref> and TACOA (2 ≤ ''k'' ≤ 6).<ref>{{Cite journal|last=Diaz|first=Naryttza N|last2=Krause|first2=Lutz|last3=Goesmann|first3=Alexander|last4=Niehaus|first4=Karsten|last5=Nattkemper|first5=Tim W|date=2009|title=TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach|url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-56|journal=BMC Bioinformatics|language=en|volume=10|issue=1|doi=10.1186/1471-2105-10-56|issn=1471-2105}}</ref> Recent developments have also applied [[deep learning]] to metagenomic binning using ''k''-mers.<ref>{{Cite journal|last=Fiannaca|first=Antonino|last2=La Paglia|first2=Laura|last3=La Rosa|first3=Massimo|last4=Lo Bosco|first4=Giosue’|last5=Renda|first5=Giovanni|last6=Rizzo|first6=Riccardo|last7=Gaglio|first7=Salvatore|last8=Urso|first8=Alfonso|date=2018|title=Deep learning models for bacteria taxonomic classification of metagenomic data|url=https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2182-6|journal=BMC Bioinformatics|language=en|volume=19|issue=S7|doi=10.1186/s12859-018-2182-6|issn=1471-2105}}</ref>
''k''-mer frequency and spectrum variation is heavily used in metagenomics for both analysis<ref>{{Cite journal|last=Zhu|first=Jianfeng|last2=Zheng|first2=Wei-Mou|date=2014|title=Self-organizing approach for meta-genomes|journal=Computational Biology and Chemistry|language=en|volume=53|pages=118–124|doi=10.1016/j.compbiolchem.2014.08.016|pmid=25213854}}</ref><ref>{{cite journal|author=Dubinkina|author2=Ischenko|author3=Ulyantsev|author4=Tyakht|author5=Alexeev|year=2016|title=Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis|journal=BMC Bioinformatics|volume=17|page=38|doi=10.1186/s12859-015-0875-7|pmc=4715287|pmid=26774270}}</ref> and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or [[operational taxonomic unit]]), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (''k'' = 4) frequencies.<ref>Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner F. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. Springer Nature; 2004;5: 163. doi:10.1186/1471-2105-5-163</ref>  Other tools that similarly rely on ''k''-mer frequency for metagenomic binning are CompostBin (''k'' = 6),<ref>{{Citation|last=Chatterji|first=Sourav|title=CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads|date=2008|work=Research in Computational Molecular Biology|volume=4955|pages=17–28|editor-last=Vingron|editor-first=Martin|publisher=Springer Berlin Heidelberg|language=en|doi=10.1007/978-3-540-78839-3_3|isbn=9783540788386|last2=Yamazaki|first2=Ichitaro|last3=Bai|first3=Zhaojun|last4=Eisen|first4=Jonathan A.|editor2-last=Wong|editor2-first=Limsoon|arxiv=0708.3098}}</ref> PCAHIER,<ref>{{Cite journal|last=Zheng|first=Hao|last2=Wu|first2=Hongwei|date=2010|title=SHORT PROKARYOTIC DNA FRAGMENT BINNING USING A HIERARCHICAL CLASSIFIER BASED ON LINEAR DISCRIMINANT ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS|journal=Journal of Bioinformatics and Computational Biology|language=en|volume=08|issue=6|pages=995–1011|doi=10.1142/S0219720010005051|issn=0219-7200}}</ref> PhyloPythia (5 ≤ ''k'' ≤ 6),<ref>{{Cite journal|last=McHardy|first=Alice Carolyn|last2=Martín|first2=Héctor García|last3=Tsirigos|first3=Aristotelis|last4=Hugenholtz|first4=Philip|last5=Rigoutsos|first5=Isidore|date=2007|title=Accurate phylogenetic classification of variable-length DNA fragments|journal=Nature Methods|language=en|volume=4|issue=1|pages=63–72|doi=10.1038/nmeth976|pmid=17179938|issn=1548-7091}}</ref> CLARK (''k'' ≥ 20),<ref>{{Cite journal|last=Ounit|first=Rachid|last2=Wanamaker|first2=Steve|last3=Close|first3=Timothy J|last4=Lonardi|first4=Stefano|date=2015|title=CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers|journal=BMC Genomics|language=en|volume=16|issue=1|doi=10.1186/s12864-015-1419-2|issn=1471-2164}}</ref> and TACOA (2 ≤ ''k'' ≤ 6).<ref>{{Cite journal|last=Diaz|first=Naryttza N|last2=Krause|first2=Lutz|last3=Goesmann|first3=Alexander|last4=Niehaus|first4=Karsten|last5=Nattkemper|first5=Tim W|date=2009|title=TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach|journal=BMC Bioinformatics|language=en|volume=10|issue=1|pages=56|doi=10.1186/1471-2105-10-56|pmid=19210774|pmc=2653487|issn=1471-2105}}</ref> Recent developments have also applied [[deep learning]] to metagenomic binning using ''k''-mers.<ref>{{Cite journal|last=Fiannaca|first=Antonino|last2=La Paglia|first2=Laura|last3=La Rosa|first3=Massimo|last4=Lo Bosco|first4=Giosue’|last5=Renda|first5=Giovanni|last6=Rizzo|first6=Riccardo|last7=Gaglio|first7=Salvatore|last8=Urso|first8=Alfonso|date=2018|title=Deep learning models for bacteria taxonomic classification of metagenomic data|journal=BMC Bioinformatics|language=en|volume=19|issue=S7|doi=10.1186/s12859-018-2182-6|pmid=30066629|issn=1471-2105}}</ref>


Other applications within metagenomics include:
Other applications within metagenomics include:


* Recovery of reading frames from raw reads<ref>{{cite journal|author=Zhu, Zheng|year=2014|title=Self-organizing approach for meta-genomes|journal=Computational Biology and Chemistry|volume=53|pages=118–124|doi=10.1016/j.compbiolchem.2014.08.016|pmid=25213854}}</ref>
* Recovery of reading frames from raw reads<ref>{{cite journal|author=Zhu, Zheng|year=2014|title=Self-organizing approach for meta-genomes|journal=Computational Biology and Chemistry|volume=53|pages=118–124|doi=10.1016/j.compbiolchem.2014.08.016|pmid=25213854}}</ref>
* Estimation of species abundance in metagenomic samples<ref>{{Cite journal|last=Lu|first=Jennifer|last2=Breitwieser|first2=Florian P.|last3=Thielen|first3=Peter|last4=Salzberg|first4=Steven L.|date=2017-01-02|title=Bracken: estimating species abundance in metagenomics data|url=https://peerj.com/articles/cs-104|journal=PeerJ Computer Science|language=en|volume=3|pages=e104|doi=10.7717/peerj-cs.104|issn=2376-5992}}</ref>
* Estimation of species abundance in metagenomic samples<ref>{{Cite journal|last=Lu|first=Jennifer|last2=Breitwieser|first2=Florian P.|last3=Thielen|first3=Peter|last4=Salzberg|first4=Steven L.|date=2017-01-02|title=Bracken: estimating species abundance in metagenomics data|journal=PeerJ Computer Science|language=en|volume=3|pages=e104|doi=10.7717/peerj-cs.104|issn=2376-5992}}</ref>
* Determination of which species are present in samples<ref>{{Cite journal|last=Wood|first=Derrick E|last2=Salzberg|first2=Steven L|date=2014|title=Kraken: ultrafast metagenomic sequence classification using exact alignments|url=http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46|journal=Genome Biology|language=en|volume=15|issue=3|pages=R46|doi=10.1186/gb-2014-15-3-r46|issn=1465-6906}}</ref><ref>{{Cite journal|last=Rosen|first=Gail|last2=Garbarine|first2=Elaine|last3=Caseiro|first3=Diamantino|last4=Polikar|first4=Robi|last5=Sokhansanj|first5=Bahrad|date=2008|title=Metagenome Fragment Classification Using -Mer Frequency Profiles|url=http://www.hindawi.com/journals/abi/2008/205969/|journal=Advances in Bioinformatics|language=en|volume=2008|pages=1–12|doi=10.1155/2008/205969|issn=1687-8027}}</ref>
* Determination of which species are present in samples<ref>{{Cite journal|last=Wood|first=Derrick E|last2=Salzberg|first2=Steven L|date=2014|title=Kraken: ultrafast metagenomic sequence classification using exact alignments|journal=Genome Biology|language=en|volume=15|issue=3|pages=R46|doi=10.1186/gb-2014-15-3-r46|pmid=24580807|issn=1465-6906}}</ref><ref>{{Cite journal|last=Rosen|first=Gail|last2=Garbarine|first2=Elaine|last3=Caseiro|first3=Diamantino|last4=Polikar|first4=Robi|last5=Sokhansanj|first5=Bahrad|date=2008|title=Metagenome Fragment Classification Using -Mer Frequency Profiles|journal=Advances in Bioinformatics|language=en|volume=2008|pages=1–12|doi=10.1155/2008/205969|issn=1687-8027}}</ref>
* Identification of [[Biomarker|biomarkers]] for diseases from samples<ref>{{Cite journal|last=Wang|first=Ying|last2=Fu|first2=Lei|last3=Ren|first3=Jie|last4=Yu|first4=Zhaoxia|last5=Chen|first5=Ting|last6=Sun|first6=Fengzhu|date=2018-05-03|title=Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures|url=http://journal.frontiersin.org/article/10.3389/fmicb.2018.00872/full|journal=Frontiers in Microbiology|volume=9|doi=10.3389/fmicb.2018.00872|issn=1664-302X}}</ref>
* Identification of [[Biomarker|biomarkers]] for diseases from samples<ref>{{Cite journal|last=Wang|first=Ying|last2=Fu|first2=Lei|last3=Ren|first3=Jie|last4=Yu|first4=Zhaoxia|last5=Chen|first5=Ting|last6=Sun|first6=Fengzhu|date=2018-05-03|title=Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures|journal=Frontiers in Microbiology|volume=9|doi=10.3389/fmicb.2018.00872|issn=1664-302X}}</ref>


=== Biotechnology ===
=== Biotechnology ===
Modifying ''k''-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates.
Modifying ''k''-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates.


With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis.<ref>{{Cite journal|last=Al-Saif|first=Maher|last2=Khabar|first2=Khalid SA|date=2012|title=UU/UA Dinucleotide Frequency Reduction in Coding Regions Results in Increased mRNA Stability and Protein Expression|url=https://linkinghub.elsevier.com/retrieve/pii/S1525001616319359|journal=Molecular Therapy|language=en|volume=20|issue=5|pages=954–959|doi=10.1038/mt.2012.29}}</ref> In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates.<ref name=":4" /><ref name=":6" /> Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression.<ref>Trinh R, Gurbaxani B, Morrison SL, Seyfzadeh M. Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression. Molecular Immunology. Elsevier BV; 2004;40: 717–722. doi:10.1016/j.molimm.2003.08.006</ref>
With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis.<ref>{{Cite journal|last=Al-Saif|first=Maher|last2=Khabar|first2=Khalid SA|date=2012|title=UU/UA Dinucleotide Frequency Reduction in Coding Regions Results in Increased mRNA Stability and Protein Expression|journal=Molecular Therapy|language=en|volume=20|issue=5|pages=954–959|doi=10.1038/mt.2012.29|pmid=22434136|pmc=3345983}}</ref> In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates.<ref name=":4" /><ref name=":6" /> Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression.<ref>Trinh R, Gurbaxani B, Morrison SL, Seyfzadeh M. Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression. Molecular Immunology. Elsevier BV; 2004;40: 717–722. doi:10.1016/j.molimm.2003.08.006</ref>


The most studied application of ''k''-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode [[dengue virus]], the virus that causes [[dengue fever]], such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type.<ref>{{Cite journal|last=Shen|first=Sam H.|last2=Stauft|first2=Charles B.|last3=Gorbatsevych|first3=Oleksandr|last4=Song|first4=Yutong|last5=Ward|first5=Charles B.|last6=Yurovsky|first6=Alisa|last7=Mueller|first7=Steffen|last8=Futcher|first8=Bruce|last9=Wimmer|first9=Eckard|date=2015-04-14|title=Large-scale recoding of an arbovirus genome to rebalance its insect versus mammalian preference|url=http://www.pnas.org/lookup/doi/10.1073/pnas.1502864112|journal=Proceedings of the National Academy of Sciences|language=en|volume=112|issue=15|pages=4749–4754|doi=10.1073/pnas.1502864112|issn=0027-8424}}</ref> Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened [[Pathogen|pathogenicity]] while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine<ref>{{Cite journal|last=Kaplan|first=Bryan S.|last2=Souza|first2=Carine K.|last3=Gauger|first3=Phillip C.|last4=Stauft|first4=Charles B.|last5=Robert Coleman|first5=J.|last6=Mueller|first6=Steffen|last7=Vincent|first7=Amy L.|date=2018|title=Vaccination of pigs with a codon-pair bias de-optimized live attenuated influenza vaccine protects from homologous challenge|url=https://linkinghub.elsevier.com/retrieve/pii/S0264410X18300616|journal=Vaccine|language=en|volume=36|issue=8|pages=1101–1107|doi=10.1016/j.vaccine.2018.01.027}}</ref> as well a vaccine for [[Marek's disease|Marek's disease herpesvirus]] (MDV).<ref>{{Cite journal|last=Eschke|first=Kathrin|last2=Trimpert|first2=Jakob|last3=Osterrieder|first3=Nikolaus|last4=Kunec|first4=Dusan|date=2018-01-29|editor-last=Mocarski|editor-first=Edward|title=Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization|url=https://dx.plos.org/10.1371/journal.ppat.1006857|journal=PLOS Pathogens|language=en|volume=14|issue=1|pages=e1006857|doi=10.1371/journal.ppat.1006857|issn=1553-7374}}</ref> Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the [[Carcinogenesis|oncogenicity]] of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use.
The most studied application of ''k''-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode [[dengue virus]], the virus that causes [[dengue fever]], such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type.<ref>{{Cite journal|last=Shen|first=Sam H.|last2=Stauft|first2=Charles B.|last3=Gorbatsevych|first3=Oleksandr|last4=Song|first4=Yutong|last5=Ward|first5=Charles B.|last6=Yurovsky|first6=Alisa|last7=Mueller|first7=Steffen|last8=Futcher|first8=Bruce|last9=Wimmer|first9=Eckard|date=2015-04-14|title=Large-scale recoding of an arbovirus genome to rebalance its insect versus mammalian preference|journal=Proceedings of the National Academy of Sciences|language=en|volume=112|issue=15|pages=4749–4754|doi=10.1073/pnas.1502864112|pmid=25825721|issn=0027-8424}}</ref> Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened [[Pathogen|pathogenicity]] while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine<ref>{{Cite journal|last=Kaplan|first=Bryan S.|last2=Souza|first2=Carine K.|last3=Gauger|first3=Phillip C.|last4=Stauft|first4=Charles B.|last5=Robert Coleman|first5=J.|last6=Mueller|first6=Steffen|last7=Vincent|first7=Amy L.|date=2018|title=Vaccination of pigs with a codon-pair bias de-optimized live attenuated influenza vaccine protects from homologous challenge|journal=Vaccine|language=en|volume=36|issue=8|pages=1101–1107|doi=10.1016/j.vaccine.2018.01.027|pmid=29366707}}</ref> as well a vaccine for [[Marek's disease|Marek's disease herpesvirus]] (MDV).<ref>{{Cite journal|last=Eschke|first=Kathrin|last2=Trimpert|first2=Jakob|last3=Osterrieder|first3=Nikolaus|last4=Kunec|first4=Dusan|date=2018-01-29|editor-last=Mocarski|editor-first=Edward|title=Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization|journal=PLOS Pathogens|language=en|volume=14|issue=1|pages=e1006857|doi=10.1371/journal.ppat.1006857|issn=1553-7374}}</ref> Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the [[Carcinogenesis|oncogenicity]] of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use.


Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias.<ref>{{Cite journal|last=Kunec|first=Dusan|last2=Osterrieder|first2=Nikolaus|date=2016|title=Codon Pair Bias Is a Direct Consequence of Dinucleotide Bias|url=https://linkinghub.elsevier.com/retrieve/pii/S2211124715014242|journal=Cell Reports|language=en|volume=14|issue=1|pages=55–67|doi=10.1016/j.celrep.2015.12.011}}</ref><ref>{{Cite journal|last=Tulloch|first=Fiona|last2=Atkinson|first2=Nicky J|last3=Evans|first3=David J|last4=Ryan|first4=Martin D|last5=Simmonds|first5=Peter|date=2014-12-09|title=RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies|url=https://elifesciences.org/articles/04531|journal=eLife|language=en|volume=3|doi=10.7554/eLife.04531|issn=2050-084X}}</ref> By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attentuation of viruses is an increase in dinucleotides poorly suited for translation.
Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias.<ref>{{Cite journal|last=Kunec|first=Dusan|last2=Osterrieder|first2=Nikolaus|date=2016|title=Codon Pair Bias Is a Direct Consequence of Dinucleotide Bias|journal=Cell Reports|language=en|volume=14|issue=1|pages=55–67|doi=10.1016/j.celrep.2015.12.011|pmid=26725119}}</ref><ref>{{Cite journal|last=Tulloch|first=Fiona|last2=Atkinson|first2=Nicky J|last3=Evans|first3=David J|last4=Ryan|first4=Martin D|last5=Simmonds|first5=Peter|date=2014-12-09|title=RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies|journal=eLife|language=en|volume=3|doi=10.7554/eLife.04531|pmid=25490153|issn=2050-084X}}</ref> By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attentuation of viruses is an increase in dinucleotides poorly suited for translation.


GC-content, due to its effect on [[Nucleic acid thermodynamics#Denaturation|DNA melting point]], is used to predict annealing temperature in [[Polymerase chain reaction#Optimization|PCR]], another important biotechnology tool.
GC-content, due to its effect on [[Nucleic acid thermodynamics#Denaturation|DNA melting point]], is used to predict annealing temperature in [[Polymerase chain reaction#Optimization|PCR]], another important biotechnology tool.
Line 162: Line 162:
Because the number of ''k''-mers grows exponentially for values of ''k'', counting ''k''-mers for large values of ''k'' (usually >10) is a computationally difficult task. While simple implementations such as the above pseudocode work for small values of ''k'', they need to be adapted for high-throughput applications or when ''k'' is large. To solve this problem, various tools have been developed:
Because the number of ''k''-mers grows exponentially for values of ''k'', counting ''k''-mers for large values of ''k'' (usually >10) is a computationally difficult task. While simple implementations such as the above pseudocode work for small values of ''k'', they need to be adapted for high-throughput applications or when ''k'' is large. To solve this problem, various tools have been developed:


* [https://github.com/gmarcais/Jellyfish/ Jellyfish] uses a multithreaded, lock-free [[hash table]] for ''k''-mer counting and has [[Python (programming language)|Python]], [[Ruby (programming language)|Ruby]], and [[Perl]] bindings<ref>{{Cite journal|last=Marçais|first=Guillaume|last2=Kingsford|first2=Carl|date=2011-03-15|title=A fast, lock-free approach for efficient parallel counting of occurrences of k-mers|url=https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btr011|journal=Bioinformatics|language=en|volume=27|issue=6|pages=764–770|doi=10.1093/bioinformatics/btr011|issn=1460-2059}}</ref>
* [https://github.com/gmarcais/Jellyfish/ Jellyfish] uses a multithreaded, lock-free [[hash table]] for ''k''-mer counting and has [[Python (programming language)|Python]], [[Ruby (programming language)|Ruby]], and [[Perl]] bindings<ref>{{Cite journal|last=Marçais|first=Guillaume|last2=Kingsford|first2=Carl|date=2011-03-15|title=A fast, lock-free approach for efficient parallel counting of occurrences of k-mers|journal=Bioinformatics|language=en|volume=27|issue=6|pages=764–770|doi=10.1093/bioinformatics/btr011|issn=1460-2059}}</ref>
* [https://github.com/refresh-bio/KMC KMC] is a tool for ''k''-mer counting that uses a multidisk architecture for optimized speed<ref>{{Cite journal|last=Deorowicz|first=Sebastian|last2=Kokot|first2=Marek|last3=Grabowski|first3=Szymon|last4=Debudaj-Grabysz|first4=Agnieszka|date=2015-05-15|title=KMC 2: fast and resource-frugal k-mer counting|url=https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btv022|journal=Bioinformatics|language=en|volume=31|issue=10|pages=1569–1576|doi=10.1093/bioinformatics/btv022|issn=1460-2059}}</ref>
* [https://github.com/refresh-bio/KMC KMC] is a tool for ''k''-mer counting that uses a multidisk architecture for optimized speed<ref>{{Cite journal|last=Deorowicz|first=Sebastian|last2=Kokot|first2=Marek|last3=Grabowski|first3=Szymon|last4=Debudaj-Grabysz|first4=Agnieszka|date=2015-05-15|title=KMC 2: fast and resource-frugal k-mer counting|journal=Bioinformatics|language=en|volume=31|issue=10|pages=1569–1576|doi=10.1093/bioinformatics/btv022|pmid=25609798|issn=1460-2059}}</ref>
* [https://github.com/uni-halle/gerbil Gerbil] uses a hash table approach but with added support for GPU acceleration<ref>{{Cite journal|last=Erbert|first=Marius|last2=Rechner|first2=Steffen|last3=Müller-Hannemann|first3=Matthias|date=2017|title=Gerbil: a fast and memory-efficient k-mer counter with GPU-support|url=http://almob.biomedcentral.com/articles/10.1186/s13015-017-0097-9|journal=Algorithms for Molecular Biology|language=en|volume=12|issue=1|pages=|doi=10.1186/s13015-017-0097-9|issn=1748-7188|via=}}</ref>
* [https://github.com/uni-halle/gerbil Gerbil] uses a hash table approach but with added support for GPU acceleration<ref>{{Cite journal|last=Erbert|first=Marius|last2=Rechner|first2=Steffen|last3=Müller-Hannemann|first3=Matthias|date=2017|title=Gerbil: a fast and memory-efficient k-mer counter with GPU-support|journal=Algorithms for Molecular Biology|language=en|volume=12|issue=1|pages=|doi=10.1186/s13015-017-0097-9|pmid=28373894|issn=1748-7188}}</ref>
* [https://github.com/TGAC/KAT K-mer Analysis Toolkit (KAT)] uses a modified version of Jellyfish to analyze ''k''-mer counts<ref name=":7" />
* [https://github.com/TGAC/KAT K-mer Analysis Toolkit (KAT)] uses a modified version of Jellyfish to analyze ''k''-mer counts<ref name=":7" />



Revision as of 12:36, 13 July 2019

The sequence ATGG has two 3-mers: ATG and TGG.

In bioinformatics, k-mers are subsequences of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides (i.e. A, T, G, and C), k-mers are capitalized upon to assemble DNA sequences,[1] improve heterologous gene expression,[2][3] identify species in metagenomic samples,[4] and create attenuated vaccines.[5] Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers (A, G, A, and T), three 2-mers (AG, GA, AT), two 3-mers (AGA and GAT) and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers (e.g. four in the case of DNA).

Introduction

k-mers are simply length subsequences. For example, all the possible k-mers of a DNA sequence are shown below:

An example 8-mer spectrum for E. coli comparing 8-mers' frequency (i.e. multiplicities) with their number of occurrences.
k-mers for GTAGAGCTGT
k k-mers
1 G, T, A, G, A, G, C, T, G, T
2 GT, TA, AG, GA, AG, GC, CT, TG, GT
3 GTA, TAG, AGA, GAG, AGC, GCT, CTG, TGT
4 GTAG, TAGA, AGAG, GAGC, AGCT, GCTG, CTGT
5 GTAGA, TAGAG, AGAGC, GAGCT, AGCTG, GCTGT
6 GTAGAG, TAGAGC, AGAGCT, GAGCTG, AGCTGT
7 GTAGAGC, TAGAGCT, AGAGCTG, GAGCTGT
8 GTAGAGCT, TAGAGCTG, AGAGCTGT
9 GTAGAGCTG, TAGAGCTGT
10 GTAGAGCTGT

A method of visualizing k-mers, the k-mer spectrum, shows the multiplicity of each k-mer in a sequence versus the number of k-mers with that multiplicity.[6] The number of modes in a k-mer spectrum for a species's genome varies, with most species having a unimodal distribution.[7] However, all mammals have a multimodal distribution. The number of modes within a k-mer spectrum can vary between regions of genomes as well: humans have unimodal k-mer spectra in 5' UTRs and exons but multimodal spectra in 3' UTRs and introns.

Forces Affecting DNA k-mer Frequency

The frequency of k-mer usage is affected by numerous forces, working at multiple levels, which are often in conflict. It is important to note that k-mers for higher values of k are affected by the forces affecting lower values of k as well. For example, if the 1-mer A does not occur in a sequence, none of the 2-mers containing A (AA, AT, AG, and AC) will occur either, thereby linking the effects of the different forces.

k = 1

When k = 1, there are four DNA k-mers, i.e., A, T, G, and C. At the molecular level, there are three hydrogen bonds between G and C, whereas there are only two between A and T. GC bonds, as a result of the extra hydrogen bond (and stronger stacking interactions), are more thermally stable than AT bonds.[8] Mammals and birds have a higher ratio of Gs and Cs to As and Ts (GC-content), which led to the hypothesis that thermal stability was a driving factor of GC-content variation.[9] However, while promising, this hypothesis did not hold up under scrutiny: analysis among a variety of prokaryotes showed no evidence of GC-content correlating with temperature as the thermal adaptation hypothesis would predict.[10] Indeed, if natural selection were to be the driving force behind GC-content variation, that would require that single nucleotide changes, which are often silent, to alter the fitness of an organism.[11]

Rather, current evidence suggests that GC‐biased gene conversion (gBGC) is a driving factor behind variation in GC content.[11] gBGC is a process that occurs during recombination which replaces Gs and Cs with As and Ts.[12] This process, though distinct from natural selection, can nevertheless exert selective pressure on DNA biased towards GC replacements being fixed in the genome. gBGC can therefore be seen as an "impostor" of natural selection. As would be expected, GC content is greater at sites experiencing greater recombination.[13] Furthermore, organisms with higher rates of recombination exhibit higher GC content, in keeping with the gBGC hypothesis's predicted effects.[14] Interestingly, gBGC does not appear to be limited to eukaryotes.[15] Asexual organisms such as bacteria and archaea also experience recombination by means of gene conversion, a process of homologous sequence replacement resulting in multiple identical sequences throughout the genome.[16] That recombination is able to drive up GC content in all domains of life suggests that gBGC is universally conserved. Whether gBGC is a (mostly) neutral byproduct of the molecular machinery of life or is itself under selection remains to be determined. The exact mechanism and evolutionary advantage or disadvantage of gBGC is currently unknown.[17]

k = 2

Despite the comparatively large body of literature discussing GC-content biases, relatively little has been written about dinucleotide biases. What is known is that these dinucleotide biases are relatively constant throughout the genome, unlike GC-content, which, as seen above, can vary considerably.[18] This is an important insight that must not be overlooked. If dinucleotide bias were subject to pressures resulting from translation, then there would be differing patterns of dinucleotide bias in coding and noncoding regions driven by some dinucelotides' reduced translational efficiency.[19] Because there is not, it can therefore be inferred that the forces modulating dinucleotide bias are independent of translation. Further evidence against translational pressures affecting dinucleotide bias is the fact that the dinucleotide biases of viruses, which rely heavily on translational efficiency, are shaped by their viral family more than by their hosts, whose translational machinery the viruses hijack.[20]

Counter to gBGC's increasing GC-content is CG suppression, which reduces the frequency of CG 2-mers due to deamination of methylated CG dinucleotides, resulting in substitutions of CGs with TGs, thereby reducing the GC-content.[21] This interaction highlights the interrelationship between the forces affecting k-mers for varying values of k.

One interesting fact about dinucleotide bias is that it can serve as a "distance" measurement between phylogenetically similar genomes. The genomes of pairs of organisms that are closely related share more similar dinucleotide biases than between pairs of more distantly related organisms.[18]

k = 3

There are twenty natural amino acids that are used to build the proteins that DNA encodes. However, there are only four nucleotides. Therefore, there cannot be a one-to-one correspondence between nucleotides and amino acids. Similarly, there are 16 2-mers, which is also not enough to unambiguously represent every amino acid. However, there are 64 distinct 3-mers in DNA, which is enough to uniquely represent each amino acid. These non-overlapping 3-mers are called codons. While each codon only maps to one amino acid, each amino acid can be represented by multiple codons. Thus, the same amino acid sequence can have multiple DNA representations. Interestingly, each codon for an amino acid is not used in equal proportions.[22] This is called codon-usage bias (CUB). When k = 3, a distinction must be made between true 3-mer frequency and CUB. For example, the sequence ATGGCA has four 3-mer words within it (ATG, TGG, GGC, and GCA) while only containing two codons (ATG and GCA). However, CUB is a major driving factor of 3-mer usage bias (accounting for up to ⅓ of it, since ⅓ of the k-mers in a coding region are codons) and will be the main focus of this section.

The exact cause of variation between the frequencies of various codons is not fully understood. It is known that codon preference is correlated with tRNA abundances, with codons matching more abundant tRNAs being correspondingly more frequent[22] and that more highly expressed proteins exhibit greater CUB.[23] This suggests that selection for translational efficiency or accuracy is the driving force behind CUB variation.

k = 4

Similar to the effect seen in dinucleotide bias, the tetranucleotide biases of phylogenetically similar organisms are more similar than between less closely related organisms.[4] The exact cause of variation in tetranucleotide bias is not well understood, but it has been hypothesized to be the result of the maintenance of genetic stability at the molecular level.[24]

Applications

The frequency of a set of k-mers in a species'a genome, in a genomic region, or in a class of sequences can be used as a "signature" of the underlying sequence. Comparing these frequencies are computationally easier than sequence alignment, and is an important method in alignment-free sequence analysis. It can also be used as a first stage analysis before an alignment.

Sequence assembly

This figure shows the process of splitting reads into smaller k-mers (4-mers in this case) in order to be able to be used in a De Bruijn graph. (A) Shows the initial segment of DNA being sequenced. (B) Shows the reads that were made output from sequencing and also shows how they align. The problem with this alignment though is that they overlap by k-2 not k-1 (which is needed in De Bruijn graphs). (C) Shows the reads being split into smaller 4-mers. (D) Discards the repeated 4-mers and then shows the alignment of them. Note that these k-mers overlap by k-1 and can then be used in a De Bruijn graph.

In sequence assembly, k-mers are used during the construction of De Bruijn graphs.[25][26] In order to create a De Bruijn Graph, the k-mers stored in each edge with length must overlap another string in another edge by in order to create a vertex. Reads generated from next-generation sequencing will typically have different read lengths being generated. For example, reads by Illumina’s sequencing technology capture reads of 100-mers. However, the problem with the sequencing is that only small fractions out of all the possible 100-mers that are present in the genome are actually generated. This is due to read errors, but more importantly, just simple coverage holes that occur during sequencing. The problem is that these small fractions of the possible k-mers violate the key assumption of De Bruijn graphs that all the k-mer reads must overlap its adjoining k-mer in the genome by (which cannot occur when all the possible k-mers are not present).

The solution to this problem is to break these k-mer sized reads into smaller k-mers, such that the resulting smaller k-mers will represent all the possible k-mers of that smaller size that are present in the genome.[27] Furthermore, splitting the k-mers into smaller sizes also helps alleviate the problem of different initial read lengths. In this example, the five reads do not account for all the possible 7-mers of the genome, and as such, a De Bruijn graph cannot be created. But, when they are split into 4-mers, the resultant subsequences are enough to reconstruct the genome using a De Bruijn graph.

Beyond being used directly for sequence assembly, k-mers can also be used to detect genome mis-assembly by identifying k-mers that are overrepresented which suggest the presence of repeated DNA sequences that have been combined.[28] In addition, k-mers are also used to detect bacterial contamination during eukaryotic genome assembly, an approach borrowed from the field of metagenomics.[29][30]

Choice of k-mer

The choice of the k-mer size has many different effects on the sequence assembly. These effects vary greatly between lower sized and larger sized k-mers. Therefore, an understanding of the different k-mer sizes must be achieved in order to choose a suitable size that balances the effects. The effects of the sizes are outlined below.

Lower k-mer sizes
  • A lower k-mer size will decrease the amount of edges stored in the graph, and as such, will help decrease the amount of space required to store DNA sequence.
  • Having smaller sizes will increase the chance for all the k-mers to overlap, and as such, have the required subsequences in order to construct the De Bruijn graph.[31]
  • However, by having smaller sized k-mers, you also risk having many vertices in the graph leading into a single k-mer. Therefore, this will make the reconstruction of the genome more difficult as there is a higher level of path ambiguities due to the larger amount of vertices that will need to be traversed.
  • Information is lost as the k-mers become smaller.
    • E.g. The possibility of AGTCGTAGATGCTG is lower than ACGT, and as such, holds a greater amount of information (refer to entropy (information theory) for more information).
  • Smaller k-mers also have the problem of not being able to resolve areas in the DNA where small microsatellites or repeats occur. This is because smaller k-mers will tend to sit entirely within the repeat region and is therefore hard to determine the amount of repetition that has actually taken place.
    • E.g. For the subsequence ATGTGTGTGTGTGTACG, the amount of repetitions of TG will be lost if a k-mer size less than 16 is chosen. This is because most of the k-mers will sit in the repeated region and may just be discarded as repeats of the same k-mer instead of referring the amount of repeats.
Higher k-mer sizes
  • Having larger sized k-mers will increase the amount of edges in the graph, which in turn, will increase the amount of memory needed to store the DNA sequence.
  • By increasing the size of the k-mers, the number of vertices will also decrease. This will help with the construction of the genome as there will be fewer paths to traverse in the graph.[31]
  • Larger k-mers also run a higher risk of not having outward vertices from every k-mer. This is due to larger k-mers increasing the risk that it will not overlap with another k-mer by . Therefore, this can lead to disjoints in the reads, and as such, can lead to a higher amount of smaller contigs.
  • Larger k-mer sizes help alleviate the problem of small repeat regions. This is due to the fact that the k-mer will contain a balance of the repeat region and the adjoining DNA sequences (given it are a large enough size) that can help to resolve the amount of repetition in that particular area.

Genetics and Genomics

With respect to disease, dinucleotide bias has been applied to the detection of genetic islands associated with pathogenicity.[11] Prior work has also shown that tetranucleotide biases are able to effectively detect horizontal gene transfer in both prokaryotes[32] and eukaryotes.[33]

Another application of k-mers is in genomics-based taxonomy. For example, GC-content has been used to distinguish between species of Erwinia with moderate success.[34] Similar to the direct use of GC-content for taxonomic purposes is the use of Tm, the melting temperature of DNA. Because GC bonds are more thermally stable, sequences with higher GC content exhibit a higher Tm. In 1987, the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics proposed the use of ΔTm as factor in determining species boundaries as part of the phylogenetic species concept, though this proposal does not appear to have gained traction within the scientific community.[35]

Other applications within genetics and genomics include:

Metagenomics

k-mer frequency and spectrum variation is heavily used in metagenomics for both analysis[47][48] and binning. In binning, the challenge is to separate sequencing reads into "bins" of reads for each organism (or operational taxonomic unit), which will then be assembled. TETRA is a notable tool that takes metagenomic samples and bins them into organisms based on their tetranucleotide (k = 4) frequencies.[49]  Other tools that similarly rely on k-mer frequency for metagenomic binning are CompostBin (k = 6),[50] PCAHIER,[51] PhyloPythia (5 ≤ k ≤ 6),[52] CLARK (k ≥ 20),[53] and TACOA (2 ≤ k ≤ 6).[54] Recent developments have also applied deep learning to metagenomic binning using k-mers.[55]

Other applications within metagenomics include:

  • Recovery of reading frames from raw reads[56]
  • Estimation of species abundance in metagenomic samples[57]
  • Determination of which species are present in samples[58][59]
  • Identification of biomarkers for diseases from samples[60]

Biotechnology 

Modifying k-mer frequencies in DNA sequences has been used extensively in biotechnological applications to control translational efficiency. Specifically, it has been used to both up- and down-regulate protein production rates.

With respect to increasing protein production, reducing unfavorable dinucleotide frequency has been used yield higher rates of protein synthesis.[61] In addition, codon usage bias has been modified to create synonymous sequences with greater protein expression rates.[2][3] Similarly, codon pair optimization, a combination of dinucelotide and codon optimization, has also been successfully used to increase expression.[62]

The most studied application of k-mers for decreasing translational efficiency is codon-pair manipulation for attenuating viruses in order to create vaccines. Researchers were able to recode dengue virus, the virus that causes dengue fever, such that its codon-pair bias was more different to mammalian codon-usage preference than the wild type.[63] Though containing an identical amino-acid sequence, the recoded virus demonstrated significantly weakened pathogenicity while eliciting a strong immune response. This approach has also been used effectively to create an influenza vaccine[64] as well a vaccine for Marek's disease herpesvirus (MDV).[65] Notably, the codon-pair bias manipulation employed to attenuate MDV did not effectively reduce the oncogenicity of the virus, highlighting a potential weakness in the biotechnology applications of this approach. To date, no codon-pair deoptimized vaccine has been approved for use.

Two later articles help explain the actual mechanism underlying codon-pair deoptimization: codon-pair bias is the result of dinucleotide bias.[66][67] By studying viruses and their hosts, both sets of authors were able to conclude that the molecular mechanism that results in the attentuation of viruses is an increase in dinucleotides poorly suited for translation.

GC-content, due to its effect on DNA melting point, is used to predict annealing temperature in PCR, another important biotechnology tool.

Implementation

Pseudocode

Determining the possible k-mers of a read can be done by simply cycling over the string length by one and taking out each substring of length . The pseudocode to achieve this is as follows:

procedure k-mers(string seq, integer k) is
    L ← length(seq)
    arr ← new array of L - k + 1 empty strings

    // iterate over the number of k-mers in seq, 
    // storing the nth k-mer in the output array
    for n ← 0 to L - k + 1 exclusive do
        arr[n] ← subsequence of seq from letter n inclusive to letter n + k exclusive

    return arr

In Bioinformatics Pipelines

Because the number of k-mers grows exponentially for values of k, counting k-mers for large values of k (usually >10) is a computationally difficult task. While simple implementations such as the above pseudocode work for small values of k, they need to be adapted for high-throughput applications or when k is large. To solve this problem, various tools have been developed:

See also

References

  1. ^ Compeau, Phillip E C; Pevzner, Pavel A; Tesler, Glenn (November 2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. doi:10.1038/nbt.2023. ISSN 1087-0156. PMC 5531759. PMID 22068540.
  2. ^ a b Welch, Mark; Govindarajan, Sridhar; Ness, Jon E.; Villalobos, Alan; Gurney, Austin; Minshull, Jeremy; Gustafsson, Claes (2009-09-14). Kudla, Grzegorz (ed.). "Design Parameters to Control Synthetic Gene Expression in Escherichia coli". PLoS ONE. 4 (9): e7002. doi:10.1371/journal.pone.0007002. ISSN 1932-6203. PMID 19759823.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. ^ a b Gustafsson, Claes; Govindarajan, Sridhar; Minshull, Jeremy (July 2004). "Codon bias and heterologous protein expression". Trends in Biotechnology. 22 (7): 346–353. doi:10.1016/j.tibtech.2004.04.006. PMID 15245907.
  4. ^ a b Perry, Scott C.; Beiko, Robert G. (2010-01-01). "Distinguishing Microbial Genome Fragments Based on Their Composition: Evolutionary and Comparative Genomic Perspectives". Genome Biology and Evolution. 2: 117–131. doi:10.1093/gbe/evq004. ISSN 1759-6653.
  5. ^ Eschke, Kathrin; Trimpert, Jakob; Osterrieder, Nikolaus; Kunec, Dusan (2018-01-29). Mocarski, Edward (ed.). "Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization". PLOS Pathogens. 14 (1): e1006857. doi:10.1371/journal.ppat.1006857. ISSN 1553-7374.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  6. ^ a b Mapleson, Daniel; Garcia Accinelli, Gonzalo; Kettleborough, George; Wright, Jonathan; Clavijo, Bernardo J. (2016-10-22). "KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies". Bioinformatics: btw663. doi:10.1093/bioinformatics/btw663. ISSN 1367-4803.
  7. ^ a b Chor, Benny; Horn, David; Goldman, Nick; Levy, Yaron; Massingham, Tim (2009). "Genomic DNA k-mer spectra: models and modalities". Genome Biology. 10 (10): R108. doi:10.1186/gb-2009-10-10-r108. ISSN 1465-6906. PMC 2784323. PMID 19814784.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  8. ^ Yakovchuk, P. (2006-01-30). "Base-stacking and base-pairing contributions into thermal stability of the DNA double helix". Nucleic Acids Research. 34 (2): 564–574. doi:10.1093/nar/gkj454. ISSN 0305-1048. PMID 16449200.
  9. ^ Bernardi, Giorgio (January 2000). "Isochores and the evolutionary genomics of vertebrates". Gene. 241 (1): 3–17. doi:10.1016/S0378-1119(99)00485-0. PMID 10607893.
  10. ^ Hurst, Laurence D.; Merchant, Alexa R. (2001-03-07). "High guanine–cytosine content is not an adaptation to high temperature: a comparative analysis amongst prokaryotes". Proceedings of the Royal Society of London. Series B: Biological Sciences. 268 (1466): 493–497. doi:10.1098/rspb.2000.1397. ISSN 1471-2954. PMC 1088632.
  11. ^ a b c Mugal, Carina F.; Weber, Claudia C.; Ellegren, Hans (December 2015). "GC-biased gene conversion links the recombination landscape and demography to genomic base composition: GC-biased gene conversion drives genomic base composition across a wide range of species". BioEssays. 37 (12): 1317–1326. doi:10.1002/bies.201500058.
  12. ^ Romiguier, Jonathan; Roux, Camille (2017-02-15). "Analytical Biases Associated with GC-Content in Molecular Evolution". Frontiers in Genetics. 8. doi:10.3389/fgene.2017.00016. ISSN 1664-8021.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  13. ^ Spencer, C.C.A. (2006-08-01). "Human polymorphism around recombination hotspots: Figure 1". Biochemical Society Transactions. 34 (4): 535–536. doi:10.1042/BST0340535. ISSN 0300-5127.
  14. ^ Weber, Claudia C; Boussau, Bastien; Romiguier, Jonathan; Jarvis, Erich D; Ellegren, Hans (December 2014). "Evidence for GC-biased gene conversion as a driver of between-lineage differences in avian base composition". Genome Biology. 15 (12). doi:10.1186/s13059-014-0549-1. ISSN 1474-760X.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  15. ^ Lassalle, Florent; Périan, Séverine; Bataillon, Thomas; Nesme, Xavier; Duret, Laurent; Daubin, Vincent (2015-02-06). Petrov, Dmitri A. (ed.). "GC-Content Evolution in Bacterial Genomes: The Biased Gene Conversion Hypothesis Expands". PLOS Genetics. 11 (2): e1004941. doi:10.1371/journal.pgen.1004941. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  16. ^ Santoyo, G; Romero, D (April 2005). "Gene conversion and concerted evolution in bacterial genomes". FEMS Microbiology Reviews. 29 (2): 169–183. doi:10.1016/j.femsre.2004.10.004.
  17. ^ Bhérer, Claude; Auton, Adam (2014-06-16), John Wiley & Sons Ltd (ed.), "Biased Gene Conversion and Its Impact on Genome Evolution", eLS, John Wiley & Sons, Ltd, doi:10.1002/9780470015902.a0020834.pub2, ISBN 9780470015902
  18. ^ a b Karlin, Samuel (October 1998). "Global dinucleotide signatures and analysis of genomic heterogeneity". Current Opinion in Microbiology. 1 (5): 598–610. doi:10.1016/S1369-5274(98)80095-7. PMID 10066522.
  19. ^ Beutler, E.; Gelbart, T.; Han, J. H.; Koziol, J. A.; Beutler, B. (1989-01-01). "Evolution of the genome and the genetic code: selection at the dinucleotide level by methylation and polyribonucleotide cleavage". Proceedings of the National Academy of Sciences. 86 (1): 192–196. doi:10.1073/pnas.86.1.192. ISSN 0027-8424.
  20. ^ Di Giallonardo, Francesca; Schlub, Timothy E.; Shi, Mang; Holmes, Edward C. (2017-04-15). Dermody, Terence S. (ed.). "Dinucleotide Composition in Animal RNA Viruses Is Shaped More by Virus Family than by Host Species". Journal of Virology. 91 (8). doi:10.1128/JVI.02381-16. ISSN 0022-538X.
  21. ^ Żemojtel, Tomasz; kiełbasa, Szymon M.; Arndt, Peter F.; Behrens, Sarah; Bourque, Guillaume; Vingron, Martin (2011-01-01). "CpG Deamination Creates Transcription Factor–Binding Sites with High Efficiency". Genome Biology and Evolution. 3: 1304–1311. doi:10.1093/gbe/evr107. ISSN 1759-6653.
  22. ^ a b Hershberg R, Petrov DA. Selection on Codon Bias. Annual Review of Genetics. Annual Reviews; 2008;42: 287–299. doi:10.1146/annurev.genet.42.110807.091442
  23. ^ Sharp, Paul M.; Li, Wen-Hsiung (1987). "The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications". Nucleic Acids Research. 15 (3): 1281–1295. doi:10.1093/nar/15.3.1281. ISSN 0305-1048. PMC 340524.
  24. ^ Noble, Peter A.; Citek, Robert W.; Ogunseitan, Oladele A. (April 1998). "Tetranucleotide frequencies in microbial genomes". Electrophoresis. 19 (4): 528–535. doi:10.1002/elps.1150190412. ISSN 0173-0835.
  25. ^ Nagarajan, Niranjan; Pop, Mihai (2013). "Sequence assembly demystified". Nature Reviews Genetics. 14 (3): 157–167. doi:10.1038/nrg3367. ISSN 1471-0056. PMID 23358380.
  26. ^ Li; et al. (2010). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. PMC 2813482. PMID 20019144.
  27. ^ Compeau, P.; Pevzner, P.; Teslar, G. (2011). "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. doi:10.1038/nbt.2023. PMC 5531759. PMID 22068540.
  28. ^ Phillippy, Schatz, Pop (2008). "Genome assembly forensics: finding the elusive mis-assembly". Bioinformatics. 9 (3): R55. doi:10.1186/gb-2008-9-3-r55. PMC 2397507. PMID 18341692.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)
  29. ^ Delmont, Eren (2016). "Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies". PeerJ. 4: e1839. doi:10.7717/Fpeerj.1839 (inactive 2019-07-13).{{cite journal}}: CS1 maint: DOI inactive as of July 2019 (link) CS1 maint: unflagged free DOI (link)
  30. ^ Bemm; et al. (2016). "Genome of a tardigrade: Horizontal gene transfer or bacterial contamination?". Proceedings of the National Academy of Sciences. 113 (22): E3054–E3056. doi:10.1073/pnas.1525116113. PMC 4896698. PMID 27173902.
  31. ^ a b Zerbino, Daniel R.; Birney, Ewan (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC 2336801. PMID 18349386.
  32. ^ Goodur, Haswanee D.; Ramtohul, Vyasanand; Baichoo, Shakuntala (2012-11-11). "GIDT — A tool for the identification and visualization of genomic islands in prokaryotic organisms". 2012 IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE). doi:10.1109/bibe.2012.6399707.
  33. ^ Jaron, K. S.; Moravec, J. C.; Martinkova, N. (2014-04-15). "SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes". Bioinformatics. 30 (8): 1081–1086. doi:10.1093/bioinformatics/btt727. ISSN 1367-4803. PMID 24371153.
  34. ^ Starr, M. P.; Mandel, M. (1969-04-01). "DNA Base Composition and Taxonomy of Phytopathogenic and Other Enterobacteria". Journal of General Microbiology. 56 (1): 113–123. doi:10.1099/00221287-56-1-113. ISSN 0022-1287.
  35. ^ Moore, W. E. C.; Stackebrandt, E.; Kandler, O.; Colwell, R. R.; Krichevsky, M. I.; Truper, H. G.; Murray, R. G. E.; Wayne, L. G.; Grimont, P. A. D. (1987-10-01). "Report of the Ad Hoc Committee on Reconciliation of Approaches to Bacterial Systematics". International Journal of Systematic and Evolutionary Microbiology. 37 (4): 463–464. doi:10.1099/00207713-37-4-463. ISSN 1466-5026.
  36. ^ Patro, Mount, Kingsford (2014). "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms". Nature Biotechnology. 32 (5): 462–464. arXiv:1308.3700. doi:10.1038/nbt.2862. PMC 4077321. PMID 24752080.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  37. ^ Navarro-Gomez; et al. (2015). "Phy-Mer: a novel alignment-free and reference-independent mitochondrial haplogroup classifier". Bioinformatics. 31 (8): 1310–1312. doi:10.1093/bioinformatics/btu825. PMC 4393525. PMID 25505086.
  38. ^ Wang, Rong; Xu, Yong; Liu, Bin (2016). "Recombination spot identification Based on gapped k-mers". Scientific Reports. 6 (1): 23934. doi:10.1038/srep23934. ISSN 2045-2322. PMC 4814916. PMID 27030570.
  39. ^ Hozza, Michal; Vinař, Tomáš; Brejová, Broňa (2015), Iliopoulos, Costas; Puglisi, Simon; Yilmaz, Emine (eds.), "How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra", String Processing and Information Retrieval, vol. 9309, Springer International Publishing, pp. 199–209, doi:10.1007/978-3-319-23826-5_20, ISBN 9783319238258
  40. ^ Lamichhaney, Sangeet; Fan, Guangyi; Widemo, Fredrik; Gunnarsson, Ulrika; Thalmann, Doreen Schwochow; Hoeppner, Marc P; Kerje, Susanne; Gustafson, Ulla; Shi, Chengcheng (2016). "Structural genomic changes underlie alternative reproductive strategies in the ruff (Philomachus pugnax)". Nature Genetics. 48 (1): 84–88. doi:10.1038/ng.3430. ISSN 1061-4036. PMID 26569123.
  41. ^ Chae; et al. (2013). "Comparative analysis using K-mer and K-flank patterns provides evidence for CpG island sequence evolution in mammalian genomes". Nucleic Acids Research. 41 (9): 4783–4791. doi:10.1093/nar/gkt144. PMC 3643570. PMID 23519616.
  42. ^ Mohamed Hashim, Abdullah (2015). "Rare k-mer DNA: Identification of sequence motifs and prediction of CpG island and promoter". Journal of Theoretical Biology. 387: 88–100. doi:10.1016/j.jtbi.2015.09.014. PMID 26427337.
  43. ^ Price, Jones, Pevzner (2005). "De novo identification of repeat families in large genomes". Bioinformatics. 21(supp 1): i351–8. doi:10.1093/bioinformatics/bti1018. PMID 15961478.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  44. ^ Meher, Prabina Kumar; Sahu, Tanmaya Kumar; Rao, A.R. (2016). "Identification of species based on DNA barcode using k-mer feature vector and Random forest classifier". Gene. 592 (2): 316–324. doi:10.1016/j.gene.2016.07.010. PMID 27393648.
  45. ^ Newburger, Bulyk (2009). "UniPROBE: an online database of protein binding microarray data on protein–DNA interactions". Nucleic Acids Research. 37(supp 1) (Database issue): D77–82. doi:10.1093/nar/gkn660. PMC 2686578. PMID 18842628.
  46. ^ Nordstrom; et al. (2013). "Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k-mers". Nature Biotechnology. 31 (4): 325–330. doi:10.1038/nbt.2515. PMID 23475072.
  47. ^ Zhu, Jianfeng; Zheng, Wei-Mou (2014). "Self-organizing approach for meta-genomes". Computational Biology and Chemistry. 53: 118–124. doi:10.1016/j.compbiolchem.2014.08.016. PMID 25213854.
  48. ^ Dubinkina; Ischenko; Ulyantsev; Tyakht; Alexeev (2016). "Assessment of k-mer spectrum applicability for metagenomic dissimilarity analysis". BMC Bioinformatics. 17: 38. doi:10.1186/s12859-015-0875-7. PMC 4715287. PMID 26774270.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  49. ^ Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner F. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics. Springer Nature; 2004;5: 163. doi:10.1186/1471-2105-5-163
  50. ^ Chatterji, Sourav; Yamazaki, Ichitaro; Bai, Zhaojun; Eisen, Jonathan A. (2008), Vingron, Martin; Wong, Limsoon (eds.), "CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads", Research in Computational Molecular Biology, vol. 4955, Springer Berlin Heidelberg, pp. 17–28, arXiv:0708.3098, doi:10.1007/978-3-540-78839-3_3, ISBN 9783540788386
  51. ^ Zheng, Hao; Wu, Hongwei (2010). "SHORT PROKARYOTIC DNA FRAGMENT BINNING USING A HIERARCHICAL CLASSIFIER BASED ON LINEAR DISCRIMINANT ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS". Journal of Bioinformatics and Computational Biology. 08 (6): 995–1011. doi:10.1142/S0219720010005051. ISSN 0219-7200.
  52. ^ McHardy, Alice Carolyn; Martín, Héctor García; Tsirigos, Aristotelis; Hugenholtz, Philip; Rigoutsos, Isidore (2007). "Accurate phylogenetic classification of variable-length DNA fragments". Nature Methods. 4 (1): 63–72. doi:10.1038/nmeth976. ISSN 1548-7091. PMID 17179938.
  53. ^ Ounit, Rachid; Wanamaker, Steve; Close, Timothy J; Lonardi, Stefano (2015). "CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers". BMC Genomics. 16 (1). doi:10.1186/s12864-015-1419-2. ISSN 1471-2164.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  54. ^ Diaz, Naryttza N; Krause, Lutz; Goesmann, Alexander; Niehaus, Karsten; Nattkemper, Tim W (2009). "TACOA – Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach". BMC Bioinformatics. 10 (1): 56. doi:10.1186/1471-2105-10-56. ISSN 1471-2105. PMC 2653487. PMID 19210774.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  55. ^ Fiannaca, Antonino; La Paglia, Laura; La Rosa, Massimo; Lo Bosco, Giosue’; Renda, Giovanni; Rizzo, Riccardo; Gaglio, Salvatore; Urso, Alfonso (2018). "Deep learning models for bacteria taxonomic classification of metagenomic data". BMC Bioinformatics. 19 (S7). doi:10.1186/s12859-018-2182-6. ISSN 1471-2105. PMID 30066629.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  56. ^ Zhu, Zheng (2014). "Self-organizing approach for meta-genomes". Computational Biology and Chemistry. 53: 118–124. doi:10.1016/j.compbiolchem.2014.08.016. PMID 25213854.
  57. ^ Lu, Jennifer; Breitwieser, Florian P.; Thielen, Peter; Salzberg, Steven L. (2017-01-02). "Bracken: estimating species abundance in metagenomics data". PeerJ Computer Science. 3: e104. doi:10.7717/peerj-cs.104. ISSN 2376-5992.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  58. ^ Wood, Derrick E; Salzberg, Steven L (2014). "Kraken: ultrafast metagenomic sequence classification using exact alignments". Genome Biology. 15 (3): R46. doi:10.1186/gb-2014-15-3-r46. ISSN 1465-6906. PMID 24580807.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  59. ^ Rosen, Gail; Garbarine, Elaine; Caseiro, Diamantino; Polikar, Robi; Sokhansanj, Bahrad (2008). "Metagenome Fragment Classification Using -Mer Frequency Profiles". Advances in Bioinformatics. 2008: 1–12. doi:10.1155/2008/205969. ISSN 1687-8027.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  60. ^ Wang, Ying; Fu, Lei; Ren, Jie; Yu, Zhaoxia; Chen, Ting; Sun, Fengzhu (2018-05-03). "Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures". Frontiers in Microbiology. 9. doi:10.3389/fmicb.2018.00872. ISSN 1664-302X.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  61. ^ Al-Saif, Maher; Khabar, Khalid SA (2012). "UU/UA Dinucleotide Frequency Reduction in Coding Regions Results in Increased mRNA Stability and Protein Expression". Molecular Therapy. 20 (5): 954–959. doi:10.1038/mt.2012.29. PMC 3345983. PMID 22434136.
  62. ^ Trinh R, Gurbaxani B, Morrison SL, Seyfzadeh M. Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression. Molecular Immunology. Elsevier BV; 2004;40: 717–722. doi:10.1016/j.molimm.2003.08.006
  63. ^ Shen, Sam H.; Stauft, Charles B.; Gorbatsevych, Oleksandr; Song, Yutong; Ward, Charles B.; Yurovsky, Alisa; Mueller, Steffen; Futcher, Bruce; Wimmer, Eckard (2015-04-14). "Large-scale recoding of an arbovirus genome to rebalance its insect versus mammalian preference". Proceedings of the National Academy of Sciences. 112 (15): 4749–4754. doi:10.1073/pnas.1502864112. ISSN 0027-8424. PMID 25825721.
  64. ^ Kaplan, Bryan S.; Souza, Carine K.; Gauger, Phillip C.; Stauft, Charles B.; Robert Coleman, J.; Mueller, Steffen; Vincent, Amy L. (2018). "Vaccination of pigs with a codon-pair bias de-optimized live attenuated influenza vaccine protects from homologous challenge". Vaccine. 36 (8): 1101–1107. doi:10.1016/j.vaccine.2018.01.027. PMID 29366707.
  65. ^ Eschke, Kathrin; Trimpert, Jakob; Osterrieder, Nikolaus; Kunec, Dusan (2018-01-29). Mocarski, Edward (ed.). "Attenuation of a very virulent Marek's disease herpesvirus (MDV) by codon pair bias deoptimization". PLOS Pathogens. 14 (1): e1006857. doi:10.1371/journal.ppat.1006857. ISSN 1553-7374.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  66. ^ Kunec, Dusan; Osterrieder, Nikolaus (2016). "Codon Pair Bias Is a Direct Consequence of Dinucleotide Bias". Cell Reports. 14 (1): 55–67. doi:10.1016/j.celrep.2015.12.011. PMID 26725119.
  67. ^ Tulloch, Fiona; Atkinson, Nicky J; Evans, David J; Ryan, Martin D; Simmonds, Peter (2014-12-09). "RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies". eLife. 3. doi:10.7554/eLife.04531. ISSN 2050-084X. PMID 25490153.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  68. ^ Marçais, Guillaume; Kingsford, Carl (2011-03-15). "A fast, lock-free approach for efficient parallel counting of occurrences of k-mers". Bioinformatics. 27 (6): 764–770. doi:10.1093/bioinformatics/btr011. ISSN 1460-2059.
  69. ^ Deorowicz, Sebastian; Kokot, Marek; Grabowski, Szymon; Debudaj-Grabysz, Agnieszka (2015-05-15). "KMC 2: fast and resource-frugal k-mer counting". Bioinformatics. 31 (10): 1569–1576. doi:10.1093/bioinformatics/btv022. ISSN 1460-2059. PMID 25609798.
  70. ^ Erbert, Marius; Rechner, Steffen; Müller-Hannemann, Matthias (2017). "Gerbil: a fast and memory-efficient k-mer counter with GPU-support". Algorithms for Molecular Biology. 12 (1). doi:10.1186/s13015-017-0097-9. ISSN 1748-7188. PMID 28373894.{{cite journal}}: CS1 maint: unflagged free DOI (link)

External links