Sequence assembly

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.

Genome assemblers[edit]

The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity (from small viruses over plasmids to bacteria and finally eukaryotes), the assembly programs used in these genome projects needed increasingly sophisticated strategies to handle:

  • terabytes of sequencing data which need processing on computing clusters;
  • identical and nearly identical sequences (known as repeats) which can, in the worst case, increase the time and space complexity of algorithms exponentially;[citation needed]
  • errors in the fragments from the sequencing instruments, which can confound assembly.

Faced with the challenge of assembling the first larger eukaryotic genomes—the fruit fly Drosophila melanogaster in 2000 and the human genome just a year later,—scientists developed assemblers like Celera Assembler[1] and Arachne[2] able to handle genomes of 100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS[3] was launched to bring together all the innovations in genome assembly technology under the open source framework.

Sample sequence showing how a sequence assembler would take fragments and match by overlaps. Image also shows the potential problem of repeats in the sequence.

EST assemblers[edit]

Expressed Sequence Tag or EST assembly differs from genome assembly in several ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent only a subset of the whole genome. At a first glance, underlying algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, cells tend to have a certain number of genes that are constantly expressed in very high numbers (housekeeping genes), which again leads to the problem of similar sequences present in high numbers in the data set to be assembled.

Furthermore, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, recoding, and post-transcriptional modification.

De-novo vs. mapping assembly[edit]

In sequence assembly, two different types can be distinguished:

  1. de-novo: assembling short reads to create full-length (sometimes novel) sequences (see de novo transcriptome assembly)
  2. mapping: assembling reads against an existing backbone sequence, building a sequence that is similar but not necessarily identical to the backbone sequence

In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read (an operation that has a naive time complexity of O(n2); using a hash this can be reduced significantly). Referring to the comparison drawn to shredded books in the introduction: while for mapping assemblies one would have a very similar book as template (perhaps with the names of the main characters and a few locations changed), the de-novo assemblies are more hardcore in a sense as one would not know beforehand whether this would become a science book, a novel, a catalogue, or even several books. Also, every shred would be compared with every other shred.

Influence of technological changes[edit]

The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.

In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.

In 1975, the Dideoxy termination method (AKA Sanger sequencing) was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects where the reads

  • are about 800–900 bases long
  • contain sequencing artifacts like sequencing and cloning vectors
  • have error rates between 0.5 and 10%

With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing.

By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400-500 bases. Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the Newbler assembler from 454 was available. Released in mid-2007,[4] the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined hybrid assembly.

From 2006, the Illumina (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as de novo transcriptome assembly), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3-400bp clone. Announced at the end of 2007, the SHARCGS assembler[5] by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others.

Later, new technologies like SOLiD from Applied Biosystems, Ion Torrent and SMRT were released and new technologies (e.g. Nanopore sequencing) continue to emerge.

Greedy algorithm[edit]

Given a set of sequence fragments the object is to find the shortest common supersequence.

  1. Сalculate pairwise alignments of all fragments.
  2. Choose two fragments with the largest overlap.
  3. Merge chosen fragments.
  4. Repeat step 2 and 3 until only one fragment is left.

The result is a suboptimal solution to the problem.[citation needed]

Available assemblers[edit]

The following table lists assemblers that have a de-novo assembly capability on at least one of the supported technologies.[6]

Name Type Technologies Author Presented /

Last updated

Licence* Homepage
ABySS (large) genomes Solexa, SOLiD Simpson, J. et al. 2008 / 2014 NC-A link
ALLPATHS-LG (large) genomes Solexa, SOLiD Gnerre, S. et al. 2011 OS link
AMOS genomes Sanger, 454 Salzberg, S. et al. 2002? / 2011 OS link
Arapan-M Medium Genomes (e.g. E.coli) All Sahli, M. & Shibuya, T. 2011 / 2012 OS link
Arapan-S Small Genomes (Viruses and Bacteria) All Sahli, M. & Shibuya, T. 2011 / 2012 OS link
Celera WGA Assembler / CABOG (large) genomes Sanger, 454, Solexa Myers, G. et al.; Miller G. et al. 2004 / 2015 OS link
CLC Genomics Workbench & CLC Assembly Cell genomes Sanger, 454, Solexa, SOLiD CLC bio 2008 / 2010 / 2014 C link
Cortex genomes Solexa, SOLiD Iqbal, Z. et al. 2011 OS link
DNA Baser Assembler (small) genomes Sanger, 454 Heracle BioSoft SRL 06.2015 C
DNA Dragon genomes Illumina, SOLiD, Complete Genomics, 454, Sanger SequentiX 2011 C link
DNAnexus genomes Illumina, SOLiD, Complete Genomics DNAnexus 2011 C link
Edena genomes Illumina D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel. 2008/2013 OS link
Euler genomes Sanger, 454 (,Solexa ?) Pevzner, P. et al. 2001 / 2006? (C / NC-A?) link
Euler-sr genomes 454, Solexa Chaisson, MJ. et al. 2008 NC-A link
Fermi (large) genomes Illumina Li, H. 2012 OS link
Forge (large) genomes, EST, metagenomes 454, Solexa, SOLID, Sanger Platt, DM, Evers, D. 2010 OS link
Geneious genomes Sanger, 454, Solexa, Ion Torrent, Complete Genomics, PacBio, Oxford Nanopore, Illumina Biomatters Ltd 2009 / 2013 C link
Graph Constructor (large) genomes Sanger, 454, Solexa, SOLiD Convey Computer Corporation 2011 C link
IDBA (Iterative De Bruijn graph short read Assembler) (large) genomes Sanger,454,Solexa Yu Peng, Henry C. M. Leung, Siu-Ming Yiu, Francis Y. L. Chin 2010 (C / NC-A?) link
LIGR Assembler (derived from TIGR Assembler) genomic Sanger - 2009/ 2012 OS link
MaSuRCA (Maryland Super Read - Celera Assembler) (large) genomes Sanger, Illumina, 454 Aleksey Zimin, Guillaume Marçais, Daniela Puiu, Michael Roberts, Steven L. Salzberg, James A. Yorke 2012 / 2013 OS link
MIRA (Mimicking Intelligent Read Assembly) genomes, ESTs Sanger, 454, Solexa Chevreux, B. 1998 / 2014 OS link
NextGENe (small genomes?) 454, Solexa, SOLiD Softgenetics 2008 C link
Newbler genomes, ESTs 454, Sanger 454/Roche 2009/2012 C link
PADENA genomes 454, Sanger 454/Roche 2010 OS link
PASHA (large) genomes Illumina Liu, Schmidt, Maskell 2011 OS link
Phrap genomes Sanger, 454, Solexa Green, P. 1994 / 2008 C / NC-A link
TIGR Assembler genomic Sanger - 1995 / 2003 OS link
Ray[7] genomes Illumina, mix of Illumina and 454, paired or not Sébastien Boisvert, François Laviolette & Jacques Corbeil. 2010 OS [GNU General Public License] link
Sequencher genomes traditional and next generation sequence data Gene Codes Corporation 1991 / 2009 / 2011 C link
SeqMan NGen (large) genomes, exomes, transcriptomes, metagenomes, ESTs Illumina, ABI SOLiD, Roche 454, Ion Torrent, Solexa, Sanger DNASTAR 2007 / 2014 C link
SGA (large) genomes Illumina, Sanger (Roche 454?, Ion Torrent?) Simpson, J.T. et al. 2011 / 2012 OS link
SHARCGS (small) genomes Solexa Dohm et al. 2007 / 2007 OS link
SOPRA genomes Illumina, SOLiD, Sanger, 454 Dayarian, A. et al. 2010 / 2011 OS link
SparseAssembler (large) genomes Illumina, 454, Ion torrent Ye, C. et al. 2012 / 2012 OS link
SSAKE (small) genomes Solexa (SOLiD? Helicos?) Warren, R. et al. 2007 / 2014 OS link
SOAPdenovo genomes Solexa Li, R. et al. 2009 / 2013 OS link
SPAdes (small) genomes, single-cell Illumina, Solexa, Sanger, 454, Ion Torrent, PacBio, Oxford Nanopore Bankevich, A et al. 2012 / 2015 OS link
Staden gap4 package BACs (, small genomes?) Sanger Staden et al. 1991 / 2008 OS link
Taipan (small) genomes Illumina Schmidt, B. et al. 2009 / 2009 OS link
VCAKE (small) genomes Solexa (SOLiD?, Helicos?) Jeck, W. et al. 2007 / 2009 OS link
Phusion assembler (large) genomes Sanger Mullikin JC, et al. 2003 / 2006 OS link
Quality Value Guided SRA (QSRA) genomes Sanger, Solexa Bryant DW, et al. 2009 / 2009 OS link
Velvet (small) genomes Sanger, 454, Solexa, SOLiD Zerbino, D. et al. 2007 / 2011 OS link
*Licences: OS = Open Source; C = Commercial; C / NC-A = Commercial but free for non-commercial and academics; Brackets = unclear, but most likely C / NC-A

See also[edit]


  1. ^ Myers, E. W.; Sutton, GG; Delcher, AL; Dew, IM; Fasulo, DP; Flanigan, MJ; Kravitz, SA; Mobarry, CM; et al. (March 2000). "A whole-genome assembly of Drosophila". Science 287 (5461): 2196–204. doi:10.1126/science.287.5461.2196. PMID 10731133. 
  2. ^ Batzoglou, S.; Jaffe, DB; Stanley, K; Butler, J; Gnerre, S; Mauceli, E; Berger, B; Mesirov, JP; Lander, ES (January 2002). "ARACHNE: a whole-genome shotgun assembler". Genome Research 12 (1): 177–89. doi:10.1101/gr.208902. PMC 155255. PMID 11779843. 
  3. ^ AMOS page with links to various papers
  4. ^ Copy in Google groups of the post announcing MIRA 2.9.8 hybrid version in the Usenet group
  5. ^ Dohm, J. C.; Lottaz, C.; Borodina, T.; Himmelbauer, H. (November 2007). "SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing". Genome Research 17 (11): 1697–706. doi:10.1101/gr.6435207. PMC 2045152. PMID 17908823. 
  6. ^ list of software including mapping assemblers in the SeqAnswers discussion forum.
  7. ^ Boisvert, Sébastien; Laviolette, François; Corbeil, Jacques (October 2010). "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies". Journal of Computational Biology 17 (11): 1519–33. doi:10.1089/cmb.2009.0238. PMC 3119603. PMID 20958248.