Jump to content

De novo sequence assemblers

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Will Sandberg (talk | contribs) at 13:36, 22 July 2020. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

Types of de novo assemblers

There are two types of algorithms that are commonly utilized by these assemblers: greedy, which aim for local optima, and graph method algorithms, which aim for global optima. Different assemblers are tailored for particular needs, such as the assembly of (small) bacterial genomes, (large) eukaryotic genomes, or transcriptomes.

Greedy algorithm assemblers are assemblers that find local optima in alignments of smaller reads. Greedy algorithm assemblers typically feature several steps: 1) pairwise distance calculation of reads, 2) clustering of reads with greatest overlap, 3) assembly of overlapping reads into larger contigs, and 4) repeat. These algorithms typically do not work well for larger read sets, as they do not easily reach a global optimum in the assembly, and do perform well on read sets that contain repeat regions.[1] Early de novo sequence assemblers, such as SEQAID[2] (1984) and CAP[3] (1992), used greedy algorithms, such as overlap-layout-consensus (OLC) algorithms. These algorithms find overlap between all reads, use the overlap to determine a layout (or tiling) of the reads, and then produce a consensus sequence. Some programs that used OLC algorithms featured filtration (to remove read pairs that will not overlap) and heuristic methods to increase speed of the analyses.

Graph method assemblers[4] come in two varieties: string and De Bruijn. String graph and De Bruijn graph method assemblers were introduced at a DIMACS[5] workshop in 1994 by Waterman[6] and Gene Myers.[7] These methods represented an important step forward in sequence assembly, as they both use algorithms to reach a global optimum instead of a local optimum. While both of these methods made progress towards better assemblies, the De Bruijn graph method has become the most popular in the age of next-generation sequencing. During the assembly of the De Bruijn graph, reads are broken into smaller fragments of a specified size, k. The k-mers are then used as nodes in the graph assembly. Nodes that overlap by some amount (generally, k-1) are then connect by an edge. The assembler will then construct sequences based on the De Bruijn graph. De Bruijn graph assemblers typically perform better on larger read sets than greedy algorithm assemblers (especially when they contain repeat regions).

Commonly used programs

Different assemblers are designed for different type of read technologies. Reads from second generation technologies (called short read technologies) like Illumina are typically short (with lengths of the order of 50-200 base pairs) and have error rates of around 0.5-2%, with the errors chiefly being substitution errors. However, reads from third generation technologies like PacBio and fourth generation technologies like Oxford Nanopore (called long read technologies) are longer with read lengths typically in the thousands or tens of thousands and have much higher error rates of around 10-20% with errors being chiefly insertions and deletions. This necessitates different algorithms for assembly from short and long read technologies.

SPAdes

SPAdes[8] is a De Bruijn graph method assembler that is designed to assemble small genomes, such as bacterial genomes. It uses a multi-sized De Bruijn graph to guide assembly.

Ray

Ray[9] is suite of assemblers that includes: Ray (de novo assembly of single genomes), RayMeta (de novo assembly of metagenomes), RayCommunities (microbe abundance and taxonomic profiling), RayOntologies (gene ontology profiling), and RaySurveyor (compares genomic content between samples). Ray also has a web-interface, called Ray Cloud Browser.

ABySS

A de novo, parallel, paired-end sequence assembler designed for large genome assembly of short reads. There are two versions: ABySS[10] (genomic) and Trans-ABySS[11] (transcriptomic).

ALLPATHS-LG

ALLPATHS-LG is a software designed for the de novo assembly of large and small genomes. The software features algorithms to handle large sequence repeats, correct errors, use data from jumping libraries, be more efficient in memory usage, and assemble low coverage regions.

Trinity

Trinity[12] utilizes a three step process to produce high-quality transcriptome assemblies: 1) Inchworm develops a k-mer library and uses a greedy algorithm to assemble transcript contigs, 2) Chrysalis finds overlap in the Inchworm contigs and assembles De Bruijn graphs, and 3) Butterfly reconciles the Chrysalis De Bruijn graphs with the original read set to reconstruct transcripts. This method has been found to reconstruct high quality transcriptomes.

HGAP

HGAP[13] was the first long read assembler. It was developed by Pacific Biosciences and Joint Genomics Institute and was designed mainly for haploid organisms. It is available here.

Falcon

Falcon[14] is a long read assembler designed by Pacific Biosciences to work on diploid organisms. It is available here. FALCON-Phase, developed in collaboration with Phase Genomics, uses Hi-C data to phase haplotypes.[15]

Canu

Canu[16] is a long read assembler which works on both third and fourth generation reads. It is a successor of the old Celera Assembler. It is available here.

Hinge

Hinge[17] is also a long read assembler which works on both third and fourth generation reads but is designed mainly for shorter microbial genomes rather than large scale genomes. It is available here.

Velvet

Velvet[18] is designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions.

Assemblathon

There are numerous programs for de novo sequence assembly and many have been compared in the Assemblathon. The Assemblathon is a periodic, collaborative effort to test and improve the numerous assemblers available. Thus far, two assemblathons have been completed (2011 and 2013) and a third is in progress (as of April 2017). Teams of researchers from across the world choose a program and assemble simulated genomes (Assemblathon 1) and the genomes of model organisms whose that have been previously assembled and annotated (Assemblathon 2). The assemblies are then compared and evaluated using numerous metrics.

Assemblathon 1

Assemblathon 1[19] was conducted in 2011 and featured 59 assemblies from 17 different groups and the organizers. The goal of this Assembalthon was to most accurately and completely assemble a genome that consisted of two haplotypes (each with three chromosomes of 76.3, 18.5, and 17.7 Mb, respectively) that was generated using Evolver. Numerous metrics were used to assess the assemblies, including: NG50 (point at which 50% of the total genome size is reached when scaffold lengths are summed from the longest to the shortest), LG50 (number of scaffolds that are greater than, or equal to, the N50 length), genome coverage, and substitution error rate.

  • Software compared: ABySS, Phusion2, phrap, Velvet, SOAPdenovo, PRICE, ALLPATHS-LG
  • N50 analysis: assemblies by the Plant Genome Assembly Group (using the assembler Meraculous) and ALLPATHS, Broad Institute, USA (using ALLPATHS-LG) performed the best in this category, by an order of magnitude over other groups. These assemblies scored an N50 of >8,000,000 bases.
  • Coverage of genome by assembly: for this metric, BGI's assembly via SOAPdenovo performed best, with 98.8% of the total genome being covered. All assemblers performed relatively well in this category, with all but three groups having coverage of 90% and higher, and the lowest total coverage being 78.5% (Dept. of Comp. Sci., University of Chicago, USA via Kiki).
  • Substitution errors: the assembly with the lowest substitution error rate was submitted by the Wellcome Trust Sanger Institute, UK team using the software SGA.
  • Overall: No one assembler performed significantly better in others in all categories. While some assemblers excelled in one category, they did not in others, suggesting that there is still much room for improvement in assembler software quality.

Assemblathon 2

Assemblathon 2[20] improved on Assemblathon 1 by incorporating the genomes of multiples vertebrates (a bird (Melopsittacus undulatus), a fish (Maylandia zebra), and a snake (Boa constrictor constrictor)) with genomes estimated to be 1.2, 1.0, and 1.6Gbp in length) and assessment by over 100 metrics. Each team was given four months to assemble their genome from Next-Generation Sequence (NGS) data, including Illumina and Roche 454 sequence data.

  • Software compared: ABySS, ALLPATHS-LG, PRICE, Ray, and SOAPdenovo
  • N50 analysis: for the assembly of the bird genome, the Baylor College of Medicine Human Genome Sequencing Center and ALLPATHS teams had the highest NG50s, at over 16,000,000 and over 14,000,000 bp, respectively.
  • Presence of core genes: Most assemblies performed well in this category (~80% or higher), with only one dropping to just over 50% in their bird genome assembly (Wayne State University via HyDA).
  • Overall: Overall, the Baylor College of Medicine Human Genome Sequencing Center utilizing a variety of assembly methods (SeqPrep, KmerFreq, Quake, BWA, Newbler, ALLPATHS-LG, Atlas-Link, Atlas-GapFill, Phrap, CrossMatch, Velvet, BLAST, and BLASR) performed the best for the bird and fish assemblies. For the snake genome assembly, the Wellcome Trust Sanger Institute using SGA, performed best. For all assemblies, SGA, BCM, Meraculous, and Ray submitted competitive assemblies and evaluations. The results of the many assemblies and evaluations described here suggest that while one assembler may perform well on one species, it may not perform as well on another. The authors make several suggestions for assembly: 1) use more than one assembler, 2) use more than one metric for evaluation, 3) select an assembler that excels in metrics of more interest (e.g., N50, coverage), 4) low N50s or assembly sizes may not be concerning, depending on user needs, and 5) assess the levels of heterozygosity in the genome of interest.

References

  1. ^ J. Bang-Jensen; G. Gutin; A. Yeo (2004). "When the greedy algorithm fails". Discrete Optimization. 1 (2): 121–127. doi:10.1016/j.disopt.2004.03.007.
  2. ^ Peltola, Hannu; Söderlund, Hans; Ukkonen, Esko (1984-01-11). "SEQAID: a DNA sequence assembling program based on a mathematical model". Nucleic Acids Research. 12 (1Part1): 307–321. doi:10.1093/nar/12.1Part1.307. ISSN 0305-1048. PMC 321006. PMID 6320092.
  3. ^ Huang, Xiaoqiu (1992-09-01). "A contig assembly program based on sensitive detection of fragment overlaps". Genomics. 14 (1): 18–25. doi:10.1016/S0888-7543(05)80277-0. PMID 1427824.
  4. ^ "How to apply de Bruijn graphs to genome assembly". Nature Biotechnology. 29 (11): 987–991. 2011. doi:10.1038/nbt.2023. PMC 5531759. PMID 22068540. {{cite journal}}: Unknown parameter |authors= ignored (help)
  5. ^ "DIMACS Workshop on Combinatorial Methods for DNA Mapping and Sequencing". October 1994.
  6. ^ Idury, R. M.; Waterman, M. S. (1995-01-01). "A new algorithm for DNA sequence assembly". Journal of Computational Biology. 2 (2): 291–306. CiteSeerX 10.1.1.79.6459. doi:10.1089/cmb.1995.2.291. ISSN 1066-5277. PMID 7497130.
  7. ^ Myers, E. W. (1995-01-01). "Toward simplifying and accurately formulating fragment assembly". Journal of Computational Biology. 2 (2): 275–290. doi:10.1089/cmb.1995.2.275. ISSN 1066-5277. PMID 7497129.
  8. ^ Bankevich, Anton (2012). "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing". Journal of Computational Biology. 19 (5): 455–477. doi:10.1089/cmb.2012.0021. PMC 3342519. PMID 22506599. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)
  9. ^ "Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies". Journal of Computational Biology. 17 (11): 1519–1533. 2010. doi:10.1089/cmb.2009.0238. PMC 3119603. PMID 20958248. {{cite journal}}: Unknown parameter |authors= ignored (help)
  10. ^ Simpson, Jared T. (2009). "ABySS: a parallel assembler for short read sequence data". Genome Research. 19 (6): 1117–1123. doi:10.1101/gr.089532.108. PMC 2694472. PMID 19251739. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)
  11. ^ Birol, Inanç (2009). "De novo transcriptome assembly with ABySS". Bioinformatics. 25 (21): 2872–2877. doi:10.1093/bioinformatics/btp367. PMID 19528083. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)
  12. ^ Grabherr, Manfred G. (2011). "Full-length transcriptome assembly from RNA-Seq data without a reference genome". Nature Biotechnology. 29 (7): 644–652. doi:10.1038/nbt.1883. PMC 3571712. PMID 21572440. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)
  13. ^ Chin, Chen-Shan, David H. Alexander, Patrick Marks, Aaron A. Klammer, James Drake, Cheryl Heiner, Alicia Clum et al. "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data." Nature methods 10, no. 6 (2013): 563-569. Available online
  14. ^ Chin, Chen-Shan, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn et al. "Phased diploid genome assembly with single-molecule real-time sequencing." Nature methods 13, no. 12 (2016): 1050-1054. Available here
  15. ^ Kingan, Sarah B.; Williams, John L.; Sullivan, Shawn T.; Smith, Timothy P. L.; Hiendleder, Stefan; Hall, Richard J.; Kronenberg, Zev N. (2018-05-21). "FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes". bioRxiv: 327064. doi:10.1101/327064.
  16. ^ Koren, Sergey, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, and Adam M. Phillippy. "Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation." Genome research 27, no. 5 (2017): 722-736. Available here
  17. ^ Kamath, Govinda M., Ilan Shomorony, Fei Xia, Thomas A. Courtade, and N. Tse David. "HINGE: long-read assembly achieves optimal repeat resolution." Genome research 27, no. 5 (2017): 747-756. Available here
  18. ^ Zerbino, D. R.; Birney, E. (2008). “Velvet: de novo assembly using very short reads”. Retrieved 2013-10-18.
  19. ^ Earl, Dent (2011). "Assemblathon 1: a competitive assessment of de novo short read assembly methods". Genome Research. 21 (12): 2224–2241. doi:10.1186/2047-217X-2-10. PMC 3844414. PMID 23870653. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)CS1 maint: unflagged free DOI (link)
  20. ^ Bradnam, Keith R. (2013). "Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species". GigaScience. 2 (1): 10. doi:10.1186/2047-217X-2-10. PMC 3844414. PMID 23870653. {{cite journal}}: Unknown parameter |displayauthors= ignored (|display-authors= suggested) (help)CS1 maint: unflagged free DOI (link)