User:Jmmy2013

From Wikipedia, the free encyclopedia

SPAdes
Developer(s)St. Petersburg Academic University - Russia and University of California, San Diego - USA.
Stable release
2.5.1 / 2013
Operating systemLinux, Mac OS
TypeBioinformatics
Licensefree use
Websitehttp://bioinf.spbau.ru/spades

SPAdes (St. Petersburg genome assembler)[1] is a new genome assembly algorithm which designed for single cell and multi-cells bacterial data sets.However, it might not be suitable for large genomes projects.[1][2] The latest SPAdes version 2.5.1 was released in September, 2013 and can be downloaded from http://bioinf.spbau.ru/spades. It only works with Illumina paired-end, mate-pairs and single reads, while supporting other technologies (Roche 454, IonTorrent, PacBio) is still under development.[1] Recently, SPAdes has been integrated into Galaxy pipelines by Guy Lionel. Wrapper is available on Galaxy tool shed at http://toolshed.g2.bx.psu.edu/view/lionelguy/spades.

Background[edit]

Studying the genome of single cells will help to track changes that occur in DNA over time or associated with exposure to different conditions. Additionally, many projects such as Human Microbiome Project and antibiotics discovery would greatly benefit from Single-cell sequencing (SCS).[3][4] SCS has an advantage over sequencing DNA extracted from large number of cells. The problem of averaging out the significant variations between cells can be overcome by using SCS.[5] Experimental and computational technologies are being optimized to allow researchers to sequence single cells. For instance, amplification of DNA extracted from a single cell is one of the experimental challenging. To maximize the accuracy and quality of SCS, a uniform DNA amplification is needed. It was demonstrated that using multiple annealing and looping-based amplification cycles (MALBAC) for DNA amplification generates less biasness compared to polymerase chain reaction (PCR) or multiple displacement amplification (MDA).[6] Furthermore, it has been recognized that the challenges facing SCS are computational rather than experimental.[7] Currently available assembler, such as Velvet,[8] String Graph Assembler (SGA)[9] and EULER-SR,[10] were not designed to handle SCS assembly.[2] Assembly of single cell data is difficult due to non-uniform read coverage, variation in insert length, high levels of sequencing errors and chimeric reads.[7][11][12] Therefore, the new algorithmic approach, SPAdes, was designed to address these issues.

SPAdes assembly approach[edit]

SPAdes uses k-mers for building the initial de Bruijn graph and on following stages it performs graph-theoretical operations which based on graph structure, coverage and sequence lengths. Moreover, it adjusts errors iteratively.[2] The stages of assembly in SPAdes are:[2]

  • Stage 1: assembly graph construction. SPAdes employs multisized de Bruijn graph (See below), which detects and removes bulge/bubble and chimeric reads.
  • Stage 2: k-bimer (pairs of k-mers) adjustment. Exact distances between k-mers in the genome (edges in the assembly graph) are estimated.
  • Stage 3: paired assembly graph construction.
  • Stage 4: contig construction. SPAdes outputs contigs and allows to map reads back to their positions in the assembly graph after graph simplification (backtracking).

Details on SPAdes assembly[edit]

Logarithmic coverage plot for the single-cell data (E.coli)[13]

SPAdes was designed to overcome the problems associated with the assembly of single cell data as follows:[2]

1. Non-uniform coverage. SPAdes utilizes multisized de Bruijn graph which allows employing different values of k. It has been suggested to use smaller values of k in low-coverage regions to minimize fragmentation, and larger values of k in high coverage regions to decrease repeat collapsing (Stage 1 above).

2. Variable insert sizes of paired-end reads. SPAdes employs the basic concept of paired de Bruijn graphs. However, paired de Bruijn works well on paired-end reads with fixed insert size. Therefore SPAdes estimates 'distances' instead of using 'insert sizes'. Distance (d) of a paired-end read is defined as, for a read length L, d = insert size – L. By utilizing k-bimer adjustment approach, distances are exactly estimated. A k-bimer consisting of k-mers ‘α’ and ‘β’ together with the estimated distance between them in a genome (α|β,d). This approach breaks the paired–end reads into pairs of k-mers which are transformed to define pairs of edges (biedges) in the de Bruijn graphs. These sets of biedges involve in the estimation distances between edges paths between k-mers α and β. By clustering, the optimal distance estimate is chosen from each cluster (stage 2, above). To construct paired de Bruijn graph, the rectangle graphs are employed in SPAdes (stage 3). Regtangle graphs approach was first introduced in 2012[14] to construct paired de Bruijn graphs with doubtful distances.

3. Bulge, tips and chimeras. Bulges and tips occur due to errors in the middle and ends of reads, respectively. A chimeric connection joins two unrelated substrings of the genome. SPAdes identifies these based on graph topology, the length and coverage of the non-branching paths including in them. SPAdes keeps a data structure to be able to backtrack all corrections or removals.

SPAdes modifies the previously used bulge removal approach[15] and iterative de Bruijn graph approach from Peng et al (2010)[16] and creates a new approach called ‘‘bulge corremoval’’, which stands for bulge correction and removal. The bulge corremoval algorithm can be summarized as follows: a simple bulge is formed by two small and similar paths (P and Q) connecting the same hubs, if P is a non-branching path (h-path), then SPAdes maps every edge in P to an edge projection in Q and removes P from the graph, as a results the coverage of Q increases. Unlike other assemblers, which use a fixed coverage cut-off bulge removal, SPAdes removes or projects the h-paths with low coverage step by step. This is achieved by employing gradually increasing cut-off thresholds and iterating through all h-paths in increasing order of coverage (for bulge corremoval and chimeric removal) or length (for tip removal). Moreover, in order to guarantee that no new sources/sinks are introduced to the graph, SPAdes deletes an h-path (in chimeric h-path removal) or projects (in bulge corremoval) only if its start and end vertices have at least two outgoing and ingoing edges. This helps to remove low coverage h-paths occurring from sequencing errors and chimeric reads but not from repeats.

SPAdes pipelines and performance[edit]

SPAdes is composed of three tools:[1]

  • Read error correction tool, BayesHammer.[17] In traditional error correction, rare k-mers are considered errors. This can not be applied for SCS because of non-uniform coverage. Therefore, BayesHammer employs probabilistic subclustering which examine multiple central nucleotide, which will better covered than others, of similar k-mers .[17] It was claimed that for Escherichia coli (E. coli) single cell data set, BayesHammer runs in about 75 min, takes up to 10 Gb of RAM to carry out read error correction and requires 10 Gb additional disk space for temporary files.
  • Iterative short-read genome assembler, SPAdes. For the same data set, this step runs for ~ 75 min. It takes ~ 40% of this time to perform stage 1 (see SPAdes assembly approach above) when using three iterations (k=22, 34 and 56), and ~ 45%, 14% and 1% for completing stages 2, 3 and 4, respectively. It also takes up to 5 Gb of RAM to perform assembly and needs 8 Gb additional disk space.
  • Mismatch corrector which uses the BWA tool. This module requires the longest time (~ 120 min) and the largest additional disk space (~21 Gb) for temporary files. It takes up 9 Gb RAM to complete mismatch correction of assembled E. coli single cell data set.

Comparing assemblers[edit]

A recent study[18] compared several genome assemblers on single cell E. coli samples. These assemblers are EULER-SR,[10] Velvet,[8] SOAPdenovo,[19] Velvet-SC, EULER+ Velvet-SC (E+V-SC),[15] IDBA-UD[20] and SPAdes. It was demonstrated that IDBA-UD and SPAdes performed the best.[18] SPAdes had the largest NG50 (99,913, NG50 statistics is the same as the N50 except that the genome size is used rather than the assembly size).[21] Moreover, using E. coli reference genome,[22] SPAdes assembled the highest percentage of genome (97%) and the highest number of complete genes (4,071 out of 4,324).[18] The assemblers’ performances were as follows:[18]

  • Number of contigs:

IDBA-UD < Velvet < E+V-SC < SPAdes < EULER-SR < Velvet-SC < SOAPdenovo

  • NG50

SPAdes > IDBA-UD >>> E+V-SC > EULER-SR >Velvet >Velvet-SC > SOAPdenovo

  • Largest contig:

IDBA-UD > SPAdes > > EULER-SR > Velvet= E+V-SC > Velvet-SC > SOAPdenovo

  • Mapped genome (%):

SPAdes > IDBA-UD > E+V-SC > Velvet-SC > EULER-SR > SOAPdenovo > Velvet

  • Number of misassemblies:

E+V-SC = Velvet = Velvet-SC < SOAPdenovo < IDBA-UD < SPADes < EULER-SR

References[edit]

  1. ^ a b c d http://spades.bioinf.spbau.ru/release2.5.1/manual.html
  2. ^ a b c d e Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. (2012). "SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing" (PDF). Journal of Computational Biology. 19: 455–477. doi:10.1089/cmb.2012.0021.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. ^ Gill S, Pop M, Deboy R, Eckburg P, Turnbaugh P, Samuel B, Gordon J, Relman D, Fraser-Liggett C, Nelson K (2006). "Metagenomic analysis of the human distal gut microbiome". Science. 312: 1355–1359. doi:10.1126/science.1124234.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  4. ^ Li J, Vederas J (2009). "Drug discovery and natural products: end of an era or an endless frontier?" (PDF). Science. 325: 161–165. doi:10.1126/science.1168243.
  5. ^ Lu S, Zong C, Fan W, Yang M, Li J, Chapman A, Zhu P, Hu X, Xu L, Yan L, F B, Qiao J, Tang F, Li R, Xie X (2012). "Probing meiotic recombination and aneuploidy of single sperm cells by whole-genome sequencing". Science. 338: 1627–1630. doi:10.1126/science.1229112.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  6. ^ http://news.harvard.edu/gazette/story/2013/01/one-cell-is-all-you-need/
  7. ^ a b Rodrigue S, Malmstrom RR, Berlin AM, Birren BW, Henn MR, Chisholm SW (2009). "Whole genome amplification and de novo assembly of single bacterial cells" (PDF). PLoS One. 4: e6864.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  8. ^ a b Zerbino D, Birney E (2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs" (PDF). Genome Research. 18: 821–829.
  9. ^ Simpson JT, Durbin R (2012). "Efficient de novo assembly of large genomes using compressed data structures" (PDF). Genome Research. 22: 549–556.
  10. ^ a b Pevzner PA, Tang H, Waterman MS (2001). "An Eulerian path approach to DNA fragment assembly". Proceedings of the National Academy of Sciences of the United States of America. 98: 9748–9753.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  11. ^ Medvedev P, Scott E, Kakaradov B, Pevzner P (2011). "Error correction of high-throughput sequencing datasets with non-uniform coverage" (PDF). Bioinformatics. 27: i137-141. doi:10.1093/bioinformatics/btr208.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  12. ^ Ishoey T, Woyke T, Stepanauskas R, Novotny M, Lasken RS (2008). "Genomic sequencing of single microbial cells from environmental samples" (PDF). Current Opinion in Microbiology. 11: 198–204. doi:10.1016/j.mib.2008.05.006.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  13. ^ http://openi.nlm.nih.gov/detailedresult.php?img=3549815_1471-2164-14-S1-S7-1&req=4
  14. ^ Vyahhi N, Pham SK, Pevzner P (2012). "From de Bruijn graphs to rectangle graphs for genome assembly". Lecture Notes in Bioinformatics. 7534: 249–261.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  15. ^ a b Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, Novotny M, Rusch DB, Fraser LJ, Gormley NA, Schulz-Trieglaff O, Smith GP, Evers DJ, Pevzner PA, Lasken RS (2011). "Efficient de novo assembly of single-cell bacterial genomes from short-read data sets" (PDF). Nat Biotechnol. 29: 915–921.{{cite journal}}: CS1 maint: multiple names: authors list (link) Cite error: The named reference "Chitsaz" was defined multiple times with different content (see the help page).
  16. ^ Peng Y., Leung H.C.M., Yiu S.-M, Chin FYL (2010). "IDBA—a practical iterative de Bruijn graph de novo assembler". Lect. Notes Comput. Sci. 6044: 426–440.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  17. ^ a b Nikolenko SI, Korobeynikov AI, Alekseyev MA. (2012). "BayesHammer: Bayesian clustering for error correction in single-cell sequencing" (PDF). BMC Genomics. 14: 57. doi:10.1186/1471-2164-14-S1-S7.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: unflagged free DOI (link)
  18. ^ a b c d Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013). [http:// http://bioinformatics.oxfordjournals.org.ezp.lib.unimelb.edu.au/content/29/8/1072.full.pdf "QUAST: quality assessment tool for genome assemblies"] (PDF). Bioinformatics. 29: 1072–1075. {{cite journal}}: Check |url= value (help)CS1 maint: multiple names: authors list (link)
  19. ^ Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (2010). "De novo assembly of human genomes with massively parallel short read sequencing" (PDF). Genome Research. 20: 265–272.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  20. ^ Peng Y, Leung HCM, Yiu SM, Chin FYL (2012). "IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth" (PDF). Bioinformatics. 28: 1–8.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  21. ^ http://bioinf.spbau.ru/spades/
  22. ^ Blattner FR, Plunkett G, Bloch C, Perna N, Burland V, Riley M, Collado-Vides J, Glasner J, Rode C, Mayhew G, Gregor J, Davis N, Kirkpatrick H, Goeden M, Rose D, Mau B, Shao Y (1997). "The complete genome sequence of Escherichia coli K-12". Science. 277: 1453–1462.{{cite journal}}: CS1 maint: multiple names: authors list (link)

Category:Software