User:Seppi333/sandbox8
- !/usr/bin/perl
- Author: Alex Di Genova
- Laboratory: ERABLE/INRIA
- Copyright (c)
- year: 2019
=head1 NAME
B<Wengan> - An accurate and ultrafast genome assembler
=head1 SYNOPSIS
# Assembling Oxford Nanopore and Illumina reads with WenganM wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000
# Assembling PacBio reads and Illumina reads with WenganA wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm2 -t 20 -g 3000
# Assembling ultra-long Nanopore reads and BGI reads with WenganM wengan.pl -x ontlon -a M -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm3 -t 20 -g 3000
# Hybrid long-read only assembly of PacBio Circular Consensus Sequence and Nanopore data with WenganM wengan.pl -x ccsont -a M -l ont.fastq.gz -b ccs.fastq.gz -p asm4 -t 20 -g 3000
# Assembling ultra-long Nanopore reads and Illumina reads with WenganD (requires a high memory machine 600Gb) wengan.pl -x ontlon -a D -s lib2.fwd.fastq.gz,lib2.rev.fastq.gz -l ont.fastq.gz -p asm5 -t 20 -g 3000
# Assembling pacraw reads with pre-assembled short-read contigs from Minia3 wengan.pl -x pacraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm6 -t 20 -g 3000 -c contigs.minia.fa
# Assembling pacraw reads with pre-assembled short-read contigs from Abyss wengan.pl -x pacraw -a A -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm7 -t 20 -g 3000 -c contigs.abyss.fa
# Assembling pacraw reads with pre-assembled short-read contigs from DiscovarDenovo wengan.pl -x pacraw -a D -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l pac.fastq.gz -p asm8 -t 20 -g 3000 -c contigs.disco.fa
=head1 DESCRIPTION
B<Wengan> is a new genome assembler that, unlike most of the current long-reads assemblers, avoids entirely the all-vs-all read comparison.
The key idea behind B<Wengan> is that long-read alignments can be B<inferred by building paths> on a sequence graph. To achieve this, B<Wengan> builds a new sequence graph called the Synthetic Scaffolding Graph (SSG). The SSG is built from a spectrum of synthetic mate-pair libraries extracted from raw long-reads. Then, longer alignments are built by performing a transitive reduction of the edges.
Another distinct feature of B<Wengan> is that it is the only assembler that performs B<self-validation> by following the read information. B<Wengan> identifies miss-assemblies at different steps of the assembly process. For more information about the algorithmic ideas behind B<Wengan>, please read the preprint available on bioRxiv.
=head2 ABOUT THE NAME
B<Wengan> is a Mapudungun word. The B<Mapudungun> is the language of the B<Mapuche> people, the largest indigenous inhabitants of south-central Chile. B<Wengan> means I<"Making the path">. Thus, when you assemble a genome with B<Wengan>, you are I<making the genome path>.
=head1 AUTHOR - Alex Di Genova
Email digenova@gmail.com
=head1 SHORT-READ ASSEMBLY
B<Wengan> uses a De Brujin graph assembler to build the assembly backbone from short-read data. Currently, B<Wengan> can use B<Minia3>, B<Abyss2> or B<DiscoVarDenovo>. The recommended short-read coverage is B<50-60X> of 2 x 150bp or 2 x 250bp short reads.
=head2 WenganM [M]
This B<Wengan> mode use the B<Minia3> short-read assembler. This is the fastest mode of B<Wengan> and can assemble a complete human genome in less than 210 CPU hours (~50GB of RAM).
=head2 WenganA [A]
This B<Wengan> mode use the B<Abyss2> short-read assembler. This is the lowest memory mode of B<Wengan> and can assemble a complete human genome with less than 40GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
=head2 WenganD [D]
This B<Wengan> mode use the B<DiscovarDenovo> short-read assembler. This is the greedier memory mode of B<Wengan> and, for assembling a complete human genome, needs about 600GB of RAM (~900 CPU hours). This assembly mode takes ~2 days when using 20 CPUs on a single machine.
=cut
- We start coding
use strict;
- use Data::Dumper;
use Getopt::Std; use FindBin;
- we set the Wengan root directory
$ENV{WROOTDIR}="$FindBin::Bin";
- load the wengan library
use lib "$FindBin::Bin/perl";
- local perl clases to control the wengan execution
use Wengan::Reads; # class to handle the read data. use Wengan::Scheduler::Local; # the scheduler is make and control the execution end-to-end
sub usage {
die(qq/ Usage example: # Assembling Oxford Nanopore and Illumina reads with WenganM wengan.pl -x ontraw -a M -s lib1.fwd.fastq.gz,lib1.rev.fastq.gz -l ont.fastq.gz -p asm1 -t 20 -g 3000
Wengan options:
Mandatory options***: -x preset [ontlon,ontraw,pacraw,ccsont,ccspac] -a Mode [M,A,D] -s Short-reads [fwd1.fastq.gz,rev1.fastq.gz..] -l Long-reads.fq.gz -b Hifi-reads [hifi.fq.gz,hifi2.fq.gz..] -g 3000 [genome size in Mb] -p prefix
General Options: -h [detail information] -t cores [1] -c <pre-assembled short-read contigs> -i <insert size lists for raw long-reads> -I <insert size list for CCS reads> -n <show pipeline comands>;
Advanced Options (Change the presets): FastMin-SG options: -k k-mer size [15-28] -w minimizer window [5-15] -q minimum mapping quality [20-60] -m moving window [150] IntervalMiss options: -d Minimum base coverage [def:7] -f Factor to compute std using avg*fst [def:0.1] Liger options: -M Minimum contig length in TR [def:2000] -R Repeat copy number factor [def:1.5] -L Length of long mate-edges [def:100000] -N Number of long-reads needed to keep a potentially erroneous mate-edge [def:5] -P Minimum length of reduced paths to convert them to physical fragments [def:20kb] -Q Minimum contig length in matching [def:2000] -U Repeat copy number factor backbone [def:1.5] -T Number of iterations in TR per edge [def:1M] HiFi options: -D Ploidy for Hifi hybrid assembly [haploid=1,diploid=2]
$0 -h for detailed usage information. \n/); exit 1;
}
- x:s:l:p:t:g:a:c:i:k:w:q:m:d:M:L:N:P:hn
my %opts = ();
getopts( "x:s:l:p:t:g:a:b:c:f:i:I:k:w:q:m:d:M:L:N:P:R:K:D:Q:U:T:hn", \%opts );
- display help usage using the perldoc
if($opts{h}){
system("perldoc $0"); exit 0;
}
- mandatory variables mode, long-reads, short-reads and genome size.
- if (!defined $opts{x} or !defined $opts{l} or !defined $opts{s} or !defined $opts{g}) {
if(!defined $opts{x} or !defined $opts{l} or !defined $opts{g}){
usage;
}
- object that handle the reads in fastq format and compressed with gzip
my $reads= new Wengan::Reads() or die "cannot create read object\n";
- print Dumper($reads);
- we check the preset for long-reads
if($opts{x} eq "ontlon"){ wengan_ontlon($reads,%opts); }elsif($opts{x} eq "ontraw"){ wengan_ontraw($reads,%opts); }elsif($opts{x} eq "pacraw"){ wengan_pacraw($reads,%opts); }elsif($opts{x} eq "pacccs"){ wengan_pacccs($reads,%opts); }elsif($opts{x} eq "ccsont" or $opts{x} eq "ccspac"){ wengan_ccslong($reads,%opts); }else{ print "Unkown preset $opts{x}\n"; usage(); exit 1; }
- check that WenganM is called for hybrid assembly with CCS
if($opts{x}=~m/ccs/ and $opts{a} ne "M"){
print "Currently only the WenganM pipeline is available for Hybrid assembly with Circular Consensus Sequence (CCS) data\n"; print " and have been only tested on haploid genomes\n"; exit 1;
}
if (defined $opts{D} && $opts{D} !=1){
print "WARNING :: Hybrid pipeline with CCS reads have been tested only in haploid genomes\n";
}
- we mark the haploid case has default
if(!defined $opts{D}){
$opts{D}=1;
}
- we define number of threads for the pipeline.
if(!defined $opts{t}){
$opts{t}=1;#default is one
}
- we check some variables Q <= M
if(defined $opts{M} and !defined $opts{Q}){
$opts{Q}=$opts{M};
}
if((defined $opts{M} and defined $opts{Q}) and ($opts{Q} > $opts{M})){
print "Q should be <= than M; given values [Q=".$opts{Q}.",M=".$opts{M}."]\n"; exit 1;
}
my $pipeline=();
- we check the pipeline called
if($opts{a} eq "M"){
$pipeline=Wengan::Scheduler::Local->new($reads,"WenganM",%opts);
}elsif($opts{a} eq "A"){
$pipeline=Wengan::Scheduler::Local->new($reads,"WenganA",%opts);
}elsif($opts{a} eq "D"){
$pipeline=Wengan::Scheduler::Local->new($reads,"WenganD",%opts);
}else{ print "Unkown pipeline $opts{a}\n";
print "Available pipelines: WenganM, WenganA, WenganD\n";
usage(); exit 1; }
- print Dumper($pipeline)
- we check if the user want to see the pipeline commands
if($opts{n}){
$pipeline->show_pipeline();
}else{ $pipeline->run(); }
=head1 LONG-READ PRESETS
The presets define several variables of the Wengan pipeline execution and depends on the long-read technology used to sequence the genome. The recommended long-read coverage is 30X.
=cut
=head2 ontlon
preset for raw ultra-long-reads from Oxford Nanopore, typically having an N50 > 50kb.
=cut
sub wengan_ontlon{ my ($reads,%opts)=@_; $reads->add_short_reads($opts{s}); $reads->add_long_reads($opts{l}); }
=head2 ontraw
preset for raw long-reads Nanopore reads typically having an N50 ~[15kb-40kb].
=cut
sub wengan_ontraw{
my ($reads,%opts)=@_;
$reads->add_short_reads($opts{s}); $reads->add_long_reads($opts{l});
#print Dumper($reads);
}
=head2 pacraw
preset for raw long-reads from Pacific Bioscience (PacBio) typically having an N50 ~[8kb-60kb].
=cut
sub wengan_pacraw{
my ($reads,%opts)=@_;
$reads->add_short_reads($opts{s}); $reads->add_long_reads($opts{l});
#print Dumper($reads);
}
=head2 ccsont/ccspac
preset for Hybrid assembly of Circular Consensus Sequences from Pacific Bioscience (PacBio) typically having an N50 ~[15kb]. The current version has been tested only in haploid human genomes.
=cut
sub wengan_ccslong{
my ($reads,%opts)=@_; #we add the long-reads
$reads->add_long_reads($opts{l},"long");
#we add the ccs reads $reads->add_long_reads($opts{b},"ccs"); #print Dumper($reads);
}
=head2 pacccs
preset for Circular Consensus Sequences from Pacific Bioscience (PacBio) typically having an N50 ~[15kb].
=cut
sub wengan_pacccs{
my ($reads,%opts)=@_;
$reads->add_long_reads($opts{l}); }
=head1 WENGAN ADVANCED OPTIONS
The following options allows to override the presets of B<Wengan> components. Don't change this variables if you are not sure.
=head2 FastMin-SG
An alignment-free algorithm for ultrafast scaffolding graph construction from short or long reads.
=head3 FastMin-SG options (Override the presets)
Indexing: -k INT k-mer size (no larger than 28) [15] -w INT minimizer window size [10]
Mapping synthetic read-pairs:
-i list Insert sizes for synthetic libraries [i.e. 500,1000,2000,3000,4000,5000, ... ,20000] -q INT Minimum quality score (no larger than 60) [40] -m INT Moving window [150]
=head2 IntervalMiss
IntervalMiss detect miss-assembled contigs and correct them when necessary.
=head3 IntervalMiss options
-d INT Minimum base coverage [<10]
=head2 Liger
Liger uses the Synthetic Scaffoding Graph to compute overlap among long reads, order and orient short contigs, validate scaffolds sequences, fill the gaps, and polish the assembly.
=head3 Liger options
Short-contigs options:
-M INT Minimum contig length in scaffolding [--mcs] (default=`2000', min=`1000') -R FLOAT Repeat copy number factor [--rcn] (default=`1.5')
Long-reads overlap options:
-L Length of long mate-edges [--lme] (default=`100000')
Validation of lines options:
-N INT Number of long-reads needed to keep a potentially erroneous mate-edge [--nlm] (default=`5', min=`1') -P INT Minimum length of reduced paths to convert them to physical fragments [--mlp] (default=`20000', min=`5000')
=cut
sub dirname { my $prog = shift; return '.' unless ($prog =~ /\//); $prog =~ s/\/[^\s\/]+$//g; return $prog; }