List of RNA-Seq bioinformatics tools
RNA-Seq (RNA-Seq.ppt ) is a technique that performs transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some related web resources.
To follow an integrated guide to the analysis of RNA-seq data, please see - Next Generation Sequencing (NGS)/RNA, Hands-On Tutorial or RNA-Seq Workflow. Also, important links are SEQanswers, RNA-SeqList, RNA-SeqBlog, Biostar and bioscholar.
- 1 Quality control and pre-processing data
- 2 Alignment Tools
- 3 Quantitative analysis and Differential Expression
- 4 Workbench (analysis pipeline / integrated solutions)
- 5 Alternative Splicing Analysis
- 6 Bias Correction
- 7 Fusion genes/chimeras/translocation finders/structural variations
- 8 Copy Number Variation identification
- 9 RNA-Seq simulators
- 10 Transcriptome assemblers
- 11 miRNA prediction
- 12 Visualization tools
- 13 Functional, Network & Pathway Analysis Tools
- 14 Further annotation tools for RNA-Seq data
- 15 RNA-Seq Databases
- 16 Webinars and Presentations
- 17 References
Quality control and pre-processing data
Quality control and filtering data
Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), adaptors, contaminations or overrepresented sequences to assure a coherent final result.
- Biopieces Biopieces.
- BBDuk BBDuk. Ultrafast, multithreaded tool to trim adapters and filter or mask contaminants based on kmer-matching, allowing a hamming- or edit-distance, as well as degenerate bases. Also performs optimal quality-trimming and filtering, format conversion, contaminant concentration reporting, gc-filtering, length-filtering, entropy-filtering, chastity-filtering, and generates text histograms for most operations. Interconverts between fastq, fasta, sam, scarf, interleaved and 2-file paired, gzipped, bzipped, ASCII-33 and ASCII-64. Keeps pairs together. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies.
- clean_reads clean_reads.
- condetri condetri.
- cutadapt cutadapt removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case.
- FastQC FastQC is a quality control tool for high-throughput sequence data (Babraham Institute) and is developed in Java. Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand alone application or it can be integrated into a larger pipeline solution. See also seqanswers/FastQC.
- FastqMcf FastqMcf.
- FASTX FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA/RNA.
- Flexbar Flexbar performs removal of adapter sequences, trimming and filtering features.
- FreClu FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology.
- HTSeq HTSeq.
- htSeqTools htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits.
- NxTrim NxTrim.
- PRINSEQ PRINSEQ generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios.
- qrqc qrqc is a Bioconductor package to quick read quality control.
- RNA-SeQC RNA-SeQC is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePattern web interface. The input could be one or more BAM files. HTML reports are generated as output.
- RSeQC RSeQC analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization.
- Sabre sabre.
- SAMStat SAMStat identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping.
- Scythe scythe.
- SEECER seecer SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly.
- Sickle Sickle.
- ShortRead ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions (BioStrings, BSgenome, IRanges, and so on). See also seqanswers/ShortRead.
- TagCleaner TagCleaner.
- Trimmomatic Trimmomatic performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64.
Further tasks performed before alignment.
- BBMerge BBMergeMerges paired reads based on overlap to create longer reads, and an insert-size histogram. Fast, multithreaded, and yields extremely few false positives. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies. Distributed with BBMap.
- DeconRNASeq DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data.
- FastQ Screen FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc.).
- FLASH FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads.
- IDCheck IDCheck.
After control assessment, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also Tools for mapping high-throughput sequencing data , List of sequence alignment software and HTS Mappers.
Short (Unspliced) aligners
Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group, despite the time spent tend to be more sensitive, generating more reads correctly aligned. See a comparative study of short aligners - comparative study.
- BFAST BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST.
- Bowtie Bowtie is a fast short aligner using an algorithm based on the Burrows-Wheeler transform and the FM-index. Bowtie tolerates a small number of mismatches. See also seqanswers/Bowtie.
- Burrows-Wheeler Aligner (BWA) BWA implements two algorithms, mainly based on Burrows–Wheeler transform. The first algorithm is used with reads with low error rate (<3%). The second algorithm was designed to handle more errors and implements a combined strategy: Burrows–Wheeler transform and Smith-Waterman method. BWA allows mismatches and small gaps (insertions and deletions). The output is presented in SAM format. See also seqanswers/BWA.
- Short Oligonucleotide Analysis Package (SOAP) SOAP.
- GNUMAP GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software.
- Maq Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches. See also seqanswers/Maq.
- Mosaik Mosaik. Mosaik is able to align reads containing short gaps using Smith-Waterman algorithm, ideal to overcome SNPs, insertions and deletions. See also seqanswers/Mosaik.
- NovoAlign (commercial) NovoAlign is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. Novoalign tolerates up to 8 mismatches per read, and up to 7bp of indels. It is able to deal with bisulphite data. Output in SAM format. See also seqanswers/NovoAlign.
- PerM PerM is a software package which was designed to perform highly efficient genome scale alignments for hundreds of millions of short reads produced by the ABI SOLiD and Illumina sequencing platforms. PerM is capable of providing full sensitivity for alignments within 4 mismatches for 50bp SOLID reads and 9 mismatches for 100bp Illumina reads.
- SEAL SEAL uses a MapReduce model to produce distributed computing on clusters of computers. Seal uses BWA to perform alignment and Picard MarkDuplicates to detection and duplicate read removal. See also seqanswers/SEAL.
- segemehl segemehl.
- SHRiMP SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm. See also seqanswers/SHRiMP.
- SMALT Smalt.
- Stampy Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format. See also seqanswers/Stampy.
- ZOOM (commercial) ZOOM is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions. See also seqanswers/ZOOM.
Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also Methods to study splicing from high-throughput RNA Sequencing data and Methods to Study RNA-Seq (workflow).
Aligners based on known splice junctions (annotation-guided aligners)
In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST).
- Erange Erange is a tool to alignment and data quantification to mammalian transcriptomes. See also seqanswers/Erange.
- IsoformEx IsoformEx.
- MapAL MapAL.
- OSA OSA.
- PRADA prada. is a bwa based, known junction guided RNAseq alignment pipeline. It includes a module that identifies gene fusions.
- RNA-MATE RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting. See also seqanswers/RNA-MATE.
- RUM RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format.
- RNASEQR RNASEQR. See also seqanswers/RNASEQR.
- SAMMate SAMMate. See also seqanswers/SAMMate.
- SpliceSeq SpliceSeq.
- X-Mate X-Mate.
De novo Splice Aligners
De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option). See also De novo Splice Aligners.
- ABMapper ABMapper. See also seqanswers/ABMapper.
- BBMap BBMap Uses short kmers to align reads directly to the genome (spanning introns to find novel isoforms) or transcriptome. Highly tolerant of substitution errors and indels, and very fast. Supports output of all SAM tags needed by Cufflinks. No limit to genome size or number of splices per read. Supports Illumina, 454, Sanger, Ion Torrent, PacBio, and Oxford Nanopore reads, paired or single-ended. Does not use any splice-site-finding heuristics optimized for a single taxonomic branch, but rather finds optimally-scoring multi-affine-transform global alignments, and thus is ideal for studying new organisms with no annotation and unknown splice motifs. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies. See also SeqAnswers/BBMap.
- ContextMap ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both.
- CRAC CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.
- GSNAP GSNAP. See also seqanswers/GSNAP.
- HMMSplicer HMMSplicer can identify canonical and non-canonical splice junctions in short-reads. Firstly, unspliced reads are removed with Bowtie. After that, the remaining reads are one at a time divided in half, then each part is seeded against a genome and the exon borders are determined based on the Hidden Markov Model . A quality score is assigned to each junction, useful to detect false positive rates. See also seqanswers/HMMSplicer.
- Pass Pass aligns gapped, ungapped reads and also bisulfite sequencing data. It includes the possibility to filter data before alignment (remotion of adapters). Pass uses Needleman-Wunsch and Smith-Waterman algorithms, and performs alignment in 3 stages: scanning positions of seed sequences in the genome, testing the contiguous regions and finally refining the alignment. See also seqanswers/Pass.
- PASSion PASSion.
- PASTA PASTA.
- QPALMA QPALMA predicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments. See also seqanswers/QPALMA.
- SeqSaw SeqSaw.
- SoapSplice SoapSplice.
- SpliceMap SpliceMap. See also seqanswers/SpliceMap.
- SplitSeek SplitSeek. See also seqanswers/SplitSeek.
- SuperSplat SuperSplat was developed to find all type of splice junctions. The algorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck. Output in "Supersplat" format. See also seqanswers/SuperSplat.
- Subread Subread is a superfast, accurate and scalable read aligner. It uses the seed-and-vote mapping paradigm to determine the mapping location of the read by using its largest mappable region. It automatically decides whether the read should be globally mapped or locally mapped. For RNA-seq data, Subread should be used for the purpose of expression analysis. Subread is very powerful in mapping gDNA-seq reads as well. See also seqanswers/Subread.
- Subjunc Subjunc is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data. See also seqanswers/Subjunc.
- TrueSight TrueSight.
De novo Splice Aligners that also use annotation optionally
- MapNext MapNext. See also seqanswers/MapNext.
- OLego OLego. See also seqanswers/OLego.
- STAR STAR is an ultrafast tool that employs “sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure”, detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. See also seqanswers/STAR.
- TopHat TopHat is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences. See also seqanswers/TopHat.
Other Spliced Aligners
- G.Mo.R-Se G.Mo.R-Se is a method that uses RNA-Seq reads to build de novo gene models.
Quantitative analysis and Differential Expression
These tools calculate the abundance of each gene expressed in a RNA-Seq sample (see also Quantification models). Some software are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. See a comparative study of differential expression methods and Which method should you use for normalization of rna-seq data?.
- ALDEx2 ALDEx2 is a tool for comparative analysis of high-throughput sequencing data. ALDEx2 uses compositional data analysis and can be applied to RNAseq, 16S rRNA gene sequencing, metagenomic sequencing, and selective growth experiments.
- Alexa-Seq Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript specific expression analysis, exon junction expression and quantitative alternative analysis. Allows wide alternative expression visualization, statistics and graphs. See also seqanswers/Alexa-Seq.
- ASC ASC. See also seqanswers/ASC.
- Ballgown Ballgown.
- BaySeq BaySeq is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods. There is an option of using the "snow" package for parallelisation of computer data processing, recommended when dealing with large data sets. See also seqanswers/BaySeq.
- BBSeq BBSeq. See also seqanswers/BBSeq.
- BitSeq BitSeq.
- CEDER CEDER.
- CPTRA CPTRA.
- casper casper is a Bioconductor package to quantify expression at the isoform level. It combines using informative data summaries, flexible estimation of experimental biases and statistical precision considerations which (reportedly) provide substantial reductions in estimation error.
- Cufflinks/Cuffdiff Cufflinks is appropriate to measure global de novo transcript isoform expression. It performs assembly of transcripts, estimation of abundances and determines differential expression (Cuffdiff) and regulation in RNA-Seq samples. See also seqanswers/Cufflinks .
- DESeq DESeq is a Bioconductor package to perform differential gene expression analysis based on negative binomial distribution. See also seqanswers/DESeq.
- DEGSeq DEGSeq. See also seqanswers/DEGSeq.
- DEXSeq DEXSeq is Bioconductor package that finds differential differential exon usage based on RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribution, provides options to visualization and exploration of the results.
- DEXUS dexus is a Bioconductor package that identifies differentially expressed genes in RNA-Seq data under all possible study designs such as studies without replicates, without sample groups, and with unknown conditions. In contrast to other methods, DEXUS does not need replicates to detect differentially expressed transcripts, since the replicates (or conditions) are estimated by the EM method for each transcript.
- DiffSplice DiffSplice is a method for differential expression detection and visualization, not dependent on gene annotations. This method is supported on identification of alternative splicing modules (ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to identify significant differential transcription with a measured false discovery rate.
- EBSeq EBSeq is a Bioconductor package for identifying genes and isoforms differentially expressed (DE) across two or more biological conditions in an RNA-seq experiment. It also can be used to identify DE contigs after performing de novo transcriptome assembly. While performing DE analysis on isoforms or contigs, different isoform/contig groups have varying estimation uncertainties. EBSeq models the varying uncertainties using an empirical Bayes model with different priors.
- EdgeR EdgeR is a R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported on negative binomial distribution as a model for count variability. See also seqanswers/EdgeR.
- ESAT ESAT The End Sequence Analysis Toolkit (ESAT) is specially designing to be applied for quantification of annotation of specialized RNA-Seq gene libraries that target the 5' or 3' ends of transcripts.
- eXpress eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used with de novo assemblers and thus is not needed a reference genome to perform alignment. It runs on Linux, Mac and Windows.
- ERANGE ERANGE performs alignment, normalization and quantification of expressed genes. See also seqanswers/ERANGE.
- featureCounts featureCounts an efficient general-purpose read quantifier. It is part of the SourceForge Subread package and Bioconductor Rsubread package.
- FDM FDM
- GPSeq GPSeq
- Kallisto Kallisto. "Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build."
- MATS MATS.
- metaseqR metaseqR is a Bioconductor package that detects differentially expressed genes from RNA-Seq data by combining six statistical algorithms using weights estimated from their performance with simulated data estimated from real data, either public or user-based. In this way, metaseqR optimizes the tradeoff between precision and sensitivity. In addition, metaseqR creates a detailed and interactive report with a variety of diagnostic and exploration plots and auto-generated text.
- MMSEQ MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat, ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression. See also seqanswers/MMSEQ.
- Myrna Myrna is a pipeline tool that runs in a cloud environment (Elastic MapReduce) or in a unique computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed for short read alignment and R algorithms for interval calculations, normalization, and statistical processing. See also seqanswers/Myrna.
- NEUMA NEUMA is a tool to estimate RNA abundances using length normalization, based on uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data available in databases like RefSeq.
- NOISeq NOISeq. See also seqanswers/NOISeq.
- NPEBseq NPEBseq is a nonparametric empirical bayesian- based method for differential expression analysis.
- NSMAP NSMAP allows inference of isoforms as well estimation of expression levels, without annotated information. The exons are aligned and splice junctions are identified using TopHat. All the possible isoforms are computed by combination of the detected exons.
- Qlucore Easy to use for analysis and visualization. One button import of BAM files. Qlucore.
- RNAeXpress RNAeXpress Can be run with Java GUI or command line on Mac, Windows and Linux. Can be configured to perform read counting, feature detection or GTF comparison on mapped rnaseq data.
- Rcount Rcount.
- rDiff rDiff is a tool that can detect differential RNA processing (e.g. alternative splicing, polyadenylation or ribosome occupancy).
- RNA-Skim RNA-Skim
- rSeq rSeq
- RSEM RSEM. See also seqanswers/RSEM.
- rQuant rQuant is a web service (Galaxy (computational biology) installation) that determines abundances of transcripts per gene locus, based on quadratic programming. rQuant is able to evaluate biases introduced by experimental conditions. A combination of tools is employed: PALMapper (reads alignment), mTiM and mGene (inference of new transcripts).
- Salmon Salmon is an software tool for computing transcript abundance from RNA-seq data using either an alignment-free (based directly on the raw reads) or an alignment-based (based on pre-computed alignments) approach. It uses an online stochastic optimization approach to maximize the likelihood of the transcript abundances under the observed data. The software itself is capable of making use of many threads to produce accurate quantification estimates quickly. It is part of the Sailfish suite of software, and is the successor to the Sailfish tool.
- SAJR SAJR is a java-written read counter and R-package for differential splicing analysis. It uses junction reads to estimate exon exclusion and reads mapped within exon to estimate its inclusion. SAJR models it by GLM with quasibinomial distribution and uses log likelihood test to assess significance.
- Scotty Scotty Performs power analysis to estimate the number of replicates and depth of sequencing required to call differential expression.
- Seal Seal Ultrafast, alignment-free algorithm to quantify sequence expression by matching kmers between raw reads and a reference transcriptome. Handles paired reads and alternate isoforms, and uses little memory. Accepts all common read formats, and outputs read counts, coverage, and FPKM values per reference sequence. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies. Distributed with BBMap. (Seal - Sequence Expression AnaLyzer - is unrelated to the SEAL distributed short-read aligner.)
- SplicingCompass SplicingCompass.
- WemIQ WemIQ.
- DEB DEB is a web-interface/pipeline that permits to compare results of significantly expressed genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq.
Workbench (analysis pipeline / integrated solutions)
- ActiveSite by Cofactor Genomics ActiveSite
- Avadis NGS Avadis NGS.
- BaseSpace by Illumina BaseSpace
- CLC Genomics Workbench CLC Genomics Workbench
- DNASTAR DNASTAR
- ERGO ERGO
- Genedata Genedata
- GeneSpring GX GeneSpring GX
- Genevestigator Genevestigator
- geospiza geospiza
- Golden Helix Golden Helix
- Maverix Biomics Maverix Biomics
- NextGENe NextGENe
- OmicsOffice OmicsOffice
- Partek Partek
Open (free) Source Solutions
- ArrayExpressHTS ArrayExpressHTS (and ebi_ArrayExpressHTS) is a BioConductor package that allows preprocessing, quality assessment and estimation of expression of RNA-Seq datasets. It can be run remotely at the European Bioinformatics Institute cloud or locally. The package makes use of several tools: ShortRead (quality control), Bowtie, TopHat or BWA (alignment to a reference genome), SAMtools format, Cufflinks or MMSEQ (expression estimation). See also seqanswers/ArrayExpressHTS.
- Chipster Chipster.
- easyRNASeq easyRNASeq.
- ExpressionPlot ExpressionPlot.
- FX FX.
- Galaxy: Galaxy is a general purpose workbench platform for computational biology. There are several publicly accessible Galaxy servers that support RNA-Seq tools and workflows, including NBIC's Andromeda, the CBIIT-Giga server, the Galaxy Project's public server, the GeneNetwork Galaxy server, the University of Oslo's Genomic Hyperbrowser, URGI's server (which supports S-MART), and many others.
- GENE-Counter GENE-Counter is a Perl pipeline for RNA-Seq differential gene expression analyses. Gene-counter performs alignments with CASHX, Bowtie, BWA or other SAM output aligner. Differential gene expression is run with three optional packages (NBPSeq, edgeR and DESeq) using negative binomial distribution methods. Results are stored in a MySQL database to make possible additional analyses.
- GenePattern GenePattern offers integrated solutions to RNA-Seq analysis (Broad Institute).
- GeneProf GeneProf: Freely accessible, easy to use analysis pipelines for RNA-seq and ChIP-seq experiments.
- GT-FAR GT-FAR is an RNA seq pipeline that performs RNA-seq QC, alignment, reference free quantification, and splice variant calling. It filters, trims, and sequentially aligns reads to gene models and predicts and validates new splice junctions after which it quantifies expression for each gene, exon, and known/novel splice junction, and Variant Calling.
- MultiExperiment Viewer (MeV) MeV is suitable to perform analysis, data mining and visualization of large-scale genomic data. The MeV modules include a variety of algorithms to execute tasks like Clustering and Classification, Student's t-test, Gene Set Enrichment Analysis or Significance Analysis. MeV runs on Java. See also seqanswers/MeV.
- NGS-Trex NGS-Trex.
- NGSUtils NGSUtils.
- RobiNA RobiNA provides a user graphical interface to deal with R/BioConductor packages. RobiNA provides a package that automatically installs all required external tools (R/Bioconductor frameworks and Bowtie). This tool offers a diversity of quality control methods and the possibility to produce many tables and plots supplying detailed results for differential expression. Furthermore, the results can be visualized and manipulated with MapMan and PageMan. RobiNA runs on Java version 6.
- RseqFlow RseqFlow is an RNA-Seq analysis pipeline which offers an express implementation of analysis steps for RNA sequencing datasets. It can perform pre and post mapping quality control (QC) for sequencing data, calculate expression levels for uniquely mapped reads, identify differentially expressed genes, and convert file formats for ease of visualization.
- S-MART S-MART handles mapped RNA-Seq data, and performs essentially data manipulation (selection/exclusion of reads, clustering and differential expression analysis) and visualization (read information, distribution, comparison with epigenomic ChIP-Seq data). It can be run on any laptop by a person without computer background. A friendly graphical user interface makes easy the operation of the tools. See also seqanswers/S-MART.
- Taverna Taverna.
- TCW TCW. TCW is a Transcriptome Computational Workbench.
- wapRNA wapRNA.
Alternative Splicing Analysis
- Alt Event Finder Alt Event Finder.
- Asprofile asprofile.
- AStalavista AStalavista.
- Cufflinks/Cuffdiff Cufflinks.
- DEXseq DEXseq.
- MISO MISO quantifies the expression level of splice variants from RNA-Seq data and is able to recognize differentially regulated exons/isoforms across different samples. MISO uses a probabilistic method (Bayesian inference) to calculate the probability of the reads origin. See also seqanswers/MISO.
- RSVP RSVP. RSVP is a software package for prediction of alternative isoforms of protein-coding genes, based on both genomic DNA evidence and aligned RNA-seq reads. The method is based on the use of ORF graphs, which are more general than the splice graphs used in traditional transcript assembly.
- SAJR SAJR calculates the number of the reads that confirms segment (part of gene between two nearest splice sites) inclusion or exclusion and then model these counts by GLM with quasibinomial distribution to account for biological variability.
- SplAdder SplAdder Identification, quantification and testing of alternative splicing events from RNA-Seq data.
- SplicePlot SplicePlot.
- SpliceR SpliceR.
- SpliceSEQ SpliceSeq.
- SpliceTrap SpliceTrap is a statistical tool for the quantification of exon inclusion ratios from RNA-seq data. See also sourceforge to download and SpliceTrap paper.
- SUPPA Suppa.
- SwitchSeq SwitchSeq identifies extreme changes in splicing (switch events).
- Vast-tools vast-toolsA toolset for profiling alternative splicing events in RNA-Seq data.
- EDASeq EDASeq is a Bioconductor package to perform GC-Content Normalization for RNA-Seq Data.
- GeneScissors GeneScissors.
- SysCall SysCall is a classifier tool to identification and correction of systematic error in high-throughput sequence data.
Fusion genes/chimeras/translocation finders/structural variations
Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies.
- BreakDancer BreakDancer. See also seqanswers/BreakDancer.
- ChimeraScan ChimeraScan.
- DeFuse DeFuse.
- EBARDenovo EBARDenovo.
- FusionAnalyser FusionAnalyser.
- FusionCatcher FusionCatcher.
- FusionHunter FusionHunter identifies fusion transcripts without depending on already known annotations. It uses Bowtie as a first aligner and paired-end reads. See also seqanswers/FusionHunter.
- FusionMap FusionMap.
- FusionSeq FusionSeq. See also seqanswers/FusionSeq.
- PRADA prada.
- SOAPFuse SOAPFuse.
- SOAPfusion Soapfusion.
- TopHat-Fusion TopHat-Fusion is based on TopHat version and was developed to handle reads resulting from fusion genes. It does not require previous data about known genes and uses Bowtie to align continuous reads. See also seqanswers/TopHat-Fusion.
- ViralFusionSeq ViralFusionSeq is high-throughput sequencing (HTS) tool for discovering viral integration events and reconstruct fusion transcripts at single-base resolution. See also hkbic/VFS and SEQWiki/VFS.
Copy Number Variation identification
- CNVseq CNVseq detects copy number variations supported on a statistical model derived from array-comparative genomic hybridization. Sequences alignment are performed by BLAT, calculations are executed by R modules and is fully automated using Perl. See also seqanswers/CNVseq.
These Simulators generate in silico reads and are useful tools to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols.See also Genetic Simulation Resources and some discussion about simulation at Biostars.
- BEERS Simulator BEERS is formatted to mouse or human data, and paired-end reads sequenced on Illumina platform. Beers generates reads starting from a pool of gene models coming from different published annotation origins. Some genes are chosen randomly and afterwards are introduced deliberately errors (like indels, base changes and low quality tails), followed by construction of novel splice junctions.
- Flux simulator Flux Simulator implements a computer pipeline simulation to mimic a RNA-Seq experiment. All component steps that influence RNA-Seq are taken into account (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing) in the simulation. These steps present experimental attributes that can be measured, and the approximate experimental biases are captured. Flux Simulator allows joining each of these steps as modules to analyse different type of protocols. See also seqanswers/Flux.
- Polyester Polyester.
- RandomReads RandomReads Generates synthetic reads from a genome with an Illumina or PacBio error model. The reads may be paired or unpaired, with arbitrary length and insert size, output in fasta or fastq, RandomReads has a wide selection of options for mutation rates, with individual settings for substitution, deletion, insertion, and N rates and length distributions, annotating reads with their original, unmutated genomic start and stop location. RandomReads is does not vary expression levels and thus is not designed to simulate RNA-seq experiments, but to test the sensitivity and specificity of RNA-seq aligners with de-novo introns. Includes a tool for grading and generating ROC curves from resultant sam files. Open-source, written in pure Java; supports all platforms with no recompilation and no other dependencies. Distributed with BBMap.
- rlsim rlsim is a software package for simulating RNA-seq library preparation with parameter estimation.
- RSEM Read Simulator rsem-simulate-reads.
- RNASeqReadSimulator RNASeqReadSimulator contains a set of simple Python scripts, command line driven. It generates random expression levels of transcripts (single or paired-end), equally simulates reads with a specific positional bias pattern and generates random errors from sequencing platforms.
- RNA Seq Simulator RNA Seq Simulator.
- WGsim wgsim.
The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts.
- CLASS CLASS.
- Cufflinks Cufflinks. See also seqanswers/Cufflinks.
- iReckon iReckon.
- IsoInfer IsoInfer.
- IsoLasso IsoLasso.
- Flipflop flipflop.
- GIIRA GIIRA.
- MITIE MITIE.
- RNAeXpress RNAeXpress.
- Scripture Scripture. See also seqanswers/Scripture.
- SLIDE SLIDE.
Genome-Independent (de novo) assemblers
- KISSPLICE KISSPLICE.
- SAT-Assembler SAT-Assembler
- SOAPdenovo-Trans SOAPdenovo-trans. See also seqanswers/SOAPdenovo.
- Scaffolding Translation Mapping STM.
- Trans-ABySS Trans-AByss. See also seqanswers/Trans-ABySS.
- T-IDBA T-IDBA.
- miRDeep2 miRDeep2.
- MIReNA MIReNA.
- miRExpress miRExpress.
- miR-PREFeR miR-PREFeR.
- miRDeep-P For plants. .
- Artemis Artemis.
- Apollo Apollo.
- BamView BamView.
- Degust Degust.
- EagleView EagleView.
- GBrowse GBrowse.
- Integrated Genome Browser IGB.
- Integrative Genomics Viewer (IGV) IGV.
- GenomeView genomeview.
- MapView MapView.
- Samscope Samscope.
- Vespa Vespa.
Functional, Network & Pathway Analysis Tools
- GAGE is applicable independent of sample sizes, experimental design, assay platforms, and other types of heterogeneity (paper). This Biocondutor package also provides functions and data for pathway, GO and gene set analysis in general. Tutorials describe both RNA-Seq pathway analysis workflows and microarray analysis workflows. The RNA-Seq workflows cover from preparation, reads counting, data preprocessing, gene set test, to pathway visualization in about 40 lines of codes.
- Gene Set Association Analysis for RNA-Seq (GSAASeq): GSAASeq are computational methods that assess the differential expression of a pathway/gene set between two biological states based on sequence count data.
- Ingenuity Systems (commercial) iReport & IPA: Ingenuity’s IPA and iReport applications enable you to upload, analyze, and visualize RNA-Seq datasets, eliminating the obstacles between data and biological insight. Both IPA and iReport support identification, analysis and interpretation of differentially expressed isoforms between condition and control samples, and support interpretation and assessment of expression changes in the context of biological processes, disease and cellular phenotypes, and molecular interactions. Ingenuity iReport supports the upload of native Cuffdiff file format as well as gene expression lists. IPA supports the upload of gene expression lists.
Further annotation tools for RNA-Seq data
- HLAminer HLAminer is a computational method for identifying HLA alleles directly from whole genome, exome and transcriptome shotgun sequence datasets. HLA allele predictions are derived by targeted assembly of shotgun sequence data and comparison to a database of reference allele sequences. This tool is developed in perl and it is available as console tool.
- pasa pasa.
- seq2HLA seq2HLA is an annotation tool for obtaining an individual's HLA class I and II type and expression using standard NGS RNA-Seq data in fastq format. It comprises mapping RNA-Seq reads against a reference database of HLA alleles using bowtie, determining and reporting HLA type, confidence score and locus-specific expression level. This tool is developed in Python and R. It is available as console tool or Galaxy module. See also seqanswers/seq2HLA.
- ENA ENA.
- ENCODE ENCODE.
- queryable-rna-seq-database queryable-rna-seq-database.
- RNA-Seq Atlas RNA-Seq Atlas.
- SRA SRA.
Webinars and Presentations
- RNASeq-Blog Presentations
- RNA-Seq Workshop Documentation (UC Davis University)
- VIDEO: Strategies for Identifying Biologically Compelling Genes from Breast Cancer Subtype RNA-Seq Profiles with Accompanying Analysis
- Princeton Workshop
- NGS Leaders
- COFACTOR genomics
- Yang Liao, Gordon K Smyth and Wei Shi (2013). "The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote". Nucleic Acids Research 41 (10): e108. doi:10.1093/nar/gkt214. PMID 23558742.
- Cole Trapnell, Lior Pachter and Steven Salzberg (2009). "TopHat: discovering splice junctions with RNA-Seq". Bioinformatics 25 (9): 1105–1111. doi:10.1093/bioinformatics/btp120. PMC 2672628. PMID 19289445.
- Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold and Lior Pachter (2010). "Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms". Nature Biotechnology 28 (5): 511–515. doi:10.1038/nbt.1621. PMC 3146043. PMID 20436464.
- Klambauer, G.; Unterthiner, T.; Hochreiter, S. (2013). "DEXUS: Identifying differential expression in RNA-Seq studies with unknown conditions". Nucleic Acids Research 41 (21): e198. doi:10.1093/nar/gkt834. PMID 24049071.
- Moulos, P; Hatzis, P (2015). "Systematic integration of RNA-Seq statistical algorithms for accurate detection of differential gene expression patterns". Nucleic Acids Research 43 (4): e25. doi:10.1093/nar/gku1273. PMID 25452340.
- Zerbino DR, Birney E (2008). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Research 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC 2336801. PMID 18349386.