Bioinformatics workflow management systems
|
|
This article may contain excessive, poor or irrelevant examples. You can improve the article by adding more descriptive text and removing less pertinent examples. See Wikipedia's guide to writing better articles for further suggestions. (February 2012) |
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics.
There are currently many different workflow systems. Some have been developed more generally as scientific workflow systems for use by scientists from many different disciplines like astronomy and earth science. All such systems are based on an abstract representation of how a computation proceeds in the form of a directed graph, where each node represents a task to be executed and edges represent either data flow or execution dependencies between different tasks. Each system typically provides visual front-end allowing the user to build and modify complex applications with little or no programming expertise.
Contents |
[edit] Examples
- Anduril is an open source component-based workflow framework for scientific data analysis developed at the University of Helsinki.[1] Anduril provides an execution engine written in Java, a large number of components for bioinformatics analysis, and the AndurilScript language to create and manage workflows.
- BioBike[2] is a biocomputing platform based upon the KnowOS (Knowledge Operating System) e-science technology. Written entirely in Lisp, KnowOS's main distinguishing feature is "through-the-browser" programmability.
- BioExtract harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools, create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports.
- BioManager is a bioinformatic data management and analysis workflow developed by the University of Sydney.
- CellProfiler[3] is an open source modular image analysis software developed at the Broad Institute. Capable of handling hundreds of thousands of images, it contains advanced algorithms for image analysis of cell-based assays and is optimized for high-throughput work. The software allows the user to construct a pipeline of individual modules; each module performs a image processing step, such as image loading, object identification, and feature extraction.
- Discovery Net.[4](circa 2000) is one of the earliest examples of scientific workflow systems. It was the winner of the “Most Innovative Data Intensive Application Award” at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. It originated from a £2m EPSRC-funded project with the same name investigating the development of an e-Science platform for scientific discovery from high throughput data sources at Imperial College London. Many of the features of the Discovery Net(architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems. The workflow system developed in the project itself was later used as the basis for the commercial products of the Imperial College spin out company InforSense.
- eHive[5] is a fault tolerant distributed processing system initially designed to support comparative genomic analysis, based on blackboard systems, network distributed autonomous agents, dataflow graphs and block-branch diagrams.
- Ergatis[6] is a web-based system used to create, run, and monitor reusable bioinformatics analysis pipelines. It contains pre-built components for common bioinformatics analysis tasks, such as blast searches or storing data in a Chado database. These components can be arranged graphically to create highly-configurable pipelines.
- Galaxy[7] is an open source workflow system developed at Penn State and Emory University. Galaxy is available as a free public web server[8] and as downloadable software.[9] Galaxy stresses ease of use and sharing and persisting analyses.
- GenePattern is a genomic analysis platform developed at the Broad Institute of MIT & Harvard that provides access to more than 150 tools for gene expression analysis, proteomics, SNP analysis, RNA-seq, flow cytometry, and common data processing tasks. A web-based interface provides access to these tools and allows the creation of multi-step analysis pipelines that enable reproducible in silico research.
- GeneProf is a web-based, graphical software suite developed at the University of Edinburgh that allows users to build pipelines and analyse data produced using high-throughput sequencing platforms (RNA-seq and ChIP-seq)
- Geodise (Grid Enabled Optimisation and Design Search for Engineering) was developed at the University of Southampton.
- HCDC is an open source workflow system developed at ETH Zurich that is focus on large scale image based biological experiments. Include large collection of components for multiwell plate handling (96, 384, ...).
- InforSense is a commercial workflow system based on the Discovery Net system providing rapid development of analytical applications, integration of data and services from heterogeneous sources, producing repeatable, auditable analytical processes. It provides domain specific extensions for Bioinformatics, Cheminformatics, Health Informatics and Business Analytics. It also provides features such as Embedded Applications, Portals, Dashboards and Business Rules Engines.
- Kepler enables scientists in a variety of disciplines like biology, ecology and astronomy to compose and execute workflows. Kepler is based on the Ptolemy II system for heterogeneous, concurrent modeling and design. Ptolemy II was developed by the members of the Ptolemy project at University of California Berkeley. Although not originally intended for scientific workflows, it provides a mature platform for building and executing workflows, and supports multiple models of computation.
- LONI Pipeline is a Java-based distributed graphical data-analysis environment for constructing, validating, executing and disseminating scientific workflows. As the LONI Pipeline references all data, services and tools as external objects, it directly allows resource interoperability without the need for rebuilding the software.
- Medicel Integrator Workflow is a cluster-enabled bioinformatics workflow design and execution application. It can be used stand-alone or integrated with a biology data warehouse.
- Mobyle is a framework and web portal specifically aimed at the integration of bioinformatics software and databanks. Mobyle is the successor of Pise and the RPBS server, previous systems that provided web environments to define and execute bioinformatics analyses.
- Pegasus is a flexible framework that enables the mapping of complex scientific workflows onto the grid developed at the Information Sciences Institute at the University of Southern California.
- Pegasys is a software for executing and integrating analyses of biological sequences, developed by the University of British Columbia.
- Pipeline Pilot is Accelrys’ scientific informatics platform that streamlines the data integration and analysis by using a Visual Programming Language (similar to LabVIEW) to build a pipeline to transform any number of inputs (raw data) into any number of outputs.
- Remora is a web server implemented according to the BioMoby web-service specifications, providing life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system.
- RetroGuide is a query framework for querying retrospective bioinformatics data.
- Sight is a web agent – oriented workflow platform that historically has extensive means to integrate websites with ordinary web forms and HTML responses (there is also support for WSDL as well). The system has a GUI-based workflow composer that supports modules with multiple ports and allows to access data from the modules that stand earlier in workflow. Sight was developed in Ulm university using java and it currently released under GPL.
- Taverna workbench is an open source workflow system that enables scientists (typically, though not exclusively, in bioinformatics) to compose and execute scientific workflows. It has been developed as part of a £5.5m EPSRC project called myGrid based at the University of Manchester. Independently, other researchers have created Programming by example workflow development tools that are interoperable with Taverna.[10]
- Triana is an open source problem solving environment developed at Cardiff University that combines an intuitive visual interface with powerful data analysis tools.
- Wildfire is a distributed, Grid-enabled workflow construction and execution environment. It has a graphical user interface for constructing and running workflows. Wildfire borrows user interface features from Jemboss and adds a drag-and-drop interface allowing the user to compose EMBOSS (and other) programs into workflows. For execution, Wildfire uses GEL, the underlying workflow execution engine, which can exploit available parallelism on multiple CPU machines including Beowulf-class clusters and Grids.
- UGENE Workflow Designer is an open source visual environment designed for building and executing bioinformatics workflows. The main purpose of the system is providing user-friendly GUI for creating computational workflows that can be executed as well as on commodity hardware as on high-performance clusters and supercomputers.
[edit] Comparisons between workflow systems
With a large number of bioinformatics workflow systems to chose from, it becomes difficult to understand and compare the features of the different workflow systems. There has been little work conducted in evaluating and comparing the systems from a bioinformatician's perspective, especially when it comes to comparing the data types they can deal with, the in-built functionalities that are provided to the user or even their performance or usability. Examples of existing comparisons include
- The paper "Scientific workflow systems-can one size fit all?",[11] which provides a high-level framework for comparing workflow systems based on their control flow and data flow properties. The systems compared include Discovery Net, Taverna , Triana, Kepler as well as Yawl and BPEL.
- The paper "Meta-workflows: pattern-based interoperability between Galaxy and Taverna" [12] which provides a more user-oriented comparison between Taverna and Galaxy in the context of enabling interoperability between both systems.
[edit] References
- ^ Ovaska, K.; Laakso, M.; Haapa-Paananen, S.; Louhimo, R.; Chen, P.; Aittomäki, V.; Valo, E.; Núñez-Fontarnau, J. et al (2010). "Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme". Genome Medicine 2 (9): 65. doi:10.1186/gm186. PMID 20822536.
- ^ Elhai, J.; Taton, A.; Massar, J.; Myers, J. K.; Travers, M.; Casey, J.; Slupesky, M.; Shrager, J. (2009). "BioBIKE: A Web-based, programmable, integrated biological knowledge base". Nucleic Acids Research 37 (Web Server issue): W28–W32. doi:10.1093/nar/gkp354. PMC 2703918. PMID 19433511. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2703918.
- ^ Kamentsky, L.; Jones, T. R.; Fraser, A.; Bray, M. -A.; Logan, D. J.; Madden, K. L.; Ljosa, V.; Rueden, C. et al (2011). "Improved structure, function and compatibility for CellProfiler: Modular high-throughput image analysis software". Bioinformatics 27 (8): 1179–1180. doi:10.1093/bioinformatics/btr095. PMC 3072555. PMID 21349861. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=3072555.
- ^ Moustafa Ghanem, Vasa Curcin, Patrick Wendel, Yike Guo. Building and using analytical workflows in Discovery Net Data Mining Techniques in Grid Environments. Dubitzky, Werner (Ed). pp. 119–140. Wiley-Blackwell. November, 2008.
- ^ Severin, J; Beal, K, Vilella, AJ, Fitzgerald, S, Schuster, M, Gordon, L, Ureta-Vidal, A, Flicek, P, Herrero, J (2010 May 11). "eHive: an artificial intelligence workflow system for genomic analysis.". BMC bioinformatics 11: 240. PMID 20459813.
- ^ Orvis, J.; Crabtree, J.; Galens, K.; Gussman, A.; Inman, J. M.; Lee, E.; Nampally, S.; Riley, D. et al (2010). "Ergatis: A web interface and scalable software system for bioinformatics workflows". Bioinformatics 26 (12): 1488–1492. doi:10.1093/bioinformatics/btq167. PMC 2881353. PMID 20413634. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2881353.
- ^ Goecks, J.; Nekrutenko, A.; Taylor, J.; Galaxy Team, T. (2010). "Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC 2945788. PMID 20738864. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2945788.
- ^ http://usegalaxy.org/
- ^ http://getgalaxy.org/
- ^ Hull, Duncan; Wolstencroft, Katy; Stevens, Robert; Goble, Carole A.; Pocock, Matthew R.; Li, Peter; Oinn, Tom (2006). "Taverna: A tool for building and running workflows of services". Nucleic Acids Research 34 (Web Server issue): W729–W732. doi:10.1093/nar/gkl320. PMC 1538887. PMID 16845108. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1538887.
- ^ Curcin, V; Ghanem, M (2008), Scientific workflow systems - can one size fit all?, Biomedical Engineering Conference, 2008. CIBEC 2008, IEEE, doi:10.1109/CIBEC.2008.4786077, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4786077
- ^ Abouelhoda, M; Ghanem, M; Alaa, S (2010), Meta-workflows: pattern-based interoperability between Galaxy and Taverna, Wands '10 Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science, ACM, doi:10.1145/1833398.1833400
[edit] External links
- Oinn, T.; Greenwood, M.; Addis, M.; Alpdemir, M. N.; Ferris, J.; Glover, K.; Goble, C.; Goderis, A. et al (2006). "Taverna: Lessons in creating a workflow environment for the life sciences". Concurrency and Computation: Practice and Experience 18 (10): 1067–1100. doi:10.1002/cpe.993. This paper reviews some of the above workflow systems
- Yu, J.; Buyya, R. (2005). "A taxonomy of scientific workflow systems for grid computing". ACM SIGMOD Record 34 (3): 44. doi:10.1145/1084805.1084814. from the ACM SIGMOD Record
- Curcin, V.; Ghanem, M. (2008). Scientific workflow systems - can one size fit all?. pp. 1–9. doi:10.1109/CIBEC.2008.4786077. paper in CIBEC'08 comparing multiple workflow systems for bioinformatics applications