Genotype to Phenotype Databases: a Holistic Approach (GEN2PHEN) is a European project aiming to develop a knowledge web portal integrating information from the genotype to the phenotype in a unifying portal: The Knowledge Centre.
Summary and Objectives
The GEN2PHEN project aims to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data, and to link this system into other biomedical knowledge sources via genome browser functionality. The project will establish the technological building-blocks needed for the evolution of today’s diverse G2P databases into a future seamless G2P biomedical knowledge environment, by the projects end. This will consist of a European-centred but globally networked hierarchy of bioinformatics GRID-linked databases, tools and standards, all tied into the Ensembl genome browser. The project has the following specific objectives:
- To analyse the G2P field and thus determine emerging needs and practices
- To develop key standards for the G2P database field
- To create generic database components, services and integration infrastructures for the G2P database domain
- To create search modalities and data presentation solutions for G2P knowledge
- To facilitate the process of populating G2P databases
- To build a major G2P internet portal
- To deploy GEN2PHEN solutions to the community
- To address system durability and long-term financing
- To undertake a whole-system utility and validation pilot study
The GEN2PHEN Consortium members have been selected from a talented pool of European research groups and companies that are interested in the G2P database challenge. Additionally, a few non-EU participants have been included to bring extra capabilities to the initiative. The final constellation is characterised by broad and proven competence, a network of established working relationships, and high-level roles/connections within other significant projects in this domain...
Background and Concept
By providing a complete Homo sapiens ‘parts list’ (the gene sequences) and a powerful ‘toolkit’ (technologies), the Human Genome Project has revolutionised mankind’s ability to explore how genes cause disease and other phenotypes. Studies in this domain are proceeding at a rapid and ever-increasing pace, generating unprecedented amounts of raw and processed data. It is now imperative that the scientific community finds ways to effectively manage and exploit this flood of information for knowledge creation and practical benefit to society. This fundamental goal lies at the heart of the “Genotype-To-Phenotype Databases: A Holistic Solution (GEN2PHEN)” project.
Previous genetics studies have shown that inter-individual genome variation plays a major role in differential normal development and disease processes. However, the details of how these relationships work are far from clear, even in the case of most Mendelian disorders where single genetic alterations are fully penetrant (essentially causative, rather than risk modifying). Background genetic effects (modifier genes), epistasis, somatic variation, and environmental factors all complicate the situation. This is particularly the case in complex, multi-factorial disorders (e.g., cancer, heart disease, diabetes, dementia) that will affect most of us at some stage in our lifetime. Strategies do, however, now exist to study the genetics of these disorders, and such investigations are a major focus of research throughout Europe and beyond. A common thread in these studies is the need to create ever-larger datasets and integrate these more effectively.
Success in deciphering the mechanisms and pathways underpinning genotype-to-phenotype (G2P) relationships will bring about radical new opportunities for predicting, preventing, diagnosing, and treating all forms of illness. It will launch an era of truly effective personalised medicine. Extensive research is therefore being conducted worldwide to characterise genetic variation in normal and disease contexts. Sadly though, the resulting flood of primary information is not yet being managed or utilised as effectively as it should be - due simply to the lack of a sufficiently organised and mature database infrastructure by which the discoveries can be gathered, stored, integrated and queried as a composite whole in the electronic (internet) domain. Furthermore, whilst new positive findings are being handled sub-optimally, ‘negative’ observations are in most cases not even reported in any way, shape, or form – despite the fact that they constitute an essential part of any complete and accurate G2P depiction. This needs to change, and an international ‘Human Variome Project’ (HVP) has emerged to help argue this case.
It is against this backdrop that the GEN2PHEN project aims to become the key European contribution to the challenges listed above, harmonised with similar projects elsewhere, and dovetailed into many related European programmes of work. It will provide an important and timely solution to a current research need that was highlighted by the European Strategy Forum on Research Infrastructures (ESFRI) - Priority area: ‘Upgrade of European Bio-Informatics Infrastructure (Shared platform for data resources in the Life Sciences)’. It will provide European G2P research and biotech industries with the proper support they need in terms of database technologies and data integration systems. Only then can our societies maximally benefit from the current exponentially increasing rate of genetic data generation in disease research and clinical settings.
Future Vision and Current Reality
Looking to the future, one can imagine a world wherein ‘omics’ biomedical sciences are commonplace, even to the point of having one’s genome sequenced in routine medical checkups. In this envisaged world, phenomenally large amounts of G2P data will be produced daily, much of which would flow effortlessly into the internet to be fully absorbed into a sophisticated and powerful ‘biomedical knowledge environment’. Some of this information will be secured for restricted access, whilst much of the raw data and the derived knowledge should be free for everyone to search and exploit.
The system will enable extensive scientific reporting and discussions, it will provide a core reference platform for medical practice, and it will open exciting new operational vistas for journals, industry, and funders. It will provide for and underpin activities in biomedical research, biotechnology, drug development, and personalised healthcare. And it will probably even impact our basic cultural practices (e.g., insurance, the law, employment policies) as society comes to grips with the immense power and relevance of genetics to the human state. But this envisaged future is nothing like the world we presently live in.
No system yet exists that even begins to approximate to a ‘biomedical knowledge environment’ properly able to support G2P data gathering and analysis. There are instead a limited number of unconnected G2P databases that are mostly at rather early stages in their development, with no agreed structured way of effectively modelling phenotype data or G2P relationships, and no convenient mode for passing data from discovery laboratories into the database world. A few recent initiatives are building large databases to host individual-specific genotypes and phenotypes to support some high-throughput disease association studies, but these do not have a global remit, have not engaged with the extensive existing knowledge from Medelian disorders, and are not focused on all the research and clinical communities around G2P. Most progress has arguably been made with locus-specific databases (LSDBs) that target specific diseases or genes, but the vast majority of the several hundred LSDBs that do exist are rudimentary in design and implementation, and operationally isolated from one another. This all contrasts with the situation for databases concerned with purely genetic data (without phenotype association), of which there are many, including several large data warehouses and genome browsers that act as central repositories and search centres for all the human and model organism genome sequences, variants, and feature annotations yet produced.
There are a number of reasons why the G2P database field is so poorly developed. Problems include the complexity/diversity of the pertinent data elements, the contemporary nature of the challenge, and certain practical/cultural issues. However, perhaps the most critical obstacle is the overwhelming scale of the problem. Whereas the genome is a bounded domain of only ~3,000,000,000 nucleotides and ~25,000 genes (in man), there is essentially no limit to the number of G2P relationships that can be examined, each by multiple different procedures. The former is thus relatively straightforward and can be managed and hosted in one or a few large data depositories (as has been accomplished). In contrast, the latter is too large in scale and scope to handle in this way.
There is virtually no limit to how many G2P data will eventually be created, or to their diversity or purpose. The database solutions for G2P information must therefore be based upon new ways of thinking and organising the field’s development - emphasising standards, integration, federation, and broad community participation from the very outset.
Related Projects and Applications
- GWAS Central
- Leiden Open Variation Database
- Web Analysis of the Variome
- Locus Reference Genomic (LRG)
- Cafe Variome
- University of Leicester, UK
- European Molecular Biology Laboratory, Germany
- Fundació IMIM, Spain
- Leiden University Medical Center, Netherlands
- Institut National de la Santé et de la Recherche Médicale, France
- Karolinska Institutet, Sweden
- Foundation for Research and Technology – Hellas, Greece
- Commissariat à l’Energie Atomique, France
- Erasmus University Medical Center, Netherlands
- Institute for Molecular Medicine Finland, University of Helsinki, Finland
- University of Aveiro – IEETA, Portugal
- University of Western Cape, South Africa
- Council of Scientific and Industrial Research, India
- Swiss Institute of Bioinformatics, Switzerland
- University of Manchester, UK
- BioBase GmbH, Germany
- deCODE genetics ehf, Iceland
- PhenoSystems SA, Belgium
- Biocomputing Platforms Ltd Oy, Finland
- University of Patras, Greece
- University Medical Center Groningen (UMCG), Netherlands (From March 2012)
- University of Lund (ULUND), Sweden (From March 2012)
- Synapse Research Management Partners, Spain. (From March 2012)
Notes and references