Biological network inference
This article is written like a personal reflection or opinion essay that states a Wikipedia editor's personal feelings about a topic. Learn how and when to remove this template message)(December 2007) (
In a topological sense, a network is a set of nodes and a set of directed or undirected edges between the nodes. Many types of biological networks exist, including transcriptional, signalling and metabolic. Few such networks are known in anything approaching their complete structure, even in the simplest bacteria. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a eukaryotic cell or bacterial organism at a given point in the future. Systems biology, in this sense, is still in its infancy.
There is great interest in network medicine for the modelling biological systems. This article focuses on a necessary prerequisite to dynamic modeling of a network: inference of the topology, that is, prediction of the "wiring diagram" of the network. More specifically, we focus here on inference of biological network structure using the growing sets of high-throughput expression data for genes, proteins, and metabolites. Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence. Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such algorithms work. Such algorithms can be of use in inferring the topology of any network where the change in state of one node can affect the state of other nodes.
Transcriptional regulatory networks
Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing an RNA or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms take as primary input data measurements of mRNA expression levels of the genes under consideration for inclusion in the network, returning an estimate of the network topology. Such algorithms are typically based on linearity, independence or normality assumptions, which must be verified on a case-by-case basis. Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments, in particular to select sets of genes as candidates for network nodes. The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of cancer, or to predict differential responses to a drug (pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network. This can be done by data integration in dynamic models supported by background literature, or information in public databases, combined with the clustering results. The modelling can be done by a Boolean network, by Ordinary differential equations or Linear regression models, e.g. Least-angle regression, by Bayesian network or based on Information theory approaches. For instance it can be done by the application of a correlation-based inference algorithm, as will be discussed below, an approach which is having increased success as the size of the available microarray sets keeps increasing 
Signal transduction networks (very important in the biology of cancer). Proteins are the nodes and directed edges represent interaction in which the biochemical conformation of the child is modified by the action of the parent (e.g. mediated by phosphorylation, ubiquitylation, methylation, etc.). Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation / dephosphorylation) across a set of proteins. Inference for such signalling networks is complicated by the fact that total concentrations of signalling proteins will fluctuate over time due to transcriptional and translational regulation. Such variation can lead to statistical confounding. Accordingly, more sophisticated statistical techniques must be applied to analyse such datasets.
Metabolite networks. Metabolites are the nodes and the edges are directed. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.
Protein-protein interaction networks are also under very active study. However, reconstruction of these networks does not use correlation-based inference in the sense discussed for the networks already described (interaction does not necessarily imply a change in protein state), and a description of such interaction network reconstruction is left to other articles.
- Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M, Collins JJ, Stolovitzky G (August 2012). "Wisdom of crowds for robust gene network inference". Nature Methods. 9 (8): 796–804. doi:10.1038/nmeth.2016. PMC . PMID 22796662.
- Sprites P, Glamour C, Scheines R (2000). Causation, Prediction, and Search: Adaptive Computation and Machine Learning (2nd ed.). MIT Press.
- Oates CJ & Mukherjee S (September 2012). "Network Inference and Biological Dynamics". The Annals of Applied Statistics. 6 (3): 1209–1235. arXiv: . doi:10.1214/11-AOAS532. PMC . PMID 23284600.
- Guthke R, Möller U, Hoffmann M, Thies F, Töpfer S (April 2005). "Dynamic network reconstruction from gene expression data applied to immune response during bacterial infection". Bioinformatics. 21 (8): 1626–34. doi:10.1093/bioinformatics/bti226. PMID 15613398.
- Hecker M, Lambeck S, Toepfer S, van Someren E, Guthke R (April 2009). "Gene regulatory network inference: data integration in dynamic models-a review". Bio Systems. 96 (1): 86–103. doi:10.1016/j.biosystems.2008.12.004. PMID 19150482.
- van Someren EP, Wessels LF, Backer E, Reinders MJ (July 2002). "Genetic network modeling". Pharmacogenomics. 3 (4): 507–25. doi:10.1517/14622422.214.171.1247. PMID 12164774.
- Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, Gardner TS (January 2007). "Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles". PLoS Biology. 5 (1): e8. doi:10.1371/journal.pbio.0050008. PMC . PMID 17214507.
- Hayete B, Gardner TS, Collins JJ (2007). "Size matters: network inference tackles the genome scale". Molecular Systems Biology. 3 (1): 77. doi:10.1038/msb4100118. PMC . PMID 17299414.
- Oates CJ, Mukherjee S (2012). "Structural inference using nonlinear dynamics". CRiSM Working Paper. 12 (7).