Machine learning in bioinformatics: Difference between revisions

Content deleted Content added

Inline

Revision as of 18:08, 12 May 2021

Machine learning, a subfield of computer science involving the development of algorithms that learn how to make predictions based on data, has a number of emerging applications in the field of bioinformatics. Bioinformatics deals with computational and mathematical approaches for understanding and processing biological data.^[1]

Prior to the emergence of machine learning algorithms, bioinformatics algorithms had to be explicitly programmed by hand which, for problems such as protein structure prediction, proves extremely difficult.^[2] Machine learning techniques such as deep learning enable the algorithm to make use of automatic feature learning which means that based on the dataset alone, the algorithm can learn how to combine multiple features of the input data into a more abstract set of features from which to conduct further learning. This multi-layered approach to learning patterns in the input data allows such systems to make quite complex predictions when trained on large datasets. These methods contrast with other computational biology approaches which, while still dealing effectively with large datasets, do not allow the data to be interpreted and analysed by the engine alone. In recent years, the size and number of available biological datasets have skyrocketed, enabling bioinformatics researchers to make use of these machine learning systems.^[3] Machine learning has been applied to six biological domains: genomics, proteomics, microarrays, systems biology, evolution, and text mining.^[3]

History

Algorithms

Clustering

Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subtypes of cells and understanding gene regulations. Subsequently, clustering approaches, including hierarchical, centroid-based, distribution-based, density-based and self-organizing maps, have long been studied and used in classical machine learning settings.

Markov model

Convolutional Neural Network

Abbreviated as CNN, convolutional neural networks are a class of deep neural network mostly used for image processing but not limited to. CNN have an architecture based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation equivariant responses known as feature maps.^[4]^[5]

CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks make them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters. Therefore, on a scale of connectivity and complexity, CNNs are on the lower extreme.

Convolutional networks were inspired by biological processes^[6]^[7]^[8]^[9] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage.

BiG-SLiCE

BiG-SLiCE (Biosynthetic Genes Super-Linear Clustering Engine), a tool designed to cluster massive numbers of BGCs. By representing them in Euclidean space, BiG-SLiCE can group BGCs into GCFs in a non-pairwise, near-linear fashion. Satria et. al, 2021 across BiG-SLiCE demonstrate the utility of such analyses by reconstructing a global map of secondary metabolic diversity across taxonomy to identify uncharted biosynthetic potential, opens up new possibilities to accelerate natural product discovery and offers a first step towards constructing a global and searchable interconnected network of BGCs. As more genomes are sequenced from understudied taxa, more information can be mined to highlight their potentially novel chemistry.

Random Forest

Random Forest (RF) is a classification method that operates by constructing a multitud of decision trees that operate as an ensamble, and the output are the class or average prediction of the individual trees. ^[10]

RF algorithm is a modification of bagging that aggregates a large collection of decision trees and can be used for either a categorical response variable as classification or a continuous response, referred to as regression. ^[11]^[12]

RF give an internal estimate of generalization error so cross-validation is unnecessary. In addition, produce proximities, which can be used to impute missing values. Proximities can also provide a wealth of information by enabling novel visualizations of the data.Random Forests have been successfully usedfor a wide variety of applications andenjoy considerable popularity in several disciplines.^[13]

From a computational standpoint, RF algorithm is appealing because: (i) naturally handle both regression and (multiclass) classification, (ii) is relatively fast to train and to predict, (iii) depend only on one or two tuning parameters, (iv) have a built in estimate of generalization error, (v) can be used directly for high-dimensional problems, (vi) can easily be implemented in parallel. Statistically, RF algorithm is appealing because of the additional features theyprovide, such as: (i) measures of variable importance, (ii) differential class weighting, (iii) missing value imputation, (iv) visualization, (v) outlier detection, (vi) unsupervised learning.^[14]

Applications

Genomics

Genomics involves the study of the genome, the complete DNA sequence, of organisms. While genomic sequence data has historically been sparse due to the technical difficulty in sequencing a piece of DNA, the number of available sequences is growing exponentially.^[15] However, while raw data is becoming increasingly available and accessible, the biological interpretation of this data is occurring at a much slower pace.^[16] Therefore, there is an increasing need for the development of computational genomics tools, among them machine learning systems, that can automatically determine the location of protein-encoding genes within a given DNA sequence.^[16] This is a problem in computational biology known as gene prediction.

Gene prediction is commonly performed through a combination of what are known as extrinsic and intrinsic searches.^[16] For the extrinsic search, the input DNA sequence is run through a large database of sequences whose genes have been previously discovered and their locations annotated. A number of the sequence's genes can be identified by determining which strings of bases within the sequence are homologous to known gene sequences. However, given the limitation in size of the database of known and annotated gene sequences, not all the genes in a given input sequence can be identified through homology alone. Therefore, an intrinsic search is needed where a gene prediction program attempts to identify the remaining genes from the DNA sequence alone.^[16]

Machine learning has also been used for the problem of multiple sequence alignment which involves aligning many DNA or amino acid sequences in order to determine regions of similarity that could indicate a shared evolutionary history.^[3] It can also be used to detect and visualize genome rearrangements.^[17]

Proteomics

Proteins, strings of amino acids, gain much of their function from protein folding in which they conform into a three-dimensional structure. This structure is composed of a number of layers of folding, including the primary structure (i.e. the flat string of amino acids), the secondary structure (alpha helices and beta sheets), the tertiary structure, and the quartenary structure.

Protein secondary structure prediction is a main focus of this subfield as the further protein foldings (tertiary and quartenary structures) are determined based on the secondary structure.^[2] Solving the true structure of a protein is an incredibly expensive and time-intensive process, furthering the need for systems that can accurately predict the structure of a protein by analyzing the amino acid sequence directly.^[2]^[3] Prior to machine learning, researchers needed to conduct this prediction manually. This trend began in 1951 when Pauling and Corey released their work on predicting the hydrogen bond configurations of a protein from a polypeptide chain.^[18] Today, through the use of automatic feature learning, the best machine learning techniques are able to achieve an accuracy of 82-84%.^[2]^[19] The current state-of-the-art in secondary structure prediction uses a system called DeepCNF (deep convolutional neural fields) which relies on the machine learning model of artificial neural networks to achieve an accuracy of approximately 84% when tasked to classify the amino acids of a protein sequence into one of three structural classes (helix, sheet, or coil).^[19] The theoretical limit for three-state protein secondary structure is 88–90%.^[2]

Machine learning has also been applied to proteomics problems such as protein side-chain prediction, protein loop modeling, and protein contact map prediction.^[3]

Metagenomics

Metagenomics is the study of microbial communities from environmental DNA samples. Currently, there are many limitations and challenges in the implementation of machine learning tools due to the high amount of data proceeded from environmental samples. According to Lin and colleagues^[20], machine learning requires a great computing power, but the development of fast supercomputers and web servers has made the access to these tools easier nowadays. A major challenge in the characterization of differences in the microbiome composition across groups of samples, is the high dimensionality of microbiome datasets, which significantly reduces the power of current approaches for identifying true differences and increases the chance of false discoveries ^[21].

Despite their importance for processing the great amount of information from environmental samples, the development of ML tools related with metagenomics has been focused in the study of gut microbiota and the relationship with digestive diseases, such as; Inflammatory bowel disease (IBD), Clostridioides difficile infection (CDI), colorectal cancer and Diabetes, with the aim for obtain better approaches of diagnosis and treatments for those pathologies. ^[22]

There are a lot of examples of algorithms developed for trying to classify microbial communities according to the health condition of the host, regardless the type of sequence data e.g. 16S or whole-genome sequencing (WGS) by the use of methods such as; least absolute shrinkage and selection operator classifier, Random forest, supervised classification model and gradient boosted tree model. Recently, more advanced models have been developed by the use of neural networks . For example, recurrent neural network (RNN), convolutional neural network (CNN), and Hopfield neural network ^[23]. Fioravanti and colleagues ^[24] for example, make an algorithm called Ph-CNN,

RF methods and implemented importance measures will help in the identification of microbiome species that can be used to distinguish diseased and non-diseased samples. However, the performance of a decision tree and the diversity of decision trees in the ensemble significantly influence the performance of RF algorithms. The generalization error for RFs involves measures of how accurate the individual classifiers are and their interdependence. Therefore, the high dimensionality problems of microbiome datasets pose a number of challenges. Effective approaches require a lot of possible combinations of variables, which exponentially increases the computational burden as the number of involved features increases ^[25].

Microbial Communities

In the soil we can find a great variety of microorganisms, within which are: fungi, bacteria, protozoa, algae and viruses. These microorganisms generate microbial communities in all different ecosystems. These microbial communities are possibly the most diverse and abundant environments on the planet, it can contribute with the soil quality and its functionality.

Microarrays

Microarrays, a type of lab-on-a-chip, are used for automatically collecting data about large amounts of biological material. Machine learning can aid in the analysis of this data, and it has been applied to expression pattern identification, classification, and genetic network induction.^[3]

This technology is especially useful for monitoring the expression of genes within a genome, aiding in diagnosing different types of cancer based on which genes are expressed.^[26] One of the main problems in this field is identifying which genes are expressed based on the collected data.^[3] In addition, due to the huge number of genes on which data is collected by the microarray, there is a large amount of irrelevant data to the task of expressed gene identification, further complicating this problem. Machine learning presents a potential solution to this problem as various classification methods can be used to perform this identification. The most commonly used methods are radial basis function networks, deep learning, Bayesian classification, decision trees, and random forest.^[26]

Systems biology

Systems biology focuses on the study of the emergent behaviors from complex interactions of simple biological components in a system. Such components can include molecules such as DNA, RNA, proteins, and metabolites.^[27]

Machine learning has been used to aid in the modelling of these complex interactions in biological systems in domains such as genetic networks, signal transduction networks, and metabolic pathways.^[3] Probabilistic graphical models, a machine learning technique for determining the structure between different variables, are one of the most commonly used methods for modeling genetic networks.^[3] In addition, machine learning has been applied to systems biology problems such as identifying transcription factor binding sites using a technique known as Markov chain optimization.^[3] Genetic algorithms, machine learning techniques which are based on the natural process of evolution, have been used to model genetic networks and regulatory structures.^[3]

Other systems biology applications of machine learning include the task of enzyme function prediction, high throughput microarray data analysis, analysis of genome-wide association studies to better understand markers of disease, protein function prediction.^[28]

Evolution

This domain particularly phylogenetic tree reconstruction uses the features of machine learning techniques. Phylogenetic trees are schematic representation of evolution of organism. Initially, they were constructed using the different features like morphological and metabolic features. Later, due to the huge amount of available genome sequence, the construction of phylogenetic tree algorithm used the concept based on genomes comparison. With the help of optimization techniques, a comparison was done by means of multiple sequence alignment.^[29]

Stroke Diagnosis

Machine learning methods for analysis of neuroimaging data are used to help diagnose stroke. Three-dimensional CNN and SVM methods are often used. ^[30]

Text mining

The increase in available biological publications led to the issue of the increase in difficulty in searching through and compiling all the relevant available information on a given topic across all sources. This task is known as knowledge extraction. This is necessary for biological data collection which can then in turn be fed into machine learning algorithms to generate new biological knowledge.^[3]^[31] Machine learning can be used for this knowledge extraction task using techniques such as natural language processing to extract the useful information from human-generated reports in a database. Text Nailing, an alternative approach to machine learning, capable of extracting features from clinical narrative notes was introduced in 2017.

This technique has been applied to the search for novel drug targets, as this task requires the examination of information stored in biological databases and journals.^[31] Annotations of proteins in protein databases often do not reflect the complete known set of knowledge of each protein, so additional information must be extracted from biomedical literature. Machine learning has been applied to automatic annotation of the function of genes and proteins, determination of the subcellular localization of a protein, analysis of DNA-expression arrays, large-scale protein interaction analysis, and molecule interaction analysis.^[31]

Another application of text mining is the detection and visualization of distinct DNA regions given sufficient reference data.^[32]

Decodification of RiPPs chemical structures

The accelerated increase of RiPPs (Ribosomally synthesized and post-translationally modified peptides) that have been experimentally characterized, together with the availability of information on the sequence and chemical structure of a large number of them, selected from databases such as BAGEL, BACTIBASE, MIBIG and THIOBASE, provide the opportunity to develop machine learning tools to decode the chemical structure of RiPPs and achieve a classification between them.

In 2017, researchers at the National Institute of Immunology of New Delhi, India, developed the RiPPMiner^[33] software, a bioinformatics resource for decode RiPP chemical structures by genome mining. The RiPPMiner web server consists of two main components, a query interface and the RiPPDB database. RiPPMiner classifies into 12 subclasses of RiPPs, predicting the cleavage site of the leader peptide and the final cross-link of the RiPP chemical structure.

Identification of RiPPs and prediction of RiPP Class

RiPPs analysis tools such as antiSMASH and RiPP-PRISM use Hidden Markov Model^[34] of modifying enzymes present in biosynthetic gene clusters in RiPP to predict the RiPP subclass. Dissimilar to these tools, RiPPMiner uses a machine learning model, trained with 513 RiPPs, that uses the amino acid sequence of the RiPP gene uniquely to identify RiPPs and subsequently predict their subclass. RiPPMiner differentiates RiPPs from other proteins and peptides using a support-vector machine model that is trained using 293 experimentally characterized RiPPs as positive data set, and 8140 genome encoded non-RiPPs polypeptides as negative data set. The negative data set included SWISSProt entries similar in length to RiPPs, e.g., 30s ribosomal proteins, matrix proteins, cytochrome B proteins, etc. The support vectors of the support-vector machine model consist of amino acid composition and dipeptide frequencies.

Benchmarking of this RiPP identification methods on an independent dataset (not included in training) using two-fold cross-validation approach indicated Sensitivity, Specificity, Precision and MCC values of 0.93, 0.90, 0.90 and 0.85 respectively. This indicates good predictive power of the SVM model for distinguishing between RiPPs and non-RiPPs. For prediction of RiPP class or sub-class a Multi Class SVM was trained using the amino acid composition and dipeptide frequencies as feature vectors. During the training of the Multi Class SVM for prediction of RiPP class, available RiPP precursor sequences belonging to a given class (e.g. lasso peptide) were used as positive set, while RiPPs belonging to all other classes were used as negative set.

Prediction of cleavage site

Prediction of cross-links

Other approaches to Big Data analysis

Databases

Important part of bioinformatic is the management of the big datasets.

Exist databases especially for each type of data, for example:

Bioinformatics analysis for metagenomic

For metagenomics testing

SILVA

silva is a database that ... ^[35]

Greengenes

Greengenes is a full-length 16S rRNA gene database that provides chimera screening, standard alignment and a curated taxonomy based on de novo tree inference ^[36]^[37]. Overview

1,012,863 RNA sequences from 92,684 organisms contributed to RNAcentral.
The shortest sequence has 1,253 nucleotides, the longest 2,368.
The average length is 1,402 nucleotides.
Database version: 13.5.

NCBI

NCBI is a databases ... ^[38]

OTT(OpenTreeofLIFE)

OTT

KEGG

KO

References

^ Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi:10.1186/s13040-017-0155-3. PMC 5721660. PMID 29234465.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ ^a ^b ^c ^d ^e Yang, Yuedong; Gao, Jianzhao; Wang, Jihua; Heffernan, Rhys; Hanson, Jack; Paliwal, Kuldip; Zhou, Yaoqi (May 2018). "Sixty-five years of the long march in protein secondary structure prediction: the final stretch?". Briefings in Bioinformatics. 19 (3): 482–494. doi:10.1093/bib/bbw129. PMC 5952956. PMID 28040746.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Larrañaga, Pedro; Calvo, Borja; Santana, Roberto; Bielza, Concha; Galdiano, Josu; Inza, Iñaki; Lozano, José A.; Armañanzas, Rubén; Santafé, Guzmán (March 2006). "Machine learning in bioinformatics". Briefings in Bioinformatics. 7 (1): 86–112. doi:10.1093/bib/bbk007. PMID 16761367.
^ Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
^ Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
^ Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717.
^ Hubel, D. H.; Wiesel, T. N. (March 1, 1968). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. ISSN 0022-3751. PMC 1557912. PMID 4966457.
^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved November 16, 2013.
^ Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject independent facial expression recognition with robust face detection using a convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. PMID 12850007. Retrieved November 17, 2013.
^ Ho, Tin Kam (1995). "Random Decision Forests". Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995.: 278–282.
^ Dietterich, Thomas (2000). An Experimental Comparison of Three Methodsfor Constructing Ensembles of Decision Trees:Bagging, Boosting, and Randomization. Kluwer Academic Publishers. pp. 139–157.
^ Breiman, Leo (2001). Random Forest (45 ed.). Machine Learning: Kluwer Academic Publisers. pp. 5–32.
^ Zhang, C; Ma, Y (2012). Ensemble machine learning: methods and applications. New York: Springer New York Dordrecht Heidelberg London. pp. 157–175. ISBN 978-1-4419-9325-0.
^ Zhang, C; Ma, Y (2012). Ensemble machine learning: methods and applications. New York: Springer New York Dordrecht Heidelberg London. pp. 157–175. ISBN 978-1-4419-9325-0.
^ "GenBank and WGS Statistics". www.ncbi.nlm.nih.gov. Retrieved May 6, 2017.
^ ^a ^b ^c ^d Mathé, Catherine; Sagot, Marie-France; Schiex, Thomas; Rouzé, Pierre (October 1, 2002). "Current methods of gene prediction, their strengths and weaknesses". Nucleic Acids Research. 30 (19): 4103–4117. doi:10.1093/nar/gkf543. ISSN 1362-4962. PMC 140543. PMID 12364589.
^ Pratas, D; Silva, R; Pinho, A; Ferreira, P (May 18, 2015). "An alignment-free method to find and visualise rearrangements between pairs of DNA sequences". Scientific Reports. 5 (10203): 10203. Bibcode:2015NatSR...510203P. doi:10.1038/srep10203. PMC 4434998. PMID 25984837.
^ Pauling, L.; Corey, R. B.; Branson, H. R. (April 1, 1951). "The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain". Proceedings of the National Academy of Sciences of the United States of America. 37 (4): 205–211. Bibcode:1951PNAS...37..205P. doi:10.1073/pnas.37.4.205. ISSN 0027-8424. PMC 1063337. PMID 14816373.
^ ^a ^b Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo (December 1, 2015). "Protein secondary structure prediction using deep convolutional neural fields". Scientific Reports. 6: 18962. arXiv:1512.00843. Bibcode:2016NatSR...618962W. doi:10.1038/srep18962. PMC 4707437. PMID 26752681.
^ Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.
^ https://www.biorxiv.org/content/10.1101/2020.10.29.361360v1.full
^ Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.
^ Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.
^ Fioravanti, Diego; Giarratano, Ylenia; Maggio, Valerio; Agostinelli, Claudio; Chierici, Marco; Giuseppe, Jurman; Furlanello, Cesare (2018). "Phylogenetic convolutional neural networks in metagenomics". BMC Bioinformatics (19): 49. doi:10.1186/s12859-018-2033-5.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Dang, Tung; Kishino, Hirohisa (October 30, 2020). "Forward variable selection improves the power of random forest for high-dimensional microbiome data". doi:10.1101/2020.10.29.361360. {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b Pirooznia, Mehdi; Yang, Jack Y.; Yang, Mary Qu; Deng, Youping (2008). "A comparative study of different machine learning methods on microarray gene expression data". BMC Genomics. 9 (1): S13. doi:10.1186/1471-2164-9-S1-S13. ISSN 1471-2164. PMC 2386055. PMID 18366602.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ "Machine Learning in Molecular Systems Biology". Frontiers. Retrieved June 9, 2017.
^ d'Alché-Buc, Florence; Wehenkel, Louis (2008). "Machine Learning in Systems Biology". BMC Proceedings. 2 (4): S1. doi:10.1186/1753-6561-2-S4-S1. ISSN 1753-6561. PMC 2654969. PMID 19091048.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Bhattacharya, Mrinmoyee (2020). "Unsupervised Techniques in Genomics". In Srinivasa, KG; Siddesh, GM; Manisekhar, SR (eds.). Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques, Tools, and Applications. Springer Singapore. pp. 164–188. ISBN 978-981-15-2445-5.
^ Jiang, Fei (2017). "Artificial intelligence in healthcare: past, present and future". BMJ Journals Stroke and Vascular Neurology. 2 (4): 230–243. doi:10.1136/svn-2017-000101. PMC 5829945. PMID 29507784.
^ ^a ^b ^c Krallinger, Martin; Erhardt, Ramon Alonso-Allende; Valencia, Alfonso (March 15, 2005). "Text-mining approaches in molecular biology and biomedicine". Drug Discovery Today. 10 (6): 439–445. doi:10.1016/S1359-6446(05)03376-3. PMID 15808823.
^ Pratas, D; Hosseini, M; Silva, R; Pinho, A; Ferreira, P (June 20–23, 2017). Visualization of Distinct DNA Regions of the Modern Human Relatively to a Neanderthal Genome. Lecture Notes in Computer Science. Vol. 10255. pp. 235–242. doi:10.1007/978-3-319-58838-4_26. ISBN 978-3-319-58837-7. {{cite book}}: |journal= ignored (help)
^ Agrawal, P; Khater, S; Gupta, M; Sain, N; Mohanty, D (July 3, 2017). "RiPPMiner: a bioinformatics resource for deciphering chemical structures of RiPPs based on prediction of cleavage and cross-links" (PDF). Nucleic acids research. 45 (W1): W80–W88. doi:10.1093/nar/gkx408. PMID 28499008.
^ Medema, MH; Blin, K; Cimermancic, P; de Jager, V; Zakrzewski, P; Fischbach, MA; Weber, T; Takano, E; Breitling, R (July 2011). "antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences" (PDF). Nucleic acids research. 39 (Web Server issue): W339-46. doi:10.1093/nar/gkr466. PMID 21672958.
^ Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013). "The SILVA ribosomal RNA gene database project: improved data processing and web-based tools". Nucleic Acids Res (41): D590-6. doi:10.1093/nar/gks1219. PMC 3531112. PMID 23193283.{{cite journal}}: CS1 maint: PMC format (link)
^ DeSantis, T. Z.; Hugenholtz, P.; Larsen, N.; Rojas, M.; Brodie, E. L.; Keller, K.; Huber, T.; Dalevi, D.; Hu, P.; Andersen, G. L. (July 2006). "Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB". Applied and Environmental Microbiology. 72 (7): 5069–5072. doi:10.1128/AEM.03006-05.
^ McDonald, Daniel; Price, Morgan N.; Goodrich, Julia; Nawrocki, Eric P.; DeSantis, Todd Z.; Probst, Alexander; Andersen, Gary L.; Knight, Rob; Hugenholtz, Philip (March 2012). "An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea". The ISME Journal. 6 (3): 610–618. doi:https://doi.org/10.1038/ismej.2011.139. {{cite journal}}: Check |doi= value (help); External link in |doi= (help)
^ NCBI RC (2018). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research. 46 (Database): D8–D13. doi:10.1093/nar/gkx1095. PMC 5753372. PMID 29140470. {{cite journal}}: Vancouver style error: initials in name 1 (help)CS1 maint: PMC format (link)

[1] Chicco D (December 2017). "Ten quick tips for machine learning in computational biology". BioData Mining. 10 (35): 35. doi:10.1186/s13040-017-0155-3. PMC 5721660. PMID 29234465.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[:2-2] Yang, Yuedong; Gao, Jianzhao; Wang, Jihua; Heffernan, Rhys; Hanson, Jack; Paliwal, Kuldip; Zhou, Yaoqi (May 2018). "Sixty-five years of the long march in protein secondary structure prediction: the final stretch?". Briefings in Bioinformatics. 19 (3): 482–494. doi:10.1093/bib/bbw129. PMC 5952956. PMID 28040746.

[:0-3] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Larrañaga, Pedro; Calvo, Borja; Santana, Roberto; Bielza, Concha; Galdiano, Josu; Inza, Iñaki; Lozano, José A.; Armañanzas, Rubén; Santafé, Guzmán (March 2006). "Machine learning in bioinformatics". Briefings in Bioinformatics. 7 (1): 86–112. doi:10.1093/bib/bbk007. PMID 16761367.

[:CNN-0-4] Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.

[:CNN-1-5] Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.

[fukuneoscholar-6] Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717.

[hubelwiesel1968-7] Hubel, D. H.; Wiesel, T. N. (March 1, 1968). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. ISSN 0022-3751. PMC 1557912. PMID 4966457.

[intro-8] Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved November 16, 2013.

[robust_face_detection-9] Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject independent facial expression recognition with robust face detection using a convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. PMID 12850007. Retrieved November 17, 2013.

[10] Ho, Tin Kam (1995). "Random Decision Forests". Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995.: 278–282.

[11] Dietterich, Thomas (2000). An Experimental Comparison of Three Methodsfor Constructing Ensembles of Decision Trees:Bagging, Boosting, and Randomization. Kluwer Academic Publishers. pp. 139–157.

[12] Breiman, Leo (2001). Random Forest (45 ed.). Machine Learning: Kluwer Academic Publisers. pp. 5–32.

[13] Zhang, C; Ma, Y (2012). Ensemble machine learning: methods and applications. New York: Springer New York Dordrecht Heidelberg London. pp. 157–175. ISBN 978-1-4419-9325-0.

[14] Zhang, C; Ma, Y (2012). Ensemble machine learning: methods and applications. New York: Springer New York Dordrecht Heidelberg London. pp. 157–175. ISBN 978-1-4419-9325-0.

[15] "GenBank and WGS Statistics". www.ncbi.nlm.nih.gov. Retrieved May 6, 2017.

[:1-16] Mathé, Catherine; Sagot, Marie-France; Schiex, Thomas; Rouzé, Pierre (October 1, 2002). "Current methods of gene prediction, their strengths and weaknesses". Nucleic Acids Research. 30 (19): 4103–4117. doi:10.1093/nar/gkf543. ISSN 1362-4962. PMC 140543. PMID 12364589.

[rearrang-17] Pratas, D; Silva, R; Pinho, A; Ferreira, P (May 18, 2015). "An alignment-free method to find and visualise rearrangements between pairs of DNA sequences". Scientific Reports. 5 (10203): 10203. Bibcode:2015NatSR...510203P. doi:10.1038/srep10203. PMC 4434998. PMID 25984837.

[18] Pauling, L.; Corey, R. B.; Branson, H. R. (April 1, 1951). "The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain". Proceedings of the National Academy of Sciences of the United States of America. 37 (4): 205–211. Bibcode:1951PNAS...37..205P. doi:10.1073/pnas.37.4.205. ISSN 0027-8424. PMC 1063337. PMID 14816373.

[:3-19] Wang, Sheng; Peng, Jian; Ma, Jianzhu; Xu, Jinbo (December 1, 2015). "Protein secondary structure prediction using deep convolutional neural fields". Scientific Reports. 6: 18962. arXiv:1512.00843. Bibcode:2016NatSR...618962W. doi:10.1038/srep18962. PMC 4707437. PMID 26752681.

[20] Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.

[21] ttps://www.biorxiv.org/content/10.1101/2020.10.29.361360v1.full

[22] Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.

[23] Lin, Yufeng; Wang, Guoping; Yu, Jun; Sung, Joseph J Y (2021). "Artificial intelligence and metagenomics in intestinal diseases". Journal of Gastroenterology and Hepatology (36): 841–847. doi:10.1111/jgh.15501.

[24] Fioravanti, Diego; Giarratano, Ylenia; Maggio, Valerio; Agostinelli, Claudio; Chierici, Marco; Giuseppe, Jurman; Furlanello, Cesare (2018). "Phylogenetic convolutional neural networks in metagenomics". BMC Bioinformatics (19): 49. doi:10.1186/s12859-018-2033-5.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[25] Dang, Tung; Kishino, Hirohisa (October 30, 2020). "Forward variable selection improves the power of random forest for high-dimensional microbiome data". doi:10.1101/2020.10.29.361360. {{cite journal}}: Cite journal requires |journal= (help)

[:4-26] Pirooznia, Mehdi; Yang, Jack Y.; Yang, Mary Qu; Deng, Youping (2008). "A comparative study of different machine learning methods on microarray gene expression data". BMC Genomics. 9 (1): S13. doi:10.1186/1471-2164-9-S1-S13. ISSN 1471-2164. PMC 2386055. PMID 18366602.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[27] "Machine Learning in Molecular Systems Biology". Frontiers. Retrieved June 9, 2017.

[28] 'Alché-Buc, Florence; Wehenkel, Louis (2008). "Machine Learning in Systems Biology". BMC Proceedings. 2 (4): S1. doi:10.1186/1753-6561-2-S4-S1. ISSN 1753-6561. PMC 2654969. PMID 19091048.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[29] Bhattacharya, Mrinmoyee (2020). "Unsupervised Techniques in Genomics". In Srinivasa, KG; Siddesh, GM; Manisekhar, SR (eds.). Statistical Modelling and Machine Learning Principles for Bioinformatics Techniques, Tools, and Applications. Springer Singapore. pp. 164–188. ISBN 978-981-15-2445-5.

[stroke1-30] Jiang, Fei (2017). "Artificial intelligence in healthcare: past, present and future". BMJ Journals Stroke and Vascular Neurology. 2 (4): 230–243. doi:10.1136/svn-2017-000101. PMC 5829945. PMID 29507784.

[:5-31] Krallinger, Martin; Erhardt, Ramon Alonso-Allende; Valencia, Alfonso (March 15, 2005). "Text-mining approaches in molecular biology and biomedicine". Drug Discovery Today. 10 (6): 439–445. doi:10.1016/S1359-6446(05)03376-3. PMID 15808823.

[sing-32] Pratas, D; Hosseini, M; Silva, R; Pinho, A; Ferreira, P (June 20–23, 2017). Visualization of Distinct DNA Regions of the Modern Human Relatively to a Neanderthal Genome. Lecture Notes in Computer Science. Vol. 10255. pp. 235–242. doi:10.1007/978-3-319-58838-4_26. ISBN 978-3-319-58837-7. {{cite book}}: |journal= ignored (help)

[33] Agrawal, P; Khater, S; Gupta, M; Sain, N; Mohanty, D (July 3, 2017). "RiPPMiner: a bioinformatics resource for deciphering chemical structures of RiPPs based on prediction of cleavage and cross-links" (PDF). Nucleic acids research. 45 (W1): W80–W88. doi:10.1093/nar/gkx408. PMID 28499008.

[34] Medema, MH; Blin, K; Cimermancic, P; de Jager, V; Zakrzewski, P; Fischbach, MA; Weber, T; Takano, E; Breitling, R (July 2011). "antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences" (PDF). Nucleic acids research. 39 (Web Server issue): W339-46. doi:10.1093/nar/gkr466. PMID 21672958.

[35] Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO (2013). "The SILVA ribosomal RNA gene database project: improved data processing and web-based tools". Nucleic Acids Res (41): D590-6. doi:10.1093/nar/gks1219. PMC 3531112. PMID 23193283.{{cite journal}}: CS1 maint: PMC format (link)

[36] DeSantis, T. Z.; Hugenholtz, P.; Larsen, N.; Rojas, M.; Brodie, E. L.; Keller, K.; Huber, T.; Dalevi, D.; Hu, P.; Andersen, G. L. (July 2006). "Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB". Applied and Environmental Microbiology. 72 (7): 5069–5072. doi:10.1128/AEM.03006-05.

[37] McDonald, Daniel; Price, Morgan N.; Goodrich, Julia; Nawrocki, Eric P.; DeSantis, Todd Z.; Probst, Alexander; Andersen, Gary L.; Knight, Rob; Hugenholtz, Philip (March 2012). "An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea". The ISME Journal. 6 (3): 610–618. doi:https://doi.org/10.1038/ismej.2011.139. {{cite journal}}: Check |doi= value (help); External link in |doi= (help)

[38] NCBI RC (2018). "Database resources of the National Center for Biotechnology Information". Nucleic Acids Research. 46 (Database): D8–D13. doi:10.1093/nar/gkx1095. PMC 5753372. PMID 29140470. {{cite journal}}: Vancouver style error: initials in name 1 (help)CS1 maint: PMC format (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

@@ Line 32: / Line 32: @@
 === Random Forest ===
-[[Random Forest]] (RF) is a classification method that operates by constructing a multitud of [[decision trees]]that operate as an ensamble, and the output are the class or average prediction of the individual trees.
+[[Random Forest]] (RF) is a classification method that operates by constructing a multitud of [[decision trees]] that operate as an ensamble, and the output are the class or average prediction of the individual trees. <ref>{{cite journal |last1=Ho |first1=Tin Kam |title=Random Decision Forests |journal=Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995. |date=1995 |pages=278–282}}</ref>
-RF algorithm is a modification of bagging that aggregates a large collection of decision trees <ref>{{cite book |last1=Dietterich |first1=Thomas |title=An Experimental Comparison of Three Methodsfor Constructing Ensembles of Decision Trees:Bagging, Boosting, and Randomization |date=2000 |publisher=Kluwer Academic Publishers |pages=139-157}}</ref>.
+RF algorithm is a modification of [[bagging]] that aggregates a large collection of decision trees and can be used for either a categorical response variable as [[classification]] or a continuous response, referred to as [[regression]]. <ref>{{cite book |last1=Dietterich |first1=Thomas |title=An Experimental Comparison of Three Methodsfor Constructing Ensembles of Decision Trees:Bagging, Boosting, and Randomization |date=2000 |publisher=Kluwer Academic Publishers |pages=139-157}}</ref><ref>{{cite book |last1=Breiman |first1=Leo |title=Random Forest |date=2001 |publisher=Kluwer Academic Publisers |location=Machine Learning |pages=5-32 |edition=45}}</ref>
+RF give an internal estimate of generalization error so cross-validation is unnecessary. In addition, produce proximities,  which  can  be  used  to  impute  missing  values.  Proximities  can  also provide  a  wealth  of  information  by  enabling  novel  visualizations  of  the  data.Random Forests have been successfully usedfor a wide variety of applications andenjoy considerable popularity in several disciplines.<ref>{{cite book |last1=Zhang |first1=C |last2=Ma |first2=Y |title=Ensemble machine learning: methods and applications |date=2012 |publisher=Springer New York Dordrecht Heidelberg London |location=New York |isbn=978-1-4419-9325-0 |pages=157-175}}</ref>
+From a computational standpoint, RF algorithm is appealing because: (i) naturally handle both [[regression]] and (multiclass) [[classification]], (ii) is relatively fast to train and to predict, (iii) depend only on one or two tuning parameters, (iv) have a built in estimate of generalization error, (v) can be used directly for high-dimensional problems, (vi) can easily be implemented in parallel.
+Statistically, RF algorithm is appealing because of the additional features theyprovide, such as: (i) measures of variable importance, (ii) differential class weighting, (iii) missing value imputation, (iv) visualization, (v) outlier detection, (vi) unsupervised learning.<ref>{{cite book |last1=Zhang |first1=C |last2=Ma |first2=Y |title=Ensemble machine learning: methods and applications |date=2012 |publisher=Springer New York Dordrecht Heidelberg London |location=New York |isbn=978-1-4419-9325-0 |pages=157-175}}</ref>
 == Applications ==