Random forest: Difference between revisions

Content deleted Content added

Inline

Revision as of 01:43, 21 March 2009

In machine learning, a random forest is a classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and Ho's "random subspace method" to construct a collection of decision trees with controlled variations.

Learning algorithm

Each tree is constructed using the following algorithm:

Let the number of training cases be N, and the number of variables in the classifier be M.
We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Advantages

The advantages of random forest are:

For many data sets, it produces a highly accurate classifier
It handles a very large number of input variables
It estimates the importance of variables in determining classification
It generates an internal unbiased estimate of the generalization error as the forest building progresses
It includes a good method for estimating missing data and maintains accuracy when a large proportion of the data are missing
It provides an experimental way to detect variable interactions
It can balance error in class population unbalanced data sets
It computes proximities between cases, useful for clustering, detecting outliers, and (by scaling) visualizing the data
Using the above, it can be extended to unlabeled data, leading to unsupervised clustering, outlier detection and data views
Learning is fast

Disadvantages

As noted in Machine Learning Benchmarks and Random Forest Regression, random forest are prone to overfitting (for some datasets). This is even more pronounced in noisy classification/regression tasks.
Random Forest does not handle large numbers of irrelevant features as well as ensembles of entropy-reducing decision trees.^[1]
It is more efficient to select a random decision boundary than an entropy-reducing decision boundary, thus making larger ensembles more feasible. Although this may seem to be an advantage at first, it has the effect of shifting the computation from training time to evaluation time, which is actually a disadvantage for most applications.

Open source implementations

The Original RF by Breiman and Cutler. Written in Fortran 77. Difficult to configure.
PARF Written in Fortran 90. Can distribute work over a cluster of computers using MPI.
FastRandomForest Efficient implementation in Java, utilitizes multiple cores in a machine. Integrates into the Weka environment.
randomForest for R

External links

Ho, Tin Kam (1995). "Random Decision Forest". Proc. of the 3rd Int'l Conf. on Document Analysis and Recognition, Montreal, Canada, August 14-18, 1995, 278-282 (Preceding Work)
Ho, Tin Kam (1998). "The Random Subspace Method for Constructing Decision Forests". IEEE Trans. on Pattern Analysis and Machine Intelligence 20 (8), 832-844 (Preceding Work)
Amit, Yali and Geman, Donald (1997) "Shape quantization and recognition with randomized trees". Neural Computation 9, 1545-1588. (Preceding work)
Breiman, Leo "Looking Inside The Black Box". Wald Lecture II (Lecture)
Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1), 5-32 (Original Article)
Random Forest classifier description (Site of Leo Breiman)
Liaw, Andy & Wiener, Matthew "Classification and Regression by randomForest" R News (2002) Vol. 2/3 p. 18 (Discussion of the use of the random forest package for R)
Ho, Tin Kam (2002). "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors". Pattern Analysis and Applications 5, p. 102-112 (Comparison of bagging and random subspace method)
Prinzie, A., Van den Poel, D. (2007). Random Multiclass Classification: Generalizing Random Forests to Random MNL and Random NB, Dexa 2007, Lecture Notes in Computer Science, 4653, 349-358. Generalizing Random Forest framework to other methods. The paper introduces Random MNL and Random NB as two generalizations of Random Forests.
Prinzie, A., Van den Poel, D. (2008). Random Forests for multiclass classification: Random MultiNomial Logit, Expert Systems with Applications, 34(3), 1721-1732. Generalization of Random Forests to choice models like the Multinomial Logit Model (MNL): Random Multinomial Logit.

^ Gashler, M. and Giraud-Carrier, C. and Martinez, T., Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, The Seventh International Conference on Machine Learning and Applications, 2008, pp. 900-905., DOI 10.1109/ICMLA.2008.154

[1] Gashler, M. and Giraud-Carrier, C. and Martinez, T., Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous, The Seventh International Conference on Machine Learning and Applications, 2008, pp. 900-905., DOI 10.1109/ICMLA.2008.154

[1]

@@ Line 23: / Line 23: @@
 == Disadvantages ==
-As noted in [http://repositories.cdlib.org/cbmb/bench_rf_regn/ Machine Learning Benchmarks and Random Forest Regression], random forest are prone to overfitting (for some datasets). This is even more pronounced in noisy classification/regression tasks.
+* As noted in [http://repositories.cdlib.org/cbmb/bench_rf_regn/ Machine Learning Benchmarks and Random Forest Regression], random forest are prone to overfitting (for some datasets). This is even more pronounced in noisy classification/regression tasks.
+* Random Forest does not handle large numbers of irrelevant features as well as ensembles of entropy-reducing decision trees.<ref>Gashler, M. and Giraud-Carrier, C. and Martinez, T., ''[http://axon.cs.byu.edu/papers/gashler2008icmla.pdf Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous]'', The Seventh International Conference on Machine Learning and Applications, 2008, pp. 900-905., [http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4796917 DOI 10.1109/ICMLA.2008.154]</ref>
+* It is more efficient to select a random decision boundary than an entropy-reducing decision boundary, thus making larger ensembles more feasible. Although this may seem to be an advantage at first, it has the effect of shifting the computation from training time to evaluation time, which is actually a disadvantage for most applications.
 == See also ==