Jump to content

Random forest: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
add link to milk open source implementation
Line 98: Line 98:
* [http://cran.r-project.org/web/packages/randomForest/index.html randomForest] for [[R (programming language)|R]]
* [http://cran.r-project.org/web/packages/randomForest/index.html randomForest] for [[R (programming language)|R]]
* [http://tmva.sourceforge.net/ TMVA] Toolkit for Multivariate Data Analysis implements random forests.
* [http://tmva.sourceforge.net/ TMVA] Toolkit for Multivariate Data Analysis implements random forests.
* [http://luispedro.org/software/milk/ milk] for Python implements [http://packages.python.org/milk/randomforests.html random forests].
* [http://waffles.sourceforge.net Waffles] A C++ library of machine learning algorithms, including RF.
* [http://waffles.sourceforge.net Waffles] A C++ library of machine learning algorithms, including RF.
* [http://code.google.com/p/randomforest-matlab] Matlab version.
* [http://code.google.com/p/randomforest-matlab] Matlab version.

Revision as of 21:06, 12 January 2011

Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the class's output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman[1] and Adele Cutler, and "Random Forests" is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho[2][3] and Amit and Geman[4] in order to construct a collection of decision trees with controlled variation.

The selection of a random subset of features is an example of the random subspace method, which, in Ho's formulation, is a way to implement stochastic discrimination[5] proposed by Eugene Kleinberg.

Learning algorithm

Each tree is constructed using the following algorithm:

  1. Let the number of training cases be N, and the number of variables in the classifier be M.
  2. We are told the number m of input variables to be used to determine the decision at a node of the tree; m should be much less than M.
  3. Choose a training set for this tree by choosing N times with replacement from all N available training cases (i.e. take a bootstrap sample). Use the rest of the cases to estimate the error of the tree, by predicting their classes.
  4. For each node of the tree, randomly choose m variables on which to base the decision at that node. Calculate the best split based on these m variables in the training set.
  5. Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).

Features and Advantages

The advantages of random forest are:

  • It is unexcelled in accuracy among current algorithms. For many data sets, it produces a highly accurate classifier
  • It runs efficiently on large data bases.
  • It can handle thousands of input variables without variable deletion.
  • It gives estimates of what variables are important in the classification.
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.
  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
  • It has methods for balancing error in class population unbalanced data sets.
  • Generated forests can be saved for future use on other data.
  • Prototypes are computed that give information about the relation between the variables and the classification.
  • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
  • The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
  • It offers an experimental method for detecting variable interactions.
  • Learning is fast

Disadvantages

  • Random forests are prone to overfitting for some datasets. This is even more pronounced in noisy classification/regression tasks.[6]
  • Random forests do not handle large numbers of irrelevant features as well as ensembles of entropy-reducing decision trees.[7]
  • It is more efficient to select a random decision boundary than an entropy-reducing decision boundary, thus making larger ensembles more feasible. Although this may seem to be an advantage at first, it has the effect of shifting the computation from training time to evaluation time, which is actually a disadvantage for most applications.

See also

References

  1. ^ Breiman, Leo (2001). "Random Forests". Machine Learning. 45 (1): 5–32. doi:10.1023/A:1010933404324.
  2. ^ Ho, Tin (1995). Random Decision Forest (PDF). 3rd Int'l Conf. on Document Analysis and Recognition. pp. 278–282.
  3. ^ Ho, Tin (1998). "The Random Subspace Method for Constructing Decision Forests" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. 20 (8): 832–844. doi:10.1109/34.709601.
  4. ^ Amit, Y.; Geman, D. (1997). "Shape quantization and recognition with randomized trees" (PDF). Neural Computation. 9 (7): 1545–1588. doi:10.1162/neco.1997.9.7.1545.
  5. ^ Kleinberg, Eugene (1996). "An Overtraining-Resistant Stochastic Modeling Method for Pattern Recognition" (PDF). Annals of Statistics. 24 (6): 2319–2349. doi:10.1214/aos/1032181157. MR1425956.
  6. ^ Segal, Mark R. (2004). Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics. {{cite book}}: Unknown parameter |month= ignored (help)
  7. ^ Gashler, M. (2008). Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous (PDF). Seventh International Conference on Machine Learning and Applications (ICMLA 08). pp. 900–905. doi:10.1109/ICMLA.2008.154. {{cite conference}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

Commercial implementation

  • [1] Random Forests.

Open source implementations

  • The Original RF by Breiman and Cutler. Written in Fortran 77. May be difficult to configure.
  • ALGLIB contains implementation of modified random forest algorithm in C#, C++, Pascal, VBA.
  • FastRandomForest Efficient implementation in Java, utilitizes multiple cores. Integrates into the Weka environment.
  • orngEnsemble module within Orange data mining software suite
  • PARF Written in Fortran 90. Can distribute work over a cluster of computers using MPI.
  • party an implementation of Breiman's random forests based on conditional inference trees for R
  • randomForest for R
  • TMVA Toolkit for Multivariate Data Analysis implements random forests.
  • milk for Python implements random forests.
  • Waffles A C++ library of machine learning algorithms, including RF.
  • [2] Matlab version.
  • Apache Mahout