Jackknife variance estimates for random forest
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)(Learn how and when to remove this template message)
Jackknife variance estimates
The sampling variance of bagged learners is:
Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:
In some classification problems, when random forest is used to fit models, jackknife estimated variance is defined as:
Here, denotes a decision tree after training, denotes the result based on samples without observation.
E-mail spam problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper( Confidence Intervals for Random Forests: The jackknife and the Infinitesimal Jackknife ) that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.
Here, accuracy is measured by error rate, which is defined as:
Here N is also the number of samples, M is the number of classes, is the indicator function which equals 1 when observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:
Here N is the number of samples, M is the number of classes, is the indicator function which equals 1 when observation is in class j, equals 0 when in other classes. is the predicted probability of observation in class .This method is used in Kaggle These two methods are very similar.
Modification for bias
When using Monte Carlo MSEs for estimating and , a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:
To eliminate this influence, bias-corrected modifications are suggested:
- Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014-05-14). "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife". Journal of Machine Learning Research. arXiv:1311.4555. Bibcode:2013arXiv1311.4555W.
- Kaggle https://www.kaggle.com/c/otto-group-product-classification-challenge/details/evaluation. Retrieved 2015. Check date values in:
|accessdate=(help); Missing or empty