One in ten rule

From Wikipedia, the free encyclopedia
Jump to: navigation, search

In statistics, the one in ten rule is a rule of thumb for how many predictors can be derived from data when doing regression analysis (in particular proportional hazards models) without risk of overfitting. The rule states that one predictive variable can be studied for every ten events.[1][2][3][4]

For example, if a sample of 200 patients are studied and 20 patients die during the study, only two pre-specified predictors can reliably be fitted to the total data. If more are fitted, overfitting is likely and the results will not predict well outside the training data. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.[5]

The one in ten rule is a minimum; a "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%.[4][6]

References[edit]

  1. ^ Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med. 1984 Apr–Jun;3(2):143–52.
  2. ^ Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996 Feb 28;15(4):361–87. http://www.unt.edu/rss/class/Jon/MiscDocs/Harrell_1996.pdf
  3. ^ Peter Peduzzi, John Concato, Elizabeth Kemper, Theodore R. Holford, Alvan R. Feinstein. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. Volume 49, Issue 12 , Pages 1373–1379, December 1996
  4. ^ a b http://painconsortium.nih.gov/symptomresearch/chapter_8/sec8/cess8pg2.htm
  5. ^ Ernest S. Shtatland, Ken Kleinman, Emily M. Cain. Model building in Proc PHREG with automatic variable selection and information criteria. Paper 206–30 in SUGI 30 Proceedings, Philadelphia, Pennsylvania April 10–13, 2005. http://www2.sas.com/proceedings/sugi30/206-30.pdf
  6. ^ Steyerberg EW, Eijkemans MJ, Harrell FE, Jr., Habbema JD. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med. 2000a;19:1059–1079.