One in ten rule

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

In statistics, the one in ten rule is a rule of thumb for how many predictors can be derived from data when doing regression analysis (in particular proportional hazards models and logistic regression) without risk of overfitting. The rule states that one predictive variable can be studied for every ten events.[1][2][3][4] It is not applicable to ordinary least squares linear regression, where it is suggested that as few as two events per predictor are sufficient.[5]

For example, if a sample of 200 patients are studied and 180 patients die during the study (so that 20 patients survive), only two pre-specified predictors can reliably be fitted to the total data. Similarly, if 120 patients die during the study (so that 80 patients survive), eight pre-specified predictors (based on the smaller of the two counts, being 80) can be fitted reliably. If more are fitted, overfitting is likely and the results will not predict well outside the training data. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.[6]

A "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%.[4][7]

Recent studies, however, show that the one in ten rule may be too conservative as a general recommendation and that five to nine events per predictor can be enough, depending on the research question.[8]

References[edit]

  1. ^ Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). "Regression modelling strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52. doi:10.1002/sim.4780030207. 
  2. ^ Harrell, F. E. Jr.; Lee, K. L.; Mark, D. B. (1996). "Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors" (PDF). Stat Med. 15 (4): 361–87. doi:10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4. 
  3. ^ Peduzzi, Peter; Concato, John; Kemper, Elizabeth; Holford, Theodore R.; Feinstein, Alvan R. (1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487. 
  4. ^ a b "Chapter 8: Statistical Models for Prognostication: Problems with Regression Models". Archived from the original on October 31, 2004. Retrieved 2013-10-11. 
  5. ^ Austin, P. C.; Steyerberg, E. W. (2015). "The number of subjects per variable required in linear regression analyses". Journal of Clinical Epidemiology. 68 (6): 627–636. doi:10.1016/j.jclinepi.2014.12.014. 
  6. ^ Ernest S. Shtatland, Ken Kleinman, Emily M. Cain. Model building in Proc PHREG with automatic variable selection and information criteria. Paper 206–30 in SUGI 30 Proceedings, Philadelphia, Pennsylvania April 10–13, 2005. http://www2.sas.com/proceedings/sugi30/206-30.pdf
  7. ^ Steyerberg, E. W.; Eijkemans, M. J.; Harrell, F. E. Jr.; Habbema, J. D. (2000). "Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets". Stat Med. 19: 1059–1079. doi:10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0. 
  8. ^ Vittinghoff, E.; McCulloch, C. E. (2007). "Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression". American Journal of Epidemiology. 165 (6): 710–718. doi:10.1093/aje/kwk052.