One in ten rule: Difference between revisions

Content deleted Content added

Inline

Revision as of 09:07, 9 January 2019

In statistics, the one in ten rule is a rule of thumb for how many predictors can be derived from data when doing regression analysis (in particular proportional hazards models in survival analysis and logistic regression) without risk of overfitting. The rule states that one predictive variable can be studied for every ten events.^[1]^[2]^[3]^[4] For logistic regression the number of events is given by the size of the smallest of the outcome categories, and for survival analysis it is given by the number of uncensored events.^[3]

Example

For example, if a sample of 200 patients are studied and 180 patients die during the study (so that 20 patients survive), only two pre-specified predictors can reliably be fitted to the total data. Similarly, if 120 patients die during the study (so that 80 patients survive), eight pre-specified predictors (based on the smaller of the two counts, being 80) can be fitted reliably. If more are fitted, overfitting is likely and the results will not predict well outside the training data. It is not uncommon to see the 1:10 rule violated in fields with many variables (e.g. gene expression studies in cancer), decreasing the confidence in reported findings.^[5]

Improvements

A "one in 20 rule" has been suggested, indicating the need for shrinkage of regression coefficients, and a "one in 50 rule" for stepwise selection with the default p-value of 5%.^[4]^[6] Other studies, however, show that the one in ten rule may be too conservative as a general recommendation and that five to nine events per predictor can be enough, depending on the research question.^[7]

More recently, a study has shown that the ratio of events per predictive variable is not a reliable statistic for estimating the minimum number of events for estimating a logistic prediction model.^[8] Instead, the number of predictor variables, the total sample size (events + non-events) and the events fraction (events / total sample size) can be used to calculate the expected prediction error of the model that is to be developed.^[9] One can then estimate the required sample size to achieve a expected prediction error that is smaller than a predetermined allowable prediction error value.^[9]

Alternatively, three requirements for prediction model estimation have been suggested: the model should have a global shrinkage factor of ≥ .9, an absolute difference of ≤ .05 in the model's apparent and adjusted Nagelkerke R², and a precise estimation of the overall risk or rate in the target population.^[10] The necessary sample size and number of events for model development are then given by the values that meet these requirements.^[10]

References

^ Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). "Regression modelling strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52. doi:10.1002/sim.4780030207.
^ Harrell, F. E. Jr.; Lee, K. L.; Mark, D. B. (1996). "Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors" (PDF). Stat Med. 15 (4): 361–87. doi:10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4.
^ ^a ^b Peduzzi, Peter; Concato, John; Kemper, Elizabeth; Holford, Theodore R.; Feinstein, Alvan R. (1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.
^ ^a ^b "Chapter 8: Statistical Models for Prognostication: Problems with Regression Models". Archived from the original on October 31, 2004. Retrieved 2013-10-11. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)
^ Ernest S. Shtatland, Ken Kleinman, Emily M. Cain. Model building in Proc PHREG with automatic variable selection and information criteria. Paper 206–30 in SUGI 30 Proceedings, Philadelphia, Pennsylvania April 10–13, 2005. http://www2.sas.com/proceedings/sugi30/206-30.pdf
^ Steyerberg, E. W.; Eijkemans, M. J.; Harrell, F. E. Jr.; Habbema, J. D. (2000). "Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets". Stat Med. 19 (8): 1059–1079. doi:10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0.
^ Vittinghoff, E.; McCulloch, C. E. (2007). "Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression". American Journal of Epidemiology. 165 (6): 710–718. doi:10.1093/aje/kwk052. PMID 17182981.
^ van Smeden, Maarten; de Groot, Joris A. H.; Moons, Karel G. M.; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus J. C.; Reitsma, Johannes B. (2016-11-24). "No rationale for 1 variable per 10 events criterion for binary logistic regression analysis". BMC Medical Research Methodology. 16 (1): 163. doi:10.1186/s12874-016-0267-3. ISSN 1471-2288. PMC 5122171. PMID 27881078.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ ^a ^b van Smeden, Maarten; Moons, Karel Gm; de Groot, Joris Ah; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus Jc; Reitsma, Johannes B. (2018-01-01). "Sample size for binary logistic prediction models: Beyond events per variable criteria". Statistical Methods in Medical Research: 962280218784726. doi:10.1177/0962280218784726. ISSN 1477-0334. PMID 29966490.
^ ^a ^b Riley, Richard D.; Snell, Kym IE; Ensor, Joie; Burke, Danielle L.; Jr, Frank E. Harrell; Moons, Karel GM; Collins, Gary S. "Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes". Statistics in Medicine. 0 (0). doi:10.1002/sim.7992. ISSN 1097-0258.

[1] Harrell, F. E. Jr.; Lee, K. L.; Califf, R. M.; Pryor, D. B.; Rosati, R. A. (1984). "Regression modelling strategies for improved prognostic prediction". Stat Med. 3 (2): 143–52. doi:10.1002/sim.4780030207.

[2] Harrell, F. E. Jr.; Lee, K. L.; Mark, D. B. (1996). "Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors" (PDF). Stat Med. 15 (4): 361–87. doi:10.1002/(sici)1097-0258(19960229)15:4<361::aid-sim168>3.0.co;2-4.

[:0-3] Peduzzi, Peter; Concato, John; Kemper, Elizabeth; Holford, Theodore R.; Feinstein, Alvan R. (1996). "A simulation study of the number of events per variable in logistic regression analysis". Journal of Clinical Epidemiology. 49 (12): 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.

[pain-4] "Chapter 8: Statistical Models for Prognostication: Problems with Regression Models". Archived from the original on October 31, 2004. Retrieved 2013-10-11. {{cite web}}: Unknown parameter |deadurl= ignored (|url-status= suggested) (help)

[5] Ernest S. Shtatland, Ken Kleinman, Emily M. Cain. Model building in Proc PHREG with automatic variable selection and information criteria. Paper 206–30 in SUGI 30 Proceedings, Philadelphia, Pennsylvania April 10–13, 2005. http://www2.sas.com/proceedings/sugi30/206-30.pdf

[6] Steyerberg, E. W.; Eijkemans, M. J.; Harrell, F. E. Jr.; Habbema, J. D. (2000). "Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets". Stat Med. 19 (8): 1059–1079. doi:10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0.

[Vittinghoff_et_al._(2007)-7] Vittinghoff, E.; McCulloch, C. E. (2007). "Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression". American Journal of Epidemiology. 165 (6): 710–718. doi:10.1093/aje/kwk052. PMID 17182981.

[8] van Smeden, Maarten; de Groot, Joris A. H.; Moons, Karel G. M.; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus J. C.; Reitsma, Johannes B. (2016-11-24). "No rationale for 1 variable per 10 events criterion for binary logistic regression analysis". BMC Medical Research Methodology. 16 (1): 163. doi:10.1186/s12874-016-0267-3. ISSN 1471-2288. PMC 5122171. PMID 27881078.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[:1-9] van Smeden, Maarten; Moons, Karel Gm; de Groot, Joris Ah; Collins, Gary S.; Altman, Douglas G.; Eijkemans, Marinus Jc; Reitsma, Johannes B. (2018-01-01). "Sample size for binary logistic prediction models: Beyond events per variable criteria". Statistical Methods in Medical Research: 962280218784726. doi:10.1177/0962280218784726. ISSN 1477-0334. PMID 29966490.

[:2-10] Riley, Richard D.; Snell, Kym IE; Ensor, Joie; Burke, Danielle L.; Jr, Frank E. Harrell; Moons, Karel GM; Collins, Gary S. "Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes". Statistics in Medicine. 0 (0). doi:10.1002/sim.7992. ISSN 1097-0258.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 7: / Line 7: @@
 A "one in 20 rule" has been suggested, indicating the need for [[Shrinkage (statistics)|shrinkage]] of regression coefficients, and a "one in 50 rule" for [[Stepwise regression|stepwise selection]] with the default [[p-value]] of&nbsp;5%.<ref name="pain" /><ref>{{cite journal |last=Steyerberg |first=E. W. |last2=Eijkemans |first2=M. J. |last3=Harrell |first3=F. E. Jr. |last4=Habbema |first4=J. D. |title=Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets |journal=[[Statistics in Medicine (journal)|Stat Med]] |year=2000 |volume=19 |issue= 8|pages=1059–1079 |doi= 10.1002/(sici)1097-0258(20000430)19:8<1059::aid-sim412>3.0.co;2-0}}</ref> Other studies, however, show that the one in ten rule may be too conservative as a general recommendation and that five to nine events per predictor can be enough, depending on the research question.<ref name="Vittinghoff et al. (2007)">{{cite journal |first=E. |last=Vittinghoff |first2=C. E. |last2=McCulloch |year=2007 |title=Relaxing the Rule of Ten Events per Variable in Logistic and Cox Regression |journal=American Journal of Epidemiology |volume=165 |issue=6 |pages=710–718 |doi=10.1093/aje/kwk052|pmid=17182981 }}</ref>
-More recently, a study has shown that the ratio of events per predictive variable is not a reliable statistic for predicting the number of events necessary for estimating a logistic prediction model.<ref>{{Cite journal|last=van Smeden|first=Maarten|last2=de Groot|first2=Joris A. H.|last3=Moons|first3=Karel G. M.|last4=Collins|first4=Gary S.|last5=Altman|first5=Douglas G.|last6=Eijkemans|first6=Marinus J. C.|last7=Reitsma|first7=Johannes B.|date=2016-11-24|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis|journal=BMC Medical Research Methodology|volume=16|issue=1|pages=163|doi=10.1186/s12874-016-0267-3|issn=1471-2288|pmc=5122171|pmid=27881078}}</ref> Instead, the number of predictor variables, the total sample size (events + non-events) and the events fraction (events / total sample size) can be used to calculate the expected prediction error of the model that is to be developed.<ref>{{Cite journal|last=van Smeden|first=Maarten|last2=Moons|first2=Karel Gm|last3=de Groot|first3=Joris Ah|last4=Collins|first4=Gary S.|last5=Altman|first5=Douglas G.|last6=Eijkemans|first6=Marinus Jc|last7=Reitsma|first7=Johannes B.|date=2018-01-01|title=Sample size for binary logistic prediction models: Beyond events per variable criteria|journal=Statistical Methods in Medical Research|pages=962280218784726|doi=10.1177/0962280218784726|issn=1477-0334|pmid=29966490}}</ref>
+More recently, a study has shown that the ratio of events per predictive variable is not a reliable statistic for estimating the minimum number of events for estimating a logistic prediction model.<ref>{{Cite journal|last=van Smeden|first=Maarten|last2=de Groot|first2=Joris A. H.|last3=Moons|first3=Karel G. M.|last4=Collins|first4=Gary S.|last5=Altman|first5=Douglas G.|last6=Eijkemans|first6=Marinus J. C.|last7=Reitsma|first7=Johannes B.|date=2016-11-24|title=No rationale for 1 variable per 10 events criterion for binary logistic regression analysis|journal=BMC Medical Research Methodology|volume=16|issue=1|pages=163|doi=10.1186/s12874-016-0267-3|issn=1471-2288|pmc=5122171|pmid=27881078}}</ref> Instead, the number of predictor variables, the total sample size (events + non-events) and the events fraction (events / total sample size) can be used to calculate the expected prediction error of the model that is to be developed.<ref name=":1">{{Cite journal|last=van Smeden|first=Maarten|last2=Moons|first2=Karel Gm|last3=de Groot|first3=Joris Ah|last4=Collins|first4=Gary S.|last5=Altman|first5=Douglas G.|last6=Eijkemans|first6=Marinus Jc|last7=Reitsma|first7=Johannes B.|date=2018-01-01|title=Sample size for binary logistic prediction models: Beyond events per variable criteria|journal=Statistical Methods in Medical Research|pages=962280218784726|doi=10.1177/0962280218784726|issn=1477-0334|pmid=29966490}}</ref> One can then estimate the required sample size to achieve a expected prediction error that is smaller than a predetermined allowable prediction error value.<ref name=":1" />
+Alternatively, three requirements for prediction model estimation have been suggested: the model should have a global shrinkage factor of ≥ .9, an absolute difference of ≤ .05 in the model's apparent and adjusted [[Logistic regression#Pseudo-R2s|Nagelkerke R<sup>2</sup>]], and a precise estimation of the overall risk or rate in the target population.<ref name=":2">{{Cite journal|last=Riley|first=Richard D.|last2=Snell|first2=Kym IE|last3=Ensor|first3=Joie|last4=Burke|first4=Danielle L.|last5=Jr|first5=Frank E. Harrell|last6=Moons|first6=Karel GM|last7=Collins|first7=Gary S.|title=Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes|url=https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.7992|journal=Statistics in Medicine|language=en|volume=0|issue=0|doi=10.1002/sim.7992|issn=1097-0258}}</ref> The necessary sample size and number of events for model development are then given by the values that meet these requirements.<ref name=":2" />
-<br />
 ==References==