Generalized estimating equation

In statistics, a generalized estimating equation (GEE) is used to estimate the parameters of a generalized linear model with a possible unmeasured correlation between observations from different timepoints.^[1]^[2] Although some believe that GEEs are robust in everything, even with the wrong choice of working correlation matrix, generalized estimating equations are robust only to loss of consistency with the wrong choice.

Regression beta coefficient estimates from the Liang-Zeger GEE are consistent, unbiased, and asymptotically normal even when the working correlation is misspecified, under mild regularity conditions. GEE is higher in efficiency than generalized linear iterative model (GLIM) in the presence of high autocorrelation.^[1] When the true working correlation is known, consistency does not require the assumption that missing data is missing completely at random.^[1] Huber-White standard errors improve the efficiency of Liang-Zeger GEE in the absence of serial autocorrelation but may remove the marginal interpretation. GEE estimates the average response over the population ("population-averaged" effects) with Liang-Zeger standard errors, and in individuals using Huber-White standard errors, also known as "robust standard error" or "sandwich variance" estimates.^[3] Huber-White GEE was used since 1997, and Liang-Zeger GEE dates to the 1980s based on a limited literature review.^[4] Several independent formulations of these standard error estimators contribute to GEE theory. Placing the independent standard error estimators under the umbrella term "GEE" may exemplify abuse of terminology.

GEEs belong to a class of regression techniques that are referred to as semiparametric because they rely on specification of only the first two moments. They are a popular alternative to the likelihood-based generalized linear mixed model which is more at risk for consistency loss at variance structure specification.^[5] The trade-off of variance-structure misspecification and consistent regression coefficient estimates is loss of efficiency, yielding inflated Wald test p-values as a result of higher variance of standard errors than that of the most optimal.^[6] They are commonly used in large epidemiological studies, especially multi-site cohort studies, because they can handle many types of unmeasured dependence between outcomes.

Formulation

Given a mean model $\mu _{ij}$ for subject $i$ and time $j$ that depends upon regression parameters $\beta _{k}$ , and variance structure, $V_{i}$ , the estimating equation is formed via:^[7]

U(\beta )=\sum _{i=1}^{N}{\frac {\partial \mu _{i}}{\partial \beta }}V_{i}^{-1}\{Y_{i}-\mu _{i}(\beta )\}\,\!

The parameters $\beta _{k}$ are estimated by solving $U(\beta )=0$ and are typically obtained via the Newton–Raphson algorithm. The variance structure is chosen to improve the efficiency of the parameter estimates. The Hessian of the solution to the GEEs in the parameter space can be used to calculate robust standard error estimates. The term "variance structure" refers to the algebraic form of the covariance matrix between outcomes, Y, in the sample. Examples of variance structure specifications include independence, exchangeable, autoregressive, stationary m-dependent, and unstructured. The most popular form of inference on GEE regression parameters is the Wald test using naive or robust standard errors, though the Score test is also valid and preferable when it is difficult to obtain estimates of information under the alternative hypothesis. The likelihood ratio test is not valid in this setting because the estimating equations are not necessarily likelihood equations. Model selection can be performed with the GEE equivalent of the Akaike Information Criterion (AIC), the quasi-likelihood under the independence model criterion (QIC).^[8]

Relationship with Generalized Method of Moments

The generalized estimating equation is a special case of the generalized method of moments (GMM).^[9] This relationship is immediately obvious from the requirement that the score function satisfy the equation: $\mathbb {E} [U(\beta )]={1 \over {N}}\sum _{i=1}^{N}{\frac {\partial \mu _{i}}{\partial \beta }}V_{i}^{-1}\{Y_{i}-\mu _{i}(\beta )\}\,\!=0$

Computation

Software for solving generalized estimating equations is available in MATLAB,^[10] SAS (proc genmod^[11]), SPSS (the gee procedure^[12]), Stata (the xtgee command^[13]), R (packages glmtoolbox,^[14] gee,^[15] geepack^[16] and multgee^[17]), Julia (package GEE.jl^[18]) and Python (package statsmodels^[19]).

Comparisons among software packages for the analysis of binary correlated data ^[20]^[21] and ordinal correlated data^[22] via GEE are available.

References

^ ^a ^b ^c Kung-Yee Liang; Scott Zeger (1986). "Longitudinal data analysis using generalized linear models". Biometrika. 73 (1): 13–22. doi:10.1093/biomet/73.1.13.
^ Hardin, James; Hilbe, Joseph (2003). Generalized Estimating Equations. London: Chapman and Hall/CRC. ISBN 978-1-58488-307-4.
^ Abadie, Alberto; Athey, Susan; Imbens, Guido W; Wooldridge, Jeffrey M (October 2022). "When Should You Adjust Standard Errors for Clustering?". The Quarterly Journal of Economics. 138 (1): 1–35. arXiv:1710.02926. doi:10.1093/qje/qjac038.
^ Wolfe, Frederick; Anderson, Janice; Harkness, Deborah; Bennett, Robert M.; Caro, Xavier J.; Goldenberg, Don L.; Russell, I. Jon; Yunus, Muhammad B. (1997). "A prospective, longitudinal, multicenter study of service utilization and costs in fibromyalgia". Arthritis & Rheumatism. 40 (9): 1560–1570. doi:10.1002/art.1780400904. PMID 9324009.
^ Fong, Y; Rue, H; Wakefield, J (2010). "Bayesian inference for generalized linear mixed models". Biostatistics. 11 (3): 397–412. doi:10.1093/biostatistics/kxp053. PMC 2883299. PMID 19966070.
^ O'Brien, Liam M.; Fitzmaurice, Garrett M.; Horton, Nicholas J. (October 2006). "Maximum Likelihood Estimation of Marginal Pairwise Associations with Multiple Source Predictors". Biometrical Journal. 48 (5): 860–875. doi:10.1002/bimj.200510227. ISSN 0323-3847. PMC 1764610. PMID 17094349.
^ Diggle, Peter J.; Patrick Heagerty; Kung-Yee Liang; Scott L. Zeger (2002). Analysis of Longitudinal Data. Oxford Statistical Science Series. ISBN 978-0-19-852484-7.
^ Pan, W. (2001), "Akaike's information criterion in generalized estimating equations", Biometrics, 57 (1): 120–125, doi:10.1111/j.0006-341X.2001.00120.x, PMID 11252586, S2CID 7862441.
^ Breitung, Jörg; Chaganty, N. Rao; Daniel, Rhian M.; Kenward, Michael G.; Lechner, Michael; Martus, Peter; Sabo, Roy T.; Wang, You-Gan; Zorn, Christopher (2010). "Discussion of 'Generalized Estimating Equations: Notes on the Choice of the Working Correlation Matrix'". Methods of Information in Medicine. 49 (5): 426–432. doi:10.1055/s-0038-1625133. S2CID 3213776.
^ Sarah J. Ratcliffe; Justine Shults (2008). "GEEQBOX: A MATLAB Toolbox for Generalized Estimating Equations and Quasi-Least Squares". Journal of Statistical Software. 25 (14): 1–14.
^ "The GENMOD Procedure". The SAS Institute.
^ "IBM SPSS Advanced Statistics". IBM SPSS website.
^ "Stata's implementation of GEE" (PDF). Stata website.
^ "glmtoolbox: Set of Tools to Data Analysis using Generalized Linear Models". CRAN. 10 October 2023.
^ "gee: Generalized Estimation Equation solver". CRAN. 7 November 2019.
^ geepack: Generalized Estimating Equation Package, CRAN, 18 December 2020{{citation}}: CS1 maint: location missing publisher (link)
^ multgee: GEE solver for correlated nominal or ordinal multinomial responses using a local odds ratios parameterization, CRAN, 13 May 2021{{citation}}: CS1 maint: location missing publisher (link)
^ Shedden, Kerby (23 June 2022). "Generalized Estimating Equations in Julia". GitHub. Retrieved 24 June 2022.
^ "Generalized Estimating Equations — statsmodels".
^ Andreas Ziegler; Ulrike Grömping (1998). "The generalised estimating equations: a comparison of procedures available in commercial statistical software packages". Biometrical Journal. 40 (3): 245–260. doi:10.1002/(sici)1521-4036(199807)40:3<245::aid-bimj245>3.0.co;2-n.
^ Nicholas J. HORTON; Stuart R. LIPSITZ (1999). "Review of software to fit generalized estimating equation regression models". The American Statistician. 53 (2): 160–169. CiteSeerX 10.1.1.22.9325. doi:10.1080/00031305.1999.10474451.
^ Nazanin Nooraee; Geert Molenberghs; Edwin R. van den Heuvel (2014). "GEE for longitudinal ordinal data: Comparing R-geepack, R-multgee, R-repolr, SAS-GENMOD, SPSS-GENLIN" (PDF). Computational Statistics & Data Analysis. 77: 70–83. doi:10.1016/j.csda.2014.03.009. S2CID 15063953.

External links

Advanced Topics I - Generalized Estimating Equations (GEE)

[:0-1] Kung-Yee Liang; Scott Zeger (1986). "Longitudinal data analysis using generalized linear models". Biometrika. 73 (1): 13–22. doi:10.1093/biomet/73.1.13.

[2] Hardin, James; Hilbe, Joseph (2003). Generalized Estimating Equations. London: Chapman and Hall/CRC. ISBN 978-1-58488-307-4.

[3] Abadie, Alberto; Athey, Susan; Imbens, Guido W; Wooldridge, Jeffrey M (October 2022). "When Should You Adjust Standard Errors for Clustering?". The Quarterly Journal of Economics. 138 (1): 1–35. arXiv:1710.02926. doi:10.1093/qje/qjac038.

[4] Wolfe, Frederick; Anderson, Janice; Harkness, Deborah; Bennett, Robert M.; Caro, Xavier J.; Goldenberg, Don L.; Russell, I. Jon; Yunus, Muhammad B. (1997). "A prospective, longitudinal, multicenter study of service utilization and costs in fibromyalgia". Arthritis & Rheumatism. 40 (9): 1560–1570. doi:10.1002/art.1780400904. PMID 9324009.

[5] Fong, Y; Rue, H; Wakefield, J (2010). "Bayesian inference for generalized linear mixed models". Biostatistics. 11 (3): 397–412. doi:10.1093/biostatistics/kxp053. PMC 2883299. PMID 19966070.

[6] O'Brien, Liam M.; Fitzmaurice, Garrett M.; Horton, Nicholas J. (October 2006). "Maximum Likelihood Estimation of Marginal Pairwise Associations with Multiple Source Predictors". Biometrical Journal. 48 (5): 860–875. doi:10.1002/bimj.200510227. ISSN 0323-3847. PMC 1764610. PMID 17094349.

[7] Diggle, Peter J.; Patrick Heagerty; Kung-Yee Liang; Scott L. Zeger (2002). Analysis of Longitudinal Data. Oxford Statistical Science Series. ISBN 978-0-19-852484-7.

[8] Pan, W. (2001), "Akaike's information criterion in generalized estimating equations", Biometrics, 57 (1): 120–125, doi:10.1111/j.0006-341X.2001.00120.x, PMID 11252586, S2CID 7862441.

[9] Breitung, Jörg; Chaganty, N. Rao; Daniel, Rhian M.; Kenward, Michael G.; Lechner, Michael; Martus, Peter; Sabo, Roy T.; Wang, You-Gan; Zorn, Christopher (2010). "Discussion of 'Generalized Estimating Equations: Notes on the Choice of the Working Correlation Matrix'". Methods of Information in Medicine. 49 (5): 426–432. doi:10.1055/s-0038-1625133. S2CID 3213776.

[10] Sarah J. Ratcliffe; Justine Shults (2008). "GEEQBOX: A MATLAB Toolbox for Generalized Estimating Equations and Quasi-Least Squares". Journal of Statistical Software. 25 (14): 1–14.

[11] "The GENMOD Procedure". The SAS Institute.

[12] "IBM SPSS Advanced Statistics". IBM SPSS website.

[13] "Stata's implementation of GEE" (PDF). Stata website.

[14] "glmtoolbox: Set of Tools to Data Analysis using Generalized Linear Models". CRAN. 10 October 2023.

[15] "gee: Generalized Estimation Equation solver". CRAN. 7 November 2019.

[16] geepack: Generalized Estimating Equation Package, CRAN, 18 December 2020{{citation}}: CS1 maint: location missing publisher (link)

[17] multgee: GEE solver for correlated nominal or ordinal multinomial responses using a local odds ratios parameterization, CRAN, 13 May 2021{{citation}}: CS1 maint: location missing publisher (link)

[18] Shedden, Kerby (23 June 2022). "Generalized Estimating Equations in Julia". GitHub. Retrieved 24 June 2022.

[19] "Generalized Estimating Equations — statsmodels".

[20] Andreas Ziegler; Ulrike Grömping (1998). "The generalised estimating equations: a comparison of procedures available in commercial statistical software packages". Biometrical Journal. 40 (3): 245–260. doi:10.1002/(sici)1521-4036(199807)40:3<245::aid-bimj245>3.0.co;2-n.

[21] Nicholas J. HORTON; Stuart R. LIPSITZ (1999). "Review of software to fit generalized estimating equation regression models". The American Statistician. 53 (2): 160–169. CiteSeerX 10.1.1.22.9325. doi:10.1080/00031305.1999.10474451.

[22] Nazanin Nooraee; Geert Molenberghs; Edwin R. van den Heuvel (2014). "GEE for longitudinal ordinal data: Comparing R-geepack, R-multgee, R-repolr, SAS-GENMOD, SPSS-GENLIN" (PDF). Computational Statistics & Data Analysis. 77: 70–83. doi:10.1016/j.csda.2014.03.009. S2CID 15063953.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

Authority control databases
International	FAST
National	France BnF data Germany Israel United States
Other	IdRef