Multilevel regression with poststratification
| Part of a series on |
| Regression analysis |
|---|
| Models |
| Estimation |
| Background |
Multilevel regression with poststratification (MRP) is a statistical technique used for correcting model estimates for known differences between a sample population (the population of the data one has), and a target population (a population one wishes to estimate for).
The poststratification refers to the process of adjusting the estimates, essentially a weighted average of estimates from all possible combinations of attributes (for example age and sex). Each combination is sometimes called a "cell". The multilevel regression is the use of a multilevel model to smooth noisy estimates in the cells with too little data by using overall or nearby averages.
One application is estimating preferences in sub-regions (e.g., states, individual constituencies) based on individual-level survey data gathered at other levels of aggregation (e.g., national surveys).[1]
Individual seat polls can struggle to have a high enough sample size, while MRPs have such large sample sizes that even smaller sub-demographics (eg grouping by age, or cultural background) will have a high enough sample size, which can then be used to adjust seat forecasts. Since the mid-2010s, MRP has seen rapid adoption by commercial pollsters and academic election forecasters as a means of producing seat-by-seat or district-by-district estimates from a single large national survey, particularly in the United Kingdom and United States.
Mathematical formulation
[edit]Following the MRP model description,[2] assume represents single outcome measurement and the population mean value of , , is the target parameter of interest. In the underlying population, each individual, , belongs to one of poststratification cells characterized by a unique set of covariates. The multilevel regression with poststratification model involves the following pair of steps:
MRP step 1 (multilevel regression): The multilevel regression model specifies a linear predictor for the mean , or the logit transform of the mean in the case of a binary outcome, in poststratification cell ,
where is the outcome measurement for respondent in cell , is the fixed intercept, is the unique covariate vector for cell , is a vector of regression coefficients (fixed effects), is the varying coefficient (random effect), maps the cell index to the corresponding category index of variable . All varying coefficients are exchangeable batches with independent normal prior distributions .
In practice, MRP implementations commonly extend the varying coefficients to include group-level predictors, which substantially improves estimation for geographic units with few respondents.[3] For a state-level random effect, for example, the exchangeable prior can be replaced with where is a state-level covariate (such as prior-election vote share or median income) and is a region-level random effect. This hierarchical structure allows the model to borrow strength not just from the overall mean but from states that resemble state on observed covariates.
MRP step 2: poststratification: The poststratification (PS) estimate for the population parameter of interest is where is the estimated outcome of interest for poststratification cell and is the size of the -th poststratification cell in the population. Estimates at any subpopulation level are similarly derived where is the subset of all poststratification cells that comprise .
Estimation and uncertainty: The MRP model is typically fit by fully Bayesian estimation using Markov chain Monte Carlo, often via Stan or its higher-level interfaces such as rstanarm and brms.[4] Uncertainty in the poststratified estimate is obtained by applying the poststratification step separately to each posterior draw of the cell-level means , yielding a posterior distribution over from which credible intervals are constructed.
The technique and its advantages
[edit]The technique essentially involves using data from, for example, censuses relating to various types of people corresponding to different characteristics (e.g. age, race), in a first step to estimate the relationship between those types and individual preferences (i.e., multi-level regression of the dataset). This relationship is then used in a second step to estimate the sub-regional preference based on the number of people having each type or characteristic in that sub-region (a process known as "poststratification").[5] The multilevel component produces estimates for each cell that are a weighted average of the raw cell mean and the overall or group mean, a property known as partial pooling; the weight toward the pooled mean increases as the cell's sample size shrinks, stabilizing estimates in cells with few respondents without discarding the information they do contain.[4] In this way the need to perform surveys at sub-regional level, which can be expensive and impractical in an area (e.g. a country) with many sub-regions (e.g. counties, ridings, or states), is avoided. It also avoids issues with consistency of survey when comparing different surveys performed in different areas.[6][1] Additionally, it allows the estimating of preference within a specific locality based on a survey taken across a wider area that includes relatively few people from the locality in question, or where the sample may be highly unrepresentative.[7]
History
[edit]The technique was originally developed by Gelman and T. Little in 1997,[8] building upon ideas of Fay and Herriot[9] and R. Little.[10] It was subsequently expanded on by Park, Gelman, and Bafumi in 2004 and 2006. It was proposed for use in estimating US-state-level voter preference by Lax and Philips in 2009. Warshaw and Rodden subsequently proposed it for use in estimating district-level public opinion in 2012.[1] Later, Wang et al.[11] used survey data of Xbox users to predict the outcome of the 2012 US presidential election. The Xbox gamers were 65% 18- to 29-year-olds and 93% male, while the electorate as a whole was 19% 18- to 29-year-olds and 47% male. Even though the original data was highly biased, after multilevel regression with poststratification the authors were able to get estimates that agreed with those coming from polls using large amounts of random and representative data. Since then it has also been proposed for use in the field of epidemiology.[7]
Use in United Kingdom elections
[edit]YouGov used the technique to successfully predict the overall outcome of the 2017 UK general election,[12] correctly predicting the result in 93% of constituencies.[13] In the 2019 election MRP was adopted by other pollsters including Survation.[14] By the 2024 general election, MRP had become a standard method for seat-level forecasting in the UK, with published models from YouGov, Ipsos,[15] Survation, More in Common, Savanta and Focaldata, among others.
Use in United States elections
[edit]In the United States, MRP has been used both in academic work on state- and district-level public opinion and in commercial election forecasting. The approach underpins parts of the methodology used by the Economist's presidential forecasting model, developed in collaboration with Andrew Gelman and colleagues, which combines state-level polls with demographic poststratification to produce state-by-state estimates. Democratic data firms such as Catalist have also used MRP-style methods to produce estimates of the composition of the electorate from voter files and survey data.
Limitations and extensions
[edit]MRP can be extended to estimating the change of opinion over time[6] and when used to predict elections works best when used relatively close to the polling date, after nominations have closed.[16] The accuracy of MRP estimates also depends on the availability and quality of the poststratification frame: if the cell population totals are drawn from an outdated census, or if important explanatory variables are missing from the frame, poststratification cannot fully correct for sample bias.
The method has also drawn critical assessments from some prominent election forecasters. At Harvard's 2018 Political Analytics Conference, FiveThirtyEight founder Nate Silver said that MRP "can be good, but it's overrated too", calling it "the Carmelo Anthony of election polling" and arguing that conventional methods already get forecasters most of the way toward an accurate answer.[17]
Both the "multilevel regression" and "poststratification" ideas of MRP can be generalized. Multilevel regression can be replaced by nonparametric regression[18] or regularized prediction, and poststratification can be generalized to allow for non-census variables, i.e. poststratification totals that are estimated rather than being known.[19] Ghitza and Gelman (2013) extended MRP by allowing structured interactions between demographic and geographic variables — for instance age-by-state or ethnicity-by-state varying coefficients with their own multilevel priors — capturing heterogeneity in behaviour across subgroups that a purely additive specification would miss.[3] For ordinal covariates such as age, education, or income, structured priors such as random walks or Gaussian processes over category indices further improve estimation by encoding the natural ordering of these variables rather than treating categories as exchangeable.[20]
References
[edit]- ^ a b c Buttice, Matthew K.; Highton, Benjamin (Autumn 2013). "How Does Multilevel Regression and Poststratification Perform with Conventional National Surveys?" (PDF). Political Analysis. 21 (4): 449–451. doi:10.1093/pan/mpt017. JSTOR 24572674. Archived (PDF) from the original on 12 February 2025. Retrieved 16 February 2025.
- ^ Downes, Marnie Downes; at al. (August 2018). "Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples". American Journal of Epidemiology. 187 (8): 1780–1790. doi:10.1093/aje/kwy070. PMID 29635276.
- ^ a b Ghitza, Yair; Gelman, Andrew (2013). "Deep Interactions with MRP: Election Turnout and Voting Patterns Among Small Electoral Subgroups". American Journal of Political Science. 57 (3): 762–776. doi:10.1111/ajps.12004.
- ^ a b Gelman, Andrew; Hill, Jennifer (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. ISBN 978-0-521-68689-1.
- ^ "What is MRP?". Survation. 5 November 2018. Archived from the original on 16 February 2025. Retrieved 31 October 2019.
- ^ a b Gelman, Andrew; Lax, Jeffrey; Phillips, Justin; Gabry, Jonah; Trangucci, Robert (28 August 2018). "Using Multilevel Regression and Poststratification to Estimate Dynamic Public Opinion" (PDF). sites.stat.columbia.edu: 1–3. Archived (PDF) from the original on 16 February 2025. Retrieved 31 October 2019.
- ^ a b Downes, Marnie; Gurrin, Lyle C.; English, Dallas R.; Pirkis, Jane; Currier, Diane; Spital, Matthew J.; Carlin, John B. (August 2018) [9 April 2018]. "Multilevel Regression and Poststratification: A Modeling Approach to Estimating Population Quantities From Highly Selected Survey Samples". American Journal of Epidemiology. 187 (8). Oxford University Press: 1780–1790. doi:10.1093/aje/kwy070. Retrieved 31 October 2019.
- ^ Gelman, Andrew; Little, Thomas (1997). "Poststratification into many categories using hierarchical logistic regression". Survey Methodology. 23: 127–135.
- ^ Fay, Robert; Herriot, Roger (1979). "Estimates of income for small places: An application of James-Stein procedures to census data". Journal of the American Statistical Association. 74 (423): 1001–1012. doi:10.1080/01621459.1979.10482505. JSTOR 2286322.
- ^ Little, Roderick (1993). "Post-stratification: A modeler's perspective". Journal of the American Statistical Association. 88 (423): 1001–1012. doi:10.1080/01621459.1993.10476368. JSTOR 2290792.
- ^ Wang, Wei; Rothschild, David; Goel, Sharad; Gelman, Andrew (2015). "Forecasting elections with non-representative polls" (PDF). International Journal of Forecasting. 31 (3): 980–991. doi:10.1016/j.ijforecast.2014.06.001. Archived (PDF) from the original on 1 November 2020. Retrieved 1 December 2019.
- ^ Revell, Timothy (9 June 2017). "How YouGov's experimental poll correctly called the UK election". New Scientist. Archived from the original on 9 June 2017. Retrieved 31 October 2019.
- ^ Cohen, Daniel (27 September 2019). "'I've never known voters be so promiscuous': the pollsters working to predict the next UK election". The Guardian. Archived from the original on 11 September 2024. Retrieved 31 October 2019.
- ^ Survation 2019 https://www.survation.com/2019-general-election-mrp-predictions-survation-and-dr-chris-hanretty/ Archived 16 February 2025 at the Wayback Machine
- ^ Ipsos 2024 https://www.ipsos.com/en-uk/uk-opinion-polls/ipsos-election-mrp Archived 18 June 2024 at the Wayback Machine
- ^ James, William; MacLellan, Kylie (15 October 2019). "A question of trust: British pollsters battle to call looming election". Reuters. Archived from the original on 31 October 2019. Retrieved 31 October 2019.
- ^ "One election winner, according to Harvard conference: the pollsters". Harvard Gazette. 9 November 2018. Retrieved 23 April 2026.
- ^ Bisbee, James (2019). "BARP: Improving Mister P Using Bayesian Additive Regression Trees". American Political Science Review. 113 (4): 1060–1065. doi:10.1017/S0003055419000480. S2CID 201385400.
- ^ Gelman, Andrew (28 October 2018). "MRP (or RPP) with non-census variables". Statistical Modeling, Causal Inference, and Social Science. Archived from the original on 22 June 2019. Retrieved 1 December 2019.
- ^ Gao, Yuxiang; Kennedy, Lauren; Simpson, Daniel; Gelman, Andrew (2021). "Improving Multilevel Regression and Poststratification with Structured Priors". Bayesian Analysis. 16 (3): 719–744. doi:10.1214/20-BA1223.