Spike-and-slab variable selection

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Spike-and-slab regression is a Bayesian variable selection technique that is particularly useful when the number of possible predictors is larger than the number of observations.[1]

Initially, the idea of the spike-and-slab model was proposed by Mitchell & Beauchamp (1988).[2] The approach was further significantly developed by Madigan & Raftery (1994)[3] and George & McCulloch (1997).[4] The final adjustments to the model were done by Ishwaran & Rao (2005).[5]

Model description[edit]

Suppose we have P possible predictors in some model. Vector γ has a length equal to P and consists of zeros and ones. This vector indicates whether a particular variable is included in the regression or not. If no specific prior information on initial inclusion probabilities of particular variables is available, a Bernoulli prior distribution is a common default choice.[6] Conditional on a predictor being in the regression, we identify a prior distribution for the model coefficient, which corresponds to that variable (β). A common choice on that step is to use a Normal prior with mean equal to zero and a large variance calculated based on (where is a design matrix of explanatory variables of the model).[7]

A draw of γ from its prior distribution is a list of the variables included in the regression. Conditional on this set of selected variables, we take a draw from the prior distribution of the regression coefficients (if γi = 1 then βi ≠ 0 and if γi = 0 then βi = 0). βγ denotes the subset of β for which γi = 1. In the next step, we calculate a posterior probability distribution for both inclusion and coefficients by applying a standard statistical procedure.[8] All steps of the described algorithm are repeated thousands of times using Markov chain Monte Carlo (MCMC) technique. As a result, we obtain a posterior distribution of γ (variable inclusion in the model), β (regression coefficient values) and the corresponding prediction of y.

The model got its name (spike-and-slab) due to the shape of the two prior distributions. The "spike" is the probability of a particular coefficient in the model to be zero. The "slab" is the prior distribution for the regression coefficient values.

An advantage of Bayesian variable selection techniques is that they are able to make use of prior knowledge about the model. In the absence of such knowledge, some reasonable default values can be used; to quote Scott and Varian (2013): "For the analyst who prefers simplicity at the cost of some reasonable assumptions, useful prior information can be reduced to an expected model size, an expected R2, and a sample size ν determining the weight given to the guess at R2."[6] Some researchers suggest the following default values: R2 = 0.5, ν = 0.01, and π = 0.5 (parameter of a prior Bernoulli distribution).[6]

A possible drawback of the Spike-and-Slab model can be its mathematical complexity (in comparison to linear regression). A deep understanding of this model requires sound knowledge in stochastic processes. On the other hand, some modern statistical software (e.g. R) have ready-to-use solutions for calculating various Bayesian variable selection models.[9][10][11] In this case, it would be enough for a researcher to know the idea of the method, required model parameters and input variables. The analysis of the model outcomes (distribution of γ, β, and corresponding predictions of y) can be more challenging in comparison to linear regression case. The spike-and-slab model produces inclusion probabilities for each of possible predictors. This can cause difficulties when comparing results to the studies with simple regression (usually only regression coefficients with corresponding statistics are available).

Spike-and-slab regression is a part of the Bayesian structural time series model, which is used for feature selection, time series forecasting, nowcasting, causal inference, and other applications.

See also[edit]