# User talk:Kvh157

The backfitting algorithm is a simple iterative procedure used to fit a Generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Freidman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss-Seidel method algorithm for solving a certain linear system of equations

## Algorithm

Generalized additive models are a class of non-parametric regression models of the form:

${\displaystyle Y=\alpha +\sum _{i=1}^{d}f_{i}(X_{i})+\epsilon }$

where each ${\displaystyle X_{1},X_{2},\ldots ,X_{p}}$ is a variable in our p-dimensional predictor X, and Y is our outcome variable. ${\displaystyle \epsilon }$ represents our inherent error, which is assumed to have mean zero. The fi represent unspecified smooth functions of a single Xi. Given the flexibility in the fi, we typically do not have a unique solution: α is left unidentifiable. It is common to rectify this by constraining

${\displaystyle \sum _{1}^{N}f_{i}(X_{i})=0}$

leaving

${\displaystyle \alpha =1/N\sum _{1}^{N}y_{i}}$

necessarily.

The backfitting algorithm is then:

   Initialize ${\displaystyle {\hat {\alpha }}=1/N\sum _{1}^{N}y_{i},{\hat {f_{j}}}\equiv 0}$,${\displaystyle \forall i,j}$
Do until ${\displaystyle {\hat {f_{j}}}}$ converge:
For each predictor j:
${\displaystyle {\hat {f_{j}}}\leftarrow Smooth[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik}\rbrace _{1}^{N}]}$
${\displaystyle {\hat {f_{j}}}\leftarrow {\hat {f_{j}}}-1/N\sum _{i=1}^{N}{\hat {f_{i}}}(x_{ij})}$


where ${\displaystyle Smooth}$ is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as:

• local polynomial regression
• kernel smoothing methods
• more complex operators, such as surface smoothers for second and higher-order interactions

## Motivation

If we consider the problem of minimizing the expected squared error:

${\displaystyle \min E[Y-(\alpha +\sum _{j=1}^{p}{\hat {f_{j}}}(x_{ij)})]^{2}}$

There exists a unique solution by the theory of projections given by:

${\displaystyle f_{i}(X_{i})=E[Y-(\alpha +\sum _{j\neq i}^{p}f_{j}(X_{j}))|X_{i})]^{2}}$

for all i = 1, 2, ... p.

This gives the matrix interpretation: ${\displaystyle {\begin{pmatrix}I&P_{1}&\cdots &P_{1}\\P_{2}&I&\cdots &P_{2}\\\vdots &&\ddots &\vdots \\P_{p}&\cdots &P_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}(X_{1})\\f_{2}(X_{2})\\\vdots \\f_{p}(X_{p})\end{pmatrix}}={\begin{pmatrix}P_{1}Y\\P_{2}Y\\\vdots \\P_{p}Y\end{pmatrix}}}$

where ${\displaystyle P_{i}(\cdot )=E(\cdot |X_{i})}$. In this context we can imagine a smoother matrix, ${\displaystyle S_{i}}$, which approximates our ${\displaystyle P_{i}}$ and gives an estimate, ${\displaystyle S_{i}Y}$, of ${\displaystyle E(Y|X)}$

${\displaystyle {\begin{pmatrix}I&S_{1}&\cdots &S_{1}\\S_{2}&I&\cdots &S_{2}\\\vdots &&\ddots &\vdots \\S_{p}&\cdots &S_{p}&I\end{pmatrix}}{\begin{pmatrix}f_{1}\\f_{2}\\\vdots \\f_{p}\end{pmatrix}}={\begin{pmatrix}S_{1}Y\\S_{2}Y\\\vdots \\S_{p}Y\end{pmatrix}}}$

or in abbreviated form

${\displaystyle {\hat {S}}f=QY}$

An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses ${\displaystyle f_{i}^{(0)}}$ and update each ${\displaystyle f_{i}^{(j)}}$ in turn to be the smoothed fit for the residuals of all the others:

${\displaystyle {\hat {f_{i}}}^{(j)}\leftarrow Smooth[\lbrace y_{i}-{\hat {\alpha }}-\sum _{k\neq j}{\hat {f_{k}}}(x_{ik}\rbrace _{1}^{N}]}$

Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss-Seidel method for linear smoothing operators S.

## Explicit Derivation for Two Dimensions

For the two dimensional case, we can formulate the backfitting algorithm explicitly. We have:

${\displaystyle f_{1}=S_{1}(Y-f_{2}),f_{2}=S_{2}(Y-f_{1})}$

If we denote ${\displaystyle astheestimateof[itex]f_{1}}$ in the i-th updating step, the backfitting steps are

${\displaystyle {\hat {f}}_{1}^{(i)}=S_{1}[Y-{\hat {f}}_{2}^{(i-1)}],{\hat {f}}_{2}^{(}i)=S_{2}[Y-{\hat {f}}_{1}^{(i-1)}]}$

By induction we get

${\displaystyle {\hat {f}}_{1}^{(i)}=Y-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y-(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}}$

and

${\displaystyle {\hat {f}}_{2}^{(i)}=S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})Y+S_{2}(S_{1}S_{2})^{i-1}S_{1}{\hat {f}}_{2}^{(0)}}$

If we assume our constant ${\displaystyle \alpha }$ is zero and we set ${\displaystyle {\hat {f}}_{2}^{(0)}=0}$ then we get

${\displaystyle {\hat {f}}_{1}^{(i)}=[I-\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y}$

${\displaystyle {\hat {f}}_{2}^{(i)}=[S_{2}\sum _{\alpha =0}^{i-1}(S_{1}S_{2})^{\alpha }(I-S_{1})]Y}$

This converges if ${\displaystyle \|S_{1}S_{2}\|<1}$.

## Issues

The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific conversion threshold will take. Also, the final model depends on the order in which the predictor variables ${\displaystyle X_{i}}$ are fit.

As well, the solution found by the backfitting procedure is non-unique. If ${\displaystyle b}$ is a vector such that ${\displaystyle {\hat {S}}b=0}$ from above, then if ${\displaystyle {\hat {f}}}$ is a solution then so is ${\displaystyle {\hat {f}}+\alpha b}$ is also a solution for any ${\displaystyle \alpha \in \mathbb {R} }$. A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.

## Modified Algorithm

We can modify the backfitting algorithm to make it easier to provide a unique solution. Let ${\displaystyle {\mathcal {V}}_{1}(S_{i})}$ be the space spanned by all the eigenvectors of Si that correspond to eigenvalue 1. Then any b satisfying ${\displaystyle {\hat {S}}b=0}$ has ${\displaystyle b_{i}\in {\mathcal {V}}_{1}(S_{i})\forall i=1,\dots ,p}$ and ${\displaystyle \sum _{i=1}^{p}b_{i}=0.}$ Now if we take ${\displaystyle A}$ to be a matrix that projects orthogonally onto ${\displaystyle {\mathcal {V}}_{1}(S_{1})+\dots +{\mathcal {V}}_{1}(S_{p})}$, we get the following modified backfitting algorithm:

   Initialize ${\displaystyle {\hat {\alpha }}=1/N\sum _{1}^{N}y_{i},{\hat {f_{j}}}\equiv 0}$,${\displaystyle \forall i,j}$, ${\displaystyle {\hat {f_{+}}}=\alpha +{\hat {f_{1}}}+\dots +{\hat {f_{p}}}}$
Do until ${\displaystyle {\hat {f_{j}}}}$ converge:
Regress ${\displaystyle y-{\hat {f_{+}}}}$ onto the space ${\displaystyle {\mathcal {V}}_{1}(S_{i})+\dots +{\mathcal {V}}_{1}(S_{p})}$, setting ${\displaystyle a=A(Y-{\hat {f_{+}}})}$
For each predictor j:
Apply backfitting update to ${\displaystyle (Y-a)}$ using the smoothing operator ${\displaystyle (I-A_{i})S_{i}}$, yielding new estimates for ${\displaystyle {\hat {f_{j}}}}$


## References

• Breiman, L. & Friedman, J. H. (1985). "Estimating optimal transformations for multiple regression and correlations (with discussion)". Journal of the American Statistical Association. 80(391): 580–619.
• Hastie, T. J. & Tibshirani, R. J. (1990). "Generalized Additive Models". Monographs on Statistics and Applied Probability. 43.
• Härdle, Wolfgang et. al. (June 9, 2004). "Backfitting". Retrieved 2009-11-15.