Kernel smoother: Difference between revisions

Content deleted Content added

Inline

Revision as of 06:53, 21 February 2014

A kernel smoother is a statistical technique for estimating a real valued function $f(X)\,\,\left(X\in \mathbb {R} ^{p}\right)$ by using its noisy observations, when no parametric model for this function is known. The estimated function is smooth, and the level of smoothness is set by a single parameter.

This technique is most appropriate for low dimensional (p < 3) data visualization purposes. Actually, the kernel smoother represents the set of irregular data points as a smooth line or surface.

Definitions

Let $K_{h_{\lambda }}(X_{0},X)$ be a kernel defined by

K_{h_{\lambda }}(X_{0},X)=D\left({\frac {\left\|X-X_{0}\right\|}{h_{\lambda }(X_{0})}}\right)

where:

$X,X_{0}\in \mathbb {R} ^{p}$
$\left\|\cdot \right\|$ is the Euclidean norm
$h_{\lambda }(X_{0})$ is a parameter (kernel radius)
D(t) typically is a positive real valued function, which value is decreasing (or not increasing) for the increasing distance between the X and X₀.

Popular kernels used for smoothing include

Let ${\hat {Y}}(X):\mathbb {R} ^{p}\to \mathbb {R}$ be a continuous function of X. For each $X_{0}\in \mathbb {R} ^{p}$ , the Nadaraya-Watson kernel-weighted average (smooth Y(X) estimation) is defined by

{\hat {Y}}(X_{0})={\frac {\sum \limits _{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})Y(X_{i})}}{\sum \limits _{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})}}}

where:

N is the number of observed points
Y(X_i) are the observations at X_i points.

In the following sections, we describe some particular cases of kernel smoothers.

Gaussian Kernel smoother

One popular and easy way is to use a kernel function. This kernel function transforms the nonlinear problem to the linear regression (smoothing) problem. The below equation is a general linear kernel regression function.

y^{*}={\frac {\sum _{i=1}^{N}K(x^{*},x_{i})y_{i}}{\sum _{i=1}^{N}K(x^{*},x_{i})}}

Here $x_{i}$ is the i th training data input, $y_{i}$ is the i th training data output, K is a kernel function. $x^{*}$ is a query point, $y^{*}$ is the predicted output.

The Gaussian Kernel is one of the most common kernel. (Its another name is radial basis function (RBF)) The kernel is expressed with the below equation.

K(x^{*},x_{i})=\exp \left(-{\frac {(x^{*}-x_{i})^{2}}{2b^{2}}}\right)

Here, b is the length scale for the input space.

Matlab code for Gaussian Kernel Regression

From the following website, http://youngmok.com/gaussian-kernel-regression-with-matlab-code/ it is possible to download Matlab codes for the Gaussian Kernel Regression.

Nearest neighbor smoother

The idea of the nearest neighbor smoother is the following. For each point X₀, take m nearest neighbors and estimate the value of Y(X₀) by averaging the values of these neighbors.

Formally, $h_{m}(X_{0})=\left\|X_{0}-X_{[m]}\right\|$ , where $X_{[m]}$ is the mth closest to X₀ neighbor, and

D(t)={\begin{cases}1/m&{\text{if }}|t|\leq 1\\0&{\text{otherwise}}\end{cases}}

Example:

In this example, X is one-dimensional. For each X₀, the ${\hat {Y}}(X_{0})$ is an average value of 16 closest to X₀ points (denoted by red). The result is not smooth enough.

Kernel average smoother

The idea of the kernel average smoother is the following. For each data point X₀, choose a constant distance size λ (kernel radius, or window width for p = 1 dimension), and compute a weighted average for all data points that are closer than $\lambda$ to X₀ (the closer to X₀ points get higher weights).

Formally, $h_{\lambda }(X_{0})=\lambda ={\text{constant}},$ and D(t) is one of the popular kernels.

Example:

For each X₀ the window width is constant, and the weight of each point in the window is schematically denoted by the yellow figure in the graph. It can be seen that the estimation is smooth, but the boundary points are biased. The reason for that is the non-equal number of points (from the right and from the left to the X₀) in the window, when the X₀ is close enough to the boundary.

Local linear regression

In the two previous sections we assumed that the underlying Y(X) function is locally constant, therefore we were able to use the weighted average for the estimation. The idea of local linear regression is to fit locally a straight line (or a hyperplane for higher dimensions), and not the constant (horizontal line). After fitting the line, the estimation ${\hat {Y}}(X_{0})$ is provided by the value of this line at X₀ point. By repeating this procedure for each X₀, one can get the estimation function ${\hat {Y}}(X)$ . Like in previous section, the window width is constant $h_{\lambda }(X_{0})=\lambda ={\text{constant}}.$ Formally, the local linear regression is computed by solving a weighted least square problem.

For one dimension (p = 1):

${\begin{aligned}&\min _{\alpha (X_{0}),\beta (X_{0})}\sum \limits _{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left(Y(X_{i})-\alpha (X_{0})-\beta (X_{0})X_{i}\right)^{2}}\\&\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\Downarrow \\&\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\hat {Y}}(X_{0})=\alpha (X_{0})+\beta (X_{0})X_{0}\\\end{aligned}}$

The closed form solution is given by:

{\hat {Y}}(X_{0})=\left(1,X_{0}\right)\left(B^{T}W(X_{0})B\right)^{-1}B^{T}W(X_{0})y

where:

$y=\left(Y(X_{1}),\dots ,Y(X_{N})\right)^{T}$
$W(X_{0})=\operatorname {diag} \left(K_{h_{\lambda }}(X_{0},X_{i})\right)_{N\times N}$
$B^{T}=\left({\begin{matrix}1&1&\dots &1\\X_{1}&X_{2}&\dots &X_{N}\\\end{matrix}}\right)$

Example:

The resulting function is smooth, and the problem with the biased boundary points is solved.

Local linear regression can be applied to any dimensional space, though the question of what is a local neighborhood becomes more complicated. It is common to use k nearest training points to a test point to fit the local linear regression. This can lead to high variance of the fitted function. To bound the variance, the set of training points should contain the test point in their convex hull (see Gupta et al. reference).

Local polynomial regression

Instead of fitting locally linear functions, one can fit polynomial functions.

For p=1, one should minimize:

${\underset {\alpha (X_{0}),\beta _{j}(X_{0}),j=1,...,d}{\mathop {\min } }}\,\sum \limits _{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left(Y(X_{i})-\alpha (X_{0})-\sum \limits _{j=1}^{d}{\beta _{j}(X_{0})X_{i}^{j}}\right)^{2}}$

with ${\hat {Y}}(X_{0})=\alpha (X_{0})+\sum \limits _{j=1}^{d}{\beta _{j}(X_{0})X_{0}^{j}}$

In general case (p>1), one should minimize:

${\begin{aligned}&{\hat {\beta }}(X_{0})={\underset {\beta (X_{0})}{\mathop {\arg \min } }}\,\sum \limits _{i=1}^{N}{K_{h_{\lambda }}(X_{0},X_{i})\left(Y(X_{i})-b(X_{i})^{T}\beta (X_{0})\right)}^{2}\\&b(X)=\left({\begin{matrix}1,&X_{1},&X_{2},...&X_{1}^{2},&X_{2}^{2},...&X_{1}X_{2}\,\,\,...\\\end{matrix}}\right)\\&{\hat {Y}}(X_{0})=b(X_{0})^{T}{\hat {\beta }}(X_{0})\\\end{aligned}}$

References

Li, Q. and J.S. Racine. Nonparametric Econometrics: Theory and Practice. Princeton University Press, 2007, ISBN 0-691-12161-3.
T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Chapter 6, Springer, 2001. ISBN 0-387-95284-5 (companion book site).
M. Gupta, E. Garcia and E. Chin, "Adaptive Local Linear Regression with Application to Printer Color Management," IEEE Trans. Image Processing 2008.

@@ Line 1: / Line 1: @@
 A '''kernel smoother''' is a [[statistics|statistical]] technique for estimating a real valued [[function (mathematics)|function]] <math>f(X)\,\,\left( X\in \mathbb{R}^{p} \right)</math> by using its noisy observations, when [[non-parametric statistics|no parametric model]] for this function is known. The estimated function is smooth, and the level of smoothness is set by a single parameter.
@@ Line 29: / Line 30: @@
 In the following sections, we describe some particular cases of kernel smoothers.
+==Gaussian Kernel smoother==
+One popular and easy way is to use a kernel function. This kernel function transforms the nonlinear problem to the linear regression (smoothing) problem. The below equation is a general linear kernel regression function.
+:<math> y^* = \frac{\sum^N_{i=1}K(x^*,x_i)y_i}{\sum^N_{i=1}K(x^*,x_i)}</math>
+Here <math>x_i</math> is the i th training data input, <math>y_i</math> is the i th training data output, K is a kernel function. <math>x^*</math> is a query point, <math>y^*</math> is the predicted output.
+The Gaussian Kernel is one of the most common kernel. (Its another name is radial basis function (RBF)) The kernel is expressed with the below equation.
+:<math> K(x^*,x_i)=\exp\left(-\frac{(x^*-x_i)^2}{2b^2}\right) </math>
+Here, b is the length scale for the input space.
+[[File:Gaussian kernel regression.png]]
+==Matlab code for Gaussian Kernel Regression ==
+From the following website, http://youngmok.com/gaussian-kernel-regression-with-matlab-code/ it is possible to download Matlab codes for the Gaussian Kernel Regression.
 ==Nearest neighbor smoother==

Revision as of 06:53, 21 February 2014

Definitions

Gaussian Kernel smoother

Matlab code for Gaussian Kernel Regression

Nearest neighbor smoother

Kernel average smoother

Local linear regression

Local polynomial regression

See also

References