From Wikipedia, the free encyclopedia
Stein's example is an important result in decision theory which can be stated as
The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk in dimension at least 3 .
The following is an outline of its proof. The reader is referred to the main article for more information.
Sketched proof
The risk function of the decision rule
d
(
x
)
=
x
{\displaystyle d(\mathbf {x} )=\mathbf {x} }
is
R
(
θ
,
d
)
=
E
θ
[
|
θ
−
X
|
2
]
{\displaystyle R(\theta ,d)=\mathbb {E} _{\theta }[|\mathbf {\theta -X} |^{2}]}
=
∫
(
θ
−
x
)
T
(
θ
−
x
)
(
1
2
π
)
n
/
2
e
(
−
1
/
2
)
(
θ
−
x
)
T
(
θ
−
x
)
m
(
d
x
)
{\displaystyle =\int (\mathbf {\theta -x} )^{T}(\mathbf {\theta -x} )\left({\frac {1}{2\pi }}\right)^{n/2}e^{(-1/2)(\mathbf {\theta -x} )^{T}(\mathbf {\theta -x} )}m(dx)}
=
n
.
{\displaystyle =n.\,}
Now consider the decision rule
d
′
(
x
)
=
x
−
α
|
x
|
2
x
{\displaystyle d'(\mathbf {x} )=\mathbf {x} -{\frac {\alpha }{|\mathbf {x} |^{2}}}\mathbf {x} }
where
α
=
n
−
2
{\displaystyle \alpha =n-2}
. We will show that
d
′
{\displaystyle d'}
is a better decision rule than
d
{\displaystyle d}
. The risk function is
R
(
θ
,
d
′
)
=
E
θ
[
|
θ
−
X
+
α
|
X
|
2
X
|
2
]
{\displaystyle R(\theta ,d')=\mathbb {E} _{\theta }\left[\left|\mathbf {\theta -X} +{\frac {\alpha }{|\mathbf {X} |^{2}}}\mathbf {X} \right|^{2}\right]}
=
E
θ
[
|
θ
−
X
|
2
+
2
(
θ
−
X
)
T
α
|
X
|
2
X
+
α
2
|
X
|
4
|
X
|
2
]
{\displaystyle =\mathbb {E} _{\theta }\left[|\mathbf {\theta -X} |^{2}+2(\mathbf {\theta -X} )^{T}{\frac {\alpha }{|\mathbf {X} |^{2}}}\mathbf {X} +{\frac {\alpha ^{2}}{|\mathbf {X} |^{4}}}|\mathbf {X} |^{2}\right]}
=
E
θ
[
|
θ
−
X
|
2
]
+
2
α
E
θ
[
(
θ
−
X
)
T
X
|
X
|
2
]
+
α
2
E
θ
[
1
|
X
|
2
]
{\displaystyle =\mathbb {E} _{\theta }\left[|\mathbf {\theta -X} |^{2}\right]+2\alpha \mathbb {E} _{\theta }\left[{\frac {\mathbf {(\theta -X)^{T}X} }{|\mathbf {X} |^{2}}}\right]+\alpha ^{2}\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}\right]}
— a quadratic in
α
{\displaystyle \alpha }
. We may simplify the middle term by considering a general "well-behaved" function
h
:
x
↦
h
(
x
)
∈
R
{\displaystyle h:\mathbf {x} \mapsto h(\mathbf {x} )\in \mathbb {R} }
and using integration by parts . For
1
≤
i
≤
n
{\displaystyle 1\leq i\leq n}
, for any continuously differentiable
h
{\displaystyle h}
growing sufficiently slowly for large
x
i
{\displaystyle x_{i}}
we have:
E
θ
[
(
θ
i
−
X
i
)
h
(
X
)
|
X
j
=
x
j
(
j
≠
i
)
]
=
∫
(
θ
i
−
x
i
)
h
(
x
)
(
1
2
π
)
n
/
2
e
−
(
1
/
2
)
(
x
−
θ
)
T
(
x
−
θ
)
m
(
d
x
i
)
{\displaystyle \mathbb {E} _{\theta }[(\theta _{i}-X_{i})h(\mathbf {X} )|X_{j}=x_{j}(j\neq i)]=\int (\theta _{i}-x_{i})h(\mathbf {x} )\left({\frac {1}{2\pi }}\right)^{n/2}e^{-(1/2)\mathbf {(x-\theta )} ^{T}\mathbf {(x-\theta )} }m(dx_{i})}
=
[
h
(
x
)
(
1
2
π
)
n
/
2
e
−
(
1
/
2
)
(
x
−
θ
)
T
(
x
−
θ
)
]
x
i
=
−
∞
∞
−
∫
∂
h
∂
x
i
(
x
)
(
1
2
π
)
n
/
2
e
−
(
1
/
2
)
(
x
−
θ
)
T
(
x
−
θ
)
m
(
d
x
i
)
{\displaystyle =\left[h(\mathbf {x} )\left({\frac {1}{2\pi }}\right)^{n/2}e^{-(1/2)\mathbf {(x-\theta )} ^{T}\mathbf {(x-\theta )} }\right]_{x_{i}=-\infty }^{\infty }-\int {\frac {\partial h}{\partial x_{i}}}(\mathbf {x} )\left({\frac {1}{2\pi }}\right)^{n/2}e^{-(1/2)\mathbf {(x-\theta )} ^{T}\mathbf {(x-\theta )} }m(dx_{i})}
=
−
E
θ
[
∂
h
∂
x
i
(
X
)
|
X
j
=
x
j
(
j
≠
i
)
]
.
{\displaystyle =-\mathbb {E} _{\theta }\left[{\frac {\partial h}{\partial x_{i}}}(\mathbf {X} )|X_{j}=x_{j}(j\neq i)\right].}
Therefore,
E
θ
[
(
θ
i
−
X
i
)
h
(
X
)
]
=
−
E
θ
[
∂
h
∂
x
i
(
X
)
]
.
{\displaystyle \mathbb {E} _{\theta }[(\theta _{i}-X_{i})h(\mathbf {X} )]=-\mathbb {E} _{\theta }\left[{\frac {\partial h}{\partial x_{i}}}(\mathbf {X} )\right].}
(This result is known as Stein's lemma .)
Now, we choose
h
(
x
)
=
x
i
|
x
|
2
.
{\displaystyle h(\mathbf {x} )={\frac {x_{i}}{|\mathbf {x} |^{2}}}.}
If
h
{\displaystyle h}
met the "well-behaved" condition (it doesn't, but this can be remedied -- see below), we would have
∂
h
∂
x
i
=
1
|
x
|
2
−
2
x
i
2
|
x
|
4
{\displaystyle {\frac {\partial h}{\partial x_{i}}}={\frac {1}{|\mathbf {x} |^{2}}}-{\frac {2x_{i}^{2}}{|\mathbf {x} |^{4}}}}
and so
E
θ
[
(
θ
−
X
)
T
X
|
X
|
2
]
=
∑
i
=
1
n
E
θ
[
(
θ
i
−
X
i
)
X
i
|
X
|
2
]
{\displaystyle \mathbb {E} _{\theta }\left[{\frac {\mathbf {(\theta -X)^{T}X} }{|\mathbf {X} |^{2}}}\right]=\sum _{i=1}^{n}\mathbb {E} _{\theta }\left[(\theta _{i}-X_{i}){\frac {X_{i}}{|\mathbf {X} |^{2}}}\right]}
=
−
∑
i
=
1
n
E
θ
[
1
|
X
|
2
−
2
X
i
2
|
X
|
4
]
{\displaystyle =-\sum _{i=1}^{n}\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}-{\frac {2X_{i}^{2}}{|\mathbf {X} |^{4}}}\right]}
=
−
(
n
−
2
)
E
θ
[
1
|
X
|
2
]
.
{\displaystyle =-(n-2)\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}\right].}
Then returning to the risk function of
d
′
{\displaystyle d'}
:
R
(
θ
,
d
′
)
=
n
−
2
α
(
n
−
2
)
E
θ
[
1
|
X
|
2
]
+
α
2
E
θ
[
1
|
X
|
2
]
.
{\displaystyle R(\theta ,d')=n-2\alpha (n-2)\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}\right]+\alpha ^{2}\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}\right].}
This quadratic in
α
{\displaystyle \alpha }
is minimized at
α
=
n
−
2
,
{\displaystyle \alpha =n-2,\,}
giving
R
(
θ
,
d
′
)
=
R
(
θ
,
d
)
−
(
n
−
2
)
2
E
θ
[
1
|
X
|
2
]
{\displaystyle R(\theta ,d')=R(\theta ,d)-(n-2)^{2}\mathbb {E} _{\theta }\left[{\frac {1}{|\mathbf {X} |^{2}}}\right]}
which of course satisfies:
R
(
θ
,
d
′
)
<
R
(
θ
,
d
)
.
{\displaystyle R(\theta ,d')<R(\theta ,d).}
making
d
{\displaystyle d}
an inadmissible decision rule.
It remains to justify the use of
h
(
X
)
=
X
|
X
|
2
.
{\displaystyle h(\mathbf {X} )={\frac {\mathbf {X} }{|\mathbf {X} |^{2}}}.}
This function is not continuously differentiable since it is singular at
x
=
0
{\displaystyle \mathbf {x} =0}
. However the function
h
(
X
)
=
X
ϵ
+
|
X
|
2
{\displaystyle h(\mathbf {X} )={\frac {\mathbf {X} }{\epsilon +|\mathbf {X} |^{2}}}}
is continuously differentiable, and after following the algebra through and letting
ϵ
→
0
{\displaystyle \epsilon \to 0}
one obtains the same result.