# Bayes classifier

In statistical classification, the Bayes classifier minimizes the probability of misclassification.[1]

## Definition

Suppose a pair ${\displaystyle (X,Y)}$ takes values in ${\displaystyle \mathbb {R} ^{d}\times \{1,2,\dots ,K\}}$, where ${\displaystyle Y}$ is the class label of ${\displaystyle X}$. Assume that the conditional distribution of X, given that the label Y takes the value r is given by

${\displaystyle (X\mid Y=r)\sim P_{r}}$ for ${\displaystyle r=1,2,\dots ,K}$

where "${\displaystyle \sim }$" means "is distributed as", and where ${\displaystyle P_{r}}$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function ${\displaystyle C:\mathbb {R} ^{d}\to \{1,2,\dots ,K\}}$, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as

${\displaystyle {\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.}$

The Bayes classifier is

${\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).}$

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, ${\displaystyle \operatorname {P} (Y=r\mid X=x)}$. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier ${\displaystyle C}$ (possibly depending on some training data) is defined as ${\displaystyle {\mathcal {R}}(C)-{\mathcal {R}}(C^{\text{Bayes}}).}$ Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]

Considering the components ${\displaystyle x_{i}}$ of ${\displaystyle x}$ to be mutually independent, we get the naive bayes classifier, where ${\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r)\prod _{i=1}^{d}P_{r}(x).}$

## Proof of Optimality

Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk ${\displaystyle R(h)}$, Bayes risk ${\displaystyle R^{*}}$, all possible classes to which the points can be classified ${\displaystyle Y=\{0,1\}}$. Let the posterior probability of a point belonging to class 1 be ${\displaystyle \eta (x)=Pr(Y=1|X=x)}$. Define the classifier ${\displaystyle {\mathcal {h}}^{*}}$as

${\displaystyle {\mathcal {h}}^{*}(x)={\begin{cases}1&,\eta (x)\geqslant 0.5\\0&,\eta (x)<0.5\end{cases}}}$

Then we have the following results:

(a) ${\displaystyle R(h^{*})=R^{*}}$, i.e. ${\displaystyle h^{*}}$ is a Bayes classifier,

(b) For any classifier ${\displaystyle h}$, the excess risk satisfies ${\displaystyle R(h)-R^{*}=2\mathbb {E} _{X}\left[|\eta (x)-0.5|\cdot \mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}\right]}$

(c) ${\displaystyle R^{*}=\mathbb {E} _{X}\left[\min(\eta (X),1-\eta (X))\right]}$

Proof of (a): For any classifier ${\displaystyle h}$, we have

${\displaystyle R(h)=\mathbb {E} _{XY}\left[\mathbb {I} _{\left\{h(X)\neq Y\right\}}\right]}$

${\displaystyle =\mathbb {E} _{X}\mathbb {E} _{Y|X}[\mathbb {I} _{\left\{h(X)\neq Y\right\}}]}$ (due to Fubini's theorem)

${\displaystyle =\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}]}$

Notice that ${\displaystyle R(h)}$ is minimised by taking ${\displaystyle \forall x\in X}$,

${\displaystyle h(x)={\begin{cases}1&,\eta (x)\geqslant 1-\eta (x)\\0&,{\text{otherwise}}\end{cases}}}$

Therefore the minimum possible risk is the Bayes risk, ${\displaystyle R^{*}=R(h^{*})}$.

Proof of (b):

{\displaystyle {\begin{aligned}R(h)-R^{*}&=R(h)-R(h^{*})\\&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h(X)=1\right\}}-\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}-(1-\eta (X))\mathbb {I} _{\left\{h^{*}(X)=1\right\}}]\\&=\mathbb {E} _{X}[|2\eta (X)-1|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\\&=2\mathbb {E} _{X}[|\eta (X)-0.5|\mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}]\end{aligned}}}

Proof of (c):

{\displaystyle {\begin{aligned}R(h^{*})&=\mathbb {E} _{X}[\eta (X)\mathbb {I} _{\left\{h^{*}(X)=0\right\}}+(1-\eta (X))\mathbb {I} _{\left\{h*(X)=1\right\}}]\\&=\mathbb {E} _{X}[\min(\eta (X),1-\eta (X))]\end{aligned}}}

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.

{\displaystyle {\begin{aligned}\mathbb {E} _{Y}(\mathbb {I} _{\{y\neq {\hat {y}}\}})&=\mathbb {E} _{X}\mathbb {E} _{Y|X}\left(\mathbb {I} _{\{y\neq {\hat {y}}\}}|X=x\right)\\&=\mathbb {E} \left[Pr(Y=1|X=x)\mathbb {I} _{\{{\hat {y}}=2,3,\dots ,n\}}+Pr(Y=2|X=x)\mathbb {I} _{\{{\hat {y}}=1,3,\dots ,n\}}+\dots +Pr(Y=n|X=x)\mathbb {I} _{\{{\hat {y}}=1,2,3,\dots ,n-1\}}\right]\end{aligned}}}

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier${\displaystyle h(x)=k,\quad \arg \max _{k}Pr(Y=k|X=x)}$

for each observation x.