= Bayes classifier =

In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classes using the same set of features.

==Definition==

Suppose a pair $(X,Y)$ takes values in $\mathbb{R}^d \times \{1,2,\dots,K\}$, where $Y$ is the class label of an element whose features are given by $X$. Assume that the conditional distribution of X, given that the label Y takes the value r is given by
$(X\mid Y=r) \sim P_r \quad \text{for} \quad r=1,2,\dots,K$
where "$\sim$" means "is distributed as", and where $P_r$ denotes a probability distribution.

A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function $C: \mathbb{R}^d \to \{1,2,\dots,K\}$, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
$\mathcal{R}(C) = \operatorname{P}\{C(X) \neq Y\}.$

The Bayes classifier is
$C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).$

In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, $\operatorname{P}(Y=r \mid X=x)$. The Bayes classifier is a useful benchmark in statistical classification.

The excess risk of a general classifier $C$ (possibly depending on some training data) is defined as $\mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}).$
Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.

Considering the components $x_i$ of $x$ to be mutually independent, we get the naive Bayes classifier, where $C^\text{Bayes}(x) = \underset{r \in \{1,2,\dots, K\}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).$

== Properties ==
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.

Define the variables: Risk $R(h)$, Bayes risk $R^*$, all possible classes to which the points can be classified $Y = \{0,1\}$. Let the posterior probability of a point belonging to class 1 be $\eta(x)=Pr(Y=1|X=x)$. Define the classifier $\mathcal{h}^*$as
$\mathcal{h}^*(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\\ 0&\text{otherwise.}\end{cases}$

Then we have the following results:
\right]</math>

| 3 = $R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]$

| 4 = $R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]$
}}

Proof of (a): For any classifier $h$, we have
$\begin{align}
R(h) &= \mathbb{E}_{XY}\left[ \mathbb{I}_{ \left\{h(X)\ne Y \right\}} \right] \\
&=\mathbb{E}_X\mathbb{E}_{Y|X}[\mathbb{I}_{ \left\{h(X)\ne Y \right\}} ] \\
&= \mathbb{E}_X[\eta(X)\mathbb{I}_{ \left\{h(X)=0\right\}} +(1-\eta(X))\mathbb{I}_{ \left\{h(X)=1 \right\}} ]
\end{align}$
where the second line was derived through Fubini's theorem

Notice that $R(h)$ is minimised by taking $\forall x\in X$,
$h(x) = \begin{cases}
1&\text{if }\eta(x)\geqslant 1-\eta(x),\\
0&\text{otherwise.}
\end{cases}$

Therefore the minimum possible risk is the Bayes risk, $R^*= R(h^*)$.

Proof of (b):
$\begin{aligned}
R(h)-R^* &= R(h)-R(h^*)\\
    &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h(X)=1\right\}}-\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}-(1-\eta(X))\mathbb{I}_{\left\{h^*(X)=1\right\}}]\\
    &=\mathbb{E}_X[|2\eta(X)-1|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]\\
    &= 2\mathbb{E}_X[|\eta(X)-0.5|\mathbb{I}_{\left\{h(X)\ne h^*(X)\right\}}]
\end{aligned}$

Proof of (c):
$\begin{aligned}
R(h^*) &= \mathbb{E}_X[\eta(X)\mathbb{I}_{\left\{h^*(X)=0\right\}}+(1-\eta(X))\mathbb{I}_{\left\{h*(X)=1\right\}}]\\
&= \mathbb{E}_X[\min(\eta(X),1-\eta(X))]
\end{aligned}$

Proof of (d):
$\begin{aligned}
R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \\
&= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\\
&=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|]
\end{aligned}$

===General case===

The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.
$\begin{align}
\mathbb{E}_Y(\mathbb{I}_{\{y\ne \hat{y}\}}) &= \mathbb{E}_X\mathbb{E}_{Y|X}\left(\mathbb{I}_{\{y\ne \hat{y}\}}|X=x\right)\\
&= \mathbb{E} \left[\Pr(Y=1|X=x)\mathbb{I}_{\{\hat{y}=2,3,\dots,n\}} + \Pr(Y=2|X=x)\mathbb{I}_{\{\hat{y}=1,3,\dots,n\}} + \dots + \Pr(Y=n|X=x) \mathbb{I}_{\{\hat{y}=1,2,3,\dots,n-1\}}\right]
\end{align}$

This is minimised by simultaneously minimizing all the terms of the expectation using the classifier $h(x) = k,\quad \arg\max_{k}Pr(Y=k|X=x)$ for each observation x.

==See also==
- Naive Bayes classifier
