In statistical classification, the Bayes classifier minimizes the probability of misclassification.[1]
Definition
Suppose a pair
takes values in
, where
is the class label of
. This means that the conditional distribution of X, given that the label Y takes the value r is given by
for ![{\displaystyle r=1,2,\dots ,K}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e7b307d1340949d91d4d75ef11cdfdb5104504e1)
where "
" means "is distributed as", and where
denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function
, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
![{\displaystyle {\mathcal {R}}(C)=\operatorname {P} \{C(X)\neq Y\}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fe3f7f30418caf0d411e785a9d6198a445d8b572)
The Bayes classifier is
![{\displaystyle C^{\text{Bayes}}(x)={\underset {r\in \{1,2,\dots ,K\}}{\operatorname {argmax} }}\operatorname {P} (Y=r\mid X=x).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8be35e64b28e71f5aace3c5c470f80c7da67a0b1)
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,
. The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier
(possibly depending on some training data) is defined as
Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]
Proof of Optimality
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
Define the variables: Risk
, Bayes risk
, all possible classes to which the points can be classified
. Let the posterior probability of a point belonging to class 1 be
. Define the classifier
as
Then we have the following results:
(a)
, i.e.
is a Bayes classifier,
(b) For any classifier
, the excess risk satisfies
(c)
Proof of (a): For any classifier
, we have
Notice that
is minimised by taking
,
Therefore the minimum possible risk is the Bayes risk,
.
Proof of (b):
Proof of (c):
The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.
This is minimised by classifying
for each observation x.
See also
References