Bayes classifier
Classification algorithm in statistics
title: "Bayes classifier" type: doc version: 1 created: 2026-02-28 author: "Wikipedia contributors" status: active scope: public tags: ["bayesian-statistics", "statistical-classification"] description: "Classification algorithm in statistics" topic_path: "science/mathematics" source: "https://en.wikipedia.org/wiki/Bayes_classifier" license: "CC BY-SA 4.0" wikipedia_page_id: 0 wikipedia_revision_id: 0
::summary Classification algorithm in statistics ::
In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classes using the same set of features.
Definition
Suppose a pair (X,Y) takes values in \mathbb{R}^d \times {1,2,\dots,K}, where Y is the class label of an element whose features are given by X. Assume that the conditional distribution of X, given that the label Y takes the value r is given by (X\mid Y=r) \sim P_r \quad \text{for} \quad r=1,2,\dots,K where "\sim" means "is distributed as", and where P_r denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function C: \mathbb{R}^d \to {1,2,\dots,K}, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as \mathcal{R}(C) = \operatorname{P}{C(X) \neq Y}.
The Bayes classifier is C^\text{Bayes}(x) = \underset{r \in {1,2,\dots, K}}{\operatorname{argmax}} \operatorname{P}(Y=r \mid X=x).
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case, \operatorname{P}(Y=r \mid X=x). The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier C (possibly depending on some training data) is defined as \mathcal{R}(C) - \mathcal{R}(C^\text{Bayes}). Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.
Considering the components x_i of x to be mutually independent, we get the naive Bayes classifier, where C^\text{Bayes}(x) = \underset{r \in {1,2,\dots, K}}{\operatorname{argmax}} \operatorname{P}(Y=r)\prod_{i=1}^{d}P_r(x_i).
Properties
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
Define the variables: Risk R(h), Bayes risk R^*, all possible classes to which the points can be classified Y = {0,1}. Let the posterior probability of a point belonging to class 1 be \eta(x)=Pr(Y=1|X=x). Define the classifier \mathcal{h}^as \mathcal{h}^(x)=\begin{cases}1&\text{if }\eta(x)\geqslant 0.5,\ 0&\text{otherwise.}\end{cases}
Then we have the following results: | 1 = R(h^)=R^, i.e. h^* is a Bayes classifier,
| 2 = For any classifier h, the excess risk satisfies R(h)-R^=2\mathbb{E}X\left[|\eta(x)-0.5|\cdot \mathbb{I}{\left{h(X)\ne h^(X)\right}}\right]
| 3 = R^* = \mathbb{E}_X\left[\min(\eta(X),1-\eta(X))\right]
| 4 = R^* = \frac{1}{2} - \frac{1}{2}\mathbb E [|2\eta(X) - 1|]
Proof of (a): For any classifier h, we have \begin{align} R(h) &= \mathbb{E}{XY}\left[ \mathbb{I}{ \left{h(X)\ne Y \right}} \right] \ &=\mathbb{E}X\mathbb{E}{Y|X}[\mathbb{I}{ \left{h(X)\ne Y \right}} ] \ &= \mathbb{E}X[\eta(X)\mathbb{I}{ \left{h(X)=0\right}} +(1-\eta(X))\mathbb{I}{ \left{h(X)=1 \right}} ] \end{align} where the second line was derived through Fubini's theorem
Notice that R(h) is minimised by taking \forall x\in X, h(x) = \begin{cases} 1&\text{if }\eta(x)\geqslant 1-\eta(x),\ 0&\text{otherwise.} \end{cases}
Therefore the minimum possible risk is the Bayes risk, R^= R(h^).
Proof of (b): \begin{aligned} R(h)-R^* &= R(h)-R(h^)\ &= \mathbb{E}X[\eta(X)\mathbb{I}{\left{h(X)=0\right}}+(1-\eta(X))\mathbb{I}{\left{h(X)=1\right}}-\eta(X)\mathbb{I}{\left{h^(X)=0\right}}-(1-\eta(X))\mathbb{I}_{\left{h^(X)=1\right}}]\ &=\mathbb{E}X[|2\eta(X)-1|\mathbb{I}{\left{h(X)\ne h^(X)\right}}]\ &= 2\mathbb{E}X[|\eta(X)-0.5|\mathbb{I}{\left{h(X)\ne h^*(X)\right}}] \end{aligned}
Proof of (c): \begin{aligned} R(h^) &= \mathbb{E}X[\eta(X)\mathbb{I}{\left{h^(X)=0\right}}+(1-\eta(X))\mathbb{I}_{\left{h*(X)=1\right}}]\ &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \end{aligned}
Proof of (d): \begin{aligned} R(h^*) &= \mathbb{E}_X[\min(\eta(X),1-\eta(X))] \ &= \frac{1}{2} - \mathbb{E}_X[\max(\eta(X) - 1/2,1/2-\eta(X))]\ &=\frac{1}{2} - \frac{1}{2} \mathbb E [|2\eta(X) - 1|] \end{aligned}
General case
The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows. \begin{align} \mathbb{E}Y(\mathbb{I}{{y\ne \hat{y}}}) &= \mathbb{E}X\mathbb{E}{Y|X}\left(\mathbb{I}{{y\ne \hat{y}}}|X=x\right)\ &= \mathbb{E} \left[\Pr(Y=1|X=x)\mathbb{I}{{\hat{y}=2,3,\dots,n}} + \Pr(Y=2|X=x)\mathbb{I}{{\hat{y}=1,3,\dots,n}} + \dots + \Pr(Y=n|X=x) \mathbb{I}{{\hat{y}=1,2,3,\dots,n-1}}\right] \end{align}
This is minimised by simultaneously minimizing all the terms of the expectation using the classifier h(x) = k,\quad \arg\max_{k}Pr(Y=k|X=x) for each observation x.
References
References
- Devroye, L.. (1996). "A probabilistic theory of pattern recognition". Springer.
- (1993). "Strong universal consistency of neural network classifiers". IEEE Transactions on Information Theory.
::callout[type=info title="Wikipedia Source"] This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page. ::