# Bayesian programming

Bayesian programming is a formalism and a methodology to specify probabilistic models and solve problems when all the necessary information is not available.

Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science[1] he developed this theory and proposed what he called “the robot,” which was not a physical device, but an inference engine to automate probabilistic reasoning — a kind of Prolog for probability instead of logic. Bayesian Programming[2] is a formal and concrete implementation of this "robot".

Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian Programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs.

## Formalism

A Bayesian program is a means of specifying a family of probability distributions.

The constituent elements of a Bayesian program are presented below:

$\text{Program} \begin{cases} \text{Description} \begin{cases} \text{Specification} (\pi) \begin{cases} \text{Variables}\\ \text{Decomposition}\\ \text{Forms}\\ \end{cases}\\ \text{Identification (based on }\delta) \end{cases}\\ \text{Question} \end{cases}$
1. A program is constructed from a description and a question.
2. A description is constructed using some specification ($\pi$) as given by the programmer and an identification or learning process for the parameters not completely specified by the specification, using a data set ($\delta$).
3. A specification is constructed from a set of pertinent variables, a decomposition and a set of forms.
4. Forms are either parametric forms or questions to other Bayesian programs.
5. A question specifies which probability distribution has to be computed.

### Description

The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables $\left\{ X_{1},X_{2},\cdots,X_{N}\right\}$ given a set of experimental data $\delta$ and some specification $\pi$. This joint distribution is denoted as: $P\left(X_{1}\wedge X_{2}\wedge\cdots\wedge X_{N}\mid\delta\wedge\pi\right)$.

To specify preliminary knowledge $\pi$, the programmer must undertake the following:

1. Define the set of relevant variables $\left\{ X_{1},X_{2},\cdots,X_{N}\right\}$ on which the joint distribution is defined.
2. Decompose the joint distribution (break it into relevant independent or conditional probabilities).
3. Define the forms each of the distributions (e.g., for each variable, one of the list of probability distributions).

#### Decomposition

Given a partition $\left\{ X_{1},X_{2},\ldots,X_{N}\right\}$ containing $K$ subsets, $K$ variables are defined $L_{1},\cdots,L_{K}$, each corresponding to one of these subsets. Each variable $L_{k}$ is obtained as the conjunction of the variables $\left\{ X_{k_{1}},X_{k_{2}},\cdots\right\}$ belonging to the $k^{th}$ subset. Recursive application of Bayes' theorem leads to:

\begin{align} & P\left(X_{1}\wedge X_{2}\wedge\cdots\wedge X_{N}\mid\delta\wedge\pi\right)\\ ={} & P\left(L_{1}\wedge\cdots\wedge L_{K}\mid\delta\wedge\pi\right)\\ ={} & P\left(L_{1}\mid\delta\wedge\pi\right)\times P\left(L_{2}\mid L_{1}\wedge\delta\wedge\pi\right) \times\cdots\times P\left(L_{K}\mid L_{K-1}\wedge\cdots\wedge L_{1}\wedge\delta\wedge\pi\right)\end{align}

Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable $L_{k}$ is defined by choosing some variable $X_{n}$ among the variables appearing in the conjunction $L_{k-1}\wedge\cdots\wedge L_{2}\wedge L_{1}$, labelling $R_{k}$ as the conjunction of these chosen variables and setting:

$P \left(L_{k}\mid L_{k-1}\wedge\cdots\wedge L_{1}\wedge\delta\wedge\pi\right ) = P\left( L_{k} \mid R_{k}\wedge\delta\wedge\pi \right)$

We then obtain:

\begin{align} & P\left(X_{1}\wedge X_{2}\wedge\cdots\wedge X_{N}\mid\delta\wedge\pi\right)\\ ={} & P\left(L_{1}\mid\delta\wedge\pi\right)\times P\left(L_{2}\mid R_{2}\wedge\delta\wedge\pi\right)\times\cdots\times P\left(L_{K}\mid R_{K}\wedge\delta\wedge\pi\right)\end{align}

Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule.

This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions[citation needed].

#### Forms

Each distribution $P\left(L_{k}\mid R_{k}\wedge\delta\wedge\pi\right)$ appearing in the product is then associated with either a parametric form (i.e., a function $f_{\mu}\left(L_{k}\right)$) or a question to another Bayesian program $P\left(L_{k}\mid R_{k} \wedge \delta \wedge \pi \right) = P\left(L\mid R\wedge\widehat{\delta}\wedge\widehat{\pi}\right)$.

When it is a form $f_{\mu}\left(L_{k}\right)$, in general, $\mu$ is a vector of parameters that may depend on $R_{k}$ or $\delta$ or both. Learning takes place when some of these parameters are computed using the data set $\delta$.

An important feature of Bayesian Programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program. $P\left(L_{k}\mid R_{k}\wedge\delta\wedge\pi\right)$ is obtained by some inferences done by another Bayesian program defined by the specifications $\widehat{\pi}$ and the data $\widehat{\delta}$. This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models.

### Question

Given a description (i.e., $P\left(X_{1}\wedge X_{2}\wedge\cdots\wedge X_{N}\mid\delta\wedge\pi\right)$), a question is obtained by partitioning $\left\{ X_{1},X_{2},\cdots,X_{N}\right\}$ into three sets: the searched variables, the known variables and the free variables.

The 3 variables $Searched$, $Known$ and $Free$ are defined as the conjunction of the variables belonging to these sets.

A question is defined as the set of distributions:

$P\left(Searched\mid \text{Known}\wedge\delta\wedge\pi\right)$

made of many "instantiated questions" as the cardinal of $Known$, each instantiated question being the distribution:

$P\left(\text{Searched}\mid\text{Known}\wedge\delta\wedge\pi\right)$

### Inference

Given the joint distribution $P\left(X_{1}\wedge X_{2}\wedge\cdots\wedge X_{N}\mid\delta\wedge\pi\right)$, it is always possible to compute any possible question using the following general inference:

\begin{align} & P\left(\text{Searched}\mid\text{Known}\wedge\delta\wedge\pi\right)\\ ={} & \sum_\text{Free}\left[P\left( \text{Searched} \wedge \text{Free} \mid \text{Known}\wedge\delta\wedge\pi\right)\right]\\ ={} & \frac{\displaystyle \sum_\text{Free}\left[P\left(\text{Searched}\wedge \text{Free}\wedge \text{Known}\mid\delta\wedge\pi\right)\right]}{\displaystyle P\left(\text{Known}\mid\delta\wedge\pi\right)}\\ ={} & \frac{\displaystyle \sum_\text{Free}\left[P\left(\text{Searched}\wedge \text{Free}\wedge \text{Known}\mid\delta\wedge\pi\right)\right]}{\displaystyle \sum_{\text{Free}\wedge \text{Searched}} \left[P\left(\text{Searched} \wedge \text{Free} \wedge \text{Known}\mid\delta\wedge\pi\right)\right]}\\ ={} & \frac{1}{Z}\times\sum_\text{Free}\left[P\left(\text{Searched}\wedge \text{Free} \wedge \text{Known} \mid \delta\wedge\pi\right)\right]\end{align}

where the first equality results from the marginalization rule, the second results from Bayes' theorem and the third corresponds to a second application of marginalization. The denominator appears to be a normalization term and can be replaced by a constant $Z$.

Theoretically, this allows to solve any Bayesian inference problem. In practice, however, the cost of computing exhaustively and exactly $P\left(\text{Searched} \mid \text{Known} \wedge\delta\wedge\pi\right)$ is too great in almost all cases.

Replacing the joint distribution by its decomposition we get:

\begin{align} & P\left(\text{Searched}\mid \text{Known}\wedge\delta\wedge\pi\right)\\ = {}& \frac{1}{Z} \sum_\text{Free} \left[\prod_{k=1}^K \left[ P\left( L_{i}\mid K_{i} \wedge \pi \right)\right]\right] \end{align}

which is usually a much simpler expression to compute, as the dimensionality of the problem is considerably reduced by the decomposition into a product of lower dimension distributions.

## Example

### Bayesian spam detection

The purpose of Bayesian spam filtering is to eliminate junk e-mails.

The problem is very easy to formulate. E-mails should be classified into one of two categories: non-spam or spam. The only available information to classify the e-mails is their content: a set of words. Using these words without taking the order into account is commonly called a bag of words model.

The classifier should furthermore be able to adapt to its user and to learn from experience. Starting from an initial standard setting, the classifier should modify its internal parameters when the user disagrees with its own decision. It will hence adapt to the user’s criteria to differentiate between non-spam and spam. It will improve its results as it encounters increasingly classified e-mails.

#### Variables

The variables necessary to write this program are as follows:

1. $Spam$: a binary variable, false if the e-mail is not spam and true otherwise.
2. $W_0,W_1, \ldots, W_{N-1}$: $N$ binary variables. $W_n$ is true if the $n^{th}$ word of the dictionary is present in the text.

These $N + 1$ binary variables sum up all the information about an e-mail.

#### Decomposition

Starting from the joint distribution and applying recursively Bayes' theorem we obtain:

\begin{align} & P(\text{Spam}\wedge W_{0}\wedge\cdots\wedge W_{N-1})\\ ={} & P(\text{Spam})\times P(W_0 \mid \text{Spam})\times P(W_1 \mid \text{Spam} \wedge W_0)\\ & \times\cdots\\ & \times P\left(W_{N-1}\mid\text{Spam}\wedge W_{0}\wedge\cdots\wedge W_{N-2}\right)\end{align}

This is an exact mathematical expression.

It can be drastically simplified by assuming that the probability of appearance of a word knowing the nature of the text (spam or not) is independent of the appearance of the other words. This is the naive Bayes assumption and this makes this spam filter a naive Bayes model.

For instance, the programmer can assume that:

$P(W_1\mid\text{Spam} \land W_0) = P(W_1\mid\text{Spam})$

to finally obtain:

$P(\text{Spam} \land W_0 \land \ldots \land W_{N-1}) = P(\text{Spam})\prod_{n=0}^{N-1}[P(W_n\mid\text{Spam})]$

This kind of assumption is known as the naive Bayes' assumption. It is "naive" in the sense that the independence between words is clearly not completely true. For instance, it completely neglects that the appearance of pairs of words may be more significant than isolated appearances. However, the programmer may assume this hypothesis and may develop the model and the associated inferences to test how reliable and efficient it is.

#### Parametric forms

To be able to compute the joint distribution, the programmer must now specify the $N + 1$ distributions appearing in the decomposition:

1. $P(\text{Spam})$ is a prior defined, for instance, by $P([\text{Spam}=1]) = 0.75$
2. Each of the $N$ forms $P(W_n\mid\text{Spam})$ may be specified using Laplace rule of succession (this is a pseudocounts-based smoothing technique to counter the zero-frequency problem of words never-seen-before):
1. $P(W_n\mid[\text{Spam}=\text{false}])=\frac{1+a^n_f}{2+a_f}$
2. $P(W_n\mid[\text{Spam}=\text{true}])=\frac{1+a^n_t}{2+a_t}$

where $a^n_f$ stands for the number of appearances of the $n^{th}$ word in non-spam e-mails and $a_f$ stands for the total number of non-spam e-mails. Similarly, $a_t^n$ stands for the number of appearances of the $n^{th}$ word in spam e-mails and $a_t$ stands for the total number of spam e-mails.

#### Identification

The $N$ forms $P(W_n\mid\text{Spam})$ are not yet completely specified because the $2N + 2$ parameters $a_f^{n=0, \ldots, N-1}$, $a_t^{n=0, \ldots, N-1}$, $a_f$ and $a_t$ have no values yet.

The identification of these parameters could be done either by batch processing a series of classified e-mails or by an incremental updating of the parameters using the user's classifications of the e-mails as they arrive.

Both methods could be combined: the system could start with initial standard values of these parameters issued from a generic database, then some incremental learning customizes the classifier to each individual user.

#### Question

The question asked to the program is: "what is the probability for a given text to be spam knowing which words appear and don't appear in this text?" It can be formalized by:

$P (\text{Spam}\mid w_0 \wedge\cdots\wedge w_{N-1} )$

which can be computed as follows:

\begin{align} & P(\text{Spam}\mid w_{0}\wedge\cdots\wedge w_{N-1} )\\ ={} & \frac{\displaystyle P(\text{Spam}) \prod_{n=0}^{N-1} [ P(w_{n}\mid\text{Spam})]}{\displaystyle \sum_\text{Spam} [P(\text{Spam}) \prod_{n=0}^{N-1} [P (w_{n}\mid\text{Spam})]]}\end{align}

The denominator appears to be a normalization constant. It is not necessary to compute it to decide if we are dealing with spam. For instance, an easy trick is to compute the ratio:

\begin{align} & \frac{P([\text{Spam}=\text{true}]\mid w_0\wedge\cdots\wedge w_{N-1})}{P([ \text{Spam} = \text{false} ]\mid w_0 \wedge\cdots\wedge w_{N-1})}\\ ={} & \frac{P([ \text{Spam}=\text{true} ] )}{P([ \text{Spam} =\text{false} ])}\times\prod_{n=0}^{N-1} \left[\frac{P(w_n\mid [\text{Spam}=\text{true}])}{P(w_n\mid [\text{Spam} = \text{false}])}\right] \end{align}

This computation is faster and easier because it requires only $2N$ products.

#### Bayesian program

The Bayesian spam filter program is completely defined by:

$\Pr \begin{cases} Ds \begin{cases} Sp (\pi) \begin{cases} Va: \text{Spam},W_0,W_1 \ldots W_{N-1} \\ Dc: \begin{cases} P(\text{Spam} \land W_0 \land \ldots \land W_n \land \ldots \land W_{N-1})\\ = P(\text{Spam})\prod_{n=0}^{N-1}P(W_n\mid\text{Spam}) \end{cases}\\ Fo: \begin{cases} P(\text{Spam}): \begin{cases} P([\text{Spam}=\text{false}])=0.25 \\ P([\text{Spam}=\text{true}])=0.75 \end{cases}\\ P(W_n\mid\text{Spam}): \begin{cases} P(W_n\mid[\text{Spam}=\text{false}])\\ =\frac{1+a^n_f}{2+a_f} \\ P(W_n\mid[\text{Spam}=\text{true}])\\ =\frac{1+a^n_t}{2+a_t} \end{cases} \\ \end{cases}\\ \end{cases}\\ \text{Identification (based on }\delta) \end{cases}\\ Qu: P(\text{Spam}\mid w_0 \land \ldots \land w_n \land \ldots \land w_{N-1}) \end{cases}$

### Bayesian filter, Kalman filter and hidden Markov model

Bayesian filters (often called Recursive Bayesian estimation) are generic probabilistic models for time evolving processes. Numerous models are particular instances of this generic approach, for instance: the Kalman filter or the Hidden Markov model.

#### Variables

• Variables $S^{0},\ldots,S^{T}$ are a time series of state variables considered to be on a time horizon ranging from $0$ to $T$.
• Variables $O^{0},\ldots,O^{T}$ are a time series of observation variables on the same horizon.

#### Decomposition

The decomposition is based:

• on $P(S^t \mid S^{t-1})$, called the system model, transition model or dynamic model, which formalizes the transition from the state at time $t-1$ to the state at time $t$;
• on $P(O^t\mid S^t)$, called the observation model, which expresses what can be observed at time $t$ when the system is in state $S^t$;
• on an initial state at time $0$: $P(S^0 \wedge O^0)$.

#### Parametrical forms

The parametrical forms are not constrained and different choices lead to different well-known models: see Kalman filters and Hidden Markov models just below.

#### Question

The question usually asked of these models is $P\left(S^{t+k}\mid O^{0}\wedge\cdots\wedge O^{t}\right)$: what is the probability distribution for the state at time $t + k$ knowing the observations from instant $0$ to $t$?

The most common case is Bayesian filtering where $k=0$, which means that one searches for the present state, knowing the past observations.

However it is also possible to do a prediction $(k>0)$, where one tries to extrapolate a future state from past observations, or to do smoothing $(k<0)$, where one tries to recover a past state from observations made either before or after that instant.

Some more complicated questions may also be asked as shown below in the HMM section.

Bayesian filters $(k=0)$ have a very interesting recursive property, which contributes greatly to their attractiveness. $P\left(S^{t}|O^{0}\wedge\cdots\wedge O^{t}\right)$ may be computed simply from $P\left(S^{t 1}\mid O^0 \wedge \cdots \wedge O^{t-1}\right)$ with the following formula:

$\begin{array}{ll} & P\left(S^{t}|O^{0}\wedge\cdots\wedge O^{t}\right)\\ = & P\left(O^{t}|S^{t}\right)\times\sum_{S^{t-1}}\left[P\left(S^{t}|S^{t-1}\right)\times P\left(S^{t-1}|O^{0}\wedge\cdots\wedge O^{t-1}\right)\right]\end{array}$

Another interesting point of view for this equation is to consider that there are two phases: a prediction phase and an estimation phase:

• During the prediction phase, the state is predicted using the dynamic model and the estimation of the state at the previous moment:
$\begin{array}{ll} & P\left(S^{t}|O^{0}\wedge\cdots\wedge O^{t-1}\right)\\ = & \sum_{S^{t-1}}\left[P\left(S^{t}|S^{t-1}\right)\times P\left(S^{t-1}|O^{0}\wedge\cdots\wedge O^{t-1}\right)\right]\end{array}$
• During the estimation phase, the prediction is either confirmed or invalidated using the last observation:
\begin{align} & P\left(S^{t}\mid O^{0}\wedge\cdots\wedge O^{t}\right)\\ ={} & P\left(O^{t}\mid S^{t}\right)\times P\left(S^{t}|O^{0}\wedge\cdots\wedge O^{t-1}\right) \end{align}

#### Bayesian program

$Pr\begin{cases} Ds\begin{cases} Sp(\pi)\begin{cases} Va:\\ S^{0},\cdots,S^{T},O^{0},\cdots,O^{T}\\ Dc:\\ \begin{cases} & P\left(S^{0}\wedge\cdots\wedge S^{T}\wedge O^{0}\wedge\cdots\wedge O^{T}|\pi\right)\\ = & P\left(S^{0}\wedge O^{0}\right)\times\prod_{t=1}^{T}\left[P\left(S^{t}|S^{t-1}\right)\times P\left(O^{t}|S^{t}\right)\right]\end{cases}\\ Fo:\\ \begin{cases} P\left(S^{0}\wedge O^{0}\right)\\ P\left(S^{t}|S^{t-1}\right)\\ P\left(O^{t}|S^{t}\right)\end{cases}\end{cases}\\ Id\end{cases}\\ Qu:\\ \begin{cases} \begin{array}{l} P\left(S^{t+k}|O^{0}\wedge\cdots\wedge O^{t}\right)\\ \left(k=0\right)\equiv \text{Filtering} \\ \left(k>0\right)\equiv \text{Prediction} \\ \left(k<0\right)\equiv \text{Smoothing} \end{array}\end{cases}\end{cases}$

#### Kalman filter

The very well-known Kalman filters[3] are a special case of Bayesian filters.

They are defined by the following Bayesian program:

$Pr\begin{cases} Ds\begin{cases} Sp(\pi)\begin{cases} Va:\\ S^{0},\cdots,S^{T},O^{0},\cdots,O^{T}\\ Dc:\\ \begin{cases} & P\left(S^{0}\wedge\cdots\wedge O^{T}|\pi\right)\\ = & \left[\begin{array}{c} P\left(S^{0}\wedge O^{0}|\pi\right)\\ \prod_{t=1}^{T}\left[P\left(S^{t}|S^{t-1}\wedge\pi\right)\times P\left(O^{t}|S^{t}\wedge\pi\right)\right]\end{array}\right]\end{cases}\\ Fo:\\ \begin{cases} P\left(S^t \mid S^{t-1}\wedge\pi\right)\equiv G\left(S^{t},A\bullet S^{t-1},Q\right)\\ P\left(O^t \mid S^t \wedge\pi\right)\equiv G\left(O^{t},H\bullet S^{t},R\right)\end{cases}\end{cases}\\ Id\end{cases}\\ Qu:\\ P\left(S^T \mid O^0 \wedge\cdots\wedge O^{T}\wedge\pi\right)\end{cases}$
• Variables are continuous.
• The transition model $P(S^t \mid S^{t-1}\wedge\pi)$ and the observation model $P(O^t \mid S^t \wedge\pi)$ are both specified using Gaussian laws with means that are linear functions of the conditioning variables.

With these hypotheses and by using the recursive formula, it is possible to solve the inference problem analytically to answer the usual $P(S^T \mid O^0 \wedge\cdots\wedge O^T \wedge\pi)$ question. This leads to an extremely efficient algorithm, which explains the popularity of Kalman filters and the number of their everyday applications.

When there are no obvious linear transition and observation models, it is still often possible, using a first-order Taylor's expansion, to treat these models as locally linear. This generalization is commonly called the extended Kalman filter.

#### Hidden Markov model

Hidden Markov models (HMMs) are another very popular specialization of Bayesian filters.

They are defined by the following Bayesian program:

$\Pr\begin{cases} Ds\begin{cases} Sp(\pi)\begin{cases} Va:\\ S^{0},\ldots,S^{T},O^{0},\ldots,O^{T}\\ Dc:\\ \begin{cases} & P\left(S^{0}\wedge\cdots\wedge O^{T}\mid\pi\right)\\ = & \left[\begin{array}{c} P\left(S^{0}\wedge O^{0}\mid\pi\right)\\ \prod_{t=1}^{T}\left[P\left(S^{t}\mid S^{t-1}\wedge\pi\right)\times P\left(O^{t}\mid S^{t}\wedge\pi\right)\right]\end{array}\right]\end{cases}\\ Fo:\\ \begin{cases} P\left(S^{0}\wedge O^{0}\mid\pi\right)\equiv \text{Matrix}\\ P\left(S^{t}\mid S^{t-1}\wedge\pi\right)\equiv \text{Matrix}\\ P\left(O^{t}\mid S^{t}\wedge\pi\right)\equiv \text{Matrix}\end{cases}\end{cases}\\ Id\end{cases}\\ Qu:\\ \max_{S^{1}\wedge\cdots\wedge S^{T-1}}\left[P\left(S^{1}\wedge\cdots\wedge S^{T-1}\mid S^{T}\wedge O^{0}\wedge\cdots\wedge O^{T}\wedge\pi\right)\right]\end{cases}$
• Variables are treated as being discrete.
• The transition model $P\left(S^{t}\mid S^{t-1}\wedge\pi\right)$ and the observation model $P\left(O^{t}\mid S^{t}\wedge\pi\right)$ are

both specified using probability matrices.

• The question most frequently asked of HMMs is:
$\max_{S^{1}\wedge\cdots\wedge S^{T-1}}\left[P\left(S^{1}\wedge\cdots\wedge S^{T-1}\mid S^{T}\wedge O^{0}\wedge\cdots\wedge O^{T}\wedge\pi\right)\right]$

What is the most probable series of states that leads to the present state, knowing the past observations?

This particular question may be answered with a specific and very efficient algorithm called the Viterbi algorithm.

A specific learning algorithm called the Baum–Welch algorithm has also been developed for HMMs.

## Applications

For the last 15 years, Bayesian programming approach has been used in various universities to develop both robotics applications and life sciences models.[4]

#### Robotics

In robotics, Bayesian programming has been applied to autonomous robotics,[5][6][7][8][9] robotic CAD systems,[10] Advanced driver assistance systems,[11] robotic arm control, mobile robotics,[12][13] Human-robots interactions,[14] Human-vehicle interactions (Bayesian autonomous driver models) [15] [16] [17] [18] [19] [20] video game avatar programming and training [21] and real-time strategy games (AI).[22]

#### Life sciences

In life sciences, Bayesian Programming has been used in vision to reconstruct shape from motion,[23] to model visuo-vestibular interaction[24] and to study saccadic eye movements;[25] in speech perception and control to study early acquisition of speech[26] and the emergence of articulatory-acoustic systems;[27] and to model handwriting perception and control.[28]

## Bayesian programming versus possibility theories

The comparison between probabilistic approaches (not only Bayesian programming) and possibility theories has been debated for a long time and is, unfortunately, a very controversial matter.

Possibility theories like, for instance, fuzzy sets,[29] Fuzzy logic[30] and Possibility theory[31] propose different alternatives to probability to model uncertainty. They argue that probability is insufficient or inconvenient to model certain aspects of incomplete and uncertain knowledge.

The defense of probability is mainly based on Cox's theorem which, starting from four postulates concerning rational reasoning in the presence of uncertainty, demonstrates that the only mathematical framework that satisfies these postulates is probability theory. The argument then goes like this: if you use a different approach than probability, then you necessarily infringe on one of these postulates. Let us see which one and discuss its utility.

## Bayesian programming versus probabilistic programming

The purpose of probabilistic programming is to unify the scope of classical programming languages with probabilistic modeling (especially Bayesian networks) in order to be able to deal with uncertainty but still profit from the power of expression of programming languages to describe complex models.

The extended classical programming languages can be logical languages as proposed in Probabilistic Horn Abduction,[32] Independent Choice Logic,[33] PRISM,[34] and ProbLog which propose an extension of Prolog.

It can also be extensions of functional programming languages (essentially Lisp and Scheme) such as IBAL or CHURCH. The inspiring programming languages can even be object oriented like in BLOG and FACTORIE or more standard ones like in CES and FIGARO.

The purpose of Bayesian programming is different. Jaynes' precept of "probability as logic" defends that probability is an extension of and an alternative to logic above which a complete theory of rationality, computation and programming can be rebuilt. Bayesian programming does not search to extend classical languages but rather to replace them by a new programming approach based on probability and taking fully into account incompleteness and uncertainty.

The precise comparison between the semantic and power of expression of Bayesian and probabilistic programming is still an open question.

## References

1. ^ Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. ISBN 0-521-59271-2.
2. ^ Bessière, P.; Mazer, E.; Ahuactzin, J-M. & Mekhnacha, K. (2013). Bayesian Programming. Chapman & Hall/CRC. ISBN 9781439880326.
3. ^ Kalman, R. E. (1960). "A New Approach to Linear Filtering and Prediction Problems". Transactions of the ASME--Journal of Basic Engineering 82: 33––45. doi:10.1115/1.3662552.
4. ^ Bessière, P.; Laugier, C. & Siegwart, R. (2008). Probabilistic Reasoning and Decision Making in Sensory-Motor Systems. Springer. ISBN 978-3-540-79007-5.
5. ^ Lebeltel, O.; Bessière, P.; Diard, J. & Mazer, E. (2004). "Bayesian Robot Programming". Advanced Robotics 16 (1): 49––79. doi:10.1023/b:auro.0000008671.38949.43.
6. ^ Diard, J.; Gilet, E.; Simonin, E. & Bessière, P. (2010). "Incremental learning of Bayesian sensorimotor models: from low-level behaviours to large-scale structure of the environment". Connection Science 22 (4): 291––312. doi:10.1080/09540091003682561.
7. ^ Pradalier, C.; Hermosillo, J.; Koike, C., Braillon, C.; Bessière, P. & Laugier, C. (2005). "The CyCab: a car-like robot navigating autonomously and safely among pedestrians". Robotics and Autonomous Systems 50 (1): 51––68. doi:10.1016/j.robot.2004.10.002.
8. ^ Ferreira, J.; Lobo, J.; Bessière, P.; Castelo-Branco, M. & Dias, J. (2012). "A Bayesian Framework for Active Artificial Perception". IEEE Transactions on Systems, IEEE Transactions on Systems, Man, and Cybernetics, Part B 99: 1––13.
9. ^ Ferreira, J. F.; Dias, J. M. (2014). Probabilistic Approaches to Robotic Perception. Springer.
10. ^ Mekhnacha, K.; Mazer, E. & Bessière, P. (2001). "The design and implementation of a Bayesian CAD modeler for robotic applications". Advanced Robotics 15 (1): 45––69. doi:10.1163/156855301750095578.
11. ^ Coué, C.; Pradalier, C.; Laugier, C.; Fraichard, T. & Bessière, P. (2006). "Bayesian Occupancy Filtering for Multitarget Tracking: an Automotive Application". International Journal of Robotics Research 25 (1): 19––30. doi:10.1177/0278364906061158.
12. ^ Vasudevan, S.; Siegwart, R. (2008). "Bayesian space conceptualization and place classification for semantic maps in Bayesian space conceptualization and place classification for semantic maps in mobile robotics". Robotics and Autonomous Systems 56 (6): 522––537. doi:10.1016/j.robot.2008.03.005.
13. ^ Perrin, X.; Chavarriaga, R.; Colas, F.; Seigwart, R. & Millan, J. (2010). "Brain-coupled interaction for semi-autonomous navigation of an assistive robot". Robotics and Autonomous Systems 58 (12): 1246––1255. doi:10.1016/j.robot.2010.05.010.
14. ^ Rett, J.; Dias, J. & Ahuactzin, J-M. (2010). "Bayesian reasoning for Laban Movement Analysis used in human-machine interaction". Int. J. of Reasoning-based Intelligent Systems 2 (1): 13––35. doi:10.1504/IJRIS.2010.029812.
15. ^ Möbus, C.; Eilers, M.; Garbe, H.; Zilinski, M. (2009), http://link.springer.com/chapter/10.1007%2F978-3-642-02809-0_45 |contribution-url= missing title (help), in Duffy, Vincent G., Probabilistic and Empirical Grounded Modeling of Agents in (Partial) Cooperative Traffic Scenarios, Lecture Notes in Computer Science, Volume 5620, Second International Conference, ICDHM 2009, San Diego, CA, USA: Springer, pp. 423–432, doi:10.1007/978-3-642-02809-0_45, ISBN 978-3-642-02808-3
16. ^ Möbus, C.; Eilers, M. (2009), http://link.springer.com/chapter/10.1007%2F978-3-642-02809-0_44 |contribution-url= missing title (help), in Duffy, Vincent G., Further Steps Towards Driver Modeling according to the Bayesian Programming Approach, Lecture Notes in Computer Science, Volume 5620, Second International Conference, ICDHM 2009, San Diego, CA, USA: Springer, pp. 413–422, doi:10.1007/978-3-642-02809-0_44, ISBN 978-3-642-02808-3
17. ^ Eilers, M.; Möbus, C. (2010). "Lernen eines modularen Bayesian Autonomous Driver Mixture-of-Behaviors (BAD MoB) Modells". In Kolrep, H.; Jürgensohn, Th. Fahrermodellierung - Zwischen kinematischen Menschmodellen und dynamisch-kognitiven Verhaltensmodellen. Fortschrittsbericht des VDI in der Reihe 22 (Mensch-Maschine-Systeme). Düsseldorf, Germany: VDI-Verlag. pp. 61 – 74. ISBN 978-3-18-303222-8.
18. ^ Möbus, C.; Eilers, M. (2011). http://www.igi-global.com/chapter/prototyping-smart-assistance-bayesian-autonomous/54671 |contribution-url= missing title (help). In Mastrogiovanni, F.; Chong, N.-Y. Prototyping Smart Assistance with Bayesian Autonomous Driver Models. Hershey, Pennsylvania (USA): IGI Global publications. pp. 460–512. doi:10.4018/978-1-61692-857-5.ch023. ISBN 9781616928575.
19. ^ Eilers, M.; Möbus, C. (2011). "Learning the Relevant Percepts of Modular Hierarchical Bayesian Driver Models Using a Bayesian Information Criterion". In Duffy, V.G. Digital Human Modeling. LNCS 6777. Heidelberg, Germany: Springer. pp. 463–472. doi:10.1007/978-3-642-21799-9_52. ISBN 978-3-642-21798-2.
20. ^ Eilers, M.; Möbus, C. (2011). "Learning of a Bayesian Autonomous Driver Mixture-of-Behaviors (BAD-MoB) Model". In Duffy, V.G. Advances in Applied Digital Human Modeling. LNCS 6777. Boca Raton, USA: CRC Press, Taylor & Francis Group. pp. 436–445. ISBN 978-1-4398-3511-1.
21. ^ Le Hy, R.; Arrigoni, A.; Bessière, P. & Lebetel, O. (2004). "Teaching Bayesian Behaviours to Video Game Characters". Robotics and Autonomous Systems 47 (2–3): 177––185. doi:10.1016/j.robot.2004.03.012.
22. ^ Synnaeve, G. (2012). Bayesian Programming and Learning for Multiplayer Video Games.
23. ^ Colas, F.; Droulez, J.; Wexler, M. & Bessière, P. (2008). "A unified probabilistic model of the perception of three-dimensional structure from optic flow". Biological Cybernetics: 132––154.
24. ^ Laurens, J.; Droulez, J. (2007). "Bayesian processing of vestibular information". Biological Cybernetics 96 (4): 389––404. doi:10.1007/s00422-006-0133-1.
25. ^ Colas, F.; Flacher, F.; Tanner, T.; Bessière, P. & Girard, B. (2009). "Bayesian models of eye movement selection with retinotopic maps". Biological Cybernetics 100 (3): 203––214. doi:10.1007/s00422-009-0292-y.
26. ^ Serkhane, J.; Schwartz, J-L. & Bessière, P. (2005). "Building a talking baby robot A contribution to the study of speech acquisition and evolution". Interaction Studies 6 (2): 253––286. doi:10.1075/is.6.2.06ser.
27. ^ Moulin-Frier, C.; Laurent, R.; Bessière, P.; Schwartz, J-L. & Diard, J. (2012). "Adverse conditions improve distinguishability of auditory, motor and percep-tuo-motor theories of speech perception: an exploratory Bayesian modeling study". Language and Cognitive Processes 27 (7–8): 1240––1263. doi:10.1080/01690965.2011.645313.
28. ^ Gilet, E.; Diard, J. & Bessière, P. (2011). Sporns, Olaf, ed. "Bayesian Action–Perception Computational Model: Interaction of Production and Recognition of Cursive Letters". Plos ONE 6 (6): e20387. Bibcode:2011PLoSO...620387G. doi:10.1371/journal.pone.0020387.
29. ^ Zadeh, Lofti, A. (1965). "Fuzzy sets". Information and Control 8 (3): 338––353. doi:10.1016/S0019-9958(65)90241-X.
30. ^ Zadeh, Lofti, A. (1975). "Fuzzy logic and approximate reasoning". Synthese 30 (3––4): 407––428. doi:10.1007/BF00485052.
31. ^ Dubois, D.; Prade, H. (2001). Ann. Math. Artif. Intell. 32 (1––4): 35––66. doi:10.1023/A:1016740830286. Missing or empty |title= (help)
32. ^ Poole, D. (1993). "Probabilistic Horn abduction and Bayesian networks". Artificial Intelligence 64: 81–129. doi:10.1016/0004-3702(93)90061-F.
33. ^ Poole, D. (1997). "The Independent Choice Logic for modelling multiple agents under uncertainty". Artficial Intelligence 94: 7–56. doi:10.1016/S0004-3702(97)00027-1.
34. ^ Sato, T.; Kameya, Y. (2001). Journal of Artificial Intelligence Research 15: 391––454. Missing or empty |title= (help)