I intend that this page should eventually summarize the content of the MIT course 6.432, "Stochastic Processes, Detection, and Estimation", which I took some years ago and considered to be among the meatiest courses I ever took at MIT. If this ever reaches a point where it doesn't suck, I want to move it into http://en.wikibooks.org/. I'll use green text as a note to myself that something should be added or extended.

The course begins with a review of probability theory. There are some on-line resources for learning or reviewing probability.

Some review of linear algebra is also called for, since these techniques can be extended to vector and matrix quantities.

A lot of this same material is covered in the Wikipedia article on Bayesian inference. Probably I should look for some way to take that into account, and reduce the stuff here.

It was long enough ago that I'll need to fumble around and organize my thoughts. Many of the fundamental ideas can be gotten by looking at a binary hypothesis problem. One of the common themes in this stuff is that given my understanding of causal relationships, I can use observable variables to improve my estimates of the probable values of unobservable variables.

Overview

Random variables, types of variables (boolean, integer, real, vector). Observable and non-observable variables. Brief definitions for "detection" (statistical hypothesis testing) and "estimation" with a few examples of how they would be useful. A-priori and a-posteriori probabilities. Maximum a posteriori detectors, maximum likelihood estimation. Likelihood-ratio test.

Detection

Another name for detection is hypothesis testing. We wish to make a determination between two competing hypotheses, for example, whether a new drug effectively treats some disease. We designate one hypothesis as the null hypothesis (usually representing what is expected, or "business as usual" in some sense) and the other as the alternative hypothesis.

Examples

A drug company is trying to determine if their new drug will effectively treat some disease. The null hypothesis is that the drug would not treat the disease, the alternative hypothesis then being that it does treat the disease.
A radar screen shows a blip moving into our airspace. The null hypothesis is that the blip is a pigeon. The alternative hypothesis is that it's an enemy missile.
You're listening to a shortwave radio. Amid the static you think you might be hearing a voice, but you're not sure. The null hypothesis is that it's just static, the alternative hypothesis is that somebody is transmitting.

In each of these cases, the question has a "yes" or "no" answer with no middle ground. But the question cannot be answered directly by a simple observation. We need to get a little more clever, and we will not arrive at an answer that is completely certain.

Let X be a binary variable which is 0 if the null hypothesis is true, and 1 if the alternative hypothesis. The prior or a priori probabilities P₀ and P₁ (whose sum is one) represent one's beliefs regarding the probabilities of the two possible values.

X cannot be observed directly, but another variable Y can. Assume Y is also binary with possible values zero and one, and that there is a known causal relationship between X and Y, including a component of uncertainty.

Assuming X is zero, let Q₀₀ be the probability that Y is also zero, and Q₁₀ be the probability that Y is one. Assuming X is one, we have probabilities Q₀₁ and Q₁₁ that Y is then respectively zero and one. These Q values are called "conditional probabilities" because they represent the probability of something being true on the condition that another thing is true. They are subject to Bayes' theorem.

Then:

Q₀₀ + Q₁₀ = 1

Q₀₁ + Q₁₁ = 1

Assuming we trust our prior probabilities we now have joint probabilities.

The probability that X = Y = 0 is P₀Q₀₀ -- this is an application of Bayes' theorem

The probability that X = 0 and Y = 1 is P₀Q₁₀

The probability that X = 1 and Y = 0 is P₁Q₀₁

The probability that X = Y = 1 is P₁Q₁₁

P₀Q₀₀ + P₀Q₁₀ + P₁Q₀₁ + P₁Q₁₁ = 1

When we make an observation of Y, we can update our estimates of the probabilities for X. Since we are doing this after observing Y, we call these a posteriori probabilities. So we will update the prior probabilities P₀ and P₁, given the new information based on the observation of Y's value.

If we observe that Y = 0, then we have:

P₀ (new) = P₀Q₀₀ / (P₀Q₀₀ + P₁Q₀₁)

P₁ (new) = P₁Q₀₁ / (P₀Q₀₀ + P₁Q₀₁)

If we observe that Y = 1, then we have:

P₀ (new) = P₀Q₁₀ / (P₀Q₁₀ + P₁Q₁₁)

P₁ (new) = P₁Q₁₁ / (P₀Q₁₀ + P₁Q₁₁)

This can be applied to a non-binary variable by replacing the two priors (P₀ and P₁) with a probability distribution, which gets updated when new observations of y become available.

False positives, false negatives

Assuming that X represents the presence of something we are trying to detect, then X = 1 means the thing is present and X = 0 means the thing is absent. If the thing is absent and our best estimate is that X = 1, that is a "false positive". If the thing is present and our best estimate is that X = 0, that is a "false negative".

Costs and expected costs

Depending on the problem at hand, it may make sense to assign different costs to false positives and false negatives. If the problem is to detect an incoming missile so that we can shoot it down, we don't mind accidentally firing upon a few pigeons rather than allow a missile to hit us directly. The cost of a false negative is very high since the enemy missile might blow up a friendly city, where the cost of a false positive is low -- we wasted a counter-missile shooting down a pigeon, but we probably have more counter-missiles left.

In all, we have four costs to consider.

C₀₀ is the cost of deciding the radar blip is a pigeon, and getting it right. This cost is zero since we didn't fire a counter-missile and no cities got hit.
C₁₁ is the cost of deciding the radar blip is a missile, and getting it right. This cost is low since we are happy to launch a counter-missile to prevent the destruction of a city.
C₀₁ is the cost of deciding the radar blip is a pigeon, and getting it wrong. It's a false negative and there was an enemy missile that got through our defenses. Ouch, we lost a city. That's a high cost.
C₁₀ is the cost of deciding the radar blip is a missile, and getting it wrong. We wasted a counter-missile shooting down a pigeon. Low cost, but embarrassing when you need to explain it to your boss.

The variable X is still binary, but we can generalize Y. It can be a real-valued number or even a vector, which is appropriate for a radar signal, provided that we have a good model for the probability distribution of Y as a function of X. Radar equipment is subject to various noise sources including enemy jamming, so we do need to think in probabilities here.

Since we're now talking about probabilities, we need a slight tweak to the notation. We'll consider x and y to be variables, and X and Y to be values. We can then write

P₀ = Pr(x=0)

P₁ = Pr(x=1)

and we now say that Y is the observed value for the variable y, and we can talk about the probability that y=Y instead of being some other value.

It's also useful to introduce a variable that differentiates the actual value of x from what we estimate as the value for x. The latter will wear a squiggly hat: ${\tilde {x}}$ . If we can observe x directly then no estimation is required, but many interesting quantities are not directly observable.

These two probability distributions for Y tell us the values $Pr(y=Y|X=0)$ (the probability that we would observe that particular value of Y, given that the value of X is zero) and $Pr(y=Y|X=1)$ . We can apply Bayes' theorem as follows.

Pr(x=0|y=Y)={{Pr(y=Y|x=0)P_{0}} \over {Pr(y=Y)}}

Pr(x=1|y=Y)={{Pr(y=Y|x=1)P_{1}} \over {Pr(y=Y)}}

Now we can express the expected cost of choosing ${\tilde {x}}=0$ , or deciding that the radar blip is a pigeon:

C_{0}=C_{00}Pr(x=0|y=Y)+C_{01}Pr(x=1|y=Y)

and the expected cost of choosing ${\tilde {x}}=1$ :

C_{1}=C_{10}Pr(x=0|y=Y)+C_{11}Pr(x=1|y=Y)

and in order to make a rational decision about launching a counter-missile, we should select the smaller of the two expected costs. We should then choose ${\tilde {x}}=1$ if $C_{1}<C_{0}$ , or equivalently

{{Pr(y=Y|x=0)} \over {Pr(y=Y|x=1)}}<\left({{C_{01}-C_{11}} \over {C_{10}-C_{00}}}\right)\left({{P_{1}} \over {P_{0}}}\right)

and choose ${\tilde {x}}=0$ otherwise. This inequality is called a likelihood-ratio test, a "likelihood" being an estimate of probability that takes into account the observation of relevant variables, in this case y.

Some typical simplifications

In many situations it is safe to assume that correct answers have no cost, and the inequality above simplifies to

{{Pr(y=Y|x=0)} \over {Pr(y=Y|x=1)}}<\left({{C_{01}} \over {C_{10}}}\right)\left({{P_{1}} \over {P_{0}}}\right)

It is often safe to assume that false positives and false negatives have equal cost. Then we have a maximum a-posteriori (MAP) detector.

{{Pr(y=Y|x=0)} \over {Pr(y=Y|x=1)}}<{{P_{1}} \over {P_{0}}}

If we assume that P₀ = P₁ = 1/2, then we have a maximum likelihood detector.

Pr(y=Y|x=0)<Pr(y=Y|x=1)

An example using additive Gaussian white noise

Let y be a random variable with a normal distribution where the mean is $\mu _{0}$ if x=0 and $\mu _{1}$ if x=1 and a constant standard deviation $\sigma$ .

This would come up in a circumstance where, for example, we were sending voltages down a cable that represent zeros and ones, like a USB connection between a computer and a peripheral. By the time the voltages reach the other end, they may have picked up noise components from nearby magnetic fields emanating from vacuum cleaners or coffee machines or whatever.

Then the likelihood ratio is

{{Pr(y=Y|x=0)} \over {Pr(y=Y|x=1)}}=exp\left(\left({{\mu _{0}-\mu _{1}} \over {\sigma ^{2}}}\right)\left(Y-{{\mu _{0}+\mu _{1}} \over {2}}\right)\right)

The phrase one will often see for this is "additive Gaussian white noise", or simply the abbreviation AGWN.

The quantity ${{\mu _{0}-\mu _{1}} \over {\sigma ^{2}}}$ acts as a multiplier for Y in the argument of the exponential function, and can be thought of as a sort of sensitivity. The larger this quantity, the more easily a change in Y might cause us to change our decision. As $\mu _{0}$ and $\mu _{1}$ move closer together the sensitivity shrinks because the behavior of Y in the two cases looks more similar and it's harder to distinguish them. As $\sigma$ shrinks, the sensitivity increases because the noise amplitude is shrinking and the signal is becoming more clearly visible.

Estimation

Let x and y be vector variables, with x unobservable and y observable, and with reason to believe that they are in some way correlated. We wish to come up with an estimate of x as a linear function of y:

{\tilde {x}}=Ay+b

and we seek to minimize the expected square error:

e=E[(x-{\tilde {x}})^{T}(x-{\tilde {x}})]

We do this by choosing A and B such that ${\partial {e} \over \partial {A_{ij}}}={\partial {e} \over \partial {b_{i}}}=0$ for all i and j. Letting u represent the values of A_ij and b_i and cranking algebra:

E\left[{\tilde {x}}^{T}{\partial {\tilde {x}} \over \partial {u}}\right]=E\left[x^{T}{\partial {\tilde {x}} \over \partial {u}}\right]

E\left[{\tilde {x}}_{i}y_{j}\right]=E\left[x_{i}y_{j}\right]

for u = A_ij ;

E\left[{\tilde {x}}_{i}\right]=E\left[x_{i}\right]

for u = b_i

X is a continuous variable. Provide an estimate that minimizes cost.

Consider the cases where X and Y are continuous variables
Consider the cases where X and Y are vectors

Other stuff

Consider the cases where X and Y are vectors, then we end up with lots of matrix arithmetic
Consider X and Y as vector-valued functions of time, where time can be discrete or continuous
Receiver operating characteristic
Coloring and whitening noise
Wiener filtering, Kalman filtering
Statistical hypothesis testing
Correlation and dependence