Conditional expectation

In probability theory, a conditional expectation (also known as conditional expected value or conditional mean) is the expected value of a real random variable with respect to a conditional probability distribution.

Thus if X is a random variable, and A is an event whose probability is not 0, then the conditional probability distribution of X given A assigns a probability P(X ≤ x | A) to the interval from − ∞ to x, and we have a conditional probability distribution, which may have a first moment, called E(X | A), the conditional expectation of X given the event A.

If Y is another random variable, then the conditional expectation E(X | Y = y) of X given that the value Y = y is a function of y, which let us call g(y). (Advanced arguments are needed because the event that Y = y may have probability zero.) The conditional expectation of X given random variable Y, denoted by E(X | Y), is g(Y), another random variable whose value depends on that of Y. (Reminder for those less-than-accustomed to the conventional language and notation of probability theory: this paragraph is an example of why case-sensitivity of notation must not be neglected, since capital Y and lower-case y refer to different things.)

It turns out that the conditional expectation E(X | Y) can be expressed only in terms of the sigma-algebra, say G, generated by the events Y ≤ y, rather than the particular values of Y. For a sigma-algebra G, the conditional expectation E(X | G) of X given sigma-algebra G, is a random variable that is G-measurable and whose integral over any G-measurable set is the same as the integral of X over the same set. The existence of this conditional expectation follows from the Radon-Nikodym theorem. If X happens to be G-measurable, then E(X | G) = X.

If X has an expected value, or -- what is the same thing -- E(|X|) < ∞, then the conditional expectation E(X | Y) also has an expected value, which is the same as that of X. That fact is the law of total expectation.

Special cases

In the simplest case, if A is an event whose probability is not 0, then

\operatorname {P} (S\mid A)={\frac {\operatorname {P} (A\cap S)}{\operatorname {P} (A)}},

as a function of S, is a probability measure on A and E(X | A) is the expectation of X with respect to this probability P_A. In case X is a discrete random variable, and with finite first moment, the expectation is explicitly given by

\operatorname {E} (X|A)=\sum _{r}r\cdot \operatorname {P} _{A}\{X=r\}=\sum _{r}r\cdot {\frac {\operatorname {P} (A\cap \{X=r\})}{\operatorname {P} (A)}}

where {X = r} is the event that X takes on the value r. Since X has finite first moment, it can be shown that this sum converges absolutely. The sum is countable since {X = r} has probability 0 for all but countable many values of r.

If X is the indicator function of an event S, then E(X | A) is just the conditional probability P_A(S).

If Y is another real random variable, then for each value of y we consider the event {Y = y}. The conditional expectation E(X | Y = y) is shorthand for E(X | {Y = y}). In general, this may not be defined, since {Y = y} may have zero probability.

The way out of this limitation is as follows: If both X and Y are discrete random variables then for any subset B of Y

\operatorname {E} (X\ \mathbf {1} _{Y\in B})=\sum _{r\in B}\operatorname {E} (X|Y=r)\operatorname {P} \{Y=r\}.

where 1 is the indicator function. For general random variables Y, P{Y = r} is zero for almost every r. As a first step in dealing with this problem, let us consider the case Y has a continuous distribution function. This means there is a non-negative integrable function φ_Y on R which is the density of Y. This means

\operatorname {P} \{Y\leq a\}=\int _{-\infty }^{a}\phi _{Y}(s)\,ds

for any a in R. We can then show the following: for any integrable random variable X, there is a function g on R such that

\operatorname {E} (X\,\mathbf {1} _{Y\leq a})=\int _{-\infty }^{a}g(t)\phi _{Y}(t)\,dt.

This function g is a suitable candidate for the conditional expectation.

Mathematical formalism

Let X, Y be real random variables on some probability space (Ω, M, P) where M is the σ-algebra of measurable sets on which P is defined. We consider two measures on R:

Q is the law of Y defined by Q(B) = P(Y⁻¹(B)) for every Borel subset B of R is a probability measure on the real line R. Now
P_X given by

\operatorname {P} _{X}(B)=\operatorname {E} _{P}(X1_{Y\in B})=\int _{Y^{-1}(B)}X(\omega )\ d\operatorname {P} (\omega ).

If X is an integrable random variable, then P_X is absolutely continuous with respect to Q. In this case, it can be shown the Radon-Nikodym derivative of P_X with respect to Q exists; moreover it is uniquely determined almost everywhere with respect to Q. This random variable is the conditional expectation of X given Y, or more accurately a version of the conditional expectation of X given Y.

It follows that the conditional expectation satisfies

\int _{Y^{-1}(B)}X(\omega )\ d\operatorname {P} (\omega )=\int _{B}\operatorname {E} (X|Y)(\theta )\ d\operatorname {Q} (\theta )

for any Borel subset B of R.

Conditioning as factorization

In the definition of conditional expectation that we provided above, the fact Y is a real random variable is irrelevant: Let U be a measurable space, that is a set equipped with a σ-algebra of subsets. A U-valued random variable is a function Y: Ω → U such that Y⁻¹(B) is an element of M for any measurable subset B of U.

We consider the measure Q on U given as above: Q(B) = P(Y⁻¹(B)) for every measurable subset B of U. Q is a probability measure on the measurable space U defined on its σ-algebra of measurable sets.

Theorem. If X is an integrable real random variable on Ω then there is one and, up to equivalence a.e. relative to Q, only one integrable function g such that for any measurable subset B of U:

\int _{Y^{-1}(B)}X(\omega )\ d\operatorname {P} (\omega )=\int _{B}g(u)\ d\operatorname {Q} (u).

There are a number of ways of proving this; one as suggested above, is to note that the expression on the left hand side defines as a function of the set B a countably additive probability measure on the measurable subsets of U. Moreover, this measure is absolutely continuous relative to Q. Indeed Q(B) = 0 means exactly that Y⁻¹(B) has probability 0. The integral of an integrable function on a set of probability 0 is itself 0. This proves absolute continuity.

The defining condition of conditional expectation then is the equation

\int _{Y^{-1}(B)}X(\omega )\ d\operatorname {P} (\omega )=\int _{B}\operatorname {E} (X|Y)(u)\ d\operatorname {Q} (u).

We can further interpret this equality by considering the abstract change of variables formula to transport the integral on the right hand side to an integral over Ω:

\int _{Y^{-1}(B)}X(\omega )\ d\operatorname {P} (\omega )=\int _{Y^{-1}(B)}[\operatorname {E} (X|Y)\circ Y](\omega )\ d\operatorname {P} (\omega ).

This equation can be interpreted to say that the following diagram is commutative in the average.

The equation means that the integrals of X and the composition $\operatorname {E} (X|Y)\circ Y$ over sets of the form $Y^{-1}(B)$ for $B$ measurable are identical.

Conditioning relative to a subalgebra

There is another viewpoint for conditioning involving σ-subalgebras N of the σ-algebra M. This version is a trivial specialization of the preceding: we simply take U to be the space Ω with the σ-algebra N and Y the identity map. We state the result:

Theorem. If X is an integrable real random variable on Ω then there is one and, up to equivalence a.e. relative to P, only one integrable function g such that for any set B belonging to the subalgebra N

\int _{B}X(\omega )\ d\operatorname {P} (\omega )=\int _{B}g(\omega )\ d\operatorname {P} (\omega )

where g is measurable with respect to N (a stricter condition than the measurability with respect to M required of X). This form of conditional expectation is usually written: E(X|N). This version is preferred by probabilists. One reason is that on the space of square-integrable real random variables (in other words, real random variables with finite second moment) the mapping X → E(X|N) is the self-adjoint orthogonal projection

L_{\operatorname {P} }^{2}(X;M)\rightarrow L_{\operatorname {P} }^{2}(X;N).

Basic properties

Let (Ω,M,P) be a probability space.

Conditioning with respect to a σ-subalgebra N is linear on the space of integrable real random variables.
E(1|N) = 1
Jensen's inequality holds: If f is a convex function,then

f(\operatorname {E} (X|N))\leq \operatorname {E} (f\circ X|N).

Conditioning is a contractive projection

L_{P}^{s}(X;M)\rightarrow L_{P}^{s}(X;N)

for any s ≥ 1.

References

William Feller, An Introduction to Probability Theory and its Applications, vol 1, 1950
Paul A. Meyer, Probability and Potentials, Blaisdell Publishing Co., 1966
G. R. Grimmett & D. R. Stirzaker, Probability and Random Processes, 1995