Lecture 27: Conditional Expectation Given a Random Variable

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Conditioning on a Random Variable: Warm-Up Examples

This lecture finishes conditional expectation as a topic in its own right. The focus is on conditioning on a random variable $X$ (not on an event), which produces a random variable $E(Y \mid X)$ that is itself a function of $X$. Two quick examples fix the notation and intuition before the general properties.

Core idea

$E(Y \mid X)$ is the best prediction of $Y$ from $X$ in the mean-squared-error sense. Because it depends on the random $X$, it is a random variable, not a number — in contrast to conditioning on an event $A$, which gives the number $E(Y \mid A)$.

Example 1: $X$ standard normal, $Y = X^2$

Let $X \sim \mathcal{N}(0,1)$ and $Y = X^2$. Two conditional expectations point in opposite directions.

Conditioning on $X$ means treating $X$ as known. If we know $X$, we know $X^2$ exactly, so the best prediction is $X^2$ itself:

$$E(X^2 \mid X) = X^2 = Y.$$

This is the only sensible answer: if we observe $X$ and the prediction were anything other than $X^2$, something would be wrong. A function of $X$, conditioned on $X$, is just that function.

Now reverse the roles and ask for $E(X \mid Y) = E(X \mid X^2)$. Here we observe $X^2$ but not $X$. Suppose $X^2 = a$ is known. Then $X = +\sqrt{a}$ or $X = -\sqrt{a}$, and by the symmetry of the normal these are equally likely — the magnitude tells us nothing about the sign. Averaging gives

$$E(X \mid X^2) = 0.$$

This does not say $X$ and $X^2$ are independent (they are uncorrelated but not independent). It says that, as a single-number prediction, $X^2$ is useless for predicting $X$: we know the magnitude but learn nothing about the sign, so the best guess is $0$.

Example 2: Breaking a stick twice

Take a stick of length $1$. Break off a uniform random piece, keep it, discard the rest; then break that piece again at a uniform random point. Let $X$ be the first break point and $Y$ the length of the second piece.

$X \sim \text{Unif}(0, 1)$.
Given $X$, the second break point is uniform on $(0, X)$, so $Y \mid X \sim \text{Unif}(0, X)$.

The notation $Y \mid X \sim \text{Unif}(0, X)$ is shorthand: if we knew $X = x$, then $Y$ would be $\text{Unif}(0, x)$. A uniform on $(0, x)$ has mean $x/2$, so

$$E(Y \mid X) = \frac{X}{2}.$$

This is a random variable — a function of $X$. We can then take its expectation. Since $E(X) = 1/2$,

$$E\bigl(E(Y \mid X)\bigr) = E\!\left(\frac{X}{2}\right) = \frac{1}{2}\cdot\frac{1}{2} = \frac{1}{4}.$$

By the iterated-expectation property (proved below), $E(E(Y \mid X)) = E(Y)$, so $E(Y) = 1/4$. The answer is intuitive: on average you keep half the stick, then half of that again.

· · ·

2. Four Properties of Conditional Expectation

A handful of properties let you derive everything else. Throughout, $h$ is an arbitrary function and $X, Y$ are random variables (assumed to have finite variance where needed).

Property 1 — Taking out what's known

$$E\bigl(h(X)\, Y \mid X\bigr) = h(X)\, E(Y \mid X).$$

Because we condition on $X$, any $h(X)$ is a known constant and pulls out. This is what happened in Example 1: $E(X^2 \cdot 1 \mid X) = X^2\, E(1 \mid X) = X^2$, since $E(1 \mid \text{anything}) = 1$. The reverse move — multiplying a known $h(X)$ back inside — is putting back what's known.

Property 2 — Dropping the condition under independence

$$E(Y \mid X) = E(Y) \quad\text{if } X, Y \text{ independent}.$$

If $X$ carries no information about $Y$, the conditional distribution of $Y$ given $X$ equals its unconditional one, so conditioning does not change the prediction. This is not an "if and only if": Example 1 had $E(X \mid X^2) = E(X) = 0$ even though $X$ and $X^2$ are not independent.

Property 3 — Iterated expectation (Adam's law)

$$E\bigl(E(Y \mid X)\bigr) = E(Y).$$

Take the conditional expectation (a random variable), then average it, and the unconditional mean returns. Also called the law of iterated expectation or the tower property; in this department, Adam's law.

How Adam's law is used

The value is in the direction $E(Y) = E(E(Y \mid X))$: when $E(Y)$ is hard to compute directly but $E(Y \mid X)$ is easy, choose $X$ cleverly, compute $E(Y \mid X)$, then average. This generalizes the law of total probability — exactly the move used to get $E(Y) = 1/4$ in the stick example.

Property 4 — Residual uncorrelated with any function of $X$

$$E\bigl((Y - E(Y \mid X))\, h(X)\bigr) = 0 \quad\text{for every } h.$$

The quantity $Y - E(Y \mid X)$ is the residual: the actual value of $Y$ minus its predicted value. Property 4 says the residual is uncorrelated with any function of $X$. Since $E(Y - E(Y \mid X)) = E(Y) - E(Y) = 0$ by Adam's law, the covariance of $Y - E(Y \mid X)$ with $h(X)$ collapses to the single expectation above, so showing it is $0$ shows the covariance is $0$.

· · ·

3. Geometric Picture: Conditional Expectation as Projection

For those with linear-algebra background, conditional expectation is an orthogonal projection. (Skip if the picture is unfamiliar; it is intuition, not a requirement.)

Treat each random variable as a vector, with inner product $\langle X, Y \rangle = E(XY)$. The collection of all functions of $X$ forms a "plane" through the origin: it contains every constant (the zero function is a function of $X$), plus $X$, $X^2$, $e^X$, and so on.

Y
  •
  |
  |  residual = Y−E(Y|X)
  |
__•________ plane of functions of X
E(Y|X)

$E(Y \mid X)$ is the foot of the perpendicular from $Y$ onto the plane; the residual is orthogonal to that plane.

$E(Y \mid X)$ is the point in the plane closest to $Y$ — the orthogonal projection of $Y$ onto the space of functions of $X$.
If $Y$ is already a function of $X$, it lies in the plane, so the projection is $Y$ itself. This is why $E(Y \mid X) = Y$ when $Y = h(X)$ (Example 1).
The residual $Y - E(Y \mid X)$ is the vector from the projection up to $Y$, perpendicular to the plane. Property 4 is exactly the statement that this residual is orthogonal to every vector $h(X)$ in the plane.

· · ·

4. Proofs

Proof of Property 4 (assuming Property 3)

$$E\bigl((Y - E(Y \mid X))\, h(X)\bigr) = 0$$

Expand by linearity:

$$E(Y\, h(X)) - E\bigl(E(Y \mid X)\, h(X)\bigr).$$

Leave the first term alone. In the second, the form $E(E(\cdot \mid X))$ invites Adam's law, but $h(X)$ sits outside. Since $X$ is known inside the conditional expectation, put $h(X)$ back inside:

$$E\bigl(E(Y \mid X)\, h(X)\bigr) = E\bigl(E(h(X)\, Y \mid X)\bigr) = E\bigl(h(X)\, Y\bigr),$$

where the last step is Adam's law applied to $h(X)\,Y$. So the expression is

$$E(Y\, h(X)) - E(h(X)\, Y) = 0.$$

The ingredients: linearity, taking out / putting back what's known, and iterated expectation.

Proof of Property 3 (Adam's law), discrete case

$$E\bigl(E(Y \mid X)\bigr) = E(Y)$$

Let $g(X) = E(Y \mid X)$; the goal is $E(g(X)) = E(Y)$. (Continuous case analogous, with integrals.) By LOTUS:

$$E(g(X)) = \sum_x g(x)\, P(X = x).$$

By definition, $g(x) = E(Y \mid X = x) = \sum_y y\, P(Y = y \mid X = x)$. Substituting and pulling $P(X = x)$ inside:

$$E(g(X)) = \sum_x \sum_y y\, P(Y = y \mid X = x)\, P(X = x).$$

Swap the order of summation (valid under absolute convergence), and recognize $P(Y = y \mid X = x)\, P(X = x) = P(Y = y, X = x)$, the joint PMF:

$$= \sum_y y \sum_x P(Y = y, X = x).$$

Since $y$ does not depend on $x$, it pulls out of the inner sum. Summing the joint PMF over all $x$ gives the marginal $P(Y = y)$ (adding up a row of the joint table). Hence

$$= \sum_y y\, P(Y = y) = E(Y).$$

The only real trick was writing a double sum and swapping the order; the rest was LOTUS plus the definitions of conditional, joint, and marginal distributions.

· · ·

5. Conditional Variance

Conditional variance is defined by analogy with ordinary variance, with everything taken given $X$. Like $E(Y \mid X)$, it is a function of $X$ — a random variable, not a number.

Definition — Conditional variance

Two equivalent forms (the equality is left as good practice):

$$\operatorname{Var}(Y \mid X) = E(Y^2 \mid X) - \bigl(E(Y \mid X)\bigr)^2$$

$$\operatorname{Var}(Y \mid X) = E\!\left[\bigl(Y - E(Y \mid X)\bigr)^2 \;\middle|\; X\right]$$

Keep the outer condition

In the second form, the outer $\mid X$ is essential. Dropping it would collapse the result to a number, but conditional variance must depend on $X$. Everything in the expression is "given $X$" — do not forget one of the conditions.

Property 5: Eve's law (law of total variance)

Property 5 — Eve's law

$$\operatorname{Var}(Y) = E\bigl(\operatorname{Var}(Y \mid X)\bigr) + \operatorname{Var}\bigl(E(Y \mid X)\bigr).$$

Named for the letters $E$-$V$-$E$ (the reverse spelling hints at why Adam's and Eve's laws pair up). The proof reduces to conditional-expectation properties and is left as practice.

Intuition: within-group vs. between-group variability

Imagine a population split into subgroups — say three subpopulations, with $Y = \text{height}$ and $X = $ which subgroup a random person belongs to ($X = 1, 2, 3$). There are two kinds of variability:

Within-group: spread of heights inside each subgroup. $E(Y \mid X = i)$ is the mean of group $i$; $\operatorname{Var}(Y \mid X)$ is the within-group variance. Averaging it, $E(\operatorname{Var}(Y \mid X))$, captures the typical within-group spread.
Between-group: differences among the group means. Replacing each group by its average height $E(Y \mid X)$ and taking $\operatorname{Var}(E(Y \mid X))$ captures the spread between groups.

Eve's law says the total variance is exactly the sum of these two pieces. The two effects do not interact in any complicated way; they simply add.

· · ·

6. Worked Example: Disease Prevalence by City

A state contains many cities with differing disease prevalence. Sampling proceeds in two stages:

Pick a random city.
Draw a random sample of $n$ people from that city and test each for the disease.

Define:

$Q$ = the true proportion infected in the randomly chosen city. Because the city is random, $Q$ is a random variable in $(0,1)$ — a "random probability."
$X$ = the number of infected people in the sample of size $n$.

This mirrors the within/between structure above: variation between cities (different $Q$) and variation within a city (binomial sampling at fixed $Q$).

Modeling assumptions

$Q \sim \text{Beta}(a, b)$, with $a, b$ known. The Beta is supported on $(0,1)$, is flexible (tuning $a, b$ reshapes it), and is the conjugate prior for the Binomial — mathematically convenient.
$X \mid Q \sim \text{Bin}(n, Q)$. Given the true prevalence $Q$, the count is binomial (assuming sampling with replacement, or $n$ small relative to the city's population so hypergeometric $\approx$ binomial).

Mean of $X$ via Adam's law

Condition on $Q$. Given $Q$, $X \sim \text{Bin}(n, Q)$ with mean $nQ$, so $E(X \mid Q) = nQ$. Then

$$E(X) = E\bigl(E(X \mid Q)\bigr) = E(nQ) = n\, E(Q) = \frac{n\,a}{a+b},$$

using $E(Q) = a/(a+b)$ for $Q \sim \text{Beta}(a, b)$.

Variance of $X$ via Eve's law

$$\operatorname{Var}(X) = E\bigl(\operatorname{Var}(X \mid Q)\bigr) + \operatorname{Var}\bigl(E(X \mid Q)\bigr).$$

Within term: given $Q$, $X \sim \text{Bin}(n, Q)$, so $\operatorname{Var}(X \mid Q) = nQ(1 - Q)$. The first term is $E(nQ(1-Q)) = n\, E(Q(1-Q))$.
Between term: $E(X \mid Q) = nQ$, so $\operatorname{Var}(E(X \mid Q)) = \operatorname{Var}(nQ) = n^2\, \operatorname{Var}(Q)$.

Two Beta facts finish the job.

Beta computation: $E(Q(1 - Q))$ by LOTUS

By LOTUS, integrate $q(1-q)$ against the $\text{Beta}(a,b)$ density on $(0,1)$:

$$E(Q(1-Q)) = \int_0^1 q(1-q)\,\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\, q^{a-1}(1-q)^{b-1}\, dq.$$

Multiplying the extra $q$ and $(1-q)$ into the density raises the exponents to $q^{a}$ and $(1-q)^{b}$ — the kernel of a $\text{Beta}(a+1, b+1)$. Insert the matching normalizing constant so the integral becomes that of a true Beta density (which integrates to $1$):

$$E(Q(1-Q)) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\cdot\frac{\Gamma(a+1)\,\Gamma(b+1)}{\Gamma(a+b+2)}.$$

Simplify with $\Gamma(x+1) = x\,\Gamma(x)$:

$\Gamma(a+1) = a\,\Gamma(a)$ and $\Gamma(b+1) = b\,\Gamma(b)$ — the $\Gamma(a)$ and $\Gamma(b)$ cancel, leaving $ab$ on top.
$\Gamma(a+b+2) = (a+b+1)(a+b)\,\Gamma(a+b)$ — the $\Gamma(a+b)$ cancels.

$$\boxed{\;E(Q(1-Q)) = \frac{ab}{(a+b)(a+b+1)}.\;}$$

Beta variance

A clean form for the variance of a $\text{Beta}(a,b)$ is

$$\operatorname{Var}(Q) = \frac{\mu(1 - \mu)}{a + b + 1}, \qquad \mu = \frac{a}{a+b}.$$

(Verified the same way, via the Beta integral.)

Assembling the answer

$$\operatorname{Var}(X) = n\, E(Q(1-Q)) + n^2\, \operatorname{Var}(Q),$$

substituting $E(Q(1-Q)) = \dfrac{ab}{(a+b)(a+b+1)}$ and $\operatorname{Var}(Q) = \dfrac{\mu(1-\mu)}{a+b+1}$ with $\mu = a/(a+b)$. Algebraic simplification is optional; the structure — a within-city binomial term plus a between-city Beta term — is the point.