Lecture 26: Conditional Expectation Continued

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. The Two-Envelope Paradox

Two envelopes look identical. One contains $X$ dollars, the other $Y$ dollars; you know nothing about the amounts except that one envelope holds exactly twice as much as the other. You pick one. Should you switch?

Two competing arguments

Argument 1 — Symmetry

$E(Y) = E(X)$. The problem statement contains no asymmetry — nothing distinguishes the left envelope from the right — so neither can have a larger expected value. This is hard to argue against: if the left envelope were somehow better, where would that advantage have come from?

Argument 2 — Conditioning

Condition on whether the other envelope is double or half. Write $X$ for the envelope you hold and $Y$ for the other:

$$E(Y) = E(Y \mid Y = 2X)\,P(Y = 2X) + E\!\left(Y \mid Y = \tfrac{X}{2}\right)P\!\left(Y = \tfrac{X}{2}\right)$$

This is the law of total expectation, the expectation analog of the law of total probability, so it is unimpeachable as written. The trouble comes from how the argument evaluates the conditional expectations. By symmetry each case has probability $\tfrac12$, and the naive (wrong) computation replaces the conditional expectations with unconditional ones:

$$E(Y) = E(2X)\cdot\tfrac12 + E\!\left(\tfrac{X}{2}\right)\cdot\tfrac12 = \frac{2 + \tfrac12}{2}\,E(X) = \frac{5}{4}\,E(X)$$

This suggests the other envelope is always better — and the same argument applies in reverse, so each envelope looks better than the other. Contradiction.

Resolving the contradiction

Can both arguments hold simultaneously? Only in the degenerate cases $E(X) = 0$ (both envelopes empty) or $E(X) = \infty$ (an infinite expected value, as in the St. Petersburg paradox — mathematically consistent but not a realistic scenario). Assuming the expected amount is positive and finite, the two arguments genuinely contradict, so one must be wrong. Symmetry is airtight, so Argument 2 must contain the error.

The error: using information then forgetting it

Key mistake

One of the most common and dangerous mistakes in conditioning is to plug in the conditioning information, then quietly discard the condition. The step $E(Y \mid Y = 2X) = E(2X)$ is wrong. It is legitimate to substitute $Y = 2X$ inside the expectation, but you may not then forget that you are still conditioning on $Y = 2X$:

$$E(Y \mid Y = 2X) = E(2X \mid Y = 2X) \neq E(2X)$$

The only time you may drop the conditioning entirely is when there is genuine independence. Here there is none.

What this reveals: $X$ and the indicator are dependent

Let $I$ be the indicator of which envelope is larger: $I = 1$ if $Y = 2X$ (the other envelope is the bigger one), $I = 0$ otherwise. The collapse of Argument 2 is exactly the statement that $X$ and $I$ are dependent.

This is surprising. It says that observing $X$ gives information about $I$ — whether the other envelope is larger. Suppose you open your envelope and see $\$100$; is the other $50$ or $200$? Dependence says it is no longer 50-50. Yet you were given no information about the scale of the problem. Even a trillion dollars is minuscule compared to the entire positive real line. It seems impossible that seeing the amount could shift your belief about $I$, but the finiteness of the expected values forces exactly that dependence.

Related result: the threshold strategy

A related "two-envelope problem" assumes only two distinct positive amounts (not that one is double the other). Surprisingly, you can devise a strategy guaranteed to give a probability strictly greater than $\tfrac12$ of ending up with the larger envelope.

Generate your own random threshold $t$ (for instance from an exponential distribution, though many distributions work). Open one envelope; keep it if its value exceeds $t$ and switch otherwise. This manufactured threshold gives you a notion of whether the observed amount is "big" or "small," and yields success with probability strictly above $\tfrac12$.

· · ·

2. Patterns in Coin Flips

Flip a fair coin repeatedly and wait for a pattern to appear. Let $W_{HT}$ be the number of flips (including the final two) until you first see heads immediately followed by tails, and $W_{HH}$ the number until you first see two heads in a row. We want $E(W_{HT})$ and $E(W_{HH})$.

Intuition first: are they equal?

Most people guess the two waiting times are equal, with the rest split between the two inequalities. In fact they are not equal:

$$E(W_{HT}) = 4, \qquad E(W_{HH}) = 6 \quad (\text{fifty percent larger})$$

False symmetry

A tempting but invalid argument claims the two must be equal. Genuine symmetry (swapping the labels "heads" and "tails" everywhere) only gives $E(W_{TT}) = E(W_{HH})$ and $E(W_{HT}) = E(W_{TH})$. Swapping heads and tails turns $HT$ into $TH$, not into $HH$, so symmetry says nothing about $HT$ versus $HH$.

$E(W_{HT})$ by splitting into two waiting times

No conditional expectation is needed. Split the wait into two stages:

$W_1$ = number of flips until the first heads.
$W_2$ = number of additional flips, after that first heads, until the next tails.

Once you have seen a head, you have made permanent partial progress: you are now simply waiting for a tail, and every future tail completes the pattern. So $W_{HT} = W_1 + W_2$ regardless of what happens.

$$E(W_{HT}) = E(W_1) + E(W_2) = 2 + 2 = 4$$

Each stage waits for a probability-$\tfrac12$ success. With the convention that $W_j - 1 \sim \text{Geometric}(\tfrac12)$ (counting failures plus the success), each stage has mean $1 + 1 = 2$. The stages are independent because the coin is memoryless, but even without independence linearity gives the sum directly.

$E(W_{HH})$ by conditioning on the first toss

The two-stage trick fails for $HH$. After the first head, if the next toss is a head you are done, but if it is a tail you lose all progress and start over from scratch. There is no permanent partial progress to bank. Condition on the first toss:

$$E(W_{HH}) = E(W_{HH} \mid \text{1st} = H)\cdot\tfrac12 + E(W_{HH} \mid \text{1st} = T)\cdot\tfrac12$$

Case	Reasoning	Contribution
1st toss $T$	One flip wasted, identical problem restarts	$1 + E(W_{HH})$
1st $H$, 2nd $H$	Pattern $HH$ complete in two flips (prob $\tfrac12$ within this branch)	$2$
1st $H$, 2nd $T$	Sequence started $HT$; two flips wasted, restart (prob $\tfrac12$)	$2 + E(W_{HH})$

Solving the self-referential equation

With $E(W_{HH} \mid \text{1st} = H) = \tfrac12(2) + \tfrac12\big(2 + E(W_{HH})\big)$, substituting both branches gives an equation in $E(W_{HH})$ itself. Solving yields $E(W_{HH}) = 6$. Check: $\tfrac12(1 + 6) + \tfrac12\!\left[\tfrac12(2) + \tfrac12(2 + 6)\right] = \tfrac{7}{2} + \tfrac{5}{2} = 6.$

Why $HH$ waits longer: clumping

If you stare at two fixed positions, $HH$ and $HT$ are equally likely, each with probability $\tfrac14$ — the source of the (incorrect) intuition that the waits should match. But a waiting time looks at the entire sequence, not two fixed positions.

The overlap argument

$HH$ self-overlaps: a run like $HHHHH$ contains four overlapping $HH$ occurrences clustered together (starting at positions 1, 2, 3, 4). $HT$ cannot overlap itself this way. Both patterns have the same expected total number of occurrences in a long sequence, but because $HH$ appears in tight clumps, those clumps must be spaced farther apart — so the typical wait between fresh $HH$ occurrences is longer.

Application: motifs in genetics

This is not just a coin-flipping curiosity. In genetics, instead of heads and tails one studies DNA sequences drawn from the alphabet $\{A, C, T, G\}$ and asks where certain patterns — called motifs — appear. The same overlap-and-clumping mathematics governs the spacing of motifs. Blitzstein recommends statistician Peter Donnelly's TED talk, which discusses related courtroom uses of probability and a similar pattern example.

· · ·

3. Conditional Expectation Given an Event

So far conditional expectation has meant conditioning on an event. $E(Y \mid X = x)$ is just the ordinary definition of expectation with the PMF or PDF replaced by its conditional version. Here $X$ and $Y$ are random variables and $x$ is a number, so $X = x$ is an event.

Discrete case

$$E(Y \mid X = x) = \sum_y y \, P(Y = y \mid X = x)$$

Identical to $E(Y) = \sum_y y\,P(Y = y)$, except every probability is made conditional on $X = x$.

Continuous case

$$E(Y \mid X = x) = \int y \, f(y \mid x)\, dy, \qquad f(y \mid x) = \frac{f(x, y)}{f_X(x)}$$

The conditional density is defined analogously to conditional probability — joint over marginal. Equivalently $f(x, y) = f_X(x)\,f(y \mid x)$. Since $f_X(x)$ does not depend on $y$, it can be pulled outside the integral if convenient.

Interpretation

$E(Y \mid X = x)$ is the best prediction of $Y$ given the information $X = x$ — best in the sense of minimizing the expected squared error.

· · ·

4. Conditioning on a Random Variable

Define $g(x) = E(Y \mid X = x)$. Writing it as $g$ of lowercase $x$ emphasizes a crucial fact.

$E(Y \mid X = x)$ is a function of $x$, never of $X$ or $Y$

Exam trap

A predicted value of $Y$ cannot depend on $Y$ itself: you are averaging $Y$, so the answer cannot contain capital $Y$. It also cannot depend on capital $X$. The result is a plain function of the number $x$ (possibly a constant). If $X$ and $Y$ are independent, conditioning on $X$ tells you nothing about $Y$, so $g(x)$ reduces to the constant function $E(Y)$. An answer that still contains capital $X$ or capital $Y$ is immediately wrong.

The definition: plug in $X$ after computing $g$

Conditional expectation given a random variable

$$E(Y \mid X) = g(X)$$

Because $X$ is random, $g(X)$ is itself a random variable — a function of $X$, not of $Y$.

The trap is to plug capital $X$ into $g$ before simplifying. If you blindly substitute $X$ for $x$ everywhere first, you would write $E(Y \mid X = X)$, reason "I already know $X = X$, so that is no information," and collapse it to $E(Y)$ — which is wrong. The correct reading: compute the function $g(x)$ first, then replace $x$ by $X$ in the finished expression. If $g(x) = x^2$, then $g(X) = X^2$. That is all it means.

Intuition

$E(Y \mid X)$ means: pretend you got to observe $X$ and may treat it as a known constant; then ask for your best prediction of $Y$. The answer is allowed to be a function of the random variable $X$. You can always, in principle, translate $E(Y \mid X)$ back into conditioning on the event $X = x$; the random-variable notation is just more compact, and you should fall back to event notation whenever it gets confusing.

· · ·

5. Worked Examples with iid Poissons

Let $X$ and $Y$ be iid $\text{Poisson}(\lambda)$.

Forward direction: $E(X + Y \mid X)$

Conditional probabilities are genuine probabilities, so they obey all the usual rules; therefore conditional expectations obey all the usual properties, including linearity:

$$E(X + Y \mid X) = E(X \mid X) + E(Y \mid X) = X + \lambda$$

$E(X \mid X) = X$. We know $X$ and want to predict $X$ — use $X$ itself.
$E(Y \mid X) = E(Y) = \lambda$, because $X$ and $Y$ are independent, so knowing $X$ is no help in predicting $Y$.

A general principle used here, true for any function $h$ (not just the Poisson): if you know $X$, you know $h(X)$, so $E(h(X) \mid X) = h(X)$. Only independence was used for the $E(Y \mid X)$ term — none of the specifics of the Poisson.

Reverse direction: $E(X \mid X + Y)$

Not linearity

There is no rule that splits the conditioning: $E(X \mid X + Y)$ is not $E(X \mid X) + E(X \mid Y)$. A common panic mistake. We must work out what the conditioning actually means.

Method 1: find the conditional distribution

Let $T = X + Y$. We want $P(X = k \mid T = n)$. By Bayes' rule:

$$P(X = k \mid T = n) = \frac{P(T = n \mid X = k)\,P(X = k)}{P(T = n)}$$

Given $X = k$, the event $T = n$ forces $Y = n - k$, and because $X$ and $Y$ are independent we may then drop the condition: $P(T = n \mid X = k) = P(Y = n - k)$. Substituting the Poisson PMFs, and using that the sum of independent Poissons is $\text{Poisson}(2\lambda)$ for the denominator:

$$P(X = k \mid T = n) = \frac{\dfrac{e^{-\lambda}\lambda^{\,n-k}}{(n-k)!}\cdot\dfrac{e^{-\lambda}\lambda^{k}}{k!}}{\dfrac{e^{-2\lambda}(2\lambda)^{n}}{n!}} = \binom{n}{k}\left(\tfrac12\right)^{n}$$

The $e^{-2\lambda}$ factors cancel and the powers of $\lambda$ cancel, leaving a $\text{Binomial}(n, \tfrac12)$. Its mean is $\tfrac{n}{2}$, so $E(X \mid T = n) = \tfrac{n}{2}$, and replacing the number $n$ by the random variable $T$:

$$E(X \mid X + Y) = \frac{T}{2} = \frac{X + Y}{2}$$

Intuitive: if the total of two iid variables is $100$, your best guess for each is $50$.

Method 2: symmetry (preferred)

Because $X$ and $Y$ are iid, $E(X \mid X + Y) = E(Y \mid X + Y)$. Add the two equal quantities and use linearity:

$$E(X \mid X+Y) + E(Y \mid X+Y) = E(X + Y \mid X + Y) = X + Y = T$$

The left side is twice $E(X \mid X + Y)$, so $E(X \mid X + Y) = \tfrac{T}{2}$ immediately. This used only that $X$ and $Y$ are iid, nothing about the Poisson, so it is the more general argument.

· · ·

6. Iterated Expectation (Adam's Law)

$E(Y \mid X)$ is a random variable, so we can take its expectation. The single most important property of conditional expectation is:

Iterated expectation (Adam's law)

$$E\big(E(Y \mid X)\big) = E(Y)$$

A compact restatement of the law of total probability, and extremely useful. The name and the proof come in the next lecture.