Lecture 10: Expectation Continued

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Proof of Linearity of Expectation

Linearity says $E(X + Y) = E(X) + E(Y)$, and $E(cX) = c\,E(X)$ for a constant $c$. The second part holds even when $X$ and $Y$ are dependent, which is the surprising and useful part.

Why it's remarkable

When $X$ and $Y$ are independent, $E(X + Y) = E(X) + E(Y)$ feels obvious. When they are dependent, it is not obvious at all — yet it remains true, as long as the expectations exist.

Why the brute-force approach stalls

Work in the discrete case (the continuous case is analogous, using a PDF instead of a PMF). Let $T = X + Y$. By definition,

$$E(T) = \sum_t t\, P(T = t) \stackrel{?}{=} \sum_x x\, P(X = x) + \sum_y y\, P(Y = y).$$

Trying to combine the two right-hand sums (one over $x$, one over $y$) into a single sum over $t$ is awkward. Trying to split $E(T)$ by conditioning on $X$,

$$P(T = t) = \sum_x P(T = t \mid X = x)\, P(X = x),$$

works cleanly only if $X$ and $Y$ are independent — then it reduces to a Vandermonde-style convolution (the same algebra used earlier to show that a sum of independent Binomials with common $p$ is Binomial). If $X$ and $Y$ are dependent, you are left stuck with a residual conditional $Y$ term. So the definitional route is the wrong way in.

The key reframing: sum over outcomes, not over values

A random variable $X$ is a function from the sample space $S$ to the real line. Picture $S$ as a collection of "pebbles," each pebble $s$ carrying probability mass $P(s)$, with all masses summing to $1$. $X$ assigns each pebble a value $X(s)$.

The expected value can be computed two equivalent ways:

Grouped (by value): $E(X) = \sum_x x\, P(X = x)$. Glue all pebbles mapped to the same value into one "super-pebble" whose mass is the sum of their masses, then take the weighted average of the distinct values.
Ungrouped (by pebble): $E(X) = \sum_{s \in S} X(s)\, P(s)$. Average over individual pebbles directly.

•••• → 0 •• → 1 ••• → 2 • → 3

Pebbles grouped by the value $X$ assigns them; the grouped sum glues each cluster into one super-pebble

Key idea

These are the same number — exactly the two ways of averaging a list (add-and-divide vs. group equal entries and take a weighted average). The ungrouped form is the lever that makes linearity trivial.

The proof

Write $E(T)$ in ungrouped form and use that adding two functions means adding them pointwise:

$$E(X + Y) = E(X) + E(Y)$$

$$E(T) = \sum_{s} (X + Y)(s)\, P(s) = \sum_{s} \big[X(s) + Y(s)\big] P(s)$$

$$= \sum_{s} X(s)\, P(s) + \sum_{s} Y(s)\, P(s) = E(X) + E(Y).$$

The crucial step is that every term is now summed over the same index $s$ (pebbles), so the sum splits cleanly. No independence is used anywhere — the same pebble contributes $X(s)$ and $Y(s)$ simultaneously, so dependence is irrelevant.

The same move proves the constant rule:

$$E(cX) = \sum_{s} c\, X(s)\, P(s) = c \sum_{s} X(s)\, P(s) = c\, E(X).$$

Sanity-checking the dependent case via extremes

Extreme-case check

Take the most extreme dependence possible: $X = Y$. Then $E(X + Y) = E(2X) = 2\,E(X)$, which is indeed $E(X) + E(Y)$. At the other extreme (independence) the result is intuitive. The proof handles everything in between.

· · ·

2. The Negative Binomial Distribution

The negative binomial generalizes the geometric. Despite the name, it is neither negative nor a binomial — the name only reflects its connection to expanding $(a + b)$ raised to a negative power in the binomial theorem.

Story — Negative Binomial$(r, p)$

Run independent Bernoulli$(p)$ trials. The negative binomial counts the number of failures before the $r$-th success. The geometric is the special case $r = 1$ (failures before the first success).

PMF

Write $1$ for success and $0$ for failure, and look at a sequence ending in the $r$-th success. With $r = 5$, a typical valid sequence is:

1 1 0 1 0 0 0 1 0 0 0 0 0 1

Five $1$s (successes) and, here, $n = 11$ failures; the sequence must end in the $r$-th success

Two observations pin down the count:

The sequence must end in a $1$ — the $r$-th success. A trailing $0$ would mean either the $r$-th success has not happened yet, or it happened earlier and there was no reason to keep going.
Everything before that final success consists of $r - 1$ successes and $n$ failures, which may appear in any order. Permuting them changes neither the probability nor the validity.

So, for $X = $ number of failures before the $r$-th success:

$$P(X = n) = \binom{n + r - 1}{r - 1} p^{\,r} (1 - p)^{n}, \qquad n = 0, 1, 2, \ldots$$

The factor $p^{\,r}(1-p)^{n}$ is the probability of one specific qualifying sequence; the binomial coefficient counts the arrangements of the first $n + r - 1$ positions. Choose the locations of the $r - 1$ successes, $\binom{n+r-1}{r-1}$, or equivalently the locations of the $n$ failures, $\binom{n+r-1}{n}$ — the same number.

Mean via linearity

Computing $E(X) = \sum_n n\, P(X = n)$ directly is nasty. Instead, decompose the waiting process: getting $r$ successes means wait for the 1st success, then the 2nd, and so on.

$$X = X_1 + X_2 + \cdots + X_r$$

where $X_j$ is the number of failures strictly between the $(j-1)$-th and $j$-th successes (with $X_1$ the failures before the very first success). Each $X_j \sim \text{Geometric}(p)$, so $E(X_j) = q/p$ with $q = 1 - p$. By linearity,

$$E(X) = E(X_1) + \cdots + E(X_r) = r\,\frac{q}{p}.$$

The $X_j$ happen to be independent (the trials are independent), but linearity does not need that — only that the expectations exist.

· · ·

3. The First Success Distribution

Convention warning

Textbooks disagree on where the geometric starts, so be careful copying formulas. Ross starts the geometric at $1$ (counting the success); DeGroot starts at $0$ (not counting it). This course follows the DeGroot convention: the geometric counts failures only, starting at $0$.

Define the first success distribution, $X \sim \text{FS}(p)$: the number of trials up to and including the first success (so it counts the success). To relate it to the geometric, set

$$Y = X - 1, \qquad Y \sim \text{Geometric}(p).$$

Subtracting $1$ just removes the counted success. Any FS probability rewrites in terms of $Y$, and

$$E(X) = E(Y + 1) = E(Y) + 1 = \frac{q}{p} + 1 = \frac{q + p}{p} = \frac{1}{p},$$

using $p + q = 1$. This is intuitive: if $p = 1/10$, it takes on average $10$ trials to get the first success when you count that success.

A common notational habit in the literature is to drop the parentheses (writing $E\,X$ for $E(X)$) when there is no ambiguity.

· · ·

4. Putnam Problem: Expected Number of Local Maxima

This appeared on the Putnam exam, one of the hardest math competitions in the country — in most years the median score is zero. Many test-takers considered this the hardest problem on the exam, yet with linearity and indicators it is a one-liner.

Setup

Take a uniformly random permutation of $1, 2, \ldots, n$ (all $n!$ orderings equally likely); assume $n \ge 2$. A local maximum is an entry larger than its neighbor(s): an interior entry must exceed both neighbors; an endpoint entry has only one neighbor and must exceed it. Find the expected number of local maxima.

3 2 1 4 7 5 6

Example: in $3\,2\,1\,4\,7\,5\,6$ the local maxima are $3$ (left end), $7$ (interior), $6$ (right end) — three of them

Solution via indicators and linearity

Let $I_j$ be the indicator that position $j$ is a local maximum, $j = 1, \ldots, n$. The number of local maxima is $I_1 + \cdots + I_n$, so

$$E(\text{number of local maxima}) = E(I_1) + \cdots + E(I_n).$$

By the fundamental bridge, $E(I_j) = P(\text{position } j \text{ is a local maximum})$. Split into interior and endpoint positions.

Interior position: probability $= \tfrac{1}{3}$

Consider the three values at that position and its two neighbors. By symmetry, the largest of the three is equally likely to occupy any of the three positions, so the chance it sits in the middle is $\tfrac{1}{3}$. (Equivalently, of the $6$ orderings of three slots, the $2$ with the largest in the middle give $\tfrac{2}{6} = \tfrac{1}{3}$.)

Common mistake

Answering $\tfrac{1}{4}$ — "$\tfrac12$ chance bigger than the left neighbor times $\tfrac12$ chance bigger than the right neighbor" — is wrong. Those two events are not independent (the same trap as $P(A \text{ older than } B \mid A \text{ older than } C)$). Two correct arguments give $\tfrac13$; one flawed argument gives $\tfrac14$.

Endpoint position: probability $= \tfrac{1}{2}$

An endpoint is a local maximum iff it exceeds its single neighbor, which by symmetry happens with probability $\tfrac{1}{2}$.

There are $n - 2$ interior positions (each $\tfrac13$) and $2$ endpoints (each $\tfrac12$):

$$E(\text{number of local maxima}) = \frac{n - 2}{3} + 2 \cdot \frac{1}{2} = \frac{n - 2}{3} + 1 = \frac{n + 1}{3}.$$

Checks

$n = 2$: the permutations $1\,2$ and $2\,1$ each have exactly one local maximum, and $\tfrac{2 + 1}{3} = 1$. ($n = 1$ was excluded as degenerate — the lone element is trivially a maximum.)
As $n \to \infty$, the count grows linearly to infinity, which makes sense: more entries, more local maxima.

· · ·

5. The St. Petersburg Paradox

This example uses no indicators; it shows that an expected value can be infinite, and that you cannot move $E$ through a nonlinear function.

The game

Flip a fair coin until the first heads. The payout is $Y = 2^X$ dollars, where $X$ is the number of flips up to and including the first heads — so $X \sim \text{FS}(1/2)$. Heads on flip 1 pays \$2, on flip 2 pays \$4, on flip 3 pays \$8, doubling each time. What is the fair price: the price making the expected net value zero?

In an informal classroom auction, even imagined billionaires would not pay much — most balked above roughly \$20 to \$100, despite the unbounded payout structure.

The calculation

The payout $2^X$ equals $2^k$ exactly when the coin shows $k - 1$ tails then heads, an event of probability $(1/2)^k$. So

$$E(Y) = \sum_{k=1}^{\infty} 2^k \left(\tfrac{1}{2}\right)^k = \sum_{k=1}^{\infty} 1 = 1 + 1 + 1 + \cdots = \infty.$$

Each term contributes exactly $1$, so the sum diverges. By this calculation the fair price is infinite — yet almost no one would pay even \$100. The arithmetic is indisputable ($1 + 1 + 1 + \cdots$); the resolution is that no one actually has infinite money. Real payouts are bounded.

Imposing a realistic bound

Cap the payout near a trillion dollars. Since $2^{40} > 10^{12}$, cap at $2^{40}$. Two scenarios:

Scenario	Sum	Expected value
Dealer flees past flip 40 (you get nothing beyond the cap)	$\sum_{k=1}^{40} 1$	\$40
Dealer pays the capped $2^{40}$ for all later outcomes	$40 + (\tfrac12 + \tfrac14 + \cdots)$	\$41

Resolution

A trillion-dollar ceiling — an astronomically unlikely outcome to ever reach — collapses an "infinite" expectation to about \$40. That no longer feels paradoxical and matches people's intuitions.

Cautionary point: don't move $E$ inside a nonlinear function

$E(2^X) = \infty$, but a careless "move the $E$ into the exponent" gives $2^{E(X)} = 2^{1/p} = 2^2 = \$4$ (using $E(X) = 1/p = 2$ for $\text{FS}(1/2)$). These are not equal:

$$E\!\left(2^X\right) \neq 2^{E(X)}.$$

Linearity lets you pull $E$ through sums and constant multiples — and only those. It does not pass through powers, products, or other nonlinear functions.