Lecture 9: Expectation, Indicator Random Variables, Linearity

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. More on CDFs

The CDF (cumulative distribution function) of a random variable $X$ is

$$F(x) = P(X \le x), \quad \text{as a function of real } x.$$

Even if $X$ only takes integer values, $F$ is defined for all real $x$, and it makes sense for any random variable — not just discrete ones.

Reading a discrete CDF

For a discrete random variable taking values $0, 1, 2, 3$, the CDF is a step function:

It is $0$ to the left of $0$ (the variable can't be negative here).
It jumps up at each possible value, then stays flat until the next one.
At each jump it takes the upper value (closed dot above, open dot below).
After the last value it reaches $1$ and stays there.

If the variable is unbounded (can go off to infinity), the jumps keep coming but get smaller and smaller, so $F$ approaches $1$ without ever reaching it.

CDF and PMF are recoverable from each other

The jump sizes of the CDF are exactly the PMF. The height of the jump at value $j$ is $P(X = j)$. Since the jumps add up to the total climb from $0$ to $1$, this matches the PMF requirements: values are non-negative and sum to $1$. So you can recover the PMF from the CDF (read off jump sizes) and the CDF from the PMF (sum things up).

Computing the probability of an interval

If you know the CDF, you know the entire distribution, so you can compute any probability for $X$. For an interval $(a, b]$:

$$P(a < X \le b) = F(b) - F(a)$$

The derivation just splits a disjoint union. To get $P(X \le b)$, note that either $X \le a$ or $a < X \le b$, and these two cases are disjoint, so $P(X \le b) = P(X \le a) + P(a < X \le b)$. Rearranging gives the result. This calculation is completely general — discrete, continuous, or anything.

For discrete random variables you must be careful whether each inequality is strict or non-strict, because it changes the answer. For continuous random variables (covered later) it doesn't matter.

Three defining properties of a CDF

A function $F$ is a valid CDF if and only if it has all three of these properties. "If and only if" means: any $F$ with these properties is the CDF of some random variable.

Property	Meaning
Increasing	$F$ is non-decreasing (allowed to be flat, never goes down). Increasing $x$ only makes $\{X \le x\}$ more likely.
Right-continuous	Approaching any point from the right, $F$ converges to its value at that point. At a jump: closed above, open below.
Limits	$F(x) \to 0$ as $x \to -\infty$, and $F(x) \to 1$ as $x \to +\infty$ (approaching, not necessarily reaching).

· · ·

2. Independence of Random Variables

We already know what independence of events means; now we define it for random variables by reducing it back to events.

Definition — Independent Random Variables

$X$ and $Y$ are independent if, for all $x$ and $y$,

$$P(X \le x,\; Y \le y) = P(X \le x)\, P(Y \le y).$$

The left side is the joint CDF. The slogan is the same as for events: independence means multiply. This says the event $\{X \le x\}$ is independent of the event $\{Y \le y\}$ for every choice of $x$ and $y$.

In the discrete case this unwieldy joint-CDF condition is equivalent to the cleaner statement in terms of the joint PMF:

$$P(X = x,\; Y = y) = P(X = x)\, P(Y = y) \quad \text{for all } x, y.$$

Intuition

Knowing the value of $X$ tells you nothing about the value of $Y$. The PMF form fails in the continuous case, where both sides would be $0 = 0$; the CDF definition is the general one.

· · ·

3. Averages and Expected Value

The main topic: how to average a random variable, and what an average even means.

The word "average," with no further qualification, means the mean — add up the numbers and divide by how many there are. The same quantity is called the expected value. Mean, average, and expected value are used interchangeably.

Expectation matters for two reasons. First, a random variable is unknown before the experiment; beforehand we may want to predict, on average, what will happen. Second, even though a single number is only a one-number summary of the center, the same expectation machinery is reused to define variance, standard deviation, and other measures of spread. We will keep taking expectations not just of $X$ but of functions of $X$.

Warm-up: averaging plain numbers

Average of $1, 2, 3, 4, 5, 6$:

$$\frac{1 + 2 + 3 + 4 + 5 + 6}{6} = 3.5$$

Shortcut: for an arithmetic sequence (each term adds a constant), the average equals the average of the first and last terms — here $\frac{1 + 6}{2} = 3.5$. More generally the average of $1$ through $n$ is $\frac{n+1}{2}$.

This is the trick attributed to a 10-year-old Gauss, told to add the numbers from $1$ to $100$ to keep him busy. He paired $1 + 100 = 101$, $2 + 99 = 101$, and so on — $50$ pairs of $101$ — and immediately wrote $50 \cdot 101 = 5050$.

Unweighted vs. weighted averages

Now suppose values repeat: five $1$s, two $3$s, one $5$ (eight numbers total). There are two equivalent ways to average:

Ungrouped: add all eight numbers and divide by $8$.
Grouped by common value, with weights equal to how often each value appears:

$$\frac{5}{8}\cdot 1 + \frac{2}{8}\cdot 3 + \frac{1}{8}\cdot 5$$

The weights are non-negative and sum to $1$. The ungrouped form is an unweighted average (every number has weight $1/n$); the grouped form is a weighted average. They give the same answer. This simple idea — group by value, weight by frequency — is exactly what carries over to random variables.

Definition: expectation of a discrete random variable

Definition — Expected Value (discrete)

For a discrete random variable $X$, weight each value by its probability:

$$E(X) = \sum_{x} x \, P(X = x)$$

Higher weight goes to more likely values. The weights $P(X = x)$ are the PMF. The sum is taken only over values $x$ with $P(X = x) > 0$, so it is a finite or countably infinite sum, never uncountable. For example, if $X$ takes positive integer values, $E(X) = \sum_{k} k \, P(X = k)$ over the positive integers.

· · ·

4. Bernoulli, Indicators, and the Fundamental Bridge

Expectation of a Bernoulli

Let $X \sim \text{Bernoulli}(p)$, so $X$ is $0$ or $1$.

$$E(X) = 1 \cdot P(X = 1) + 0 \cdot P(X = 0) = p$$

The "$0$ times" term vanishes, leaving $E(X) = p$.

Indicator random variables

Given an event $A$, define

$$X = \begin{cases} 1 & \text{if } A \text{ occurs} \\ 0 & \text{otherwise} \end{cases}$$

This is the indicator random variable of $A$. It is automatically Bernoulli (only ever $1$ or $0$), and $P(X = 1) = P(A)$. These come up constantly.

The fundamental bridge

Since $E(X) = p$ and $p = P(A)$ for an indicator:

Fundamental bridge

$$E(\mathbb{1}_A) = P(A)$$ Any probability $P(A)$ can be reinterpreted as the expectation of an indicator. This links expectation and probability, and you can travel between $P$ and $E$ in either direction — in principle the whole course could have started from expectation and derived probabilities from it.

· · ·

5. Expectation of the Binomial — Two Ways

Let $X \sim \text{Binomial}(n, p)$, with $q = 1 - p$.

The hard way (direct summation)

$$E(X) = \sum_{k=0}^{n} k \, \binom{n}{k} p^k q^{n-k}$$

The $k$ "gets in the way." Use the absorption identity (a story-proof from the counting lectures: choose a committee of size $k$ from $n$ with a designated president):

$$k \binom{n}{k} = n \binom{n-1}{k-1}$$

Substituting, $n$ factors out (it doesn't depend on $k$), and pulling out one factor of $p$ (the $k = 0$ term is $0$, so start at $k = 1$):

$$E(X) = np \sum_{k=1}^{n} \binom{n-1}{k-1} p^{k-1} q^{n-k}$$

Change variables $j = k - 1$, so $j$ runs $0$ to $n-1$ and $n - k = (n-1) - j$:

$$E(X) = np \sum_{j=0}^{n-1} \binom{n-1}{j} p^{j} q^{(n-1)-j} = np$$

By the binomial theorem the sum equals $1$ (a $\text{Binomial}(n-1, p)$ PMF summing to $1$). This is exactly why one factor of $p$ was pulled out instead of $q$ — it makes the remaining sum collapse to $1$.

The easy way (linearity)

A $\text{Binomial}(n, p)$ is a sum of $n$ i.i.d. $\text{Bernoulli}(p)$ random variables:

$$X = X_1 + \cdots + X_n, \qquad X_j \sim \text{Bernoulli}(p)$$

Each $X_j$ has expectation $p$, and there are $n$ of them, so $E(X) = np$ by linearity — a calculation you can do in your head. (For the binomial these Bernoullis are independent, but linearity does not require it.)

· · ·

6. Linearity of Expectation

The single most important property of expectation. It has two parts:

Linearity

$$E(X + Y) = E(X) + E(Y) \qquad \text{(for any } X, Y\text{)}$$

$$E(cX) = c\,E(X) \qquad \text{(for any constant } c\text{)}$$

Key point

The additivity part holds even when $X$ and $Y$ are dependent — surprising the first time you see it; it seems obvious for independent variables but is true regardless. (Proof deferred to the next lecture.) Pulling out constants is the more obvious part. The headline: the expectation of a sum is the sum of expectations, no independence needed.

· · ·

7. Expectation of the Hypergeometric via Indicators

Example: a 5-card hand from a standard 52-card deck; let $X = $ number of aces. (The same hypergeometric problem can be phrased with elk, marbles, etc.)

The hypergeometric PMF involves a product of binomial coefficients and is painful to sum directly. Instead use indicators. Even though the cards have no inherent order, imagine them dealt one at a time so we can index them:

$$X = X_1 + X_2 + X_3 + X_4 + X_5, \qquad X_j = \mathbb{1}(j\text{-th card is an ace})$$

Counting just means adding one for each ace. Now chain four tools:

Linearity: $E(X) = \sum_{j=1}^{5} E(X_j)$.
Symmetry: all five cards have the same distribution (no reason the 2nd differs from the 5th), so $E(X) = 5\, E(X_1)$.
Fundamental bridge: $E(X_1) = P(\text{first card is an ace}) = \tfrac{4}{52} = \tfrac{1}{13}$.

$$E(X) = 5 \cdot \frac{1}{13} = \frac{5}{13}$$

Subtlety

The $X_j$ are dependent (if the first four cards are aces, the fifth can't be), yet linearity still applies. So for the expectation, the hypergeometric behaves exactly as if it were binomial: $n$ times the probability that an individual trial has the property. Other quantities, like the variance, will differ from the binomial; the expectation is the special case where they agree.

· · ·

8. The Geometric Distribution

Our next named distribution — not to be confused with the hypergeometric, with which it has little in common.

Story and PMF

Run independent $\text{Bernoulli}(p)$ trials (e.g., flipping a coin), each with the same success probability $p$. The geometric counts the number of failures before the first success.

Convention warning

Some books include the success in the count, some don't. Blitzstein counts failures before the first success and excludes that success — just be consistent.

Let $X \sim \text{Geometric}(p)$, with $q = 1 - p$. A run of $k$ failures then a success (e.g., $FFFFFS$ gives $X = 5$) is the only sequence producing that value, and its probability is $q^k p$. So:

$$P(X = k) = q^k p, \qquad k = 0, 1, 2, 3, \ldots$$

Unlike the binomial, there is no fixed number of trials: keep trying until you succeed, then count the failures.

Valid PMF check

Pull out the constant $p$ and recognize a geometric series:

$$\sum_{k=0}^{\infty} p\, q^k = p \cdot \frac{1}{1 - q} = p \cdot \frac{1}{p} = 1$$

Since $1 - q = p$, the total is $1$. This geometric series is exactly why the distribution is called geometric, just as the binomial theorem validates the binomial PMF.

Expectation — three ways

The expectation is $E(X) = \dfrac{q}{p}$, derived three ways. By definition,

$$E(X) = \sum_{k=0}^{\infty} k\, p\, q^k = p \sum_{k=1}^{\infty} k\, q^k$$

(start at $k = 1$ since the $k = 0$ term is $0$). The bare $k$ in front blocks a direct geometric-series evaluation.

Method A — differentiate a known series

Start from what we know:

$$\sum_{k=0}^{\infty} q^k = \frac{1}{1 - q}$$

Differentiate both sides with respect to $q$. The left side gives $\sum_{k=1}^{\infty} k\, q^{k-1}$; the right side gives $\frac{1}{(1-q)^2}$ (the minus from the derivative and the minus from the chain rule cancel). Multiply both sides by $q$ to fix the exponent:

$$\sum_{k=1}^{\infty} k\, q^k = \frac{q}{(1-q)^2} = \frac{q}{p^2}$$

So $E(X) = p \cdot \dfrac{q}{p^2} = \dfrac{q}{p}$.

Method B — story proof / first-step analysis (no calculus)

Let $c = E(X)$ and solve for $c$ by conditioning on the first flip, like Gambler's Ruin:

With probability $p$, the first flip is a success: $0$ failures, contributing $0$.
With probability $q$, the first flip is a failure: that's $1$ failure, and then — because the coin is memoryless — the problem restarts identically, contributing $1 + c$.

$$c = p \cdot 0 + q\,(1 + c) = q + cq.$$

Solving, $c - cq = q$, so $c(1 - q) = q$, hence $c = \dfrac{q}{p}$ (using $1 - q = p$).

Conclusion

Both routes give $E(X) = \dfrac{q}{p}$. The story proof avoids calculus by exploiting that the coin is memoryless — after one failure, the future is a fresh copy of the same problem.