Lecture 29: Law of Large Numbers and Central Limit Theorem

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Setup: IID Random Variables and the Sample Mean

These are arguably the two most famous theorems in the entire history of probability. They are closely related, so it makes sense to study them together and compare and contrast them.

Both theorems share the same setup. We have IID (independent and identically distributed) random variables $X_1, X_2, X_3, \ldots$. Because they are identically distributed, they all share the same mean and the same variance:

mean: $E(X_j) = \mu$
variance: $\operatorname{Var}(X_j) = \sigma^2$

We assume these are finite (the mean and variance exist).

The central object is the sample mean, written $\bar{X}_n$. The bar is standard statistics notation for an average, and the subscript $n$ is the number of terms:

$$\bar{X}_n = \frac{X_1 + X_2 + \cdots + X_n}{n}$$

Both theorems answer the same question: what happens to $\bar{X}_n$ as $n$ gets large?

Interpretation

The $X$'s are random variables, but once observed they become data. We never have infinitely much data — at some point we stop at $n$, which we can think of as the sample size. Hopefully $n$ is large. We take the average; the question is what we can say about it.

· · ·

2. The Law of Large Numbers

Statement (strong law)

The law of large numbers (LLN) is a simple, intuitive statement: the sample mean converges to the true mean.

Strong Law of Large Numbers

$$\bar{X}_n \to \mu \quad \text{as } n \to \infty, \text{ with probability } 1.$$

Here "true mean" means the theoretical mean, $\mu = E(X_j)$.

Note the asymmetry: $\mu$ is a constant, while $\bar{X}_n$ is a random variable (an average of random variables). The theorem says the random variable converges to the constant. The "with probability 1" is the fine print: something crazy could happen on an event of probability $0$, but we don't worry about it.

What the convergence means

When we take limits of random variables (rather than ordinary sequences of numbers), we must be careful about the definition. A random variable is, mathematically, a function of the outcome of the experiment. The convergence here is pointwise: evaluate $\bar{X}_n$ at a specific outcome and you get an honest sequence of numbers, and those numbers converge to $\mu$. So either the sequence converges or it doesn't — that is an event — and the theorem says that event has probability $1$.

Bernoulli example: coin flipping

Let $X_j \sim \text{Bern}(p)$, so we imagine an infinite sequence of coin tosses with probability $p$ of heads. Then

$$\frac{X_1 + \cdots + X_n}{n} = \frac{\text{number of heads in the first } n \text{ flips}}{\text{number of flips}}$$

and the LLN says this proportion converges to $p$ with probability 1. Flip a fair coin a million times: you don't expect exactly 500,000 heads, but you do expect the proportion to get closer and closer to $1/2$ in the long run.

The "with probability 1" qualification is needed because, mathematically, nothing forbids the coin from landing heads forever. That will never happen in reality, but the math does not call such a sequence invalid. These pathological cases have probability $0$, and on everything else we get what we expect.

The result is also necessary: if you didn't know $p$, the obvious estimate is to flip many times and take the proportion of heads. The LLN is exactly the justification for that procedure — without it, you'd have no grounds for the approximation.

Gambler's fallacy and "swamping"

The gambler's fallacy is the feeling that after losing many times in a row, you are "due" to win — and people sometimes try to justify it with the LLN: "the long-run average must return to $1/2$, so I should start winning to compensate." That is not how it works. The coin is memoryless; it does not care how many losses came before. After 100 tails in a row, the probability for flip 101 is unchanged.

Swamping, not compensation

The correct mechanism is swamping. The LLN sends $n \to \infty$. No matter how unlucky you were in the first hundred or first million trials, that finite stretch is nothing compared to infinity — it gets swamped by the entire infinite future. The early losses are not offset by extra future heads; they are simply diluted away.

Aside: "I can only ever be average"

A story: a colleague's student claimed to hate statistics. Asked why, the student — an athlete training daily — said he had just learned the LLN and found it depressing: it seemed to say that in the long run he could only ever be average, with no room to improve.

The fallacy is in the IID assumption. IID means the distribution does not change with time. Improving your own performance changes your distribution, so the sequence would no longer be IID. The LLN does not say you cannot improve; it only describes averages of a fixed distribution.

Far from depressing, the LLN is arguably what makes science possible. Imagine a counterfactual world where it were false: you collect more and more data, let your sample size grow, and yet never converge to the truth. The LLN guarantees that more data brings you to the truth.

The weak law and its proof

We will prove a slightly different version, the weak law of large numbers. (Blitzstein notes he dislikes the "strong"/"weak" terminology, but it is standard.) The weak law states: for any $c > 0$,

Weak Law of Large Numbers

$$P\big(|\bar{X}_n - \mu| > c\big) \to 0 \quad \text{as } n \to \infty.$$

This is convergence in probability.

It is not literally equivalent to the strong law, but the strong law implies it (the converse requires some real analysis we don't need here). The intuition is the same: interpret $c$ as small (say $0.001$); the statement says that for large $n$ it is extremely unlikely that the sample mean and the true mean differ by more than $0.001$.

The strong law is hard to prove, but the weak law follows in essentially one line from Chebyshev's inequality (from the previous lecture):

$$P\big(|\bar{X}_n - \mu| > c\big) \le \frac{\operatorname{Var}(\bar{X}_n)}{c^2}$$

Now compute $\operatorname{Var}(\bar{X}_n)$. Staring at the definition, the $1/n$ in front comes out as $1/n^2$, and by independence the variance of the sum is $n$ times the variance of one term:

$$\operatorname{Var}(\bar{X}_n) = \frac{1}{n^2}\,\big(n\,\sigma^2\big) = \frac{\sigma^2}{n}.$$

$$P\big(|\bar{X}_n - \mu| > c\big) \le \frac{\sigma^2}{n\,c^2}.$$

Here $\sigma$ and $c$ are constants and $n \to \infty$, so the bound $\to 0$. That proves the weak law of large numbers in one line.

· · ·

3. From the LLN to the Central Limit Theorem

Rewrite what we proved as $\bar{X}_n - \mu \to 0$ as $n \to \infty$. This is good to know, but it does not tell us two things: the shape of the distribution of $\bar{X}_n$, and the rate at which it converges to $\mu$.

The right scaling: multiply by a power of $n$

A general strategy for studying how fast something goes to zero is to multiply by something that goes to infinity. Multiply $\bar{X}_n - \mu$ by $n$ raised to some power:

If the power is too large, the $n$-factor blows up and dominates — the product goes to infinity.
If the power is too small (but still positive), the convergence-to-zero wins — the product still goes to $0$.

So the two factors compete: one goes to infinity, the other to zero. The threshold power that balances them is $\tfrac{1}{2}$. With the square root we get a non-degenerate limit — the product neither collapses to $0$ nor blows up, but converges to an actual distribution. Dividing by $\sigma$ cleans things up:

$$\sqrt{n}\,\frac{\bar{X}_n - \mu}{\sigma}$$

Statement of the CLT

Central Limit Theorem

As $n \to \infty$,

$$\sqrt{n}\,\frac{\bar{X}_n - \mu}{\sigma} \;\xrightarrow{\ d\ }\; \mathcal{N}(0, 1).$$

Convergence is in distribution: the CDF of the left-hand quantity converges to the standard normal CDF $\Phi$.

The $X$'s may be discrete, continuous, or a mixture, so they need not have a PDF — but every random variable has a CDF, so the statement is about CDFs.

Why this is amazing

The standard normal is one specific bell curve, yet any distribution with finite variance — discrete, continuous, or extremely nasty-looking — gives a standardized sample mean that converges to it. The only assumption is finite variance. This is a major reason the standard normal is so important: although the CLT is an $n \to \infty$ statement, in practice it justifies normal approximation — for large $n$ the sample mean is approximately normal, even when the original data were nowhere near normal.

Comparison: LLN vs. CLT

	Law of Large Numbers (strong)	Central Limit Theorem
Object	$\bar{X}_n$	$\sqrt{n}\,(\bar{X}_n - \mu)/\sigma$
Conclusion	converges to the constant $\mu$	converges to $\mathcal{N}(0,1)$
Convergence	pointwise, w.p. 1 (the random variables converge)	in distribution (only the CDF converges)
Information	the limit value	the shape and the rate (scale $\sqrt{n}$)
Assumptions	finite mean (and variance, this version)	finite variance

The CLT is in a sense more informative — it gives the distribution and the rate, not just the limiting value. But it is a genuinely different sense of convergence: the LLN says the random variables themselves converge, the CLT says only their distribution converges to $\mathcal{N}(0,1)$.

Equivalent form in terms of the sum

It is worth being fluent in both forms; the algebra connecting them is just a factor of $n$. Let $S_n = X_1 + \cdots + X_n$. To turn the sum into a standard normal we must standardize it (otherwise it just blows up as we add more terms):

By linearity, $E(S_n) = n\mu$, so subtract $n\mu$ to center it.
$\operatorname{Var}(S_n) = n\sigma^2$, so divide by the standard deviation $\sqrt{n}\,\sigma$.

$$\frac{S_n - n\mu}{\sqrt{n}\,\sigma} \;\xrightarrow{\ d\ }\; \mathcal{N}(0, 1)$$

This is the same theorem, viewed as a statement about the sum (or convolution) rather than the sample mean.

· · ·

4. Proof of the Central Limit Theorem (via MGFs)

The CLT holds whenever the variance exists — no need to assume higher moments. But that generality is hard to prove. We assume the moment generating function (MGF) exists, which lets us use MGF machinery; the proof can be extended to cases where the MGF does not exist, but we won't need that.

Strategy

Let $M(t)$ be the common MGF of the $X_j$ (they are IID, so one having an MGF means they all share the same one). The plan:

There is a theorem (used in a homework problem): if the MGFs converge to some MGF, then the random variables converge in distribution.
So we just write down the MGF of the standardized sum and take its limit, showing it converges to the standard normal MGF.

Reductions

Assume $\mu = 0$ and $\sigma = 1$ without loss of generality. This is legitimate because we could standardize each $X_j$ separately first — replace $X_j$ by $(X_j - \mu)/\sigma$, call it $Y_j$ — and the CLT for the $Y_j$ is the same statement. So we may assume the $X$'s are already standardized.

Let $S_n = X_1 + \cdots + X_n$. With $\mu = 0$ and $\sigma = 1$ we are looking at $S_n / \sqrt{n}$, and we want to show that its MGF converges to the standard normal MGF as $n \to \infty$.

Computing the MGF

$$E\!\left( e^{\, t\, S_n / \sqrt{n}} \right) = \left[\, M\!\left(\tfrac{t}{\sqrt{n}}\right) \right]^{n}$$

By definition the MGF of $S_n/\sqrt{n}$ is $E\big(e^{t S_n/\sqrt{n}}\big)$. Since $S_n$ is a sum and the $X$'s are independent, the exponential factors and the expectation of the product is the product of the expectations:

$$E\!\left( e^{t X_1/\sqrt{n}}\, e^{t X_2/\sqrt{n}} \cdots \right) = \prod_{j=1}^{n} E\!\left( e^{\, t X_j / \sqrt{n}} \right).$$

The $X$'s are IID, so all $n$ factors are identical. Each factor is the common MGF evaluated at $t/\sqrt{n}$, giving $\big[ M(t/\sqrt{n}) \big]^n$.

Taking the limit: an indeterminate form

As $n \to \infty$, the inside $M(t/\sqrt{n}) \to M(0) = 1$ (for any MGF, $M(0) = E(e^0) = 1$). So we have the form $1^{\infty}$, which is indeterminate. The standard move is to take the log to reach a form where L'Hôpital's rule applies, then exponentiate at the end:

$$\log \big[ M(t/\sqrt{n}) \big]^{n} = n \, \log M\!\left(\tfrac{t}{\sqrt{n}}\right)$$

This is of the form $\infty \cdot 0$. Write the $n$ as $1/n$ in the denominator to get $0/0$.

Change of variables

Before applying L'Hôpital we change variables, because (a) $n$ is an integer and you can't differentiate over integers, and (b) the square root and the $-1/n^2$ derivative of $1/n$ are annoying. Let

$$y = \frac{1}{\sqrt{n}}, \quad y \text{ real (not just } 1/\sqrt{\text{integer}}\text{)}.$$

As $n \to \infty$, $y \to 0$. Note $1/n = y^2$, so the limit becomes

$$\lim_{y \to 0} \frac{\log M(yt)}{y^2}.$$

The square roots are gone; this is much cleaner. It is still of the form $0/0$.

MGF facts

$M(0) = 1$.
$M'(0) = E(X_1) = \mu = 0$ (the first derivative at $0$ is the mean — this is why it's called the moment generating function).
$M''(0) = E(X_1^2) =$ second moment. Since variance is $1$ and mean is $0$, the second moment is $1$, so $M''(0) = 1$.

First application of L'Hôpital

Differentiate numerator and denominator with respect to $y$ (treating $t$ as constant). The denominator's derivative is $2y$; the numerator's derivative is, by the chain rule, $\dfrac{M'(yt)}{M(yt)}\, t$. As $y \to 0$ the numerator still $\to 0$ (since $M'(0) = 0$) and the denominator $\to 0$ — still $0/0$. Simplify first: the constant $t$ comes out, the $2$ comes out, and the factor $M(yt) \to 1$ splits off. What remains is

$$\frac{t}{2} \, \lim_{y \to 0} \frac{M'(yt)}{y}.$$

Second application of L'Hôpital

Apply L'Hôpital to $M'(yt)/y$. The denominator's derivative is $1$; by the chain rule the numerator's derivative is $M''(yt)\, t$. Combined with the $t$ already outside, this produces a $t^2$ (it was a single $t$ before):

$$\frac{t^2}{2} \, \lim_{y \to 0} M''(yt) = \frac{t^2}{2} \, M''(0) = \frac{t^2}{2}.$$

Conclusion

We took the log, so exponentiate to undo it. The limiting log-MGF is $t^2/2$, hence the limiting MGF is $e^{t^2/2}$ — which is exactly the standard normal $\mathcal{N}(0,1)$ MGF. By the MGF-convergence theorem, the standardized sum converges in distribution to $\mathcal{N}(0,1)$. That completes the proof: basic MGF facts plus L'Hôpital's rule applied twice.

There are more general versions (relaxing IID under extra assumptions), but this is the basic CLT.

· · ·

5. Normal Approximation to the Binomial

A key application is using the CLT for approximations (as opposed to the inequalities of the previous lecture). Historically, the first version of the CLT ever proven was for the binomial: a $\text{Bin}(n, p)$ is, under suitable conditions, approximately normal.

In the old days this was crucial — without computers, the binomial is hard to evaluate by hand (factorials and $\binom{n}{k}$ for large $n$ and $k$). Even today, factorials grow so fast that large binomial computations can exceed a computer's capacity, and normals have many convenient properties. So we ask: when can we approximate a binomial by a normal, and how?

Setup

Let $X \sim \text{Bin}(n, p)$. Represent it as a sum of IID Bernoulli indicators:

$$X = X_1 + X_2 + \cdots + X_n, \qquad X_j = \begin{cases} 1 & \text{success on trial } j \\ 0 & \text{otherwise} \end{cases}$$

This fits the CLT framework (a sum of IID random variables), so for large $n$ the standardized $X$ is approximately normal. Its mean is $np$ and its variance is $npq$, where $q = 1 - p$.

The approximation

Suppose we want $P(a \le X \le b)$. Standardize $X$ by subtracting the mean and dividing by the standard deviation (this step is exact — no approximation yet):

$$P(a \le X \le b) = P\!\left( \frac{a - np}{\sqrt{npq}} \le \frac{X - np}{\sqrt{npq}} \le \frac{b - np}{\sqrt{npq}} \right)$$

Now apply the CLT: for large enough $n$ the standardized $X$ is approximately $\mathcal{N}(0,1)$. For a normal, the probability of landing in an interval is the difference of CDF values. Writing $\Phi$ for the standard normal CDF:

$$P(a \le X \le b) \approx \Phi\!\left( \frac{b - np}{\sqrt{npq}} \right) - \Phi\!\left( \frac{a - np}{\sqrt{npq}} \right)$$

(We approximate the discrete distribution by a continuous one and use the fundamental theorem of calculus: integrating the normal PDF over an interval gives the difference of CDFs.)

How large must $n$ be?

The CLT is an $n \to \infty$ statement and does not by itself say how large $n$ must be. There are rules of thumb:

A common blanket rule is $n \ge 30$, but this is only a rough guideline and does not always work.
For the binomial specifically, the relevant conditions are that both $np$ and $n(1 - p)$ be reasonably large.

When the approximation works well: $p$ near $1/2$

Symmetry heuristic

Even though $p$ never appeared in the statement of the CLT, as a practical matter the normal approximation to the binomial works best when $p$ is close to $1/2$. A $\text{Bin}(n, 1/2)$ is symmetric, and every normal is symmetric. If $p$ is far from $1/2$ the binomial is heavily skewed, and approximating a skewed distribution by a symmetric one makes little sense. With $p$ near $1/2$ the approximation is good already at $n \approx 30, 50, 100$; for small $p$ the CLT is still true but converges much more slowly.

Contrast with the Poisson approximation

	Normal approximation	Poisson approximation
$n$	large	large $(n \to \infty)$
$p$	want $p$ near $1/2$	want $p$ small $(p \to 0)$
Regime	balanced / symmetric counts	many rare, unlikely events
Result	$\text{Bin}(n,p) \approx \mathcal{N}(np,\, npq)$	$\text{Bin}(n,p) \approx \text{Pois}(\lambda),\ \lambda = np$

The Poisson approximation (proved earlier) applies when $n \to \infty$, $p \to 0$, with $np = \lambda$ fixed — a large number of very rare events. The two regimes are opposite: small $p$ favors Poisson, $p$ near $1/2$ favors normal.

Reconciling the two: large-$\lambda$ Poisson is normal

How can the binomial look both Poisson and normal? If $n \to \infty$ with $p$ very small, the binomial still converges to normal (just more slowly). The resolution: the Poisson itself looks normal when its parameter is large. A $\text{Pois}(\lambda)$ with $\lambda$ very large is approximately normal. So in that regime all three meet.

· · ·

6. Continuity Correction

There is something subtle about approximating a discrete distribution by a continuous one. Look at the extreme case $a = b$, i.e., approximating a single binomial PMF value $P(X = a)$.

The exact binomial probability changes if you switch between $\le$ and $<$ (strict vs. non-strict inequalities at the endpoints), but the naive normal approximation does not — once we say "approximately normal," $P(X = a)$ collapses to the area of a single point, which for a continuous distribution is $0$. That is useless.

The fix

The continuity correction. Assume $a$ is an integer. Since $X$ is integer-valued,

$$\{X = a\} \;=\; \left\{\, a - \tfrac{1}{2} < X < a + \tfrac{1}{2} \,\right\}.$$

Replace each integer value by an interval of length $1$ centered at it. This is exactly equivalent for the discrete $X$ (no integers other than $a$ lie in the interval), but now we feed the normal approximation an interval of positive width instead of a degenerate point of probability $0$. This improves the approximation; more generally, expanding the endpoints by $1/2$ in each direction is the continuity correction for interval probabilities.