These are arguably the two most famous theorems in the entire history of probability. They are closely related, so it makes sense to study them together and compare and contrast them.
Both theorems share the same setup. We have IID (independent and identically distributed) random variables $X_1, X_2, X_3, \ldots$. Because they are identically distributed, they all share the same mean and the same variance:
We assume these are finite (the mean and variance exist).
The central object is the sample mean, written $\bar{X}_n$. The bar is standard statistics notation for an average, and the subscript $n$ is the number of terms:
Both theorems answer the same question: what happens to $\bar{X}_n$ as $n$ gets large?
The $X$'s are random variables, but once observed they become data. We never have infinitely much data — at some point we stop at $n$, which we can think of as the sample size. Hopefully $n$ is large. We take the average; the question is what we can say about it.
The law of large numbers (LLN) is a simple, intuitive statement: the sample mean converges to the true mean.
$$\bar{X}_n \to \mu \quad \text{as } n \to \infty, \text{ with probability } 1.$$
Here "true mean" means the theoretical mean, $\mu = E(X_j)$.
Note the asymmetry: $\mu$ is a constant, while $\bar{X}_n$ is a random variable (an average of random variables). The theorem says the random variable converges to the constant. The "with probability 1" is the fine print: something crazy could happen on an event of probability $0$, but we don't worry about it.
When we take limits of random variables (rather than ordinary sequences of numbers), we must be careful about the definition. A random variable is, mathematically, a function of the outcome of the experiment. The convergence here is pointwise: evaluate $\bar{X}_n$ at a specific outcome and you get an honest sequence of numbers, and those numbers converge to $\mu$. So either the sequence converges or it doesn't — that is an event — and the theorem says that event has probability $1$.
Let $X_j \sim \text{Bern}(p)$, so we imagine an infinite sequence of coin tosses with probability $p$ of heads. Then
and the LLN says this proportion converges to $p$ with probability 1. Flip a fair coin a million times: you don't expect exactly 500,000 heads, but you do expect the proportion to get closer and closer to $1/2$ in the long run.
The "with probability 1" qualification is needed because, mathematically, nothing forbids the coin from landing heads forever. That will never happen in reality, but the math does not call such a sequence invalid. These pathological cases have probability $0$, and on everything else we get what we expect.
The result is also necessary: if you didn't know $p$, the obvious estimate is to flip many times and take the proportion of heads. The LLN is exactly the justification for that procedure — without it, you'd have no grounds for the approximation.
The gambler's fallacy is the feeling that after losing many times in a row, you are "due" to win — and people sometimes try to justify it with the LLN: "the long-run average must return to $1/2$, so I should start winning to compensate." That is not how it works. The coin is memoryless; it does not care how many losses came before. After 100 tails in a row, the probability for flip 101 is unchanged.
The correct mechanism is swamping. The LLN sends $n \to \infty$. No matter how unlucky you were in the first hundred or first million trials, that finite stretch is nothing compared to infinity — it gets swamped by the entire infinite future. The early losses are not offset by extra future heads; they are simply diluted away.
A story: a colleague's student claimed to hate statistics. Asked why, the student — an athlete training daily — said he had just learned the LLN and found it depressing: it seemed to say that in the long run he could only ever be average, with no room to improve.
The fallacy is in the IID assumption. IID means the distribution does not change with time. Improving your own performance changes your distribution, so the sequence would no longer be IID. The LLN does not say you cannot improve; it only describes averages of a fixed distribution.
Far from depressing, the LLN is arguably what makes science possible. Imagine a counterfactual world where it were false: you collect more and more data, let your sample size grow, and yet never converge to the truth. The LLN guarantees that more data brings you to the truth.
We will prove a slightly different version, the weak law of large numbers. (Blitzstein notes he dislikes the "strong"/"weak" terminology, but it is standard.) The weak law states: for any $c > 0$,
$$P\big(|\bar{X}_n - \mu| > c\big) \to 0 \quad \text{as } n \to \infty.$$
This is convergence in probability.
It is not literally equivalent to the strong law, but the strong law implies it (the converse requires some real analysis we don't need here). The intuition is the same: interpret $c$ as small (say $0.001$); the statement says that for large $n$ it is extremely unlikely that the sample mean and the true mean differ by more than $0.001$.
The strong law is hard to prove, but the weak law follows in essentially one line from Chebyshev's inequality (from the previous lecture):
Now compute $\operatorname{Var}(\bar{X}_n)$. Staring at the definition, the $1/n$ in front comes out as $1/n^2$, and by independence the variance of the sum is $n$ times the variance of one term:
$$\operatorname{Var}(\bar{X}_n) = \frac{1}{n^2}\,\big(n\,\sigma^2\big) = \frac{\sigma^2}{n}.$$
So
$$P\big(|\bar{X}_n - \mu| > c\big) \le \frac{\sigma^2}{n\,c^2}.$$
Here $\sigma$ and $c$ are constants and $n \to \infty$, so the bound $\to 0$. That proves the weak law of large numbers in one line.
Rewrite what we proved as $\bar{X}_n - \mu \to 0$ as $n \to \infty$. This is good to know, but it does not tell us two things: the shape of the distribution of $\bar{X}_n$, and the rate at which it converges to $\mu$.
A general strategy for studying how fast something goes to zero is to multiply by something that goes to infinity. Multiply $\bar{X}_n - \mu$ by $n$ raised to some power:
So the two factors compete: one goes to infinity, the other to zero. The threshold power that balances them is $\tfrac{1}{2}$. With the square root we get a non-degenerate limit — the product neither collapses to $0$ nor blows up, but converges to an actual distribution. Dividing by $\sigma$ cleans things up:
As $n \to \infty$,
$$\sqrt{n}\,\frac{\bar{X}_n - \mu}{\sigma} \;\xrightarrow{\ d\ }\; \mathcal{N}(0, 1).$$
Convergence is in distribution: the CDF of the left-hand quantity converges to the standard normal CDF $\Phi$.
The $X$'s may be discrete, continuous, or a mixture, so they need not have a PDF — but every random variable has a CDF, so the statement is about CDFs.
The standard normal is one specific bell curve, yet any distribution with finite variance — discrete, continuous, or extremely nasty-looking — gives a standardized sample mean that converges to it. The only assumption is finite variance. This is a major reason the standard normal is so important: although the CLT is an $n \to \infty$ statement, in practice it justifies normal approximation — for large $n$ the sample mean is approximately normal, even when the original data were nowhere near normal.
| Law of Large Numbers (strong) | Central Limit Theorem | |
|---|---|---|
| Object | $\bar{X}_n$ | $\sqrt{n}\,(\bar{X}_n - \mu)/\sigma$ |
| Conclusion | converges to the constant $\mu$ | converges to $\mathcal{N}(0,1)$ |
| Convergence | pointwise, w.p. 1 (the random variables converge) | in distribution (only the CDF converges) |
| Information | the limit value | the shape and the rate (scale $\sqrt{n}$) |
| Assumptions | finite mean (and variance, this version) | finite variance |
The CLT is in a sense more informative — it gives the distribution and the rate, not just the limiting value. But it is a genuinely different sense of convergence: the LLN says the random variables themselves converge, the CLT says only their distribution converges to $\mathcal{N}(0,1)$.
It is worth being fluent in both forms; the algebra connecting them is just a factor of $n$. Let $S_n = X_1 + \cdots + X_n$. To turn the sum into a standard normal we must standardize it (otherwise it just blows up as we add more terms):
This is the same theorem, viewed as a statement about the sum (or convolution) rather than the sample mean.
The CLT holds whenever the variance exists — no need to assume higher moments. But that generality is hard to prove. We assume the moment generating function (MGF) exists, which lets us use MGF machinery; the proof can be extended to cases where the MGF does not exist, but we won't need that.
Let $M(t)$ be the common MGF of the $X_j$ (they are IID, so one having an MGF means they all share the same one). The plan:
Assume $\mu = 0$ and $\sigma = 1$ without loss of generality. This is legitimate because we could standardize each $X_j$ separately first — replace $X_j$ by $(X_j - \mu)/\sigma$, call it $Y_j$ — and the CLT for the $Y_j$ is the same statement. So we may assume the $X$'s are already standardized.
Let $S_n = X_1 + \cdots + X_n$. With $\mu = 0$ and $\sigma = 1$ we are looking at $S_n / \sqrt{n}$, and we want to show that its MGF converges to the standard normal MGF as $n \to \infty$.
By definition the MGF of $S_n/\sqrt{n}$ is $E\big(e^{t S_n/\sqrt{n}}\big)$. Since $S_n$ is a sum and the $X$'s are independent, the exponential factors and the expectation of the product is the product of the expectations:
$$E\!\left( e^{t X_1/\sqrt{n}}\, e^{t X_2/\sqrt{n}} \cdots \right) = \prod_{j=1}^{n} E\!\left( e^{\, t X_j / \sqrt{n}} \right).$$
The $X$'s are IID, so all $n$ factors are identical. Each factor is the common MGF evaluated at $t/\sqrt{n}$, giving $\big[ M(t/\sqrt{n}) \big]^n$.
As $n \to \infty$, the inside $M(t/\sqrt{n}) \to M(0) = 1$ (for any MGF, $M(0) = E(e^0) = 1$). So we have the form $1^{\infty}$, which is indeterminate. The standard move is to take the log to reach a form where L'Hôpital's rule applies, then exponentiate at the end:
This is of the form $\infty \cdot 0$. Write the $n$ as $1/n$ in the denominator to get $0/0$.
Before applying L'Hôpital we change variables, because (a) $n$ is an integer and you can't differentiate over integers, and (b) the square root and the $-1/n^2$ derivative of $1/n$ are annoying. Let
As $n \to \infty$, $y \to 0$. Note $1/n = y^2$, so the limit becomes
The square roots are gone; this is much cleaner. It is still of the form $0/0$.
Differentiate numerator and denominator with respect to $y$ (treating $t$ as constant). The denominator's derivative is $2y$; the numerator's derivative is, by the chain rule, $\dfrac{M'(yt)}{M(yt)}\, t$. As $y \to 0$ the numerator still $\to 0$ (since $M'(0) = 0$) and the denominator $\to 0$ — still $0/0$. Simplify first: the constant $t$ comes out, the $2$ comes out, and the factor $M(yt) \to 1$ splits off. What remains is
Apply L'Hôpital to $M'(yt)/y$. The denominator's derivative is $1$; by the chain rule the numerator's derivative is $M''(yt)\, t$. Combined with the $t$ already outside, this produces a $t^2$ (it was a single $t$ before):
We took the log, so exponentiate to undo it. The limiting log-MGF is $t^2/2$, hence the limiting MGF is $e^{t^2/2}$ — which is exactly the standard normal $\mathcal{N}(0,1)$ MGF. By the MGF-convergence theorem, the standardized sum converges in distribution to $\mathcal{N}(0,1)$. That completes the proof: basic MGF facts plus L'Hôpital's rule applied twice.
There are more general versions (relaxing IID under extra assumptions), but this is the basic CLT.
A key application is using the CLT for approximations (as opposed to the inequalities of the previous lecture). Historically, the first version of the CLT ever proven was for the binomial: a $\text{Bin}(n, p)$ is, under suitable conditions, approximately normal.
In the old days this was crucial — without computers, the binomial is hard to evaluate by hand (factorials and $\binom{n}{k}$ for large $n$ and $k$). Even today, factorials grow so fast that large binomial computations can exceed a computer's capacity, and normals have many convenient properties. So we ask: when can we approximate a binomial by a normal, and how?
Let $X \sim \text{Bin}(n, p)$. Represent it as a sum of IID Bernoulli indicators:
This fits the CLT framework (a sum of IID random variables), so for large $n$ the standardized $X$ is approximately normal. Its mean is $np$ and its variance is $npq$, where $q = 1 - p$.
Suppose we want $P(a \le X \le b)$. Standardize $X$ by subtracting the mean and dividing by the standard deviation (this step is exact — no approximation yet):
Now apply the CLT: for large enough $n$ the standardized $X$ is approximately $\mathcal{N}(0,1)$. For a normal, the probability of landing in an interval is the difference of CDF values. Writing $\Phi$ for the standard normal CDF:
(We approximate the discrete distribution by a continuous one and use the fundamental theorem of calculus: integrating the normal PDF over an interval gives the difference of CDFs.)
The CLT is an $n \to \infty$ statement and does not by itself say how large $n$ must be. There are rules of thumb:
Even though $p$ never appeared in the statement of the CLT, as a practical matter the normal approximation to the binomial works best when $p$ is close to $1/2$. A $\text{Bin}(n, 1/2)$ is symmetric, and every normal is symmetric. If $p$ is far from $1/2$ the binomial is heavily skewed, and approximating a skewed distribution by a symmetric one makes little sense. With $p$ near $1/2$ the approximation is good already at $n \approx 30, 50, 100$; for small $p$ the CLT is still true but converges much more slowly.
| Normal approximation | Poisson approximation | |
|---|---|---|
| $n$ | large | large $(n \to \infty)$ |
| $p$ | want $p$ near $1/2$ | want $p$ small $(p \to 0)$ |
| Regime | balanced / symmetric counts | many rare, unlikely events |
| Result | $\text{Bin}(n,p) \approx \mathcal{N}(np,\, npq)$ | $\text{Bin}(n,p) \approx \text{Pois}(\lambda),\ \lambda = np$ |
The Poisson approximation (proved earlier) applies when $n \to \infty$, $p \to 0$, with $np = \lambda$ fixed — a large number of very rare events. The two regimes are opposite: small $p$ favors Poisson, $p$ near $1/2$ favors normal.
How can the binomial look both Poisson and normal? If $n \to \infty$ with $p$ very small, the binomial still converges to normal (just more slowly). The resolution: the Poisson itself looks normal when its parameter is large. A $\text{Pois}(\lambda)$ with $\lambda$ very large is approximately normal. So in that regime all three meet.
There is something subtle about approximating a discrete distribution by a continuous one. Look at the extreme case $a = b$, i.e., approximating a single binomial PMF value $P(X = a)$.
The exact binomial probability changes if you switch between $\le$ and $<$ (strict vs. non-strict inequalities at the endpoints), but the naive normal approximation does not — once we say "approximately normal," $P(X = a)$ collapses to the area of a single point, which for a continuous distribution is $0$. That is useless.
The continuity correction. Assume $a$ is an integer. Since $X$ is integer-valued,
$$\{X = a\} \;=\; \left\{\, a - \tfrac{1}{2} < X < a + \tfrac{1}{2} \,\right\}.$$
Replace each integer value by an interval of length $1$ centered at it. This is exactly equivalent for the discrete $X$ (no integers other than $a$ lie in the interval), but now we feed the normal approximation an interval of positive width instead of a degenerate point of probability $0$. This improves the approximation; more generally, expanding the endpoints by $1/2$ in each direction is the continuity correction for interval probabilities.