The final lecture surveys where the material leads. It opens with the ten ideas Blitzstein singles out as most important — deliberately listed in no order of importance.
| # | Idea | Why it matters |
|---|---|---|
| 1 | Conditioning | "The soul of statistics." Includes conditional probability, conditional expectation, Bayes' rule, and the conditional structure of Markov chains. Conditional independence vs. independence has been everywhere. |
| 2 | Symmetry | "Powerful but dangerous." Can collapse 100 pages of algebra into one line — but only if the symmetry is really there. Do not hallucinate symmetries that don't exist. |
| 3 | Random variables and their distributions | A five-word title for the whole course. |
| 4 | Stories | Not just story proofs, but the story behind each named distribution (Normal, Gamma, Poisson, Exponential). The stories are why these distributions matter at all. |
| 5 | Linearity | Of expectation. |
| 6 | Indicator random variables | A favorite trick; extremely useful on interview problems and elsewhere, especially for computing expectations. |
| 7 | LOTUS | The Law of the Unconscious Statistician — computing expectations of functions of random variables without finding their distributions. |
| 8 | Law of Large Numbers | One of the two most famous (and arguably most important) theorems in probability. |
| 9 | Central Limit Theorem | The other one. |
| 10 | Markov chains | The subject of the final lectures; a step beyond i.i.d. |
The ten ideas group into four themes:
"Computing expected values" is not just finding the average. To get the standard deviation you need the variance, and to get the variance you compute expectations. Many quantities reduce back down to computing averages.
LLN and CLT take the average of a large number of i.i.d. random variables and ask how it behaves. Markov chains go one step beyond i.i.d.: Markov's own motivation was a process that wanders for a long time without being i.i.d., yet retains a clean conditional structure that makes it tractable.
To frame "where do you go from here," picture the world of statistics at Harvard as a Markov chain. The current state is Stat 110.
It doesn't matter how you got here — all that matters is that you are here now. The question is which state you transition to next. Most statistics courses numbered above 110 require it, because probability supplies both the language and the theorems that later courses build on.
| Course | Topic | Connection to 110 |
|---|---|---|
| Stat 111 | Statistical inference | The natural sequel. 110 + 111 together form a full year on probability and inference — two sides of one coin. Every spring. |
| Stat 139 | Linear models / regression | Overlaps with econometrics (EC1123, EC1126). Not formally requiring 110, but far clearer with it — e.g., EC1126 is "everything is conditional expectation." |
| Stat 123 | Finance | Strongly recommended for anyone interested in finance. A guest preview noted the 2D LOTUS was a key to a Nobel Prize. |
| Stat 115 | Computational biology | A mix of biology, CS, and statistics. Leans heavily on Markov chains and Bayesian thinking. Every spring. |
| Stat 160 | Survey sampling | Relevant to government, policy, polling. Hypergeometric distributions arise naturally (sampling without replacement from a finite population). Every other year. |
| Stat 171 | Stochastic processes | Continues the probability side. About a month on Markov chains, then many other processes — random variables evolving over time. |
The 110/111 split returns to the very first day of class. The two are complementary:
You cannot do inference without probability — it provides the language and the theorems. Results like universality of the uniform were not introduced "to torture you": they are genuinely useful in inference.
A strong recommendation alongside the coursework: learn R. (Also learn C, e.g. via CS50 — but for statistics specifically, learn R.)
Regression (Stat 139, or as seen in econometrics) is extremely widely used for analyzing data. The textbook formulas can look ugly — long summations to memorize or accept on faith. But the core fact follows in a few lines from properties of covariance and Adam's law.
The simplest model relates a predictor $X$ to a response $Y$:
A common assumption is that the $\varepsilon$ are Normal with mean $0$, but that isn't required. The key assumption used here is the weaker, conditional one:
That is, there is no value of $X$ for which the errors systematically tend positive or negative.
Because the two sides of the model are the same random variable, we may apply any operation to both. Take the covariance of both sides with $X$:
$$\operatorname{Cov}(Y, X) = \operatorname{Cov}(\beta_0, X) + \operatorname{Cov}(\beta_1 X, X) + \operatorname{Cov}(\varepsilon, X)$$
Term by term:
For the last term, since $E(\varepsilon \mid X) = 0$, Adam's law (the tower property) gives the unconditional mean:
$$E(\varepsilon) = E\big(E(\varepsilon \mid X)\big) = E(0) = 0$$
With $\varepsilon$ centered at $0$, $\operatorname{Cov}(\varepsilon, X) = E(\varepsilon X)$. Compute that by conditioning on $X$ again:
$$E(\varepsilon X) = E\big(E(\varepsilon X \mid X)\big) = E\big(X \cdot E(\varepsilon \mid X)\big) = E(X \cdot 0) = 0$$
(Once we condition on $X$, the factor $X$ is known and pulls out; what remains is $E(\varepsilon \mid X) = 0$.) So $\operatorname{Cov}(\varepsilon, X) = 0$, and the covariance equation collapses to the boxed result.
This is the population version of the regression slope — clean and interpretable. Textbooks that avoid these assumptions give the sample version instead: an ugly summation over $(x_i - \bar{x})(y_i - \bar{y})$, either proved by tedious algebra or stated without proof. Understanding the population derivation tells you where that formula comes from.
More broadly: many formulas that look ugly at first are, once understood, just a conditional expectation — and often, geometrically, just a projection. (Projections are nice.)
A recommended, beautifully written, inexpensive book for econometrics: Mostly Harmless Econometrics. Conditional expectation, Adam's law, and Eve's law are everywhere in it — all building on Stat 110.
Survey sampling (Stat 160) has a different flavor from the i.i.d. world: we sample from a finite population. Hypergeometric ideas arise naturally because sampling is typically without replacement. The example also doubles as a review of indicators, linearity, and the fundamental bridge.
There is a finite population — say, of people — and for each person a fixed quantity of interest (height, income, an opinion). Every population is, of course, finite, though this is often ignored.
The goal: estimate the population average — or equivalently, if $N$ is known, the total $\sum_{j=1}^{N} Y_j$.
Let $p_j$ be the (assumed known) probability that person $j$ is included. Simple random sampling is the case where all $p_j$ are equal; in general some people are easier to reach than others, so the $p_j$ may differ.
The observed data are pairs $(X_1, Z_1), \ldots, (X_n, Z_n)$:
To get an unbiased estimator of the total, the standard trick is to divide each observation by the probability of having sampled that person:
Concretely, if the observed values were $5, 10, 15$ with sampling probabilities $a, b, c$, the estimate is $\tfrac{5}{a} + \tfrac{10}{b} + \tfrac{15}{c}$.
The denominator $p_{Z_j}$ is awkward: $Z_j$ is a random ID, so this is a random probability. The fix is to rewrite the sum with indicator random variables, summing over the entire population rather than just the sampled units:
$$\hat{T} = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, I_j$$
where $I_j$ indicates that person $j$ is included. This is the same quantity — anyone not sampled is zeroed out, anyone sampled contributes $Y_j / p_j$ — but now the denominators are the fixed, known $p_j$.
Take the expectation, using linearity and then the fundamental bridge $E(I_j) = P(\text{person } j \text{ included}) = p_j$:
$$E(\hat{T}) = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, E(I_j) = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, p_j = \sum_{j=1}^{N} Y_j$$
The estimator is unbiased for the total. This is the Horvitz-Thompson estimator, also known as inverse-probability weighting — variations on "divide by the probability" are used very widely. The proof requires $p_j > 0$, so we never divide by zero.
Unbiasedness is reassuring, but is it enough to call an estimator good? This question goes much deeper, and Basu's elephant is the classic cautionary tale.
A circus owner has 50 elephants and wants their total weight. Weighing all 50 is impractical, so he proposes: weigh one average-looking elephant, "Stampy," and multiply by 50. Reasonable enough.
A statistician objects that this is biased and recommends Horvitz-Thompson. The owner insists on weighing Stampy (the friendly one who won't kick him), and the statistician agrees — any probabilities $p_j > 0$ keep the estimator unbiased — assigning $p = 0.99$ to Stampy and splitting the remaining $0.01$ equally among the other 49.
With 99% probability they weigh Stampy. Horvitz-Thompson then says: divide Stampy's weight by $0.99$, i.e. multiply by $\tfrac{100}{99}$ — barely more than one elephant's weight, as an estimate of the total weight of 50. Had they drawn one of the others, they would multiply by $4{,}900$. On average it is exactly unbiased; in any realized outcome it is absurd.
Unbiasedness alone is not a sufficient criterion for a good estimator. Choosing good criteria — and even deciding whether competing methods are well-defined, when they agree, when they disagree — is subtle, and full of surprising paradoxes where a perfectly natural-looking estimator can always be beaten.
Much of this lives in Stat 111, which also takes up the Bayesian-vs.-frequentist distinction. Unbiasedness is an inherently frequentist concept; the conjugate priors and the Beta-Binomial seen in 110 point the Bayesian way.
A few final edges to add to the chain. One: 110 transitions to jobs — probability questions show up often in interviews. Another: 110 is a recurrent state. Some students, having taken it, wish they could simply repeat it.
The intent is not literal re-enrollment but revisiting the material over and over — which, for this subject, is a genuinely good thing to do.