Lecture 34: A Look Ahead

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. The Top Ten List

The final lecture surveys where the material leads. It opens with the ten ideas Blitzstein singles out as most important — deliberately listed in no order of importance.

#IdeaWhy it matters
1Conditioning"The soul of statistics." Includes conditional probability, conditional expectation, Bayes' rule, and the conditional structure of Markov chains. Conditional independence vs. independence has been everywhere.
2Symmetry"Powerful but dangerous." Can collapse 100 pages of algebra into one line — but only if the symmetry is really there. Do not hallucinate symmetries that don't exist.
3Random variables and their distributionsA five-word title for the whole course.
4StoriesNot just story proofs, but the story behind each named distribution (Normal, Gamma, Poisson, Exponential). The stories are why these distributions matter at all.
5LinearityOf expectation.
6Indicator random variablesA favorite trick; extremely useful on interview problems and elsewhere, especially for computing expectations.
7LOTUSThe Law of the Unconscious Statistician — computing expectations of functions of random variables without finding their distributions.
8Law of Large NumbersOne of the two most famous (and arguably most important) theorems in probability.
9Central Limit TheoremThe other one.
10Markov chainsThe subject of the final lectures; a step beyond i.i.d.

The list partitions neatly

The ten ideas group into four themes:

Expectation is broader than the mean

"Computing expected values" is not just finding the average. To get the standard deviation you need the variance, and to get the variance you compute expectations. Many quantities reduce back down to computing averages.

Long-run behavior

LLN and CLT take the average of a large number of i.i.d. random variables and ask how it behaves. Markov chains go one step beyond i.i.d.: Markov's own motivation was a process that wanders for a long time without being i.i.d., yet retains a clean conditional structure that makes it tractable.

· · ·

2. Stat 110 as a Markov Chain

To frame "where do you go from here," picture the world of statistics at Harvard as a Markov chain. The current state is Stat 110.

The Markov property, applied to you

It doesn't matter how you got here — all that matters is that you are here now. The question is which state you transition to next. Most statistics courses numbered above 110 require it, because probability supplies both the language and the theorems that later courses build on.

Follow-on courses

CourseTopicConnection to 110
Stat 111Statistical inferenceThe natural sequel. 110 + 111 together form a full year on probability and inference — two sides of one coin. Every spring.
Stat 139Linear models / regressionOverlaps with econometrics (EC1123, EC1126). Not formally requiring 110, but far clearer with it — e.g., EC1126 is "everything is conditional expectation."
Stat 123FinanceStrongly recommended for anyone interested in finance. A guest preview noted the 2D LOTUS was a key to a Nobel Prize.
Stat 115Computational biologyA mix of biology, CS, and statistics. Leans heavily on Markov chains and Bayesian thinking. Every spring.
Stat 160Survey samplingRelevant to government, policy, polling. Hypergeometric distributions arise naturally (sampling without replacement from a finite population). Every other year.
Stat 171Stochastic processesContinues the probability side. About a month on Markov chains, then many other processes — random variables evolving over time.

Probability vs. inference

The 110/111 split returns to the very first day of class. The two are complementary:

You cannot do inference without probability — it provides the language and the theorems. Results like universality of the uniform were not introduced "to torture you": they are genuinely useful in inference.

Practical advice: learn R

A strong recommendation alongside the coursework: learn R. (Also learn C, e.g. via CS50 — but for statistics specifically, learn R.)

  • R was designed by statisticians for statistical work, very different in flavor from C — easier in some ways, but a genuinely different mindset. Programmers from C often misjudge it as something they can pick up in an afternoon.
  • A free, open-source culture means well-written tutorials and a growing ecosystem of packages, all at no cost.
  • An argument worth taking seriously: studying R doesn't just give you a tool, it can make you a better statistical thinker.
· · ·

3. Worked Example: A Regression Coefficient

Regression (Stat 139, or as seen in econometrics) is extremely widely used for analyzing data. The textbook formulas can look ugly — long summations to memorize or accept on faith. But the core fact follows in a few lines from properties of covariance and Adam's law.

Setup: simple linear regression

The simplest model relates a predictor $X$ to a response $Y$:

$$Y = \beta_0 + \beta_1 X + \varepsilon$$

A common assumption is that the $\varepsilon$ are Normal with mean $0$, but that isn't required. The key assumption used here is the weaker, conditional one:

$$E(\varepsilon \mid X) = 0$$

That is, there is no value of $X$ for which the errors systematically tend positive or negative.

Derivation

$$\beta_1 = \frac{\operatorname{Cov}(X, Y)}{\operatorname{Var}(X)}$$

Because the two sides of the model are the same random variable, we may apply any operation to both. Take the covariance of both sides with $X$:

$$\operatorname{Cov}(Y, X) = \operatorname{Cov}(\beta_0, X) + \operatorname{Cov}(\beta_1 X, X) + \operatorname{Cov}(\varepsilon, X)$$

Term by term:

  • $\operatorname{Cov}(\beta_0, X) = 0$ — the covariance of a constant with anything is zero.
  • $\operatorname{Cov}(\beta_1 X, X) = \beta_1 \operatorname{Var}(X)$ — pull out the constant; the covariance of $X$ with itself is $\operatorname{Var}(X)$.
  • $\operatorname{Cov}(\varepsilon, X) = 0$ — shown next.

For the last term, since $E(\varepsilon \mid X) = 0$, Adam's law (the tower property) gives the unconditional mean:

$$E(\varepsilon) = E\big(E(\varepsilon \mid X)\big) = E(0) = 0$$

With $\varepsilon$ centered at $0$, $\operatorname{Cov}(\varepsilon, X) = E(\varepsilon X)$. Compute that by conditioning on $X$ again:

$$E(\varepsilon X) = E\big(E(\varepsilon X \mid X)\big) = E\big(X \cdot E(\varepsilon \mid X)\big) = E(X \cdot 0) = 0$$

(Once we condition on $X$, the factor $X$ is known and pulls out; what remains is $E(\varepsilon \mid X) = 0$.) So $\operatorname{Cov}(\varepsilon, X) = 0$, and the covariance equation collapses to the boxed result.

Why this matters

This is the population version of the regression slope — clean and interpretable. Textbooks that avoid these assumptions give the sample version instead: an ugly summation over $(x_i - \bar{x})(y_i - \bar{y})$, either proved by tedious algebra or stated without proof. Understanding the population derivation tells you where that formula comes from.

More broadly: many formulas that look ugly at first are, once understood, just a conditional expectation — and often, geometrically, just a projection. (Projections are nice.)

A recommended, beautifully written, inexpensive book for econometrics: Mostly Harmless Econometrics. Conditional expectation, Adam's law, and Eve's law are everywhere in it — all building on Stat 110.

· · ·

4. Worked Example: Survey Sampling and Horvitz-Thompson

Survey sampling (Stat 160) has a different flavor from the i.i.d. world: we sample from a finite population. Hypergeometric ideas arise naturally because sampling is typically without replacement. The example also doubles as a review of indicators, linearity, and the fundamental bridge.

Setup

There is a finite population — say, of people — and for each person a fixed quantity of interest (height, income, an opinion). Every population is, of course, finite, though this is often ignored.

The goal: estimate the population average — or equivalently, if $N$ is known, the total $\sum_{j=1}^{N} Y_j$.

Let $p_j$ be the (assumed known) probability that person $j$ is included. Simple random sampling is the case where all $p_j$ are equal; in general some people are easier to reach than others, so the $p_j$ may differ.

The observed data are pairs $(X_1, Z_1), \ldots, (X_n, Z_n)$:

The estimator

To get an unbiased estimator of the total, the standard trick is to divide each observation by the probability of having sampled that person:

$$\hat{T} = \sum_{j=1}^{n} \frac{X_j}{p_{Z_j}}$$

Concretely, if the observed values were $5, 10, 15$ with sampling probabilities $a, b, c$, the estimate is $\tfrac{5}{a} + \tfrac{10}{b} + \tfrac{15}{c}$.

Proof of unbiasedness

$$E(\hat{T}) = \sum_{j=1}^{N} Y_j$$

The denominator $p_{Z_j}$ is awkward: $Z_j$ is a random ID, so this is a random probability. The fix is to rewrite the sum with indicator random variables, summing over the entire population rather than just the sampled units:

$$\hat{T} = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, I_j$$

where $I_j$ indicates that person $j$ is included. This is the same quantity — anyone not sampled is zeroed out, anyone sampled contributes $Y_j / p_j$ — but now the denominators are the fixed, known $p_j$.

Take the expectation, using linearity and then the fundamental bridge $E(I_j) = P(\text{person } j \text{ included}) = p_j$:

$$E(\hat{T}) = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, E(I_j) = \sum_{j=1}^{N} \frac{Y_j}{p_j}\, p_j = \sum_{j=1}^{N} Y_j$$

Conclusion

The estimator is unbiased for the total. This is the Horvitz-Thompson estimator, also known as inverse-probability weighting — variations on "divide by the probability" are used very widely. The proof requires $p_j > 0$, so we never divide by zero.

· · ·

5. Is Unbiased Good? Basu's Elephant

Unbiasedness is reassuring, but is it enough to call an estimator good? This question goes much deeper, and Basu's elephant is the classic cautionary tale.

Basu's Elephant

A circus owner has 50 elephants and wants their total weight. Weighing all 50 is impractical, so he proposes: weigh one average-looking elephant, "Stampy," and multiply by 50. Reasonable enough.

A statistician objects that this is biased and recommends Horvitz-Thompson. The owner insists on weighing Stampy (the friendly one who won't kick him), and the statistician agrees — any probabilities $p_j > 0$ keep the estimator unbiased — assigning $p = 0.99$ to Stampy and splitting the remaining $0.01$ equally among the other 49.

With 99% probability they weigh Stampy. Horvitz-Thompson then says: divide Stampy's weight by $0.99$, i.e. multiply by $\tfrac{100}{99}$ — barely more than one elephant's weight, as an estimate of the total weight of 50. Had they drawn one of the others, they would multiply by $4{,}900$. On average it is exactly unbiased; in any realized outcome it is absurd.

The moral

Unbiasedness alone is not a sufficient criterion for a good estimator. Choosing good criteria — and even deciding whether competing methods are well-defined, when they agree, when they disagree — is subtle, and full of surprising paradoxes where a perfectly natural-looking estimator can always be beaten.

Much of this lives in Stat 111, which also takes up the Bayesian-vs.-frequentist distinction. Unbiasedness is an inherently frequentist concept; the conjugate priors and the Beta-Binomial seen in 110 point the Bayesian way.

· · ·

6. Closing: A Recurrent State

A few final edges to add to the chain. One: 110 transitions to jobs — probability questions show up often in interviews. Another: 110 is a recurrent state. Some students, having taken it, wish they could simply repeat it.

The intent is not literal re-enrollment but revisiting the material over and over — which, for this subject, is a genuinely good thing to do.