Lecture 15: Midterm Review

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

This review lecture revisits a handful of representative problems rather than introducing new theory. Each one drills a core technique — linearity, symmetry, universality of the uniform, LOTUS, going from a CDF to a PDF, and story proofs — applied to a self-contained example. The recurring meta-lesson: break a hard random variable into pieces, recognize the pattern, and let structure (not brute calculation) do the work.

1. Coupon Collector: Linearity and Geometrics

The problem

You collect toys (or coupons) that come in $n$ equally likely types. Each purchase gives one uniformly random type. On average, how many toys must you buy to collect a complete set of all $n$ types? "Time" is measured discretely as the number of toys purchased.

The equally-likely assumption matters: with unequal probabilities the calculation becomes extremely tedious (pages of work), so exam problems keep it equally likely.

Break the total into pieces

Let $T$ be the total number of toys needed. Decompose it by milestones:

$$T = T_1 + T_2 + \cdots + T_n,$$

where $T_j$ is the additional time spent waiting for the $j$-th new type, after you already hold $j - 1$ distinct types ("new" = a type you don't already have).

Each piece is geometric

Suppose you currently hold $j - 1$ distinct types. On each new purchase, the chance of getting something new is

$$p_j = \frac{n - (j - 1)}{n},$$

since $n - (j - 1)$ of the $n$ types are still missing; with the complementary probability you draw a duplicate and try again. So the number of extra trials past the previous milestone, $T_j - 1$, is $\mathrm{Geom}(p_j)$. (Subtracting $1$ matches the convention that a Geometric counts failures before the first success and so starts at $0$; you always need at least one more purchase, hence the $+1$ when converting back.)

Add up with linearity

Why linearity wins

Linearity of expectation gives the answer directly — and would hold even if the $T_j$ were dependent. (Here they happen to be independent, but independence is not required.)

$$E(T) = E(T_1) + E(T_2) + \cdots + E(T_n).$$

For a Geometric starting at $1$ the mean is $1/p$ (that is, $q/p + 1$, adding back the $+1$). Therefore:

$$E(T) = \frac{n}{n} + \frac{n}{n-1} + \frac{n}{n-2} + \cdots + \frac{n}{1} = n\left(1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}\right).$$

The bracketed sum is the $n$-th harmonic number $H_n$.

Result and approximation

Answer

$$E(T) = n \, H_n = n\left(1 + \tfrac{1}{2} + \cdots + \tfrac{1}{n}\right) \approx n \log n \quad (\text{large } n).$$

Give the exact answer ($n\,H_n$) on an exam unless an approximation is explicitly requested; $n \log n$ is the handy large-$n$ estimate. What looked like a hard problem becomes easy once it is split into geometric pieces and reassembled with linearity.

· · ·

2. Universality of the Uniform

This was the number-one review request. The key statement: if $X$ is a continuous random variable with a strictly increasing, continuous CDF $F$, then plugging $X$ into its own CDF yields a $\mathrm{Unif}(0,1)$:

$$F(X) \sim \mathrm{Unif}(0, 1).$$

(This is one of the two equivalent halves of universality; only this half is treated here.)

Why it's true — a geometric argument

Draw a generic CDF $F$: increasing, continuous, leveling off at $1$ (either hitting $1$ and staying, or approaching $1$ asymptotically). The horizontal axis is the value $x$; the vertical axis runs $0$ to $1$.

Pick any target height on the vertical axis — say $\tfrac{1}{3}$ — and let $x_0$ be the point with $F(x_0) = \tfrac{1}{3}$ (the inverse-CDF value). Now ask for the probability that $F(X) \le \tfrac{1}{3}$, where $X$ is random with CDF $F$. To compute $F(X)$ you first draw $X$ from the distribution, then read off its height $F(X)$.

$$P\!\left(F(X) \le \tfrac{1}{3}\right) = P(X \le x_0) = F(x_0) = \tfrac{1}{3}.$$

The event $F(X) \le \tfrac{1}{3}$ happens exactly when $X$ lands to the left of $x_0$ (if $X$ were to the right, its height would exceed $\tfrac{1}{3}$). Nothing was special about $\tfrac{1}{3}$: for any $y \in [0, 1]$, $P(F(X) \le y) = y$. That is precisely the $\mathrm{Unif}(0,1)$ CDF — probability proportional to length on $[0, 1]$.

The mechanism just translates a random point on the $x$-axis into a uniformly random height between $0$ and $1$.

Using it to simulate: the logistic distribution

The flip side of universality lets you generate draws from any continuous distribution given a uniform random number, via the inverse CDF. The logistic distribution (the basis of logistic regression, widely used in economics and statistics) has CDF

$$F(x) = \frac{e^x}{1 + e^x}, \qquad x \in \mathbb{R}.$$

(Good practice: verify this is a valid CDF — continuous, increasing, with the right limits.) To simulate it, let $U \sim \mathrm{Unif}(0,1)$ and apply the inverse CDF. Setting $F(x) = u$ and solving for $x$ gives

$$F^{-1}(u) = \log\!\left(\frac{u}{1 - u}\right).$$
The simulation recipe

$F^{-1}(U) = \log\!\big(U / (1 - U)\big)$ is a draw from the logistic distribution. The expression is the log-odds: if $u$ were a probability, $u/(1-u)$ is the odds and its log is the log-odds. You could confirm by computing the CDF of $F^{-1}(U)$ directly, but it is easier to recognize why it works through universality.

· · ·

3. Symmetry and Linearity

The problem

Let $X, Y, Z$ be i.i.d. positive random variables. Find $E\!\left(\dfrac{X}{X + Y + Z}\right)$. No PDF, PMF, discrete/continuous assumption, or formula is given — only "i.i.d. positive." Positivity is solely to avoid dividing by zero. So the result must be completely general.

Pitfall

Linearity is for sums only. $E$ of a ratio is not the ratio of expectations; there is no rule turning $E\!\left(\frac{X}{X+Y+Z}\right)$ into $\frac{E(X)}{E(X+Y+Z)}$. The trick is symmetry, not a quotient formula.

Solve by symmetry, then linearity

Because $X, Y, Z$ are i.i.d., relabeling them cannot change the answer. The denominator $X + Y + Z$ is order-independent, so the three expectations are identical:

$$E\!\left(\frac{X}{X+Y+Z}\right) = E\!\left(\frac{Y}{X+Y+Z}\right) = E\!\left(\frac{Z}{X+Y+Z}\right).$$

(This does not assert $X = Y = Z$; it says the three problems have the same structure, hence the same value.) Add all three and use linearity:

$$E\!\left(\frac{X}{X+Y+Z}\right) + E\!\left(\frac{Y}{X+Y+Z}\right) + E\!\left(\frac{Z}{X+Y+Z}\right) = E\!\left(\frac{X+Y+Z}{X+Y+Z}\right) = E(1) = 1.$$

The left side is one quantity counted three times, so $3 \, E\!\left(\frac{X}{X+Y+Z}\right) = 1$, giving

$$E\!\left(\frac{X}{X+Y+Z}\right) = \frac{1}{3}.$$

Sanity check

The answer must lie strictly between $0$ and $1$, since the positive numerator is smaller than the (larger) denominator — an answer like $4$ would be obviously wrong. And $\tfrac{1}{3}$ matches intuition: among three symmetric contributors, each accounts on average for one third of the total. Intuition is a good guess, but the symmetry-plus-linearity argument is the actual proof.

· · ·

4. LOTUS: Pattern Over Variable Names

LOTUS (the Law of the Unconscious Statistician) looks simple but causes frequent mistakes. The fix is to focus on the pattern — integrate the function against the density of whatever variable you are expanding in — rather than on what the variables happen to be named.

Example

Let $U \sim \mathrm{Unif}(0,1)$, $X = U^2$, and $Y = e^X$. Find $E(Y)$, written as an integral.

Note on exams: if an integral is hard, the prompt will say to leave it as an integral; otherwise compute and fully simplify to a number.

Two correct LOTUS approaches

Approach 1 — expand in $X$

Treat $Y = e^X$ as a function of $X$ and apply LOTUS over $X$'s density:

$$E(Y) = \int_0^1 e^x \, f_X(x) \, dx,$$

where $f_X$ is the PDF of $X$ (the limits are $0$ to $1$ because squaring a number in $[0,1]$ stays in $[0,1]$). This is correct but incomplete until you actually supply $f_X$ — "the PDF of $X$" is not an answer by itself. Getting $f_X$ requires finding the CDF of $X$ and differentiating (see next section).

Approach 2 — expand in $U$

Since $Y = e^X = e^{U^2}$ is also a function of $U$, apply LOTUS over $U$'s density, which is just $1$ on $[0,1]$:

$$E(Y) = \int_0^1 e^{u^2} \, du.$$

This can be written down immediately, with no extra PDF computation. (Solving this integral is a different matter — it resembles the Gaussian integral without the minus sign — but the problem only asks for the integral form.)

Both correct

Both answers earn full credit, provided you actually write out the density you integrate against.

The common pitfall

Students mix the two approaches — for instance, letting stray $X$'s appear in a problem where $X$ was never defined. The mindset that "LOTUS is about $X$" because the rule was first stated with a variable named $X$ is the trap. LOTUS is a pattern (function times the appropriate density, integrated or summed), independent of variable names.

· · ·

5. From CDF to PDF

A recurring subskill: to get a PDF of a transformed variable, first find its CDF by reducing the event back to something understood (here, the uniform), then differentiate.

Take $X = U^2$ with $U \sim \mathrm{Unif}(0,1)$, and find $f_X$.

Find the CDF

For $0 \le x \le 1$:

$$F_X(x) = P(U^2 \le x) = P(U \le \sqrt{x}) = \sqrt{x}.$$

(Taking the square root reduces the event back to $U$; no negative branch to worry about, since $U > 0$.)

Differentiate to get the PDF

$$f_X(x) = \frac{d}{dx}\sqrt{x} = \frac{1}{2}\,x^{-1/2}, \qquad 0 < x < 1.$$
Method in one line

Understand what a CDF is, reduce the event back to the uniform we already understand, then differentiate. Substituting this $f_X$ completes Approach 1 above.

· · ·

6. Story Proof: Distribution of $n - X$

A quick review of a story (interpretation) proof. Let $X \sim \mathrm{Bin}(n, p)$. Find the distribution of $n - X$.

The PMF method (more work)

$n - X$ is discrete; compute its PMF. For each $k$:

$$P(n - X = k) = P(X = n - k) = \binom{n}{n-k} p^{n-k} q^{k}, \qquad q = 1 - p.$$

Using $\binom{n}{n-k} = \binom{n}{k}$, rewrite this as

$$P(n - X = k) = \binom{n}{k} q^{k} p^{n-k},$$

which is exactly the $\mathrm{Bin}(n, q)$ PMF, so $n - X \sim \mathrm{Bin}(n, q)$.

The story proof (one line)

$$X \sim \mathrm{Bin}(n, p) \;\Longrightarrow\; n - X \sim \mathrm{Bin}(n, q), \quad q = 1 - p.$$

$X$ counts the number of successes in $n$ i.i.d. $\mathrm{Bern}(p)$ trials, so $n - X$ counts the number of failures. Since each trial is either a success or a failure (never both) and you may define which outcome is "success," simply swap the roles: relabel failure as success and success as failure. Immediately $n - X \sim \mathrm{Bin}(n, q)$ — no calculation, just swap success and failure.

Exam tip: don't waste time writing "it is immediately obvious"; one short sentence naming the swap is enough, and conserving time matters.

· · ·

7. Poisson to Exponential: First-Arrival Time

A Poisson example that links a discrete count to a continuous waiting time.

Setup

Suppose the number of emails received in a time interval of length $t$ is

$$N_t \sim \mathrm{Pois}(\lambda t),$$

where $\lambda$ is a rate (e.g., $20$ emails/hour) and $\lambda t$ is the expected count over the interval. Recall $\lambda t$ is both the mean and the variance of a Poisson. Different intervals of the same length yield different counts, so $N_t$ is genuinely random.

Define $T = $ the time of the first email, with the clock starting at $0$. $T$ is continuous (an email can arrive at any real time), connecting the discrete count $N_t$ to a continuous waiting time.

Find the CDF via the complement

Strategy

When finding a probability, ask whether the event or its complement is easier. Here the complement wins: "first email after time $t$" is exactly "zero emails in $[0, t]$."

For $t > 0$:

$$P(T > t) = P(\text{no email in } [0, t]) = P(N_t = 0) = e^{-\lambda t}\frac{(\lambda t)^0}{0!} = e^{-\lambda t}.$$

Therefore the CDF is

$$F_T(t) = P(T \le t) = 1 - e^{-\lambda t}, \qquad t > 0.$$

Differentiate to get the PDF

Result

$$f_T(t) = \lambda e^{-\lambda t}, \qquad t > 0.$$ This is the exponential distribution (to be studied formally later).

The example shows how the discrete Poisson count drives the continuous first-arrival PDF — the same complement-then-differentiate move as the $U^2$ example, now bridging discrete and continuous.

· · ·

8. Closing Reminder: Three Classes of Objects

Keep three things distinct; conflating them causes much of the trouble in this course.

Distribution

The blueprint for creating a random variable (e.g., a CDF). The "random house" blueprint.

Random variable

The random quantity itself. The "random house."

Constant

A single fixed value. A specific, fixed house.

A distribution is the plan, a random variable is the random realization-generating object, and a constant is one concrete number. Mixing them up is a frequent source of error.