Lecture 20: Multinomial and Cauchy

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Expected Absolute Difference of Two Normals

Last lecture computed $E|U_1 - U_2|$ for two iid $\text{Unif}(0,1)$ random variables using a 2D LOTUS — a double integral, the two-dimensional analog of the single-integral LOTUS. The natural follow-up: compute $E|Z_1 - Z_2|$ where $Z_1, Z_2$ are iid standard normal.

Why not just grind the 2D integral

Because $Z_1, Z_2$ are independent, their joint PDF factors into the product of the two marginal PDFs, so the 2D LOTUS integral is writable and (with effort) doable. But that is the wrong instinct. Before launching into a two-dimensional integral, stop and use the structure of the problem.

Key idea

The difference from the uniform case: we never studied the distribution of the difference of two uniforms, but the difference of two normals is something we understand well. So simplify first instead of integrating.

Simplify via the distribution of the difference

A sum or difference of independent normals is normal (proven in the next section), with means combining linearly and variances always adding. Therefore $Z_1 - Z_2$ has mean $0 - 0 = 0$ and variance $1 + 1 = 2$:

$$Z_1 - Z_2 \sim \mathcal{N}(0, 2)$$

Now read $\mathcal{N}(0,2)$ in location-scale terms: it is $\sqrt{2}\,Z$ for a single standard normal $Z$. So the problem collapses from a double integral to a one-dimensional LOTUS:

$$E|Z_1 - Z_2| = \sqrt{2}\,E|Z|$$

The one-dimensional LOTUS

$$E|Z| = \int_{-\infty}^{\infty} |z|\,\frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}\,dz$$

The integrand is even (replacing $z$ by $-z$ leaves it unchanged), so double the integral over $[0, \infty)$ and drop the absolute value:

$$E|Z| = 2\int_{0}^{\infty} z\,\frac{1}{\sqrt{2\pi}}\,e^{-z^2/2}\,dz$$

This is an easy $u$-substitution ($u = z^2/2$), giving $E|Z| = \sqrt{2/\pi}$. Hence:

$$E|Z_1 - Z_2| = \sqrt{2}\cdot\sqrt{\tfrac{2}{\pi}} = \frac{2}{\sqrt{\pi}}$$

The lesson: a function of two variables does not force a 2D LOTUS. Recognizing the difference as a one-dimensional normal turns a double integral into a trivial one.

· · ·

2. Sum of Independent Normals Is Normal (MGF Proof)

We previously asserted that sums of independent normals are normal but had not proven it. With MGFs the proof is short.

Theorem — Sum of independent normals

If $X \sim \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y \sim \mathcal{N}(\mu_2, \sigma_2^2)$ are independent, then:

$$X + Y \sim \mathcal{N}(\mu_1 + \mu_2,\; \sigma_1^2 + \sigma_2^2)$$

Means add by linearity; variances add. For a difference $X - Y$, treat it as $X + (-Y)$: means subtract, but variances still add (never subtract).

Proof

The MGF of any $\mathcal{N}(\mu, \sigma^2)$ follows from the standard-normal MGF via the transformation $\mu + \sigma Z$:

$$M(t) = \exp\!\left(\mu t + \tfrac{1}{2}\sigma^2 t^2\right)$$

Because $X$ and $Y$ are independent, the MGF of the sum is the product of the MGFs:

$$M_{X+Y}(t) = \exp\!\left(\mu_1 t + \tfrac{1}{2}\sigma_1^2 t^2\right)\exp\!\left(\mu_2 t + \tfrac{1}{2}\sigma_2^2 t^2\right)$$

Combine into a single exponential and factor:

$$M_{X+Y}(t) = \exp\!\left((\mu_1 + \mu_2)\,t + \tfrac{1}{2}(\sigma_1^2 + \sigma_2^2)\,t^2\right)$$

Conclusion

This is exactly the MGF of $\mathcal{N}(\mu_1 + \mu_2,\, \sigma_1^2 + \sigma_2^2)$. Since the MGF determines the distribution, the proof is complete. Independence is essential — the multiply-the-MGFs step fails without it.

· · ·

3. The Multinomial Distribution

The multinomial is by far the most important discrete multivariate distribution. A multivariate distribution is simply a joint distribution for more than one random variable at once. The normal, Poisson, and geometric are univariate (one random variable each). For this course there are only two named multivariate distributions to know: the multinomial (here) and the multivariate normal (covered later).

Story and parameters

The multinomial, abbreviated $\text{Mult}(n, \mathbf{p})$, generalizes the binomial: where the binomial sorts each trial into two categories (success/failure), the multinomial sorts into $k$ categories. It has parameters $n$ and $\mathbf{p}$, except $\mathbf{p}$ is now a vector:

$$\mathbf{p} = (p_1, \ldots, p_k), \qquad p_j \ge 0, \quad \sum_{j=1}^{k} p_j = 1$$
Story of the multinomial

$\mathbf{X} = (X_1, \ldots, X_k) \sim \text{Mult}(n, \mathbf{p})$ means:

  • We have $n$ objects (trials, people, anything) sorted independently into $k$ categories.
  • Each object lands in category $j$ with probability $p_j$.
  • $X_j$ is the count of objects in category $j$.

When $k = 2$ this reduces to the binomial.

Joint PMF

We want $P(X_1 = n_1, \ldots, X_k = n_k)$, derived just like the binomial PMF. Any one specific sequence of category labels with the desired counts has probability $p_1^{n_1} p_2^{n_2} \cdots p_k^{n_k}$. (Example with $k = 3$: a sequence like $2,3,1,1,2,\ldots$ contributes $p_1$ raised to the number of $1$'s, $p_2$ to the number of $2$'s, $p_3$ to the number of $3$'s.)

Many sequences give the same counts, so we multiply by the number of distinct arrangements — exactly the "permutations of a multiset" count from the MISSISSIPPI / PEPPER problems:

$$P(X_1 = n_1, \ldots, X_k = n_k) = \frac{n!}{n_1!\,n_2!\cdots n_k!}\; p_1^{n_1} p_2^{n_2} \cdots p_k^{n_k}$$

valid when $n_1 + n_2 + \cdots + n_k = n$, and $0$ otherwise (the counts must total $n$, since every object is in exactly one category).

Marginal distribution

What is the marginal distribution of a single component $X_j$? The brute-force route sums the joint PMF over everything we do not want — a lot of algebra. The right route is the story: each object either is in category $j$ or is not. Define "success" as landing in category $j$; then there are $n$ independent trials with success probability $p_j$.

$$X_j \sim \text{Bin}(n, p_j)$$

This proves itself directly from the story, and immediately gives the mean and variance by reusing the binomial results — no new calculation:

$$E(X_j) = n p_j \qquad \operatorname{Var}(X_j) = n p_j (1 - p_j)$$

Lumping property

Take $k = 10$ (imagine a country with 10 political parties, $n$ independent people, $X_j$ = number in party $j$). If two parties dominate and the rest are minor, we may merge the minor categories:

$$Y = (X_1,\; X_2,\; X_3 + X_4 + \cdots + X_{10})$$

Then $Y$ is multinomial with the same $n$ and probability vector $(p_1,\, p_2,\, p_3 + p_4 + \cdots + p_{10})$ — just add the lumped probabilities. No algebra required: lumping categories is the same multinomial story with coarser categories. The only requirement is that each object still lands in exactly one category.

Conditional distribution

Suppose $\mathbf{X} \sim \text{Mult}(n, \mathbf{p})$ and we learn $X_1 = n_1$. The conditional joint distribution of $(X_2, \ldots, X_k)$ is still multinomial, but with carefully adjusted parameters:

For each remaining category, the conditional probability is that of being in category $j$ given not in category 1. Since "in category $j$" already implies "not in category 1," the intersection is redundant:

$$p_j' = \frac{p_j}{1 - p_1} = \frac{p_j}{p_2 + p_3 + \cdots + p_k}$$
Conclusion

Conditionally, $(X_2, \ldots, X_k) \sim \text{Mult}\!\big(n - n_1,\, (p_2', \ldots, p_k')\big)$. The relative proportions among the remaining categories are unchanged — we only rescale so they sum to 1.

· · ·

4. The Cauchy Distribution

The "Cauchy interview problem": find the PDF of a Cauchy random variable. Blitzstein calls it the interview problem because, oddly, it really does get asked in interviews — presumably to test comfort with joint PDFs.

Definition — Cauchy distribution

$$T = \frac{X}{Y}, \qquad X, Y \text{ iid standard normal}$$

A ratio of two iid standard normals. Ratios arise naturally, so this is a useful — but notorious — distribution.

Why it is "evil"

The Cauchy has several pathological properties (not proven here):

  • It has no expected value — try to compute $E(T)$ and it blows up. Consequently no variance either.
  • The truly strange property: if $T_1, \ldots, T_n$ are iid Cauchy, their average is again Cauchy — the same distribution. Averaging a million iid Cauchys gives back a Cauchy.

This subverts the usual Law of Large Numbers intuition, where averaging many iid observations converges to the mean. Here there is no mean, and naive averaging gets nowhere — you need other techniques to work with Cauchy data.

Finding the PDF via the CDF

Goal: find the PDF of $T = X/Y$. Strategy: find the CDF first, then differentiate. We need $P(X/Y \le t)$. Multiplying by $Y$ is dangerous because $Y$ can be negative and would flip the inequality. Symmetry rescues us: a standard normal times $-1$ is still standard normal, in both numerator and denominator, so we can absorb the signs:

$$P\!\left(\tfrac{X}{Y} \le t\right) = P\big(X \le t\,|Y|\big)$$

Now no inequality-flipping concerns arise. Write this as a double integral of the joint PDF over the region $X \le t|Y|$. Since $X, Y$ are iid standard normal, the joint PDF is the product of the two standard-normal PDFs (the $e^{-y^2/2}$ factor does not depend on $x$, so it is pulled outside the inner integral):

$$F(t) = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}}\,e^{-y^2/2}\left[\int_{-\infty}^{t|y|} \frac{1}{\sqrt{2\pi}}\,e^{-x^2/2}\,dx\right] dy$$

Recognizing the inner integral

In one sense the inner $x$-integral is the integral you "cannot do" (the normal PDF has no elementary antiderivative). In another sense you already know it: it is the standard normal CDF $\Phi$ evaluated at the upper limit. The integrand is even in $y$ (both $y^2$ and $|y|$ are even), so double it and integrate over $[0, \infty)$:

$$F(t) = \frac{2}{\sqrt{2\pi}}\int_{0}^{\infty} e^{-y^2/2}\,\Phi(t y)\,dy$$

Here panic might set in: we are being asked to integrate something involving the intractable $\Phi$.

The rescue: differentiate

The interview asks for the PDF, not the CDF — and the PDF is the derivative of the CDF. Under mild regularity conditions (easily satisfied: the integrand is smooth and decays fast), we may swap the derivative and the integral. Differentiating with respect to $t$ ($y$ is a dummy variable; $e^{-y^2/2}$ is constant in $t$), the chain rule on $\Phi(ty)$ brings down a factor of $y$ times the standard normal PDF at $ty$:

$$\frac{d}{dt}\,\Phi(t y) = y\cdot\frac{1}{\sqrt{2\pi}}\,e^{-t^2 y^2/2}$$

So:

$$f(t) = \frac{2}{\sqrt{2\pi}}\int_{0}^{\infty} e^{-y^2/2}\; y\,\frac{1}{\sqrt{2\pi}}\,e^{-t^2 y^2/2}\,dy$$

Doing the final integral

The factors $\frac{2}{\sqrt{2\pi}}\cdot\frac{1}{\sqrt{2\pi}} = \frac{1}{\pi}$, and the two exponentials combine:

$$f(t) = \frac{1}{\pi}\int_{0}^{\infty} y\,e^{-(1 + t^2)y^2/2}\,dy$$

Substitute $u = (1 + t^2)y^2/2$, so $du = (1 + t^2)\,y\,dy$. Multiplying and dividing by $(1 + t^2)$, the integral becomes $\frac{1}{1 + t^2}\int_0^\infty e^{-u}\,du = \frac{1}{1 + t^2}$. Therefore:

$$\boxed{\,f(t) = \dfrac{1}{\pi\,(1 + t^2)}, \quad \text{for all real } t\,}$$

That is the Cauchy PDF. Integrating it back gives an arctangent, and evaluating confirms it integrates to $1$ — a valid PDF.

Alternative: Law of Total Probability

A second method conditions on $Y$ instead of writing the full double integral directly:

$$P\big(X \le t|Y|\big) = \int_{-\infty}^{\infty} P\big(X \le t|Y| \mid Y = y\big)\, f(y)\,dy$$

This is the continuous Law of Total Probability — integrating over the conditioning variable, with $f(y)$ the standard normal PDF, in place of the discrete sum over a partition. Because $X$ is independent of $Y$, conditioning on $Y = y$ lets us drop the condition and plug in:

$$P\big(X \le t|y| \mid Y = y\big) = \Phi(t|y|)$$

giving the same integrand $\int \Phi(t|y|)\,f(y)\,dy$, so this route reduces to the same calculation. Independence is what allows dropping the condition after substituting $Y = y$. (A third method exists as well; with a common interview question it pays to know several.)