Lecture 30: Chi-Square, Student-t, Multivariate Normal

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Offshoots of the Normal

Today's distributions are all offshoots of the normal: their importance is inherited from the fact that the normal is important. Each is defined in terms of normal random variables, and each story is tied to the normal's story — we just do different things with normals. The Central Limit Theorem (last lecture) is one reason the normal matters, but there are many others. If you do something natural with normals, there's a good chance the result is also important.

The agenda:

Chi-square distribution (univariate, one parameter)
Student-t distribution (univariate, one parameter)
Multivariate normal — the most important multivariate continuous distribution (the multinomial being the most important multivariate discrete one)

After this we start Markov chains.

· · ·

2. The Chi-Square Distribution

The chi-square (written with the Greek letter $\chi$, which looks like a fancy $x$) has one parameter $n$, called the degrees of freedom. It is extremely famous in statistics — chi-square tests appear everywhere in psychology, sociology, and far beyond. We won't derive chi-square tests in this course, but to really understand what such a test does (rather than just letting a computer run it), you have to understand the chi-square distribution.

Definition

Rather than define it by writing down a PDF — which would tell you nothing about where it comes from — we define it by its relationship to the normal.

Chi-Square($n$)

Let $Z_1, Z_2, \ldots, Z_n$ be IID standard normal. Then

$$V = Z_1^2 + Z_2^2 + \cdots + Z_n^2 \;\sim\; \chi^2_n$$

It is simply the sum of squares of $n$ IID standard normals.

This is exactly why it appears all over statistics: many methods involve adding up squares of things, and when those things are IID standard normal, you get a chi-square.

Chi-square is a special case of the gamma

The key fact connecting chi-square to what we already know:

$$\chi^2_1 = \text{Gamma}\!\left(\tfrac12, \tfrac12\right)$$

That is, the square of a single standard normal is a $\text{Gamma}(1/2, 1/2)$. This is not obvious by inspection, but it is a routine change-of-variables calculation — an easier version of the homework problem where you found the PDF of a standard normal raised to the fourth power. The only subtlety is that $y = x^2$ is decreasing then increasing, so you cannot blindly apply the one-to-one change-of-variables formula; handle the two branches carefully and the $\text{Gamma}(1/2, 1/2)$ density falls out.

To extend to general $n$, use the additive property of the gamma. Recall: independent $\text{Gamma}(a, \lambda)$ and $\text{Gamma}(b, \lambda)$ with the same rate $\lambda$ sum to $\text{Gamma}(a + b, \lambda)$. A $\chi^2_n$ is a sum of $n$ IID $\text{Gamma}(1/2, 1/2)$ terms — all with rate $1/2$ — so:

$$\chi^2_n = \text{Gamma}\!\left(\tfrac{n}{2}, \tfrac12\right)$$

Key takeaway

The chi-square is not really a new distribution — it is a new name for a gamma we already knew. It earns its own name because sums of squares are so ubiquitous in statistics. The payoff: everything we know about the gamma (its link to the beta, how to add gammas, how to get its moments) transfers to the chi-square for free, with no rederivation.

This is the penultimate famous univariate distribution in the course.

· · ·

3. The Student-t Distribution

Why "Student"? Gosset and Guinness

In 1908 a statistician named William Sealy Gosset introduced this distribution in a very influential paper, under the pseudonym Student. Gosset worked as a master brewer for Guinness, the beer company.

A common belief is that he hid behind a pseudonym so Guinness wouldn't know. In fact Guinness was supportive; the real reason was that Guinness did not want rival breweries to learn they had a statistician as a secret competitive weapon. The letter $t$ came later — it simply became standard to use $t$ for a certain statistic (the $t$ statistic, $t$ test), and the name stuck.

The mathematical basis behind $t$ tests, one-sample and two-sample, is the $t$ distribution. We focus on the distribution itself, not the testing procedures.

Definition

As with chi-square, we define the $t$ by relating it back to the normal rather than writing its (complicated, uninformative) PDF.

$t_n$ — Student-t with $n$ degrees of freedom

$$T = \frac{Z}{\sqrt{V / n}}, \qquad Z \sim \mathcal{N}(0,1), \quad V \sim \chi^2_n, \quad Z \perp V$$

With $Z$ and $V$ independent, $T \sim t_n$.

The parameter $n$ is again the degrees of freedom — a name that is "mysterious at first" and genuinely deep to interpret, but for now just the parameter inherited from the chi-square: how many squared normals we added up.

Properties from the representation

We reason directly from $T = Z / \sqrt{V/n}$ rather than the PDF.

Symmetric about zero

Multiplying $T$ by $-1$ leaves its distribution unchanged: the numerator $Z$ is symmetric about $0$ (so $-Z$ has the same distribution as $Z$), and the denominator is an independent nonnegative quantity that is untouched. So $t_n$ is symmetric about $0$.

The case $n = 1$ is the Cauchy

With $n = 1$, the denominator is $\sqrt{Z_1^2} = |Z_1|$, the absolute value of a standard normal. By symmetry the absolute value doesn't affect the distribution, so $T_1$ is the ratio of two independent standard normals — exactly the Cauchy distribution (the "evil Cauchy" from the interview problem, whose PDF we derived).

$$t_1 = \text{Cauchy}$$

In particular, the Cauchy has no finite mean, so $t_1$ has no mean — the expectation does not exist.

Mean is zero for $n \ge 2$

For $n \ge 2$, the mean exists and equals $0$. Symmetry is one quick way to see it. A direct calculation also works, using independence:

$$E(T) = E(Z) \cdot E\!\left(\frac{1}{\sqrt{V/n}}\right) = 0$$

The factorization is valid because $Z \perp V$ (hence uncorrelated). Since $E(Z) = 0$, the product is $0$ — provided the second factor exists, which it does for $n \ge 2$. The argument breaks at $n = 1$ because there the second expectation is infinite; you cannot claim "something that doesn't exist times zero equals zero."

Limited moments; no MGF

The $t$ has no MGF, and not all moments exist: $t_1$ has no first moment, $t_2$ no second, $t_3$ no third, and so on. For the moments that do exist, the representation makes them computable — raise $T$ to a power $p$, and the numerator is a power of $Z$ (moments known) while the denominator is a constant times a power of a gamma random variable, handled with LOTUS. Odd moments are $0$ by symmetry (when they exist).

Aside: even moments of the normal via chi-square

$$E(Z^2)=1,\quad E(Z^4)=3,\quad E(Z^6)=3\cdot 5,\quad E(Z^8)=3\cdot 5\cdot 7,\;\ldots$$

These are skip-factorials (products of consecutive odd numbers), proved before with MGFs. A second route uses the chi-square link. For the $2N$-th moment of a standard normal ($N$ a positive integer):

$$E\!\left(Z^{2N}\right) = E\!\left((Z^2)^N\right),$$

and recognizing $Z^2 = \chi^2_1 = \text{Gamma}(1/2, 1/2)$, this is just the $N$-th moment of a $\text{Gamma}(1/2, 1/2)$, obtained directly with LOTUS and the gamma integral. The answer comes out in terms of the gamma function, and gamma-function identities show it equals the product-of-odd-numbers form above.

Heavier tails than the normal

A major reason the $t$ is famous: it looks approximately normal but has heavier tails — extreme values are relatively more likely. See it by plotting the density or generating values, or measure it via kurtosis (on the homework just turned in). The effect is strongest for small $n$. The Cauchy ($n=1$) has very heavy tails: its density $\propto 1/(1 + x^2)$ decays like $1/x^2$. Compare the normal density $\propto e^{-x^2/2}$, which decays far faster.

Convergence to the normal as $n$ grows

For large $n$ (say $n = 30, 40, 50$ or more), $t_n$ looks very much like a standard normal. Precisely: as $n \to \infty$, the distribution of $t_n$ (CDF or PDF) converges to the standard normal. The cleanest proof uses the Law of Large Numbers, not a big calculation.

$$t_n \;\xrightarrow{\;d\;}\; \mathcal{N}(0,1) \quad \text{as } n \to \infty$$

Construct the sequence cleverly. Let $Z, Z_1, Z_2, \ldots$ be IID standard normal, and set

$$V_n = Z_1^2 + \cdots + Z_n^2 \;(\sim \chi^2_n), \qquad T_n = \frac{Z}{\sqrt{V_n / n}}.$$

Reusing the same $Z$ for every $n$ is legitimate — we only care about the distribution of $T_n$, and any construction with the correct distribution will do.

Now $V_n / n$ is the sample mean of the IID squared normals $Z_1^2, \ldots, Z_n^2$. By the Law of Large Numbers it converges with probability $1$ to $E(Z_1^2) = 1$ (the variance of a standard normal). Taking square roots, $\sqrt{V_n / n} \to 1$ with probability $1$. Therefore

$$T_n = \frac{Z}{\sqrt{V_n / n}} \;\longrightarrow\; Z \quad \text{(with probability 1).}$$

A different construction would not give the same pointwise statement, but the distributions still behave the same way, giving the convergence in distribution above.

Intuitively, for large degrees of freedom the denominator is essentially $1$ (by the LLN), so all that matters is the normal $Z$ on top. For small $n$ the $t$ still resembles the normal in shape but carries much heavier tails. That completes the famous univariate named distributions for the course.

· · ·

4. The Multivariate Normal

The last famous distribution is the multivariate normal (MVN). We want to extend the many nice properties of the univariate normal to a random vector rather than a single random variable.

One trivial way to build a multivariate distribution is to stack IID random variables into a vector — then the joint PDF is the product of the marginals. But the interesting case is when there is correlation between components. For the normal there is a very nice standard way to do this, with several equivalent definitions.

Definition via linear combinations

Multivariate Normal

A random vector $\mathbf{X} = (X_1, X_2, \ldots, X_k)$ is multivariate normal if

$$t_1 X_1 + t_2 X_2 + \cdots + t_k X_k \;\text{ is (univariate) normal}$$

for every choice of constants $t_1, \ldots, t_k$.

A linear combination just means adding up the variables with arbitrary constants in front. If even a single choice of the $t$'s produces a non-normal combination, the vector is not MVN; if every linear combination is normal, it is MVN. The definition collapses the $k$-dimensional object back to a familiar one-dimensional random variable.

Example: linear combinations of IID normals

$$\big(\,Z + 2W,\;\; 3Z + 5W\,\big) \;\text{ is MVN}$$

Let $Z, W$ be IID standard normal. (The constants are arbitrary; any $a, b, c, d$ would work.) Take arbitrary $s, t$ and form the combination:

$$s(Z + 2W) + t(3Z + 5W) = (s + 3t)\,Z + (2s + 5t)\,W.$$

This is a linear combination of the independent normals $Z$ and $W$, and a sum of independent normals is normal (shown earlier with MGFs). Since this holds for all $s, t$, the vector is MVN.

General class

Start with IID normals, then form a vector of linear combinations of them. The result is always multivariate normal.

Non-example: a normal and a sign-flipped normal

Let $Z$ be standard normal and $S$ a random sign ($S = +1$ or $-1$ with equal probability), independent of $Z$. Consider the pair $(Z, SZ)$.

Marginally, both are standard normal: multiplying a normal by an independent random sign leaves its distribution unchanged, by symmetry (related to an earlier strategic-practice problem). But the pair is not MVN. Take the combination $Z + SZ$:

S = −1 → 0 | S = +1 → 2Z

half the time exactly 0; half the time continuous — a discrete/continuous mixture

No normal distribution puts probability $1/2$ on a single point, so $Z + SZ$ is not normal, hence $(Z, SZ)$ is not MVN. This is exactly the pathology the linear-combination definition rules out: defining MVN as merely "string together some normals" would wrongly admit this nasty example.

The MGF of the multivariate normal

We don't need the MVN's PDF for this course; the MGF is simpler and more useful. For a random vector $\mathbf{X}$, the joint MGF uses a constant for each component:

$$M(\mathbf{t}) = E\!\left(e^{\mathbf{t}' \mathbf{X}}\right) = E\!\left(e^{t_1 X_1 + \cdots + t_k X_k}\right)$$

where $\mathbf{t}' \mathbf{X}$ is the dot product of $\mathbf{t}$ and $\mathbf{X}$. This looks complicated, but the MVN definition saves us: the exponent is a linear combination of the components, hence a univariate normal. So $M(\mathbf{t})$ is just $E(e^{\text{one normal}})$, which we know from the univariate MGF.

Recall: if $X \sim \mathcal{N}(\mu, \sigma^2)$, then

$$M_X(t) = \exp\!\left(t\mu + \tfrac12 t^2 \sigma^2\right)$$

— "$e$ to the (mean of the exponent) plus $\tfrac12$(variance of the exponent)." Applying this to our normal linear combination, with $\mu_j = E(X_j)$:

MVN MGF

$$M(\mathbf{t}) = \exp\!\left(\sum_{j=1}^{k} t_j \mu_j \;+\; \tfrac12 \,\text{Var}\!\left(t_1 X_1 + \cdots + t_k X_k\right)\right)$$

The mean part is $\sum t_j \mu_j$. The variance part is expanded the usual way: if the components are independent the variances simply add; otherwise covariance terms appear. As in the univariate case the MGF determines the distribution, so an MVN is completely characterized by its component means and covariances.

Within an MVN, uncorrelated implies independent

In general, independent implies uncorrelated, but uncorrelated does not imply independent. A crucial exception holds inside a multivariate normal.

Uncorrelated ⇒ independent (within an MVN)

If $\mathbf{X}$ is MVN and split as $\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2)$, and every component of $\mathbf{X}_1$ is uncorrelated with every component of $\mathbf{X}_2$ (all cross-covariances $= 0$), then $\mathbf{X}_1$ and $\mathbf{X}_2$ are independent.

The non-example shows why the MVN assumption is essential: $Z$ and $SZ$ are uncorrelated and each normal, yet clearly not independent — but $(Z, SZ)$ is not MVN, so the theorem does not apply.

Worked example: sum and difference of IID normals

$$X + Y \;\text{ and }\; X - Y \;\text{ are independent}$$

Let $X, Y$ be IID standard normal. The pair $(X+Y,\; X-Y)$ is MVN (bivariate normal): any linear combination of the two is a linear combination of the independent normals $X, Y$, hence normal. Now the covariance:

$$\text{Cov}(X+Y,\; X-Y) = \text{Var}(X) - \text{Var}(Y) = 1 - 1 = 0,$$

where the two $\text{Cov}(X, Y)$ terms cancel and the equal variances subtract to $0$. So they are uncorrelated. Because the pair is multivariate normal, uncorrelated upgrades to independent.

This is a famously special property of the normal: if $X, Y$ are IID from any distribution and $X + Y$ is independent of $X - Y$, then $X$ and $Y$ must be normal. The general MVN fact is proved with a short MGF calculation (omitted, good practice). It is useful because computing a covariance is often easier than demonstrating independence directly.