Lecture 22: Transformations and Convolutions

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Hypergeometric Variance Revisited

The lecture opens by finishing the variance of the hypergeometric distribution begun last time. The setup: a population of $W$ white balls and $B$ black balls, from which a sample of size $n$ is drawn without replacement. $X$ counts the white balls in the sample.

Two pieces of convenient notation:

$p = \dfrac{W}{W + B}$: the fraction of white balls in the population (between $0$ and $1$).
$N = W + B$: the population size (capital $N$), as opposed to $n$, the sample size (lowercase $n$). $N$ is not random.

Decomposition via indicators

Write $X = X_1 + \cdots + X_n$, where $X_j$ is the indicator that the $j$-th ball drawn is white. Then:

$$\operatorname{Var}(X) = \sum_{j=1}^{n} \operatorname{Var}(X_j) + 2 \sum_{i < j} \operatorname{Cov}(X_i, X_j)$$

The indicators are not independent (sampling is without replacement), so the covariance terms do not vanish.

By symmetry, before any balls are drawn the $j$-th ball is equally likely to be any ball, so every $X_j$ is marginally $\text{Bernoulli}(p)$ and every covariance $\operatorname{Cov}(X_i, X_j)$ is the same. This collapses the sums:

There are $n$ variance terms, each equal to $\operatorname{Var}(X_1) = p(1 - p)$. Total: $n\, p(1 - p)$.
There are $\binom{n}{2}$ covariance terms, each equal to $\operatorname{Cov}(X_1, X_2)$.

The covariance term

Using $\operatorname{Cov}(X_1, X_2) = E(X_1 X_2) - E(X_1)E(X_2)$:

$E(X_1)E(X_2) = p^2$ (each is marginally $\text{Bernoulli}(p)$).
$E(X_1 X_2)$: the product of two indicators is the indicator of the intersection — the event that ball 1 and ball 2 are both white. By the multiplication rule that probability is $\dfrac{W}{W + B} \cdot \dfrac{W - 1}{W + B - 1} = p \cdot \dfrac{W - 1}{W + B - 1}$.

$$\operatorname{Cov}(X_1, X_2) = p\,\frac{W - 1}{W + B - 1} - p^2$$

Simplified result and the finite population correction

Plugging in and grinding through the algebra produces a strikingly clean answer:

$$\operatorname{Var}(X) = \frac{N - n}{N - 1}\, n\, p\,(1 - p)$$

Key idea

The factor $n\, p(1 - p)$ is exactly the binomial variance. The leading factor $\dfrac{N - n}{N - 1}$ is the finite population correction. The hypergeometric variance is the binomial variance times this correction.

Sanity checks (extreme cases)

Case	Correction factor	Interpretation
$n = 1$	$\dfrac{N - 1}{N - 1} = 1$	Drawing one ball, with vs. without replacement makes no difference; reduces to the $\text{Bernoulli}(p)$ variance
$N \gg n$ (e.g. $n = 20$, $N = 100{,}000$)	very close to $1$	Sample tiny relative to population; almost never draw the same individual twice, so behaves like a binomial

Both checks confirm the formula reduces to the binomial in the regimes where with/without replacement should not matter.

· · ·

2. Change of Variables (Transformations)

Change of variables is synonymous with transformations. A function of a random variable is itself a random variable, and LOTUS gives its expected value (and moments) without needing its distribution. But LOTUS only delivers the mean; sometimes the entire distribution of the transformed variable is needed. The change-of-variables theorem supplies the full PDF.

The theorem (one dimension)

Change-of-variables formula

Let $X$ be a continuous random variable with PDF $f_X$, and let $Y = g(X)$. Assume $g$ is differentiable (its derivative exists everywhere of interest) and strictly increasing. Then:

$$f_Y(y) = f_X(x)\,\left|\frac{dx}{dy}\right|, \qquad y = g(x), \quad x = g^{-1}(y)$$

Everything on the right must ultimately be a function of $y$: substitute $x = g^{-1}(y)$ into $f_X(x)$, and $\frac{dx}{dy}$ is the derivative of $x$ with respect to $y$ viewed as a function of $y$.

Two ways to compute the derivative

By the chain rule, $\dfrac{dx}{dy}$ and $\dfrac{dy}{dx}$ are reciprocals (they behave like fractions even though they are not literally fractions). So there is a choice:

compute $\dfrac{dx}{dy}$ directly, or
compute $\dfrac{dy}{dx}$ and take its reciprocal.

Pick whichever is easier before diving in.

Check the assumptions

Common mistake

Do not plug into the formula without verifying strict monotonicity. For example $g(x) = x^2$ is a perfectly nice (infinitely differentiable) function, but it is U-shaped, not strictly increasing on the whole line, so the formula does not apply directly — one must return to first principles. However, $x^2$ is strictly increasing if $X$ is restricted to positive values, in which case the formula applies.

Proof: CDF then differentiate

No deep insight is required, just the CDF-then-derivative recipe:

$$F_Y(y) = P(Y \le y) = P(g(X) \le y) = P\!\left(X \le g^{-1}(y)\right) = F_X\!\left(g^{-1}(y)\right)$$

The third step uses that $g$ is invertible, so the two events are the same event written differently. The result is just the CDF of $X$ evaluated at $g^{-1}(y) = x$. Differentiating both sides with respect to $y$, the chain rule produces the $\frac{dx}{dy}$ correction factor and yields $f_Y(y) = f_X(x)\,\frac{dx}{dy}$.

Worked example: the log-normal distribution

The log-normal is one of the most widely used distributions in practice. The name is a frequent source of confusion.

Watch the name

Log-normal means the logarithm is normal, NOT the log of a normal. You cannot take the log of a normal random variable, since a normal can be negative and you cannot take the log of a negative number.

Let $Z$ be standard normal and define $Y = e^Z$. Then $\log Y = Z$ is normal, which is why $Y$ is called log-normal. (More generally $Z$ could be $\mathcal{N}(\mu, \sigma^2)$; here we take the standard case.) An earlier homework used the normal MGF to find moments of the log-normal; now we want the full PDF.

The map $z \mapsto e^z$ is strictly increasing and infinitely differentiable, so the theorem applies. The standard normal density is $f_Z(z) = \frac{1}{\sqrt{2\pi}}\, e^{-z^2/2}$. Substituting $z = \log y$:

$$f_Y(y) = \frac{1}{\sqrt{2\pi}}\, e^{-(\log y)^2 / 2} \cdot \frac{1}{y}, \qquad y > 0$$

The factor $\frac{1}{y}$ is the derivative: $\frac{dy}{dz} = e^z = y$, so $\frac{dz}{dy}$ is its reciprocal, $\frac{1}{y}$.

Mnemonic

To recall which derivative to use, pretend the differentials can be moved around so the relation reads symmetrically: $f_Y(y)\,dy = f_X(x)\,dx$. (Calculus courses say you cannot separate $dx$ from $dy$; in more advanced math they are separated again, and this works if interpreted carefully. Treat it as a memory aid.)

· · ·

3. Multidimensional Transformations and the Jacobian

The one-dimensional theorem generalizes to vectors. Now $Y = g(X)$ where $g$ maps $\mathbb{R}^n \to \mathbb{R}^n$, and $X = (X_1, \ldots, X_n)$ is a continuous random vector with some joint PDF (a random vector is just a list of random variables packaged together). The goal is the joint PDF of $Y$ in terms of the joint PDF of $X$.

The result is completely analogous (its proof is an exercise in multivariable calculus and is not proved here):

$$f_Y(\mathbf{y}) = f_X(\mathbf{x})\,\left|\frac{d\mathbf{x}}{d\mathbf{y}}\right|$$

The only new ingredient is interpreting the derivative of one vector with respect to another, and the absolute value.

The Jacobian

Jacobian matrix and determinant

The object $\frac{d\mathbf{x}}{d\mathbf{y}}$ is the Jacobian: the matrix of all partial derivatives of the components of $\mathbf{x}$ with respect to the components of $\mathbf{y}$, with entry $(i, j)$ equal to $\frac{\partial x_i}{\partial y_j}$.

$$\frac{d\mathbf{x}}{d\mathbf{y}} = \begin{pmatrix} \frac{\partial x_1}{\partial y_1} & \cdots & \frac{\partial x_1}{\partial y_n} \\ \vdots & & \vdots \\ \frac{\partial x_n}{\partial y_1} & \cdots & \frac{\partial x_n}{\partial y_n} \end{pmatrix}$$

To compress this matrix into one number, take its determinant, then its absolute value: $\left|\det \frac{d\mathbf{x}}{d\mathbf{y}}\right|$.

(Computing a partial derivative is just an ordinary derivative holding everything else constant.) The absolute value is essential: without it, a strictly decreasing transformation would produce a negative PDF, which is nonsense. Putting the absolute value in also lets the same formula handle decreasing maps.

Choice of direction

As in one dimension, you may instead compute the Jacobian of $\mathbf{y}$ with respect to $\mathbf{x}$, then take the reciprocal of its determinant. Sometimes one direction is far easier than the other, so decide which way to do the transformation first.

Blitzstein deliberately writes $\frac{d\mathbf{x}}{d\mathbf{y}}$ rather than the common shorthand $J$, because $J$ does not record which direction the transformation goes ($X \to Y$ or $Y \to X$), whereas $\frac{d\mathbf{x}}{d\mathbf{y}}$ makes the direction explicit.

· · ·

4. Convolutions

Convolution is the technical term for the distribution of a sum of random variables. Several sums have already appeared: binomial-plus-binomial (with the same $p$) via a story proof, and sums of Poissons or normals via MGFs. Those are easy when a story works or the MGF exists and is tractable. When no story is available and the MGF is unavailable or unwieldy, a more direct method is needed.

Let $T = X + Y$, with $X$ and $Y$ assumed independent (the dependent case is much nastier). We want the distribution of $T$ from the distributions of $X$ and $Y$.

Discrete case

Break the event $\{T = t\}$ into disjoint cases by what value $X$ takes; whatever is left over must be the value of $Y$:

$$P(T = t) = \sum_{x} P(X = x)\, P(Y = t - x)$$

The product of probabilities uses independence. This needs no separate proof: to make the total equal $t$, $X$ is some value $x$ and $Y$ must be $t - x$. The sum runs over all $x$ for which the term is nonzero.

Continuous case

The analogous statement replaces PMFs with PDFs and the sum with an integral:

$$f_T(t) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(t - x)\, dx$$

This is the convolution integral. The analogy with the discrete formula is the easiest way to remember it, but analogy is not proof: reasoning the same way for a product instead of a sum (a homework problem) does not work, so this needs real justification.

Justification via CDF

Use the continuous law of total probability, then differentiate:

$$F_T(t) = P(X + Y \le t) = \int_{-\infty}^{\infty} P(X + Y \le t \mid X = x)\, f_X(x)\, dx$$

After conditioning on $X = x$, independence lets the condition be dropped, leaving $P(Y \le t - x) = F_Y(t - x)$:

$$F_T(t) = \int_{-\infty}^{\infty} F_Y(t - x)\, f_X(x)\, dx$$

Differentiating with respect to $t$ (swapping the derivative inside the integral, which a theorem permits) turns $F_Y$ into $f_Y$ and gives the convolution formula for the PDF. Here $F_T$ denotes the CDF of $T$: the lines above are the CDF, the boxed formula at the top of this section is the PDF.

Practical advice

Convolution integrals are worth avoiding when a story proof or MGF will do, but sometimes they are unavoidable.

· · ·

5. The Probabilistic Method: Proving Existence with Probability

To rebalance the "beauty quotient" after the technical (useful but rarely called beautiful) Jacobian, Blitzstein closes with something he considers almost existential and which uses no calculus: using probability to prove that an object with a desired property exists.

Method 1: positive probability implies existence

Goal: show that some object has a desired property $A$. This sounds unrelated to probability or uncertainty. The strategy:

Impose a probability structure on the universe of objects: choose an object at random, by whatever rule we like (for a finite set, the obvious choice is uniformly at random).
Let $A$ be the event that the randomly chosen object has the property.
Show $P(A) > 0$.

Key idea

If $P(A) > 0$, at least one such object must exist (if none existed, the probability would be zero). We need not compute $P(A)$ exactly — only bound it away from zero — and we can prove existence without ever exhibiting a single example.

Method 2: an object beats the average

Suppose each object carries a numerical score. Goal: show there exists an object with a "good" score.

Pick a random object and let $X$ be its score, so $E(X)$ is the average score. Then there must exist an object whose score is at least $E(X)$: they cannot all be below average.

If $E(X)$ happens to be good, a good object exists, again without exhibiting it. (This is crude: just as in any room at least one person earns at least the average salary.)

Aside: Shannon's theorem

This crude-sounding idea underlies one of the most beautiful and useful results of the 20th century. Claude Shannon, the father of information theory and modern communication theory (cell phones rely on the coding theory that traces back to him), proved in 1948 that a noisy communication channel has a capacity, and one can transmit at rates arbitrarily close to that capacity with arbitrarily small probability of error.

His method for showing a "good code" exists was to pick a random code. Choosing randomly and trusting that the random choice performs well is audacious, yet it works. It took another 30 to 40 years before anyone could explicitly write down a good code; until then existence rested on Shannon's argument that a random one has the right properties.

Worked example: committee overlap

A self-contained illustration with made-up numbers chosen so the arithmetic is clean:

$100$ people form committees; a person may sit on more than one committee.
There are $15$ committees, each with $20$ people.
Since $15 \times 20 = 300$ committee seats and $300 / 100 = 3$, assume each person is on exactly $3$ committees. (The setup generalizes to people on differing numbers of committees.)

Goal: show there exist two committees whose overlap is at least $3$ (three people on both). Searching all assignments and computing every intersection would be a nightmare; instead, prove existence by computing an average.

There is no randomness yet; the assignment of people to committees is fixed. Introduce randomness by choosing two committees at random and computing the expected overlap $E(O)$. Use an indicator for each of the $100$ people (the indicator that the person is on both chosen committees) and linearity:

$$E(O) = 100 \cdot P(\text{person 1 is on both chosen committees})$$

By the naive definition, choosing $2$ of the $15$ committees gives $\binom{15}{2}$ equally likely pairs. Person 1 sits on $3$ committees, so the favorable pairs are $\binom{3}{2} = 3$:

$$E(O) = 100 \cdot \frac{\binom{3}{2}}{\binom{15}{2}} = 100 \cdot \frac{3}{\frac{15 \cdot 14}{2}} = \frac{300}{105} = \frac{20}{7}$$

Conclusion

The average overlap is $\frac{20}{7} \approx 2.86$, frustratingly just short of $3$ (if only it were $\frac{21}{7} = 3$). But overlaps are integers. Since the average is $\frac{20}{7}$, some pair of committees has overlap at least $\frac{20}{7}$; rounding that integer up to meet the bound forces it to be at least $3$. Therefore two committees with overlap at least $3$ must exist — proven without exhibiting them.