Lecture 18: MGFs Continued

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Why MGFs Matter

The theory of moment generating functions was developed last lecture; this one builds intuition through the most important examples. Recall the definition: the MGF of a random variable $X$ is

$$M(t) = E\!\left(e^{tX}\right)$$

viewed as a function of $t$. The letter $t$ is a dummy variable — $M(s) = E(e^{sX})$ is the same object. The MGF is a bookkeeping device that packages all the moments of $X$, and it is a third way to describe a distribution, alongside the CDF and the PDF/PMF.

Three reasons the MGF matters
  • It generates moments: $E(X^n)$ is read off the Taylor coefficients of $M(t)$, avoiding the often-hard integral $E(X^n) = \int x^n f(x)\,dx$.
  • The MGF of a sum of independent variables is the product of their MGFs, turning convolutions into multiplication.
  • The MGF determines the distribution: if two variables share an MGF on an interval around $0$, they have the same distribution.

The word "moment" entered statistics from physics, by analogy with moment of inertia: there is a genuine analogy between variance and moment of inertia. (Where physics got the word "moment" is a separate question.)

How moments are read off

If $M(t) = \sum_n a_n t^n$, matching against $M(t) = \sum_n E(X^n)\, t^n / n!$ shows that the $n$th moment is whatever sits in front of $t^n / n!$:

$$E(X^n) = (\text{coefficient of } t^n) \times n!$$

If the series you wrote has no explicit $n!$ in the denominator, just multiply and divide by $n!$ to force the pattern.

· · ·

2. Exponential MGF and Its Moments

Let $X \sim \text{Expo}(1)$ (rate-1 exponential, PDF $e^{-x}$ for $x > 0$). Find the MGF and all moments. By LOTUS,

$$M(t) = E(e^{tX}) = \int_0^{\infty} e^{tx} e^{-x}\,dx = \int_0^{\infty} e^{-x(1 - t)}\,dx = \frac{1}{1 - t}$$

This follows either by direct integration or by recognizing the integrand (up to a constant) as another exponential PDF of rate $1 - t$ and supplying its normalizing constant.

Convergence restriction

The result holds only for $t < 1$. If $t > 1$ (say $t = 2$), the exponent turns positive, $e^{x}$ blows up, and the integral diverges. For $t < 1$ we have exponential decay and a finite integral. The MGF is finite on $(-\infty, 1)$, which contains an open interval around $0$ (e.g. $(-1, 1)$), so this is a valid MGF.

Moments via the geometric series

Rather than differentiating $1/(1 - t)$ repeatedly, recognize the pattern: anything of the form $1/(1 - \text{something})$ should suggest a geometric series. For $|t| < 1$,

$$\frac{1}{1 - t} = \sum_{n=0}^{\infty} t^n = \sum_{n=0}^{\infty} n! \cdot \frac{t^n}{n!}$$

Matching the coefficient of $t^n / n!$ gives, for all $n$:

$$E(X^n) = n! \qquad (X \sim \text{Expo}(1))$$

All moments fall out at once — no derivatives, no integrals. This is the appeal of the MGF: by LOTUS, $E(X^n)$ would need a possibly difficult integral of $x^n$ times the PDF, but the MGF replaces integration with differentiation (or, here, reading off a known series), which is almost always easier.

General rate: $\text{Expo}(\lambda)$

Let $Y \sim \text{Expo}(\lambda)$. Convert to the rate-1 case by scaling. Set $X = \lambda Y$, which is $\text{Expo}(1)$. (Multiply by $\lambda$ rather than divide: $\text{Expo}(\lambda)$ has mean $1/\lambda$, so multiplying by $\lambda$ gives mean $1$.) Then $Y = X / \lambda$, and

$$E(Y^n) = \frac{E(X^n)}{\lambda^n} = \frac{n!}{\lambda^n}$$

No calculus beyond the rate-1 result.

· · ·

3. Moments of the Normal

Let $Z \sim \mathcal{N}(0, 1)$ be standard normal. Find all its moments.

Odd moments by symmetry

The odd moments are all $0$. By LOTUS, $E(Z^{2k+1})$ integrates an odd function symmetrically about $0$, so the negative area cancels the positive area — no work needed:

$$E(Z^{\text{odd}}) = 0$$

Even moments via the Taylor series

We know $E(Z) = 0$ and $E(Z^2) = 1$ (mean $0$, variance $1$), but higher even moments look forbidding by LOTUS: $E(Z^4)$ means integrating $z^4$ times the normal PDF, an integral that could take hours by substitution or parts — and that is only the fourth moment.

Use the MGF instead. From last lecture, $M(t) = e^{t^2 / 2}$. Differentiating it repeatedly (chain rule, then product rule, with the term count growing) is mechanical but tedious. Better: the Taylor series for $e^x$ converges everywhere, so substitute $x = t^2/2$ directly, with no derivatives:

$$M(t) = e^{t^2/2} = \sum_{n=0}^{\infty} \frac{(t^2/2)^n}{n!} = \sum_{n=0}^{\infty} \frac{t^{2n}}{2^n\, n!}$$

Only even powers of $t$ appear, as expected for an even function. To read off the $2n$-th moment, force the matching factorial: the coefficient of $t^{2n}/(2n)!$ is the moment, so multiply and divide by $(2n)!$:

$$E(Z^{2n}) = \frac{(2n)!}{2^n\, n!}$$

This gives every even moment of the normal with no calculus.

Checks and the partnership connection

$n$MomentValue
$1$$E(Z^2)$$\dfrac{2!}{2^1 \cdot 1!} = \dfrac{2}{2} = 1$
$2$$E(Z^4)$$\dfrac{4!}{2^2 \cdot 2!} = \dfrac{24}{8} = 3$
$3$$E(Z^6)$$\dfrac{6!}{2^3 \cdot 3!} = \dfrac{720}{48} = 15$

$E(Z^2) = 1$ matches the variance. The values $1, 3, 15, \ldots$ are products of consecutive odd numbers: $1,\; 1\cdot 3,\; 1\cdot 3\cdot 5,\; 1\cdot 3\cdot 5\cdot 7,\; \ldots$

These are exactly the counts of ways to break $2n$ people into $n$ partnerships — a number that appeared in earlier combinatorics practice. The reappearance of the partnership count as the even moments of the normal is not a coincidence: there is a deep combinatorial reason for it (beyond this course).

· · ·

4. Poisson MGF

Let $X \sim \text{Pois}(\lambda)$. We know its mean and variance are both $\lambda$, but here the goal is to illustrate the other uses of the MGF, not to grind out moments. By LOTUS, summing over the non-negative integers:

$$M(t) = E(e^{tX}) = \sum_{k=0}^{\infty} e^{tk}\, \frac{e^{-\lambda}\lambda^k}{k!}$$

Pull out the constant $e^{-\lambda}$ and combine $e^{tk}\lambda^k = (\lambda e^t)^k$:

$$M(t) = e^{-\lambda} \sum_{k=0}^{\infty} \frac{(\lambda e^t)^k}{k!} = e^{-\lambda}\, e^{\lambda e^t}$$

The remaining sum is the Taylor series for $e^x$ evaluated at $x = \lambda e^t$. So

Poisson MGF

$$M(t) = e^{\lambda(e^t - 1)}, \qquad \text{valid for all real } t$$

The series converges everywhere. (This sum also appears on the course math-review handout, but it is just pattern recognition once the $e^x$ series is familiar — nothing to memorize.)

· · ·

5. Sum of Independent Poissons

Let $X \sim \text{Pois}(\lambda)$ and $Y \sim \text{Pois}(\mu)$ be independent. Find the distribution of $X + Y$. The distribution of a sum is a convolution, generally messy to compute directly — but for independent variables, the MGF of the sum is the product of the MGFs:

$$M_{X+Y}(t) = M_X(t)\, M_Y(t) = e^{\lambda(e^t - 1)} e^{\mu(e^t - 1)} = e^{(\lambda + \mu)(e^t - 1)}$$

This is exactly the $\text{Pois}(\lambda + \mu)$ MGF. Since the MGF determines the distribution, $X + Y \sim \text{Pois}(\lambda + \mu)$.

The mean $\lambda + \mu$ was already obvious by linearity; the interesting part is that the sum stays in the Poisson family. Most families lack this closure — adding independent members usually produces a different family. The Poisson is special.

Independence is essential

A common mistake is to drop the independence assumption. Multiplying MGFs is justified only when $X$ and $Y$ are independent. Take the most extreme dependence, $X = Y$. Then $X + Y = 2X$, which is not Poisson, for three independent reasons:

Why $2X$ is not Poisson
  • Possible values. $2X$ is always even, but a Poisson must be able to take every non-negative integer value. (Simplest argument.)
  • MGF. $M_{2X}(t) = E(e^{t \cdot 2X}) = E(e^{2t \cdot X}) = M_X(2t)$, i.e. $t \mapsto 2t$ — not of Poisson form, and no algebraic simplification rescues it.
  • Mean vs. variance. $E(2X) = 2\lambda$ but $\operatorname{Var}(2X) = 4\lambda$ (the constant comes out squared). A Poisson always has mean equal to variance; here the variance is double.

Intuitively, adding a variable to itself inflates variance: if the value is large, you add the same large value twice, with no chance of one term offsetting the other (which independent terms can do). The same error appears whenever a student writes an i.i.d. sum $X_1 + X_2 + X_3$ as $X + X + X = 3X$ — but $X$ is not independent of itself.

· · ·

6. Joint Distributions

The next major topic is joint distributions: working with the distribution of more than one random variable at a time. Everything in the course is cumulative — understanding the CDF of one variable is the prerequisite for the joint CDF of several.

Definitions (discrete and continuous, in parallel)

Joint CDF, PMF, PDF

Joint CDF (any $X, Y$): $\;F(x, y) = P(X \le x,\, Y \le y)$.

Joint PMF (discrete): $\;P(X = x,\, Y = y)$.

Joint PDF (continuous): a function $f(x, y)$ such that, for a region $B$ of the plane,

$$P\big((X, Y) \in B\big) = \iint_B f(x, y)\,dx\,dy.$$

This is the two-dimensional analog of "the PDF is what you integrate to get a probability." In practice the double integrals here are usually handled as one single integral followed by another.

Independence

Independence means multiply

$X$ and $Y$ are independent if and only if the joint CDF factors, for all $x, y$:

$$F(x, y) = F_X(x)\, F_Y(y).$$

Equivalently, and usually more convenient, the joint PMF/PDF factors into the marginals:

  • Discrete: $P(X = x, Y = y) = P(X = x)\,P(Y = y)$ for all $x, y$.
  • Continuous: $f(x, y) = f_X(x)\, f_Y(y)$ for all $x, y$.

The factorization must hold for all real $x, y$, not only where the density is positive — the zeros matter too.

Marginals

The marginal distribution of $X$ is its distribution viewed on its own. You can always recover marginals from the joint distribution by summing or integrating out the other variable:

This operation is called marginalizing. It runs one direction only: joint $\to$ marginal. The marginals alone do not determine the joint distribution — they carry no information about how $X$ and $Y$ relate.

The name "marginal" comes from writing row and column sums in the margins of a table. (In economics, "marginal" means a derivative — marginal cost, marginal revenue — the opposite operation. In statistics, marginal means integrate or sum.)

Discrete example: a $2\times 2$ table

Let $X, Y$ be Bernoulli (possibly dependent, possibly with different parameters). Their joint distribution is specified by four non-negative numbers summing to $1$ — just a valid PMF in two dimensions.

$Y = 0$$Y = 1$Row sum ($X$)
$X = 0$$2/6$$1/6$$3/6 = 1/2$
$X = 1$$2/6$$1/6$$3/6 = 1/2$
Col sum ($Y$)$4/6$$2/6$$1$

Each cell is a joint probability, e.g. $P(X = 0, Y = 0) = 2/6$. The margins give the marginal distributions: $P(Y = 0) = 2/6 + 2/6 = 4/6$, $P(X = 0) = 2/6 + 1/6 = 1/2$, and so on.

Check independence by the definition: does each cell equal the product of its margins? For the $(0,0)$ cell, $P(X = 0)\,P(Y = 0) = \tfrac{1}{2}\cdot\tfrac{2}{3} = \tfrac{1}{3} = 2/6$, which matches. All four cells factor this way, so $X$ and $Y$ are independent here. By contrast, a table with a $0$ entry is generally dependent: $0$ cannot be written as a product of two positive marginals such as $\tfrac{1}{4}\cdot\tfrac{1}{2}$. A zero forces dependence, but dependence does not require a zero.

Continuous example: uniform on a square (independent)

Uniform on the unit square is a completely random point in $S = \{(x, y) : 0 \le x \le 1,\ 0 \le y \le 1\}$. By analogy with the one-dimensional uniform (constant PDF on an interval), the joint PDF is constant on the square and $0$ outside:

$$f(x, y) = c \ \text{ on the square}, \qquad 0 \ \text{ otherwise}.$$

Integrating $1$ over a region gives its area (the 2-D analog of length), so $c = 1/\text{area} = 1$ (the square has area $1$). Integrating out one variable gives a $\text{Unif}(0,1)$ marginal for the other, and the density factors as $1 = 1 \cdot 1$. So the coordinates are independent $\text{Unif}(0,1)$ — a random point has a uniform $x$-coordinate and an independent uniform $y$-coordinate.

Continuous example: uniform on a disk (dependent)

Now take uniform on the unit disk $D = \{(x, y) : x^2 + y^2 \le 1\}$. Uniform means probability proportional to area, so the normalizing constant is $1/(\text{area}) = 1/\pi$:

$$f(x, y) = \tfrac{1}{\pi} \ \text{ inside the disk}, \qquad 0 \ \text{ outside}.$$
Common mistake

The density is a constant, so it "looks like it factors" into constant times constant — suggesting independence. It does not. The constraint $x^2 + y^2 \le 1$ couples the variables: if $x = 0$, then $y \in [-1, 1]$; but if $x$ is near $1$, then $y$ is confined to a tiny interval. Knowing $x$ constrains $y$, so $X$ and $Y$ are dependent.

Concretely, given $X = x$, the constraint forces

$$-\sqrt{1 - x^2} \le y \le \sqrt{1 - x^2},$$

whose endpoints depend on $x$. A reasonable guess is that, conditioned on $X = x$, $Y$ is uniform on this interval — to be confirmed next time with an integral. Either way, the dependence is already visible from the constraint.