Lecture 17: Moment Generating Functions

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Memorylessness: Expectation vs. Conditional Expectation

The lecture returns to the exponential distribution and its memoryless property, motivated by a common error in popular reporting about life expectancy.

Life expectancy and a common fallacy

Suppose life expectancy is 80 years (real figures are roughly 76 for men and 81 for women in the US; computing them well is itself a hard statistical problem, because of censored data — people still alive haven't yet contributed a final lifetime).

The fallacy: assuming the expected remaining lifetime is the same for everyone, including people already in their 50s and 60s. In reality, the longer you have already lived, the longer your total expected lifespan becomes. Writing $T$ for lifetime:

$$E(T \mid T > 20) > E(T)$$

Conditional expectation

This is ordinary expectation computed with conditional rather than unconditional probabilities. Conditional expectation is just expectation using conditional probability; since conditional probabilities are probabilities, everything carries over. The strict inequality fails only in the degenerate case where everyone lives exactly the same span — no variability means the information $T > 20$ is irrelevant.

What memorylessness would say

Human lifetimes are not memoryless — people decay with age. If they were, surviving to 20 would reset the clock: you'd be "as good as new" and gain a fresh 80 years on average.

$$E(T \mid T > 20) = 20 + E(T) \quad \text{(memoryless case only)}$$

So the truth lies between two bounds — the upper one is the (unrealistic) memoryless case, the lower one the no-variability case:

$$E(T) \;<\; E(T \mid T > 20) \;<\; E(T) + 20$$

Why memorylessness still matters

Realistic in many science applications (chemistry, physics) and some economics, where things don't decay with age.
Analogy with two kinds of homework: a "fixed effort" problem where you make gradual partial progress (not memoryless), versus a "breakthrough" problem where you either get the aha moment eventually or don't — past effort doesn't change the future (memoryless).
It is a building block. The Weibull distribution — the most popular model for survival times in practice — is built from an exponential raised to a power. Cubing an exponential random variable, for instance, destroys memorylessness and yields a Weibull.

· · ·

2. The Exponential Is the Only Memoryless Continuous Distribution

In discrete time the geometric distribution is memoryless; in continuous time the exponential is. They are deep analogs of each other — the exponential is the continuous analog of the geometric, and vice versa. In fact the exponential is the unique continuous memoryless distribution.

Theorem — Characterization of the Exponential

If $X$ is a positive continuous random variable (taking values in $[0, \infty)$) whose distribution has the memoryless property, then $X \sim \text{Expo}(\lambda)$ for some $\lambda > 0$.

The memoryless property belongs to the distribution; we say the random variable has it if its distribution does.

Setup: a functional equation

This proof is unusual: we solve for a function, not for a number — a functional equation. Let $F$ be the CDF of $X$. As with the exponential, it is cleaner to work with the survival function (one minus the CDF):

$$g(x) = P(X > x) = 1 - F(x)$$

In terms of $g$, the memoryless property is a clean multiplicative identity (the derivation is the same one-line conditional-probability argument from last time):

$$g(s + t) = g(s)\,g(t), \qquad \text{for all } s, t > 0$$

This holds for the exponential because $e^{-\lambda(s+t)} = e^{-\lambda s}\,e^{-\lambda t}$. We want to show only exponential functions can satisfy it. The strategy: plug in clever values and gradually learn more about $g$.

Bootstrapping from integers to all reals

Step	Substitution	Conclusion
Integer multiples	$s = t$, then $s = 2t, \ldots$	$g(kt) = g(t)^k$ for positive integers $k$
Reciprocals	replace $t$ by $t/k$, take $k$-th root	$g(t/k) = g(t)^{1/k}$
Rationals	combine the two	$g\!\left(\tfrac{m}{n}\,t\right) = g(t)^{m/n}$
Reals	limit of rationals; continuity of $g$	$g(xt) = g(t)^x$ for all real $x > 0$

For the integer step: $s = t$ gives $g(2t) = g(t)^2$; then $g(3t) = g(2t)\,g(t) = g(t)^3$; by induction $g(kt) = g(t)^k$. The reals step relies on $g$ being continuous, which lets us swap the limit with $g$ (e.g., approximate $\pi$ by $3,\, 3.1,\, 3.14, \ldots$).

Finishing the proof

Since $g(xt) = g(t)^x$ holds for all $x$ and $t$, set $t = 1$:

$$g(x) = g(1)^x = e^{x \ln g(1)}$$

Conclusion

Now $g(1) = P(X > 1)$ is a probability in $(0, 1)$, so $\ln g(1)$ is negative. Call it $-\lambda$ with $\lambda > 0$:

$$g(x) = e^{-\lambda x}$$

This is exactly the survival function of $\text{Expo}(\lambda)$. Hence the exponential is the only continuous memoryless distribution. $\blacksquare$

· · ·

3. Moment Generating Functions: Definition

The moment generating function (MGF) is another way to describe a distribution — an alternative to the CDF or PDF.

Definition — Moment Generating Function

A random variable $X$ has MGF

$$M(t) = E\!\left(e^{tX}\right)$$

viewed as a function of $t$. The MGF exists only if $M(t)$ is finite on some interval $(-a, a)$ around $0$ with $a > 0$. (It may be finite for all $t$, which is even better, but we require at least a small interval around zero.)

What is $t$?

$t$ is a dummy variable — a placeholder. We could call it $s$, $q$, or $w$; convention is $t$. Avoid letters that clash ($e$, $M$, $X$).
For any fixed $t$, $e^{tX}$ is a function of a random variable, hence a random variable, so its expectation $E(e^{tX})$ is well-defined (possibly infinite). $M$ is therefore a genuine function of $t$.
Think of $t$ as a bookkeeping device: the MGF is a clever way to package all the moments of a distribution.

Why "moment generating"

Expand $e^{tX}$ with the Taylor series for the exponential (valid everywhere). If we may swap the expectation and the sum:

$$M(t) = E\!\left(\sum_{n=0}^{\infty} \frac{X^n t^n}{n!}\right) = \sum_{n=0}^{\infty} E(X^n)\,\frac{t^n}{n!}$$

The nth moment

$E(X^n)$ is the $n$th moment. The first moment is the mean; the second moment (with the first) gives the variance — the variance equals the second moment only when the mean is zero. Higher moments have more complex interpretations. All moments sit right there as coefficients in the Taylor series — hence "moment generating."

The swap of $E$ with the infinite sum is justified (under the mild assumption that the MGF exists on an interval around $0$) by results from real analysis (Stat 210 / a real analysis course). For a finite sum it would be immediate by linearity; the infinite case is a kind of "infinite linearity" requiring more care.

· · ·

4. Why the MGF Matters: Three Reasons

Reason 1: It generates moments

The $n$th moment $E(X^n)$ can be read off two equivalent ways:

As the coefficient of $t^n / n!$ in the Taylor (Maclaurin) expansion of $M$ about $0$ — often easiest, since we frequently already know a relevant Taylor series (e.g., for $e^x$).
As the $n$th derivative of $M$ evaluated at $0$: $\;E(X^n) = M^{(n)}(0)$. The first derivative at $0$ gives the mean, the second the second moment, and so on.

Reason 2: The MGF determines the distribution

Uniqueness

If $X$ and $Y$ have the same MGF, they have the same distribution (same CDF; same PDF if continuous, and so on). The proof is difficult and is omitted.

The consequence is powerful: if you compute an MGF and recognize it as, say, a $\text{Pois}(3)$ MGF, you may conclude the random variable is $\text{Pois}(3)$. No other distribution can impersonate it. Once you know the MGF, you know the distribution — at least in principle.

Reason 3: It tames sums of independent random variables

Finding the distribution of a sum of independent random variables directly is hard — it requires a convolution (a convolution sum or integral). MGFs make it easy. If $X$ has MGF $M_X$ and $Y$ has MGF $M_Y$, and $X, Y$ are independent:

$$M_{X+Y}(t) = M_X(t)\,M_Y(t)$$

By definition and the laws of exponents:

$$M_{X+Y}(t) = E\!\left(e^{t(X+Y)}\right) = E\!\left(e^{tX} e^{tY}\right) = E\!\left(e^{tX}\right) E\!\left(e^{tY}\right) = M_X(t)\,M_Y(t).$$

The middle step uses the fact (to be proved later) that the expectation of a product of independent random variables factors. Independence is essential — it is false in general otherwise. Since $X, Y$ independent implies $e^{tX}, e^{tY}$ independent, the product splits. So MGFs of independent sums simply multiply: no convolution required.

· · ·

5. Worked Examples of MGFs

Bernoulli$(p)$

$$M(t) = p\,e^t + q, \qquad q = 1 - p$$

If $X \sim \text{Bern}(p)$, then $e^{tX}$ is $e^t$ (when $X = 1$) or $e^0 = 1$ (when $X = 0$). No LOTUS is needed — just take the weighted average over the two outcomes.

Binomial$(n, p)$

$$M(t) = \left(p\,e^t + q\right)^n$$

Write a $\text{Bin}(n, p)$ as a sum of $n$ IID $\text{Bern}(p)$ random variables and apply Reason 3 (the MGF of a sum of independents is the product of MGFs). This avoids a messy LOTUS computation.

As a check, the Binomial has mean $np$ and variance $npq$; differentiating $M$ once and evaluating at $0$ recovers the mean, and the second derivative at $0$ gives the second moment, from which the variance follows.

Standard Normal

Let $Z \sim \mathcal{N}(0, 1)$. Once we have the standard normal MGF, every normal follows via location and scale (any normal is $\mu + \sigma Z$), so derive the general case as practice. By LOTUS:

$$M(t) = E\!\left(e^{tZ}\right) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{tz}\, e^{-z^2/2}\, dz$$

If $t = 0$ this is just the integral of the standard normal density, which equals $1$. The linear term $tz$ is the only obstacle, so eliminate it by completing the square in the exponent:

$$tz - \tfrac{z^2}{2} = -\tfrac{1}{2}(z - t)^2 + \tfrac{t^2}{2}$$

Pulling the constant $e^{t^2/2}$ outside the integral:

$$M(t) = e^{t^2/2} \cdot \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-(z - t)^2/2}\, dz$$

Conclusion

The remaining integral is a normal density centered at $t$ (variance unchanged), so it integrates to $1$. Therefore:

$$M(t) = e^{t^2/2}.$$

· · ·

6. Laplace's Rule of Succession

A famous old problem (not strictly about MGFs, but useful for the homework and upcoming material). Laplace — a great mathematician and physicist — asked: given that the sun has risen on every one of the last $n$ days, what is the probability it rises tomorrow? He was ridiculed for it, but the structure is broadly useful regardless of the literal astronomy.

Setup

Model each day by an indicator: $X_1, X_2, \ldots$ are IID $\text{Bern}(p)$ given $p$, where $X_i = 1$ means the sun rose on day $i$. Let $S_n = X_1 + \cdots + X_n$. Given $p$, $\;S_n \sim \text{Bin}(n, p)$.

Bayesian framing

The twist: $p$ is unknown. Statistics has long debated how to handle unknowns (the Bayesian vs. frequentist controversy, a Stat 111 topic). The Bayesian approach quantifies uncertainty about $p$ by treating $p$ itself as a random variable: start with a prior (beliefs before data), collect data, and update via Bayes' rule to get the posterior (beliefs after data).

Laplace took the prior $P \sim \text{Unif}(0, 1)$, arguing a uniform reflects complete ignorance. Both the use of any prior and this specific uniform choice are philosophically contested, but we proceed with it.

The conditional structure

Everything is conditional on $p$. Given $p$ (treated as a known constant), the $X_i$ are IID $\text{Bern}(p)$, so $S_n \sim \text{Bin}(n, p)$. This is conditional independence — the same idea as the random-coins examples: the $X_i$ are not unconditionally independent, but they are independent given $p$; the randomness in $p$ couples them.

Observing $S_n = k$ (how many of the $n$ days were sunny) is enough; conditioning on the full sequence $X_1, \ldots, X_n$ gives the same answer. $S_n$ is a sufficient statistic (another Stat 111 idea).

Finding the posterior via a continuous Bayes' rule

We want the posterior density $f(p \mid S_n = k)$. Bayes' rule looks just like the discrete version, treating the density as if it were a probability (a PDF times a small increment is approximately the probability of landing in that interval):

$$f(p \mid S_n = k) = \frac{P(S_n = k \mid p)\, f(p)}{P(S_n = k)}$$

$f(p) = 1$ (the uniform prior).
The denominator $P(S_n = k)$ does not depend on $p$; computing it directly uses the continuous law of total probability, $\;P(S_n = k) = \int_0^1 P(S_n = k \mid p)\, f(p)\, dp$ — the continuous analog of the discrete law of total probability.

We don't actually need the denominator. Working up to proportionality (dropping the $p$-independent constant and the binomial coefficient $\binom{n}{k}$):

$$f(p \mid S_n = k) \;\propto\; p^k (1 - p)^{n - k}$$

The all-sunny case $k = n$

$$f(p \mid S_n = n) = (n + 1)\, p^n, \qquad 0 < p < 1$$

When the sun rose all $n$ days, $\;f(p \mid S_n = n) \propto p^n$. The integral of $p^n$ from $0$ to $1$ is $\frac{1}{n+1}$, so the normalizing constant is $n + 1$.

Probability the sun rises tomorrow

We want $P(X_{n+1} = 1 \mid S_n = n)$. By the fundamental bridge, this equals the expected value of $p$ under the posterior:

$$P(X_{n+1} = 1 \mid S_n = n) = \int_0^1 (n + 1)\, p^n \cdot p \, dp = (n + 1) \cdot \frac{1}{n + 2}$$

Rule of succession

$$P(X_{n+1} = 1 \mid S_n = n) = \frac{n + 1}{n + 2}$$

For example, if the sun has risen $100$ days in a row, the probability it rises tomorrow is $\frac{101}{102}$.