The lecture opens with a puzzle: what comes next in the sequence $0, 1, 2, \ldots$? Different people guess different things — $3$, $4$, $720$, $42/e$ — and that is the point. A "next number" puzzle has no well-defined answer without a complexity measure that picks the simplest continuation, and two people rarely agree on what is simplest.
The continuation Blitzstein had in mind: $0, 1, 2, 6, 24, 120, 720, \ldots$ — that is, $0, 1!, 2!, 3!, 4!, 5!, 6!, \ldots$ (zero with no factorials, one with one factorial, and so on). The broader goal is to interpolate the factorial function: it is defined only on non-negative integers, so how do we "connect the dots" and ask, say, what $\pi!$ should mean?
Before interpolating, it helps to know how fast $n!$ grows. Stirling's formula gives an approximation everyone should know:
This is an excellent approximation even for modest $n$ (around $12$ or $20$). More than that, the ratio of the two sides converges to $1$ as $n \to \infty$. The dominant growth comes from the $(n/e)^n$ term: roughly $n^n$ behavior, discounted by $e^n$. (The formula can also be proved using probability, via a Poisson interpretation — not done in this lecture.)
There are infinitely many smooth curves through the points $(n, n!)$. The question is whether one is canonical. There is:
The gamma function is the standard, most widely applicable extension of the factorial to the positive reals. It is the function we need to define the gamma distribution.
For any real $a > 0$,
$$\Gamma(a) = \int_0^{\infty} x^{a}\, e^{-x}\, \frac{dx}{x}.$$
The factor $dx/x$ means the integrand is $x^{a-1} e^{-x}$; the two forms are identical. Keeping one $x$ "on hold" in the $dx/x$ is a notational convenience that pays off repeatedly below.
The integral exists only for real $a > 0$. Two limits matter:
| Property | Statement |
|---|---|
| Factorial link | $\Gamma(n) = (n-1)!$ for positive integer $n$ |
| Recursion | $\Gamma(x+1) = x\,\Gamma(x)$ |
| Base case | $\Gamma(1) = 1$ |
| Half-integer | $\Gamma(1/2) = \sqrt{\pi}$ |
The offset in $\Gamma(n) = (n-1)!$ is a historical convention, nothing deeper.
The recursion is the heart of the function. Check it against the factorials:
This is exactly the recursion the factorials obey, with the same starting point, so the two sequences agree forever. The recursion itself follows from a single integration by parts on the defining integral — the calculation reproduces the gamma function again and hands back the identity for free.
$\Gamma(1/2) = \sqrt{\pi}$ connects to the normalizing constant of the normal distribution, where $\sqrt{2\pi}$ appears. The link is a substitution. Take the (un-normalized) normal integrand $e^{-x^2/2}$ and substitute $u = x^2/2$, so $du = x\,dx$. There is no spare $x$ in the integrand, so $x$ must be re-expressed as roughly $\sqrt{u}$, and the integral reduces to gamma-function form with $a = 1/2$. So either:
From $\Gamma(1/2)$, the recursion generates all half-integers: $\Gamma(3/2) = \tfrac{1}{2}\Gamma(1/2) = \tfrac{\sqrt{\pi}}{2}$, then $\Gamma(5/2)$, $\Gamma(7/2)$, and so on.
The gamma function immediately suggests a probability density. The defining integral equals $\Gamma(a)$; divide both sides by $\Gamma(a)$ and the integrand integrates to $1$. Since it is also non-negative, it is a valid PDF:
$$f(x) = \frac{1}{\Gamma(a)}\, x^{a}\, e^{-x}\, \frac{1}{x}, \qquad x > 0.$$
Writing the trailing $1/x$ (rather than folding it into $x^{a-1}$) avoids the common slip of confusing whether the exponent should be $a$ or $a-1$.
This construction is "a cute math trick," not yet a probabilistic story. By itself, normalizing a famous function does not explain why this distribution matters. The real motivation comes from its connection to the exponential distribution.
The two-parameter family Gamma$(a, \lambda)$ comes from rescaling, exactly as a general Exponential$(\lambda)$ comes from an Exponential$(1)$. Here $\lambda$ is a rate; $1/\lambda$ is a scale.
Let $X \sim$ Gamma$(a, 1)$ and define $Y = X/\lambda$ (equivalently $X = \lambda Y$). By the change-of-variables formula, $f_Y(y) = f_X(x)\,\bigl|\tfrac{dx}{dy}\bigr|$ with $\tfrac{dx}{dy} = \lambda$. Substituting $x = \lambda y$ into the Gamma$(a, 1)$ density:
The two factors of $\lambda$ from $1/(\lambda y)$ and $dx/dy$ cancel one each, giving the Gamma$(a, \lambda)$ density on $y > 0$. Like the exponential, the gamma is a continuous distribution on the positive reals.
The substantive story behind the gamma is the Poisson process, the standard model for arrivals over time — emails landing in an inbox, phone calls, radioactive decays. Picture a timeline with an $\times$ marked at each arrival. Let $N_t$ = number of arrivals up to time $t$.
Two assumptions, and only two:
A few weeks earlier the class established the link to the exponential — before the exponential had even been defined. Let $T_1$ be the time of the first arrival. Then $T_1 \sim$ Exponential$(\lambda)$, shown by a one-line calculation:
No arrival by time $t$ is the same event as $N_t = 0$, and the Poisson$(\lambda t)$ PMF at $0$ is $e^{-\lambda t}$. That tail probability is exactly $1$ minus the Exponential$(\lambda)$ CDF, so $T_1 \sim$ Exponential$(\lambda)$.
The same argument repeats for every gap. By the memoryless property of the exponential, once the first arrival happens, waiting for the second is a fresh copy of the same problem, independent of the first. So every gap between consecutive arrivals is Exponential$(\lambda)$, and the gaps are independent. These gaps are the inter-arrival times.
A Poisson process of rate $\lambda$ is equivalently a process whose inter-arrival times are i.i.d. Exponential$(\lambda)$.
Let $T_n$ be the actual time of the $n$-th arrival (not a gap, but the cumulative time on the line). Summing the gaps:
For integer $n$, this sum is Gamma$(n, \lambda)$. This is the central story of the gamma distribution. (The parameter $a$ in the density need not be an integer in general, but the arrival-time interpretation requires an integer $n$.)
The discrete-time analogy makes it memorable:
| Discrete time | Continuous time |
|---|---|
| Geometric — wait for $1$ success | Exponential — wait for $1$ arrival |
| Negative Binomial — wait for $n$ successes | Gamma — wait for $n$ arrivals |
A Negative Binomial is a sum of i.i.d. Geometrics; a Gamma$(n, \lambda)$ is a sum of $n$ i.i.d. Exponentials. "Success" here just means an arrival.
The claim "sum of $n$ i.i.d. exponentials is Gamma$(n, \lambda)$" is asserted, not yet proved. Two routes are available:
It suffices to prove the $\lambda = 1$ case; the general $\lambda$ follows by rescaling. The exponential MGF (derived earlier in the course, where it gave all the moments at once via the geometric series) is:
Since $T_n$ is a sum of $n$ independent $X_j$ with identical MGFs, the MGF of the sum is the product:
That was almost no work. Now confirm this is the MGF of a Gamma$(n, 1)$. Let $Y \sim$ Gamma$(n, 1)$ (fresh notation, to avoid circular reasoning) and compute its MGF by LOTUS:
Collect the two exponentials, $e^{ty} e^{-y} = e^{-(1-t)y}$:
$$M_Y(t) = \frac{1}{\Gamma(n)}\int_0^{\infty} y^{n}\, e^{-(1-t)y}\, \frac{dy}{y}.$$
This is almost the gamma integral, just with an extra constant in the exponent. The skill the course actually rewards is pattern recognition, not heavy calculus: substitute to turn the exponent back into $e^{-x}$. Let $x = (1-t)y$, so $dx = (1-t)\,dy$. Since $t < 1$, the factor $1 - t$ is positive and the limits stay $0$ to $\infty$. Then $y = x/(1-t)$, so $y^{n}$ contributes $(1-t)^{-n}$, and the convenient $dy/y$ equals $dx/x$ (the substitution factor cancels):
$$M_Y(t) = \frac{(1-t)^{-n}}{\Gamma(n)}\int_0^{\infty} x^{n}\, e^{-x}\, \frac{dx}{x} = (1-t)^{-n}\,\frac{\Gamma(n)}{\Gamma(n)} = \frac{1}{(1 - t)^{n}}.$$
The MGFs match. Since an MGF determines the distribution (no "masquerading" distributions share an MGF without being equal), $T_n \sim$ Gamma$(n, 1)$, and by rescaling the sum of $n$ i.i.d. Exponential$(\lambda)$ is Gamma$(n, \lambda)$.
A bonus: nothing in this computation actually used that $n$ is an integer. So $1/(1-t)^{a}$ is the correct MGF of Gamma$(a, 1)$ for any positive real $a$ — the integer assumption is only needed for the sum-of-exponentials interpretation.
The MGF could deliver the moments, but LOTUS is even quicker. Let $X \sim$ Gamma$(a, 1)$ and find $E(X^c)$ for an arbitrary constant $c$ (not necessarily an integer; existence must be checked):
The remaining integral is again a gamma integral — it equals $\Gamma(a+c)$. (Equivalently, multiply and divide by $\Gamma(a+c)$ and recognize the Gamma$(a+c, 1)$ density, which integrates to $1$.) No real calculus needed; it is algebra plus recognition:
$$E(X^c) = \frac{\Gamma(a+c)}{\Gamma(a)}, \qquad \text{valid when } a + c > 0.$$
The constraint $a + c > 0$ is required because the gamma function is defined only on positive arguments; otherwise the moment does not exist.
Set $c = 1$ and $c = 2$ and apply the recursion $\Gamma(a+1) = a\,\Gamma(a)$:
A Gamma$(a, 1)$ has mean $a$ and variance $a$. (Independent check in the integer case: an Exponential$(1)$ has mean $1$, and summing $n$ of them gives mean $n$ by linearity.)
The coincidence that mean equals variance here is not a deep gamma property — it is an artifact of setting $\lambda = 1$. (The Poisson is the distribution where mean always equals variance.) Restore the rate by $Y = X/\lambda$:
The variance picks up $\lambda^2$ because multiplying by a constant pulls the constant out squared in the variance. So Gamma$(a, \lambda)$ has mean $a/\lambda$ and variance $a/\lambda^2$. Doing the algebra at $\lambda = 1$ first and restoring $\lambda$ at the end keeps the work uncluttered.
The gamma is a hub among the named distributions. The course has now covered all the discrete distributions it needs and nearly all the continuous ones; the remaining continuous distributions are essentially variations built from the gamma and the normal. The gamma relates to:
Because the gamma and normal together generate the rest, once both are in hand the remaining distributions follow.