Lecture 23: Beta Distribution

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. What the Beta Distribution Is

The Beta distribution is a generalization of the Uniform distribution. So far the only named distribution that is both continuous and bounded is the Uniform: $\text{Unif}(0,1)$ lives on $[0,1]$ by definition and has a completely flat PDF.

Other continuous distributions reach to infinity — the Normal goes from $-\infty$ to $+\infty$, the Exponential from $0$ to $+\infty$. The Beta is the most widely used distribution that stays bounded between $0$ and $1$ but is not forced to be flat.

It is a two-parameter family, $\text{Beta}(a, b)$, where $a$ and $b$ are any positive real numbers (not necessarily integers). A Beta random variable takes values in $[0,1]$. Its PDF is:

$$f(x) = c\, x^{a-1} (1-x)^{b-1}, \qquad 0 < x < 1,$$

and $f(x) = 0$ otherwise. Here $c$ is the normalizing constant.

The beta function

The integral $\int_0^1 x^{a-1}(1-x)^{b-1}\,dx$ is the famous beta function, an object with a long history in mathematics independent of its use in statistics. Finding $c$ amounts to evaluating that integral — but a great deal can be done without yet knowing $c$.

· · ·

2. Flexibility: Shapes of the PDF

The reason Betas are interesting is that, by varying $a$ and $b$, the PDF takes many different shapes. There is no single characteristic shape — that flexibility is what makes it a useful modeling tool.

ParametersTerm that survivesShape on $[0,1]$
$a = 1,\ b = 1$constantFlat — this is exactly $\text{Unif}(0,1)$
$a = 2,\ b = 1$$x$Straight line rising from $0$ to $1$ (a triangle of area $1$)
$a = \tfrac12,\ b = \tfrac12$$x^{-1/2}(1-x)^{-1/2}$U-shaped, with asymptotes blowing up to $\infty$ at both ends
$a = 2,\ b = 2$$x(1-x)$A smooth upside-down-U (a hump centered at $\tfrac12$)

The $a = b = 1$ case shows directly that the Beta generalizes the Uniform: both exponents become $0$, the $x$-dependence disappears, and the PDF is constant. The $a = b = \tfrac12$ case is worth noting: since $x^{-1/2} \to \infty$ as $x \to 0$ and $(1-x)^{-1/2} \to \infty$ as $x \to 1$, the density is unbounded at both ends yet still integrates to $1$.

· · ·

3. Why the Beta Matters: A Prior for Probabilities

The main use of the Beta is as a distribution for a probability. When a parameter is itself a probability — and therefore lives between $0$ and $1$ — the Beta is by far the most widely used prior distribution for it.

This is the Bayesian viewpoint: we treat an unknown parameter as a random variable, encoding our uncertainty about it as a distribution. The Beta is the standard choice for a parameter on $[0,1]$ because it has many convenient properties.

Key property

The central reason is that the Beta is the conjugate prior to the Binomial — a term from Bayesian statistics (developed fully in Stat 111) explained in the next section. The Beta is also well connected to other distributions, with relationships explored in later lectures.

· · ·

4. Conjugate Prior to the Binomial

This section generalizes Laplace's rule of succession. In the original (the "will the Sun rise tomorrow?" problem), complete ignorance about an unknown probability was modeled with a Uniform prior. Here we replace the Uniform with a general $\text{Beta}(a,b)$ prior — and we will not even need the normalizing constant.

Setup

Deriving the posterior

Write the posterior PDF of $p$ given $X = k$ as $f(p \mid X = k)$. By Bayes' rule (the hybrid discrete/continuous form):

$$f(p \mid X = k) = \frac{P(X = k \mid p)\, f(p)}{P(X = k)}$$

The denominator $P(X = k)$ is a constant with respect to $p$: it depends on $k$ but not on $p$, because $p$ has been integrated out. So we can work up to proportionality in $p$, discarding any factor that does not depend on $p$.

$$f(p \mid X = k) \;\propto\; p^{a+k-1} (1-p)^{b+n-k-1}$$

This is exactly the kernel of a Beta density. The constant out front is just whatever is needed to make it integrate to $1$. Therefore:

$$p \mid X \;\sim\; \text{Beta}(a + X,\; b + n - X)$$

No hard integration was required — grouping the $p$'s, grouping the $(1-p)$'s, and ignoring constants immediately gave the answer.

What "conjugate" means

Conjugate prior

A family of priors is conjugate to a likelihood if, starting from a member of that family, the posterior is again a member of the same family (with updated parameters). Here: start with a Beta prior, observe Binomial data, and the posterior is again Beta. So the Beta family is the conjugate prior for the Binomial.

Intuition

Think of $X$ as the number of successes and $n - X$ as the number of failures. The prior $\text{Beta}(a, b)$ acts as if $a$ were prior "pseudo-successes" and $b$ prior "pseudo-failures" from an earlier experiment. The data adds $X$ new successes and $n - X$ new failures, giving $\text{Beta}(a + X,\, b + n - X)$.

Conjugacy is computationally very convenient — you stay inside one well-understood family rather than inventing a new distribution at every update. It does not, by itself, make the prior correct; whether conjugate priors are the right way to encode uncertainty is a philosophical debate. But there is no denying the convenience. Setting $a = b = 1$ (a Uniform prior) recovers Laplace's rule of succession as a special case.

· · ·

5. Anticipating Mean and Variance

The mean and variance of the Beta are derived next lecture, but the density's structure already hints at why they will be clean.

This "it still looks like a Beta" pattern is what makes the moments easy to obtain.

· · ·

6. Bayes' Billiards: Integer Normalizing Constant

We will not derive the general normalizing constant today, but one important special case — integer parameters — has a beautiful argument due to Bayes (around 1760). For non-negative integers $k$ between $0$ and $n$, the goal is to evaluate:

$$\int_0^1 x^{k}(1-x)^{n-k}\,dx$$

A direct attack would expand $(1-x)^{n-k}$ with the Binomial theorem and integrate term by term — tedious. Instead we find the integral without using calculus, by telling a story. Multiply through by $\binom{n}{k}$ to make the picture nicer; we can adjust for that constant later.

The story

Take $n + 1$ billiard balls, all white and indistinguishable. There are two equivalent procedures for producing the same final picture:

A configuration on the number line: white balls (dots) and the pink ball (triangle); here $X = 3$ white balls land to its left

Given only the final configuration, you cannot tell which story generated it — painting at the beginning or the end makes no difference. The two procedures yield the same distribution. That equivalence is the key. Let $X$ be the number of white balls landing to the left of the pink ball, an integer between $0$ and $n$.

Counting it two ways

Story 1 — via the law of total probability

Since the balls are i.i.d. and order of throwing does not matter, throw the pink ball first, fix its position $p$, then ask how many of the $n$ white balls land to its left. Conditional on $p$, "white ball is to the left of pink" is a success with probability $p$, so $X \mid p \sim \text{Bin}(n, p)$. The pink ball's position is $\text{Unif}(0,1)$, so $f(p) = 1$. Thus:

$$P(X = k) = \int_0^1 \binom{n}{k} p^{k}(1-p)^{n-k} \cdot 1 \, dp$$

This is exactly the integral we wanted (dummy variable renamed from $x$ to $p$).

Story 2 — by symmetry

Throw all $n + 1$ balls first, then paint a uniformly random one pink. The pink ball is equally likely to be in any of the $n + 1$ order positions, so the number of white balls to its left is equally likely to be any of $0, 1, \ldots, n$. Hence $X$ is uniform on $\{0, 1, \ldots, n\}$:

$$P(X = k) = \frac{1}{n + 1}, \quad \text{for every } k$$
Conclusion

The two expressions for $P(X = k)$ must agree, so $$\int_0^1 \binom{n}{k} x^{k}(1-x)^{n-k}\,dx = \frac{1}{n+1}, \qquad\text{i.e.}\qquad \int_0^1 x^{k}(1-x)^{n-k}\,dx = \frac{1}{(n+1)\binom{n}{k}}.$$ This delivers the Beta normalizing constant for integer parameters — read straight off the picture, with no integration by parts and no term-by-term expansion. (Next lecture: the general normalizing constant via the beta function, plus connections to other distributions.)

· · ·

7. Guest Segment: Probability in Quantitative Finance

The second half of the lecture was a guest talk by Stephen Blyth, who teaches Stat 123 (Applied Quantitative Finance). Its only prerequisite is Stat 110; quantitative finance was largely built by mathematicians, statisticians, and physicists who knew no finance, so no finance background is needed. The pedagogical point: probability — and especially LOTUS — is the engine of derivative pricing.

What a derivative is, and the fundamental theorem of finance

A financial derivative is a contract between two parties whose payout at a maturity date is a function of some other random variable — typically the price of a financial asset. The term means "derives from"; it has nothing to do with calculus. Example: a weather derivative that pays \$1 if Boston snowfall exceeds 80 inches this winter — the payout is a function of the random snowfall.

In finance the underlying asset price is written $S$ (for stock) rather than $X$. So $S_T$ is the random price of an asset at time $T$, and a derivative pays $G(S_T)$, itself a random variable.

Fundamental theorem of finance

The price you should pay for the contract is closely tied to the expected value of its payoff; chosen correctly, the price is exactly an expected value: $$\text{price} = E\big(G(S)\big) = \int G(s)\, f(s)\, ds.$$ This is just LOTUS. The Black-Scholes formula is the solution of exactly such an integral under a particular PDF for $S$ — "you can win a Nobel Prize in Economics just for knowing LOTUS."

Brain teaser 1: The foreign-exchange paradox

A simple two-state ("binomial") model for the euro, with one dollar per euro now and equal probabilities for each state:

State (prob $\tfrac12$ each)Value of 1 euroValue of \$1 (reciprocal)
Up\$1.25$0.80$ euro
Down\$0.80$1.25$ euro

Expected value of a euro in one year: $\tfrac12(1.25) + \tfrac12(0.80) = \$1.025$ — the euro is expected to appreciate. But the same model, restated from the dollar's point of view, gives $\tfrac12(0.80) + \tfrac12(1.25) = 1.025$ euro — the dollar is also expected to appreciate.

Both currencies cannot appreciate against each other. The model is the simplest possible, yet resolving exactly what goes wrong is genuinely non-trivial: Jensen's inequality lurks, since the reciprocal is a convex function and $E(1/S) \neq 1/E(S)$. Stat 123 attacks it several ways.

Brain teaser 2: TARP warrants as call options

In the October 2008 financial crisis, the US government (via TARP, the Troubled Asset Relief Program) received warrants from financial institutions. A warrant is a type of derivative called a call option — the right, not the obligation, to buy. The rounded TARP/Goldman Sachs deal: the government paid about \$450 million for the right to buy roughly 10 million Goldman shares at \$125 each in 10 years, when shares were trading around \$95.

The payoff of a call option with strike price $K$ at maturity $T$ is:

$$G(S) = \max(S_T - K,\; 0), \qquad K = 125,\ T = 10\text{ years}$$

Knowing the PDF of $S$ at maturity, the warrant's price is again just LOTUS: $\int \max(s - K, 0)\, f(s)\, ds$. The $\max$ conveniently collapses by adjusting the lower limit of integration to $K$, giving a tractable integral whose solution under the right PDF is the Black-Scholes formula. Two questions Stat 123 pursues: how the government arrived at a price of about \$45 per option, and whether the government is really in the business of estimating probability density functions.