Lecture 25: Order Statistics and Conditional Expectation

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. The Bank-Post Office Story: Connecting Gamma and Beta

The beta and gamma distributions are not just consecutive Greek letters — they are deeply connected. This example makes the link concrete and, as a bonus, hands us the beta normalizing constant for free.

Setup

You must visit the bank and then the post office, waiting in line at each.

$X$ = waiting time at the bank, with $X \sim \text{Gamma}(a, \lambda)$.
$Y$ = waiting time at the post office, with $Y \sim \text{Gamma}(b, \lambda)$.
$X$ and $Y$ are independent.

When $a$ is an integer, $\text{Gamma}(a, \lambda)$ has a natural story: it is the sum of $a$ i.i.d. $\text{Expo}(\lambda)$ waiting times — as if $a$ people are ahead of you in a single line, each taking an $\text{Expo}(\lambda)$ amount of time.

We study two quantities:

$T = X + Y$, the total waiting time.
$W = \dfrac{X}{X + Y}$, the fraction of the total spent at the bank.

The total $T$ is easy

For integer $a$ and $b$, $T$ is a sum of $a + b$ i.i.d. exponentials, so $T \sim \text{Gamma}(a + b, \lambda)$ immediately. For non-integer parameters, multiply the two MGFs to confirm the same result. So the marginal of $T$ requires no work.

The real question: are $T$ and $W$ independent?

Intuition is genuinely split. They share the $X + Y$ term (suggesting dependence), yet knowing how long you waited in total tells you little about what fraction was at the bank (suggesting independence). The class vote was close. The matter is not obvious either way, so we compute.

To reduce notational clutter, set $\lambda = 1$. This loses no generality: rescaling by $\lambda$ cancels out of $W$ entirely and merely turns $\text{Gamma}(a + b, 1)$ into $\text{Gamma}(a + b, \lambda)$.

The joint PDF via a change of variables (Jacobian)

We want the joint PDF $f_{T,W}(t, w)$. Start from the joint PDF of $(X, Y)$ and multiply by the Jacobian of the transformation.

Because $X$ and $Y$ are independent, their joint PDF is the product of two $\text{Gamma}(\cdot, 1)$ PDFs:

$$f_{X,Y}(x, y) = \frac{1}{\Gamma(a)}\, x^{a} e^{-x}\, \frac{1}{x} \cdot \frac{1}{\Gamma(b)}\, y^{b} e^{-y}\, \frac{1}{y}$$

The transformation is $t = x + y$ and $w = \dfrac{x}{x + y}$. Invert it (easy algebra, since the denominator of $w$ is $t$):

$x = t\,w$
$y = t\,(1 - w)$

Compute the Jacobian determinant of $(x, y)$ with respect to $(t, w)$:

Partial	w.r.t. $t$	w.r.t. $w$
$x = tw$	$w$	$t$
$y = t(1 - w)$	$1 - w$	$-t$

$\det = w \cdot (-t) - t \cdot (1 - w) = -tw - t + tw = -t$. The two $tw$ terms cancel. Taking the absolute value (we cannot have a negative PDF, and $t > 0$), the Jacobian factor is just $t$.

Substituting and reading off the answer

Plug $x = tw$ and $y = t(1 - w)$ into the joint PDF and multiply by the Jacobian $t$:

$$f_{T,W}(t, w) = \frac{1}{\Gamma(a)\Gamma(b)}\, w^{a-1}(1 - w)^{b-1} \cdot t^{a+b} e^{-t} \cdot \frac{1}{t} \cdot t$$

Collecting powers gives a clean product of a function of $w$ and a function of $t$:

$$f_{T,W}(t, w) = \frac{1}{\Gamma(a)\Gamma(b)}\, w^{a-1}(1 - w)^{b-1} \cdot t^{a+b-1} e^{-t}$$

Key result

Because the joint PDF factors as (a function of $w$) times (a function of $t$), $T$ and $W$ are independent. This was not at all obvious in advance, but the factorization proves it.

Three conclusions from one calculation

The $t$-part already looks like a $\text{Gamma}(a + b, 1)$ PDF — it is missing only its normalizing constant $1/\Gamma(a + b)$. Insert that constant by multiplying and dividing by $\Gamma(a + b)$:

$$f_{T,W}(t, w) = \underbrace{\frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)}\, w^{a-1}(1 - w)^{b-1}}_{\text{function of } w} \cdot \underbrace{\frac{1}{\Gamma(a + b)}\, t^{a+b-1} e^{-t}}_{\text{function of } t}$$

Now integrate out $t$ to get the marginal of $W$. The $t$-factor is a genuine $\text{Gamma}(a + b, 1)$ PDF and integrates to $1$, so the entire $w$-factor passes through as a constant:

$$f_W(w) = \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)}\, w^{a-1}(1 - w)^{b-1}, \qquad 0 < w < 1.$$

This is exactly the $\text{Beta}(a, b)$ PDF. We set out to study waiting times and stumbled onto the beta normalizing constant — a recurring theme in mathematics: solving an easier problem reveals the answer to a harder one you could not attack directly (finding the beta constant by direct integration is genuinely difficult).

Three results, all proved at once

$T = X + Y \sim \text{Gamma}(a + b, 1)$.
$W = \dfrac{X}{X + Y} \sim \text{Beta}(a, b)$.
$T$ and $W$ are independent.

This independence is special. There is a theorem (hard to prove) that replacing the gamma with any other distribution destroys it — independence of the sum and the ratio is a defining property of the gamma-beta pair.

· · ·

2. Computing $E(W)$ for the Beta

As an application — and a preview of the next homework — find the mean of $W \sim \text{Beta}(a, b)$.

Method 1: LOTUS / definition

Write the integral of $w$ times the $\text{Beta}(a, b)$ PDF. The integrand is just $w^{a}(1 - w)^{b-1}$ times the constant, which is itself a beta shape; recognizing the pattern lets you read off the answer. The same approach handles $E(W^{12})$ or $E(W^{100})$ with no extra difficulty — the integrand still has the form $w^{\text{power}}(1 - w)^{\text{power}}$.

Method 2: the gamma-beta interpretation (the slick way)

A $\text{Beta}(a, b)$ random variable has one well-defined mean regardless of how it is generated, so we are free to pick the most convenient construction. Choose the bank-post office one: $W = \dfrac{X}{X + Y}$ with $X \sim \text{Gamma}(a, 1)$, $Y \sim \text{Gamma}(b, 1)$, independent.

Common mistake

A tempting but usually wrong move is to write $E\!\left(\dfrac{X}{X+Y}\right) = \dfrac{E(X)}{E(X+Y)}$. This is not linearity. In general it is false; here it is valid for a specific reason.

The justification: $T = X + Y$ and $W = \dfrac{X}{X + Y}$ are independent, hence uncorrelated. By definition, uncorrelated means $E(W \cdot T) = E(W) \cdot E(T)$. But

$$W \cdot T = \frac{X}{X + Y} \cdot (X + Y) = X,$$

therefore

$$E(X) = E(W) \cdot E(X + Y) \quad\Longrightarrow\quad E(W) = \frac{E(X)}{E(X + Y)}.$$

Since $E(X) = a$ (mean of $\text{Gamma}(a, 1)$) and $E(X + Y) = a + b$, we get

$$E(W) = \frac{a}{a + b}.$$

So the mean of a $\text{Beta}(a, b)$ is $\dfrac{a}{a + b}$. The same independence machinery yields the variance and other functionals.

· · ·

3. Order Statistics

Definition

Given random variables $X_1, \ldots, X_n$ that are i.i.d., the order statistics are the same values sorted into increasing order:

$$X_{(1)} \le X_{(2)} \le \cdots \le X_{(n)}$$

The parenthesized subscripts are standard notation:

$X_{(1)}$ is the minimum.
$X_{(2)}$ is the second smallest, $X_{(3)}$ the third smallest, and so on.
$X_{(n)}$ is the maximum.

The max and min are the two extreme order statistics, but everything in between matters too. The median is the middle one: for odd $n$ it is $X_{((n+1)/2)}$; for even $n$ the usual convention averages the two middle values. More generally, order statistics give quantiles and percentiles — the value below which, say, $75\%$ of the observations fall. The interquartile range (75th percentile minus 25th percentile) is one of many statistical quantities defined through order statistics, which is why there is a vast literature on them: data often arrive in an arbitrary order you do not care about, and you instead want ranks, extremes, and middle values.

Why they are hard: dependence

Key insight

Even though $X_1, \ldots, X_n$ start out independent, the order statistics are dependent — in fact positively correlated. The maximum is always at least the minimum, so learning the minimum is large forces the maximum to be even larger. More generally, the 7th order statistic exceeds the 3rd, so they carry information about each other.

A key earlier insight: the covariance of the max and min is not the covariance of $X$ and $Y$ — you cannot relabel one order statistic as "$X$" and the other as "$Y$".

We focus on the continuous case. In the discrete case, ties (two variables exactly equal) are a serious problem. For continuous random variables the probability of an exact tie is zero, so ties can be ignored.

The CDF of the $j$-th order statistic

Let $X_1, \ldots, X_n$ be i.i.d. continuous with PDF $f$ and CDF $F$. We want the CDF and PDF of $X_{(j)}$.

Drawing a number line helps. The event $\{X_{(j)} \le x\}$ means the $j$-th smallest value lies at or to the left of $x$. Equivalently, since at least $j$ of the values must be to the left for the $j$-th smallest to be there:

$$\{X_{(j)} \le x\} \;=\; \{\text{at least } j \text{ of the } X_i \text{ are} \le x\}$$

Break "at least $j$" into disjoint cases by the exact count $k$ of values to the left of $x$, where $k$ ranges from $j$ to $n$. For a single $X_i$, landing to the left of $x$ is a "success" with probability $F(x)$; landing to the right is a "failure" with probability $1 - F(x)$. The number of successes among $n$ independent trials is $\text{Bin}(n, F(x))$. Summing the binomial PMF over $k = j$ to $n$:

$$P(X_{(j)} \le x) = \sum_{k=j}^{n} \binom{n}{k} F(x)^{k}\,[1 - F(x)]^{\,n-k}$$

The PDF of the $j$-th order statistic

Differentiating that sum is ugly, so derive the PDF directly from a picture instead. Recall that $f(x)\,dx$ is approximately the probability of landing in a tiny interval of width $dx$ around $x$. Interpret $f_{X_{(j)}}(x)\,dx$ as the probability that the $j$-th order statistic falls in that infinitesimal interval. For that to happen:

× × [ • ] × × ×

j−1 to the left • one in the dx interval at x • n−j to the right

One of the $n$ observations lands in the tiny interval: $n$ choices for which one, each with probability $f(x)\,dx$.
Of the remaining $n - 1$ observations, exactly $j - 1$ lie to the left of the interval: choose which ones, $\binom{n-1}{j-1}$ ways, each landing left with probability $F(x)$, giving $F(x)^{j-1}$.
The other $n - j$ lie to the right, each with probability $1 - F(x)$, giving $[1 - F(x)]^{\,n-j}$.

Cancelling the $dx$ gives the marginal PDF:

$$f_{X_{(j)}}(x) = n \binom{n-1}{j-1} F(x)^{j-1}\,[1 - F(x)]^{\,n-j}\, f(x)$$

This neat formula uses both the CDF (the two powered factors) and the PDF (the trailing $f(x)$). Differentiating the CDF sum and simplifying yields the same expression.

Stat 110 focuses on the marginal order statistics, but the same infinitesimal-interval picture, drawn with two tiny intervals, gives the joint PDF of two order statistics (for example, the 3rd and the 7th).

· · ·

4. Uniform Order Statistics: the Beta Connection

The cleanest and most useful example: let $U_1, \ldots, U_n$ be i.i.d. $\text{Unif}(0, 1)$, and find the distribution of $U_{(j)}$.

For the standard uniform, the CDF increases linearly, $F(x) = x$ on $[0, 1]$, and the PDF is $f(x) = 1$. Substituting into the order-statistic PDF:

$$f_{U_{(j)}}(x) = n \binom{n-1}{j-1} x^{j-1}\,(1 - x)^{\,n-j} \cdot 1, \qquad 0 < x < 1,$$

and $0$ otherwise. The shape $x^{\text{power}}(1 - x)^{\text{power}}$ is a beta. The constant is whatever it must be to integrate to $1$ (and it agrees with the beta normalizing constant derived earlier). Therefore

$$U_{(j)} \sim \text{Beta}(j,\; n - j + 1)$$

The $j$-th order statistic of $n$ i.i.d. standard uniforms is exactly a beta.

Recovering an earlier result

This ties back to an old 2-D LOTUS computation: the expected absolute difference of two uniforms, $E|U_1 - U_2| = \frac{1}{3}$. The absolute difference equals max minus min:

$$E|U_1 - U_2| = E\big(\max(U_1, U_2)\big) - E\big(\min(U_1, U_2)\big)$$

For $n = 2$: the max is $U_{(2)} \sim \text{Beta}(2, 1)$ with mean $\frac{2}{3}$, and the min is $U_{(1)} \sim \text{Beta}(1, 2)$ with mean $\frac{1}{3}$. So the difference is $\frac{2}{3} - \frac{1}{3} = \frac{1}{3}$, confirming the earlier answer without any 2-D integral.

The max and min are not independent, but linearity of expectation holds regardless of dependence — that is what makes this shortcut valid.

· · ·

5. A First Look at Conditional Expectation

The next big topic is conditional expectation. If you understand conditional probability, you already understand the idea: it just means taking the expected value using the conditional distribution rather than the unconditional one.

Write $E(X \mid A)$, where $A$ is an event, to mean the expectation of $X$ computed under the distribution of $X$ given $A$.

Law of total expectation

For any event $A$:

$$E(X) = E(X \mid A)\,P(A) + E(X \mid A^c)\,P(A^c)$$

This is the expectation version of the law of total probability.

To prove it in the discrete case, write $E(X)$ as the sum of values times probabilities and expand each probability via the law of total probability; splitting into two sums reproduces the two conditional terms.

Aside: the two-envelope paradox

A puzzle to mull over: you are handed two identical-looking envelopes, each containing a check. One holds exactly twice as much money as the other, but you cannot tell which is which. You pick one.

Suppose you open it and see $\$100$. The other envelope seems equally likely to hold $\$50$ or $\$200$, averaging $\$125$, so you should switch. But the argument never used the value $\$100$. Call your envelope's amount $X$; the other is either $X/2$ or $2X$, averaging $\frac{1}{2}\!\left(\frac{X}{2} + 2X\right) = 1.25X > X$, so you should always switch — even without opening. By symmetry you should then switch back, and switch forever, which is absurd.

This deserves the name "paradox" more than Monty Hall does. The resolution is left for later; the flaw lies in the unstated assumption that, given your envelope shows any amount, the other is equally likely to be double or half.