Lecture 28: Inequalities

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Warm-Up: A Sum of a Random Number of Random Variables

Before the main topic, one more conditional-expectation problem — a very general and natural setup.

The setup

A store gets a random number of customers, and each customer spends a random amount.

$N$ = number of customers in some time period (a random variable; a Poisson would be a reasonable guess for its distribution, but we don't need to specify it).
$X_j$ = amount the $j$-th customer spends, with mean $E(X_j) = \mu$ and variance $\operatorname{Var}(X_j) = \sigma^2$ for every $j$.
The $X_j$ all share the same mean and variance, and are independent (not necessarily IID — just independent, so no customer reacts to what an earlier customer spent).
$N$ is independent of the sequence of expenditures (the number of customers tells you nothing about how much any individual chooses to spend).

We want the mean and variance of the total revenue:

$$X = \sum_{j=1}^{N} X_j$$

What makes this sum unusual is that the upper index $N$ is itself random — we are adding up a random number of random variables.

Why the naive answer is wrong

A first instinct is linearity: "there are $N$ terms, each with mean $\mu$, so $E(X) = N\mu$." This is immediately, obviously wrong — for a structural reason, not an arithmetic one.

Category error

$E(X)$ must be a number, but the right-hand side $N\mu$ contains the random variable $N$. A number cannot equal a random variable. The blunder is useful, though: we wish $N$ were a constant — so let's condition on it.

Fixing it by conditioning on N

Written out longhand (the analog of the law of total probability):

$$E(X) = \sum_{n=0}^{\infty} E(X \mid N = n)\, P(N = n)$$

Given $N = n$ we have a fixed number of terms, so linearity applies — and because $N$ is independent of the $X_j$, after plugging in $N = n$ we may forget the condition (unlike the two-envelope paradox, where dropping the condition was illegal):

$$E(X \mid N = n) = n\mu$$

Substituting back:

$$E(X) = \sum_{n} (n\mu)\, P(N = n) = \mu \sum_{n} n\, P(N = n) = \mu\, E(N)$$

The correction is that the count $N$ must be replaced by its expectation $E(N)$.

The same calculation via Adam's law (iterated expectation) is more compact:

$$E(X) = E\big(E(X \mid N)\big) = E(\mu N) = \mu\, E(N)$$

Here $E(X \mid N) = \mu N$ is obtained by treating $N$ as a known constant and applying linearity. The longhand and shorthand mean exactly the same thing; both are worth being comfortable with.

Variance via Eve's law

For the variance, use Eve's law (the law of total variance), again conditioning on $N$:

$$\operatorname{Var}(X) = E\big(\operatorname{Var}(X \mid N)\big) + \operatorname{Var}\big(E(X \mid N)\big)$$

$\operatorname{Var}(X \mid N)$: treating $N$ as known, $X$ is a sum of $N$ independent terms. Variances of independent variables add (no covariance terms), so $\operatorname{Var}(X \mid N) = N\sigma^2$.
$E(X \mid N) = \mu N$, from the mean calculation above.

Therefore:

$$\operatorname{Var}(X) = E(N\sigma^2) + \operatorname{Var}(\mu N) = \sigma^2\, E(N) + \mu^2\, \operatorname{Var}(N)$$

If $N \sim \text{Pois}(\lambda)$, then $E(N) = \operatorname{Var}(N) = \lambda$ and you'd just plug those in, but this form is more general.

Sanity checks

Mean: intuition

$E(X) = \mu\, E(N)$ reads as: average revenue $=$ (average number of customers) $\times$ (average spend per customer).

The variance passes a units check. The count $N$ has no units (counting people). If spend is in dollars, then $\mu$ is dollars and $\sigma^2$ is dollars-squared. So $\sigma^2 E(N)$ is dollars-squared and $\mu^2 \operatorname{Var}(N)$ is dollars-squared — consistent, and their sum is dollars-squared. Taking the square root gives a standard deviation in dollars (which is why we prefer SD to variance for interpretation). A formula producing dollars-cubed added to dollars-to-the-fourth would have signalled an error.

The same conditioning trick gives the MGF: given $N = n$, the MGF of a sum of $n$ independent terms is the product of their MGFs, so conditioning on $N$ yields the MGF of the random sum in general.

· · ·

2. Why Inequalities Deserve Attention: Bounds vs. Approximations

The main topic is the four statistical inequalities needed in Stat 110. Inequalities get less attention than they deserve in most courses, and there is a recurring student mistake worth heading off: confusing an approximation with a bound.

An approximation (Poisson approximation, normal approximation) says a distribution is close to some target under certain conditions. Extremely useful when an exact computation is too hard — but "close" is fuzzy.
An inequality is a proven statement: the probability is less than $0.37$, full stop. It may be crude, but it is a theorem.

A statistician serving as an expert witness in court is far better off with an inequality than an approximation. Under cross-examination, "this approximation is good" invites "how good, and by whose standard?" — and if you actually knew how close it was, you'd know the true answer. There is no accepted standard for "good." By contrast, "I proved the probability is below $0.37$" leaves little to attack. The interesting twist: there is still genuine randomness, yet we have proven a definite fact about something random.

The four inequalities below trade approximation quality for simplicity and generality — they hold for essentially any random variable.

· · ·

3. Cauchy-Schwarz

Cauchy-Schwarz inequality

For random variables $X$ and $Y$:

$$E(XY) \le \sqrt{E(X^2)\, E(Y^2)}$$

(You may put absolute values around the left side; still true.)

Interpretation

Recall the geometric view of conditional expectation, where $E(XY)$ plays the role of a dot product. Read that way, this is exactly the Cauchy-Schwarz inequality familiar from linear algebra.

Computing $E(XY)$ in general requires the joint distribution (via 2D LOTUS, or finding the distribution of the product $XY$ through a Jacobian) — potentially very messy. Cauchy-Schwarz bounds that joint quantity by two marginal second moments: $E(X^2)$ involves only $X$, and $E(Y^2)$ involves only $Y$. It separates the two variables.

If $X$ and $Y$ are uncorrelated, then by definition $E(XY) = E(X)E(Y)$, an exact equality — so the inequality is useless there (the direction even makes sense, since $E(X^2) \ge E(X)^2$). It earns its keep in the correlated case, but it is comforting that it holds in all cases without splitting into sub-cases.

Connection to correlation (the zero-mean case)

The statistical meaning is clearest when $X$ and $Y$ have mean zero. Then the covariance is $E(XY) - E(X)E(Y) = E(XY)$, and each variance equals its second moment, so the correlation is

$$\operatorname{Corr}(X, Y) = \frac{E(XY)}{\sqrt{E(X^2)\, E(Y^2)}}$$

We already proved that correlation lies in $[-1, 1]$; taking absolute values, that statement is identical to Cauchy-Schwarz. So in statistics, Cauchy-Schwarz says exactly that correlation is between $-1$ and $1$. No separate proof is needed — the general (non-zero-mean) version is a small linear-algebra extension of the same fact.

· · ·

4. Jensen's Inequality

Jensen's inequality

If $g$ is a convex function, then for any random variable $X$:

$$E\big(g(X)\big) \ge g\big(E(X)\big)$$

This is valuable because it tells you which way the inequality goes. One of the biggest blunders in probability is to "move the $E$ around" freely — you cannot. For convex $g$, Jensen pins down the direction.

Convex and concave

If the second derivative exists, convex means $g''(x) \ge 0$ everywhere (the easiest test). The canonical example is $g(x) = x^2$: $g'(x) = 2x$, $g''(x) = 2 > 0$, so $x^2$ is convex (the familiar upward U-shape — "concave up" in AP-Calculus terminology, which essentially no one uses afterward).

A function $h$ is concave if $h''(x) \le 0$; then the inequality simply flips: $E(h(X)) \le h(E(X))$. No separate theory is needed — replace $h$ by $-h$, which flips the sign of the second derivative, apply Jensen, and the minus sign reverses the inequality.

The geometric definition is more general (it doesn't need a second derivative): $g$ is convex if, for any two points on the curve, the line segment joining them lies above the curve. By this definition $|x|$ is convex (a V-shape) even though its derivative doesn't exist at $0$.

Remembering the direction

Use the parabola. We already know $E(X^2) \ge E(X)^2$ because variance is nonnegative — and that is exactly Jensen for $g(x) = x^2$. If you ever forget which way Jensen points, fall back on your friendly parabola.

The old mnemonic "concave up holds water" is both unhelpful and wrong — a mathematician even wrote a paper titled "Does concave-up-holds-water hold water?" and the answer was no. Better to just memorize one convex example.

Worked examples

Function $g(x)$	$g''(x)$	Convex / concave ($x > 0$)	Jensen gives
$x^2$	$2$	convex	$E(X^2) \ge E(X)^2$
$1/x$	$2/x^3$	convex	$E(1/X) \ge 1/E(X)$
$\log x$	$-1/x^2$	concave	$E(\log X) \le \log E(X)$

(For $1/x$ and $\log x$, take $X$ to be a positive random variable so we avoid dividing by zero or taking the log of a non-positive number.)

Equality in Jensen holds only when $X$ is a constant (e.g., for $x^2$ that says the variance is zero).

Proof (tangent line)

A key geometric fact about a convex curve $g$: at any point, the tangent line lies below the curve. (You'd prove this formally in an analysis course; it is clear from the picture.)

Pick the point $(\mu, g(\mu))$ where $\mu = E(X)$, and let the tangent line there be $y = a + bx$. "The curve stays above its tangent line" means, for every $x$ in the domain,

$$g(x) \ge a + bx$$

Since this holds for every number $x$, it holds as an inequality between random variables: $g(X) \ge a + bX$ surely. Take expectations of both sides:

$$E\big(g(X)\big) \ge E(a + bX) = a + b\,E(X) = a + b\mu$$

But the line was chosen tangent at $x = \mu$, so there $a + b\mu = g(\mu) = g(E(X))$. Therefore $E(g(X)) \ge g(E(X))$, which is Jensen's inequality. (A Taylor-expansion argument also works, but the geometric picture is cleaner.)

· · ·

5. Markov's Inequality

Markov's inequality

For any random variable $X$ and any constant $a > 0$:

$$P(|X| \ge a) \le \frac{E(|X|)}{a}$$

(This is the same Markov as in "Markov chains," the very last topic of the course — same person, different idea.) Its strength is again simplicity and generality, not accuracy: it holds for any random variable. In some cases $E(|X|)$ is infinite (a useless "probability $\le \infty$") or the right side exceeds $1$ (true, but uninformative). It is a deliberately crude, general bound.

Proof (fundamental bridge)

Convert the probability into an expectation of an indicator (the fundamental bridge):

$$P(|X| \ge a) = E\big(I(|X| \ge a)\big)$$

where $I(\cdot)$ is $1$ if the event occurs and $0$ otherwise. Now establish the pointwise inequality, with $a$ inserted:

$$a \cdot I(|X| \ge a) \le |X|, \quad \text{always.}$$

Two cases:

Indicator $= 0$: the left side is $0$, and $0 \le |X|$ trivially.
Indicator $= 1$: then by definition $|X| \ge a$, so the left side is $a$ and $a \le |X|$ holds.

So the relation holds surely. Take expectations of both sides and pull out the constant $a$:

$$a \cdot E\big(I(|X| \ge a)\big) \le E(|X|) \;\Longrightarrow\; a\, P(|X| \ge a) \le E(|X|),$$

and dividing by $a$ finishes the proof.

Intuition: 100 people

Take 100 people with mean (average) value $\mu$.

Can at least 95% be below average? Yes. Ages all differ; one extremely old person pulls the mean way up, so 95 people can easily sit below the mean. (This is about the mean, not the median.)
Can at least 50% be above twice the average? No. The total over all 100 people is $100\mu$. If 50 people each exceeded $2\mu$, those 50 alone would contribute more than $50 \cdot 2\mu = 100\mu$ — already the entire total, before counting anyone else. Impossible.

More generally, no more than $1/3$ of the people can exceed $3\mu$, and so on — which is exactly what Markov's inequality says.

· · ·

6. Chebyshev's Inequality

Chebyshev's inequality

Let $\mu = E(X)$. For any $a > 0$:

$$P(|X - \mu| \ge a) \le \frac{\operatorname{Var}(X)}{a^2}$$

Equivalently, measuring distance in standard deviations ($a = c\sigma$, with $c > 0$):

$$P(|X - \mu| \ge c\sigma) \le \frac{1}{c^2}$$

So the probability of being at least $2$ standard deviations from the mean is at most $\tfrac{1}{4}$ (= $1/2^2$). Compare the normal-distribution 68-95-99.7 rule, where the probability of being more than $2$ SDs out is about $0.05$. Chebyshev's $0.25$ is far looser — but it holds for every distribution with finite variance, not just the normal.

Historical irony: Chebyshev's inequality follows almost immediately from Markov's, yet in real life Chebyshev was Markov's adviser.

Proof (square, then apply Markov)

Square both sides inside the probability. Since both sides are nonnegative, this gives an equivalent event and lets us drop the absolute value:

$$P(|X - \mu| \ge a) = P\big((X - \mu)^2 \ge a^2\big)$$

Apply Markov's inequality to the nonnegative random variable $(X - \mu)^2$:

$$P\big((X - \mu)^2 \ge a^2\big) \le \frac{E\big((X - \mu)^2\big)}{a^2}$$

The numerator is, by definition, $\operatorname{Var}(X)$. Hence

$$P(|X - \mu| \ge a) \le \frac{\operatorname{Var}(X)}{a^2},$$

which is Chebyshev's inequality.