Before the main topic, one more conditional-expectation problem — a very general and natural setup.
A store gets a random number of customers, and each customer spends a random amount.
We want the mean and variance of the total revenue:
What makes this sum unusual is that the upper index $N$ is itself random — we are adding up a random number of random variables.
A first instinct is linearity: "there are $N$ terms, each with mean $\mu$, so $E(X) = N\mu$." This is immediately, obviously wrong — for a structural reason, not an arithmetic one.
$E(X)$ must be a number, but the right-hand side $N\mu$ contains the random variable $N$. A number cannot equal a random variable. The blunder is useful, though: we wish $N$ were a constant — so let's condition on it.
Written out longhand (the analog of the law of total probability):
Given $N = n$ we have a fixed number of terms, so linearity applies — and because $N$ is independent of the $X_j$, after plugging in $N = n$ we may forget the condition (unlike the two-envelope paradox, where dropping the condition was illegal):
Substituting back:
The correction is that the count $N$ must be replaced by its expectation $E(N)$.
The same calculation via Adam's law (iterated expectation) is more compact:
Here $E(X \mid N) = \mu N$ is obtained by treating $N$ as a known constant and applying linearity. The longhand and shorthand mean exactly the same thing; both are worth being comfortable with.
For the variance, use Eve's law (the law of total variance), again conditioning on $N$:
Therefore:
If $N \sim \text{Pois}(\lambda)$, then $E(N) = \operatorname{Var}(N) = \lambda$ and you'd just plug those in, but this form is more general.
$E(X) = \mu\, E(N)$ reads as: average revenue $=$ (average number of customers) $\times$ (average spend per customer).
The variance passes a units check. The count $N$ has no units (counting people). If spend is in dollars, then $\mu$ is dollars and $\sigma^2$ is dollars-squared. So $\sigma^2 E(N)$ is dollars-squared and $\mu^2 \operatorname{Var}(N)$ is dollars-squared — consistent, and their sum is dollars-squared. Taking the square root gives a standard deviation in dollars (which is why we prefer SD to variance for interpretation). A formula producing dollars-cubed added to dollars-to-the-fourth would have signalled an error.
The same conditioning trick gives the MGF: given $N = n$, the MGF of a sum of $n$ independent terms is the product of their MGFs, so conditioning on $N$ yields the MGF of the random sum in general.
The main topic is the four statistical inequalities needed in Stat 110. Inequalities get less attention than they deserve in most courses, and there is a recurring student mistake worth heading off: confusing an approximation with a bound.
A statistician serving as an expert witness in court is far better off with an inequality than an approximation. Under cross-examination, "this approximation is good" invites "how good, and by whose standard?" — and if you actually knew how close it was, you'd know the true answer. There is no accepted standard for "good." By contrast, "I proved the probability is below $0.37$" leaves little to attack. The interesting twist: there is still genuine randomness, yet we have proven a definite fact about something random.
The four inequalities below trade approximation quality for simplicity and generality — they hold for essentially any random variable.
For random variables $X$ and $Y$:
$$E(XY) \le \sqrt{E(X^2)\, E(Y^2)}$$
(You may put absolute values around the left side; still true.)
Recall the geometric view of conditional expectation, where $E(XY)$ plays the role of a dot product. Read that way, this is exactly the Cauchy-Schwarz inequality familiar from linear algebra.
Computing $E(XY)$ in general requires the joint distribution (via 2D LOTUS, or finding the distribution of the product $XY$ through a Jacobian) — potentially very messy. Cauchy-Schwarz bounds that joint quantity by two marginal second moments: $E(X^2)$ involves only $X$, and $E(Y^2)$ involves only $Y$. It separates the two variables.
If $X$ and $Y$ are uncorrelated, then by definition $E(XY) = E(X)E(Y)$, an exact equality — so the inequality is useless there (the direction even makes sense, since $E(X^2) \ge E(X)^2$). It earns its keep in the correlated case, but it is comforting that it holds in all cases without splitting into sub-cases.
The statistical meaning is clearest when $X$ and $Y$ have mean zero. Then the covariance is $E(XY) - E(X)E(Y) = E(XY)$, and each variance equals its second moment, so the correlation is
We already proved that correlation lies in $[-1, 1]$; taking absolute values, that statement is identical to Cauchy-Schwarz. So in statistics, Cauchy-Schwarz says exactly that correlation is between $-1$ and $1$. No separate proof is needed — the general (non-zero-mean) version is a small linear-algebra extension of the same fact.
If $g$ is a convex function, then for any random variable $X$:
$$E\big(g(X)\big) \ge g\big(E(X)\big)$$
This is valuable because it tells you which way the inequality goes. One of the biggest blunders in probability is to "move the $E$ around" freely — you cannot. For convex $g$, Jensen pins down the direction.
If the second derivative exists, convex means $g''(x) \ge 0$ everywhere (the easiest test). The canonical example is $g(x) = x^2$: $g'(x) = 2x$, $g''(x) = 2 > 0$, so $x^2$ is convex (the familiar upward U-shape — "concave up" in AP-Calculus terminology, which essentially no one uses afterward).
A function $h$ is concave if $h''(x) \le 0$; then the inequality simply flips: $E(h(X)) \le h(E(X))$. No separate theory is needed — replace $h$ by $-h$, which flips the sign of the second derivative, apply Jensen, and the minus sign reverses the inequality.
The geometric definition is more general (it doesn't need a second derivative): $g$ is convex if, for any two points on the curve, the line segment joining them lies above the curve. By this definition $|x|$ is convex (a V-shape) even though its derivative doesn't exist at $0$.
Use the parabola. We already know $E(X^2) \ge E(X)^2$ because variance is nonnegative — and that is exactly Jensen for $g(x) = x^2$. If you ever forget which way Jensen points, fall back on your friendly parabola.
The old mnemonic "concave up holds water" is both unhelpful and wrong — a mathematician even wrote a paper titled "Does concave-up-holds-water hold water?" and the answer was no. Better to just memorize one convex example.
| Function $g(x)$ | $g''(x)$ | Convex / concave ($x > 0$) | Jensen gives |
|---|---|---|---|
| $x^2$ | $2$ | convex | $E(X^2) \ge E(X)^2$ |
| $1/x$ | $2/x^3$ | convex | $E(1/X) \ge 1/E(X)$ |
| $\log x$ | $-1/x^2$ | concave | $E(\log X) \le \log E(X)$ |
(For $1/x$ and $\log x$, take $X$ to be a positive random variable so we avoid dividing by zero or taking the log of a non-positive number.)
Equality in Jensen holds only when $X$ is a constant (e.g., for $x^2$ that says the variance is zero).
A key geometric fact about a convex curve $g$: at any point, the tangent line lies below the curve. (You'd prove this formally in an analysis course; it is clear from the picture.)
Pick the point $(\mu, g(\mu))$ where $\mu = E(X)$, and let the tangent line there be $y = a + bx$. "The curve stays above its tangent line" means, for every $x$ in the domain,
Since this holds for every number $x$, it holds as an inequality between random variables: $g(X) \ge a + bX$ surely. Take expectations of both sides:
$$E\big(g(X)\big) \ge E(a + bX) = a + b\,E(X) = a + b\mu$$
But the line was chosen tangent at $x = \mu$, so there $a + b\mu = g(\mu) = g(E(X))$. Therefore $E(g(X)) \ge g(E(X))$, which is Jensen's inequality. (A Taylor-expansion argument also works, but the geometric picture is cleaner.)
For any random variable $X$ and any constant $a > 0$:
$$P(|X| \ge a) \le \frac{E(|X|)}{a}$$
(This is the same Markov as in "Markov chains," the very last topic of the course — same person, different idea.) Its strength is again simplicity and generality, not accuracy: it holds for any random variable. In some cases $E(|X|)$ is infinite (a useless "probability $\le \infty$") or the right side exceeds $1$ (true, but uninformative). It is a deliberately crude, general bound.
Convert the probability into an expectation of an indicator (the fundamental bridge):
where $I(\cdot)$ is $1$ if the event occurs and $0$ otherwise. Now establish the pointwise inequality, with $a$ inserted:
$$a \cdot I(|X| \ge a) \le |X|, \quad \text{always.}$$
Two cases:
So the relation holds surely. Take expectations of both sides and pull out the constant $a$:
$$a \cdot E\big(I(|X| \ge a)\big) \le E(|X|) \;\Longrightarrow\; a\, P(|X| \ge a) \le E(|X|),$$
and dividing by $a$ finishes the proof.
Take 100 people with mean (average) value $\mu$.
More generally, no more than $1/3$ of the people can exceed $3\mu$, and so on — which is exactly what Markov's inequality says.
Let $\mu = E(X)$. For any $a > 0$:
$$P(|X - \mu| \ge a) \le \frac{\operatorname{Var}(X)}{a^2}$$
Equivalently, measuring distance in standard deviations ($a = c\sigma$, with $c > 0$):
$$P(|X - \mu| \ge c\sigma) \le \frac{1}{c^2}$$
So the probability of being at least $2$ standard deviations from the mean is at most $\tfrac{1}{4}$ (= $1/2^2$). Compare the normal-distribution 68-95-99.7 rule, where the probability of being more than $2$ SDs out is about $0.05$. Chebyshev's $0.25$ is far looser — but it holds for every distribution with finite variance, not just the normal.
Historical irony: Chebyshev's inequality follows almost immediately from Markov's, yet in real life Chebyshev was Markov's adviser.
Square both sides inside the probability. Since both sides are nonnegative, this gives an equivalent event and lets us drop the absolute value:
Apply Markov's inequality to the nonnegative random variable $(X - \mu)^2$:
$$P\big((X - \mu)^2 \ge a^2\big) \le \frac{E\big((X - \mu)^2\big)}{a^2}$$
The numerator is, by definition, $\operatorname{Var}(X)$. Hence
$$P(|X - \mu| \ge a) \le \frac{\operatorname{Var}(X)}{a^2},$$
which is Chebyshev's inequality.