Linearity says $E(X + Y) = E(X) + E(Y)$, and $E(cX) = c\,E(X)$ for a constant $c$. The second part holds even when $X$ and $Y$ are dependent, which is the surprising and useful part.
When $X$ and $Y$ are independent, $E(X + Y) = E(X) + E(Y)$ feels obvious. When they are dependent, it is not obvious at all — yet it remains true, as long as the expectations exist.
Work in the discrete case (the continuous case is analogous, using a PDF instead of a PMF). Let $T = X + Y$. By definition,
Trying to combine the two right-hand sums (one over $x$, one over $y$) into a single sum over $t$ is awkward. Trying to split $E(T)$ by conditioning on $X$,
works cleanly only if $X$ and $Y$ are independent — then it reduces to a Vandermonde-style convolution (the same algebra used earlier to show that a sum of independent Binomials with common $p$ is Binomial). If $X$ and $Y$ are dependent, you are left stuck with a residual conditional $Y$ term. So the definitional route is the wrong way in.
A random variable $X$ is a function from the sample space $S$ to the real line. Picture $S$ as a collection of "pebbles," each pebble $s$ carrying probability mass $P(s)$, with all masses summing to $1$. $X$ assigns each pebble a value $X(s)$.
The expected value can be computed two equivalent ways:
These are the same number — exactly the two ways of averaging a list (add-and-divide vs. group equal entries and take a weighted average). The ungrouped form is the lever that makes linearity trivial.
Write $E(T)$ in ungrouped form and use that adding two functions means adding them pointwise:
$$E(T) = \sum_{s} (X + Y)(s)\, P(s) = \sum_{s} \big[X(s) + Y(s)\big] P(s)$$
$$= \sum_{s} X(s)\, P(s) + \sum_{s} Y(s)\, P(s) = E(X) + E(Y).$$
The crucial step is that every term is now summed over the same index $s$ (pebbles), so the sum splits cleanly. No independence is used anywhere — the same pebble contributes $X(s)$ and $Y(s)$ simultaneously, so dependence is irrelevant.
The same move proves the constant rule:
Take the most extreme dependence possible: $X = Y$. Then $E(X + Y) = E(2X) = 2\,E(X)$, which is indeed $E(X) + E(Y)$. At the other extreme (independence) the result is intuitive. The proof handles everything in between.
The negative binomial generalizes the geometric. Despite the name, it is neither negative nor a binomial — the name only reflects its connection to expanding $(a + b)$ raised to a negative power in the binomial theorem.
Run independent Bernoulli$(p)$ trials. The negative binomial counts the number of failures before the $r$-th success. The geometric is the special case $r = 1$ (failures before the first success).
Write $1$ for success and $0$ for failure, and look at a sequence ending in the $r$-th success. With $r = 5$, a typical valid sequence is:
Two observations pin down the count:
So, for $X = $ number of failures before the $r$-th success:
The factor $p^{\,r}(1-p)^{n}$ is the probability of one specific qualifying sequence; the binomial coefficient counts the arrangements of the first $n + r - 1$ positions. Choose the locations of the $r - 1$ successes, $\binom{n+r-1}{r-1}$, or equivalently the locations of the $n$ failures, $\binom{n+r-1}{n}$ — the same number.
Computing $E(X) = \sum_n n\, P(X = n)$ directly is nasty. Instead, decompose the waiting process: getting $r$ successes means wait for the 1st success, then the 2nd, and so on.
where $X_j$ is the number of failures strictly between the $(j-1)$-th and $j$-th successes (with $X_1$ the failures before the very first success). Each $X_j \sim \text{Geometric}(p)$, so $E(X_j) = q/p$ with $q = 1 - p$. By linearity,
$$E(X) = E(X_1) + \cdots + E(X_r) = r\,\frac{q}{p}.$$
The $X_j$ happen to be independent (the trials are independent), but linearity does not need that — only that the expectations exist.
Textbooks disagree on where the geometric starts, so be careful copying formulas. Ross starts the geometric at $1$ (counting the success); DeGroot starts at $0$ (not counting it). This course follows the DeGroot convention: the geometric counts failures only, starting at $0$.
Define the first success distribution, $X \sim \text{FS}(p)$: the number of trials up to and including the first success (so it counts the success). To relate it to the geometric, set
Subtracting $1$ just removes the counted success. Any FS probability rewrites in terms of $Y$, and
using $p + q = 1$. This is intuitive: if $p = 1/10$, it takes on average $10$ trials to get the first success when you count that success.
A common notational habit in the literature is to drop the parentheses (writing $E\,X$ for $E(X)$) when there is no ambiguity.
This appeared on the Putnam exam, one of the hardest math competitions in the country — in most years the median score is zero. Many test-takers considered this the hardest problem on the exam, yet with linearity and indicators it is a one-liner.
Take a uniformly random permutation of $1, 2, \ldots, n$ (all $n!$ orderings equally likely); assume $n \ge 2$. A local maximum is an entry larger than its neighbor(s): an interior entry must exceed both neighbors; an endpoint entry has only one neighbor and must exceed it. Find the expected number of local maxima.
Let $I_j$ be the indicator that position $j$ is a local maximum, $j = 1, \ldots, n$. The number of local maxima is $I_1 + \cdots + I_n$, so
By the fundamental bridge, $E(I_j) = P(\text{position } j \text{ is a local maximum})$. Split into interior and endpoint positions.
Consider the three values at that position and its two neighbors. By symmetry, the largest of the three is equally likely to occupy any of the three positions, so the chance it sits in the middle is $\tfrac{1}{3}$. (Equivalently, of the $6$ orderings of three slots, the $2$ with the largest in the middle give $\tfrac{2}{6} = \tfrac{1}{3}$.)
Answering $\tfrac{1}{4}$ — "$\tfrac12$ chance bigger than the left neighbor times $\tfrac12$ chance bigger than the right neighbor" — is wrong. Those two events are not independent (the same trap as $P(A \text{ older than } B \mid A \text{ older than } C)$). Two correct arguments give $\tfrac13$; one flawed argument gives $\tfrac14$.
An endpoint is a local maximum iff it exceeds its single neighbor, which by symmetry happens with probability $\tfrac{1}{2}$.
There are $n - 2$ interior positions (each $\tfrac13$) and $2$ endpoints (each $\tfrac12$):
This example uses no indicators; it shows that an expected value can be infinite, and that you cannot move $E$ through a nonlinear function.
Flip a fair coin until the first heads. The payout is $Y = 2^X$ dollars, where $X$ is the number of flips up to and including the first heads — so $X \sim \text{FS}(1/2)$. Heads on flip 1 pays \$2, on flip 2 pays \$4, on flip 3 pays \$8, doubling each time. What is the fair price: the price making the expected net value zero?
In an informal classroom auction, even imagined billionaires would not pay much — most balked above roughly \$20 to \$100, despite the unbounded payout structure.
The payout $2^X$ equals $2^k$ exactly when the coin shows $k - 1$ tails then heads, an event of probability $(1/2)^k$. So
Each term contributes exactly $1$, so the sum diverges. By this calculation the fair price is infinite — yet almost no one would pay even \$100. The arithmetic is indisputable ($1 + 1 + 1 + \cdots$); the resolution is that no one actually has infinite money. Real payouts are bounded.
Cap the payout near a trillion dollars. Since $2^{40} > 10^{12}$, cap at $2^{40}$. Two scenarios:
| Scenario | Sum | Expected value |
|---|---|---|
| Dealer flees past flip 40 (you get nothing beyond the cap) | $\sum_{k=1}^{40} 1$ | \$40 |
| Dealer pays the capped $2^{40}$ for all later outcomes | $40 + (\tfrac12 + \tfrac14 + \cdots)$ | \$41 |
A trillion-dollar ceiling — an astronomically unlikely outcome to ever reach — collapses an "infinite" expectation to about \$40. That no longer feels paradoxical and matches people's intuitions.
$E(2^X) = \infty$, but a careless "move the $E$ into the exponent" gives $2^{E(X)} = 2^{1/p} = 2^2 = \$4$ (using $E(X) = 1/p = 2$ for $\text{FS}(1/2)$). These are not equal:
Linearity lets you pull $E$ through sums and constant multiples — and only those. It does not pass through powers, products, or other nonlinear functions.