Lecture 8: Random Variables and Their Distributions

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Three Ways to View the Binomial

The Binomial is one of the most famous and useful distributions in statistics. It is written $\text{Bin}(n, p)$ and has two parameters: $n$, any positive integer, and $p$, any real number in $[0, 1]$. Changing the parameters gives a different distribution, but it is still a Binomial — so $\text{Bin}(n, p)$ is really a whole family of distributions, one for each $(n, p)$ pair.

Saying $X \sim \text{Bin}(n, p)$ means $X$ is a random variable whose distribution is Binomial. There are three equally important ways to understand what that distribution is, and all three matter in this course.

View 1: The story

The Binomial story

We run $n$ independent trials. Each trial results in success or failure, with probability $p$ of success. $X$ is the number of successes.

The independence of the trials is crucial. The word "success" is generic: you can define success however you like (you can even swap the labels success and failure), as long as each trial yields exactly one of the two, never both.

The classic example is flipping a coin $n$ times and counting heads. But the setting is far more general than coin flips — any sequence of $n$ independent yes/no trials with a common success probability fits. The story is the most important view, because it tells us why we care about the Binomial; a distribution with no useful story is not worth studying.

View 2: Sum of indicator random variables

The story immediately gives a second view:

$$X = X_1 + X_2 + \cdots + X_n, \qquad X_j = \begin{cases} 1 & \text{if trial } j \text{ is a success} \\ 0 & \text{otherwise} \end{cases}$$

Each $X_j$ is an indicator random variable — it indicates whether trial $j$ succeeded (1) or failed (0). This equation says exactly what the story says: add 1 for every success, add 0 for every failure. That is just how counting works (to count to five, you add 1 five times). Breaking a complicated random variable into a sum of very simple 0/1 pieces is subtle but extremely useful.

IID

The $X_j$ here are i.i.d. Bernoulli$(p)$. Independent: the trials are independent, so their indicators are too. Identically distributed: every $X_j$ is Bernoulli$(p)$, equal to 1 with probability $p$ and 0 with probability $1 - p$. The acronym IID (independent and identically distributed) is used constantly in statistics.

View 3: The PMF

The third view writes down the probability mass function directly. For $X \sim \text{Bin}(n, p)$ with $q = 1 - p$:

$$P(X = k) = \binom{n}{k} p^k q^{\,n-k}, \qquad k = 0, 1, \ldots, n.$$

This is the probability of one specific arrangement of $k$ successes and $n - k$ failures, $p^k q^{\,n-k}$, times $\binom{n}{k}$, the number of ways to choose which trials are the successes.

· · ·

2. Random Variable vs. Distribution

A very common confusion is to conflate a random variable with its distribution.

Key distinction

Many different random variables can share the same distribution. The indicators $X_1, \ldots, X_n$ are distinct random variables (they depend on different trials, and they are independent of one another), yet they all have the same Bernoulli$(p)$ distribution.

· · ·

3. What a Random Variable Is

Think of the sample space $S$ in the "pebble world" picture: a finite collection of pebbles, each representing a possible outcome. A random variable is a function that assigns a number to each pebble.

(7)  (7)
(5)  (5)
(3)  (3)
Each pebble (outcome) is mapped to a number; the numbers shown are arbitrary

The real sample space could be enormous, high-dimensional, or infinite — impossible to draw — but the idea is the same: start with an abstract space of outcomes, assign a numerical value to each one.

"$X = 7$" is an event

The expression $X = 7$ is not an equation to solve. It is notation for an event — the set of all outcomes (pebbles) that $X$ maps to 7. Since $\{X = 7\}$ is an event (a subset of $S$), it makes sense to talk about its probability. Likewise $\{X \le x\}$ is an event for any number $x$.

· · ·

4. The CDF

For any random variable $X$, define the cumulative distribution function (CDF):

CDF

$$F(x) = P(X \le x).$$

The reasoning: before the experiment you don't know $X$; afterward you observe a value, say $X = 7$. If $x = 9$, then since $7 \le 9$ the event $\{X \le x\}$ occurred. As a function of $x$, the probability of this event is the CDF.

The CDF is one way to describe a distribution, and in principle it determines all probabilities about $X$. Questions like "what is the probability $X$ lies between 1 and 3, or between 5 and 9?" can all be answered from $F$. The CDF's strength is generality: it is defined for any random variable, discrete or not.

· · ·

5. Discrete vs. Continuous; the PMF

The PMF is defined only for discrete random variables, so we first need that distinction.

There are hybrids that are neither purely discrete nor continuous, but once you understand the two pure cases you can handle the hybrids. The course starts mostly with discrete, adds continuous later, and keeps using discrete throughout.

Definition of the PMF

Probability mass function

For a discrete random variable with possible values $a_1, a_2, \ldots$,

$$p_j = P(X = a_j) \quad \text{for all } j.$$

To specify the PMF you must give every one of these probabilities. It is the "blueprint" for $X$ — it says exactly how the randomness of $X$ is distributed.

Conditions for a valid PMF

A valid PMF must satisfy

$$p_j \ge 0 \text{ for all } j, \qquad \sum_j p_j = 1.$$

If the sum exceeded 1 it would be nonsensical; if it were less than 1, some possible value was left off the list — $X$ has to equal something. Conversely, any numbers $p_j$ satisfying these two conditions define a valid PMF.

For discrete random variables the PMF is usually easier to work with than the CDF; the CDF exists mainly because it generalizes to the non-discrete case. When a problem says "find the distribution," giving either the PMF (if discrete) or the CDF counts as a complete answer — they are equally valid, but the PMF is usually easier.

· · ·

6. The Binomial PMF Sums to One

Check that the Binomial PMF is valid. Non-negativity is obvious, so the only thing to verify is that it sums to 1:

$$\sum_{k=0}^{n} \binom{n}{k} p^k q^{\,n-k} = (p + q)^n = 1^n = 1.$$

The sum is exactly the Binomial Theorem, which gives $(p + q)^n$. Since $q = 1 - p$, we have $p + q = 1$, so the sum is $1^n = 1$.

This is precisely why it is called the Binomial distribution — it is tied to the Binomial Theorem. If the sum had failed to equal 1, the PMF formula would have to be wrong, so the check is reassuring.

· · ·

7. Sum of Two Independent Binomials

Let $X \sim \text{Bin}(n, p)$ and $Y \sim \text{Bin}(m, p)$ be independent (note: the same $p$). Claim: $X + Y \sim \text{Bin}(n + m, p)$. We prove it three ways, one for each view of the Binomial.

What does $X + Y$ mean?

$X$ and $Y$ are functions on the same sample space (pebble world). To add two functions with the same domain, evaluate both and add the values. So as long as random variables live on the same sample space $S$, it makes perfect sense to add them, multiply them, square them, cube them, exponentiate them — each operation yields a new random variable, computed outcome by outcome.

Proof via the story

$X$ counts successes in $n$ trials; $Y$ counts successes in $m$ separate trials (separate because $X$ and $Y$ are independent). Together that is $n + m$ trials, each with success probability $p$. The total number of successes is the sum of the two counts, so $X + Y \sim \text{Bin}(n + m, p)$. No algebra needed.

This requires the same $p$ for both. It fails if, say, one had $p = 1/2$ and the other $p = 1/3$.

Proof via indicators

Write $X = X_1 + \cdots + X_n$ and $Y = Y_1 + \cdots + Y_m$, where all the $X_j$ and $Y_j$ are independent Bernoulli$(p)$. Then

$$X + Y = (X_1 + \cdots + X_n) + (Y_1 + \cdots + Y_m),$$

which is a sum of $n + m$ IID Bernoulli$(p)$ random variables. A sum of $N$ IID Bernoulli$(p)$ is $\text{Bin}(N, p)$; here $N = n + m$, so $X + Y \sim \text{Bin}(n + m, p)$. (This easy argument would break if the variables were not independent.)

Proof via the PMF (convolution)

The hard way: compute $P(X + Y = k)$ directly. The sum of two random variables is called a convolution — and, fittingly, "convolution" and "convoluted" sound alike. By wishful thinking, condition on $X$ (we could equally condition on $Y$). The Law of Total Probability gives

$$P(X + Y = k) = \sum_{j=0}^{k} P(X + Y = k \mid X = j)\, P(X = j).$$

We sum only up to $k$: since $X$ and $Y$ are non-negative, neither can exceed $k$ when $X + Y = k$. Given $X = j$, the event $\{X + Y = k\}$ becomes $\{Y = k - j\}$. By independence, conditioning on $X = j$ gives no information about $Y$, so $P(Y = k - j \mid X = j) = P(Y = k - j)$. Plugging in both Binomial PMFs:

$$P(X + Y = k) = \sum_{j=0}^{k} \binom{m}{k-j} p^{\,k-j} q^{\,m-k+j} \cdot \binom{n}{j} p^{\,j} q^{\,n-j}.$$

Collect the powers of $p$ and $q$ (both independent of $j$, so they leave the sum):

$$= p^{\,k}\, q^{\,m+n-k} \sum_{j=0}^{k} \binom{m}{k-j}\binom{n}{j}.$$

The remaining sum is Vandermonde's identity:

$$\sum_{j=0}^{k} \binom{m}{k-j}\binom{n}{j} = \binom{m+n}{k}.$$

So $P(X + Y = k) = \binom{m+n}{k} p^{\,k} q^{\,m+n-k}$, which is exactly the $\text{Bin}(n + m, p)$ PMF.

Bonus

This was far more work than the other two proofs — and we would have been stuck at the ugly sum without already knowing Vandermonde. As a bonus, the calculation is itself a second proof of Vandermonde's identity: if that sum had not equaled $\binom{m+n}{k}$, we would have a contradiction.

· · ·

8. When It Is NOT Binomial: The Hypergeometric

A common mistake is to assume a distribution is Binomial when it is not. The two key Binomial assumptions are that the trials are independent and have the same success probability. If either fails, it is not Binomial.

The five-card example

Draw a five-card hand from a standard 52-card deck (all $\binom{52}{5}$ subsets equally likely). Find the distribution of $X = $ the number of aces in the hand.

This is discrete ($X$ is $0, 1, 2, 3,$ or $4$), so the PMF is the natural target. Listing the possible values first is good practice — it guards against PMFs that don't sum to 1 or that allow impossible values like 2.5 or 5 aces.

Why it is not Binomial

Think of each card as a trial, but the trials are not independent. If the first card is an ace, the second is less likely to be one; if the first four cards are all aces, the fifth is certainly not. Independence fails.

Using the naive definition (equally likely hands):

$$P(X = k) = \frac{\binom{4}{k}\binom{48}{5-k}}{\binom{52}{5}}, \qquad k = 0, 1, 2, 3, 4.$$

Choose $k$ of the 4 aces and $5 - k$ of the 48 non-aces, out of $\binom{52}{5}$ total hands. There is a pleasing pattern: $4 + 48 = 52$ in the "tops" and $k + (5 - k) = 5$ in the "bottoms."

Connection to the elk problem

This PMF looks like Vandermonde, and it is exactly the structure of the elk problem (capture–recapture) from the homework: a population of elk, some tagged and some untagged; sample some; find the probability of exactly $k$ tagged ones. Here "tag" just means "is an ace" — four cards come pre-tagged as aces, 48 are untagged. The card problem and the elk problem are not merely similar, they are the same problem. Recognizing when a new problem is identical to one already solved is a central skill in this course.

The general Hypergeometric

Generalize: a jar holds $b$ black marbles and $w$ white marbles. Take a simple random sample of size $n$ (all subsets of that size equally likely). Let $X = $ number of white marbles in the sample. (Same problem as the aces and the elk, with "white/black" in place of "ace/non-ace" and "tagged/untagged.")

Hypergeometric distribution

$$P(X = k) = \frac{\binom{w}{k}\binom{b}{\,n-k}}{\binom{w+b}{n}},$$

with constraints $0 \le k \le w$ and $0 \le n - k \le b$; by convention the formula is 0 when $k > w$ (or otherwise impossible).

As with the Binomial, memorizing the PMF is far less useful than knowing the story — the story tells you when to apply the distribution.

Hypergeometric vs. Binomial: with vs. without replacement

The defining difference is replacement:

FeatureBinomialHypergeometric
SamplingWith replacementWithout replacement
TrialsIndependentNot independent
StoryCount successes in $n$ independent trialsCount white marbles in a sample of size $n$ drawn without replacement
PMF$\binom{n}{k} p^k q^{\,n-k}$$\dfrac{\binom{w}{k}\binom{b}{\,n-k}}{\binom{w+b}{n}}$

If you draw a marble, replace it, and repeat, the situation resets each time and the draws are independent — that gives a Binomial. Drawing without replacement makes the trials dependent, giving a Hypergeometric.

Bridge between the two

Suppose the jar holds about a billion marbles and we sample only 10. Picking the same marble twice is then extraordinarily unlikely, so sampling with and without replacement behave almost identically. Under such conditions the Hypergeometric is approximately Binomial.

The Hypergeometric PMF sums to one

Verify validity. Non-negativity is clear. For the sum, the constant $\binom{w+b}{n}$ factors out and the numerator sum is Vandermonde again:

$$\sum_{k} \frac{\binom{w}{k}\binom{b}{\,n-k}}{\binom{w+b}{n}} = \frac{1}{\binom{w+b}{n}} \sum_{k} \binom{w}{k}\binom{b}{\,n-k} = \frac{\binom{w+b}{n}}{\binom{w+b}{n}} = 1.$$

As before, this doubles as a third proof of Vandermonde's identity: had the sum not matched, we would have a contradiction.

· · ·

9. Shapes of CDFs

A quick look at what CDFs $F(x) = P(X \le x)$ look like, to keep in mind for next time.

Continuous CDF

For a continuous random variable, $F$ is a smooth increasing curve:

1 ——————___
        ___/
   ___/
0 _/
Continuous CDF: smooth, increasing, $0$ at the left, $1$ at the right

Discrete CDF

For a discrete random variable, $F$ is a step function with jumps at the possible values and flat stretches in between. For $X$ taking values $0, 1, 2$:

1 ————o
      o——
 o——
  0  1  2
Discrete CDF: jumps at each possible value, flat between, leveling off at $1$

The function jumps at each value the random variable can take, then stays flat until the next one, eventually leveling off at 1 forever. The open circles reflect that $F$ is defined with $\le$: at a jump point $F$ takes the higher value. In the discrete case these jumpy step functions make the PMF easier to use; in the continuous case the CDF is often the more convenient description.