Lecture 4: Conditional Probability

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. The Matching Problem Revisited (de Montmort)

This is the most famous example of inclusion-exclusion, finished more carefully than in Lecture 3.

Setup

A deck of $n$ cards is labeled $1$ through $n$. You flip cards over one at a time, calling out "one, two, three, $\ldots$" as you go. You win if at some point the card you name matches the card that appears (the card in position $j$ is labeled $j$). We want $P(\text{win}) = P(\text{at least one match})$.

Let $A_j$ be the event that the $j$-th card matches. We want $P(A_1 \cup A_2 \cup \cdots \cup A_n)$.

Solving with inclusion-exclusion and symmetry

Inclusion-exclusion adds single-event probabilities, subtracts pairwise intersections, adds triple intersections, and so on. The key ingredient is the probability of a $k$-fold intersection. By symmetry, every intersection of $k$ specific events has the same probability, so we may take the first $k$ for concreteness:

$$P(A_1 \cap A_2 \cap \cdots \cap A_k) = \frac{(n-k)!}{n!}$$

This follows from the naive definition (all $n!$ orderings equally likely). Fixing the first $k$ cards in their matching positions leaves the other $n - k$ cards free to be in any order, giving $(n-k)!$ favorable arrangements out of $n!$.

There are $\binom{n}{k}$ such $k$-fold intersections, and all are equal by symmetry. So the $k$-th term of inclusion-exclusion collapses to a clean fraction:

$$\binom{n}{k} \cdot \frac{(n-k)!}{n!} = \frac{n!}{(n-k)!\,k!} \cdot \frac{(n-k)!}{n!} = \frac{1}{k!}$$

Result

$$P(\text{at least one match}) = 1 - \frac{1}{2!} + \frac{1}{3!} - \frac{1}{4!} + \cdots + (-1)^{n+1}\frac{1}{n!}$$

Sanity check: the final term $1/n!$ corresponds to all $n$ cards being perfectly ordered $1$ through $n$ — there is exactly one such arrangement, so this term makes sense.

The complement and the $1/e$ surprise

Often the problem is phrased as $P(\text{no match})$. Taking complements (the complement of a union is the intersection of complements):

$$P(\text{no match}) = 1 - 1 + \frac{1}{2!} - \frac{1}{3!} + \cdots + (-1)^{n}\frac{1}{n!}$$

This is the Taylor series for $e^x$ evaluated at $x = -1$:

$$P(\text{no match}) \approx e^{-1} = \frac{1}{e} \approx 0.37$$
Key principle

Recognizing factorials in the denominator should immediately suggest a Taylor series. The two series used over and over in this course are the geometric series and the series for $e^x$.

Why this is surprising:

The convergence is extremely fast: even at $n = 10$ (where $n!$ exceeds three million) the approximation is accurate to roughly $10^{-8}$.

This problem is a recurring favorite because it illustrates inclusion-exclusion, symmetry, and the constant $1/e$ all at once. Blitzstein's heuristic: if you have no idea what an answer is and must guess, $1/e$ is a better guess than anything else.

· · ·

2. Independence

The definition is simple to state but takes effort to fully understand. Informally, two events are independent if knowing one occurred gives no information about whether the other occurred. That intuition is too vague to verify, so we need a precise definition.

Definition (two events)

Definition — Independent Events

Working within a fixed sample space and probability function $P$, events $A$ and $B$ are independent if:

$$P(A \cap B) = P(A)\,P(B)$$

The probability that both occur is the probability of one times the probability of the other, because they have nothing to do with each other.

Independence is not disjointness

Common blunder

Confusing independence with disjointness is a disastrous mistake — they are opposite concepts. If $A$ and $B$ are disjoint, then learning $A$ occurred tells you $B$ is impossible, which is highly informative. Independence is the case where learning $A$ tells you nothing about $B$.

ConceptMeaningKnowing $A$ occurred tells you
Disjoint$A$ and $B$ cannot both occur$B$ definitely did NOT occur (strong info)
Independent$P(A \cap B) = P(A)\,P(B)$nothing about $B$

Independence of three or more events

For three events $A, B, C$, pairwise independence is not enough. We require all four conditions:

$$P(A \cap B) = P(A)P(B), \quad P(A \cap C) = P(A)P(C), \quad P(B \cap C) = P(B)P(C)$$
$$P(A \cap B \cap C) = P(A)\,P(B)\,P(C)$$

Having only the first three (pairwise independence) does not imply the fourth, and the fourth does not imply the first three. All four are genuinely needed; none can be dropped. Constructing a counterexample where pairwise independence holds but the triple condition fails is a recommended exercise.

For $n$ events $A_1, \ldots, A_n$, full independence requires that for every subset, the probability of the intersection equals the product of the individual probabilities.

Slogan

Independence means multiply. To find the probability of an intersection of independent events, multiply their probabilities. (It is a paraphrase of the definition, not a separate theorem.)

Notation aside: it is convenient to write commas for intersections, e.g. $P(A, B)$ for $P(A \cap B)$. This is harmless until unions enter the picture, so it is used only for intersections.

· · ·

3. The Newton-Pepys Problem (1693)

A historical gambling problem. Samuel Pepys (the famous diarist) wanted the answer to a dice question, couldn't solve it, and wrote to Isaac Newton, who solved it for him. This was in the very early days of probability theory.

Setup

Fair six-sided dice. Which of these is most likely?

Pepys strongly believed $C$ was most likely. The actual answer is $A$.

Computing $P(A)$

Assume the dice rolls are independent. "At least one" suggests a union, but it is easier to take the complement (all non-sixes) and use independence to multiply:

$$P(A) = 1 - \left(\tfrac{5}{6}\right)^{6} \approx 0.665$$

About a $2/3$ chance of at least one six in six dice. This agrees with the naive definition: $6^6$ total outcomes, $5^6$ of them with no six.

Computing $P(B)$

The count of sixes could be anything from $2$ to $12$, which is unwieldy, so subtract the complementary cases (zero sixes or exactly one six):

$$P(B) = 1 - \left(\tfrac{5}{6}\right)^{12} - 12 \cdot \tfrac{1}{6}\left(\tfrac{5}{6}\right)^{11} \approx 0.619$$

The factor $12$ counts which of the $12$ dice shows the single six.

Computing $P(C)$

Subtract the cases of zero, one, or two sixes among $18$ dice. The probability of exactly $k$ sixes among $18$ dice is a binomial probability:

$$P(\text{exactly } k \text{ sixes}) = \binom{18}{k}\left(\tfrac{1}{6}\right)^{k}\left(\tfrac{5}{6}\right)^{18-k}$$

$\binom{18}{k}$ counts which $k$ of the $18$ positions hold the sixes; $(1/6)^k$ for those sixes; $(5/6)^{18-k}$ for the remaining non-sixes. Then:

$$P(C) = 1 - \sum_{k=0}^{2} \binom{18}{k}\left(\tfrac{1}{6}\right)^{k}\left(\tfrac{5}{6}\right)^{18-k} \approx 0.597$$

Conclusion

$P(A) \approx 0.665 > P(B) \approx 0.619 > P(C) \approx 0.597$. So $A$ is most likely and $C$ least likely — the opposite of Pepys's belief. Newton got the calculation right but offered a confusing intuitive argument that was actually wrong.

Stigler's diagnosis

Statistician Stephen Stigler later showed Newton's intuition had to be wrong without decoding it: Newton's argument never used the fact that the dice are fair. With biased dice one can make $C$ most likely, yet Newton's reasoning would still claim $A$ wins regardless of the face probabilities. An argument that is invariant to a change the true answer depends on cannot be correct.

The point of the binomial probability, which recurs throughout the course, is to understand where it comes from, not to memorize it.

· · ·

4. Conditional Probability

This is the main topic for the rest of the week and one of the most important ideas in the course. The motivating question: you have beliefs and uncertainties, and you learn new things every day. How should you update your probabilities when you receive new evidence? This is central to science (experiments produce data), to investigation (clues update suspicions), and to reasoning in general. The process is sequential: today's updated probabilities become tomorrow's starting point, to be updated again.

Motto

Conditioning is the soul of statistics. Everything in the course relates to conditioning one way or another.

Definition

Definition — Conditional Probability

The conditional probability of $A$ given $B$ is:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad \text{provided } P(B) > 0$$

We pronounce the vertical bar as "given."

Interpretation: we started with some probability $P(A)$ for $A$. We then observe that $B$ occurred. If $A$ and $B$ are independent, this is irrelevant; otherwise it is valuable information, and $P(A \mid B)$ is our updated probability for $A$. The definition itself is just one probability divided by another — simple to state, but the source of deep theory.

Intuition 1: Pebble World

We have moved beyond the naive definition, so outcomes need not be equally likely. Picture a finite sample space $S$ as a collection of pebbles, each representing one outcome. Each pebble has a mass (its probability), and the total mass is $1$ (we choose units so this holds). An event is a subset — a set of pebbles.

[ ]
9 pebbles (total mass 1); the 4 inside brackets are event B

To compute $P(A \mid B)$:

  1. Condition on $B$: we learned $B$ occurred, so every pebble outside $B$ is now irrelevant. Discard the pebbles in $B^c$.
  2. Renormalize: the surviving pebbles no longer sum to mass $1$. Multiply (renormalize) so they do — dividing by $P(B)$ achieves exactly this.

After conditioning, $B$ is the new universe and the usual laws of probability apply. The mass that matters for $A$ is the part of $A$ inside $B$, namely $A \cap B$. So $P(A \mid B) = P(A \cap B)/P(B)$.

Why divide by $P(B)$

Discarding the pebbles in $B^c$ leaves the survivors with total mass $P(B) < 1$. Renormalizing means dividing by $P(B)$ so the masses sum to $1$ again. Check: if $A = B$, then $P(B \mid B) = P(B)/P(B) = 1$, as required.

Intuition 2: Frequentist World

Instead of running the experiment once, imagine repeating it many times. One interpretation of probability is long-run frequency: if you flip a coin $1000$ times and see $612$ heads, you estimate the probability of heads as about $612/1000$.

To find $P(A \mid B)$: list all repetitions, circle those in which $B$ occurred, and among only those circled repetitions, find the fraction in which $A$ also occurred. Restricting to the repetitions where $B$ happened, and asking how often $A$ happens within them, matches the formula $P(A \cap B)/P(B)$.

A philosophical caveat: "repeat the same experiment over and over" is itself a deep assumption — can you really step into the same river twice? We set that aside and assume exact repetition is possible.

· · ·

5. Theorems from the Definition

Each of these theorems is one or two lines of algebra, yet together they are among the most useful tools in the course.

Theorem 1: Multiplication rule

Theorem 1 — Multiplication Rule

Multiplying the definition by $P(B)$ (and by symmetry of $A \cap B = B \cap A$):

$$P(A \cap B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A)$$

If $A$ and $B$ are independent, then $P(A \mid B) = P(A)$, so this reduces to $P(A \cap B) = P(A)P(B)$. "Conditioning on $B$ does nothing" is exactly what independence means.

Theorem 2: Chain rule

Theorem 2 — Chain Rule (general multiplication rule)

Applying Theorem 1 repeatedly to $n$ events:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\,P(A_2 \mid A_1)\,P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})$$

This is really $n!$ theorems in one: the events can be peeled off in any order (start with $A_7$, then $A_4$ given $A_7$, and so on), each new event conditioned on all previously taken. For some problems one ordering is hard and another is easy, so it pays to consider different permutations.

Theorem 3: Bayes' rule

Theorem 3 — Bayes' Rule

Starting from $P(A \cap B) = P(B \mid A)\,P(A)$ and dividing by $P(B)$:

$$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$$

Discovered in the 1760s by Thomas Bayes, a Presbyterian minister who did probability on the side. The proof is trivial algebra, but the implications are extremely deep — an entire field, Bayesian statistics, rests on it. The formula is uncontroversial; how to use and interpret it has fueled controversy for centuries. It is one of the most useful theorems you will ever see, yet it is just easy algebra.