This is the most famous example of inclusion-exclusion, finished more carefully than in Lecture 3.
A deck of $n$ cards is labeled $1$ through $n$. You flip cards over one at a time, calling out "one, two, three, $\ldots$" as you go. You win if at some point the card you name matches the card that appears (the card in position $j$ is labeled $j$). We want $P(\text{win}) = P(\text{at least one match})$.
Let $A_j$ be the event that the $j$-th card matches. We want $P(A_1 \cup A_2 \cup \cdots \cup A_n)$.
Inclusion-exclusion adds single-event probabilities, subtracts pairwise intersections, adds triple intersections, and so on. The key ingredient is the probability of a $k$-fold intersection. By symmetry, every intersection of $k$ specific events has the same probability, so we may take the first $k$ for concreteness:
This follows from the naive definition (all $n!$ orderings equally likely). Fixing the first $k$ cards in their matching positions leaves the other $n - k$ cards free to be in any order, giving $(n-k)!$ favorable arrangements out of $n!$.
There are $\binom{n}{k}$ such $k$-fold intersections, and all are equal by symmetry. So the $k$-th term of inclusion-exclusion collapses to a clean fraction:
Sanity check: the final term $1/n!$ corresponds to all $n$ cards being perfectly ordered $1$ through $n$ — there is exactly one such arrangement, so this term makes sense.
Often the problem is phrased as $P(\text{no match})$. Taking complements (the complement of a union is the intersection of complements):
This is the Taylor series for $e^x$ evaluated at $x = -1$:
Recognizing factorials in the denominator should immediately suggest a Taylor series. The two series used over and over in this course are the geometric series and the series for $e^x$.
Why this is surprising:
The convergence is extremely fast: even at $n = 10$ (where $n!$ exceeds three million) the approximation is accurate to roughly $10^{-8}$.
This problem is a recurring favorite because it illustrates inclusion-exclusion, symmetry, and the constant $1/e$ all at once. Blitzstein's heuristic: if you have no idea what an answer is and must guess, $1/e$ is a better guess than anything else.
The definition is simple to state but takes effort to fully understand. Informally, two events are independent if knowing one occurred gives no information about whether the other occurred. That intuition is too vague to verify, so we need a precise definition.
Working within a fixed sample space and probability function $P$, events $A$ and $B$ are independent if:
$$P(A \cap B) = P(A)\,P(B)$$
The probability that both occur is the probability of one times the probability of the other, because they have nothing to do with each other.
Confusing independence with disjointness is a disastrous mistake — they are opposite concepts. If $A$ and $B$ are disjoint, then learning $A$ occurred tells you $B$ is impossible, which is highly informative. Independence is the case where learning $A$ tells you nothing about $B$.
| Concept | Meaning | Knowing $A$ occurred tells you |
|---|---|---|
| Disjoint | $A$ and $B$ cannot both occur | $B$ definitely did NOT occur (strong info) |
| Independent | $P(A \cap B) = P(A)\,P(B)$ | nothing about $B$ |
For three events $A, B, C$, pairwise independence is not enough. We require all four conditions:
Having only the first three (pairwise independence) does not imply the fourth, and the fourth does not imply the first three. All four are genuinely needed; none can be dropped. Constructing a counterexample where pairwise independence holds but the triple condition fails is a recommended exercise.
For $n$ events $A_1, \ldots, A_n$, full independence requires that for every subset, the probability of the intersection equals the product of the individual probabilities.
Independence means multiply. To find the probability of an intersection of independent events, multiply their probabilities. (It is a paraphrase of the definition, not a separate theorem.)
Notation aside: it is convenient to write commas for intersections, e.g. $P(A, B)$ for $P(A \cap B)$. This is harmless until unions enter the picture, so it is used only for intersections.
A historical gambling problem. Samuel Pepys (the famous diarist) wanted the answer to a dice question, couldn't solve it, and wrote to Isaac Newton, who solved it for him. This was in the very early days of probability theory.
Fair six-sided dice. Which of these is most likely?
Pepys strongly believed $C$ was most likely. The actual answer is $A$.
Assume the dice rolls are independent. "At least one" suggests a union, but it is easier to take the complement (all non-sixes) and use independence to multiply:
About a $2/3$ chance of at least one six in six dice. This agrees with the naive definition: $6^6$ total outcomes, $5^6$ of them with no six.
The count of sixes could be anything from $2$ to $12$, which is unwieldy, so subtract the complementary cases (zero sixes or exactly one six):
The factor $12$ counts which of the $12$ dice shows the single six.
Subtract the cases of zero, one, or two sixes among $18$ dice. The probability of exactly $k$ sixes among $18$ dice is a binomial probability:
$\binom{18}{k}$ counts which $k$ of the $18$ positions hold the sixes; $(1/6)^k$ for those sixes; $(5/6)^{18-k}$ for the remaining non-sixes. Then:
$P(A) \approx 0.665 > P(B) \approx 0.619 > P(C) \approx 0.597$. So $A$ is most likely and $C$ least likely — the opposite of Pepys's belief. Newton got the calculation right but offered a confusing intuitive argument that was actually wrong.
Statistician Stephen Stigler later showed Newton's intuition had to be wrong without decoding it: Newton's argument never used the fact that the dice are fair. With biased dice one can make $C$ most likely, yet Newton's reasoning would still claim $A$ wins regardless of the face probabilities. An argument that is invariant to a change the true answer depends on cannot be correct.
The point of the binomial probability, which recurs throughout the course, is to understand where it comes from, not to memorize it.
This is the main topic for the rest of the week and one of the most important ideas in the course. The motivating question: you have beliefs and uncertainties, and you learn new things every day. How should you update your probabilities when you receive new evidence? This is central to science (experiments produce data), to investigation (clues update suspicions), and to reasoning in general. The process is sequential: today's updated probabilities become tomorrow's starting point, to be updated again.
Conditioning is the soul of statistics. Everything in the course relates to conditioning one way or another.
The conditional probability of $A$ given $B$ is:
$$P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad \text{provided } P(B) > 0$$
We pronounce the vertical bar as "given."
Interpretation: we started with some probability $P(A)$ for $A$. We then observe that $B$ occurred. If $A$ and $B$ are independent, this is irrelevant; otherwise it is valuable information, and $P(A \mid B)$ is our updated probability for $A$. The definition itself is just one probability divided by another — simple to state, but the source of deep theory.
We have moved beyond the naive definition, so outcomes need not be equally likely. Picture a finite sample space $S$ as a collection of pebbles, each representing one outcome. Each pebble has a mass (its probability), and the total mass is $1$ (we choose units so this holds). An event is a subset — a set of pebbles.
To compute $P(A \mid B)$:
After conditioning, $B$ is the new universe and the usual laws of probability apply. The mass that matters for $A$ is the part of $A$ inside $B$, namely $A \cap B$. So $P(A \mid B) = P(A \cap B)/P(B)$.
Discarding the pebbles in $B^c$ leaves the survivors with total mass $P(B) < 1$. Renormalizing means dividing by $P(B)$ so the masses sum to $1$ again. Check: if $A = B$, then $P(B \mid B) = P(B)/P(B) = 1$, as required.
Instead of running the experiment once, imagine repeating it many times. One interpretation of probability is long-run frequency: if you flip a coin $1000$ times and see $612$ heads, you estimate the probability of heads as about $612/1000$.
To find $P(A \mid B)$: list all repetitions, circle those in which $B$ occurred, and among only those circled repetitions, find the fraction in which $A$ also occurred. Restricting to the repetitions where $B$ happened, and asking how often $A$ happens within them, matches the formula $P(A \cap B)/P(B)$.
A philosophical caveat: "repeat the same experiment over and over" is itself a deep assumption — can you really step into the same river twice? We set that aside and assume exact repetition is possible.
Each of these theorems is one or two lines of algebra, yet together they are among the most useful tools in the course.
Multiplying the definition by $P(B)$ (and by symmetry of $A \cap B = B \cap A$):
$$P(A \cap B) = P(A \mid B)\,P(B) = P(B \mid A)\,P(A)$$
If $A$ and $B$ are independent, then $P(A \mid B) = P(A)$, so this reduces to $P(A \cap B) = P(A)P(B)$. "Conditioning on $B$ does nothing" is exactly what independence means.
Applying Theorem 1 repeatedly to $n$ events:
$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\,P(A_2 \mid A_1)\,P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})$$
This is really $n!$ theorems in one: the events can be peeled off in any order (start with $A_7$, then $A_4$ given $A_7$, and so on), each new event conditioned on all previously taken. For some problems one ordering is hard and another is easy, so it pays to consider different permutations.
Starting from $P(A \cap B) = P(B \mid A)\,P(A)$ and dividing by $P(B)$:
$$P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}$$
Discovered in the 1760s by Thomas Bayes, a Presbyterian minister who did probability on the side. The proof is trivial algebra, but the implications are extremely deep — an entire field, Bayesian statistics, rests on it. The formula is uncontroversial; how to use and interpret it has fueled controversy for centuries. It is one of the most useful theorems you will ever see, yet it is just easy algebra.