The theme for the day is not just probability but thinking. Probability is how to reason about uncertainty and randomness, so this is a "thinking course" as much as a statistics course. The math behind conditional probability is easy — last lecture's theorems came from multiplying both sides of the definition by something — but applying it correctly is subtle.
Conditioning is the central tool of the course. As Blitzstein puts it, conditioning is "a condition for thinking": you cannot reason clearly about uncertainty unless you understand how to condition properly.
There are two distinct reasons it is so important:
This is also a course in problem solving. Two general strategies recur throughout:
At Caltech the hero was physicist Richard Feynman, and people joked about the "Feynman algorithm" for solving any problem: (1) write down the problem, (2) think very hard, (3) write down the solution. That worked for Feynman but not for the rest of us — we need actual strategies.
The "break into pieces" strategy, made precise, is the law of total probability. Draw the sample space $S$ and an event $B$ (a "blob") whose probability is hard to compute directly. Chop $S$ into pieces $A_1, A_2, \ldots, A_n$.
A partition of $S$ is a collection of events $A_1, \ldots, A_n$ that are:
The pieces need not be rectangles or any particular shape — chop up the space however is convenient, as long as the pieces are disjoint and cover $S$.
Given a partition $A_1, \ldots, A_n$ of $S$, for any event $B$:
$$P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i)\, P(A_i)$$
The first form is immediate from the second axiom of probability: $B$ is split into the disjoint pieces $B \cap A_i$, so its probability is the sum of theirs. The second form just rewrites each piece using the definition of conditional probability, $P(B \cap A_i) = P(B \mid A_i)\, P(A_i)$. (As noted last time, you can factor the intersection in either order — that is why there were "$n!$ theorems.")
No separate proof is needed; the law is immediate from the axioms. Its usefulness depends entirely on choosing a good partition. A bad partition turns one problem into $n$ problems each as hard as the original. A good partition turns one hard problem into $n$ easy ones. Statistics is part science, part art — picking useful partitions takes practice.
Draw a random 2-card hand from a standard 52-card deck (all hands equally likely, so the naive definition applies). Compare two conditional probabilities that sound nearly identical but differ substantially.
By the definition of conditional probability, and noting that "both aces" makes "have an ace" redundant (so the intersection is just "both aces"):
Numerator, counting without regard to order:
Denominator, via the complement (easier than splitting into cases):
Putting it together and simplifying:
By symmetry: condition on holding the ace of spades. The other card is equally likely to be any of the remaining 51 cards. We have both aces exactly when that card is one of the three remaining aces:
(Plugging into the definition the long way gives the same answer.)
| Condition | Probability |
|---|---|
| Given an ace (unspecified suit) | $\frac{1}{33} \approx 3.0\%$ |
| Given the ace of spades | $\frac{1}{17} \approx 5.9\%$ |
Specifying the suit nearly doubles the probability — yet $\frac{1}{17}$ does not depend on which suit we name. Hearts, clubs, diamonds, or spades all give $\frac{1}{17}$. But "we have an ace" (suit unspecified) yields $\frac{1}{33}$.
"We have an ace" only asserts at least one ace, so you cannot pin a concrete card to your hand. "We have the ace of spades" lets you fix one card and treat the other as a single mystery card with a clean symmetry argument. The distinction is between conditioning on "at least one" versus a specific labeled object — a problem worth pondering at length.
A patient is tested for a disease. This everyday problem shows why it pays to specify carefully what you are conditioning on and what your goal is — a hint useful for homework too: state explicitly "find $P(\text{what} \mid \text{what})$" with clear notation.
Define events (write them out fully — "disease" alone is not an event; and do not use $P$ for "positive," it collides with $P$ for probability):
$$P(T \mid D) = 0.95 \qquad P(T^c \mid D^c) = 0.95$$
Given disease, the test correctly reports positive 95% of the time; given no disease, it correctly reports negative 95% of the time. It follows that $P(T \mid D^c) = 0.05$.
A common real-world mistake is confusing $P(T \mid D)$ with $P(D \mid T)$. The test reliability gives $P(T \mid D)$, but the patient cares about $P(D \mid T)$ — the chance of having the disease given a positive test. These are different concepts, related by Bayes' rule.
We know $P(T \mid D) = 0.95$ and $P(D) = 0.01$. The only unknown is the denominator $P(T)$, which we expand by the law of total probability over the partition $\{D, D^c\}$:
Bayes' rule and the law of total probability are very commonly used in tandem — the clean one-line Bayes' rule, then the denominator expanded when needed. Plugging in:
Despite the test being "95% accurate," there is only about a 16% chance the patient has the disease. This surprises most patients and most doctors. In a Harvard study, around 60 doctors were asked a similar question, and roughly 80% guessed numbers like 95% — far too high.
Two morals:
Of 1,000 patients, about 10 have the disease; suppose all 10 test positive. Of the 990 without it, about 5% — roughly 50 — test positive (false positives). So about 50 false positives versus 10 true positives, a ratio of 5 to 1. The share of positives who truly have the disease is about $\frac{10}{60} = \frac{1}{6} \approx 16\%$, matching the calculation.
Bayes' rule is coherent: it does not matter whether you incorporate evidence all at once or piece by piece, or in what order. Investigating a crime, you might get two clues at once and update on their intersection, or get one clue, break for lunch, return, and update on the second — the final probability of the thing you care about given all the evidence is the same.
Student question: if the patient came in because of symptoms, the calculation changes, but the principle is the same. With initial evidence you update first, so the relevant prior is no longer the bare 1% — you raise the starting probability and the numbers shift accordingly.
Blitzstein calls these "biohazards" — common conditional-probability mistakes that are hazardous to your statistical health.
These are different; Bayes' rule connects them but does not make them equal. This is sometimes called the prosecutor's fallacy — though unfair to prosecutors, since defense attorneys, doctors, and everyone else make it too. In a criminal case you care about $P(\text{guilt} \mid \text{evidence})$, but the fallacy fixates on $P(\text{evidence} \mid \text{innocence})$.
Sally Clark, a British woman, lost two babies to unexplained causes (labelled SIDS) and was convicted of murdering them. The case rested on an "expert" who claimed the chance of one baby dying mysteriously was $\frac{1}{8500}$, then multiplied for two babies: $\frac{1}{8500} \cdot \frac{1}{8500} \approx \frac{1}{73{,}000{,}000}$.
Two errors: (1) the multiplication assumes independence of the two deaths — unjustified, since a shared genetic factor could link them; (2) even granting that, $\frac{1}{73{,}000{,}000}$ is $P(\text{evidence} \mid \text{innocence})$, not the relevant $P(\text{innocence} \mid \text{evidence})$. By Bayes' rule the latter involves the prior $P(\text{innocence})$, which — given billions of non-murdering mothers — is extremely close to 1. That tradeoff was ignored; she was imprisoned, later exonerated, and died shortly after release.
| Term | Meaning |
|---|---|
| Prior — $P(A)$ | Probability before observing evidence |
| Posterior — $P(A \mid B)$ | Probability after observing evidence $B$ |
A problem saying "$A$ occurred" tempts students to write $P(A) = 1$. That is wrong. The correct statement is $P(A \mid A) = 1$ — given that $A$ occurred, $A$ has probability 1 — but the unconditional $P(A)$ is not 1. "We observe that $A$ occurred" means we compute quantities conditional on $A$; be careful about what goes left versus right of the conditioning bar.
The most subtle of the three, and consequential in practice.
Events $A$ and $B$ are conditionally independent given an event $C$ if:
$$P(A \cap B \mid C) = P(A \mid C)\, P(B \mid C)$$
This is the definition of independence with everything conditioned on $C$. You must always say what you condition on — "conditionally independent" alone is incomplete.
Two natural questions follow, and the answer to both is no:
You play a series of games against the same opponent whose rating you do not know. Conditional on the opponent's true strength, assume the outcomes are independent. They are not unconditionally independent: winning the first five games makes you confident you are stronger, so early games are evidence about strength and are informative about later games. Unconditional independence would mean early games carry no such information — but they clearly do.
Let $A$ = the alarm sounds, $F$ = there is a fire, $C$ = someone is making popcorn. Suppose $F$ and $C$ are independent and the alarm sounds if either occurs. Then
Ruling out popcorn forces fire (Sherlock Holmes: eliminate the other explanations and what remains must be true). So given the alarm, $F$ and $C$ become dependent — "explaining away" one cause raises the other. They are independent unconditionally but not conditionally independent given $A$.
The "explaining away" effect arises whenever a phenomenon has multiple causes. Independence is not a stronger condition that automatically delivers conditional independence; the two are genuinely separate.