Lecture 5: Conditioning Continued, Law of Total Probability

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Thinking Conditionally

The theme for the day is not just probability but thinking. Probability is how to reason about uncertainty and randomness, so this is a "thinking course" as much as a statistics course. The math behind conditional probability is easy — last lecture's theorems came from multiplying both sides of the definition by something — but applying it correctly is subtle.

Key theme

Conditioning is the central tool of the course. As Blitzstein puts it, conditioning is "a condition for thinking": you cannot reason clearly about uncertainty unless you understand how to condition properly.

Why conditional probability matters

There are two distinct reasons it is so important:

  1. Important in its own right. Conditioning is how we update beliefs after observing evidence — a fundamental, general problem.
  2. A tool for unconditional probabilities. Even when the quantity we ultimately want, $P(B)$, is unconditional, conditioning lets us break it into simpler pieces.
· · ·

2. Problem-Solving Strategies

This is also a course in problem solving. Two general strategies recur throughout:

Aside

At Caltech the hero was physicist Richard Feynman, and people joked about the "Feynman algorithm" for solving any problem: (1) write down the problem, (2) think very hard, (3) write down the solution. That worked for Feynman but not for the rest of us — we need actual strategies.

· · ·

3. The Law of Total Probability

The "break into pieces" strategy, made precise, is the law of total probability. Draw the sample space $S$ and an event $B$ (a "blob") whose probability is hard to compute directly. Chop $S$ into pieces $A_1, A_2, \ldots, A_n$.

Definition — Partition

A partition of $S$ is a collection of events $A_1, \ldots, A_n$ that are:

  • disjoint (no two overlap), and
  • exhaustive (their union is all of $S$).

The pieces need not be rectangles or any particular shape — chop up the space however is convenient, as long as the pieces are disjoint and cover $S$.

A₁ •B• A₂ A₃ A₄
$B$ straddles several partition cells $A_i$; its probability is the sum over the pieces

Statement

Theorem — Law of Total Probability

Given a partition $A_1, \ldots, A_n$ of $S$, for any event $B$:

$$P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i)\, P(A_i)$$

The first form is immediate from the second axiom of probability: $B$ is split into the disjoint pieces $B \cap A_i$, so its probability is the sum of theirs. The second form just rewrites each piece using the definition of conditional probability, $P(B \cap A_i) = P(B \mid A_i)\, P(A_i)$. (As noted last time, you can factor the intersection in either order — that is why there were "$n!$ theorems.")

The art of partitioning

No separate proof is needed; the law is immediate from the axioms. Its usefulness depends entirely on choosing a good partition. A bad partition turns one problem into $n$ problems each as hard as the original. A good partition turns one hard problem into $n$ easy ones. Statistics is part science, part art — picking useful partitions takes practice.

· · ·

4. Example: Two Aces

Draw a random 2-card hand from a standard 52-card deck (all hands equally likely, so the naive definition applies). Compare two conditional probabilities that sound nearly identical but differ substantially.

Part 1: $P(\text{both aces} \mid \text{have an ace})$

$P(\text{both aces} \mid \text{have an ace})$, where "have an ace" means at least one

By the definition of conditional probability, and noting that "both aces" makes "have an ace" redundant (so the intersection is just "both aces"):

$$P(\text{both aces} \mid \text{have an ace}) = \frac{P(\text{both aces})}{P(\text{have an ace})}$$

Numerator, counting without regard to order:

$$P(\text{both aces}) = \frac{\binom{4}{2}}{\binom{52}{2}}$$

Denominator, via the complement (easier than splitting into cases):

$$P(\text{have an ace}) = 1 - \frac{\binom{48}{2}}{\binom{52}{2}}$$

Putting it together and simplifying:

$$P(\text{both aces} \mid \text{have an ace}) = \frac{\binom{4}{2}}{\binom{52}{2} - \binom{48}{2}} = \frac{1}{33} \approx 3\%$$

Part 2: $P(\text{both aces} \mid \text{have the ace of spades})$

$P(\text{both aces} \mid \text{have the ace of spades})$ — "have at least one ace of spades" equals "have the ace of spades," since there is only one

By symmetry: condition on holding the ace of spades. The other card is equally likely to be any of the remaining 51 cards. We have both aces exactly when that card is one of the three remaining aces:

$$P(\text{both aces} \mid \text{have the ace of spades}) = \frac{3}{51} = \frac{1}{17} \approx 5.9\%$$

(Plugging into the definition the long way gives the same answer.)

The surprise

ConditionProbability
Given an ace (unspecified suit)$\frac{1}{33} \approx 3.0\%$
Given the ace of spades$\frac{1}{17} \approx 5.9\%$

Specifying the suit nearly doubles the probability — yet $\frac{1}{17}$ does not depend on which suit we name. Hearts, clubs, diamonds, or spades all give $\frac{1}{17}$. But "we have an ace" (suit unspecified) yields $\frac{1}{33}$.

Intuition

"We have an ace" only asserts at least one ace, so you cannot pin a concrete card to your hand. "We have the ace of spades" lets you fix one card and treat the other as a single mystery card with a clean symmetry argument. The distinction is between conditioning on "at least one" versus a specific labeled object — a problem worth pondering at length.

· · ·

5. Example: Disease Testing

A patient is tested for a disease. This everyday problem shows why it pays to specify carefully what you are conditioning on and what your goal is — a hint useful for homework too: state explicitly "find $P(\text{what} \mid \text{what})$" with clear notation.

Assumptions and notation

Define events (write them out fully — "disease" alone is not an event; and do not use $P$ for "positive," it collides with $P$ for probability):

Interpreting "95% accurate"

$$P(T \mid D) = 0.95 \qquad P(T^c \mid D^c) = 0.95$$

Given disease, the test correctly reports positive 95% of the time; given no disease, it correctly reports negative 95% of the time. It follows that $P(T \mid D^c) = 0.05$.

The goal is not what the test gives

A common real-world mistake is confusing $P(T \mid D)$ with $P(D \mid T)$. The test reliability gives $P(T \mid D)$, but the patient cares about $P(D \mid T)$ — the chance of having the disease given a positive test. These are different concepts, related by Bayes' rule.

Bayes' rule with the law of total probability

$$P(D \mid T) = \frac{P(T \mid D)\, P(D)}{P(T)}$$

We know $P(T \mid D) = 0.95$ and $P(D) = 0.01$. The only unknown is the denominator $P(T)$, which we expand by the law of total probability over the partition $\{D, D^c\}$:

$$P(T) = P(T \mid D)\, P(D) + P(T \mid D^c)\, P(D^c)$$

Bayes' rule and the law of total probability are very commonly used in tandem — the clean one-line Bayes' rule, then the denominator expanded when needed. Plugging in:

$$P(D \mid T) = \frac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.05 \cdot 0.99} \approx 0.16$$

Interpretation

Despite the test being "95% accurate," there is only about a 16% chance the patient has the disease. This surprises most patients and most doctors. In a Harvard study, around 60 doctors were asked a similar question, and roughly 80% guessed numbers like 95% — far too high.

Two morals:

  1. Get a second opinion / another test — ideally a different kind of test, since a repeat of the same test may not be independent (whatever caused the first error could recur).
  2. Mind the base rate. Intuition fails because people focus on the test's reliability (5% error) and ignore that the disease is itself rare (1%). There is a tradeoff between how rare the disease is and how rarely the test errs, and people fixate on the latter.
Frequentist intuition (1,000 patients)

Of 1,000 patients, about 10 have the disease; suppose all 10 test positive. Of the 990 without it, about 5% — roughly 50 — test positive (false positives). So about 50 false positives versus 10 true positives, a ratio of 5 to 1. The share of positives who truly have the disease is about $\frac{10}{60} = \frac{1}{6} \approx 16\%$, matching the calculation.

Coherency of Bayesian updating

Bayes' rule is coherent: it does not matter whether you incorporate evidence all at once or piece by piece, or in what order. Investigating a crime, you might get two clues at once and update on their intersection, or get one clue, break for lunch, return, and update on the second — the final probability of the thing you care about given all the evidence is the same.

Student question: if the patient came in because of symptoms, the calculation changes, but the principle is the same. With initial evidence you update first, so the relevant prior is no longer the bare 1% — you raise the starting probability and the numbers shift accordingly.

· · ·

6. Biohazards: Common Mistakes

Blitzstein calls these "biohazards" — common conditional-probability mistakes that are hazardous to your statistical health.

Biohazard 1: Confusing $P(A \mid B)$ with $P(B \mid A)$

These are different; Bayes' rule connects them but does not make them equal. This is sometimes called the prosecutor's fallacy — though unfair to prosecutors, since defense attorneys, doctors, and everyone else make it too. In a criminal case you care about $P(\text{guilt} \mid \text{evidence})$, but the fallacy fixates on $P(\text{evidence} \mid \text{innocence})$.

The Sally Clark case

Sally Clark, a British woman, lost two babies to unexplained causes (labelled SIDS) and was convicted of murdering them. The case rested on an "expert" who claimed the chance of one baby dying mysteriously was $\frac{1}{8500}$, then multiplied for two babies: $\frac{1}{8500} \cdot \frac{1}{8500} \approx \frac{1}{73{,}000{,}000}$.

Two errors: (1) the multiplication assumes independence of the two deaths — unjustified, since a shared genetic factor could link them; (2) even granting that, $\frac{1}{73{,}000{,}000}$ is $P(\text{evidence} \mid \text{innocence})$, not the relevant $P(\text{innocence} \mid \text{evidence})$. By Bayes' rule the latter involves the prior $P(\text{innocence})$, which — given billions of non-murdering mothers — is extremely close to 1. That tradeoff was ignored; she was imprisoned, later exonerated, and died shortly after release.

Biohazard 2: Confusing prior with posterior

TermMeaning
Prior — $P(A)$Probability before observing evidence
Posterior — $P(A \mid B)$Probability after observing evidence $B$
The $P(A) = 1$ trap

A problem saying "$A$ occurred" tempts students to write $P(A) = 1$. That is wrong. The correct statement is $P(A \mid A) = 1$ — given that $A$ occurred, $A$ has probability 1 — but the unconditional $P(A)$ is not 1. "We observe that $A$ occurred" means we compute quantities conditional on $A$; be careful about what goes left versus right of the conditioning bar.

Biohazard 3: Confusing independence with conditional independence

The most subtle of the three, and consequential in practice.

Definition — Conditional Independence

Events $A$ and $B$ are conditionally independent given an event $C$ if:

$$P(A \cap B \mid C) = P(A \mid C)\, P(B \mid C)$$

This is the definition of independence with everything conditioned on $C$. You must always say what you condition on — "conditionally independent" alone is incomplete.

Two natural questions follow, and the answer to both is no:

Conditional independence does NOT imply independence — chess opponent of unknown strength

You play a series of games against the same opponent whose rating you do not know. Conditional on the opponent's true strength, assume the outcomes are independent. They are not unconditionally independent: winning the first five games makes you confident you are stronger, so early games are evidence about strength and are informative about later games. Unconditional independence would mean early games carry no such information — but they clearly do.

Independence does NOT imply conditional independence — fire alarm with two causes

Let $A$ = the alarm sounds, $F$ = there is a fire, $C$ = someone is making popcorn. Suppose $F$ and $C$ are independent and the alarm sounds if either occurs. Then

$$P(F \mid A,\, C^c) = 1$$

Ruling out popcorn forces fire (Sherlock Holmes: eliminate the other explanations and what remains must be true). So given the alarm, $F$ and $C$ become dependent — "explaining away" one cause raises the other. They are independent unconditionally but not conditionally independent given $A$.

Takeaway

The "explaining away" effect arises whenever a phenomenon has multiple causes. Independence is not a stronger condition that automatically delivers conditional independence; the two are genuinely separate.