Lecture 19: Joint, Conditional, and Marginal Distributions

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. The Three Concepts and Their Relationships

The big theme of this stretch of the course is joint, conditional, and marginal distributions. These are three different ways of describing more than one random variable at once, and the goal is fluency in how they relate.

So far the course has built up all the tools needed to handle one random variable at a time — its PMF/PDF, CDF, expectation, MGF. But much more remains for two random variables, a sequence, or even a sum of a million of them.

Cumulative

Everything here is cumulative: if a single random variable and its CDF are still shaky, then understanding two at the same time will be much harder.

· · ·

2. Joint, Marginal, and Conditional Distributions (Continuous Case)

Joint CDF

For two random variables, the joint CDF is

$$F(x, y) = P(X \le x,\; Y \le y)$$

This always makes sense — discrete, continuous, mixtures, anything. It generalizes immediately to more variables: for a million of them one would write $P(X_1 \le x_1, \ldots, X_{10^6} \le x_{10^6})$. Two is just easier to write down and reason about.

Joint PDF

In the continuous case there is a joint PDF, obtained from the joint CDF by differentiating — analogous to the 1-D case where the PDF is the derivative of the CDF. Because the CDF is a function of two variables, we take a mixed second partial derivative:

$$f(x, y) = \frac{\partial^2}{\partial x\, \partial y}\, F(x, y)$$

Nothing to fear even without much partial-derivative practice: differentiate with respect to $y$ treating $x$ as a constant, then with respect to $x$ treating $y$ as a constant. A multivariable-calculus theorem says that under mild conditions the order doesn't matter.

Density, not probability

The joint PDF is not a probability; it is a density. We integrate it to get probabilities. The probability that the point $(X, Y)$ lands in a set $A$ is

$$P\big((X, Y) \in A\big) = \iint_A f(x, y)\, dx\, dy$$

If double integrals are new, this is no big deal: integrate with respect to $x$ holding $y$ constant, then integrate with respect to $y$. The only genuinely tricky part is the limits of integration.

Why we mostly avoid arbitrary regions

If $A$ is an arbitrary "blob," integrating over it becomes a nasty calculus problem — not an interesting probability problem, and not something this course cares about. The nice case is a rectangle: then $x$ and $y$ each run between fixed numbers, and the double integral is literally one integral inside another.

The one situation where blob-shaped regions do matter is the uniform distribution over a region: there, probability is proportional to area, so we can reason geometrically rather than by brute force.

Marginal PDF

To recover the distribution of one variable on its own, integrate out the other. The marginal PDF of $X$:

$$f_X(x) = \int_{-\infty}^{\infty} f(x, y)\, dy$$

Here $y$ becomes a dummy variable; the result depends on $x$ alone. This is exactly analogous to the discrete operation of summing over the cases: we want $X = x$ and $Y$ to be anything, so we integrate over all $y$. The process is called marginalization. Symmetrically, integrate $dx$ to get the marginal of $Y$.

A valid marginal must integrate to 1; equivalently, integrating the joint over the whole plane gives 1:

$$\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f(x, y)\, dx\, dy = 1$$

It is always safe to write the limits as $-\infty$ to $\infty$ at first, then determine where the density is zero versus non-zero and tighten the limits.

Conditional PDF

The conditional distribution mirrors ordinary conditioning. The conditional PDF of $Y$ given $X$ (sometimes written $f_{Y \mid X}$, sometimes left implicit) is the joint density divided by the marginal of $X$:

$$f_{Y \mid X}(y \mid x) = \frac{f(x, y)}{f_X(x)}$$

This looks exactly like the definition of conditional probability, $P(\text{this} \mid \text{that}) = P(\text{this and that}) / P(\text{that})$ — except now $X$ and $Y$ stand for numbers, not events. It is derived by conditioning on the event that $Y$ lies in a tiny interval around little $y$ and taking a limit.

There is also a continuous analogue of Bayes' rule, swapping the roles of $X$ and $Y$ (the joint numerator is shared):

$$f_{X \mid Y}(x \mid y) = \frac{f_{Y \mid X}(y \mid x)\, f_X(x)}{f_Y(y)}$$

Rearranged, this builds the joint from a marginal times the matching conditional — exactly as if multiplying $P(y)\, P(x \mid y)$ in the discrete case.

Independence (continuous case)

Independence via factoring PDFs

$X$ and $Y$ are independent if and only if

$$f(x, y) = f_X(x)\, f_Y(y) \quad \text{for all } x, y.$$

The joint PDF factors into the product of the marginals. This is equivalent to the CDFs factoring (differentiate one direction, integrate the other).

· · ·

3. Worked Example: Uniform on the Unit Disc

Pick a point $(X, Y)$ uniformly at random inside the unit disc $x^2 + y^2 \le 1$. "Uniform" means probability is proportional to area, so the joint PDF is the constant that makes it integrate to 1 — one over the area of the disc:

$$f(x, y) = \frac{1}{\pi} \;\text{ for } x^2 + y^2 \le 1, \quad \text{and } 0 \text{ outside.}$$

A constant joint PDF does not imply independence

It is tempting to see the constant $1/\pi$ and conclude $X$ and $Y$ factor and are independent. They are not. If $x$ is very close to $1$, that severely constrains $y$ (it must be near $0$). The constraint $x^2 + y^2 \le 1$ ties the two variables together. Lesson: do not read independence off $1/\pi$ while ignoring the support where the density is non-zero.

Marginal PDF of $X$

Integrate out $y$, carefully. The density is $1/\pi$ only where $x^2 + y^2 \le 1$, i.e. $y^2 \le 1 - x^2$, i.e. $y \in [-\sqrt{1 - x^2},\, \sqrt{1 - x^2}]$:

$$f_X(x) = \int_{-\sqrt{1 - x^2}}^{\sqrt{1 - x^2}} \frac{1}{\pi}\, dy = \frac{2}{\pi}\sqrt{1 - x^2}, \quad -1 \le x \le 1$$

The integral of a constant is the constant times the interval length. The main mistake on such problems is botching the limits of integration — get them wrong and the whole answer is wrong. (Integrating $f_X$ over $[-1, 1]$ with a trig substitution gives 1, reducing back to the area of the circle.)

Note this marginal is not uniform on $[-1, 1]$, even though the point $(X, Y)$ is uniform on the disc. It is largest at $x = 0$, which makes sense: near the center there is more vertical room, so the point is more likely found there than out near the edges. By symmetry, the marginal of $Y$ is the same formula with the letter changed.

Conditional PDF of $Y$ given $X$

Divide the joint by the marginal of $X$:

$$f_{Y \mid X}(y \mid x) = \frac{1/\pi}{\tfrac{2}{\pi}\sqrt{1 - x^2}} = \frac{1}{2\sqrt{1 - x^2}}, \quad -\sqrt{1 - x^2} \le y \le \sqrt{1 - x^2}$$

The $\pi$'s cancel. Crucially, the right-hand side has no $y$ in it — it is constant in $y$ for each fixed $x$. A constant density on an interval is a uniform distribution:

$$Y \mid X \;\sim\; \text{Uniform}\!\left(-\sqrt{1 - X^2},\; \sqrt{1 - X^2}\right)$$

This is the right interval: once $x$ is observed, $y$ must lie between those bounds, and within that range every value is equally likely. The notation $Y \mid X$ (capital $X$) is shorthand for "$Y$ given $X = x$" — treat the observed $X$ as a known constant.

Not independent

Two ways to see it:

The joint PDF is not the product of the marginals: $f_X(x)\, f_Y(y) \ne 1/\pi$.
The conditional distribution of $Y$ given $X$ is not the unconditional (marginal) distribution of $Y$. Learning $X$ gives information about $Y$.

· · ·

4. 2D LOTUS

The Law of the Unconscious Statistician extends to functions of more than one variable. Let $(X, Y)$ have joint PDF $f(x, y)$, and let $g(x, y)$ be any real-valued function of two variables — e.g. $x + y$, or $x^2 \sin(x)\, y^3$, or anything. Then:

$$E\big(g(X, Y)\big) = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} g(x, y)\, f(x, y)\, dx\, dy$$

No finding of the distribution of $g(X, Y)$ is required; we integrate directly against the joint PDF. This is completely analogous to 1-D LOTUS (a discrete 2D LOTUS works the same way with a double sum). When $g(X, Y)$ first appears it can look like a new random variable demanding its own study; LOTUS bypasses that entirely.

· · ·

5. Independence Implies $E(XY) = E(X)\,E(Y)$

This fact was needed earlier — when claiming the MGF of a sum of independent random variables is the product of the MGFs — but was not yet proven. It is the key "$E$ of (something times something) $=$ $E$ of something times $E$ of the other thing" step.

Theorem

If $X$ and $Y$ are independent, then $E(XY) = E(X)\,E(Y)$.

In words: independent implies uncorrelated (correlation is defined in a later lecture; this is foreshadowing). Holds in the continuous or discrete case alike; we prove the continuous case.

$$E(XY) = E(X)\,E(Y)$$

Treating $XY$ as $g(X, Y) = xy$ and applying 2D LOTUS:

$$E(XY) = \iint x\,y\, f(x, y)\, dx\, dy$$

Independence means $f(x, y) = f_X(x)\, f_Y(y)$, so the integrand separates:

$$E(XY) = \iint \big[x\, f_X(x)\big]\big[y\, f_Y(y)\big]\, dx\, dy$$

Doing the inner integral over $x$ (with $y$ held constant), the factors $y$ and $f_Y(y)$ come out:

$$= \int y\, f_Y(y) \left[\int x\, f_X(x)\, dx\right] dy$$

The bracketed inner integral is just a number — it is $E(X)$ — so it pulls out of the entire integral, leaving $\int y\, f_Y(y)\, dy = E(Y)$:

$$E(XY) = E(X)\,E(Y)$$

The whole proof amounts to pulling out constants and watching the expression factor. Without LOTUS this would be a nightmare to prove.

· · ·

6. Expected Distance Between Two Uniform Points

Let $X$ and $Y$ be i.i.d. $\text{Uniform}(0, 1)$. Find the expected distance $E(|X - Y|)$. Such problems arise often in applications involving two random points.

Solution by 2D LOTUS

Since $X, Y$ are i.i.d. $\text{Uniform}(0, 1)$, the joint PDF is $1$ on the unit square:

$$E(|X - Y|) = \int_0^1\!\!\int_0^1 |x - y|\, dx\, dy$$

To integrate an absolute value, split the region so it can be dropped. Where $x > y$, $|x - y| = x - y$; where $x \le y$, it equals $y - x$. By the symmetry of the problem (i.i.d. variables, symmetric function), the two pieces are equal — compute one and double it, two integrals instead of four:

$$E(|X - Y|) = 2 \iint_{\{x > y\}} (x - y)\, dx\, dy$$

Getting the limits right

Writing the order $dx\, dy$, the outer limits refer to $y$ and must be plain numbers: $y$ goes from $0$ to $1$. The inner limits (over $x$) may depend on $y$, and here they must: restricting to $x > y$, $x$ runs from $y$ to $1$, not $0$ to $1$.

$$E(|X - Y|) = 2 \int_{0}^{1}\!\!\int_{y}^{1} (x - y)\, dx\, dy$$

The inner integral (treating $y$ as a constant) is $\left[\frac{x^2}{2} - y x\right]$ evaluated from $x = y$ to $x = 1$. Plugging in the bounds and doing the remaining easy outer integral gives:

$$E(|X - Y|) = \frac{1}{3}$$

Intuition check

[ ——— • ——— • ——— ]

two random points near $1/3$ and $2/3$ — a gap of $1/3$

Picture two random points on $[0, 1]$: a "stereotypical" picture has one near $1/3$ and one near $2/3$, a gap of $1/3$ — matching the answer and making it memorable (not a proof, but reassuring).

A cleaner approach via Max and Min

The picture suggests a left point and a right point. Define:

$M = \max(X, Y)$
$L = \min(X, Y)$

Blitzstein uses $L$ for the minimum because $M$ is taken by max — though both "least" and "large" start with $L$, an annoying coincidence of English.

$$|X - Y| = M - L, \qquad M + L = X + Y$$

Bigger minus smaller is the absolute difference; bigger plus smaller is just the sum. So our result gives

$$E(M) - E(L) = \tfrac{1}{3}.$$

And by linearity, since $M + L = X + Y$,

$$E(M) + E(L) = E(X + Y) = E(X) + E(Y) = \tfrac{1}{2} + \tfrac{1}{2} = 1.$$

Two equations, two unknowns:

$$E(M) = \tfrac{2}{3}, \qquad E(L) = \tfrac{1}{3}$$

— exactly the $1/3$ and $2/3$ of the intuitive picture.

A third route would be to find the PDFs of $M$ and $L$ directly (related to the fact that the minimum of independent exponentials is exponential with the sum of the rates). In general Blitzstein prefers linearity, CDFs, and these tricks over grinding out double integrals.

· · ·

7. The Chicken-and-Egg Problem

A favorite discrete example. A chicken lays a random number of eggs $N \sim \text{Poisson}(\lambda)$. Each egg independently hatches with probability $p$ (each egg is an independent $\text{Bernoulli}(p)$ trial, hatching $=$ success). Let:

$X = $ number of eggs that hatch
$Y = $ number that do not hatch

Setup

Conditional on the number of eggs, $X$ is binomial:

$$X \mid N \;\sim\; \text{Binomial}(N, p)$$

(The notation means: pretend $N$ is a known constant — even though it is really Poisson — then $X$ is binomial.) And there is the identity $X + Y = N$.

The task: find the joint PMF and determine whether $X, Y$ are independent. Intuition strongly suggests dependence: $X + Y$ must equal $N$, and more hatching seems to mean fewer not hatching. But that intuition is about the conditional situation (given $N$); it is not a proof, because $N$ itself is random.

Finding the joint PMF

By definition the joint PMF is $P(X = i, Y = j)$ ($i, j$ emphasize integers). The recurring strategy: when stuck, condition on what you wish you knew — the number of eggs. By the law of total probability:

$$P(X = i, Y = j) = \sum_{n=0}^{\infty} P(X = i, Y = j \mid N = n)\, P(N = n)$$

This infinite sum looks intimidating, and many students get stuck. The fix: try concrete numbers.

The sum collapses to a single term: $n = i + j$.

Scratch work: $P(X = 3, Y = 5 \mid N = 10) = 0$ — $3$ hatch and $5$ don't accounts for only $8$ of the $10$ eggs; the other two vanished, impossible. And $P(X = 3, Y = 5 \mid N = 2) = 0$ — can't get $8$ outcomes from $2$ eggs. The only surviving term has $n = i + j$:

$$P(X = i, Y = j) = P(X = i, Y = j \mid N = i + j)\, P(N = i + j)$$

Given $N = i + j$ and $X = i$, the value $Y = j$ is redundant (already forced), so the conditional reduces to $P(X = i \mid N = i + j)$ — straight from the binomial PMF — times the Poisson PMF.

Evaluating

With $q = 1 - p$:

$$P(X = i, Y = j) = \binom{i + j}{i} p^i q^j \cdot \frac{e^{-\lambda}\, \lambda^{i+j}}{(i + j)!}$$

The $(i + j)!$ in the binomial coefficient cancels the $(i + j)!$ in the Poisson PMF. Split $\lambda^{i+j} = \lambda^i \lambda^j$ and write $e^{-\lambda} = e^{-\lambda(p + q)} = e^{-\lambda p}\, e^{-\lambda q}$:

$$P(X = i, Y = j) = \underbrace{\frac{e^{-\lambda p}(\lambda p)^i}{i!}}_{\text{function of } i} \cdot \underbrace{\frac{e^{-\lambda q}(\lambda q)^j}{j!}}_{\text{function of } j}$$

Conclusion

The joint PMF factors into a function of $i$ times a function of $j$ — each factor a Poisson PMF. Therefore:

$X$ and $Y$ are independent.
$X \sim \text{Poisson}(\lambda p)$
$Y \sim \text{Poisson}(\lambda q)$

This is startling: how can the hatch count and the no-hatch count be independent when they must sum to $N$? The resolution is that $N$ is itself random, and for the Poisson the randomness in $N$ is exactly what makes $X$ and $Y$ independent. This is a special property of the Poisson — change $N$ to any other distribution and $X, Y$ become dependent. (It is fine if your intuition said "dependent"; that intuition is correct for a fixed number of eggs.)