Lecture 12: Discrete vs. Continuous, the Uniform

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Why Continuous Now

With the binomial, Poisson, hypergeometric, geometric, and the other named discrete distributions covered, the course has all the discrete machinery it needs. It is a good time to switch to continuous distributions.

Discrete came first deliberately: it is conceptually simpler to picture. That does not make continuous harder. The discrete world forces us into sums, some of which stay nasty even after using stories to avoid them. In the continuous world, sums become integrals, and it is often easier to do an integral than the analogous sum. The same difficulty can resurface — we may meet an integral we cannot evaluate — so the goal stays the same: find clever, conceptual ways to avoid grinding through computation.

The good news is that most ideas carry over by direct analogy. Assuming a solid grasp of what a PMF is, what a discrete distribution means, and the expected value of a discrete distribution, the move to continuous is mostly a matter of translation.

· · ·

2. The Discrete-Continuous Dictionary

The cleanest way to learn the continuous case is to lay it next to the discrete case term by term. The random variable is called $X$ in both worlds.

ConceptDiscrete worldContinuous world
Random variable$X$$X$
Mass / density functionPMF: $P(X = x)$ as a function of $x$PDF: $f_X(x)$
$P(X = x)$ at a single pointCan be positiveAlways $0$
CDF$F_X(x) = P(X \le x)$$F_X(x) = P(X \le x)$ (identical)
Probability of a regionSum the PMFIntegrate the PDF
Expected value$\sum_x x\, P(X = x)$$\int_{-\infty}^{\infty} x\, f(x)\, dx$
Variance$E(X^2) - (E(X))^2$$E(X^2) - (E(X))^2$ (identical)

Two entries deserve emphasis.

Single points have probability zero

In the continuous case $P(X = x) = 0$ for every $x$. We are modeling random variables that can take any real value (or any real value in an interval, say $[0, 1]$). There are uncountably many reals in any interval, and any specific one — $\pi/4$, for instance — has probability $0$. A PMF would just report $0$ everywhere, which is useless, so we need a density instead.

Second, the CDF is completely general: every random variable, discrete or continuous, has one. So the CDF unifies the theory and we never need to separate out the cases for it. In the discrete case the CDF is a step function full of jumps and is awkward to work with, so the PMF is usually easier there.

· · ·

3. The PDF: Density, Not Probability

PDF stands for probability density function (not portable document format). The key word is density.

The most common mistake with PDFs is treating their values as probabilities. They are not. A density is probability per unit of something — per length, per area, per volume — not a probability.

Key analogy

Discrete probability is like pebbles: discrete lumps of mass, total mass $1$. Continuous probability is like mud smeared over the space: total mass still $1$, but no individual point carries any of it. A density measures how thickly the mud is spread, not how much sits at a point.

Definition

PDF

A random variable $X$ has PDF $f(x)$ if, for every interval $[a, b]$,

$$P(a \le X \le b) = \int_a^b f(x)\, dx \qquad \text{for all } a, b.$$

So $f(x)$ is not a probability; it is the thing you integrate to get one. Integrate a density and you get a probability.

A consistency check is built in: setting $a = b$ gives $\int_a^a f(x)\, dx = 0$ (no width, no area). That matches the fact that any single point has probability $0$ — we need an interval of nonzero length to get nonzero probability.

Validity conditions

By analogy with a PMF (nonnegative, sums to $1$), a valid PDF must be nonnegative and integrate to $1$:

Geometrically, $f$ can be any continuous-looking curve from $-\infty$ to $+\infty$ — symmetric or lopsided, possibly $0$ on one side — as long as it never dips below $0$ and the total area underneath equals $1$. The bell curve is the famous example, coming later.

What a density value actually means

A PDF value can exceed $1$: a function can poke above $1$ somewhere and still enclose total area $1$. So $f(x_0)$ cannot be a probability. To interpret it, convert from the density scale back to the probability scale by multiplying by a small width. For very small $\epsilon$,

$$P\!\left(x_0 - \tfrac{\epsilon}{2} \le X \le x_0 + \tfrac{\epsilon}{2}\right) \approx f(x_0)\, \epsilon.$$

The reason follows straight from the definition. To get that probability, integrate $f$ over the tiny interval. Because $\epsilon$ is very small, $f$ barely changes across it, so it is approximately constant there. The integral of a constant is the constant times the length of the interval — exactly $f(x_0)\, \epsilon$.

· · ·

4. PDF and CDF: Two Sides of the Same Coin

The PDF and CDF carry the same information; calculus moves between them.

CDF from PDF: integrate

The CDF is by definition $F(x) = P(X \le x)$. Since the PDF is the thing you integrate to get probability, integrate it over everything to the left of $x$:

$$F(x) = \int_{-\infty}^{x} f(t)\, dt.$$

The dummy variable is renamed $t$ to avoid clashing with the upper limit $x$. Picture it as the running area under the density curve from far left up to the point $x$.

PDF from CDF: differentiate

Going the other way, the PDF is the derivative of the CDF:

$$f(x) = F'(x).$$

This is the Fundamental Theorem of Calculus, and both parts get used:

So the probability of an interval can be read two equivalent ways:

$$P(a \le X \le b) = \int_a^b f(x)\, dx = F(b) - F(a),$$

the second equality being FTC Part 2, consistent with the earlier CDF results.

A useful consequence

In the continuous case, strict versus non-strict inequalities make no difference: $P(X \le b) = P(X < b)$, because the endpoint contributes probability $0$. In the discrete case the difference is crucial.

A note on terminology

“Continuous random variable” in this course means the CDF is differentiable (has a PDF), not merely that the CDF is a continuous function. There exist functions that are continuous but not differentiable everywhere; for them things get nastier and there is no PDF. The word “continuous” really refers to $X$ taking a whole continuum of values rather than discrete ones. We assume throughout that $F$ is differentiable, so the PDF exists.

· · ·

5. Expected Value, Variance, and Standard Deviation

Expected value

In the discrete case, $E(X)$ is the sum of (value times probability). In the continuous case that sum would be $0$ (every point has probability $0$), so the analog is an integral:

$$E(X) = \int_{-\infty}^{\infty} x\, f(x)\, dx.$$

If $X$ only lives on an interval, say $[0, 1]$, the integrand is $0$ outside it, so we just integrate over the region where $f$ is nonzero.

Variance

The expected value is a one-number summary of the center; it says nothing about spread. Variance measures spread: on average, how far is $X$ from its mean?

A first attempt, $E(X - E(X))$, is useless: by linearity it equals $E(X) - E(X) = 0$ always. Putting absolute values around the deviation, $E(|X - E(X)|)$, fixes the sign problem but the absolute-value function has a sharp corner (it is not differentiable) and is awkward to work with. The standard fix is to square the deviation:

$$\operatorname{Var}(X) = E\!\left[(X - E(X))^2\right].$$
Why square instead of absolute value

Beyond the absolute value being non-differentiable and annoying, there is a deeper reason: squares and sums of squares evoke the Pythagorean theorem, right triangles, and Euclidean distance. There is a great deal of beautiful geometry attached to squared quantities, and that geometry is lost if you use absolute values.

Standard deviation

Squaring changes the units. If $X$ is measured in miles, $\operatorname{Var}(X)$ is in miles squared. The standard deviation restores the original units:

$$\operatorname{SD}(X) = \sqrt{\operatorname{Var}(X)}.$$

The recipe looks convoluted — square, average, then square-root back — but variance has very nice mathematical properties, so we do the math with variance and convert to SD only at the end when we want something interpretable on the original scale.

Unified notation

A virtue of the $E(\cdot)$ notation: the definition $\operatorname{Var}(X) = E[(X - E(X))^2]$ assumes nothing about whether $X$ is discrete or continuous. It is a single, unified definition that works in both worlds without writing two versions.

The computational formula

Expanding the square gives a form that is usually easier to compute:

$$\operatorname{Var}(X) = E(X^2) - (E(X))^2$$

$\operatorname{Var}(X) = E\!\left[(X - E(X))^2\right]$

$\phantom{\operatorname{Var}(X)} = E\!\left[X^2 - 2\,X\,E(X) + (E(X))^2\right]$

$\phantom{\operatorname{Var}(X)} = E(X^2) - 2\,E(X)\,E(X) + (E(X))^2 \quad$ (linearity; $E(X)$ is a constant)

$\phantom{\operatorname{Var}(X)} = E(X^2) - (E(X))^2.$

This reads almost like “zero,” but the parentheses differ. $E(X^2)$ squares first, then averages. $(E(X))^2$ averages first, then squares. They are not equal.

Square-first vs. average-first

This settles an old question: given some measurements, should you square then average, or average then square? The two give different answers. The identity does not say which is “correct” for a given purpose, but it shows $E(X^2) \ge (E(X))^2$ always, with equality only when $X$ is a constant. If $X$ is constant, $\operatorname{Var}(X) = 0$ (it always equals its mean). Otherwise you are averaging quantities that are sometimes positive and never negative, so the average is strictly positive, making $E(X^2)$ strictly greater than $(E(X))^2$.

Notational convention: $E(X^2)$ means square first, then take the expectation. That is the standard reading whenever you see it. The remaining question — how to actually compute $E(X^2)$ — is answered by Lotus, below.

· · ·

6. The Uniform Distribution

The simplest continuous distribution is the Uniform. Before the midterm, only two named continuous distributions are required: the Uniform and the Normal. The Normal — the most important distribution in all of statistics — comes next week.

Setting it up: “completely random” on an interval

We want to pick a “random” point on an interval $[a, b]$. “Random” alone is too vague — every random variable is random. What does “completely random” mean here?

We cannot say “every two points are equally likely”: each individual point already has probability $0$, so that says nothing useful. Instead, reason about chunks. Split $[a, b]$ at its midpoint. Completely random should mean the left half is as likely as the right half — otherwise the variable would “prefer” one side, which is not uniform.

Defining principle of the uniform

Probability is proportional to length. Two intervals of equal length have equal probability; an interval twice as long is twice as likely.

The PDF

A constant density makes probability proportional to length. So the PDF is constant on $[a, b]$ and $0$ elsewhere:

$$f(x) = \begin{cases} c & a \le x \le b \\ 0 & \text{otherwise.} \end{cases}$$

To find $c$, force the total area to $1$. The density is $0$ outside $[a, b]$, so integrate only over $[a, b]$:

$$\int_a^b c\, dx = c\,(b - a) = 1 \quad\Longrightarrow\quad c = \frac{1}{b - a}.$$

The density is one over the length of the interval. Any other value would not be a valid PDF.

The CDF

Integrate the PDF from the left up to $x$, splitting into cases. Below $a$ there is nothing to accumulate; above $b$ everything has accumulated:

$$F(x) = \begin{cases} 0 & x < a \\[2pt] \dfrac{x - a}{b - a} & a \le x \le b \\[6pt] 1 & x > b. \end{cases}$$

The middle case is the only interesting one: integrating the constant $c$ from $a$ to $x$ gives $c\,(x - a) = \frac{x - a}{b - a}$. This is a continuous, piecewise-linear function. It checks out at the endpoints: plug in $x = a$ and it reduces to $0$; plug in $x = b$ and it reduces to $1$. The linear rise says probability accumulates at a steady rate as $x$ increases — natural, since equal lengths contribute equal probability.

Expected value

$E(X)$ is an easy integral:

$$E(X) = \frac{a + b}{2}$$

$E(X) = \displaystyle\int_a^b x \cdot \frac{1}{b - a}\, dx = \frac{1}{b - a}\left[\frac{x^2}{2}\right]_a^b = \frac{b^2 - a^2}{2(b - a)} = \frac{(b - a)(b + a)}{2(b - a)} = \frac{a + b}{2}.$

The mean is the midpoint of the interval — exactly what intuition demands for something uniform. It would be strange if it were anything else.

· · ·

7. Lotus: The Law of the Unconscious Statistician

To get $\operatorname{Var}(X)$ for the Uniform we still need $E(X^2)$. This raises a general problem: how do you compute the expected value of a function of a random variable?

The problem

Let $Y = X^2$. A function of a random variable is itself a random variable, so $E(X^2) = E(Y)$. The principled route is: find the PDF of $Y$, then compute $E(Y) = \int y\, f_Y(y)\, dy$. But we do not yet know the PDF of $Y$, and finding it is a hassle. (The course covers how to do it later.)

The lazy route that works

Instead of finding the PDF of $Y$, look at the formula for $E(X) = \int x\, f(x)\, dx$ and “lazily” replace $x$ with $x^2$ while keeping the PDF of $X$:

$$E(X^2) = \int_{-\infty}^{\infty} x^2\, f_X(x)\, dx.$$

This looks too good to be true: we never converted to the distribution of $Y$. But it is true. It is called the Law of the Unconscious Statistician (Lotus) — the name suggests doing it half-asleep, swapping $x$ for $x^2$ without thinking carefully about whether it is legitimate. It is legitimate.

Theorem — Lotus

For a function $g$ of a random variable $X$, you can compute $E(g(X))$ using the distribution of $X$ directly — no need to find the distribution of $g(X)$ first.

Continuous: $$E(g(X)) = \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx.$$

Discrete: $$E(g(X)) = \sum_x g(x)\, P(X = x).$$

In both cases you keep $X$'s own PMF/PDF and simply feed the values through $g$. (Proof deferred; it will be justified next week.)

Variance of the Uniform via Lotus

Take $U \sim \text{Uniform}(0, 1)$ for simplicity. Its PDF is the constant $1$ on $[0, 1]$ (since $\frac{1}{b - a} = \frac{1}{1} = 1$), and $E(U) = \tfrac{1}{2}$ (the midpoint). By Lotus, with no need for the PDF of $U^2$:

$$E(U^2) = \int_0^1 u^2 \cdot 1\, du = \left[\frac{u^3}{3}\right]_0^1 = \frac{1}{3}.$$

Then:

$$\operatorname{Var}(U) = \frac{1}{12} \quad \text{for } U \sim \text{Uniform}(0, 1)$$

$\operatorname{Var}(U) = E(U^2) - (E(U))^2 = \dfrac{1}{3} - \dfrac{1}{4} = \dfrac{1}{12}.$

A very easy calculation, made easy by Lotus.

· · ·

8. Universality of the Uniform

The Uniform PDF could hardly be simpler — a constant on an interval. One structural requirement: the interval must be bounded. There is no Uniform on the entire real line, because no constant integrates to $1$ over an infinite domain (there is no way to normalize it). This is occasionally annoying but unavoidable.

Despite its simplicity, the $\text{Uniform}(0, 1)$ is extraordinarily powerful.

Universality of the uniform

Given a single $\text{Uniform}(0, 1)$ random variable, you can generate a draw from any distribution you want, however complicated — at least in principle. Whether the computation is easy or hard depends on the case, but in principle a uniform can produce anything.

This is theoretically elegant (one humble distribution unifies them all) and practically central: most computer programs can generate (pseudo-)random numbers between $0$ and $1$ but not arbitrary complicated distributions. Universality gives the conversion recipe used to simulate from those distributions.

The construction

Start with $U \sim \text{Uniform}(0, 1)$ and let $F$ be a CDF we want to draw from. Here we go in reverse from the usual workflow: instead of starting with a random variable and finding its CDF, we start with a target CDF $F$ and want to build a random variable having it.

To keep the proof short, assume $F$ is strictly increasing and continuous, so it has a genuine inverse with no flat regions or jumps. (The result generalizes beyond these assumptions.)

Theorem — Universality of the Uniform

Let $U \sim \text{Uniform}(0, 1)$ and let $F$ be a continuous, strictly increasing CDF. Define

$$X = F^{-1}(U).$$

Then $X$ has CDF $F$; that is, $X$ is a draw from the distribution $F$. In words: plug a uniform into the inverse CDF, and out comes a random draw from the target distribution.

The proof

The proof needs nothing but the meaning of a CDF — which is part of why it is worth doing: it is excellent practice at really understanding what a CDF is. Compute the CDF of $X$ directly:

$$P(X \le x) = F(x)$$

$P(X \le x) = P\!\left(F^{-1}(U) \le x\right) \quad$ (definition of $X$)

$\phantom{P(X \le x)} = P\!\left(U \le F(x)\right) \quad$ (apply increasing, invertible $F$ to both sides — same event, inequality preserved)

$\phantom{P(X \le x)} = F(x) \quad$ (for $U \sim \text{Uniform}(0,1)$, the probability of an interval is its length; the interval $[0, F(x)]$ has length $F(x)$)

So $P(X \le x) = F(x)$: $X$ has CDF $F$, and the construction works.

Conclusion

Run a $\text{Uniform}(0, 1)$ draw through the inverse CDF $F^{-1}$ and you obtain a sample from any continuous, strictly increasing distribution $F$ — the engine behind simulating arbitrary distributions from uniform random numbers.