With the binomial, Poisson, hypergeometric, geometric, and the other named discrete distributions covered, the course has all the discrete machinery it needs. It is a good time to switch to continuous distributions.
Discrete came first deliberately: it is conceptually simpler to picture. That does not make continuous harder. The discrete world forces us into sums, some of which stay nasty even after using stories to avoid them. In the continuous world, sums become integrals, and it is often easier to do an integral than the analogous sum. The same difficulty can resurface — we may meet an integral we cannot evaluate — so the goal stays the same: find clever, conceptual ways to avoid grinding through computation.
The good news is that most ideas carry over by direct analogy. Assuming a solid grasp of what a PMF is, what a discrete distribution means, and the expected value of a discrete distribution, the move to continuous is mostly a matter of translation.
The cleanest way to learn the continuous case is to lay it next to the discrete case term by term. The random variable is called $X$ in both worlds.
| Concept | Discrete world | Continuous world |
|---|---|---|
| Random variable | $X$ | $X$ |
| Mass / density function | PMF: $P(X = x)$ as a function of $x$ | PDF: $f_X(x)$ |
| $P(X = x)$ at a single point | Can be positive | Always $0$ |
| CDF | $F_X(x) = P(X \le x)$ | $F_X(x) = P(X \le x)$ (identical) |
| Probability of a region | Sum the PMF | Integrate the PDF |
| Expected value | $\sum_x x\, P(X = x)$ | $\int_{-\infty}^{\infty} x\, f(x)\, dx$ |
| Variance | $E(X^2) - (E(X))^2$ | $E(X^2) - (E(X))^2$ (identical) |
Two entries deserve emphasis.
In the continuous case $P(X = x) = 0$ for every $x$. We are modeling random variables that can take any real value (or any real value in an interval, say $[0, 1]$). There are uncountably many reals in any interval, and any specific one — $\pi/4$, for instance — has probability $0$. A PMF would just report $0$ everywhere, which is useless, so we need a density instead.
Second, the CDF is completely general: every random variable, discrete or continuous, has one. So the CDF unifies the theory and we never need to separate out the cases for it. In the discrete case the CDF is a step function full of jumps and is awkward to work with, so the PMF is usually easier there.
PDF stands for probability density function (not portable document format). The key word is density.
The most common mistake with PDFs is treating their values as probabilities. They are not. A density is probability per unit of something — per length, per area, per volume — not a probability.
Discrete probability is like pebbles: discrete lumps of mass, total mass $1$. Continuous probability is like mud smeared over the space: total mass still $1$, but no individual point carries any of it. A density measures how thickly the mud is spread, not how much sits at a point.
A random variable $X$ has PDF $f(x)$ if, for every interval $[a, b]$,
$$P(a \le X \le b) = \int_a^b f(x)\, dx \qquad \text{for all } a, b.$$
So $f(x)$ is not a probability; it is the thing you integrate to get one. Integrate a density and you get a probability.
A consistency check is built in: setting $a = b$ gives $\int_a^a f(x)\, dx = 0$ (no width, no area). That matches the fact that any single point has probability $0$ — we need an interval of nonzero length to get nonzero probability.
By analogy with a PMF (nonnegative, sums to $1$), a valid PDF must be nonnegative and integrate to $1$:
Geometrically, $f$ can be any continuous-looking curve from $-\infty$ to $+\infty$ — symmetric or lopsided, possibly $0$ on one side — as long as it never dips below $0$ and the total area underneath equals $1$. The bell curve is the famous example, coming later.
A PDF value can exceed $1$: a function can poke above $1$ somewhere and still enclose total area $1$. So $f(x_0)$ cannot be a probability. To interpret it, convert from the density scale back to the probability scale by multiplying by a small width. For very small $\epsilon$,
The reason follows straight from the definition. To get that probability, integrate $f$ over the tiny interval. Because $\epsilon$ is very small, $f$ barely changes across it, so it is approximately constant there. The integral of a constant is the constant times the length of the interval — exactly $f(x_0)\, \epsilon$.
The PDF and CDF carry the same information; calculus moves between them.
The CDF is by definition $F(x) = P(X \le x)$. Since the PDF is the thing you integrate to get probability, integrate it over everything to the left of $x$:
The dummy variable is renamed $t$ to avoid clashing with the upper limit $x$. Picture it as the running area under the density curve from far left up to the point $x$.
Going the other way, the PDF is the derivative of the CDF:
This is the Fundamental Theorem of Calculus, and both parts get used:
So the probability of an interval can be read two equivalent ways:
the second equality being FTC Part 2, consistent with the earlier CDF results.
In the continuous case, strict versus non-strict inequalities make no difference: $P(X \le b) = P(X < b)$, because the endpoint contributes probability $0$. In the discrete case the difference is crucial.
“Continuous random variable” in this course means the CDF is differentiable (has a PDF), not merely that the CDF is a continuous function. There exist functions that are continuous but not differentiable everywhere; for them things get nastier and there is no PDF. The word “continuous” really refers to $X$ taking a whole continuum of values rather than discrete ones. We assume throughout that $F$ is differentiable, so the PDF exists.
In the discrete case, $E(X)$ is the sum of (value times probability). In the continuous case that sum would be $0$ (every point has probability $0$), so the analog is an integral:
If $X$ only lives on an interval, say $[0, 1]$, the integrand is $0$ outside it, so we just integrate over the region where $f$ is nonzero.
The expected value is a one-number summary of the center; it says nothing about spread. Variance measures spread: on average, how far is $X$ from its mean?
A first attempt, $E(X - E(X))$, is useless: by linearity it equals $E(X) - E(X) = 0$ always. Putting absolute values around the deviation, $E(|X - E(X)|)$, fixes the sign problem but the absolute-value function has a sharp corner (it is not differentiable) and is awkward to work with. The standard fix is to square the deviation:
Beyond the absolute value being non-differentiable and annoying, there is a deeper reason: squares and sums of squares evoke the Pythagorean theorem, right triangles, and Euclidean distance. There is a great deal of beautiful geometry attached to squared quantities, and that geometry is lost if you use absolute values.
Squaring changes the units. If $X$ is measured in miles, $\operatorname{Var}(X)$ is in miles squared. The standard deviation restores the original units:
The recipe looks convoluted — square, average, then square-root back — but variance has very nice mathematical properties, so we do the math with variance and convert to SD only at the end when we want something interpretable on the original scale.
A virtue of the $E(\cdot)$ notation: the definition $\operatorname{Var}(X) = E[(X - E(X))^2]$ assumes nothing about whether $X$ is discrete or continuous. It is a single, unified definition that works in both worlds without writing two versions.
Expanding the square gives a form that is usually easier to compute:
$\operatorname{Var}(X) = E\!\left[(X - E(X))^2\right]$
$\phantom{\operatorname{Var}(X)} = E\!\left[X^2 - 2\,X\,E(X) + (E(X))^2\right]$
$\phantom{\operatorname{Var}(X)} = E(X^2) - 2\,E(X)\,E(X) + (E(X))^2 \quad$ (linearity; $E(X)$ is a constant)
$\phantom{\operatorname{Var}(X)} = E(X^2) - (E(X))^2.$
This reads almost like “zero,” but the parentheses differ. $E(X^2)$ squares first, then averages. $(E(X))^2$ averages first, then squares. They are not equal.
This settles an old question: given some measurements, should you square then average, or average then square? The two give different answers. The identity does not say which is “correct” for a given purpose, but it shows $E(X^2) \ge (E(X))^2$ always, with equality only when $X$ is a constant. If $X$ is constant, $\operatorname{Var}(X) = 0$ (it always equals its mean). Otherwise you are averaging quantities that are sometimes positive and never negative, so the average is strictly positive, making $E(X^2)$ strictly greater than $(E(X))^2$.
Notational convention: $E(X^2)$ means square first, then take the expectation. That is the standard reading whenever you see it. The remaining question — how to actually compute $E(X^2)$ — is answered by Lotus, below.
The simplest continuous distribution is the Uniform. Before the midterm, only two named continuous distributions are required: the Uniform and the Normal. The Normal — the most important distribution in all of statistics — comes next week.
We want to pick a “random” point on an interval $[a, b]$. “Random” alone is too vague — every random variable is random. What does “completely random” mean here?
We cannot say “every two points are equally likely”: each individual point already has probability $0$, so that says nothing useful. Instead, reason about chunks. Split $[a, b]$ at its midpoint. Completely random should mean the left half is as likely as the right half — otherwise the variable would “prefer” one side, which is not uniform.
Probability is proportional to length. Two intervals of equal length have equal probability; an interval twice as long is twice as likely.
A constant density makes probability proportional to length. So the PDF is constant on $[a, b]$ and $0$ elsewhere:
To find $c$, force the total area to $1$. The density is $0$ outside $[a, b]$, so integrate only over $[a, b]$:
The density is one over the length of the interval. Any other value would not be a valid PDF.
Integrate the PDF from the left up to $x$, splitting into cases. Below $a$ there is nothing to accumulate; above $b$ everything has accumulated:
The middle case is the only interesting one: integrating the constant $c$ from $a$ to $x$ gives $c\,(x - a) = \frac{x - a}{b - a}$. This is a continuous, piecewise-linear function. It checks out at the endpoints: plug in $x = a$ and it reduces to $0$; plug in $x = b$ and it reduces to $1$. The linear rise says probability accumulates at a steady rate as $x$ increases — natural, since equal lengths contribute equal probability.
$E(X)$ is an easy integral:
$E(X) = \displaystyle\int_a^b x \cdot \frac{1}{b - a}\, dx = \frac{1}{b - a}\left[\frac{x^2}{2}\right]_a^b = \frac{b^2 - a^2}{2(b - a)} = \frac{(b - a)(b + a)}{2(b - a)} = \frac{a + b}{2}.$
The mean is the midpoint of the interval — exactly what intuition demands for something uniform. It would be strange if it were anything else.
To get $\operatorname{Var}(X)$ for the Uniform we still need $E(X^2)$. This raises a general problem: how do you compute the expected value of a function of a random variable?
Let $Y = X^2$. A function of a random variable is itself a random variable, so $E(X^2) = E(Y)$. The principled route is: find the PDF of $Y$, then compute $E(Y) = \int y\, f_Y(y)\, dy$. But we do not yet know the PDF of $Y$, and finding it is a hassle. (The course covers how to do it later.)
Instead of finding the PDF of $Y$, look at the formula for $E(X) = \int x\, f(x)\, dx$ and “lazily” replace $x$ with $x^2$ while keeping the PDF of $X$:
This looks too good to be true: we never converted to the distribution of $Y$. But it is true. It is called the Law of the Unconscious Statistician (Lotus) — the name suggests doing it half-asleep, swapping $x$ for $x^2$ without thinking carefully about whether it is legitimate. It is legitimate.
For a function $g$ of a random variable $X$, you can compute $E(g(X))$ using the distribution of $X$ directly — no need to find the distribution of $g(X)$ first.
Continuous: $$E(g(X)) = \int_{-\infty}^{\infty} g(x)\, f_X(x)\, dx.$$
Discrete: $$E(g(X)) = \sum_x g(x)\, P(X = x).$$
In both cases you keep $X$'s own PMF/PDF and simply feed the values through $g$. (Proof deferred; it will be justified next week.)
Take $U \sim \text{Uniform}(0, 1)$ for simplicity. Its PDF is the constant $1$ on $[0, 1]$ (since $\frac{1}{b - a} = \frac{1}{1} = 1$), and $E(U) = \tfrac{1}{2}$ (the midpoint). By Lotus, with no need for the PDF of $U^2$:
Then:
$\operatorname{Var}(U) = E(U^2) - (E(U))^2 = \dfrac{1}{3} - \dfrac{1}{4} = \dfrac{1}{12}.$
A very easy calculation, made easy by Lotus.
The Uniform PDF could hardly be simpler — a constant on an interval. One structural requirement: the interval must be bounded. There is no Uniform on the entire real line, because no constant integrates to $1$ over an infinite domain (there is no way to normalize it). This is occasionally annoying but unavoidable.
Despite its simplicity, the $\text{Uniform}(0, 1)$ is extraordinarily powerful.
Given a single $\text{Uniform}(0, 1)$ random variable, you can generate a draw from any distribution you want, however complicated — at least in principle. Whether the computation is easy or hard depends on the case, but in principle a uniform can produce anything.
This is theoretically elegant (one humble distribution unifies them all) and practically central: most computer programs can generate (pseudo-)random numbers between $0$ and $1$ but not arbitrary complicated distributions. Universality gives the conversion recipe used to simulate from those distributions.
Start with $U \sim \text{Uniform}(0, 1)$ and let $F$ be a CDF we want to draw from. Here we go in reverse from the usual workflow: instead of starting with a random variable and finding its CDF, we start with a target CDF $F$ and want to build a random variable having it.
To keep the proof short, assume $F$ is strictly increasing and continuous, so it has a genuine inverse with no flat regions or jumps. (The result generalizes beyond these assumptions.)
Let $U \sim \text{Uniform}(0, 1)$ and let $F$ be a continuous, strictly increasing CDF. Define
$$X = F^{-1}(U).$$
Then $X$ has CDF $F$; that is, $X$ is a draw from the distribution $F$. In words: plug a uniform into the inverse CDF, and out comes a random draw from the target distribution.
The proof needs nothing but the meaning of a CDF — which is part of why it is worth doing: it is excellent practice at really understanding what a CDF is. Compute the CDF of $X$ directly:
$P(X \le x) = P\!\left(F^{-1}(U) \le x\right) \quad$ (definition of $X$)
$\phantom{P(X \le x)} = P\!\left(U \le F(x)\right) \quad$ (apply increasing, invertible $F$ to both sides — same event, inequality preserved)
$\phantom{P(X \le x)} = F(x) \quad$ (for $U \sim \text{Uniform}(0,1)$, the probability of an interval is its length; the interval $[0, F(x)]$ has length $F(x)$)
So $P(X \le x) = F(x)$: $X$ has CDF $F$, and the construction works.
Run a $\text{Uniform}(0, 1)$ draw through the inverse CDF $F^{-1}$ and you obtain a sample from any continuous, strictly increasing distribution $F$ — the engine behind simulating arbitrary distributions from uniform random numbers.