Lecture 14: Location, Scale, and LOTUS

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Moments of the Standard Normal

Recall the standard normal $Z \sim \mathcal{N}(0, 1)$. The traditional letter is $Z$, but $Z$ is just a common choice — it does not have to be reserved for standard normals. From last time:

PDF: $\dfrac{1}{\sqrt{2\pi}}\, e^{-z^2/2}$, with normalizing constant $\dfrac{1}{\sqrt{2\pi}}$.
CDF: has no closed form, so it gets its own symbol, capital $\Phi$.
Mean: $E(Z) = 0$, immediate by symmetry.
Variance: $\operatorname{Var}(Z) = E(Z^2) - E(Z)^2 = E(Z^2) = 1$, computed by integration by parts.

The $k$-th moment

$E(Z^k)$ is called the $k$-th moment: $E(Z)$ is the first moment, $E(Z^2)$ the second, $E(Z^3)$ the third, and so on. (The origin of the word "moment" comes later in the course.)

By LOTUS, $E(Z^3)$ is the integral

$$E(Z^3) = \int_{-\infty}^{\infty} z^3 \cdot \frac{1}{\sqrt{2\pi}}\, e^{-z^2/2}\, dz = 0.$$

The base integral (no $z$ factor) is just the PDF integrating to $1$; LOTUS lets us insert $z^3$ directly without first finding the distribution of $Z^3$. The integral vanishes because the integrand is an odd function. The same argument kills every odd power: $E(Z) = E(Z^3) = E(Z^5) = \cdots = 0$.

Odd vs. even moments

Moment	Value	How
Odd powers $E(Z), E(Z^3), E(Z^5), \ldots$	$0$	Symmetry — odd integrand, no integral needed
$E(Z^2)$	$1$	Integration by parts (done last time)
$E(Z^4)$	not trivial	Same integral with $z^4$; computable but deferred until after the midterm

The point: you should immediately be able to write down the LOTUS integral for $E(Z^4)$, even if you cannot yet evaluate it. The odd moments require no computation at all.

Symmetry restated: $-Z$ is also standard normal

Key fact

If $Z \sim \mathcal{N}(0, 1)$, then $-Z \sim \mathcal{N}(0, 1)$. Flipping the sign turns positives into negatives and vice versa — it changes the random variable but not its distribution. You can verify this by finding the CDF of $-Z$ and differentiating. Always look for symmetries.

· · ·

2. The General Normal: Location and Scale

Define $X = \mu + \sigma Z$, where $\mu$ is any real number and $\sigma$ is any positive number, with $Z \sim \mathcal{N}(0, 1)$. Then $X \sim \mathcal{N}(\mu, \sigma^2)$.

Why this construction

This is more insightful than starting from the PDF (as most books do). There is one fundamental distribution — the standard normal — and every other normal is obtained from it by scaling and shifting. Always reduce a normal back to standard normal rather than wrestling with the formula.

Parameter	Role	Effect on the density
$\mu$ (mean)	location	Adding a constant shifts the curve left/right; the shape is unchanged
$\sigma$ (std. dev.)	scale	Multiplying stretches/squeezes the curve; a normalizing factor keeps the area at $1$

$\mu$ can be negative; $\sigma$ must be positive (it is $\sqrt{\operatorname{Var}}$).

Checking the mean and variance

Mean (by linearity): $E(X) = \mu + \sigma\, E(Z) = \mu + 0 = \mu$.
Variance: needs the properties of variance under shifting and scaling (next section). The result will be $\operatorname{Var}(X) = \sigma^2 \operatorname{Var}(Z) = \sigma^2$.

· · ·

3. Properties of Variance

Two equivalent formulas:

$$\operatorname{Var}(X) = E\!\left[(X - E(X))^2\right] = E(X^2) - E(X)^2$$

The first is the definition (average squared distance of $X$ from its mean); the second is the shortcut.

Shifting by a constant

$\operatorname{Var}(X + c) = \operatorname{Var}(X)$

Adding a constant does not change how variable $X$ is. From the definition: replacing $X$ by $X + c$ shifts the mean by $c$ too (by linearity), so $X + c$ minus its mean equals $X$ minus its mean — identical.

Scaling by a constant

$\operatorname{Var}(cX) = c^2 \operatorname{Var}(X)$

The constant comes out, but squared. A common mistake is forgetting the square. It matters: if $c$ is negative and you drop the square, you would get a negative variance, which is impossible.

Variance is non-negative

Sanity check

$\operatorname{Var}(X) \ge 0$ always (it is an average of squares). The first thing to check after any variance computation is non-negativity. Moreover, $\operatorname{Var}(X) = 0$ if and only if $X$ is constant with probability $1$, i.e. $P(X = a) = 1$ for some $a$. Otherwise the variance is strictly positive.

Variance is NOT linear

Unlike expectation, variance violates both linearity properties:

Constants come out squared, not as themselves.
$\operatorname{Var}(X + Y) \ne \operatorname{Var}(X) + \operatorname{Var}(Y)$ in general.

$\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ does hold when $X$ and $Y$ are independent (proof deferred until after the midterm). Crucially, expectation's linearity holds whether or not the variables are independent; variance's additivity does not.

Extreme example: $X + X$

$$\operatorname{Var}(X + X) = \operatorname{Var}(2X) = 4\operatorname{Var}(X) \ne 2\operatorname{Var}(X)$$

If additivity held blindly we would get $2\operatorname{Var}(X)$, but $X$ is maximally dependent on itself.

Related mistake: rewriting $2X$ as $X + X$ is fine, but then replacing it with $X_1 + X_2$ (independent copies with the same distribution as $X$) is wrong. $X$ is not IID with itself — it is perfectly dependent. With independent copies the variabilities just add (giving $2\operatorname{Var}(X)$); with the same variable they magnify (giving $4\operatorname{Var}(X)$).

· · ·

4. Standardization

Solving $X = \mu + \sigma Z$ for $Z$ gives:

$$Z = \frac{X - \mu}{\sigma}$$

Standardization

Take any $X \sim \mathcal{N}(\mu, \sigma^2)$, subtract the mean, divide by the standard deviation, and you always get a standard normal. The construction $\mu + \sigma Z$ goes one direction; standardization goes the other. Two frequent errors: dividing by the variance instead of the standard deviation, or never thinking to standardize at all.

Standardization removes units

If $X$ is a physical measurement — say time in seconds — then $\frac{X - \mu}{\sigma}$ is $\frac{\text{seconds} - \text{seconds}}{\text{seconds}}$, so the units cancel. The standardized quantity is dimensionless. Two measurements in different units (one in seconds, one in years) standardize to the same scale, which is part of what makes standardization so interpretable.

· · ·

5. Deriving the PDF of the General Normal

Goal: given the standard normal PDF (call it $\varphi$) and CDF ($\Phi$), find the PDF of $X \sim \mathcal{N}(\mu, \sigma^2)$ without memorizing a formula. This is good CDF/PDF practice.

Step 1 — CDF, then standardize

By definition the CDF is $P(X \le x)$. Standardize inside ($\sigma > 0$, so the inequality does not flip):

$$P(X \le x) = P\!\left(\frac{X - \mu}{\sigma} \le \frac{x - \mu}{\sigma}\right) = \Phi\!\left(\frac{x - \mu}{\sigma}\right)$$

The left side is now standard normal, so by definition this is $\Phi$ evaluated at $\frac{x - \mu}{\sigma}$.

Step 2 — differentiate (chain rule)

The PDF is the derivative of the CDF. With $\Phi$ as the outer function and $\frac{x - \mu}{\sigma}$ as the inner function:

derivative of the inner function $= \dfrac{1}{\sigma}$
derivative of $\Phi$ is $\varphi$ (the standard normal PDF), evaluated at $\frac{x - \mu}{\sigma}$

PDF of $\mathcal{N}(\mu, \sigma^2)$

$$f(x) = \frac{1}{\sigma}\, \varphi\!\left(\frac{x - \mu}{\sigma}\right)$$

Bonus: the distribution of $-X$

$$-X = -\mu + \sigma(-Z) \;\Longrightarrow\; -X \sim \mathcal{N}(-\mu, \sigma^2)$$

Rather than redo the calculation, write $X = \mu + \sigma Z$. Since $-Z$ is standard normal, $-X$ has the form (location) $+$ (scale)$\,\times\,$(standard normal). The mean flips sign; the variance does not (variance cannot be negative). Immediate from the standard-normal viewpoint.

· · ·

6. Sums of Independent Normals

A useful fact (proved much later): if $X_j \sim \mathcal{N}(\mu_j, \sigma_j^2)$ are independent for $j = 1, 2$, then:

$$X_1 + X_2 \sim \mathcal{N}\!\left(\mu_1 + \mu_2,\; \sigma_1^2 + \sigma_2^2\right)$$

The sum of independent normals is normal — a closure property that keeps us inside the family of normals. For now, focus on the mean and variance:

Mean: $\mu_1 + \mu_2$, by linearity.
Variance (independent case): $\sigma_1^2 + \sigma_2^2$, variances add.

The subtraction trap

Common mistake

For $X_1 - X_2$ (independent), the mean is $\mu_1 - \mu_2$ (linearity), but the variance is not $\sigma_1^2 - \sigma_2^2$ — that could be negative. Write subtraction as adding the negative: $X_1 - X_2 = X_1 + (-X_2)$, and $-X_2$ still has variance $\sigma_2^2$. So

$$\operatorname{Var}(X_1 - X_2) = \sigma_1^2 + \sigma_2^2.$$

Variances add even when you subtract.

· · ·

7. The 68-95-99.7 Rule

Because $\Phi$ cannot be evaluated in closed form (only via tables or a computer), a quick rule of thumb is handy. For $X \sim \mathcal{N}(\mu, \sigma^2)$:

Distance from mean	$P(X$ falls within$)$
$1$ standard deviation	about $68\%$
$2$ standard deviations	about $95\%$
$3$ standard deviations	about $99.7\%$

These are just a few values of $\Phi$, written intuitively. In practice people often add and subtract two standard deviations: roughly $95\%$ of independent observations from a normal fall within two SDs of the mean, and $99.7\%$ within three. Converting these into statements about $\Phi$ is good practice.

· · ·

8. LOTUS: Why It Works

Law of the Unconscious Statistician (LOTUS)

$E(g(X))$ can be computed directly from the PMF of $X$, without first finding the distribution of $g(X)$:

$$E(g(X)) = \sum_x g(x)\, P(X = x)$$

Intuition via a table

Take $X$ with possible values $0, 1, 2, 3, \ldots$ and PMF $p_0, p_1, p_2, p_3, \ldots$ where $p_j = P(X = j)$. To find $\operatorname{Var}(X)$ we need $E(X^2)$, so look at $X^2$ with values $0, 1, 4, 9, \ldots$

For a discrete RV, $E(X) = \sum_x x\, P(X = x)$. Now $P(X^2 = 9)$ is exactly $P(X = 3) = p_3$ — the probabilities did not change, the values did. So

$$E(X^2) = \sum_x x^2\, P(X = x).$$

When the function is one-to-one (as squaring is here, on non-negative integers), this is immediate. The subtle case is a non-one-to-one function (e.g., squaring with negative values present, producing duplicates). LOTUS guarantees the formula still works even with those duplications.

· · ·

9. Worked Example: Variance of the Poisson

Let $X \sim \operatorname{Pois}(\lambda)$. We already proved $E(X) = \lambda$; now derive $\operatorname{Var}(X)$. By LOTUS:

$$E(X^2) = \sum_{k=0}^{\infty} k^2\, \frac{e^{-\lambda}\lambda^k}{k!}.$$

This sum is unfamiliar. Canceling $k^2$ against $k!$ leaves an awkward extra $k$, so use a generating-function trick instead — start from something known and differentiate.

The differentiate-and-replenish method

$$\sum_{k=0}^{\infty} \frac{\lambda^k}{k!} = e^{\lambda} \quad \text{(valid for all complex } \lambda\text{)}$$

Differentiate both sides (start at $k = 1$): $\displaystyle\sum_{k=1}^{\infty} \frac{k\,\lambda^{k-1}}{k!} = e^{\lambda}$. Now there is a factor $k$, but the power is $\lambda^{k-1}$.
Replenish the lambdas — multiply by $\lambda$ to restore $\lambda^k$: $\displaystyle\sum_{k=1}^{\infty} \frac{k\,\lambda^{k}}{k!} = \lambda e^{\lambda}$.
Differentiate again. By the product rule the right side is $\lambda e^{\lambda} + e^{\lambda} = e^{\lambda}(\lambda + 1)$: $\displaystyle\sum_{k=1}^{\infty} \frac{k^2\,\lambda^{k-1}}{k!} = e^{\lambda}(\lambda + 1)$.
Replenish once more (multiply by $\lambda$): $\displaystyle\sum_{k=1}^{\infty} \frac{k^2\,\lambda^{k}}{k!} = \lambda e^{\lambda}(\lambda + 1) = e^{\lambda}(\lambda^2 + \lambda)$.

Multiply through by $e^{-\lambda}$ to recover the LOTUS sum: $E(X^2) = \lambda^2 + \lambda$.

Result

Conclusion

$$\operatorname{Var}(X) = E(X^2) - E(X)^2 = (\lambda^2 + \lambda) - \lambda^2 = \lambda.$$

A $\operatorname{Pois}(\lambda)$ has both mean and variance equal to $\lambda$ — worth remembering.

This is a slightly strange property: one might expect mean $=$ standard deviation to be more natural (same scale). But the Poisson counts things and is essentially dimensionless, so it lacks that dimensional interpretation.

· · ·

10. Worked Example: Variance of the Binomial

Let $X \sim \operatorname{Bin}(n, p)$. There are three approaches:

Method	Verdict
LOTUS directly on $E(X^2)$ with the binomial PMF	Works, but tedious algebra
$\operatorname{Var}$ of a sum of independent Bernoullis $=$ sum of variances	Easiest, but relies on the not-yet-proven additivity fact
Indicators $+$ linearity (compromise)	Done here — good review of indicators, symmetry, linearity

Indicator approach

$$X = I_1 + \cdots + I_n, \qquad I_j \overset{\text{iid}}{\sim} \operatorname{Bern}(p)$$

Square the sum, just as $(a + b)^2 = a^2 + b^2 + 2ab$, so each pair gives a cross term with a factor of $2$:

$$X^2 = \sum_{i} I_i^2 + \sum_{i \ne j} 2\, I_i I_j.$$

Take expectations and use linearity:

$$E(X^2) = n\, E(I_1^2) + 2\binom{n}{2} E(I_1 I_2).$$

By symmetry, all $n$ diagonal terms are equal, giving $n\, E(I_1^2)$.
There are $\binom{n}{2}$ pairs, each with a $2$, so $2\binom{n}{2}$ cross terms; by symmetry take $E(I_1 I_2)$.

Simplify each piece:

$I_1^2 = I_1$ (since $0^2 = 0$, $1^2 = 1$), so $E(I_1^2) = E(I_1) = p$. First piece $= np$.
$2\binom{n}{2} = n(n-1)$. The product $I_1 I_2$ is itself an indicator — success on both trials $1$ and $2$ (a product of indicators is $1$ only when both are $1$). By independence, $E(I_1 I_2) = P(\text{both}) = p^2$.

$$E(X^2) = np + n(n-1)p^2 = np + n^2 p^2 - n p^2.$$

Result

Conclusion

With $E(X) = np$, so $E(X)^2 = n^2 p^2$:

$$\operatorname{Var}(X) = E(X^2) - E(X)^2 = np + n^2 p^2 - n p^2 - n^2 p^2 = np(1 - p) = npq,$$

where $q = 1 - p$.

Note: this derivation used only linearity, not independence, in expanding the square. So a similar approach works for dependent indicators (e.g., the hypergeometric) — you would just need $P(\text{first two draws both tagged})$, which is messier but doable.

Other distributions

Geometric: similar calculation, but starting from the geometric series rather than the Taylor series for $e^x$; not done in class.
Hypergeometric: write as a sum of indicators (drawing one at a time, success $=$ a tagged item). For the mean, linearity works even though the indicators are dependent. The variance is harder because of dependence — deferred until after the midterm.

· · ·

11. Proof of LOTUS (Discrete Case)

Prove that $E(g(X)) = \sum_x g(x)\, P(X = x)$, so that we never need the distribution of $g(X)$. (The continuous case needs fancier integrals but the same idea.) This mirrors the proof of linearity: a grouped-vs-ungrouped sum.

Two ways to write the same sum

For a discrete sample space $S$ with pebbles $s$ of mass $P(\{s\})$:

Grouped (over values $x$): $\displaystyle\sum_x g(x)\, P(X = x)$.
Ungrouped (over pebbles $s$): $\displaystyle\sum_{s} g(X(s))\, P(\{s\})$.

Random variables are functions, so $g(X(s))$ means: apply $X$ to the pebble $s$, then apply $g$. The grouped form first merges all pebbles sharing the same value of $x$ into a "super-pebble"; the ungrouped form weighs each pebble individually. Both give the same weighted average.

Algebraic justification via a double sum

Rewrite the sum over all pebbles as: first sum over each value $x$, then sum over all pebbles $s$ with $X(s) = x$ (a finite sum can be reordered freely):

$$\sum_{s} g(X(s)) P(\{s\}) = \sum_x \;\sum_{s:\, X(s) = x} g(X(s))\, P(\{s\}).$$

Within the inner sum, $X(s) = x$, so $g(X(s)) = g(x)$, which does not depend on $s$ and pulls out:

$$= \sum_x g(x) \;\sum_{s:\, X(s) = x} P(\{s\}).$$

The inner sum adds the masses of all pebbles labeled $x$ — exactly the mass of the super-pebble, $P(X = x)$. (The event $\{X = x\}$ is the set of pebbles $s$ with $X(s) = x$.) Therefore:

$$E(g(X)) = \sum_x g(x)\, P(X = x). \qquad \blacksquare$$