Recall the standard normal $Z \sim \mathcal{N}(0, 1)$. The traditional letter is $Z$, but $Z$ is just a common choice — it does not have to be reserved for standard normals. From last time:
$E(Z^k)$ is called the $k$-th moment: $E(Z)$ is the first moment, $E(Z^2)$ the second, $E(Z^3)$ the third, and so on. (The origin of the word "moment" comes later in the course.)
By LOTUS, $E(Z^3)$ is the integral
The base integral (no $z$ factor) is just the PDF integrating to $1$; LOTUS lets us insert $z^3$ directly without first finding the distribution of $Z^3$. The integral vanishes because the integrand is an odd function. The same argument kills every odd power: $E(Z) = E(Z^3) = E(Z^5) = \cdots = 0$.
| Moment | Value | How |
|---|---|---|
| Odd powers $E(Z), E(Z^3), E(Z^5), \ldots$ | $0$ | Symmetry — odd integrand, no integral needed |
| $E(Z^2)$ | $1$ | Integration by parts (done last time) |
| $E(Z^4)$ | not trivial | Same integral with $z^4$; computable but deferred until after the midterm |
The point: you should immediately be able to write down the LOTUS integral for $E(Z^4)$, even if you cannot yet evaluate it. The odd moments require no computation at all.
If $Z \sim \mathcal{N}(0, 1)$, then $-Z \sim \mathcal{N}(0, 1)$. Flipping the sign turns positives into negatives and vice versa — it changes the random variable but not its distribution. You can verify this by finding the CDF of $-Z$ and differentiating. Always look for symmetries.
Define $X = \mu + \sigma Z$, where $\mu$ is any real number and $\sigma$ is any positive number, with $Z \sim \mathcal{N}(0, 1)$. Then $X \sim \mathcal{N}(\mu, \sigma^2)$.
This is more insightful than starting from the PDF (as most books do). There is one fundamental distribution — the standard normal — and every other normal is obtained from it by scaling and shifting. Always reduce a normal back to standard normal rather than wrestling with the formula.
| Parameter | Role | Effect on the density |
|---|---|---|
| $\mu$ (mean) | location | Adding a constant shifts the curve left/right; the shape is unchanged |
| $\sigma$ (std. dev.) | scale | Multiplying stretches/squeezes the curve; a normalizing factor keeps the area at $1$ |
$\mu$ can be negative; $\sigma$ must be positive (it is $\sqrt{\operatorname{Var}}$).
Two equivalent formulas:
The first is the definition (average squared distance of $X$ from its mean); the second is the shortcut.
Adding a constant does not change how variable $X$ is. From the definition: replacing $X$ by $X + c$ shifts the mean by $c$ too (by linearity), so $X + c$ minus its mean equals $X$ minus its mean — identical.
The constant comes out, but squared. A common mistake is forgetting the square. It matters: if $c$ is negative and you drop the square, you would get a negative variance, which is impossible.
$\operatorname{Var}(X) \ge 0$ always (it is an average of squares). The first thing to check after any variance computation is non-negativity. Moreover, $\operatorname{Var}(X) = 0$ if and only if $X$ is constant with probability $1$, i.e. $P(X = a) = 1$ for some $a$. Otherwise the variance is strictly positive.
Unlike expectation, variance violates both linearity properties:
$\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ does hold when $X$ and $Y$ are independent (proof deferred until after the midterm). Crucially, expectation's linearity holds whether or not the variables are independent; variance's additivity does not.
If additivity held blindly we would get $2\operatorname{Var}(X)$, but $X$ is maximally dependent on itself.
Related mistake: rewriting $2X$ as $X + X$ is fine, but then replacing it with $X_1 + X_2$ (independent copies with the same distribution as $X$) is wrong. $X$ is not IID with itself — it is perfectly dependent. With independent copies the variabilities just add (giving $2\operatorname{Var}(X)$); with the same variable they magnify (giving $4\operatorname{Var}(X)$).
Solving $X = \mu + \sigma Z$ for $Z$ gives:
Take any $X \sim \mathcal{N}(\mu, \sigma^2)$, subtract the mean, divide by the standard deviation, and you always get a standard normal. The construction $\mu + \sigma Z$ goes one direction; standardization goes the other. Two frequent errors: dividing by the variance instead of the standard deviation, or never thinking to standardize at all.
If $X$ is a physical measurement — say time in seconds — then $\frac{X - \mu}{\sigma}$ is $\frac{\text{seconds} - \text{seconds}}{\text{seconds}}$, so the units cancel. The standardized quantity is dimensionless. Two measurements in different units (one in seconds, one in years) standardize to the same scale, which is part of what makes standardization so interpretable.
Goal: given the standard normal PDF (call it $\varphi$) and CDF ($\Phi$), find the PDF of $X \sim \mathcal{N}(\mu, \sigma^2)$ without memorizing a formula. This is good CDF/PDF practice.
By definition the CDF is $P(X \le x)$. Standardize inside ($\sigma > 0$, so the inequality does not flip):
The left side is now standard normal, so by definition this is $\Phi$ evaluated at $\frac{x - \mu}{\sigma}$.
The PDF is the derivative of the CDF. With $\Phi$ as the outer function and $\frac{x - \mu}{\sigma}$ as the inner function:
$$f(x) = \frac{1}{\sigma}\, \varphi\!\left(\frac{x - \mu}{\sigma}\right)$$
Rather than redo the calculation, write $X = \mu + \sigma Z$. Since $-Z$ is standard normal, $-X$ has the form (location) $+$ (scale)$\,\times\,$(standard normal). The mean flips sign; the variance does not (variance cannot be negative). Immediate from the standard-normal viewpoint.
A useful fact (proved much later): if $X_j \sim \mathcal{N}(\mu_j, \sigma_j^2)$ are independent for $j = 1, 2$, then:
The sum of independent normals is normal — a closure property that keeps us inside the family of normals. For now, focus on the mean and variance:
For $X_1 - X_2$ (independent), the mean is $\mu_1 - \mu_2$ (linearity), but the variance is not $\sigma_1^2 - \sigma_2^2$ — that could be negative. Write subtraction as adding the negative: $X_1 - X_2 = X_1 + (-X_2)$, and $-X_2$ still has variance $\sigma_2^2$. So
$$\operatorname{Var}(X_1 - X_2) = \sigma_1^2 + \sigma_2^2.$$
Variances add even when you subtract.
Because $\Phi$ cannot be evaluated in closed form (only via tables or a computer), a quick rule of thumb is handy. For $X \sim \mathcal{N}(\mu, \sigma^2)$:
| Distance from mean | $P(X$ falls within$)$ |
|---|---|
| $1$ standard deviation | about $68\%$ |
| $2$ standard deviations | about $95\%$ |
| $3$ standard deviations | about $99.7\%$ |
These are just a few values of $\Phi$, written intuitively. In practice people often add and subtract two standard deviations: roughly $95\%$ of independent observations from a normal fall within two SDs of the mean, and $99.7\%$ within three. Converting these into statements about $\Phi$ is good practice.
$E(g(X))$ can be computed directly from the PMF of $X$, without first finding the distribution of $g(X)$:
$$E(g(X)) = \sum_x g(x)\, P(X = x)$$
Take $X$ with possible values $0, 1, 2, 3, \ldots$ and PMF $p_0, p_1, p_2, p_3, \ldots$ where $p_j = P(X = j)$. To find $\operatorname{Var}(X)$ we need $E(X^2)$, so look at $X^2$ with values $0, 1, 4, 9, \ldots$
For a discrete RV, $E(X) = \sum_x x\, P(X = x)$. Now $P(X^2 = 9)$ is exactly $P(X = 3) = p_3$ — the probabilities did not change, the values did. So
When the function is one-to-one (as squaring is here, on non-negative integers), this is immediate. The subtle case is a non-one-to-one function (e.g., squaring with negative values present, producing duplicates). LOTUS guarantees the formula still works even with those duplications.
Let $X \sim \operatorname{Pois}(\lambda)$. We already proved $E(X) = \lambda$; now derive $\operatorname{Var}(X)$. By LOTUS:
This sum is unfamiliar. Canceling $k^2$ against $k!$ leaves an awkward extra $k$, so use a generating-function trick instead — start from something known and differentiate.
Multiply through by $e^{-\lambda}$ to recover the LOTUS sum: $E(X^2) = \lambda^2 + \lambda$.
$$\operatorname{Var}(X) = E(X^2) - E(X)^2 = (\lambda^2 + \lambda) - \lambda^2 = \lambda.$$
A $\operatorname{Pois}(\lambda)$ has both mean and variance equal to $\lambda$ — worth remembering.
This is a slightly strange property: one might expect mean $=$ standard deviation to be more natural (same scale). But the Poisson counts things and is essentially dimensionless, so it lacks that dimensional interpretation.
Let $X \sim \operatorname{Bin}(n, p)$. There are three approaches:
| Method | Verdict |
|---|---|
| LOTUS directly on $E(X^2)$ with the binomial PMF | Works, but tedious algebra |
| $\operatorname{Var}$ of a sum of independent Bernoullis $=$ sum of variances | Easiest, but relies on the not-yet-proven additivity fact |
| Indicators $+$ linearity (compromise) | Done here — good review of indicators, symmetry, linearity |
Square the sum, just as $(a + b)^2 = a^2 + b^2 + 2ab$, so each pair gives a cross term with a factor of $2$:
$$X^2 = \sum_{i} I_i^2 + \sum_{i \ne j} 2\, I_i I_j.$$
Take expectations and use linearity:
$$E(X^2) = n\, E(I_1^2) + 2\binom{n}{2} E(I_1 I_2).$$
Simplify each piece:
$$E(X^2) = np + n(n-1)p^2 = np + n^2 p^2 - n p^2.$$
With $E(X) = np$, so $E(X)^2 = n^2 p^2$:
$$\operatorname{Var}(X) = E(X^2) - E(X)^2 = np + n^2 p^2 - n p^2 - n^2 p^2 = np(1 - p) = npq,$$
where $q = 1 - p$.
Note: this derivation used only linearity, not independence, in expanding the square. So a similar approach works for dependent indicators (e.g., the hypergeometric) — you would just need $P(\text{first two draws both tagged})$, which is messier but doable.
Prove that $E(g(X)) = \sum_x g(x)\, P(X = x)$, so that we never need the distribution of $g(X)$. (The continuous case needs fancier integrals but the same idea.) This mirrors the proof of linearity: a grouped-vs-ungrouped sum.
For a discrete sample space $S$ with pebbles $s$ of mass $P(\{s\})$:
Random variables are functions, so $g(X(s))$ means: apply $X$ to the pebble $s$, then apply $g$. The grouped form first merges all pebbles sharing the same value of $x$ into a "super-pebble"; the ungrouped form weighs each pebble individually. Both give the same weighted average.
Rewrite the sum over all pebbles as: first sum over each value $x$, then sum over all pebbles $s$ with $X(s) = x$ (a finite sum can be reordered freely):
$$\sum_{s} g(X(s)) P(\{s\}) = \sum_x \;\sum_{s:\, X(s) = x} g(X(s))\, P(\{s\}).$$
Within the inner sum, $X(s) = x$, so $g(X(s)) = g(x)$, which does not depend on $s$ and pulls out:
$$= \sum_x g(x) \;\sum_{s:\, X(s) = x} P(\{s\}).$$
The inner sum adds the masses of all pebbles labeled $x$ — exactly the mass of the super-pebble, $P(X = x)$. (The event $\{X = x\}$ is the set of pebbles $s$ with $X(s) = x$.) Therefore: