Lecture 21: Covariance and Correlation

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Why Covariance

Expectation is linear: $E(X + Y) = E(X) + E(Y)$ always, even when $X$ and $Y$ are dependent. Variance is not linear. So if we want $\operatorname{Var}(X + Y)$, we cannot just add the variances — we have to think harder rather than falsely applying linearity. Covariance is the tool that finally lets us handle the variance of a sum.

Two motivations

Covariance is two sides of one coin: it is exactly what we need to compute the variance of a sum, and it is what we use to study two random variables together rather than one at a time — "like variance, except for two of them." That second motivation explains the name: it is the co-variance of $X$ and $Y$.

· · ·

2. Definition and First Properties

Definition — Covariance

For any two random variables $X$ and $Y$ on the same probability space:

$$\operatorname{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]$$

Stare at the form for a moment. It is a product, which brings the $X$ information and the $Y$ information together into one quantity, because we are trying to see how they vary together. Recall the sign rules: positive times positive is positive, negative times negative is positive, positive times negative is negative.

Now imagine drawing a random sample of many IID pairs $(X_1, Y_1), (X_2, Y_2), \ldots$ — the pairs are IID, but within a pair $X_i$ and $Y_i$ have some joint distribution and need not be independent.

Sign intuition

If $X$ above its mean tends to go with $Y$ above its mean (positive $\times$ positive), and $X$ below with $Y$ below (negative $\times$ negative), the products are mostly positive: $X$ and $Y$ are positively correlated.
If $X$ above its mean tends to go with $Y$ below its mean, the products are mostly negative: $X$ and $Y$ are negatively correlated.

Property 1: covariance with itself is variance

$$\operatorname{Cov}(X, X) = \operatorname{Var}(X)$$

Proof: set $Y = X$ in the definition; $E[(X - E(X))^2]$ is exactly the definition of $\operatorname{Var}(X)$. So we have proved a theorem just by relabeling.

Property 2: symmetry

$$\operatorname{Cov}(X, Y) = \operatorname{Cov}(Y, X)$$

Immediate — swapping $X$ and $Y$ in the definition gives the same product.

The shortcut formula

Just as we rewrote $\operatorname{Var}(X) = E(X^2) - (E(X))^2$, covariance has an analogous, often more computable form:

$$\operatorname{Cov}(X, Y) = E(XY) - E(X)\,E(Y)$$

In general $E(XY)$ is not equal to $E(X)\,E(Y)$; they are equal only when $X$ and $Y$ are independent. Setting $Y = X$ recovers the variance shortcut $E(X^2) - (E(X))^2$.

Proof by expanding

Multiply out $(X - E(X))(Y - E(Y))$ into four terms and use linearity, treating $E(X)$ and $E(Y)$ as constants that pull out of expectations:

$$E(XY) \;-\; E(X)E(Y) \;-\; E(X)E(Y) \;+\; E(X)E(Y)$$

That is $E(XY)$ minus two copies plus one copy, i.e. $E(XY) - E(X)E(Y)$. The definition has more intuitive appeal ("$X$ relative to its mean, $Y$ relative to its mean"); the shortcut is usually easier to compute with.

· · ·

3. Bilinearity

The remaining algebraic properties all follow immediately by plugging the relevant random variable into either form of the definition and using linearity of expectation.

Property 3: covariance with a constant is zero

$\operatorname{Cov}(X, c) = 0$ for any constant $c$. A constant has $E(c) = c$, so $c - E(c) = 0$, killing the product. By symmetry, $\operatorname{Cov}(c, X) = 0$ as well.

Property 4: constants pull out

$$\operatorname{Cov}(cX, Y) = c\,\operatorname{Cov}(X, Y)$$

Replace $X$ by $cX$; the $c$ factors out of the whole expression. Constants on either argument can be pulled out.

Property 5: distributes over sums

$$\operatorname{Cov}(X, Y + Z) = \operatorname{Cov}(X, Y) + \operatorname{Cov}(X, Z)$$

Replace $Y$ by $Y + Z$; since $X(Y + Z) = XY + XZ$, linearity splits the expectation into the two covariances.

Bilinearity

Properties 4 and 5 together are called bilinearity: hold one coordinate fixed and the map behaves like a linear function in the other. It is not full linearity, but coordinate-by-coordinate it looks like it. A useful mnemonic: it resembles the distributive property $a(b + c) = ab + ac$ — except the "multiplication" is covariance, not actual multiplication.

The general statement, applying Property 5 repeatedly and Property 4 to extract constants, is:

$$\operatorname{Cov}\!\left(\sum_{i=1}^{m} a_i X_i,\; \sum_{j=1}^{n} b_j Y_j\right) = \sum_{i=1}^{m}\sum_{j=1}^{n} a_i b_j \,\operatorname{Cov}(X_i, Y_j)$$

You "covary" each term on the left with each term on the right over all pairs $(i, j)$. This looks complicated but is no different from Property 5 used many times. It is often far easier to manipulate covariances directly with bilinearity than to return to the definition and expand everything in terms of expectations.

· · ·

4. Variance of a Sum

Two terms

By Property 1, $\operatorname{Var}(X_1 + X_2) = \operatorname{Cov}(X_1 + X_2,\, X_1 + X_2)$. Expanding by bilinearity gives four terms: $\operatorname{Cov}(X_1, X_1) = \operatorname{Var}(X_1)$, $\operatorname{Cov}(X_2, X_2) = \operatorname{Var}(X_2)$, and two equal cross terms by symmetry. Hence:

$$\operatorname{Var}(X_1 + X_2) = \operatorname{Var}(X_1) + \operatorname{Var}(X_2) + 2\,\operatorname{Cov}(X_1, X_2)$$

If and only if

The variance of the sum equals the sum of the variances if and only if the covariance is zero. Independence is one sufficient condition (independence forces $\operatorname{Cov} = 0$), but we will also see dependent examples where the covariance happens to vanish. In general you cannot drop the covariance term.

Many terms

Sum all the variances, then all the covariance terms. Grouping the symmetric pairs:

$$\operatorname{Var}\!\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \operatorname{Var}(X_i) + 2 \sum_{i < j} \operatorname{Cov}(X_i, X_j)$$

The factor of $2$ is easy to forget. The alternative is to sum over $i \neq j$ (listing $\operatorname{Cov}(X_1, X_2)$ and $\operatorname{Cov}(X_2, X_1)$ separately) and omit the $2$. Grouping with $i < j$ and keeping the $2$ is usually cleaner.

· · ·

5. Independence and Correlation

Independence implies uncorrelated

Theorem

If $X$ and $Y$ are independent, then they are uncorrelated, where uncorrelated is defined as $\operatorname{Cov}(X, Y) = 0$.

This was effectively proved in an earlier lecture (in the continuous case via the 2-D LOTUS, with the discrete case analogous): for independent $X$ and $Y$, $E(XY) = E(X)E(Y)$, so the shortcut formula gives $\operatorname{Cov} = 0$.

The converse is false

A common mistake is to show $\operatorname{Cov}(X, Y) = 0$ and then leap to independence. Zero covariance does not imply independence.

Counterexample: $X = Z$, $\;Y = Z^2$, with $Z \sim \mathcal{N}(0, 1)$

$$\operatorname{Cov}(X, Y) = E(XY) - E(X)E(Y) = E(Z^3) - E(Z)\,E(Z^2) = 0 - 0 = 0,$$

because the odd moments of a standard normal are zero ($E(Z) = 0$ and $E(Z^3) = 0$). So $X$ and $Y$ are uncorrelated — yet they are extremely dependent: $Y$ is a function of $X$, so knowing $X$ gives complete information about $Y$. (Going the other way, knowing $Y = Z^2$ determines $X$ up to a sign, since the square root recovers the magnitude.)

Correlation measures linear association

What is going wrong? Correlation captures the kind of upward- or downward-sloping cloud of points you see in a scatterplot. Here $X$ and $Y$ have a quadratic relationship but no linear trend, so the linear measure reads zero. There is a theorem (not proved here): if every function of $X$ is uncorrelated with every function of $Y$, then $X$ and $Y$ are independent. Having only the linear pieces uncorrelated is not enough.

Definition of correlation

Definition — Correlation

Correlation is a standardized covariance, denoted $\operatorname{Corr}(X, Y)$ or $\rho$. The usual form:

$$\operatorname{Corr}(X, Y) = \frac{\operatorname{Cov}(X, Y)}{\operatorname{SD}(X)\,\operatorname{SD}(Y)}$$

Equivalently — standardize first, then take the covariance:

$$\operatorname{Corr}(X, Y) = \operatorname{Cov}\!\left(\frac{X - E(X)}{\operatorname{SD}(X)},\; \frac{Y - E(Y)}{\operatorname{SD}(Y)}\right)$$

Standardization subtracts the mean and divides by the standard deviation ($\operatorname{SD} = \sqrt{\operatorname{Var}}$), turning any random variable into one with mean $0$ and variance $1$.

Why standardize: units

Covariance has an annoying units problem. If $X$ and $Y$ are distances in nanometers, and a collaborator measures the same quantities in light-years, the two covariances come out wildly different. "The covariance is 42" is uninterpretable without the units — is 42 big or small? Correlation is dimensionless: nanometers (in the covariance) divided by nanometers times nanometers (from the two standard deviations) cancels the units entirely.

The two definitions agree: subtracting the means is just adding constants, which does not affect covariance, and the standard deviations pull out by the constants-out property, reproducing the divide-by-$\operatorname{SD}$ form.

· · ·

6. The Correlation Bound

Theorem

$$-1 \le \operatorname{Corr}(X, Y) \le 1 \quad \text{always.}$$

(So a correlation can never equal 42.)

This makes correlation interpretable on an absolute scale: a correlation of $0.9$ is high precisely because we know $1$ is the maximum. The inequality is essentially the Cauchy-Schwarz inequality — one of the most important inequalities in mathematics — written in a probabilistic setting. But rather than invoke Cauchy-Schwarz, here is a direct proof.

Direct proof

Without loss of generality, assume $X$ and $Y$ are already standardized (mean $0$, variance $1$). This is allowed because standardizing does not change the correlation. For standardized variables the covariance is the correlation; call it $\rho$. Apply the variance-of-a-sum result twice:

$$\operatorname{Var}(X + Y) = 1 + 1 + 2\rho = 2 + 2\rho$$

$$\operatorname{Var}(X - Y) = 1 + 1 - 2\rho = 2 - 2\rho$$

Think of $X - Y$ as $X + (-Y)$: it still adds the variances, but the covariance cross term flips sign. (A common error is $\operatorname{Var}(X - Y) = \operatorname{Var}(X) - \operatorname{Var}(Y)$, which can go negative — impossible.) Since variance is non-negative:

$2 + 2\rho \ge 0$ gives $\rho \ge -1$.
$2 - 2\rho \ge 0$ gives $\rho \le 1$.

Hence $-1 \le \rho \le 1$.

In practice covariances are easier to work with, but correlations are more intuitive and bounded.

· · ·

7. Example: Covariance in the Multinomial

Let $(X_1, \ldots, X_k) \sim \operatorname{Mult}(n, \mathbf{p})$, where $X_j$ is the number of the $n$ objects landing in category $j$ and $\mathbf{p} = (p_1, \ldots, p_k)$ gives the category probabilities. We want $\operatorname{Cov}(X_i, X_j)$ for all $i, j$.

Diagonal case ($i = j$)

$\operatorname{Cov}(X_i, X_i) = \operatorname{Var}(X_i)$. Focusing on a single category — "success" means landing in category $i$ — makes $X_i \sim \operatorname{Bin}(n, p_i)$. So:

$$\operatorname{Var}(X_i) = n\,p_i\,(1 - p_i)$$

Off-diagonal case ($i \neq j$): intuition first

Sign check

Before computing, ask whether $\operatorname{Cov}(X_1, X_2)$ should be positive, negative, or zero. It must be negative: the categories compete for a fixed pool of $n$ objects. If many land in category 1, fewer are left for category 2. (Contrast the chicken-and-egg example with a random number of eggs; here the total $n$ is fixed.) If you compute and get a positive number, stop and recheck.

Derivation via the lumping property

Use $\operatorname{Var}(X_1 + X_2) = \operatorname{Var}(X_1) + \operatorname{Var}(X_2) + 2C$, calling the unknown covariance $C$. The lumping property of the multinomial says merging categories 1 and 2 is again binomial, with "success" $=$ landing in category 1 or 2:

$$X_1 + X_2 \sim \operatorname{Bin}(n,\; p_1 + p_2), \qquad \operatorname{Var}(X_1 + X_2) = n(p_1 + p_2)(1 - p_1 - p_2).$$

Substituting the known variances and solving:

$$n(p_1 + p_2)(1 - p_1 - p_2) = n p_1(1 - p_1) + n p_2(1 - p_2) + 2C$$

Expanding and simplifying yields $C = -n\,p_1 p_2$. In general:

$$\operatorname{Cov}(X_i, X_j) = -n\,p_i\,p_j \quad (i \neq j)$$

Negative, as the competition intuition predicted.

· · ·

8. Example: Variance of the Binomial

Earlier we found $\operatorname{Var}(\operatorname{Bin}(n, p)) = npq$ (with $q = 1 - p$) using indicators directly. Covariance gives a clean re-derivation. Write $X = X_1 + \cdots + X_n$ with the $X_j$ IID $\operatorname{Bern}(p)$; each $X_j$ indicates success on the $j$-th trial.

Indicator random variable facts

Let $I_A$ be the indicator of an event $A$ ($1$ if $A$ occurs, $0$ otherwise). A few simple but frequently-overlooked facts:

$I_A^2 = I_A$ (and any positive power: $I_A^3 = I_A$, etc.) — because $0$ and $1$ are unchanged by squaring.
$I_A \cdot I_B = I_{A \cap B}$ — the product is $1$ if and only if both indicators are $1$, which is exactly the intersection.

Variance of one Bernoulli

$$\operatorname{Var}(X_j) = E(X_j^2) - (E(X_j))^2 = E(X_j) - p^2 = p - p^2 = p(1 - p) = pq,$$

using $X_j^2 = X_j$ and $E(X_j) = p$.

Summing up

The trials are independent, so $\operatorname{Cov}(X_i, X_j) = 0$ for $i \neq j$ (independent implies uncorrelated) — no covariance terms survive. Thus:

$$\operatorname{Var}(X) = \sum_{j=1}^{n} \operatorname{Var}(X_j) = npq$$

You can now compute the binomial variance in your head: $n$ times the variance of a single Bernoulli.

· · ·

9. Example: Variance of the Hypergeometric

Let $X \sim \operatorname{HGeom}(w, b, n)$: a jar holds $w$ white and $b$ black balls; we draw a sample of size $n$ without replacement; $X$ counts the white balls drawn. Decompose with indicators: draw the balls one at a time, and let $X_j = 1$ if the $j$-th ball drawn is white. (With replacement we would get a binomial; without replacement makes these indicators dependent.) Then $X = X_1 + \cdots + X_n$.

Exploiting symmetry

Writing out the full variance looks like a nightmare of $n$ variance terms plus $n(n-1)$ covariance terms — but symmetry collapses it:

$$\operatorname{Var}(X) = n\,\operatorname{Var}(X_1) + 2\binom{n}{2}\operatorname{Cov}(X_1, X_2)$$

All the $\operatorname{Var}(X_j)$ are equal, so the variance part is $n\,\operatorname{Var}(X_1)$. The $j$-th ball is equally likely to be any ball in the jar — no ball "prefers" to be drawn $j$-th — so every $X_j$ has the same marginal distribution. You may consider the seventh ball on its own, imagining it before any draw; you do not need to track the earlier draws.
All $\binom{n}{2}$ covariance pairs are equal by the same symmetry, so we use a single representative $\operatorname{Cov}(X_1, X_2)$, with the factor $2$ for the $i < j$ grouping.

Symmetry is powerful but dangerous

Confirm the symmetry genuinely holds before relying on it; never assume symmetry that isn't there.

The pieces

Each $X_1$ is Bernoulli with $P(\text{white}) = \dfrac{w}{w + b}$, so $\operatorname{Var}(X_1)$ follows from $pq$. For the covariance, use the shortcut and the indicator-product fact:

$$\operatorname{Cov}(X_1, X_2) = E(X_1 X_2) - E(X_1)\,E(X_2)$$

$E(X_1)\,E(X_2)$: by the fundamental bridge each marginal probability of white is $\dfrac{w}{w + b}$, so this term is $\left(\dfrac{w}{w + b}\right)^2$.
$E(X_1 X_2) = E\big(I_{\text{both white}}\big) = P(\text{first two balls white}) = \dfrac{w}{w + b} \cdot \dfrac{w - 1}{w + b - 1}$, since the first is white with probability $\dfrac{w}{w+b}$ and, given that, the second with probability $\dfrac{w-1}{w+b-1}$.

With both pieces known, assembling $\operatorname{Var}(X)$ is just algebra (final simplification left for next time).