Expectation is linear: $E(X + Y) = E(X) + E(Y)$ always, even when $X$ and $Y$ are dependent. Variance is not linear. So if we want $\operatorname{Var}(X + Y)$, we cannot just add the variances — we have to think harder rather than falsely applying linearity. Covariance is the tool that finally lets us handle the variance of a sum.
Covariance is two sides of one coin: it is exactly what we need to compute the variance of a sum, and it is what we use to study two random variables together rather than one at a time — "like variance, except for two of them." That second motivation explains the name: it is the co-variance of $X$ and $Y$.
For any two random variables $X$ and $Y$ on the same probability space:
$$\operatorname{Cov}(X, Y) = E\big[(X - E(X))(Y - E(Y))\big]$$
Stare at the form for a moment. It is a product, which brings the $X$ information and the $Y$ information together into one quantity, because we are trying to see how they vary together. Recall the sign rules: positive times positive is positive, negative times negative is positive, positive times negative is negative.
Now imagine drawing a random sample of many IID pairs $(X_1, Y_1), (X_2, Y_2), \ldots$ — the pairs are IID, but within a pair $X_i$ and $Y_i$ have some joint distribution and need not be independent.
Proof: set $Y = X$ in the definition; $E[(X - E(X))^2]$ is exactly the definition of $\operatorname{Var}(X)$. So we have proved a theorem just by relabeling.
Immediate — swapping $X$ and $Y$ in the definition gives the same product.
Just as we rewrote $\operatorname{Var}(X) = E(X^2) - (E(X))^2$, covariance has an analogous, often more computable form:
In general $E(XY)$ is not equal to $E(X)\,E(Y)$; they are equal only when $X$ and $Y$ are independent. Setting $Y = X$ recovers the variance shortcut $E(X^2) - (E(X))^2$.
Multiply out $(X - E(X))(Y - E(Y))$ into four terms and use linearity, treating $E(X)$ and $E(Y)$ as constants that pull out of expectations:
$$E(XY) \;-\; E(X)E(Y) \;-\; E(X)E(Y) \;+\; E(X)E(Y)$$
That is $E(XY)$ minus two copies plus one copy, i.e. $E(XY) - E(X)E(Y)$. The definition has more intuitive appeal ("$X$ relative to its mean, $Y$ relative to its mean"); the shortcut is usually easier to compute with.
The remaining algebraic properties all follow immediately by plugging the relevant random variable into either form of the definition and using linearity of expectation.
$\operatorname{Cov}(X, c) = 0$ for any constant $c$. A constant has $E(c) = c$, so $c - E(c) = 0$, killing the product. By symmetry, $\operatorname{Cov}(c, X) = 0$ as well.
Replace $X$ by $cX$; the $c$ factors out of the whole expression. Constants on either argument can be pulled out.
Replace $Y$ by $Y + Z$; since $X(Y + Z) = XY + XZ$, linearity splits the expectation into the two covariances.
Properties 4 and 5 together are called bilinearity: hold one coordinate fixed and the map behaves like a linear function in the other. It is not full linearity, but coordinate-by-coordinate it looks like it. A useful mnemonic: it resembles the distributive property $a(b + c) = ab + ac$ — except the "multiplication" is covariance, not actual multiplication.
The general statement, applying Property 5 repeatedly and Property 4 to extract constants, is:
You "covary" each term on the left with each term on the right over all pairs $(i, j)$. This looks complicated but is no different from Property 5 used many times. It is often far easier to manipulate covariances directly with bilinearity than to return to the definition and expand everything in terms of expectations.
By Property 1, $\operatorname{Var}(X_1 + X_2) = \operatorname{Cov}(X_1 + X_2,\, X_1 + X_2)$. Expanding by bilinearity gives four terms: $\operatorname{Cov}(X_1, X_1) = \operatorname{Var}(X_1)$, $\operatorname{Cov}(X_2, X_2) = \operatorname{Var}(X_2)$, and two equal cross terms by symmetry. Hence:
The variance of the sum equals the sum of the variances if and only if the covariance is zero. Independence is one sufficient condition (independence forces $\operatorname{Cov} = 0$), but we will also see dependent examples where the covariance happens to vanish. In general you cannot drop the covariance term.
Sum all the variances, then all the covariance terms. Grouping the symmetric pairs:
The factor of $2$ is easy to forget. The alternative is to sum over $i \neq j$ (listing $\operatorname{Cov}(X_1, X_2)$ and $\operatorname{Cov}(X_2, X_1)$ separately) and omit the $2$. Grouping with $i < j$ and keeping the $2$ is usually cleaner.
If $X$ and $Y$ are independent, then they are uncorrelated, where uncorrelated is defined as $\operatorname{Cov}(X, Y) = 0$.
This was effectively proved in an earlier lecture (in the continuous case via the 2-D LOTUS, with the discrete case analogous): for independent $X$ and $Y$, $E(XY) = E(X)E(Y)$, so the shortcut formula gives $\operatorname{Cov} = 0$.
A common mistake is to show $\operatorname{Cov}(X, Y) = 0$ and then leap to independence. Zero covariance does not imply independence.
$$\operatorname{Cov}(X, Y) = E(XY) - E(X)E(Y) = E(Z^3) - E(Z)\,E(Z^2) = 0 - 0 = 0,$$
because the odd moments of a standard normal are zero ($E(Z) = 0$ and $E(Z^3) = 0$). So $X$ and $Y$ are uncorrelated — yet they are extremely dependent: $Y$ is a function of $X$, so knowing $X$ gives complete information about $Y$. (Going the other way, knowing $Y = Z^2$ determines $X$ up to a sign, since the square root recovers the magnitude.)
What is going wrong? Correlation captures the kind of upward- or downward-sloping cloud of points you see in a scatterplot. Here $X$ and $Y$ have a quadratic relationship but no linear trend, so the linear measure reads zero. There is a theorem (not proved here): if every function of $X$ is uncorrelated with every function of $Y$, then $X$ and $Y$ are independent. Having only the linear pieces uncorrelated is not enough.
Correlation is a standardized covariance, denoted $\operatorname{Corr}(X, Y)$ or $\rho$. The usual form:
$$\operatorname{Corr}(X, Y) = \frac{\operatorname{Cov}(X, Y)}{\operatorname{SD}(X)\,\operatorname{SD}(Y)}$$
Equivalently — standardize first, then take the covariance:
$$\operatorname{Corr}(X, Y) = \operatorname{Cov}\!\left(\frac{X - E(X)}{\operatorname{SD}(X)},\; \frac{Y - E(Y)}{\operatorname{SD}(Y)}\right)$$
Standardization subtracts the mean and divides by the standard deviation ($\operatorname{SD} = \sqrt{\operatorname{Var}}$), turning any random variable into one with mean $0$ and variance $1$.
Covariance has an annoying units problem. If $X$ and $Y$ are distances in nanometers, and a collaborator measures the same quantities in light-years, the two covariances come out wildly different. "The covariance is 42" is uninterpretable without the units — is 42 big or small? Correlation is dimensionless: nanometers (in the covariance) divided by nanometers times nanometers (from the two standard deviations) cancels the units entirely.
The two definitions agree: subtracting the means is just adding constants, which does not affect covariance, and the standard deviations pull out by the constants-out property, reproducing the divide-by-$\operatorname{SD}$ form.
$$-1 \le \operatorname{Corr}(X, Y) \le 1 \quad \text{always.}$$
(So a correlation can never equal 42.)
This makes correlation interpretable on an absolute scale: a correlation of $0.9$ is high precisely because we know $1$ is the maximum. The inequality is essentially the Cauchy-Schwarz inequality — one of the most important inequalities in mathematics — written in a probabilistic setting. But rather than invoke Cauchy-Schwarz, here is a direct proof.
Without loss of generality, assume $X$ and $Y$ are already standardized (mean $0$, variance $1$). This is allowed because standardizing does not change the correlation. For standardized variables the covariance is the correlation; call it $\rho$. Apply the variance-of-a-sum result twice:
$$\operatorname{Var}(X + Y) = 1 + 1 + 2\rho = 2 + 2\rho$$
$$\operatorname{Var}(X - Y) = 1 + 1 - 2\rho = 2 - 2\rho$$
Think of $X - Y$ as $X + (-Y)$: it still adds the variances, but the covariance cross term flips sign. (A common error is $\operatorname{Var}(X - Y) = \operatorname{Var}(X) - \operatorname{Var}(Y)$, which can go negative — impossible.) Since variance is non-negative:
Hence $-1 \le \rho \le 1$.
In practice covariances are easier to work with, but correlations are more intuitive and bounded.
Let $(X_1, \ldots, X_k) \sim \operatorname{Mult}(n, \mathbf{p})$, where $X_j$ is the number of the $n$ objects landing in category $j$ and $\mathbf{p} = (p_1, \ldots, p_k)$ gives the category probabilities. We want $\operatorname{Cov}(X_i, X_j)$ for all $i, j$.
$\operatorname{Cov}(X_i, X_i) = \operatorname{Var}(X_i)$. Focusing on a single category — "success" means landing in category $i$ — makes $X_i \sim \operatorname{Bin}(n, p_i)$. So:
Before computing, ask whether $\operatorname{Cov}(X_1, X_2)$ should be positive, negative, or zero. It must be negative: the categories compete for a fixed pool of $n$ objects. If many land in category 1, fewer are left for category 2. (Contrast the chicken-and-egg example with a random number of eggs; here the total $n$ is fixed.) If you compute and get a positive number, stop and recheck.
Use $\operatorname{Var}(X_1 + X_2) = \operatorname{Var}(X_1) + \operatorname{Var}(X_2) + 2C$, calling the unknown covariance $C$. The lumping property of the multinomial says merging categories 1 and 2 is again binomial, with "success" $=$ landing in category 1 or 2:
$$X_1 + X_2 \sim \operatorname{Bin}(n,\; p_1 + p_2), \qquad \operatorname{Var}(X_1 + X_2) = n(p_1 + p_2)(1 - p_1 - p_2).$$
Substituting the known variances and solving:
$$n(p_1 + p_2)(1 - p_1 - p_2) = n p_1(1 - p_1) + n p_2(1 - p_2) + 2C$$
Expanding and simplifying yields $C = -n\,p_1 p_2$. In general:
Negative, as the competition intuition predicted.
Earlier we found $\operatorname{Var}(\operatorname{Bin}(n, p)) = npq$ (with $q = 1 - p$) using indicators directly. Covariance gives a clean re-derivation. Write $X = X_1 + \cdots + X_n$ with the $X_j$ IID $\operatorname{Bern}(p)$; each $X_j$ indicates success on the $j$-th trial.
Let $I_A$ be the indicator of an event $A$ ($1$ if $A$ occurs, $0$ otherwise). A few simple but frequently-overlooked facts:
$$\operatorname{Var}(X_j) = E(X_j^2) - (E(X_j))^2 = E(X_j) - p^2 = p - p^2 = p(1 - p) = pq,$$
using $X_j^2 = X_j$ and $E(X_j) = p$.
The trials are independent, so $\operatorname{Cov}(X_i, X_j) = 0$ for $i \neq j$ (independent implies uncorrelated) — no covariance terms survive. Thus:
You can now compute the binomial variance in your head: $n$ times the variance of a single Bernoulli.
Let $X \sim \operatorname{HGeom}(w, b, n)$: a jar holds $w$ white and $b$ black balls; we draw a sample of size $n$ without replacement; $X$ counts the white balls drawn. Decompose with indicators: draw the balls one at a time, and let $X_j = 1$ if the $j$-th ball drawn is white. (With replacement we would get a binomial; without replacement makes these indicators dependent.) Then $X = X_1 + \cdots + X_n$.
Writing out the full variance looks like a nightmare of $n$ variance terms plus $n(n-1)$ covariance terms — but symmetry collapses it:
Confirm the symmetry genuinely holds before relying on it; never assume symmetry that isn't there.
Each $X_1$ is Bernoulli with $P(\text{white}) = \dfrac{w}{w + b}$, so $\operatorname{Var}(X_1)$ follows from $pq$. For the covariance, use the shortcut and the indicator-product fact:
$$\operatorname{Cov}(X_1, X_2) = E(X_1 X_2) - E(X_1)\,E(X_2)$$
With both pieces known, assembling $\operatorname{Var}(X)$ is just algebra (final simplification left for next time).