This lecture finishes conditional expectation as a topic in its own right. The focus is on conditioning on a random variable $X$ (not on an event), which produces a random variable $E(Y \mid X)$ that is itself a function of $X$. Two quick examples fix the notation and intuition before the general properties.
$E(Y \mid X)$ is the best prediction of $Y$ from $X$ in the mean-squared-error sense. Because it depends on the random $X$, it is a random variable, not a number — in contrast to conditioning on an event $A$, which gives the number $E(Y \mid A)$.
Let $X \sim \mathcal{N}(0,1)$ and $Y = X^2$. Two conditional expectations point in opposite directions.
Conditioning on $X$ means treating $X$ as known. If we know $X$, we know $X^2$ exactly, so the best prediction is $X^2$ itself:
This is the only sensible answer: if we observe $X$ and the prediction were anything other than $X^2$, something would be wrong. A function of $X$, conditioned on $X$, is just that function.
Now reverse the roles and ask for $E(X \mid Y) = E(X \mid X^2)$. Here we observe $X^2$ but not $X$. Suppose $X^2 = a$ is known. Then $X = +\sqrt{a}$ or $X = -\sqrt{a}$, and by the symmetry of the normal these are equally likely — the magnitude tells us nothing about the sign. Averaging gives
This does not say $X$ and $X^2$ are independent (they are uncorrelated but not independent). It says that, as a single-number prediction, $X^2$ is useless for predicting $X$: we know the magnitude but learn nothing about the sign, so the best guess is $0$.
Take a stick of length $1$. Break off a uniform random piece, keep it, discard the rest; then break that piece again at a uniform random point. Let $X$ be the first break point and $Y$ the length of the second piece.
The notation $Y \mid X \sim \text{Unif}(0, X)$ is shorthand: if we knew $X = x$, then $Y$ would be $\text{Unif}(0, x)$. A uniform on $(0, x)$ has mean $x/2$, so
This is a random variable — a function of $X$. We can then take its expectation. Since $E(X) = 1/2$,
By the iterated-expectation property (proved below), $E(E(Y \mid X)) = E(Y)$, so $E(Y) = 1/4$. The answer is intuitive: on average you keep half the stick, then half of that again.
A handful of properties let you derive everything else. Throughout, $h$ is an arbitrary function and $X, Y$ are random variables (assumed to have finite variance where needed).
$$E\bigl(h(X)\, Y \mid X\bigr) = h(X)\, E(Y \mid X).$$
Because we condition on $X$, any $h(X)$ is a known constant and pulls out. This is what happened in Example 1: $E(X^2 \cdot 1 \mid X) = X^2\, E(1 \mid X) = X^2$, since $E(1 \mid \text{anything}) = 1$. The reverse move — multiplying a known $h(X)$ back inside — is putting back what's known.
$$E(Y \mid X) = E(Y) \quad\text{if } X, Y \text{ independent}.$$
If $X$ carries no information about $Y$, the conditional distribution of $Y$ given $X$ equals its unconditional one, so conditioning does not change the prediction. This is not an "if and only if": Example 1 had $E(X \mid X^2) = E(X) = 0$ even though $X$ and $X^2$ are not independent.
$$E\bigl(E(Y \mid X)\bigr) = E(Y).$$
Take the conditional expectation (a random variable), then average it, and the unconditional mean returns. Also called the law of iterated expectation or the tower property; in this department, Adam's law.
The value is in the direction $E(Y) = E(E(Y \mid X))$: when $E(Y)$ is hard to compute directly but $E(Y \mid X)$ is easy, choose $X$ cleverly, compute $E(Y \mid X)$, then average. This generalizes the law of total probability — exactly the move used to get $E(Y) = 1/4$ in the stick example.
$$E\bigl((Y - E(Y \mid X))\, h(X)\bigr) = 0 \quad\text{for every } h.$$
The quantity $Y - E(Y \mid X)$ is the residual: the actual value of $Y$ minus its predicted value. Property 4 says the residual is uncorrelated with any function of $X$. Since $E(Y - E(Y \mid X)) = E(Y) - E(Y) = 0$ by Adam's law, the covariance of $Y - E(Y \mid X)$ with $h(X)$ collapses to the single expectation above, so showing it is $0$ shows the covariance is $0$.
For those with linear-algebra background, conditional expectation is an orthogonal projection. (Skip if the picture is unfamiliar; it is intuition, not a requirement.)
Treat each random variable as a vector, with inner product $\langle X, Y \rangle = E(XY)$. The collection of all functions of $X$ forms a "plane" through the origin: it contains every constant (the zero function is a function of $X$), plus $X$, $X^2$, $e^X$, and so on.
Expand by linearity:
$$E(Y\, h(X)) - E\bigl(E(Y \mid X)\, h(X)\bigr).$$
Leave the first term alone. In the second, the form $E(E(\cdot \mid X))$ invites Adam's law, but $h(X)$ sits outside. Since $X$ is known inside the conditional expectation, put $h(X)$ back inside:
$$E\bigl(E(Y \mid X)\, h(X)\bigr) = E\bigl(E(h(X)\, Y \mid X)\bigr) = E\bigl(h(X)\, Y\bigr),$$
where the last step is Adam's law applied to $h(X)\,Y$. So the expression is
$$E(Y\, h(X)) - E(h(X)\, Y) = 0.$$
The ingredients: linearity, taking out / putting back what's known, and iterated expectation.
Let $g(X) = E(Y \mid X)$; the goal is $E(g(X)) = E(Y)$. (Continuous case analogous, with integrals.) By LOTUS:
$$E(g(X)) = \sum_x g(x)\, P(X = x).$$
By definition, $g(x) = E(Y \mid X = x) = \sum_y y\, P(Y = y \mid X = x)$. Substituting and pulling $P(X = x)$ inside:
$$E(g(X)) = \sum_x \sum_y y\, P(Y = y \mid X = x)\, P(X = x).$$
Swap the order of summation (valid under absolute convergence), and recognize $P(Y = y \mid X = x)\, P(X = x) = P(Y = y, X = x)$, the joint PMF:
$$= \sum_y y \sum_x P(Y = y, X = x).$$
Since $y$ does not depend on $x$, it pulls out of the inner sum. Summing the joint PMF over all $x$ gives the marginal $P(Y = y)$ (adding up a row of the joint table). Hence
$$= \sum_y y\, P(Y = y) = E(Y).$$
The only real trick was writing a double sum and swapping the order; the rest was LOTUS plus the definitions of conditional, joint, and marginal distributions.
Conditional variance is defined by analogy with ordinary variance, with everything taken given $X$. Like $E(Y \mid X)$, it is a function of $X$ — a random variable, not a number.
Two equivalent forms (the equality is left as good practice):
$$\operatorname{Var}(Y \mid X) = E(Y^2 \mid X) - \bigl(E(Y \mid X)\bigr)^2$$
$$\operatorname{Var}(Y \mid X) = E\!\left[\bigl(Y - E(Y \mid X)\bigr)^2 \;\middle|\; X\right]$$
In the second form, the outer $\mid X$ is essential. Dropping it would collapse the result to a number, but conditional variance must depend on $X$. Everything in the expression is "given $X$" — do not forget one of the conditions.
$$\operatorname{Var}(Y) = E\bigl(\operatorname{Var}(Y \mid X)\bigr) + \operatorname{Var}\bigl(E(Y \mid X)\bigr).$$
Named for the letters $E$-$V$-$E$ (the reverse spelling hints at why Adam's and Eve's laws pair up). The proof reduces to conditional-expectation properties and is left as practice.
Imagine a population split into subgroups — say three subpopulations, with $Y = \text{height}$ and $X = $ which subgroup a random person belongs to ($X = 1, 2, 3$). There are two kinds of variability:
Eve's law says the total variance is exactly the sum of these two pieces. The two effects do not interact in any complicated way; they simply add.
A state contains many cities with differing disease prevalence. Sampling proceeds in two stages:
Define:
This mirrors the within/between structure above: variation between cities (different $Q$) and variation within a city (binomial sampling at fixed $Q$).
Condition on $Q$. Given $Q$, $X \sim \text{Bin}(n, Q)$ with mean $nQ$, so $E(X \mid Q) = nQ$. Then
using $E(Q) = a/(a+b)$ for $Q \sim \text{Beta}(a, b)$.
Two Beta facts finish the job.
By LOTUS, integrate $q(1-q)$ against the $\text{Beta}(a,b)$ density on $(0,1)$:
Multiplying the extra $q$ and $(1-q)$ into the density raises the exponents to $q^{a}$ and $(1-q)^{b}$ — the kernel of a $\text{Beta}(a+1, b+1)$. Insert the matching normalizing constant so the integral becomes that of a true Beta density (which integrates to $1$):
$$E(Q(1-Q)) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\cdot\frac{\Gamma(a+1)\,\Gamma(b+1)}{\Gamma(a+b+2)}.$$
Simplify with $\Gamma(x+1) = x\,\Gamma(x)$:
$$\boxed{\;E(Q(1-Q)) = \frac{ab}{(a+b)(a+b+1)}.\;}$$
A clean form for the variance of a $\text{Beta}(a,b)$ is
(Verified the same way, via the Beta integral.)
substituting $E(Q(1-Q)) = \dfrac{ab}{(a+b)(a+b+1)}$ and $\operatorname{Var}(Q) = \dfrac{\mu(1-\mu)}{a+b+1}$ with $\mu = a/(a+b)$. Algebraic simplification is optional; the structure — a within-city binomial term plus a between-city Beta term — is the point.