Lecture 32: Markov Chains Continued

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Recap: What a Markov Chain Is

A Markov chain models a particle bouncing from state to state. It is memoryless: it does not remember or care how it arrived at its current state. The only information needed to predict the future is the current state.

Markov property

Given the present (the current state), the future and the past are conditionally independent. This memorylessness is in a different sense than the exponential distribution — here it means "the path taken to get here is irrelevant," not "the time already elapsed is irrelevant."

A chain is specified by drawing states as nodes and transitions as arrows, with probabilities on the arrows. For simplicity in the picture examples below, assume that from any state you follow one of its outgoing arrows uniformly at random; in general the arrows can carry arbitrary probabilities.

· · ·
· · ·

3. Irreducibility

The first concept describes the whole chain.

Definition — Irreducible

A chain is irreducible if you can get from anywhere to anywhere: for every pair of states, it is possible (with positive probability) to get from one to the other in some finite number of steps — not necessarily in a single step.

"Irreducible" is standard but opaque terminology; Blitzstein notes that connected would have been a more descriptive word for the same idea.

Applying it to the gallery:

ChainIrreducible?Why
1 (nice)YesEvery state reaches every other
2 (trapdoor)No (reducible)Can't get from $\{4,5,6\}$ back to $\{1,2,3\}$
3 (gambler's ruin)No (reducible)Once at 0 or 3 you can't leave
4 (cycle)YesThe cycle visits every state

Reducible chains are annoying, but not a deep problem: you can always split a reducible chain into its irreducible components, study each separately, and stitch the results back together. In Chain 2, once you start in the top group you stay there forever, so the bottom group is irrelevant — and vice versa. For this reason, the theory focuses on irreducible chains.

· · ·

4. Recurrence and Transience

The next two concepts describe a single state within a chain.

Definition — Recurrent / Transient

A state is recurrent if, starting from that state, the chain is guaranteed (probability 1) to return to it. A state is transient if it is not recurrent — there is positive probability the chain leaves and never comes back.

Blitzstein's analogy: recurrent is like the tourist-board slogan "visit our city and you'll keep coming back" — it recurs over and over. There can't even be a $0.001$ chance of never returning.

Return once implies return infinitely often

If a state is recurrent, the chain doesn't just return once — it returns infinitely many times, with probability 1. Once the chain comes back, the Markov property means it has forgotten its history, so it faces exactly the same problem again: it returns once more with probability 1, and again, and again. (If the probability of the next return somehow decreased each time, the process wouldn't be Markov.) Conversely, a transient state may be revisited for a while, but eventually the chain stops returning to it forever.

Murphy's Law of probability (finite-state chains)

Key principle

In a chain with finitely many states, anything that can happen with positive probability will happen eventually — the probabilistic generalization of Murphy's Law. Even if returning to a state is extremely unlikely on any single attempt, with infinitely many attempts it eventually occurs.

Consequence: in an irreducible, finite-state chain, every state is recurrent.

Recurrence is tied to the chain's components, not just irreducibility. In Chain 2, even though you can't get from state 1 to state 4, all six states are recurrent: if you start in $\{4,5,6\}$ you revisit those forever, and if you start in $\{1,2,3\}$ you revisit those forever.

A small modification flips the answer

Take Chain 2 and add one extra edge: a one-way arrow from state 3 down to state 6. The chain is still not irreducible (you still can't get from 4 back to 1). But now:

In the gambler's ruin chain (Chain 3): interior states 1 and 2 are transient (eventually you hit an absorbing end), while the absorbing endpoints 0 and 3 are recurrent (starting at 0 keeps you at 0 forever, trivially returning).

· · ·

5. Periodicity

Even the deterministic cycle (Chain 4) has every state recurrent — it keeps returning. But it has another defect: periodicity. Index time as $1, 2, 3, \ldots$ starting at state 1. Then at every time that is a multiple of 3 the chain is in state 3 — completely predictable. The states cycle with a fixed period, and we will want to exclude this kind of behavior when discussing long-run convergence.

Summary of the gallery

ChainIrreducibleRecurrenceOther defect
1 (nice)YesAll recurrentNone — the target
2 (trapdoor)NoAll recurrent (per component)Reducible
2 + edge 3→6No1,2,3 transient; 4,5,6 recurrentReducible
3 (gambler's ruin)No0,3 recurrent; 1,2 transientAbsorbing states
4 (cycle)YesAll recurrentPeriodic
· · ·

6. Stationary Distributions

A stationary distribution is the central object of Markov chain theory. Write it as a probability row vector $s$ — just a PMF written out horizontally, with non-negative entries summing to 1.

Let $Q$ be the chain's transition matrix: $Q_{ij}$ is the probability of going from state $i$ to state $j$ in one step. The defining condition of stationarity is:

$$s\,Q = s$$

The interpretation: if $s$ is the distribution over states right now, then $sQ$ is the distribution one step later. So $sQ = s$ says that a chain started in distribution $s$ stays in distribution $s$ at every subsequent step — forever. Hence "stationary." More generally, $sQ^2$ is the distribution after two steps, $sQ^3$ after three, and the powers $Q^n$ give the $n$-step transition probabilities (probability of getting from $i$ to $j$ in $n$ steps).

As an eigenvalue problem

For those who know linear algebra, $sQ = s$ is an eigenvalue/eigenvector equation: $s$ is a left eigenvector of $Q$ with eigenvalue 1. In principle, finding $s$ just means solving a linear system (Gaussian elimination). For very large chains this can be computationally intensive, even by computer.

Four theorems

These hold for any irreducible, finite-state Markov chain. Rigorous proofs need substantial linear algebra, so they are stated without proof.

Theorem 1 — Existence

A stationary distribution $s$ always exists. When solving $sQ = s$, a solution with mixed positive and negative entries would be bad; but one can always find an all-non-negative solution (or rescale an all-negative one), then renormalize so the entries sum to 1.

Theorem 2 — Uniqueness

The stationary distribution is unique — even if the chain has a trillion states.

Theorem 3 — Return-Time Formula

$$s_i = \frac{1}{r_i}$$

where $r_i$ is the expected return time: the average number of steps to return to state $i$, starting from state $i$. (An irreducible finite chain is recurrent, so the chain is guaranteed to come back; $r_i$ is the expected value of how long that takes.)

Theorem 4 — Convergence

If the chain is also not periodic, then $P(X_n = i) \to s_i$ as $n \to \infty$, regardless of the starting state.

Intuition for Theorem 3

Think of $s_i$ as the long-run fraction of time spent in state $i$. If the chain is in state $i$ one-tenth of the time, then on average it takes 10 steps to return to $i$. So $s_i = \tfrac{1}{10} \Leftrightarrow r_i = 10$ — the two are reciprocal.

Stationary distributions are also called equilibrium or steady-state distributions (terms from physics and economics), all reflecting long-run behavior. Theorems 1–3 hold even for periodic chains; convergence requires ruling out periodicity. One clean sufficient condition:

Ruling out periodicity

If $Q^m$ is strictly positive (every entry $> 0$) for some $m$, then the chain has no periodicity problem (and is automatically irreducible, since $Q^m$ gives the probability of reaching anywhere from anywhere in exactly $m$ steps). For the deterministic cycle, taking powers of the $3 \times 3$ transition matrix produces zeros that oscillate in position — no single power is ever all-positive, the signature of periodicity.

In matrix terms, convergence says: for any probability vector $t$ (deterministic start = a single 1; or any random start), $t\,Q^n \to s$ as $n \to \infty$. Multiplying by $Q$ steps the distribution forward one unit of time; do this many times and you converge to $s$ no matter where you began.

The four theorems are powerful but leave a gap: computation. They don't say how to find $s$ efficiently — the return-time formula needs $r_i$ (itself hard), and solving $sQ = s$ directly can mean a lifetime of matrix algebra for large chains. We need a shortcut.

· · ·

7. Reversibility (Detailed Balance)

There is a special class of chains where the stationary distribution comes quickly — the "good kind of hard": it may take a clever idea, but it isn't tedious matrix grinding.

Definition — Reversible

A chain with transition matrix $Q$ (entries $Q_{ij}$) is reversible if there exists a probability vector $s$ satisfying the reversibility equation (in physics, detailed balance):

$$s_i\, Q_{ij} = s_j\, Q_{ji} \qquad \text{for all states } i, j$$

To go from the left side to the right, you just swap $i$ and $j$. For a given chain such an $s$ may or may not exist, and the equation doesn't directly say how to find it.

Reversibility implies stationarity

If $s_i Q_{ij} = s_j Q_{ji}$ for all $i, j$, then $s$ is stationary: $sQ = s$.

The proof is short and good practice with the concept:

  • Start from $s_i Q_{ij} = s_j Q_{ji}$, which holds for all $i, j$.
  • Sum both sides over all $i$ (fix $j$): $\displaystyle\sum_i s_i Q_{ij} = \sum_i s_j Q_{ji}$.
  • On the right, $s_j$ doesn't depend on $i$, so factor it out: $s_j \sum_i Q_{ji}$.
  • But $\sum_i Q_{ji}$ is the probability of going from $j$ to somewhere — each row of $Q$ sums to 1. So the right side is just $s_j$.
  • The left side $\sum_i s_i Q_{ij}$ is, by definition of matrix multiplication, the $j$-th entry of $sQ$.
Conclusion

$(sQ)_j = s_j$ for every $j$, i.e. $sQ = s$. So reversibility implies stationarity. $\blacksquare$

Why "reversible"? Time reversibility

Reversibility is also called time reversibility. Start the chain in distribution $s$ and videotape the wandering particle. Reversibility says that if you play the tape backward, an observer cannot tell whether it is running forward or backward — the process looks statistically identical in both directions. This can be verified with the definition of conditional probability and Bayes' rule. The same idea is central in physics, especially thermodynamics (where it is the detailed balance condition).

· · ·

8. Example: Random Walk on an Undirected Network

This is the most important and general example of a reversible chain. Many problems can be encoded in this form.

Take an undirected graph: nodes connected by edges that are two-way streets, not arrows. (Any Markov chain is a random walk on a directed, weighted network; restricting to undirected, equally-weighted edges is what makes things nice.) The random walk: wherever you are, look at all edges available at your current node and pick one uniformly at random.

Setup

Consider a connected graph on 4 nodes: nodes 1 and 2 are connected to each other, and node 3 is connected to everything else, with node 4 attached only to node 3. Assume the graph is connected, so the chain is irreducible (an isolated node would break irreducibility). Let $d_i$ be the degree of node $i$ — the number of edges attached to it.

NodeDegree $d_i$
12
22
33
41

(Assume no self-loops; the argument extends to loops with extra care.)

The key claim and its proof

Claim: $d_i\, Q_{ij} = d_j\, Q_{ji}$ for all $i, j$ — the reversibility equation with $s$ replaced by the degree vector $d$.
  • If $i = j$, both sides are identical — nothing to check. So assume $i \neq j$.
  • Because the graph is undirected, $Q_{ij}$ and $Q_{ji}$ are either both zero (no edge) or both nonzero (an edge exists). If both are zero, the equation reads $0 = 0$. (This is exactly where undirectedness is essential — a one-way arrow would break it.)
  • If there is an edge between $i$ and $j$: from state $i$ you pick uniformly among $d_i$ available edges, so $Q_{ij} = \tfrac{1}{d_i}$. Likewise $Q_{ji} = \tfrac{1}{d_j}$.
  • Substitute: $d_i \cdot \dfrac{1}{d_i} = d_j \cdot \dfrac{1}{d_j}$, i.e. $1 = 1$. True.
Conclusion

The degree vector satisfies detailed balance — no matrices, no Gaussian elimination — and the argument is completely general: the network could have 4 billion nodes and the proof is unchanged.

The stationary distribution

The degrees are positive integers, not yet a probability vector — but multiplying both sides of a reversibility equation by a constant keeps it valid. Normalize by the total degree. For a graph on $M$ nodes labeled $1$ through $M$ with degrees $d_i$:

$$s_i = \frac{d_i}{\sum_{j} d_j}$$
Result

The stationary probabilities are proportional to the degrees. Intuitive: in the long run the walk spends more time at high-degree nodes, since they have more edges feeding into them.

For the example graph, the total degree is $2 + 2 + 3 + 1 = 8$, giving $s = \left(\tfrac{2}{8}, \tfrac{2}{8}, \tfrac{3}{8}, \tfrac{1}{8}\right) = \left(\tfrac{1}{4}, \tfrac{1}{4}, \tfrac{3}{8}, \tfrac{1}{8}\right)$. This generalizes to weighted edges fairly directly, as long as the weight from $i$ to $j$ equals the weight from $j$ to $i$; asymmetric weights would complicate matters substantially.