A Markov chain models a particle bouncing from state to state. It is memoryless: it does not remember or care how it arrived at its current state. The only information needed to predict the future is the current state.
Given the present (the current state), the future and the past are conditionally independent. This memorylessness is in a different sense than the exponential distribution — here it means "the path taken to get here is irrelevant," not "the time already elapsed is irrelevant."
A chain is specified by drawing states as nodes and transitions as arrows, with probabilities on the arrows. For simplicity in the picture examples below, assume that from any state you follow one of its outgoing arrows uniformly at random; in general the arrows can carry arbitrary probabilities.
Before introducing definitions, it pays to stare at a few small chains and notice which ones behave nicely and which are annoying.
From any state you can reach any other state. This is the kind of chain we ultimately want: you can wander freely among all states forever.
States 1, 2, 3 are connected among themselves, and separately states 4, 5, 6 are connected among themselves, with a one-way arrow from the top group down to the bottom group. You can fall from $\{1,2,3\}$ into $\{4,5,6\}$, but you can never climb back up. The chain splits into pieces that don't fully communicate.
States 0, 1, 2, 3 in a line; from an interior state you step left or right one step. But state 0 only loops to itself, and so does state 3 — once you land on either end you stay there forever. These trapping endpoints are absorbing states. This chain is exactly gambler's ruin drawn as a Markov chain: the state is how much money gambler A has, the walk wanders until A is bankrupt (state 0) or has all the money (state 3), then stays there.
States $1 \to 2 \to 3 \to 1$, each forced to the next. There's no math to do — we know exactly what it does — but its rigid, predictable cycling is also something we'll want to rule out.
The first concept describes the whole chain.
A chain is irreducible if you can get from anywhere to anywhere: for every pair of states, it is possible (with positive probability) to get from one to the other in some finite number of steps — not necessarily in a single step.
"Irreducible" is standard but opaque terminology; Blitzstein notes that connected would have been a more descriptive word for the same idea.
Applying it to the gallery:
| Chain | Irreducible? | Why |
|---|---|---|
| 1 (nice) | Yes | Every state reaches every other |
| 2 (trapdoor) | No (reducible) | Can't get from $\{4,5,6\}$ back to $\{1,2,3\}$ |
| 3 (gambler's ruin) | No (reducible) | Once at 0 or 3 you can't leave |
| 4 (cycle) | Yes | The cycle visits every state |
Reducible chains are annoying, but not a deep problem: you can always split a reducible chain into its irreducible components, study each separately, and stitch the results back together. In Chain 2, once you start in the top group you stay there forever, so the bottom group is irrelevant — and vice versa. For this reason, the theory focuses on irreducible chains.
The next two concepts describe a single state within a chain.
A state is recurrent if, starting from that state, the chain is guaranteed (probability 1) to return to it. A state is transient if it is not recurrent — there is positive probability the chain leaves and never comes back.
Blitzstein's analogy: recurrent is like the tourist-board slogan "visit our city and you'll keep coming back" — it recurs over and over. There can't even be a $0.001$ chance of never returning.
If a state is recurrent, the chain doesn't just return once — it returns infinitely many times, with probability 1. Once the chain comes back, the Markov property means it has forgotten its history, so it faces exactly the same problem again: it returns once more with probability 1, and again, and again. (If the probability of the next return somehow decreased each time, the process wouldn't be Markov.) Conversely, a transient state may be revisited for a while, but eventually the chain stops returning to it forever.
In a chain with finitely many states, anything that can happen with positive probability will happen eventually — the probabilistic generalization of Murphy's Law. Even if returning to a state is extremely unlikely on any single attempt, with infinitely many attempts it eventually occurs.
Consequence: in an irreducible, finite-state chain, every state is recurrent.
Recurrence is tied to the chain's components, not just irreducibility. In Chain 2, even though you can't get from state 1 to state 4, all six states are recurrent: if you start in $\{4,5,6\}$ you revisit those forever, and if you start in $\{1,2,3\}$ you revisit those forever.
Take Chain 2 and add one extra edge: a one-way arrow from state 3 down to state 6. The chain is still not irreducible (you still can't get from 4 back to 1). But now:
In the gambler's ruin chain (Chain 3): interior states 1 and 2 are transient (eventually you hit an absorbing end), while the absorbing endpoints 0 and 3 are recurrent (starting at 0 keeps you at 0 forever, trivially returning).
Even the deterministic cycle (Chain 4) has every state recurrent — it keeps returning. But it has another defect: periodicity. Index time as $1, 2, 3, \ldots$ starting at state 1. Then at every time that is a multiple of 3 the chain is in state 3 — completely predictable. The states cycle with a fixed period, and we will want to exclude this kind of behavior when discussing long-run convergence.
| Chain | Irreducible | Recurrence | Other defect |
|---|---|---|---|
| 1 (nice) | Yes | All recurrent | None — the target |
| 2 (trapdoor) | No | All recurrent (per component) | Reducible |
| 2 + edge 3→6 | No | 1,2,3 transient; 4,5,6 recurrent | Reducible |
| 3 (gambler's ruin) | No | 0,3 recurrent; 1,2 transient | Absorbing states |
| 4 (cycle) | Yes | All recurrent | Periodic |
A stationary distribution is the central object of Markov chain theory. Write it as a probability row vector $s$ — just a PMF written out horizontally, with non-negative entries summing to 1.
Let $Q$ be the chain's transition matrix: $Q_{ij}$ is the probability of going from state $i$ to state $j$ in one step. The defining condition of stationarity is:
The interpretation: if $s$ is the distribution over states right now, then $sQ$ is the distribution one step later. So $sQ = s$ says that a chain started in distribution $s$ stays in distribution $s$ at every subsequent step — forever. Hence "stationary." More generally, $sQ^2$ is the distribution after two steps, $sQ^3$ after three, and the powers $Q^n$ give the $n$-step transition probabilities (probability of getting from $i$ to $j$ in $n$ steps).
For those who know linear algebra, $sQ = s$ is an eigenvalue/eigenvector equation: $s$ is a left eigenvector of $Q$ with eigenvalue 1. In principle, finding $s$ just means solving a linear system (Gaussian elimination). For very large chains this can be computationally intensive, even by computer.
These hold for any irreducible, finite-state Markov chain. Rigorous proofs need substantial linear algebra, so they are stated without proof.
A stationary distribution $s$ always exists. When solving $sQ = s$, a solution with mixed positive and negative entries would be bad; but one can always find an all-non-negative solution (or rescale an all-negative one), then renormalize so the entries sum to 1.
The stationary distribution is unique — even if the chain has a trillion states.
$$s_i = \frac{1}{r_i}$$
where $r_i$ is the expected return time: the average number of steps to return to state $i$, starting from state $i$. (An irreducible finite chain is recurrent, so the chain is guaranteed to come back; $r_i$ is the expected value of how long that takes.)
If the chain is also not periodic, then $P(X_n = i) \to s_i$ as $n \to \infty$, regardless of the starting state.
Think of $s_i$ as the long-run fraction of time spent in state $i$. If the chain is in state $i$ one-tenth of the time, then on average it takes 10 steps to return to $i$. So $s_i = \tfrac{1}{10} \Leftrightarrow r_i = 10$ — the two are reciprocal.
Stationary distributions are also called equilibrium or steady-state distributions (terms from physics and economics), all reflecting long-run behavior. Theorems 1–3 hold even for periodic chains; convergence requires ruling out periodicity. One clean sufficient condition:
If $Q^m$ is strictly positive (every entry $> 0$) for some $m$, then the chain has no periodicity problem (and is automatically irreducible, since $Q^m$ gives the probability of reaching anywhere from anywhere in exactly $m$ steps). For the deterministic cycle, taking powers of the $3 \times 3$ transition matrix produces zeros that oscillate in position — no single power is ever all-positive, the signature of periodicity.
In matrix terms, convergence says: for any probability vector $t$ (deterministic start = a single 1; or any random start), $t\,Q^n \to s$ as $n \to \infty$. Multiplying by $Q$ steps the distribution forward one unit of time; do this many times and you converge to $s$ no matter where you began.
The four theorems are powerful but leave a gap: computation. They don't say how to find $s$ efficiently — the return-time formula needs $r_i$ (itself hard), and solving $sQ = s$ directly can mean a lifetime of matrix algebra for large chains. We need a shortcut.
There is a special class of chains where the stationary distribution comes quickly — the "good kind of hard": it may take a clever idea, but it isn't tedious matrix grinding.
A chain with transition matrix $Q$ (entries $Q_{ij}$) is reversible if there exists a probability vector $s$ satisfying the reversibility equation (in physics, detailed balance):
$$s_i\, Q_{ij} = s_j\, Q_{ji} \qquad \text{for all states } i, j$$
To go from the left side to the right, you just swap $i$ and $j$. For a given chain such an $s$ may or may not exist, and the equation doesn't directly say how to find it.
The proof is short and good practice with the concept:
$(sQ)_j = s_j$ for every $j$, i.e. $sQ = s$. So reversibility implies stationarity. $\blacksquare$
Reversibility is also called time reversibility. Start the chain in distribution $s$ and videotape the wandering particle. Reversibility says that if you play the tape backward, an observer cannot tell whether it is running forward or backward — the process looks statistically identical in both directions. This can be verified with the definition of conditional probability and Bayes' rule. The same idea is central in physics, especially thermodynamics (where it is the detailed balance condition).
This is the most important and general example of a reversible chain. Many problems can be encoded in this form.
Take an undirected graph: nodes connected by edges that are two-way streets, not arrows. (Any Markov chain is a random walk on a directed, weighted network; restricting to undirected, equally-weighted edges is what makes things nice.) The random walk: wherever you are, look at all edges available at your current node and pick one uniformly at random.
Consider a connected graph on 4 nodes: nodes 1 and 2 are connected to each other, and node 3 is connected to everything else, with node 4 attached only to node 3. Assume the graph is connected, so the chain is irreducible (an isolated node would break irreducibility). Let $d_i$ be the degree of node $i$ — the number of edges attached to it.
| Node | Degree $d_i$ |
|---|---|
| 1 | 2 |
| 2 | 2 |
| 3 | 3 |
| 4 | 1 |
(Assume no self-loops; the argument extends to loops with extra care.)
The degree vector satisfies detailed balance — no matrices, no Gaussian elimination — and the argument is completely general: the network could have 4 billion nodes and the proof is unchanged.
The degrees are positive integers, not yet a probability vector — but multiplying both sides of a reversibility equation by a constant keeps it valid. Normalize by the total degree. For a graph on $M$ nodes labeled $1$ through $M$ with degrees $d_i$:
The stationary probabilities are proportional to the degrees. Intuitive: in the long run the walk spends more time at high-degree nodes, since they have more edges feeding into them.
For the example graph, the total degree is $2 + 2 + 3 + 1 = 8$, giving $s = \left(\tfrac{2}{8}, \tfrac{2}{8}, \tfrac{3}{8}, \tfrac{1}{8}\right) = \left(\tfrac{1}{4}, \tfrac{1}{4}, \tfrac{3}{8}, \tfrac{1}{8}\right)$. This generalizes to weighted edges fairly directly, as long as the weight from $i$ to $j$ equals the weight from $j$ to $i$; asymmetric weights would complicate matters substantially.