Lecture 31: Markov Chains

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Stochastic Processes and the Markov Property

A stochastic process is a collection of random variables evolving over time. The course began with one random variable at a time, then two; later, with the law of large numbers and the central limit theorem, it studied sequences. In those settings the sequence was usually assumed IID — independent and identically distributed — a very strong assumption: each new variable resets to a fresh draw, ignoring everything before it.

Key idea

A Markov chain is a stochastic process that goes exactly one step beyond IID. The variables become dependent, but in a very special, tractable way.

Classification of stochastic processes

Indexing a sequence by time gives $X_0, X_1, X_2, \ldots$ A process can be classified along two axes:

Axis	Discrete	Continuous
Time	$n$ a non-negative integer ($X_0, X_1, \ldots$)	$X_t$ for continuous $t$
Space	each $X$ takes finitely / countably many values	each $X$ takes values in a continuous space (e.g., any real)

Stat 110 studies only discrete-time, discrete-space chains, and further assumes finitely many states to keep things simple. (For continuous-time and continuous-space processes, the suggested follow-up is Stat 171.) Think of $X_n$ as the state of a system at time $n$ — a particle wandering randomly from state to state.

The Markov property

Number the states $1$ to $M$ (a convention, not a requirement). Predicting tomorrow, $X_{n+1} = j$, in general requires the entire past history:

$$P(X_{n+1} = j \mid X_n = i,\ X_{n-1} = i_{n-1},\ \ldots,\ X_0 = i_0)$$

For a general stochastic process this conditional probability can be hopelessly complicated. The Markov property says that only the most recent state matters — everything further back can be discarded.

The Markov Property

$$P(X_{n+1} = j \mid X_n = i,\ X_{n-1} = i_{n-1},\ \ldots,\ X_0 = i_0) = P(X_{n+1} = j \mid X_n = i)$$

How to remember it

The past and future are conditionally independent given the present. Once you know the current state $X_n$, anything older is obsolete, outdated information.

A subtlety about conditional independence: this does not say the past and future are independent outright. If you were given $X_{n-1}$ but not $X_n$, you could not discard $X_{n-1}$ — it is only obsolete given the present.

Homogeneity and transition probabilities

We further assume the one-step probability does not depend on the time $n$. Write it as

$$q_{ij} = P(X_{n+1} = j \mid X_n = i)$$

This is the transition probability from state $i$ to state $j$. A chain whose transition probabilities don't change with time is called homogeneous (or time-homogeneous). The word "homogeneous" is often dropped, so judge from context. Stat 110 studies only homogeneous chains.

· · ·

2. A Concrete Four-State Example

It helps to keep a small picture in mind. Consider a chain with four states drawn as ovals, with arrows labeled by transition probabilities:

From state 1: stays at 1 with probability $\frac{1}{3}$, goes to 2 with probability $\frac{2}{3}$.
From state 2: goes to 1 or to 3, each with probability $\frac{1}{2}$.
From state 3: always goes to 4 (probability $1$).
From state 4: goes to 1 with probability $\frac{1}{2}$, to 3 with probability $\frac{1}{4}$, stays at 4 with probability $\frac{1}{4}$.

It is usually not possible to go from any state to any state in one step (e.g., 1 cannot reach 3 or 4 in a single step — though it can in more steps). With finitely many states you can always draw such a picture: a particle bouncing between states along the arrows.

The picture in words

To predict the future, all that matters is the current state. It does not matter how the particle got there — the long history of wanderings is irrelevant; only where it is now counts.

· · ·

3. The Transition Matrix

The same information can be encoded as a matrix $Q$ (the transition matrix), often more convenient than redrawing the picture. The $(i, j)$ entry is $q_{ij}$, the probability of jumping from $i$ to $j$ in one step. For the example above:

$$Q = \begin{pmatrix} 1/3 & 2/3 & 0 & 0 \\ 1/2 & 0 & 1/2 & 0 \\ 0 & 0 & 0 & 1 \\ 1/2 & 0 & 1/4 & 1/4 \end{pmatrix}$$

Notice that every row sums to 1: starting from a given state, the chain must go somewhere. (Some authors use the transpose convention, in which columns sum to 1; this course writes rows summing to 1.) Columns may or may not sum to 1.

Valid transition matrix

Write down any square matrix with non-negative entries whose rows each sum to 1. Every such matrix corresponds to a Markov chain and can be converted back to an arrow picture.

The two most important concepts beyond the basic definition are the transition matrix (now defined) and the stationary distribution (introduced below).

· · ·

4. How Markov Chains Are Used; Markov's Original Motivation

Markov chains were introduced by Andrei Markov — already familiar from Markov's inequality — just over 100 years ago (around 1906). They are used today in two broad ways.

As a model

In the social sciences, physical sciences, and biology, one may genuinely believe a system is (approximately) a Markov chain, or use one as a convenient approximation for a system evolving over time. The limitation: the conditional-independence assumption is strong. Whether a stock price over time, or the weather, is Markovian has generated thousands of pages of debate — these are empirical questions.

The first-order assumption is less limiting than it looks. One can generalize so the future depends on $X_n$ and $X_{n-1}$ (or 10 steps back). The right starting point for understanding such higher-order chains is the first-order theory anyway.

Markov chain Monte Carlo (MCMC)

The other major use essentially created a revolution in scientific computing. The challenge "find a major field where MCMC has never been applied" has never been met — one can even find Markov chains applied to French poetry.

The MCMC idea

With MCMC you synthetically construct your own chain, so the question "is the real process Markovian?" never arises — you built the chain. Why build one? Because you can construct a chain that converges to a distribution you care about. To study a distribution too complicated to handle analytically, you cleverly design a chain that converges to it, program the chain, run it for a long time, and use the results to study the target distribution.

This requires fast computers, so it is a recent idea — calculations now routine were impossible 30 to 50 years ago.

Markov's actual reason: the free-will debate

Both modern uses differ from Markov's own motivation. He introduced Markov chains to settle a religious-philosophical debate over free will. The law of large numbers had recently been proven, and some philosophers worried it left no room for free will: if behavior always converges to the mean in the long run, where is the scope for choice?

One of Markov's rivals tried to rescue free will by arguing the law of large numbers assumes IID, while human behavior is not IID — so we are safe. But this is unconvincing: proving "IID implies the law of large numbers" does not prove "not IID implies it fails."

So Markov sought a process one step beyond IID — going one step backward in conditioning — and proved a version of the law of large numbers that holds for such chains, showing IID is not needed. His first chain (possibly the very first Markov chain) was empirical: he took a Russian novel, classified letters as vowel or consonant (two states), and estimated the probabilities of vowel-following-vowel, vowel-following-consonant, and so on — conceptually the same picture as the four-state example, only simpler.

· · ·

5. Marginal Distributions: Powers of Q

The transition matrix gives one-step probabilities. To get multi-step behavior, take powers of $Q$.

One step forward

Suppose at time $n$ the chain $X_n$ has distribution $s$, written as a row vector (a $1 \times M$ matrix). Since there are $M$ states, the PMF is just the list of probabilities:

$$s = \big(P(X_n = 1),\ P(X_n = 2),\ \ldots,\ P(X_n = M)\big)$$

with non-negative entries summing to 1. To find the distribution at time $n+1$, condition on the current state (law of total probability):

$$P(X_{n+1} = j) = \sum_{i} P(X_{n+1} = j \mid X_n = i)\, P(X_n = i) = \sum_{i} s_i\, q_{ij}$$

That sum is exactly the $j$-th entry of the matrix product $s\,Q$. Dimensions check: $s$ is $1 \times M$ and $Q$ is $M \times M$, so $s\,Q$ is $1 \times M$ — a valid distribution vector.

The distribution one step into the future is $s\,Q$.

Repeating: the distribution $m$ steps ahead

Now $s\,Q$ plays the role of $s$, so the same argument gives:

two steps ahead: $s\,Q^2$
three steps ahead: $s\,Q^3$
$m$ steps ahead: $s\,Q^m$

Takeaway

To go one step into the future, multiply on the right by $Q$. Tedious by hand, but trivial on a computer: just take powers of $Q$, and you know how the distribution evolves.

· · ·

6. m-Step Transition Probabilities

A parallel result holds for the transition probabilities themselves. By definition $Q$ gives the one-step probabilities: $P(X_{n+1} = j \mid X_n = i) = q_{ij}$.

Two-step transition: condition on the intermediate state

For two steps, condition on the intermediate state $X_{n+1} = k$ (the missing link), with everything conditioned on $X_n = i$:

$$P(X_{n+2} = j \mid X_n = i) = \sum_{k} P(X_{n+2} = j \mid X_{n+1} = k,\ X_n = i)\, P(X_{n+1} = k \mid X_n = i)$$

The Markov property kills the $X_n$ in the first factor (knowing $X_{n+1}$ makes $X_n$ obsolete), leaving two one-step transitions:

$$P(X_{n+2} = j \mid X_n = i) = \sum_{k} q_{ik}\, q_{kj}$$

This is exactly row $i$, column $j$ of $Q^2$ (multiply $Q$ by $Q$: dot a row of $Q$ with a column of $Q$).

Repeating the argument:

m-step transition probabilities

The probability of going from $i$ to $j$ in $m$ steps is the $(i, j)$ entry of $Q^m$:

$$P(X_{n+m} = j \mid X_n = i) = (Q^m)_{ij}$$

So powers of the transition matrix carry all the multi-step information — no separate study of each transition or each number of steps is needed.

· · ·

7. Stationary Distributions (Preview)

The other most important concept, defined here but developed next time, is the stationary distribution — also called the steady-state, long-run, or equilibrium distribution.

Stationary distribution

A probability vector $s$ (a $1 \times M$ row, a PMF) is stationary for the chain if

$$s\,Q = s$$

(Here $s\,Q$ is $1 \times M$, a valid distribution.)

For those who have seen them, this is essentially an eigenvalue–eigenvector equation with eigenvalue $1$ — usually written as matrix-times-vector, but transposing both sides recovers that form. Eigenvalues are not required for this course.

Intuition for the name

We showed that if the chain follows distribution $s$ at time $n$, then $s\,Q$ is its distribution one step later. If $s\,Q = s$, then starting from $s$ the distribution is unchanged after one step — and after two steps, and forever. The distribution never changes: hence stationary.

Why it matters and the open questions

One reason stationary distributions matter: under mild conditions, running the chain a long time makes it converge to a limiting distribution, and that limit is the stationary distribution. It describes the chain's long-run behavior. This raises several questions, to be addressed next time:

Existence: Does a solution to $s\,Q = s$ exist? It must be a genuine probability vector — non-negative entries summing to 1 (a solution with negative entries is useless).
Uniqueness: If it exists, is it unique?
Convergence: The definition $s\,Q = s$ says nothing about limits on its face. Does the chain actually converge to $s$, and in what sense?
Computation: Even granting existence, how do we compute $s$ efficiently? In principle $s\,Q = s$ is a linear system you could solve by elimination, but for large chains that could take impractically long even with a computer.

Looking ahead

Under some mild technical conditions, existence, uniqueness, and convergence all hold. Computation can be very hard in the fully general case — but there is a special, beautiful class of chains where the stationary distribution can be found quickly and easily, without matrices at all. That case is the subject of the next lecture.