A stochastic process is a collection of random variables evolving over time. The course began with one random variable at a time, then two; later, with the law of large numbers and the central limit theorem, it studied sequences. In those settings the sequence was usually assumed IID — independent and identically distributed — a very strong assumption: each new variable resets to a fresh draw, ignoring everything before it.
A Markov chain is a stochastic process that goes exactly one step beyond IID. The variables become dependent, but in a very special, tractable way.
Indexing a sequence by time gives $X_0, X_1, X_2, \ldots$ A process can be classified along two axes:
| Axis | Discrete | Continuous |
|---|---|---|
| Time | $n$ a non-negative integer ($X_0, X_1, \ldots$) | $X_t$ for continuous $t$ |
| Space | each $X$ takes finitely / countably many values | each $X$ takes values in a continuous space (e.g., any real) |
Stat 110 studies only discrete-time, discrete-space chains, and further assumes finitely many states to keep things simple. (For continuous-time and continuous-space processes, the suggested follow-up is Stat 171.) Think of $X_n$ as the state of a system at time $n$ — a particle wandering randomly from state to state.
Number the states $1$ to $M$ (a convention, not a requirement). Predicting tomorrow, $X_{n+1} = j$, in general requires the entire past history:
For a general stochastic process this conditional probability can be hopelessly complicated. The Markov property says that only the most recent state matters — everything further back can be discarded.
$$P(X_{n+1} = j \mid X_n = i,\ X_{n-1} = i_{n-1},\ \ldots,\ X_0 = i_0) = P(X_{n+1} = j \mid X_n = i)$$
The past and future are conditionally independent given the present. Once you know the current state $X_n$, anything older is obsolete, outdated information.
A subtlety about conditional independence: this does not say the past and future are independent outright. If you were given $X_{n-1}$ but not $X_n$, you could not discard $X_{n-1}$ — it is only obsolete given the present.
We further assume the one-step probability does not depend on the time $n$. Write it as
This is the transition probability from state $i$ to state $j$. A chain whose transition probabilities don't change with time is called homogeneous (or time-homogeneous). The word "homogeneous" is often dropped, so judge from context. Stat 110 studies only homogeneous chains.
It helps to keep a small picture in mind. Consider a chain with four states drawn as ovals, with arrows labeled by transition probabilities:
It is usually not possible to go from any state to any state in one step (e.g., 1 cannot reach 3 or 4 in a single step — though it can in more steps). With finitely many states you can always draw such a picture: a particle bouncing between states along the arrows.
To predict the future, all that matters is the current state. It does not matter how the particle got there — the long history of wanderings is irrelevant; only where it is now counts.
The same information can be encoded as a matrix $Q$ (the transition matrix), often more convenient than redrawing the picture. The $(i, j)$ entry is $q_{ij}$, the probability of jumping from $i$ to $j$ in one step. For the example above:
Notice that every row sums to 1: starting from a given state, the chain must go somewhere. (Some authors use the transpose convention, in which columns sum to 1; this course writes rows summing to 1.) Columns may or may not sum to 1.
Write down any square matrix with non-negative entries whose rows each sum to 1. Every such matrix corresponds to a Markov chain and can be converted back to an arrow picture.
The two most important concepts beyond the basic definition are the transition matrix (now defined) and the stationary distribution (introduced below).
Markov chains were introduced by Andrei Markov — already familiar from Markov's inequality — just over 100 years ago (around 1906). They are used today in two broad ways.
In the social sciences, physical sciences, and biology, one may genuinely believe a system is (approximately) a Markov chain, or use one as a convenient approximation for a system evolving over time. The limitation: the conditional-independence assumption is strong. Whether a stock price over time, or the weather, is Markovian has generated thousands of pages of debate — these are empirical questions.
The first-order assumption is less limiting than it looks. One can generalize so the future depends on $X_n$ and $X_{n-1}$ (or 10 steps back). The right starting point for understanding such higher-order chains is the first-order theory anyway.
The other major use essentially created a revolution in scientific computing. The challenge "find a major field where MCMC has never been applied" has never been met — one can even find Markov chains applied to French poetry.
With MCMC you synthetically construct your own chain, so the question "is the real process Markovian?" never arises — you built the chain. Why build one? Because you can construct a chain that converges to a distribution you care about. To study a distribution too complicated to handle analytically, you cleverly design a chain that converges to it, program the chain, run it for a long time, and use the results to study the target distribution.
This requires fast computers, so it is a recent idea — calculations now routine were impossible 30 to 50 years ago.
Both modern uses differ from Markov's own motivation. He introduced Markov chains to settle a religious-philosophical debate over free will. The law of large numbers had recently been proven, and some philosophers worried it left no room for free will: if behavior always converges to the mean in the long run, where is the scope for choice?
One of Markov's rivals tried to rescue free will by arguing the law of large numbers assumes IID, while human behavior is not IID — so we are safe. But this is unconvincing: proving "IID implies the law of large numbers" does not prove "not IID implies it fails."
So Markov sought a process one step beyond IID — going one step backward in conditioning — and proved a version of the law of large numbers that holds for such chains, showing IID is not needed. His first chain (possibly the very first Markov chain) was empirical: he took a Russian novel, classified letters as vowel or consonant (two states), and estimated the probabilities of vowel-following-vowel, vowel-following-consonant, and so on — conceptually the same picture as the four-state example, only simpler.
The transition matrix gives one-step probabilities. To get multi-step behavior, take powers of $Q$.
Suppose at time $n$ the chain $X_n$ has distribution $s$, written as a row vector (a $1 \times M$ matrix). Since there are $M$ states, the PMF is just the list of probabilities:
with non-negative entries summing to 1. To find the distribution at time $n+1$, condition on the current state (law of total probability):
That sum is exactly the $j$-th entry of the matrix product $s\,Q$. Dimensions check: $s$ is $1 \times M$ and $Q$ is $M \times M$, so $s\,Q$ is $1 \times M$ — a valid distribution vector.
The distribution one step into the future is $s\,Q$.
Now $s\,Q$ plays the role of $s$, so the same argument gives:
To go one step into the future, multiply on the right by $Q$. Tedious by hand, but trivial on a computer: just take powers of $Q$, and you know how the distribution evolves.
A parallel result holds for the transition probabilities themselves. By definition $Q$ gives the one-step probabilities: $P(X_{n+1} = j \mid X_n = i) = q_{ij}$.
For two steps, condition on the intermediate state $X_{n+1} = k$ (the missing link), with everything conditioned on $X_n = i$:
$$P(X_{n+2} = j \mid X_n = i) = \sum_{k} P(X_{n+2} = j \mid X_{n+1} = k,\ X_n = i)\, P(X_{n+1} = k \mid X_n = i)$$
The Markov property kills the $X_n$ in the first factor (knowing $X_{n+1}$ makes $X_n$ obsolete), leaving two one-step transitions:
$$P(X_{n+2} = j \mid X_n = i) = \sum_{k} q_{ik}\, q_{kj}$$
This is exactly row $i$, column $j$ of $Q^2$ (multiply $Q$ by $Q$: dot a row of $Q$ with a column of $Q$).
Repeating the argument:
The probability of going from $i$ to $j$ in $m$ steps is the $(i, j)$ entry of $Q^m$:
$$P(X_{n+m} = j \mid X_n = i) = (Q^m)_{ij}$$
So powers of the transition matrix carry all the multi-step information — no separate study of each transition or each number of steps is needed.
The other most important concept, defined here but developed next time, is the stationary distribution — also called the steady-state, long-run, or equilibrium distribution.
A probability vector $s$ (a $1 \times M$ row, a PMF) is stationary for the chain if
$$s\,Q = s$$
(Here $s\,Q$ is $1 \times M$, a valid distribution.)
For those who have seen them, this is essentially an eigenvalue–eigenvector equation with eigenvalue $1$ — usually written as matrix-times-vector, but transposing both sides recovers that form. Eigenvalues are not required for this course.
We showed that if the chain follows distribution $s$ at time $n$, then $s\,Q$ is its distribution one step later. If $s\,Q = s$, then starting from $s$ the distribution is unchanged after one step — and after two steps, and forever. The distribution never changes: hence stationary.
One reason stationary distributions matter: under mild conditions, running the chain a long time makes it converge to a limiting distribution, and that limit is the stationary distribution. It describes the chain's long-run behavior. This raises several questions, to be addressed next time:
Under some mild technical conditions, existence, uniqueness, and convergence all hold. Computation can be very hard in the fully general case — but there is a special, beautiful class of chains where the stationary distribution can be found quickly and easily, without matrices at all. That case is the subject of the next lecture.