This is the penultimate Stat 110 lecture and finishes the Markov chains unit. It picks up from the reversibility discussion, generalizes random walk on a network to weighted edges, proves that every reversible chain is such a weighted random walk, and closes with a fully worked non-reversible example: Google's PageRank.
Last time we studied random walk on an undirected graph (network). The setup:
The stationary probability of a node is proportional to its degree. No matrix calculation is needed: write down the vector of degrees and normalize it (scale so the entries sum to $1$).
Birth-death chains are another important family of reversible chains; they are covered on the handout and worth working through, but were not derived in lecture.
In the unweighted walk, from a node with three incident edges the walker chooses each with probability $\frac{1}{3}$. The natural generalization: what if some edges are more likely than others? Attach an edge weight $w_{ij}$ to each edge, subject to two rules.
For positive edges we may as well take $w_{ij} > 0$; the only thing to avoid is dividing by zero.
The walk now chooses among the available edges with probability proportional to their weights. From state $i$, go to $j$ with probability $\propto w_{ij}$; if $w_{ij} = 0$ we never go there. Setting all weights to $1$ recovers the original uniform walk (each of $d$ choices gets probability $\frac{1}{d}$).
For an actual edge $(i, j)$, the transition probability is that edge's weight divided by the total weight of all available steps from $i$:
(and $Q_{ij} = 0$ if there is no edge). The denominator is assumed nonzero — the walker must be able to do something, so not all weights out of $i$ can vanish. When every weight equals $1$, the denominator is just the degree of $i$ and $Q_{ij} = \frac{1}{\deg(i)}$, recovering the unweighted case.
The generalized analog of "degree" is the total edge weight at a node: from state $i$, sum the weights of all incident edges. Claim: this weighted chain is reversible, and the stationary probability of $i$ is proportional to that generalized degree.
Take $s_i \propto \sum_k w_{ik}$ and multiply by $Q_{ij}$. The denominator of $Q_{ij}$ is exactly that sum, so it cancels:
By symmetry $w_{ij} = w_{ji}$, this equals $w_{ji}$, which is exactly what $s_j \, Q_{ji}$ gives by the same cancellation. Detailed balance holds.
Therefore the stationary distribution, with $s_i$ proportional to the generalized degree, is
The double sum just normalizes. As before, the stationary distribution came out with no matrix work — only multiplying both sides of the balance equation by the denominator and using the symmetry of the weights.
The weighted construction looks like a modest extension, but it is in fact completely general.
Any reversible Markov chain can be represented as a random walk on an undirected network with edge weights. In this sense, weighted random walk on a network is the entire theory of reversible Markov chains.
The caveat is practical, not theoretical: if you already know the stationary distribution $s$, you are in good shape; if you do not, it may not be obvious how to find the $s$ satisfying $s_i Q_{ij} = s_j Q_{ji}$, nor how to read off the weights. But in principle every reversible chain has this form.
Start from a given reversible chain with transition matrix $Q$ and stationary distribution $s$. Build the network: the nodes are the states, and place an edge between $i$ and $j$ whenever $Q_{ij} > 0$. It remains only to define the weights. Set
Because the chain is reversible, $s_i Q_{ij} = s_j Q_{ji}$, so this same number also equals $w_{ji}$ — the required symmetry holds automatically.
Now check that the weighted walk with these weights reproduces the original transition probabilities. The walk's transition probability from $i$ to $j$ is
Since $s_i$ does not depend on the index $k$, factor it out of the denominator. (Assume all $s_i > 0$; a state with $s_i = 0$ should simply have been removed, so we are not dividing by zero.) The $s_i$ cancels top and bottom:
because $\sum_k Q_{ik}$ is a row of a transition matrix and equals $1$ (from $i$ you must go somewhere). The constructed weighted walk is exactly the original chain.
Random walk on an undirected network, possibly with edge weights, is the quintessential, prototypical reversible Markov chain — and captures all of them.
Non-reversible chains are much harder in general. The reversible case has many nice properties: intuitively you can run time forward and backward, and practically you can often skip matrix computations. A non-reversible chain can still be pictured as a random walk on a network with weights — but now with directed arrows, possibly one-way, possibly two-way with different weights in each direction, and that extra generality makes it much harder.
The worked non-reversible example is the Google PageRank chain, the algorithm Google was originally built on (and still uses in some form). It is based directly on a Markov chain, and that chain is non-reversible.
The states are web pages; the transitions are hyperlinks. The entire web is a giant directed network: some pages link to others. PageRank studies the stationary distribution of the random walk on this network. Because it is not reversible, the stationary distribution cannot be written down as easily as in the network examples above, but we can still ask how to compute it — a real concern given the size of the web.
Understanding four pages well lets you imagine billions of pages with a vast, complicated link structure.
Google was started in 1998 by Sergey Brin and Larry Page, then Stanford grad students who dropped out to work on it full time. ("PageRank" conveniently is also Page's name, though it genuinely ranks pages.)
Earlier search engines used crude methods: human-curated directories (which do not scale to billions of pages) and raw keyword-frequency ranking (easily gamed by repeating a word, and a high count does not imply a reliable page). These ignored the web's network structure. AltaVista was among the first to use the actual link structure; an early improvement called a page important if many pages link to it — but that too is easy to abuse (thousands of dummy pages) and ignores whether the linking pages are themselves any good.
The importance of a page should depend not just on how many pages link to it, but on how important those linking pages are. This sounds circular — importance defined via importance — but that is fine: it suggests an eigenvalue / eigenvector equation, read as a Markov chain.
Let $s_j$ be the score (rank) of page $j$. Brin and Page wanted scores satisfying
A naive version would just mark incoming links and add the scores of the recommenders. But a page with one outgoing link and a page with a thousand outgoing links should not cast equally weighted recommendations: each page has a fixed recommendation "budget," so a page with a thousand links dilutes each one. Thus $Q$ is the transition matrix of the random walk that follows links uniformly at random, with each row normalized to sum to $1$.
| From \ To | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| $1$ | $0$ | $\tfrac12$ | $\tfrac12$ | $0$ |
| $2$ | $\tfrac12$ | $0$ | $\tfrac12$ | $0$ |
| $3$ | $0$ | $0$ | $0$ | $1$ |
| $4$ | $\tfrac14$ | $\tfrac14$ | $\tfrac14$ | $\tfrac14$ |
Page 4 has no links, which would leave its row all zeros and break the chain. The fix: from a dangling page, jump to any page with equal probability — for $M$ pages, each entry is $\frac{1}{M}$ (here $\frac14$). The interpretation is a random web surfer who reads a page, clicks a random link, and on reaching a dead end opens a new window and goes to a random page rather than being trapped.
With $Q$ defined this way, the score equation in matrix form (with $s$ a row vector) becomes
which is exactly the equation for a stationary distribution. So $s$ is the stationary distribution of the random-web-surfing chain. By the long-run interpretation, the importance of a page equals the long-run fraction of time the surfer spends there — more important pages are visited more.
Treating $s = sQ$ as a pure eigenvalue problem hides two problems for the real web: the link chain may not be irreducible (can every page reach every page by clicking links? and even so, it might take ages), and the convergence guarantees may not apply. Brin and Page's paper therefore used a modified chain, the Google chain $G$:
At each step the surfer flips a coin with probability $\alpha$ of heads. On heads (prob. $\alpha$) follow a random link — use $Q$. On tails (prob. $1 - \alpha$) teleport: ignore the links and jump to a uniformly random page. The original paper suggested $\alpha = 0.85$ — Google's "magic number" — so 85% of the time you follow links and 15% of the time you teleport. There is much speculation about why $0.85$, and whether they still use it.
Consequently, even though the chain is not reversible (web links generally go one direction), all the good results from last time apply: a stationary distribution exists, it is unique, and the chain converges to it.
The stationary distribution still has to be computed, and the web is enormous.
Solving $s = sG$ as a linear system by Gaussian elimination costs on the order of $M^3$ for $M$ equations in $M$ unknowns. That is polynomial time, but for $M = 10$ billion, $M^3 = (10^{10})^3 = 10^{30}$ operations — hopeless even on a fast computer.
The better approach uses the fact that the chain converges to its stationary distribution. We do not need an exact solution, only an approximate one.
Let $t$ be an initial probability (row) vector — the pmf at time $0$. It could be $(1, 0, 0, \ldots)$ (always start at page 1) or $(\tfrac{1}{M}, \ldots, \tfrac{1}{M})$ (start at a uniformly random page); any starting distribution works. Multiplying a distribution on the right by $G$ advances it one step:
Running for $n$ steps gives $tG^n$, which converges to the stationary distribution — the PageRank vector — regardless of the starting point. In practice no one knows exactly how long to run such a complicated chain (the mixing-time questions are genuinely hard and not well understood here); people run it as long as they can, or until the vector looks stabilized, and take that as the approximate answer.
The multiplication $tG$ looks daunting but splits into two easy pieces:
So one step is a sparse matrix-vector product plus a uniform vector — far cheaper than Gaussian elimination. Use $tG$ as the new $t$, repeat to get $tG^2, tG^3, \ldots$; the limit is the stationary distribution, which is the PageRank vector.
A non-reversible Markov chain whose stationary distribution scores pages, made well-behaved by teleportation and computed efficiently by power iteration rather than by solving a $10$-billion-by-$10$-billion linear system. This is, as far as Blitzstein knows, in the spirit of how Google originally did it — and the end of the Markov chains unit.