Lecture 33: Markov Chains Continued Further

Harvard Statistics 110 (Joe Blitzstein)
Watch on YouTube

1. Recap: Random Walk on an Undirected Network

This is the penultimate Stat 110 lecture and finishes the Markov chains unit. It picks up from the reversibility discussion, generalizes random walk on a network to weighted edges, proves that every reversible chain is such a weighted random walk, and closes with a fully worked non-reversible example: Google's PageRank.

Last time we studied random walk on an undirected graph (network). The setup:

A set of nodes (states), with undirected edges between some of them. Edges are bidirectional — you can travel either way along an edge — not one-way arrows.
Loops (an edge from a node to itself) are allowed. The only subtlety is a convention question: how does a self-loop contribute to a node's degree? We take a self-loop to add $1$ to the degree, so a node with one self-loop and two outgoing edges has degree $3$.
The walker at a node picks one of its incident edges uniformly at random and follows it.

Key result from last time

The stationary probability of a node is proportional to its degree. No matrix calculation is needed: write down the vector of degrees and normalize it (scale so the entries sum to $1$).

Birth-death chains are another important family of reversible chains; they are covered on the handout and worth working through, but were not derived in lecture.

· · ·

2. Weighted Random Walk on a Network

Generalizing to unequal edge weights

In the unweighted walk, from a node with three incident edges the walker chooses each with probability $\frac{1}{3}$. The natural generalization: what if some edges are more likely than others? Attach an edge weight $w_{ij}$ to each edge, subject to two rules.

Edge-weight conditions

$w_{ij} \ge 0$, and $w_{ij} = 0$ exactly when there is no edge between $i$ and $j$ (you cannot get from $i$ to $j$ in one step).
Symmetry: $w_{ij} = w_{ji}$. Because the graph is undirected, a single number is written on each edge — we may not assign one value going one way and a different value going back.

For positive edges we may as well take $w_{ij} > 0$; the only thing to avoid is dividing by zero.

The walk now chooses among the available edges with probability proportional to their weights. From state $i$, go to $j$ with probability $\propto w_{ij}$; if $w_{ij} = 0$ we never go there. Setting all weights to $1$ recovers the original uniform walk (each of $d$ choices gets probability $\frac{1}{d}$).

Transition probabilities

For an actual edge $(i, j)$, the transition probability is that edge's weight divided by the total weight of all available steps from $i$:

$$Q_{ij} = \frac{w_{ij}}{\sum_k w_{ik}}$$

(and $Q_{ij} = 0$ if there is no edge). The denominator is assumed nonzero — the walker must be able to do something, so not all weights out of $i$ can vanish. When every weight equals $1$, the denominator is just the degree of $i$ and $Q_{ij} = \frac{1}{\deg(i)}$, recovering the unweighted case.

Stationary distribution via reversibility

The generalized analog of "degree" is the total edge weight at a node: from state $i$, sum the weights of all incident edges. Claim: this weighted chain is reversible, and the stationary probability of $i$ is proportional to that generalized degree.

$$s_i \, Q_{ij} = s_j \, Q_{ji} \quad\text{(detailed balance)}$$

Take $s_i \propto \sum_k w_{ik}$ and multiply by $Q_{ij}$. The denominator of $Q_{ij}$ is exactly that sum, so it cancels:

$$s_i \, Q_{ij} \;\propto\; \left(\sum_k w_{ik}\right)\frac{w_{ij}}{\sum_k w_{ik}} = w_{ij}.$$

By symmetry $w_{ij} = w_{ji}$, this equals $w_{ji}$, which is exactly what $s_j \, Q_{ji}$ gives by the same cancellation. Detailed balance holds.

Therefore the stationary distribution, with $s_i$ proportional to the generalized degree, is

$$s_i = \frac{\sum_k w_{ik}}{\sum_i \sum_k w_{ik}}.$$

The double sum just normalizes. As before, the stationary distribution came out with no matrix work — only multiplying both sides of the balance equation by the denominator and using the symmetry of the weights.

· · ·

3. Every Reversible Chain Is a Weighted Random Walk

The weighted construction looks like a modest extension, but it is in fact completely general.

Theorem

Any reversible Markov chain can be represented as a random walk on an undirected network with edge weights. In this sense, weighted random walk on a network is the entire theory of reversible Markov chains.

The caveat is practical, not theoretical: if you already know the stationary distribution $s$, you are in good shape; if you do not, it may not be obvious how to find the $s$ satisfying $s_i Q_{ij} = s_j Q_{ji}$, nor how to read off the weights. But in principle every reversible chain has this form.

Why the theorem is true

Start from a given reversible chain with transition matrix $Q$ and stationary distribution $s$. Build the network: the nodes are the states, and place an edge between $i$ and $j$ whenever $Q_{ij} > 0$. It remains only to define the weights. Set

$$w_{ij} = s_i \, Q_{ij}.$$

Because the chain is reversible, $s_i Q_{ij} = s_j Q_{ji}$, so this same number also equals $w_{ji}$ — the required symmetry holds automatically.

Now check that the weighted walk with these weights reproduces the original transition probabilities. The walk's transition probability from $i$ to $j$ is

$$\frac{w_{ij}}{\sum_k w_{ik}} = \frac{s_i \, Q_{ij}}{\sum_k s_i \, Q_{ik}}.$$

Since $s_i$ does not depend on the index $k$, factor it out of the denominator. (Assume all $s_i > 0$; a state with $s_i = 0$ should simply have been removed, so we are not dividing by zero.) The $s_i$ cancels top and bottom:

$$\frac{Q_{ij}}{\sum_k Q_{ik}} = \frac{Q_{ij}}{1} = Q_{ij},$$

because $\sum_k Q_{ik}$ is a row of a transition matrix and equals $1$ (from $i$ you must go somewhere). The constructed weighted walk is exactly the original chain.

Conclusion

Random walk on an undirected network, possibly with edge weights, is the quintessential, prototypical reversible Markov chain — and captures all of them.

· · ·

4. Non-Reversible Chains and Google PageRank

Non-reversible chains are much harder in general. The reversible case has many nice properties: intuitively you can run time forward and backward, and practically you can often skip matrix computations. A non-reversible chain can still be pictured as a random walk on a network with weights — but now with directed arrows, possibly one-way, possibly two-way with different weights in each direction, and that extra generality makes it much harder.

The worked non-reversible example is the Google PageRank chain, the algorithm Google was originally built on (and still uses in some form). It is based directly on a Markov chain, and that chain is non-reversible.

The web as a Markov chain

The states are web pages; the transitions are hyperlinks. The entire web is a giant directed network: some pages link to others. PageRank studies the stationary distribution of the random walk on this network. Because it is not reversible, the stationary distribution cannot be written down as easily as in the network examples above, but we can still ask how to compute it — a real concern given the size of the web.

Toy example: four pages

1 → 2, 3 2 → 1, 3 3 → 4 4 → (none)

Page 4 is a dangling page with no outgoing links

Understanding four pages well lets you imagine billions of pages with a vast, complicated link structure.

History: ranking search results

Google was started in 1998 by Sergey Brin and Larry Page, then Stanford grad students who dropped out to work on it full time. ("PageRank" conveniently is also Page's name, though it genuinely ranks pages.)

Earlier search engines used crude methods: human-curated directories (which do not scale to billions of pages) and raw keyword-frequency ranking (easily gamed by repeating a word, and a high count does not imply a reliable page). These ignored the web's network structure. AltaVista was among the first to use the actual link structure; an early improvement called a page important if many pages link to it — but that too is easy to abuse (thousands of dummy pages) and ignores whether the linking pages are themselves any good.

The PageRank idea

Key insight

The importance of a page should depend not just on how many pages link to it, but on how important those linking pages are. This sounds circular — importance defined via importance — but that is fine: it suggests an eigenvalue / eigenvector equation, read as a Markov chain.

Let $s_j$ be the score (rank) of page $j$. Brin and Page wanted scores satisfying

$$s_j = \sum_i s_i \, Q_{ij}.$$

A naive version would just mark incoming links and add the scores of the recommenders. But a page with one outgoing link and a page with a thousand outgoing links should not cast equally weighted recommendations: each page has a fixed recommendation "budget," so a page with a thousand links dilutes each one. Thus $Q$ is the transition matrix of the random walk that follows links uniformly at random, with each row normalized to sum to $1$.

Transition matrix for the four-page example

From \ To	1	2	3	4
$1$	$0$	$\tfrac12$	$\tfrac12$	$0$
$2$	$\tfrac12$	$0$	$\tfrac12$	$0$
$3$	$0$	$0$	$0$	$1$
$4$	$\tfrac14$	$\tfrac14$	$\tfrac14$	$\tfrac14$

Page 4 has no links, which would leave its row all zeros and break the chain. The fix: from a dangling page, jump to any page with equal probability — for $M$ pages, each entry is $\frac{1}{M}$ (here $\frac14$). The interpretation is a random web surfer who reads a page, clicks a random link, and on reaching a dead end opens a new window and goes to a random page rather than being trapped.

With $Q$ defined this way, the score equation in matrix form (with $s$ a row vector) becomes

$$s = sQ,$$

which is exactly the equation for a stationary distribution. So $s$ is the stationary distribution of the random-web-surfing chain. By the long-run interpretation, the importance of a page equals the long-run fraction of time the surfer spends there — more important pages are visited more.

· · ·

5. The Teleportation Modification

Treating $s = sQ$ as a pure eigenvalue problem hides two problems for the real web: the link chain may not be irreducible (can every page reach every page by clicking links? and even so, it might take ages), and the convergence guarantees may not apply. Brin and Page's paper therefore used a modified chain, the Google chain $G$:

$$G = \alpha \, Q + (1 - \alpha)\,\frac{J}{M},$$

$Q$ is the link-following transition matrix above ($M \times M$).
$J$ is the $M \times M$ matrix of all ones, so $\frac{J}{M}$ is a valid transition matrix (rows sum to $1$, all entries nonnegative).
$\alpha$ is a constant in $(0, 1)$ — think of it as a probability.

At each step the surfer flips a coin with probability $\alpha$ of heads. On heads (prob. $\alpha$) follow a random link — use $Q$. On tails (prob. $1 - \alpha$) teleport: ignore the links and jump to a uniformly random page. The original paper suggested $\alpha = 0.85$ — Google's "magic number" — so 85% of the time you follow links and 15% of the time you teleport. There is much speculation about why $0.85$, and whether they still use it.

Why add teleportation

Effect of the teleport term

Guarantees irreducibility: there is now a small positive probability of jumping from any page to any page in one step, so every state communicates.
Removes all zeros: even though the added amount per entry is tiny — with $\alpha = 0.85$ it is $\frac{0.15}{M}$, and $M$ may be $10$ billion — every entry of $G$ is strictly positive (the matrix itself already satisfies the "some power has all positive entries" condition).

Consequently, even though the chain is not reversible (web links generally go one direction), all the good results from last time apply: a stationary distribution exists, it is unique, and the chain converges to it.

· · ·

6. Computing PageRank by Power Iteration

The stationary distribution still has to be computed, and the web is enormous.

Why not Gaussian elimination

Solving $s = sG$ as a linear system by Gaussian elimination costs on the order of $M^3$ for $M$ equations in $M$ unknowns. That is polynomial time, but for $M = 10$ billion, $M^3 = (10^{10})^3 = 10^{30}$ operations — hopeless even on a fast computer.

The better approach uses the fact that the chain converges to its stationary distribution. We do not need an exact solution, only an approximate one.

The iteration

Let $t$ be an initial probability (row) vector — the pmf at time $0$. It could be $(1, 0, 0, \ldots)$ (always start at page 1) or $(\tfrac{1}{M}, \ldots, \tfrac{1}{M})$ (start at a uniformly random page); any starting distribution works. Multiplying a distribution on the right by $G$ advances it one step:

$$t,\quad tG,\quad tG^2,\quad tG^3,\quad \ldots,\quad tG^n \xrightarrow{\;n\to\infty\;} \text{stationary distribution}.$$

Running for $n$ steps gives $tG^n$, which converges to the stationary distribution — the PageRank vector — regardless of the starting point. In practice no one knows exactly how long to run such a complicated chain (the mixing-time questions are genuinely hard and not well understood here); people run it as long as they can, or until the vector looks stabilized, and take that as the approximate answer.

Why each step is cheap

The multiplication $tG$ looks daunting but splits into two easy pieces:

$$tG = \alpha\,(tQ) + (1 - \alpha)\,\frac{tJ}{M}.$$

$tQ$ is cheap because $Q$ is extremely sparse: a $10$-billion-by-$10$-billion matrix in which a typical page has only a handful (perhaps $3$ to $100$) of links, never tens of thousands. It is dominated by zeros. Google employs strong computer scientists to choose data structures that track exactly where the few nonzero entries are, which makes the multiplication efficient.
$\frac{tJ}{M}$ is trivial: $J$ is all ones, so $tJ$ dots the probability vector $t$ against a column of ones — that is, it sums the entries of $t$. Since $t$ is a probability vector, the sum is $1$, so $tJ$ is the all-ones row vector and $\frac{tJ}{M}$ is the uniform vector $(\tfrac{1}{M}, \ldots, \tfrac{1}{M})$.

So one step is a sparse matrix-vector product plus a uniform vector — far cheaper than Gaussian elimination. Use $tG$ as the new $t$, repeat to get $tG^2, tG^3, \ldots$; the limit is the stationary distribution, which is the PageRank vector.

The essence of PageRank

A non-reversible Markov chain whose stationary distribution scores pages, made well-behaved by teleportation and computed efficiently by power iteration rather than by solving a $10$-billion-by-$10$-billion linear system. This is, as far as Blitzstein knows, in the spirit of how Google originally did it — and the end of the Markov chains unit.