2. more on markov chains, examples and...

2. More on Markov chains,Examples and Applications

Section 1. Branching processes.Section 2. Time reversibility.Section 3. Application of time reversibility: a tandem queuemodel.Section 4. The Metropolis method.Section 5. Simulated annealing.Section 6. Ergodicity concepts for time-inhomogeneous Markovchains.Section 7. Proof of the main theorem of simulated annealing.Section 8. Card shuffling: speed of convergence to stationarity.

2.1 Branching Processes

The branching process model we will study was formulated in 1873 by Sir Francis Galton,∗

who was interested in the survival and extinction of family names. Suppose children inherittheir fathers’ names, so we need only keep track of fathers and sons. Consider a male whois the only member of his generation to have a given family name, so that the responsibilityof keeping the family name alive falls upon him—if his line of male descendants terminates,so does the family name. Suppose for simplicity that each male has probability f(0) ofproducing no sons, f(1) of producing one son, and so on. Here is a question: What is theprobability that the family name eventually becomes extinct?

Galton brought the problem to his mathematician friend, Rev. H. W. Watson, whodevised the method of analysis using probability generating functions that is still usedtoday. However, a minor mathematical slip caused Galton and Watson to get the answerto the main question wrong. They believed that the extinction probability is 1 — all namesare doomed to eventual extinction. We will see below that this is false: if the expectednumber of sons is greater than 1, the branching process model produces lines of descentthat have positive probability of going on forever.

Let us begin with a more formal description of the branching process. Thinking ofGt as the number of males in generation t, start with G0 = 1. If Gt = i then writeGt+1 = Xt1 +Xt2 + · · ·+Xti; here Xtj denotes the number of sons fathered by the jth manin generation t. Assume the random variables {Xtj : t ≥ 0, j ≥ 1} are iid with probabilitymass function f , so that P{Xtj = k} = f(k) for k = 0, 1, . . .. To avoid trivial cases we

∗See Jagers (1975) and Guttorp (1991) for more on the history.

Stochastic Processes J. Chang, February 2, 2007

2. MARKOV CHAINS: EXAMPLES AND APPLICATIONS

assume that f(0) > 0 and f(0) + f(1) < 1. [[Why are these trivial?]] We are interested inthe extinction probability ρ = P1{Gt = 0 for some t}.

It is clear from the verbal description of the process that {Gt : t ≥ 0} is a Markovchain. We can say a few interesting things about the process directly from general resultsof the previous chapter. Clearly state 0 is absorbing. Therefore, for each i > 0, sincePi{G1 = 0} = (f(0))i > 0, the state i must be transient—this follows from Theorem (1.24).Consequently, we know that with probability 1, each state i > 0 is visited only a finitenumber of times. From this, a bit of thought shows that, with probability 1, the chainmust either get absorbed at 0 eventually or approach∞. [[Exercise: Think a bit and showthis.]]

We can obtain an equation for ρ by the idea we have used before a number of times—e.g.see exercise ([1.3])—namely, conditioning on what happens at the first step of the chain.This gives

ρ =∞∑

k=0

P{G1 = k | G0 = 1}P{eventual extinction | G1 = k}.

Evidently, since the males all have sons independently (in the terminology above, the ran-dom variables Xtj are independent), we have P{eventual extinction | G1 = k} = ρk. Thisholds because the event of eventual extinction, given G1 = k, requires each of the k malelines starting at time 1 to reach eventual extinction; this is the intersection of k independentevents, each of which has probability ρ. Thus, ρ satisfies

(2.1) ρ =∞∑

k=0

f(k)ρk =: ψ(ρ).

The last sum is a function of ρ; for each distribution f there is a corresponding function ofρ, which we have denoted by ψ. So ρ satisfies ψ(ρ) = ρ: the extinction probability ρ is afixed point of ψ.

So the function ψ, which is called the probability generating function of the probabilitymass function f , arises in a natural and interesting way in this problem. Let us pause tocollect a few of its properties. The first two derivatives are given by

ψ′(z) =∞∑

k=1

kf(k)zk−1, ψ′′(z) =∞∑

k=2

k(k − 1)f(k)zk−2

for z ∈ (0, 1). Since these are positive, the function ψ is strictly increasing and convex on(0,1). Also, clearly ψ(0) = f(0) and ψ(1) = 1. Finally, notice that ψ′(1) =

∑kf(k) = µ,

where µ denotes E(X), the expected number of sons for each male.These properties imply that the graph of ψ over [0,1] must look like one of the three

following pictures, depending on the value of µ = ψ′(1).


2.1. BRANCHING PROCESSES Page 45

}

ψ+},

3 4

ψ+},

}3 4

ψ+},

}3 4ρ

So you can see what happens. Since ψ(1) = 1, the equation ψ(ρ) = ρ always has a trivialsolution at ρ = 1. When µ ≤ 1, this trivial solution is the only solution, so that, since theprobability ρ of eventual extinction satisfies ψ(ρ) = ρ, it must be the case that ρ = 1. Whenµ > 1, there is one additional solution, indicated by the arrow in the picture. This solutionwas missed by Watson and Galton (1875), leading them to believe that the probability ofextinction would be 1 in this case as well. We will show that this was incorrect, and thatthe probability of extinction is the smaller solution of the equation ψ(ρ) = ρ.

Thus, assuming µ > 1 and defining r to be the smaller solution of ψ(r) = r, we wantto show that ρ = r. Since ψ(ρ) = ρ, we know that ρ must be either r or 1. Definingpt = P1{Gt = 0}, observe that as t→∞,

pt ↑ P1

[ ∞⋃

n=1

{Gn = 0}]

= ρ.

Therefore, to rule out the possibility that ρ = 1, it is sufficient to prove the followingstatement:

pt ≤ r for all t.

To prove this by induction, observe that p0 = 0, so that the statement holds for t = 0.Next observe that

pt+1 = P1{Gt+1 = 0} =∞∑

i=0

P1{G1 = i}P1{Gt+1 = 0 | G1 = i} =∞∑

i=0

f(i)(pt)i,

that is, pt+1 = ψ(pt). Thus, using the induction hypothesis pt ≤ r and the fact that thefunction ψ is increasing, we obtain pt+1 ≤ ψ(r) = r, which completes the proof.

(2.2) Example. Suppose each man has 3 children, with each child having probability1/2 of being male, and different children being independent. What is the probability thata particular man’s line of male descendants will eventually become extinct? Here thedistribution f is the binomial distribution Bin(3,1/2), so that µ = 3/2 > 1. Thus, we knowthat the probability ρ of extinction is less than 1. Here f(0) = 1/8, f(1) = 3/8, f(2) = 3/8,


6 2. MARKOV CHAINS: EXAMPLES AND APPLICATIONS

and f(3) = 1/8, so that the equation ψ(r) = r becomes

1

8+

3

8r +

3

8r2 +

1

8r3 = r,

or r3 + 3r2 − 5r + 1 = 0. Fortunately, r = 1 is a solution (as it must be!), so we can factorit out, getting the equation (r − 1)(r2 + 4r − 1) = 0. Solving the quadratic equation givesρ =√

5 − 2 = 0.2361. The man can rest assured that with probability 1 − ρ = 0.7639 hisglorious family name will live on forever.

2.2 Time Reversibility

LetX0, X1, . . . be a Markov chain having probability transition matrix P = P (i, j). Imaginethat I recorded a movie of the sequence of states (X0, . . . , Xn), and I am showing you themovie on my fancy machine that can play the tape forward or backward equally well. Canyou tell by watching the sequence of transitions on the movie whether I am showing itforward or backward?

Of course, we are assuming that you know the transition matrix P ; otherwise, this wouldbe an unreasonable request. There are cases in which distinguishing the direction of timeis very easy. For example, if the state space is {1, 2, 3} and P (1, 2) = P (2, 3) = P (3, 1) = 1[[one of our standard periodic examples]], observing just one transition of the chain is enoughto tell you for sure the direction of time; for example, a “movie” in which we observe 3followed by 2 must be running backward.

That one was easy. Let’s consider another example: do you think a stationary Ehrenfestchain is time-reversible? Here the state space is {0, 1, . . . , d}, say, and X0 ∼ Bin(d, 1/2),the stationary distribution of the chain. It is clear in this case that you will not be ableto tell for sure from observing any finite movie (X0, . . . , Xn) which direction the movie isbeing shown—a sequence has positive probability if and only if its reversal also has positiveprobability. But we are asking whether or not you can get any sort of probabilistic hintabout the direction in which the movie is being shown, and I am willing to show you aslong a segment as you would like to request. So you can have plenty of data to look at.One might suspect that it should be possible to make this sort of distinction. For example,we know that the Ehrefest chain has a “restoring force” that pulls it toward the level d/2,where half the balls are in each of the two urns. So, for instance, if we observe a longsequence that moves from (3/4)d down toward d/2, we might favor the explanation thatthe movie is being shown forward, since otherwise we are observing a long sequence movingagainst the restoring force.

Did you buy that? I hope not, because in fact we will see that the Ehrenfest chain istime-reversible: no movie, no matter how long, will give you any probabilistic informationthat is useful in distinguishing the direction of time. [[And the argument suggested abovereally didn’t make much sense — what comes down must have gone up.]]


2.2. TIME REVERSIBILITY Page 47

Here is a definition that captures the concept.

(2.3) Definition. We say that a Markov chain {Xn} is time-reversible if, for each n,

(X0, X1, . . . , Xn)D= (Xn, Xn−1, . . . , X0)

that is, the joint distribution of (X0, X1, . . . , Xn) is the same as the joint distribution of(Xn, Xn−1, . . . , X0).

Suppose {Xn} is time-reversible. As a particular consequence of the definition, we see

that (X0, X1)D= (X1, X0). This, in turn, implies that X1

D= X0, that is, π1 = π0. Thus, in

view of the relation π1 = π0P , we obtain π0 = π0P , so that the initial distribution π0 isstationary. Not surprisingly, we have found that a time-reversible chain must be stationary.

We will write π for the distribution π0 to emphasize that it is stationary. So Xn ∼ π for

all n. The condition (X0, X1)D= (X1, X0) says that P{X0 = i,X1 = j} = P{X1 = i,X0 = j}

for all i, j; that is,

(2.4) π(i)P (i, j) = π(j)P (j, i) for all i, j.

We have shown that the condition (2.4) together with X0 ∼ π is necessary for a chainto be reversible. In fact, these two conditions are also sufficient for reversibility.

(2.5) Proposition. The Markov chain {Xn} is time-reversible if and only if the distri-bution π of X0 satisfies πP = π and the condition (2.4) holds.To see this, observe that (2.4) gives, for example,

P{X0 = i,X1 = j,X2 = k} = [π(i)P (i, j)]P (j, k)

= [P (j, i)π(j)]P (j, k)

= P (j, i)[π(j)P (j, k)]

= P (j, i)[P (k, j)π(k)]

= π(k)P (k, j)P (j, i)

= P{X0 = k,X1 = j,X2 = i}= P{X2 = i,X1 = j,X0 = k},

that is, (X0, X1, X2)D= (X2, X1, X0). Notice how (2.4) allowed the π factor to propagate

through the product from the left end to the right, reversing the direction of all of thetransitions along the way. The same trick allows us to deduce the general equality requiredin the definition (2.3).

The equalities in (2.4) have a nice interpretation in terms of probability flux. Recall[[as discussed in one of your homework problems]] that the flux from i to j is definedas π(i)P (i, j). So (2.4) says that the flux from i to j is the same as the flux from jto i—flux balances between each pair of states. These are called the “detailed balance”(or “local balance”) equations; they are more detailed than the “global balance equations”π(j) =

∑i π(i)P (i, j) that characterize stationarity. Global balance, which can be rewritten

as∑

i π(j)P (j, i) =∑

i π(i)P (i, j) says that the total flux leading out of state j is the same



as the total flux into state j. If we think of a system of containers of fluid connected bytubes, one container for each state, and we think of probability flux as fluid flowing aroundthe system, global balance says that the flow out of container j is balanced by the flow intoj, so that the fluid level in container j stays constant, neither rising nor falling. This isa less stringent requirement than detailed balance, which requires flow to balance betweeneach pair of containers.

A more probabilistic interpretation is this: think of π(i)P (i, j) as the limiting long runfraction of transitions made by the Markov chain that go from state i to state j. Timereversibility requires that the long run fraction of i-to-j transitions is the same as that ofthe j-to-i transitions, for all i and j. This is a more stringent requirement than stationarity,which equates the long run fraction of transitions that go out of state i to the long runfraction of transitions that go into state i.

The mathematical formulation of this relationship is simple.

(2.6) Proposition. If the local balance equations (2.4) hold, then the distribution π isstationary.

Proof: Summing the local balance equations (2.4) over i gives the global balance equations

∑

i

π(i)P (i, j) =∑

i

π(j)P (j, i) = π(j).

So why is the Ehrenfest chain time-reversible? The Ehrenfest chain is an exampleof a birth and death chain, which is defined to be a Markov chain whose states consistof nonnegative integers and whose transitions increase or decrease the state by at most1. That is, interpreting the current state of the chain as the population count of livingindividuals, the population can change by at most 1 in a transition, which might representa birth, a death, or no change. The time reversibility of the Ehrenfest chain is an exampleof a more general fact.

(2.7) Claim. All stationary birth and death chains are reversible.

To show this, consider a stationary birth and death chain on the state space S ={0, 1, . . . , d}.

� � � ll l.4 g

D Df

We ask: does π(i)P (i, j) = π(j)P (j, i) for all i, j? Since a birth and death chain hasP (i, j) = 0 if |i − j| > 1, we need only consider the case where j = i + 1. Recall from


2.3. MORE ON TIME REVERSIBILITY: A TANDEM QUEUE MODEL Page 49

Exercise [1.11] (the exercise on probability flux) that for any subset A ⊂ S, the flux fromA to Ac must equal the flux from Ac to A. Taking A = {0, . . . , i} as in the picture givesjust what we want: π(i)P (i, i+ 1) = π(i+ 1)P (i+ 1, i). This establishes the claim.

(2.8) Example. Another important class of examples of time-reversible Markov chains isthe random walk on a graph. Defining d(i) to be the degree of node i, the randomwalk moves according to the matrix P (i, j) = 1/(d(i)) for each neighbor j of node i, andP (i, j) = 0 otherwise. Then it is easy to check that the distribution

π(i) =d(i)∑j∈S

d(j)

satisfies the detailed balance equations. Thus, the random walk is time-reversible, and π isits stationary distribution.

4

2 3

5

1

For example, consider a random walk on the house graph above. The degrees are(d1, d2, d3, d4, d5) = (2, 3, 3, 2, 2). So the stationary distribution is (π1, π2, π3, π4, π5) =(2/12, 3/12, 3/12, 2/12, 2/12).

2.3 More on Time Reversibility: A Tandem Queue Model

Consider two office workers Andrew and Bertha who have a lot of paper work to do. Whena piece of paper arrives at the office, it goes first to Andrew for processing. When hecompletes his task, he puts the paper on Bertha’s pile. When she completes her processingof the paper, it is sent out of the office.



d es

Let’s specify a model of this in more detail. Consider a Markov chain whose state at time nis (Xn, Yn), where Xn is the number of papers in Andrew’s work pile and Yn is the numberof papers in Bertha’s pile. Suppose that Xn = i ≥ 0. Then with probability p, a newpiece of work enters the office, resulting in Xn+1 = i + 1. For definiteness, it is helpfulto paint a detailed picture, so let’s suppose that any new arrival is placed onto the top ofAndrew’s current pile of papers, preempting any other task he might have been working onat the time. Thus, if a new paper arrives, Andrew cannot complete his processing of anypreviously received papers that period. If Andrew receives no new arrival in a period andi > 0, then with probability a Andrew completes the processing of the paper on top of hispile, resulting in Xn+1 = i− 1. Thus, in summary, Andrew’s pile evolves as follows. GivenXn = i > 0,

Xn+1 =

i+ 1 with probability pi with probability (1− p)(1− a)i− 1 with probability (1− p)a,

and if Xn = 0, then Xn+1 = 1 with probability p and Xn+1 = 0 with probability 1− p.Let us assume that Bertha’s pile changes in the same way as Andrew’s, except that she

gets her new work from Andrew’s completed papers rather than from the outside, and herservice-completion probability is b rather than a. A sample path of the {(Xn, Yn)} processis shown in the picture. Notice that in each period in which the X process decreases, theY process increases: work completed by Andrew is sent to Bertha.


2.3. MORE ON TIME REVERSIBILITY: A TANDEM QUEUE MODEL Page 51

[w

\w

w

w

The {Xn} process is a birth and death chain in its own right. Letting qa denote thedownward transition probability qa = (1 − p)a, a stationary distribution πa for Andrew’sprocess X exists if p < qa, in which case familiar probability flux reasoning gives πa(i)p =πa(i+ 1)qa, or πa(i+ 1)/πa(i) = p/qa , so that

πa(i) =

(p

qa

)i (1− p

qa

)for i = 0, 1, . . . .

Here is where time reversibility allows us to make an interesting and useful observation.Assume X0 ∼ πa. Then we know that {Xn}, being a stationary birth and death process, istime reversible. Define An to be 1 if Andrew has an “arrival” at time n and 0 otherwise,so that An = 1 occurs when Xn = Xn−1 + 1. Define another indicator random variable Dn

to be 1 if Andrew has a “departure” at time n, that is, when Xn = Xn−1 − 1. Consideringk to be the present time, clearly the present queue size Xk is independent of the futurearrivals Ak+1, Ak+2, . . .. This obvious fact, when applied to the reversed process, givessomething interesting. In the reversed process, if k again represents the “present,” then“future arrivals” correspond to the departures Dk, Dk−1, . . .. Therefore, we can say that thedepartures (D1, D2, . . . , Dk) are independent of the queue size Xk. This is quite surprising,isn’t it? Also, since reversibility implies that arrivals in the reversed process have thesame probabilistic behavior as arrivals in the forward process, we see that the departuresD1, D2, . . . are iid Bernoulli(p). Thus, the output process of Andrew’s queue is the sameprobabilistically as the input process of his queue. Isn’t that interesting? For example, wehave found that the departure process does not depend on the service completion probabilitya.

Bertha’s queue size {Yn} is also a birth and death chain, with a similar structure asAndrew’s. We have just shown that Bertha’s input consists of iid Bernoulli(p) randomvariables, just as Andrew’s input does. Defining the downward transition probability qb =



(1− p)b for {Yn}, if p < qb the stationary distribution πb is given by

πb(i) =

(p

qb

)i (1− p

qb

)for i = 0, 1, . . . .

Now we are ready to show a surprising property of the stationary distribution of(Xk, Yk): the two queue sizes Xk and Yk are independent! That is, we claim that thestationary distribution π of {(Xn, Yn)} takes the product form

π(i, j) = πa(i)πb(j).

I’ll try to say the idea loosely in words first, then more carefully. It is helpful to imaginethat at each time, Andrew flips a coin with probability a of heads, and if he gets heads hecompletes a piece of work, if that is possible—i.e. if there is work in his pile and if he hasnot received a new arrival. Bertha does the same, only with her coin having probabilityb. Back to the question: Supposing that X0 and Y0 are independent with distributionsπa and πb, we want to see that Xn and Yn are also independent with distributions πa andπb. We know the marginal distributions of Xn and Yn are πa and πb; independence is thereal question. The key is the observation made above that Xn is independent of Andrew’sdepartures up to time n, which are the same as Bertha’s arrivals up to time n. So sinceYn is determined Y0, Bertha’s arrivals up to time n, and Bertha’s service coin flips, all ofwhich are independent of Xn, we should have Yn independent of Xn.

To establish this more formally, assuming that (X0, Y0) ∼ π, we want to show that(X1, Y1) ∼ π. Since πa and πb are stationary for {Xn} and {Yn}, we know that X1 ∼ πaand Y1 ∼ πb, so our task is to show that X1 and Y1 are independent. Let AXk denote theindicator of an arrival to Andrew’s desk at time k. Let SXk = 1 if at time k Andrew’s“service completion coin flip” as described in the previous paragraph comes up heads, andSXk = 0 otherwise. Define SYk analogously for Bertha. We are assuming that the randomvariables X0, Y0, A

X1 , SX1 , and SY1 are independent. But we can write X1 and AY1 as

functions of (X0, AX1 , S

X1 ). [[The precise functional forms are not important, but just for

fun,

X1 =

X0 + 1 if AX1 = 1X0 if AX1 = 0 and SX1 = 0X0 − 1 if AX1 = 0 and SX1 = 1 and X0 > 00 if AX1 = 0 and X0 = 0

and

AY1 =

{1 if X0 > 0 and AX1 = 0 and SX1 = 10 otherwise

is one way to write them.]] So Y0 is independent of (X1, AY1 ). But we know that X1 is

independent of whether there is a departure from Andrew’s queue at time 1, which is justthe indicator AY1 . Therefore, the 3 random variables Y0, X1, and AY1 are independent.Finally, observe that SY1 is independent of (Y0, X1, A

Y1 ), so that the 4 random variables Y0,

X1, AY1 , and SY1 are all independent. Thus, since Y1 is a function of (Y0, A

Y1 , S

Y1 ), it follows

that X1 and Y1 are independent.


2.4. THE METROPOLIS METHOD Page 53

2.4 The Metropolis method

This is a very useful general method for using Markov chains for simulation. The idea is afamous one due to Metropolis et al. (1953), and is known as the Metropolis method . Supposewe want to simulate a random draw from some distribution π on a finite set S. By theBasic Limit Theorem above, one way to do this (approximately) is to find an irreducible,aperiodic probability transition matrix P satisfying πP = π, and then run a Markov chainaccording to P for a sufficiently long time.

Suppose we have chosen some connected graph structure G = (S,E) on our set S. Thatis, we think of each element of S as a node, and we imagine a collection E of edges, whereeach edge joins some pair of nodes. If nodes i and j are joined by an edge, we say i and j areneighbors. Let N(i) denote the set of neighbors of node i. [[We’ll assume that i /∈ N(i) forall i.]] Just to make sure the situation is clear, I want to emphasize that this graph structureis not imposed on us, and there is not a single, unique, magic choice that will work. Theset of edges E is ours to choose; we have a great deal of freedom here, and different choiceswill lead to different matrices P satisfying πP = π.

As a preliminary observation, recall from Example (2.8) that a random walk on G, whichhas probability transition matrix

(2.9) Prw(i, j) =

{1d(i) if j ∈ N(i)

0 otherwise

has stationary distribution

πrw(i) =d(i)∑j∈S

d(j),

where d(i) is the degree of node i. To reduce typographical and conceptual clutter, let uswrite this as

πrw(i) ∝ d(i),by omitting the denominator, which is simply a normalization constant [[constant in that itdoes not depend on i]] that makes the probabilities add up to 1. The Basic Limit Theoremtells us that (assuming aperiodicity holds) if we run the random walk for sufficiently long,then we get arbitrarily close to achieving the distribution πrw.

Thus, simply running a random walk on G would solve our simulation problem if wehappened to want to simulate from πrw. In general, however, we will want to simulate fromsome different, arbitrary distribution π, which we will write in the form

(2.10) π(i) ∝ d(i)f(i).

That is, we are interested in modifying the relative probabilities of the natural randomwalk stationary distribution by some multiplicative function f . Our goal here is a simpleway to modify the random walk transition probabilities in such a way that the modifiedprobability transition matrix has stationary distribution π. The Metropolis method solves



this problem by defining the probability transition matrix

(2.11) P (i, j) =

1d(i) min{1, f(j)

f(i) } if j ∈ N(i)

1−∑k∈N(i) P (i, k) if j = i

0 otherwise.

The verification that this method works is simple.

(2.12) Claim. For π defined by (2.10) and (P (i, j)) defined by (2.11), we have πP = π.

Proof: For j ∈ N(i),

π(i)P (i, j) ∝ f(i)min{1, f(j)

f(i)} = min{f(i), f(j)}

is symmetric in i and j, so we have

(2.13) π(i)P (i, j) = π(j)P (j, i).

In fact, (2.13) holds for all i and j, since it is trivial to verify (2.13) in the cases when i = jor j /∈ N(i). Summing (2.13) over i gives

∑

i

π(i)P (i, j) = π(j)∑

i

P (j, i) = π(j),

so that πP = π.

Notice that the last proof showed that the “detailed balance” equations π(i)P (i, j) =π(j)P (j, i) that characterize time-reversible Markov chains hold.

⊲ As we know from our discussion of the Basic Limit Theorem, the fact that a Markov chainhas stationary distribution π does not in itself guarantee that the Markov chain will convergein distribution to π. Exercise [2.10] gives conditions under which this convergence holds.

Running the Metropolis chain (using the P from (2.11)) is actually a simple modificationof performing a random walk (using Prw from (2.9)). To run the random walk, at each time,we choose a random neighbor and go there. We can think of running the Metropolis chainas follows. Suppose we are currently at state i and we are about to generate our nextrandom transition. We start out, in the same way as the random walk, by choosing arandom neighbor of i; let’s call our choice j. The difference between the Metropolis chainand the random walk is that in the Metropolis chain, we might move to j, or we might stayat i. So let’s think of j as our “candidate” state, and we next make a probabilistic decisionabout whether to “accept the candidate” and move to j, or “reject the candidate” and stayat i. The probability that we accept the candidate is the extra factor

(2.14) min{1, f(j)

f(i)}


2.4. THE METROPOLIS METHOD Page 55

that appears in (2.11). Thus, having nominated a candidate j, we look at the values f(j) andf(i). If f(j) ≥ f(i), the minimum in (2.14) is 1, and we definitely move to j. If f(j) < f(i),the minimum in (2.14) is f(j)/f(i), and we move to j with probability f(j)/f(i). Thismakes qualitative sense: for example, if f(j) is much smaller than f(i), this means that,relative to the random walk stationary distribution πrw, our desired distribution π placesmuch less probability on j than on i, so that we should make a transition from i to j muchless frequently than the random walk does. This is accomplished in the Metropolis chainby usually rejecting the candidate j.

(2.15) Example. To illustrate the Metropolis method in a simple way, we’ll discuss an arti-ficial toy example where we don’t really need the method. Suppose the distribution π on S ={1, 2, 3, 4, 5} is (4/12, 2/12, 2/12, 2/12, 2/12), and suppose the graph structure we choose onS is the “house” graph from Example (2.8). Thus, we want to be on the roof of the house(state 1) with probability 4/12, and at each of the other states with equal probability, 2/12.

rwπ

2

12

3

12

2

12

2

12

3

12

desired π

4

12

2

12

2

12

2

12

2

12

4

2 3

5

1

4

2 3

5

1

Comparing with the distribution πrw that we found in Example (2.8) [[and is re-produced in the figure above]] we can calculate the f ratios to be f(1) =(4/12)/(2/12) = 2, and, similarly, f(2) = 2/3 = f(3) and f(4) = 1 = f(5).

(1) 2f =

2(2)

3f =

(4) 1f = (5) 1f =

2(3)

3f =

4

2 3

5

1

And now the Metropolis formula for P shows that to modify the random walk to havethe desired stationary distribution, we run the process depicted in the figure below.



4

2 3

5

1

2

3

1

6

1

61

3 1

3

1

3

1

31

3

1

31

31

2

1

6

1

31

2

1

6

These transition probabilities were calculated as follows. For example,

P (1, 2) =1

2min

{1,

2/3

2

}=

1

6= P (1, 3),

P (1, 1) = 1− 1

6− 1

6=

2

3,

P (2, 1) =1

3min

{1,

2

2/3

}=

1

3,

P (2, 3) =1

3min

{1,

2/3

2/3

}=

1

3,

P (2, 4) =1

3min

{1,

1

2/3

}=

1

3,

P (2, 2) = 1− 1

3− 1

3− 1

3= 0,

and so on.

The Metropolis method has a nice property: actually we do not even quite have tobe able to write down or compute the probabilities π(i) in order to be able to use themethod to simulate from π! That is, as is clear from (2.11), to run the Metropolis chain,all we need to know are ratios of the form f(j)/f(i), or, equivalently, ratios of the formπ(j)/π(i) [[these are equivalent by (2.10), and because we assume we know the degrees d(j)and d(i)]]. That is, we do not have to know any of the π(i)’s explicitly to use the method;all we have to know are the ratios π(j)/π(i). Now this may not seem like a big deal, butthere are cases in which we would like to simulate from a distribution π for which theratios π(j)/π(i) are easy to compute, while the individual probabilities π(j) and π(i) areextremely difficult to compute. This happens when π is known only up to a complicatedmultiplicative normalization constant. One simple example of this you have already seen:in our problem of simulating a uniformly distributed 4×4 table with given row and columnsums, the desired probability of any given table is the reciprocal of the number of tables


2.5. SIMULATED ANNEALING Page 57

satisfying the given restrictions—a number that we do not know! (Remember, in fact, abonus of being able to generate a nearly uniformly distributed table is that it leads to amethod for approximating the number of such tables.) So in this problem, we do not knowthe individual probabilities of the form π(i). But the ratio π(j)/π(i) is simply 1 for auniform distribution! Now, simulating from a uniform distribution is admittedly a specialcase, and a symmetric probability transition matrix will do the trick. For a more generalclass of examples, in the Bayesian approach to statistics, suppose the unknown parameterof interest is θ ∈ Θ, where Θ is a finite parameter space. Suppose our prior distribution[[probability mass function]] for θ is µ(θ) and the likelihood is P (x | θ). Both µ(θ) andP (x | θ) are known to us because we specify them as part of our probability model. Theposterior distribution for θ given x is

P (θ | x) =µ(θ)P (x | θ)

G,

whereG =

∑

θ′∈Θ

µ(θ′)P (x | θ′).

The sum G may be very difficult to compute; in statistical mechanics it is the infamous“partition function.” However, for given x, if we want to simulate from the posteriordistribution P (· | x), we can do so using the Metropolis method; although the distributionitself may be hard to compute because G is hard to compute, the ratios

P (θ1 | x)P (θ2 | x)

=µ(θ1)P (x | θ1)µ(θ2)P (x | θ2)

are easy to compute.

2.5 Simulated annealing

Simulated annealing is a recent and powerful technique for addressing large, complicatedoptimization problems. Although the idea is so simple it may sound naive, the simulatedannealing method has enabled people in some cases to find better answers to bigger problemsthan any previously known method.

Suppose we would like to minimize some “cost function” c(·) defined on a set S. Forexample, c(·) might be a function of d variables defined on the simplest interesting domainS, namely, the domain S = {0, 1}d, in which each of the d variables may take on only thetwo values 0 and 1. That is, this S is the set of 2d d-tuples i = (i1, . . . , id); we could thinkof these as the vertices of the d-dimensional unit cube. So for d = 10 variables, S contains210 ≈ 1000 points. If we want to solve a problem with d = 20, 30, or 40 variables, thenumber of points in S rises to about one million, one billion, and one trillion, respectively.Have our computers gotten fast enough that we can just about handle a trillion pointsnow? Well, if we then just add 20 more variables to the problem, all of a sudden ourcomputers are a million times too slow again. So even though computers are getting fasterall the time, clearly our appetite for solving larger and more complex problems grows much,much faster. “But come now, who really deals with functions of 60 or 100 variables?” you



may be wondering. Well, consider, for example, an image processing problem, in whichwe want to calculate an optimal guess at an image, given a noisy, corrupted version. Ifwe are dealing with a black and white image, so that each pixel can be encoded as a 0or a 1, then our state space is exactly of the form {0, 1}d that we have been discussing.How big are typical values of d? Very big: d is the number of pixels in the image, so ifwe have a (quite crude!) 200 × 200 image, then d = 40, 000. This is much bigger than60 or 100! I hope this gives you some inkling of the inconceivable vastness of many quiteordinary combinatorial optimization problems, and a feeling for the utter hopelessness ofever solving such problems by exhaustive enumerative search of the state space.

Stuart and Donald Geman used simulated annealing in their approach to image restora-tion. Another famous example of a difficult optimization problem is the traveling salesmanproblem in which a salesman is given a set of cities he must visit, and he wants to decidewhich city to visit first, second, and so on, in such a way that his total travel distance isminimized. For us, in addition to its practical importance, simulated annealing provides anice illustration of some of the Markov chain ideas we have discussed so far, as well as anexcuse to learn something about time-inhomogeneous Markov chains.

2.5.1 Description of the method

The method is a combination of the familiar idea of using Markov chains for simulation anda new idea (“annealing”) that provides the connection to optimization. We have alreadydiscussed the very general Metropolis method that allows us to simulate approximatelyfrom any desired distribution on S. But what does simulation have to do with optimizationor “annealing” or whatever?

Our goal is to find an i ∈ S minimizing c(i), where c is a given cost function defined onthe set of nodes S of a graph. As discussed above, an exact solution of this problem maybe an unattainable ideal in practice, but we would like to come as close as possible. Foreach T > 0, define a probability distribution αT = {αT (i) : i ∈ S} on S by

(2.16) αT (i) =d(i)e−c(i)/T

G(T ),

where again d(i) is the degree of node i and of course

G(T ) =∑

i∈S

d(i)e−c(i)/T

is just the normalizing constant that makes (2.16) a probability distribution. The letter“T” stands for “temperature.”

We have defined a family of probability distributions on S; corresponding to each positiveT there is a distribution αT . These distributions have an important property that explainswhy we are interested in them. To state this property, let S∗ denote the set of globalminimizers of c(·), that is,

S∗ = {i∗ ∈ S : c(i∗) = mini∈S

c(i)}.



Our goal is to find an element of S∗. Define a distribution α∗ on S∗ by

(2.17) α∗(i) =d(i)∑j∈S∗ d(j)

if i ∈ S∗, and α∗(i) = 0 otherwise. The important thing to keep in mind about thedistribution α∗ is that it puts positive probability only on globally optimal solutions of ouroptimization problem.

(2.18) Fact. As T ↓ 0, we have αTD−→ α∗, that is, αT (i)→ α∗(i) for all i ∈ S.

The symbol “D−→” stands for convergence in distribution.

⊲ Exercise [2.16] asks you to prove (2.18).

Thus, we have found that, as T ↓ 0, the distributions αT converge to the special dis-tribution α∗. If we could somehow simulate from α∗, our optimization problem would be

solved: We would just push the simulate from α∗ button, and out would pop a randomelement of S∗, which would make us most happy. Of course, we cannot do that, since weare assuming that we do not already have the answer to the minimization problem that weare trying to solve! However, we can do something that seems as if it should be nearly asgood: simulate from αT . If we do this for a value of T that is pretty close to 0, then sinceαT is pretty close to α∗ for that T , presumably we would be doing something pretty good.

So, fix a T > 0. How do we simulate from αT ? We can use the Metropolis idea tocreate a probability transition matrix AT = (AT (i, j)) such that αTAT = αT , and then runa Markov chain according to AT .

[[A note on notation: I hope you aren’t bothered by the use here of α and A for stationarydistributions and probability transition matrices related to simulated annealing. The usualletters π and P have been so overworked that using different notation for the special exampleof simulated annealing should be clearer ultimately. Although they don’t look much like πand P , the letters α and A might be remembered as mnemonic for “simulαted Annealing”at least.]]

A glance at the definition of αT in (2.16) shows that we are in the situation of theMetropolis method as described in (2.10) with the choice f(i) = e−c(i)/T . So as prescribedby (2.11), for j ∈ N(i) we take

AT (i, j) =1

d(i)min

{1,e−c(j)/T

e−c(i)/T

}

=1

d(i)

{1 if c(j) ≤ c(i)e−(c(j)−c(i))/T if c(j) > c(i).

(2.19)

The specification of AT is completed by the obvious requirements that AT (i, i) = 1 −∑j∈N(i)AT (i, j) and AT (i, j) = 0 if j /∈ N(i).For any fixed temperature T1 > 0, if we run a Markov chain {Xn} having probability

transition matrix AT1 for “a long time,” then the distribution of {Xn} will be very close to



αT1 . If our goal is to get the chain close to the distribution α∗, then continuing to run thischain will not do much good, since the distribution of Xn will get closer and closer to αT1 ,not α∗. So the only way we can continue to “make progress” toward the distribution α∗ isto decrease the temperature to T2, say, where T2 < T1, and continue to run the chain, butnow using the probability transition matrix AT2 rather than AT1 . Then the distribution ofXn will approach the distribution αT2 . Again, after the distribution gets very close to αT2 ,continuing to run the chain at temperature T2 will not be an effective way to get closer tothe desired distribution α∗, so it makes sense to lower the temperature again.

It should be quite clear intuitively that, if we lower the temperature slowly enough,we can get the chain to converge in distribution to α∗. For example, consider “piecewiseconstant” schedules as discussed in the last paragraph. Given a decreasing sequence ofpositive temperatures T1 > T2 > . . . such that Tn ↓ 0, we could start by running a chainfor a time n1 long enough so that the distribution of the chain at time n1 is within adistance of 1/2 [[in total variation distance, say]] from αT1 . Then we could continue to runat temperature T2 until time n2, at which the distribution of the chain is within 1/3 of αT2 .Then we could run at temperature T3 until we are within 1/4 of of αT3 . And so on. Thus, aslong as we run long enough at each temperature, the chain should converge in distributionto α∗. We must lower the temperature slowly enough so that the chain can always “catchup” and remain close to the stationary distribution for the current temperature.

Once we have had this idea, we need not lower the temperature in such a piecewiseconstant manner, keeping the temperature constant for many iterations and only thenchanging it. Instead, let us allow the temperature to change at each at each step of theMarkov chain. Thus, each time nmay have its own associated temperature Tn, and hence itsown probability transition matrix ATn

. The main theoretical result that has been obtainedfor simulated annealing says that for any problem there is a cooling schedule of the form

(2.20) Tn =a

log(n+ b)

[[where a and b are constants]] such that, starting from any state i ∈ S, the[[time-inhomogeneous]] Markov chain {Xn} will converge in distribution to α∗, the uniformdistribution on the set of global minima.

Accordingly, a simulated annealing procedure may be specified as follows. Choose a“cooling schedule” T0, T1, . . .; the schedules we will discuss later will have the property thatTn ↓ 0 as n → ∞. Choose the inital state X0 according to a distribution ν0. Let the suc-cession of states X0, X1, X2, . . . form a time-inhomogeneous Markov chain with probabilitytransition matrices AT0 , AT1 , AT2 , . . ., so that

P{Xn+1 = j | Xn = i} = ATn(i, j)

and

(2.21) Xn ∼ νn = ν0AT0AT1 · · ·ATn−1 .

By the way, what is all this talk about “temperature,” “cooling schedule,” “annealing,”and stuff like that? I recommend you consult the article by Kirkpatrick et al., but I’ll try tosay a few words here. Let’s see, I’ll start by looking up the word “anneal” in my dictionary.It gives the definition



To free (glass, metals, etc.) from internal stress by heating and gradually cool-ing.

The idea, as I understand it, is this. Suppose we want to have a metal that is as “stress-free”as possible. That corresponds to the metal being “relaxed,” in a low energy state. Themetal is happy when it has low energy, and it wants to get to such a state, just as a ballwill roll downhill to decrease its potential energy. In the liquid state, the atoms of a metalare all sliding around in all sorts of random orientations. If we quickly freeze the metalinto a solid by cooling it quickly, its atoms will freeze into a configuration that has allsorts of haphazard orientations that give the metal unnecessary excess potential energy. Incontrast, if we start with the metal as a liquid and then cool it extremely slowly, then theatoms have plenty of time to explore around, find, and work themselves into a nice, happy,ordered, low-energy configuration. Thus, nature can successfully address an optimizationproblem—minimization of the energy of the metal—by slow cooling. I hope that gives theflavor of the idea at least.

For another image that captures certain aspects of simulated annealing, imagine abumpy frying pan containing a marble made of popcorn. At high temperature, the marbleis jumping around quite wildly, ignoring the ups and downs of the pan.

As we lower the temperature, the popcorn begins to settle down into the lower parts of thepan.



I’d like to make sure it is clear to you how to run the method, so that you will be ableto implement it yourself and experiment with it if you’d like. So let us pause for a momentto imagine what it is like to move around according to a Markov chain having probabilitytransition matrix AT specified by (2.19). Here is one way to describe it. Suppose that we arenow at time n sitting at node i of the graph, and we are trying to decide where to go next.First we choose at random one of the d neighbors of i, each neighbor being chosen with thesame probability 1/d(i). Say we choose node j. Then j becomes out “candidate” for whereto go at time n+1; in fact, at time n+1 we will either move to j (accept the candidate) orstay at i. To decide between these two alternatives, we must look at the values of c(i) andc(j). If c(j) ≤ c(i), we definitely move to j. If c(j) > c(i), we might or might not move to j;in fact we move to j with probability e−(c(j)−c(i))/T and stay at i with the complementaryprobability 1− e−(c(j)−c(i))/T . At high temperature, even when c(j) > c(i), the probabilitye−(c(j)−c(i))/T is close to 1, so that we accept all candidates with high probability. Thus, athigh temperature, the process behaves nearly like a random walk on S, choosing candidatesas a random walk would, and almost always accepting them. At lower temperatures, theprocess still always accepts “downhill moves” [[those with c(j) ≤ c(i)]], but has a lowerprobability of accepting uphill moves. At temperatures very close to zero, the process veryseldom moves uphill.

Note that when c(j) ≤ c(i), moving to j “makes sense”; since we are trying to minimizec(·), decreasing c looks like progress. However, simulated annealing is not just another“descent method,” since we allow ourselves positive probability of taking steps that increasethe value of c. This feature of the procedure prevents it from getting stuck in local minima.

You may have asked yourself the question, “If we want to bring the temperature downto 0, then why don’t we just run the chain at temperature 0?” But now you can see thatat temperature T = 0, the process will never make an uphill move. Thus, running attemperature 0 is a descent method, which will get stuck in local minima, and therefore willnot approach approach global minima. At temperature 0, the chain all of a sudden losesthe nice properties it has at positive temperatures—it is no longer irreducible, and so the



basic limit theorem doesn’t apply.

(2.22) Example [Traveling salesman problem]. The figure below showsthe location of 40 cities. A traveling salesman who lives in one of thecities wants to plan a tour in such a way that he visits all of the citiesand then returns home while traveling the shortest possible total distance.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

To solve this problem by simulated annealing, we start with some legal tour. We’ll startin a very dumb way, with a tour that is simply a random permutation of the 40 cities.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

As we know, simulated annealing gradually modifies and, we hope, improves the tour byproposing random modifications of the tour, and accepting or rejecting those modificationsrandomly based on the current value of the temperature parameter and the amount ofimprovement or deterioration produced by the proposed modification.

Here our state space S is the set of all possible traveling salesman tours. Each such tourcan be specified by a permutation of the cities; for example, if there are 8 cities, the tour(1, 3, 5, 7, 2, 4, 6, 8) means that we start at city 1, then go to 3, 5, 7, 2, 4, 6, and 8 in thatorder, and finally return from 8 back to 1. [[Note that since the tours are closed circuitsbecause the salesman returns to his starting point, we could consider, for example, thetour (3, 5, 7, 2, 4, 6, 8, 1) to be the same tour as (1, 3, 5, 7, 2, 4, 6, 8). So in this way different



permutations can represent the same tour. We could say these two equivalent permutationsare “rotations” of each other.]]

Recall that the Metropolis method, and simulated annealing, require us to choose aneighborhood structure on S. That is, for each tour i in S, we must say which other toursare the neighbors of i, and then the simulated annealing method chooses a random candidateneighbor at each iteration. There are many possible ways to do this. One way that seemsto move around S nicely and is also easy to work with is to choose two cities on the tourrandomly, and then reverse the portion of the tour that lies between them. For example,if we are currently at the tour (1, 2, 3, 4, 5, 6, 7, 8), we might choose the two cities 4 and6, and change from the tour (1, 2, 3, 4, 5, 6︸︷︷︸, 7, 8) to the neighboring tour (1, 2, 3, 6, 5, 4︸︷︷︸, 7, 8).

Another convenient feature of this definition of neighbors is that it is easy to calculate thedifference in total lengths between neighboring tours; no matter how many cities we areconsidering, the difference in tour lengths is calculated using only four intercity distances— the edges that are broken and created as we break out part of the tour and then reattachit in the reversed direction.

Using the type of moves just described, it was quite easy to write a computer program,and fun to watch it run. Here is how it went:

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 1000, Temp 0.5, Length 15.5615

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 2000, Temp 0.212, Length 10.0135

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 5000, Temp 0.089, Length 7.6668 Time 10000, Temp 0.058, Length 5.07881

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

If we were watching a movie of this, we would see the path slithering around and graduallyuntangling itself. The slithering is violent at high temperatures. It starts to get the grossfeatures right, removing those silly long crossings from one corner of the picture to another.And now let’s watch it continue.



0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 20000, Temp 0.038, Length 4.52239

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 30000, Temp 0.0245, Length 4.11739

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Time 40000, Temp 0.0245, Length 4.00033 Time 66423, Temp 0.0104, Length 3.96092

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Our salesman ends up with a very nice tour. I checked to make sure this tour is a localoptimum — no neighboring tour is better. Maybe it’s even the globally optimal tour.

Doesn’t this method of random searching seem extraordinarily dumb? No particularknowledge or strategic ideas about the particular problem are built in. The Markov propertymeans that there is no memory; any successes or failures from the past are forgotten. Butthen again, evolution is “dumb” in the same way, accumulating random changes, gentlyguided by natural selection, which makes favorable changes more likely to persist. Andlook at all of the marvelous solutions evolution has produced.

2.5.2 The Main Theorem

First some notation. Define the matrix A(n) by

A(n) =n−1∏

k=0

ATk.

Recalling that νn denotes the distribution of Xn, (2.21) says that νn = ν0A(n). Let A(n)(i, ·)

denote the ith row of the matrix A(n). This row may be interpreted as the distribution ofXn given that X0 = i. We will continue using the notation ‖µ− ν‖ for the total variationdistance

‖µ− ν‖ = supS⊂S

|µ(S)− ν(S)| = 1

2

∑

i∈S

|µi − νi|



between the distributions µ and ν.We also need some definitions related to the structure of the graph G and the cost

function c(·). For two nodes i, j ∈ G, define the distance ρ(i, j) to be the length [[i.e.,number of edges]] of a path from i to j of minimal length. Define the radius r of G by

r = mini∈Sc

maxj∈S

ρ(i, j),

whereSc = {i ∈ S : c(j) > c(i) for some j ∈ N(i)},

that is, Sc is the set of all nodes that are not local maxima of the function c. [[Note that ractually depends on the function c in addition to the graph G, but we won’t indicate thisin the notation.]] Define the number L to be the largest “local fluctuation” of c(·), that is,

L = maxi∈S

maxj∈N(i)

|c(j)− c(i)|.

Finally, recall the definition of α∗ from (2.17).After only 38 million definitions (counting multiplicity), we are ready to state the main

theorem.

(2.23) Theorem [Main Theorem]. For any cooling schedule T0, T1, . . . satisfying

(i) Tn+1 < Tn for all n ≥ 0

(ii) Tn → 0 as n→∞

(iii)∑

k exp(−rL/Tkr−1) =∞,

we have‖A(n)(i, ·)− α∗‖ → 0

as n→∞, for all i ∈ S.

The proof of this theorem is a rather easy application of some theory oftime-inhomogeneous Markov chains. We’ll develop this theory in the next section, andthen come back to simulated annealing to prove Theorem (2.23).

Here is a specific family of cooling schedules that satisfy the conditions of the theorem:taking γ ≥ rL, let

Tn =γ

log(n)

for n > 1. [[And let T0 and T1 be arbitrary, subject to the monotonicity condition (i).]] Itis easy to check that the conditions of the theorem hold; (iii) boils down to the statementthat

∑k k

−p =∞ if p ≤ 1.Note that the “convergence in distribution” sort of conclusion of the theorem is weaker

than almost sure (“a.s.”) convergence [[i.e., convergence with probability 1]]. There is goodreason for this: a.s. convergence indeed does not hold in general for simulated annealing.Intuitively, we should not expect to be able to get a.s. convergence—if our procedure has


2.6. ERGODICITY CONCEPTS Page 67

the nice property of always escaping eventually from local minima, then we must live withthe fact that it will also eventually “escape” from global minima. That is, the process willnot get “stuck” in [[that is, converge to]] a global minimum. It is true that, along a typicalsample path, the process will spend a larger and larger fraction of its time at or near globalminima as n increases. However, it will also make infinitely many (although increasinglyrare) excursions away from global minima.

⊲ I would say that the theorem is certainly elegant and instructive. But if you come awaythinking it is only a beginning and is not yet answering the “real” question, I would considerthat a good sign. Exercise [2.15] asks for your thoughts on the matter.

2.6 Ergodicity Concepts for Time-Inhomogeneous Markovchains

In this section we will work in the general setting of a time-inhomogeneous Markov chainon a countably infinite state space S; we will revert to finite S when we return to the specificproblem of simulated annealing.

(2.24) Notation. For a time-inhomogeneous Markov chain {Xn}, let Pn denote theprobability transition matrix governing the transition from Xn to Xn+1, that is, Pn(i, j) =P{Xn+1 = j | Xn = i}. Also, for m < n define P (m,n) =

∏n−1k=m Pk, so that P (m,n)(i, j) =

P{Xn = j | Xm = i}.

(2.25) Definition. {Xn} is strongly ergodic if there exists a probability distribution π∗

on S such thatlimn→∞

supi∈S

‖P (m,n)(i, ·)− π∗‖ = 0 for all m.

For example, we will prove Theorem (2.23) by showing that under the stated conditions,the simulated annealing chain {Xn} is strongly ergodic, with limiting distribution α∗ asgiven in (2.17).

The reason for the modifier “strongly” is to distinguish the last concept from the fol-lowing weaker one.

(2.26) Definition. {Xn} is weakly ergodic if

limn→∞

supi,j∈S

‖P (m,n)(i, ·)− P (m,n)(j, ·)‖ = 0 for all m.

To interpret these definitions a bit, notice that weak ergodicity is a sort of “loss ofmemory” concept. It says that at a large enough time n, the chain has nearly “forgotten”its state at time m, in the sense that the distribution at time n would be nearly thesame no matter what the state was at time m. However, there is no requirement thatthe distribution be converging to anything as n → ∞. The concept that incorporatesconvergence in addition to loss of memory is strong ergodicity.



What is the role of the “for all m” requirement? Why not just uselimn→∞ supi∈S ‖P (0,n)(i, ·)−π∗‖ = 0 for strong ergodicity and limn→∞ supi,j∈S ‖P (0,n)(i, ·)−P (0,n)(j, ·)‖ = 0 for weak ergodicity? Here are a couple of examples to show that these would

not be desirable definitions. Let S = {1, 2} and P0 =

(1/4 3/41/4 3/4

). Then with these def-

initions, {Xn} would be strongly ergodic even if Pk = I for all k ≥ 1 and {Xn} would beweakly ergodic for any sequence of probability transition matrices P1, P2, . . .. This seemssilly. We want more “robust” concepts that cannot be determined just by one matrix P0,but rather depend on the whole sequence of transition matrices.

Incidentally, since our goal with respect to simulated annealing is the main theorem(2.23) above, we are really interested in proving strong ergodicity. However, we will findweak ergodicity to be a useful stepping stone on the way toward that goal.

2.6.1 The Ergodic Coefficient

This will be a useful quantity in formulating sufficient conditions for weak and strongergodicity. For a probability transition matrix P = (P (i, j)), the ergodic coefficient δ(P ) ofP is defined to be the maximum total variation distance between pairs of rows of P , thatis,

δ(P ) = supi,j∈S

‖P (i, ·)− P (j, ·)‖

=1

2supi,j∈S

∑

k∈S

|P (i, k)− P (j, k)|

= supi,j∈S

∑

k∈S

(P (i, k)− P (j, k))+.

The basic idea here is that δ(P ) being small is “good” for ergodicity. For example, theextreme case is δ(P ) = 0, in which case all of the rows of P are identical, and so P wouldcause a Markov chain to lose its memory completely in just one step: ν1 = ν0P does notdepend on ν0.

Here is a useful lemma.

(2.27) Lemma. δ(PQ) ≤ δ(P )δ(Q) for probability transition matrices P , Q.

Proof: By definition, δ(PQ) = supi,j∈S

∑k∈S

[(PQ)ik−(PQ)jk]+, where here and through-

out this proof, for readability let us use subscripts to denote matrix entries. Fix a pair of


2.6. ERGODICITY CONCEPTS Page 69

states i, j, and let A = {k ∈ S : (PQ)ik > (PQ)jk}. Then

∑

k∈S

[(PQ)ik − (PQ)jk]+ =

∑

k∈A[(PQ)ik − (PQ)jk]

=∑

k∈A

∑

l∈S

[PilQlk − PjlQlk]

=∑

l∈S

[Pil − Pjl]∑

k∈AQlk

≤∑

l∈S

(Pil − Pjl)+[supl

∑

k∈AQlk

]−

∑

l∈S

(Pil − Pjl)−[infl

∑

k∈AQlk

]

=∑

l∈S

(Pil − Pjl)+[supl

∑

k∈AQlk − inf

l

∑

k∈AQlk

]

≤[ ∑

l∈S

(Pil − Pjl)+]δ(Q)

≤ δ(P )δ(Q),

where the last equality uses the fact that∑

l(Pil − Pjl)+ =∑

l(Pil − Pjl)−, which holdsbecause P is a probability transition matrix. Since i and j were arbitrary in S, we are done.

⊲ Exercises [2.17] and [2.18] lead to an alternative proof of Lemma (2.27) using coupling.

2.6.2 Sufficient Conditions for Weak and Strong Ergodicity

Sufficient conditions are given in the next two propositions.

(2.28) Proposition. If there exist n0 < n1 < n2 < · · · such that∑

k[1− δ(P (nk,nk+1))] =∞, then {Xn} is weakly ergodic.

(2.29) Proposition. If {Xn} is weakly ergodic and if there exist π0, π1, . . . such that πn isa stationary distribution for Pn for all n and

∑n ‖πn − πn+1‖ <∞, then {Xn} is strongly

ergodic. In that case, the distribution π∗ in the definition (2.25) is given by π∗ = limn→∞ πn.Recall that strong ergodicity is like weak ergodicity [[loss of memory]] together with

convergence. The extra condition∑

n ‖πn−πn+1‖ <∞ is giving this convergence in (2.29).

Proof of Proposition (2.28). It follows directly from the definitions we have giventhat weak ergodicity is equivalent to the condition that limn→∞ δ(P (m,n)) = 0 for all m.By assumption,

∑k≥K [1 − δ(P (nk,nk+1))] = ∞ for all K. We will use the following little

fact about real numbers: if 0 ≤ an ≤ 1 for all n and∑

k ak = ∞, then∏k(1 − ak) = 0.

[[Proof: under the assumed conditions, 0 ≤ ∏(1− ak) ≤

∏e−ak = e−

P

ak = 0.]] From this



we obtain∏k≥K δ(P

(nk,nk+1)) = 0 for all K. That is, limL→∞∏L−1k=K δ(P

(nk,nk+1)) = 0 forall K. However, from Lemma (2.27),

δ(P (nK ,nL)) ≤L−1∏

k=K

δ(P (nk,nk+1)).

Therefore, limL→∞ δ(P (nK ,nL)) = 0 for all K. Clearly this implies that limn→∞ δ(P (m,n)) =0 for all m.

Proof of Proposition (2.29). Suppose that {Xn} is weakly ergodic and∑ ‖πn −

πn+1‖ < ∞, where each πn is stationary for Pn. Let π∗ = limπn; clearly the limit existsby the assumption

∑ ‖πn − πn+1‖ <∞. Let k be fixed. Then for any l > k and m > l wehave

‖P (k,m)(i, ·)− π∗‖ ≤ ‖P (k,m)(i, ·)− πlP (l,m)‖+‖πlP (l,m) − πm‖+ ‖πm − π∗‖.(2.30)

Let ǫ > 0. We will show that the right-hand side can be made less than ǫ if m is largeenough; we’ll do this by making a judicious choice of l. The last term is the simplest; clearlythere exists M3 such that ‖πm− π∗‖ ≤ ǫ/3 for all m ≥M3. For the second term, note that

πlP(l,m) = πlPlP

(l+1,m) = πlP(l+1,m)

= πl+1P(l+1,m) + [πl − πl+1]P

(l+1,m),

so thatπlP

(l,m) − πm = [πl+1P(l+1,m) − πm] + [πl − πl+1]P

(l+1,m).

Applying this relation recursively gives

πlP(l,m) − πm =

m−1∑

n=l

[πn − πn+1]P(n+1,m)

So ‖πlP (l,m) − πm‖ ≤∑m−1

n=l ‖πn − πn+1‖. [[Why? Exercise.]] Therefore, since∑ ‖πn −

πn+1‖ < ∞, we can make ‖πlP (l,m) − πm‖ ≤ ǫ/3 by taking m and l large enough, say,m ≥ l ≥ L2.

Finally, for the first term on the right-hand side of (2.30), note that

‖P (k,m)(i, ·)− πlP (l,m)‖ = ‖[P (k,l)(i, ·)− πl]P (l,m)‖.

However, for any given l, we can make the last expression less than ǫ/3 by taking mlarge enough—by weak ergodicity, at a large enough time m, the chain doesn’t “remember”whether its distribution at time l was P (k,l)(i, ·) or πl! So, for all l, there is anM1(l) such thatifm ≥M1(l) then ‖P (k,m)(i, ·)−πlP (l,m)‖ < ǫ/3. So we are done: ifm ≥ max{M3,M1(L2)},then supi ‖P (k,m)(i, ·)− π∗‖ ≤ ǫ.

Notice how the hypotheses of the last result were used to get the terms on the right-handside of (2.30) small: weak ergodicity took care of the first term, and

∑ ‖πn − πn+1‖ < ∞took care of the other two.


2.7. PROOF OF MAIN THEOREM OF SIMULATED ANNEALING Page 71

2.7 Proof of the Main Theorem of Simulated Annealing

We will show that if conditions (i)–(iii) of Theorem (2.23) hold, then thetime-inhomogeneous Markov chain {Xn} for simulated annealing satisfies the sufficientconditions for strong ergodicity.

We have used two sets of parallel notation in the last two sections; one (involving α’sand A’s) that we introduced specifically to discuss simulated annealing and one (involvingπ’s and P ’s) used in the general theory of time-inhomogeneous Markov chains. Let’s relatethem and make sure there is no confusion. The general notation Pn refers to the probabilitytransition matrix used in going from Xn to Xn+1. In the case of simulated annealing,since the chain is operating at temperature Tn during that transition, we have Pn = ATn

.Similarly, πn, the stationary distribution associated with Pn, is αTn

for simulated annealing.We’ll also continue to use the notation P (m,n) =

∏n−1k=m Pk =

∏n−1k=mATk

.To establish weak ergodicity, we want to show that the sufficient condition

∑k[1 −

δ(P (nk,nk+1))] =∞ holds for some sequence {nk}. In fact, we will show that the conditionholds for the sequence nk = kr, that is,

(2.31)∑

k

[1− δ(P (kr−r,kr))] =∞.

We will do this by finding an upper bound for δ(P (kr−r,kr)) that will guarantee thatδ(P (kr−r,kr)) does not get too close to 1.

Recall the definition of the radius r = mini∈Scmaxj∈Sρ(i, j). Let i0 denote a center of

the graph, that is, a node at which the minimum in the definition of r is assumed. Alsorecall the definition L = maxi∈S maxj∈N(i) |c(j)− c(i)|.

(2.32) Claim. For all sufficiently large m we have

P (m−r,m)(i, i0) ≥ D−r exp(−rL/Tm−1) for all i ∈ S,

where D = maxj d(j).

Proof: It follows immediately from the definitions of Pij(n), D, and L that for all n,i ∈ S, and j ∈ N(i), we have

Pij(n) =1

d(i)min{1, e−(c(j)−c(i))/Tn} ≥ D−1e−L/Tn .

Since we did not allow i0 to be a local maximum of c(·), we must have {j ∈ N(i0) : c(j) >c(i0)} nonempty, so that as n→∞,

Pi0,i0(n) = 1−∑

j∈N(i0)

Pi0j(n)→ 1−∑

j∈N(i0)c(j)≤c(i0)

1

d(i)=

∑

j∈N(i0)c(j)>c(i0)

1

d(i)> 0.

Therefore, Pi0,i0(n) ≥ D−1 exp(−L/Tn) clearly holds for large enough n.Let i ∈ S, and consider P (m−r,m)(i, i0) = P{Xm = i0 | Xm−r = i}. Clearly this

probability is at least the conditional probability that, starting from i at time m − r, the



chain takes a specified shortest path from i to i0 [[which is of length at most r, by thedefinition of r]], and then holds at i0 for the remaining time until m. However, by theprevious paragraph, if m is large enough, this probability is in turn bounded below by

m−1∏

n=m−rD−1e−L/Tn ≥ D−re−rL/Tm−1 ,

where we have used the assumption that Tn is decreasing in n here. Thus, P (m−r,m)(i, i0) ≥D−re−rL/Tm−1 for large enough m.

Taking m = kr in the last claim, we get for all sufficiently large k that

P (kr−r,kr)(i, i0) ≥ D−re−rL/Tkr−1 for all i.

Thus, P (kr−r,kr) is a probability transition matrix having a column [[namely, column i0]]all of whose entries are at least D−r exp(−rL/Tkr−1). Next we use the general observationthat if a probability transition matrix Q has a column all of whose entries are at least a,then δ(Q) ≤ 1− a. [[Exercise [2.19] asks you to prove this.]] Therefore, for large enough k,we have 1 − δ(P (kr−r,kr)) ≥ D−r exp(−rL/Tkr−1), which, by assumption (iii) of the maintheorem, gives (2.31). This completes the proof of weak ergodicity.

Finally, we turn to the proof of strong ergodicity. Recall that the stationary distributionπn for Pn is given by πn(i) = (G(n))−1d(i) exp[−c(i)/Tn]. By our sufficient conditions forstrong ergodicity, we will be done if we can show that

∑ ‖πn+1 − πn‖ < ∞. This will beeasy to see from the following monotonicity properties of the stationary distributions πn.

(2.33) Lemma. If i ∈ S∗ then πn+1(i) > πn(i) for all n. If i /∈ S∗ then there exists ni suchthat πn+1(i) < πn(i) for all n ≥ ni.Thus, as the temperature decreases, the stationary probabilities of the optimal states in-crease. Also, for each nonoptimal state, as the temperature decreases to 0, the stationaryprobability of that state decreases eventually, that is, when the temperature is low enough.The proof is calculus; just differentiate away. I’ll leave this as an exercise.

From this nice, monotonic behavior, the desired result follows easily. Letting n denotemax{ni : i /∈ S∗},

∞∑

n=n

‖πn+1 − πn‖ =∞∑

n=n

∑

i∈S

(πn+1 − πn)+

=∞∑

n=n

∑

i∈S∗

(πn+1 − πn)

=∑

i∈S∗

∞∑

n=n

(πn+1 − πn)

=∑

i∈S∗

(π∗(i)− πn(i)) ≤∑

i∈S∗

π∗(i) = 1,


2.8. CARD SHUFFLING Page 73

so that clearly∑∞

n=n ‖πn+1 − πn‖ <∞. This completes the proof of strong ergodicity.

REFERENCES: This proof came from a paper of Mitra, Romeo, and Sangiovanni-Vincentellicalled “Convergence and finite-time behavior of simulated annealing” [Advances in AppliedProbability, 18, 747–771 (1986)]. The material on time-inhomogeneous Markov chains iscovered very nicely in the book Markov chains: Theory and Applications by Isaacson andMadsen (1976).

2.8 Card Shuffling: Speed of Convergence to Stationarity

We have seen that for an irreducible, aperiodic Markov chain {Xn} having stationarydistribution π, the distribution πn of Xn converges to π in the total variation distance. Forexample, this was used in Example (1.22), which showed how to generate a nearly uniformlydistributed 4× 4 table having given row and column sums by simulating a certain Markovchain for a long enough time. The inevitable question is: How long is “long enough”?We could ask how close (in total variation, say) is the Markov chain to being uniformlydistributed after 100 steps? How about 1000? 50 billion?

In certain simple Markov chain examples we have discussed, it is easy to figure out therate of convergence of πn to π. For instance, for the Markov frog example (1.2), startingin the initial distribution π0 = (1, 0, 0), say, we can compute the distributions π1, π2, . . . bymatrix multiplication and compare them to the stationary distribution π = (1/4, 3/8, 3/8),getting the results shown in Figure (2.34).

� � � � � ��WLPH

��

��

��

��

��

WRWDO�

YDULD

WLRQ�G

LVWDQ

FH

GLVWDQFH�IURP�VWDWLRQDULW\

� � � � � ��WLPH

��

��

��

��

�

ORJ�WR

WDO�YD

ULDWLR

Q�GLVWDQ

FH�

ORJ�GLVWDQFH�IURP�VWDWLRQDULW\�

(2.34) Figure. Speed of convergence to stationarity in Markov frog example from Chapter 1.

Notice the smooth geometric rate of decrease: the log distance decreases linearly, whichmeans that the distance decreases to 0 geometrically. Coupling is a technique that cansometimes shed some light on questions of this sort. For example, in Exercise ([1.27]) weshowed that ‖πn − π‖ ≤ 2

3

(1116

)nfor all n, [[and, in fact, ‖πn − π‖ ≤ 2

3

(14

)n]] which gives

much more information than just saying that ‖πn − π‖ → 0.In this section we will concentrate on a simple shuffling example considered by Aldous

and Diaconis in their article “Shuffling cards and stopping times.” Again, the basic question



is: How close is the deck to being “random” (i.e. uniformly distributed over the 52! possiblepermutations) after n shuffles? Or, put another way, how many shuffles does it take toshuffle well? Of course, the answer depends on the details of the method of shuffling;Aldous and Diaconis say that for the riffle shuffle model, which is rather realistic, theanswer is “about 7.” In this sort of problem, in contrast to the smooth and steady sortof decrease depicted in Figure (2.34), we will see that an interesting threshold phenomenonarises.

2.8.1 “Top-in-at-random” Shuffle

This shuffle, which is the model of shuffling we will analyze in detail, is simpler than theriffle shuffle for which Aldous and Diaconis gave the answer “about 7 is enough.” Theanalysis we will discuss here also comes from their paper.

The top-in-at-random method would be a really silly way to shuffle a deck of cards inpractice, but it is an appealingly simple model for a first example to study. One such shuffleconsists of taking the top card off the deck and then inserting it back into the deck in arandom position. “Random” here means that the top card could end up back on top (inwhich case the shuffle didn’t change the deck at all), or it could end up second from top,third from top, ..., or at the bottom of the deck: altogether 52 equally likely positions.

Repeated performances of this shuffle on a deck of cards produces a sequence of “states”of the deck. This sequence of states forms a Markov chain {Xn} having state space S52, thegroup of permutations of the cards. This Markov chain is irreducible, aperiodic, and hasstationary distribution π = Uniform on S52 (i.e. probability 1/(52!) for each permutation);therefore, by the Basic Limit Theorem, we may conclude that ‖πn − π‖ → 0 as n→∞.

⊲ Explaining each assertion in the previous sentence is Exercise [2.20].

2.8.2 Threshold Phenomenon

Suppose we are working with a fresh deck of d cards, which we know are in the originalorder: card 1 on top, card 2 just below card 1, ..., and card d on the bottom. Then‖π0 − π‖ = 1 − (1/d!). We also know that ‖πn − π‖ → 0 as n → ∞, by the Basic LimitTheorem. It is natural to presume that the distance from stationarity ‖πn − π‖ decreasesto 0 in some smooth, uneventful manner. However, the fact of the matter is that ‖πn − π‖stays close to 1 for a while, then it undergoes a rather abrupt decrease from nearly 1 tonearly 0, and this abrupt change happens in a relatively small neighborhood of the valuen = d log d. That is, for large d the graph of ‖πn − π‖ versus n looks rather like the nextpicture.



|| π? −π ||

_�*L}�_� ?

The larger the value of the deck size d, the sharper (relative to d log d) the drop in ‖πn−π‖near n = d log d.

Well, this is hardly a gradual or uneventful decrease! This interesting behavior of‖πn−π‖ has been called the threshold phenomenon. The phenomenon is not limited to thisparticular shuffle or even to shuffling in general, but rather seems to occur in a wide varietyof interesting Markov chains. Aldous and Diaconis give a number of examples in whichthreshold phenomena can be shown to occur, but they point out that the phenomenonis not yet very well understood, in the sense that general conditions under which thephenomenon does or does not occur are not known.

_?

S^7_�!�?`This seems weird, doesn’t it? As a partial an-tidote to the uneasy feelings of mystery that tendto accompany a first exposure to the threshold phe-nomenon idea, let’s think about a simple problemthat is familiar to all of us, in which a threshold ofsorts also occurs. Suppose X1, X2, . . . are iid randomvariables with mean 1 and variance 1. In fact, forsimplicity, let’s suppose that they have the Normaldistribution N(1, 1). Thinking of d as a large num-ber, let Sd = X1+· · ·+Xd, and consider the probability P{Sd > n} as a function of n. SinceSd ∼ N(d, d), it is easy to see that the graph of P{Sd > n} has a “threshold” near n = d ifd is large. In fact, the length of a neighborhood about n = d in which P{Sd > n} decreasesfrom nearly 1 to nearly 0 is of order

√d. However, if d is large, then

√d is vanishingly

small compared with d. Thus, for large d, the graph of P{Sd > n} versus n (plotted on ascale in which n = d is at some moderate, fixed distance from n = 0) will indeed appear asa sharp dropoff near n = d. In particular, note that the existence of a threshold for large ddoes not say that the dropoff from near 1 to near 0 takes place over a shorter and shortertime interval as d increases; it is just that the length (here O(

√d)) of that dropoff interval

is smaller and smaller in comparison with the location (here around d) of that interval.



2.8.3 A random time to exact stationarity

��

FDUG��

FDUG��

FDUG��

Let’s give a name to each card in the deck: say“card 1” is 2♥, “card 2” is 3♥, ..., “card 51” is K♠,“card 52” is A♠. Suppose we start with the deck inthe pristine order shown to the right. Wouldn’t it benice if we could say, “After 1000 shuffles the deck willbe exactly random,” or maybe “After 1,000,000 shuf-fles the deck will be exactly random”? Well, sorry.We can’t. We know πn gets closer and closer to theuniform distribution as n increases, but unfortunatelyπn will never become exactly random, even if n is 53bezillion.

However, it is possible to find a random time T at which the deck becomes exactlyuniformly distributed, that is, XT ∼ Unif(S52)! Here is an example of such a random time.To describe it, let’s agree that “card i” always refers to the same card [[e.g. card 52 = A♠]],while terms like “top card,” “card in position 2,” and so on just refer to whatever cardhappens to be on top, in position 2, and so on at the time under consideration. Also notethat we may describe a sequence of shuffles simply by a sequence of iid random variablesU1, U2, . . . uniformly distributed on {1, 2, . . . , 52}: just say that the ith shuffle moves thetop card to position Ui. Define the following random times:

T1 = inf{n : Un = 52} = 1st time a top card goes below card 52,

T2 = inf{n > T1 : Un ≥ 51} = 2nd time a top card goes below card 52,

T3 = inf{n > T2 : Un ≥ 50} = 3rd time a top card goes below card 52,

...

T51 = inf{n > T50 : Un ≥ 2} = 51st time a top card goes below card 52,

and

(2.35) T = T52 = T51 + 1.

It is not hard to see that T has the desired property and that XT is uniformly distributed.To understand this, start with T1. At time T1, we know that some card is below card 52;we don’t know which card, but that will not matter. After time T1 we continue to shuffleuntil T2, at which time another card goes below card 52. At time T2, there are 2 cardsbelow card 52. Again, we do not know which cards they are, but conditional on which 2cards are below card 52, each of the two possible orderings of those 2 cards is equally likely.Similarly, we continue to shuffle until time T3, at which time there are some 3 cards belowcard 52, and, whatever those 3 cards are, each of their (3!) possible relative positions in thedeck is equally likely. And so on. At time T51, card 52 has risen all the way up to becomethe top card, and the other 51 cards are below card 52 (now we do know which cards theyare), and those 51 cards are in random positions (i.e. uniform over 51! possibilities). Nowall we have to do is shuffle one more time to get card 52 in random position, so that attime T = T52 = T51 + 1, the whole deck is random.



Let us find ET . Clearly by the definitions above we have T1 ∼ Geom(1/52), (T2 −T1) ∼ Geom(2/52), ..., (T51 − T50) ∼ Geom(51/52), and (T52 − T51) ∼ Geom(52/52) = 1.Therefore,

ET = E(T1) + E(T2 − T1) + · · ·+ (T51 − T50) + E(T52 − T51)

= 52 + (52/2) + (52/3) + · · ·+ (52/51) + (52/52)

= 52

(1 +

1

2+

1

3+ · · ·+ 1

51+

1

52

)≈ 52 log 52.

Analogously, if the deck had d cards rather than 52, we would have obtained ET ∼ d log d(for large d), where T is now a random time at which the whole deck of d cards becomesuniformly distributed on Sd.

2.8.4 Strong Stationary Times

The main new idea in the analysis of the shuffling example is that of a strong stationarytime. As we have observed, the random variable T that we just constructed has the propertythat XT ∼ π. T also has two other important properties. First, XT is independent of T .Second, T is a stopping time , that is, for each n, one can determine whether or not T = njust by looking at the values of X0, . . . , Xn. In particular, to determine whether or notT = n it is not necessary to know any “future” values Xn+1, Xn+2, . . ..

A random variable having the three properties just enumerated is called a strong sta-tionary time.

(2.36) Definition. A random variable T satisfying

(i) T is a stopping time,

(ii) XT is distributed as π, and

(iii) XT is independent of T

is called a strong stationary time.So what’s so good about strong stationary times? For us, the answer is contained in

the following theorem, which says that strong stationary times satisfy the same inequalitythat we derived for coupling times in (1.37).

(2.37) Lemma. If T is a strong stationary time for the Markov chain {Xn}, then ‖πn−π‖ ≤P{T > n} for all n.

Proof: Letting A ⊆ S, we will begin by showing that

(2.38) P{T ≤ n,Xn ∈ A} = P{T ≤ n}π(A).

To see this, let k ≤ n and write P{T = k,Xn ∈ A} =∑

i P{T = k,Xk = i,Xn ∈ A} =∑i P{T = k,Xk = i}P{Xn ∈ A | T = k,Xk = i}. But P{Xn ∈ A | T = k,Xk = i} =

P{Xn ∈ A | Xk = i} =: Pn−k(i, A) by the Markov property and the assumption that T



is a stopping time. Also, P{T = k,Xk = i} = P{T = k,XT = i} = P{T = k}P{XT =i} = P{T = k}π(i) by properties (ii) and (iii) of the definition of strong stationary time.Substituting gives P{T = k,Xn ∈ A} = P{T = k}∑

i π(i)Pn−k(i, A) = P{T = k}π(A) bythe stationarity of π. Summing this over k ≤ n gives (2.38).

Next, similarly to the proof of the coupling inequality, we decompose according towhether T ≤ n or T > n to show that for A ⊆ S

πn(A)− π(A) = P{Xn ∈ A} − π(A) = P{Xn ∈ A, T ≤ n}+ P{Xn ∈ A, T > n} − π(A)

= π(A)P{T ≤ n}+ P{Xn ∈ A, T > n} − π(A) by (2.38)

= P{Xn ∈ A, T > n} − π(A)P{T > n}.

Since each of the last two quantities lies between 0 and P{T > n}, we conclude that|πn(A)− π(A)| ≤ P{T > n}, so that ‖πn − π‖ ≤ P{T > n}.

2.8.5 Proof of threshold phenomenon in shuffling

Let ∆(n) denote ‖πn − π‖. The proof that the threshold phenomenon occurs in thetop-in-at-random shuffle consists of two parts: Roughly speaking, the first part shows that∆(n) is close to 0 for n slightly larger than d log d, and the second part shows that ∆(n) isclose to 1 for n slightly smaller than d log d, where in both cases the meaning of “slightly”is “small relative to d log d.”

The first part is addressed by the next result.

(2.39) Theorem. For T as defined in (2.35), we have

∆(d log d+ cd) ≤ P{T > d log d+ cd} ≤ e−c for all c ≥ 0.

Note that for each fixed c, cd is small relative to d log d if d is large enough.

Proof: The first inequality in (2.39) is just Lemma (2.37), so our task is to prove thesecond inequality. Recall that, as discussed above, T1 ∼ Geom(1/d), T2−T1 ∼ Geom(2/d),T3−T2 ∼ Geom(3/d), ..., and T −Td−1 = Td−Td−1 ∼ Geom(d/d) = 1. It is also clear thatT1, T2 − T1, T3 − T2, . . . , and T − Td−1 are independent. Thus,

T ∼ Geom

(1

d

)⊕Geom

(2

d

)⊕ · · · ⊕Geom

(d− 1

d

)⊕ 1,

where the symbol “⊕” indicates a sum of independent random variables. However, observethat the distribution 1 ⊕ Geom[(d − 1)/d] ⊕ · · · ⊕ Geom[1/d] is also the distribution thatarises in the famous coupon collector’s problem. [[To review the coupon collector’s problem:Suppose that each box of Raisin Bran cereal contains one of d possible coupons numbered{1, . . . , d}, with the coupons in different boxes being independent and uniformly distributedon {1, . . . , d}. The number of cereal boxes a collector must buy in order to obtain a completeset of d coupons has the distribution 1⊕Geom[(d− 1)/d]⊕ · · · ⊕Geom[1/d].]]



To find a bound on P{T > n}, let us adopt this coupon collecting interpretation of T .For each i = 1, . . . , d define an event

Bi = {coupon i does not appear among the first n cereal boxes}.

Then the event T > n is just the union⋃di=1Bi, so that

P{T > n} ≤d∑

i=1

P(Bi) =d∑

i=1

(d− 1

d

)n

= d

(1− 1

d

)n

≤ de−n/d,

where the last inequality uses the fact that 1 − x ≤ e−x for all numbers x. Setting n =d log d+ cd gives

P{T > n} ≤ de−(log d+c) = e−c,

as desired.

For the second part of the proof, let us temporarily be a bit more fastidious about

notation: Instead of just writing πn and π, let us write π(d)n and π(d) to indicate explicitly

the dependence of the various distributions on the deck size d.

(2.40) Theorem. Let k(d) = d log d − cdd, where {cd} is a sequence of numbers thatapproaches infinity as d→∞. Then

‖π(d)k(d) − π

(d)‖ → 1 as d→∞.

Note: The case of interest for establishing the threshold at d log d is when cd = o(log d),since in that case k(d) ∼ d log d.

Proof: Let’s start with some fumbling around, intended to provide some glimmer ofhope that we might have been able to think of this proof ourselves. The proof proceeds

by bounding ‖π(d)k(d) − π(d)‖ below by something that is close to 1 for large d. By the

definition of the total variation distance ‖ · ‖, this may be done by finding events Ad such

that ‖π(d)k(d)(Ad) − π(d)(Ad)‖ is close to 1. OK, now let’s drop those pesky d’s from the

notation, and say that for large d, we want to find events A such that πk(A) is large (closeto 1) while π(A) is small (close to 0). Fumbling time. . .

• How about A = {card d is still on the bottom}?

– Is π(A) small? Yes: π(A) = 1/d.

– Is πk(A) large? No: since k ≫ d, clearly P{T1 > k} = P{Geom(1/d) > k} is notlarge, so that P{card d is still on the bottom at time k} is not large.

• How about A = {cards d− 1 and d are still in their original positions}?

– Is π(A) small? Yes: π(A) = 1/[d(d− 1)].

– Is πk(A) large? No: in fact, it is smaller than the previous πk(A) we justconsidered. You should be ashamed of yourself for that suggestion!



• How about just requiring cards d − 1 and d still be in their original order , that is,A = {card d− 1 still above card d in the deck}?

– Is πk(A) large? Maybe; this doesn’t seem very obvious.

– Is π(A) small? No: π(A) = 1/2.

• Well, that may look discouraging. But with a little more thought we can at leastextend the previous idea to get π(A) small while keeping a “maybe” for πk(A) beinglarge, as follows. How about

A = Ad,a = {cards d− a+ 1, d− a+ 2, . . . , d still in their original order}?

– Is πk(A) large? Still maybe.

– Is π(A) small? π(A) = 1/(a!), so yes if a increases with d.

Let’s review what we are doing. We are given a sequence of numbers {cd} such thatcd →∞ as d→∞. [[As noted, we are interested in the case where we also have cd = o(log d),but the proof will not require this.]] For each d, we have also defined a number k = k(d) =d log d − cdd. For each d and each a ≤ d we have identified an event A = Ad,a. What wewant to show is that there is a sequence {ad} of values of a such that, as d→∞, we have

πk(A)→ 1 and π(A)→ 0. [[Actually we should write these statements as π(d)k(d)(Ad,ad

)→ 1

and π(d)(Ad,ad)→ 0, but I doubt any of us really wants that.]]

As for getting the second statement, since π(A) = 1/(a!), any sequence {ad} that tendsto infinity as d→∞ will suffice.

To get the first statement to hold we need a little more analysis. Suppose that in kshuffles, card d− a+ 1 has not yet “risen to the top of the deck.” In this case, clearly theevent A occurs. Letting U denote the number of shuffles required for card d− a+ 1 to riseto the top of the deck, we thus have

πk(A) ≥ P{U > k}.

Note that

U ∼ Geom(ad

)⊕Geom

(a+ 1

d

)⊕ · · · ⊕Geom

(d− 1

d

).

The plan now is to use Chebyshev’s inequality to show that we can cause P{U > k} → 1to hold by choosing ad appropriately. This will show that πk(A)→ 1, and hence prove thetheorem.

The ingredients needed to use Chebyshev are E(U) and Var(U). Since E[Geom(p)] =1/p, we have

E(U) = d

(1

a+

1

a+ 1+ · · ·+ 1

d− 1

)= d

(log d− log a+ o(1)

)

where the second equality assumes only that ad → ∞. Next, since Var[Geom(p)] = (1 −p)/p2 ≤ 1/p2, using independence gives

Var(U) ≤ d2

(1

a2+

1

(a+ 1)2+ · · ·

)=: ǫ(a)d2,


2.9. EXERCISES Page 81

where ǫ(a)→ 0 as a→∞ (or d→∞ ).So here it is in a nutshell. Since Var(U) = o(d2), so that U has standard deviation

SD(U) = o(d), all we have to do is choose ad so that the difference

E(U)− k = d(log d− log ad + o(1))− d(log d− cd) ∼ d(cd − log ad)

is large compared with SD(U), that is, at least the order of magnitude of d. But that’s easy;for example, if we choose ad = ecd/2, then E(U) − k ∼ d(cd/2), whose order of magnitudeis larger than d.

To say this in a bit more detail, choose ad = ecd/2, say. [[Many other choices would alsodo.]] Then of course ad →∞. So we have

P{U > k(d)} = P{U > d(log d− cd)}= P{U − E(U) > d(log d− cd)− d(log d− log ad + o(1))}= P{U − E(U) > −d(cd − log ad + o(1))}.

Substituting ad = ecd/2, this becomes

P{U > k(d)} = P{U − E(U) > −d(cd/2 + o(1))}≥ P{|U − E(U)| < d(cd/2 + o(1))}

≥ 1− Var(U)

d2(cd/2 + o(1))2

≥ 1− ǫ(ad)

(cd/2 + o(1))2.

Since the last expression approaches 1 as d→∞, we are done.

This completes our analysis of the top-in-at-random shuffle. There are lots of otherinteresting things to look at in the Aldous and Diaconis paper as well as some of the otherreferences. For example, a book of P. Diaconis applies group representation theory to thissort of problem. The paper of Brad Mann is a readable treatment of the riffle shuffle.

2.9 Exercises

[2.1] For a branching process {Gt} with G0 = 1, define the probability generating function of Gtto be ψt, that is,

ψt(z) = E(zGt) =

∞∑

k=0

zkP{Gt = k}.

With ψ defined as in (2.1), show that ψ1(z) = ψ(z), ψ2(z) = ψ(ψ(z)), ψ3(z) = ψ(ψ(ψ(z))),and so on.

[2.2] With ψt defined as in Exercise [2.1], show that P{Gt = 1} = ψ′t(0).



[2.3] Consider a branching process with offspring distribution Poisson(2), that is, Poisson withmean 2. Calculate the extinction probability ρ to four decimal places.

[2.4] As in the previous exercise, consider again a branching process with offspring distributionPoisson(2). We know that the process will either go extinct or diverge to infinity, and theprobability that it is any fixed finite value should converge to 0 as t→∞. In this exerciseyou will investigate how fast such probabilities converge to 0. In particular, consider theprobability P{Gt = 1}, and find the limiting ratio

limt→∞

P{Gt+1 = 1}P{Gt = 1} .

This may be interpreted as a rate of geometric decrease of P{Gt = 1}.

[[Hint: use the result of Exercise [2.2].]]

[2.5] Consider a branching process {Gt} with G0 = 1 and offspring distribution f(k) = qkp fork = 0, 1, . . ., where q = 1 − p. So f is the probability mass function of X − 1, whereX ∼ Geom(p).

(a) Show thatψ(z)− (p/q)

ψ(z)− 1=p

q

(z − (p/q)

z − 1

).

(b) Derive the expressions

ψt(z) =

{p[(qt−pt)−qz(qt−1−pt−1)]qt+1−pt+1−qz(qt−pt)

if p 6= 1/2t−(t−1)zt+1−tz if p = 1/2.

[[Hint: The first part of the problem makes this part quite easy. If you are findingyourself in a depressing, messy calculation, you are missing the easy way. For p 6= 1/2,consider the fraction [ψt(z)− (p/q)]/[ψt(z)− 1].]]

(c) What is the probability of ultimate extinction, as a function of p?

[[Hint: Observe that P{Gt = 0} = ψt(0).]]

[2.6] Let {Gt} be a supercritical (i.e. µ = E(X) > 1) branching process with extinction proba-bility ρ ∈ (0, 1). Let B =

⋃{Gt = 0} denote the event of eventual extinction.

(a) Show that E(zGt | B) = (1/ρ)ψt(ρz).

(b) Consider again the example of Exercise [2.5], with p < 1/2. Let {Gt} be a branchingprocess of the same form as {Gt}, except with the probabilities p and q interchanged.So {Gt} is subcritical, and goes extinct with probability 1. Show that the G process,conditional on the event B, behaves like the G process, in the sense that E(zGt | B) =

E(zGt).

(c) Isn’t that interesting?



[2.7] Consider an irreducible, time-reversible Markov chain {Xt} with Xt ∼ π, where the distri-bution π is stationary. Let A be a subset of the state space. Let 0 < α < 1, and define onthe same state space a Markov chain {Yt} having probability transition matrix Q satisfying,for i 6= j,

Q(i, j) =

{αP (i, j) if i ∈ A and j /∈ AP (i, j) otherwise.

Define the diagonal elements Q(i, i) so that the rows of Q sum to 1.

(a) What is the stationary distribution of {Yt}, in terms of π and α?

(b) Show that the chain {Yt} is also time-reversible.

(c) Show by example that the simple relationship of part (1) need not hold if we drop theassumption that X is reversible.

[2.8] Let {Xt} have probability transition matrix P and initial distribution π0. Imagine observingthe process until time n, seeing X0, X1, . . . , Xn−1, Xn. The time reversal of this sequence ofrandom variables is Xn, Xn−1, . . . , X1, X0, which we can think of as another random processX. That is, given the Markov chain X, define the reversed process {Xt} by Xt = Xn−t.

(a) Show that

P{Xt = j | Xt+1 = i,Xt+2 = xt+2, . . . , Xt+n = xt+n} =πt(j)P (j, i)

πt+1(i)

(b) Use part (a) to show that the process {Xt} is a Markov chain, although it is not timehomogeneous in general.

(c) Suppose {Xt} has stationary distribution π, and suppose X0 is distributed accordingto π. Show that the reversed process {Xt} is a time-homogeneous Markov chain.

[2.9] Let p = (p1, . . . , pd) be a probability mass function on {1, . . . , d}. Consider the residuallifetime chain, discussed in Exercise [1.14], which has probability transition matrix

P =

0 1 2 · · · d− 2 d− 1

0 p1 p2 p3 · · · pd−1 pd1 1 0 0 · · · 0 02 0 1 0 · · · 0 0...

d− 1 0 0 0 · · · 1 0

and stationary distribution π(i) = P{X > i}/E(X), where X denotes a random variabledistributed according to p.

(a) Find P , the probability transition matrix for the reversed chain.



(b) In renewal theory, the time since the most recent renewal is called the age, and theprocess {At}, whose state At is the age at time t, is called the age process. Show thatthe matrix P that you have just found is the probability transition matrix of the ageprocess. That is, the time-reversed residual lifetime chain is the age chain.

[2.10] Let’s think about the irreducibility and aperiodicity conditions of the Basic Limit Theoremas applied to the Metropolis method. Suppose that the graph structure on S is a connectedgraph. Let π be any distribution other than πrw, the stationary distribution of a randomwalk on the graph. Show that the Basic Limit Theorem implies that the Metropolis chainconverges in distribution to π.

[2.11] Why was the condition π 6= πrw needed in Exercise [2.10]?

[2.12] [[Metropolis-Hastings method]] For simplicity, let us assume that π is positive, so thatwe won’t have to worry about dividing by 0. Choose any probability transition matrixQ = (Q(i, j)) [[again, suppose it is positive]], and define P (i, j) for i 6= j by

P (i, j) = Q(i, j)min

(1,π(j)Q(j, i)

π(i)Q(i, j)

),

and of course define P (i, i) = 1−∑j 6=i P (i, j). Show that the probability transition matrix

P has stationary distribution π. Show how the Metropolis method we have discussed is aspecial case of this Metropolis-Hastings method.

[2.13] [[Computing project: traveling salesman problem]] Make up an example of the travelingsalesman problem; it could look similar to the first figure in Example (2.22) if you’d like.Write a program to implement simulated annealing and produce a sequence of figuresshowing various improving traveling salesman tours. You could even produce a slitheringsnake movie if you are so inspired.

[2.14] For simulated annealing, temperature schedules of the form (2.20) decrease excruciatinglyslowly. It is reasonable to ask whether we could decrease the temperature faster andstill retain a guarantee of convergence in distribution to global optima. Let c be a positivenumber, and consider performing simulated annealing with the cooling schedule Tn = bn−c.Of course, this schedule decreases faster than (2.20), no matter how small c is. Can yougive an example that shows that such a schedule decreases too fast, in the sense that theprocess has positive probability of getting stuck in a local minimum forever? Thus, evenTn = n−.0001 cools “too fast”!

[2.15] [[A creative writing, essay-type question]] Do you care about convergence in distribution toa global minimum? Does this property of simulated annealing make you happy?

[2.16] Prove (2.18).



[2.17] Here is yet another interpretation of total variation distance. Let µ and ν be distributionson a finite set S. Show that

‖µ− ν‖ = min P{X 6= Y },

where the minimum is taken over all P, X, and Y such that X has distribution µ and Yhas distribution ν.

[2.18] Prove Lemma (2.27) using coupling.

Hint: Defining R = PQ, we want to show that for all i and j,

‖Ri· −Rj·‖ ≤ supk,l‖Pk· − Pl·‖ sup

k,l‖Qk· −Ql·‖.

Construct Markov chainsX0P−→X1

Q−→X2 and Y0P−→Y1

Q−→Y2 withX0 = i and Y0 = j. Take(X1, Y1) to be a coupling achieving the total variation distance ‖Pi·−Pj·‖. Then, conditionalon (X1, Y1), take X2 and Y2 to achieve the total variation distance ‖QX1· −QY1·‖.

[2.19] Show that if a probability transition matrix Q has a column all of whose entries are at leasta, then δ(Q) ≤ 1− a.

[2.20] Repeated performances of the top-in-at-random shuffle on a deck of cards produces aMarkov chain {Xn} having state space S52, the group of permutations of the cards.Show that this Markov chain is irreducible, aperiodic, and has stationary distributionπ = Uniform on S52 (i.e. probability 1/(52!) for each permutation), so that, by the BasicLimit Theorem, we may conclude that ‖πn − π‖ → 0 as n→∞.

[2.21] Why do we require the “strong” in strong stationary times? That is, in Definition (2.36),although I’m not so inclined to question the requirement XT ∼ π, why do we require XT

to be independent of T? It is easy to see where this is used in the proof of the fundamentalinequality ‖πn − π‖ ≤ P{T > n}, but that is only a partial answer. The real questionis whether the fundamental inequality could fail to hold if we do not require XT to beindependent of T . Can you find an example?



Things to do• Add a section introducing optimal stopping, dynamic programming, and Markov

decision problems.


2. more on markov chains, examples and...

Documents