contentsowen/mc/ch-var-is.pdf · contents 9 importance ... 9.5 importance sampling versus...

Contents

9 Importance sampling 39.1 Basic importance sampling . . . . . . . . . . . . . . . . . . . . . 49.2 Self-normalized importance sampling . . . . . . . . . . . . . . . . 89.3 Importance sampling diagnostics . . . . . . . . . . . . . . . . . . 119.4 Example: PERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 139.5 Importance sampling versus acceptance-rejection . . . . . . . . . 179.6 Exponential tilting . . . . . . . . . . . . . . . . . . . . . . . . . . 189.7 Modes and Hessians . . . . . . . . . . . . . . . . . . . . . . . . . 199.8 General variables and stochastic processes . . . . . . . . . . . . . 219.9 Example: exit probabilities . . . . . . . . . . . . . . . . . . . . . 239.10 Control variates in importance sampling . . . . . . . . . . . . . . 259.11 Mixture importance sampling . . . . . . . . . . . . . . . . . . . . 279.12 Multiple importance sampling . . . . . . . . . . . . . . . . . . . 319.13 Positivisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339.14 What-if simulations . . . . . . . . . . . . . . . . . . . . . . . . . 35End notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1

2 Contents

© Art Owen 2009–2013,2018 do not distribute or post electronically withoutauthor’s permission

9

Importance sampling

In many applications we want to compute µ = E(f(X)) where f(x) is nearlyzero outside a region A for which P(X ∈ A) is small. The set A may havesmall volume, or it may be in the tail of the X distribution. A plain MonteCarlo sample from the distribution of X could fail to have even one point insidethe region A. Problems of this type arise in high energy physics, Bayesianinference, rare event simulation for finance and insurance, and rendering incomputer graphics among other areas.

It is clear intuitively that we must get some samples from the interestingor important region. We do this by sampling from a distribution that over-weights the important region, hence the name importance sampling. Havingoversampled the important region, we have to adjust our estimate somehow toaccount for having sampled from this other distribution.

Importance sampling can bring enormous gains, making an otherwise infeasi-ble problem amenable to Monte Carlo. It can also backfire, yielding an estimatewith infinite variance when simple Monte Carlo would have had a finite variance.It is the hardest variance reduction method to use well.

Importance sampling is more than just a variance reduction method. It canbe used to study one distribution while sampling from another. As a result wecan use importance sampling as an alternative to acceptance-rejection sampling,as a method for sensitivity analysis and as the foundation for some methods ofcomputing normalizing constants of probability densities. Importance samplingis also an important prerequisite for sequential Monte Carlo (Chapter 15). Forthese reasons we spend a whole chapter on it. For a mini-chapter on the basicsof importance sampling, read §9.1 through §9.4.

3

4 9. Importance sampling

9.1 Basic importance sampling

Suppose that our problem is to find µ = E(f(X)) =∫D f(x)p(x) dx where p

is a probability density function on D ⊆ Rd and f is the integrand. We takep(x) = 0 for all x 6∈ D. If q is a positive probability density function on Rd,then

µ =

∫Df(x)p(x) dx =

∫D

f(x)p(x)

q(x)q(x) dx = Eq

(f(X)p(X)

q(X)

), (9.1)

where Eq(·) denotes expectation for X ∼ q. We also write Varq(·), Covq(·, ·),and Corrq(·, ·) for variance, covariance and correlation when X ∼ q. Our orig-inal goal then is to find Ep(f(X)). By making a multiplicative adjustmentto f we compensate for sampling from q instead of p. The adjustment factorp(x)/q(x) is called the likelihood ratio. The distribution q is the importancedistribution and p is the nominal distribution.

The importance distribution q does not have to be positive everywhere. It isenough to have q(x) > 0 whenever f(x)p(x) 6= 0. That is, for Q = {x | q(x) >0} we have x ∈ Q whenever f(x)p(x) 6= 0. So if x ∈ D ∩ Qc we know thatf(x) = 0, while if x ∈ Q ∩ Dc we have p(x) = 0. Now

Eq(f(X)p(X)

q(X)

)=

∫Q

f(x)p(x)

q(x)q(x) dx =

∫Qf(x)p(x) dx

=

∫Df(x)p(x) dx +

∫Q∩Dc

f(x)p(x) dx−∫D∩Qc

f(x)p(x) dx

=

∫Df(x)p(x) dx = µ. (9.2)

It is natural to wonder what happens for x with q(x) = 0 in the denominator.The answer is that there are no such points x ∈ Q and we will never see onewhen sampling X ∼ q. Later we will see examples where q(x) close to 0 causesextreme difficulties, but q(x) = 0 is not a problem if f(x)p(x) = 0 too.

When we want q to work for many different functions fj then we need q(x) >0 at every x where any fj(x)p(x) 6= 0. Then a density q with q(x) > 0 wheneverp(x) > 0 will suffice, and will allow us to add new functions fj to our list afterwe’ve drawn the sample.

The importance sampling estimate of µ = Ep(f(X)) is

µq =1

n

n∑i=1

f(Xi)p(Xi)

q(Xi), Xi ∼ q. (9.3)

To use (9.3) we must be able to compute fp/q. Assuming that we can computef , this estimate requires that we can compute p(x)/q(x) at any x we mightsample. When p or q has an unknown normalization constant, then we willresort to a ratio estimate (see §9.2). For now, we assume that p/q is computable,and study the variance of µq. Exponential tilting (§9.6) is one way to choose qwith computable p/q even when p is unnormalized.


9.1. Basic importance sampling 5

Theorem 9.1. Let µq be given by (9.3) where µ =∫D f(x)p(x) dx and q(x) > 0

whenever f(x)p(x) 6= 0. Then Eq(µq) = µ, and Varq(µq) = σ2q/n where

σ2q =

∫D

(f(x)p(x))2

q(x)dx− µ2

=

∫D

(f(x)p(x)− µq(x)

)2q(x)

dx.

(9.4)

Proof. That Eq(µq) = µ follows directly from (9.2). Using Q = {x | q(x) > 0},we find that

Varq(µq) =1

n

[∫Q

(f(x)p(x)

q(x)

)2q(x) dx− µ2

]=

1

n

[∫D

(f(x)p(x)

q(x)

)2q(x) dx− µ2

],

because the contributions to the integral from x in D∩Qc and Q∩Dc are zero.Simple rearrangements give the two forms in equation (9.4).

To form a confidence interval for µ, we need to estimate σ2q . From the second

expression in (9.4) we find that

σ2q = Eq((f(X)p(X)− µq(X))2/q(X)2).

Because xi are sampled from q, the natural variance estimate is

σ2q =

1

n

n∑i=1

(f(xi)p(xi)

q(xi)− µq

)2=

1

n

n∑i=1

(wif(xi)− µq

)2, (9.5)

where wi = p(xi)/q(xi). Then an approximate 99% confidence interval for µ isµq ± 2.58σq/

√n.

Theorem 9.1 guides us in selecting a good importance sampling rule. Thefirst expression in (9.4) is simpler to study. A better q is one that gives asmaller value of

∫D(fp)2/q dx. When we want upper bounds on σ2

q we canbound

∫D(fp)2/q dx.

The second integral expression in (9.4) illustrates how importance samplingcan succeed or fail. The numerator in the integrand at the right of (9.4) is smallwhen f(x)p(x)−µq(x) is close to zero, that is, when q(x) is nearly proportionalto f(x)p(x). From the denominator, we see that regions with small values ofq(x) greatly magnify whatever lack of proportionality appears in the numerator.

Suppose that f(x) > 0 for all x, and to rule out a trivial case, assume thatµ > 0 too. Then qopt(x) ≡ f(x)p(x)/µ is a probability density, and it hasσ2qopt

= 0. It is optimal, but not really usable because µqopt becomes an averageof f(xi)p(xi)/q(xi) = µ meaning that we could compute µ directly from f , p,and q without any sampling. Likewise for f(x) 6 0 with µ < 0 a zero varianceis obtained for the density q = −fp/µ.



Although zero-variance importance sampling densities are not usable, theyprovide insight into the design of a good importance sampling scheme. Supposethat f(x) > 0. It may be good for q to have spikes in the same places that fdoes, or where p does, but it is better to have them where fp does. Moreover,the best q is proportional to fp not

√fp or f2p or some other combination.

To choose a good importance sampling distribution requires some educatedguessing and possibly numerical search. In many applications there is domainknowledge about where the spikes are. In a financial setting we may knowwhich stock fluctuations will cause an option to go to its maximal value. For aqueuing system it may be easy to know what combination of arrivals will causethe system to be overloaded.

In general, the density q∗ that minimizes σ2q is proportional to |f(x)|p(x)

(Kahn and Marshall, 1953), outside of trivial cases where∫|f(x)|p(x) dx = 0.

This optimal density does not have σ2q∗ = 0 on problems where fp can be

positive for some x and negative for other x. When fp takes both positive andnegative values, there is still a zero variance method, but it requires samplingat two points. See §9.13.

To prove that q∗(x) = |f(x)|p(x)/Ep(|f(X)|) is optimal, let q be any densitythat is positive when fp 6= 0. Then

µ2 + σ2q∗ =

∫f(x)2p(x)2

q∗(x)dx =

∫f(x)2p(x)2

|f(x)|p(x)/Ep(|f(X)|)dx

= Ep(|f(X)|)2 = Eq(|f(X)|p(X)/q(X))2

6 Eq(f(X)2p(X)2/q(X)2) = µ2 + σ2q ,

and so σ2q∗ 6 σ2

q . This proof is a straightforward consequence of the Cauchy-Schwarz inequality. But it requires that we know the optimal density beforeapplying Cauchy-Schwarz. For a principled way to find candidates for q∗ inproblems like importance sampling, we can turn to the calculus of variations,as outlined on page 38 of the chapter end notes.

We may use the likelihood ratio w(x) = p(x)/q(x) as a means to understandwhich importance sampling densities are good or bad choices. The first termin σ2

q is∫f(x)2p(x)2/q(x) dx. We may write this term as Ep(f(X)2w(X)) =

Eq(f(X)2w(X)2). The appearance of q in the denominator of w, means thatlight-tailed importance densities q are dangerous. If we are clever or lucky, thenf might be small just where it needs to be to offset the small denominator. Butwe often need to use the same sample with multiple integrands f , and so as arule q should have tails at least as heavy as p does.

When p is a Gaussian distribution, then a common tactic is to take q to bea student’s t distribution. Such a q has heavier tails than p. It even has heaviertails than fp for integrands of the form f(x) = exp(xTθ), because then fp isproportional to another Gaussian density.

The reverse practice, of using Gaussian importance distribution q for a stu-dent’s t nominal distribution, can easily lead to σ2

q =∞. Even when q is nearlyproportional to fp we may still have σ2

q = ∞. The irony is that the infinite


9.1. Basic importance sampling 7

variance could be largely attributable to the unimportant region of D where|f |p is small (but q is extremely small).

Bounded weights are especially valuable. If w(x) 6 c then σ2q 6 cσ2

p. SeeExercise 9.6.

Example 9.1 (Gaussian p and q). We illustrate the effect of light-tailed q in aproblem where we would not need importance sampling. Suppose that f(x) =x ∈ R and that p(x) = exp(−x2/2)/

√2π. If q(x) = exp(−x2/(2σ2))/(σ

√2π)

with σ > 0 then

σ2q =

∫ ∞−∞

x2(exp(−x2/2)/

√2π)2

exp(−x2/(2σ2))/√

2π/σdx

= σ

∫ ∞−∞

x2 exp(−x2(2− σ−2)/2)/√

2π dx

=

σ

(2− σ−2)3/2, σ2 > 1/2

∞, else.

To estimate Ep(X) with finite variance in the normal family, the importancedistribution q must have more than half the variance that p has. Interestingly,the optimal σ is not 1. See Exercise 9.7.

The likelihood ratio also reveals a dimension effect for importance sam-pling. Suppose that x ∈ Rd and to keep things simple, that the components ofX are independent under both p and q. Then w(X) =

∏dj=1 pj(Xj)/qj(Xj).

Though f plays a role, the value of σ2q is also driven largely by Eq(w(X)2) =∏d

j=1 Eq(wj(Xj)2) where wj = pj/qj . Each Eq(wj(Xj)) = 1 and Eq(w(X)2) =∏d

j=1(1+Varq(wj(Xj))). This quantity will grow exponentially with d unless wearrange for the variances of wj(Xj) to eventually decrease sharply as j increases.

Example 9.2 (Standard Gaussian mean shift). Let X ∼ N (0, I) under p andX ∼ N (θ, I) under q, where θ ∈ Rd. Then w(x) = exp(−θTx + θTθ/2). Themoments of the likelihood ratio depend on the size ‖θ‖ of our shift, but not onthe dimension d. In particular, Eq(w(X)2) = exp(θTθ).

Example 9.2 shows that a large nominal distribution d does not of itself makeimportance sampling perform poorly. Taking θ = (1/

√d, . . . , 1/

√d) there has

the same effect as θ = (1, 0, . . . , 0) does. The latter is clearly a one dimensionalperturbation, and so is the former, since it involves importance sampling forjust one linear combination of the d variables. What we do see is that smallchanges to a large number of components of E(X) or moderately large changesto a few of them, do not necessarily produce an extreme likelihood ratio.

Example 9.3 (Standard Gaussian variance change). Let X ∼ N (0, I) andsuppose that q has X ∼ N (0, σ2I) for σ > 0. Then

w(x) = σd exp(−xTx(1− 1/(2σ2))),

and Ep(w(X)) = (σ2/√

2σ2 − 1)d for σ2 > 1/2.



Example 9.3 shows that making any small change of variance will bring anextreme likelihood ratio in d dimensions when d is large. The allowable changeto σ2 decreases with d. If we take σ2 = 1 +A/d1/2 for A > −1/2 then a Taylorseries expansion of log(Ep(w(X))) shows that (σ2/

√2σ2 − 1)d remains bounded

as d grows.Although we can get a lot of insight from the likelihood ratio w(x) the

efficiency of importance sampling compared to alternatives still depends on f .We will see another example of that phenomenon in §9.3 where we discuss somemeasures of effective sample size for importance sampling.

9.2 Self-normalized importance sampling

Sometimes we can only compute an unnormalized version of p, pu(x) = cp(x)where c > 0 is unknown. The same may be true of q. Suppose that we can com-pute qu(x) = bq(x) where b > 0 might be unknown. If we are fortunate or cleverenough to have b = c, then p(x)/q(x) = pu(x)/qu(x) and we can still use (9.3).Otherwise we may compute the ratio wu(x) = pu(x)/qu(x) = (c/b)p(x)/q(x)and consider the self-normalized importance sampling estimate

µq =

∑ni=1 f(Xi)wu(Xi)∑n

i=1 wu(Xi)(9.6)

where Xi ∼ q are independent. The factor c/b cancels from numerator anddenominator in (9.6), leading to the same estimate as if we had used the desiredratio w(x) = p(x)/q(x) instead of the computable alternative wu(x). As aresult we can write

µq =1n

∑ni=1 f(Xi)w(Xi)

1n

∑ni=1 w(Xi)

. (9.7)

Theorem 9.2. Let p be a probability density function on Rd and let f(x)be a function such that µ =

∫f(x)p(x) dx exists. Suppose that q(x) is a

probability density function on Rd with q(x) > 0 whenever p(x) > 0. LetX1, . . . ,Xn ∼ q be independent and let µq be the self-normalized importancesampling estimate (9.6). Then

P(

limn→∞

µq = µ)

= 1.

Proof. We may use expression (9.7) for µq. The numerator in (9.7) is the averageµq of f(Xi)p(Xi)/q(Xi) under sampling from q. These are IID with mean µby (9.2). The strong law of large numbers gives P

(limn→∞ µq = µ

)= 1. The

denominator of (9.7) converges to 1, using the argument for the numerator onthe function f with f(x) ≡ 1.

The self-normalized importance sampler µq requires a stronger conditionon q than the unbiased importance sampler µq does. We now need q(x) > 0whenever p(x) > 0 even if f(x) is zero with high probability.


9.2. Self-normalized importance sampling 9

To place a confidence interval around µq we use the delta method and equa-tion (2.30) from §2.7 for the variance of a ratio of means. Applying that formulato (9.7) yields an approximate variance for µq of

Var(µq) =1

n

Eq((f(X)w(X)− µw(X))2

)Eq(w(X))2

=σ2q,sn

n, where

σ2q,sn = Eq

(w(X)2(f(X)− µ)2

).

(9.8)

For a computable variance estimate, we employ wu instead of w, getting

Var(µq) =

1

n

1

n

n∑i=1

wu(xi)2(f(xi)− µq)2

( 1

n

n∑i=1

wu(xi))2 =

n∑i=1

w2i (f(xi)− µq)2 (9.9)

where wi = wu(xi)/∑nj=1 wu(xj) is the i’th normalized weight. The delta

method 99% confidence interval based for self-normalized importance samplingis

µq ± 2.58

√Var(µq)/n.

When the weight function w(x) is computable, then we can choose to useeither the unbiased estimate (9.3) or the ratio estimate (9.6). They each havetheir strengths. Their variance formulas are:

nVar(µq) =

∫(f(x)p(x)− µq(x))2

q(x)dx, and

nVar(µq) =

∫p(x)2(f(x)− µ)2

q(x)dx.

An argument for µq is that it is exact when f(x) = C is constant, while µqis not. We are not usually focused on estimating the mean of a constant, butif we don’t get that right, then our estimates of P(X ∈ B) and P(X 6∈ B) for a

set B, won’t sum to one, and E(f(X) + C) won’t equal E(f(X)) + C, both ofwhich could lead to strange results.

The arguments in favor of µq are that it is unbiased, has a simple varianceestimate, and the variance can be made arbitrarily close to 0 by making everbetter choices of q.

It is not possible for the self-normalized estimate µq to approach 0 variancewith ever better choices of q. The optimal density for self-normalized importancesampling has the form q(x) ∝ |f(x) − µ|p(x) (Hesterberg, 1988, Chapter 2).Using this formula we find (Exercise 9.15) that

σ2q,sn > Ep(|f(X)− µ|)2. (9.10)

It is zero only for constant f(x). Self-normalized importance sampling cannotreduce the variance by more than Ep((f(X)− µ)2)/Ep(|f(X)− µ|)2.



Example 9.4 (Event probability). Let f(x) = 1{x ∈ A}, so µ = E(f(X)) isthe probability of event A. The optimal distribution for self-normalized impor-tance sampling is q(x) ∝ p(x)|1{x ∈ A} − µ|. Under this distribution

∫A

q(x) dx =

∫Ap(x)|1{x ∈ A} − µ|dx∫p(x)|1{x ∈ A} − µ|dx

=µ(1− µ)

µ(1− µ) + (1− µ)µ=

1

2.

Whether A is rare or not, the asymptotically optimal density for self-normalizedimportance sampling puts half of its probability inside A. By contrast, in ordi-nary importance sampling, the optimal distribution places all of its probabilityinside A.

When w(x) is computable, then a reasonable answer to the question ofwhether to use µq or µq is ‘neither’. In this case we can use the regressionestimator of §8.9 with the control variate w(X). This variate has knownmean, Eq(w(X)) = 1. The regression estimator has a bias but it cannot in-crease the asymptotic variance compared to µq. The regression estimator hasapproximate variance (1 − ρ2)σ2

q/n compared to σ2q/n for µq of (9.3), where

ρ = Corrq(f(X)w(X), w(X)). The regression estimator is also at least as effi-cient as the ratio estimator which in this case is µq. The regression estimatoris exact if either q(x) ∝ f(x)p(x) (for nonnegative f) or f is constant.

The regression estimator here takes the form

µq,reg =1

n

n∑i=1

f(Xi)w(Xi)− β(w(Xi)− 1), (9.11)

where

β =

∑ni=1(w(Xi)− w)f(Xi)w(Xi)∑n

i=1(w(Xi)− w)2

for w = (1/n)∑ni=1 w(Xi). To get both µq,reg and its standard error, we can

use least squares regression as in Algorithm 8.3. We regress f(Xi)w(Xi) on thecentered variable w(Xi) − 1 and retain the estimated intercept term as µq,reg.Most regression software will also compute a standard error se(µq,reg) for theintercept. Our 99% confidence interval is µq,reg ± 2.58 se(µq,reg).

The regression estimator version of importance sampling is a very goodchoice when we are able to compute w(x). It has one small drawback rela-tive to µq. In extreme cases we could get µq,reg < 0 even though f(x) > 0always holds. In such cases it may be preferable to use a reweighting ap-proach as in §8.10. That will give µq,reg > 0 when min16i6n f(xi) > 0 andmin16i6n p(xi)/q(xi) < 1 < maxi p(xi)/q(xi). On the other hand, gettingµq,reg < 0 when P(f(X) > 0) = 1 suggests that the sampling method has notworked well and perhaps we should increase n or look for a better q instead oftrying to make better use of the given sample values.


9.3. Importance sampling diagnostics 11

9.3 Importance sampling diagnostics

Importance sampling uses unequally weighted observations. The weights arewi = p(Xi)/q(Xi) > 0 for i = 1, . . . , n. In extreme settings, one of the wi maybe vastly larger than all the others and then we have effectively only got oneobservation. We would like to have a diagnostic to tell when the weights areproblematic. It is even possible that w1 = w2 = · · · = wn = 0. In that case,importance sampling has clearly failed and we don’t need a diagnostic to tell usso. Hence, we may assume that

∑ni=1 wi > 0.

Consider a hypothetical linear combination

Sw =

∑ni=1 wiZi∑ni=1 wi

(9.12)

where Zi are independent random variables with a common mean and commonvariance σ2 > 0 and w ∈ [0,∞)n is the vector of weights. The unweightedaverage of n e independent random variables Zi has variance σ2/n e. SettingVar(Sw) = σ2/n e and solving for n e yields the effective sample size

n e =(∑ni=1 wi)

2∑ni=1 w

2i

=nw2

w2(9.13)

where w = (1/n)∑ni=1 wi and w2 = (1/n)

∑ni=1 w

2i . If the weights are too

imbalanced then the result is similar to averaging only n e � n observationsand might therefore be unreliable. The point at which n e becomes alarminglysmall is hard to specify, because it is application specific.

The effective sample size (9.13) is computed from the generated values. Forsimple enough distributions, we can obtain a population version of n e, as

n∗e =nEq(w)2

Eq(w2)=

n

Eq(w2)=

n

Ep(w). (9.14)

If n∗e is too small, then we know q will produce imbalanced weights.Another effective sample size formula is

n

1 + cv(w)2(9.15)

where cv(w) =((n − 1)−1

∑ni=1(wi − w)2

)1/2/w is the coefficient of variation

of the weights. If we replace the n − 1 by n in cv(w) and substitute inton/(1 + cv(w)2), the result is n e of (9.13), so the two formulas are essentiallythe same.

Instead of using an effective sample size, we might just estimate the varianceof µq (or of µq) and use the variance as a diagnostic. When that variance isquite large we would conclude that the importance sampling had not workedwell. Unfortunately the variance estimate is itself based on the same weightsthat the estimate has used. Badly skewed weights could give a badly estimatedmean along with a bad variance estimate that masks the problem.



The variance estimate (9.9) for µq uses a weighted average of (f(xi)− µq)2with weights equal to w2

i . Similarly, for µq we have σ2q = (1/n)

∑ni=1 w

2i f(xi)

2−µ2q which again depends on weights w2

i . We can compute an effective samplesize for variance estimates by

n e,σ =

(∑ni=1 w

2i

)2∑ni=1 w

4i

. (9.16)

If n e,σ is small, then we cannot trust the variance estimate. Estimating avariance well is typically a harder problem than estimating a mean well, andhere we have n e,σ 6 n e,µ (Exercise 9.17).

The accuracy of the central limit theorem for a mean depends on the skew-ness of the random variables being summed. To gauge the reliability of the CLTfor µq and µq we find that the skewness of Sw matches that of the average of

n e,γ =

(∑i w

2i

)3(∑i w

3i )

2

equally weighted observations (Exercise 9.19).Effective sample sizes are imperfect diagnostics. When they are too small

then we have a sign that the importance sampling weights are problematic.When they are large we still cannot conclude that importance sampling hasworked. It remains possible that some important region was missed by all ofX1, . . . ,Xn.

The diagnostics above used the weights wi only through their relative valueswi/

∑j wj . In cases where wi is computable, it is the observed value of a random

variable with mean Eq(p(X)/q(X)) = 1. If the sample mean of the weights is farfrom 1 then that is a sign that q was poorly chosen. That is, w = (1/n)

∑ni=1 wi

is another diagnostic.These weight based diagnostics are the same for every function f . They are

convenient summaries, but the effectiveness of importance sampling dependsalso on f . If every f(Xi)p(Xi)/q(Xi) = 0 it is likely that our importance sam-pling has failed and no further diagnostic is needed. Otherwise, for a diagnosticthat is specific to f let

wi(f) =|f(Xi)|p(Xi)/q(Xi)∑nj=1 |f(Xj)|p(Xj)/q(Xj)

.

After some algebra, the sample coefficient of variation of these weights is cv(w, f) =√n∑ni=1 w

2i − 1. An effective sample size customized to f is

n e(f) =n

1 + cv(w, f)2=

1∑ni=1 wi(f)2

. (9.17)

There is a further discussion of effective sample size on page 39 of the chapterend notes.


9.4. Example: PERT 13

j Task Predecessors Duration

1 Planning None 42 Database Design 1 43 Module Layout 1 24 Database Capture 2 55 Database Interface 2 26 Input Module 3 37 Output Module 3 28 GUI Structure 3 39 I/O Interface Implementation 5,6,7 2

10 Final Testing 4,8,9 2

Table 9.1: A PERT problem from Chinneck (2009, Chapter 11).Used with the author’s permission. Here we replace the durationsby exponential random variables. This example was downloaded fromhttp://optlab-server.sce.carleton.ca/POAnimations2007/PERT.html.

9.4 Example: PERT

Importance sampling can be very effective but it is also somewhat delicate.Here we look at a numerical example and develop an importance sampler for it.The example is based on Project Evaluation and Review Technique (PERT), aproject planning tool. Consider the software project described in Table 9.1. Ithas 10 tasks (activities), indexed by j = 1, . . . , 10. The project is completedwhen all of the tasks are completed. A task can begin only after all of itspredecessors have been completed. Figure 9.1 shows the predecessor-successorrelations in this example, with a node for each activity.

The project starts at time 0. Task j starts at time Sj , takes time Tj and endsat time Ej = Sj+Tj . Any task j with no predecessors (here only task 1) starts atSj = 0. The start time for a task with predecessors is the maximum of the endingtimes of its predecessors. For example, S4 = E2 and S9 = max(E5, E6, E7). Theproject as a whole ends at time E10.

The duration θj for task j is listed in Table 9.1. If every task takes the timelisted there, then the project takes E10 = 15 days. For this example, we changethe PERT problem to make Tj independent exponentially distributed randomvariables with means given in the final column of the table. The completiontime is then a random variable with a mean just over 18 days, and a long tailto the right, as can be easily seen from a simple Monte Carlo.

Now suppose that there will be a severe penalty should the project missa deadline in 70 days time. In a direct simulation of the project n = 10,000times, E10 > 70 happened only 2 times. Missing that deadline is a moderatelyrare event and importance sampling can help us get a better estimate of itsprobability.

Here (T1, . . . , T10) plays the role of X. The distribution p is comprised of



1

2

3

4

5

6

7

8

9 10

PERT graph (activities on nodes)

Figure 9.1: This figure shows the 10 activities for the PERT example in Ta-ble 9.1. The arrows point to tasks from their predecessors.

independent exponential random variables, Tj ∼ θj × Exp(1). The functionf is 1 for E10(T1, . . . , T10) > 70 and 0 otherwise. For simplicity, we take thedistribution q to be that of independent random variables Tj ∼ λj × Exp(1).The importance sampling estimate of µ = P(E10 > 70) is

µλ =1

n

n∑i=1

1{Ei,10 > 70}10∏j=1

exp(−Tij/θi)/θiexp(−Tij/λi)/λi

.

We want to generate more E10 > 70 events, and we can do this by takingλi > θi. We can make the mean of E10 close to 70 by taking λ = 4θ. Importancesampling with λ = 4θ and n = 10,000 gave µλ = 2.02 × 10−5 with a standarderror of 4.7× 10−6. It is a good sign that the standard error is reasonably smallcompared to the estimate. But checking the effective sample sizes, we find that(

n e,µ n e,σ n e,γ) .

=(4.90 1.15 1.21

).

Simply scaling the means by a factor of 4 has made the weights too uneven.The largest weight was about 43% of the total weight. Inspecting the standarderror did not expose the problem with these weights. Similarly, the averageof p(xi)/q(xi) was 0.997 which is not alarmingly far from 1. This problem issimple enough that we could have computed n∗e (Exercise 9.20).

Multiplying all the durations by 4 is too much of a distortion. We canget better results by using somewhat smaller multiples of θ. When exploringchanges in λ it is advantageous to couple the simulations. One way to do this is


9.4. Example: PERT 15

Task 1 2 3 4 5 6 7 8 9 10

θ 4 4 2 5 2 3 2 3 2 2∆(+1)E10 1 1 1 1∆(×2)E10 4 4 5 1 1 1 2

Table 9.2: For tasks j = 1, . . . , 10 the mean durations are in the row θ. Therow ∆(+1) shows the increase in completion time E10 when we add one to Tj(getting θj + 1) keeping all other durations at their means. Only the nonzerochanges are shown. The row ∆(×2) similarly shows the delay from doubling thetime on task j to 2θj , keeping other times at their means.

to generate an n×10 matrixX of Exp(1) random variables and take Tij = Xijλj .We find this way that there is a limit to how well scale multiples of λ = κθ canperform. Each component of θ that we increase makes the importance samplingweights more extreme. But the components do not all contribute equally tomoving E10 out towards 70. We may do better by using a larger multiple onthe more important durations.

Table 9.2 shows the results of some simple perturbations of T1, . . . , T10. If alltasks but one are held fixed at their mean duration and the other is increasedby 1 day, then it turns out that E10 either stays fixed or increases by 1. Itincreases by 1 for tasks 1, 2, 4, and 10. Those tasks are on the critical path. Ifthey are delayed then the project is delayed. The other tasks can suffer smalldelays without delaying the project. Larger delays are different. A task thattakes twice as long as expected could delay the project, even if it is not on thecritical path. Table 9.2 also shows the effects of these doublings.

Suppose that we take λj = θj for j not in the critical path and λj = κθj forj in the critical path {1, 2, 4, 10}. The estimates below

κ 3.0 3.5 4.0 4.5 5.0105 × µ 3.3 3.6 3.1 3.1 3.1106 × se(µ) 1.8 2.1 1.6 1.5 1.6n e,µ 939 564 359 239 165n e,σ 165 88 52 33 23n e,γ 52 28 17 12 8

were obtained from n = 10,000 simulations using multiples λj of common ran-dom numbers Xij .

The effective sample sizes move as we would predict. The estimates andstandard errors do not move much. There does not seem to be a clearly su-perior choice among these values of κ. Using κ = 4 and n = 200,000 and adifferent random seed, the estimate µq = 3.18 × 10−5 was found. At this levelof precision µq

.= µq

.= µq,reg, the latter using p(xi)/q(xi) as a control variate.

The standard error for µq was 3.62 × 10−7. The self-normalized importancesampling estimate µq had a larger standard error of 5.22× 10−7. The standarderror of µq,reg was barely smaller than the one for µq but they agreed to the



precision given here. The effective sample sizes, rounded to integer values, were(n e,µ n e,σ n e,γ

) .=(7472 992 1472

). The mean weight was 0.9992 and

the coefficient of variation for f(xi)p(xi)/q(xi), was 5.10 leading to an f -specificeffective sample size (equation (9.17)) of n e(f)

.= 7405.

We can estimate the variance reduction due to importance sampling as

µq(1− µq)/nse(µq)2

.=

µq/n

se(µq)2.= 1209.

That is n = 200,000 samples with importance sampling are about as effective asroughly 242 million samples from the nominal distribution. Plain importancesampling was roughly twice as efficient as self-normalized importance samplingon this example: se(µq)

2/se(µq)2 .

= 2.08.

It is possible to improve still further. This importance sampler quadrupledall the task durations on the critical path. But quadrupling the 10’th time incursthe same likelihood ratio penalty as quadrupling the others without adding asmuch to the duration. We might do better using a larger multiple for task5 and a smaller one for task 10. Of course, any importance sampler yieldingpositive variance can be improved and we must at some point stop improvingthe estimate and move on to the next problem.

It is not necessary to sample the 10’th duration T10, from the nominal dis-tribution or any other. Once E4, E8 and E9 are available, either S10 > 70in which case we are sure that E10 > 70, or S10 < 70 and we can computeP(E10 > 70 |T1, . . . , T9) = exp(−(70−S10)/θ10) using the tail probability for anexponential distribution. That is, we can use conditional Monte Carlo. Here twoprinciples conflict. We should use conditional Monte Carlo on principle becauseusing known integrals where possible reduces variance. A contrary principle isthat we do better to stay with simple methods because human time is worthmuch more than computer time. For this example, we take the second principleand consider (but do not implement) some ways in which the analysis might besharpened.

It is not just the 10’th time that can be replaced by conditional MonteCarlo. Given all the durations except Tj , there is a threshold m > 0 suchthat E10 > 70 when Tj > m. We could in principle condition on any one ofthe durations. Furthermore, it is clear that antithetic sampling would reducevariance because E10 is monotone in the durations T1, . . . , T10. See Exercise 8.4for the effects of antithetic sampling on spiky functions.

In this example importance sampling worked well but had to be customizedto the problem. Finding the critical path and making it take longer worked well.That cannot be a general finding for all PERT problems. There may sometimesbe a second path, just barely faster than the critical path, and having little or nooverlap with the critical path. Increasing the mean duration of the critical pathalone might leave a very high sampling variance due to contributions from thesecond path. Importance sampling must be used with some caution. Defensiveimportance sampling and other mixture based methods in §9.11 provide someprotection.


9.5. Importance sampling versus acceptance-rejection 17

9.5 Importance sampling versus acceptance-rejection

Importance sampling and acceptance-rejection sampling are quite similar ideas.Both of them distort a sample from one distribution in order to sample fromanother. It is natural then to compare the two methods. There are two mainfeatures on which to compare the methods: implementation difficulty, and effi-ciency. Importance sampling is usually easier to implement. Either one can bemore efficient.

For implementation, let p be the target density and q the proposal density.We might only have unnormalized versions p and q. Ordinary importance sam-pling requires that we can compute p(x)/q(x). Self-normalized importance sam-pling has the weaker requirement that we can compute p(x)/q(x). Acceptance-rejection sampling can be applied with either the ratio p(x)/q(x) or p(x)/q(x).But either way we must know an upper bound c for this ratio in order to useacceptance-rejection. We don’t need to know the bound for importance sam-pling. Furthermore, when supx p(x)/q(x) = ∞ then acceptance-rejection isimpossible while importance sampling is still available, though it might have ahigh variance.

As a result importance sampling is easier to implement than acceptance-rejection sampling, for straightforward estimation of µ as above. On the otherhand, if the problem involves many different random variables X, Y , Z eachbeing generated in different places in our software, then using exact samples fromeach distribution may be easier to do than managing and combining importanceweights for every variable in the system.

To compare the efficiencies we have to take account of costs. At one ex-treme, suppose that almost all of the cost is in computing f so that the costsof sampling from q and deciding whether to accept or reject that sample arenegligible. Then we would use the same value of n in either importance sam-pling or acceptance-rejection sampling. As a result the efficiency comparison isthe same as that between importance sampling and IID sampling. In this case,the relative efficiency of importance sampling from q to acceptance-rejection isσ2p/σ

2q . For self-normalized importance sampling it is σ2

p/σ2q,sn.

At the other extreme, suppose that almost all of our costs are in samplingfrom q and computing the ratio q/p, while f is very inexpensive. Maybe X isa complicated stochastic process and f is simply whether it ends at a positivevalue. Then in the time it takes us to get n samples by importance samplingwe would have accepted a random number of samples by acceptance-rejection.This number of samples will have the Bin(n, 1/c) distribution. For large nwe might as well suppose it is n/c points where the acceptance-rejection rulekeeps points when p(X)/q(X) 6 c. In this extreme setting the efficiency ofimportance sampling is cσ2

p/σ2q and that of self-normalized importance sampling

is cσ2p/σ

2q,sn.

We can get this same efficiency answer from a longer argument writingacceptance-rejection estimate as µAR =

∑ni=1AiYi/

∑ni=1Ai, where Ai is 1

if Xi is accepted and 0 otherwise, and then using the ratio estimation varianceformula (2.29) of §2.7.



Family p(· ; θ) w(·) Θ

Normal N (θ,Σ) exp(xTΣ−1(θ0 − θ) + 12θ

TΣ−1θ − 12θ

T0 Σ−1θ0) Rd

Poisson Poi(θ) exp(θ − θ0)(θ0/θ)x (0,∞)

Binomial Bin(m, θ) (θ0/θ)x((1− θ0)/(1− θ))m−x (0, 1)

Gamma Gam(θ) xθ0/θ Γ(θ)/Γ(θ0) (0,∞)

Table 9.3: Some example parametric families with their importance samplingratios. The normal family shown shares a non-singular Σ.

We expect that our problem is somewhere between these extremes, andso the efficiency of importance sampling compared to acceptance-rejection issomewhere between σ2

p/σ2q and cσ2

p/σ2q . The relative efficiency of self-normalized

importance sampling to acceptance rejection is between σ2p/σ

2q,sn and cσ2

p/σ2q,sn.

9.6 Exponential tilting

There are a small number of frequently used strategies for picking the impor-tance distribution q. Here we describe exponential tilting. In §9.7 we considermatching the Hessian of q to that of p at the mode.

Often the nominal distribution p is a member of a parametric family, whichwe may write as p(x) = p(x; θ0). When we can sample from any member of thisfamily it becomes convenient to define the importance sampling distribution bya change of parameter: qθ(x) = p(x; θ). Many examples in this chapter areof this type. Table 9.3 gives the weight function w(x) for several parametricfamilies.

The example distributions in Table 9.3 are all exponential families. In anexponential family,

p(x; θ) = exp(η(θ)TT (x)−A(x)− C(θ)), θ ∈ Θ,

for functions η(·), T (·), A(·) and C(·). The normalizing constant (in the denom-inator of p) is c(θ) = exp(C(θ)). Here p may be a probability density functionor a probability mass function depending on the context. It takes a bit of rear-rangement to put a distribution into the exponential family framework. In thisform we find that

w(x) =p(x; θ0)

p(x; θ)= e(η(θ0)−η(θ))

TT (x) × c(θ)/c(θ0).

The factor exp(−A(x)) cancels, resulting in a simpler formula that may also bemuch faster than computing the ratio explicitly.

Often η(θ) = θ and T (x) = x and then w(x) = exp((θ0 − θ)Tx)c(θ)/c(θ0).In this case, the importance sampling estimate takes the particularly simple


9.7. Modes and Hessians 19

form

µθ =1

n

n∑i=1

f(Xi)e(θ0−θ)TXi × c(θ)

c(θ0).

Importance sampling by changing the parameter in an exponential family isvariously called exponential tilting or exponential twisting. The normal-izing constants for popular exponential families can usually be looked up. Theexponential form is particularly convenient when X is a stochastic process. Wewill see an example in §9.9.

Exponential tilting can be used outside of exponential families. If p(x) is adensity function then qη(x) ∝ p(x) exp(−ηTT (x)) may still be a useful density(if it can be normalized). But qη will no longer be in the same parametric familyas p, perhaps requiring new sampling algorithms.

9.7 Modes and Hessians

In a Bayesian setting the target distribution is the posterior distribution of aparameter. In keeping with that context we will use the letter X to representsome data and the quantity that we importance sample will be represented byθ. A very general approach is to approximate the log posterior distribution bya quadratic Taylor approximation around its mode. That suggests a Gaussiandistribution with covariance matrix equal to the inverse of the Hessian matrixof second derivatives. We use logistic regression as a concrete example but theconcepts apply to many parametric statistical models.

In logistic regression, the binary random variable Yi ∈ {0, 1} is related to avector xi ∈ Rd by

P(Yi = 1) =exp(xT

i θ)

1 + exp(xTi θ)

=1

1 + exp(−xTi θ)

.

Typically the first component of xi is 1 which puts an intercept term in themodel. The Yi are independent and we let Y = (Y1, . . . , Yn).

The Bayesian approach models θ as a random variable with a prior distri-bution π(θ). We often work with an un-normalized version πu(θ) ∝ π(θ). Thenthe posterior distribution of θ given n independent observations is π(θ | Y) =πu(θ | Y)/c, for an un-normalized posterior distribution

πu(θ | Y) = πu(θ)

n∏i=1

( exp(xTi θ)

1 + exp(xTi θ)

)Yi( 1

1 + exp(xTi θ)

)1−Yi.

We might not know the normalizing constant c even when we know the constantfor π(θ). Sometimes an improper prior πu(θ) = 1 is used. That prior cannot benormalized and then whether πu(θ | Y) can be normalized depends on the data.

The logarithm of the posterior distribution is

− log(c) + log(πu(θ)) +

n∑i=1

YixTi θ −

n∑i=1

log(1 + exp(xTi θ)).



The sums over i grow with n while πu(θ) does not, and so for n large enough thedata dominate log(π(θ)). The first sum is linear in θ. Minus the second sum is aconcave function of θ. We will assume that log(πu(θ | Y)) is strictly concave andtakes its maximum at a finite value. With πu(θ) = 1, that could fail to happen,trivially if all Yi = 0, or in a more complicated way if the xi with Yi = 0 canbe linearly separated from those with Yi = 1. Such separability failures may berare when d � n but they are not rare if d and n are of comparable size. Wecan force strict concavity through a choice of πu that decays at least as fast asexp(−‖θ‖1+ε) for some ε > 0.

Let θ∗ be the maximizer of πu(θ | Y). If πu(θ) is not too complicated then wemay be able to compute θ∗ numerically. The first derivatives of log(πu(θ | Y))with respect to θ vanish at this optimum. Then a Taylor approximation aroundthis maximum a posteriori estimate is

log(πu(θ | Y)).= log(πu(θ∗ | Y)− 1

2(θ − θ∗)TH∗(θ − θ∗) (9.18)

where

H∗ = − ∂2

∂θ∂θTlog(πu(θ))|θ=θ∗

is (minus) the matrix of second derivatives of log(πu(θ | Y)). We can usuallyapproximate H∗ by divided differences and, for convenient πu, we can get H∗in closed form.

We interpret (9.18) as θ | Y �∼ N (θ∗, H−1∗ ). Then

E(f(θ) | Y) =

∫Rd f(θ)πu(θ | Y) dθ∫

Rd πu(θ | Y) dθ

.= E

(f(θ) | θ ∼ N (θ∗, H

−1∗ )). (9.19)

This approximation may be good enough for our purposes and depending on f(·)we may or may not have to resort to Monte Carlo sampling from N (θ∗, H

−1∗ ).

When we want to estimate posterior properties of θ without the Gaussianapproximation then we can importance sample from a distribution similar toπu(θ | Y). Even when we believe that N (θ∗, H

−1∗ ) is fairly close to πu(θ | Y) it

is unwise to importance sample from a Gaussian distribution. The Gaussiandistribution has very light tails.

A popular choice is to apply self-normalized importance sampling drawingIID samples θi from the multivariate t distribution t(θ∗, H

−1∗ , ν) of §5.2. Then

we estimate E(f(θ) | Y) by∑ni=1 f(θi)πu(θi | Y)

(1 + (θi − θ∗)TH∗(θi − θ∗)

)−(ν+d)/2∑ni=1 πu(θi | Y)

(1 + (θi − θ∗)TH∗(θi − θ∗)

)−(ν+d)/2 .

The degrees of freedom ν determine how heavy the tails of the t distributionare. They decay as (1 + ‖θ − θ∗‖2)−(ν+d)/2 ≈ ‖θ − θ∗‖−ν−d for large ‖θ‖.For the logistic regression example above log(1 + exp(z)) is asymptotic to z asz → ∞ and to 0 as z → −∞. It follows that the data contribution to the


9.8. General variables and stochastic processes 21

log posterior decays at the rate exp(−‖θ‖). Any ν will lead to an importancesampling distribution with heavier tails than the posterior distribution.

There have been several refinements of this Hessian approach to accountfor skewness or multimodality of the posterior distribution. See page 38 of thechapter end notes.

The method cannot be applied when H∗ is not invertible. For a simple onedimensional example of this problem, let the posterior distribution be propor-tional to exp(−(θ − θ∗)4) for θ ∈ R. Then H∗ = 0.

If we have a normalized posterior distribution π(θ | Y) then we might considerordinary importance sampling. There the optimal importance distribution isproportional to f(θ)π(θ | Y). For f > 0 we might then choose θ∗ to maximizef(θ)π(θ | Y) and take H∗ from the Hessian matrix of log(f(θ∗)) + log(π(θ∗ | Y)).The result is very much like the Laplace approximation method in quadrature(§7.6).

9.8 General variables and stochastic processes

So far we have assumed that the nominal and sampling distributions of X arecontinuous and finite dimensional. Here we extend importance sampling to moregeneral random variables and to processes.

For discrete random variables X, the ratio w(x) = p(x)/q(x) is now aratio of probability mass functions. The sampling distribution q has to satisfyq(x) > 0 whenever p(x)f(x) 6= 0. The variance of µq for discrete sampling isσ2q/n where

σ2q =

∑j

(f(xj)p(xj)− µq(xj))2

q(xj)

taking the sum over all j with q(xj) > 0. For self-normalized importance sam-pling we require q(x) > 0 whenever p(x) > 0.

We can also use importance sampling on mixed random variables havingboth continuous and discrete components. Suppose that with probability λ thevector X is drawn from a discrete mass function pd while with probability 1−λit is sampled from a probability density function pc. That is p = λpd+(1−λ)pcfor 0 < λ < 1. Now we sample from a distribution q = ηqd + (1 − η)qc where0 < η < 1 for a discrete mass function qd and a probability density function pc.We assume that qc(x) > 0 whenever f(x)pc(x) 6= 0 and that qd(x) > 0 wheneverpd(x)f(x) 6= 0. Then the importance sampling estimate of µ = Ep(f(X)) is

1

n

n∑i=1

f(Xi)w(Xi)



where Xi ∼ q and

w(x) =

λpd(x)

ηqd(x), qd(x) > 0

(1− λ)pc(x)

(1− η)qc(x), qd(x) = 0.

Notice that w(x) is determined only by the discrete parts at points x for whichboth qd(x) > 0 and qc(x) > 0. The discrete component trumps the continuousone.

Now suppose that X = (X1, X2, . . . ) is a process of potentially infinitedimension with nominal distribution p(x) =

∏j>1 pj(xj |x1, . . . , xj−1). For

j = 1 we interpret pj(xj |x1, . . . , xj−1) as simply p1(x1).If we importance sample with q(x) =

∏j>1 qj(xj |x1, . . . , xj−1) then the

likelihood ratio is

w(x) =∏j>1

pj(xj |x1, . . . , xj−1)

qj(xj |x1, . . . , xj−1).

An infinite product leads to very badly distributed weights unless we arrangefor the factors to rapidly approach 1.

In an application, we could never compute an infinite number of componentsof x. Instead we apply importance sampling to problems where we can computef(X) from just an initial subsequence X1, X2, . . . , XM of the process. The timeM at which the value of f(X) becomes known is generally random. WhenM = m, then f(X) = fm(X1, . . . , Xm) for some function fm. Assuming thatthe i’th sampled process generates xi1, . . . , xiMi where Mi = m < ∞ we willtruncate the likelihood ratio at m factors using

wm(x) =

m∏j=1

pj(xj |x1, . . . , xj−1)

qj(xj |x1, . . . , xj−1).

We have assumed that Pq(M <∞) = 1. A simulation without this propertymight fail to terminate so we need it. For efficiency, we need the strongercondition that Eq(M) < ∞. We also need M = M(X) to be a stoppingtime, which means that we can tell whether Mi 6 m by looking at the valuesxi1, . . . , xim. For example, the hitting time Mi = min{m > 1 | xim ∈ A} of a setA is a valid stopping time as is min(Mi, 1000), or more generally the minimumof several stopping times. But Mi − 1 is generally not a valid stopping timebecause we usually can’t tell that the process is just about to hit the set A.

For a process, the importance sampling estimator is

µq =1

n

n∑i=1

f(Xi)wMi(Xi).

Next, we find Eq(µq). We will assume that qj(xj |x1, . . . , xj−1) > 0 wheneverpj(xj |x1, . . . , xj−1) > 0.


9.9. Example: exit probabilities 23

We assume that Eq(f(X)wM (X)) exists and Pq(M < ∞) = 1. Then,recalling that

∑∞m=1 denotes

∑16m<∞,

Eq(f(X)wM(X)(X)

)=

∞∑m=1

Eq(fm(X)wM(X)(X)1M(X)=m

)=

∞∑m=1

Eq(fm(X)

m∏j=1

pj(Xj |X1, . . . , Xj−1)

qj(Xj |X1, . . . , Xj−1)1M(X)=m

)

=

∞∑m=1

Ep(fm(X)1M(X)=m)

= Ep(f(X)1M(X)<∞

).

We needed P(M < ∞) = 1 for the first equality. For the third equality,1M(X)=m is a function of X1, . . . , Xm because M is a stopping time. Thereforethe whole integrand there is an expectation over X1, . . . , Xm from q, which canbe replaced by expectation from p, after canceling the likelihood ratio.

We have found that

Eq(µq) = Ep(f(X)1M<∞).

If Pq(M <∞) = Pp(M <∞) = 1 then Eq(µq) = Ep(f(X)) as we might naivelyhave expected. But in general we can have Eq(µq) 6= Ep(f(X)). The differencecan be quite useful. For example, suppose that we want to find Pp(M < ∞).Then we can define f(x) = 1 for all x and look among distributions q withPq(M < ∞) = 1 for one that gives a good estimate. Example 9.5 in §9.9 findsthe probability that a Gaussian random walk with a negative drift ever exceedssome positive value.

9.9 Example: exit probabilities

A good example of importance sampling is in Siegmund’s method for computingthe probability that a random walk reaches the level b before reaching a wherea 6 0 < b. Suppose that Xj are IID random variables with distribution p havingmean µ < 0 and P(X > 0) > 0. Let S0 = 0, and for m > 1, let Sm =

∑mj=1Xj .

Define M = Ma,b = min{m > 1 | Sm > b or Sm 6 a}. The process Sm is arandom walk starting at 0, taking p-distributed jumps, and M represents thefirst time, after 0, that the process is outside the interval (a, b). The desiredquantity, µ = P(SM > b), gives the probability that at the time of the first exitfrom (a, b), the walk is positive. The walk has a negative drift. If b is largeenough, then SM > b is a rare event.

In one application, Xj represents a gambler’s winnings, losses being negativewinnings, at the j’th play of the game. The game continues until either thegambler’s winnings reach b, or the losses reach |a|. Then µ is the probabilitythat the game ends with the gambler having positive winnings.



A second application is the educational testing problem of §6.2. The se-quential probability ratio test was driven by a random walk that drifted upfor students who had mastered the material and down for those who neededremediation.

Suppose now that p is in an exponential family with parameter θ0. Letq(x) = p(x) exp((θ− θ0)x)cθ0/cθ. Then if we run our random walk with Xi ∼ qwe find that

µq =1

n

n∑i=1

1{SMi > b}Mi∏j=1

p(Xij)

q(Xij)

=1

n

n∑i=1

1{SMi > b}( cθcθ0

)Mi

exp((θ0 − θ)SMi).

The estimator has a very desirable property that the likelihood ratio w(X) =∏Mj=1 p(Xj)/q(Xj) only depends on X through the number M of steps taken

and the end point SM , but not the entire path from 0 to SM .Often there is a special value θ∗ 6= θ0 with cθ∗ = cθ0 . If we choose θ = θ∗

then the estimator simplifies to

µq =1

n

n∑i=1

1{SMi > b} exp((θ0 − θ∗)SMi)

which depends only on the end-point and not even on how many steps it tookto get there. A celebrated result of Siegmund (1976) is that this choice θ∗ isasymptotically optimal in the limit as b→∞. Let the variance using importancesampling with tilt θ be σ2

θ/n. If θ 6= θ∗ then his Theorem 1 has

limb→∞

µ2 + σ2θ∗

µ2 + σ2θ

= 0

at an exponential rate, for any a. This implies not only that θ∗ is best but thatin this limit no other tilting parameter θ can have σθ/µ remain bounded. Thatresult holds for a walk on real values or on integer multiples of some h > 0, thatis {0,±h,±2h,±3h, . . . } with b an integer multiple of h.

We can illustrate this method with a normal random walk. Suppose thatp is N (θ0, σ

2). We can assume that σ = 1 for otherwise we could divide a, b,θ0, and σ by σ without changing µ. With p = N (θ0, 1) and q = N (θ, 1) wefind that q(x) = p(x) exp((θ0 − θ)x)c(θ0)/c(θ) where the normalizing constantis c(θ) = θ2/2. Choosing θ∗ = −θ0 > 0 we obtain c(θ∗) = c(θ0). This choicereverses the drift and now

µq =1

n

n∑i=1

1{SMi> b} exp(2θ0SMi

). (9.20)

The advantage of reversing the drift parameter becomes apparent when b > 0becomes very large while a remains fixed. For large b the nominal simulation will


9.10. Control variates in importance sampling 25

seldom cross the upper boundary. The reversed simulation now drifts towardsthe upper boundary, the important region. The estimate µq keeps score, byaveraging exp(2θ0SMi

). These are very small numbers, being no larger thanexp(−2|θ0|b).

Example 9.5. A numerical example with a = −5, b = 10, θ0 = −1 and

n = 50,000 yielded the following results: µθ∗ = 6.6 × 10−10 with

√Var(µθ∗) =

2.57 × 10−12. A nominal simulation with µ on the order of 10−9 would beextremely expensive. It is clear from (9.20) that µ 6 exp(2θ0b) = exp(−20)

.=

2.06× 10−9. Because SM is larger than b we get µ < exp(2θ0b).As a → −∞ we have µ → P(maxm>1 Sm > b) = P(M−∞,b < ∞), the

probability that the random walk will ever exceed b. Because we reversed thedrift, we can be sure that under the importance distribution the walk will alwaysexceed b.

Example 9.6 (Insurance company failure). A similar simulation can be usedto estimate the probability that an insurance company is bankrupted. Betweenclaims the amount of money the company receives is a per unit time, frompremiums collected minus overhead expenses. Claim i arrives at time Ti andcosts Yi. The company’s balance at claim times is a random walk Bi = Bi−1+Xi

for Xi = a(Ti − Ti−1) − Yi. If they have priced risk well then E(Xi) > 0, sothe walk drifts up. Their ruin probability when starting with a balance B0 > 0is P(mini>1Bi < 0). The random variables Xi are not necessarily from anexponential family, so the likelihood ratio w(x) may be more complicated. Butthe general notion of reversing the drift is still applicable.

9.10 Control variates in importance sampling

Control variates can be usefully combined with importance sampling. A controlvariate is a vector h(x) = (h1(x), . . . , hJ(x))T for which

∫h(x) dx = θ for a

known value θ ∈ RJ . The integral is over the entire set of possible x values,such as all of Rd. To use integrals over a subset we make h(x) = 0 outside thatsubset. Some of the equations below simplify if θ = 0. We can make θ = 0 byusing h(x)− θ as the control variate.

A second form of control variate has a vector g(x) where∫g(x)p(x) dx = θ

is known. We will work mainly with the first form. When the density p isnormalized the first form includes the second via h(x) = g(x)p(x).

Combining control variates with importance sampling from q leads to theestimate

µq,β =1

n

n∑i=1

f(Xi)p(Xi)− βTh(Xi)

q(Xi)+ βTθ

for Xiiid∼ q.

Theorem 9.3. Let q be a probability density function with q(x) > 0 whenevereither h(x) 6= 0 or f(x)p(x) 6= 0. Then Eq(µq,β) = µ for any β ∈ RJ .



Proof. The result follows from these two equalities: Eq(f(X)p(X)/q(X)) =∫f(x)p(x) dx = µ, and Eq(hj(X)/q(X)) =

∫hj(x) dx = θj .

The variance of µq,β is σ2q,β/n where

σ2q,β =

∫ (f(x)p(x)− βTh(x)

q(x)− µ+ βTθ

)2

q(x) dx. (9.21)

If we consider q fixed then we want the vector βopt which minimizes (9.21). Wecan seldom get a closed form expression for βopt but we can easily estimate itfrom the sample data.

The variance (9.21) is a mean squared error of a linear model relating fp/qto h/q under sampling X ∼ q. Our sample was drawn from this distribution qso we may estimate βopt by least squares regression on the sample. We let Yi =f(Xi)p(Xi)/q(Xi), Zij = hj(Xi)/q(Xi)−θ and minimize

∑ni=1(Yi−µ−βTZi)

2

over β ∈ RJ and µ ∈ R obtaining β and the corresponding µq,β . As usual withcontrol variates, µq,β has a nonzero but asymptotically small bias.

For a 99% confidence interval we find

σ2q,β

=1

n− J − 1

n∑i=1

(Yi − µq,β − βTZi)

2

and then take µq,β ± 2.58 σq,β/√n for a 99% confidence interval.

Theorem 9.4. Suppose that there is a unique vector βopt ∈ RJ which minimizesσ2q,β of (9.21). Suppose further that

Eq(hj(X)4/p(X)4) <∞ and Eq(hj(X)2f(X)2/p(X)4) <∞

for j = 1, . . . , J . Then as n→∞,

β = βopt +Op(n−1/2) and µq,β = µq,βopt +Op(n

−1).

Proof. This is Theorem 1 of Owen and Zhou (2000).

The conclusion of Theorem 9.4 is that we get essentially the same asymptoticerror from an estimated coefficient as we would with the optimal one. Theuniqueness condition in Theorem 9.4 will fail if two of the predictors hj/q arecollinear. In that case we may drop one or more redundant predictors to attainuniqueness.

Probability density functions provide a useful class of control variates hj .We know that

∫hj(x) dx = 1. In §9.11 we will see a strong advantage to using

a distribution q that is a mixture of densities with those same densities used ascontrol variates.


9.11. Mixture importance sampling 27

9.11 Mixture importance sampling

In mixture importance sampling, we sample X1, . . . ,Xn from the mixturedistribution qα =

∑Jj=1 αjqj where αj > 0,

∑Jj=1 αj = 1 and the qj are dis-

tributions. The closely related multiple importance sampling method of §9.12takes nj > 1 observations Xij ∼ qj all independently, and then combines themto estimate µ.

Mixtures are convenient to sample from. Mixtures of unimodal densitiesprovide a flexible approximation to multimodal target densities p that may arisein some Bayesian contexts. Similarly, since the ideal q is proportional to fp andnot p, we might seek a mixture describing peaks in fp. For instance when aphysical structure described by X has several different ways to fail, representedby the event X ∈ A1∪A2∪· · ·∪AJ we may choose a qj that shifts X towards Aj .In computer graphics, f(x) may be the total contribution of photons reachinga point in an image. If there are multiple light sources and reflective objects inthe scene, then mixture sampling is suitable with components qj representingdifferent light paths reaching the image.

Mixtures also allow us to avoid an importance density q with tails that aretoo light. If one mixture component is the nominal distribution p then the tailsof q cannot be much lighter than those of p. Theorem 9.6 below show thatmixture components used as control variates allow us to make several guessesas to which importance sampling distribution is best and get results comparableto those of the unknown best distribution among our choices.

In mixture importance sampling, we estimate µ = E(f(X)) by the usualimportance sampling equation (9.3), which here reduces to

µα =1

n

n∑i=1

f(Xi)p(Xi)∑Jj=1 αjqj(Xi)

. (9.22)

Notice that equation (9.22) does not take account of which component qj ac-tually delivered Xi. If we had sampled Xi from the mixture distribution byinversion, then we would not even know which qj that was, but we could stilluse (9.22). We do have to compute qj(Xi) for all 1 6 j 6 J , regardless of whichproposal density generated Xi.

An important special case of mixture IS is defensive importance sampling.In defensive importance sampling, we take a distribution q thought to be agood importance sampler, mix it with the nominal distribution p, and then use

qα(x) = α1p(x) + α2q(x)

as the importance distribution, where αj > 0 and α1 + α2 = 1. For α1 > 0 wefind that

p(x)

qα(x)=

p(x)

α1p(x) + α2q(x)6

1

α1

so the tails of qα are not extremely light. A consequence is that

Var(µqα) =1

n

(∫ (f(x)p(x))2

qα(x)dx− µ2

)© Art Owen 2009–2013,2018 do not distribute or post electronically without

author’s permission


61

n

(∫ (f(x)p(x))2

α1p(x)dx− µ2

)=

1

nα1

(σ2p + α2µ

2), (9.23)

where σ2p/n is the variance of µ under sampling from the nominal distribution.

If Varp(f(X)) < ∞ then we are assured that σ2qα < ∞ too. For spiky f it is

not unusual to find that µ� σp and so the defensive mixture could inflate thevariance by only slightly more than 1/α1. The bound (9.23) is quite conserva-tive. It arises by putting q(x) = 0. Self-normalized importance sampling withthe defensive mixture component achieves an even better bound.

Theorem 9.5. Let Xiiid∼ qα = α1p + α2q for 0 < α1 < 1, α1 + α2 = 1, and

i = 1, . . . , n. Let µqα be the self-normalized importance sampling estimate (9.6)and σ2

qα,sn be the asymptotic variance of µqα . Then σ2qα,sn 6 σ2

p/α1.

Proof. Hesterberg (1995).

Example 9.7. Suppose that f(x) = ϕ((x− 0.7)/0.05)/0.05 (a Gaussian prob-ability density function) and that p(x) is the U(0, 1) distribution. If q(x) isthe Beta(70, 30) distribution, then q(x) is a good approximation to f(x)p(x) inthe important region. However q has very light tails and pf/q is unbounded on(0, 1). The defensive mixture q0.3 = 0.3q + 0.7p has a bounded likelihood ratiofunction. See Figure 9.2.

It is also worth comparing defensive importance sampling to what we mighthave attained by importance sampling from q. When q but not p is nearlyproportional to fp, then qα will not be nearly proportional to fp, possiblyundoing what would have been a great variance reduction. Once again, theself-normalized version does better.

The self-normalized version of a general mixture is never much worse thanthe best individual mixture component would have been had we known whichone that was. For self-normalized sampling we require qα(x) > 0 wheneverp(x) > 0. For µqj to be consistent we need qj(x) > 0 whenever p(x) > 0. Inthat case the ratio of asymptotic variances is

σ2qα,sn

σ2qj ,sn

=

∫(p(x)/qα(x))2(f(x)− µ)2qα(x) dx∫(p(x)/qj(x))2(f(x)− µ)2qj(x) dx

6α−2j

∫(p(x)/qj(x))2(f(x)− µ)2qα(x) dx

α−1j∫

(p(x)/qj(x))2(f(x)− µ)2qα(x) dx

=1

αj. (9.24)

For qj = p we recover Theorem 9.5. For qj = q we find that the mixture would beat least as good as self-normalized importance sampling with nαj observationsfrom q. That does not mean it is as good as ordinary importance samplingwould have been from qj 6= p.


9.11. Mixture importance sampling 29

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

range(u)

f(x)

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

range(u)

c(0,

10)

Sampling densities

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

Likelihood ratios

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

c(0,

10)

Integrands

Defensive importance sampling

Figure 9.2: This figure shows defensive importance sampling from Example 9.7.The upper left panel shows f . The upper right panel shows the nominal densityp(x) = 1 as well as the importance density q(x) ∝ x69(1 − x)29 as solid lines.The defensive mixture with α = (0.2, 0.8) is a dashed line. The bottom panelsshow likelihood ratios p/q, p/p, p/qα and integrands fp/q, fp/p, fp/qα, withthe defensive case dashed. The defensive integrand is much flatter than theothers.

Using the density function as a control variate provides at least as good avariance reduction as we get from self-normalized importance sampling or ordi-nary importance sampling or nominal sampling. This method also generalizesto mixtures with more than two components. We know that both

∫p(x) dx = 1

and∫q(x) dx = 1 because they are densities. As long as these densities are

normalized, we can use them as control variates as described in §9.10.

When we combine mixture sampling with control variates based on the com-



Algorithm 9.1 Mixture IS with component control variates by regression

given Xiiid∼ qα =

∑Jj=1 αjqj , p(Xi) and f(Xi), i = 1, . . . , n

Yi ← f(Xi)p(Xi)/qα(Xi), i = 1, . . . , nZij ← qj(Xi)/qα(Xi)− 1 i = 1, . . . , n, j = 1, . . . , J − 1 // dropped J ’thMLR ← multiple linear regression of Yi on Zijµreg ← estimated intercept from MLRse← intercept standard error from MLRdeliver µ, se

This algorithm shows how to use linear regression software to implement mixtureimportance sampling with component control variates. One component hasbeen dropped to cope with collinearity. For additional control variates hm with∫hm(x) dx = θm, include Zi,J−1+m = (hm(Xi) − θm)/qα(Xi) in the multiple

regression.

ponent densities, the estimate of µ is

µα,β =1

n

n∑i=1

f(Xi)p(Xi)−∑Jj=1 βjqj(Xi)∑J

j=1 αjqj(Xi)+

J∑j=1

βj (9.25)

where p is the nominal distribution. For defensive purposes we may take one ofthe qj to be the same as p, or some other distribution known to yield a goodestimator.

We may run into numerical difficulties using these control variates because∑Jj=1 qj(x)/qα(x) = 1 by definition of qα. We can simply drop one of these

redundant control variates. Algorithm 9.1 describes that process. It also showshow to include additional control variates

∫h(x) dx = θ. If Ep(g(X)) = θ then

use h(x) = g(x)p(x) there and if Eq(g(X)) = θ use h(x) = g(x)q(x).We can compare the variance of mixture importance sampling to that of

importance sampling with the individual mixture components qj . Had we usedqj , the variance of µqj would be σ2

qj/n where σ2qj = Varqj (f(X)p(X)/qj(X)).

Theorem 9.6. Let µα,β be given by (9.25) for αj > 0,∑Jj=1 αj = 1 and

Xiiid∼∑Jj=1 αjqj. Let βopt ∈ RJ be any minimizer over β of Var(µα,β). Then

Var(µα,βopt) 6 min16j6J

σ2qj

nαj. (9.26)

Proof. This is from Theorem 2 of Owen and Zhou (2000).

We expect to get nαj sample values from density qj . The quantity σ2qj/(nαj)

is the variance we would obtain from nαj such values alone. We should notexpect a better bound. Indeed, when σ2

qj = ∞ for all but one of the mixturecomponents it is reassuring that those bad components don’t make the estimateworse than what we would have had from the one good component.


9.12. Multiple importance sampling 31

If any one of the qj is proportional to fp then Theorem 9.6 shows that wewill get a zero variance estimator of µ.

It is common to have multiple functions fm for m = 1, . . . ,M to average overX. We may want to use the same Xi for all of those functions even though agood importance distribution for fm might be poor for fm′ . Theorem 9.6 showsthat we can include a mixture component qj designed specifically for fj withouttoo much adverse impact on our estimate of E(fm(X)) for m 6= j.

Theorem 9.6 does not compare µα,βopt to the best self-normalized importancesampler we could have used. When there is a single importance density thenusing it as a control variate has smaller asymptotic variance than using it in self-normalized importance sampling. Here we get a generalization of that result,as long as one of our component densities is the nominal one.

Theorem 9.7. Let µα,β be given by (9.25) for αj > 0,∑Jj=1 αj = 1 and

Xiiid∼∑Jj=1 αjqj for densities qj one of which is p. Let βopt ∈ RJ be any

minimizer over β of Var(µα,β). Then

Var(µα,βopt) 6 min16j6J

σ2qj ,sn

nαj. (9.27)

Proof. Without loss of generality, suppose that q1 = p. Let β = (µ, 0, . . . , 0)T.Then the optimal asymptotic variance is

σ2α,βopt 6 σ2

α,β =

∫ (fp−

∑Jj=1 βjqj

qα− µ+

J∑j=1

βj

)2

qα dx

=

∫ (fp− µpqα

)2

qα dx =

∫p2(f − µ)2

qαdx

6σ2qj ,sn

αj

for any j = 1, . . . , J .

By inspecting the proof of Theorem 9.7 we see that it is not strictly necessaryfor one of the qj to be p. It is enough for there to be constants ηj > 0 with

p(x) =∑Jj=1 ηjqj(x). See Exercise 9.26.

9.12 Multiple importance sampling

In multiple importance sampling we take nj observations from qj for j =

1, . . . , J arriving at a total of n =∑Jj=1 nj observations. The sample values may

be used as if they had been sampled from the mixture with fraction αj = nj/ncoming from component qj , and we will consider that case below, but thereare many other possibilities. Multiple importance sampling also provides a zerovariance importance sampler for use when f(x) takes both positive and negativevalues. See §9.13 on positivisation.



A partition of unity is a collection of J > 1 weight functions ωj(x) > 0

which satisfy∑Jj=1 ωj(x) = 1 for all x. Suppose that Xij ∼ qj for i = 1, . . . , nj

and j = 1, . . . , J and that ωj are a partition of unity. The multiple importancesampling estimate is

µω =

J∑j=1

1

nj

nj∑i=1

ωj(Xij)f(Xij)p(Xij)

qj(Xij). (9.28)

We saw an estimate of this form in §8.4 on stratified sampling. If we letωj(x) = 1 for x ∈ Dj and 0 otherwise then (9.28) becomes independent im-portance sampling within strata. If we further take qj to be p restricted to Djthen (9.28) becomes stratified sampling. Multiple importance sampling (9.28)may be thought of as a generalization of stratified sampling in which strata areallowed to overlap and sampling within strata need not be proportional to p.

Now assume that qj(x) > 0 whenever ωj(x)p(x)f(x) 6= 0. Then multipleimportance sampling is unbiased, because

E(µω) =

J∑j=1

Eqj(ωj(X)

f(X)p(X)

qj(X)

)=

J∑j=1

∫ωj(x)f(x)p(x) dx = µ.

Among the proposals for functions ωj(x), the most studied one is the bal-ance heuristic with ωj(x) ∝ njqj(x), that is

ωj(x) = ωBHj (x) ≡ njqj(x)∑J

k=1 nkqk(x).

By construction qj(x) > 0 holds whenever (ωBHj pf)(x) 6= 0. Let n =

∑Jj=1 nj

and define αj = nj/n. Then using the balance heuristic, µωBH simplifies to

µα =1

n

J∑j=1

nj∑i=1

f(Xij)p(Xij)∑Jj=1 αjqj(Xij)

. (9.29)

In other words, multiple importance sampling, with weights from the balanceheuristic reduces to the same estimator we would use in mixture importancesampling with mixture weights αj = nj/n. Once again, the weight on a givensampled value Xij does not depend on which mixture component it came from.The balance heuristic is nearly optimal in the following sense:

Theorem 9.8. Let nj > 1 be positive integers for j = 1, . . . , J . Let ω1, . . . , ωJbe a partition of unity and let ωBH be the balance heuristic. Suppose that qj(x) >0 whenever ωj(x)p(x)f(x) 6= 0. Then

Var(µωBH) 6 Var(µω) +

(1

minj nj− 1∑

j nj

)µ2.

Proof. This is Theorem 1 of Veach and Guibas (1995).


9.13. Positivisation 33

Other heuristics have been used in computer graphics. Among these arethe power heuristic wj(x) ∝ (njqj(x))β for β > 1 and the cutoff heuristicwj(x) ∝ 1{njqj(x) > αmax` n`q`(x)} for α > 0. In each case the wj arenormalized to sum to 1 to get a partition of unity. The intuition is to put extraweight on the component with the largest njqj values that are more locallyimportant (or locally proportional to fp). The maximum heuristic is thecutoff heuristic with α = 1 or the power heuristic as β →∞. It puts all weighton the component with the largest value of njqj(x) or shares that weight equallywhen multiple components are tied for the maximum.

These heuristics diminish the problem that when fp is nearly proportionalto one of qj it might not be nearly proportional to

∑j njqj . There are versions

of Theorem 9.8 for these three heuristics in Veach (1997, Theorem 9.3). Theupper bounds are less favorable for them than the one for the balance heuristic.The upper bound in Theorem 9.8 contains a term like µ2/n. The same holds fordefensive importance sampling (9.23) if we don’t use self-normalization. Usingthe densities as control variates removes the µ2/n term.

We can also incorporate control variates into multiple importance samplingby the balance heuristic. Choosing the control variates to be mixture com-ponents is a derandomization of multiple IS with mixture controls. Becausethis derandomization reduces variance, Theorem 9.6 applies so that multipleimportance sampling with the balance heuristic and βopt has variance at mostmin16j6N σ

2qj/nj . Furthermore, Theorem 9.4 on consistency of β for βopt also

applies to multiple importance sampling with the balance heuristic. As a con-sequence, a very effective way to use multiple qj is to sample a deterministicnumber nj of observations from qj and then run Algorithm 9.1 as if the Xi hadbeen sampled IID from qα.

9.13 Positivisation

It is a nuisance that a zero variance importance sampler is not available whenf takes both positive and negative signs. There is a simple remedy based onmultiple importance sampling.

We use a standard decomposition of f into positive and negative parts.Define

f+(x) = max(f(x), 0), and

f−(x) = max(−f(x), 0).

Then f(x) = f+(x)− f−(x).Now let q+ be a density function which is positive whenever pf+ > 0 and

let q− be a density function which is positive whenever pf− > 0. We sampleXi+ ∼ q+ for i = 1, . . . , n+ and Xi− ∼ q− for i = 1, . . . , n− (all independently)and define

µ±mis =1

n+

n+∑i=1

f+(Xi+)p(Xi+)

q+(Xi+)− 1

n−

n−∑i=1

f−(Xi−)p(Xi−)

q−(Xi−).



While it is necessary to have q+(x) > 0 at any point where f+(x)p(x) 6= 0, it isnot an error if q+(x) > 0 at a point x with f(x) < 0. All that happens is thata value f+(x) = 0 is averaged into the estimate µ+. Exercise 9.27 verifies thatµ±mis really is multiple importance sampling as given by (9.28).

This two sample estimator is unbiased, because

E(µ±mis) =

∫f+(x)p(x) dx−

∫f−(x)p(x) dx = µ,

using f+ − f− = f . Next

Var(µ±mis) =1

n+

∫(pf+ − µ+q+)2

q+dx +

1

n−

∫(pf− − µ−q−)2

q−dx

where µ± =∫f±(x)p(x) dx. The variance can be estimated in a straightforward

way by summing variances estimates for the two terms.We get Var(µ±mis) = 0 with n = 2 by taking q± ∝ f±p with n± = 1. As

with ordinary importance sampling, we don’t expect to attain an estimate withvariance 0, but we can use the idea to select good sample densities: q+ shouldbe roughly proportional to p times the positive part of f , erring on the side ofhaving heavier tails instead of lighter ones, and similarly for q−.

We can generalize the positivisation approach by writing

f(x) = c+ (f(x)− c)+ − (f(x)− c)−

for any c ∈ R that is convenient and estimating µ by

c+1

n+

n+∑i=1

w+(Xi+)(f(Xi+)− c)+ −1

n−

n−∑i=1

w−(Xi−)(f(Xi−)− c)−

for independent Xi± ∼ q± where q±(x) 6= 0 whenever p(x)(f(x)−c)± 6= 0 withw±(x) = p(x)/q±(x). A good choice for c would be one where f(X) = c haspositive probability.

Still more generally, suppose that g(x) is a function for which we know∫g(X)p(X) dx = θ. Then we can estimate µ by

θ+1

n+

n+∑i=1

w+(Xi+)(f(Xi+)−g(Xi+))+−1

n−

n−∑i=1

w−(Xi−)(f(Xi−)−g(Xi−))−.

This approach can be valuable when P(f(X) = g(X)) is close to one. Forinstance we might have a closed form expression for θ =

∫g(x)p(x) dx but not

for

f(x) =

L, g(x) 6 L

U, g(x) > U

g(x), otherwise.

The problem context may give qualitative information about when f > g orf < g which we can use to choose q±. It would be wise to incorporate adefensive mixture component from p into each of q±.


9.14. What-if simulations 35

A very simple form of positivisation is available when f(x) > B > −∞ forsome B < 0. In that case we can replace f by f + c for c > −B to get apositive function. Then we subtract c from the importance sampled estimate ofEp(f(X) + c). It may be hard to find a density q that is nearly proportional tof + c. This simple shift is used in some particle transport problems describedin §10.5 where it helps bound some estimates away from 0.

9.14 What-if simulations

Although the primary motivation for importance sampling is to reduce variancefor very skewed sampling problems, we can also use it to estimate E(f(X))under multiple alternative distributions for X. This is sometimes called whatif simulation.

Suppose for instance that the probability density function p has a parametricform p(x; θ) for θ ∈ Θ ⊂ Rk. Let X1, . . . ,Xn ∼ p(· ; θ0) where θ0 ∈ Θ. Thenfor some other value θ ∈ Θ we can estimate µ(θ) =

∫f(x)p(x; θ) dx by

µθ0(θ) =1

n

n∑i=1

p(xi; θ)

p(xi; θ0)f(xi).

The estimation spares us from having to sample from lots of different distri-butions. Instead we simply reweight the simulated values from one of the sim-ulations. The result is a bit like using common random numbers as in §8.6.Changing the likelihood ratio p(x; θ)/p(x; θ0) can be much faster than usingcommon random numbers, if the latter has to recompute f for each value of θ.

The tails of p(· ; θ0) should be at least as heavy as any of the p(· ; θ) thatwe’re interested in. We don’t have to sample values from one of the p(· ; θ)distributions. We can instead take Xi ∼ q for a heavier tailed distribution qand then use the estimates

µq(θ) =1

n

n∑i=1

p(xi; θ)

q(xi)f(xi).

Reweighting the sample is a good strategy for a local analysis in which θ isa small perturbation of θ0. If p(· ; θ0) is too different from p(· ; θ) then we canexpect a bad result. For a concrete example, suppose that we sample from theN (θ0, I) distribution and plan to study the N (θ, I) distribution by reweightingthe samples. Using Example 9.2 we find that the population effective samplesize of (9.14) is n∗e = n exp(−‖θ−θ0‖2). If we require n∗e > 0.01n, then we need‖θ−θ0‖ 6 log(10)

.= 2.30. More generally, if Σ is nonsingular, p is N (θ0,Σ) and

q is N (θ,Σ), then n∗e > n/100 holds for((θ − θ0)TΣ−1(θ − θ0)

)1/26 log(10).

Example 9.8 (Expected maximum of d Poissons). For x ∈ Rd, let max(x) =

max16j6d xj . The X we consider has components Xjiid∼ Poi(ν) and we will



estimate µ(ν) = E(max(X)). If we sample Xijiid∼ Poi(λ) for i = 1, . . . , n and

j = 1, . . . , d then we can estimate µ(ν) by

µ(ν) =1

n

n∑i=1

max(Xi)×d∏j=1

e−ννXij/Xij !

e−λλXij/Xij !

=1

n

n∑i=1

max(Xi)ed(λ−ν)

(νλ

)∑dj=1Xij

.

In this case we can find n∗e, the effective sample size (9.14) in the population.

For d = 1 it is n/e−2λ+ν+λ2/ν (Exercise 9.31) and for d independent components

it is n/(e−2λ+ν+λ2/ν)d. Figure 9.3 illustrates this computation for λ = 1 and

d = 5 using n = 10,000.For ν near λ, the estimated effective sample sizes are close to n∗e(ν). For ν

much greater than λ we get very small values for n∗e(ν) and unreliable samplevalues of n e. The estimates µ(ν) are monotone increasing in ν. The 99%confidence intervals become very wide for ν > λ = 1 even when the effectivesample size is not small. The root of the problem is that the sampling densityPoi(λ) has lighter tails than Poi(ν). For large enough ν a sample from Poi(ν)will have a maximum value smaller than E(max(X); ν) and no reweighting willfix that. In this example a sample from Poi(λ) was effective for all of the νbelow λ and for ν only modestly larger than λ.

We can even use reweighting to get an idea whether another importancesampling distribution might have been better than the one we used. Supposethat we want µ =

∫f(x)p(x) dx. Now we have a parametric family of impor-

tance sampling candidates, q(x, θ), for θ ∈ Θ and p is not necessarily a member

of this family. The variance of importance sampling for µ via Xiiid∼ q(x; θ) is

σ2θ =

∫(p(x)f(x))2

q(x; θ)dx− µ2.

The second term does not depend on θ and so we prefer θ which makes the firstterm small. Rewriting the first term as

MSθ =

∫(p(x)f(x))2

q(x; θ)q(x; θ0)q(x; θ0) dx,

we see that an unbiased estimate of it is

MSθ(θ0) =1

n

n∑i=1

(p(Xi)f(Xi))2

q(Xi; θ)q(Xi; θ0),

for Xiiid∼ q(·, θ0). We might then minimize MSθ(θ0) numerically over θ0 to find a

new value θ1 to sample from. There are numerous adaptive importance samplingalgorithms that make use of sample values from one distribution to choose abetter importance sampler through some number of iterations. See §10.5.


End Notes 37

0.5 1.0 1.5 2.0 2.5

1e−

011e

+01

1e+

03

**

***

* * * * * * * * * ***

** * * * * * *

Effective sample sizes

ν

λ●

0.5 1.0 1.5 2.0 2.5

12

34

rang

e(va

ls[k

eep,

])

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Means and 99% C.I.s

ν

λ●

Figure 9.3: These figures describe the what-if simulation of Example 9.8. Theleft panel shows the population effective sample n∗e size versus ν as a curveand the sample estimates as points. For ν with population n∗e > n/100 the

right panel shows the estimates of E(max(X)) for X1, . . . , X5iid∼ Poi(ν). The

estimates were based on reweighting values from Poi(λ) where λ = 1.

Chapter end notes

Two very early descriptions of importance sampling are Kahn (1950a,b). The fa-mous result that the optimal importance density q(x) is proportional to p(x)|f(x)|appears in Kahn and Marshall (1953). They describe a variational approach andalso present the optimal importance sampler for the marginal distribution of X1

when X = (X1, X2) and X2 is sampled from its nominal conditional distributiongiven X1.

Trotter and Tukey (1956) consider whether to use ordinary or self-normalizedimportance sampling. They lean towards the former and then suggest a hy-brid of the methods. Kong (1992) connects effective sample size, via a deltamethod, to an approximate relative efficiency for self-normalized importancesampling. Hesterberg (1995) has a good discussion on the comparative advan-tages of ordinary versus self-normalized importance sampling, when both canbe implemented. He finds that the ordinary method is superior for rare eventsbut inferior in cases where fluctations in the likelihood ratio dominate.

Trotter and Tukey (1956) pointed out how observations from one distribu-tion can be reweighted to estimate quantities from many other distributions,providing the basis for what-if simulations. Arsham et al. (1989) use that ideafor a sensitivity analysis of some discrete event systems.



Using Eq(p(X)/q(X)) = 1 to construct a hybrid of control variates and theregression estimator appears in Hesterberg (1987, 1995) and also in Arshamet al. (1989).

Hessian approach

Here we describe some refinements of the Hessian approach. If π(β | Y) is skewedwe can get a better fit by using a distribution that is not symmetric around β∗.One approach is to use a different variance to the left and right of β∗. In ddimensions there are potentially 2d different variances to select. The split-tdistribution of Geweke (1989) is constructed as follows. First set

Z ∼ N (0, Id), Q ∼ χ2(ν), and

Vj =

{σLjZj , Zj 6 0

σRjZj , Zj > 0,j = 1, . . . , d,

with Z independent of Q. Then the split-t random variable is µ+ C1/2V/√Q.

The parameters σLj > 0 and σRj > 0 allow different variances to the left andright of the mode µ of V . Geweke (1989) gives the form of the density forsplit-t variables and some advice on how to select parameters. There is also asplit-normal distribution.

Mixtures

Mixtures have long been used for importance sampling. In a Bayesian context, ifthe posterior distribution πu(·) is multimodal then we might attempt to find allof the modes and approximate πu by a mixture distribution with one componentcentered on each of those modes. Oh and Berger (1993) use a mixture of tdistributions.

Defensive mixtures are from the dissertation of Hesterberg (1988), and arealso in the article Hesterberg (1995).

Multiple importance sampling was proposed by Veach and Guibas (1995)for use with bidirectional path sampling in graphical rendering. Bidirectionalpath sampling was proposed independently by Lafortune and Willems (1993)and Veach and Guibas (1994). The combination of multiple control variateswith importance sampling in §9.10 and §9.11 is taken from Owen and Zhou(2000).

The positivisation approach from §9.13 is based on Owen and Zhou (2000).They also considered a ‘partition of identity’ trick to smooth out the cusps from(f(x)− g(x))±.

Calculus of variations

In §9.1 we proved that q∗(x) ∝ |f(x)|p(x) is the optimal importance samplingdensity. The proof is unattractive in that we had to know the answer first.


End Notes 39

We want to minimize∫f(x)2p(x)2/q(x) dx over q subject to the constraint∫

q(x) dx = 1. This is a calculus problem, except that the unknown q is a func-tion, not a vector. The solution may be found using the calculus of variations.A thorough description of the calculus of variations is given by Gelfand andFomin (2000).

To minimize∫f(x)2p(x)2/q(x) dx over q subject to

∫q(x) dx = 1, we form

the Lagrangian

G(q) =

∫f(x)2p(x)2/q(x) dx + λ

(∫q(x) dx− 1

)and working formally, set ∂G/∂q(x) = 0 along with ∂G/∂λ = 0.

Setting the partial derivative with respect to q(x) to zero, yields

−f(x)2p(x)2

q(x)2+ λ = 0

which is satisfied by the density q∗(x) =√f(x)2p(x)2/λ, that is, q∗ ∝ |f(x)|p(x).

Just as in multivariable calculus, setting the first derivative to zero can givea minimum, maximum or saddle point, and it may not take proper account ofconstraints. To show that q∗ is a local minimum we might turn to second orderconditions (Gelfand and Fomin, 2000, Chapters 5 and 6). In many importancesampling problems we can make a direct proof that the candidate function is aglobal minimum, typically via the Cauchy-Schwartz inequality.

Effective sample size

The linear combination Sw of (9.12) used to motivate the formula for n e raisessome awkward conceptual issues. It is hard to consider the random variablesZi = f(Xi) in it as independent given wi = p(Xi)/q(Xi). Both Zi and wiare deterministic functions of the same random Xi. The derivation of equa-tion (9.15) was based on a delta method argument (Kong, 1992). After somesimplification the variance of the normalized importance sampler µq can be closeto 1 + cv(w)2 times the variance of µp, the estimate we would use sampling di-rectly from the nominal distribution p. Then instead of getting variance σ2/nwe get variance σ2/n e. By this approximation, self-normalized importancesampling cannot be better than direct sampling from the nominal distribution.

Self-normalized importance sampling can in fact be better than direct sam-pling. The difficulty arises from one step in the approximation. Kong (1992)points out the exact term whose omission gives rise to the difference. The ef-fective sample size is a convenient way to interpret the amount of inequality inthe weights but it does not translate cleanly into a comparison of variances.

The f -specific effective sample sizes are based on an idea from Evans andSwartz (1995). The statistic that they use is cv(w, f)/

√n, which may be

thought of as a coefficient of variation for µq.The difficult choice with diagnostics is that we would like one summary

effective sample size for all f , but the actual variance reduction in importance



sampling depends on f . For an example with good performance even thoughn e is not large see Exercise 9.11.

Exercises

9.1. Suppose that p is the U(0, 1) distribution, q is the U(0, 1/2) distribution,and that f(x) = x2. Show that Eq(f(X)p(X)/q(X)) is not equal to Ep(f(X)).Note: when X ∼ q, we will never see X > 1/2.

9.2. Suppose that q(x) = exp(−x) for x > 0 and that f(x) = |x| for all x ∈ R,so that q(x) = 0 at some x where f(x) 6= 0.

a) Give a density p(x) for which the expectation of f(X)p(X)/q(X) forX ∼ qmatches the expectation of f(X) for X ∼ p.

b) Give a density p(x) for which the expectation of f(X)p(X)/q(X) for X ∼q does not match the expectation of f(X) for X ∼ p.

9.3. Suppose that p is the N (0, 1) distribution, and that f(x) = exp(−(x −10)2/2). Find the optimal importance sampling density q.

9.4. Suppose that p is the N (0, 1) distribution, and that f is exp(kx) for k 6= 0.Find the optimal importance sampling density q.

9.5. Let p be the N (0, I) distribution in dimension d > 1.

a) Generalize Exercise 9.3 to the case where f is the density of N (θ, I).

b) Generalize Exercise 9.4 to the case where f(x) = exp(kTx) for k ∈ Rd.

9.6. In the importance sampling notation of §9.1, suppose that σ2p < ∞ and

that w(x) = p(x)/q(x) 6 c holds for all x. Prove that σ2q 6 cσ2

p + (c− 1)µ2.

9.7. Here we revisit Example 9.1. If the nominal distribution p hasX ∼ N (0, 1),and one uses an importance sampling distribution q which isN (0, σ2), then whatchoice of σ2 will minimize the variance of (1/n)

∑ni=1Xip(Xi)/q(Xi)? By how

much does the optimum σ2 reduce the variance compared to using σ2 = 1?

9.8. For self-normalized importance sampling suppose that w(x) = p(x)/q(x) 6c < ∞ for all x with p(x) > 0. Prove that the asymptotic variance satisfiesσ2q,sn 6 cσ2

p.

9.9. Let p(x) = λe−λx, for x > 0 and q(x) = ηe−ηx also for x > 0. The constantλ > 0 is given. For which values of η is the likelihood ratio w(x) bounded?

9.10 (Optimal shift in importance sampling). The variance of an ordinary im-portance sampling estimate changes if we add a constant to f . Here we findthe optimal constant to add. For c ∈ R let fc(x) = f0(x) + c. Let the nominaldensity be p(x) and the importance density be some other density q(x). Forthis problem we assume that q(x) > 0 whenever p(x) > 0. Let

µq,c =1

n

n∑i=1

fc(Xi)p(Xi)

q(Xi)− c


Exercises 41

for Xiiid∼ q be our estimate of µ =

∫f0(x)p(x) dx. Show that Var(µq,c) is

minimized by

c = −∫

(w(x)f0(x)− µ)(w(x)− 1) dx∫(w(x)− 1)2 dx

.

You may assume that Var(µq,c) <∞ for all c ∈ R.

9.11. A bad effective sample size sometimes happens when importance samplingis actually working well. This exercise is based on an example due to TimHesterberg. Let p be N (0, 1), f(x) = 1x>τ for τ > 0 and q be N (τ, 1).

a) Show that n∗e = exp(−τ2)n.

b) Show thatσ2p

σ2q

=Φ(τ)Φ(−τ)

eτ2Φ(−2τ)− Φ(−τ)2.

c) Evaluate these quantities for τ = 3 and τ = 10.

d) τ = 10 is a pretty extreme case. What value do we get for σp/µ and σq/µ?As usual, µ = Ep(f(X)).

e) Simulate the cases τ = 3 and τ = 10 for n = 106. Report the Evans andSwartz effective sample sizes (9.17) that you get.

9.12. In the content uniformity problem of §8.6 the method first samples 10items from p = N (µ, σ2). The test might pass based on those 10 items, or itmay be necessary to sample 20 more items to make the decision. Suppose thatwe importance sample with a distribution q and use the ordinary importancesampling estimate µq taken as an average of weighted acceptance outcomes,(1/n)

∑ni=1 w(Xi)A(Xi) for a weight function w and A(x) which is 1 for ac-

cepted lots and 0 otherwise.

a) Will our estimate of the acceptance probability be unbiased if we always

use w(Xi) =∏30j=1 p(Xij)/q(Xij), even in cases where the lot was accepted

based on the first 10 items?

b) Will our estimate of the acceptance probability be unbiased if we use

w(Xi) =∏10j=1 p(Xij)/q(Xij) when the lot is accepted after the first 10

items and use w(Xi) =∏30j=1 p(Xij)/q(Xij) otherwise?

9.13. Here we revisit the PERT problem of §9.4. This time we assume thatthe duration for task j is χ2 with degrees of freedom equal to the value inthe last column of Table 9.1. For example, task 1 (planning) takes χ2

(4) and

task 10 (final testing) takes χ2(2) time, and so on. Use importance sampling to

estimate the probability that the completion time is larger than 50. Documentthe importance sampling distribution you used and how you arrived at it. Itis reasonable to explore a few distributions with small samples before runninga larger simulation to get the answer. Use hand tuning, and not automatedadaptive importance samplers.



9.14. Let A ⊂ Rd and let Ac = {x ∈ Rd | x 6∈ A}. Let p be a probability densityfunction on Rd with 0 <

∫Ap(x) dx < 1. Let q1 be the asymptotically optimal

sampling density for estimating µ = Ep(1{x ∈ A}), using self-normalized impor-tance sampling. Let q2 be the corresponding optimal density for Ep(1{x 6∈ A}).Show that q1 and q2 are the same distribution.

9.15. Equation (9.8) gives the delta method approximation to the variance ofthe ratio estimation version of the importance sampler. The optimal samplingdistribution q for this estimate is proportional to p(x)|f(x)− µ|.

a) Prove that equation (9.10) is an equality if q is the optimal self-normalizedimportance sampling density.

b) Now suppose that f(x) = 1 for x ∈ A and 0 otherwise, so µ = P(X ∈ A)for X ∼ p. What form does the variance from the previous part take?Compare it to µ(1−µ)/n paying special attention to extreme cases for µ.

9.16 (Rare events and self-normalized importance sampling). For each ε > 0let Aε be an event with Pp(X ∈ Aε). Sampling from the nominal distribution pleads to the estimate µp,ε = (1/n)

∑ni=1 1Xi∈Aε ∼ Bin(n, ε). As a result,√

Var(µp,ε)

ε=

√1− εnε→∞, as ε→ 0. (9.30)

Now suppose that for each ε > 0, we are able to find and sample from theoptimal self-normalized importance sampling density. Call it qε and the resultingestimator µq,ε.

a) Find limε→0+

√Var(µq,ε)/ε.

b) If the limit in part a is finite then the method has bounded relativeerror. Does the optimal self-normalized importance sampling estimatorhave bounded relative error?

c) Does the optimal ordinary importance sampling estimator have boundedrelative error?

Equation (9.30) shows that sampling from the nominal distribution does nothave bounded relative error.

9.17 (Effective sample sizes). For i = 1, . . . , n, let wi > 0 with∑ni=1 wi strictly

positive and n > 1.

a) Show that n e,σ 6 n e where n e,σ is given by (9.16).

b) Give an example to show that n e,σ = n e is possible here. [There are atleast two quite different ways to make this happen.]

9.18. For i = 1, . . . , n, let wi > 0 with∑ni=1 wi strictly positive and n > 1. For

k > 1 let n e,k = (∑i w

ki )2/

∑i w

2ki . What is limk→∞ n e,k?


Exercises 43

9.19. The skewness of a random variable X is γ = E((X−µ)3)/σ3 where µ andσ2 are the mean and variance of X. Suppose that Zi has (finite) mean, variance,and skewness µ, σ2 and γ respectively. Let w1, . . . , wn > 0 be nonrandom with∑ni=1 wi > 0. Show that the skewness of Y =

∑ni=1 wiZi/

∑ni=1 wi is(

n∑i=1

w3i

)/(n∑i=1

w2i

)3/2

.

9.20. Sometimes we can compute a population version of effective sample size,without doing any sampling. Let p(x) =

∏dj=1 exp(−xj/θj)/θj for x ∈ (0, 1)d

and θj > 0. Let q(x) =∏dj=1 exp(−xj/λj)/λj where λj = κjθj for κj > 0.

Show that n∗e given by (9.14) satisfies

n∗e = n×d∏j=1

2κj − 1

κ2j

if minj κj > 1/2 and n∗e = 0 otherwise.

9.21. Derive population versions of n e,σ and n e,γ for the setting of Exer-cise 9.20.

9.22. For the PERT example of §9.4, use importance sampling to estimateP(E10 < 3). In view of Exercise 9.20 we should be cautious about importancesampling with small multiples of exponential means. Defensive importance sam-pling may help. Document the steps you took in finding an importance samplingstrategy.

9.23. Let p =∑1j=0

∑1k=0 αjkpjk be a distribution on R2 with

p00 = δ0 × δ0p01 = δ0 ×N (0, 1)

p10 = N (0, 1)× δ0, and

p11 = N (0, I2)

where δ0 is the degenerate distribution on R taking the value 0 with probabilityone. The coefficients satisfy αjk > 0 and

∑j

∑k αjk = 1. The notation X ∼

f × g means that X = (X1, X2) with X1 ∼ f independently of X2 ∼ g.

Now let q =∑1j=0

∑1k=0 βjkqjk where qjk = qj × qk with q0 = δ0 and q1

a continuous distribution on R with a strictly positive density, βjk > 0 and∑j

∑k βjk = 1.

Let Xi ∼ q independently for i = 1, . . . , n and suppose that f(x) is afunction on R2 with Ep(|f(X)|) <∞. Develop an estimate µq of µ = Ep(f(X))using the observed values xi of Xi ∼ q and prove that Eq(µq) = µ.

9.24. Here we revisit the boundary crossing problem of Example 9.5 but with

different parameters. Let Xiiid∼ N (−1.5, 1). Estimate the probability that the

random walk given by X1 +X2 + · · ·+Xm exceeds b = 20 before it goes belowa = −7. Include a 99% confidence interval.



9.25. Suppose that we use ordinary importance sampling with the defensivemixture qα(x) = α1p(x)+α2(x) where 0 < α1 < 1 and α1 +α2 = 1. Prove that

Var(µqα) 6σ2q

nα2+

1

n

α1

α2µ2

where µ = Ep(f(X)) and σ2q is the asymptotic variance when importance sam-

pling from q.

9.26. Prove Theorem 9.7 comparing mixture importance sampling to self-normal-ized importance sampling, under the weaker condition that there are constantsηj > 0 with p(x) =

∑Jj=1 ηjqj(x).

9.27. Let f(x) take both positive and negative values. Show that µmis of §9.13really is multiple importance sampling (9.28) by exhibiting densities q1 andq2, weight functions w1 and w2, and sample sizes n1 and n2 for which (9.28)simplifies to µmis.

9.28. The Horvitz-Thompson estimator is a mainstay of sampling methods forfinite populations as described in Cochran (1977) and other references. Which ofthe heuristics in §9.12 is the closest match to the Horvitz-Thompson estimator?

9.29. The multiple importance sampling estimator of µ = Ep(f(X)) is

µw =

J∑j=1

1

nj

nj∑i=1

ωj(Xij)f(Xij)p(Xij)

qj(Xij),

where ω1, . . . , ωJ form a partition of unity, nj > 1, p and qj are probabilitydensity functions on Rd and Xij ∼ qj independently. We assume that qj(x) > 0whenever (ωjpf)(x) 6= 0. Let h(x) be such that

∫h(x) dx = η ∈ RJ is known.

Develop an unbiased control variate estimator µω,β that combines multipleimportance sampling with a control variate coefficient of β on h.

9.30. Suppose that X ∼ p and for each x ∈ Rd that f(x) is a complex number.Show that there exists a multiple importance sampling scheme which will eval-uate E(f(X)) exactly using 4 sample points. You may assume that E(f(X))exists.

9.31. Here we find the population effective sample size (9.14) for the what-ifsimulation of Example 9.8. Let p be the Poi(ν) distribution and q be the Poi(λ)distribution. Show that Ep(w) = exp(λ− 2ν + ν2/λ).

9.32. We are about to do a what-if simulation for the distributions Gam(1),Gam(2), Gam(3) and Gam(4). Each of these distributions plays the role of p inits turn. One of them will be our importance sampling density q. Which oneshould we choose, and why?


Bibliography

Arsham, H., Feuerverger, A., McLeish, D. L., Kreimer, J., and Rubinstein, R. Y.(1989). Sensitivity analysis and the ‘what if’ problem in simulation analysis.Mathematical and Computer Modelling, 12(2):193–219.

Chinneck, J. W. (2009). Practical optimization: a gentle introduction. http:

//www.sce.carleton.ca/faculty/chinneck/po.html.

Cochran, W. G. (1977). Sampling Techniques (3rd Ed). John Wiley & Sons,New York.

Evans, M. J. and Swartz, T. (1995). Methods for approximating integrals instatistics with special emphasis on Bayesian integration problems. StatisticalScience, 10(3):254–272.

Gelfand, I. M. and Fomin, S. V. (2000). Calculus of variations. Dover, Mineola,NY.

Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlointegration. Econometrica, 57(6):1317–1339.

Hesterberg, T. C. (1987). Importance sampling in multivariate problems. InProceedings of the Statistical Computing Section, American Statistical Asso-ciation 1987 Meeting, pages 412–417.

Hesterberg, T. C. (1988). Advances in importance sampling. PhD thesis, Stan-ford University.

Hesterberg, T. C. (1995). Weighted average importance sampling and defensivemixture distributions. Technometrics, 37(2):185–192.

Kahn, H. (1950a). Random sampling (Monte Carlo) techniques in neutronattenuation problems, I. Nucleonics, 6(5):27–37.

45

46 Bibliography

Kahn, H. (1950b). Random sampling (Monte Carlo) techniques in neutronattenuation problems, II. Nucleonics, 6(6):60–65.

Kahn, H. and Marshall, A. (1953). Methods of reducing sample size in MonteCarlo computations. Journal of the Operations Research Society of America,1(5):263–278.

Kong, A. (1992). A note on importance sampling using standardized weights.Technical Report 348, University of Chicago.

Lafortune, E. P. and Willems, Y. D. (1993). Bidirectional path tracing. InProceedings of CompuGraphics, pages 95–104.

Oh, M.-S. and Berger, J. O. (1993). Integration of multimodal functions byMonte Carlo importance sampling. Journal of the American Statistical As-sociation, 88(422):450–456.

Owen, A. B. and Zhou, Y. (2000). Safe and effective importance sampling.Journal of the American Statistical Association, 95(449):135–143.

Siegmund, D. (1976). Importance sampling in the Monte Carlo study of sequen-tial tests. The Annals of Statistics, 4(4):673–684.

Trotter, H. F. and Tukey, J. W. (1956). Conditional Monte Carlo for normalsamples. In Symposium on Monte Carlo Methods, pages 64–79, New York.Wiley.

Veach, E. (1997). Robust Monte Carlo methods for light transport simulation.PhD thesis, Stanford University.

Veach, E. and Guibas, L. (1994). Bidirectional estimators for light transport.In 5th Annual Eurographics Workshop on Rendering, pages 147–162.

Veach, E. and Guibas, L. (1995). Optimally combining sampling techniques forMonte Carlo rendering. In SIGGRAPH ’95 Conference Proceedings, pages419–428. Addison-Wesley.


contentsowen/mc/ch-var-is.pdf · contents 9 importance ... 9.5 importance sampling versus...

Documents