measuring sample quality with kernels · stein’s method recipe based on reproducing kernel...

MEASURING SAMPLE QUALITY WITH KERNELS

By

Jackson Gorham Lester Mackey

Technical Report No. 2017-05 June 2017

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

MEASURING SAMPLE QUALITY WITH KERNELS

By

Jackson Gorham Stanford University

Lester Mackey

Microsoft Research New England

Technical Report No. 2017-05 June 2017

This research was supported in part by National Science Foundation grant DMS RTG 1501767.

Department of Statistics STANFORD UNIVERSITY

Stanford, California 94305-4065

http://statistics.stanford.edu

Measuring Sample Quality with Kernels

Jackson Gorham 1 Lester Mackey 2

AbstractApproximate Markov chain Monte Carlo(MCMC) offers the promise of more rapid sam-pling at the cost of more biased inference. Sincestandard MCMC diagnostics fail to detect thesebiases, researchers have developed computableStein discrepancy measures that provably de-termine the convergence of a sample to itstarget distribution. This approach was recentlycombined with the theory of reproducing kernelsto define a closed-form kernel Stein discrepancy(KSD) computable by summing kernel evalua-tions across pairs of sample points. We developa theory of weak convergence for KSDs basedon Stein’s method, demonstrate that commonlyused KSDs fail to detect non-convergence evenfor Gaussian targets, and show that kernels withslowly decaying tails provably determine con-vergence for a large class of target distributions.The resulting convergence-determining KSDsare suitable for comparing biased, exact, anddeterministic sample sequences and simpler tocompute and parallelize than alternative Steindiscrepancies. We use our tools to compare bi-ased samplers, select sampler hyperparameters,and improve upon existing KSD approachesto one-sample hypothesis testing and samplequality improvement.

1. IntroductionWhen Bayesian inference and maximum likelihood estima-tion (Geyer, 1991) demand the evaluation of intractable ex-pectations EP [h(Z)] =

∫p(x)h(x)dx under a target dis-

tribution P , Markov chain Monte Carlo (MCMC) methods(Brooks et al., 2011) are often employed to approximatethese integrals with asymptotically correct sample aver-ages EQn [h(X)] = 1

n

∑ni=1 h(xi). However, many exact

1Stanford University, Palo Alto, CA USA 2Microsoft Re-search New England, Cambridge, MA USA. Correspondenceto: Jackson Gorham <[email protected]>, Lester Mackey<[email protected]>.

MCMC methods are computationally expensive, and recentyears have seen the introduction of biased MCMC proce-dures (see, e.g., Welling & Teh, 2011; Ahn et al., 2012; Ko-rattikara et al., 2014) that exchange asymptotic correctnessfor increased sampling speed.

Since standard MCMC diagnostics, like mean and traceplots, pooled and within-chain variance measures, effectivesample size, and asymptotic variance (Brooks et al., 2011),do not account for asymptotic bias, Gorham & Mackey(2015) defined a new family of sample quality measures– the Stein discrepancies – that measure how well EQnapproximates EP while avoiding explicit integration underP . Gorham & Mackey (2015); Mackey & Gorham (2016);Gorham et al. (2016) further showed that specific mem-bers of this family – the graph Stein discrepancies – were(a) efficiently computable by solving a linear program and(b) convergence-determining for large classes of targets P .Building on the zero mean reproducing kernel theory ofOates et al. (2016b), Chwialkowski et al. (2016) and Liuet al. (2016) later showed that other members of the Steindiscrepancy family had a closed-form solution involvingthe sum of kernel evaluations over pairs of sample points.

This closed form represents a significant practical advan-tage, as no linear program solvers are necessary, and thecomputation of the discrepancy can be easily parallelized.However, as we will see in Section 3.2, not all kernel Steindiscrepancies are suitable for our setting. In particular, indimension d ≥ 3, the kernel Stein discrepancies previouslyrecommended in the literature fail to detect when a sam-ple is not converging to the target. To address this short-coming, we develop a theory of weak convergence for thekernel Stein discrepancies analogous to that of (Gorham &Mackey, 2015; Mackey & Gorham, 2016; Gorham et al.,2016) and design a class of kernel Stein discrepancies thatprovably control weak convergence for a large class of tar-get distributions.

After formally describing our goals for measuring sam-ple quality in Section 2, we outline our strategy, basedon Stein’s method, for constructing and analyzing practicalquality measures at the start of Section 3. In Section 3.1,we define our family of closed-form quality measures – thekernel Stein discrepancies (KSDs) – and establish severalappealing practical properties of these measures. We an-


alyze the convergence properties of KSDs in Sections 3.2and 3.3, showing that previously proposed KSDs fail to de-tect non-convergence and proposing practical convergence-determining alternatives. Section 4 illustrates the valueof convergence-determining kernel Stein discrepancies ina variety of applications, including hyperparameter selec-tion, sampler selection, one-sample hypothesis testing, andsample quality improvement. Finally, in Section 5, we con-clude with a discussion of related and future work.

Notation We will use µ to denote a generic probabil-ity measure and ⇒ to denote the weak convergence of asequence of probability measures. We will use ‖·‖r forr ∈ [1,∞] to represent the `r norm on Rd and occasion-ally refer to a generic norm ‖·‖ with associated dual norm‖a‖∗ , supb∈Rd,‖b‖=1 〈a, b〉 for vectors a ∈ Rd. Welet ej be the j-th standard basis vector. For any functiong : Rd → Rd′ , we define M0(g) , supx∈Rd‖g(x)‖2,M1(g) , supx 6=y‖g(x)− g(y)‖2/‖x− y‖2, and ∇g asthe gradient with components (∇g(x))jk , ∇xkgj(x). Wefurther let g ∈ Cm indicate that g is m times continu-ously differentiable and g ∈ Cm0 indicate that g ∈ Cm

and ∇lg is vanishing at infinity for all l ∈ {0, . . . ,m}.We define C(m,m) (respectively, C(m,m)

b and C(m,m)0 )

to be the set of functions k : Rd × Rd → R with(x, y) 7→ ∇lx∇lyk(x, y) continuous (respectively, contin-uous and uniformly bounded, continuous and vanishing atinfinity) for all l ∈ {0, . . . ,m}.

2. Quality measures for samplesConsider a target distribution P with continuously differ-entiable (Lebesgue) density p supported on all of Rd. Weassume that the score function b , ∇ log p can be eval-uated1 but that, for most functions of interest, direct inte-gration under P is infeasible. We will therefore approxi-mate integration under P using a weighted sample Qn =∑ni=1 qn(xi)δxi with sample points x1, . . . , xn ∈ Rd and

qn a probability mass function. We will make no assump-tions about the origins of the sample points; they may bethe output of a Markov chain or even deterministically gen-erated.

Each Qn offers an approximation EQn [h(X)] =∑ni=1 qn(xi)h(xi) for each intractable expectation

EP [h(Z)], and our aim is to effectively compare thequality of the approximation offered by any two samplestargeting P . In particular, we wish to produce a qualitymeasure that (i) identifies when a sequence of samples isconverging to the target, (ii) determines when a sequenceof samples is not converging to the target, and (iii) isefficiently computable. Since our interest is in approx-imating expectations, we will consider discrepancies

1No knowledge of the normalizing constant is needed.

quantifying the maximum expectation error over a class oftest functionsH:

dH(Qn, P ) , suph∈H|EP [h(Z)]− EQn [h(X)]|. (1)

When H is large enough, for any sequence of probabilitymeasures (µm)m≥1, dH(µm, P )→ 0 only if µm ⇒ P . Inthis case, we call (1) an integral probability metric (IPM)(Muller, 1997). For example, when H = BL‖·‖2 , {h :

Rd → R |M0(h) + M1(h) ≤ 1}, the IPM dBL‖·‖2 iscalled the bounded Lipschitz or Dudley metric and exactlymetrizes convergence in distribution. Alternatively, whenH = W‖·‖2 , {h : Rd → R |M1(h) ≤ 1} is the set of1-Lipschitz functions, the IPM dW‖·‖ in (1) is known as theWasserstein metric.

An apparent practical problem with using the IPM dH as asample quality measure is that EP [h(Z)] may not be com-putable for h ∈ H. However, if H were chosen such thatEP [h(Z)] = 0 for all h ∈ H, then no explicit integra-tion under P would be necessary. To generate such a classof test functions and to show that the resulting IPM stillsatisfies our desiderata, we follow the lead of Gorham &Mackey (2015) and consider Charles Stein’s method forcharacterizing distributional convergence.

3. Stein’s method with kernelsStein’s method (Stein, 1972) provides a three-step recipefor assessing convergence in distribution:

1. Identify a Stein operator T that maps functions g :Rd → Rd from a domain G to real-valued functionsT g such that

EP [(T g)(Z)] = 0 for all g ∈ G.

For any such Stein operator and Stein set G, Gorham& Mackey (2015) defined the Stein discrepancy as

S(µ, T ,G) , supg∈G|Eµ[(T g)(X)]| = dT G(µ, P ) (2)

which, crucially, avoids explicit integration under P .

2. Lower bound the Stein discrepancy by an IPM dHknown to dominate weak convergence. This can bedone once for a broad class of target distributionsto ensure that µm ⇒ P whenever S(µm, T ,G) →0 for a sequence of probability measures (µm)m≥1

(Desideratum (ii)).

3. Provide an upper bound on the Stein discrepancy en-suring that S(µm, T ,G) → 0 under suitable conver-gence of µm to P (Desideratum (i)).


While Stein’s method is principally used as a mathemat-ical tool to prove convergence in distribution, we seek,in the spirit of (Gorham & Mackey, 2015; Gorham et al.,2016), to harness the Stein discrepancy as a practical toolfor measuring sample quality. The subsections to followdevelop a specific, practical instantiation of the abstractStein’s method recipe based on reproducing kernel Hilbertspaces. An empirical analysis of the Stein discrepanciesrecommended by our theory follows in Section 4.

3.1. Selecting a Stein operator and a Stein set

A standard, widely applicable univariate Stein operator isthe density method operator (see Stein et al., 2004; Chat-terjee & Shao, 2011; Chen et al., 2011; Ley et al., 2017),

(T g)(x) , 1p(x)

ddx (p(x)g(x)) = g(x)b(x) + g′(x).

Inspired by the generator method of Barbour (1988; 1990)and Gotze (1991), Gorham & Mackey (2015) general-ized this operator to multiple dimensions. The resultingLangevin Stein operator

(TP g)(x) , 1p(x) 〈∇, p(x)g(x)〉 = 〈g(x), b(x)〉+ 〈∇, g(x)〉

for functions g : Rd → Rd was independently devel-oped, without connection to Stein’s method, by Oates et al.(2016b) for the design of Monte Carlo control function-als. Notably, the Langevin Stein operator depends on Ponly through its score function b = ∇ log p and hence iscomputable even when the normalizing constant of p is not.While our work is compatible with other practical Stein op-erators, like the family of diffusion Stein operators definedin (Gorham et al., 2016), we will focus on the Langevinoperator for the sake of brevity.

Hereafter, we will let k : Rd×Rd → R be the reproducingkernel of a reproducing kernel Hilbert space (RKHS) Kkof functions from Rd → R. That is, Kk is a Hilbert spaceof functions such that, for all x ∈ Rd, k(x, ·) ∈ Kk andf(x) = 〈f, k(x, ·)〉Kk whenever f ∈ Kk. We let ‖·‖Kk bethe norm induced from the inner product on Kk.

With this definition, we define our kernel Stein set Gk,‖·‖as the set of vector-valued functions g = (g1, . . . , gd) suchthat each component function gj belongs toKk and the vec-tor of their norms ‖gj‖Kk belongs to the ‖·‖∗ unit ball:2

Gk,‖·‖ , {g = (g1, . . . , gd) | ‖v‖∗ ≤ 1 for vj , ‖gj‖Kk}.

The following result, proved in Section B, establishes thatthis is an acceptable domain for TP .

Proposition 1 (Zero mean test functions). If k ∈ C(1,1)b

and EP [‖∇ log p(Z)‖2] <∞, then EP [(TP g)(Z)] = 0 forall g ∈ Gk,‖·‖.

2Our analyses and algorithms support each gj belonging to adifferent RKHS Kkj , but we will not need that flexibility here.

The Langevin Stein operator and kernel Stein set togetherdefine our quality measure of interest, the kernel Stein dis-crepancy (KSD) S(µ, TP ,Gk,‖·‖). When ‖·‖ = ‖·‖2, thisdefinition recovers the KSD proposed by Chwialkowskiet al. (2016) and Liu et al. (2016). Our next result showsthat, for any ‖·‖, the KSD admits a closed-form solution.Proposition 2 (KSD closed form). Suppose k ∈ C(1,1),and, for each j ∈ {1, . . . d}, define the Stein kernel

kj0(x, y) , 1p(x)p(y)∇xj∇yj (p(x)k(x, y)p(y)) (3)

= bj(x)bj(y)k(x, y) + bj(x)∇yjk(x, y)

+ bj(y)∇xjk(x, y) +∇xj∇yjk(x, y).

If∑dj=1 Eµ

[kj0(X,X)

1/2]< ∞, then S(µ, TP ,Gk,‖·‖) =

‖w‖ where wj ,√

Eµ×µ[kj0(X, X)] with X, X iid∼ µ.

The proof is found in Section C. Notably, when µ is thediscrete measure Qn =

∑ni=1 qn(xi)δxi , the KSD reduces

to evaluating each kj0 at pairs of support points as wj =√∑ni,i′=1 qn(xi)k

j0(xi, xi′)qn(xi′), a computation which

is easily parallelized over sample pairs and coordinates j.

Our Stein set choice was motivated by the work of Oateset al. (2016b) who used the sum of Stein kernels k0 =∑dj=1 k

j0 to develop nonparametric control variates. Each

term wj in Proposition 2 can also be viewed as an instanceof the maximum mean discrepancy (MMD) (Gretton et al.,2012) between µ and P measured with respect to the Steinkernel kj0. In standard uses of MMD, an arbitrary kernelfunction is selected, and one must be able to compute ex-pectations of the kernel function under P . Here, this re-quirement is satisfied automatically, since our induced ker-nels are chosen to have mean zero under P .

For clarity we will focus on the specific kernel Stein setchoice Gk , Gk,‖·‖2 for the remainder of the paper, but ourresults extend directly to KSDs based on any ‖·‖, since allKSDs are equivalent in a strong sense:Proposition 3 (Kernel Stein set equivalence). Un-der the assumptions of Proposition 2, there are con-stants cd, c

′d > 0 depending only on d and ‖·‖

such that cdS(µ, TP ,Gk,‖·‖) ≤ S(µ, TP ,Gk,‖·‖2) ≤c′dS(µ, TP ,Gk,‖·‖).

The short proof is found in Section D.

3.2. Lower bounding the kernel Stein discrepancy

We next aim to establish conditions under which the KSDS(µm, TP ,Gk) → 0 only if µm ⇒ P (Desideratum (ii)).Recently, Gorham et al. (2016) showed that the Langevingraph Stein discrepancy dominates convergence in distri-bution whenever P belongs to the class P of distantly dis-sipative distributions with Lipschitz score function b:


Definition 4 (Distant dissipativity (Eberle, 2015; Gorhamet al., 2016)). A distribution P is distantly dissipative ifκ0 , lim infr→∞ κ(r) > 0 for

κ(r) = inf{−2 〈b(x)−b(y),x−y〉‖x−y‖22

: ‖x− y‖2 = r}. (4)

Examples of distributions in P include finite Gaussianmixtures with common covariance and all distributionsstrongly log-concave outside of a compact set, includingBayesian linear, logistic, and Huber regression posteriorswith Gaussian priors (see Gorham et al., 2016, Section 4).Moreover, when d = 1, membership in P is sufficientto provide a lower bound on the KSD for most commonkernels including the Gaussian, Matern, and inverse multi-quadric kernels.Theorem 5 (Univariate KSD detects non-convergence).Suppose that P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2

with a non-vanishing generalized Fourier transform. Ifd = 1, then S(µm, TP ,Gk)→ 0 only if µm ⇒ P .

The proof in Section E provides a lower bound on theKSD in terms of an IPM known to dominate weak con-vergence. However, our next theorem shows that in higherdimensions S(Qn, TP ,Gk) can converge to 0 without thesequence (Qn)n≥1 converging to any probability measure.This deficiency occurs even when the target is Gaussian.Theorem 6 (KSD fails with light kernel tails). Supposek ∈ C(1,1)

b and define the kernel decay rate

γ(r) , sup{max(|k(x, y)|, ‖∇xk(x, y)‖2,|〈∇x,∇yk(x, y)〉|) : ‖x− y‖2 ≥ r}.

If d ≥ 3, P = N (0, Id), and γ(r) = o(r−α) for α , ( 12 −

1d )−1, then S(Qn, TP ,Gk)→ 0 does not imply Qn ⇒ P .

Theorem 6 implies that KSDs based on the commonly usedGaussian kernel, Matern kernel, and compactly supportedkernels of Wendland (2004, Theorem 9.13) all fail to de-tect non-convergence when d ≥ 3. In addition, KSDsbased on the inverse multiquadric kernel (k(x, y) = (c2 +

‖x− y‖22)β) for β < −1 fail to detect non-convergencefor any d > 2β/(β + 1). The proof in Section F showsthat the violating sample sequences (Qn)n≥1 are simple toconstruct, and we provide an empirical demonstration ofthis failure to detect non-convergence in Section 4.

The failure of the KSDs in Theorem 6 can be traced totheir inability to enforce uniform tightness. A sequenceof probability measures (µm)m≥1 is uniformly tight if forevery ε > 0, there is a finite number R(ε) such thatlim supm µm(‖X‖2 > R(ε)) ≤ ε. Uniform tightnessimplies that no mass in the sequence of probability mea-sures escapes to infinity. When the kernel k decays morerapidly than the score function grows, the KSD ignores ex-cess mass in the tails and hence can be driven to zero by a

non-tight sequence of increasingly diffuse probability mea-sures. The following theorem demonstrates uniform tight-ness is the missing piece to ensure weak convergence.Theorem 7 (KSD detects tight non-convergence). Supposethat P ∈ P and k(x, y) = Φ(x−y) for Φ ∈ C2 with a non-vanishing generalized Fourier transform. If (µm)m≥1 isuniformly tight, then S(µm, TP ,Gk)→ 0 only if µm ⇒ P .

Our proof in Section G explicitly lower bounds the KSDS(µ, TP ,Gk) in terms of the bounded Lipschitz metricdBL‖·‖(µ, P ), which exactly metrizes weak convergence.

Ideally, when a sequence of probability measures is not uni-formly tight, the KSD would reflect this divergence in itsreported value. To achieve this, we consider the inversemultiquadric (IMQ) kernel k(x, y) = (c2 + ‖x− y‖22)β

for some β < 0 and c > 0. While KSDs based on IMQkernels fail to determine convergence when β < −1 (byTheorem 6), our next theorem shows that they automati-cally enforce tightness and detect non-convergence when-ever β ∈ (−1, 0).Theorem 8 (IMQ KSD detects non-convergence). Sup-pose P ∈ P and k(x, y) = (c2 + ‖x− y‖22)β for c > 0and β ∈ (−1, 0). If S(µm, TP ,Gk)→ 0, then µm ⇒ P .

The proof in Section H provides a lower bound on the KSDin terms of the bounded Lipschitz metric dBL‖·‖(µ, P ).The success of the IMQ kernel over other common char-acteristic kernels can be attributed to its slow decay rate.When P ∈ P and the IMQ exponent β > −1, the func-tion class TPGk contains unbounded (coercive) functions.These functions ensure that the IMQ KSD S(µm, TP ,Gk)goes to 0 only if (µm)m≥1 is uniformly tight.

3.3. Upper bounding the kernel Stein discrepancy

The usual goal in upper bounding the Stein discrepancyis to provide a rate of convergence to P for particularapproximating sequences (µm)∞m=1. Because we aim todirectly compute the KSD for arbitrary samples Qn, ourchief purpose in this section is to ensure that the KSDS(µm, TP ,Gk) will converge to zero when µm is converg-ing to P (Desideratum (i)).

Proposition 9 (KSD detects convergence). If k ∈ C(2,2)b

and ∇ log p is Lipschitz with EP [‖∇ log p(Z)‖22] < ∞,then S(µm, TP ,Gk) → 0 whenever the Wasserstein dis-tance dW‖·‖2 (µm, P )→ 0.

Proposition 9 applies to common kernels like the Gaussian,Matern, and IMQ kernels, and its proof in Section I pro-vides an explicit upper bound on the KSD in terms of theWasserstein distance dW‖·‖2 . When Qn = 1

n

∑ni=1 δxi for

xiiid∼ µ, (Liu et al., 2016, Thm. 4.1) further implies that

S(Qn, TP ,Gk)⇒ S(µ, TP ,Gk) at an O(n−1/2) rate undercontinuity and integrability assumptions on µ.


4. ExperimentsWe next conduct an empirical evaluation of the KSD qual-ity measures recommended by our theory, recording alltimings on an Intel Xeon CPU E5-2650 v2 @ 2.60GHz.Throughout, we will refer to the KSD with IMQ base ker-nel k(x, y) = (c2 + ‖x− y‖22)β , exponent β = − 1

2 ,and c = 1 as the IMQ KSD. Code reproducing all ex-periments can be found on the Julia (Bezanson et al.,2014) package site https://jgorham.github.io/SteinDiscrepancy.jl/.

4.1. Comparing discrepancies

Our first, simple experiment is designed to illustrate sev-eral properties of the IMQ KSD and to compare its be-havior with that of two preexisting discrepancy measures,the Wasserstein distance dW‖·‖2 , which can be computedfor simple univariate targets (Vallender, 1974), and thespanner graph Stein discrepancy of Gorham & Mackey(2015). We adopt a bimodal Gaussian mixture with p(x) ∝e−

12‖x+∆e1‖22 + e−

12‖x−∆e1‖22 and ∆ = 1.5 as our target

P and generate a first sample point sequence i.i.d. from thetarget and a second sequence i.i.d. from one component ofthe mixture, N (−∆e1, Id). As seen in the left panel ofFigure 1 where d = 1, the IMQ KSD decays at an n−0.51

rate when applied to the first n points in the target sampleand remains bounded away from zero when applied to theto the single component sample. This desirable behavior isclosely mirrored by the Wasserstein distance and the graphStein discrepancy.

The middle panel of Figure 1 records the time consumedby the graph and kernel Stein discrepancies applied to thei.i.d. sample points from P . Each method is given access tod cores when working in d dimensions, and we use the re-leased code of Gorham & Mackey (2015) with the defaultGurobi 6.0.4 linear program solver for the graph Stein dis-crepancy. We find that the two methods have nearly iden-tical runtimes when d = 1 but that the KSD is 10 to 1000times faster when d = 4. In addition, the KSD is straight-forwardly parallelized and does not require access to a lin-ear program solver, making it an appealing practical choicefor a quality measure.

Finally, the right panel displays the optimal Stein func-

tions, gj(y) =EQn [bj(X)k(X,y)+∇xj k(X,y)]

S(Qn,TP ,Gk) , recovered bythe IMQ KSD when d = 1 and n = 103. The associated

test functions h(y) = (TP g)(y) =∑dj=1 EQn [kj0(X,y)]S(Qn,TP ,Gk) are

the mean-zero functions under P that best discriminate thetarget P and the sample Qn. As might be expected, theoptimal test function for the single component sample fea-tures large magnitude values in the oversampled region farfrom the missing mode.

4.2. The importance of kernel choice

Theorem 6 established that kernels with rapidly decay-ing tails yield KSDs that can be driven to zero by off-target sample sequences. Our next experiment providesan empirical demonstration of this issue for a multivari-ate Gaussian target P = N (0, Id) and KSDs based onthe popular Gaussian (k(x, y) = e−‖x−y‖

22/2) and Matern

(k(x, y) = (1 +√

3‖x− y‖2)e−√

3‖x−y‖2 ) radial kernels.

Following the proof Theorem 6 in Section F, we constructan off-target sequence (Qn)n≥1 that sends S(Qn, TP ,Gk)to 0 for these kernel choices whenever d ≥ 3. Specifically,for each n, we let Qn = 1

n

∑ni=1 δxi where, for all i and j,

‖xi‖2 ≤ 2n1/d log n and ‖xi − xj‖2 ≥ 2 log n. To selectthese sample points, we independently sample candidatepoints uniformly from the ball {x : ‖x‖2 ≤ 2n1/d log n},accept any points not within 2 log n Euclidean distance ofany previously accepted point, and terminate when n pointshave been accepted.

For various dimensions, Figure 2 displays the result ofapplying each KSD to the off-target sequence (Qn)n≥1

and an “on-target” sequence of points sampled i.i.d. fromP . For comparison, we also display the behavior of theIMQ KSD which provably controls tightness and domi-nates weak convergence for this target by Theorem 8. Aspredicted, the Gaussian and Matern KSDs decay to 0 underthe off-target sequence and decay more rapidly as the di-mension d increases; the IMQ KSD remains bounded awayfrom 0.

4.3. Selecting sampler hyperparameters

The approximate slice sampler of DuBois et al. (2014)is a biased MCMC procedure designed to accelerate in-ference when the target density takes the form p(x) ∝π(x)

∏Ll=1 π(yl|x) for π(·) a prior distribution on Rd and

π(yl|x) the likelihood of a datapoint yl. A standard slicesampler must evaluate the likelihood of all L datapoints todraw each new sample point xi. To reduce this cost, theapproximate slice sampler introduces a tuning parameter εwhich determines the number of datapoints that contributeto an approximation of the slice sampling step; an appropri-ate setting of this parameter is imperative for accurate infer-ence. When ε is too small, relatively few sample points willbe generated in a given amount of sampling time, yield-ing sample expectations with high Monte Carlo variance.When ε is too large, the large approximation error will pro-duce biased samples that no longer resemble the target.

To assess the suitability of the KSD for tolerance parame-ter selection, we take as our target P the bimodal Gaussianmixture model posterior of (Welling & Teh, 2011). For anarray of ε values, we generated 50 independent approxi-mate slice sampling chains with batch size 5, each with a

https://jgorham.github.io/SteinDiscrepancy.jl/

https://jgorham.github.io/SteinDiscrepancy.jl/


●●

●●

●●●●

●●●●●●

●●●●●●●●

●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●●●

●●

●●

●● ●●

●●●●●●●●●●●●

●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

101 102 103 104 101 102 103 10410−2.5

10−2

10−1.5

10−1

10−0.5

100

Number of sample points, n

Dis

crep

ancy

val

ueDiscrepancy

●

IMQ KSD

Graph Steindiscrepancy

Wasserstein●

●●●●●●●●●

●●●●●

●●●●

●●●●●

●●●●

●

●

●●●●●●

●●●●

●

●

●

●

●●●●●

●

●

●

d = 1 d = 4

101 102 103 104 101 102 103 104

10−3

10−2

10−1

100

101

102

103


Com

puta

tion

time

(sec

)

i.i.d. from mixturetarget P

i.i.d. from singlemixture component

gh

=T

P g

−3 0 3 −3 0 3

0.4

0.6

0.8

−0.1

0.0

0.1

0.2

x

Figure 1. Left: For d = 1, comparison of discrepancy measures for samples drawn i.i.d. from either the bimodal Gaussian mixture targetP or a single mixture component (see Section 4.1). Middle: On-target discrepancy computation time using d cores in d dimensions.Right: For n = 103 and d = 1, the Stein functions g and discriminating test functions h = TP g which maximize the KSD.

i.i.d. from target P Off−target sample

●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●●●●●●

●●

●●

●●●●●●

●● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●●● ● ●●●●●●●

● ● ●●●●●●●

● ● ● ●●●●●●● ● ● ●●●●●●● ● ● ●●●●●●●

10−2

10−1

100

10−1

100

10−2

10−1

100

Gaussian

Matérn

Inverse Multiquadric

102 103 104 105102 103 104 105


Ker

nel S

tein

dis

crep

ancy

Dimension● d = 5

d = 8

d = 20

Figure 2. Gaussian and Matern KSDs are driven to 0 by an off-target sequence that does not converge to the target P = N (0, Id)(see Section 4.2). The IMQ KSD does not share this deficiency.

budget of 148000 likelihood evaluations, and plotted themedian IMQ KSD and effective sample size (ESS, a stan-dard sample quality measure based on asymptotic variance(Brooks et al., 2011)) in Figure 3. ESS, which does notdetect Markov chain bias, is maximized at the largest hy-perparameter evaluated (ε = 10−1), while the KSD is mini-mized at an intermediate value (ε = 10−2). The right panelof Figure 3 shows representative samples produced by sev-eral settings of ε. The sample produced by the ESS-selectedchain is significantly overdispersed, while the sample fromε = 0 has minimal coverage of the second mode due to

its small sample size. The sample produced by the KSD-selected chain best resembles the posterior target. Using 4cores, the longest KSD computation with n = 103 samplepoints took 0.16s.

4.4. Selecting samplers

Ahn et al. (2012) developed two biased MCMC samplersfor accelerated posterior inference, both called Stochas-tic Gradient Fisher Scoring (SGFS). In the full version ofSGFS (termed SGFS-f), a d× d matrix must be inverted todraw each new sample point. Since this can be costly forlarge d, the authors developed a second sampler (termedSGFS-d) in which only a diagonal matrix must be invertedto draw each new sample point. Both samplers can beviewed as discrete-time approximations to a continuous-time Markov process that has the target P as its station-ary distribution; however, because no Metropolis-Hastingscorrection is employed, neither sampler has the target asits stationary distribution. Hence we will use the KSD – aquality measure that accounts for asymptotic bias – to eval-uate and choose between these samplers.

Specifically, we evaluate the SGFS-f and SGFS-d samplesproduced in (Ahn et al., 2012, Sec. 5.1). The target P isa Bayesian logistic regression with a flat prior, conditionedon a dataset of 104 MNIST handwritten digit images. Fromeach image, the authors extracted 50 random projections ofthe raw pixel values as covariates and a label indicatingwhether the image was a 7 or a 9. After discarding the firsthalf of sample points as burn-in, we obtained regressioncoefficient samples with 5 × 104 points and d = 51 di-mensions (including the intercept term). Figure 4 displaysthe IMQ KSD applied to the first n points in each sample.As external validation, we follow the protocol of Ahn et al.(2012) to find the bivariate marginal means and 95% confi-dence ellipses of each sample that align best and worst withthose of a surrogate ground truth sample obtained from a


● ●●

●

●

ESS (higher is better)

KSD (lower is better)

2.0

2.5

3.0

3.5

−0.5

0.0

0.5

0 10−4 10−3 10−2 10−1

Tolerance parameter, ε

Log

med

ian

diag

nost

ic

ε = 0 (n = 230) ε = 10−2 (n = 416) ε = 10−1 (n = 1000)

−3

−2

−1

0

1

2

3

−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2x1

x 2

Figure 3. Left: Median hyperparameter selection criteria across 50 independent approximate slice sampler sample sequences (see Sec-tion 4.3); IMQ KSD selects ε = 10−2; effective sample size selects ε = 10−1. Right: Representative approximate slice samplersamples requiring 148000 likelihood evaluations with posterior equidensity contours overlaid; n is the associated sample size.

Hamiltonian Monte Carlo chain with 105 iterates. Both theKSD and the surrogate ground truth suggest that the moder-ate speed-up provided by SGFS-d (0.0017s per sample vs.0.0019s for SGFS-f) is outweighed by the significant lossin inferential accuracy. However, the KSD assessment doesnot require access to an external trustworthy ground truthsample. The longest KSD computation took 400s using 16cores.

4.5. Beyond sample quality comparison

While our investigation of the KSD was motivated by thedesire to develop practical, trustworthy tools for samplequality comparison, the kernels recommended by our the-ory can serve as drop-in replacements in other inferentialtasks that make use of kernel Stein discrepancies.

4.5.1. ONE-SAMPLE HYPOTHESIS TESTING

Chwialkowski et al. (2016) recently used the KSDS(Qn, TP ,Gk) to develop a hypothesis test of whether agiven sample from a Markov chain was drawn from a tar-get distribution P (see also Liu et al., 2016). However, theauthors noted that the KSD test with their default Gaussianbase kernel k experienced a considerable loss of power asthe dimension d increased. We recreate their experimentand show that this loss of power can be avoided by usingour default IMQ kernel with β = − 1

2 and c = 1. Fol-lowing (Chwialkowski et al., 2016, Section 4) we drawzi

iid∼ N (0, Id) and uiiid∼ Unif[0, 1] to generate a sample

(xi)ni=1 with xi = zi + ui e1 for n = 500 and various di-

mensions d. Using the authors’ code (modified to includean IMQ kernel), we compare the power of the GaussianKSD test, the IMQ KSD test, and the standard normal-ity test of Baringhaus & Henze (1988) (B&H) to discernwhether the sample (xi)

500i=1 came from the null distribution

P = N (0, Id). The results, averaged over 400 simula-

tions, are shown in Table 1. Notably, the IMQ KSD experi-ences no power degradation over this range of dimensions,thus improving on both the Gaussian KSD and the standardB&H normality tests.

Table 1. Power of one sample tests for multivariate normality, av-eraged over 400 simulations (see Section 4.5.1)

d=2 d=5 d=10 d=15 d=20 d=25B&H 1.0 1.0 1.0 0.91 0.57 0.26

Gaussian 1.0 1.0 0.88 0.29 0.12 0.02IMQ 1.0 1.0 1.0 1.0 1.0 1.0

4.5.2. IMPROVING SAMPLE QUALITY

Liu & Lee (2016) recently used the KSD S(Qn, TP ,Gk)as a means of improving the quality of a sample. Specifi-cally, given an initial sample Qn supported on x1, . . . , xn,they minimize S(Qn, TP ,Gk) over all measures Qn sup-ported on the same sample points to obtain a new samplethat better approximates P over the class of test functionsH = TPGk. In all experiments, Liu & Lee (2016) employa Gaussian kernel k(x, y) = e−

1h‖x−y‖

22 with bandwidth

h selected to be the median of the squared Euclidean dis-tance between pairs of sample points. Using the authors’code, we recreate the experiment from (Liu & Lee, 2016,Fig. 2b) and introduce a KSD objective with an IMQ ker-nel k(x, y) = (1 + 1

h‖x− y‖22)−1/2 with bandwidth se-

lected in the same fashion. The starting sample is given byQn = 1

n

∑ni=1 δxi for n = 100, various dimensions d, and

each sample point drawn i.i.d. from P = N (0, Id). Forthe initial sample and the optimized samples produced byeach KSD, Figure 5 displays the mean squared error (MSE)1d‖EP [Z]− EQn [X]‖22 averaged across 500 independentlygenerated initial samples. Out of the box, the IMQ kernelproduces better mean estimates than the standard Gaussian.


●

●

●● ●●

●●●●

● ● ●●●●●●● ● ● ● ●

17500

18000

18500

19000

102 102.5 103 103.5 104 104.5


IMQ

ker

nel S

tein

dis

crep

ancy

Sampler● SGFS−d

SGFS−f

SGFS−d

●●

−0.4

−0.3

−0.2

BE

ST

−0.3 −0.2 −0.1 0.0x7

x 51

SGFS−d

●●

0.80.91.01.11.2 W

OR

ST

−0.5 −0.4 −0.3 −0.2 −0.1 0.0x8

x 42

SGFS−f

●●

−0.1

0.0

0.1

0.2 BE

ST

−0.1 0.0 0.1 0.2x32

x 34

SGFS−f

●●

−0.6

−0.5

−0.4

−0.3 WO

RS

T

−1.5 −1.4 −1.3 −1.2 −1.1x2

x 25

Figure 4. Left: Quality comparison for Bayesian logistic regression with two SGFS samplers (see Section 4.4). Right: Scatter plots ofn = 5× 104 SGFS sample points with overlaid bivariate marginal means and 95% confidence ellipses (dashed blue) that align best andworst with surrogate ground truth sample (solid red).

● ● ● ● ●

10−4.5

10−4

10−3.5

10−3

10−2.5

10−2

2 10 50 75 100Dimension, d

Ave

rage

MS

E, |

|EP Z

−E

Qn~ X

|| 22d

Sample● Initial Qn

Gaussian KSD

IMQ KSD

Figure 5. Average quality of mean estimates (±2 standard errors)under optimized samples Qn for target P = N (0, Id); MSE av-eraged over 500 independent initial samples (see Section 4.5.2).

5. Related and future workThe score statistic of Fan et al. (2006) and the Gibbs sam-pler convergence criteria of Zellner & Min (1995) detectcertain forms of non-convergence but fail to detect othersdue to the finite number of test functions tested. For ex-ample, when P = N (0, 1), the score statistic (Fan et al.,2006) only monitors sample means and variances.

For an approximation µ with continuously differentiabledensity r, Chwialkowski et al. (2016, Thm. 2.1) andLiu et al. (2016, Prop. 3.3) established that if k is C0-universal (Carmeli et al., 2010, Defn. 4.1) or integrallystrictly positive definite (ISPD, Stewart, 1976, Sec. 6) andEµ[k0(X,X) + ‖∇ log p(X)

r(X)‖22] < ∞ for k0 ,

∑dj=1 k

j0,

then S(µ, TP ,Gk) = 0 only if µ = P . However, this prop-erty is insufficient to conclude that probability measureswith small KSD are close to P in any traditional sense. In-deed, Gaussian and Matern kernels are C0 universal andISPD, but, by Theorem 6, their KSDs can be driven to zeroby sequences not converging to P . On compact domains,

where tightness is no longer an issue, the combined resultsof (Oates et al., 2016a, Lem. 4), (Fukumizu et al., 2007,Lem. 1), and (Simon-Gabriel & Scholkopf, 2016, Thm. 55)give conditions for a KSD to dominate weak convergence.

While assessing sample quality was our chief objective, ourresults may hold benefits for other applications that makeuse of Stein discrepancies or Stein operators. In particu-lar, our kernel recommendations could be incorporated intothe Monte Carlo control functionals framework of Oateset al. (2016b); Oates & Girolami (2015), the variationalinference approaches of Liu & Wang (2016); Liu & Feng(2016); Ranganath et al. (2016), and the Stein generativeadversarial network approach of Wang & Liu (2016).

In the future, we aim to leverage stochastic, low-rank, andsparse approximations of the kernel matrix and score func-tion to produce KSDs that scale better with the number ofsample and data points while still guaranteeing control overweak convergence. A reader may also wonder for whichdistributions outside ofP the KSD dominates weak conver-gence. The following theorem, proved in Section J, showsthat no KSD with aC0 kernel dominates weak convergencewhen the target has a bounded score function.Theorem 10 (KSD fails for bounded scores). If ∇ log p isbounded and k ∈ C

(1,1)0 , then S(Qn, TP ,Gk) → 0 does

not imply Qn ⇒ P .

However, Gorham et al. (2016) developed convergence-determining graph Stein discrepancies for heavy-tailedtargets by replacing the Langevin Stein operator TPwith diffusion Stein operators of the form (T g)(x) =

1p(x) 〈∇, p(x)(a(x) + c(x))g(x)〉. An analogous construc-tion should yield convergence-determining diffusion KSDsfor P outside of P . Our results also extend to targets Psupported on a convex subset X of Rd by choosing k tosatisfy p(x)k(x, ·) ≡ 0 for all x on the boundary of X .


AcknowledgmentsWe thank Kacper Chwialkowski, Heiko Strathmann, andArthur Gretton for sharing their hypothesis testing code,Qiang Liu for sharing his black-box importance samplingcode, and Sebastian Vollmer and Andrew Duncan for manyhelpful conversations regarding this work. This materialis based upon work supported by the National ScienceFoundation DMS RTG Grant No. 1501767, the NationalScience Foundation Graduate Research Fellowship underGrant No. DGE-114747, and the Frederick E. Terman Fel-lowship.

ReferencesAhn, S., Korattikara, A., and Welling, M. Bayesian poste-

rior sampling via stochastic gradient Fisher scoring. InProc. 29th ICML, ICML’12, 2012.

Bachman, G. and Narici, L. Functional Analysis. Aca-demic Press textbooks in mathematics. Dover Publica-tions, 1966. ISBN 9780486402512.

Baker, J. Integration of radial functions. Mathematics Mag-azine, 72(5):392–395, 1999.

Barbour, A. D. Stein’s method and Poisson process con-vergence. J. Appl. Probab., (Special Vol. 25A):175–184,1988. ISSN 0021-9002. A celebration of applied proba-bility.

Barbour, A. D. Stein’s method for diffusion approxima-tions. Probab. Theory Related Fields, 84(3):297–322,1990. ISSN 0178-8051. doi: 10.1007/BF01197887.

Baringhaus, L. and Henze, N. A consistent test for mul-tivariate normality based on the empirical characteristicfunction. Metrika, 35(1):339–348, 1988.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V.B.Julia: A fresh approach to numerical computing. arXivpreprint arXiv:1411.1607, 2014.

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. Hand-book of Markov chain Monte Carlo. CRC press, 2011.

Carmeli, C., De Vito, E., Toigo, A., and Umanita, V. Vectorvalued reproducing kernel hilbert spaces and universal-ity. Analysis and Applications, 8(01):19–61, 2010.

Chatterjee, S. and Shao, Q. Nonnormal approximation byStein’s method of exchangeable pairs with application tothe Curie-Weiss model. Ann. Appl. Probab., 21(2):464–483, 2011. ISSN 1050-5164. doi: 10.1214/10-AAP712.

Chen, L., Goldstein, L., and Shao, Q. Normal approxima-tion by Stein’s method. Probability and its Applications.Springer, Heidelberg, 2011. ISBN 978-3-642-15006-7.doi: 10.1007/978-3-642-15007-4.

Chwialkowski, K., Strathmann, H., and Gretton, A. A ker-nel test of goodness of fit. In Proc. 33rd ICML, ICML,2016.

DuBois, C., Korattikara, A., Welling, M., and Smyth, P.Approximate slice sampling for Bayesian posterior in-ference. In Proc. 17th AISTATS, pp. 185–193, 2014.

Eberle, A. Reflection couplings and contraction rates fordiffusions. Probab. Theory Related Fields, pp. 1–36,2015. doi: 10.1007/s00440-015-0673-1.

Fan, Y., Brooks, S. P., and Gelman, A. Output assessmentfor Monte Carlo simulations via the score statistic. J.Comp. Graph. Stat., 15(1), 2006.

Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B. Ker-nel measures of conditional dependence. In NIPS, vol-ume 20, pp. 489–496, 2007.

Geyer, C. J. Markov chain Monte Carlo maximum like-lihood. Computer Science and Statistics: Proc. 23rdSymp. Interface, pp. 156–163, 1991.

Gorham, J. and Mackey, L. Measuring sample quality withStein’s method. In Cortes, C., Lawrence, N. D., Lee,D. D., Sugiyama, M., and Garnett, R. (eds.), Adv. NIPS28, pp. 226–234. Curran Associates, Inc., 2015.

Gorham, J., Duncan, A., Vollmer, S., and Mackey,L. Measuring sample quality with diffusions.arXiv:1611.06972, Nov. 2016.

Gotze, F. On the rate of convergence in the multivariateCLT. Ann. Probab., 19(2):724–739, 1991.

Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., andSmola, A. A kernel two-sample test. J. Mach. Learn.Res., 13(1):723–773, 2012.

Herb, R. and Sally Jr., P.J. The Plancherel formula, thePlancherel theorem, and the Fourier transform of orbitalintegrals. In Representation Theory and MathematicalPhysics: Conference in Honor of Gregg Zuckerman’s60th Birthday, October 24–27, 2009, Yale University,volume 557, pp. 1. American Mathematical Soc., 2011.

Korattikara, A., Chen, Y., and Welling, M. Austerity inMCMC land: Cutting the Metropolis-Hastings budget.In Proc. of 31st ICML, ICML’14, 2014.

Ley, C., Reinert, G., and Swan, Y. Stein’s method for com-parison of univariate distributions. Probab. Surveys, 14:1–52, 2017. doi: 10.1214/16-PS278.

Liu, Q. and Feng, Y. Two methods for wild variationalinference. arXiv preprint arXiv:1612.00081, 2016.


Liu, Q. and Lee, J. Black-box importance sampling.arXiv:1610.05247, October 2016. To appear in AIS-TATS 2017.

Liu, Q. and Wang, D. Stein Variational Gradient De-scent: A General Purpose Bayesian Inference Algo-rithm. arXiv:1608.04471, August 2016.

Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrep-ancy for goodness-of-fit tests. In Proc. of 33rd ICML,volume 48 of ICML, pp. 276–284, 2016.

Mackey, L. and Gorham, J. Multivariate Stein factorsfor a class of strongly log-concave distributions. Elec-tron. Commun. Probab., 21:14 pp., 2016. doi: 10.1214/16-ECP15.

Muller, A. Integral probability metrics and their generatingclasses of functions. Ann. Appl. Probab., 29(2):pp. 429–443, 1997.

Oates, C. and Girolami, M. Control functionals for Quasi-Monte Carlo integration. arXiv:1501.03379, 2015.

Oates, C., Cockayne, J., Briol, F., and Girolami, M. Con-vergence rates for a class of estimators based on steinsmethod. arXiv preprint arXiv:1603.03220, 2016a.

Oates, C. J., Girolami, M., and Chopin, N. Control func-tionals for Monte Carlo integration. Journal of theRoyal Statistical Society: Series B (Statistical Method-ology), pp. n/a–n/a, 2016b. ISSN 1467-9868. doi:10.1111/rssb.12185.

Ranganath, R., Tran, D., Altosaar, J., and Blei, D. Operatorvariational inference. In Advances in Neural InformationProcessing Systems, pp. 496–504, 2016.

Simon-Gabriel, C. and Scholkopf, B. Kernel distribu-tion embeddings: Universal kernels, characteristic ker-nels and kernel metrics on distributions. arXiv preprintarXiv:1604.05251, 2016.

Sriperumbudur, B. On the optimal estimation of probabilitymeasures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 2016.

Sriperumbudur, B., Gretton, A., Fukumizu, K., Scholkopf,B., and Lanckriet, G. Hilbert space embeddings and met-rics on probability measures. J. Mach. Learn. Res., 11(Apr):1517–1561, 2010.

Stein, C. A bound for the error in the normal approximationto the distribution of a sum of dependent random vari-ables. In Proc. 6th Berkeley Symposium on Mathemat-ical Statistics and Probability (Univ. California, Berke-ley, Calif., 1970/1971), Vol. II: Probability theory, pp.583–602. Univ. California Press, Berkeley, Calif., 1972.

Stein, C., Diaconis, P., Holmes, S., and Reinert, G. Useof exchangeable pairs in the analysis of simulations. InStein’s method: expository lectures and applications,volume 46 of IMS Lecture Notes Monogr. Ser., pp. 1–26.Inst. Math. Statist., Beachwood, OH, 2004.

Steinwart, I. and Christmann, A. Support Vector Machines.Springer Science & Business Media, 2008.

Stewart, J. Positive definite functions and generalizations,an historical survey. Rocky Mountain J. Math., 6(3):409–434, 09 1976. doi: 10.1216/RMJ-1976-6-3-409.

Vallender, S. Calculation of the Wasserstein distancebetween probability distributions on the line. TheoryProbab. Appl., 18(4):784–786, 1974.

Wainwright, M. High-dimensional statistics: Anon-asymptotic viewpoint. 2017. URL http://www.stat.berkeley.edu/˜wainwrig/nachdiplom/Chap5_Sep10_2015.pdf.

Wang, D. and Liu, Q. Learning to Draw Samples: With Ap-plication to Amortized MLE for Generative AdversarialLearning. arXiv:1611.01722, November 2016.

Welling, M. and Teh, Y. Bayesian learning via stochasticgradient Langevin dynamics. In ICML, 2011.

Wendland, H. Scattered data approximation, volume 17.Cambridge university press, 2004.

Zellner, A. and Min, C. Gibbs sampler convergence crite-ria. JASA, 90(431):921–927, 1995.

http://www.stat.berkeley.edu/~wainwrig/nachdiplom/Chap5_Sep10_2015.pdf




A. Additional appendix notation

We use f ∗ h to denote the convolution between f and h, and, for absolutely integrable f : Rd → R, we say f(ω) ,

(2π)−d/2∫f(x)e−i〈x,ω〉dx is the Fourier transform of f . For g ∈ Kdk we define ‖g‖Kdk ,

√∑dj=1‖gj‖

2Kk . Let L2

denote the Banach space of real-valued functions f with ‖f‖L2 ,∫f(x)2 dx < ∞. For Rd-valued g, we will overload

g ∈ L2 to mean ‖g‖L2 ,√∑d

j=1‖gj‖2L2 < ∞. We define the operator norm of a vector a ∈ Rd as ‖a‖op , ‖a‖2

and of a matrix A ∈ Rd×d as ‖A‖op , supx∈Rd,‖x‖2=1‖Ax‖2. We further define the Lipschitz constant M2(g) ,

supx 6=y‖∇g(x)−∇g(y)‖op/‖x− y‖2 and the ball B(x, r) , {y ∈ Rd | ‖x− y‖2 ≤ r} for any x ∈ Rd and r ≥ 0.

B. Proof of Proposition 1: Zero mean test functionsFix any g ∈ G. Since k ∈ C(1,1), supx∈Rd k(x, x) <∞, and supx∈Rd‖∇x∇yk(x, x)‖op <∞, Cor. 4.36 of (Steinwart &Christmann, 2008) implies that M0(gj) <∞ and M1(gj) <∞ for each j ∈ {1, . . . , d}. As EP [‖b(Z)‖2] <∞, the proofof (Gorham & Mackey, 2015, Prop. 1) now implies EP [(TP g)(Z)] = 0.

C. Proof of Proposition 2: KSD closed formOur proof generalizes that of (Chwialkowski et al., 2016, Thm. 2.1). For each dimension j ∈ {1, . . . , d}, we definethe operator T jP via (T jP g0)(x) , 1

p(x)∇xj (p(x)g0(x)) = ∇xjg0(x) + bj(x)g0(x) for g0 : Rd → R. We further let

Ψk : Rd → Kk denote the canonical feature map of Kk, given by Ψk(x) , k(x, ·). Since k ∈ C(1,1), the argument of(Steinwart & Christmann, 2008, Cor. 4.36) implies that

TP g(x) =∑dj=1(T jP gj)(x) =

∑dj=1 T

jP 〈gj ,Ψk(x)〉Kk =

∑dj=1〈gj , T

jPΨk(x)〉Kk (5)

for all g = (g1, . . . , gd) ∈ Gk,‖·‖ and x ∈ Rd. Moreover, (Steinwart & Christmann, 2008, Lem. 4.34) gives

〈T jPΨk(x), T jPΨk(y)〉 = 〈bj(x)Ψk(x) +∇xjΨk(x), bj(y)Ψk(y) +∇yjΨk(y)〉Kk= bj(x)bj(y)k(x, y) + bj(x)∇yjk(x, y) + bj(y)∇xjk(x, y) +∇xj∇yjk(x, y) = kj0(x, y) (6)

for all x, y ∈ Rd and j ∈ {1, . . . , d}. The representation (6) and our µ-integrability assumption together imply that, foreach j, T jPΨk is Bochner µ-integrable (Steinwart & Christmann, 2008, Definition A.5.20), since

Eµ[∥∥∥T jPΨk(X)

∥∥∥Kk

]= Eµ

[√kj0(X,X)

]<∞.

Hence, we may apply the representation (6) and exchange expectation and RKHS inner product to discover

w2j = E

[kj0(X, X)

]= E

[〈T jPΨk(X), T jPΨk(X)〉Kk

]=∥∥∥Eµ[T jPΨk(X)

]∥∥∥2

Kk. (7)

for X, X iid∼ µ. To conclude, we invoke the representation (5), Bochner µ-integrability, the representation (7), and theFenchel-Young inequality for dual norms twice:

S(µ, TP ,Gk,‖·‖) = supg∈Gk,‖·‖

Eµ[(TP g)(X)] = sup‖gj‖Kk=vj ,‖v‖∗≤1

∑dj=1〈gj ,Eµ[T jPΨk(X)]〉Kk

= sup‖v‖∗≤1

∑dj=1vj

∥∥∥Eµ[T jPΨk(X)]∥∥∥Kk

= sup‖v‖∗≤1

∑dj=1vjwj = ‖w‖.

D. Proof of Proposition 3: Stein set equivalenceBy Proposition 2, S(µ, TP ,Gk,‖·‖) = ‖w‖ and S(µ, TP ,Gk,‖·‖2) = ‖w‖2 for some vector w, and by (Bachman & Narici,1966, Thm. 8.7), there exist constants cd, c′d > 0 depending only on d and ‖·‖ such that cd‖w‖ ≤ ‖w‖2 ≤ c′d‖w‖.


E. Proof of Theorem 5: Univariate KSD detects non-convergenceWhile the statement of Theorem 5 applies only to the univariate case d = 1, we will prove all steps for general d whenpossible. Our strategy is to define a reference IPM dH for which µm ⇒ P whenever dH(µm, P ) → 0 and then upperbound dH by a function of the KSD S(µm, TP ,Gk). To construct the reference class of test functions H, we choose someintegrally strictly positive definite (ISPD) kernel kb : Rd × Rd → R, that is, we select a kernel function kb such that∫

Rd×Rdkb(x, y)dµ(x)dµ(y) > 0

for all finite non-zero signed Borel measures µ on Rd (Sriperumbudur et al., 2010, Section 1.2). For this proof, we willchoose the Gaussian kernel kb(x, y) = exp

(−‖x− y‖22/2

), which is ISPD by (Sriperumbudur et al., 2010, Section 3.1).

Since r(x) , exp(−‖x‖22/2

)is bounded and continuous and never vanishes, the kernel kb(x, y) = kb(x, y)r(x)r(y) is

also ISPD. Let H , {h ∈ Kkb | ‖h‖kb ≤ 1}. By (Sriperumbudur, 2016, Thm. 3.2), since kb is ISPD with kb(x, ·) ∈C0(Rd) for all x, we know that dH(µm, P ) → 0 only if µm ⇒ P . With H in hand, Theorem 5 will follow from our nexttheorem which upper bounds the IPM dH(µ, P ) in terms of the KSD S(µ, TP ,Gk).

Theorem 11 (Univariate KSD lower bound). Let d = 1, and consider the set of univariate functions H = {h ∈Kkb | ‖h‖kb ≤ 1}. Suppose P ∈ P and k(x, y) = Φ(x − y) for Φ ∈ C2 with generalized Fourier transform Φ andF (t) , sup‖ω‖∞≤t Φ(ω)−1 finite for all t > 0. Then there exists a constant MP > 0 such that, for all probabilitymeasures µ and ε > 0,

dH(µ, P ) ≤ ε+(π2

)1/4MPF(

12 log 2π (1 +M1(b)MP )ε−1

)1/2

S(µ, TP ,Gk).

Remarks An explicit value for the Stein factor MP can be derived from the proof in Section E.1 and the resultsof Gorham et al. (2016). After optimizing the bound dH(µ, P ) over ε > 0, the Gaussian, inverse multiquadric, andMatern (v > 1) kernels achieve rates of O(1/

√log( 1

S(µ,TP ,Gk) )), O(1/ log( 1S(µ,TP ,Gk) )), and O(S(µ, TP ,Gk)1/(v+1/2))

respectively as S(µ, TP ,Gk)→ 0.

In particular, since Φ is non-vanishing, F (t) is finite for all t. If S(µm, TP ,Gk) → 0, then, for any fixed ε > 0, we havelim supm→∞ dH(µm, P ) ≤ ε. Taking ε→ 0 shows that limm→∞ dH(µm, P )→ 0, which implies that µm ⇒ P .

E.1. Proof of Theorem 11: Univariate KSD lower bound

Fix any probability measure µ and h ∈ H, and define the tilting function Ξ(x) , (1 + ‖x‖22)1/2. The proof will proceedin three steps.

Step 1: Uniform bounds on M0(h), M1(h) and supx∈Rd‖Ξ(x)∇h(x)‖2 We first bound M0(h), M1(h) andsupx∈Rd‖Ξ(x)∇h(x)‖2 uniformly overH. To this end, we define the finite value c0 , supx∈Rd(1+‖x‖22)r(x) = 2e−1/2.For all x ∈ Rd, we have

|h(x)| = |〈h, kb(x, ·)〉Kkb | ≤ ‖h‖Kkbkb(x, x)1/2 ≤ 1.

Moreover, we have ∇xkb(x, y) = (y − x)kb(x, y) and ∇r(x) = −xr(x). Thus for any x, by (Steinwart & Christmann,2008, Corollary 4.36) we have

‖∇h(x)‖2 ≤ ‖h‖Kkb〈∇x,∇ykb(x, x)〉1/2 ≤ [d r(x)2 + ‖x‖22 r(x)2]1/2kb(x, x)1/2 ≤ [(d− 1)1/2 + (1 + ‖x‖22)1/2)]r(x),

where in the last inequality we used the triangle inequality. Hence ‖∇h(x)‖2 ≤ (d − 1)1/2 + 1 and ‖Ξ(x)∇h(x)‖2 ≤(d− 1)1/2 + c0 for all x, completing our bounding of M0(h), M1(h) and supx∈Rd‖Ξ(x)∇h(x)‖2 uniformly overH.

Step 2: Uniform bound on ‖gh‖L2 for Stein solution gh We next show that there is a solution to the P Stein equation

(TP gh)(x) = h(x)− EP [h(Z)] (8)


with gh(x) ≤ MP /(1 + ‖x‖22)1/2 for every h ∈ H. When d = 1, this will imply that ‖gh‖L2 is bounded uniformly overH. To proceed, we will define a tilted distribution P ∈ P and a tilted function f , show that a solution gf to the P Steinequation is bounded, and construct a solution gh to the Stein equation of P based on gf .

Define P via the tilted probability density p(x) ∝ p(x)/Ξ(x) with score function b(x) , ∇ log p(x) = b(x) − ξ(x)

for ξ(x) , ∇ log Ξ(x) = x/(1 + ‖x‖22). Since b is Lipschitz and ∇ξ(x) = (1 + ‖x‖22)−1[I − 2 xx>

1+‖x‖22] has its

operator norm uniformly bounded by 3, b is also Lipschitz. To see that P is also distantly dissipative, note first that|〈ξ(x)− ξ(y), x− y〉| ≤ ‖ξ(x)− ξ(y)‖2 · ‖x− y‖2 ≤ ‖x− y‖2 since supx‖ξ(x)‖2 ≤ 1/2. Because P is distantly dissi-pative, we know 〈b(x)− b(y), x− y〉 ≤ − 1

2κ0‖x− y‖22 for some κ0 > 0 and all ‖x− y‖2 ≥ R for some R > 0. Thusfor all ‖x− y‖2 ≥ max(R, 4/κ0), we have

〈b(x)− b(y), x− y〉 = 〈b(x)− b(y), x− y〉+ 〈ξ(x)− ξ(y), x− y〉 ≤ −1

2κ0‖x− y‖22 + ‖x− y‖2 ≤ −

1

2

κ0

2‖x− y‖22,

so P is also distantly dissipative and hence in P .

Let f(x) , Ξ(x)(h(x) − EP [h(Z)]). Since EP [f(Z)] = EP [h(Z)− EP [h(Z)]] = 0, Thm. 5 and Sec. 4.2 of (Gorhamet al., 2016), imply that the P Stein equation (TP gf )(x) = f(x) has a solution gf with M0(gf ) ≤ M′PM1(f) forM′P aconstant independent of f and h. Since∇f(x) = ∇Ξ(x)(h(x)−EP [h(Z)]) + Ξ(x)∇h(x) and ‖∇Ξ(x)‖2 =

‖x‖2(1+‖x‖22)1/2

is bounded by 1, M0(gf ) ≤M′P (2 + (d− 1)1/2 + c0) ,MP , a constant independent of h.

Finally, we note that gh(x) , gf (x)/Ξ(x) is a solution to the P Stein equation (8) satisfying gh(x) ≤ MP /Ξ(x) =

MP /(1 + ‖x‖22)1/2. Hence, in the case d = 1, we have ‖gh‖L2 ≤MP√π.

Step 3: Approximate TP gh using TPGk In our final step, we will use the following lemma, proved in Section E.2, toshow that we can approximate TP gh arbitrarily well by a function in a scaled copy of TPGk.

Lemma 12 (Stein approximations with finite RKHS norm). Suppose that g : Rd → Rd is bounded and belongs to L2∩C1

and that h = TP g is Lipschitz. Moreover, suppose k(x, y) = Φ(x− y) for Φ ∈ C2 with generalized Fourier transform Φ.Then for every ε > 0, there is a function gε : Rd → Rd such that supx∈Rd |(TP gε)(x)− (TP g)(x)| ≤ ε and

‖gε‖Kdk ≤ (2π)−d/4F(

12d log 2π (M1(h) +M1(b)M0(g))ε−1

)1/2

‖g‖L2 ,

where F (t) , sup‖ω‖∞≤t Φ(ω)−1.

When d = 1, Lemma 12 implies that for every ε > 0 there is a function gε : R → R such that M0(TP gε − h) ≤ ε and‖gε‖Kk ≤ 3(π2 )1/4MPF ( 12 log 2

π (M1(h) +M1(b)MP )ε−1)1/2. Hence we have

|EP [h(Z)]− Eµ[h(X)]| ≤ |Eµ[h(X)− (TP gε)(X)]|+ |Eµ[(TP gε)(X)]|≤ ε+ ‖gε‖KkS(µ, TP ,Gk)

≤ ε+ (2π)−1/4MP√πF(

12 log 2π (M1(h) +M1(b)MP )ε−1

)1/2

S(µ, TP ,Gk).

Taking a supremum over h ∈ H yields the advertised result.

E.2. Proof of Lemma 12: Stein approximations with finite RKHS norm

Let us define the function S : Rd → R via the mapping S(x) ,∏dj=1

sin xjxj

. Then S ∈ L2 and∫Rd‖x‖2S(x)4 < ∞.

We will then define the density function ρ(x) , Z−1S(x)4, where Z ,∫Rd S(x)4 dx = (2π/3)d is the normalization

constant. One can check that ρ(ω)2 ≤ (2π)−dI[‖ω‖∞ ≤ 4].

Let Y be a random variable with density ρ. For each δ > 0, let us define ρδ(x) = δ−dρ(x/δ) and for any func-tion f let us denote fδ(x) , E[f(x+ δY )]. Since h = TP g is assumed Lipschitz, this implies |hδ(x) − h(x)| =|Eρ[h(x+ δY )− h(x)]| ≤ δM1(h)Eρ[‖Y ‖2] for all x ∈ Rd.


Next, notice that for any δ > 0 and x ∈ Rd,

(TP gδ)(x) = Eρ[〈b(x), g(x+ δY )〉] + E[〈∇, g(x+ δY )〉], andhδ(x) = Eρ[〈b(x+ δY ), g(x+ δY )〉] + E[〈∇, g(x+ δY )〉].

Thus because we assume b is Lipschitz, we can deduce from above for any x ∈ Rd,

|(TP gδ)(x)− hδ(x)| = |Eρ[〈b(x)− b(x+ δY ), g(x+ δY )〉]|≤ Eρ[‖b(x)− b(x+ δY )‖2‖g(x+ δY )‖2]

≤M0(g)M1(b) δ Eρ[‖Y ‖2].

Thus for any ε > 0, letting ε = ε/((M1(h) +M1(b)M0(g))Eρ[‖Y ‖2]), we have by the triangle inequality

|(TP gε)(x)− (TP g)(x)| ≤ |(TP gε)(x)− hε(x)|+ |hε(x)− h(x)| ≤ ε.

Thus it remains to bound the RKHS norm of gδ . By the Convolution Theorem (Wendland, 2004, Thm. 5.16), we havegδ(ω) = (2π)d/2g(ω)ρδ(ω), and so the squared norm of gδ in Kdk is equal to (Wendland, 2004, Thm. 10.21)

(2π)−d/2∫Rd

|gδ(ω)|2

Φ(ω)dω = (2π)d/2

∫Rd

|g(ω)|2ρδ(ω)2

Φ(ω)dω ≤ (2π)−d/2

{sup

‖ω‖∞≤4δ−1

Φ(ω)−1

}∫Rd|g(ω)|2 dω,

where in the inequality we used the fact that ρδ(ω) = ρ(δω). By Plancherel’s theorem (Herb & Sally Jr., 2011, Thm. 1.1),we know that f ∈ L2 implies that ‖f‖L2 = ‖f‖L2 . Thus we have ‖gδ‖Kdk ≤ (2π)−d/4F (4δ−1)1/2‖g‖L2 . The final result

follows from noticing that∫R sin4(x)/x4 dx = 2π

3 and also∫Rd‖x‖2

d∏j=1

sin4 xjx4j

dx ≤∫Rd‖x‖1

d∏j=1

sin4 xjx4j

dx =

d∑j=1

∫Rd

(sinxj)4

|xj |3∏k 6=j

sin4 xkx4k

dx = 2d(log 2)

(2π

3

)d−1

,

which implies Eρ[‖Y ‖2] ≤ 3d log 2π .

F. Proof of Theorem 6: KSD fails with light kernel tailsFirst, define the generalized inverse function γ−1(s) , inf{r ≥ 0 | γ(r) ≤ s}. Next, fix an n ≥ 1, let ∆n ,max(1, γ−1(1/n)), and define rn , ∆nn

1/d. Select n distinct points x1, . . . , xn ∈ Rd so that zi,i′ , xi − xi′ satisfies‖zi,i′‖2 > ∆n for all i 6= i′ and ‖xi‖2 ≤ rn for all i. By (Wainwright, 2017, Lems. 5.1 and 5.2), such a point set alwaysexists. Now define Qn = 1

n

∑ni=1 δxi . We will show that if ∆n grows at an appropriate rate then S(Qn, TP ,Gk) → 0 as

n→∞.

Since the target distribution P is N (0, Id), the associated gradient of the log density is b(x) = −x. Thus

k0(x, y) ,d∑j=1

kj0(x, y) = 〈x, y〉k(x, y)− 〈y,∇xk(x, y)〉 − 〈x,∇yk(x, y)〉+ 〈∇x,∇yk(x, y)〉.

From Proposition 2, we have

S(Qn, TP ,Gk)2 =1

n2

n∑i,i′=1

k0(xi, xi′) =1

n2

n∑i=1

k0(xi, xi) +1

n2

∑i 6=i′

k0(xi, xi′). (9)

Since k ∈ C(2,2)b , γ(0) <∞. Thus by Cauchy-Schwarz, the first term of (9) is upper bounded by

1

n2

n∑i=1

k0(xi, xi) ≤1

n2

n∑i=1

‖xi‖22k(xi, xi) + ‖xi‖2(‖∇xk(xi, xi)‖2 + ‖∇yk(xi, xi)‖2) + |〈∇x,∇yk(xi, xi)〉|

≤ γ(0)

n[r2n + 2rn + 1] ≤ γ(0)

n(n1/d∆n + 1)2.


To handle the second term of (9), we will use the assumed bound on k and its derivatives from γ. For any fixed i 6= i′, bythe triangle inequality, Cauchy-Schwarz, and fact γ is monotonically decreasing we have

|k0(xi, xi′)| ≤ ‖xi‖2‖xi′‖2|k(xi, xi′)|+ ‖xi‖2‖∇yk(xi, xi′)‖2 + ‖xi′‖2‖∇xk(xi, xi′)‖2 + |〈∇x,∇yk(xi, xi′)〉|≤ r2

nγ(‖zi,i′‖2) + rnγ(‖zi,i′‖2) + rnγ(‖zi,i′‖2) + γ(‖zi,i′‖2)

≤ (n1/d∆n + 1)2γ(∆n).

Our upper bounds on the Stein discrepancy (9) and our choice of ∆n now imply that

S(Qn, TP ,Gk) = O(n1/d−1/2γ−1(1/n) + n−1/2).

Moreover, since γ(r) = o(r−α), we have γ−1(1/n) = o(n1/α) = o(n1/2−1/d), and hence S(Qn, TP ,Gk)→ 0 as n→∞.

However, the sequence (Qn)n≥1 is not uniformly tight and hence converges to no probability measure. This follows as,for each r > 0,

Qm(‖X‖2 ≤ r) ≤(r + 4r/∆m)d

m≤ 5drd

m≤ 1

5

for m = d5d+1rde, since at most (r + 4r/∆m)d points with minimum pairwise Euclidean distance greater than ∆m canfit into a ball of radius r (Wainwright, 2017, Lems. 5.1 and 5.2).

G. Proof of Theorem 7: KSD detects tight non-convergenceFor any probability measure µ on Rd and ε > 0, we define its tightness rate as

R(µ, ε) , inf{r ≥ 0 |µ(‖X‖2 > r) ≤ ε}. (10)

Theorem 7 will follow from the following result which upper bounds the bounded Lipschitz metric dBL‖·‖2 (µ, P ) in termsof the tightness rate R(µ, ε), the rate of decay of the generalized Fourier transform Φ, and the KSD S(µ, TP ,Gk).

Theorem 13 (KSD tightness lower bound). Suppose P ∈ P and let µ be a probability measure with tightness rate R(µ, ε)defined in (10). Moreover, suppose the kernel k(x, y) = Φ(x− y) with Φ ∈ C2 and F (t) , sup‖ω‖∞≤t Φ(ω)−1 finite forall t > 0. Then there exists a constantMP such that, for all ε, δ > 0,

dBL‖·‖2 (µ, P ) ≤ ε+ min(ε, 1)(1 + ε+ δ−1dθd−1

θdMP )

+ (2π)−d/4V1/2d MP (R(µ, ε) + 2δ)d/2F

(12d log 2

π (1 +M1(b)MP )ε−1)1/2

S(µ, TP ,Gk),

where θd , d∫ 1

0exp(−1/(1− r2)

)rd−1 dr for d > 0 (and θ0 , e−1), and Vd is the volume of the unit Euclidean ball in

dimension d.

Remarks An explicit value for the Stein factor MP can be derived from the proof in Section G.1 and the results ofGorham et al. (2016). When bounds on R and F are known, the final expression can be optimized over ε and δ to producerates of convergence in dBL‖·‖2 .

Consider now a sequence of probability measures (µm)m≥1 that is uniformly tight. This implies that lim supmR(µm, ε) <

∞ for all ε > 0. Moreover, since Φ is non-vanishing, F (t) is finite for all t. Thus if S(µm, TP ,Gk)→ 0, then for any fixedε < 1, lim supm dBL‖·‖2 (µm, P ) ≤ ε(2 + ε+ δ−1dθd−1

θdMP ). Taking ε→ 0 yields dBL‖·‖2 (µm, P )→ 0.

G.1. Proof of Theorem 13: KSD tightness lower bound

Fix any h ∈ BL‖·‖2 . By Theorem 5 and Section 4.2 of (Gorham et al., 2016), there exists a g ∈ C1 which solves the Steinequation TP g = h − E[h(Z)] and satisfies M0(g) ≤ MP forMP a constant independent of h and g. To show that wecan approximate TP g arbitrarily well by a function in a scaled copy of TPGk, we will form a truncated version of g usinga smoothed indicator function described in the next lemma.


Lemma 14 (Smoothed indicator function). For any compact set K ⊂ Rd and δ > 0, define the set inflation K2δ , {x ∈Rd | ‖x− y‖2 ≤ 2δ, ∀y ∈ K}. There is a function vK,δ : Rd → [0, 1] such that

vK,δ(x) = 1 for all x ∈ K and vK,δ(x) = 0 for all x /∈ K2δ, (11)

‖∇vK,δ(x)‖2 ≤δ−1dθd−1

θdI[x ∈ K2δ \K

], (12)

where θd , d∫ 1

0exp(−1/(1− r2)

)rd−1 dr for d > 0 and θ0 , e−1.

This lemma is proved in Section G.2.

Fix any ε, δ > 0, and let K = B(0, R(µ, ε)) with R(µ, ε) defined in (10). This set is compact since our sequence isuniformly tight. Hence, we may define gK,δ(x) , g(x) vK,δ(x) as a smooth, truncated version of g based on Lemma 14.Since

(TP g)(x)− (TP gK,δ)(x) = (1− vK,δ(x))[〈b(x), g(x)〉+ 〈∇, g〉(x)] + 〈∇vK,δ(x), g(x)〉= (1− vK,δ(x))(TP g)(x) + 〈∇vK,δ(x), g(x)〉,

properties (11) and (12) imply that (TP g)(x) = (TP gK,δ)(x) for all x ∈ K, (TP gK,δ)(x) = 0 when x /∈ K2δ , and

|(TP g)(x)− (TP gK,δ)(x)| ≤ |(TP g)(x)|+ ‖∇vK,δ(x)‖2 ‖g(x)‖2≤ |(TP g)(x)|+ δ−1dθd−1

θd‖g(x)‖2 ≤ 1 + δ−1dθd−1

θdMP

for x ∈ K2δ \K by Cauchy-Schwarz.

Moreover, since vK,δ has compact support and is in C1 by (11), gK,δ ∈ C1 with ‖gK,δ‖L2 ≤ Vol(K2δ)1/2M0(g) ≤Vol(K2δ)1/2MP . Therefore, Lemma 12 implies that there is a function gε ∈ Kdk such that |(TP gε)(x)− (TP gK,δ)(x)| ≤ εfor all x with norm

‖gε‖Kdk ≤ (2π)−d/4F ( 12d log 2π (1 +M1(b)MP ε

−1))1/2Vol(K2δ)1/2MP . (13)

Using the fact that TP gK,δ and TP g are identical on K, we have |(TP gε)(x) − (TP g)(x)| ≤ ε for all x ∈ K. Moreover,when x /∈ K, the triangle inequality gives

|(TP gε)(x)− (TP g)(x)| ≤ |(TP gε)(x)− TP gK,δ|+ |TP gK,δ − (TP g)(x)| ≤ 1 + ε+ δ−1dθd−1

θdMP .

By the triangle inequality and the definition of the decay rate, we therefore have

|Eµ[h(X)]− EP [h(Z)]| = |Eµ[(TP g)(X)]| ≤ |E[(TP g)(X)− (TP gε)(X)]|+ |Eµ[(TP gε)(X)]|≤ |Eµ[((TP g)(X)− (TP gε)(X))I[X ∈ K]]|+ |Eµ[((TP g)(X)− (TP gε)(X))I[X /∈ K]]|+ |Eµ[(TP gε)(X)]|

≤ ε+ min(ε, 1)(1 + ε+ δ−1dθd−1

θdMP ) + ‖gε‖Kdk S(µm, TP ,Gk)

≤ ε+ min(ε, 1)(1 + ε+ δ−1dθd−1

θdMP )

+ (2π)−d/4Vol(B(0, R(µ, ε) + 2δ))1/2F(

12d log 2π (1 +M1(b)MP )ε−1

)1/2

MPS(µ, TP ,Gk).

The advertised result follows by substituting Vol(B(0, r)) = Vdrd and taking the supremum over all h ∈ BL‖·‖.

G.2. Proof of Lemma 14: Smoothed indicator function

For all x ∈ Rd, define the standard normalized bump function ψ ∈ C∞ as

ψ(x) , I−1d exp

(−1/(1− ‖x‖22)

)I[‖x‖2 < 1],

where the normalizing constant is given by

Id =

∫B(0,1)

exp(−1/(1− ‖x‖22)

)dx = θd Vd


for Vd being the volume of the unit Euclidean ball in d dimensions (Baker, 1999).

LettingW be a random variable with density ψ, define vK,δ(x) , E[I[x+ δW ∈ Kδ

]]as the smoothed approximation of

x 7→ I[x ∈ K], where δ > 0 controls the amount of smoothing. Since supp(W ) = B(0, 1), we can immediately conclude(11) and also supp(∇vK,δ) ⊆ K2δ \K.

Thus to prove (12), it remains to consider x ∈ K2δ \K. We see ∇vK,δ(x) = δ−d−1∫B(x,δ)

∇ψ(x−yδ )I[y ∈ Kδ

]dy by

Leibniz rule. Letting Kδx , δ−1(Kδ − x), then by Jensen’s inequality we have

‖∇vK,δ(x)‖2 ≤ δ−d−1

∫B(x,δ)∩Kδ

∥∥∥∥∇ψ(x− yδ)∥∥∥∥

2

dy = δ−1

∫B(0,1)∩Kδ

x

‖∇ψ(z)‖2 dz ≤ δ−1

∫B(0,1)

‖∇ψ(z)‖2 dz

where we used the substitution z , (x− y)/δ. By differentiating ψ, using (Baker, 1999) with the substitution r = ‖z‖2,and employing integration by parts we have∫

B(0,1)

‖∇ψ(z)‖2 dz = I−1d

∫ 1

0

2r

(1− r2)2exp

(−1

1− r2

)(dVdr

d−1) dr

=d

θd

[−rd−1 exp

(−1

1− r2

)∣∣∣∣r=1

r=0

+

∫ 1

0

(d− 1)rd−2 exp

(−1

1− r2

)dr

]

=d

θd[e−1I[d = 1] + I[d 6= 1]θd−1] =

dθd−1

θd

yielding (12).

H. Proof of Theorem 8: IMQ KSD detects non-convergenceWe first use the following theorem to upper bound the bounded Lipschitz metric dBL‖·‖(µ, P ) in terms of the KSDS(µ, TP ,Gk).

Theorem 15 (IMQ KSD lower bound). Suppose P ∈ P and k(x, y) = (c2 + ‖x‖22)β for c > 0, and β ∈ (−1, 0). Chooseany α ∈ (0, 1

2 (β + 1)) and a > 12c. Then there exist an ε0 > 0 and a constantMP such that, for all µ,

dBL‖·‖(µ, P ) ≤ infε∈[0,ε0),δ>0

(2 + ε+ δ−1dθd−1

θdMP

)ε+ (2π)−d/4MPV

1/2d ×[(

D(a,c,α,β)1/2(S(µ,TP ,Gk)−ζ(a,c,α,β))ακ0ε

)1/α

+ 2δ

]d/2√FIMQ( 12d log 2

π (1 +M1(b)MP )ε−1)S(µ, TP ,Gk) (14)

= O(

1/ log(

1S(µ,TP ,Gk)

))as S(µ, TP ,Gk)→ 0, (15)

where θd , d∫ 1

0exp(−1/(1− r2)

)rd−1 dr for d > 0 and θ0 , e−1, Vd is the volume of the Euclidean unit ball in

d-dimensions, the function D is defined in (20), the function ζ is defined in (17), and finally

FIMQ(t) , Γ(−β)21+β

(√dc

)β+d/2tβ+d/2

Kβ+d/2(c√dt)

(16)

where Kv is the modified Bessel function of the third kind. Moreover, if lim supm S(µm, TP ,Gk) < ∞ then (µm)m≥1 isuniformly tight.

Remark The Stein factorMP can be determined explicitly based on the proof of Theorem 15 in Section H.1 and theresults of Gorham et al. (2016).

Note that FIMQ(t) is finite for all t > 0, so fix any ε ∈ [0, ε0) and δ > 0. If S(µm, TP ,Gk) → 0, then

lim supm dBL‖·‖(µm, P ) ≤ (2 + ε + δ−1dθd−1

θdMP )ε. Thus taking ε → 0 yields dBL‖·‖(µm, P ) → 0. Since

dBL‖·‖(µm, P )→ 0 only if µm ⇒ P , the statement of Theorem 8 follows.


H.1. Proof of Theorem 15: IMQ KSD lower bound

Fix any α ∈ (0, 12 (β + 1)) and a > 1

2c. Then there is some g ∈ Gk such that TP g is bounded below by a constantζ(a, c, α, β) and has a growth rate of ‖x‖2α2 as ‖x‖2 → ∞. Such a function exists by the following lemma, proved inSection H.2.

Lemma 16 (Generalized multiquadric Stein sets yield coercive functions). Suppose P ∈ P and k(x, y) = Φc,β(x − y)

for Φc,β(x) , (c2 + ‖x‖22)β , c > 0, and β ∈ R \N0. Then, for any α ∈ (0, 12 (β + 1)) and a > 1

2c, there exists a functiong ∈ Gk such that TP g is bounded below by

ζ(a, c, α, β) , −D(a, c, α, β)1/2

2α

[M1(b)R2

0 + ‖b(0)‖2R0 + d

a2(1−α)

], (17)

where the function D is defined in (20) and R0 , inf{r > 0 |κ(r′) ≥ 0,∀r′ ≥ r}. Moreover, lim inf‖x‖−2α2 (TP g)(x) ≥

αD(a,c,α,β)1/2

κ0 as ‖x‖2 →∞.

Our next lemma connects the growth rate of TP g to the tightness rate of a probability measure evaluated with the Steindiscrepancy. Its proof is found in Section H.3.

Lemma 17 (Coercive functions yield tightness). Suppose there is a g ∈ G such that TP g is bounded below by ζ ∈ R andlim inf‖x‖2→∞‖x‖

−u2 (TP g)(x) > η for some η, u > 0. Then for all ε sufficiently small and any probability measure µ the

tightness rate (10) satisfies

R(µ, ε) ≤[

1

εη(S(µ, TP ,G)− ζ)

]1/u

.

In particular, if lim supm S(µm, TP ,Gk) is finite, (µm)m≥1 is uniformly tight.

We can thus plug the tightness rate estimate of Lemma 17 applied to the function g into Theorem 13. Since ‖w‖∞ ≤ t

implies ‖w‖2 ≤√dt, we can use the formula for the generalized Fourier transform of the IMQ kernel in (18) to see Φ(ω)

is monotonically decreasing in ‖w‖2 to establish (16). By taking η → αD(a,c,α,β)1/2

κ0 we obtain (14).

To prove (15), notice that FIMQ(t) = O(e(c√d+λ)t) as t → ∞ for any λ > 0 by (19). Hence, by choosing ε =

c√d/ log( 1

S(µ,TP ,Gk) ) and δ = 1/ε1/α we obtain the advertised decay rate as S(µ, TP ,Gk) → 0. The uniform tightnessconclusion follows from Lemma 17.

H.2. Proof of Lemma 16: Generalized multiquadric Stein sets yield coercive functions

By (Wendland, 2004, Thm. 8.15), Φc,β has a generalized Fourier transform of order max(0, dβe) given by

Φc,β(ω) =21+β

Γ(−β)

(‖ω‖2c

)−β−d/2Kβ+d/2(c‖ω‖2), (18)

where Kv(z) is the modified Bessel function of the third kind. Furthermore, by (Wendland, 2004, Cor. 5.12, Lem. 5.13,Lem. 5.14), we have the following bounds on Kv(z) for v, z ∈ R:

Kv(z) ≥ τve−z√z

for z ≥ 1 where τv =

√π

2for |v| ≥ 1

2and τv =

√π3|v|−1/2

2|v|+1Γ(|v|+ 1/2)for |v| < 1

2, (19)

Kv(z) ≥ e−1τvz−|v| for z ≤ 1, (since x 7→ xvK−v(x) is non-decreasing and Kv = K−v)

Kv(z) ≤ min

(√2π

ze−z+v

2/z, 2|v|−1Γ(|v|)z−|v|)

for z > 0.

Now fix any a > c/2 and α ∈ (0, 12 (β + 1)), and consider the functions gj(x) = ∇xjΦa,α(x) = 2αxj(a

2 + ‖x‖22)α−1.We will show that g = (g1, . . . , gd) ∈ Kdk. Note that gj(ω) = (iωj)Φa,α(ω). Using (Wendland, 2004, Thm. 10.21), we


know ‖gj‖Kk =

∥∥∥∥gj/√Φc,β

∥∥∥∥L2

, and thus ‖g‖Kdk =

∥∥∥∥g/√Φc,β

∥∥∥∥L2

. Hence

‖g‖2Kdk =

d∑j=1

∫Rdgj(ω)gj(ω)/Φc,β(ω) dω

=

d∑j=1

∫Rd

22(1+α)/Γ(−α)2

21+β/Γ(−β)

a2α+d

cβ+d/2ω2j ‖ω‖

β−2α−d/22

Kα+d/2(a‖ω‖2)2

Kβ+d/2(c‖ω‖2)dω

= c0

∫Rd‖ω‖β−2α−d/2+2

2

Kα+d/2(a‖ω‖2)2

Kβ+d/2(c‖ω‖2)dω,

where c0 = 22(1+α)/Γ(−α)2

21+β/Γ(−β)a2α+d

cβ+d/2. We can split the integral above into two, with the first integrating over B(0, 1) and the

second integrating over B(0, 1)c = Rd \ B(0, 1). Thus using the inequalities from (19) with v0 , β + d/2, we have∫B(0,1)

‖ω‖β−2α−d/2+22

Kα+d/2(a‖ω‖2)2

Kβ+d/2(c‖ω‖2)dω ≤

∫B(0,1)

‖ω‖β−2α−d/2+22

22α+d−2Γ(α+ d/2)2(a‖w‖2)−2α−d

e−1τv0 · ‖cω‖−β−d/22

dω

= 22α+d−2Γ(α+ d/2)2 e

τv0

cβ+d/2

a2α+d

∫B(0,1)

‖ω‖2β−4α−d+22 ec‖ω‖2 dω

= d Vd 22α+d−2Γ(α+ d/2)2 e

τv0

cβ+d/2

a2α+d

∫ 1

0

r2β−4α+1ecr dr,

where Vd is the volume of the unit ball in d-dimensions and in the last step we used the substitution r = ‖ω‖2 (Baker,1999). Since α < 1

2 (β + 1) and the function r 7→ rt is integrable around the origin when t > −1, we can bound theintegral above by ∫ 1

0

r2β−4α+1ecr dr ≤ ec∫ 1

0

r2β−4α+1 dr =ec

2β − 4α+ 2.

We can apply the technique to the other integral, yielding∫B(0,1)c

‖ω‖β−2α−d/2+22

Kα+d/2(a‖ω‖2)2

Kβ+d/2(c‖ω‖2)dω ≤

∫B(0,1)c

‖ω‖β−2α−d/2+22

2π/(a‖ω‖2) · e−2a‖ω‖2+2(α+d/2)2/(a‖ω‖2)

τv0e−c‖ω‖2/

√c‖ω‖2

dω

≤ 2π√c

aτv0

∫B(0,1)c

‖ω‖β−2α−d/2+3/22 e(c−2a)‖ω‖2+2(α+d/2)2/(a‖ω‖2) dω

= d Vd2π√c

aτv0

∫ ∞1

rβ−2α+d/2+1/2e(c−2a)r+2(α+d/2)2/(ar) dr

Since c− 2a < 0, we can upper bound the last integral above by the quantity∫ ∞1

rβ−2α+d/2+1/2e(c−2a)r+2(α+d/2)2/(ar) dr

≤ e(c−2a)+2(α+d/2)2/a

∫ ∞1

rβ−2α+d/2+1/2e(c−2a)r dr

= e(c−2a)+2(α+d/2)2/a(2a− c)−β+2α−d/2−3/2Γ(β − 2α+ d/2 + 3/2, 2a− c),

where Γ(s, x) ,∫∞xts−1e−t dt is the upper incomplete gamma function. Hence, the function g belongs to Kdk with norm

upper bounded by D(a, b, α, β)1/2 where

D(a, c, α, β) , d Vd 21+2α−β a2α+dΓ(−β)

cβ+d/2Γ(−α)2

(22α+d−2Γ(α+ d/2)2ec+1cβ+d/2

τv0(2β − 4α+ 2)a2α+d+

2π√c

aτv0e(c−2a)+2(α+d/2)2/a(2a− c)−β+2α−d/2−3/2Γ(β − 2α+ d/2 + 3/2, 2a− c)

). (20)


Now define g = −D(a, c, α, β)−1/2g so that g ∈ Gk. We will lower bound the growth rate of TP g and also construct auniform lower bound. Note

D(a, c, α, β)1/2

2α(TP g)(x) = − 〈b(x), x〉

(a2 + ‖x‖22)1−α− d

(a2 + ‖x‖22)1−α+

2(1− α)‖x‖22(a2 + ‖x‖22)2−α

. (21)

The latter two terms are both uniformly bounded in x. By the distant dissipativity assumption, there is some κ > 0 suchthat lim sup‖x‖2→∞

1‖x‖22〈b(x), x〉 ≤ − 1

2κ. Thus the first term of (21) grows at least at the rate 12κ‖x‖

2α2 . This assures

lim inf‖x‖−2α2 (TP g)(x) ≥ α

D(a,c,α,β)1/2κ as ‖x‖2 →∞.

Moreover, because b is Lipschitz, we have

|〈b(x), x〉| ≤ |〈b(x)− b(0), x− 0〉|+ |〈b(0), x〉| ≤M1(b)‖x‖22 + ‖b(0)‖‖x‖2,

Hence for any x ∈ B(0, R0), we must have−〈b(x), x〉 ≥ −M1(b)R20−‖b(0)‖2R0. By choice ofR0, for all x /∈ B(0, R0),

the distant dissipativity assumption implies−〈b(x), x〉 ≥ 0. Hence applying this to (21) shows that TP g is uniformly lowerbounded by ζ(a, c, α, β).

H.3. Proof of Lemma 17: Coercive functions yield tightness

Pick g ∈ Gk such that lim inf‖x‖2→∞‖x‖−u2 (TP g)(x) > η and infx∈Rd(TP g)(x) ≥ ζ. Let us define γ(r) ,

inf{(TP g)(x) − ζ | ‖x‖2 ≥ r} ≥ 0 for all r > 0. Thus for sufficiently large r, we have γ(r) ≥ ηru. Then, for anymeasure µ by Markov’s inequality,

µ(‖X‖2 ≥ r) ≤Eµ[γ(‖X‖2)]

γ(r)≤ Eµ[(TP g)(X)− ζ]

γ(r).

Thus we see that µ(‖X‖2 ≥ rε) ≤ ε whenever ε ≥ (S(µ, TP ,Gk)− ζ)/γ(rε). This implies that for sufficiently small ε, if

rε ≥[

1

ηε(S(µ, TP ,Gk)− ζ)

]1/u

,

we must have µ(‖X‖2 ≥ rε) ≤ ε. Hence whenever lim supm S(µm, TP ,Gk) is bounded, we must have (µm)m≥1 isuniformly tight as lim supmR(µm, ε) is finite.

I. Proof of Proposition 9: KSD detects convergenceWe will first state and prove a useful lemma.Lemma 18 (Stein output upper bound). Let Z ∼ P and X ∼ µ. If the score function b = ∇ log p is Lipschitz withEP [‖b(Z)‖22] <∞, then, for any g : Rd → Rd with max(M0(g),M1(g),M2(g)) <∞,

|Eµ[(TP g)(X)]| ≤ (M0(g)M1(b) +M2(g)d)dW‖·‖2 (µ, P ) +√

2M0(g)M1(g)EP [‖b(Z)‖22] dW‖·‖2 (µ, P ),

where the Wasserstein distance dW‖·‖2 (µ, P ) = infX∼µ,Z∼P E[‖X − Z‖2].

Proof By Jensen’s inequality, we have EP [‖b(Z)‖2] ≤√EP [‖b(Z)‖22] < ∞, which implies that EP [(TP g)(Z)] = 0

(Gorham & Mackey, 2015, Prop. 1). Thus, using the triangle inequality, Jensen’s inequality, and the Fenchel-Younginequality for dual norms,

|Eµ[(TP g)(X)]| = |E[(TP g)(Z)− (TP g)(X)]|= |E[〈b(Z), g(Z)− g(X)〉+ 〈b(Z)− b(X), g(X)〉+ 〈I,∇g(Z)−∇g(X)〉]|≤ E[|〈b(Z), g(Z)− g(X)〉|] + (M0(g)M1(b) +M2(g)d)E[‖X − Z‖2],

To handle the other term above, notice that by Cauchy-Schwarz and the fact that min(a, b) ≤√ab for a, b ≥ 0,

E[|〈b(Z), g(Z)− g(X)〉|] ≤ E[min(2M0(g),M1(g)‖X − Z‖2)‖b(Z)‖2]

≤ (2M0(g)M1(g))1/2E[‖X − Z‖1/22 ‖b(Z)‖2

]≤√

2M0(g)M1(g)E[‖X − Z‖2]EP [‖b(Z)‖22].


The stated inequality now follows by taking the infimum of these bounds over all joint distributions (X,Z) with X ∼ µand Z ∼ P .

Now we are ready to prove Proposition 9. In the statement below, let us use α ∈ Nd as a multi-index for the differentiationoperator Dα, that is, for a differentiable function f : Rd → R we have for all x ∈ Rd,

Dαf(x) ,d|α|

(dx1)α1 . . . (dxd)αdf(x)

where |α| =∑dj=1 αj . Pick any g ∈ Gk, and choose any multi-index α ∈ Nd such that |α| ≤ 2. Then by Cauchy-Schwarz

and (Steinwart & Christmann, 2008, Lem. 4.34), we have

supx∈Rd

|Dαgj(x)| = supx∈Rd

|Dα〈gj , k(x, ·)〉Kk | ≤ supx∈Rd‖gj‖Kk ‖D

αk(x, ·)‖Kk = ‖gj‖Kk supx∈Rd

(DαxD

αy k(x, x))1/2.

Since∑dj=1‖gj‖

2Kk ≤ 1 for all g ∈ Gk and Dα

xDαy k(x, x) is uniformly bounded in x for all |α| ≤ 2, the elements of

the vector g(x), matrix ∇g(x), and tensor ∇2g(x) are uniformly bounded in x ∈ Rd and g ∈ Gk. Hence, for some λk,supg∈Gk max(M0(g),M1(g),M2(g)) ≤ λk <∞, so the advertised result follows from Lemma 18 as

S(µ, TP ,Gk) ≤ λk(

(M1(b) + d)dW‖·‖2 (µ, P ) +√

2EP [‖b(Z)‖22] dW‖·‖2 (µ, P )

).

J. Proof of Theorem 10: KSD fails for bounded scoresFix some n ≥ 1, and let Qn = 1

n

∑ni=1 δxi where xi , ine1 ∈ Rd for i ∈ {1, . . . , n}. This implies ‖xi − xi′‖2 ≥ n for

all i 6= i′. We will show that when M0(b) is finite, S(Qn, TP ,Gk)→ 0 as n→∞.

We can express k0(x, y) ,∑dj=1 k

j0(x, y) as

k0(x, y) = 〈b(x), b(y)〉k(x, y) + 〈b(x),∇yk(x, y)〉+ 〈b(y),∇xk(x, y)〉+ 〈∇x,∇yk(x, y)〉.

From Proposition 2, we have

S(Qn, TP ,Gk)2 =1

n2

n∑i,i′=1

k0(xi, xi′) =1

n2

n∑i=1

k0(xi, xi) +1

n2

∑i 6=i′

k0(xi, xi′). (22)

Let γ be the kernel decay rate defined in the statement of Theorem 6. Then as k ∈ C(1,1)0 , we must have γ(0) < ∞ and

limr→∞ γ(r) = 0. By the triangle inequality

limn→∞

∣∣∣∣∣ 1

n2

n∑i=1

k0(xi, xi)

∣∣∣∣∣ ≤ limn→∞

1

n2

n∑i=1

|k0(xi, xi)| ≤ limn→∞

γ(0)

n(M0(b) + 1)2 = 0.

We now handle the second term of (22). By repeated use of Cauchy-Schwarz we have

|k0(xi, xi′)| ≤ |〈b(xi), b(xi′)〉k(xi, xi′)|+ |〈b(xi),∇yk(xi, xi′)〉|+ |〈b(xi′),∇xk(xi, xi′)〉|+ |〈∇x,∇yk(xi, xi′)〉|≤ ‖b(xi)‖2‖b(xi′)‖2|k(xi, xi′)|+ ‖b(xi)‖2‖∇yk(xi, xi′)‖2 + ‖b(xi′)‖2‖∇xk(xi, xi′)‖2

+ |〈∇x,∇yk(xi, xi′)〉|≤ γ(n)(M0(b) + 1)2.

By assumption, γ(r) → 0 as r → ∞. Furthermore, since the second term of (22) is upper bounded by the average of theterms k0(xi, x

′i) for i 6= i′, we have S(Qn, TP ,Gk) → 0 as n → ∞. However, (Qn)n≥1 is not uniformly tight and hence

does not converge to the probability measure P .

measuring sample quality with kernels · stein’s method recipe based on reproducing kernel...

Documents