integral geometry, hamiltonian dynamics, and markov chain
Post on 06-Nov-2021
3 Views
Preview:
TRANSCRIPT
Integral Geometry, Hamiltonian Dynamics, and
Markov Chain Monte Carlo
by MASS
Oren Mangoubi
B.S., Yale University (2011)
Submitted to the Department of Mathematicsin partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ACHUSES INS ITUTEOF TECHNOLOGY
JUN 16 2016
LIBRARIES
MCHVES
June 2016
@ Oren Mangoubi, MMXVI. All rights reserved.
The author hereby grants to MIT permission to reproduce and todistribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
AuthorSignature redacted ..................C/
Department of Mathematics
Certified by. Signature redactedApril 28, 2016
Alan EdelmanProfessor
Thesis Supervisor
Accepted bySignature redactedJonathan Kelner
Chairman, Applied Mathematics Committee
2
Integral Geometry, Hamiltonian Dynamics, and Markov
Chain Monte Carlo
by
Oren Mangoubi
Submitted to the Department of Mathematicson April 28, 2016, in partial fulfillment of the
requirements for the degree ofDoctor of Philosophy
Abstract
This thesis presents applications of differential geometry and graph theory to thedesign and analysis of Markov chain Monte Carlo (MCMC) algorithms. MCMC al-gorithms are used to generate samples from an arbitrary probability density ir incomputationally demanding situations, since their mixing times need not grow expo-nentially with the dimension of w. However, if w has many modes, MCMC algorithmsmay still have very long mixing times. It is therefore crucial to understand and reduceMCMC mixing times, and there is currently a need for global mixing time bounds aswell as algorithms that mix quickly for multi-modal densities.
In the Gibbs sampling MCMC algorithm, the variance in the size of modes inter-sected by the algorithm's search-subspaces can grow exponentially in the dimension,greatly increasing the mixing time. We use integral geometry, together with the Hes-sian of r and the Chern-Gauss-Bonnet theorem, to correct these distortions and avoidthis exponential increase in the mixing time. Towards this end, we prove a general-ization of the classical Crofton's formula in integral geometry that can allow one togreatly reduce the variance of Crofton's formula without introducing a bias.
Hamiltonian Monte Carlo (HMC) algorithms are some the most widely-used MCMCalgorithms. We use the symplectic properties of Hamiltonians to prove global Cheeger-type lower bounds for the mixing times of HMC algorithms, including RiemannianManifold HMC as well as No-U-Turn HMC, the workhorse of the popular Bayesiansoftware package Stan. One consequence of our work is the impossibility of energy-conserving Hamiltonian Markov chains to search for far-apart sub-Gaussian modes inpolynomial time. We then prove another generalization of Crofton's formula that ap-plies to Hamiltonian trajectories, and use our generalized Crofton formula to improvethe convergence speed of HMC-based integration on manifolds.
We also present a generalization of the Hopf fibration acting on arbitrary- ghost-valued random variables. For # = 4, the geometry of the Hopf fibration is encodedby the quaternions; we investigate the extent to which the elegant properties of thisencoding are preserved when one replaces quaternions with general 0 > 0 ghosts.
3
Thesis Supervisor: Alan EdelmanTitle: Professor
4
Acknowledgments
I am very grateful to my advisor and coauthor Alan Edelman' for his guidance and
collaboration on this thesis. I am also deeply grateful to my coauthor Natesh Pillai2
for his collaboration and advice on the Hamiltonian mixing times chapter of this
thesis. I could not have finished this thesis without their insights. I am deeply
thankful as well for indispensable advice and insights from Aaron Smith3 , Youssef
Marzouk4 , Michael Betancourt5 , Jonathan Kelner1 , Michael La Croixi, Jiahao Chen',
Laurent Demanet', Dennis Amelunxen', Ofer Zeitouni' 8 , Neil Shephard2 , and Nawaf
Bou-Rabee'.
I would also like to thank my mentors and previous coauthors Stephen Morse'o,
Yakar Kannai7 , Edwin Marengo", and Lucio Frydman"2 . I am very grateful to my
other mentors and professors at MIT and Yale, especially Roger Howe1", Gregory
Margulis13 , Victor Chernozhukov1 4 , Kumpati Narendra'0 , Andrew Barron15, Ivan
Marcus16 , Paulo Lozano 4, and Manuel Martinez-Sanchez'. For valuable opportuni-
ties to learn, teach and conduct research, I would like to thank the MIT Mathematics
department and the Theory of Computation group at the MIT Computer Science and
Artificial Intelligence Laboratory (CSAIL), as well as the Yale Mathematics and Elec-
trical Engineering departments, the Weizmann Institute Mathematics and Chemical
Physics departments, and the Northeastern Electrical Engineering department.
I am very thankful to have been blessed with a kind and loving family for who's
'MIT Mathematics Department2 Harvard Statistics Department3University of Ottawa Mathematics and Statistics Department4 MIT Department of Aeronautics and Astronautics5University of Warwick Statistics Department6 City University of Hong Kong Mathematics Department7Weizmann Institute of Science Mathematics Department8 Courant Institute of Mathematical Sciences at NYU9Rutgers Mathematical Sciences Department0Yale Electrical Engineering Department
"Northeastern University Electrical Engineering Department12Weizmann Institute of Science Chemical Physics Department13Yale Mathematics Department1 4MIT Economics Department"5 Yale Statistics Department' 6 Yale History Department
5
encouragement and support I am forever grateful: My mother and father, my brothers
Tomer and Daniel, and, most importantly, my grandparents M6m6, Oma and Opa,
as well as P6p6 (of blessed memory). I am also very thankful to my friends for their
kindness and companionship. My schoolteachers at Schechter and Gann, especially
my Mathematics teacher Mrs. Voolich and my Science teacher Mrs. Schreiber, have
been an inspiration to me as well. I also thank the MITxplore program for giving
me the opportunity to design and teach weekly Mathematics enrichment classes for
children in Cambridge and Boston public schools.
I deeply appreciate the generous support of a National Defense Science and Engi-
neering Graduate (NDSEG) Fellowship, as well as support from the National Science
Foundation (NSF DMS-1312831) and the MIT Mathematics Department.
Thesis Committee:
" Professor Alan Edelman
Thesis Committee Chairman and Thesis Advisor
Professor of Applied Mathematics, MIT
" Professor Natesh Pillai
Associate Professor of Statistics, Harvard
* Professor Youssef Marzouk
Associate Professor of Aeronautics and Astronautics, MIT
" Professor Jonathan Kelner
Associate Professor of Applied Mathematics, MIT
6
Contents
I Introductih 9
1.1 Somic wid liy-u -&( 1\CL C ig1( 1 u I .. .. .. . .. . . . .... . . . . 10
L 1. 1 RandomN Walk Metropoi . . . . . . . . . . . . . . . . . . . . 10
1.1.2 Gibbs sampling algoritin . . . . . . . . . . . . . . . . . . . . . 10
1.1.3 Hamiltonian Monte Carlo . . . . . . . . . . . . . . 12
1.2 Iiitegral & differential geometrY prelimninari . . . . . . . . . . . . . . 14
1.2. Kine a Wi m neas ... .. . .. .. . .. .. . .. .. . ...15
1.2.2 The Crofton for . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Concentration. . . . . . . . . . . . . . . . . . . . . . 17
1.2.4 The Chern-GaussR nnm. . . . . . . . . . . . . . . 18
1.3 Conitribution-, of this the>.......................11.3 Con rib ti ns f t is h . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Integral Geometry for Gibw Sa-mmers 21
2.1 Illt roductio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..21
2.2 A first-ordei ewtn (iiin7 ie I eiuion 1>...... . . . . . . . . . . . 28
2.2.1 The Crofton formula Gibbs sample . . . . . . . . . . . . . . 29
2.2.2 Traditional weights vs. integral geometry weigmt . . . . . . . 30
2.3 A generalized Crofton forini . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 The generalized Crofton formula Gibbs saimp . . . . . . . . . 44
2.3.2 The pe-- j ~ ( K~'~; KUi
densite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3.3 An NC Im n aie( aw o
t e e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7
7
2.3.5 Higliei-or( hi' Cv~veril-Lill .. -u e .(W.1.j .t.1. 50
2. 3. 6 Colle ctioni-of-sphieres xaiilIe aii( oietatidm-)-miii 51
2.3.7 Vairianiicc due to )1r(l-efltYklma(nisue((
2..> lieoiet a; )t )oul1(s (Ieive( usin- 1C C .i en 31 i t-O)raj(
oiitY....... .. ...... .. ...... .. ...... .. ..... .. .. .... 60
2.4 Ranidoiii iilatrix a-pplicattionl: Sa111pf)lig 0t lestocia'st a 62
2.4.1 Approxiinate sanljplilig alo-ov)Y1+2 1!?~Cr 2...........63
2.5 Conditioinlg oni iiiltiple eigenivall( .. ...... ..... ....... 65
2.(6 Coniditioilling on- ai singp.le-eigeinvalti ue raI' er.. ...... ....... 66
3 Mixing Times of H-amiitouia-i 1\Iloit~e Carlc 71i
3.1 Itroducti . .... ..... ...... ..... ..... ........ 71
3.2 Haiifltoia 1, AUI I,- I .................................. . .. .. .. ....72
3.3 Clieger- bo)01 s111 in X!ItLCJC > ......................... 78
:3.4 Daindoiii \\itlIK iiiip1 rnn u&'..... ... .. .. .. .. .. .. .... 80
4 A Generalization of Crofton's Formula to H ~r'i
with Applictn~ii ~ atho IV~Ionte c>'a,2. 85
4.2 Cr-oftoni foiiuime 101- liiniironin uIvnanmII-. ................ 85
4.3 Mnhuifoli integrationi usinig HNWC anil the fllii 1111 ( irotti Oi bul 88
5 A Hopf Fibration for ' -Ghost Gaiissiau- 91
5.1 hit-odlUCti(o .. ... ..... ...... ..... ..... ........ 91
5.2 Defiingi th .......... .......................... 92
5.3 Hopf Fibrationi oni'.. .... ..... ..... ...... ..... ... 94
8
-1
Chapter 1
Introduction
Applications of sampling on probability distributions, defined on Euclidean space
or on other manifolds, arise in many fields, such as Statistics [18, 6, 24], Machine
Learning [4], Statistical Mechanics [39], General Relativity [16], Molecular Biology
[15], Linguistics [5], and Genetics [31]. In many cases these probability distributions
are difficult to sample from with straightforward methods such as rejection sampling
because the events we are conditioning on are very rare, or the probability density
concentrates in some small regions of space. Typically, the complexity of sampling
from these distributions grows exponentially with the dimension of the space. In
such situations, we require alternative sampling methods whose complexity promises
not to grow exponentially with dimension. In Markov chain Monte Carlo (MCMC)
algorithms, one of the most commonly used such methods, we run a Markov chain
that converges to the desired probability distribution [12].
MCMC algorithms are used to generate samples from an arbitrary probability
density 7r in computationally demanding situations such as high-dimensional Bayesian
statistics [60], machine learning [4], and molecular biology [15], since their mixing
times need not grow exponentially with the dimension of 7r. However, if ir has many
modes, MCMC algorithms may still have very long mixing times [35, 48, 28]. It
is therefore crucial to understand and reduce MCMC mixing times, and there is
currently a need for global mixing time bounds as well as algorithms that mix quickly
for multi-modal densities.
9
1.1 Some widely-used MCMC algorithms
In this section we review some widely-used MCMC algorithms.
1.1.1 Random Walk Metropolis
The Random Walk Metropolis (RWM) algorithm (Algorithm 1) is the most basic
MCMC algorithm. At each step of the Markov chain, the RWM algorithm proposes
to take the next step ii+1 in a random direction and distance from the current position
xi. The step is accepted with a probability of min{', 1}. If the step is rejected,
the algorithm stays at its current position until the next time step.
Algorithm 1 Random Walk Metropolis [45]
input: c > 0, xO, oracle for 7r : R' -+ [0, oc)output: x 1 , X2 , . .. , with stationary distribution 7r
1: for i = 1, 2, ... do2: Sample independent h ~ f(0, e)n3: set x'i+ = xi + h4: set xi+1 = i+1 with probability min{{ ,Gi+i) 1}; Else, set xi+1 = xi5: end for
Although the RWM algorithm is widely-used in practice due to its simplicity [7],
its mixing time slows down quadratically with a decrease in the step size since its
associated "random walk" behavior approximates a diffusion. We will discuss this
slowdown for the RWM algorithm further in Chapter 3.
1.1.2 Gibbs sampling algorithms
Gibbs sampling MCMC algorithms [23] offer one way of taking larger steps to avoid
the quadratic slowdown associated with diffusion-like "random walk" behavior. Gibbs
sampling algorithms work by sampling the next step xi+1 in the Markov chain from
the probability density 7 conditioned on a random search subspace S. xi+1 may be
sampled from S using a subroutine Markov chain or another sampling method. If a
subroutine Markov chain is used, the Gibbs sampler is oftentimes referred to as the
10
"Metropolis-within-Gibbs" algorithm, where the term "Metropolis" is used loosely
here to denote the subroutine Markov chain. The search subspace may be a line or
a multi-dimensional plane passing through xi. In this thesis we will consider search
subspaces with isotropic random orientation.
Algorithm 2 Gibbs Sampler (with isotropic random search subspaces) [23]
input: x0 , oracle for 7r : R --+ [0, oc)output: x 1 , x2 , .. ., with stationary distribution 7r
1: for i = 1, 2, ... do2: Sample a k-dimensional isotropic random search subspace S passing through xi3: Sample xi+1 with probability proportional to the restriction of ir(x) to S (using
a subroutine Markov chain or another sampling method)4: end for
Algorithm 3 Gibbs Sampler (for 7r supported on submanifold)
Algorithm 3 is identical to Algorithm 2 , except for the following steps:
3: Sample xi+1 from r(x)/I aII restricted to Sn M (using a subroutine Markovchain or another sampling method)
(Here I| denotes the product of the singular values of the projection map from
S onto M', the orthogonal complement of the tangent space of M at x.)
If the manifold can be mapped onto a sphere, it is sometimes simpler to bypass
the primary Markov chain in the Gibbs sampler and sample the manifold directly by
intersecting it with random subspaces moving according to the kinematic measure:
Algorithm 4 Great sphere sampler
Algorithm 4 is identical to Algorithms 2 and 3, except for the following steps:
input: oracle for 7r supported on a manifold M C Sn.
2: Sample a search subspace Si C S that is an isotropic random great sphere
independent of xi.
One problem with Gibbs sampling algorithms when sampling from distributions
with multiple modes is that the orientation of the search subspace S can greatly
11
distort the apparent size of a mode, slowing the algorithm. In Chapter 2 we use
concentration of measure to quantify how much these distortions slow down the Gibbs
sampling algorithm. We also show how one can use integral geometry to eliminate
some of these distortions.
1.1.3 Hamiltonian Monte Carlo algorithms
Like Gibbs sampling algorithms, Hamiltonian Monte Carlo (HMC) algorithms [19]
seek to avoid quadratic slowdowns associated with diffusion-like "random walk" be-
havior. They do so by simulating the trajectory of a Hamiltonian particle for some
amount of time T, and then refreshing the momentum according to the momentum's
Boltzman distribution from statistical mechanics. Since the particle has momentum,
it will tend to take large steps in the direction of the momentum, avoiding "random
walk" behavior. Since Hamiltonian trajectories conserve energy, there is no need to
reject any proposed steps. For this reason HMC algorithms work especially well in
high dimensions since concentration of the posterior measure 7r causes most other
MCMC algorithms to either sample steps that are very close (leading to "random
walk" behavior) or to propose steps that have low probability density and are thus
rejected with high probability.
In this section we review three commonly-used HMC algorithms (Figure 1-1). The
first two of these HMC algorithms (Algorithms 4 and 5) form the workhorse of the
popular Bayesian software package Stan [10]. All three algorithms generate a Markov
chain step by integrating a Hamiltonian trajectory for a time T, and refreshing the
momentum. In Chapter 3 we will show global lower bounds on the mixing times
of a large class of HMC algorithms sampling from arbitrary posterior distributions,
including Algorithms 3-5.
Isotropic-Momentum HMC (Algorithm 3) [44] is the most basic HMC algorithm
(Figure 1-1, top, entire solid+dotted trajectory),
The No-U-Turn Sampler is a modification of Algorithm 3 which seeks to take
longer steps by avoiding U-turns in the Hamiltonian trajectories. It does so by stop-
ping the trajectory once any two velocity vectors on the trajectory path form an angle
12
U
got
wool
Figure 1-1: The Isotropic-Momentum HMC (dashed and solid,top), No-U-Turn HMCtrajectory (solid only, top), and Rieimanian manifold HMC trajectory (bottom).
The Isotropic-Momentum and Riemannian Manifold trajectories evolve for a fixed
time T, while the No-U-Turn trajectory stops once any two momentum vectors on the
trajectory are orthogonal. The Isotropic-Momentum HMC and No-U-Turn HMC both
have spherical Gaussian random initial trajectories, while Riemannian Manifold H MC
has a non-spherical Gaussian random initial trajectory determined by the Hessian of
-log(7r) at t = 0. In Chapter 3, we will use boundaries such as OS to establish lower
bounds on the HMC mixing time.
I13
I
#S N&
IIT.
S 0 to0 0 t
Algorithm 5 Isotropic-Momentum HMC (idealized symplectic integrator) [44]
input: qO, oracle for : R' -+ [0, oc)output: q1, q2 ... ., with stationary distribution 7define: H(q, p) -log(7r(q)) + lp'p.
1: for i = 1, 2, ... do2: Sample independent pi ~ N(0, 1)n3: Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time
interval [0, T] and initial conditions (p(O), q(0)) = (pi, qi)4: set qg4 1 = q(T)5: end for
of more than 900 (Figure 1-1, top, solid trajectory)
Algorithm 6 Idealized No-U-Turn Sampler HMC (perfect symplectic integrator) [32]
Algorithm 6 is identical to Algorithm 5, except for step 3.
3: Integrate Hamiltonian trajectory (q(t),p(t)) over the time interval [0, T], withinitial conditions (p(O), q(0)) = (pi, qi), where T is the minimum time such thatthe velocity vectors at two points on the trajectory path form an angle greaterthan 900.
Riemannian Manifold HMC seeks to take longer steps by choosing initial momenta
from a multivariate Gaussian distribution that agrees with the local geometry of the
posterior density 7r (Figure 1-1, bottom):
Algorithm 7 Riemannian Manifold HMC (idealized symplectic integrator) [27, 25]
Algorithm 7 is identical to Algorithm 5, except for the following steps:
define: H(q,p) := -log(7r(q)) + cadet(G(q)) + jpT G(q)p, where G(q) is thenon-degenerate Fisher information matrix of 7r at q, and c, = jlog(pi)n.
2: Sample pi ~ .(0, G- 1 (qi))
1.2 Integral & differential geometry preliminaries
In this section we review results from differential geometry, integral geometry, and
concentration of measure that we will use extensively in this thesis.
14
1.2.1 Kinematic measure
Up until this point we have talked about random search-subspaces informally. This
notion of randomness is formally referred to as the kinematic measure [53, 54]. The
kinematic measure provides the right setting to state the Crofton Formula. The kine-
matic measure, as the name suggests, is invariant under translations and rotations.
The random subspace is said to be "moving according to the kinematic measure".
The kinematic measure is the formal way of discussing the following simple situ-
ation: we would like to take a random point p uniformly on the unit sphere or, say,
inside a cube in R'. First we consider the sphere. After choosing p we then choose
an isotropically random plane of dimension d + 1 through the point p and the center
of the sphere. In the case of the sphere, this is simply an isotropic random plane
through the center of the sphere. On a cube there are some technical issues, but the
basic idea of choosing a random point and an isotropic random orientation using that
point as the origin persists. On the cube we would allow any orientation not only
those through a "center". The technical issues relate to the boundary effects of a
finite cube or the lack of a concept of a uniform probability measure on an infinite
space. In any case the spherical geometry is the natural computational setting be-
cause it is compact (If we insist on artificially compactifying RI by conditioning on a
compact subset then either the boundary effects cause the different search-subspaces
to vary greatly in volume, slowing the algorithm, or we must restrict ourselves to such
a large subset of R' that most of the search-subspaces don't pass through much of the
region of interest). However, for the sake of completeness we introduce the kinematic
measure for the Euclidean as well as the spherical constant-curvature space because
it is relevant in more theoretical applications.
In the spherical geometry case, we define the kinematic measure with respect
to a fixed non-random subset Sfixed c S, usually a great subsphere, by the action
of the Haar measure on the special orthogonal group SO(n + 1) on Sfxe.. When
generalizing to Euclidean geometry, we must be a bit more careful, because there is
no uniform probability distribution on Rn. In the case where S has finite d-volume,
15
we can circumvent these issues simply by choosing p to be a point in the poisson point
process. To generalize to planes, we may define the kinematic measure as a poisson-
like point process for our search-subspaces with a translationally and rotationally
invariant distribution on all of R' (the "points" here are the search-subspaces):
Definition 1. (Kinematic measure)
Let Kn e {Sn, R'} be a constant-curvature space. Let Sfied be a d-dimensional
manifold that either has a finite d-volume (in R" or Sn), or is a plane (in RI only).
Let H be the Haar measure on G. If S has finite d-volume we take G to be the group
In of isometries of K". If S is a plane, we instead take G to be the quotient group
In/Id of the isometries on K" with the isometries on Sfix. Let N be the counting
process such that
(i) E[N(A)] = x H(A)
(ii) N(A) and N(B) are independent
for any disjoint Haar-measurable subsets A, B C G, where we drop the 1vold (sfixed)
term if Sfi.x is a plane. We define the kinematic measure with respect to Sfixe C Kn
to be the action of the elements of N on Saxe-
If we wish to actually sample from the kinematic measure for the infinite-measure
space R" in real life, we must restrict ourselves to some (almost surely) finite subset
of the infinite kinematic measure "point" process. For instance, we could condition
on those subspaces that intersect the manifold M that we wish to sample from.
Remark 1. There is in fact a third constant curvature-space, the constant negative-
curvature hyperbolic space Hn (S has constant positive-curvature and Rn constant
zero-curvature). Since the proof of Theorem 3 in chapter 2 seems to rely only on
the constant-curvature of the space, we suspect that nearly identical versions of this
proof and theorem probably apply to hyperbolic space as well. However, we do not
investigate this further as it is beyond the scope of this thesis.
16
1.2.2 The Crofton formula
In this section, we state the Crofton formula [17, 53, 54], which says that the volume of
a manifold M is proportional to the average of the volumes of the intersection SnM of
M with a random search-subspace S moving according to the kinematic measure. Our
first-order reweighting of the Gibbs sampler for submanifolds (section 2.2), referred
to as the "angle-independent" reweighting in the introduction of Chapter 2, is based
on this formula. In Section 2.3, we will prove a generalization of this formula that
will allow for higher-order reweightings. In Chapter 4, we will prove a generalization
of the Crofton formula that applies to trajectories in Hamiltonian dynamics including
trajectories in the HMC algorithms. We will then apply our "Hamiltonian Crofton
Formula" to improve the convergence rate of the HMC algorithm when it is used to
compute integrals on manifolds.
Lemma 1. (Crofton Formula)[1 7, 53, 54]
Let M be a codimension-k submanifold of Kn, where KI e { sn, Rn}. Let S be a
random d-dimensional manifold in Kn of finite volume (or a random plane), moving
according to the kinematic measure. Then there exists a constant Cd,k,n,K such that
Volnk (M) = Cd,k,n,K X Es[Vold-k(S nM)], (1.1)Void(S)
where we set Vold(S) to 1 if S is a plane. In the spherical case we have cd,k,n,S =
Vols S-k x s. cdk,n, is given in [53] and depends on whether Vold(S) is finite.
1.2.3 Concentration of measure
The Concentration of Measure phenomenon ([38], [46]), is the idea that volume con-
centrates in certain regions of high-dimensional space. One well-known result says
that all but an exponentially small (in n) volume of an n - 1 sphere concentrates at
a small distance from any single n - 2 dimensional equator [38].
In Section 2.3.6 we will briefly go over some of our generalizations [43] of this
concentration result to the kinematic measure, which says that most of the intersec-
17
tion volume of an n-sphere with kinematic measure-distributed d-dimensional search-
subspaces concentrates in a fraction of these search-subspaces that is exponentially
small in d, causing the variance of these intersection volumes to grow exponentially as
well. We will then use our concentration results for kinematic measure to compare the
convergence rates of the traditional Gibbs sampler to our curvature-reweighted Gibbs
sampler for an example involving the sampling of a manifold M that is a collection
of spheres.
1.2.4 The Chern-Gauss-Bonnet theorem
The Gauss-Bonnet theorem [56], states that the integral of the Gauss curvature C of
a 2-dimensional manifold M is proportional to its Euler characteristic X(M):
IM CdA = 27rX(M). (1.2)
The Chern-Gauss-Bonnet theorem, a generalization of the Gauss-Bonnet theorem
to arbitrary even-m-dimensional manifolds [13, 57], states that
IM Pf(Q)dVolm = (27r)ix(M), (1.3)
where Q is the curvature form of the Levi-Civita connection and Pf is the pfaffian.
The curvature form Q is an intrinsic property of the manifold, i.e., it does not depend
on the embedding. In the special case when M is a hypersurface, the curvature Pf(Q2)
may be computed as the Jacobian determinant of the Gauss map at x [59, 61].
The Chern-Gauss-Bonnet theorem is usually viewed as a way of relating the cur-
vature of the manifold with its Euler characteristic. In Section 2.3 we will instead
interpret the Chern-Gauss Bonnet theorem as a way of relating the volume form
dVolm to the curvature form Q. This will come in useful since the curvature form
does not change very quickly in sufficiently smooth manifolds, allowing us to get a
good estimate for the volume of the manifold from its curvature form at a single
point.
18
1.3 Contributions of this thesis
The contributions of this thesis are as follows:
" In Chapter 2 we show how Crofton formulae from integral geometry can be used
to eliminate inefficiencies in Gibbs sampling MCMC algorithms, and prove that
the transition kernels of the primary Gibbs sampling Markov chains remain
unchanged after applying Crofton formulae. In doing so, we also prove a gen-
eralization of Crofton's formula that allows for the use of generalized Gauss
curvature to reduce the variance in Crofton's formula without introducing a
bias. Some of our integral geometry results from Chapter 2 have since been
used and further generalized by Amelunxen and Lotz in [2, 3].
" In Chapter 3, we use the symplectic volume-preserving properties of Hamil-
tonian dynamics and Cheeger's inequality from graph theory to prove upper
bounds for the spectral gap of Hamiltonian Monte Carlo algorithms for general
posterior densities ir. Our results apply to the classical HMC algorithm, as
well as Riemannian Manifold HMC and No-U-Turn HMC, the workhorse of the
popular Bayesian software package Stan [10]. One consequence of our work is
the impossibility of energy-conserving Hamiltonian Markov chains to search for
far-apart sub-Gaussian modes in polynomial time.
" In Chapter 4, we prove a generalization of the Crofton formula that applies to
Hamiltonian trajectories, and use our generalized Crofton formula to improve
the convergence speed of HMC-based integration on submanifolds.
* In Chapter 5, we present a generalization of the Hopf fibration acting on
arbitrary-f ghost-valued random variables. For # = 4, the geometry of the Hopf
fibration is encoded by the quaternions; we investigate the extent to which the
elegant properties of this encoding are preserved when one replaces quaternions
with general # > 0 ghosts.
19
20
Chapter 2
Integral Geometry for Gibbs
Samplers
2.1 Introduction
In this chapter, we consider Gibbs sampler MCMC algorithms. If the density we wish
to sample from has many modes, or if the density has support on a submanifold M of
Rn, then sever inefficiencies can arise. The purpose of this chapter is to demonstrate
that integral geometry can be used to eliminate many of these inefficiencies. To
illustrate these inefficiencies and our proposed fix, we imagine we would like to sample
uniformly from a manifold M C Rn+1 (as illustrated in dark blue in Figure 2-1.) By
uniformly, we can imagine that M has finite volume, and the probability of being
picked in a region is equal to the volume of that region. More generally, we can put
a probability measure on M and sample from that measure.
We consider algorithms that produce a sequence of points {xi, X 2, .. .} (yellow dots
in Figure 2-1) with the property that xj+1 will be chosen somehow in an (isotropically
generated) random plane S (red plane in Figure 2-1) centered at xi. Further, the step
from xi to xj+1 is independent of all the previous steps (Markov chain property.)
This situation is known as a Gibbs sampling Markov chain with isotropic random
search-subspaces.
For our purposes, we find it helpful to pick a sphere (light blue) of radius r that
21
represents the length of the jump we will take upon stepping from xi to xi+,. Note
that r is usually random. The sphere will be the natural setting to mathematically
exploit the symmetries associated with isotropically distributed planes. Conditioning
on the sphere, the plane S becomes a great circle S (red), and the manifold A
becomes a submanifold (blue) of the sphere. Assuming we take a step length of r,
then necessarily xi+1 must be on the intersection (green dots in Figure 2-1, higher-
dimensional submanifolds in more general situations) of the red great circle and the
blue submanifold.
For definitiveness, suppose our ambient space is R+'1 where n = 2, our blue
manifold M has codimension k = 1, and our search-subspaces have dimension k +
1. Our sphere now has dimension n and the great circle dimension k = 1. The
intersections (green dots) of the great circle with M are 0-dimensional points.
We now turn to the specifics of how xi+ 1 may be chosen from the intersection of
the red curve and the blue curve. Every green point is on the intersection of the blue
manifold and the red circle. It is worth pondering the distinction between shallower
angles of intersection, and steeper angles. If we thicken the circle by a small constant
thickness E, we see that a point with a shallow angle has a larger intersection than a
steep angle. Therefore points with shallow angles should be weighted more. Figure
2-2 illustrates that 1 is the proper weighting for an intersection angle of 62.sin(62 )
We will argue that the distinction between shallower and steeper angles takes
on a false sense of importance and traditional algorithms may become unnecessarily
inefficient accordingly. A traditional algorithm focuses on the specific red circle that
happens to be generated by the algorithm and then gives more weight to intersection
points with shallower angles. We propose that knowledge of the isotropic distribution
of the red circle indicates that all angles may be given the same weight. Therefore,
any algorithmic work that goes into weighting points unequally based on the angle of
intersection is wasted work.
Specifically, as we will see in Section 2.2.2, has infinite variance, due in partsin(Oi)
to the fact that ) can become arbitrarily large for small enough 64. The algorithm
must therefore search through a large fraction of the (green) intersection points before
22
converging because any one point could contain a signifiant portion of the conditional
probability density, provided that its intersection angle is small enough. This causes
the algorithm to sample the intersection points very slowly in situations where the
dimension is large and there are typically exponentially many possible intersection
points to sample from.
This chapter justifies the validity of the angle-independent approach through the
mathematics of integral geometry [53, 54, 22, 30, 1], and the Crofton formula in
particular in Section 2.2. We should note that sampling all the intersection points
with equal probability cannot work for just any choice of random search-subspace
S. For instance, if the search-subspaces are chosen to be random longitudes on the
2-sphere, parts of M that have a nearly east-west orientation would be sampled
frequently but parts of M that have nearly north-south orientation would be almost
never sampled, introducing a statistical bias to the samples in favor of the east-west
oriented samples. However, if S is chosen to be isotropically random, the random
orientation of S does not favor either the north-south nor the east-west parts of
M, suggesting that we can sample the intersection points with equal probability
in this situation without introducing a bias. Effectively, by sampling with equal
probability weights and isotropic search-subspaces we will use integral geometry to
compute an analytical average of the weights, an average that we would otherwise
compute numerically, thereby freeing up computational resources and speeding up
the algorithm.
In Part II of this chapter, we perform a numerical implementation of an approxi-
mate version of the above algorithm in order to sample the eigenvalues of a random
matrix conditioned on certain rare events involving other eigenvalues of this matrix.
We obtain different histograms from these samples weighted according to both the
traditional weights as well as integral geometry weights (Figure 2-3; Figures 2-10 and
2-11 in part II). We find that using integral geometry greatly reduces the variance of
the weights. For instance, the integral geometry weights normalized by the median
weight had a sample variance of 3.6 x 10, 578, and 1879 times smaller than the tra-
ditional weights, respectively, for the top, middle, and bottom simulations of Figure
23
41
li -A
Figure 2-1: In this example we wish to generate random samples on a codimension-kmanifold A C R' (dark blue) with a Gibbs sampling Markov chain {Xi, X 2 , .. .} thatuses isotropic random search-subspaces S (light red) centered at the most recent pointxi (k = 1, n = 3 in figure). We will consider the sphere rS" of an arbitrary radius rcentered at xi (light blue), allowing us to make use of the spherical symmetry in thedistribution of the random search-subspace to improve the algorithm's convergencespeed. S now becomes an isotropically distributed random great k-sphere S = SnrS"(dark red), that intersects a codimension-k submanifold M = M4n rJS of the sphere.
24
. This reduction in variance allows us to get faster-converging (i.e., smoother for
the same number of data points) and more accurate histograms in Figure . In
fact, as we show in Section , the traditional weights have infinite variance due to
their second-order heavy tailed probability density, so the sample variance tends to
increase greatly as more samples are taken. Because of the second-order heavy-tailed
behavior in the weights, the smoother we desire the histogram to be, the greater the
speed up in the convergence time obtained by using the integral geometry weights in
place of the traditional weights.
Figure 2-2: Conditional on the next sample point xj+1 lying a distance r from xi, the
algorithm must randomly choose xj+1 from a probability distribution on the intersec-
tion points (middle, green) of the manifold M with the isotropic random great circle
S (red). If traditional Gibbs sampling is used, intersection points with a very small
angle of intersection Oi must be sampled with a much greater (unnormalized) prob-
ability 1 (right, top) than intersection points with a large angle (right, bottom).
This greatly increases the variance in the sampling probabilities for different points
and slows down the convergence of the method used to generate the next sample xi+1 .
However, since S is isotropically distributed on rS", the symmetry of the isotropic
distribution of S allows us to use the Crofton formula from integral geometry to
analytically average out these unequal probability weights so that every intersection
point now has the same weight. freeing the algorithm from the time-consuming task
of effectively computing this average numerically.
Remark 2. Since we are using an approximate truncated version of the full algorithm
that is not completely asymptotically accurate, the integral geometry weights also
cause an increase in asymptotic accuracy. The full MCMC algorithm should have
perfect asymptotic accuracy, so we expect this increase in accuracy to become an
25
increase in convergence speed if we allow the Markov chain to mix for a longer amount
of time.
For situations where the intersections are higher-dimensional submanifolds rather
than individual points, we show in Section 2.3 that the angle-independent approach
generalizes to a curvature-dependent approach. We stress that traditional algorithms
condition only on the plane that was actually generated while ignoring its isotropic
distribution. By taking the isotropy into account, our algorithm can use the curvature
information of the manifold to compute an analytical average of the local intersection
volumes (local in a second-order sense) with all possible isotropically distributed
search-subspaces, greatly reducing the variance of the volumes.
Higher-dimensional intersections occur in many (perhaps most) situations, such
as applications with events that are rare for reasons other than that their associated
submanifold has high codimension. In these situations, the probability of a low-
dimensional search-subspace intersecting M can be very small, so one may wish to
use a search-subspace S of dimension d that is greater than the codimension k of M
in order to increase the probability of intersecting M.
As we will see in Section 2.3.6, the traditional approach can lead to a huge vari-
ance in the intersection volumes that increases exponentially with the difference in
dimension d - k (Figure 2-4, right). This exponentially large variance leads to the
same type of algorithmic slowdowns of the traditional algorithm as the variance in
the traditional angle weights discussed above. Using the curvature-aware approach
can oftentimes reduce or eliminate this exponential slowdown.
This chapter justifies the validity of the curvature-aware approach by proving a
generalization of the Crofton formula (Section 2.3). We then motivate the use of the
curvature-aware approach over the traditional curvature-oblivious approach using the
mathematics of concentration-of-measure [43, 38, 46] (Section 2.3.6) and differential
geometry [56, 57], specifically the Chern-Gauss-Bonnet Theorem [13] whose curvature
form we use to re-weight the intersection volumes (Section 2.3.4).
26
3
> 2
1. 5
c1-00a-0.5
0
0.5
0.4
0 0.3
0.2
0--
0.4
00.3
0.2C.-g 0.10-
A4 1(A A2'3' A5 ,A6'7)=(2,-3.5,-4.65,-7.9,-9,-10.8)}
-Rejection Sampling of A41 (A 3 ,A5)
-Integral Geometry Weights- Traditional Weights
-8 -7.5 -7 -6.5 6 -5.5 -54
A2 A1=2
4.5
-A 1=2(Integral geometry weights)
A=2(traditional weigh_A 1 2(rejection sampli
6 - - -3'-2-6 -5 -4 -3 A\2 -2 -1 0 1
2 1=5
-5 -4 -3 -2 2
-1 0
ts)ng)
Figure 2-3: Histograms from 3 random matrix simulations (see Sections and )where we seek the distribution of an eigenvalue given conditions on one or more other
eigenvalues. In all three figures, the blue curve uses the integral geometry weights
proposed in this chapter, the red curve uses traditional weights, and the black curve
(only in the top two figures) is obtained by the accurate but very slow rejection
sampling method. Two things worth noticing is that the integral geometry weight
curve is more accurate than the traditional weight curve (at least when we have a
rejection sampling curve to compare), and that the integral geometry weight curve is
smoother than the traditional weight curve. The integral geometry algorithm achieves
these benefits in part because of the much smaller variance in the weights. (In these
three simulations the integral geometry sample variance was smaller by a factor of
105 , 600, and 2000 respectively)
27
-- Integral Geometry WeightsTraditional Weights
F__ _T __-F I __T
--
--
-
1
Variance of Vol(SnM;) normalized by its mean
10
10 2
0 0
10- 0 50 100 150 200 250 300 350 400d
Figure 2-4: In this example a collection A4 UA/ of n - 1-dimensional spheresA (blue, left) is intersected (intersection depicted as green circles) by a randomsearch-subspace S (red). The spheres that S intersects farther form their center willhave a much smaller intersection volume than the spheres that S intersects closer totheir center, with the variance in the intersection volumes increasing exponentially inthe dimension d of S (logarithmic plot, right). This curse of dimensionality for theintersection volume can lead to an exponential slowdown when using a traditionalalgorithm to sample from S n M. In Section we will see that this slowdown canbe avoided if we use the curvature information to reweight the intersection volumes,reducing the variance in the intersection volumes.
Part I
Theoretical results and discussion
2.2 A first-order reweighting via the Crofton for-
mula
As discussed in the introduction to this chapter, we can use Crofton's formula directly
to eliminate the weight 1/1 d'l in step 3 of Algorithm
Algorithm 8 Great sphere sampler
Algorithm is identical to Algorithm except for the following step:
3: Sample xj+ 1 from r(x) restricted to S n A (using a subroutine Markov chain
or another sampling method)
Before we apply Crofton's formula to the Gibbs sampler (Algorithm ), which
uses isotropic random linear search subspaces (as opposed to great spheres), we need
28
the following modification of Crofton's formula:
Theorem 1. (Crofton's formula for isotropic random linear subspaces)
Let S be an isotropic random linear subspace centered at the origin. Let 7r be a
function on a manifold M then
7r(x)dx = c x E[snm - |IxIV- dx]d!M.
where c = cd1,k,n_1,s is a constant and ||I 1| is the sine of the angle between the
line passing through both the origin and x.
Proof.
r(x)dx = (X) dxdrJ0 rsn-1 n.A |dM=
Crofton formula on rSn Scd-1,k,n-1,S - Es[If0 snmnrsn
7r(x) dx]drI df dr
||g r|
Fubini ir(x)Cd-1,k,n-1,s -Es[ f r dx]dr
JO JsnMnlrsn M=I
Cd-l,k,n-1,s -Es[ 7r ( jx-dx]
2.2.1 The Crofton formula Gibbs sampler
As discussed in the introduction, we can apply the first-order reweighting of Theo-
rem 1 to the Gibbs sampler algorithm with d-dimensional isotropic random search-
subspaces to get a more efficient MCMC algorithm (Algorithm 9):
Algorithm 9 Crofton Formula Gibbs Sampler (for 7r supported on submanifold)
Algorithm 9 is identical to Algorithm 3, except for the following steps:
3: Sample xj+1 from the (unnormalized) density 7r(x)/I d4-)| 1 restricted to
S n M (using a subroutine Markov chain or another sampling method)
29
IM
IM
Theorem 2. The primary Markov chains in Algorithm 9 and 3 (denoted by x1 , X2 ,...
in both algorithms) have identical probability transition kernels.
Proof. For convenience, in this proof we will denote the primary Markov chains in
Algorithm 9 and 3 by zi, 22, ... and 1, 32, ... , respectively.
Let k(U, V) := P(i+1 E VJzi E U) and k(U, V) := P(si+1 E V i E U) denote
the probability transition kernel for the primary Markov chain in Algorithm 3 and 9
respectively (here U, V C Rn).
Then
K(U, V) = P(Gi+1 E Vjzi E U) = Es[ r(y)/ dSdy]dxJ snv dMA
=jjr(y) 1 idydxThe/remiy -f d -
The-eml IEs[J r(y)/11 -)| dy]dxU snv dM
= P(i+1 E Vl.i E U)
=k(U, V)]
2.2.2 Traditional weights vs. integral geometry weights
In this section we find the theoretical distribution of the traditional weights and
compare them to the integral geometry weights of Theorem 1. We will see that
while both weights incorporate the factor 1 the traditional weights have an
additional component not present in the integral geometry weights that has infi-
nite variance, greatly slowing the traditional MCMC algorithm. Indeed, d =
d(x-x) Mds), where dS is the projection of dS onto the tangent spaceIMJ d MJ FI d(M--) p fdMJ
at x of the sphere of radius |Ix - xill centered at xi. Since both weights share the
component | I)I, for the remainder of this section we will focus our analysis on
the component I d(Mnsx)I that is unique to the traditional algorithm.
30
In the codimension-k = 1 case, we can find the distribution of the weights by
observing that the symmetry of the Haar measure means that the distribution of
the weights are a local property that does not depend on the choice of manifold
M. Moreover, since the kinematic measure is locally the same for both constant-
curvature spaces S' and R', the distribution is the same regardless of the choice of
constant-curvature space. Hence, without loss of generality, we may choose M to
be a cylinder of unit radius in R'. We observe that projecting the cylinder down to
the unit circle in R2 , together with the dimension k = 1 search-subspace, does not
increase the weight 1 . Because of the rotational symmetry of both the kinematic
measure and the circle, without loss of generality we may condition on only the
vertical lines { (x, t) : t E R}, in which case x is distributed uniformly on [-1, 1]. The
weights are then given by w = w(X) = 1 _+ 1X2 with exactly two intersections at
almost every x. Hence, E[w] = 2 f 1+ 1$ dx = 2r, the circumference of the
circle, as expected. However, E[w 2] = 2 f 1+ x 2 dx = oo. Hence, w has infinite
variance. Since projecting down to R2 did not increase the weights, the original
weights must have infinite variance as well, greatly slowing the convergence of the
sampling algorithm even in the codimension-k = 1 case! On the other hand, the
integral geometry weights, being identically = 1, have variance zero, so the weights
do not slow down the convergence at all. (A related computation, which we do not
give here, shows that the theoretical weights for general k are given by the Wishart
matrix determinant 1 1(,G), where G is a (k + 1) x k matrix of i.i.d. standard
normals, which also has infinite variance.)
2.3 A generalized Crofton formula
Oftentimes, it is necessary to use a random search-subspace of dimension d larger
than the codimension k of the constraint manifold M (the manifold we wish to
sample from). For instance, the manifold might represent a rare event, so we might
use a higher dimension than the codimension to increase the probability of finding an
intersection with the manifold. However, the intersections will no longer be points
31
but submanifolds of dimension d - k. How should one assign weights to the points on
this submanifold? The first-order factor in this weight is simple: it is the same as the
Jacobian weight of Theorem . However, the size of the intersection still depends on
the orientation of the search-subspace with respect to the constraint manifold. For
instance, we will see in Section that if we intersect a spherical manifold with a
plane near the sphere's center, then we will get a much larger intersection than if we
intersect the sphere with a plane far from its center.
This example suggests that we should weight the points on the intersection using
the local curvature form: If we intersect in a direction where the curvature is greater
(with the plane not passing near the center in the example) then we should use a larger
weight than in directions where the curvature is smaller (when the plane passes near
the center) (Figure ).
7
Figure 2-5: Both d-dimensional slices, S and S2, pass through the green point x,but the slice passing through the center of the n-1 sphere M has a much biggerintersection volume than the slice passing far from the center. The smaller slice alsohas larger curvature at any given point x. If we reweight the density of Si nM at x bythe Chern-Gauss-Bonnet curvature of Si n M at x, then both slices will have exactlythe same total reweighted volume (exact in this case since the sphere has constantcurvature form), since the Chern-Gauss-Bonnet theorem relates this curvature to thevolume measure.
Consider the simple case where M is a collection of spheres. If we were just
applying an algorithm based on the Classical Crofton formula, such as Algorithm ,
we would sample uniformly from the volume on the intersection S n M. However,
the intersected volume depends heavily on the orientation of the search-subspace S
with respect to each intersected sphere (Figure ), meaning that the algorithm will
32
in practice have to search through exponentially many spheres before converging to
the uniform distribution on S f M (See section 2.3.6). To avoid this problem, we
would like to sample from a density Wi that is proportional to the absolute value of
the Chern-Gauss-Bonnet curvature of S n M at each point x in the intersection:
w = W^ (x; S) = IPf(Qx(S n M))I (The motivation for using the Chern-Gauss-Bonnet
curvature Pf(Q2(S n M)) will be discussed in Section 2.3.4).
However, sampling from the density w^ (x; S) does not in general produce unbiased
samples uniformly distributed on M even when S is chosen at random according
to the kinematic measure. We will see in Theorem 3 that in order to guarantee
an unbiased uniform sampling of M we can instead sample from the normalized
curvature density
W(X; S) = x~dS .^(;S (2.1)Cd,k,n,K EQ[) (x; SQ) x det(Projm Q)]
The normalization term EQ [W^1(x; SQ) x det(ProjM Q)] is the average curvature at
x over all the random orientations at which S could have passed through x. Here
SQ = Q(S - x) + x is a random isotropically distributed rotation of S about x,
with Q the corresponding isotropic random orthogonal matrix. The determinant
inside the expectation is there because while S is originally isotropically distributed,
the conditioning of S to intersect M (at x) modifies the probability density of its
orientation by a factor of det(Projmj Q). Projag Q is the projection of the orthogonal
complement of the tangent space of M at x. In this collection of spheres example,
the denominator is a constant for each sphere of a radius R. For instance, in the
Euclidean case it can be computed analytically, using the Gauss-Bonnet theorem, as
d-1 F(d + 1) F(--1 + 1)F(n-- + 1) R(27r)- 2 X 2 2 Rn.7r! (n - d) (n - d)F(- 2 1 + n 2 +)
From this fact, together with the fact that the total curvature is always the same
for any intersection by the Chern-Gauss-Bonnet theorem, we see that when sampling
under the probability density w the probability that we will sample from any given
sphere is always the same regardless of the volume of the intersection of S with that
33
sphere. Since each sphere (of the same radius) has an equal probability of being
sampled, when sampling from M the algorithm has to search for far fewer spheres
before converging to a uniformly random point on S n M than when sampling from
the uniform distribution on S n M.
The need to guarantee that w will still allow us to sample uniformly without bias
from M motivates introducing the following generalization of the classical Crofton
formula (Theorem 3), which, as far as we know, is new to the literature. Since the
proof does not rely on the fact that w is derived from a curvature form, we state
the theorem in a more general form that allows for arbitrary w^ (see Section 2.3.5
for a discussion of higher-order choices of Cv beyond just the Chern-Gauss-Bonnet
curvature).
Theorem 3. (Generalized Crofton formula)
Let & be a weight function, M a manifold, and S a random search subspace
moving according to the kinematic measure, satisfying smoothness conditions Al and
A2 (defined below). Then
Vol(M) X E (x; S) dx]. (2.2)Cd,k,n,K snM EQ[w(x;SQ) x det(Projmj Q)]
Q is a matrix formed by the first d columns of a random matrix sampled from the
Haar measure on SO(n). SQ := Q(S - x) + x. Proju is the projection onto the
orthogonal complement of the tangent space of M at x.
(As in Lemma 1, if S is a plane, we set the "Vol(S)" term to 1.)
Remark 3. We note that Amelunxen and Lotz [2, 3] recently managed to provide a
more elegant proof of our Theorem 3, by modifying our proof using an algebraic ap-
proach similar to group-theoretic double-fibration arguments of [22, 1, 30]. Although
our proof of Theorem 3 relies on smoothness conditions Al and A2 (defined below),
their proof does not seem to rely on these two assumptions.
For MCMC applications, M is taken to be a component of a level set of 7F and tC
the magnitude of the Chern-Gauss-Bonnet curvature of SnM, since the Chern-Gauss-
34
Bonnet theorem states that the integral of the curvature form over the intersection
SnM is invariant under rotations of S as long as the topology of Sn M is unchanged.
Definition 2. (Smoothness conditions)
Al: A manifold (such as M or S in Theorem 3) satisfies condition Al if its cur-
vature form is uniformly bounded above.
A2: The pre-normalized weight Cv(x; S) is said to satisfy condition A2 if it is any
function such that a < wi(x; S) < b for some 0 < a < b, and is Lipschitz in the
variable x E M for some Lipschitz constant 0 < c / oo (when using a translation of
S to keep x in S n M when we vary x).
Proof. (Of Theorem 3)
We first observe that it suffices to prove Theorem 3 for the case where K' = R' is
Euclidean, S is a random plane, and w(x; S) = w(x; L) depends only on the orien-
tation d = d I of the tangent spaces of S and M at x. This is because constant-
curvature kinematic measure spaces are locally Euclidean (and converge uniformly to
a Euclidean geometry if we restrict ourselves to increasingly small neighborhoods of
any point in the space because the curvature is the same). We may use any geodesic
d-cube in place of the plane as a search-subspace S, since S can be decomposed as
a collection of cubes, and Equation 2.2 treats each subset of S in an identical way
(since so far we have assumed that w(x; S) depends only on the orientation of the
tangent spaces of S and M at x). We can then approximate any search-subspace S
of bounded curvature, and Lipschitz function w(x; S) that depends on the location
on S where S intersects M (in addition to L), by approximating S with very small
squares, each with a different "w(x; L)" that depends only on d.
The remainder of the proof consists of two parts. In Part I we prove the theorem
for the special case of very small codimension-k balls (in place of M). In Part II we
extend this result to the entire manifold by tiling the manifold with randomly placed
balls.
35
Part I: Special case for small codimension-k balls
Let BE = BE(x) be any k-ball of radius c that is tangent to M C R' at the ball's
center x. Let S and 5 be independent random d-planes distributed according to the
kinematic measure in RI. Let r be the distance in the k-plane containing BE (the
shortest line contained in this plane) from S to the ball's center x. Let 0 be the
orthogonal matrix denoting the orientation of S. Then we may write S = S,,O
Then almost surely (i.e., with probability 1; abbreviated "a.s.") Vol(Sr,o n BE)
does not depend on 0 (this is because BE is a codimension-k ball and S is a d-plane,
so the volume of S n BE, itself a d - k-ball, depends a.s. only on r and not on 9).
We also note that w(x; d) obviously does not depend on r as well. Define events
E := {S,,O n B #0} and := {5n B, # 0}. Then
Er,O [W (x; d ) 'X VOldk-(Sr,o Be)] (2.3)
= E,,O W (X; d x) X VOldk(Sr,o n BE) E x P(E) (2.4)
= Eo [ (X; dB E x Er[Vold-k(s,o nBE)IE] x P(E) (2.5)
= E0 x ' dB, E[ x E[VolVdk((S,o nBlB)EE] x P(E) (2.6)Cd,k,n,R E [ (x; d) ]V S f
1 Eo[W^ (x; dsx,)IE]= x dB x E[VOldk-(Sr,O n BE)IE] x P(E) (2.7)
Cd,k,n,R Eg[((X; d)|5
36
Cd,k,n,R
1
Cd,k,n,R
1
Cd,k,n,Rl
1
Cd,k,n,R
X 1 X Er[VOld-k(Sr,o n B,)|E] x IP(E)
X Er,O[VOld-(Sr,o n BI)E] x P(E)
X Er,O[VOld-k(Sr,o f Be)]
X Cd,k,n,R X Vold-k(BE)
= Vold-k(BE).
(2.8)
(2.9)
(2.10)
(2.11)
(2.12)
* Equation 2.5 is due to the fact that r and 0 are independent random vari-
ables even when conditioning on the event E. This is true because they are
independent in the unconditioned kinematic measure on S, and remain inde-
pendent once we condition on S intersecting BE (i.e., the event E) because of
the symmetry of the codimension-k ball BE.
* Equation 2.6 is due to the fact that, by the change of variables formula,
1' 1Vol(TQ + RQLy n BE)dVolfld(y) x d = Vol(BE) (2.13)
fRn-d det Proi j;L
for every orthogonal matrix Q, where the coordinates of the integral are con-
veniently chosen with the origin at the center of BE. RQI is rotation matrix
rotating the vector y so that it is orthogonal to TQ, the subspace spanned by
the rows of Q.
37
Multiplying by tzbx; Q) and rearranging terms gives
(x; Q) x det(ProjB=Q)
fR-d Vol(TQ + RQLyVol(B
n BE)dVolfd(y)
6) (2.14)
Taking the expectation with respect to Q (where Q is the first d columns of a
Haar(SO(n)) random matrix) on both sides of the equation gives
EQ[tb(x;Q) x det(ProjBIQ)]
~EK[WXnQ fR--dVol(TQ + RQIy nVol(BE)
BE)dVol_ (y)I
Recognizing the right hand side as an expectation with respect to the kinematic
measure on TQ+RQ y conditioned to intersect BE (since the fraction on the RHS
is exactly the density of the probability of intersection for a given orientation
of Q), we have:
EQ[w(x; Q) x det(Proj5 Q)] = E X; dM (2.16)
Equation 2.8 is due to the fact that ds' 6 = r because BE hasdBtge dB
tangent space, and hence
dSxo'dBe
a constant
E] = ErO [ -(x; '6) E]
= Er,O [ X; d~ o) E] = E I(X;
9 Equation 2.11 is by the Crofton formula.
Writing Es in place of E,,O in Equation 2.3 (LHS)/ 2.12 (RHS) (we may do this
since S = S,,O is determined by r and 0), and observing that _ = g = d, We an osevig ha B, =j B - dS
38
b(X; Q) x
(2.15)
dSEdBE ,
E]. (2.17)
have shown that
Es wj dM) x Vold-k(Sn B) =Vold-k(BE). (2.18)1(dM
Part II: Extension to all of M
All that remains to be done is to extend this result over all of M. To do so, we
consider the Poisson point process {x } on M, with density equal to 1 We wish
to approximate the volume-measure on M using the collection balls {BE(x )} (think
of making a papier-mich6 mold of M using the balls BE(x ) as tiny bits of paper).
Let A C M be any measurable subset of M. Since M and S have uniformly
bounded curvature forms, because of the symmetry of the balls and the symmetry
of the poisson distribution, the total volume of the balls intersected by S and A
converges a.s. to Vol(S n m n A) on any compact sumbanifold M C M:
Vol(B,(x ) n A) a.! VO(Vol(S n Be(D) x -- > Vol(S ( xn M n A), (2.19)
Vol(BE(x)) o
and similarly,
Vol(BE(xe) n A) -+ Vol(M n A). (2.20)C40{i:xiEM}
But, by assumption, w is Lipschitz in x on M (since WJ, which appears in both
the numerator and denominator of w, is Lipschitz, and the denominator is bounded
below by a > 0), so we can cut up M into a countable union of disjoint compact
submanifolds UL1 Mj such that Iw(t; ) - w(x; )I < 6 on all of x, t E M3 , and
hence, by Equation 2.19,
Vol(Bjxzi) n A) w x;dSlim Vol(S n BE(Zx) n A) x x ( E -4 0 Vol(BE(x )) ' dMj
- (x; d ) dVol(x) < 6 x Vol(S nM n A) (2.21)snasnf dyj
a. s. for every j.
39
Summing over all j in equation 2.21 implies that
Vol(S n BE(x,) n A) x Vol(V
snMnA w X
BE(xe) n A)l(BE(x ))
dS)dVol(
x w (xf; - )" dM
X) <6x voi(s n MnA)
almost surely. Since Equation 2.22 is true for every 6 > 0, we must have that
Vol(S nB(z) nA)Vol(BE(x,) n A)
Vol(B,,(x ))
dSdM )dVol(x).dM
Hence, taking the expectation Es on both sides of Equation 2.23, we get
ES V ol(S n B, (z) n A) x A)
-+ Es[ w (X;dS) dVol(x)]dMI
a.s. as c 4 0 (we may exchange the limit and the expectation by the dominated
convergence theorem, since I E Vol(S n BE(xz) n A) x w(x4; )I is dominated by
2 x Vol(S n M) x k) for sufficiently small e.
Since the sum on the LHS of Equation 2.24 is of nonnegative terms we may
exchange the sum and expectation, by the monotone convergence theorem:
Es [Vol(S nB(x) nA)
= ZEs Vol(S
Vol(B(xs) n A)x Vol(B(x))
n B,(xf)) x
w ; )
Vol(Be(xi) n A)
Vol(Bjxzi))dS)]
.
40
lim4O
(2.22)
dS )dM
xw X'
I
0 snMnA
w (x; (2.23)
x1w(XdS
)91
(2.24)
(2.25)
But by Equation 2.18, Es[Vol(S n B,(x )) x w(x ; A)] = Vol(BE(x )), so
Vo(S nB(xE)) xVOl(BE(xj) n A)
Vol(Bejxi))
Vol(B,(x ) nA)= Vol(B(xf)) x -+ Vol(M n A)
almost surely as e 4 0 by Equation 2.20.
Combining Equations 2.24 and 2.26 gives
Es [fflm nA w X; dS dVol(x) =dMJ
Vol(M n A).
We now prove Theorem 4, a version of Theorem 3 with somewhat more general
analytical assumptions.
Theorem 4.
Suppose that ?b(x; S) is c(t)-Lipschitz on M n {x : (x; S) < t}, and that
limt-*oo
EQ [(w^ (x; SQ) - A T-(x; SQ) V t) x det(Projm Q)]EQ [I A ?-(x; SQ) V t x det(ProjM Q)]
= 0
and
lim Esb-+oo [ Snm
IIA (X) x [w(x; S)1
- - V w(x; S)t
where we define the "A" and "V " operators to be r A s := min{r, s} and r V s
max{r, s}, respectively, for all r, s E R.
Then Theorem 3 holds even for a = 0 and b = c = oo.
Proof. (Of Theorem 4)
Define
EQ [(zC(x; SQ) - a A W^ (x; SQ) V b) x det(Projm LQ)]
EQ [a A w-(x; SQ) V b x det(Projm Q)]
41
ES[Z f dS )]dMJ
(2.26)
(2.27)
A t) d Vol] = 0,
Let A be any Lebesgue-measurable subset. Then
1A(X) x w(x; S) dVol]
= lim Es[
= t m Esj
1A(X) x w(x;S) dVoll
IA(X) x V w(X; S)t
1ILA(X) X [W(X; S) - A
t+ lim Es[
t-+ . snm
= lim Esj
= lim Esj
= lim Esjt-+oo . snm
1A(X) x 1t
IA(X) x
V w(x; S) A t]
V w(x; S) A t dVoll +0
V (x; S) A tEQ [C(x; SQ) x det(ProjM Q)] dVoll
RA(X)
EQ R V (x;V w(x; S) A t
SQ) A t x det(Projm Q)] x (1+ k(t))
42
Es [j (2.28)
(2.29)
(2.30)A t dVoll
dVoll
(2.31)
(2.32)
(2.33)
dVolJ
= lim Est-+0 [ L snm
E A v(x; S) V t 1
EQ[} A (x; SQ) V t x det(Projm Q)] dVOIJ X 1 + 0(t)
1= lim Vol(MlnA)x
t-+oo +
= Vol(M n A) x 1
= Vol(M n A).
* Equation 2.31 is true because
(2.35)
(2.36)
(2.37)
O Es 1A(x) x I A w(x;S) V tdVoll
5 EsE[ I A w(x;S) V t dVol -+0.[JsnM t t-0
* Equation 2.35 follows from Theorem 3 using .1 A ii(x; S) V t as our pre-weight.
Indeed, 1A (x; S) Vt obviously satisfies the boundedness conditions of Theorem
3. Moreover, since z(x; S) is c(t)-Lipschitz everywhere on M where zZ (x; S) < t,
the pre-weight .1A v(x; S) V t must be c(t)-Lipschitz on all x E M.
t
43
IA (X) (2.34)
2.3.1 The generalized Crofton formula Gibbs sampler
We now state a Generalization of Theorem 1, that can be proved in much the same
way as Theorem 1 by applying our Generalized Crofton's formula (Theorem 3) in
place of the classical Crofton formula:
Theorem 5. Let S be an isotropic random subspace centered at the origin. Let 7r be
a function on a manifold M then
j r(q) zi(q;SfnSq)7r(q)dq = c xE[ d-||jqj n- dq]
sn I I EQ [foi (q; SQ n Sq) x det(Proj(Mqfns,)!Q)]
where c = Cdl,k,n_1,S is a constant and ||gLr|| is the sine of the angle between the
line passing through both the origin and q. (Sq is the sphere of radius ||q|| centered at
the origin.)
Proof. The proof is identical to our proof of Theorem 1, with Crofon's formula on the
sphere replaced by the Generalized Crofton formula on the sphere (Theorem 3). 0
Applying Theorem 5 gives the following improvement to Algorithm 9:
Algorithm 10 Generalized Crofton Formula Gibbs Sampler (for 7r supported onsubmanifold)
Algorithm 10 is identical to Algorithms 9 and 3, except for the following step:
3: Sample xj 1 from the (unnormalized) density
wi(X) = 7r(x) bi (x; Si n Sx)- IEQ[f(x; SQ n Sx) x det(Proj(M.fns.)-Q)
restricted to S n M (using a subroutine Markov chain or another sampling
method). (Sx is the sphere of radius jix - xi4l centered at xi.)
As discussed in Section 2.3, we would usually set (i to fvi(x, S) = IPf(2(S nM))I.
However, as discussed in Section 2.3.5 in some cases it may be advantageous to use
other functions for di.
44
Theorem 6. The primary Markov chains in Algorithms 10 and 3 (denoted by x 1 , x 2 , ...
in both algorithms) have identical probability transition kernels.
Proof. For convenience, in this proof we will denote the primary
Algorithms 3 and 10 by x 1, x 2 , ... and 1, ^ 2,. , respectively.
Let k(U, V) := P(i+1 E V1zi E U) and k(U, V) := P(i+1 Ethe probability transition kernel for the primary Markov chain in
10 respectively (here U, V C Rn).
Then
k(U, V) = P(zi+1 E Vjzi E U) = j Es[j
Markov chains in
Vj. E U) denote
Algorithms 3 and
dS740y/1 dSdy]dxdM
= 4 r(y) 1 _1 dydx
Theorem5 j j d(yx) i(y; S n S,)SESsnvy dM 1 EQ ['Ji (y; SQ n S,) x det(Proj(mdnsY) 0 ]dx
= P(i+ VjIj E U)
=K(U, V)]
Remark 4. The curvature form 0x(Si+1 n M n sx) of the intersected manifold can
be computed in terms of the curvature form Qx(M) of the original manifold by
applying the implicit function theorem twice in a row. Also, if M is a hyper-
surface then IPf(Qx(Si+1 n M n Sx)) is the determinant of the product of a ran-
dom Haar-measure orthogonal matrix with known deterministic matrices, and hence
EQ[IPf(Qx(Q n M n Sx))I x det(ProjM Q)] is also the expectation of a determinant
of a random matrix of this type. If the Hessian is positive-definite, then we can
obtain an analytical solution in terms of zonal polynomials. Even in the case when
the curvature form is not a positive-definite matrix (it is a matrix with entries in the
algebra of differential forms), the fact that the curvature form is the Pfaffian of a
random curvature form (in particular, a determinant of a real-valued random matrix
45
in the codimension-1 case) should make it very easy to compute numerically, perhaps
by a Monte Carlo method.
This fact also means that it should be easy to bound the expectation, which allows
us to use Theorem 3 to get bounds for the volumes of algebraic manifolds (Section
2.3.8).
Remark 5. While the Chern-Gauss-Bonnet theorem only holds for even-dimensional
manifolds, if M has odd dimension we can always include a dummy variable to
increase both the dimensions n and d by 1.
2.3.2 The generalized Crofton formula Gibbs sampler for
full-dimensional densities
In many cases one might wish to sample from a full-dimensional set of nonzero prob-
ability measure. One could still reweight in this situation to achieve faster conver-
gence by decomposing the probability density into its level sets, and applying the
weights of Theorem 5 separately to each of the (infinitely many) level sets. We ex-
pect this reweighting to speed convergence in cases where the probability density is
concentrated in certain regions, since when d is large, intersecting these regions with
a random search-subspace S typically causes large variations in the integral of the
probability density over the different regions intersected by S, unless we reweight
using Theorem 5.
Algorithm 11 Generalized Crofton Formula Gibbs Sampler (for full-dimensional 7r)
Algorithm 11 is identical to Algorithms 9 and 10, except for the following step:
3: Sample xj+1 (using a subroutine Markov chain or another sampling method)
from the (unnormalized) density
wi(x) = ir(X) -Q (x S nEQ [ i (x; S n Sr)]'
where Sx is the sphere of radius |x - xi I centered at xi.
46
As discussed at the beginning of Section 2.3, we would usually set ?i to tbi(x, S) =
IPf(Qx(S n Lx))I, where Cx is the level set of 7r passing through x. If we instead set
qii(x, S) = 1, we get the traditional Gibbs sampler (Algorithm 2).
Theorem 7. The primary Markov chains in Algorithms 11 and 2 (denoted by x1 , x 2, ...
in both algorithms) have identical probability transition kernels.
Proof. For convenience, in this proof we will denote the primary Markov chains in
Algorithms 2 and 11 by i1 , z2, ... and i1, 2 ,..., respectively.
Let K(U, V) := P(zi+ 1 E V~j. E U) and k(U, V) :=QPs 1 E V|%i E U) denote
the probability transition kernel for the primary Markov chain in Algorithm 2 and 11
respectively (here U, V C Rn).
Then
f(U, V) = P(i+1 E VIzi E U) = Es[j ir(y)dy]dx
=r r(Y) _,dydx//\|y - X11|-
Theorem5 7r(Y i(y; S n yES r(y) S) dy]dJu fsnv ( EQ [i(y; SQ n Sy)]
= EGsi+1 E V|Si E U)
=K(U, V),
where we set M = R" when applying Theorem 5.
0
2.3.3 An MCMC volume estimator based on the Chern-Gauss-
Bonnet theorem
In this section we briefly go over a new MCMC method (which we plan to discuss in
much greater detail in a future paper) of estimating the volume of a manifold that is
47
based on the Chern-Gauss-Bonnet curvature. While this method is interesting in its
own right, we choose to introduce it here since it will serve as a good introduction
to our motivation (Section 2.3.4) for using the Chern-Gauss-Bonnet curvature as a
pre-weight for Theorem 3.
Suppose we somehow knew or had an estimate for the Euler characteristic X(M) /
0 of a closed manifold M of even-dimension m. We could then use a Markov chain
Monte Carlo algorithm to estimate the average Gauss curvature form Em[(Pf(Q))]
on M.
The Chern-Gauss-Bonnet theorem says that
IM Pf(Q)dVlm = (27r)iX(M). (2.38)
We may rewerite this as
fM Pf(Q)dVolm (27r) X(M)
fM dVolm fM dVOlm
By definition, the left hand side is Em[(Pf(Q))], and fM dVolm = Volm(M), so
(2-x) mX (M)EM[(Pf(Q))] = ,2)_ M (2.40)
Volm(M)
from which we may derive an equation for the volume in terms of the known quantities
Em[(Pf(Q))] and X(M)_(2wr) x(M)
Volm(M) = .( (M) (2.41)Em [(Pf(Q))]'
2.3.4 Motivation for reweighting with respect to Chern-Gauss-
Bonnet curvature
While Theorem 3 tells us that any pre-weight fv generates an unbiased weight w, it
does not tell us what pre-weights reduce the variance of the intersection volumes. We
argue here that the Chern-Gauss-Bonnet theorem in many cases provides us with an
ideal pre-weight if one only has access to local second-order information at a point x.
48
Equation 2.41 of Section 2.3.3 gives an estimate for the volume
Voldk (S n M) = (27r)dy(s n m) (2.42)EsnM[(Pf(Q(S nfM)))]'
where Q(S n M) is the curvature form of the submanifold S n M.
If we had access to all the quantities in Equation 2.42 our pre-weight would then
be 1 EsnM[P(Q(SnM)))]. However, as we shall see we cannot actually im-vold-k(snM) (27r)-7- x(SnM)
plement this pre-weight since some of these quantities represent higher-order informa-
tion. To make use of this weight to the best of our ability given only the second-order
information, we must separate the higher-order components of the weight from the
second-order components by dividing out the higher-order components.
The Euler characteristic is essentially a higher-order property, so it is not reason-
able in general to try to estimate the Euler characteristic x(S n M) using the second
derivatives of M at x because the local second-order information gives us little if any
information about X(S n M) (although it may in theory be possible to say a bit more
about the Euler characteristic if one has some prior knowledge of the manifold). The
best we can do at this point is to assume the Euler characteristic is a constant with
respect to S, or more generally, statistically independent of S.
All that remains to be done is to estimate EsnMPf(Q(S n M)). We observe that
EsnMPf(Q(S fl M))EsnMPf(Q (S n M)) = EsnM IPf(Q(S n M))I x EsnmPf(.(s n m)) (2.43)
EsE(Pf(S(s n m))
But the ratio ESnM PfsnMD is also a higher-order property since all it does is de-
scribe how much the second-order Chern-Gauss-Bonnet curvature form changes glob-
ally over the manifold, so in general we can say nothing about it using only the local
second-order information. The best we can do at this point is to assume that this
ratio is statistically independent of S as well.
49
Hence, we have:
1 M) = EsnMlPf(Q(S nf M))lxVoldki(S n M)44
Vd-k- Esn nf( m) )((27r)'2 'x(snA)E mfQ(S n M)), (2.44)
Esnm IPf(Q (S n M))|
where we lose nothing by dividing out the unknown quantity
(27r)m X(M) EsnsPf (snm) since we have no information about it and it is indepen-
dent of S.
We would therefore like to use Esnm Pf(Q(SnM)) as a pre-weight. Since we only
know the curvature form Q(Sn M) locally at x, our best estimate for Esnm Pf(Q(Sn
M)) is the absolute value IPf(Qx(SnM)| of the Chern-Gauss-Bonnet curvature at x.
Hence, our best local second-order choice for the pre-weight is W^ = |Pf(Q (S n M)|.
2.3.5 Higher-order Chern-Gauss-Bonnet reweightings
One may consider higher-order reweightings which attempt to guess not only the
second-order local intersection volume, but also make a better guess for both the
Euler characteristic of the intersection SQ fM and how the curvature would vary over
SQfnM. Nevertheless, higher-order approximations are probably harder to implement
for the same reason that most nonlinear solvers, such as Newton's method, do not
use higher-order derivatives. Moreover, it may not even be desirable to implement
higher-order reweightings. Indeed, if the local intersection region whose volume we
are aiming to estimate is so large that the second derivatives vary widely over this
region, then the statistic we wish to compute with our algorithm will most likely
also vary widely over this region, ensuring that different samples over this region will
contain different information about this statistic. Hence, we probably only need to
consider volume approximations that are local in a second-order sense.
50
2.3.6 Collection-of-spheres example and concentration-of-measure
In this section we argue that the traditional algorithms can suffer from an exponential
slowdown (exponential in the search-subspace dimension) unless we reweight the in-
tersection volumes using Theorem 3 with the Chern-Gauss-Bonnet curvature weights.
We do so by applying two results (Theorems 8 and 9) related to the concentration-
of-measure phenomenon, to an example involving a collection of hyperspheres.
Consider a collection of very many hyperspheres in R'. We wish to sample uni-
formly from these hyperspheres. To do so, we imagine running a Markov chain with
isotropically random search-subspaces. We imagine that there are so many hyper-
spheres that a random search-subspace typically intersects exponentially many hy-
perspheres. As a first step we would use Theorem 1 which allows us to sample the
intersected hypersphere from the uniform distribution on their intersection volumes.
While using Theorem 1 should speed convergence somewhat (as discussed in Section
2.2.2), concentration-of-measure causes the intersections with the different hyper-
spheres to have very different volumes (Figure 2-6). In fact we shall see that the
variance of these volumes increases exponentially in d, causing an exponential slow-
down if only Theorem 1 is used, since the subroutine Markov chain would need to
find exponentially many subspheres before converging.
Reweighting intersection volumes using Theorem 3 causes each random intersec-
tion S n Mi (where Mi is a subsphere) to have exactly the same reweighted inter-
section volume, regardless of the location where S intersects Mi, and regardless of d.
Hence, in this example, Theorem 3 allows us to avoid the exponential slowdown in
convergence speed that would arise from the variance of the intersection volumes.
The first result deals with the variance of the intersection volumes of a sphere
in Euclidean space. It says that the variance of the intersection volume, normalized
by it's mean, increases exponentially with the dimension d (as long as d is not too
close to n). Although isotropically random search-subspaces are (conditional on the
radial direction) distributed according to the Haar measure in spherical space, the
Euclidean case is still of interest to us since it represents the limiting case when the
51
Figure 2-6: The random search-subspace S intersects a collection of spheres M4. Eventhough the spheres in this example all have the same n - I-volume, the d - I-volumeof the intersection of S with each individual sphere (green circles) varies greatlydepending on where S intersects the sphere if d is large. In fact, the variance of theintersection volume of each intersected sphere increases exponentially with d. This"curse of dimensionality" for the intersection volume variance leads to an exponentialslowdown if we wish to sample from sn M with a Markov chain sampler (and S n M4consists of exponentially many intersected spheres). However, if we use the Chern-Gauss-Bonnet curvature to reweight the intersection volumes, then all spheres in this
example will have exactly the same reweighed intersection volume, greatly increasingthe convergence speed of the Markov chain sampler.
hyperspheres are small, since spherical space is locally Euclidean.
T heorem 8. (Variance resulting from concentration of Euclidean kinematic measure)
Let S C R' be a random d-dimensional plane distributed according to the kinematic
measure on R". Let M4 = S" C R" be the unit sphere in R n. Defining a :=1, we
have
k (a, d)eCd x a") I < Var( VlSnA ) < K (a, d)eCdxpd) -- 1, (2.45)E [Vol (S n M4)]-
where
o(a) = log(2) + ()log(-) - ( + -)log(- + 1) - ( )log(- 1)a a 2ce 2 a 2a 2 a
(27T 2 (TI -- 1)(" d ) -2 1k (a, d) = 4 )(n - d)2 4-
e- (n - 1)(n _ d -2
K (a, d) = 0 3 ( -d)2 T" - C)( -n)-1r47r2 (d - 1)(n + d - 2)
52
Proof. Consider the unit sphere M = S- 1 centered at the origin. By symmetry of
the sphere, the intersection M n S of the unit sphere with a d-dimensional plane S
is entirely determined (up to a rotation) by the plane's orthogonal complement S'
that passes through the origin, and the intersection point x = S n S'. By symmetry,
we may assume S' = Rn-d is aligned with the first n - d coordinate axes. If S is
distributed according to the kinematic measure, we must have that x is distributed
uniformly on the ball Bn-d c Rn-d. Hence, IP(IlxJl 5 R) = vld(RBnd)= Rn-d,VOnd(Bn-d)
0 < R < 1, where B-d is the unit n - d ball.
The radius of SnM is just /1 - Ilxi 12, and hence Vol(SnM) = Cd x (1- lxi 2)d/2,
where Cd drd/2
Denoting by S, a d-plane whose associated x-value has | xii = r, and by Cd =
the constant in the volume formula for the d-dimensional unit ball, we have:
E[Vol(SnS"- 1)1] = )dVol(Srn-)'dP(r) = (cd(1-r2)d )txr -d-dr
(C)_ 2 d-1 __ (cd)t F(td 2 + 1)F(!2 + 1)(1 - r2) 2 x r dr = x (2.46)
(cd ___ - _d___-_) __t_-_-+ ____1)
where the last equality is by Gauss's theorem for the Gauss hypergeometric function.
In particular,
Cd p(di +1),(n-+2)E[Vol(S n S f-1)] = d x 2 2 (2.47)n - d (n - d)]F(n - 1 + 1)
and
Var[Vol(S n S"-1)] = E[Vol(S n Sn- 1) 2] - E[Vol(S n S"-1)]2
_ (Cd) 2 1(d - 1 + 1)F(n-d + 1) Cd F(!. 1 + 1)F("~a2 ) 2 (2.48)
n - d (n - d)(d - 1 + "+ 1) n - d (n - d)(zi - 1 + 1)
53
Combining Equations 2.47 and 2.48 gives
Var o(s n -1) ~I ]VarE[voi(s n sn-1)_
r(d - 1 +1)r(ngd + 1)
F(d - I1+ nid + 1)
E[(S n n-1)2]
E[(S n Sn-1)]2 - 1
(n - d)( - j + 1) - 1r(d + 1)]p(n -d +)
F(d - 1+ 1) (n - d) 2 F( - j+1)2 1
r(nd - 1 + 1) F(9 1 + 1) 2 (n-d + 1)
= k x (n - d)2 X(d - 1)d--le-d+1+d - -i- +1
k x (n-d)2 X - 1 x(n-JI)n+'
(n+d-2 n~d x
where the second-to-last equality is a consequence of Stirling's formula and k = k(d, n)(27r)~ _3
is some number that satisfies 4 k < e. To shorten our equations, we set A
log(k)+2log(n-d) - log(d 1) and B := log(2-j) -1 log(2+1- 2)+ log(a -1).
Hence,
log(Var + 1)
= dlog(2) + (n + ) log( )2 2 -(n + d
n- d + ) log( n -d) + A2 2 2
=dlog(2) + (n + ) log(n )2 d -(n + d
- - ( 1 ) - d2-( )log )log(2d) [(n +2
-(n + d
- ( + )log(n d ) + A2 2 d
= dlog(2) + (n + 1) log( - )2 d d
-(n d + 1)
n+ d-(2
)log(n + d2 2
2)~n d~2
(n + d
- 1)log(n + d- 2
- 1) log(
log( - 1)+ A
= d[log(2) + (n) log(n -d 2d
-)log(-2 d -d - 1)log(-2d 2 d
- 1)]+ A + B
54
x -1
- 1, (2.49)
1 n - d+ )] +A
((-1)n,-a+i)2
(( -1 d1)2(n-d)n-+),
vol(s n sn-1)
E[voi(s n sn-1)
=d log(2) + (n + 1) log(n )
+1 -2
+ -2)
But, log(x) - a < log(x - a) log(x) - whenever a > 0 and x > 0. Therefore,
log(Var [ Vo1(S n -1)] + 1)>
d x [log(2) + (,)[log(,) -A]n/d
- + 1)[log(n + 1) -
2d - 1) log(-2d2 d- 1)] + A + B
= d x log(2)
- (2
= d x log(2)
1 1+ (-)[log(-)-
1log(- -
a1 1
+ (-)[log(-)]a a
- ( + )[log( + 1) -2a 2 a
1)] + A + B
1 1 )[log(-a + ) 1 log(
log( e4 )+2
1
log( - )
+ -log(2 d
1 ) - )2a 2
and, in the reverse direction,
log(Var[ Vol(S n n-1) + 1) <o E[Voi(S n S 1)] -1
= d x [log(2) + (n)[log(n)1
d - 3(7n 1 7
'd + -)[log(-T2d1) log(- - 1)] + A
2 d
= d x [log(2) + (n)[log(n)]
+ [log( e 2) + 2 log(n
(n+[log(n
- d) - Ilog(d - 1)2
- ) log( - 1)]
1 ni+ -log(- -)
2 d d
- 1 log( + 1 - 2) + 1 log(n - 1) - _ +1].n2-1I
Combining the above two inequalities (Equations 2.50 and 2.51), we get the double-
55
2d 2
2]
1e - 1)]
21~+ 1 - (2.50)
2
+ 1)- d
(2.51)
-1 )
log(n - d) - Ilog(d - )+ 1
log( n+ 1 - )2
n
+ n) -(
sided inequality
dlog(2) + (n + ) log( - 1) - ( +d- -)log(n+ 1 )- (nd2 d 2 2 d 2
+ [log(k) + 2 log(n - d) - I log(d - 1)]2
< log(Var vo(i(sn -) + 1)- E[vol(s n sn-1 )]]
< d log(2) + (n + 1) log(n) - (n + d I-) log(-)2 d
n - d+ ) log ( - 1)
+ [log(k) + 2log(n - d) - I log(d - 1)].2
Substituting a e into Equation 2.52, we getn
d x [log(2) + ( I + 1) log( - 1)-a (- + ) log( + 1)
2a 2 a
+ [log( ) - log(- + 1) + 2log(d(a - 1))
[ Vol(S nl "-1)< log(Var LEVol(S n Sn-1)] +)
-)log( 3-) S 2 )log
- log(d2 - 1)]
1)]
+ [log( e)1 1
+ log( - -1) + 2 log(ad - d) - - log(d - 1)],2 a 2
where the second set of brackets contain only lower-order terms. This completes the
proof of Theorem 8. 1
56
+ I) log( - 1)
(2.52)
(2.53)
< d x [log(2) + ( 1
2.3.7 Variance due to spherical-geometry kinematic measure
concentration
The next result (Theorem 9 and Figure 2-8) deals with the spherical geometry case.
As in the Euclidean case, the concentration of spherical-geometry kinematic measure
causes the variance of the intersection volume to increase exponentially with the
dimension d as well. (While we were able to derive the analytical expression for the
variance of the intersection volumes (Theorem 9), which we used to generate the plot
in Figure 2-8 showing an exponential increase in variance, we have not yet finished
deriving an inequality analogous to Theorem 8 for the spherical geometry case. We
hope to make the analogous result available soon in [43])
In this example M (Figure 2-7, dark blue) is the boundary of a spherical cap of
the unit n-sphere S" (Figure 2-7, light blue), contained in a hyperplane M a distance
h from the center of S . We want to calculate the volume of the intersection Sn M to
show the exponentially large variance which results from the concentration of measure
of these intersection volumes. S n M is a dimension d - 1 subsphere (Figure 2-7,
top left, green), which we project down to a dimension-0 sphere consisting of two
green points in the figures in order to save our precious 3 dimensions to illustrate
other features. To compute Vol(S n.M), we must find the radius r of the intersection
S n M. As a first step we will consider the el component y of the maximum point
(Figure 2-7, yellow) of S (Figure 2-7, top left, red) in the el direction, where el
is defined to be the direction orthogonal to the hyperplane M containing M. By
congruence of the two triangles in the bottom-left diagram, we know that y is just
||Pge1 |l, the length of the projection of el onto S (Figure 2-7, red), the smallest
Euclidean subspace containing S. By congruence of the two triangles in the bottom-
right diagram we have that the length a of the dotted diagonal line is a = L. DrawingY
S" from a different (3-dimensional) projection (Figure 2-7, top right) that contains
a diameter of S n M, we see by the pythagorean theorem that S n M has radius
r =VI/ -1a2
57
1h
Figure 2-7: Diagrams used to obtain the volume of the intersection of a great d-sphere S c S' with the boundary of a spherical cap M c S" (diagrams arrangedcounterclockwise from top-left)
P can be generated as the submatrix consisting of the first d + 1 rows of a
random matrix Q sampled from the Haar measure on the special orthogonal group
SO(n+1). Since the distribution of Q is invariant under action by orthogonal matrices
in SO(n + 1), each row of Q corresponds to a random vector on Sn, and is therefore
distributed according to .. N+i)T where N1 , ..., N+ 1 are independent standard
normals. Hence, y = ||Pseilj is distributed according to KNi Nd~l)TH XI(Ni - x2-I-Y 2
Tj
where X~ Xd+1 and Y ~ nX-d are independent. Hence, = 1 + , where ZdIx is F-distributed with parameters (n - d, d+1), conditional on Z < (i.e.
conditional on y > h). Now we can use the fact that we know the distribution of y to
obtain the variance of the intersection volume (Theorem and Figure ). The plot
(Figure ) shows that the variance of the intersection volume grows exponentially
with the dimension d of S when d is not too large, and grows exponentially with the
codimension n - d of S when the codimension n - d is not too large.
58
Theorem 9. (Variance for spherical-geometry kinematic measure)
Let M be a n - 1-dimensional subsphere of Sn C Rn+1, such that the hyperplane
containing M lies at a distance h from the center of Sn. Let S be an isotropic random
d-dimensional great subsphere of Sn. Then
Vol(S n M) h7+01 _ nn-d _n--i 1Var z2 (1+ z) 2 dz
kiE[Vol(S n M)] J d + 1d+ ( 11)
_
n- -)(1- h 2 (1 + n z))d-1z- -1(1 + z)- n+1 dz (2.54)
x L(1 - 9/h nd 1 n n d + 2x1 Z2 t )
=h (I - h 2(1 + n-dZ)) d Z - ( )--d
Proof. From the discussion above (illustrated in Figure 2-7), we have
dP(Z < z) _ [F(T) n - d n-d n-d _ 1 n - d _n+fz(z) : z )(d) d +1) 2 xZ 2 (1+ 1 z) 2 (2.55)
dE(Vol(S n~ M)t x 1{Z < z})=V(S M)f ()-~ (c2~~1 2
dE vo~s n my x td< z}z vol(s n m ) (z) x fzz< +i ( i-1) (z)
= Cd- lr(Zjd-1 t X d - Z)
' f-d h fz(z)dz (2.56)
2(1 n -d 2z- - + 1 2-c x 1-h 2 (1+ z) xd-+f
+ 1Zn)-l dz
Hence,
Vol(S n M) 7(-- n-a_ n - d _+1Var zz2 (1+ z) 2 dz
E[Vol(S n M)]J[ d0+d 1(1 - -Z Z21 di)2d
n_.ghT 21 + -dZ -d d n+1(2.57)fn (1 h1 + n))d-Z- -1(1 + gz)-4 2dz(
x + )
_h7 (I - 42(1 + E-_z))d zn-d-1(1 + 2 z)" Z
59
U
We can now numerically evaluate the integrals in Equation to obtain a plot
(Figure ) of Var( VEs")) for different values of d:
Variance of Vol(SnM.) normalized by its mean12510 --- --
1020
S10 15_
10
10 0-
10-50 50 100 150 200 250 300 350 400d
Figure 2-8: This log-scale plot shows the variance of Vol(S n Mj) normalized byits mean, when S is an isotropic random d-dimensional great subsphere of S', for
different values of d where n = 400. Mi is taken to be the boundary of a spherical
cap of the unit sphere S' with geodesic radius r(d) such that S has a 10% probability
of intersecting Mi. The variance increases exponentially with the dimension d of the
search-subspace (as long as d is not too close to n), leading to an exponential slow-
down in the convergence for the traditional Gibbs sampling algorithm applied to the
collection-of-spheres example of Section . Reweighting the intersection volumes
with the Chern-Gauss-Bonnet curvature using Theorem in this example (where
M = UM is a collection of equal-radius subspheres M 2 ) causes each (nonempty)
random intersection S n Mi to have exactly the same reweighted intersection volume
regardless of d, allowing us to avoid the exponential slowdown in the convergence
speed that would otherwise arise from the variance in the intersection volumes.
2.3.8 Theoretical bounds derived using Theorem and alge-
braic geometry
Generalizing on bounds for lower-dimensional algebraic manifolds based on the Crofton
formula (such as the bounds for tubular neighborhoods in [42] and [29]), it is also
possible to use Theorem to get a bound for the volume of an algebraic manifold
M of given degree s, as long as one can also use analytical arguments to bound
the second-order Chern-Gauss-Bonnet curvature reweighting factor on M for some
convenient search-subspace dimension d:
60
Theorem 10. Let M C R' be an algebraic manifold of degree s and codimension
1, such that EQ[IPf(Qx(SQ n M))| x det(Proj pIQ)] > b for every x E M, and the
conditions of Theorem 3 are satisfied if we set Wi(x; s) = |Pf (G(S n M))1. Then
1 1 s x (s - 1)dVol(M) < x - x vl(s) (2.58)cd,k,n,R b 2
Proof. If we have an algebraic manifold of degree s in R", by Bezout's theorem the
intersection with an arbitrary plane is also degree s. Hence (at least in the case
where M has codimension 1), we can use Risler's bound to bound the integral of the
absolute value of the Gauss curvature over S n M by a := sx(-1)d Vol(Sn) [51, 49].2
By Theorem 3,
Vol(M) = Es[VOld(S) xPf(Q(snM)) dVold k]cd,k,n,K IEQ[IPf(Qx(SQ n M))| x det(ProjpI Q)]
11 1 1< x Es[If(Qx(S n M))] x - < x a x
cd,k,n,R b- cd,k,n,R b
El
Unlike a bound derived using only the Crofton formula for point intersections,
the bound in Theorem 10 allows us to incorporate additional information about the
curvature, so we suspect that this bound will be much stronger in situations where
the curvature does not vary too much in most directions over the manifold. We hope
to investigate examples of such manifolds in the future where we suspect Theorem
10 will provide stronger bounds, but do not pursue such examples here because it is
beyond the scope of this chapter.
61
Part II
Numerical simulations
2.4 Random matrix application: Sampling the stochas-
tic Airy operator
Oftentimes, one would like to know the distribution of the largest eigenvalues of a ran-
dom matrix in the large-n limit, for instance when performing principal component
analysis [34]. For a large class of random matrices that includes the Gaussian or-
thogonal/unitary/symplectic ensembles, and more generally the beta-ensemble point
processes, the joint distribution of the largest eigenvalues converges in the large-
n limit, after rescaling, to the so-called hard-edge limiting distribution (the single-
largest eigenvalue's limiting distribution is the well-known Tracy-Widom distribution)
[34, 58, 20, 21]. One way to learn about these distributions is to generate samples
from certain large matrix models. One such matrix model that converges particularly
fast to the large-n limit is the tridiagonal matrix discretization of the Stochastic Airy
operator of Edelman and Sutton [58, 20],
d2 2- x + -- dW, (2.59)
where dW is the white noise process. We wish to study the distributions of eigenvalues
of the hard edge conditioned on other eigenvalue(s).
To obtain samples from these conditional distributions, we can use Algorithm 8,
which is straightforward to apply in this case since dW is already discretized as i.i.d.
Gaussians.
The stochastic Airy operator (2.59) can be discretized as the tridiagonal matrix
[20, 58]1 2
A3= A - h x diag(1, 2, ... , k) + N, (2.60)h2 h(
62
-2 1
1 -2 1
1 -2 1where A = is the k x k discretized Laplacian, N
1 -2 1
1 -2
diag(K(0, 1)') is a vector of independent standard normals, and the cutoff k is cho-
sen (as in [20, 58]) to be k = 1OnA (the O(10n- ) cutoff is due to the decay of
the eigenvectors corresponding to the largest eigenvalues, which decay like the Airy
function, causing only the first O(10n-1) entries to be computationally significant).
2.4.1 Approximate sampling algorithm implementation
The discretized stochastic operator A, is a function of spherical (i.i.d.) Gaussians
h;7N (Equation 2.60). Since, conditional on their magnitude, these Gaussians are
uniformly distributed on a sphere, we can use the following modification of Algo-
rithm 8 to sample AO conditional on our eigenvalue constraints of interest after first
independently sampling their X,-distributed magnitude (Figure 2-10):
Algorithm 12 Great sphere sampler (with weights)Algorithm 12 is identical to Algorithms 4 and 8, except for the following steps:
output: x 1, x2 ,..., with associated weights w 1, w2 , ... , having (weighted)
distribution 7r
3: Sample xj+ 1 uniformly from S n M (using a subroutine Markov chain or
another sampling method). Set the weight wj+j = ir(Xi+1).
To simplify the algorithm, in our simulations we will use a deterministic nonlin-
ear solver with random starting points in place of the nonlinear solver-based MCMC
"Metropolis" subroutine of Algorithm 12 to get an approximate sampling. This is
somewhat analogous to setting both the hot and cold baths in a simulated annealing-
based (see, for instance, [28]) "Metropolis step" in a Metropolis-within-Gibbs algo-
63
rithm to zero temperature, since we are setting the randomness of the Metropolis
subroutine to zero while fully retaining the randomness of the search-subspaces.
3F,
Figure 2-9: In Algorithm the random great circle (red) intersects the constraintmanifold (the blue ribbon which represents the level set {g : A = 3} in this exam-ple) at different points, generating samples (green dots). The constraint manifoldhas different (differential) thickness at different points, given by 1 . Instead of
weighting the green dots by the (differential) intersection length of the great circleand the constraint manifold at the green dot, Crofton's formula allows Algorithmto instead weight it by the local differential thickness, greatly reducing the variationin the weights (see Sections , . and ).
Remark 6. Using a deterministic solver with random starting point in place of the
more random nonlinear solver-based "Metropolis" Markov chain subroutine of Algo-
rithm introduces some bias in the samples, since the nonlinear solver probably will
not find each point in the intersection Si 1 n M n rSi with equal probability. There
is nothing preventing us from using a more random Markov chain in place of the
deterministic solver, which one would normally do. However, since we only wanted
to compare weighting schemes, we can afford to use a more deterministic solver in
order to simplify numerical implementation for the time being, as the implementation
of the "Metropolis" step would be beyond the scope of this chapter. It is important
to note that this bias is not a failure of the reweighting scheme, but rather just a
consequence of using a purely deterministic solver in place of the "Metropolis" step.
On the contrary, we will see in Sections and , that this bias is in fact much
smaller than the bias present when the traditional weighting scheme is used together
with the same deterministic solver. In the future, we plan to also perform numerical
64
simulations with a random "Metropolis" step in place of the deterministic solver, as
described in Algorithm 12.
2.5 Conditioning on multiple eigenvalues
In the first simulation (Figure 2-10), we sampled the fourth-largest eigenvalue con-
ditioned on the remaining 1st- through 7th- largest eigenvalues. We begin with this
example since in this particular situation, when conditioned only on the 3rd and 5th
eigenvalues, the 4th eigenvalue is not too strongly dependent on the other eigenvalues
(the intuition for this reasoning comes from the fact that the eigenvalues behave as
a system of repelling particles with only week repulsion, so the majority of the inter-
action involves the immediate neighbors of A4 ). Hence, in this situation, we are able
to test the accuracy of the local solver approximation by comparison to brute force
rejection sampling. Of course, in a more general situation where we do not have these
relatively week conditional dependencies, rejection sampling would be prohibitively
slow (e.g., even if we allow a 10% probability interval for each of the six eigenvalues,
conditioning on all six eigenvalues gives a box that would be rejection-sampled with
probability 10-6).
Despite the fact that the integral geometry algorithm is solving for 6 different
eigenvalues simultaneously, the conditional probability density histogram obtained
using Algorithm 12 with the integral geometry weights (Figure 2-10, blue) agrees
closely with the conditional probability density histogram obtained using rejection
sampling (Figure 2-10, black). Weighting the exact same data points obtained with
Algorithm 12 with the traditional weights instead yields a probability density his-
togram (Figure 2-10, red) that is much more skewed to the right than either the
black or blue curves. This is probably because, while theoretically unbiased, the
traditional weights greatly amplify a small bias in the nonlinear solver's selection of
intersection points.
65
A 4 {(A1 ,A 2'A 3 A 5'A6 'A 7)=(-2,-3.5,-4.65,-7.9,-9,-1 0.8)}
2.5 - Rejection Sampling of A41 (A3 A5)-Integral Geometry Weights
a 2- Traditional Weights0>11.5-
-00o 0.5-
-8 -7.5 -7 -6.5 -6 -5.5 -5 -4.5
Figure 2-10: In this simulation we used Algorithm together with both the tradi-tional weights (red) and the integral geometry weights (blue) to plot the histogram
of A 4 (A1,A2, A3 , A5, A6 , A7 ) = (-2, -3.5, -4.65, -7.9, -9, -10.8) . We also provideda histogram obtained using rejection sampling of the approximated conditioning
A4 1(A 3,A 5 ) - [-4.65 0.001] x [-7.9 0.001] (black) for comparison (conditioningon all six eigenvalues would have caused rejection sampling to be much too slow).
Since we used a deterministic solver in place of the Metropolis subroutine in Algo-
rithm 2, some bias is expected for both reweighting schemes. Despite this, we seethat the integral geometry histogram agrees closely with the approximated rejec-
tion sampling histogram, but the traditional weights lead to an extremely skewed
histogram. This is probably because, while theoretically unbiased, the traditional
weights greatly amplify a small bias in the nonlinear solver's selection of intersection
points. The skewness is especially large (in comparison to Figure ) because we
are conditioning on 6 eigenvalues simultaneously.
2.6 Conditioning on a single-eigenvalue rare event
In this set of simulations (Figure ), we sampled the second-largest eigenvalue
conditioned on the largest eigenvalue being equal to -2, 0, 2, and 5. Since A = 5
is a very rare event, we do not have any reasonable chance of finding a point in
the intersection of the codimension 1 constraint manifold A = {A = 5} with the
search-subspace unless we use a search-subspace of dimension d >> 1. Indeed, the
analytical solution for A, tells us that P(Al > 2) = 1 x 10-4, P(Al > 4) = 5 x 10 8 and
P(Al > 5) < 8 x 10-10 [,), -17]. For this same reason, rejection sampling for A = 2
is very slow (58 sec./sanmple vs. 0.25 sec./sample for ) and we cannot hope to
perform rejection sampling for A = 5 (It would have taken about 84 days to get a
single sample!). To allow us to make a histogram in a reasonable amount of time, we
66
will use 12 with search-subspaces of dimension d = 23 >> 1, vastly increasing the
probability of the random search-subspace intersecting M.
In (Figure 2-11, top), we see that while the rejection sampling (black) and integral
geometry weight (blue) histograms of the density of A2 1 A, = 2 are fairly close to each
other, the plot obtained with the exact same data as the blue plot but weighted in
the traditional way (red) is much more skewed to the right and less smooth than both
the black and blue curves, implying that using the integral geometry weights from
Theorem 1 greatly reduces bias and increases the convergence speed (Although the
red curve is not as skewed as in Figure 2-11 of Section 2.5. This is probably because
in this situation the codimension of M is 1, while in Section 2.5 the codimension was
6.)
In (Figure 2-11, middle), where we conditioned instead on A1 = 5, we see that
solving from a random starting point but not restricting oneself to a random search-
subspace (purple plot) causes huge errors in the histogram of A 2 IA1. We also see that,
as in the case of A1 = 2, the plot of A 2 obtained with the traditional weights is much
more skewed to the right and less smooth than the plot obtained using the integral
geometry weights.
In (Figure 2-11, bottom), we apply our Algorithm 12 to study the behavior of
A2 IA1 for values of A1 at which it would be difficult to obtain accurate curves with
traditional weights or rejection sampling. We see that as we move A 1 to the right,
the variance of A 2IA1 increases and the mean shifts to the right. One explanation
for this is that the largest and third-largest eigenvalues normally repel the second-
largest eigenvalue, squeezing it between the largest and third-largest eigenvalues,
which reduces the variance of A 2 IA1. Hence, moving the largest eigenvalue to the right
effectively "decompresses" the probability density of the second-largest eigenvalue,
increasing it's variance. Moving the largest eigenvalue to the right also allows the
second-largest eigenvalue's mean to move to the right by reducing the repulsion from
the right caused by the largest eigenvalue.
Remark 7. As discussed in Remark 6 of Section 2.4.1, if we wanted to get a perfectly
accurate plot, we would still need to use a randomized solver, such as a subroutine
67
0.5 A 1 =2 .-.A1 =2(Integral geometry weights) >1 A, 2(traditional weights)
0.4 - .... _2(rejection sampling)a)
C 0.3-
TD .2 --
00.10
-26~~~~~ -5 - 3A2-
1 -7- A21 1,=0 -Integral Geometry Weights_ Unconstrained Solver
Traditional Weights
- -"
-6 -5 -4 -3 -2 -1 0 1 2 3 4"2
A 21A1={-2,O,2,5}, Integral geometry weights
-6 -5 -4 -3 2 -2
A =0
.A =5
=2
-1 0 1
Figure 2-11: Histograms of A2 =A -2, 0, 2, and 5, generated using Algorithm .A search-subspace of dimension d 23 was used, allowing us to sample the rareevent A = 5. In the first plot (top) we see that the rejection sampling histogram ofA2A = 2 is much closer to the histogram obtained using the integral geometry weights(blue) than the histogram obtained with the traditional weights (red) because the redplot is much more skewed to the right and less smooth (it takes longer to converge)than either the blue or black plots. If we do not constrain the solver to a randomsearch-subspace, the histogram we get for A21 = 5 (purple) is very skewed to theright (middle plot), implying that using a random search-subspace (as a opposed tojust a random starting point) greatly helps in the mixing of our samples towards thecorrect distribution. As an application of our algorithm, in the last plot (bottom), theprobability densities of A2 JA obtained with the integral geometry weights show thatmoving the largest eigenvalue to the right has the effect of increasing the variance ofthe probability density of A2LJA and moving its mean to the right, probably becausethe second eigenvalue feels less repulsion from the largest eigenvalue as A, -- oc.
68
0.4U)
00.3
0.2
-00.10
0~
0.6 -
0.4 -
M 0.2 -00
CL-
SEEM_
Markov Chain, to randomize over the intersection points. Since d = 23, the volumes of
the exponentially many connected submanifolds in the intersection S i1 n M would
be concentrated in just a few of these submanifolds, with the concentration being
exponential in d, causing the algorithm to be prohibitively slow for d = 23 unless
we use Algorithm 11, which uses the Chern-Gauss-Bonnet curvature reweighting of
Theorem 3 (see Section 2.3.6). Hence, if we were to implement the randomized solver
of Algorithm 12, the red curve would converge extremely slowly unless we reweighted
according to Theorem 3 (in addition to Theorem 1). Hence, the situation for the
traditional weights is in fact much worse in comparison to the integral geometry
weights of Theorems 1 and 3 than even (Figure 2-11, middle) would suggest.
Acknowledgements
We gratefully acknowledge support from NSF DMS-1312831. Oren Mangoubi was
supported by the Department of Defense (DoD) through the National Defense Science
& Engineering Graduate Fellowship (NDSEG) Program.
69
70
Chapter 3
Mixing Times of Hamiltonian
Monte Carlo
3.1 Introduction
Hamiltonian Monte Carlo (also called Hybrid Monte Carlo or HMC) algorithms are
some the most widely-used [26, 14, 44] MCMC algorithms. In this chapter we derive
lower bounds for the mixing times of a large class of Hamiltonian Monte Carlo al-
gorithms sampling from an arbitrary probability density 7r, including the traditional
Isotropic-Momentum HMC algorithm [19], Riemannian Manifold HMC [27, 25] and
the No-U-Turn Sampler [32], the workhorse of the popular Bayesian software package
Stan [10] (Section 3.2). We do so by applying the continuity equation, a generaliza-
tion of the divergence theorem used extensively in fluid mechanics. For comparison,
we also prove lower bounds for the mixing times of the Random Walk Metropolis
MCMC algorithm (Section 3.4).
Since true mixing times, in the narrow sense of the word, usually do not exist for
continuous state-space Markov Chains, we use the term "mixing time" here in the
broader sense to refer to the relaxation time tre := 1, defined as the inverse of the
spectral gap p; the term "mixing times" is oftentimes used more loosely to include a
variety of measures of convergence times, including relaxation times [40]. Denoting
the Markov transition Kernel by P(., .), the spectral gap p is the smallest number
71
such that
l|P(., .)I|L2 (7r) PI|I|PL2 (7r)
for any signed measure p [52]. If P has a second-largest eigenvalue A2 (for example,
if P is a matrix of a finite state space Markov Chain), then p = 1 - A2. Geometric
ergodicity of HMC algorithms was proved under very general conditions in [41, 9],
implying existence of a non-zero spectral gap under those conditions [52].
Cheeger's inequality [11, 37, 55] provides bounds for the spectral gap in terms of
the bottleneck ratio 4D(S) of a subset S of the state space, a quantity proportional to
the probability that the Markov chain at stationary distribution transitions between
S and Sc:
*<5 p < 2(* (3.1)2
where 4* := minsCRA(S). In Section 3.2, we derive bounds for the spectral gap by
using the symplectic volume-conservation properties of the Hamiltonian in the phase
space to obtain an equation for the bottleneck ratio [40] and then applying Cheeger's
inequality to bound p.
3.2 Hamiltonian Monte Carlo mixing times
In this section we derive equations for the bottleneck ratio. First, we define the
following terms:
" Nas and Ng,) are the number of times a random trajectory intersects aS and
an E-ball of (q, p), respectively.
* Pq is the component of p in the direction orthogonal to oS at q.
" Ps+(q) is the half-space of momentum vectors pointing away from S at q E aS.
Definition 3. (of Q and Qqp))
Let F be the probability measure on the random trajectories -y at stationary
distribution conditioned to intersect aS at least once. Define Q to be the probability
measure whose density is proportional to dP(-y) - Nas(7y).
72
Similarly, let lP,,,) be the probability measure on the random trajectories -y at sta-
tionary distribution conditioned to intersect BE(q, p) nS x Rd at least once. Let Q'qp)
to be the probability measure whose density is proportional to dP',)(Y) N6 4,,(-).
Theorem 11. Isotropic-Momentum and Riemannian Manifold HMC.
Let r(q) be any probability density on Rd. Let S C Rd be any subset of the position
space. Then the bottleneck ratio for an HMC Markov chain with fixed trajectory
evolution time T and any smooth phase-space stationary distribution 7r(q, p) satisfies
D(S) = D+ -EI Nas - 1{Nas odd}] I(S) (3.2)
where the total positive flux 4+ is
4)+ =T j IOlog(ir (q, p)) dd= Tr -r(q, p) -dpdq
s+ (q) '9Pq
In the case of Isotropic-Momentum HMC, (D+ reduces to
+=T j 7r(q)dqIj)jas d vf2
The term EQ - 1{Nas odd} can be interpreted as the average periodicity of
the Hamiltonian trajectories comprising each iteration of the algorithm. Observing
that EQ [ -1{Nos odd} <; 1 gives an upper bound for bottleneck ratio. Numerical
simulations for various two-mode densities that approximate Gaussian mixture models
(Figure 3-1) suggest that this bound is nearly tight in many cases where T is not too
large. To extend Theorem 11 to algorithms with non-fixed trajectory time, such as
No-U-Trn HMC, we express T as a function T(q, p):
Theorem 12. (General HMC, including No-U-Turn HMC)
Let S C Rd be any subset of the position space. Then the bottleneck ratio 1(S) of
73
an HMC Markov chain with trajectory evolution time T(q, p) is
// Blg(?rq, p))s() =f r(qp) - -T(q, p) -liM -1{Nasodd} dpdq H(S).
+s P+(q) 0 p ( Nas
(3.3)
The proofs of Theorems 11 and 12, make use of the continuity equation, a general-
ization of the divergence theorem that says that the change in probability measure of a
subset S of the position space at any given time is equal to the flux of the probability
measure flowing through 9S [50, 33]. Also essential is the fact that the symplec-
tic phase-space volume preserving properties of Hamiltonian mechanics imply that
the stationary distribution of HMC is equal to the invariant measure of Hamilton's
equations at every point in time of the trajectory's evolution.
Proof. (of Theorem 11)
Define 4D+ to be the total flux flowing into (but not out of) SC during one step of the
algorithm (or equivalently, by reversibility, the flux into S)
First, note that by reversibility, P(Nas = n, 7y(0) E S) = P(Nas = n, y(O) E Sc).
Hence, P(Nas = n) = P(Nas = n, 7y(0) E S) + P(Nas = n, 7y(0) E S') = 2P(Nas =
n, y(0) E S), i.e.,
1P(Nas = n, y(O) E S) = P(Nas = n, 7(0) E SC) = P(Nas = n) (3.4)
2
Define the measure Q by d(7) := Nas(y) - dP(y) at every trajectory 7.
By the continuity equation,
Sn+1 -1<D+ E nL 12 P(Nas = n, -(O) E S) + n 2 P(Ns = n, (O) E Sc)
n=1,3,... n=1,3,...
+ P(Nas = n, y(O) E S) + E fP(Nas = n, 7(O) E Sc)n=2,4... n=2,4,...
74
E(Nas = n)n=1,2,3,...
2(n= 1, 2,3,..
Hence, Zn-1,2,3,... NO(Nas = n) = 1,
measure, and hence Q =
Therefore,
Nas = n)
so the measure is a probability
P(.y(T) E SC, (O) E S) = P(Nas = n, -y(O) E S)n==1,3,...
1 P(Nas = n)n=1,3,...
n 1 .
n= 1,3,...
Q(Nas = n) = <+-In E
n=1,3,...
-Q(Nas = n) =n
D+-EQ 1{Nas odd}]
All that remains to be done is to compute <D+. Towards this end, let v+ (q, p)
be the velocity of the Hamiltonian trajectory if it is flowing away from S, and zero
otherwise.
At any time t the time derivative of the total flux into Sc is:
dtb+dt = 1 r(q, p)
.s d- v+(q, p)T'r/(q)dpdq (3.5)
and hence
b+= T (q, p) -v+(q, P)T T7(q)dpdqdt0 'SI d.
= T. -asf I r(q,p)- v+(q, p)Tr(q)dq
Applying Hamilton's equations gives
v+(qp)T 7(q) =olog(7r(q, p))+pq 1pSc(q)
(3.6)
75
<b+ = T - f(q, p) log((q, p)) dqJ sP p(q) 09pq
In the case of Isotropic-Momentum HMC, Equation 3.7 simplifies to
4P+ =T - as fs
=T - j f
= T - -r(
= T - 2
d//F($)2
=T V2 -(\//F( ~)
= T - fr(q)dqas
(q)
r(q,p)- log(r(qp)) dqapq
q) ' &, (y) ydydq
q) - d dq%/2/r(l) v 2d7r
Js7r(q)dq - 10 ly2e-2Ndy
1
2d7r(q)dq
-F(I)
dVd
Proof. (of Theorem 12)
Define 4b+ to be the total flux going into S' during one step of the algorithm (or
equivalently, by reversibility, the flux into S) due to only those trajectories for which
Nas = n. Define 4D, E to be the total flux flowing into S' during one step of the(qp)
algorithm through BE(q, p) n 4S x Rd, and let < : m (p)
P(y(T) E SC IY(O) E S) = zn=1 ,3,....
P(Nas = n, y(O) E S) = =n=1,3,....
1 [n.P(Nas = n, y(U) C S)
(by the continuity equation)
jS Rdlim Qp)(Nas = n)dpdq640
(by law of total probability)
76
so
(3.7)
(3.8)
1n=1,3,....
n=1,3,...
isI p) liQq,p)(Ns = n)dpdqn=1,3,....
< n=1,3...
1 IRd
(by Monotone Convergence Theorem)
1- rli Q,,p) (Nas = n)dpdqn 40o d d
q,p) [jas. {Nas odd}] dpdq
All that remains to be done is to compute <D+,,y. Towards this end, let v+(q, p)
be the velocity of the Hamiltonian trajectory if it is flowing away from S, and zero
otherwise.
At any time t the time derivative of the total flux into S' is:
d) +dq4' = r(q, p) - IL{T(q, p) <5 t} - v+(q, p)T (q) (3.9)
and hence
- T
(q+) - 7r(q, p) - 1{T(q, p) <_ t} -v+(q, p, t)Ti(q)dt
= T(q,p) -7r(q,p) - v+(qp)T?(q)
Applying Hamilton's equations gives
v (qp)TJ(q) = log(7r(q, p))0Pq S7()P
(3.10)
so
(q p)= T(q, p) r(q,p)- (3.11)0og(7r(q, p)) ]-
09 q)p)
One consequence of this bound is that the time it takes energy-conserving Hamil-
tonian Markov chains to search for far-apart sub-Gaussian modes grows exponentially
with both the dimension and the distance between modes, resolving an open question
posed by Prof. Neil Shephard at Harvard.
77
4+ im E(Q,
Equation 3.3 also suggests that an optimal HMC algorithm should minimize a
particular function of the periodicity of each Hamiltonian trajectory. In the future,
we plan to investigate to what extent the No-U-Turn algorithm, the default algorithm
in the widely-used Stan software [10], approaches this optimality, and whether one
can design a better algorithm closer to optimality.
In many applications, the following bound for the Bottleneck ratio of HMC may
also be helpful:
Theorem 13. Let S c Rd be any subset of the state space. Then the bottleneck
ratio (D(S) of an HMC Markov chain (including Isotropic-Momentum, Riemannian
Manifold and No- U- Turn HMC) with any stationary distribution 7r satisfies
<(S) < H [I - Fj(E - U(q))]H(q)dq, (3.12)
where U(q) := -log(7r(q)), E := minqeasU(q), and FX2 is the CDF of the x2 random
variable.
Proof. No trajectory starting at qo E S can exit S if it does not start with energy at
least E := minqeOsU(q). Hence, the probability of exiting S starting at the stationary
distribution -r must be at most IP,({H(qo,po) ;> E} n {qo E S}) = fs(i - FX2(E -
U(q)))7r(q)dq. E
Finally, we provide a simulation (Figure 3-1) of isotropic HMC sampling of a two-
mode density. The results of the simulation agree closely with Theorems 11 and 12,
and illustrate the various components of Equations 3.3 and 3.2.
3.3 Cheeger bounds in extreme cases
If the mean step size T = e is small, the spectral gap is close to the bound .
Why is this true? First of all, as e 4 0 the expectation in the integrand of Theorem
12 approaches = 1 (i.e., there are no "u-turns" as e 4 0 since all paths are nearly
78
1 -
0.8-
C\J 0.6
0.2 -
T10-212 10 8-
6 T 4 2 --
Figure 3-1: In this simulation we computed the spectral gap for theIsotropic-Momentum HMC algorithm with the stationary distribution ir(q) =
2F 1 ()ax(fg(O,1)(q - a), J(o,1)(q - a)), for different distances 2a between the
two modes and different Hamiltonian trajectory times T. As predicted in Theoremthe spectral gap is bounded above by
linearly with T when a = 0 for T < .to the fact that the trajectories here haveexpectation term in Equation variesexponential decay in a2 is explained by tso the corresponding term in Equationpoint between the two modes) decreases eimates the Gaussian mixture model #(q)- - - + 0 as a -- oo. The plots were
linear function of T, and in fact increasesThe approximate periodicity in T is due
period ;> , meaning that the conditional
(approximately) periodically with T. Thehe fact that f0 8 (q)dq = fK(o,1)(q - a)),
(if we choose S = {0} to be the halfwayxponentially in a2 . Note that 7(q) approx-
=}fA(o,)(q - a) + lfg(o,1)(q - a); indeedgenerated by numerically diagonalizing an
analytical solution for the transition matrix of the HMC Markov chain.
79
straight and oS is nearly straight on the scale of E). Hence, ignoring constant factors,
by Theorem 12, 4, ~ e for small enough e.
If we multiply the integration time T = e by n, the algorithm (which approximates
a diffusion for small T = e) travels on average n times the distance in one step.
However, if we instead take n independent steps of size e in succession, the mean
distance traveled is only fin, so we need to make n 2 steps of size e to achieve the
same average displacement as a step of size n. Indeed, if (D. = e, then the upper bound
is I! = e2, but if I, = ne then the upper bound is * = n2e2. So, to get a spectral
gap of 1 - n 2 E2 using steps of size e, we would need to apply the transition matrix n2
times (i.e., take n 2 steps): An' (1- _ )n2 = (I _ .2)n 2 = 1-n 2 2 +L.O.T. ~ 1-n2 E2 .
On the other extreme, if there are two sets S, and S2 for which I(S*) is much
smaller than the relaxation time of the chain restricted to either S or S' by itself,
then we would expect the spectral gap for the entire space (S* U S*) to be close to
= 'P(S*). Indeed, this behavior has been proved in the case of a lazy random
walk on two discrete tori glued together at a single vertex [40]. We conjecture that
this behavior will occur for HMC as well, for instance, in the case of a two-mode
density with a deep valley, with one mode in the region S* and the other mode in S'
(assuming T is not too small).
3.4 Random Walk Metropolis mixing times
In this section we show two bounds for the RWM algorithm. Together, the bounds
suggest that the relaxation time of RWM grows exponentially with both the dimension
d and the distance between modes, even if one uses a mean step size e that is optimal.
Theorem 14 shows that the relaxation time grows with both the dimension d and
with the distance between modes for small e. Theorem 15 shows that the relaxation
time grows with the dimension d for all other (i.e., non-small) e.
Theorem 14. Consider a RWM Markov chain q 1, q2,... with proposal distribution
qw1 - qi ~ .j(0, e)d for every i E N. Let S be a subset of the state space. Then
80
<D(S) 52 j r(s + Br)]- fx(r/e)dr/7r(S),
where aS + B, := {x + y : x E S, y E B,} is the Minkowski sum, and B, is an r-ball
centered at the origin.
In particular, for every r > 0, we have
<DL(S) 27r(&S + B,) - Fr (r/e) + (1 - Fxd(r/e))] (S), (3.14)
Proof. Let qi, q2, .... be a RWM Markov chain. We wish to bound <D(S) =1 iP,(qi+1 E
SCqj E S).
Suppose that the step size is |lqj - q +1|| < r. Then whenever both qji+1 E Sc and
q E S, it must be true that either qj E aS + B, or qj+ 1 E aS + Br. Hence,
Pr qi+1 E Sc, qj E S ||qi+1 - qi| 11 r)(15
Pr(q E as + Br) + Pr(qi+1 E aS + Br) = 2ir(aS + B,).
Equations 3.13 and 3.14 now follow directly from the law of total probability.
Theorem 15. Let ir(q) = Ekmax ckirk(q), E ck = 1, be a Gaussian mixture model,
where 7rk has covariance matrix Ek and mean ak. Let the RWM proposal have covari-
ance matrix AE.
Then, for every 7, 6 > 0,
6 kmax
Paccept + min 1ZCj 0j,k (77)j=1
kmax kmax
+ x: CjVj,k( 6 )k=1 j=1
OJ,k () exp( -min{ - 2IE d3/ 2 + V 4Ej2d3 - 8tE l((IE d- (7r l(r())2), O}] 2)
41IEjI
81
where
(3.13)
+ AEd - (7r-'(ck )2'/j,k <) eXp
and
(rk ()2= -2lEkllog((2pi E -k12t)
Proof. Let X be distributed according to the stationary distribution (that is, an index
j is sampled at random with probability cj, and then a Gaussian random variable
X = Xj is sampled from X ~ r1j), and let Y be the independent random jump
proposal (so that the next proposed move is the point X + Y).
Then
Paccept = min{ 1,r(X+}Y)
7r(X) < -P 7r(X) > rq, ir(X+Y) 6) +1-P 1gr(X) < '}{w7r(X+Y)>
+Y) > 6)
Now, { Ek ck7rk(X + Y) < r1} C {ck7rk(X + Y) < 77} for every k, so
P lr(X) < < minP Irk = minEk
cJP rk(Xi) <
By Equation 4.3 of [36], if Z ~ 2 , then
IP(IE1Z 2 > t) exp(- [min{ - 2|Ejld3/ 2 + v41z3 -2d3 - 8I'E-I(|Z Id - t), 0} 2
so
P (rk(Xi) < r) = P(IEj|Z2 > (7rk1(r1))2
< exp ( --min{ - 2|Ejd3 / 2 + V 41Xj|2d3 - 8I|El (|E d - (7r l(r))2), O} 2)
41Z31
82
J})
n)(X) < r7)
d min{|Ej0}
2)
< 6+ P 7r(X) < r7 + P 7r(X
All that remains to be done is to bound P (lr(X + Y) > 6).
To do so we observe that {E cki7rk(X + Y) > 6} C Uk{ck7rk(X + Y) > so
1P 7r(X+Y) >6 < P 7rk(X-Y) > Cka = Z crk(Xj+Y) >Ckkk Ckkm. k j kka
but (for fixed covariance determinants IEkI and I Ej+ E), A P(7rk(XJ +Y) > Ckmax ) is
maximized if the means of Xj +Y and irk are the same and the covariance matrices are
both multiples of the identity matrix. Hence, we may replace IIXjH with IEj + AEIZ
without decreasing P(lrk(Xi) > j). By Equation 4.4 of [36], P(|Ej + AEIZ2 <
t) ; exp(-( )2), so
IP(wr(X) > ) = IP(IE + LIEZ2 < (7rjl( ))2)Ckk x Ckkma
exp (min{IEj + AEld - (7r'(ck ))2,0}2)d 2|Ej + AJE
Acknowledgement
We gratefully acknowledge support from ONR 14-0001 and NSF DMS-1312831.
83
84
Chapter 4
A Generalization of Crofton's
Formula to Hamiltonian Dynamics,
with Applications to Hamiltonian
Monte Carlo
4.1 Introduction
In this chapter we prove generalizations of Crofton's formula (Theorems 16 and 17)
that apply to particle trajectories in Hamiltonian Dynamics. We then use one of
these new formulae (Theorem 16) to increase the efficiency of computing integrals
over codimension-1 manifolds with Hamiltonian Monte Carlo algorithms.
4.2 Crofton formulae for Hamiltonian dynamics
Theorem 16. Let M be a codimension-1 submanifold in the position space Rn. Let
y be a random Hamiltonian trajectory with Hamiltonian energy functional H(q, p) =
U(q) + Tp. Then
r(q)dq = cE[NM], (4.1)
85
where NM is the number of times y intersects M (counted with multiplicity) and
c := C1,n_1,n,R is the same constant used in the classical Crofton's formula (Lemma
1).
Remark 8. If M = OS, then Theorem 16 gives the expected number of times that y
crosses from S to Sc or vis versa. If the stationary density 7r and integration time T
are such that -y never crosses OS more than once, then Theorem 16 and Theorem 11
imply each other in this special case.
Remark 9. The classical Crofton formula for lines (Lemma 1 with k = 1) can be viewed
as a corollary to Theorem 16: if we choose the Hamiltonian potential to be uniform
over a compact set Q, the Hamiltonian trajectories will be composed of lines that
move with the Kinematic measure, conditioned to intersect Q. Since the potential is
uniform, we have fM ir(q)dq -Vol(Q) = Vol(M), the value on the LHS of the classical
Crofton formula.
Proof. (of Theorem 16)
Let y a random Hamiltonian trajectory over time [0, T] with position and momen-
tum at time 0 sampled from the stationary density ir(q,p) = -r(q) - fg(0,1).(p). Let
NM be the number of intersection points of y with M, and let (qi, pi) be the phase
space coordinates of -y at its i'th intersection point with M.
Since Hamiltonian flow preserves the stationary distribution, -y is at stationary
distribution at any time t E [0, T], so for any function h we have
I NM h(i i
I h(q, p)ir(q, p)dpdq = TE [qIPro piI, (4.2)
where "projMT" denotes the projection onto the normal vector to M at q.
Also, since the marginal stationary density of the momentum f(o,i1)n (p) is inde-
pendent of the position q and is rotationally invariant with respect to p, for every
q E M we have
1' 1j I projMTp1. - fg(0,1)n(p)dp =] |proje1 pjj - fg(ol)n(p)d = -, (4.3)Rn q8R
86
where el := (1, 0, ... 0)T is a coordinate vector.
Hence, setting h(q,p) := ||projMTpII, we get
= Jir(q) . ldq
(by Eq. 4.3)= 7r(q) - c - I| IprojMTp|| - f(o,1)"(p)dpdq
= c - I projT pI| . r(q) - fg(0,1)n(p)dpdqfM Rn R
=- IM IlprojMTp|| -7r(q,p)dpdq
(by Eq. 4.2) 1 -N -
- ET E|proj p ||
- +E[Z-
= +E [NM
0
More generally, for arbitrary Hamiltonians we have
Theorem 17. Let M be a codimension-1 submanifold of the position space Rn. Let
^y be a random Hamiltonian trajectory with arbitrary Hamiltonian energy functional
H(q, p). Then
7r(q)dq = T E n q4p117(,POP4.4)Im . i_1 -7r"q dM7 I IEGpd
where qi is the position where y intersects M for the i'th time, and NM is the number
of intersections (counted with multiplicity). v(q, p) : - is the velocity givendt O p
by Hamilton's equations at position q and momentum p. | dv(q'p) 1| is the magnitude
of the component of v in the direction orthogonal to the tangent space of M at qi.
Proof. The proof of Theorem 17 follows the same steps as the proof of Theorem 16:
Let 7 be a random Hamiltonian trajectory over time [0, T] with position and mo-
mentum at time 0 sampled from the stationary density ir(q, p). Let NM be the number
87
M 7r(q)dq
of intersection points of -y with M, and let (qj, pi) be the phase space coordinates of
-y at its i'th intersection point with M.
Since Hamiltonian flow preserves the stationary distribution, 'Y is at stationary
distribution at any time t E [0, T], so for any function h we have
h(qipi)]M - (4.5)h (q, p)7r(q, p)dpdq = EfM Jn T- i=1
Hence, setting h(qp) := 1 11yI _Ibrqd we get
7r (q, p)dp
/ J
d v(qp)
M Rn ddo , )1r1 ,
LRn 1 dv 1)7r(q, p)dpdq
- 7r(q,p)dpdqp
(by Eq. 4.5) 1E NdM ,p)
T~ ~ ~ , v'P)117r (qj, p) dp dv(qp)
Ii I
f.. f I ,j4p)17r(qi,p)dp]
4.3 Manifold integration using HMC and the Hamil-
tonian Crofton formula
We now state a conventional method of using Hamiltonian Monte Carlo to compute
integrals over a submanifold (Algorithm 13).
88
IM 7r(q)dq = IMr(q) - dq
= IM(q) -
I E-NM
Algorithm 13 integration on a submanifold with HMC
input: qo, oracle for 7 : Rd - [0, oc), oracle for intersections with M
output: Estimator A for fM x(q)dq.
define: H(q,p) := -log(r(q)) + }p'p.
1: for i = 1, 2,... do
2: Sample independent pi ~ K(O, I)d
3:. Integrate Hamiltonian trajectory (q(t),p(t)) with Hamiltonian H over the time
interval [0, T] and initial conditions (p(O), q(O)) = (pi, qi)
4: set qi+1 = q(T)
5: compute the sequence of intersection angles Of, y,.... of the trajectory with M
6: end for
7: computeA:=Z)
Algorithm 13 requires the use of the weights . Unfortunately, these weightssin(Oi)
have infinite variance, greatly slowing the algorithm ( Var( L) = 00, as shown insin(Oq)2
Section 2.2.2 when analyzing classical Crofton formula). To eliminate these weights
we can apply our Hamiltonian Crofton formula (Theorem 16) to obtain an algorithm
with a much faster-converging estimator for the integral (Algorithm 14):
Algorithm 14 Crofton formula integration on a submanifold with HMC
Algorithm 14 is identical to Algorithm 13, except for the following steps:
output: Estimator -- NM for fm 7r(q)dq.
5: compute the number (with multiplicity) of intersections Ni of the trajectory
with M
7: compute NM : >jM Ni
Acknowledgements
We gratefully acknowledge support from NSF DMS-1312831.
89
90
Chapter 5
A Hopf Fibration for #-Ghost
Gaussians
5.1 Introduction
There is a wonderful geometrical construction known as the Hopf fibration. The
Hopf map is a continuous map from the 3-sphere to the 2-sphere where every point
on the 2-sphere is the image of a circle on the 3-sphere. This Hopf map has the
nice property that if a point is uniformly distributed on the 3-sphere, its image is
uniformly distributed on the 2-sphere.
The entire Hopf story can be expressed quickly and elegantly with quaternions. In
quaternion language, consider the map that takes z to zi., where z is a unit quaternion
(Izi = 1). By taking its conjugate, it is easy to see that zi2 is a unit quaternion with
zero real part (these are known as versors). Identifying unit quarternions with the
3-sphere, and those with zero real part with the 2-sphere, we have our Hopf map.
Notice that for a fixed unit quaternion q, the map from z with 0 real part to
qzq is a linear function of z which preserves JzJ. The matrix representation is a 3x3
rotation matrix. This association is widely used in practice in computer graphics and
other fields. Given any 3x3 rotation matrix, there is a well known construction to go
the other way based on computing the eigenvector (the axis of the rotation), and the
eigenvalues (which encode the angle of rotation.)
91
We offer a quick proof based on orthogonal invariance that if z is uniformly dis-
tributed on the 3-sphere, then zi is uniformly distributed on the 2-sphere. As just
discussed, any orthogonal rotation of the non-zero coordinates of ziz may be written
as qzizg = (qz)i(qz). Now if z is uniform on the 3-sphere, so is qz and thus qzizg has
the same distribution as zi2 implying that it is uniformly distributed on the 2-sphere.
This proof is very elegant and worth reading a few times. The very fact that the
geometry of Hopf is so nicely encoded in quaternions inspired us to ask what happens
if we replaced quaternions with general 0 ghosts.
5.2 Defining the 3-dimensional algebra
Let # be of even dimension. Write A = R8 = A x B x C x D
Where A = span{1}, B = span{i},
C = span{ji, ... , j,-} 2 R 2, and D = span{k, ..., k_} 6 R.2 2
Definition 4. (multiplication) The only multiplication allowed in this algebra is mul-
tiplication of an element in A by elements of span{1, i}, as well as any multiplication
where the output only has elements of A = A x B x C x D (e.g., r2 and rir are
allowed for any r E A). Left multiplication is defined for all t, s E {1, ... , 2} as:
i2 = j2 = k 2 = -1 (5.1)
ijt =k (5.2)
ikt = -Jt (5.3)
and right multiplication is defined by
jti =-k (5.4)
92
kti = jt (5.5)
Finally, we assume that all orthogonal pure imaginary components are anti-
commutative, although we only allow such operations if the end results cancel so
that the output is in A x B x C x D:
isit =-iti
jsjt = -jtJs (5.6)
isJt = -Jtis
Associativity of multiplication is assumed as well (but NOT commutativity). Ad-
dition is done as a vector space.
Finally, we define the conjugation operation on an element Z = a + bi + cr, where
a, b, c E R and r E S,-2 C C x D by Z a - b - cr.
The following theorems (Theorems 1-3) give algebraic properties of the ,3-dimensional
algebra that generalize properties of the quaternion algebra. These properties will
come in handy when generalizing the Hopf fibration to R8.
Theorem 18. r2 = -1 and (ir)2 = 1
Proof. 1 = Ir12 = rr = r x (-r) = -r 2 hence, r2 =_I
Since ir is pure imaginary (no real part) (this is because we are assuming that the
pure imaginary # - 2 sphere is closed under orthogonal multiplication), 1 = lir1 2
ir(-ir) = -(ir)2
Hence, (ir)2 = 1.
Theorem 19. ir = -ri
Proof. (i +r)(i +r) = Ji+r 12 =E R
93
2but (i+ r)(i+ r) = (i+ r)(-i - r) = -i2i - r2 I ir - ri+1
Hence, since ir and ri are non-real (since Equations 5.2-5.4 imply that the # - 1
pure imaginary sphere is closed under orthogonal multiplication), and Ii + rl is real,
it must be true that -ir - ri = 0. Hence, ir = -ri. L
Theorem 20. Ixr + yir = /x 2 + y2 for all x, y E R
Proof. Ixr + yir1 2 = (xr + yir)(-xr - yir) = -x 2r 2 - xyir2 - yrir - y 2 (ir) 2
x 2 +xyi+xyr2i+y2 = 2 +Xyi-Xyi+y 2 X 2 +y 2
5.3 Hopf Fibration on R
We begin by defining a version of the Hopf fibration W : R -+ R1- 1 for even-integer
# by generalizing the quaternion representation of the / = 4 Hopf fibration.
Let Z ~ .(O, 1)3 (where "Z ~ .(0, 1)13" means that Z is a random variable
sampled from the distribution .J(O, 1)a). Then
Z=a+bi+cr
where a, b ~ (0, 1), c - X-2, r Uniform(S'- 2) C C x D are independently
distributed.
We can now define the Hopf map W : R3 -+ R- by WH(Z) ZiZ.
Theorem 21.
7(Z) :Zi = (a2 + b2 - c2 )i + 2bcr - 2acir ~ (a2 + b2 - c2 )i + 2c (Va2+ b2)r
Proof.
Zi = (a+bi+cr)i(a-bi-cr) = (a+bi+cr)(ai+b-cir) = a2i+ab-acir-ab+b2 i-bci2 r+acri+bcr -c 2r
= (a2 + b2 )i - acir + bcr - acir + bcr + c2r 2 i = (a2 + b2 - c2 )i + 2bcr - 2acir
94
(a2+b2-2)i+2c (Vb2+a2)r
In particular, Theorem 21 shows that the Hopf map eliminates the real component,
just like it does in the quaternion (0 = 4) case. More generally, Theorem 21 allows
us to compare the distribution of Zi2 for different /. Towards this end, let W be the
i-component of ") . Then
W imagi(ZiZ) _ X-YIZi2| X+Y
where X:= a2 + b2 ~ x2 and Y:= c2 ~ X-2-
A quick multivariable integral computation gives the densities fw and fIwi of W
and 1W , respectively:
f2 1 -t _-2_fw (t) = x ( 2 ) 2 4 t E [-1, 1]
and
t E [0, 1]
In particular, fw(t) is constant for /3 = 4, has negative second derivative for 4 < /3 <
6, is linear for / = 6 and has positive second derivative for 3 > 6. fiwi is uniform
for both 3 = 4 and / = 6, and also has negative second derivative for 4 < /3 < 6 and
positive second derivative for 3 > 6.
The / - 2-dimensional surface area density p(t) of on the / - 2 sphere at
height t on the i-axis is:
P~t) fw(t) 0 -2( 1)82Vol(Pt n Si-2) - 48- 3 2(1 + t)
95
t E [-1 1]
# -2 1 +t 6_-2_1 1 - t 62_1fAw,(t) = 4 ( 2) 2 + ( 2) 2 , 1
where Pt is a # - 2-plane a distance t from the origin and
S1 - 3 := Voli( '- 3 ) = 2-
In particular, 0 < p(t) < oo everywhere except at t = -1, where p(-1) = oo. This
suggests that W- has a dimension reduction of 1 everywhere except at the "south pole"
of the # -2 sphere (-i), where the dimension was reduced by # -2 (r gets mapped to
-i for any /3-2- "phase" r. However, for instance, only the circle {a+bi : a2 + b2 = 1
gets mapped to +i, so the map is well-behaved at +i because the dimension goes down
by 1.)
Moreover, this fact, together with the fact that 7-(a + bi + cr) = (a2 + b2 - c2 )i +
2bcr - 2acir, suggests that W- is analytic everywhere except at -i E S8- 2 .
Acknowledgements
We gratefully acknowledge support from NSF DMS-1312831.
96
Bibliography
[1] J. C. Alvarez Paiva and E. Fernandes. Gelfand transforms and Crofton formulas.Selecta Math. (N.S.), 13(3):369-390, 2007.
[2] Dennis Amelunxen and Martin Lotz. Computational kinematics. manuscript inpreparation.
[3] Dennis Amelunxen and Martin Lotz. A comment on "Integral geometry forMCMC" . private correspondence.
[4] C. Andrieu, N. de Freitas, A. Doucet, and M.I. Jordan. An introduction toMCMC for machine learning. Machine Learning, 50:5-43, 2003.
[5] R.H. Baayen, D.J. Davidson, and D.M. Bates. Mixed-effects modeling withcrossed random effects for subjects and items. Journal of Memory and Language,59:390-412, 2008.
[6] Julian Besag. Markov chain Monte Carlo for statistical inference. Technicalreport, University of Washington, Department of Statistics, 04 2001.
[7] Louis J Billera and Persi Diaconis. A geometric interpretation of the Metropolis-Hastings algorithm. Statistical Science, 16(4):335-339, 2001.
[8] F. Bornemann. On the numerical evaluation of distributions in random matrixtheory: a review. Markov Process. Related Fields, 16(4):803-866, 2010.
[9] Nawaf Bou-Rabee and Jesus Mara Sanz-Sernax. Randomized Hamiltonian MonteCarlo. arXiv preprint arXiv:1511.09382v1, 2015.
[10] Bob Carpenter, A Gelman, M Hoffman, D Lee, B Goodrich, M Betancourt,M Brubaker, J Guo, P Li, and A Riddell. Stan: a probabilistic programminglanguage. Journal of Statistical Software, in press, 2015.
[11] Jeff Cheeger. A lower bound for the smallest eigenvalue of the Laplacian. InProblems in analysis (Papers dedicated to Salomon Bochner, 1969), pages 195-199. Princeton Univ. Press, Princeton, N. J., 1970.
[12] Ming-Hui Chen, Qi-Man Shao, and Joseph G. Ibrahim. Monte Carlo methods inBayesian computation. Springer Series in Statistics. Springer-Verlag, New York,2000.
97
[13] Shiing-shen Chern. On the curvatura integra in a Riemannian manifold. Ann.of Math. (2), 46:674-684, 1945.
[14] Sai Hung Cheung and James L Beck. Bayesian model updating using hybridMonte Carlo simulation with application to structural dynamic models withmany uncertain parameters. Journal of engineering mechanics, 135(4):243-255,2009.
[15] Anders S. Christensen, Troels E. Linnet, Mikael Borg, Kresten Lindorff-LarsenWouter Boomsma, Thomas Hamelryck, and Jan H. Jense. Protein structure val-idation and refinement using amide proton chemical shifts derived from quantummechanics. PLoS ONE, 8(12):1-10, 2013.
[16] Neil J. Cornish and Edward K. Porter. MCMC exploration of supermassive blackhole binary inspirals. Classical Quantum Gravity, 23(19):761-767, 2006.
[17] Morgan W. Crofton. On the theory of local probability, applied to straight linesdrawn at random in a plane; the methods used being also extended to the proofof certain new theorems in the integral calculus. Philosophical Transactions ofthe Royal Society of London, 158:181-199, 1868.
[18] Persi Diaconis, Susan Holmes, and Mehrdad Shahshahani. Sampling from a man-ifold. In Advances in Modern Statistical Theory and Applications: A Festschriftin Honor of Morris L. Eaton, pages 102-125. Institute of Mathematical Statis-tics, 2013.
[19] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth.Hybrid Monte Carlo. Physics Letters B, 195(2):216-222, 1987.
[20] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper-ators. J. Stat. Phys., 127(6):1121-1165, 2007.
[21] Alan Edelman and Brian D. Sutton. From random matrices to stochastic oper-ators. J. Stat. Phys., 127(6):1121-1165, 2007.
[22] Israel M. Gelfand and Mikhail M. Smirnov. Lagrangians satisfying Crofton for-mulas, Radon transforms, and nonlocal differentials. Adv. Math., 109(2):188-227,1994.
[23] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. Pattern Analysis and Machine Intelli-gence, IEEE Transactions on, (6):721-741, 1984.
[24] Charles J. Geyer. Markov chain Monte Carlo maximum likelihood. In ComputingScience and Statistics: Proceedings of the 23rd Symposium on the Interface, pages156-163, 1991.
98
[25] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto-nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 73(2):123-214, 2011.
[26] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamilto-nian Monte Carlo methods. Journal of the Royal Statistical Society: Series B(Statistical Methodology), 73(2):123-214, 2011.
[27] Mark Girolami, Ben Calderhead, and Siu A Chin. Riemannian Manifold Hamil-tonian Monte Carlo. Arxiv preprint, 2009.
[28] Yongtao Guan and Stephen M. Krone. Small world MCMC and convergence tomulti-modal distributions: from slow mixing to fast mixing. Ann. Appl. Probab.,17(1):284-304, 2007.
[29] Larry Guth. Degree reduction and graininess for Kakeya-type sets in R3 . preprinton arXiv:1402.0518, 2014.
[30] Sigurdur Helgason. Integral geometry and Radon transforms. Springer, NewYork, 2011.
[31] Jody Hey and Rasmus Neilsen. Integration within the Felsenstein equation forimproved Markov chain Monte Carlo methods in population genetics. Proceedingsof the national academy of sciences of the United States of America, 104(8):2785-2790, 2006.
[32] Matthew D. Homan and Andrew Gelman. The No-U-Turn sampler: Adaptivelysetting path lengths in Hamiltonian Monte Carlo. The Journal of Machine Learn-ing Research, 15(1):1593-1623, 2014.
[33] JH Irving and John G Kirkwood. The statistical mechanical theory of transportprocesses. iv. the equations of hydrodynamics. The Journal of chemical physics,18(6):817-829, 1950.
[34] Iain M. Johnstone. On the distribution of the largest eigenvalue in principalcomponents analysis. Ann. Statist., 29(2):295-327, 2001.
[35] Daphne Koller and Nir Friedman. Probabilistic graphical models. Adaptive Com-putation and Machine Learning. MIT Press, Cambridge, MA, 2009. Principlesand techniques.
[36] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic func-tional by model selection. Annals of Statistics, 28(5):1302-1338, 2000.
[37] Gregory F Lawler and Alan D Sokal. Bounds on the L2 spectrum for Markovchains and Markov processes: a generalization of Cheeger's inequality. Transac-tions of the American mathematical society, 309(2):557-580, 1988.
99
[38] Michel Ledoux. The concentration of measure phenomenon, volume 89 of Math-ematical Surveys and Monographs. American Mathematical Society, Providence,RI, 2001.
[39] Tony Lelivre, Mathias Rousset, and Gabriel Stoltz. Free energy computations:A Mathematical Perspective. Imperial College Press, 2010.
[40] David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov chains andmixing times. American Mathematical Soc., 2009.
[41] Samuel Livingstone, Michael Betancourt, Simon Byrne, and Mark Girolami.On the geometric ergodicity of Hamiltonian Monte Carlo. arXiv preprintarXiv:1601.08057, 2016.
[42] Martin Lotz. On the volume of tubular neighborhoods of real algebraic varieties.Proc. Amer. Math. Soc., 143(5):1875-1889, 2015.
[43] Oren Mangoubi. Concentration of kinematic measure. manuscript in preparation.
[44] B Mehlig, DW Heermann, and BM Forrest. Hybrid Monte Carlo method forcondensed-matter systems. Physical Review B, 45(2):679, 1992.
[45] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta HTeller, and Edward Teller. Equation of state calculations by fast computingmachines. The journal of chemical physics, 21(6):1087-1092, 1953.
[46] V. D. Milman. A new proof of A. Dvoretzky's theorem on cross-sections of convexbodies. Funkcional. Anal. i Priloien., 5(4):28-37, 1971.
[47] Boaz Nadler. On the distribution of the ratio of the largest eigenvalue to thetrace of a Wishart matrix. J. Multivariate Anal., 102(2):363-371, 2011.
[48] Radford M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markovchain Monte Carlo, Chapman & Hall/CRC Handb. Mod. Stat. Methods, pages113-162. CRC Press, Boca Raton, FL, 2011.
[49] S. Yu. Orevkov. Sharpness of Risler's upper bound for the total curvature of anaffine real algebraic hypersurface. Uspekhi Mat. Nauk, 62(2):169-170, 2007.
[50] Hannes Risken. Fokker-planck equation. Springer, 1984.
[51] Jean-Jacques Risler. On the curvature of the real Milnor fiber. Bull. LondonMath. Soc., 35(4):445-454, 2003.
[52] Gareth 0 Roberts and Jeffrey S Rosenthal. Geometric ergodicity and hybridMarkov chains. Electron. Comm. Probab, 2(2):13-25, 1997.
[53] Luis A. Santal6. Integral geometry and geometric probability. Cambridge Math-ematical Library. Cambridge University Press, Cambridge, second edition, 2004.With a foreword by Mark Kac.
100
[54] Rolf Schneider and Wolfgang Weil. Stochastic and integral geometry. Probabilityand its Applications (New York). Springer-Verlag, Berlin, 2008.
[55] Alistair Sinclair and Mark Jerrum. Approximate counting, uniform generationand rapidly mixing Markov chains. Information and Computation, 82(1):93-133,1989.
[56] Michael Spivak. A comprehensive introduction to differential geometry. Vol. III.Publish or Perish, Inc., Wilmington, Del., second edition, 1979.
[57] Michael Spivak. A comprehensive introduction to differential geometry. Vol. V.Publish or Perish, Inc., Wilmington, Del., second edition, 1979.
[58] Brian D. Sutton. The stochastic operator approach to random matrix theory.ProQuest LLC, Ann Arbor, MI, 2005. Thesis (Ph.D.)-Massachusetts Instituteof Technology.
[59] Mihai Tibar and Dirk Siersma. Curvature and Gauss-Bonnet defect of globalaffine hypersurfaces. Bulletin des Sciences Mathematiques, 130(2):110-122, 2006.
[60] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevindynamics. In Proceedings of the 28th International Conference on MachineLearning (ICML-11), pages 681-688, 2011.
[61] Chenchang Zhu. The Gauss-Bonnet theorem and its applications. http: //math.berkeley. edu/-alanw/240papersOO/zhu. pdf, 2004.
101
top related