high dimensional inference with random maximum a...
TRANSCRIPT
1
High Dimensional Inference with RandomMaximum A-Posteriori Perturbations
Tamir Hazan, Francesco Orabona, Anand D. Sarwate Senior Member, IEEE, Subhransu Maji and Tommi Jaakkola
AbstractâThis paper presents a new approach, called perturb-max, for high-dimensional statistical inference that is based onapplying random perturbations followed by optimization. Thisframework injects randomness to maximum a-posteriori (MAP)predictors by randomly perturbing the potential function forthe input. A classic result from extreme value statistics assertsthat perturb-max operations generate unbiased samples fromthe Gibbs distribution using high-dimensional perturbations.Unfortunately, the computational cost of generating so manyhigh-dimensional random variables can be prohibitive. However,when the perturbations are of low dimension, sampling theperturb-max prediction is as efficient as MAP optimization. Thispaper shows that the expected value of perturb-max inferencewith low dimensional perturbations can be used sequentially togenerate unbiased samples from the Gibbs distribution. Fur-thermore the expected value of the maximal perturbations isa natural bound on the entropy of such perturb-max models.A measure concentration result for perturb-max values showsthat the deviation of their sampled average from its expectationdecays exponentially in the number of samples, allowing effectiveapproximation of the expectation.
Index TermsâGraphical models, MAP inference, Measureconcentration, Markov Chain Monte Carlo
I. INTRODUCTION
Modern machine learning tasks in computer vision, natu-ral language processing, and computational biology involvemodeling the data using high-dimensional graphical models.Examples include scene understanding [5], parsing [6], andprotein design [7]. In these settings, inference involves findinga likely assignment (or equivalently, structure) that fits thedata: objects in images, parsers in sentences, or molecularconfigurations in proteins. In the graphical model framework,each structure corresponds to an assignment of values torandom variables and the preference of a structure is based
Manuscript received February 10, 2016; revised November 4, 2016; revisedMay xx, 2018; accepted xxxxxxxxxx
T. Hazan is with the Faculty of Industrial Engineering & Manage-ment, Technion - Israel Institute of Technology, Technion City, Haifa32000, Israel (e-mail: [email protected]). F. Orabonais with the Department of Computer Science, Stony Brook University,Stony Brook, NY 11794-2424. This work was done in part when hewas with Yahoo Labs, 229 W 43rd St., New York, NY 10036, USA (e-mail: [email protected]). A.D. Sarwate is with the Depart-ment of Electrical and Computer Engineering, Rutgers, The State Univer-sity of New Jersey, 94 Brett Road, Piscataway NJ 08854, USA (e-mail:[email protected]). S. Maji is with the College of Infor-mation and Computer Sciences, University of Massachusetts Amherst, 140Governors Drive, MA 01003-9264, USA (e-mail: [email protected]).T. Jaakkola is with the Department of Electrical Engineering and ComputerScience, Massachusetts Institute of Technology, 77 Massachusetts Avenue,Cambridge, MA 02139, USA (e-mail: [email protected]).
Preliminary versions of these results appeared in conference proceed-ings [1]â[4].
on defining potential functions that account for interactionsover these variables. These potential functions encode localand global interactions using domain-specific knowledge.
Given observed data (for example, an image), the goalof inference is to find an assignment. The observed datainduces a posterior probability distribution on assignmentswhich takes the form of a Gibbs distribution. To generate anassignment, we can use the maximum a posteriori probability(MAP) assignment or sample from the Gibbs distribution usingMarkov chain Monte Carlo (MCMC). In many applicationsof interest, the posterior distribution has many local max-ima which are far apart (in Hamming distance), leading toa âraggedâ posterior probability landscape (see Figure 3).Although MCMC approaches such as Gibbs sampling [8],Metropolis-Hastings [9], or Swendsen-Wang [10] are success-ful in many models, sampling from the Gibbs distribution maybecome prohibitively expensive [11]â[13].
An alternative to sampling from the Gibbs distribution is tolook for the maximum a posteriori probability (MAP) assign-ment. Substantial effort has gone into developing optimizationalgorithms for recovering MAP assignments by exploitingdomain-specific structural restrictions [5], [14]â[18] or bylinear programming relaxations [7], [19]â[22]. MAP inferenceis nevertheless limiting when there are a many near-maximalassignments. Such alternatives arise either from inherent am-biguities (e.g., in image segmentation or text analysis) ordue to the use of computationally/representationally limitedpotential functions (e.g., super-modularity) aliasing alternativeassignments to have similar scores. For an example, see Figure1.
Recently, several works have leveraged the current effi-ciency of MAP solvers to build (approximate) samplers for theGibbs distribution, thereby avoiding the computational burdenof MCMC methods [2], [23]â[36]. These works have shownthat one can represent the Gibbs distribution by calculatingthe MAP assignment of a randomly perturbed potential func-tion whenever the perturbations follow the Gumbel distribu-tion [23], [24]. Unfortunately the total number of assignments(or structures), and consequently the total number of randomperturbations, is exponential in the structureâs dimension. Wecall this a perturb-max approach.
In this paper we propose a sampler for high-dimensional in-ference tasks that approximates the expected value of perturb-max programs by using a number of perturbations that islinear in the dimension. To analyze this approach we provenew measure concentration bounds for functions of Gumbelrandom variables that may be of independent interest. As aresult, approximate inference can be as fast as computing the
2
Fig. 1. Comparing MAP inference and perturbation models. Asegmentation is modeled by x = (x1, x2, . . . , xn) where n is thenumber of pixels and xi â {0, 1} is a discrete label relating apixel to foreground (xi = 1) or background (xi = 0). Ξ(x) isthe (super-modular) score of each segmentation. Left: original imagealong with the annotated boundary. Middle: the MAP segmentationargmaxx Ξ(x) recovered by the graph-cuts optimization algorithmusing a region inside the boundary as seed pixels [15]. Note that theâoptimalâ solution is inaccurate because thin and long objects (wings)are labeled incorrectly. Right: The marginal probabilities of theperturb-max model estimated using 20 samples (random perturbationsof Ξ(x) followed by executing graph-cuts). The information aboutthe wings is recovered by these samples. Estimating the marginalprobabilities of the corresponding Gibbs distribution by MCMCsampling is slow in practice and provably hard in theory [2], [12].
MAP assignment.We begin by introducing the setting of high dimensional
inference as well as the necessary background in extremevalue statistics in Section II. Subsequently, we develop highdimensional inference algorithms that rely on the expectedMAP value of randomly perturbed potential functions, whileusing only low dimensional perturbations. In Section III-Awe propose a novel sampling algorithm and in Section III-Cwe derive bounds on the entropy that may be of independentinterest. Finally, we show that the expected value of theperturb-max value can be estimated efficiently despite theunboundedness of the perturbations. To show this we mustprove new measure concentration results for the Gumbeldistribution. In particular, in Section IV we prove new Poincareand modified log-Sobolev inequalities for (non-strictly) log-concave distributions.
II. INFERENCE AND RANDOM PERTURBATIONS
We first describe the high dimensional statistical inferenceproblems that motivate this work. For a more complete treat-
ment of graphical models, we refer the reader to Wainwrightand Jordan [37]. We then describe how to use extreme valuestatistics to perform statistical inference while recovering themaximal assignment of randomly perturbed potential func-tions [38] [39, pp.159â61].Notation. Vectors/tuples will generally be in boldface and setsin calligraphic script. We write [n] for the set {1, 2, . . . , n}.
A. Gibbs distributions and MAP estimation
Let {Xi : i â [n]} be a collection of n finite discrete setsand X = X1ĂX2Ă· · ·ĂXn. We call x â X a configuration.A potential function is a map Ξ : X â RâȘ{ââ}. We defineDom(Ξ) = {x â X : Ξ(x) 6= ââ}. In practical inferencetasks, Ξ is estimated from observed data. A Gibbs distributionis a probability distribution on configurations induced by Ξ:
p(x)â=
1
Z(Ξ)exp(Ξ(x)) where Z(Ξ)
â=âxâX
exp(Ξ(x)).
(1)
The normalization constant Z(Ξ) is called the partition func-tion. Whenever Ξ is clear from the context we use the short-hand Z for Z(Ξ). Sampling from the Gibbs distribution is oftendifficult because the partition function involves exponentiallymany terms (equal to the number of discrete structures in X ).Computing the partition function is #P -hard in general (e.g.,Valiant [40]).
The maximum a-posteriori (MAP) estimate is the mode ofp(x):
xâ = argmaxxâX
Ξ(x). (2)
Due to the combinatorial nature of X , MAP inference is NP-hard in general. However, many heuristics for performing theoptimization in (2) for high dimensional potential functionshave been extensively researched in the last decade [5], [7],[15], [17], [18]. These have been useful in many cases ofpractical interest in computer vision, such as foreground-background image segmentation with supermodular potentialfunctions (e.g., [41]), parsing and tagging (e.g., [6], [42]),branch and bound for scene understanding and pose estima-tion [43], [44], and dynamic programming predictions foroutdoor scene understanding [45]. Although the run-time ofthese solvers can be exponential in the number of variables,they are often surprisingly effective in practice, [7], [19], [22],[46], [47].
We are interested in problems where there are severalvalues of x whose scores Ξ(x) are close to Ξ(xâ). In thiscase we may wish to sample from the Gibbs distribution.Unfortunately, this is also intractable. Sampling methods forposterior distributions often resort to MCMC algorithms thatconverge slowly in many practical settings [11]â[13]. Anotherreason to sample from the p(x) is to estimate the uncertaintyin the model by estimating its entropy:
H(p) = ââxâX
p(x) log p(x). (3)
In this paper we propose an approximate sampler for p(x) thatcan also be used to estimate the entropy.
3
B. Inference and extreme value statistics
An alternative approach to drawing unbiased samples fromthe Gibbs distribution is by randomly perturbing the poten-tial function and solving the perturbed MAP problem. Theâperturb-maxâ approach adds a random function Îł : X â Rto the potential function in (1) and solves the resulting MAPproblem:
xâ = argmaxxâX
{Ξ(x) + γ(x)} , (4)
where Îł(x) is a random function on X . The simplest approachto designing a perturbation function is to associate an indepen-dent and identically distributed (i.i.d.) random variable Îł(x)for each x â X . In this case, the distribution of the perturb-max value Ξ(x) + Îł(x) has an analytic form. To verify thisobservation we denote by F (t) = P(Îł(x) †t) the cumulativedistribution function of Îł(x). The independence of Îł(x) acrossx â X implies that
PÎł(
maxxâX{Ξ(x) + Îł(x)} †t
)= PÎł (âx â X : {Ξ(x) + Îł(x)} †t)
(5)= PÎł (âx â X : {Ξ(x) + Îł(x)} †t)
(6)
=âxâX
F (tâ Ξ(x)). (7)
Unfortunately, the product of cumulative distribution functionsis not usually a simple distribution.
The Gumbel, Frechet, and Weibull distributions, used inextremal statistics, are max-stable distributions: the productâ
xâX F (t â Ξ(x)) can be described in terms of F (·) it-self [48]â[50]. In this work we focus on the Gumbel distribu-tion with zero mean: Îł is a zero-mean Gumbel random variableif its cumulative distribution function is
G(t) = P(Îł(x) †t) = exp(â exp(â(t+ c))), (8)
where c â 0.5772 is the Euler-Mascheroni constant. Through-out our work we use the max-stability of the Gumbel distri-bution as described in the following theorem.
Theorem 1 (Max-stability of Gumbel perturbations [48]â[50]):Let Îł = {Îł(x) : x â X} be a collection of i.i.d. Gumbelrandom variables with cumulative distribution function(8). Then, the random variable maxxâX {Ξ(x) + Îł(x)} isdistributed according to the Gumbel distribution whose meanis the log-partition function logZ(Ξ).
Proof: The proof is known, but we include it for com-pleteness. By the independence assumption,
PÎł(
maxxâX{Ξ(x) + Îł(x)} †t
)=âxâX
Pγ(x) (Ξ(x) + γ(x) †t) .
The random variable Ξ(x) + γ(x) follows the Gumbel distri-bution with mean Ξ(x). Therefore
PÎł(x) (Ξ(x) + Îł(x) †t) = G(tâ Ξ(x)).
Lastly, the double exponential form of the Gumbel distributionyields the result:âxâX
G(tâ Ξ(x)) = exp
(ââxâX
exp (â(tâ Ξ(x) + c))
)= exp (â exp(â(t+ câ logZ(Ξ)))
= G(tâ logZ(Ξ)).
We can use the log-partition function to recover the mo-ments of the Gibbs distribution: the log-partition functioncharacterizes the stability of the randomized MAP predictorxâ in (4).
Corollary 1 (Sampling from perturb-max models [51]â[53]):Let Îł = {Îł(x) : x â X} be a collection of i.i.d. Gumbelrandom variables with cumulative distribution function (8).Then, for all x,
exp(Ξ(x))
Z(Ξ)= Pγ
(x = argmax
xâX{Ξ(x) + Îł(x)}
). (9)
Proof: From Theorem 1, we have logZ(Ξ) =EÎł [maxxâX {Ξ(x) + Îł(x)}], so we can take the derivativewith respect to some Ξ(x). We note that by differentiatingthe left hand side we get the Gibbs distribution:
â logZ(Ξ)
âΞ(x)=
exp(Ξ(x))
Z(Ξ).
Differentiating the right hand side is slightly more involved.First, we can differentiate under the integral sign (cf. [54]), so
â
âΞ(x)
â«R|X|
maxxâX{Ξ(x) + Îł(x)} dÎł =
â«R|X|
â
âΞ(x)maxxâX{Ξ(x) + Îł(x)} dÎł.
The (sub)gradient of the max-function is the indicator function(an application of Danskinâs Theorem [55]):
â
âΞ(x)maxxâX{Ξ(x) + Îł(x)} = 1
(x = argmax
xâX{Ξ(x) + Îł(x)}
).
The corollary then follows by applying the expectation to bothsides of the last equation.
An alternative proof of the preceding corollary can begiven by considering the probability density function g(t) =GâČ(t) of the Gumbel distribution. This proof consists of twosteps. First, the probability that x maximizes Ξ(x) + Îł(x) isâ«g(t â Ξ(x))
âx6=xG(t â Ξ(x))dt. Second, g(t â Ξ(x)) =
exp(Ξ(x)) · exp(â(t + c))G(t â Ξ(x)). Thus, the probabilitythat x maximizes Ξ(x) + Îł(x) is proportional to exp(Ξ(x)),i.e., it is the Gibbs distribution.
We can also use the random MAP perturbation to estimatethe entropy of the Gibbs distribution. See also the work ofTomczak [56].
Corollary 2: Let Îł = {Îł(x) : x â X} be a collection ofi.i.d. Gumbel random variables with cumulative distributionfunction (8). Let p(x) be the Gibbs distribution defined in (3)and let xâ be given by (4). Then the entropy of p(x) is givenby
H(p) = EÎł [Îł(xâ)] .
4
Proof: The proof consists of evaluating the entropy inEquation (3) and using Theorem 1 to replace logZ(Ξ) withEÎł [Ξ(xâ) + Îł(xâ)]. Formally,
H(p) = ââxâX
p(x)Ξ(x) + EÎł [Ξ(xâ) + Îł(xâ)]
= ââxâX
p(x)Ξ(x) +âxâX
Ξ(x)PÎł(xâ = x) + EÎł [Îł(xâ)]
= EÎł [Îł(xâ)],
where in the last line we used Corollary 1, which saysPÎł(xâ = x) = p(x).
A direct proof of the preceding corollary can be given byshowing that EÎł [Îł(xâ) · 1[x = xâ]] = âp(x) log p(x) whilethe entropy is then attained by summing over all x, sinceâ
xâX 1[x = xâ] = 1. To establish this equality we note that
EÎł [Îł(xâ) · 1[x = xâ]] =
â«(tâ Ξ(x))g(tâ Ξ(x))
âx 6=x
G(tâ Ξ(x))dt.
Using the relation between g(t) and G(t) and the fact thatâxâX G(t â Ξ(x)) = G(t â logZ(Ξ)) while changing the
integration variable to t = t â Ξ(x) we can rephrase thisquantity as
â«t exp(â(c+t))G(t+log p(x))dt. Again, by using
the relation between g(t+ log p(x)) and G(t+ log p(x)), wederive that EÎł [Îł(xâ) · 1[x = xâ]] = p(x)
â«tg(t+ log p(x))dt
while the integral is now the mean of a Gumbel randomvariable with expected value of â log p(x).
The preceding derivations show that perturbing the potentialfunction Ξ(x) and then finding the MAP estimate xâ ofthe perturbed Gibbs distribution allows us to perform manycore tasks for high-dimensional statistical inference by usingi.i.d. Gumbel perturbations. The distribution of xâ is p(x),its expected maximum value is the log-partition function,and the expected maximizing perturbation is the entropyof p(x). While theoretically appealing, these derivations arestill computationally intractable: they involve generating |X |random variables, which grows exponentially with n, and thenapplying a heuristic MAP solver. In this paper we proposea method for reducing this complexity using a number ofrandom variables that grows only linearly with n.
III. LOW-DIMENSIONAL PERTURBATIONS
In this section we show that the log-partition function can becomputed using low-dimensional perturbations in a sequenceof expected max-value computations. This will give us someinsight on performing high dimensional inference using lowdimensional perturbations. In what follows we will use thenotation xji to refer to the tuple (xi, xi+1, . . . , xj) for i < j,with x = xn1 .
The log-partition function logZ(Ξ) (c.f. Theorem 2) isthe key quantity to understand: its gradient is the Gibbsdistribution and the entropy is its Fenchel dual. It is well-known that computing Z(Ξ) for high-dimensional models ischallenging because of the exponential size of X . However,the partition function has a self-reducible form. That is, wecan compute it iteratively while computing partial partition
functions of lower dimensions:
Z(Ξ) =âx1
âx2
· · ·âxn
exp(Ξ(x1, x2, . . . , xn)). (10)
For example, the partition function is the sum, over x1,of partial partition functions
âx2,...,xn
exp(Ξ(x)). Fixingx1, x2, . . . , xi, the remaining summations are partial partitionfunctions
âxi+1,...,xn
exp(Ξ(x)). With this in mind, we cancompute each partial partition function using Theorem 1 butwith low-dimensional perturbations for each partial partition.
Theorem 2: Let {Îłi(xi)}xiâXi,iâ[n], be a collection of in-dependent and identically distributed (i.i.d.) random vari-ables following the Gumbel distribution (8). Define Îłi ={Îłi(xi)}xiâXi . Then
logZ = EÎł1 maxx1
· · ·Eγn maxxn
{Ξ(x) +
nâi=1
Îłi(xi)
}. (11)
Proof: The result follows from applying Theorem 1iteratively. Let Ξn(xn1 ) = Ξ(xn1 ) and define
Ξiâ1(xiâ11 ) = EÎłi max
xi{Ξi(xi1) + γi(xi)} i = 2, 3, . . . , n
If we think of xiâ11 as fixed and apply Theorem 1 to
Ξi(xiâ11 , xi), we see that from (10),
Ξiâ1(xiâ11 ) = log
âxi
exp(Ξi(xi1)).
Applying this for i = n to i = 2, we obtain (11).The computational complexity of the alternating procedure
in (11) is still exponential in n. For example, the innermostiteration Ξnâ1(xnâ1
1 ) = EÎłn maxxn{Ξn(xn1 ) + Îłn(xn)} needsto be estimated for every xnâ1
1 = (x1, x2, . . . , xnâ1), whichis growing exponentially with n. Thus from a computationalperspective the alternating formulation in Theorem 2 is justas inefficient as the formulation in Theorem 1. Nevertheless,this is the building block that enables inference in high-dimensional problems using low dimensional perturbationsand max-solvers. Specifically, it provides the means for a newsampling algorithm from the Gibbs distribution and bounds onthe log-partition and entropy functions.
A. Ideal Sampling
Sampling from the Gibbs distribution is inherently tied toestimating the partition function. If we could compute thepartition function exactly, then we could sample from theGibbs distribution sequentially: for dimension i â [n] samplexi with probability which is proportional to
âxni+1
exp(Ξ(x)).Since computing the partition function is #P-hard, we constructa family of self-reducible upper bounds which imitate the par-tition function behavior, namely by bounding the summationover its exponentiations.
Corollary 3: Let {Îłi(xi)}xiâXi,iâ[n] be a collection of i.i.d.random variables, each following the Gumbel distribution withzero mean. Set
Ïj(xj1) = EÎł
maxxnj+1
{Ξ(x) +
nâi=j+1
Îłi(xi)}
. (12)
5
Then, for every j â [nâ 1] and every x = xn1 , the followinginequality holds:â
xj
exp(Ïj(x
j1))†exp
(Ïjâ1(xjâ1
1 )). (13)
In particular, for j = n we haveâxn
exp(Ξ(xn1 )) =
exp(Ïnâ1(xnâ1
1 )).
Proof: The result is an application of the perturb-max in-terpretation of the partition function in Theorem 1. Intuitively,these bounds correspond to moving expectations outside themaximization operations in Theorem 2, each move resultingin a different bound. Formally, the left hand side of (13) canbe expanded as
EÎłj
maxxj
EÎłj+1,...,Îłn
maxxnj+1
Ξ(xn1 ) +
nâi=j
Îłi(xi)
,
(14)
while the right hand side is attained by alternating themaximization with respect to xj with the expectation ofÎłj+1, . . . , Îłn. The proof then follows by exponentiating bothsides.
The above corollary is similar in nature to variationalapproaches that have been extensively developed to efficientlyestimate the partition function in large-scale problems. Theseare often inner-bound methods in which a simpler distributionis optimized as an approximation to the posterior in a KL-divergence sense (e.g., mean-field [57]). Variational upperbounds are convex and usually derived by replacing theentropy term with a simpler surrogate function and relaxingconstraints on sufficient statistics [58].
We use these upper bounds for every dimension i â [n]to sample from a probability distribution that follows a sum-mation over exponential functions, with a discrepancy that isdescribed by the upper bound. This is formalized below inAlgorithm 1. Note that x = (xjâ1
1 , xj ,xnj+1).
This algorithm is forced to restart the entire sample ifit samples the ârejectâ symbol r at any iteration. We saythe algorithm accepts if it terminates with an output x. Theprobability of accepting with particular x is the product ofthe probabilities of sampling xj in round j for j â [n]. Sincethese upper bounds are self-reducible, i.e., for every dimensioni we are using the same quantities that were computed inthe previous dimensions 1, 2, . . . , i â 1, we are samplingan accepted configuration proportionally to exp(Ξ(x)), thefull Gibbs distribution. This is summarized in the followingtheorem.
Theorem 3: Let p(x) be the Gibbs distribution defined in(1) and let {Îłi(xi)} be a collection of i.i.d. random variablesfollowing the Gumbel distribution with zero mean given in(8). Then
P (Algorithm 1 accepts) = Z(Ξ)/
exp
(EÎł
[maxx{Ξ(x) +
nâi=1
Îłi(xi)}
]).
Moreover, if Algorithm 1 accepts then it produces a configu-ration x = (x1, . . . , xn) according to the Gibbs distribution:
P(Algorithm 1 outputs x
âŁâŁ Algorithm 1 accepts)
=exp(Ξ(x))
Z(Ξ).
Algorithm 1 Unbiased sampling from Gibbs distributionRequire: Potential function Ξ(x), MAP solver
Initial step j = 1.while j < n do
For all x â Xj compute
Ïj(xjâ11 , x) = EÎł
maxxnj+1
Ξ(xjâ11 , x,xnj+1) +
nâi=j+1
Îłi(xi)
.
(15)
Define a distribution on Xj âȘ {r}:
pj(x) =exp
(Ïj(x
jâ11 , x)
)exp
(Ïjâ1(xjâ1
1 )) , x â Xj (16)
pj(r) = 1ââxâXj
pj(x) (17)
Sample xj from pj(·).if xj = r then
Set j = 1 to restart sampler.else xj â Xj
Set j â j + 1.end if
end whilereturn x = (x1, x2, . . . , xn)
Proof: Set Ïj(xj1) as in Corollary 3. The probability of
sampling a configuration x = (x1, . . . , xn) without rejectingis
nâj=1
exp(Ïj(x
j1))
exp(Ïjâ1(xjâ1
1 )) =
exp(Ξ(x))
exp (EÎł [maxx {Ξ(x) +âni=1 Îłi(xi)}])
.
The probability of sampling without rejecting is thus the sumof this probability over all configurations, i.e.,
P (Algorithm 1 accepts) = Z(Ξ)/
exp
(EÎł
[maxx
{Ξ(x) +
nâi=1
Îłi(xi)
}]).
Therefore conditioned on acceptance, the output configurationis produced according to the Gibbs distribution.
Since acceptance/rejection follows a geometric distribu-tion, the sampling procedure rejects k times with proba-bility (1â P (Algorithm 1 accepts))k. The running time ofour Gibbs sampler is determined by the average number ofrejections 1/P(Algorithm 1 accepts). The exponent of thiserror event is:
log1
P (Algorithm 1 accepts)= EÎł
[maxx
{Ξ(x) +
nâi=1
Îłi(xi)
}]â logZ(Ξ).
To be able to estimate the number of steps the samplingalgorithm requires, we construct an efficiently computablelower bound to the log-partition function that is based onperturb-max values.
6
B. Approximate Inference and Lower Bounds to the PartitionFunction
To be able to estimate the number of steps the samplingAlgorithm 1 requires, we construct an efficiently computablelower bound to the log-partition function, that is based onperturb-max values. Let {Mi : iı[n]} be a collection of positiveintegers. For each i â [n] let xi = {xi,ki : ki â [Mi]} be atuple of Mi elements of Xi. We define an extended potentialfunction over a configuration space of
âni=1Mi variables x =
(x1, x2, . . . , xn):
Ξ(x) =1ân
i=1Mi
M1âk1=1
M2âk2=1
· · ·Mnâkn=1
Ξ(x1,k1 , x2,k2 , . . . , xn,kn).
(18)
Now consider a collection of i.i.d. zero-mean Gumbel ran-dom variables {Îłi,ki(xi,ki)}iâ[n],kiâ[Mi] with distribution (8).Define the following perturbation for the extended model:
Îłi(xi) =1
Mi
Miâki=1
Îłi,ki(xi,ki). (19)
Corollary 4: Let Ξ(x) be a potential function over x =(x1, . . . , xn) and logZ be the log partition function for thecorresponding Gibbs distribution. Then for any Δ > 0 we have
PÎł
(logZ â„ max
x
{Ξ(x) +
nâi=1
Îłi(xi)
}â Δn
)â„ 1â
nâi=1
Ï2âij=2 |Xjâ1|6MiΔ2
.
(20)
Proof: The proof consists of three steps:1) developing a measure concentration analysis for The-
orem 1, which states that a single max-evaluation isenough to lower bound the expected max-value withhigh probability;
2) using the self-reducibility of the partition function inTheorem 2 to show the partition function can be com-puted by iteratively applying low-dimensional perturba-tions;
3) proving that these lower dimensional partition functionscan be lower bounded uniformly (i.e., all at once) witha single measure concentration statement.
We first provide a measure concentration analysis of Theo-rem 1. Specifically, we estimate the deviation of the randomvariable F = maxxâX {Ξ(x) + Îł(x)} from its expectedvalue using Chebyshevâs inequality. For this purpose we recallTheorem 1 which states that F is Gumbel-distributed andtherefore its variance is Ï2/6. Chebyshevâs inequality thenasserts that
PÎł (|F â EÎł [F ]| ℠Δ) †Ï2/6Δ2. (21)
Since we want this statement to hold with high probability forsmall epsilon we reduce the variance of the random variablewithout changing its expectation by taking the mean of i.i.d.perturb-max values. Let F (γ) = maxx{Ξ(x)+γ(x)} and sam-ple M i.i.d. random variables γ1, γ2, . . . , γM with the samedistribution as γ and generate the i.i.d. Gumbel-distributedvalues Fj
â= F (Îłj). We call Îł1, Îł2, . . . , ÎłM âcopiesâ of Îł.
Since EÎł [F (Îł)] = logZ, we can apply Chebyshevâs inequalityto the 1
M
âMi=1 Fj â logZ to get
P
(âŁâŁâŁâŁâŁ 1
M
Mâi=1
Fj â logZ
âŁâŁâŁâŁâŁ ℠Δ)†Ï2
6MΔ2. (22)
Using the explicit perturb-max notation and considering onlythe lower-side of the measure concentration bound, this showsthat with probability at least 1â Ï2
6MΔ2 we have
logZ â„ 1
M
Mâj=1
maxxâX{Ξ(x) + Îłj(x)} â Δ. (23)
To complete the first step, we wish to compute the summationover M -maximum values using a single maximization. Forthis we form an extended model on XM containing variablesx1, x2, . . . , xM â X and note thatMâj=1
maxxâX{Ξ(x) + Îłj(x)} = max
x1,x2,...,xM
Mâj=1
(Ξ(xj) + γj(xj)).
(24)
For the remainder we use an argument by induction on n,the number of variables. Consider first the case n = 2 sothat Ξ(x) = Ξ1,2(x1, x2). The self-reducibility as described inTheorem 1 states that
logZ = log
(âx1
exp
[log
(âx2
exp(Ξ1,2(x1, x2))
)]).
(25)
As in the proof of Theorem 1, define Ξ1(x1) =log(
âx2
exp(Ξ1,2(x1, x2)). Thus, we have logZ =
log(â
x1exp(Ξ1(x1))
), which is a partition function
for a single-variable model.We wish to uniformly approximate Ξ1(x1) over all x1 â X1.
Fix x1 = a for some a â X1 and consider the single-variable model Ξ1,2(a, x2) over x2 which has Ξ1(a) as its log-partition function. Then from Theorem 1, we have Ξ1(a) =EÎł2 [maxx2{Ξ(a, x2) + Îł2(x2)}]. Applying Chebyshevâs in-equality in (22) to M2 âcopiesâ of Îł2, we get
P
âŁâŁâŁâŁâŁâŁ 1
M2
M2âj=1
maxx2
{Ξ(a, x2) + Îł2,j(x2)]} â Ξ1(a)
âŁâŁâŁâŁâŁâŁ ℠Δ †Ï2
6M2Δ2.
Taking a union bound over a â X1 we have
P
âŁâŁâŁâŁâŁâŁ 1
M2
M2âj=1
maxx2
{Ξ(x1, x2) + Îł2,j(x2)]} â Ξ1(x1)
âŁâŁâŁâŁâŁâŁ †Δ âx1 â X1
†1â |X1|Ï2
6M2Δ2.
This implies the following one-sided inequality with probabil-ity at least 1â |X1| Ï2
6M2Δ2uniformly over x1 â X1:
Ξ1(x1) ℠1
M2
M2âj=1
maxx2
{Ξ(x1, x2) + Îł2,j(x2)]} â Δ. (26)
Now note that the overall log-partition function for themodel Ξ(x) = Ξ1,2(x1, x2) is a log-partition function fora single variable model with potential Ξ1(x1), so logZ =log(
âx1
exp(Ξ1(x1))). Again using Theorem 1, we have
7
logZ = Eγ1 [maxx1{Ξ1(x1) + γ1(x1)}], so we can apply
Chebyshevâs inequality to M1 âcopiesâ of Îł1 to get that withprobability at least 1â Ï2
6M1Δ2:
logZ â„ 1
M1
M1âk=1
maxx1
{Ξ1(x1) + Îł1,k(x1)} â Δ. (27)
Plugging in (26) into (27), we get that with probability at least1â Ï2
6M1Δ2â |X1| Ï2
6M2Δ2:
logZ â„ 1
M1
M1âk=1
maxx1
1
M2
M2âj=1
maxx2
{Ξ(x1, x2) + γ2,j(x2)}
+ Îł1,k(x1)
â 2Δ. (28)
Now we pull the maximization outside the sum by intro-ducing i.i.d. âcopiesâ of the variables again: this time we haveM1 copies x1 and M1M2 copies x2 as in (24). Now,
1
M1
M1âk=1
maxx1
1
M2
M2âj=1
maxx2
{Ξ(x1, x2) + γ2,j(x2)}
+ Îł1,k(x1)
=
1
M1
M1âk=1
maxx1
maxx2,1,...,x2,M2
1
M2
M2âj=1
Ξ(x1, x2,j) + γ2,j(x2,j)
+ Îł1,k(x1)
= maxx1,1,...,x1,M1
maxx2,1,...,x2,M2
1
M1M2
M1âk=1
M2âj=1
Ξ(x1,k, x2,j) + γ2,j(x1,k, x2,j) + γ1,k(x1,k).
(29)
Note that in this bound we have to generate |X1||X2|variables Îł2,j(x1,k, x2,j), which will become inefficient aswe add more variables. We can get an efficiently computablelower bound on this quantity by generating a smaller set ofvariables: we use the same perturbation realization Îł2,j(x2,j)for every value of x1,k. Thus, we have the lower bound
logZ â„ maxx1,x2
1
M1M2
M1âk=1
M2âj=1
(Ξ(x1,k, x2,j) + Îł2,j(x2,j) + Îł1,k(x1,k))â 2Δ
(30)
with probability at least 1 â Ï2
6M1Δ2â |X1| Ï2
6M2Δ2. Here
we have abused notation slightly and used x1 ={x1,1, x1,2, . . . , x1,M1
} and x2 = {x2,1, x2,2 . . . , x2,M2}.
Now suppose the result holds for models on nâ1 variablesand consider the model Ξ(x1, x2, . . . , xn) on n variables.Consider the 2-variable model Ξ(x1,x
n2 ) and define
Ξ1(x1) = log
âxn2
exp(Ξ(x1,xn2 ))
. (31)
From the analysis of the 2-variable case, as in (27), the fol-lowing lower bound holds with probability at least 1â Ï2
6M1Δ2:
logZ â„ 1
M1
M1âk1=1
maxx1
{Ξ1(x1) + Îł1,k1(x1)} â Δ. (32)
Now note that for each value of x1, the function Ξ1(x1) is alog-partition function on the nâ 1 variables xn2 . Applying the
induction hypothesis to Ξ1(x1), we have with probability atleast
1â Ï2
6M2Δ2â |X2|
Ï2
6M3Δ2â |X2||X3|
Ï2
6M4Δ2â · · · â
nâ1âj=2
|Xj |Ï2
6MnΔ2, (33)
the following lower bound holds:
Ξ1(x1) ℠maxxn2
{Ξ(x1, x
n2 ) +
nâi=2
Îłi(xi)
}â Δ(nâ 1). (34)
Taking a union bound over all x1, with probability at least
1ânâi=1
iâj=2
|Xjâ1|
Ï2
6MnΔ2(35)
we have
logZ â„ 1
M1
M1âk1=1
maxx1
{maxxn2
{Ξ(x1, x
n2 ) +
nâi=2
Îłi(xi)
}+ Îł1,k1(x1)
}â Δn
â„ maxx
Ξ(x) +
nâi=1
Îłi(xi)â Δn,
as desired.Importantly, the perturbation structure in our lower bound
is local and therefore the complexity of computing the lowerbound (i.e., maximizing the perturbed potential function) isthe same as optimizing the potential function itself. To seethat, we note that the transition from (29) to (30) controlsthe complexity of our lower bound. This lower bound onlyuses local perturbation (for each variable, e.g., γ2,j(x2,j) in(30)) instead of high-order perturbation (for each group ofvariables, e.g., γ2,j(x1,k, x2,j) in (29)). Since all perturbationsare local, the key to understand the efficiency of this lowerbound is in analyzing the structure of the potential functionsΞ(x) and Ξ(x). Although Ξ(x) seems to consider exponentiallymany configurations, its order is the same as the originalpotential function Ξ(x). Particularly, if Ξ(x) is the sum oflocal and pairwise potential functions (as happens for theIsing model) then Ξ(x) is also the sum of local and pairwisepotential functions. Therefore, whenever the original modelcan be maximized efficiently, e.g., for super-modular functionsor graphs with bounded tree width, the inflated model Ξ(x)can also be optimized efficiently. Moreover, while the theoryrequires Mi to be exponentially large (as a function of n), itturns out that in practice Mi may be very small to generatetight bounds (see Section V). Theoretically tighter bounds canbe derived by our measure concentration results in Section IVbut they do not fully capture the tightness of this lower bound.
C. Entropy bounds
We now show how to use perturb-max values to bound theentropy of high-dimensional models. Estimating the entropyis an important building block in many machine learningapplications. Corollary 2 applies the interpretation of Gibbsdistribution as a perturb-max model (see Corollary 1) inorder to calculate the entropy of Gibbs distributions using theexpected value of the maximal perturbation. As before, this
8
procedure requires exponentially many independent perturba-tions Îł(x), for every x â X .
We can again use our low-dimensional perturbations toupper bound the entropy of perturb-max models by extendingour definition of perturb-max models as follows. Let A bea collection of subsets of [n] such that
âαâA = [n]. For
each α â A generate a Gumbel perturbation γα(xα) wherexα = (xi)iâα. We define the perturb-max models as
p(x; Ξ) = Pγ
(x = argmax
x
{Ξ(x) +
âαâA
γα(xα)
}). (36)
Our upper bound uses the duality between entropy and the log-partition function [37] and then upper bounds the log-partitionfunction with perturb-max operations.
Upper bounds for the log-partition function using randomperturbations can be derived from the refined upper bounds inCorollary 3. However, it is simpler to provide upper boundsthat rely on Theorem 2. These bounds correspond to movingexpectations outside the maximization operations.
Lemma 1: Let Ξ(x) be a potential function over x =(x1, . . . , xn), and {Îłi(xi)}xiâXi,iâ[n] be a collection of in-dependent and identically distributed (i.i.d.) random variablesfollowing the Gumbel distribution. Then
logZ(Ξ) †Eγ
[max
x=(x1,x2,...,xn)
{Ξ(x) +
nâi=1
Îłi(xi)
}].
(37)
Proof: The lemma follows from Theorem 2 that repre-sents (11) as the log-partition as a sequence of alternatingexpectations and maximizations, namely
logZ(Ξ) = Eγ1 maxx1
· · ·Eγn maxxn
{Ξ(x) +
nâi=1
Îłi(xi)
}.
(38)
The upper bound is attained from the right hand side of theabove equation by Jensenâs inequality (or equivalently, bymoving all the expectations in front of the maximizations,yielding the following:
EÎł1 maxx1
· · ·Eγn maxxn
{Ξ(x) +
nâi=1
Îłi(xi)
}†Eγ1 · · ·Eγn max
x1
· · ·maxxn
{Ξ(x) +
nâi=1
Îłi(xi)
}.
(39)
In this case the bound is an average of MAP valuescorresponding to models with only single node perturbationsÎłi(xi), for every i â [n] and xi â Xi. If the maximization overΞ(x) is feasible (e.g., due to supermodularity), it will typicallybe feasible after such perturbations as well. We generalize thisbasic result further below.
Corollary 5: Consider a family of subsets α â A such thatâαâA α = [n], and let xα = {xi : i â α}. Assume that the
random variables γα(xα) are i.i.d. according to the Gumbeldistribution (8) for every α,xα. Then
logZ(Ξ) †Eγ
[maxx
{Ξ(x) +
âαâA
γα(xα)
}].
Proof: If the subsets α are disjoint the upper boundis an application of Lemma 1 as follows: we consider thepotential function Ξ(x) over the disjoint subsets of variablesx = (xα)αâA as well as the i.i.d. Gumbel random variablesγα(xα). Applying Lemma 1 yields the following upper bound:
logZ(Ξ) †Eγ
[max
x=(xα)αâA
{Ξ(x) +
âαâA
γα(xα)
}]. (40)
In the general case, α, ÎČ â A may overlap. To follow thesame argument, we lift the n-dimensional assignment x =(x1, x2, . . . , xn) to an higher-dimensional assignment a(x) =(xα)αâA which creates an independent perturbation for eachα â A. To complete the proof, we also construct a potentialfunction ΞâČ(xâČ) such that
ΞâČ(xâČ) =
{Ξ(x) if âx s.t. a(x) = xâČ
ââ otherwise.(41)
Thus, logZ(Ξ) = logZ(ΞâČ) =â
xâČ exp(ΞâČ(xâČ)) since incon-sistent assignments (i.e., xâČ such that a(x) 6= xâČ for any x)receive zero weight. Moreover,
maxxâČ
{ΞâČ(xâČ) +
âαâA
γα(xâČα)
}= max
x
{Ξ(x) +
âαâA
γα(xα)
}for each realization of the perturbation. This equality holdsafter expectation over Îł as well. Now, given that the pertur-bations are independent for each lifted coordinate, the basicresult in (37) guarantees that
logZ(ΞâČ) †EÎł
[maxxâČ
{ΞâČ(xâČ) +
âαâA
γα(xâČα)
}],
from which the result follows since logZ(Ξ) = logZ(ΞâČ)Establishing bounds on the log-partition function allows us
to derive bounds on the entropy. For this we use the conjugateduality between the (negative) entropy and the log-partitionfunction [37]. The entropy bound then follows from the log-partition bound.
Theorem 4: Let p(x; Ξ) be a perturb-max probability dis-tribution in (36) and A be a collection of subsets of [n] suchthat
âαâA α = [n], and let xα = {xi : i â α}. Assume
that the random variables γα(xα) are i.i.d. according to theGumbel distribution (8) for every α,xα. Let xγ be the optimalperturb-max assignment using low dimensional perturbations:
xÎł = argmaxx
{Ξ(x) +
âαâA
γα(xα)
}. (42)
Then we have the following upper bound on the entropy:
H(p) †Eγ
[âαâA
γα(xγα)
].
Proof: We use the characterization of the log-partitionfunction as the conjugate dual of the (negative) entropyfunction [37]:
H(p) = minΞ
{logZ(Ξ)â
âx
p(x; Ξ)Ξ(x)
}.
9
The minimum is over all potential functions on X . For a fixedscore function Ξ(x), let W (Ξ) be the expected value of thelow-dimensional perturbation:
W (Ξ) = Eγ
[maxx
{Ξ(x) +
âαâA
γα(xα)
}].
Corollary 5 asserts that logZ(Ξ) â€W (Ξ). Thus, we can upperbound H(p) by replacing logZ(Ξ) with W (Ξ) in the dualityrelation:
H(p) †minΞ
{W (Ξ)â
âx
p(x; Ξ)Ξ(x)
}.
The infimum of the right hand side is attained wheneverthe gradient vanishes, i.e., whenever âW (Ξ) = p(x; Ξ). Tocompute âW (Ξ) we differentiate under the integral sign:
âW (Ξ) = EÎł
[âmax
x
{Ξ(x) +
âαâA
γα(xα)
}].
Since the (sub)gradient of the maximum-function is the indi-cator function, we deduce that âW (Ξ) is the expected valueof the events of xÎł . Consequently, âW (Ξ) is the vector ofthe probabilities of all these events, namely, the probabilitydistribution p(x; Ξ). Since the derivatives of W (Ξ) are perturb-max models, and so is p(x; Ξ), then the the infimum is attainedfor Ξ = Ξ. Therefore, recalling that xÎł has distribution p(x; Ξ)in (36):
minΞ
{W (Ξ)â
âx
p(x; Ξ)Ξ(x)
}= W (Ξ)â
âx
p(x; Ξ)Ξ(x).
= EÎł
[maxx
{Ξ(x) +
âαâA
γα(xα)
}]â EÎł [Ξ(xÎł)]
= EÎł
[Ξ(xγ) +
âαâA
γα(xγα)
]â EÎł [Ξ(xÎł)]
= EÎł
[âαâA
γα(xγα)
],
from which the result follows.This entropy bound motivates the use of perturb-max poste-
rior models. These models are appealing as they are uniquelybuilt around prediction and as such they inherently have anefficient unbiased sampler. The computation of this entropybound relies on MAP solvers. Thus, computing these boundsis significantly faster than computing the entropy itself, whosecomputational complexity is generally exponential in n.
Using the linearity of expectation we may alternate sum-mation and expectation. For simplicity, assume only localperturbations, i.e., Îłi(xi) for every dimension i = [n]. Thenthe preceding theorem bounds the entropy by summing the ex-pected change of MAP perturbations H(p) â€
âi EÎł [Îłi(x
Îłi )].
This bound resembles to the independence bound for theentropy H(p) â€
âiH(pi), where pi(xi) =
âx\xi p(x)
are the marginal probabilities [59]. The independence boundis tight whenever the joint probability p(x) is composed ofindependent systems, i.e., p(x) =
âi pi(xi). In the following
we show that the same holds for perturbation bounds.
Corollary 6: Consider the setting of Theorem 4 with xÎł
given by (42) and the independent probability distributionp(x) =
âi pi(xi). Let {Îłi(xi)}xiâXi,iâ[n] be a collection of
i.i.d. random variables, each following the Gumbel distributionwith zero mean. Then the corresponding potential function forp(x) is Ξ(x) =
âni=1 log pi(xi) and
H(p) = EÎł
[nâi=1
Îłi(xÎłi )
],
where
xÎł = argmaxx
{Ξ(x) +
nâi=1
Îłi(xi)
}.
Proof: Since the system is independent, H(p) =âiH(pi). Since Ξi(xi) = log pi(xi) and {Îłi(xi)}xiâXi
are independent, we may apply Corollary 2 to each di-mension i to obtain H(pi) = EÎł [Îłi(x
Îłii )], where xÎłi =
argmaxxi {Ξi(xi) + γi(xi)}. To complete the proof, we setΞ(x) =
âi Ξi(xi). Since the system is independent, there
holds xÎłi = xÎłiiâ= argmaxxi {Ξi(xi) + Îłi(xi)}. Therefore,
H(p) =âiH(pi) =
âi EÎłi [Îłi(x
Îłii )] =
âi EÎł [Îłi(x
Îłi )] =
EÎł [âi Îłi(x
Îłi )].
There are two special cases for independent systems: thezero-one probability model, for which p(x) = 0 except fora single configuration p(x) = 1, and the uniform distributionwith p(x) = 1/|X | for every x â X . The entropy of the formeris 0 and of the latter is log |X |. For the zero-one model, theperturb-max entropy bound assigns xÎł = x for all randomfunctions Îł = (Îłi(xi))i,xi . Since these random variables havezero mean, it follows that EÎł [
âi Îłi(xi)] = 0. For the uniform
distribution we have EÎł [âi Îłi(xi)] = log |X |. This suggests
that we could use the perturb-max bound as an alternativeuncertainty measure for non-independent systems.
Corollary 7: Consider the setting of Theorem 4 with xÎł
given by (42). Define the function U(p) by
U(p) = EÎł
[âαâA
γα(xγα)
]. (43)
Then U(p) is non-negative and attains its minimal value forthe deterministic distributions and its maximal value for theuniform distribution.
Proof: As argued above, U(p) is 0 for deterministic p.Non-negativity follows from the requirement that the pertur-bation are zero-mean random variables: since
âα γα(xγα) â„â
α γα(xα) for x, then U(p) = EÎłâα γα(xγα) â„
EÎłâα γα(xα) = 0. Lastly, we must show that the uniform
distribution puni maximizes U(·), namely U(puni) â„ U(·). Thepotential function for the uniform distribution is constant forall x â X . This means that U(puni) = EÎł maxx
âα γα(xα).
On the other hand U(·) correspond to a potential func-tion Ξ(x) and its corresponding xγ . Furthermore, we havemaxx
âα γα(xα) â„
âα γα(xγα), and taking expectations
on both sides shows U(puni) = EÎł maxxâα γα(xα) â„
EÎłâα γα(xγα) for any other Ξ(·) and its corresponding xÎł .
The preceding result implies we can use U(p) as a surrogateuncertainty measure instead of the entropy. Using efficiently
10
computable uncertainty measures allows us to extend theapplications of perturb-max models to Bayesian active learn-ing [4]. The advantage of using the perturb-max uncertaintymeasure over the entropy function is that it does not requireMCMC sampling procedures. Therefore, our approach fitswell with contemporary techniques for using high-dimensionalmodels that are popular in machine learning applications suchas computer vision. Moreover, our perturb-max uncertaintymeasure is an upper bound on the entropy, so minimizingthe upper bound can be a reasonable heuristic approach toreducing entropy.
IV. MEASURE CONCENTRATION FOR LOG-CONCAVEPERTURBATIONS
High dimensional inference with random perturbations re-lies on expected values of MAP predictions. Practical appli-cation of this theory requires estimating these expectations;the simplest way to do this is by taking a sample average.We therefore turn to bounding the number of samples byproving concentration of measure results for our randomperturbation framework. The key technical challenge is that theGumbel perturbations γα(xα) have support on the entire realline, which means standard approaches for bounded randomvariables, such as McDiarmidâs inequality, do not apply.
Because the Gumbel distribution decays exponentially onewould expect that the distance between the perturbed MAPprediction and its expected value should concentrate. We showthis using new measure concentration results (in Section IV-E)that bound the deviation of a general function F (Îł) of Gumbelvariables via its moment generating function
ÎF (λ)â= E [exp(λF )] . (44)
For notational convenience we omit the subscript when thefunction we consider is clear from its context. The exponentialdecay follows from the Markov inequality: P (F (Îł) â„ r) â€Î(λ)/ exp(λr) for any λ > 0.
We derive bounds on the moment generating function Î(λ)(in Section IV-E) by looking at the expansion (i.e., gradi-ent) of F (Îł). Since the max-value changes at most linearlywith its perturbations, its expansion is bounded and so isÎ(λ). This type of bound appears often in the context ofisoperimetric inequalities; a series of results have establishedmeasure concentration bounds for general families of distri-butions, including log-concave distributions [60]â[65]. A onedimensional density function q(t) is said to be log-concaveif q(t) = exp(âQ(t)) and Q(t) is a convex function: log-concave distributions have log-concave densities. The familyof log-concave distribution includes the Gaussian, Laplace,logistic and Gumbel distributions, among many others. Theseprobability density functions decay exponentially1 with t. Tosee that we recall that for any convex function Q(t) â„Q(0) + tQâČ(0) for any t. By exponentiating and rearrangingwe can see q(t) †q(0) exp(âtQâČ(0)).
1One may note that for the Gaussian distribution QâČ(0) = 0 and that forLaplace distribution QâČ(0) is undefined. Thus to demonstrate the exponentialdecay one may verify that Q(t) â„ Q(c) + tQâČ(c) for any c thus q(t) â€q(c) exp(âtQâČ(c)).
A. A Poincare inequality for log-concave distributions
A Poincare inequality bounds the variance of a randomvariable by its expected expansion, i.e., the norm of itsgradient. These results are general and apply to any (almosteverywhere) smooth real-valued functions f(t) for t â Rm.The variance of a random variable (or a function) is its squaredistance from its expectation, according to the measure ”:
Var”(f)â=
â«Rf2(t)d”(t)â
(â«Rf(t)d”(t)
)2
. (45)
A Poincare inequality is a bound of the form
Var”(f) †Câ«Rââf(t)â2d”(t). (46)
If this inequality holds for any function f(t) we say that themeasure ” satisfies the Poincare inequality with a constantC. The optimal constant C is called the Poincare constant.To establish a Poincare inequality it suffices to derive aninequality for a one-dimensional function and extend it to themultivariate case by tensorization [63, Proposition 5.6].
Restricting to one-dimensional functions, Var”(f) â€â« +âââ f(t)2q(t)dt and the one-dimensional Poincare inequality
takes the form:â«Rf(t)2q(t)dt †C
â«Rf âČ(t)2q(t)dt. (47)
The following theorem, based on the seminal work of Bras-camp and Lieb [60], allows us to prove Poincare inequalities.
Theorem 5: Let ” be a log-concave measure with densityq(t) = exp(âQ(t)), where Q : R â R is a convex functionthat has a unique minimum in the point t = a. Also, assumeQ(t) is twice continuously differentiable excepts possibly att = a, limtâa± Q
âČ(t) 6= 0 or limtâa± QâČâČ(t) 6= 0. Let f :
Râ R a function that1) is continuous and in L2(”),2) is differentiable almost everywhere with derivative f âČ â
L2(”),3) satisfies limtâ±â f(t)2q(t) = 0 and
limtâ±â f(t)q(t) = 0.
Then for any η â[mintâR\{a}â QâČâČ(t)
(QâČ(t))2 , 1], we have
Var”(f) †1
1â η
â« +â
ââ
(f âČ(t))2
QâČâČ(t) + η(QâČ(t))2q(t)dt.
Proof: The variance of f(t) is upper bounded by its sec-ond moment with respect to any constant K, that is Var”(f) â€â« +âââ (f(t) â K)2q(t)dt for any K. Thus, we set K = f(a)
and define f(t) = f(t)â f(a). Since f âČ(t) = f âČ(t), we haveto prove that
â« +âââ f2(t)q(t)dt â€
â« +âââ
(f âČ(t))2
QâČâČ(t)+η(QâČ(t))2 q(t)dt.
We define the function Ï(t) =(f2(t)q(t)/QâČ(t)
)âČ. We have
thatâ« +â
ââÏ(t)dt =
â« +â
ââ
(f2(t)q(t)/QâČ(t)
)âČdt (48)
=
â« a
ââ
(f2(t)q(t)/QâČ(t)
)âČdt+
â« +â
a
(f2(t)q(t)/QâČ(t)
)âČdt.
(49)
11
Now consider the first integralâ« a
ââ
(f2(t)q(t)/QâČ(t)
)âČdt = lim
tâaâf2(t)q(t)/QâČ(t)â lim
tâââf2(t)q(t)/QâČ(t).
(50)
The limtâââ f2(t)q(t)/QâČ(t) = 0, because Q is convex andwith a unique minimum at a finite point so limtâââQâČ(t) 6=0. The treatment of the term limtâa f
2(t)q(t)/QâČ(t) is slightlymore involved. We distiguish two cases. If limtâaâ Q
âČ(t) 6=0, then limtâa f
2(t)q(t)/QâČ(t) = 0 because f(a) = 0. Onthe other hand, if limtâaâ Q
âČ(a) = 0, by our assumption wehave that limtâaâ Q
âČâČ(t) 6= 0 and to evaluate the limit weuse LâHopitalâs rule. Differentiating both the numerator anddenominator, we obtain limtâaâ 2f(t)f âČ(t)q(t)/QâČâČ(t) = 0,because f(a) = 0. The same argument follows for the integralover the interval [a,â], so we have
â« +âââ Ï(t)dt = 0.
Now, we would like to show the following bound on thefunction Ï(t):
Ï(t) =
(f2(t)
q(t)
QâČ(t)
)âȆâq(t)f2(t)+q(t)ηf2(t)+q(t)
f âČ2(t)
QâČâČ(t) + ηQâČ2(t).
(51)Assuming that (51) holds, the proof then follows by taking anintegral over both sides, while noticing that the left hand sideis zero. To prove the inequality in (51) we first note that bydifferentiating the function Ï(t), we get[f2(t)
q(t)
QâČ(t)
]âČ= âq(t)f2(t)â q(t)f2(t)
QâČâČ(t)
QâČ2(t)+ 2f2(t)f âČ(t)
q(t)
QâČ(t).
(52)
Using the inequality 2ab †ca2+b2/c for any c ℠0 we derivethe bound
2f âČ(t)
QâČ(t)f2(t) â€
(c(t)f2(t) +
f âČ2(t)
c(t)QâČ2(t)
). (53)
Finally, we set c(t) = QâČâČ(t)QâČ2(t) + η to satisfy c(t) â„ 0 and get
the inequality in (51).Brascamp and Liebâs theorem [60] can be obtained by our
derivation when η = 0. Their result implies a Poincare in-equality for strongly log-concave measures, where QâČâČ(t) â„ cand was later extended by Bobkov [66]. Our theorem isstrictly more general, in fact it also applies, for example, tothe Laplace distribution. For the Laplace distribution q(t) =exp(â|t|)/2, where QâČ(t) â {â1, 1} for any t 6= 0, satisfyingthe assumption of the Theorem. In particular, the Poincareinequality for the Laplace distribution given by Ledoux [63]follows from our result by setting η = 1/2. Instead, forthe Gaussian distribution we have QâČ(0) = 0 and for theGumbel distribution QâČ(âc) = 0, but the second derivative isdifferent than 0, again satisfying the assumptions of the the-orem. Recently, Nguyen [65] proved a similar bound for log-concave measures and multivariate functions while restrictingη â [1/2, 1]. In the one-dimensional setting, our bound is validfor a wider range of η â
[mintâR\{a}â QâČâČ(t)
(QâČ(t))2 , 1].
B. Bounds for the Gumbel distribution
Specializing to the Gumbel distribution we get the followingbound.
Corollary 8 (Poincare inequality for the Gumbel distribution):Let ” be the measure corresponding to the Gumbeldistribution. Let f : Rm â R be a multivariate functionwhich satisfies the conditions 1)â3) in Theorem 5 for eachdimension. Then
Var”(f) †4
â«Rââf(t)â2d”(t). (54)
Proof: First, we derive the Poincare constant for a one di-mensional Gumbel distribution. Then we derive the multivari-ate bound by tensorization. Following Theorem 5 it suffices toshow that there exists η such that 1
(1âη)(QâČâČ(t)+η(QâČ(t))2) †4for any t.
For the Gumbel distribution,
Q(t) = t+ c+ exp(â(t+ c)) (55)
QâČâČ(t) + η(QâČ(t))2 = eâ(t+c) + η(1â eâ(t+c))2. (56)
Simple calculus shows that tâ minimizing (56) is given by
0 = â(tâ + c)eâ(tâ+c) â 2η(tâ + c)(1â eâ(tâ+c))eâ(tâ+c)
(57)
or eâ(tâ+c) = 1 â 12η . The lower bound is then QâČâČ(t) +
η(QâČ(t))2 â„ 4ηâ14η whenever 1â 1
2η is positive, or equivalentlywhenever η > 1
2 . For η †12 , we note that QâČâČ(t)+η(QâČ(t))2 =
η + (1â 2η)eâ(t+c) + ηeâ2(t+c) ℠η.Combining these two cases, the Poincare constant is at
most min{
4η(4ηâ1)(1âη) ,
1η(1âη)
}= 4 at η = 1
2 . By applyingTheorem 5 we obtain the one-dimensional Poincare inequality.
Finally, for f : Rm â R we denote by Vari(f) thevariance of i-th variable while fixing the rest of the m â 1variables in f(t1, t2, . . . , tm) and Et\ti the expectation overall variables except the i-th variable. The one dimensionalPoincare inequality implies that
Vari(f) †4
â«R|âf(t)/âti|2d”. (58)
The proof then follows by a tensorization argument given byLedoux [63, Proposition 5.6], which shows that
Var”(f) â€mâi=1
Et\ti [Vari(f)] . (59)
Although the Poincare inequality establishes a boundon the variance of a random variable, it may also beused to bound the moment generating function Îf (λ) =â«R exp(λf(t))d”(t) [61], [62], [64]. For completeness we
provide a proof specifically for the Gumbel distribution.Corollary 9 (MGF bound for the Gumbel distribution):
Let ” be the measure corresponding to the Gumbeldistribution. Let f : Rm â R be a multivariate functionwhich satisfies the conditions 1)â3) in Theorem 5 for eachdimension and furthermore satisfies ââf(t)â †a almost
12
everwhere. Then whenever λa †1 the moment generatingfunction is bounded as
Îf (λ) †1 + λa
1â λa· exp (λE [f ]) . (60)
Proof: The proof is due to Bobkov and Ledoux [62].Applying the Poincare inequality with g(t) = exp(λf(t)/2)implies
Îf (λ)â Îf (λ/2)2 †4
â«R
λ2
4exp(λf(t)) · ââf(t)â2d”(t) †a2λ2Îf (λ).
(61)
Whenever λ2a2 †1 one can rearrange the terms to get the
bound Îf (λ) â€(
1â λ2a2)â1
Îf (λ/2)2. Applying this self-reducible bound recursively k times implies
Îf (λ) †Îf
(λ
2k
)2k kâi=0
(1â λ2a2
4i
)â2i
(62)
=
(1 +
λE [f ]
2k+ o(2âk)
)2k kâi=0
(1â λ2a2
4i
)â2i
,
(63)
where the last line follows from the fact that Îf (λ) =1 + λE [f ] + o(λ). Taking k â â and noting that(1 + c/2k)2k â ec we obtain the bound Îf (λ) â€ââi=0
(1â λ2a2
4i
)â2i
exp(λE [f ]). Applying Lemma 3 in the
Appendix shows thatââi=0
(1â λ2a2
4i
)â2i
†1+λa1âλa , which
completes the proof.Bounds on the moment generating function generally imply
(via the Markov inequality) that the deviation of a randomvariable from its mean decays exponentially in the number ofsamples. We apply these inequalities in a setting in which thefunction f is random and we think of it as a random variable.With some abuse of notation then, we will call f a randomvariable in the following corollary.
Corollary 10 (Measure concentration for the Gumbel distribution):Let ” be the measure corresponding to the Gumbeldistribution. Let f : Rm â R be a multivariate functionwhich satisfies the conditions 1)â3) in Theorem 5 for eachdimension and furthermore satisfies ââf(t)â †a almosteverwhere. Let f1, f2, . . . , fM be M i.i.d. random variableswith the same distribution as f . Then with probability at least1â ÎŽ,
1
M
Mâj=1
fj â E[f ] †2a
(1 +
â1
2Mlog
1
ÎŽ
)2
.
Proof: From the independence assumption, using theMarkov inequality, we have that
P
Mâj=1
fj â„ME[f ] +Mr
†exp(âλME[f ]â λMr)
Mâj=1
E[exp(λfj)].
Applying Corollary 9, we have, for any λ †1/a,
P
1
M
Mâj=1
fj â„ E[f ] + r
†exp(M(log(1 + λa)â log(1â λa)â λr)
).
Optimizing over positive λ subject to λ †1/a we obtainλ =
ârâ2aaâr
for r â„ 2a. Hence, for r â„ 2a, the right side
becomes exp
(M
(2 tanhâ1
(â1â 2a
r
)â râ
1â 2ar
a
)).
Now, consider the function g(x) = 2 tanhâ1(â
1â 2x)ââ
1â2xx and h(x) = â2(1â 1â
x)2. We have that gâČ(x) =
â1â2xx2
and hâČ(x) = 1ââ
2xx2 , that implies hâČ(x) †gâČ(x), âx â
(0, 0.5]. So, taking into account that g(0.5) = h(0.5), theabove implies that g(x) †h(x), âx â (0, 0.5]. Using thisinequality, we have, for r â„ 2a
exp
M2 tanhâ1
(â1â 2a
r
)ârâ
1â 2ar
a
(64)
†exp
(â2M
(âr
2aâ 1
)2). (65)
Equating the right side of the last inequality to ÎŽ and solvingfor r, we have the stated bound.
C. A modified log-Sobolev inequality for log-concave distri-butions
In this section we provide complementary measure con-centration results that bound the moment generating functionÎ(λ) =
â«exp(λf(t))d”(t) by its expansion (in terms of
gradients). Such bounds are known as modified log-Sobolevbounds. We follow the same recipe as previous works [62],[63] and use the so-called Herbst argument. Consider the λ-scaled cumulant generating function of a random function withzero mean, i.e., E[f ] = 0:
K(λ)â=
1
λlog Î(λ). (66)
First note that by LâHopitalâs rule K(0) = ÎâČ(0)Î(0) = E[f ], so
whenever E[f ] = 0 we may represent K(λ) by integrating itsderivative: K(λ) =
⫠λ0K âČ(λ)dλ. Thus to bound the moment
generating function it suffices to bound K âČ(λ) †α(λ) forsome function α(λ). A direct computation of K âČ(λ) translatesthis bound to
λÎâČ(λ)â Î(λ) log Î(λ) †λ2Î(λ)α(λ). (67)
The left side of (67) turns out to be the so-called functional en-tropy Ent, which is not the same as the Shannon entropy [63].We calculate the functional entropy of λf(t) with respect toa measure ”:
Ent”(exp(f))â=
â«f(t) · exp(f(t))d”(t)â
(â«exp(f(t))d”(t)
)log(â«
exp(f(t))d”(t)).
In the following we derive a variation of the modifiedlog-Sobolev inequality for log-concave distributions based ona Poincare inequality for these distributions. This in turnprovides a bound on the moment generating function. Thisresult complements the exponential decay that appears inSection IV-A. Figure 2 compares these two approaches.
Lemma 2: Let ” be a measure that satisfiesthe Poincare inequality with a constant C, i.e.,Var”(f) †C
â«ââf(t)â2d”(t) for any continuous and
13
differentiable almost everywhere function f(t). Assume thatâ«f(t)d”(t) = 0 and that ââf(t)â †a < 2/
âC. Then
Ent”(exp(f)) †a2C
2
(2 + a
âC
2â aâC
)2 â«exp(f(t))d”(t). (68)
Proof: First, z log z â„ z â 1. Setting z =â«
exp(f)d”and applying this inequality results in the functional en-tropy bound Ent”(exp(f)) â€
â«f(t) exp(f(t))d”(t) â(â«
exp(f(t))d”(t))
+ 1. Rearranging the terms, the righthand side is
â«(f(t) · exp(f(t))â exp(f(t)) + 1) d”(t). We
proceed by using the identity (cf. [67], Equation 17.25.2) ofthe indefinite integral
â«s exp(sc)ds = exp(sc)
c (sâ 1c ). Taking
into account the limits [0, 1] and setting c = f(t) we getthe desired form: f(t)2
â« 1
0s exp(sf(t))ds = f(t) exp(f(t))â
exp(f(t)) + 1. Particularly,
Ent”(exp(f)) â€â« (â« 1
0
sf(t)2 exp(sf(t))ds
)d”(t) (69)
= limΔâ0+
â« 1
Δ
1
s
(â«s2f(t)2 exp(sf(t))d”(t)
)ds.
(70)
The last equality holds by Fubiniâs theorem.Next we use Proposition 3.3 from [62] that applies the
Poincare inequality to g(t) exp(g(t)/2) to show that for anyfunction g(t) with mean zero and ââg(t)â †a < 2/
âC, we
have the inequalityâ«g2(t) exp(g(t))d”(t) †C
â«ââg(t)â2 exp(g(t))d”(t), (71)
where C = C((2+aâC)/(2âa
âC))2. Setting g(t) = sf(t)
in this inequality satisfies ââg(t)â = sââf(t)â †a <2/âC. This implies the inequalityâ«
s2f(t)2 exp(sf(t))d”(t) †s2C
(2 + a
âC
2â aâC
)2 â«ââf(t)â2 exp(sf(t))d”(t). (72)
Using ââfâ †a we obtain the bound
Ent”(exp(f)) †a2C
(2 + a
âC
2â aâC
)2 â« 1
0
s
(â«exp(sf(t))d”(t)
)ds. (73)
The function Ï(s) =â«
exp(sf(t))d”(t) is convex in theinterval s â [0, 1], so its maximum value is attained at s = 0or s = 1. Also, Ï(0) = 1 and Ï(1) =
â«exp(f(t))d”(t).
From Jensenâs inequality, and the fact thatâ«f(t)d”(t) = 0,
we haveâ«
exp(f(t))d”(t) â„ exp(â«f(t)d”(t)) = 1. Hence
Ï(1) â„ Ï(0). So, we haveâ« 1
0
s
(â«exp(sf(t))d”(t)
)ds â€
â«exp(f(t))d”(t) ·
â« 1
0
sds
=1
2
â«exp(f(t))d”(t)
and the result follows.The preceding lemma expresses an upper bound on the
functional entropy in terms of the moment generating func-tion. Applying this Lemma with the function λf , and as-suming that ââfâ †a, we rephrase this upper bound as
Ent”(exp(λf)) †Î(λ) · λ2a2C
2
(2+λa
âC
2âλaâC
)2
, where Î(λ) isthe moment generating function of f . Fitting it to (67) and(66) we deduce that
K âČ(λ) †α(λ) =a2C
2
(2 + λa
âC
2â λaâC
)2
. (74)
Since the Poincare constant of the Gumbel distribution isat most 4 we obtain its corresponding bound K âČ(λ) â€2a2
(1+λa1âλa
)2
. Applying the Herbst argument (cf. [63]), thistranslates to a bound on the moment generating function. Thisresult is formalized as follows.
Corollary 11: Let ” denote the Gumbel measure on R andlet f : Rm â R be a multivariate function that satisfies theconditions in Theorem 5 for each dimension. Also, assume thatââf(t)â †a. Then whenever λa †1 the moment generatingfunction is bounded as
Î(λ) †ÎČ(λ) exp (λE [f ]),
where ÎČ(λ) = exp(
2a2λ2 5âλa1âλa + 8aλ log(1â λa)
).
Proof: We apply Lemma 2 to the function f(t) = f(t)âE[f ] which has zero mean. Thus
K âČ(λ) =Ent”(exp(λf))
λ2Î(λ)†2a2
(1 + λa
1â λa
)2
. (75)
Recalling that K(0) = 0 we derive K(λ) =⫠λ
0K âČ(λ)dλ.
Using the bound on K âČ(λ) we obtain
K(λ) †2a2
⫠λ
0
(1 + λa
1â λa
)2
dλ. (76)
A straight forward verification of the integral implies that
K(λ) †2a2
[4 log(1â aλ)
a+
4
a(1â aλ)+ λâ 4
a
]= 2a2
[4 log(1â aλ)
a+
4λ
1â aλ+ λ
].
Now, from the definition of K(λ) and the one of ÎČ(λ), thisimplies logE[exp(λf)] †log ÎČ(λ).
D. Evaluating measure concentration
The above bound is tighter than our previous bound inTheorem 3 of [3]. In particular, the bound in Corollary 11 doesnot involve âf(t)ââ. It is interesting to compare ÎČ(λ) in theabove bound to the one that is attained directly from Poincareinequality in Corollary 9, namely Î(λ) †α(λ) · exp (λE [f ])where α(λ) = 1+λa
1âλa . Both α(λ), ÎČ(λ) are finite in the interval0 †λ < 1/a although they behave differently at the theirlimits. Particularly, α(0) = 1 and ÎČ(0) = 1. On the otherhand, α(λ) < ÎČ(λ) for λâ 1. This is illustrated in Figure 2.
With this Lemma we can now upper bound the error inestimating the average E[f ] of a function f of m i.i.d. Gumbelrandom variables by generating M independent samples of fand taking the sample mean. We again abuse notation to thinkof f as a random variable itself.
Corollary 12 (Measure concentration via log-Sobolev inequalities):Consider a random function f that satisfies the same
14
lambda
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
bound
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
MGF bounds
Poincare
Sobolev
(a)Number of samples
0 5 10 15 20 25
bo
un
d
2
4
6
8
10
12
14
16
18
20
Deviation bounds
PoincareSobolev
(b)log (1/ÎŽ) / M
0 0.5 1 1.5 2 2.5 3 3.5 4
bound
0
2
4
6
8
10
12
14
16
Deviation bounds
PoincareSobolev
(c)
Fig. 2. Comparing the measure concentration bounds that are attained bythe Poincare and modified log-Sobolev inequalities for a = 1. Figure (2a):the moment generating functions bounds that are attained by the Poincareinequality in Corollary 9 and the modified log-Sobolev inequality in Corollary11. The plots show that respective functions α(λ) = 1+λa
1âλa and ÎČ(λ) thatappear in these bounds. Figure (2b): the deviation bounds, of the sampledaverage from its mean, that are attained by the Poincare inequality in Corollary10 and the modified log-Sobolev inequality in Corollary 12 when ÎŽ = 0.1and the number of samples M = 1, 2, . . . , 100. Figure (2c): the deviationbounds, of the sampled average from its mean, that are attained by thePoincare inequality in Corollary 10 and the modified log-Sobolev inequalityin Corollary 12, as a function for log(1/ÎŽ)/M that ranges between [0, 2].
assumptions as Corollary 11 almost surely. Let f1, f2, . . . , fMbe M i.i.d. random variables with the same distribution as f .Then with probability at least 1â ÎŽ,
1
M
Mâj=1
fj â E[f ] †amax
(4
Mlog
1
ÎŽ,
â32
Mlog
1
ÎŽ
).
Proof: From the independence assumption, using theMarkov inequality, we have that
P
Mâj=1
fj â€ME[f ] +Mr
†exp(âMλE[f ]âMrλ)
Mâj=1
E[exp(λfj)].
We use the elementary inequality log(1 â x) †â2x2âx and
Corollary 11 to obtain
P
1
M
Mâj=1
fj †E[f ] + r
†exp
(M
(2a2λ2 a2λ2 + aλ+ 2
(1â aλ)(2â aλ)â λr
)).
For2 any |λ| †1325a , we have that 2 a2λ2+aλ+2
(1âaλ)(2âaλ) †8. Hence,for any |λ| †13
25a , we have that
P
1
M
Mâj=1
fj †E[f ] + r
†exp(M(8a2λ2 â λr
)).
Optimizing over λ subject to |λ| †1325a we obtain
exp(M(8a2λ2 â λr
))†exp
(âM min
(r
4a,r2
32a2
)).
Equating the left side of the last inequality to ÎŽ and solvingfor r, we have the stated bound.
E. Application to MAP perturbations
The derived bounds on the moment generating functionimply the concentration of measure of our high-dimensionalinference algorithms that we use both for sampling (in Section
2The constants are found in order to have the junction of curve toapproximately lie on the Poincare curve in Figure 2c.
III-A) and to estimate prediction uncertainties or entropies(in Section III-C). The relevant quantities for our inferencealgorithms are the expectation of randomized max-solversF (γ) = maxx{Ξ(x)+
âα γα(xα)} as well as the expectation
of the maximizing perturbations F (Îł) =âα γα(xγα) for
which xÎł = argmaxx{Ξ(x) +âα γα(xα)}, as in Theorem
4.To apply our measure concentration results to perturb-max
inference we calculate the parameters in the bound given bythe Corollary 9 and Corollary 11. The random functions F (Îł)
above are functions of m â=âαâA |Xα| i.i.d. Gumbel random
variables. The (sub)gradient of these functions is structuredand âpointsâ toward the γα(xα) that corresponding to themaximizing assignment in xâ. More precisely:
âF (Îł)
âγα(xα)=
{1 if xα â xâ,
0 otherwise.
Thus, the gradient satisfies ââFâ2 = |A| almost everywhere,so a2 = |A|. Suppose we sample M i.i.d. random variablesÎł1, Îł2, . . . , ÎłM with the same distribution as Îł and denotetheir respective random values by Fj
â= F (Îłj). We estimate
their deviation from the expectation by 1M
âMi=1 Fj â E[F ].
Applying Corollary 9 to both F and âF we get the followingdouble-sided bound with probability 1â ÎŽ:
1
M
Mâj=1
Fj â E[F ] †2â|A|
(1 +
â1
2Mlog
2
ÎŽ
)2
.
Applying Corollary 11 to both F and âF we get the followingdouble-sided bound with probability 1â ÎŽ:
1
M
Mâj=1
Fj â E[F ] â€â|A|max
(4
Mlog
2
ÎŽ,
â32
Mlog
2
ÎŽ
).
Clearly, the concentration of perturb-max inference is deter-mined by the best concentration of these two bounds.
V. EMPIRICAL EVALUATION
Statistical inference of high dimensional structures is closelyrelated to estimating the partition function. Our proposed infer-ence algorithms, both for sampling and inferring the entropyof high-dimensional structures, are derived from an alternativeinterpretation of the partition function as the expected valueof the perturb-max value. We begin our empirical validationby computing the upper and lower bounds for the partitionfunction computed as the expected value of a max-function.We then show empirically that the perturb-max algorithm forsampling from the Gibbs distribution has a sub-exponentialcomputational complexity. Subsequently, we evaluate the prop-erties of the perturb-max entropy bounds. Also, we explorethe deviation of the sample mean of the perturb-max valuefrom its expectation by deriving new measure concentrationinequalities.
We evaluate our approach on spin glass models, where eachvariable xi represents a spin, namely xi â {â1, 1}. Each spinhas a local field parameter Ξi which correspond to the localpotential function Ξi(xi) = Ξixi. The parameter Ξi represents
15
(a) (b) (c) (d)
(e) Ξi,j = 0 (f) Ξi,j â [0, 1) (g) Ξi,j â [0, 2) (h) Ξi,j â [0, 3)
Fig. 3. The probability (top row) and energy (bottom row) landscapes for all 512 configurations in a 3 Ă 3 spin glass system with strong local field,Ξi â [â1, 1]. When Ξi,j = 0 the system is independent and one can observe the block pattern. As the coupling potentials get stronger the landscape getmore ragged. By zooming one can see the ragged landscapes throughout the space, even for negligible configurations, which affect many local approaches.The random MAP perturbation directly targets the maximal configurations, thus performs well in these settings.
data signal, which in the spin model is the preference ofa spin to be positive or negative. Adjacent spins interactwith couplings Ξi,j(xixj) = Ξi,jxixj . Whenever the couplingparameters are positive the model is called attractive becauseadjacent variables give higher values to positively correlatedconfigurations. The potential function of a spin glass model isthen
Ξ(x1, x2, . . . , xn) =âiâV
Ξixi +â
(i,j)âE
Ξi,jxixj . (77)
In our experiments we consider adjacencies of a grid-shapedmodel.
First, we compared our bounds to the partition function on10Ă10 spin glass models. For such comparison we computedthe partition function exactly using dynamic programming (thejunction tree algorithm). The local field parameters Ξi weredrawn uniformly at random from [âf, f ], where f â {0.1, 1},reflecting weak and strong data signals. The parameters Ξi,jwere drawn uniformly from [0, c] to obtain attractive couplingpotentials. Attractive potentials are computationally favorableas their MAP value can be computed efficiently by the graph-cuts algorithm [15].
First, we evaluate an upper bound in (37) that holds inexpectation with perturbations γi(xi). The expectation wascomputed using 100 random MAP perturbations, althoughvery similar results were attained after only 10 perturbations,e.g., Figure 8 and Figure 9. We compared this upper boundwith the sum-product form of tree re-weighted belief propa-gation with uniform distribution over the spanning trees [58].We also evaluate our lower bound that holds in probabilityand requires only a single MAP prediction on an expandedmodel, as described in Corollary 4. We estimate our probablebound by expanding the model to 1000à 1000 grids, settingthe discrepancy Δ in Corollary 4 to zero. We compared this
lower bound to the belief propagation algorithm, that providesthe tightest lower bound on attractive models [68]â[70]. Wecomputed the signed error (the difference between the boundand logZ), averaged over 100 spin glass models as shown inFigure 9.
One can see that the probabilistic lower bound is the tightestwhen considering the medium and high coupling domain,which is traditionally hard for all methods. Because the boundholds only with high probability probability it might generatea (random) estimate which is not a proper lower bound. Wecan see that on average this does not happen. Similarly, ourperturb-max upper bound is better than the tree re-weightedupper bound in the medium and high coupling domain. Inthe attractive setting, both our bounds use the graph-cutsalgorithm and were therefore considerably faster than thebelief propagation variants. Finally, the sum-product beliefpropagation lower bound performs well on average, but fromthe plots one can observe that its variance is high. Thisdemonstrates the typical behavior of belief propagation: itfinds stationary points of the non-convex Bethe free energy andtherefore works well on some instances but does not convergeor attains bad local minima on others.
We also compared our bound in the mixed case, wherethe coupling potentials may either be attractive or repulsive,namely Ξij â [âc, c]. Recovering the MAP solution in themixed coupling domain is harder than the attractive domain.Therefore we could not test our lower bound in the mixedsetting as it relies on expanding the model. We also omitthe comparison to the sum-product belief propagation sinceit is no longer a lower bound in this setting. We evaluatethe MAP perturbation value using MPLP [7]. One can verifythat qualitatively the perturb-max upper bound is significantlybetter than the tree re-weighted upper bound. Nevertheless itis significantly slower as it relies on finding the MAP solution,
16
Coupling strengths0 0.5 1 1.5 2 2.5 3
Estim
atio
n e
rro
r
-10
-5
0
5
10
15
Attractive. Field 0.1
upper TRW lower BP
Coupling strengths0 0.5 1 1.5 2 2.5 3
Estim
atio
n e
rro
r
-8
-6
-4
-2
0
2
4
6
8
10
12
Attractive. Field 1
upper TRW lower BP
Fig. 4. The attractive case. The (signed) difference of the different bounds and the log-partition function. These experiments illustrate our bounds on 10Ă10spin glass model with weak and strong local field potentials and attractive coupling potentials. The plots below zero are lower bounds and plots above zeroare upper bounds. We compare our upper bound (37) with the tree re-weighted upper bound. We compare our lower bound (Corollary 4) with the beliefpropagation result, whose stationary points are known to be lower bounds to the log-partition function for attractive spin-glass models.
a harder task in the presence of mixed coupling strengths.Next, we evaluate the computational complexity of our
sampling procedure. Section III-A describes an algorithm thatgenerates unbiased samples from the full Gibbs distribution.For spin glass models with strong local field potentials, itis well-known that one cannot produce unbiased samplesfrom the Gibbs distributions in polynomial time [11]â[13].Theorem 3 connects the computational complexity of ourunbiased sampling procedure to the gap between the log-partition function and its upper bound in (37). We use ourprobable lower bound to estimate this gap on large grids, forwhich we cannot compute the partition function exactly. Figure6 suggests that in practice, the running time for this samplingprocedure is sub-exponential.
Next we estimate our upper bounds for the entropy ofperturb-max probability models that are described in SectionIII-C. We compare them to marginal entropy bounds H(p) â€âiH(pi), where pi(xi) =
âx\xi p(x) are the marginal
probabilities [59]. Unlike the log-partition case which relatesto the entropy of Gibbs distributions, it is impossible to usedynamic programming to compute the entropy of perturb-maxmodels. Therefore we restrict ourselves to a 4Ă 4 spin glassmodel to compare these upper bounds as shown in Figure 7.One can see that the MAP perturbation upper bound is tighterthan the marginalization upper bound in the medium and highcoupling strengths. We can also compare the marginal entropybounds and the perturb-max entropy bounds to arbitrary gridsizes without computing the true entropy. Figure 7 shows thatthe larger the model the better the perturb-max bound.
Both our log-partition bounds as well as our entropy boundshold in expectation. Thus, we evaluate their measure con-centration properties, i.e., how many samples are required toconverge to their expected value. We evaluate our approach
on a 100 Ă 100 spin glass model with n = 104 variables.The local field parameters Ξi were drawn uniformly at randomfrom [â1, 1] to reflect high signal. To find the perturb-maxassignment for such a large model we restrict ourselves toattractive coupling setting; the parameters Ξi,j were drawnuniformly from [0, c], where c â [0, 4] to reflect weak, mediumand strong coupling potentials. Throughout our experimentswe evaluate the expected value of our bounds with 100different samples. We note that both our log-partition andentropy upper bounds have the same gradient with respectto their random perturbations, so their measure concentrationproperties are the same. In the following we only report theconcentration of our entropy bounds; the same concentrationoccurs for our log-partition bounds.
Figure 8 shows the error in the sample mean 1M
âMj=1 Fj
as described in Section IV-E. We do so for three differentsample sizes M = 1, 5, 10, while F (Îł) =
âi Îłi(x
Îłi ) is our
entropy bound. The error reduces rapidly as M increases;only 10 samples are needed to estimate the expectation ofthe perturb-max random function that consist of 104 randomvariables Îłi(xi). To test our measure concentration result, thatensures exponential decay, we measure the deviation of thesample mean from its expectation by using M = 1, 5, 10samples. Figure 9 shows the histogram of the sample mean,i.e., the number of times that the sample mean has error morethan r from the true mean. One can see that the decay isindeed exponential for every M , and that for larger M thedecay is much faster. These show that by understanding themeasure concentration properties of MAP perturbations, wecan efficiently estimate the mean with high probability, evenin very high dimensional spin-glass models.
17
Coupling strengths0 0.5 1 1.5 2 2.5 3
Estim
atio
n e
rro
r
0
10
20
30
40
50
60
Mixed. Field 0.1
upper TRW
Coupling strengths0 0.5 1 1.5 2 2.5 3
Estim
ation e
rror
0
5
10
15
20
25
30
35
40
45
50
Mixed. Field 1
upper TRW
Fig. 5. The (signed) difference of the different bounds and the log-partition function. These experiments illustrate our bounds on 10Ă 10 spin glass modelwith weak and strong local field potentials and mixed coupling potentials. We compare our upper bound (37) with the tree re-weighted upper bound.
0 2000 4000 6000 8000 100000
20
40
60
80
100
120
140
Total vertices
Estim
ate
d r
un
nin
g t
ime
log(runâtime) estimate
sqrt(totalâvertices)
Fig. 6. Estimating our unbiased sampling procedure complexity on spinglass models of varying sizes, ranging from 10 Ă 10 spin glass models to100 Ă 100 spin glass models. The running time is the difference betweenour upper bound in (37) and the log-partition function. Since the log-partitionfunction cannot be computed for such a large scale model, we replaced itwith its lower bound in Corollary 4.
VI. CONCLUSIONS
High dimensional inference is a key challenge for applyingmachine learning in real-life applications. While the Gibbsdistribution is widely used throughout many areas of research,standard sampling algorithms may be too slow in many casesof interest. In the last few years many optimization algorithmswere devised to avoid the computational burden of samplingand instead researchers predicted the most likely (MAP) solu-tion. In this work we proposed low-dimensional perturbationsand an approximate sampler for the Gibbs distribution thatrely on MAP optimization as their core element. To controlthe number of calls to the MAP solver we proved new mea-sure concentration bounds for functions of Gumbel randomvariables. We also use the MAP perturbation framework to
10â1
100
101
1
2
3
4
5
6
7
8
9
Coupling strength (c)
Estim
ate
d e
ntr
op
y
4 x 4 spin glass model
exact entropy
marginal entropy
upper
10 20 30 40 50 60 70 80 90 100
1000
2000
3000
4000
5000
6000
n (n x n spin glass model)
Estim
ate
d e
ntr
opy
Entropy vs. model size (c = 5)
marginal entropy
upper
Fig. 7. Estimating our entropy bounds (in Section III-C) while comparingthem to the true entropy and the marginal entropy bound. Left: comparison onsmall-scale spin models. Right: comparison on large-scale spin glass models.
derive bounds on the entropy of these models.The results here can be extended in a number of different
directions. Recent advances in perturbation methods appearin [71]. In contrast to tree re-weighted or entropy coveringbounds, the perturb-max bounds do not have a straightfor-ward tightening scheme. Another direction is to consider theperturb-max model beyond the first moment (expectation). Itremains open whether the variance or other related statisticsof the perturb-max value can be beneficial for learning, e.g.,learning the correlations between data measurements. Under-standing the effect of approximate MAP solvers could extendthe range of applicability [31], [72]â[75]. A natural extension
18
0 0.5 1 1.5 2 2.5 3 3.5 410
20
30
40
50
60
70
80
90
100
110
coupling strengths
tota
l dev
iatio
n
1 samples5 samples10 samples
Fig. 8. Error of the sample mean versus coupling strength for 100 Ă 100spin glass models. The local field parameter Ξi is chosen uniformly at randomfrom [â1, 1] to reflect high signal. With only 10 samples one can estimatethe expectation well.
of these methods is to consider high dimensional learning.Perturb-max models already appear implicitly in risk analy-sis [32] and online learning [35], [76]. Novel approaches thatconsider perturb-max models explicitly [24], [31] may lead tonew learning paradigms for high-dimensional inference.
APPENDIX
Proof details for Corollary 9
The result in Corollary 9 follows by taking C = 4 in thefollowing lemma.
Lemma 3: For any a > 0 and C > 0, for λ â [0, 2/aâC]
ââi=0
(1â λ2a2C
4i+1
)â2i
†2 + λaâC
2â λaâC. (78)
Proof: To prove this inequality there are three simplesteps. First, factor out the first term:
ââi=0
(1â λ2aaC
4i+1
)â2i
=
(1â λ2a2C
4
)â1 ââi=1
(1â λ2a2C
4i+1
)â2i
(79)
and define
V (λ) =
ââi=1
(1â λ2a2C
4i+1
)â2i
. (80)
Next, from the identity(1â λ2a2C
4
)â1
=4
(2 + λaâC)(2â λa
âC). (81)
If we can show thatâV (λ) < 2+λa
âC
2 then the result willfollow.
We claimâV (λ) is convex. Note that if V (λ) log-convex,
thenâV (λ) is also convex, so it is sufficient to show that
log V (λ) is convex. Using the Taylor series expansion log(1âx) = â
ââj=1 x
j/j and switching the order of summation,
log V (λ) = âââi=1
2i log
(1â λ2a2C
4i+1
)(82)
=
ââi=1
2iââj=1
(λ2a2C)j
j · 4ji+j(83)
=
ââj=1
(λ2a2C)j
j4j
ââi=1
2i
2(2jâ1)i(84)
=
ââj=1
(λ2a2C)j
j4j
(1
1â 2â(2jâ1)â 1
)(85)
=
ââj=1
(λ2a2C)j
j4j
(1
22jâ1 â 1
). (86)
Note that the expansion holds only for λ2a2C4i+1 < 1 and
this bound is tightest for i = 1. This expansion is the sumof convex functions and hence convex. This means that forλ < 4
aâC
the functionâV (λ) is convex.
At λ = 2aâC
, we have
log V
(2
aâC
)=
ââj=1
1
j
(1
22jâ1 â 1
)(87)
†1 +
ââj=2
1
j · 22jâ2(88)
= 1 +4ââj=2
(1/4)j
j(89)
= 1 + 4
(â log
(1â 1
4
)â 1
4
)(90)
= 4 log4
3(91)
< log 4. (92)
Therefore V (2/aâC) < 4.
Since V (0) = 1 and V (2/aâC) < 4, by convexity, for
λ â [0, 2/aâC],
âV (λ) â€
(1â λa
âC
2
)âV (λ) +
λaâC
2
âV
(2
aâC
)(93)
< 1 +λaâC
2(94)
=2 + λa
âC
2. (95)
Now, considering (79) and the terms in (80) and (81), we
19
0 100 200 300 4000
5
10
15
20
25
Total deviation
Fre
quency
Local field = 1, coupling strength = 1
1 samples
5 samples
10 samples
0 50 100 150 200 250 3000
5
10
15
20
25
Total deviation
Fre
quency
Local field = 1, coupling strength = 2
1 samples
5 samples
10 samples
0 100 200 300 4000
5
10
15
20
25
Total deviation
Fre
qu
en
cy
Local field = 1, coupling strength = 3
1 samples
5 samples
10 samples
Fig. 9. Histogram that shows the decay of random MAP values, i.e., the number of times that the sample mean has error more than r from the true mean.These histograms are evaluated on 100Ă 100 spin glass model with high signal Ξi â [â1, 1] and various coupling strengths. One can see that the decay isindeed exponential for every M , and that for larger M the decay is faster.
haveââi=0
(1â λ2aaC
4i+1
)â2i
=4
(2 + λaâC)(2â λa
âC)V (λ)
(96)
<4
(2 + λaâC)(2â λa
âC)·
(2 + λa
âC
2
)2
(97)
=2 + λa
âC
2â λaâC, (98)
as desired.
ACKNOWLEDGMENTS
The authors thank the reviewers for their detailed andhelpful comments which helped considerably in clarifying themanuscript, Francis Bach and Tatiana Shpakova for helpfuldiscussions, and Associate Editor Constantine Caramanis forhis patience and understanding.
REFERENCES
[1] T. Hazan and T. Jaakkola, âOn the partition function and random max-imum a-posteriori perturbations,â in The 29th International Conferenceon Machine Learning (ICML 2012), 2012.
[2] T. Hazan, S. Maji, and T. Jaakkola, âOn sampling from the Gibbsdistribution with random maximum a-posteriori perturbations,â inAdvances in Neural Information Processing Systems 26, C. Burges,L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger,Eds. Curran Associates, Inc., 2013, pp. 1268â1276. [Online].Available: http://papers.nips.cc/paper/4962-on-sampling-from-the-gibbs-distribution-with-random-maximum-a-posteriori-perturbations
[3] F. Orabona, T. Hazan, A. Sarwate, and T. Jaakkola, âOn measureconcentration of random maximum a-posteriori perturbations,â inProceedings of The 31st International Conference on MachineLearning, ser. JMLR: Workshop and Conference Proceedings, E. P.Xing and T. Jebara, Eds., vol. 32, 2014, p. 1. [Online]. Available:http://jmlr.csail.mit.edu/proceedings/papers/v32/orabona14.html
[4] S. Maji, T. Hazan, and T. Jaakkola, âActive boundary annotationusing random MAP perturbations,â in Proceedings of the SeventeenthInternational Conference on Artificial Intelligence and Statistics(AISTATS), ser. JMLR: Workshop and Conference Proceedings,S. Kaski and J. Corander, Eds., vol. 33, 2014, pp. 604â613. [Online].Available: http://jmlr.org/proceedings/papers/v33/maji14.html
[5] P. F. Felzenszwalb and R. Zabih, âDynamic programming andgraph algorithms in computer vision,â Pattern Analysis and MachineIntelligence, IEEE Transactions on, vol. 33, no. 4, pp. 721â740, April2011. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2010.135
[6] T. Koo, A. Rush, M. Collins, T. Jaakkola, and D. Sontag, âDualdecomposition for parsing with non-projective head automata,â inProceedings of the 2010 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP â10), 2010, pp. 1288â1298. [Online].Available: http://dl.acm.org/citation.cfm?id=1870783
[7] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss,âTightening LP relaxations for MAP using message passing,â inProceedings of the Twenty-Fourth Conference Annual Conferenceon Uncertainty in Artificial Intelligence (UAI-08). Corvallis,Oregon, USA: AUAI Press, 2008, pp. 503â510. [Online]. Available:https://dslpitt.org/papers/08/p503-sontag.pdf
[8] S. Geman and D. Geman, âStochastic relaxation, Gibbs distributions,and the Bayesian restoration of images,â IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 6, pp. 721â741, November 1984. [Online]. Available:http://dx.doi.org/10.1109/TPAMI.1984.4767596
[9] W. K. Hastings, âMonte Carlo sampling methods using Markov chainsand their applications,â Biometrika, vol. 57, no. 1, pp. 97â109, April1970. [Online]. Available: http://dx.doi.org/10.1093/biomet/57.1.97
[10] R. H. Swendsen and J.-S. Wang, âNonuniversal criticaldynamics in Monte Carlo simulations,â Physical Review Letters,vol. 58, no. 2, pp. 86â88, January 1987. [Online]. Available:http://dx.doi.org/10.1103/PhysRevLett.58.86
[11] M. Jerrum and A. Sinclair, âPolynomial-time approximationalgorithms for the Ising model,â SIAM Journal on computing,vol. 22, no. 5, pp. 1087â1116, October 1993. [Online]. Available:http://dx.doi.org/10.1137/0222066
[12] L. A. Goldberg and M. Jerrum, âThe complexity of ferromagneticIsing with local fields,â Combinatorics Probability and Computing,vol. 16, no. 1, p. 43, January 2007. [Online]. Available:http://dx.doi.org/10.1017/S096354830600767X
[13] ââ, âApproximating the partition function of the ferromagnetic Pottsmodel,â Journal of the ACM (JACM), vol. 59, no. 5, p. 25, 2012.
[14] J. M. Eisner, âThree new probabilistic models for dependencyparsing: an exploration,â in Proceedings of the 16th Conference onComputational Linguistics (COLING â96), vol. 1. Association forComputational Linguistics, 1996, pp. 340â345. [Online]. Available:http://dx.doi.org/10.3115/992628.992688
[15] Y. Boykov, O. Veksler, and R. Zabih, âFast approximate energyminimization via graph cuts,â IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 23, no. 11, pp. 1222â1239, November2001. [Online]. Available: http://dx.doi.org/10.1109/34.969114
[16] V. Kolmogorov, âConvergent tree-reweighted message passing forenergy minimization,â IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 28, no. 10, pp. 1568â1583, October 2006.[Online]. Available: http://dx.doi.org/10.1109/TPAMI.2006.200
[17] Gurobi Optimization. (2015) Gurobi optimizer documentation. [Online].Available: http://www.gurobi.com/documentation/
[18] P. Swoboda, B. Savchynskyy, J. Kappes, and C. Schnorr, âPartial
20
optimality via iterative pruning for the Potts model,â in ScaleSpace and Variational Methods in Computer Vision: 4th InternationalConference, ser. Lecture Notes in Computer Science. Berlin, Germany:Springer, 2013, vol. 7893, ch. 40, pp. 477â488. [Online]. Available:http://dx.doi.org/10.1007/978-3-642-38267-3
[19] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, âMAPestimation via agreement on trees: Message-passing and linearprogramming,â IEEE Transactions on Information Theory, vol. 51,no. 11, pp. 3697â3717, November 2005. [Online]. Available:http://dx.doi.org/10.1109/TIT.2005.856938
[20] Y. Weiss, C. Yanover, and T. Meltzer, âMAP estimation, linearprogramming and belief propagation with convex free energies,âin Proceedings of the Twenty-Third Conference Conference onUncertainty in Artificial Intelligence (2007). Corvallis, Oregon,USA: AUAI Press, 2007, pp. 416â425. [Online]. Available:https://dslpitt.org/papers/07/p416-weiss.pdf
[21] T. Werner, âHigh-arity interactions, polyhedral relaxations, and cuttingplane algorithm for soft constraint optimisation (MAP-MRF),â inProceedings of the 2008 IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2008, pp. 1â8. [Online]. Available:http://dx.doi.org/10.1109/CVPR.2008.4587355
[22] J. Peng, T. Hazan, N. Srebro, and J. Xu, âApproximateinference by intersecting semidefinite bound and local polytope,âin Proceedings of the Fifteenth International Conference onArtificial Intelligence and Statistics, ser. JMLR: Workshopand Conference Proceedings, N. Lawrence and M. Girolami,Eds., vol. 22, 2012, pp. 868â876. [Online]. Available:http://jmlr.csail.mit.edu/proceedings/papers/v22/peng12.html
[23] G. Papandreou and A. Yuille, âPerturb-and-MAP random fields: Usingdiscrete optimization to learn and sample from energy models,â inProceedings of the 2011 IEEE International Conference on ComputerVision (ICCV), Barcelona, Spain, November 2011, pp. 193â200.[Online]. Available: http://dx.doi.org/10.1109/ICCV.2011.6126242
[24] D. Tarlow, R. P. Adams, and R. S. Zemel, âRandomized optimummodels for structured prediction,â in Proceedings of the FifteenthInternational Conference on Artificial Intelligence and Statistics, ser.JMLR: Workshop and Conference Proceedings, N. Lawrence andM. Girolami, Eds., vol. 22, 2012, pp. 1221â1229. [Online]. Available:http://jmlr.org/proceedings/papers/v22/tarlow12b.html
[25] S. Ermon, C. Gomes, A. Sabharwal, and B. Selman, âTaming the curseof dimensionality: Discrete integration by hashing and optimization,â inProceedings of The 30th International Conference on Machine Learning,ser. JMLR: Workshop and Conference Proceedings, S. Dasgupta andD. McAllester, Eds., vol. 28, no. 2, 2013, pp. 334â342. [Online].Available: http://jmlr.org/proceedings/papers/v28/ermon13.html
[26] ââ, âOptimization with parity constraints: From binary codes todiscrete integration,â in Proceedings of the Twenty-Ninth ConferenceAnnual Conference on Uncertainty in Artificial Intelligence (UAI-13).Corvallis, Oregon: AUAI Press, 2013, pp. 202â211.
[27] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman, âEmbed andproject: Discrete sampling with universal hashing,â in Advances inNeural Information Processing Systems 26 (NIPS 2013), C. Burges,L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger,Eds. Curran Associates, Inc., 2013, pp. 2085â2093. [Online].Available: http://papers.nips.cc/paper/4965-embed-and-project-discrete-sampling-with-universal-hashing.pdf
[28] ââ, âLow-density parity constraints for hashing-based discrete integra-tion,â in Proceedings of The 31st International Conference on MachineLearning, ser. JMLR: Workshop and Conference Proceedings, E. P.Xing and T. Jebara, Eds., vol. 32, no. 1, 2014, pp. 271â279. [Online].Available: http://jmlr.org/proceedings/papers/v32/ermon14.html
[29] C. Maddison, D. Tarlow, and T. Minka, âAâ sampling,â in Advancesin Neural Information Processing Systems 27, Z. Ghahramani,M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds.Curran Associates, Inc., 2014, pp. 2085â2093. [Online]. Available:http://papers.nips.cc/paper/5449-a-sampling
[30] G. Papandreou and A. Yuille, âPerturb-and-MAP random fields: Re-ducing random sampling to optimization, with applications in computervision,â in Advanced Structured Prediction, S. Nowozin, P. V. Gehler,J. Jancsary, and C. H. Lampert, Eds. Cambridge, MA, USA: MITPress, 2014, ch. 7, pp. 159â186.
[31] A. Gane and T. S. J. Tamir Hazan, âLearning with maximuma-posteriori perturbation models,â in Proceedings of the SeventeenthInternational Conference on Artificial Intelligence and Statistics(AISTATS), ser. JMLR: Workshop and Conference Proceedings,S. Kaski and J. Corander, Eds., vol. 33, 2014, pp. 247â256. [Online].Available: http://jmlr.org/proceedings/papers/v33/gane14.html
[32] J. Keshet, D. McAllester, and T. Hazan, âPAC-Bayesian approachfor minimization of phoneme error rate,â in Proceedings of the2011 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), 2011, pp. 2224â2227. [Online]. Available:http://dx.doi.org/10.1109/ICASSP.2011.5946923
[33] C. Kim, A. Sabharwal, and S. Ermon, âExact sampling with integerlinear programs and random perturbations.â in Proceedings of theThirtieth AAAI Conference on Artificial Intelligence (AAAI-18), Phoenix,Arizona USA, February 12â17 2016, pp. 3248â3254.
[34] Q. Liu, J. Peng, A. Ihler, and J. Fisher III, âEstimating the partitionfunction by discriminance sampling,â in Proceedings of the Thirty-FirstConference on Uncertainty in Artificial Intelligence. AUAI Press, 2015,pp. 514â522.
[35] A. Kalai and S. Vempala, âEfficient algorithms for onlinedecision problems,â Journal of Computer and System Sciences,vol. 71, no. 3, pp. 291â307, October 2005. [Online]. Available:http://dx.doi.org/10.1016/j.jcss.2004.10.016
[36] M. Balog, N. Tripuraneni, Z. Ghahramani, and A. Weller, âLostrelatives of the Gumbel trick,â in Proceedings of the 34thInternational Conference on Machine Learning, ser. Proceedingsof Machine Learning Research, D. Precup and Y. W. Teh,Eds., vol. 70. International Convention Centre, Sydney, Australia:PMLR, 06â11 Aug 2017, pp. 371â379. [Online]. Available:http://proceedings.mlr.press/v70/balog17a.html
[37] M. Wainwright and M. Jordan, âGraphical models, exponentialfamilies, and variational inference,â Foundations and Trends inMachine Learning, vol. 1, no. 1-2, pp. 1â305, 2008. [Online].Available: http://dx.doi.org/10.1561/2200000001
[38] S. Kotz and S. Nadarajah, Extreme value distributions: theory andapplications. London, UK: Imperial College Press, 2000.
[39] H. A. David and H. N. Nagaraja, Order Statistics, 3rd ed. Hoboken,NJ, USA: John Wiley & Sons, 2003.
[40] L. G. Valiant, âThe complexity of computing the permanent,âTheoretical Computer Science, vol. 8, no. 2, pp. 189â201, 1979.[Online]. Available: http://dx.doi.org/10.1016/0304-3975(79)90044-6
[41] V. Kolmogorov and R. Zabih, âWhat energy functions can be minimizedvia graph cuts?â IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 26, no. 2, pp. 147â159, February 2004. [Online].Available: http://dx.doi.org/10.1109/TPAMI.2004.1262177
[42] A. Rush, D. Sontag, M. Collins, and T. Jaakkola, âOn dualdecomposition and linear programming relaxations for natural languageprocessing,â in Proceedings of the 2010 Conference on EmpiricalMethods in Natural Language Processing (EMNLP â10), 2010, pp.1â11. [Online]. Available: http://dl.acm.org/citation.cfm?id=1870659
[43] A. G. Schwing and R. Urtasun, âEfficient exact inference for 3Dindoor scene understanding,â in Computer Vision â ECCV 2012 :12th European Conference on Computer Vision, ser. Lecture Notesin Computer Science. Berlin, Germany: Springer, 2012, vol. 7577,ch. 22, pp. 299â313. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-33783-3
[44] M. Sun, M. Telaprolu, H. Lee, and S. Savarese, âAn efficientbranch-and-bound algorithm for optimal human pose estimation,â inProceedings of the 2012 IEEE Conference on Computer Vision andPattern Recognition (CVPR), Providence, RI, 2012, pp. 1616â1623.[Online]. Available: http://dx.doi.org/10.1109/CVPR.2012.6247854
[45] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,âObject detection with discriminatively trained part based models,âIEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 32, no. 9, pp. 1627â1645, September 2010. [Online]. Available:http://dx.doi.org/10.1109/TPAMI.2009.167
[46] A. Globerson and T. S. Jaakkola, âFixing max-product:Convergent message passing algorithms for MAP LP-relaxations,âin Advances in Neural Information Processing Systems 20,J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. CurranAssociates, Inc., 2007, vol. 21, pp. 553â560. [Online]. Avail-able: http://papers.nips.cc/paper/3200-fixing-max-product-convergent-message-passing-algorithms-for-map-lp-relaxations.pdf
[47] D. Sontag and T. S. Jaakkola, âNew outer bounds onthe marginal polytope,â in Advances in Neural InformationProcessing Systems 20, J. Platt, D. Koller, Y. Singer, andS. Roweis, Eds. Curran Associates, Inc., 2008, pp. 1393â1400.[Online]. Available: http://papers.nips.cc/paper/3274-new-outer-bounds-on-the-marginal-polytope.pdf
[48] R. A. Fisher and L. H. C. Tippett, âLimiting forms of thefrequency distribution of the largest or smallest member of a sample,âMathematical Proceedings of the Cambridge Philosophical Society,
21
vol. 24, no. 02, pp. 180â190, April 1928. [Online]. Available:http://dx.doi.org/10.1017/S0305004100015681
[49] B. Gnedenko, âSur la distribution limite du terme maximum dâuneserie aleatoire,â Annals of Mathematics, vol. 44, no. 3, pp. 423â453,July 1943. [Online]. Available: http://dx.doi.org/10.2307/1968974
[50] E. J. Gumbel, Statistical theory of extreme values and some practicalapplications: a series of lectures, ser. National Bureau of StandardsApplied Mathematics Series. Washington, DC, USA: US Govt. Print.Office, 1954, no. 33.
[51] R. D. Luce, Individual Choice Behavior: A Theoretical Analysis. NewYork, NY, USA: John Wiley and Sons, 1959.
[52] M. Ben-Akiva and S. R. Lerman, Discrete Choice Analysis: Theory andApplication to Travel Demand. Cambridge, MA, USA: MIT press,1985, vol. 9.
[53] D. McFadden, âConditional logit analysis of qualitative choice behav-ior,â in Frontiers in Econometrics, P. Zarembka, Ed. New York, NY,USA: Academic Press, 1974, ch. 4, pp. 105â142.
[54] G. B. Folland, Real Analysis: Modern Techniques and Their Applica-tions, 2nd ed. New York, NY, USA: John Wiley & Sons, 2013.
[55] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar, Convex Analysis andOptimization. Nashua, NH, USA: Athena Scientific, 2003.
[56] J. M. Tomczak, âOn some properties of the low-dimensionalGumbel perturbations in the perturb-and-map model,â Statistics &Probability Letters, vol. 115, pp. 8 â 15, 2016. [Online]. Available:https://doi.org/10.1016/j.spl.2016.03.019
[57] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, âAn introductionto variational methods for graphical models,â Machine learning, vol. 37,no. 2, pp. 183â233, 1999.
[58] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, âA new classof upper bounds on the log partition function,â IEEE Transactions onInformation Theory, vol. 51, no. 7, pp. 2313â2335, July 2005. [Online].Available: http://dx.doi.org/10.1109/TIT.2005.850091
[59] T. M. Cover and J. A. Thomas, Elements of Information Theory.Hoboken, NJ, USA: John Wiley & Sons, 2012.
[60] H. J. Brascamp and E. H. Lieb, âOn extensions of the Brunn-Minkowskiand Prekopa-Leindler theorems, including inequalities for log concavefunctions, and with an application to the diffusion equation,â Journalof Functional Analysis, vol. 22, no. 4, pp. 366â389, August 1976.[Online]. Available: http://dx.doi.org/10.1016/0022-1236(76)90004-5
[61] S. Aida, T. Masuda, and I. Shigekawa, âLogarithmic Sobolevinequalities and exponential integrability,â Journal of FunctionalAnalysis, vol. 126, no. 1, pp. 83â101, November 1994. [Online].Available: http://dx.doi.org/10.1006/jfan.1994.1142
[62] S. Bobkov and M. Ledoux, âPoincareâs inequalities and Talagrandâsconcentration phenomenon for the exponential distribution,â ProbabilityTheory and Related Fields, vol. 107, no. 3, pp. 383â400, March 1997.[Online]. Available: http://dx.doi.org/10.1007/s004400050090
[63] M. Ledoux, The Concentration of Measure Phenomenon, ser. Mathemat-ical Surveys and Monographs. American Mathematical Society, 2001,vol. 89.
[64] D. Bakry, I. Gentil, and M. Ledoux, Analysis and Geometry ofMarkov Diffusion Operators, ser. Grundlehren der mathematischenWissenschaften. Switzerland: Springer International Publishing, 2014,vol. 348. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-00227-9
[65] V. H. Nguyen, âDimensional variance inequalities of BrascampâLiebtype and a local approach to dimensional Prekopaâs theorem,â Journalof Functional Analysis, vol. 266, no. 2, pp. 931â955, January 2014.[Online]. Available: http://dx.doi.org/10.1016/j.jfa.2013.11.003
[66] S. G. Bobkov and M. Ledoux, âWeighted Poincare-type inequalitiesfor Cauchy and other convex measures,â The Annals ofProbability, vol. 37, no. 2, pp. 403â427, 2009. [Online]. Available:http://dx.doi.org/10.1214/08-AOP407
[67] M. R. Spiegel, S. Lipschutz, and J. Liu, Mathematical Handbook ofFormulas and Tables, 4th ed., ser. Schaumâs Outlines. New York, NY,USA: McGraw-Hill Education, 2013.
[68] A. S. Willsky, E. B. Sudderth, and M. J. Wainwright, âLoop series andBethe variational bounds in attractive graphical models,â in Advancesin Neural Information Processing Systems 20, J. Platt, D. Koller,Y. Singer, and S. Roweis, Eds. Curran Associates, Inc., 2007, pp.1425â1432. [Online]. Available: http://papers.nips.cc/paper/3354-loop-series-and-bethe-variational-bounds-in-attractive-graphical-models.pdf
[69] N. Ruozzi, âThe Bethe partition function of log-supermodulargraphical models,â in Advances in Neural Information ProcessingSystems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger,Eds. Curran Associates, Inc., 2012, pp. 117â125. [Online].
Available: http://papers.nips.cc/paper/4649-the-bethe-partition-function-of-log-supermodular-graphical-models.pdf
[70] A. Weller and T. Jebara, âClamping variables and approximateinference,â in Advances in Neural Information Processing Systems27, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, andK. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 909â917. [Online]. Available: http://papers.nips.cc/paper/5529-clamping-variables-and-approximate-inference.pdf
[71] T. Hazan, G. Papandreou, and D. Tarlow, Eds., Perturbations, Optimiza-tion, and Statistics. Cambridge, MA, USA: MIT Press, 2017.
[72] T. Hazan, S. Maji, J. Keshet, and T. Jaakkola, âLearning efficientrandom maximum a-posteriori predictors with non-decomposable lossfunctions,â in Advances in Neural Information Processing Systems 26,C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger,Eds. Curran Associates, Inc., 2013, pp. 1887â1895. [Online].Available: http://papers.nips.cc/paper/4962-on-sampling-from-the-gibbs-distribution-with-random-maximum-a-posteriori-perturbations
[73] C. J. Maddison, A. Mnih, and Y. W. Teh, âThe concrete distribution:A continuous relaxation of discrete random variables,â in InternationalConference on Learning Representations (ICLR 2017), 2017. [Online].Available: https://openreview.net/forum?id=S1jE5L5gl
[74] E. Jang, S. Gu, and B. Poole, âCategorical reparameterizationwith Gumbel-softmax,â in International Conference on Learn-ing Representations (ICLR 2017), 2017. [Online]. Available:https://openreview.net/forum?id=rkE3y85ee
[75] J. Kuck, A. Sabharwal, and S. Ermon, âApproximate inference viaweighted Rademacher complexity,â in Proceedings of the Thirty-SecondAAAI Conference on Artificial Intelligence and Thirtieth Innovative Ap-plications of Artificial Intelligence Conference (AAAI-18), New Orleans,LA, USA, February 2â7 2018.
[76] A. Cohen and T. Hazan, âFollowing the perturbed leader for onlinestructured learning,â in International Conference on Machine Learning,2015, pp. 1034â1042.