thesis proposal parallel learning and inference in ...jegonzal/jegonzal_thesis_proposal.pdf · in...

57
November 20, 2010 DRAFT Thesis Proposal Parallel Learning and Inference in Probabilistic Graphical Models Joseph Gonzalez November 20, 2010 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Carlos Guestrin, Chair Guy Blelloch, CMU David O’Hallaron, CMU Alex Smola, Yahoo Jeff Bilmes, UW Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2010 Joseph Gonzalez

Upload: dinhnguyet

Post on 14-Apr-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Thesis ProposalParallel Learning and Inference in

Probabilistic Graphical ModelsJoseph Gonzalez

November 20, 2010

School of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Thesis Committee:Carlos Guestrin, ChairGuy Blelloch, CMU

David O’Hallaron, CMUAlex Smola, Yahoo

Jeff Bilmes, UW

Submitted in partial fulfillment of the requirementsfor the degree of Doctor of Philosophy.

Copyright c© 2010 Joseph Gonzalez

Page 2: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Abstract

Probabilistic graphical models are one of the most influential and widely used techniquesin machine learning. Powered by exponential gains in processor technology, graphical modelshave been successfully applied to a wide range of increasingly large and complex real-worldproblems. However, recent developments in computer architecture, large-scale computing,and data-storage have shifted the focus away from sequential performance scaling and to-wards parallelism and large-scale distributed systems. Therefore, in order for graphical mod-els to continue to benefit from developments in computer architecture and remain a viableoption in the clouds and beyond, we must discover and exploit the parallelism of learning andinference in probabilistic graphical models.

In this thesis we explore how to design efficient parallel algorithms for probabilistic graph-ical models by framing learning and inference as iterative adaptive asynchronous computa-tion. We first present our work on efficient parallel algorithms for loopy belief propagationand Gibbs sampling. We then describe GraphLab, a new parallel abstraction for designingand implementing iterative adaptive asynchronous computation. Finally, we conclude withour proposed work on parallel parameter and structure learning in probabilistic graphicalmodels.

Page 3: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber20

, 201

0

DRAFT

Contents

1 Introduction 11.1 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Graph Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Parallel Inference [95% Complete] 72.1 Completed Work in Parallel Loopy BP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Background: The Sequential Belief Propagation Algorithm . . . . . . . . . . . . . 72.1.2 Types of Parallelism in Belief Propagation . . . . . . . . . . . . . . . . . . . . . . 82.1.3 The Naive Parallel BP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.4 Identifying the Sequential Sub-problem . . . . . . . . . . . . . . . . . . . . . . . 92.1.5 Role of Approximation in Parallel Belief Propagation . . . . . . . . . . . . . . . . 102.1.6 The Splash BP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.7 Adaptive Prioritization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.1.8 Distributed Belief Propagation on Clusters . . . . . . . . . . . . . . . . . . . . . 142.1.9 Weighted Partitioning for Factor Graphs . . . . . . . . . . . . . . . . . . . . . . . 142.1.10 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Proposed Work in Parallel Loopy BP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.1 Proposed: Single Splash Asynchronous Parallelism . . . . . . . . . . . . . . . . . 182.2.2 Proposed: Edge Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 Optional: Lifted Inference with SplashBP . . . . . . . . . . . . . . . . . . . . . . 192.2.4 Optional: Using SplashBP on Higher Order Approximations . . . . . . . . . . . . 19

2.3 Completed Work in Parallel Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 The Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.2 The Chromatic Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.3 New Insight Into the Parallel Structure of 2-Colorable Models . . . . . . . . . . . 222.3.4 The Splash Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Proposed Work in Parallel Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.1 Proposed: Extensive Empirical Analysis of Splash Gibbs . . . . . . . . . . . . . . 302.4.2 Proposed: Variable Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.4.3 Proposed: Ergodic Parallel Acceleration of Latent Topic Models . . . . . . . . . . 302.4.4 Optional: Nonparametric Splash Gibbs Sampling . . . . . . . . . . . . . . . . . . 31

2.5 Related Work in Parallel Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

iii

Page 4: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

3 GraphLab Parallel Abstraction [75 % Complete] 323.1 Completed Work: The GraphLab Abstraction . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Limitations of Previous Alternatives to GraphLab . . . . . . . . . . . . . . . . . . 323.1.2 The GraphLab Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.1.3 Implementation and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2.1 Proposed: Address the Challenge of High Degree Vertices . . . . . . . . . . . . . 383.2.2 Proposed: Asynchronous Splash Scheduler . . . . . . . . . . . . . . . . . . . . . 383.2.3 Optional: GPU Based GraphLab Implementation . . . . . . . . . . . . . . . . . . 39

4 Parallel Parameter Learning [30% Complete] 404.1 Initial Work: Simultaneous Learning and Inference . . . . . . . . . . . . . . . . . . . . . 404.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1 Proposed: Theoretically and Experimentally Evaluate the Naive Algorithm . . . . 424.2.2 Proposed: Factorized Asynchronous Parallelization Using CAMEL . . . . . . . . 42

5 Parallel Structure Learning [0% Complete] 455.1 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Proposed: Parallel Regularization for Structure Search . . . . . . . . . . . . . . . 455.1.2 Proposed: L1 Regularized Parameters for Structure Search . . . . . . . . . . . . . 46

6 Platforms, Applications, and Performance Metrics 476.1 Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Time-line 49

Bibliography 51

iv

Page 5: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 1

Introduction

Probabilistic graphical models are one of the most influential and widely used techniques in machinelearning. From modeling the structure [Yanover and Weiss, 2002, Yanover et al., 2007] and interactionsof proteins [Jaimovich et al., 2006], to untangling written language [Blei et al., 2003, Richardson andDomingos, 2006] and understanding our visual world [Li, 1995], probabilistic graphical models formthe foundation of probabilistic reasoning and have become the principal tool in complex structured prob-lems. Until recently, exponential gains in sequential processor technology have enabled increasinglysophisticated probabilistic graphical models to be constructed for a wide range of challenging real-worldproblems, contributing to the explosion of graphical models based techniques.

However, recent physical and economic limitations have forced computer architectures away from expo-nential sequential performance scaling and towards parallel performance scaling [Asanovic et al., 2006].Meanwhile, improvements in processor virtualization, networking, and data-storage, and the desire to cat-alog every fact and human experience, has lead to the development of new hardware paradigms like cloudcomputing and massively parallel data-centers which fundamentally challenge the future of sequential al-gorithms[Armbrust et al., 2010]. Therefore, in order for machine learning to continue to benefit from thesubstantial gains in computer architecture and remain a viable option in the clouds and beyond, we mustdiscover and exploit the parallel structure inherent in probabilistic learning and reasoning.

Unfortunately, the algorithms and techniques used to learn and apply probabilistic graphical models weredeveloped and studied in the sequential setting. As a consequence, recent efforts [Chu et al., 2006,Panda et al., 2009, Wolfe et al., 2008, Ye et al., 2009] in parallel machine learning have overlookedthese rich models and techniques in favor of simpler unstructured models and approximations which aremore amenable to existing large-scale parallel abstractions, tools, and infrastructure. Alternatively, whenconfronted with computationally challenging graphical models we typically resort to model simplificationrather than leveraging ubiquitous and increasingly powerful parallel hardware.

Thesis Statement: By framing learning and inference in graphical models as factorized iterative compu-tation, employing prioritized asynchronous scheduling, and adaptively identifying and solving sequentialsubproblems, we can construct efficient parallel algorithms for probabilistic reasoning.

Thesis Contributions: We will evaluate the thesis statement by developing the core algorithms, theory,and tools needed to learn and apply probabilistic graphical models to challenging real-world problems inthe parallel setting. More precisely, we will:

1

Page 6: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

• design and implement a suite of efficient inference, parameter learning, and structure learning algo-rithms which span the fundamental operations required to learn and apply graphical models.

• develop the theoretical tools needed to characterize the performance and efficiency of the proposedalgorithms as well as the underlying parallel structure of learning and inference in graphical models.

• develop an algorithm design abstraction and software framework which simplifies the design andimplementation of future parallel algorithms for graphical models.

• evaluate the proposed techniques on real-world problems using a variety of parallel platforms andcharacterize the challenges and opportunities presented by the parallel hardware with respect to thestructure and parameters of graphical models.

By achieving these goals, we will help build a solid foundation for the future of parallel probabilisticreasoning.

We begin in Sec. 1.1 by briefly reviewing some of the key terminology, operations, and properties ofprobabilistic graphical models and discuss how they relate to learning and inference in the parallel setting.Then we present our prior work on approximate algorithms for graphical model inference (Sec. 2) and aparallel abstraction for designing and implementing parallel algorithms for graphical models (Sec. 3). InSec. 4 we discuss the proposed work on parallel parameter learning and briefly present some preliminaryresults on simultaneous parameter learning and inference. Then in Sec. 5 we present the proposed workon the more ambitious task of structure learning and discuss the key challenges. Finally, in Sec. 6 wereview the parallel hardware, suite of real-world applications, and some of the key performance metricswe will use to evaluate the thesis statement. Finally, we conclude in Sec. 7 with a proposed completiontime-line.

1.1 Probabilistic Graphical Models

Probabilistic graphical models provide a common language for studying, learning, and manipulating largefactorized distributions of the form:

P (X1 = x1, . . . , Xn = xn | θ) =1

Z(θ)

∏A∈F

fA(xA | θ), (1.1)

where each clique A ∈ F is a subset, A ⊆ 1, . . . , n, of indices and the factors fA are un-normalizedpositive functions, fA : xA → R+, over subsets of random variables. The factors are parametrized by anset of parameters θ, and the partition function Z(θ) is defined as:

Z(θ) =∫x

∏A∈F

fA(xA | θ) . . . dx. (1.2)

We will primarily focus on distributions over discrete random variablesXi ∈ Xi = 1, . . . , |Xi| howevermost of the proposed methods may also be applied to continuous distributions.

For example, we can construct a factorized distribution to model the noisy pixels in a digital image (seeFig. 1.1). For each pixel sensor reading we define an observed random variable yi and corresponding tothe unobserved random variable Xi which represents the true pixel value. We can then introduce set of

2

Page 7: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

factors corresponding to the noise process (e.g., Guassian):

fi(Xi = xi |Yi = yi, θ) = exp(−(

(xi − yi)2

2θ21

)). (1.3)

To encode smoothness in the image we introduce factors which encode similarity between neighboringpixels (e.g., Laplace smoothing):

fi,j(Xi = xi, Xj = xj | θ) = exp (−θ2 |xi − xj |) . (1.4)

for adjacent Xi and Xj (e.g., cardinal neighbors). Then Eq. (1.1) induces a probability distribution overpossible assignments to the latent image. We might then want to compute the expected assignment E [Xi]for each pixel as in Fig. 1.1(f).

1.2 Graph Structure

A Markov Random Field (MRF) is an undirected graphical representation of a factorized distributionand a core data-structure in graphical models. The MRF corresponding to the factorized probabilitydistribution in Eq. (1.1) is an undirected graph over the variables where Xi is connected to Xj if there is aA ∈ F such that i, j ∈ A. We denote the set of indicies corresponding to the variables neighboring Xi asNi ⊆ 1, . . . , n and therefore XNi is the set of neighboring variables. For example, the Markov randomfield for the image denoising problem is drawn in Fig. 1.1(c), and the indicies of the variables neighboringX5 in Fig. 1.1(c) are N5 = 2, 4, 6, 8.

The edges in the MRF encode both the statistical dependencies between the variables and the compu-tational dependencies of many iterative graphical models algorithms. As a consequence the MRF is acrucial structure in designing parallel iterative algorithms for graphical models. In addition, the MRF alsoprovides a means to leverage the substantial existing work in parallel graph algorithms and job scheduling(see Sec. 2.3 for an example).

When building large-scale probabilistic graphical models, sparsity in the MRF is a desirable property.Densely connected models typically require too many parameters and make exact and even approximateinference computationally intractable. Most auspiciously, sparsity is also a desirable property for paral-lelism. Many algorithms that operate on graphical models factor with respect to the Markov Blanket ofeach variable. The Markov Blanket (see Fig. 1.1(d)) of a variable is the set of neighboring variables in theMRF. Therefore sparse models, provide strong locality in computation which is a very desirable propertyin parallel algorithms. In fact the new parallel abstraction we develop in Sec. 3 is built around the localityderived from sparsity in the MRF and the idea that computation factorizes over Markov blankets.

In some settings it may be more convenient to reason about the factor graph corresponding to the factor-ized distribution. A factor graph is an undirected bipartite graph where the vertices V = X ∪ fAA∈Fcorrespond to all the variables X on one side and all the factors fAA∈F on the other, and the undirectededges E = fA, Xi : i ∈ A,A ∈ F connect factors with the variables in their domain. To simplifynotation, we will use fi, Xj ∈ V to refer to vertices in the factor when we wish to distinguish between fac-tors and variables, and i, j ∈ V otherwise. As with the MRF we defineNi ⊆ V as the neighbors of i ∈ Vin the factor graph. In Fig. 1.1(e) we illustrate the factor graph for the image denoising problem.

3

Page 8: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

(a) Original Image (b) Noisy Image

X1 X2 X3

X4 X5 X6

X7 X8 X9

(c) MRF

X1 X2 X3

X4 X5 X6

X7 X8 X9

(d) Markov Blanket

X1 X2 X3 X4 X5 X6 X7 X8 X9

f12 f23

f1 f2 f3 f4 f5 f6 f7 f8 f9

f14 f25 f36 f45 f47 f56 f58 f69 f78 f89

(e) Factor Graph (f) Expected Estimate

Figure 1.1: Image denoising problem. (a) The synthetic noiseless original image. (b) The noisy pixelvalues observed by the camera. (c) The MRF corresponding to the potentials in Eq. (1.3) and Eq. (1.4).The large gray vertices correspond to the latent unobserved pixel values Xi while the dark black circlescorrespond to the observed pixel values yi (d) The Markov blanket for variable X5. (e) The denoise factorgraph. As with the MRF we have removed the Yi random variables and instead associated each Xi withthe unary factor fi. (f) The expected assignment to each of the latent pixels.

1.3 Inference

There are three primary operations within graphical models: inference, parameter learning, and structurelearning. Given the structure and parameters of the model, inference is used to compute marginals:

P (XS | θ) =∑

x1,...,n\S

P(XS ,x1,...,n\S | θ

)(1.5)

Through marginalization we can also compute conditionals, construct posteriors, make predictions, andevaluate the partition functionZ(θ). Inference is also usually a sub-routine in both parameter and structurelearning and is therefore the most elementary routine in probabilistic graphical models.

Unfortunately, exact inference is NP-hard in general [Cooper, 1990] and even computing bounded approx-imations is known to be NP-hard [Roth, 1993]. Since exact inference is not tractable in most large-scalereal-world graphical models we typically resort to approximate inference algorithms with limited guaran-tees. Fortunately, there are many of the approximate inference algorithms which typically perform wellin practice. In addition we can often use the approximations made by approximate inference algorithmsto introduce additional parallelism. In some cases (like Sec. 2.1) we can introduce new approximationsthat are tailored to the objective of increasing parallelism with a bounded impact on the resulting error ininference.

Much of our preliminary work (see Sec. 2) in parallel graphical models has focused on efficient approx-

4

Page 9: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

imate parallel inference methods. Hence, our existing work largely already answers the question of howto do efficient parallel inference. To a limited extent, our approximate parallel inference algorithms alsoprovides a method to build parallel parameter and structure learning algorithms. However, we believethat by directly addressing the parallel learning problems we can potential produce more efficient parallellearning algorithms.

1.4 Parameter Learning

Parameter learning in graphical models is used to choose the value of the parameters θ, usually by fittingthe model to data. The parameters may correspond to individual values in probability tables or weightsin functions over sets of variables. Often many factors will depend a single parameter entry (sharedparameter). In the image denoising task their are only two parameters which are shared across all thefactors and act as weights.

Parameter learning is typically accomplished by maximizing the log-likelihood of the data:

θMLE = arg maxθ

∑i∈D

log P(x(i)

A | θ)

(1.6)

= arg maxθ

∑i∈D

∑A∈F

log fA(x(i)A | θ)− |D| log

∑x

∏A∈F

fA(xA | θ) (1.7)

Unfortunately, while the left term (unnormalized log-likelihood) in Eq. (1.7) is easily computable, theright term (log-partition function) in Eq. (1.7) requires inference to evaluate.

While it may appear that parallel inference is sufficient for parallel parameter learning, we believe thatby examining the parameter learning and inference problem simultaneously and applying the thesis state-ment we can design more efficient parallel parameter learning algorithms. In Sec. 4 we present somepreliminary work on joint parallel learning and inference and propose a new parallel method based on theconcave convex programing methodology [Ganapathi et al., 2008, Yuille, 2002].

In many applications parameter learning is not necessary. In a purely Bayesian formulation, priors areplaced on parameters and the actual parameter values are treated as variables that are eliminated throughmarginalization during inference. In some applications domain knowledge, like the physical energy be-tween interacting molecules, is used to select parameter values. Nonetheless, parameter learning is animportant fundamental routine used in a wide range of applications.

1.5 Structure Learning

The final and typically most difficult procedure in graphical models, is structure learning. Here the struc-ture of the MRF, or the model factorization encoded by F , is learned from data. Structure learning isprobably the most difficult routine because it introduces a typically combinatorial search on-top of the al-ready intractable problems of learning and inference. That is in order to apply structure learning we need amethod to impose factors over the maximal cliques in the MRF and then the ability to learn optimal valuesfor the factor parameters θ given the structure. Therefore, structure learning requires parameter learningand consequently inference as nested subroutines.

5

Page 10: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Usually the first step in structure learning is to define a score function used to evaluate structures in thecombinatorial search. Because we desire sparsity and the maximum likelihood structure is typically fullyconnected, the score function assigns a penalty to densely connected models. For example, in Sec. 5 wedefine a score function which learns L1 regularized parameters for the given structure and then computesthe likelihood of the data under the L1 (sparsity inducing) penalty.

Finally, the optimization of the graph structure often requires a very large combinatorial search. Fortu-nately, unlike in the setting of directed graphical models (e.g., Bayesian Networks), undirected modelsare more amenable to greedy structure search since they do not impose a global acyclic constraint. There-fore the combinatorial search is often implemented using a simple greedy local search routine. We willexploit both the locality of the combinatorial search as well as the dependence on parameter learning andinference to construct a parallel structure learning procedure in Sec. 5.

6

Page 11: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 2

Parallel Inference [95% Complete]

Because inference is typically a time consuming subroutine in almost all other graphical models algo-rithms, much of our preliminary work in parallel algorithms for graphical models has been focused onparallel approximate inference. Their are two popular types of approximate inference algorithm typi-cally used in the sequential setting. The first is loopy belief propagation (BP), which iteratively computes“messages” between vertices in the MRF and does not have any general guarantees of convergence orcorrectness, but typically performs well in practice. The second, is Gibbs sampling, a form of MarkovChain Monte Carlo (MCMC) simulation, which attempts to draw samples from the factorized distributionand then uses empirical averages over the samples to estimate marginals.

In this section we present our progress in developing efficient parallel version of loopy belief propagationand Gibbs sampling. In both cases we discuss how applying the thesis statement leads to a more efficientparallel algorithm. After presenting each algorithm we discuss the proposed extensions and additionaloptional directions for further investigation.

2.1 Completed Work in Parallel Loopy BP

Loopy BP has been described [Mendiburu et al., 2007, Sun et al., 2002] as a trivially parallel algorithm.However in our work [Gonzalez et al., 2009a,b, 2010] we demonstrated that the naive parallelizationof loopy BP can be highly inefficient and that by applying the thesis statement and constructing an asyn-chronous parallelization which adaptively focuses computation on the challenging sequential subproblemswe can construct and efficient parallel algorithm. As a result we developed the Splash BP algorithm whichis provably more efficient and performs substantially better in practice. Here we briefly review some ofthe key ideas from this work.

2.1.1 Background: The Sequential Belief Propagation Algorithm

Belief propagation (BP), or the Sum-Product algorithm, was originally popularized by Pearl [1988] asan exact inference procedure for tree-structured models and estimates variable and clique marginals byiteratively updating parameters along edges in the factor graph. The edge parameters in BP are typicallyreferred to as “messages” which are computed in both directions along each edge. The message sent from

7

Page 12: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

variable Xi to factor fj along the edge (Xi, fj) is given in Eq. (2.1) and the message sent from factor fjto vertex Xi along the edge (fj , Xi) is given in Eq. (2.2).

mXi→fj(xi) ∝

∏fk∈NXi

\fj

mfk→Xi(xi) (2.1)

mfj→Xi(xi) ∝

∑xj\xi

fj(xj)∏

Xk∈Nfj\Xi

mXk→fj(xk) (2.2)

The sum,∑

xj\xi, is computed over all assignments to xj excluding the variable xi, and the product,∏

k∈Nfj\Xi

, is computed over all neighbors of the vertex fj excluding vertex Xi.

Messages are computed until the change in message values between consecutive computations is boundedby small constant β ≥ 0:

max(i,j)∈E

∣∣∣∣∣∣m(new)i→j −m

(old)i→j

∣∣∣∣∣∣1≤ β. (2.3)

Belief propagation is said to converge if at some point Eq. (2.3) is achieved. Unfortunately, in cyclicgraphical models there are few convergence guarantees [Ihler et al., 2005, Mooij and Kappen, 2007,Tatikonda and Jordan, 2002] and in some practical cases BP may not converge. Using the final messageswe can estimate marginals over individual variables.

Unfortunately, choosing the best schedule (order) in which to update messages is a challenging problemwhich depends heavily on the graph structure and even the model parameters. For simplicity, a com-mon strategy in BP is to adopt the inherently parallel synchronous schedule in which all messages aresimultaneously updated using messages from the previous iteration. The alternative to the synchronousschedule is the inherently sequential asynchronous schedule, in which messages are updated sequentiallyusing the most recent inbound messages. For example, the popular round-robin asynchronous schedule,sequentially updates the messages in fixed random order.

More recent advances by Elidan et al. [2006] and Ranganathan et al. [2007] have focused on dynamicasynchronous schedules, in which the message update order is determined as the algorithm proceeds.Other work by Wainwright et al. [2001] has focused on tree structured schedules, in which messages areupdated along collections of spanning trees. By dynamically adjusting the schedule or by updating alongspanning trees, these more recent techniques attempt to directly address the schedule dependence on themodel parameters and structure. By applying the thesis statement we have developed an efficient parallelalgorithm which combine these advances in BP scheduling.

2.1.2 Types of Parallelism in Belief Propagation

Belief propagation offers several opportunities for parallelism. At the graph level, multiple messages canbe computed in parallel. At the factor level, individual message calculations (sums and products) canbe expressed as matrix operations which can be parallelized relatively easily [Bertsekas and Tsitsiklis,1989]. For typical message sizes where the number of assignments is much less than the number of ver-tices (|Xi| << n), graph level parallelism is asymptotically more beneficial than factor level parallelism.Therefore, we focus on graph level parallelism and leave factor level parallelization to parallel linear al-gebra libraries. Running time will be measured in terms of the number of message computations, treatingindividual message updates as atomic unit time operations.

8

Page 13: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

2.1.3 The Naive Parallel BP Algorithm

The Synchronous BP algorithm in which all messages are updated simultaneously is inherently parallel.Given the messages from the previous iteration, all messages in the current iteration can be computedsimultaneously and independently without inter-processor communication (Alg. 2.1). The SynchronousBP algorithm requires two copies of each message to be maintained at all times. When the messages areupdated their new values are stored in m(new) using as input the old values stored in m(old). This form ofcompletely independent computation is often deemed embarrassingly parallel.

Algorithm 2.1: Synchronous BPInput: Graph G = (V,E) and all messages ∀(i, j) ∈ Emi→jwhile not converged do

forall j ∈ Nv in the neighbors of v do in parallelCompute Message m(new)

v→j usingm

(old)i→v

i∈Nv

Swap(m(new), m(new))

2.1.4 Identifying the Sequential Sub-problem

While it is not possible to analyze the running time of Synchronous BP on a general cyclic graphicalmodel, we can analyze the running time in the context of tree graphical models. In Theorem 2.1.1 weprovide the running time of Synchronous BP when computing exact marginals without early-stoppingwhich is often used when running loopy BP.Theorem 2.1.1 (Exact Synchronous BP Running Time). Given an acyclic factor graph with n vertices,longest path length l, and p ≤ 2(n− 1) processors, parallel synchronous belief propagation will computeexact marginals in time (as measured in number of vertex updates):

Θ(nl

p+ l

).

If we consider the running time given by Theorem 2.1.1 we see that the n/p term corresponds to theparallelization of each synchronous update. The length l of the longest path corresponds to the sequentialsub-problem which cannot be eliminated by scaling the number of processors.

If the number of vertices is much greater than the number of processors or the longest path, the Syn-chronous BP algorithm achieves nearly linear parallel scaling and therefore appears to be an optimalparallel algorithm. However, an optimal parallel algorithm should also be efficient. That is, the total workdone by all processors should be asymptotically equivalent to the work done by a single processor runningthe optimal sequential algorithm.

To illustrate the inefficiency of Synchronous BP, we analyze the running time on a chain graphical modelwith n vertices. Chain graphical models act as a theoretical benchmark by directly capturing the sequentialsubproblem structure of message passing algorithms and appear as subproblems in both acyclic and cyclicgraphical models.

It is well known that the forward-backward schedule (Fig. 2.1(a)) for belief propagation on chain graph-ical models is optimal. The forward-backward schedule sequentially computes messages from m1→2 to

9

Page 14: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

1

(a) Single Sequential

1

2(b) Optimal Parallel

Figure 2.1: (a) The optimal forward-backwards message ordering for exact inference on a chain using asingle processor. (b) The optimal message ordering for exact inference on a chain using two processors.

mn−1→n in the forward direction and then sequentially computes messages from mn→n−1 to m2→1 inthe backward direction. The running time of the forward-backward schedule is therefore Θ (n) or exactly2(n− 1) message calculations.

If we run the Synchronous BP algorithm using many p = 2(n − 1) processors on a chain graphicalmodel of length n, we obtain a running time of exactly n− 1 which corresponds to a paltry factor-of-twospeedup when measured relative to the best sequential algorithm. Furthermore, if we use fewer than n−1processors, the highly parallel synchronous algorithm will actually be slower than the best sequentialalgorithm.

Key Idea: While messages may be computed in any order, information is propagated sequentially. Thechain graphical model forms the sequential subproblem that must be explicitly addressed. The Syn-chronous BP algorithm is highly inefficient and by directly addressing the sequential subproblem we canobtain a more efficient parallel algorithm.

Unfortunately, there are no parallel schedules which achieve greater than a factor of two speedup for exactinference on an arbitrary chain graphical models. However, there are parallel schedules which can achievesubstantially better scaling by exploiting a common approximation in loopy BP.

2.1.5 Role of Approximation in Parallel Belief Propagation

In almost all applications of Loopy BP, the convergence threshold β in Eq. (2.3) is set to a small valuegreater than zero and the algorithm is terminated prior to reaching the true fixed-point. Even when β = 0,the fixed floating point precision of discrete processors result in early termination. We will now show howwe can exploit this idea to “break” long chains and further divide the sequential subproblems.

We can represent a single iteration of Synchronous BP by a function fBP which maps all the messagesm(t) on the tth iteration to all the messages m(t+1) = fBP(m(t)) on the (t+ 1)th iteration. The fixed pointis then the set of messagesm∗ = fBP(m∗) that are invariant under fBP. In addition, we define a max-normfor the message space ∣∣∣∣∣∣m(t) −m(t+1)

∣∣∣∣∣∣∞

= max(i,j)∈E

∣∣∣∣∣∣m(t)i→j −m

(t+1)i→j

∣∣∣∣∣∣1, (2.4)

which matches the norm used in the standard termination condition, Eq. (2.3). We define τε as:

τε = mintt s.t.

∣∣∣∣∣∣m(t) −m∗∣∣∣∣∣∣∞≤ ε, (2.5)

10

Page 15: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

the number of synchronous iterations required to be within an ε ball of the BP fixed-point. Thereforea τε-Approximation is the approximation obtained from running synchronous belief propagation for τεiterations.

The τε-approximation allows us to reduce the size of the sequential subproblems. Therefore models withweaker variable interactions will have a smaller τε and will permit greater parallelism. We formalize thisintuition through the following lower bound on τε-approximations for chain graphical models.

Theorem 2.1.2 (τε-Approximate BP Lower Bound). For an arbitrary chain graph with n vertices and pprocessors, a lower bound for the running time of a τε-approximation is:

Ω(n

p+ τε

)

This theorem is particularly informative as it allows separates the fundamentally parallel aspect of infer-ence from the τε sized sequential component.

2.1.6 The Splash BP Algorithm

We introduced the Splash BP algorithm in [Gonzalez et al., 2009a] which combines the forward-backwardsequential subproblem structure in the form of a Splash with the recent advances in dynamic schedulingintroduced by Elidan et al. [2006] to achieve a provably optimal parallel BP algorithm. The Splash op-eration (Alg. 2 and Fig. 2.2) generalizes the forward backward scheduling illustrated in Fig. 2.1(a) byconstructing a small tree, and then executing a local forward-backward schedule on the tree.

The inputs to the Splash operation are the root vertex v and the maximum allowed size of the Splash,W which is abound on the number of floating point operations associated with executing the Splash.The Splash operation begins by constructing a local spanning tree rooted at v, adding vertices in breadthfirst search order until the maximum Splash size is achieved. Then using the reverse of the ordering inwhich vertices were added to the Splash, the messages are computed for each vertex in the spanning tree,generalizing the forward sweep. Finally, messages are computed in the original order starting at the rootand generalizing the backward sweep.

By repeatedly executing p parallel splashes of size W = wn/p (where w is the work of updating a singlevertex) placed evenly along the chain we can achieve the optimal parallel BP running time:

Theorem 2.1.3 (Splash Optimality). Given a chain graph with n vertices and p ≤ n processors, theexecuting evenly spaced Splashes in parallel, achieves a τε level approximation for all vertices in time

O

(n

p+ τε

)

We therefore achieve the runtime lower bound for τε approximation described Theorem 2.1.2 by usingthe Splash operation. The remaining challenge is determining how to place Splashes in arbitrary cyclicgraphical models.

11

Page 16: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.2: Splash(v, W )Input: vertex vInput: maximum splash size W// Construct the breadth first search ordering with W message

computations and rooted at v.fifo← [] // FiFo Spanning Tree Queueσ ← (v) // Initial Splash ordering is the root vAccumW← wv // Accumulate the root vertex workvisited← v // Set of visited verticesfifo.Enqueue(Nv)while fifo is not empty do

u← fifo.Dequeue()// If adding u does not cause splash to exceed limitif AccumW + wu ≤W then

AccumW← AccumW + wuAdd u to the end of the ordering σforeach neighbors v ∈ Nu do

if v is not in visited thenfifo.Enqueue(v) // Add to boundary of spanning treevisited← visited ∪v // Mark Visited

// Forward Pass: sends messages from the leaves to rootforeach u ∈ ReverseOrder(σ) do1

SendMessages(u) // Update priorities if necessary.

// Backward Pass: sends messages from the root to leavesforeach u ∈ σ do2

SendMessages(u) // Update priorities if necessary.

2.1.7 Adaptive Prioritization

A simple solution to the problem of over scheduling Splashes on arbitrary cyclic graphical models is toadapt the adaptive scheduling work of Elidan et al. [2006] to schedule Splashes. The key idea behind dy-namic scheduling is to focus computation on regions of the graphical model that make the most progressper update and to avoid recomputing messages for vertices that have already converged. For both theoret-ical and practical reasons discussed in [Gonzalez et al., 2009b] we introduced the belief residual definedas:

r(t)v ← r(t−1)v +

∣∣∣∣∣∣b(t)v − b(t−1)v

∣∣∣∣∣∣1. (2.6)

The belief residual accumulates the changes in belief between vertex updates. After sending all the mes-sages from a vertex its residual is set to zero and the residuals of the neighboring vertices are increased bythe corresponding change in belief. We also adopt the modified convergence criterion:

maxv∈V

rv ≤ β. (2.7)

Therefore, if rv ≤ β then we no longer need to compute messages out of v. The belief residual is anapproximate measure of the amount of new information that has not yet been sent by vertex v. If we select

12

Page 17: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

34

34

10

30

30

12

12

12

30

12

12

72

A B C

D

E F G H

I

J K L

(a) Factor Graph

A

J

E

C

L

H

B

K

F

D

I

G

Root

W=30

(b) Splash Root

Root

W=108

A

J

E

C

L

H

B

K

F

D

I

G

(c) Splash Level 1

Root

W=132

A

J

E

C

L

H

B

K

F

D

I

G

(d) Splash Level 2

Root

W=162

A

J

E

C

L

H

B

K

F

D

I

G

(e) Splash Level 3

Root

A

J

E

C

L

H

B

K

F

D

I

G

Send Messages

(f) SendMessages

Figure 2.2: A splash of splash size W = 170 is grown starting at vertex F . The Splash spanning tree is represented by theshaded region. (a) The initial factor graph is labeled with the vertex work associated with each vertex. (b) The Splash beginsrooted at vertex F . (c) The neighbors of F are added to the Splash and the accumulated work increases to w = 108. (d) TheSplash expands further to include vertex B and K but does not include vertex G because doing so would exceed the maximumsplash size of W = 170. (e) The splash expand once more to include vertex C but can no longer expand without exceeding themaximum splash size. The final Splash ordering is σ = [F,E,A, J,B,K,C]. (f) The SendMessages operation is invoked onvertex C causing the messages mC→B , mC→G, and mC→D to be recomputed.

the root of each Splash to vertices with high residual and then grow each Splash favoring vertices of highresidual, we can focus computation where it is needed and avoid wasted computation.

In Fig. 2.3 we illustrate the execution of the Splash algorithm on the image denoising problem. Here wesee that the belief residual is able to adapt the shape and location of Splashes to focus on regions of themodel that required additional computation. We also see that by excluding “converged” vertices fromthe Splash we can automatically tune the size of the Splash simplify the choice of the initial splash sizeparameter.

We construct the final parallel Splash BP algorithm (Alg. 3) by running multiple Splashes in parallel.All processors draw their root vertices from a shared priority queue and then grow parallel Splashesindependently. While the shared priority queue does introduce a point of contention, we have developedapproximate priority queues which help reduce contention.

Algorithm 2.3: Parallel Splash Belief Propagation AlgorithmInput: Constants W,βSet All Residuals to∞forall processors do in parallel1

while TopResidual(Q) > β dov ← Pop(Q) // AtomicSplash(Q, v, W) // Updates vertex residualsQ.Push((v, Residual(v))) // Atomic

13

Page 18: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

0 10 20 30 40 5050

100

150

200

250

300

Splash SizeR

untim

e (S

econ

ds)

With Pruning

Without Pruning

(a) Automatic Splash Size Tun-ing

Un-converged Vertices

Irregular Splash

Converged Vertices

(b) Irregular Splash Shape

OriginalNoisy Image

CumulativeUpdatesPhases of Execution

Phase 1 Phase 2 Phase 3 Phase 4

(c) Execution of the Splash Algorithm on the Denoising Task

Figure 2.3: (a) The running time of the Splash algorithm using various different Splash sizes with and without pruning. (b)The vertices with high belief residual, shown in black, are included in the splash while vertices with belief residual below thetermination threshold, shown in gray are excluded. (c) To illustrate the execution of the Splash BP algorithm we ran it on asimple image denoising task and took snapshots of the program state at four representative points (phases) during the execution.The cumulative vertex updates are plotted with brighter regions being updates more often than darker regions. Initially, largeregular Splashes are evenly spread over the entire model and as the algorithm proceeds the Splashes become smaller and morefocused.

2.1.8 Distributed Belief Propagation on Clusters

In this section we discuss some of the challenges and opportunities distributed memory architecturespresent in the context of parallel belief propagation and provide algorithmic solutions to several of thekey challenges. While the distributed setting often considers systems with network and processor failure,for this thesis work we assume that the all resources remain available throughout execution and that allmessages eventually reach their destination.

In Alg. 2.4 we present a generic distributed BP algorithm composed of partitioning phase (Line 2.4) afterwhich each processor repeatedly executes a BP update (Line 2.4) using a local schedule followed by inter-processor communication and a distributed convergence test (Line 2.4). To achieve balance computationand communication we employ weighted graph partitioning and over-partitioning.

2.1.9 Weighted Partitioning for Factor Graphs

To factor the state of the program among p processors, we begin by partitioning the factor graph andmessages. To maximize throughput and hide network latency, we must minimize inter-processor com-munication and ensure that the data needed for message computations are locally available. We define apartitioning of the factor graph over p processors as a set B = B1, ..., Bp of disjoint sets of verticesBk ⊆ V such that ∪pk=1Bk = V . Given a partitioning B we assign all the factor data associated with

14

Page 19: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.4: The Distributed BP Algorithm

B ← Partition(G, p) // Partition the graph over processors1

DistributeGraph(G,B)forall b ∈ B do in parallel

repeat// Perform BP Updates according to local scheduleBPUpdate(b)2

RecvExternalMsgs() // Receive and integrate messagesSendExternalMsgs() // Transmit boundary messages

until not Converged() // Distributed convergence test3

fi ∈ Bk to the kth processor. Similarly, for all (both factor and variable) vertices i ∈ Bk we store theassociated belief and all inbound messages on the processor k. Each vertex update is therefore a localprocedure. For instance, if vertex i is updated, the processor owning vertex i can read factors and all in-coming messages without communication. To maintain the locality invariant, after new outgoing messagesare computed, they are transmitted to the processors owning the destination vertices.

Ultimately, we want to minimize communication and ensure balanced storage and computation, therefore,we can frame the minimum communication load balancing objective in terms of a graph partitioning. Weformally define the graph partitioning problem as:

minB

∑B∈B

∑(i∈B,j /∈B)∈E

(Ui + Uj)wij (2.8)

subj. to: ∀B ∈ B∑i∈B

Uiwi ≤γ

p

∑v∈v

Uvwv (2.9)

where Ui is the number of times SendMessages is invoked on vertex i, wij is the communication costof the edge between vertex i and vertex j, wi is number of FLOPs associated with the vertex update, andγ ≥ 1 is the balance coefficient. We define the communication cost as:

wij = min (|Xi| , |Xj |) + Ccomm (2.10)

the size of the message plus some additional constant network packet overhead Ccomm. The size of thisoverhead may vary among implementations.

The objective in Eq. (2.8) minimizes communication while the constraint in Eq. (2.9) ensures work bal-ance for small γ and is commonly referred to as the k-way balanced cut objective which is unfortunatelyNP -Hard in general. However, there are several popular graph partitioning libraries such as METIS[Karypis and Kumar, 1998] and Chaco [Hendrickson and Leland, 1994], which quickly produce reason-able approximation.

In the case of static schedules, every vertex is updated the same number of times (Ui = Uj : ∀i, j,) andtherefore U can be eliminated from both the objective and the constraints. Unfortunately for dynamicschedules the update counts Ui for each vertex are neither fixed or known. Furthermore, the update countsare difficult to estimate since they depend on the graph structure, factors, and progress towards conver-gence [Gonzalez et al., 2009b]. Consequently, for dynamic BP algorithms, we advocate of a randomizedload balancing technique based on over-partitioning, which does not require knowledge of Ui.

15

Page 20: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

CPU 1

CPU 2

(a) Denoise Uniformed Cut

1

2

2

1

1

2

1

2

1

2

2

1(b) Denoise Over-Partitioning

Figure 2.4: Over-partitioning can help improve work balance by more uniformly distributing the graph over the variousprocessors. (a) A two processor uninformed partitioning of the denoising factor graph can lead to one processor (CPU1) beingassigned most of the work. (b) Over-partitioning by a factor of 6 can improve the overall work balance by assigning regions fromthe top and bottom of the denoising image to both processors.

Work Balance Through Over-Partitioning

If the graph is partitioned assuming constant update counts, there could be work imbalance if BP is exe-cuted using a dynamic schedule. For instance, a frequently updated sub-graph could be placed within asingle partition as shown in Fig. 2.4(a). To decrease the chance of such an event, we can over-partition thegraph, as shown in Fig. 2.4(b), into k× p balanced partitions and then randomly redistribute the partitionsto the original p processors.

Choosing the optimal over-partitioning factor k is challenging and depends heavily on hardware, graphstructure, and even factors. In situations where the algorithm may be run repeatedly, standard searchtechniques may be used. We find that in practice a small factor, e.g., k = 5 is typically sufficient.

2.1.10 Experimental Evaluation

We have built highly tuned implementations of Splash BP and a suite of other parallel loopy BP algo-rithms which correspond to direct parallelization of popular loopy BP schedules. We evaluated SplashBP algorithm along with the other parallel loop BP schedules on a wide range of applications using bothmulticore and cluster systems. These applications include: Markov logic networks, protein-protein inter-action networks, stereo depth prediction grid models, and protein side-chain interaction networks. In mostcases we find that Splash BP outperforms existing techniques with better runtime on a single processor,better parallel scaling, and slightly improved accuracy and convergence.

Multicore SplashBP Results

We present runtime, speedup, work, and efficiency results as a function of the number of cores on protein-protein interaction networks with over 14K binary variables, 20K factors, and 50K edges. The runtime,shown in Fig. 2.5(a), is measured in seconds of elapsed wall clock time before convergence. An ideal run-time curve for p processors is proportional 1/p. On all of the models we find that the Splash algorithm

16

Page 21: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

2 4 6 8 10 12 14 160

100

200

300

400

500

Number of Processors

Run

time

(Sec

onds

)Synchronous

Round Robin

Wildfire

Residual

Splash

(a) Runtime

2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16

Number of Processors

Spe

edup

Linear

Splash

ResidualWildfire

Round RobinSynchronous

(b) Speedup

2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

x 107

Number of Processors

# M

sg. C

alcu

latio

ns

Synchronous

Round Robin

Wildfire

SplashResidual

(c) Work

2 4 6 8 10 12 14 16

3

3.5

4

4.5

5

5.5

6x 10

4

Number of Processors

# M

sg. C

alcu

latio

ns P

er C

PU

−S

econ

d

Splash

Residual

Synchronous

Round Robin

Wildfire

(d) Comp. Efficiency

Figure 2.5: Multicore results for a Protein-Protein Interaction Network.

achieved a runtime that was strictly less than the standard belief propagation algorithms. We also find thatthe popular static scheduling algorithms, round-robin and synchronous belief propagation, are consistentlyslower than the dynamic scheduling algorithms, Residual, Widlfire, and Splash.

The speedup, shown in Fig. 2.5(b), is measured relative to the fastest single processor algorithm. As aconsequence of the relative-to-best speedup definition, inefficient algorithms may exhibit a speedup lessthan one. By measuring the speedup relative to the fastest single processor algorithm we ensure highlyparallel wasted computation will not manifest as optimal scaling. We find that the Splash algorithm scalesbetter and achieves near linear speedup on all of the models. Furthermore, we again see a consistent patternin which the dynamic scheduling algorithms dramatically outperform the static scheduling algorithms.The inefficiency in the static scheduling algorithms (synchronous and round robin) is so dramatic that theparallel variants seldom achieve more than a factor of 2 speedup using 16 processors.

We measured algorithmic work in Fig. 2.5(c), in terms of the number of message calculations beforeconvergence. The total work, which is a measure of algorithm efficiency, should be as small as possibleand not depend on the number of processors. We find that the Splash algorithm generally does the leastwork and that the number of processors has minimal impact on the total work done. However, surprisingly,we found on several of the Cora MLNs, the Wildfire algorithm actually does slightly less work than theSplash and Residual algorithms.

Finally, we assessed computation efficiency, shown inFig. 2.5(d), by computing the number of messagecalculations per processor-second. The computational efficiency is a measure of the raw throughput ofmessage calculations during inference. Ideally, the efficiency should remain as high as possible. Thecomputational efficiency, captures the cost of building spanning trees in the Splash algorithm or frequentlyupdating residuals in the residual algorithm. The computational efficiency also captures concurrencycosts incurred at spin-locks and barriers. In all cases we see that the Splash algorithm is considerablymore computationally efficient. While it is tempting to conclude that the implementation of the Splashalgorithm is more optimized, all algorithms used the same message computation, scheduling, and supportcode and only differ in the core algorithmic structure. Hence it is surprising that the Splash algorithm,even with the extra overhead associated with generating the spanning tree in each Splash operations.However, by reducing the frequency of queue operations and the resulting lock contention, by increasingcache efficiency through message reuse in the forward and backward pass, and by favoring lower workhigh residual vertices in each spanning tree, the Splash algorithm is able to update more messages perprocessor-second.

17

Page 22: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

1 5 10 15 20 25 30 35 400

50

100

150

200

250

300

350

400

Number of Processors

Run

time

(Sec

onds

)

Round Robin

Splash

Synchronous

Wildfire

(a) Runtime

1 5 10 15 20 25 30 35 4001

5

10

15

20

25

30

35

40

Number of Processors

Spe

edup

Linear

Splash

Synchronous

Round Robin

Wildfire

(b) Speedup

1 5 10 15 20 25 30 35 400

0.5

1

1.5

2

2.5

x 107

Number of Processors

# B

ytes

Sen

t per

Sec

ond

Wildfire

Round Robin

Synchronous

Splash

(c) Communication

1 5 10 15 20 25 30 35 400

1

2

3

4

5

6

7

x 105

Number of Processors

# B

ytes

Sen

t Per

CP

U−

Sec

ond

Splash

Round RobinSynchronous

Wildfire

(d) Comm. Efficiency

Figure 2.6: We assessed how each algorithm scale in the distributed setting on the Protein-Protein interaction network. Notethat Synchronous BP failed to converge on a single processor. In (a) we plot the runtime in seconds as a function of the numberof processors. In (b) we plot the speedup of each algorithm relative to the fastest single processor algorithm. The linear linerepresents the ideal linear speedup. Here we see that the Splash algorithm achieves the maximum speedup. Finally, in (c) and(d) we plot the bytes sent and bytes sent per CPU-Second respectively. Here we find that all algorithms share roughly the samecommunication requirements.

Distributed SplashBP Results

In Fig. 2.6 we compare the different parallel BP algorithms in the distributed setting. In all cases we usedthe protein-protein interaction network from the multicore setting with the partitioning factor of k = 5.We implemented all the algorithms using MPICH2 on a commodity cluster consisting of 5 nodes eachwith 8 cores. We plot the runtime and speedup in Fig. 2.6(a) and Fig. 2.6(b) respectively. We plot speeduprelative to the fastest single processor algorithm. In all cases we the Splash algorithm demonstrates thebest performance.

2.2 Proposed Work in Parallel Loopy BP

While we have a fairly complete story for the parallelization of loopy BP, there are a few additionaldirections which, given sufficient time, we would like to explore further. Below we briefly describeseveral possibilities.

2.2.1 Proposed: Single Splash Asynchronous Parallelism

With the completion of the GraphLab framework we now have the ability to easily construct and executea single Splash asynchronously and using multiple processors with rake style algorithm. This differs fromour current implementation which runs multiple Splashes in parallel. In addition, our work in Gibbs sam-pling has revealed a simple algorithmic approach to ensure that parallel Splashes are strictly disjoint, po-tentially further reducing wasted parallel computing in Splash BP. Asynchronous parallel Splashes couldalso improve performance in the distributed setting by allowing Splashes to span a larger region of thegraph than that owned by an individual machine. Therefore we would like to implement and evaluate afully asynchronous version of SplashBP.

18

Page 23: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

2.2.2 Proposed: Edge Level Parallelism

The SplashBP algorithm takes a vertex centric approach, by treating vertex updates as fundamental op-erations. However, in many models there often a few variables that have a very high degree. As a con-sequence, we would like to be able to exploit finer grained parallelism when both receiving the inboundmessages and recomputing the outbound messages. Fortunately, assembling the new belief given all in-bound messages is an associative operation and so a parallel tree reduction can be used to compute the newbelief. Similarly, computing the new outbound messages is an embarrassingly parallel operation given thenew belief. Therefore we would like to further divide the computation of messages on high degree nodes.Exploiting this fine grained parallelism could be extremely beneficial are highly parallel hardware withrelatively weak cores (e.,g., GPU).

2.2.3 Optional: Lifted Inference with SplashBP

Very large models often contain variables that have similar connectivity and for which the messages andeven beliefs are equivalent. This is especially true in Markov Logic networks. For these models wewould like to perform lifted inference [Sen et al., 2009, Singla and Domingos, 2008] which groups similarvariables and eliminates the need to repeatedly compute the same message. While we have a prototypeimplementation of lifted inference using SplashBP for Markov Logic Networks, we would like to furtherevaluate the performance of this prototype and generalize it to arbitrary models.

2.2.4 Optional: Using SplashBP on Higher Order Approximations

We would like to evaluate the SplashBP algorithm on the generalized cluster graph and higher order ap-proximations. The current implementation of SplashBP has been evaluated on pair-wise Markov randomfields and factor graph both of which restrict messages to function over single variables effectively factor-izing the “hyper” messages between factors in the cluster graph. Work by Yedidia et al. [2001, 2003] insequential setting has demonstrated improved marginal estimates by running loopy BP on the cluster graphor by moving to the region graph to obtain generalized BP. We would like to evaluate these approaches inthe parallel setting using the SplashBP schedule.

2.3 Completed Work in Parallel Gibbs Sampling

Gibbs sampling is a popular MCMC inference method used widely in statistics and machine learning.On many models, however, the Gibbs sampler can be slow mixing [Barbu and Zhu, 2005, Kuss andRasmussen, 2005]. Consequently, a number of authors [Asuncion et al., 2008, Doshi-Velez et al., 2009,Newman et al., 2007, Yan et al., 2009] have proposed parallel methods to accelerate Gibbs sampling.Unfortunately, most of the recent methods rely on Synchronous Gibbs updates that are not ergodic, andtherefore generate chains that do not converge to the targeted stationary distribution.

In our work on parallel Gibbs sampling we developed two separate ergodic parallel Gibbs samplers. Thefirst, called the Chromatic sampler, applies a classic technique [Bertsekas and Tsitsiklis, 1989] relatinggraph coloring to parallel job scheduling, to obtain a direct parallelization of the classic sequential scan

19

Page 24: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Gibbs sampler. We show that the Chromatic sampler is provably ergodic and provide strong guaranteeson the parallel reduction in mixing time.

For the relatively common case of models with two-colorable Markov random fields, the Chromatic sam-pler provides substantial insight into the behavior of the non-ergodic Synchronous Gibbs sampler. Weshow that in the two-colorable case, the Synchronous Gibbs sampler is equivalent to the simultaneousexecution of two independent Chromatic samplers and provide a method to recover the corresponding er-godic chains. As a consequence, we are able to derive the invariant distribution of the Synchronous Gibbssampler and show that the Synchronous Gibbs updates are ergodic with respect to functions over singlevariable marginals.

Our second parallel Gibbs sampler, the Splash sampler, was obtained by applying the thesis statementand addresses the challenges of highly correlated variables. The Splash sampler incrementally constructsmultiple parallel bounded tree-width blocks, called Splashes, and then jointly samples each Splash usingparallel junction-tree inference and backward-sampling. To ensure that multiple simultaneous Splashesare conditionally independent (and hence that the chain is ergodic), we introduce a Markov blanket lockingprotocol. To accelerate burn-in and ensure high likelihood states are reached quickly, we introduce a van-ishing adaptation heuristic for the initial samples of the chain, which explicitly builds blocks of stronglycoupled variables.

We now briefly review the classic Gibbs sampling algorithm and then present both of the new algorithmsthat we have developed. Part of this work is currently under review for AISTATS 2011 and will bepresented at the NIPS workshop on Learning in Cores, Clusters, and Clouds as well as MCMSKiIIIworkshop on MCMC methods.

2.3.1 The Gibbs Sampler

The Gibbs sampler was originally introduced by Geman and Geman [1984],to simulate samples from thejoint distribution P (X). The Gibbs sampler is constructed by iteratively sampling each variable,

Xi ∼ P (Xi |XNi = xNi) ∝∏

A:i∈A,A∈FfA(Xi,xNi) (2.11)

given the assignment to the variables in its Markov blanket. Geman and Geman [1984] showed that ifeach variable is sampled infinitely often and under reasonable assumptions on the conditional distributions(e.g., positive support), the Gibbs sampler is ergodic (i.e., it converges to the true distribution). While wehave considerable latitude in the update schedule, we shall see in subsequent sections that certain updatesmust be treated with care: in particular, Geman and Geman were incorrect in their claim that parallelsimultaneous sampling of all variables (the Synchronous update) yields an ergodic chain.

The simplest method to construct a parallel sampler is to run a separate chain on each processor. However,running multiple parallel chains requires large amounts of memory and, more importantly, is not guaran-teed to accelerate mixing or the production of high-likelihood samples. As a consequence, we focus onsingle chain parallel acceleration, where we apply parallel methods to increase the speed at which a singleMarkov chain is advanced. The single chain setting ensures that any parallel speedup directly contributesto an equivalent reduction in the mixing time, and the time to obtain a high-likelihood sample.

Unfortunately, recent efforts to build parallel single-chain Gibbs samplers have struggled to retain ergod-icity. The resulting methods have relied on approximate sampling algorithms [Asuncion et al., 2008] or

20

Page 25: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.5: The Synchronous Gibbs Sampler

forall Variables Xj do in parallelExecute Gibbs Update: X(t+1)

j ∼ P(Xj |x(t)

Nj

)

Algorithm 2.6: The Chromatic SamplerInput: k-Colored MRFfor For each of the k colors κi : i ∈ 1, . . . , k do

forall Variables Xj ∈ κi in the ith color do in parallelExecute Gibbs Update: X(t+1)

j ∼ P(Xj |x(t+1)

Nj∈κ<i,x(t)Nj∈κ>i

)

proposed generally costly extensions to recover ergodicity [Doshi-Velez et al., 2009, Ferrari et al., 1993,Newman et al., 2007]. However, if ergodicity is not an objective, then it makes little sense to apply Gibbssampling. Therefore we have focused on ergodic Gibbs sampling methods.

2.3.2 The Chromatic Gibbs Sampler

A naive single chain parallel Gibbs sampler is obtained by sampling all variables simultaneously on sepa-rate processors. Called the Synchronous Gibbs sampler, this highly parallel algorithm (Alg. 5) was orig-inally proposed by Geman and Geman [1984]. Unfortunately the extreme parallelism of the SynchronousGibbs sampler comes at a cost. As others [e.g., Newman et al., 2007] have observed, one can easily con-struct cases where the Synchronous Gibbs sampler is not ergodic and therefore does not converge to thecorrect stationary distribution.

Fortunately, the parallel computing community has developed methods to directly transform sequentialgraph algorithms into equivalent parallel graph algorithms using graph colorings. Here we apply thesetechniques to obtain the ergodic Chromatic parallel Gibbs sampler shown in Alg. 6. Let there be a k-coloring of the MRF such that each vertex is assigned one of k colors and adjacent vertices have differentcolors. Let κi denote the variables in color i. Then the Chromatic sampler simultaneously draws newvalues for all variables in κi before proceeding to κi+1. The k-coloring of the MRF ensures that allvariables within a color are conditionally independent given the variables in the remaining colors and cantherefore be sampled independently and in parallel.

By combining a classic result ([Bertsekas and Tsitsiklis, 1989, Proposition 2.6]) from parallel computingwith the original Geman and Geman [1984] proof of ergodicity for the sequential Gibbs sampler one caneasily show:

Proposition 2.3.1 (Graph Coloring and Parallel Execution). Given p processors and a k-coloring of ann-variable MRF, the parallel Chromatic sampler is ergodic and generates a new joint sample in runningtime:

O

(n

p+ k

).

Therefore, given sufficient parallel resource (p ∈ O(n)) and a k-coloring of the MRF, the parallel Chro-

21

Page 26: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

x1(0)

x2(0)

x1(1)

x2(1)

x1(2)

x2(2)

x1(3)

x2(3)

(a) Synchronous Chain

x1(0) x1

(0)

x2(1)x2

(0) x2(0)

x1(1) x1

(2)

x2(1)x2

(2)

x1(1) x1

(2)

x2(3)x2

(2)

x1(3)

Ergodic Chain 1 Ergodic Chain 2

(b) Two Ergodic Chains

Figure 2.7: (a) Execution of a two colored model using the synchronous Gibbs sampler. The dotted lines represent dependen-cies between samples. (b) Two ergodic chains obtained by executing the Synchronous Gibbs sampler. Note that ergodic sumswith respect to marginals are equivalent to those obtained using the Synchronous sampler.

matic sampler has running-time O(k), which for many MRFs is constant in the number of vertices. It isimportant to note that this parallel gain directly results in a factor of p reduction in the mixing time.

2.3.3 New Insight Into the Parallel Structure of 2-Colorable Models

Many popular models in machine learning have natural two-colorings. For example, Latent Dirichlet Al-location, the Indian Buffet process, the Boltzmann machine, hidden Markov models, and the grid modelscommonly used in computer vision all have two-colorings. For these models, the Chromatic sampler pro-vides substantial insight into properties of the Synchronous sampler. The following theorem relates theSynchronous Gibbs sampler to the Chromatic sampler in the two-colorable setting and provides a methodto recover two ergodic chains from a single Synchronous Gibbs chain:

Theorem 2.3.2 (2-Color Ergodic Synchronous Samples). Let (X(t))mt=0 be the non-ergodic Markovchain constructed by the Synchronous Gibbs sampler (Alg. 5) then using only (X(t))mt=0 we can constructtwo ergodic chains (see Fig. 2.7):

(Y (t)

)mt=0

=

[(X

(0)κ1

X(1)κ2

),

(X

(2)κ1

X(1)κ2

),

(X

(2)κ1

X(3)κ2

), . . .

](Z(t)

)mt=0

=

[(X

(1)κ1

X(0)κ2

),

(X

(1)κ1

X(2)κ2

),

(X

(3)κ1

X(2)κ2

), . . .

]

which are conditionally independent given X(0) and correspond to the simultaneous execution of twoChromatic samplers (Alg. 6).

Using the partitioning induced by the 2-coloring of the MRF we can analytically construct the invariantdistribution of the Synchronous Gibbs sampler:

Theorem 2.3.3 (Invariant Distribution of Sync. Gibbs). Let (Xκ1 , Xκ2) = X be the partitioning ofthe variables over the two colors, then the invariant distribution of the Synchronous Gibbs sampler is theproduct of the marginals P (Xκ1) P (Xκ2).

A pleasing consequence of Theorem 2.3.3 is that when computing ergodic averages over sets of variableswith the same color, we can directly use the non-ergodic Synchronous samples and still obtain convergentestimators:

22

Page 27: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Corollary 2.3.4 (Monochromatic Marginal Ergodicity). Given a sequence of samples (x(t))mt=0 drawnfrom the Synchronous Gibbs sampler on two-colorable model, empirical expectations computed with re-spect to single color marginals are ergodic:

∀f, i ∈ 1, 2 : limm→∞

1m

m∑t=0

f(x(t)κi

) a.s.−−→ EP [f(Xκi)]

Corollary 2.3.4 justifies many applications where the Synchronous Gibbs sampler is used to estimate sin-gle variables marginals and explains why the Synchronous Gibbs sampler performs well in these settings.However, Corollary 2.3.4 also highlights the danger of computing empirical expectations over variablesthat span both colors without splitting the chains as shown in Theorem 2.3.2.

2.3.4 The Splash Gibbs Sampler

The Chromatic sampler provides a substantial speedup for single-chain sampling, advancing the Markovchain for a k-colorable model in time O

(np + k

)rather than O(n). Unfortunately, some models have

strongly correlated variables and complex dependencies, which can cause the Chromatic sampler to mixprohibitively slowly.

In the single processor setting, a common method to accelerate a slowly mixing Gibbs sampler is tointroduce blocking updates [Barbu and Zhu, 2005, Jensen and Kong, 1996]. In a blocked Gibbs sampler,blocks of strongly coupled random variables are sampled jointly conditioned on their combined Markovblanket. The blocked Gibbs sampler improves mixing by enabling strongly coupled variables to updatejointly when individual conditional updates would cause the chain to mix too slowly.

To improve mixing in the parallel setting we introduced the Splash sampler (Alg. 7), a general purposeblocking sampler. Like the SplashBP algorithm, the Splash Gibbs sampler applies the thesis statementby using asynchronous prioritized blocking which explicitly addresses the sequential subproblem of localexact inference on trees. The Splash sampler exploits parallelism both to construct multiple blocks, calledSplashes, in parallel; and to accelerate the sampling of each individual Splash. To ensure each Splash canbe safely and efficiently sampled in parallel, we developed a novel Splash generation algorithm whichincrementally builds multiple conditionally independent bounded treewidth junction trees. In the initialrounds of sampling, the Splash algorithm uses a novel adaptation heuristic which groups strongly depen-dent variables together based on the state of the chain. Adaptation is disabled after a finite number ofrounds to ensure ergodicity.

We present the Splash sampler in three parts. First, we present the parallel algorithm used to constructmultiple conditionally independent Splashes. Next, we describe the parallel junction tree sampling proce-dure used to jointly sample all variables in a Splash. Finally, we present our Splash adaptation heuristicwhich sets the priorities used during Splash generation.

Parallel Splash Generation

The Splash generation algorithm (Alg. 8) uses p processors to incrementally build p disjoint Splashes inparallel. Each processor grows a Splash rooted at a unique vertex in the MRF (Line 2.8). To preserve

23

Page 28: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.7: Parallel Splash SamplerInput: Maximum treewidth wmaxInput: Maximum Splash size hmaxwhile t ≤ ∞ do

// Make p bounded treewidth Splashes

JSipi=1 ← ParSplash(wmax, hmax, x

(t));1

// Calibrate each junction trees

JSipi=1 ← ParCalibrate(x(t), JSi

pi=1);2

// Sample each SplashxSi

pi=1 ← ParSample(JSi

pi=1);3

// Advance the chain

x(t+1) ←xS1 , ..., xS1 , x

(t)

¬Sp

i=1 Si

(a) Noisy (b) FIFO Splashes (c) FIFO Splashes (+) (d) Priority Splashes (e) Priority Splashes (+)

Figure 2.8: Different Splashes constructed on a 200 × 200 image denoising grid MRF. (a) A noisy sunset image. EightSplashes of treewidth 5 were constructed using the FIFO (b) and priority (c) ordering. Each splash is shown in a different shadeof gray and the black pixels are not assigned to any Splash. The priorities were obtained using the adaptive heuristic. In (c) and(e) we zoom in on the Splashes to illustrate their structure and the black pixels along the boundary needed to maintain conditionalindependence.

ergodicity we require that no two roots share a common edge in the MRF, and that every variable is a rootinfinitely often.

Each Splash is grown incrementally using a best first search (BeFS) of the MRF. The exact order inwhich variables are explored is determined by the call to NextVertexToExplore(B) on Line 2.8of Alg. 8 which selects (and removes) the next vertex from the boundary B. In Fig. 2.8 we plot severalsimultaneous Splashes constructed using a first-in first-out (FIFO) ordering (Fig. 2.8(b)) and a prioritizedordering (Fig. 2.8(d)).

The Splash boundary is extended until there are no remaining variables that can be safely added or theSplash is sufficiently large. A variable cannot be safely added to a Splash if sampling the resulting Splashis excessively costly (violates a treewidth bound) or if the variable or any of its neighbors are members ofother Splashes (violates conditional independence of Splashes). We now explain how both the treewidthbound and conditional independence invariant are maintained in the parallel setting.

To bound the computational complexity of sampling, and later to jointly sample the Splash, we rely onjunction trees. A junction tree, or clique graph, is an undirected acyclic graphical representation of the jointdistribution over a collection of random variables. For a Splash over the set of variablesXS we construct ajunction tree (C, E) = JS representing the conditional distribution P (XS |x−S). The vertices C ∈ C areoften called cliques and represent a subset of the indices (i.e., C ⊆ S) in the Splash S. The cliques satisfy

24

Page 29: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.8: ParallelSplash: Parallel Splash GenerationInput: Maximum treewidth wmaxInput: Maximum Splash size hmaxOutput: Disjoint Splashes S1, . . . ,Spdo in parallel on processor i ∈ 1, . . . , p

r ← NextRoot(i) // Unique roots1

Si ← r // Add r to splashB ← Nr // Add neighbors to boundaryV ← r ∪ Nr // Visited verticesJSi ← JunctionTree(r)while (|Si| < hmax)

∧(|B| > 0) do

v ← NextVertexToExplore(B)2

MarkovBlanketLock(Xv)// Check that v and its neighbors Nv are not in other

Splashes.

safe←∣∣∣(v ∪ Nv) ∩ (⋃j 6=i Sj

)∣∣∣ = 0JSi+v ← ExtendJunctionTree(JSi , v)if safe

∧TreeWidth(JSi+v)< wmax then

JSi ← JSi+v // Accept new treeSi ← Si ∪ vB ← B ∪ (Nv\V) // Extend boundary3

V ← V ∪Nv // Mark visited

MarkovBlanketFree(Xv)

the constraint that for every factor domain A ∈ F there exists a clique C ∈ C such that A ∩ S ⊆ C. Theedges E of the junction tree satisfy the running intersection property (RIP) which ensures that all cliquessharing a common variable form a connected tree.

The computational complexity of a inference, and consequently sampling in a junction tree, is exponen-tial in the treewidth; one less than number of variables in the largest clique. Therefore, to evaluate thecomputational cost of adding a new variable Xv to the Splash, we need an efficient method to extendthe junction tree JS over XS to a junction tree JS+v over XS∪v and evaluate the resulting treewidth.Since the junction tree extension algorithm is used to evaluate each boundary variable during the Splashconstruction, it must be computationally efficient.

To efficiently build incremental junction trees, we developed a novel junction tree extension algorithm(Alg. 9) which emulates standard variable elimination, with variables being eliminated in the reverse ofthe order they are added to the Splash (e.g., if Xi is added to JS then Xi is eliminated before all XS).Because each Splash grows outwards from the root, the resulting elimination ordering is optimal on treeMRFs and typically performs well on cyclic MRFs.

The incremental junction tree extension algorithm (Alg. 9) begins by eliminating Xi and forming thenew clique Ci = (Ni ∩ S) ∪ i which is added JS+i. We then attach Ci to the most recently addedclique CPa[i] which contains a variable in Ci (CPa[i] denotes the parent of Ci). We then restore the RIP bypropagating the newly added variables back up the tree. Letting R = Ci\ i, we insert R into its parentclique CPa[i]. The RIP condition is now satisfied for variables in R which were already in CPa[i]. The

25

Page 30: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Algorithm 2.9: ExtendJunctionTree AlgorithmInput: The original junction tree (C, E) = JS .Input: The variable Xi to add to JSOutput: JS+i

Define: Cu as the clique created by eliminating u ∈ SDefine: V[C] ∈ S as the variable eliminated when creating CDefine: t[v] as the time v ∈ S was added to SDefine: Pa[v] ∈ Nv ∩ S as the next neighbor of v to be eliminated.Ci ← (Ni ∩ S) ∪ iPa[i]← arg maxv∈Ci\i t[v]// ----------- Repair RIP -------------R ← Ci\ i // RIP Setv ← Pa[i]while |R| > 0 do

Cv ← Cv ∪R // Add variables to parentw ← arg maxw∈Cv\v t[w] // Find new parent

if w = Pa[v] thenR ← (R\Ci)\ i

elseR ← (R∪ Ci)\ iPa[v]← w // New parent

v ← Pa[v] // Move upwards

parent for CPa[i] is then recomputed, and any unsatisfied variables are propagated up the tree in the sameway. We demonstrate this algorithm with a simple example in Fig. 2.9.

To ensure that simultaneously constructed Splashes are conditionally independent, we developed theMarkov blanket locking (MBL) protocol. This associates a read/write lock with each variable in the model.The Markov blanket lock for variable Xv is obtained by acquiring the read-locks on all neighboring vari-ablesXNv and the write lock onXv. To ensure that the MBL protocol is deadlock-free, locks are acquiredand released using the canonical ordering of the variables. Once the MarkovBlanketLock(Xv) hasbeen acquired, no other processor can assign Xv or any of it neighbors XNv to a Splash. Therefore, wecan safely test if Xv or any of its neighbors XNv are currently assigned to other Splashes. Since weonly add Xv to the Splash if both Xv and all its neighbors are currently unassigned, there will never bean edge in the MRF that connects two Splashes. Consequently, simultaneously constructed Splashes areconditionally independent given all remaining unassigned variables.

Parallel Splash Sampling

Once we have constructed p conditionally independent Splashes S1pi=1, we jointly sample each Splashby drawing from P (XSi |x−Si) in parallel. We jointly sample each Splash by calibrating the junction treesJSi

pi=1, and then running backward-sampling starting at the root. We expose additional asynchronous

parallelism within the tree calibration (ParCalibrate) and sampling (ParSample) routines by usingthe classic parallel rake operation from Miller and Reif [1985].

26

Page 31: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

56 124 245 45 4

2

4 5

1

6

3

1256 1245 245 45 4 1236

Figure 2.9: Incremental Junction Tree Example: The junction tree on the top comprises the subset of variables 1,2,4,5,6of the MRF (center). The tree is formed by the variable elimination ordering 6,1,2,5,4 (reading the underlined variables ofthe tree in reverse). To perform an incremental insertion of variable 3, we first create the clique formed by the elimination of 3(1,2,3,6) and insert it into the end of the tree. Its parent is set to the latest occurrence of any of the variables in the new clique.Next the set 1,2,6 is inserted into its parent (boldface variables), and its parent is recomputed in the same way.

Insight into SplashBP: The resulting highly parallel ParCalibrate and ParSample routines intro-duce substantially more parallelism than was originally available in the SplashBP work with parallel beliefpropagation. By combining the ParSplash with the ParCalibrate routine using BP messages in-stead of conditionals along the boundary of the tree, we can construct a fully asynchronous SplashBPalgorithm which will expose further parallelism within each subproblem.

Adaptive Splash Generation

As discussed earlier, the order in which variables are explored when constructing a Splash is determinedon Line 2.8 in the ParSplash algorithm (Alg. 8). To improve the quality of blocking, we proposed asimple adaptive prioritization heuristic, based on the current assignment to x(t), that prioritizes variablesat the boundary of the current tree which are strongly coupled with variables already in the Splash. Weassign each variable Xv ∈ B a score using the likelihood ratio:

s[Xv] =

∣∣∣∣∣∣∣∣∣∣∣∣log

∑xP

(XS , Xv = x |X−S = x

(t)−S

)P(XS , Xv = x

(t)v |X−S = x

(t)−S

)∣∣∣∣∣∣∣∣∣∣∣∣1

, (2.12)

and includes the variable with the highest score. Effectively, Eq. (2.12) favors variables with greateraverage log likelihood than conditional log likelihood. We illustrate the consequence of applying thismetric to an image denoising task in which we denoise the synthetic image shown in Fig. 2.8(a). InFig. 2.10 we show a collection of Splashes constructed using the score function Eq. (2.12). To see howpriorities evolve over time, we plot the update frequencies early (Fig. 2.10(a)) and later (Fig. 2.10(b)) inthe execution of the Splash scheduler.

To ensure the chain remains ergodic, we disable the prioritized tree growth after a finite number of itera-tions, and replace it with a random choice of variables to add (we call this vanishing adaptation). Indeed,a Gibbs chain cannot be ergodic if the distribution over variables to sample is a function of the currentstate of the chain. This result may appear surprising, as it contradicts Algorithm 7.1 of Levine and Casella[2006], of which our adaptive algorithm is an example: Levine and Casella claim this to be a valid adap-tive Gibbs sampler. We are able to construct a simple counter example, using two uniform independentbinary random variables, thus disproving the claim. By contrast, we have shown that our Splash samplerwith vanishing adaptation is ergodic:

27

Page 32: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

(a) Early (b) Later

Figure 2.10: The update frequencies of each variable in the 200×200 image denoising grid MRF for the synthetic noisy imageshown in Fig. 2.8(a). The brighter pixes have been prioritized higher and are therefore updated more frequently. (a) The earlyupdate counts are relatively uniform as the adaptive heuristic has not converged on the priorities. (b) The final update counts arefocused on the boundaries of the regions in the model corresponding to pixels that can be most readily changed by Gibbs steps.

Theorem 2.3.5 (Splash Sampler Ergodicity). The adaptive Splash sampler with vanishing adaptation isergodic and converges to the true distribution P (X).

2.3.5 Experimental Results

We implemented an optimized C++ version of both the Chromatic and Splash samplers GraphLab ab-straction (see Sec. 3). The GraphLab API provides the graph based locking routines needed to implementthe Markov blanket locking protocol. The GraphLab API also substantially simplifies the design and im-plementation of the Chromatic sampler, which uses the highly-tuned lock-free GraphLab scheduler andbuilt-in graph coloring tools.

Although Alg. 7 is presented as a sequence of synchronous parallel steps, our implementation splits thesesteps over separate processors to maximize performance and eliminate the need for threads to join betweenphases. We also implemented parallel junction tree calibration and sampling algorithm the GraphLab API.However, we found that for the typically small maximum treewidth used in our experiments, the overheadassociated the additional parallelism limited any gains. Nonetheless, when we made the treewidth suffi-ciently large (e.g., 106 sized factors) we were able to obtain 13×-speedup on 32 cores.

To evaluate the proposed algorithms in both the weakly and strongly correlated settings we selected tworepresentative large-scale models. In the weakly correlated setting we used a 40, 000 variable 200 × 200grid MRF similar to those used in image processing. The latent pixel values were discertized into 5 states.Gibbs sampling was used to compute the expected pixel assignments for the synthetic noisy image shownin Fig. 2.8(a). We used Gaussian node potentials centered around the pixel observations with σ2 = 1and Ising-Potts edge potentials of the form exp(−3δ(xi 6= xj)). To test the algorithms in the stronglycorrelated setting, we used the CORA-1 Markov Logic Network (MLN) obtained from Domingos [2009].This large real-world factorized model consists of over 10, 000 variables and 28, 000 factors, and has amuch higher connectivity and higher order factors than the pairwise MRF.

In Fig. 2.11 we present the results of running both algorithms on both models using a state-of-the-art 32core Intel Nehalem (X7560) server with hyper-threading disabled. We plot un-normalized log-likelihoodand across chain variance in-terms of wall-clock time. For the likelihood and variance analysis we focuson the smaller 8 processor system where performance trends are more visible, and demonstrate extremeparallel scaling separately.

In Fig. 2.11(a) and Fig. 2.11(e) we plot the un-normalized log-likelihood of the last sample as a function

28

Page 33: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

0 20 40 60 80 100−3.4

−3.2

−3

−2.8

−2.6

−2.4

−2.2x 10

5

Runtime (seconds)

Logl

ikel

ihoo

d

Splash(8, 3, 1)

Splash(8, 3, 0)

Splash(1, 3, 1)

Splash(1, 3, 0)

Chromatic(8)

Chromatic(1)

(a) Denoise Likelihood

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Runtime (seconds)

Var

ianc

e Splash(8, 3, 0)Splash(8, 3, 1)

Splash(1, 3, 0)

Splash(1, 3, 1)

Chromatic(1)

Chromatic(8)

(b) Denoise Variance

0 10 20 30 400

1

2

3

4

5

6

7x 10

7

Number of Cores

Num

ber

of S

ampl

es

Chromatic

Splash

(c) Denoise Samples Count

0 10 20 30 400

5

10

15

20

25

30

35

Number of Cores

Spe

edup

Chromatic

Splash

Ideal

(d) Denoise Speedup

0 20 40 60 80 100−8.4

−8.2

−8

−7.8

−7.6

−7.4x 10

4

Runtime (seconds)

Logl

ikel

ihoo

d

Splash(8, 5)

Splash(8, 2)

Splash(1, 5)Splash(1, 2)

Chromatic(8)

Chromatic(1)

(e) CORA-1 Likelihood

0 20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Runtime (seconds)

Var

ianc

e Splash(1, 5)

Splash(1, 2)Splash(8, 2)

Splash(8, 5)

Chromatic(8)

Chromatic(1)

(f) CORA-1 Variance

0 10 20 30 400

1

2

3

4

5

6x 10

7

Number of Cores

Num

ber

of S

ampl

es

Chromatic

Splash

(g) CORA-1 Samples Count

0 10 20 30 400

5

10

15

20

25

30

35

Number of Cores

Spe

edup

Splash

Ideal

Chromatic

(h) CORA-1 Speedup

Figure 2.11: Comparison of Chromatic sampler and the Splash sampler at different settings (i.e., Chromatic(p),Splash(p, wmax, adaptation) for p processors and treewidth wmax) on the synthetic image denoising grid model and the CoraMarkov logic network. Adaptation was not used in the CORA-1 MLN. (a,e) The un-normalized log-likelihood plotted as afunction of running-time. (b,f) The variance in the estimator of the expected assignment computed across 10 independent chainswith random starting points. (c,g) The total number of variables sampled in a 20 second window plotted as a function of thenumber of cores. (d,h) The speedup in number of samples drawn as a function of the number of processors.

of time. While in both cases the Splash algorithm out-performs the chromatic sampler, the difference ismore visible in the CORA-1 MLN. We found that the adaptation heuristic had little effect in likelihoodmaximization on the CORA-1 MLN, but did improve performance on the denoising model by focusingthe Splashes on the higher variance regions. In Fig. 2.11(b) and Fig. 2.11(f) we plot the variance in theexpected variable assignments across 10 independent chains with random starting points. Here we seethat for the faster mixing denoising model, the increased sampling rate of the Chromatic sampler leads toa greater reduction in variance while in the slowly mixing CORA-1 MLN only the Splash procedure isable to reduce the variance.

To illustrate the parallel scaling we plot the number of samples generated in a 20 seconds (Fig. 2.11(c)and Fig. 2.11(d)) as well as the speedup in sample generation (Fig. 2.11(c) and Fig. 2.11(d)). The speedupis computed by measuring the multiple of the number of samples generated in 20 seconds using a singleprocessor. The ideal speedup is linear with 32× speedup on 32 cores.

We find that the Chromatic sampler typically generates an order of magnitude more samples per secondthan the more costly Splash sampler. However, if we examine speedup curves we see that the largercost associated with the Splash construction and inference contributes to more exploitable coarse grainparallelism. Interestingly, in Fig. 2.11(h) we see that the Splash sampler exceeds the ideal scaling. This isactually a consequence of the high connectivity forcing each of the parallel Splashes to be smaller as thenumber of processors increases. As a consequence the cost of computing each Splash is reduced and thesampling rate increases. However, this also reduces some of the benefit from the Splash procedure as thesize of each Splash is slightly smaller.

29

Page 34: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

2.4 Proposed Work in Parallel Gibbs Sampling

As with parallel belief propagation, we have a nearly complete story for parallel Gibbs sampling. Whilewe have fairly extensively explored algorithmic properties and variations of parallel Gibbs sampling wehave a limited empirical assessment. The following are a few key areas we need to explore as well assome potential areas of interest.

2.4.1 Proposed: Extensive Empirical Analysis of Splash Gibbs

We need to assess our parallel Gibbs sampling algorithms on larger collection of models. Many of themodels we have explored have very strong factors and generally are not well suited to Gibbs sampling.While the Gibbs sampler is still able to obtain high likelihood samples, it is unlikely to “fully” mix onthese models which have tightly coupled variables. Therefore we would like to explore some of thelarger models used in Bayesian statistics where the behavior of the sequential Gibbs sampler is betterunderstood.

We also need to better characterize the sensitivity of the performance of the Splash Gibbs sampler on thetreewidth and Splash size parameters. In particular we have seen that for treewidth 1 we can constructfully asynchronous Splashes and expose considerable parallelism but at the expense of potential gains inmixing. As we increase the treewidth the cost of constructing and evaluating each Splash increases butwe potentially gain improved mixing performance. However, this effect has not been well studied onreal-world problems.

2.4.2 Proposed: Variable Splitting

High degree variables in the MRF present unique challenges to both the chromatic and Splash based Gibbssamplers. For example consider the situation when if Xi is connected to all other variables in the model.In the chromatic sampler, while Xi is sampled no other variable can be sampled simultaneously. If Xi isincluded in a Splash using the Splash sampler, then the treewidth of the Splash sampler can grow quickly(depending on the remaining connectivity) as new variables are added substantially reducing the Splashsize.

To increase the parallelism in models with high degree variables we would like to develop a model trans-formation that preserves the original model as a marginal while simultaneously admitting increase paral-lel inference. One possible option is split high-degree nodes into pairs of connected variables such thatmarginalizing out recovers the original models. While this could potentially increase the mixing time itcould also substantially improve the parallel scaling of the proposed Gibbs samplers.

2.4.3 Proposed: Ergodic Parallel Acceleration of Latent Topic Models

There has been substantial recent work in building large hierarchical latent topic models that typicallyrely on collapsed Gibbs sampling as an inference procedure. However, the parallel Gibbs techniques wehave developed do not work well in the collapsed setting where the resulting models are typically dense.Several authors have proposed parallel methods which are provably incorrect yet seem to perform well inpractice.

30

Page 35: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

However, it is also unclear whether Gibbs sampling is appropriate at all since it is unlikely to mix onthese models. Therefore we would like to answer the following two questions. Does collapsed Gibbssampling significantly improve mixing over un-collapsed Gibbs sampling? Can we provide an ergodicparallel Gibbs sampler for this setting?

2.4.4 Optional: Nonparametric Splash Gibbs Sampling

We would like to explore the use on non-parametric methods for parallel Splash Gibbs sampling. Oneof the limitations of the Splash Gibbs sampler is that we need to be able to run exact inference (messagepassing) on the junction tree. Unfortunately, for continuous models this can be difficult. Therefore wewould like to use the recent work by Song et al. [2010] to construct a Splash Gibbs sampler for continuousvalued variables.

2.5 Related Work in Parallel Inference

While we focus on approximate inference methods, much of the existing work on parallel algorithms forgraphical models focuses on exact inference which is extremely computationally intensive. Most exactinference algorithms operate on the junction tree which is an acyclic representation of the factorizeddistribution. One of the earliest parallel exact inference algorithm was introduced by Kozlov and Singh[1994] who implemented the Lauritzen-Spiegelhalter (see Koller and Friedman [2009]) algorithm on a 32core Stanford DASH processor. Later Pennock [1998] provided a method to both construct the junctiontree and apply Lauritzen-Spiegelhalter with log depth inference using a form of parallel tree contraction[Miller and Reif, 1985]. While the work by Pennock [1998] provides an optimal algorithm it requiressubstantial fine grain parallelism. Later Xia and Prasanna [2008, 2010] and Xia et al. [2009] implementedthe algorithm for various platforms and demonstrated limited scaling on models with high treewidth.Unfortunately, the complexity of inference in junction trees is exponential in the treewidth which growsquickly in loopy models with many variables. Therefore, for most real-world large-scale models thetreewidth is typically prohibitively large.

31

Page 36: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 3

GraphLab Parallel Abstraction [75 %Complete]

Our work on parallel inference provided insight into the general characteristics of efficient parallel algo-rithms and guided the formulation of the thesis statement. As we began to apply the thesis statement andto develop new algorithms, we discovered a painful void in the space of high-level tools, design abstrac-tions, and software APIs. However, after several years of designing and implementing iterative adaptiveasynchronous parallel algorithms for a variety of different parallel platforms, we discovered a set of fun-damental design primitives that span the class of computation described in the thesis statement. Guidedby this knowledge, we developed the high-level GraphLab parallel abstraction.

3.1 Completed Work: The GraphLab Abstraction

In this section we introduce the GraphLab parallel abstraction [Low et al., 2010] and describe howGraphLab provides the core components needed to apply the thesis statement to parallel learning andinference in graphical models. We begin by briefly reviewing the predominant alternative to GraphLaband how they fail to support the class of algorithms described by the thesis statement. We then outline thekey primitives of the GraphLab abstraction. We conclude with a summary of some experimental analysisof the GraphLab abstraction in the context of learning and inference in graphical models.

3.1.1 Limitations of Previous Alternatives to GraphLab

There are several previous frameworks for designing and implementing parallel machine learning algo-rithms. Because GraphLab generalizes these ideas and addresses several of their critical limitations webriefly review these frameworks.

Map-Reduce Abstraction

The MapReduce abstraction was introduced by [Dean and Ghemawat, 2004] and has been successfullyapplied to a broad range of ML applications [Chu et al., 2006, Panda et al., 2009, Wolfe et al., 2008, Ye

32

Page 37: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

et al., 2009]. A program implemented in the MapReduce framework consists of a Map operation and aReduce operation. The Map operation is a function which is applied independently and in parallel toeach datum (e.g., webpage) in a large data set (e.g., computing the word-count). The Reduce opera-tion is an aggregation function which combines the Map outputs (e.g., computing the total word count).MapReduce performs optimally only when the algorithm is embarrassingly parallel and can be decom-posed into a large number of independent computations. The MapReduce framework expresses the classof ML algorithms which fit the Statistical-Query model [Chu et al., 2006] as well as problems wherefeature extraction dominates the run-time.

The MapReduce abstraction fails when there are sparse computational dependencies in the data. Forexample, MapReduce can be used to extract features from a massive collection of images but cannotrepresent computation that depends on small overlapping subsets of images. This critical limitation makesit difficult to represent algorithms which operate on structured models like those in the class described bythe thesis statement.

Often computational dependencies can lead to scheduling dependencies, as was illustrated in our workon parallel Gibbs sampling (Sec. 2.3). These scheduling dependencies manifest as sequential subprob-lems that cannot be efficiently (e.g., belief propagation) or correctly (e.g., Gibbs sampling) executed inparallel. Unfortunately, the MapReduce abstraction has no means to encode these sequential data depen-dencies.

Most algorithms algorithms for graphical model learning and inference are iterative. For example, all thealgorithms we studied in our work on parallel inference iteratively recompute and refine a set of parametersuntil some termination condition is achieved. While the MapReduce abstraction can be invoked iteratively,it does not provide a mechanism to directly encode iterative computation. As a consequence, it is notpossible to express sophisticated scheduling, automatically assess termination, or even leverage basic datapersistence.

The popular implementations of the MapReduce abstraction are targeted at large data-center applicationsand therefore optimized to address node-failure and disk-centric parallelism. The overhead associatedwith the fault-tolerant, disk-centric approach is unnecessarily costly when applied to the typical clusterand multi-core settings encountered in machine learning research. Nonetheless, MapReduce is used insmall clusters and even multi-core settings [Chu et al., 2006].

DAG Abstraction

In the DAG abstraction, parallel computation is represented as a directed acyclic graph with data flowingalong edges between vertices. Vertices correspond to functions which receive information on inboundedges and output results to outbound edges. Implementations of this abstraction include Dryad [Isard et al.,2007] and Pig Latin [Olston et al., 2008]. The DAG abstraction generalizes the MapReduce abstractionto express richer computation dependencies and limited asynchronous scheduling. However, like theMapReduce abstraction the DAG abstraction does not represent iterative dynamic computation.

Systolic Abstraction

The systolic abstraction [Shapiro, 1988] (and the closely related dataflow abstraction) extends the DAGframework to the iterative setting. Just as in the DAG Abstraction, the systolic abstraction forces the

33

Page 38: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

computation to be decomposed into small atomic components with limited communication between thecomponents. The systolic abstraction uses a directed graph G = (V,E) which is not necessarily acyclic)where each vertex represents a processor, and each edge represents a communication link. In a singleiteration, each processor reads all incoming messages from the in-edges, performs some computation, andwrites messages to the out-edges. A barrier synchronization is performed between each iteration, ensuringall processors compute and communicate in lockstep. Recently, Malewicz et al. [2009] rediscovered thesystolic abstraction and developed a high level parallel API built around systolic computation.

While the systolic framework can efficiently express iterative computation, it is still unable to efficientlyexpress the adaptive asynchronous computation described in the thesis statement. In particular the barrierbetween phases delays computation on the slowest node and prohibits asynchrony. Since all nodes areprocessed in each round it is difficult to encode adaptive computation in which the majority of the nodesare skipped between phases.

3.1.2 The GraphLab Abstraction

We now present a high-level overview of the GraphLab abstraction, for a more detailed presentation see[Low et al., 2010]. The GraphLab abstraction can be divided into 5 principal parts:

1. data graph which is used to represent the sparse graphical model structure, data, and parameters.

2. shared data table which represents the remaining state that cannot be factorized (e.g., the sharedfactor parameters θ).

3. update functions which represent the factorized computation.

4. scheduler which manages the adaptive computation.

5. consistency model which ensures data consistency and allows reasoning about local sequentialsub-problems.

We now describe each of these elements.

Data Graph

The GraphLab data graph G = (V,E) encodes both the problem specific sparse computational de-pendencies and directly modifiable program state. The user can associate arbitrary blocks of data (orparameters) with each vertex and directed edge in G. We denote the data associated with vertex v by Dv,and the data associated with edge (u → v) by Du→v. In addition, we use (u → ∗) to represent the set ofall outbound edges from u and (∗ → v) for inbound edges at v.

In many graphical model learning and inference applications, the Data Graph corresponds directly tothe underlying graphical representation of the model. For example, in Gibbs sampling the data graph isequivalent to the MRF with the current sample values being stored on the vertices. Alternatively, in beliefpropagation the data graph corresponds to the factor graph with vertices storing the local belief and factorsand edges storing the messages.

34

Page 39: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Scopev

vDataData

DataData DataData

DataDataDataDataDataData

DataData

DataData

DataData

DataDataDataData

DataData

DataData

(a) Scope

DataData

DataData DataData

DataDataDataDataDataData

DataData

DataData

DataData

DataDataDataData

DataData

DataData

(b) Consistency Models

Figure 3.1: (a) The scope, Sv , of vertex v consists of all the data at the vertex v, its inbound and outbound edges, and itsneighboring vertices. The update function f when applied to the vertex v can read and modify any data within Sv . (b). Weillustrate the 3 data consistency models by drawing their exclusion sets as a ring where no two update functions may be executedsimultaneously if their exclusions sets (rings) overlap.

Shared Data Table

To support globally shared state, GraphLab provides a shared data table (SDT) which is an associativemap, T [Key] → Value, between keys and arbitrary blocks of data. The SDT is used to store globalconstants as well as compute and store aggregate computations. For example, in Gibbs sampling the SDTmay be used to store any shared parameters in the MRF as well as the current estimate of the unnormalizedlikelihood. While the computation of the unnormalized likelihood can be accomplished locally, computingthe sum of the local contributions must be accomplished through global reduction using the SDT.

Global computation is accomplished in the SDT using the sync framework that behaves nearly identicallyto the Reduce in MapReduce. Like MapReduce, the user defines the aggregation operation which isapplied by the sync framework to all the vertices to produce the final aggregate. However, unlike MapRe-duce, the user can also define a normalization routine which is applied to the global aggregate before itis placed back in the SDT. There each entry in the SDT may be associated with pair of aggregation andnormalization functions which are used to maintain the table entry. Finally, because GraphLab is designedfor iterative algorithms the entries in the SDT are continuously updated as the computation proceeds andso the user must define how aggressively these global values are maintained. If the entries are updatedtoo frequently computation is wasted. Alternatively if the entries are not updated frequently enough theybecome poor estimates of the global values.

Update Functions

A GraphLab update function is a stateless user-defined function which operates on the data associatedwith small neighborhoods in the graph and represents the core element of computation. For every vertexv, we define Sv as the neighborhood of v which consists of v, its adjacent edges (both inbound andoutbound) and its neighboring vertices as shown in Fig. 3.1(a). We define DSv as the data correspondingto the neighborhood Sv. In addition to DSv , update functions also have read-only access, to the shareddata table T. We define the application of the update function f to the vertex v as the state mutatingcomputation:

DSv ← f(DSv ,T).

We refer to the neighborhood Sv as the scope of v because Sv defines the extent of the graph that can beaccessed by f when applied to v. For notational simplicity, we denote f(DSv ,T) as f(v). The updatefunctions allow us to express the types of local factorized computation described in the thesis statement.

35

Page 40: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

For example, the computation in both Gibbs sampling and belief propagation can be described as updatefunctions in the GraphLab framework.

Scheduler

The GraphLab scheduler describes the order in which update functions are applied to vertices and is whatallows GraphLab to express the types of adaptive computation described in the thesis statement abstractlyrepresents a dynamic list of tasks (vertex-function pairs) which are to be executed by the GraphLabengine.

Because constructing a scheduler requires reasoning about the complexities of parallel algorithm design,the GraphLab framework provides a collection of base schedules. To represent synchronous algorithms(e.g., gradient descent) GraphLab provides a synchronous scheduler which ensures that all vertices areupdated simultaneously. To represent basic asynchronous algorithms GraphLab provides a round-robinscheduler which updates all vertices sequentially using the most recently available data.

To represent the principal class of adaptive asynchronous computation GraphLab provides a collectionof task schedulers which permit update functions to add and reorder tasks. The most basic class of taskscheduler are FIFO schedulers which only permit task creation but do not permit task reordering. Theprioritized schedules permit task reordering at the cost of increased overhead. For both types of taskscheduler GraphLab also provide relaxed versions which increase performance at the expense of reducedcontrol:

Strict Order Relaxed OrderFIFO Single Queue Multi Queue / PartitionedPrioritized Priority Queue Approx. Priority Queue

In addition GraphLab provides the splash scheduler which was specifically designed to allow automaticsequential subproblem decomposition and is sufficient to express the Splash BP algorithm. However,in our work on Gibbs sampling we found that it is possible to actually encode the Splash scheduler byusing only a prioritized scheduler and cleverly modifying our update functions. In the discussion onproposed improvements to GraphLab (see Sec. 3.2) we address the possibility of building a more flexibleversion of the Splash scheduler to allow asynchronous Splash generation as well as user defined treeconstruction.

Consistency Model

Since scopes may overlap, the simultaneous execution of two update functions can lead to race-conditionsresulting in data inconsistency and even corruption. For example, two function applications to neigh-boring vertices could simultaneously try to modify data on a shared edge resulting in a corrupted value.Alternatively, a function trying to normalize the parameters on a set of edges may compute the sum onlyto find that the edge values have changed.

Effectively, the graph structure imposes local sequential subproblems. That is, for many graphical modelsalgorithms, adjacent variables in the MRFs cannot or should not be operated on simultaneously. In somecases even variables that share a common neighbor cannot be simultaneously modified.

GraphLab provides a choice of three data consistency models which enable the user to balance perfor-mance and data consistency. The choice of data consistency model determines the extent to which over-

36

Page 41: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

lapping scopes can be executed simultaneously. We illustrate each of these models in Fig. 3.1(b) bydrawing their corresponding exclusion sets. GraphLab guarantees that update functions never simulta-neously share overlapping exclusion sets. Therefore larger exclusion sets lead to reduced parallelism bydelaying the execution of update functions on nearby vertices.

The full consistency model ensures that during the execution of f(v) no other function will read ormodify data within Sv. Therefore, parallel execution may only occur on vertices that do not share acommon neighbor. The slightly weaker edge consistency model ensures that during the execution of f(v)no other function will read or modify any of the data on v or any of the edges adjacent to v. Under theedge consistency model, parallel execution may only occur on non-adjacent vertices. Finally, the weakestvertex consistency model only ensures that during the execution of f(v) no other function will be appliedto v. The vertex consistency model is therefore prone to race conditions and should only be used whenreads and writes to adjacent data can be done safely (In particular repeated reads may return differentresults). However, by permitting update functions to be applied simultaneously to neighboring vertices,the vertex consistency model permits maximum parallelism.

Choosing the right consistency model has direct implications to algorithm correctness. One method toprove correctness of a parallel algorithm is to show that it is equivalent to a correct sequential algorithm.To capture the relation between sequential and parallel execution of a program we introduce the conceptof sequential consistency:

Definition 3.1.1 (Sequential Consistency). A GraphLab program is sequentially consistent if for everyparallel execution, there exists a sequential execution of update functions that produces an equivalentresult.

The sequential consistency property is typically a sufficient condition to extend algorithmic correctnessfrom the sequential setting to the parallel setting. In particular, if the algorithm is correct under any se-quential execution of update functions, then the parallel algorithm is also correct if sequential consistencyis satisfied.

Proposition 3.1.1. GraphLab guarantees sequential consistency under the following three conditions:1. The full consistency model is used

2. The edge consistency model is used and update functions do not modify data in adjacent vertices.

3. The vertex consistency model is used and update functions only access local vertex data.

In our work on loopy BP and Gibbs sampling we typically only require the edge consistency modelto obtain sequential consistency. In fact using only edge consistency we can construct a tree growingprotocol that ensures that no two spanning trees share a common edge which was sufficient to provealgorithm correctness for the first parallel blocking sampler.

3.1.3 Implementation and Experiments

We have implemented the GraphLab abstraction in the multicore (shared memory) and cloud computingsettings. The multicore version of GraphLab is fairly mature, with relatively complete documentation,tutorials, and application suites. We have used the multicore version to produce state-of-the-art results onup to 32 cores in parallel sampling even loopy BP.

37

Page 42: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

While our cloud implementation of GraphLab is still in pre-release, we have already demonstrated excep-tional performance on a large 256 core Amazon EC2 cloud deployment. The cloud version of GraphLabrelies on an optimized TCP communication layer to manage communication, basic graph partitioning andover-partitioning to distribute the graph, and a clever latency hiding locking protocol to implement thevarious consistency models.

3.2 Proposed Work

3.2.1 Proposed: Address the Challenge of High Degree Vertices

In many purely Bayesian applications it is possible that the MRF contains several very high degree vari-ables. For example, when we impose a prior distribution over a shared parameter the resulting parametervariable can connect to a large number of variables in the model. While the model may still be sparse, thepresence of a high degree variables presents a unique challenge for parallelization.

For example, consider the situation where a single vertex connects to all other vertices in the MRF. Underthe full consistency model, the resulting algorithm is strictly sequential. Under the more frequently usededge consistency model (used in both loopy BP and Gibbs sampling), the execution of an update on thehigh degree variable blocks all other computation. While there is still potential for considerable parallelismwhen the high degree variable is not being updated, the single barrier point imposed by high-degreevertices may be costly.

Finally, high-degree variables expose an opportunity for more fine grained parallelism than can currentlybe expressed in GraphLab. In particular, GraphLab takes a vertex centric view of parallelism. As a conse-quence, GraphLab does not attempt to exploit parallelism in edge data aggregation or transformation. Inmany cases it may be possible to accelerate the update of a single vertex by spreading the computation overmultiple processors. For example, in loopy BP, variable belief is computed using an associative (product)operation on all the inbound messages and can therefore be done using a parallel tree reduction.

Therefore we propose the addition of edge update functions which would allow the GraphLab frameworkto exploit fine grained parallelism on very high degree nodes in the graph.

3.2.2 Proposed: Asynchronous Splash Scheduler

The current GraphLab framework provides a Splash scheduler which is tuned for the SplashBP algorithmand does not support asynchronous Splash construction and destruction and disjoint Splash creation. Inour work in Splash Gibbs sampling were able to implement the entire Splash framework using only thebasic priority and FIFO scheduling primitives. The resulting Splash structures exposed substantially moreparallelism than was available in SplashBP and also enabled disjoint Splash creation. Therefore we planto bring the asynchronous Splash creation used in Gibbs sampling back into the Splash scheduler andprovide users with greater control over the construction and destruction of Splashes. This will allowfurther exploration of the Splash scheduling in other commutative semi-rings (i.e., max-product, mean-field approximations, the TRW work by Wainwright et al. [2003] ...).

Given a more general implementation of the Splash schedule, we would like to explore both theoreticallyand empirically the gain of Splash scheduling in other application domains. To maintain the focus of the

38

Page 43: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

thesis we propose exploring the use of Splash scheduling in mean-field approximation, max-product, andtree re-weighted belief propagation.

3.2.3 Optional: GPU Based GraphLab Implementation

As mixed multicore and GPU computation becomes more pervasive, the ability to leverage both technolo-gies independently and even simultaneously becomes increasingly advantageous. Consequently we wouldlike the techniques that we develop for parallel learning and inference to run on GPUs. Therefore, we re-cently started working on a GPU based implementation of the GraphLab abstraction. The GPU settingpresents its own unique challenges which both limit the types computation as well as the potential size andstructure of models that can be solved efficiently. By implementing a GPU version of the GraphLab ab-straction and then simply running our existing GraphLab based algorithms, we will be able to characterizethe challenges presented by the GPU hardware as some of the potential new opportunities.

39

Page 44: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 4

Parallel Parameter Learning [30%Complete]

In Sec. 1.4 we introduced the problem of parameter learning in probabilistic graphical models. We notedthat the computationally expensive component of parameter learning is inference. In Sec. 2 we demon-strated considerable progress towards a suite of general purpose parallel inference methods. Therefore,we have already begun to provide a solution to the problem of parallel parameter learning.

Nonetheless, we believe that by applying the thesis statement directly to the problem of parameter learningwe can further improve parallel performance and at the same time build a better understanding of thechallenges and opportunities for parallelism in the context of parameter learning. To justify this claim wenow present some preliminary results that demonstrate how applying the thesis statement to the problemof parameter learning can lead to a more efficient parallel algorithm.

4.1 Initial Work: Simultaneous Learning and Inference

Parameter learning in graphical models is typically accomplished by iterating between inference and pa-rameter optimization. During inference the clique marginals are estimated. In the parameter optimizationphase the estimated clique marginals are used to compute new optimal parameter values either by gradientdescent or iterative proportional fitting (IPF).

In most large-scale real-world problems, statistical efficiency dictates parameter sharing. Often when webuild large models, we use templates which generate the model structure by assigning generic factors(e.g., energy of van der Walls forces) to sets of random variables that satisfy predicates (e.g., nearby in theprotein backbone). Furthermore, by sharing parameters across multiple factors we reduce the degrees offreedom in our model and consequently the variance in our estimator. However, because parameters aretypically shared and the resulting optimization is generally fast, it seems unlikely that we would be ableto obtain substantial parallelism directly from parameter optimization.

However, if we examine the high-level iterative algorithm we realize that there is potentially substantialwasted computation. In particular, we are repeatedly solving the inference problem only to make a smallstep in the parameter space. Alternatively, one could imagine partially solving the inference problem whilesimultaneously exploring the parameter space. Moreover, while the highly parallel inference algorithm

40

Page 45: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

15 30 45 60 75 90 105 1200

500

1000

1500

2000

Sync Frequency (Seconds)

Tot

al R

untim

e (S

econ

ds)

Total Learning Runtime

(a) Bkgnd Sync. Runtime

0 15 30 45 60 75 90 105 1200

0.5

1

1.5

2

2.5

3

3.5

4

Sync Frequency (Seconds)

Ave

rage

% d

evia

tion

Average % deviation

(b) Bkgnd Sync. Error

Figure 4.1: Retinal Scan Denoising (a) The total runtime in seconds of parameter learning and (b) the average percent deviationin learned parameters plotted against the time between gradient steps using the Splash schedule on 16 processors.

proceeds we can simultaneously make small steps in parameter space based on the current aggregatedestimate of the belief.

Using the GraphLab framework we implemented simultaneous learning and inference. The parallelSplashBP algorithm was used for inference and parameter optimization was accomplished with gradi-ent descent. We used the GraphLab periodic global aggregation mechanism provided by the SDT tosimultaneously run the gradient descent routine while running inference.

We evaluated simultaneous learning and inference using a large retinal density estimation problem whichconsists of 256 × 64 × 64 grid structured MRF with pairwise potentials. Laplace smoothing was usedwith separate parameters for each axis. In Fig. 4.1(a) we plot the total runtime in seconds as a function ofthe time between gradient steps. As we take gradient steps more frequently we see a reduction in the totalrunning time of parameter learning and inference.

One might be concerned that the resulting learned model would deviate substantially from the modellearned using the more traditional iterative method. In Fig. 4.1(b) we plot the average percent deviationin the parameters relative to their estimated value when computed after 120 seconds. While there is anincrease in the variance of the parameter estimates it is small relatively small and leads to a negligibleimpact in the predictions.

While there is still a lot that needs to be studied, this simple anecdotal example give some hope that wecan improve the performance of parameter learning by simultaneously focusing computation on parameteroptimization and inference. In the propose work in parameter learning we discuss our plans to both furtherevaluate this simple method of simultaneous learning and inference as well as some methods to furthercombine the parameter optimization into the inference procedure.

4.2 Proposed Work

We would like to continue the theme of simultaneous learning and inference while applying the ideas inthe thesis statement. We have therefore divided the proposed work into two parts. The first is to developa better understanding of the naive simultaneous inference and learning algorithm and the second is toconstruct an asynchronous parallelization of a slightly more principaled approach.

41

Page 46: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

4.2.1 Proposed: Theoretically and Experimentally Evaluate the Naive Algorithm

The naive algorithm for simultaneous parameter learning and inference relies on the conjecture that wecan take small gradient steps in parameter space while simultaneously computing approximate marginalsand thereby reduce wasted inference while obtaining high quality parameter estimates. Currently we haveonly limited anecdotal evidence to support this conjecture. Therefore we would like to both theoreticallyand experimentally evaluate the naive simultaneous learning and inference algorithm.

First, we would like to better characterize the conditions under which the naive algorithm is guaranteedto converge to the optimal parameter estimate. Since it is impossible to provide general bounds for theperformance of approximate inference, we propose restricting our analysis to tree graphical models andgraphical models with a single loop. In this restricted setting it is possible to bound the running timeof parallel inference using our τε-framework. We can then characterize the running time of the classiciterated inference and learning algorithm and then adapt that analysis to the approximate gradient stepsmade in the naive simultaneous learning and inference algorithm.

In our early experimental analysis we scheduled gradient steps as a function of time. However, since thespeed of inference can depend heavily on the asynchronous schedule, degree of change in the parameterand the problem structure, we would like to trigger gradient steps based on the stability of the currentmarginal estimates. In particular, we would like to schedule gradient steps when the belief residuals fallbelow a fixed value. We believe that this may actually improve the performance by adaptively adjustingthe rate of exploration as a function of the stability in the marginals.

Finally, we would like to evaluate the naive algorithm on larger collection models that vary in structure,degree of parameter sharing, and complexity of the data. We believe Markov relational network and inparticular Markov Logic Networks (MLNs) provide a strong candidate domain to evaluate the proposedparameter learning techniques. These models have varying complexity, amounts of training data, andstructure. To evaluate more complex parameter learning settings with stronger variable interactions, wewould like to look at the protein side chain prediction tasks.

4.2.2 Proposed: Factorized Asynchronous Parallelization Using CAMEL

A known problem with combining loopy BP with parameter learning is that it can lead to unstable gra-dients and to intermediate parameters in which BP may fail to converge. There have been recent effortsin developing more stable parameter learning techniques that frame the inference and parameter learn-ing problem as a single optimization procedure. We will now discuss one such effort which we believemay be amenable to parallelization and potentially further illustrate the advantages of applying the thesisstatement.

To simplify the presentation we will restrict our attention to log-linear models of the form:

P (X1 = x1, . . . , Xn = xn | θ) =1

Z(θ)exp

(∑A∈F

∑i

θiAfiA(xA)

), (4.1)

where f iA is called the ith feature with domain xA. In the log-linear representation we may often havemany features that share a common domain. Beyond the additional sum (

∑i) the primary change between

the general factorized model (Eq. (1.1)) and the log-linear representation is that the factor fA no longerdepends on θ. While log-linear models are a restricted class of factorized models, they can express a wide

42

Page 47: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

range of distributions are extremely common in practice. Therefore the remainder of our discussion onparameter learning as well as structure learning will focus on log-linear models.

To directly apply the thesis statement to the problem of parameter learning we need to construct a fac-torized asynchronous representation. Fortunately, there has been recent work in combining the parameterlearning and inference objectives. In particular, Ganapathi et al. [2008] extend the work of Teh and Welling[2003] to the setting of models with shared parameters to produce the objective they call Constrained Ap-proximate Maximum Entropy Learning or CAMEL. While the CAMEL objective leads to a double loopalgorithm, the inner loop simultaneously solves for optimal parameters and marginal estimates.

The CAMEL objective tries to maximize the entropy of pseudo-marginals with respect to the standardmarginal constraints imposed by belief propagation and the added constraint that the expected values ofthe features f iA match the empirical marginals in the data. This is the dual formulation of the maximumlikelihood objective used in Sec. 4.2.1. The CAMEL objective is formally defined as:

maxµ

∑Ai∈F

H (µi (XAi))−∑

Ai,Aj∈FH

∑xAi\Aj

µi

(XAi∩Aj ,xAi\Aj

) (4.2)

subject to: Eµ[f iA] = Ep[f iA] : ∀f iA (4.3)∑xAi\Aj

µi

(XAi∩Aj ,xAi\Aj

)=∑

xAj\Ai

µj

(XAj∩Ai ,xAj\Ai

): ∀Ai,Aj ∈ F (4.4)

∑xAi

µi (xAi) = 1 : ∀Ai ∈ F (4.5)

µ > 0 (4.6)

where µi(XAi) are the approximate clique marginals that we try to estimate given the data.

The objective function in Eq. (4.2) maximize the approximate1 entropy of the model. The constraint inEq. (4.3) tries to match the expected values of the features under the model with observed expected valuesof the feature in the data. Finally, the constraints Eq. (4.4), Eq. (4.5), and Eq. (4.6) are the traditionalmarginalization constraints use in belief propagation.

Ganapathi et al. [2008] solve the CAMEL objective using the concave convex procedure (CCCP) origi-nally introduced in Yuille [2002] for convergent belief propagation. The resulting two phase algorithm,begins by computing a concave approximation to the convex term:

−∑

Ai,Aj∈FH

∑xAi\Aj

µi

(XAi∩Aj ,xAi\Aj

) (4.7)

in the objective function (Eq. (4.2)). For simplicity, Ganapathi et al. [2008] uses a simple local firstorder approximation. Then given a concave approximation of Eq. (4.7) we can then run a convergentoptimization on the remaining problem. While Ganapathi et al. [2008] moves Eq. (4.4) into the objectiveby adding a new set of factors over pairs of cliques to encode the marginalization constraints and enable thedirect application of an off-the-shelf L-BFGS solver, Yuille [2002] derives a factorized iterative procedurebut without the constraint term Eq. (4.3). In their experimental evaluation on a large selection of models,Ganapathi et al. [2008] found that CAMEL objective is substantially more stable and leads to models that

1The entropy objective in Eq. (4.2) is exact in trees.

43

Page 48: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

perform as well or better in standard prediction tasks than iterated residual belief and gradient descentparameter updates.

In this work we propose developing a factorized iterative procedure like that described by Yuille [2002]for the CAMEL objective introduced by Ganapathi et al. [2008]. This will allow us to construct an asyn-chronous parallelization of the CAMEL learning procedure and directly apply the thesis statement. Wecan then evaluate the use of our Splash scheduling primitive on the remaining concave constrained opti-mization problem. We would then like to explore merging the two phases of the CCCP algorithm to permitsimultaneous optimization of the concave objective and the construction of the concave approximation ofthe convex term.

44

Page 49: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 5

Parallel Structure Learning [0%Complete]

We have not yet started working directly on the problem of parallel structure learning. Nonetheless,because the computationally expensive component of structure learning is parameter learning and conse-quently inference, we have already provided the basic tools needed for parallel structure learning. Fur-thermore, as we make progress in parallel inference and parameter learning we are also making progresstowards the objective of parallel structure search.

5.1 Proposed Work

Nonetheless, directly studying the structure learning problem presents several challenges and opportuni-ties unique to the parallel setting. Therefore we would like to directly evaluate the thesis statement inthe context of parallel structure learning. In the following proposed work, we discuss several of the chal-lenges and opportunities unique to the parallel setting as well as some potential parallel structure learningmethods to pursue.

5.1.1 Proposed: Parallel Regularization for Structure Search

The choice of structure has an impact on the parallel scaling of learning and inference. For example,very high degree variables can limit what may be sampled simultaneously in parallel Gibbs sampler.The standard regularization used in structure learning does not directly favor graphical model structureswhich are well suited for parallel inference and parameter learning. Therefore we would like to exploreadding additional penalty terms that would induce structures which are more directly amenable to parallelinference methods. Essentially, we are proposing a parallel analogue of structure search for efficientinference. Therefore we would like to answer the following key questions:

• What penalty terms lead to structure that are most amenable to parallelism?

Hypothesis: A term for sparsity plus a constraint on the maximum degree of any variable inthe MRF.

45

Page 50: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

• What is the computational cost of adding the additional parallel penalty terms to the score functionand combinatorial search?

Hypothesis: By maintaining convexity and comparability of the penalty term we can reducethe dependence on variable ordering in the greedy search.

• What is the statistical cost of adding the additional parallel penalty terms in terms of model bias.

5.1.2 Proposed: L1 Regularized Parameters for Structure Search

In large-scale real-world structure search there is often considerable domain knowledge about which vari-ables might be related and which features may be informative. For example in image processing we mightbelieve that nearby variables have similar values but we may not know the exact local structure. Thereforeconjecture that:

• Hypothesis: Structure search within the context of a restricted family of structures will be substan-tially more effective in the large-scale setting.

If we restrict our attention to the log-linear representation (Eq. (4.1)) of the factorized distribution, thenone method to refine the space of structures is to first introduce large collection of candidate featuresderived using domain knowledge. We can then appeal to L1 regularization to restrict the set of featureswith nonzero weights. Fortunately, using L1 regularized parameters for structure search is well studiedand typically provides reasonable results Koller and Friedman [2009].

46

Page 51: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 6

Platforms, Applications, and PerformanceMetrics

In our preliminary work in graphical model inference (Sec. 2) and parameter learning (Sec. 4) we ex-amined various different models and parallel architectures. In addition we also developed some usefulperformance metrics to assess the efficiency of our parallel algorithms as well as the scaling and raw per-formance. In this section we briefly review the hardware, models, and metrics we will use in this thesiswork.

6.1 Parallel Systems

We plan to focus predominantly on the multicore and cluster parallel systems. Multicore systems havebecome ubiquitous and represent a future trend in parallel hardware. The multicore setting has the ad-vantages of fast inter-processor communication and approximately symmetric shared memory and thedisadvantage of typically memory bandwidth making it difficult to get enough data to each core. Thecluster, distributed memory setting, offers the opposite trade-offs. In the cluster setting inter-processorcommunication is costly and the program state must be distributed, however the amount of memory andthe memory bandwidth typically scales linearly with the number of cores.

By applying the thesis statement and mapping our algorithms to the GraphLab abstraction we should beable to easily move across different platforms and rely on GraphLab to optimize the details of mapping thealgorithm to various parallel platforms. Nonetheless we have access to very large shared and distributedmemory architectures as well some more exotic prototype platforms.

6.2 Applications

To evaluate inference and learning we have assembled a moderate collection of test applications fromvarying domains:

• Planar and 3D Grid Pair-wise Markov random fields

47

Page 52: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Synthetic image denoising (Fig. 1.1, Gonzalez et al. [2009a])

Depth prediction from monocular video [Gonzalez et al., 2009a]

Retinal scan density estimation [Low et al., 2010]

Video segmentation

• Protein-protein interaction networks [Jaimovich et al., 2006]

• Protein side-chain prediction networks [Yanover et al., 2007]

• Markov logic networks [Richardson and Domingos, 2006]

However we would like to explore several new application domains. First, the protein side-chain pre-diction networks we have studied are based on protein structure in a vacuum. From our discussion withChistopher Langmead we believe that improvements in the performance of inference and even learningmay aid in the ability to learn and reason about protein side-chain models for structure in solution. Bybeing able to model the structure in solution, we would be able to build better quality models of pro-tein structure in different environments. Unfortunately, we are currently unaware of any existing modelsfor protein structure in solution and would therefore need to learn the energy functions for these struc-tures.

6.3 Performance Metrics

An algorithm that very quickly gets the wrong answer is not typically useful. Therefore we need a methodto verify that the new approximate parallel algorithms produce acceptable results. Evaluating performancefor the various algorithms can be difficult in the absence of guaranteed convergence or accuracy. Whenpossible, we directly test model performance on test data, however strong prediction performance does notalways imply a strong algorithm. For example, the algorithm and model may estimate modes accuratelybut fail to estimate the complete distribution. Alternatively, it can be difficult to assess whether poorprediction performance is due to the model or the learning and inference procedures. To address theseissues we rely on models and applications for which the performance of the existing techniques has beenwell characterized and provides a strong baseline. In addition, we ensure that the prediction quality and ifpossible, even the output is consistent as we scale the number of processors.

To ensure that our parallel algorithms are efficient we analyze the amount of work done as a function of thenumber of processors. In our experiments, we track the number of message calculation or parameter gradi-ent steps required for the algorithm to converge. To obtain an efficient parallel algorithm the total amountof work should not change as a function of the number of processors. When constructing distributed al-gorithms, we similarly measure the amount of messages communicated across the network. Ideally, theamount of messages sent per machine should remain relatively constant. Finally, when conducting scalinganalysis we always compare against the fastest sequential algorithm.

48

Page 53: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Chapter 7

Time-line

• January - March, 2011:

Complete extensive empirical analysis of the Splash Gibbs sampler (Sec. 2.4.1)

Explore using variable splitting to address the challenge of high degree variables in Gibbssampling (Sec. 2.4.2).

Explore egodic methods for the parallel acceleration of Gibbs sampling in latent topic models(Sec. 2.4.3).

Explore several candidate large-scale applications (e.g., protein side chain prediction in solu-tion.)

• March - June, 2011:

Address challenge of high-degree vertices in GraphLab (Sec. 3.2.1)

Add general support for Asynchronous Splash constuction to GraphLab (Sec. 3.2.2)

Complete single splash asynchronous parallelism (Sec. 2.2.1) and edge level parallelism (Sec. 2.2.2)as part of the JMLR paper on SplashBP

• June - September, 2011:

Complete analysis of naive parameter learning algorithm. (Sec. 4.2.1)

Begin work CAMEL based parameter learning. (Sec. 4.2.2)

• September - November, 2011:

Complete CAMEL based parameter learning. (Sec. 4.2.2)

Begin work on parallel L1 regularized parameter learning for structure search. (Sec. 5.1.2)

Begin work on parallel regularization for structure search. (Sec. 5.1.1)

• November, 2011 - January, 2012:

Complete work on structure learning (Sec. 5.1.2 and Sec. 5.1.1)

Begin preparing Thesis Document

49

Page 54: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

• January, 2012 - March, 2012:

Complete thesis document.

Defend.

50

Page 55: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Bibliography

M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, Al Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia.A view of cloud computing. Commun. ACM, 53:50–58, April 2010. ISSN 0001-0782. 1

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf,S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Re-port UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html. 1

A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, 2008. 2.3, 2.3.1

A. Barbu and S. Zhu. Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Trans. Pattern Anal.Mach. Intell., 27(8), 2005. 2.3, 2.3.4

D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 1989. 2.1.2, 2.3,2.3.2

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003. ISSN1532-4435. 1

C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore.In NIPS, 2006. 1, 3.1.1

G. F. Cooper. The computational complexity of probabilistic inference using bayesian belief networks. Artificial Intelligence,42:393–405, 1990. 1.3

J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1), 2004. 3.1.1

P. Domingos. Uw-cse mlns, 2009. URL alchemy.cs.washington.edu/mlns/cora. 2.3.5

F. Doshi-Velez, D. Knowles, S. Mohamed, and Z. Ghahramani. Large scale nonparametric bayesian inference: Data parallelisa-tion in the indian buffet process. In NIPS 22, 2009. 2.3, 2.3.1

G. Elidan, I. Mcgraw, and D. Koller. Residual belief propagation: Informed scheduling for asynchronous message passing. InUAI’06, 2006. 2.1.1, 2.1.6, 2.1.7

P. A. Ferrari, A. Frigessi, and R. H. Schonmann. Convergence of some partially parallel gibbs samplers with annealing. TheAnnals of Applied Probability, 3(1), 1993. 2.3.1

V. Ganapathi, D. Vickrey, J. Duchi, and D. Koller. Constrained approximate maximum entropy learning. In Proceedings of theTwenty-fourth Conference on Uncertainty in AI (UAI), 2008. 1.4, 4.2.2, 4.2.2, 4.2.2

S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. In PAMI, 1984.2.3.1, 2.3.1, 2.3.2, 2.3.2

J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS’09, 2009a. 2.1,2.1.6, 6.2

J. Gonzalez, Y. Low, C. Guestrin, and D. O’Hallaron. Distributed parallel inference on large factor graphs. In UAI’09, July2009b. 2.1, 2.1.7, 2.1.9

J. Gonzalez, Y. Low, and C. Guestrin. Scaling Up Machine Learning, chapter Parallel Belief Propagation in Factor Graphs.Cambridge University Press, 2010. 2.1

B. Hendrickson and R. Leland. The chaco user’s guide, version 2.0. Tech. Rep. SAND94-2692, Sandia National Labs, Albu-querque, NM, October 1994. 2.1.9

A.T. Ihler, J.W. Fischer III, and A.S. Willsky. Loopy belief propagation: Convergence and effects of message errors. J. Mach.

51

Page 56: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Learn. Res., 6:905–936, 2005. 2.1.1

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks.SIGOPS Oper. Syst. Rev., 41(3), 2007. 3.1.1

A. Jaimovich, G. Elidan, H. Margalit, and N. Friedman. Towards an integrated proteinprotein interaction network: A relationalmarkov network approach. Journal of Computational Biology, 13(2):145–164, 2006. 1, 6.2

C. S. Jensen and A. Kong. Blocking gibbs sampling for linkage analysis in large pedigrees with many loops. In American Journalof Human Genetics, 1996. 2.3.4

G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48(1), 1998.2.1.9

D. Koller and N. Friedman. Probabilistic Graphical Models. MIT Press, 2009. 2.5, 5.1.2

A.V. Kozlov and J.P. Singh. A parallel lauritzen-spiegelhalter algorithm for probabilistic inference. In Supercomputing ’94.Proceedings, pages 320 –329, November 1994. 2.5

M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary gaussian process classification. J. Mach. Learn. Res.,6, 2005. 2.3

R. A. Levine and G. Casella. Optimizing random scan gibbs samplers. J. Multivar. Anal., 97(10), 2006. 2.3.4

S. Z. Li. Markov random field modeling in computer vision. Springer-Verlag, London, UK, 1995. ISBN 4-431-70145-1. 1

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. Graphlab: A new framework for parallel machinelearning. In UAI, 2010. 3.1, 3.1.2, 6.2

Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski.Pregel: a system for large-scale graph processing. In Proceedings of the 28th ACM symposium on Principles of distributedcomputing, PODC ’09, pages 6–6, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-396-9. doi: http://doi.acm.org/10.1145/1582716.1582723. URL http://doi.acm.org/10.1145/1582716.1582723. 3.1.1

A. Mendiburu, R. Santana, J.A. Lozano, and E. Bengoetxea. A parallel framework for loopy belief propagation. In GECCO’07:Proceedings of the 2007 GECCO conference companion on Genetic and evolutionary computation, 2007. 2.1

G. Miller and J. Reif. Parallel tree contraction and its applications. In In Proceedings of the 26th IEEE Annual Symposium onFoundations of Computer Science, pages 478–489, 1985. 2.3.4, 2.5

J.M. Mooij and H.J. Kappen. Sufficient conditions for convergence of the Sum-Product algorithm. ITIT, pages 4422–4437, 2007.2.1.1

D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed inference for latent dirichlet allocation. In NIPS, 2007. 2.3,2.3.1, 2.3.2

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. SIGMOD,2008. 3.1.1

B. Panda, J.S. Herbach, S. Basu, and R.J. Bayardo. Planet: massively parallel learning of tree ensembles with mapreduce. Proc.VLDB Endow., 2(2), 2009. 1, 3.1.1

J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988. 2.1.1

D. M. Pennock. Logarithmic time parallel bayesian inference. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, pages431–438. Morgan Kaufmann, 1998. 2.5

A. Ranganathan, M. Kaess, and F. Dellaert. Loopy sam. In IJCAI’07, 2007. 2.1.1

M. Richardson and P. Domingos. Markov logic networks. Mach. Learn., 62(1-2), 2006. 1, 6.2

D. Roth. On the hardness of approximate reasoning. In ijcai93, pages 613–618, 1993. 1.3

Prithviraj Sen, Amol Deshpande, and Lise Getoor. Bisimulation-based approximate lifted inference. In Proceedings of theTwenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 496–505, Arlington, Virginia, United States,2009. AUAI Press. ISBN 978-0-9749039-5-8. URL http://portal.acm.org/citation.cfm?id=1795114.1795172. 2.2.3

E. Shapiro. Systolic programming: a paradigm of parallel processing. Concurrent Prolog, 1988. 3.1.1

Parag Singla and Pedro Domingos. Lifted first-order belief propagation. In Proceedings of the 23rd national conference onArtificial intelligence - Volume 2, pages 1094–1099. AAAI Press, 2008. ISBN 978-1-57735-368-3. URL http://portal.acm.org/citation.cfm?id=1620163.1620242. 2.2.3

52

Page 57: Thesis Proposal Parallel Learning and Inference in ...jegonzal/jegonzal_thesis_proposal.pdf · In this thesis we explore how to design efficient parallel algorithms for probabilistic

Novem

ber 20

, 201

0

DRAFT

Le Song, Arthur Gretton, and Carlos Guestrin. Nonparametric tree graphical models via kernel embeddings. In In ArtificialIntelligence and Statistics (AISTATS), May 2010. 2.4.4

J. Sun, H. Y. Shum, and N. N. Zheng. Stereo matching using belief propagation. In ECCV’02, 2002. 2.1

S. Tatikonda and M. I. Jordan. Loopy belief propogation and gibbs measures. In UAI’02, 2002. 2.1.1

Yee Whye Teh and Max Welling. On improving the efficiency of the iterative proportional fitting procedure. In AISTATS’03,2003. 4.2.2

M. Wainwright, T. Jaakkola, and A.S. Willsky. Tree-based reparameterization for approximate estimation on graphs with cycles.In NIPS, 2001. 2.1.1

Martin J. Wainwright, Tommi S. Jaakkola, and Alan S. Willsky. Tree-reweighted belief propagation algorithms and approximateml estimation by pseudo-moment matching. In In AISTATS, 2003. 3.2.2

J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In ICML. ACM, 2008. 1, 3.1.1

Yinglong Xia and Viktor K. Prasanna. Parallel exact inference on the cell broadband engine processor. In Proceedings of the2008 ACM/IEEE conference on Supercomputing, SC ’08, pages 58:1–58:12, Piscataway, NJ, USA, 2008. IEEE Press. ISBN978-1-4244-2835-9. URL http://portal.acm.org/citation.cfm?id=1413370.1413429. 2.5

Yinglong Xia and Viktor K. Prasanna. Scalable node-level computation kernels for parallel exact inference. IEEE Trans. Comput.,59:103–115, January 2010. ISSN 0018-9340. doi: http://dx.doi.org/10.1109/TC.2009.106. URL http://dx.doi.org/10.1109/TC.2009.106. 2.5

Yinglong Xia, Xiaojun Feng, and Viktor K. Prasanna. Parallel evidence propagation on multicore processors. In Proceedingsof the 10th International Conference on Parallel Computing Technologies, PaCT ’09, pages 377–391, Berlin, Heidelberg,2009. Springer-Verlag. ISBN 978-3-642-03274-5. doi: http://dx.doi.org/10.1007/978-3-642-03275-2 37. URL http://dx.doi.org/10.1007/978-3-642-03275-2_37. 2.5

F. Yan, N. Xu, and Y. Qi. Parallel inference for latent dirichlet allocation on graphics processing units. In NIPS, 2009. 2.3

C. Yanover and Y. Weiss. Approximate inference and protein folding. In NIPS, pages 84–86, 2002. 1

C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. J ComputBiol, pages 381–395, 2007. 1, 6.2

J. Ye, J. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In CIKM. ACM, 2009. 1, 3.1.1

J.S. Yedidia, W.T. Freeman, and Y. Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Tech-nical report, Mitsubishi Electric Research Laboratories, 2001. 2.2.4

J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. In Exploring artificialintelligence in the new millennium, pages 239–269, 2003. 2.2.4

A. L. Yuille. Cccp algorithms to minimize the bethe and kikuchi free energies: convergent alternatives to belief propagation.Neural Comput., 14:1691–1722, July 2002. ISSN 0899-7667. doi: 10.1162/08997660260028674. URL http://portal.acm.org/citation.cfm?id=638977.638986. 1.4, 4.2.2, 4.2.2

53