mcmc using parallel computation

47
MCMC Using Parallel Computation Andrew Beam ST 790 – Advanced Bayesian Inference 02/21/2013

Upload: luke-owen

Post on 31-Dec-2015

38 views

Category:

Documents


1 download

DESCRIPTION

MCMC Using Parallel Computation. Andrew Beam ST 790 – Advanced Bayesian Inference 02/21/2013. Outline. Talk focus : Grab bag of ways to speed up R methods to use alternative platforms. Serial vs. Parallel Computation Chain level parallelization in MCMC R code and example - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MCMC Using Parallel Computation

MCMC Using Parallel Computation

Andrew Beam

ST 790 – Advanced Bayesian Inference

02/21/2013

Page 2: MCMC Using Parallel Computation

Outline

• Serial vs. Parallel Computation• Chain level parallelization in MCMC

• R code and example• Kernel speed-up and parallelization for larger data sets

• Calling C/Java from R• GPU example

• MCMC in pure java• Conclusions

Talk focus: Grab bag of ways to speed up R methods to use alternative platforms

Page 3: MCMC Using Parallel Computation

MCMC Methods

• MCMC methods are great!

• They are general.

• Work for every family of distribution.

• They are exact (in the limit).

• Unlike approximate methods, e.g. variational methods

• Flexible.

• They do have some drawbacks…

• Can be painfully slow for complex problems

• Complex problems are exactly when we would like to use MCMC

Page 4: MCMC Using Parallel Computation

grad student

simulation is running.

Simulating!

Your funding has been revoked.

Dr. Advisor

Page 5: MCMC Using Parallel Computation

Serial vs. Parallel Computation• Computers have used roughly the same architecture for

nearly 70 years– Proposed by John von Neumann in 1945

• Programs are stored on disk, loaded into system memory (RAM) by a control unit

• Instructions are executed one at a time by an arithmetic logic unit (ALU).

• Central Processing Unit (CPU) consistsof one ALU, control units, communication lines, and devices for I/O.– ALU is what does instruction execution.

Page 6: MCMC Using Parallel Computation

Moore's law

• “The number of transistors on integrated circuits doubles approximately every two years.”

– Intel co-founder Gordon E. Moore

• Around 2002, chip manufacturers began to shift from increasing the speed of single processors to providing multiple computational units on each processor.

• Consistent with Moore's law, but changes the way programmers (i.e. us) have to use them.

• CPU still consists of the same components, but now have multiple ALUs capable of executing instructions at the same time.

Page 7: MCMC Using Parallel Computation

Multicore architectures• Each central processing unit (CPU) now consists of multiple

ALUs.• Can execute multiple instructions simultaneously.

• Communication between coreshappens with shared memory.

• Programs can create multiple execution “threads” to take advantage of this.

• Speed up requires explicit programming representations.• Running a program designed for a single core processor on a

multicore processor will not make it faster.

Page 8: MCMC Using Parallel Computation

Program execution hierarchy• Atomic unit of execution is a thread

– A serial set of instructions

– E.g. – a = 2+2; b=3*4; c = (b/a)2

– Threads communicate with each other using a variety of system dependent protocols

• A collection of threads is known as a process

• Usually represent a common computational task.

• A collection of processes is known as an application.

• Example: Google chrome browser:

• The entire browser is an application.

• Each tab is a process controlling multiple threads.

• A thread handles different elements of each page (i.e. music, chat, etc.)

Page 9: MCMC Using Parallel Computation

Decomposing the MCMC problem

• Given our computers are capable of doing multiple things at once, we should all the resources available to us.

• If a problem has independent parts, we should try to execute them in parallel. This is often called “embarrassingly” parallel.

• Let’s go through the general Metropolis-Hasting algorithms and look for parts we can speed up using parallel computation.

Page 10: MCMC Using Parallel Computation

General MH MCMC Algorithm

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times

Page 11: MCMC Using Parallel Computation

How can we make this faster?

This is Nc independent operations – we can do this concurrently on a multicore machine.

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times

Page 12: MCMC Using Parallel Computation

• Without parallel computing, chains must be completed one at a time

Chain 1

Chain 2

Chain 3

Completed Chains:

Speeding up multi-chain simulations

4 available cores

Page 13: MCMC Using Parallel Computation

• The first chain finishes and the second one starts…

Chain 1

Chain 2

Chain 3

Completed Chains:

Speeding up multi-chain simulations

4 available cores

Page 14: MCMC Using Parallel Computation

• The second chain is done and the third one starts…

Chain 1

Chain 2

Chain 3

Completed Chains:

Speeding up multi-chain simulations

4 available cores

Page 15: MCMC Using Parallel Computation

• Now all three are done.

• If one chain takes T seconds, the entire simulation takes 3*T seconds

Chain 1

Chain 2

Chain 3

Completed Chains:

Speeding up multi-chain simulations

4 available cores

Page 16: MCMC Using Parallel Computation

• Now consider a parallel version.

• Each chain is assigned to its own core.

Chain 2

Chain 3

Completed Chains:

Chain 1

Speeding up multi-chain simulations

Page 17: MCMC Using Parallel Computation

• The entire simulation takes T + ε, where ε is the thread communication penalty and is usually << T

Chain 2

Chain 3

Completed Chains:

Chain 1

Speeding up multi-chain simulations

Page 18: MCMC Using Parallel Computation

Multiple Chains vs. One Long Chain

• It has been claimed that a chain of length N*L is preferable to N chains of length L, due to better mixing properties of the longer chain.

• Robert and Casella agree (text, pg. 464)

• However, if you have a multicore machine you should NEVER just use one chain, for moderately sized problems because you can get additional samples for essentially no extra time.

• Buy one, get two free

• You can use multiple chins to assess convergence• Gelman and Rubin diagnostic.

• More samples is always better, even if they are spread over a few chains.

Page 19: MCMC Using Parallel Computation

Chain-level parallelization in R

• For small to moderate sized problems, it’s very easy to do this in R with minor changes to existing code.

• Small to moderate size = 100-1000 observations 10-100 predictors.

• Revolutionary Analytics has developed several packages to parallelize independent for loops.

• Windows – doSNOW

• Linux/OSX – doMC

• Uses foreach paradigm to split up the work across several cores and then collect the results.

• Use special %dopar% par operator to indicate parallel tasks.

• This is more of a “hack” then a fully concurrent library.

• This is a good idea if your simulation currently takes a reasonable amount of time and doesn’t use too much memory.

Page 20: MCMC Using Parallel Computation

doMC/doSNOW Paradigm

For N parallel tasks:

…Task 1

Task 2

Task N-1 Task N

Create N copies of the entire R workspace

…Run N R sessions in parallel

Collect results and return control to original R session.

Page 21: MCMC Using Parallel Computation

Parallel Template (Windows):

Load doSNOW package

Register parallel backend

Indicate how the results should be collected

Indicate this should be done in parallel

Assign return value of foreach to a variable

Page 22: MCMC Using Parallel Computation

Example: Penalized Regression via MCMC

Page 23: MCMC Using Parallel Computation

Prior for β

τ = 1

τ = 0.5

τ = 0.1

Page 24: MCMC Using Parallel Computation

Simulation Description

• N = 100, p = 10• Xij ~ N(0,1)• True model: Y = X2 + X6

• Yi = X2 + X6 + ei, where ei ~ N(0,1)• Full penalty -> τ = 1• Sample using RW-MH such that βi = βi-1 + εi, where εi ~ N(0,

(1/2p)2) • Simulate initially for Nc = 1 chain, Ns = 100,000 iterations,

keep 10,000 for inference• Macbook pro, quad-core i7 with hyperthreading, 8GB ram.• Sampler code available in chain-parallel-lasso.R.

Demonstration code in mcmc-lasso.R• N and p are adjustable so you can see how many chains

your system can handle for a given problem size

Page 25: MCMC Using Parallel Computation

Simulation Resultsβ1 ≈ 0 β2 ≈ 1 β3 ≈ 0 β4 ≈ 0 β5 ≈ 0

β6 ≈ 1 β7 ≈ 0 β8 ≈ 0 β9 ≈ 0 β10 ≈ 0

Page 26: MCMC Using Parallel Computation

Simulation Results• Sampler appears to be working – finds correct variables.• Execution time: 3.02s• Ram used: 30MB• Returns a Nkx p matrix of samples Sampler

Code

ProposeEvaluate

Accept/Reject

Avoid using rbind at all costs

Page 27: MCMC Using Parallel Computation

Multiple Chain Penalized Regression

• Wrap the single chain sampler into a function• Call it multiple times from within the foreach loop

Register parallel backendCall foreach and %dopar%

Windows quirk

Call single chain sampler

Return single chain results to be collected

Page 28: MCMC Using Parallel Computation

Multiple Chain Penalized Regression

Example call using 2 chains:

Page 29: MCMC Using Parallel Computation

Multiple Chain Execution Times

Single chain execution time

Page 30: MCMC Using Parallel Computation

doSNOW/doMC not just for MCMC

• Worth noting that this paradigm is useful for a lot of other tasks in R

• Any task that has “embarrassingly” parallel portions• Bootstrap• Permutations• Cross-validation• Bagging• etc

• Be aware that every parallel R session will use the same amount of memory as the parent.

• Computation happens 5x faster, but use 5x as much memory

• Application level parallelism

Page 31: MCMC Using Parallel Computation

Back the the MH Algorithm

Unfortunately loop is inherently serial and we can’t speed it up through parallelization

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θold))6. Accept θnew with probability ρ(K(θnew), (K(θold))7. Repeat Ns times8. Repeat Nc times

Can we speed up the inner loop?

Page 32: MCMC Using Parallel Computation

What is the most expensive part of each iteration?

1. For each chain in 1 to number of chains (Nc)2. For i in 1 to number of sim iterations (Ns) 3. Draw θnew~ from the proposal density4. Evaluate the posterior kernel for this point K(θnew)5. Calculate the acceptance ratio, ρ(K(θnew), (K(θnew))6. Accept θnew with probability ρ(K(θnew), (K(θnew))7. Repeat Ns times8. Repeat Nc times

• Each iteration we have to evaluate the kernel for N samples and p parameters – most expensive.

• As N gets bigger, we can use parallel devices to speed this up.• Communication penalties preclude multicore CPUs from

speeding this.

Page 33: MCMC Using Parallel Computation

Kernel Evaluations

• For the penalized regression problem we have 3 main kernel computations:

• Evaluate Xβ – Could be expensive for large N or large p

• Evaluate the likelihood for Xβ, Y – N evaluations of the normal pdf.

• Evaluate the prior for β – cheap for small p, potentially expensive for large p.

• For demonstration, let’s assume we are in a scenario with large N and reasonable p

• Xβ – not expensive• Likelihood – expensive• Prior – very cheap

Page 34: MCMC Using Parallel Computation

Kernel Evaluations in Java and C

• If evaluating our kernel is too slow in R, we can drop down to another language from within R and evaluate it there.

• This is actually what dnorm() in R does• R has native support for calling C functions.• rJava package provides interface for calling Java from R.

• For small problems (e.g. N ~ 100) C will be faster.• Reason – there is also a communication penalty since R can’t call Java

natively. R has to transfer data to Java, then start execution, then retrieve it.

• Java has rich external libraries set for scientific computing.• Java is portable – write once and run anywhere• No pointers!• Built in memory management – no malloc.• Unified concurrency library. Can write multithreaded programs that will run

on any platform.• Thread level parallelism saves memory

Page 35: MCMC Using Parallel Computation

Example Java code for kernel evaluation function called from R

evaluate likelihood

evaluate prior

function to evaluate normal PDF

function to evaluate prior

Page 36: MCMC Using Parallel Computation

Calling Java from R

• rJava package has R-Java bindings that will allow R to use Java objects.

• Java based kernel for penalized regression available in java_based_sampler.R and java source code in dnorm.java

• Compiled java code is in dnorm.class• To compile source use javac command: javac dnorm.java• Not necessary, but if you modify the source you will have to

recompile

• Performance is slightly worse that C code used by dnorm() in R• Abstracting out kernel evaluation is a good idea if you have

complicated kernels not available in R or you have large values of N or p.

• What if you have so much data Java or C still isn’t enough?

Page 37: MCMC Using Parallel Computation

GPU Computing

GPUCPU

• GPU – Graphics Processing Unit• Designed to do floating point math very quickly for image

processing• Cheap – usually a few hundred dollars• Highly parallel – 100s to 1000s of processors per unit• Very little control units – designed for floating point operations

• Good for doing a lot of simultaneous calculations (e.g. + - / * )

• CUDA - Compute Unified Device Architecture• SDK and driver set that gives access to NVIDIA floating point

units• Provides C-level interface and compiler

• Write kernel evaluation in CUDA C

Page 38: MCMC Using Parallel Computation

CUDA Programming Paradigm

• CUDA is based on notions of host (i.e. the CPU) and the device (i.e. the GPU)

• Create data objects (arrays, matrices, etc) on the host, transfer them to the device, and then retrieve the results.

• Computation is accomplished via “kernels” – small computational programs.

• GPU threads are given a block of data, which they use to execute the kernel.

• Compile C programs using NVIDIA’s nvcc compiler• We can use this to do the expensive normal PDF evaluation.

Page 39: MCMC Using Parallel Computation

GPU Simulation Setup

• Do N normal PDF evaluations for values ranging from 100 to 100,000,000.

• Compare to R’s native dnorm()• GEFORCE GTX 650 – 2GB RAM, 384 CUDA processors• Requires CUDA SDK 3.0, compatible GPU and cuda drivers.• Code available in kernel.cu. • Compile on Linux using nvcc kernel.cu –o kernel

Page 40: MCMC Using Parallel Computation

Execution Times

N GPU dnorm

100 0.000348 0.0001

1000 0.000524 0.0004

10000 0.000733 0.001

1.00E+005 0.00124 0.007

1.00E+006 0.006677 0.097

5.00E+006 0.0235 0.385

1.00E+007 0.053936 0.772

5.00E+007 0.22657 3.75

1.00E+008 0.542329 7.567

Page 41: MCMC Using Parallel Computation

Speed up

With a sample size of 5 million, if you ran a simulation for 100,000 iterations it would take 10.7 hours days using dnorm(), but just over 30 minutes on a GPU.

Page 42: MCMC Using Parallel Computation

GPUs and MCMC – Not just about speed

Pierre Jacob , Christian P. Robert, Murray H. Smith

Using parallel computation to improve Independent Metropolis--Hastings based estimation

In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis--Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis--Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao--Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert (1996), Atchade and Perron (2005) and Douc and Robert (2010). We illustrate those improvements both on a toy normal example and on a classical probit regression model, but stress the fact that they are applicable in any case where the independent Metropolis-Hastings is applicable.

http://arxiv.org/abs/1010.1595

Page 43: MCMC Using Parallel Computation

Non-R based solutions

• If you want your simulation to just “go faster”, you will probably need to abandon R all together.

• Implement everything in another language like C/C++ or Java and it will (most likely) be orders of magnitude faster

• Compiled code is in general much faster.• You can stop having wacky R memory issues.• Control the level of parallelization (thread vs. process vs.

application).• Interface more easily with devices like GPUs.• Pain now will be rewarded later.

Page 44: MCMC Using Parallel Computation

BAJA – Bayesian Analysis in JAva• ST 790 Project• Pure java, general MH-MCMC sampler

• Adaptive sampling options available• Model specification via graphical directed, acyclic graph (DAG)• Parallel chain execution – multithreaded• Stochastic thinning via resolvant• Memory efficient (relatively)• Automatic generation of trace plots, histograms, ACF plots.• Export samples to .csv or R.

Page 45: MCMC Using Parallel Computation

BAJA Parallel Performance

L = 1,000,000

Page 46: MCMC Using Parallel Computation

Speed comparison with several platforms

Page 47: MCMC Using Parallel Computation

Thank you!