approximate bayesian computation with differential evolution

11
Journal of Mathematical Psychology ( ) Contents lists available at SciVerse ScienceDirect Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp Approximate Bayesian computation with differential evolution Brandon M. Turner a , Per B. Sederberg b,a University of California, Irvine, United States b The Ohio State University, United States article info Article history: Received 16 May 2012 Received in revised form 12 June 2012 Available online xxxx Keywords: Approximate Bayesian computation Differential evolution Computational modeling Likelihood-free inference abstract Approximate Bayesian computation (ABC) is a simulation-based method for estimating the posterior distribution of the parameters of a model. The ABC approach is instrumental when a likelihood function for a model cannot be mathematically specified, or has a complicated form. Although difficulty in calculating a model’s likelihood is extremely common, current ABC methods suffer from two problems that have largely prevented their mainstream adoption: long computation time and an inability to scale beyond a few parameters. We introduce differential evolution as a computationally efficient genetic algorithm for proposal generation in our ABC sampler. We show how using this method allows our new ABC algorithm, called ABCDE, to obtain accurate posterior estimates in fewer iterations than kernel-based ABC algorithms and to scale to high-dimensional parameter spaces that have proven difficult for current ABC methods. © 2012 Elsevier Inc. All rights reserved. 1. Introduction Bayesian statistics has become an enormously popular tool for the analysis of random variables. While there are a myriad of practical and theoretical reasons to adopt the Bayesian framework, one clear advantage is the quantification of uncertainty in the form of the posterior distribution. In contrast to classic null hypothesis testing, directed analysis of the posterior distribution allows for statistical inference that does not compromise the process theory behind the experiments generating the data. In the last decade, many have turned to the problem of Bayesian estimation when the likelihood function is difficult or impossible to evaluate. One popular method for solving this problem is called approximate Bayesian computation (ABC). While ABC originated in genetics (Beaumont, Zhang, & Balding, 2002; Marjoram, Molitor, Plagnol, & Tavare, 2003; Pritchard, Seielstad, Perez-Lezaun, & Feldman, 1999; Tavaré, Balding, Griffiths, & Donnelly, 1997), it has received a considerable amount of attention in ecology, epidemiology, and systems biology (Beaumont, 2010), as well as, more generally, with regard to model choice (Robert, Cornuet, Marin, & Pillai, 2011). Recently, ABC has also become an important tool in the analysis Portions of this work were presented at the 8th Annual Context and Episodic Memory Symposium, Bloomington, Indiana, and the 45th Annual Meeting of the Society for Mathematical Society, Columbus, Ohio. The authors would like to thank Trisha Van Zandt, Simon Dennis, Rich Shiffrin, Jay Myung, and Dan Navarro for helpful comments that improved an earlier version of this article. Corresponding author. E-mail addresses: [email protected] (B.M. Turner), [email protected] (P.B. Sederberg). of models of human behavior, whose likelihoods are often rather complex (Turner & Van Zandt, 2012). The basic ABC algorithm proceeds as follows. Given an observed data set Y , some unknown parameters θ with prior distribution π(θ), and an assumed model such that Y Model(y|θ), ABC algorithms propose candidate parameter values by randomly sampling θ and generating a data set X Model(x|θ ), where the method used to draw θ varies from one algorithm to the next. Because this random data set is always generated and paired with the random parameter value θ , we often use the notation , X ) to refer to a single point in the joint distribution of θ and X . The idea is that if the distance ρ(X , Y ) between the two data sets is small enough, specifically, less than a value ϵ , we can assume that the proposed candidate parameter value θ has some nonzero probability of being in the posterior distribution π(θ |Y ). In so doing, ABC algorithms attempt to approximate the posterior distribution by sampling from π(θ |Y ) X π(θ) Model(x|θ) I (ρ(X , Y ) ϵ) dx, (1) where X is the support of the simulated data and I (a) is an indicator function returning a one if the condition a is satisfied and a zero otherwise. The ABC rejection framework (see Turner & Van Zandt, 2012 for a tutorial) has been embedded in Markov chain Monte Carlo (Marjoram et al., 2003), sequential Monte Carlo (Beaumont, Cornuet, Marin, & Robert, 2009; Del Moral, Doucet, & Jasra, 2008; Sisson, Fan, & Tanaka, 2007; Toni, Welch, Strelkowa, Ipsen, & Stumpf, 2009), expectation propagation (Barthelmé & Chopin, 2011), and Gibbs sampling (Turner & Van Zandt, submitted for publication). While these algorithms have become 0022-2496/$ – see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jmp.2012.06.004

Upload: others

Post on 09-Feb-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Approximate Bayesian computation with differential evolution

Journal of Mathematical Psychology ( ) –

Contents lists available at SciVerse ScienceDirect

Journal of Mathematical Psychology

journal homepage: www.elsevier.com/locate/jmp

Approximate Bayesian computation with differential evolution✩

Brandon M. Turner a, Per B. Sederberg b,∗

a University of California, Irvine, United Statesb The Ohio State University, United States

a r t i c l e i n f o

Article history:Received 16 May 2012Received in revised form12 June 2012Available online xxxx

Keywords:Approximate Bayesian computationDifferential evolutionComputational modelingLikelihood-free inference

a b s t r a c t

Approximate Bayesian computation (ABC) is a simulation-based method for estimating the posteriordistribution of the parameters of amodel. TheABC approach is instrumentalwhen a likelihood function fora model cannot be mathematically specified, or has a complicated form. Although difficulty in calculatinga model’s likelihood is extremely common, current ABC methods suffer from two problems that havelargely prevented their mainstream adoption: long computation time and an inability to scale beyond afew parameters. We introduce differential evolution as a computationally efficient genetic algorithm forproposal generation in our ABC sampler. We show how using this method allows our new ABC algorithm,called ABCDE, to obtain accurate posterior estimates in fewer iterations than kernel-based ABC algorithmsand to scale to high-dimensional parameter spaces that have proven difficult for current ABC methods.

© 2012 Elsevier Inc. All rights reserved.

1. Introduction

Bayesian statistics has become an enormously popular tool forthe analysis of random variables. While there are a myriad ofpractical and theoretical reasons to adopt the Bayesian framework,one clear advantage is the quantification of uncertainty in the formof the posterior distribution. In contrast to classic null hypothesistesting, directed analysis of the posterior distribution allows forstatistical inference that does not compromise the process theorybehind the experiments generating the data. In the last decade,many have turned to the problem of Bayesian estimation whenthe likelihood function is difficult or impossible to evaluate. Onepopular method for solving this problem is called approximateBayesian computation (ABC). While ABC originated in genetics(Beaumont, Zhang, & Balding, 2002; Marjoram, Molitor, Plagnol,& Tavare, 2003; Pritchard, Seielstad, Perez-Lezaun, & Feldman,1999; Tavaré, Balding, Griffiths, & Donnelly, 1997), it has receiveda considerable amount of attention in ecology, epidemiology, andsystems biology (Beaumont, 2010), aswell as, more generally, withregard to model choice (Robert, Cornuet, Marin, & Pillai, 2011).Recently, ABC has also become an important tool in the analysis

✩ Portions of this work were presented at the 8th Annual Context and EpisodicMemory Symposium, Bloomington, Indiana, and the 45th Annual Meeting of theSociety for Mathematical Society, Columbus, Ohio. The authors would like to thankTrisha Van Zandt, Simon Dennis, Rich Shiffrin, Jay Myung, and Dan Navarro forhelpful comments that improved an earlier version of this article.∗ Corresponding author.

E-mail addresses: [email protected] (B.M. Turner), [email protected](P.B. Sederberg).

of models of human behavior, whose likelihoods are often rathercomplex (Turner & Van Zandt, 2012).

The basic ABC algorithmproceeds as follows. Given an observeddata set Y , some unknown parameters θ with prior distributionπ(θ), and an assumed model such that Y ∼ Model(y|θ), ABCalgorithms propose candidate parameter values by randomlysampling θ∗ and generating a data set X ∼ Model(x|θ∗), wherethe method used to draw θ∗ varies from one algorithm to thenext. Because this random data set is always generated and pairedwith the random parameter value θ∗, we often use the notation(θ∗, X) to refer to a single point in the joint distribution of θ∗

and X . The idea is that if the distance ρ(X, Y ) between the twodata sets is small enough, specifically, less than a value ϵ, we canassume that the proposed candidate parameter value θ∗ has somenonzero probability of being in the posterior distribution π(θ |Y ).In so doing, ABC algorithms attempt to approximate the posteriordistribution by sampling from

π(θ |Y ) ∝

X

π(θ)Model(x|θ) I(ρ(X, Y ) ≤ ϵ) dx, (1)

where X is the support of the simulated data and I(a) is anindicator function returning a one if the condition a is satisfied anda zero otherwise.

The ABC rejection framework (see Turner & Van Zandt, 2012for a tutorial) has been embedded in Markov chain MonteCarlo (Marjoram et al., 2003), sequential Monte Carlo (Beaumont,Cornuet, Marin, & Robert, 2009; Del Moral, Doucet, & Jasra,2008; Sisson, Fan, & Tanaka, 2007; Toni, Welch, Strelkowa,Ipsen, & Stumpf, 2009), expectation propagation (Barthelmé& Chopin, 2011), and Gibbs sampling (Turner & Van Zandt,submitted for publication). While these algorithms have become

0022-2496/$ – see front matter© 2012 Elsevier Inc. All rights reserved.doi:10.1016/j.jmp.2012.06.004

Page 2: Approximate Bayesian computation with differential evolution

2 B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) –

increasingly more complex and efficient, they all still rely onsatisfying the condition ρ(X, Y ) ≤ ϵ. Hence, the accuracy of theestimated posterior distribution and the computational efficiencyhinge on the selection of the tolerance threshold ϵ: large values ofϵ produce inaccurate posteriors while small values of ϵ producelong computation times. Long computation times are a result ofthe decrease in the likelihood of proposing acceptable parametervalues θ∗ that produce data X such that the condition ρ(X, Y ) ≤ ϵis satisfied.

A number of methods have been proposed to balance thetrade-off between long computation times and accurate posteriorestimates. One common approach is to select a set of ϵ valuesthat decrease monotonically from a larger tolerance value tosome small one. This allows the algorithm to converge slowlyto a computationally feasible final value of ϵ. Other methodsallow for larger values of ϵ, and instead rely on a post-simulationcorrection to obtain accurate estimates of the posterior (Beaumontet al., 2002; Blum & François, 2010). These algorithms proceed asdescribed above but then use regression techniques to reweighthe samples based on their distance from the observed data Y .Although these techniques are very useful in practice, the post hocnature of the correction suggests that the samplers used prior tothe correction could be substantially improved. That is, algorithmsthat use the reweighing scheme internally would bemore efficientthan the regression correction schemes.

Although the basic ABC framework has been instrumental infurthering the widespread usage of Bayesian analysis, it currentlysuffers from the so-called ‘‘curse of dimensionality’’ (Beaumont,2010). That is, when the number of parameters is increasedthe acceptance rate declines dramatically because successful setsof parameter values (i.e., sets of parameter values that satisfythe tolerance threshold condition) become very hard to find.To attenuate this problem, a few new algorithms replace theindicator function in Eq. (1) with a kernel function ψ(ρ(X, Y )|δ)(Wilkinson, submitted for publication). The kernel functionprovides continuous measures to the evaluation of a proposalparameter set θ∗ rather than a dichotomous accept or reject (aone or a zero, respectively). While this dramatically improvesthe computational efficiency, the accuracy of the estimatedposterior now relies on the selection of δ. In addition, whensampling from a high-dimensional parameter space, the selectionof tuning parameters such as the transition kernel becomes vitallyimportant.

In this article we present a new ABC algorithm, which we callABCDE, that uses differential evolution (DE) as ameans of proposalgeneration (Hu & Tsui, 2005; ter Braak, 2006; Vrugt et al., 2009).DE is an extremely efficient population-based genetic algorithmfor minimizing real-valued cost functions (Storn & Price, 1997).Instead of relying solely on random mutations of each memberof the population to drive evolution, DE creates a self-organizingpopulation by evolving each member based on the weighteddifference between two other members of the population, similarin spirit to particle swarm optimization (Havangi, Nekoui, &Teshnehlab, 2010; Kennedy & Eberhart, 1995; Tong, Fang, & Xu,2006). In practice, DE requires far fewer population members thanother genetic algorithms, typically just enough to span the possibleparameter space, because the entire population evolves towardsthe minimum cost function values (or in our case the regions ofthe parameter space with the highest densities). Here we mergeABC and DE, including enhancements to the basic DE algorithm tomake it suitable for posterior estimation instead of simply locatingthe global minimum (Hu & Tsui, 2005).

We first describe the algorithm (presented in Fig. 1) andmotivate the use of each of its components. We then use thealgorithm to fit three models whose posterior distributions areknown to assess the accuracy of the estimates obtained using

Fig. 1. The ABCDE algorithm for estimating the posterior distribution of θ .

ABCDE. In the first example, we show that a reduced version of theABCDE algorithm can easily handle the classic mixture of normalsexample that is prevalent in the ABC literature (Beaumont et al.,2009; Sisson et al., 2007; Toni et al., 2009;Wilkinson, submitted forpublication). We then assess the accuracy of the ABCDE algorithmby comparing it to a basic kernel-based ABC algorithm (Wilkinson,submitted for publication). To do so, we use both algorithms tofit a model whose joint posterior distribution is highly correlated.In our final example, we test the scalability of ABCDE by fittingit to a model with 20 parameters, a problem that is extremelydifficult for existing ABC algorithms to solve. We close with a briefdiscussion of the potential advancements and provide guidelinesfor the selection of the tuning parameters.

2. The algorithm

The ABCDE algorithm combines DE with a distributed geneticalgorithm approach that relies on three steps: crossover, mutation,and migration (Hu & Tsui, 2005). After a high-level overview ofthe algorithm, we will discuss each of these steps in turn. Fig. 1illustrates the structure of the ABCDE algorithm. We begin byselecting a pool of P particles and dividing this pool evenly into Kgroups, each of size G = P/K . Although we will refer to the groupof individualmembers of our sampler as particles, they are actuallya pool of Markov chains that interact, forming an entire systemvery similar to population Monte Carlo samplers. The particles arefirst initialized; for example, one could initialize the particles bysampling from the prior distribution π(θ) for the parameters θ .In Fig. 1, this step is represented by Lines 1–2. Once all particlesare initialized, the first step in the algorithm is to decide whetheror not a migration step should be performed with probabilityα. As we will soon discuss, the migration step is the distributedgenetic algorithm method of diversifying each of the groups, andit is represented in Fig. 1 by Lines 4–6. Subsequently, the ABCDEalgorithm updates the particles by means of a crossover step (thecore DE particle update mechanism) performed with probability βor by means of a mutation step (the standard particle perturbationmethod) with probability 1 − β . The decision to update via thecrossover or the mutation step is performed for each of the Kgroups. This cyclical updating procedure is represented in Fig. 1 byLines 7–13.

In practice, the ‘‘decision’’ tuning parameters will be small(e.g., α = β = 0.10), such that a majority of the particle updates

Page 3: Approximate Bayesian computation with differential evolution

B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) – 3

occur via the DE crossover step, and migration only occurs a fewtimes throughout the simulation. The new proposed particles areassigned ‘‘fitness’’ values, which are used to decide individuallywhether the new proposals will replace the current particles. Aswith all ABC algorithms, an estimate of the posterior is formedby collapsing the values obtained from the sampler across alliterations, ignoring a (short) burn-in period. In the next sections,we first describe how to evaluate the particles at each iteration andthenwe discuss each of the components of the ABCDE algorithm indetail.

2.1. Weighing particles

One of the advancements that makes the ABCDE algorithmpossible is the ability to assign continuousmeasures for a particle’s‘‘fitness’’, rather than simply accepting or rejecting a proposedparameter value. We make a careful distinction here between aparticle’s weight and a particle’s fitness. There are many particlefiltering ABC algorithms (e.g., Beaumont et al. (2009), Del Moralet al. (2008), Sisson et al. (2007), and Toni et al. (2009)) that applycontinuous weights to their particles; however, these algorithmsdo not apply continuous measures to a particle’s fitness. Instead,they rely on a simple accept/reject evaluation of a particle’s fitnessprior to calculating theweights. That is,when a particle is proposedthat does not satisfy the tolerance condition ρ(X, Y ) ≤ ϵ, theproposal is given a weight of zero because the indicator functionin Eq. (1) will be zero. Thus, the fitness evaluation used by theserejection-based algorithms can result in many samples beingdrawn that are never used, producing high inefficiency.

To calculate our weights, we draw on a framework for applyingcontinuous measures to a particle’s fitness (Beaumont et al., 2002;Blum & François, 2010; Wilkinson, submitted for publication).Specifically, we assume that the data Y is a realization of a modelsimulation under the best possible parameter values plus somerandom error, so that

Y = Model(y|θ)+ ζ , (2)

where ζ follows the distributionψ(·|δ) governed by the parameterδ. By assuming that the random error is continuous, we can usethis error term to evaluate a particle’s fitness. For example, ifψ(·|δ) is centered at zero, we can compute the distance ρ(X, Y )and evaluate ψ(ρ(X, Y )|δ). If ψ(·|δ) then follows, say, a normaldistribution, simulated data that are closer to the observed datawill have higher ψ(ρ(X, Y )|δ) values than simulated data that arefar from the observed data because ρ(X, Y ) will be smaller. As aresult, the approximation in Eq. (1) becomes

π(θ |Y ) ∝

X

π(θ)Model(x|θ) ψ(ρ(X, Y )|δ) dx.

We will refer to the class of algorithms that use this weighingscheme as kernel-based ABC (KABC) algorithms. Because ABCDEis, at its core, a KABC algorithm, we will compare the efficiency ofthe proposalmechanismused in ABCDE (discussed in detail below)against the proposal mechanism used by a simple Markov chainMonte Carlo KABC algorithm later on in this article.

The ABCDE algorithm uses theMetropolis–Hastings probabilityto determine if a proposal should be accepted or rejected, and sowe begin by outlining the Metropolis–Hastings step in the contextof our algorithm. First, we let the current particle state at Iterationt +1 be denoted θt and the corresponding simulated data that wasgenerated on Iteration t be denoted Xt . Then, for a given proposaland corresponding simulated data (θ∗, X), we can evaluate theproposal’s fitness by calculating

π(θ∗)ψ(ρ(X, Y )|δ)q(θt |θ∗),

where q(a|b) is called the ‘‘transition kernel’’ and it representsthe probability of transitioning to a from b. We will delay

further discussion of the transition kernel to the sections on themutation and crossover steps, because the form of q(a|b)will varybetween these two different steps because they rely on differentperturbationmethods.We can then compare the proposal’s fitnessto the current particle state by calculating

π(θt)ψ(ρ(Xt , Y )|δ)q(θ∗|θt).

Once the fitness for both the current state and the proposal havebeen calculated, we ‘‘jump’’ from the current state to the proposalstate with Metropolis–Hastings probability

min1,π(θ∗)ψ(ρ(X, Y )|δ)q(θt |θ∗)

π(θt)ψ(ρ(Xt , Y )|δ)q(θ∗|θt)

. (3)

Here, a jump is performed by setting (θt+1, Xt+1) = (θ∗, X), other-wise the chain will not move and (θt+1, Xt+1) = (θt , Xt).

2.1.1. Selecting δMost other ABC algorithms rely on a rejection sampling routine

to first decide if a proposal θ∗ is suitable for approximatingthe posterior distribution. As previously discussed, this decisionis based on a comparison to a tolerance threshold ϵ so that aparticle is accepted only if ρ(X, Y ) ≤ ϵ. As such the accuracyof the estimated posterior distribution π(θ |ρ(X, Y ) ≤ ϵ) clearlydepends on the precision of ϵ. Using a kernel function to evaluatea particle’s fitness does not solve this estimation problem becauseour approximation of the true posterior distribution π(θ |Y ) willstill depend on the value of δ.

While thismight appear concerning, the difference between thetwo approaches lies in the specification of the model, as shownin Eq. (2). The observed data is assumed to have arisen from aparticular model plus some error term ζ . As a result, we canincorporate the parameters of the distribution of ζ directly intoour model. Specifically, we can treat δ as a parameter in the modeland estimate it rather than (subjectively) choosing a tolerancecondition. To do so, we must specify a prior distribution for δ(e.g., an exponential distribution that favors small values) andinclude it in the calculation of Eq. (3). At first, the values of δ willbe large because the chains are unlikely to be in the high-densityregions of the posterior. However, as the algorithm proceeds,smaller values of δwill carry largerweights, and, as a consequence,the estimated posterior distribution will becomemore accurate. Inthe simulations that follow, we demonstrate how freeing δ allowsfor a nearly automatic parameter estimation procedure.

It has been shown that under the assumption of model error(see Eq. (2)), ABC provides exact samples from the posteriordistribution (Wilkinson, submitted for publication). We extendthis finding by applying a weighing scheme that uses continuousmeasures into an entire network of Markov chains to facilitateestimation that is both accurate and fast. In particular, using asystem of chains allows our ABCDE algorithm to jump across themodes of a target distribution, as well as tacklemore difficult high-dimensional problems where ABC MCMC approaches are likely tosuffer from high rejection rates.

2.2. Crossover

We propose an efficient method of generating candidateparticles based on DE (Storn & Price, 1997; ter Braak, 2006).The central idea behind DE is to allow the particles in the setto inform the proposal process. Particles that are performingwell are selected to guide other, poorer performing particles tohigher-density regions. We implement DE within a distributedgenetic algorithm framework, which divides the population ofparticles into sub-populations that evolve largely independently.This distributed sub-population approach has been shown to help

Page 4: Approximate Bayesian computation with differential evolution

4 B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) –

Fig. 2. Examples of the crossover step (left panel) and the migration step (right panel).

prevent falling into local minima when minimizing global costfunctions (Tanese, 1989).

Fig. 2 (left panel) provides an illustration of the crossoverprocedure for generating the new proposals for each particle inthe kth group. In this example we are updating the individualparticle θt . First, we sample a ‘‘base’’ particle θb (shown as thevector from the origin) with the current weights as probabilitiesand two other particles θm and θn (bottom most vectors) withuniform probabilities, so that θm = θn = θt . Next, we computetwo directional vectors. The first directional vector is obtained bytaking the difference between the two randomly selected particlesand scaling by a constant, given by γ1(θm − θn). The dark graylines in the left panel of Fig. 2 represent the difference vectorθm − θn. This first vector is the component in ABCDE that allows usto generate efficient proposals that match the shape of the targetdistribution. The second directional vector is obtained by takingthe difference between the particle being updated and the baseparticle, and scaling by another constant, given by γ2(θb − θt).The second vector is what quickly drives the ABCDE sampler tothe high-density regions of the posterior. Finally, we generate aproposal θ∗ for the original particle θt by calculating

θ∗= θt + γ1(θm − θn)+ γ2(θb − θt)+ b, (4)

where γ ∼ CU[0.5, 1.0] and b is drawn from some symmetricdistribution with very small variance (e.g., b ∼ CU[−ϵ, ϵ], whereϵ = 0.001). The small amount of random noise introduced bysampling γ provides enough diversity in the proposals to allow thesampler to traverse the entire range of the target posterior.

Some model-fitting problems may benefit from only updatingsome of the parameters of each particle to slow the exploration ofthe parameter space (Storn & Price, 1997). To implement this, wereset someof theproposal parameters to their previous valueswithprobability (1 − κ). In addition to slowing the sampler, resettingonly some of the parameters in the vector θ∗ will increase thediversity in the pool because the new proposal will be a mixture ofparameters accepted in the past and the present, which will likelymake this vector very unique. In practice, κ is typically set to 1.0 or0.9, meaning all, or nearly all, parameters are updated tomatch theproposal vector. The proposal θ∗ is then evaluated by generatinga data set X ∼ Model(x|θ∗), comparing it to the observed data Ythrough the distance metric ρ(X, Y ), and then deciding to jumpfrom the current state (θt , Xt) to the proposal state (θ∗, X) byevaluating Eq. (3).

2.3. The two modes of sampling

While generating proposals using Eq. (4) leads to fast conver-gence to the high-density regions of the posterior (i.e., the maxi-mum a posteriori estimate), the term γ2(θb − θt)makes the prob-ability of transitioning from θ∗ to θt lower than the reverse move

that transitions θt to θ∗. To see this, suppose the density of b is anindependent multivariate normal centered at zero with a variancematrix of ϵ. Then, the probability of transitioning from the currentstate to the proposal is

q(θ∗|θt)

=

i

j

k

wkNp(θt + γ1(θi − θj)+ γ2(θk − θt), ϵ),

where Np(a, b) is the multivariate normal density of dimensionequal to the number of parameters p with mean a and covariancematrix equal to the identity matrix times b, wk is the weightassociated with the kth particle θk and θi = θj = θt . However, thereverse move is given by

q(θt |θ∗)

=

i

j

k

wkNpθ∗

− γ1(θi − θj)− γ2(θk − θt)

(1 − γ2),

ϵ

(1 − γ2)

.

Here, we can see that q(θ∗|θt) = q(θt |θ∗), but one could calculate

the transition probabilities for each possible state to evaluateEq. (3). Unfortunately, calculation of these transition probabilitieswould need to be done on each iteration, which can be very timeconsuming when the population of particles G is large.

However, when γ2 = 0, Eq. (4) reduces to

θ∗= θt + γ1(θm − θn)+ b, (5)

and the probability of transitioning from θt to θ∗ is given by

q(θ∗|θt) =

i

j

Np(θt + γ1(θi − θj), ϵ),

whichwill be equal to the probability of the reversemove given by

q(θt |θ∗) =

i

j

Np(θ∗− γ1(θi − θj), ϵ)

=

i

j

Np(θ∗+ γ1(θj − θi), ϵ),

because the elements of the difference vector are selected withequal probability.

As a result of the asymmetry of q(·|·) when γ2 = 0, we imple-ment the algorithm in two modes. The first mode is a ‘‘burn-inmode’’, in which γ2 can be tuned to locate the high-density regionsof the posterior. During this phase, we still evaluate Eq. (3), butwe ignore the transition kernel probabilities. That is, we accept theproposal θ∗ with probability

min1,π(θ∗)ψ(ρ(X, Y )|δ)π(θt)ψ(ρ(Xt , Y )|δ)

. (6)

Page 5: Approximate Bayesian computation with differential evolution

B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) – 5

This approach is very similar to methods of generating initialvalues for chains used when carrying out Markov chain MonteCarlo sampling. One could even adopt stricter, deterministic rulesfor proposal acceptance such as accepting a proposal θ∗ if

π(θ∗)ψ(ρ(X, Y )|δ) > π(θt)ψ(ρ(Xt , Y )|δ). (7)

However, we recommend retaining the random acceptance rulein Eq. (3) because it results in more variability in the particleset. Once the high density regions have been found, we switch to‘‘sampling mode’’, in which we constrain γ2 = 0 and fix δ to thevalue obtained during the burn-in mode. The estimate of δ canbe any user-defined value, but we recommend using the medianor the minimum of the estimates from the pool of particles. Asexplained, setting γ2 = 0 results in q(θ∗

|θt) = q(θt |θ∗) and Eq.(3) can be evaluated without having to calculate the transitionprobabilities q(θ∗

|θt) and q(θt |θ∗) because they cancel out. Thus,for both the burn-in and sampling modes in the crossover step,the transition kernel can be ignored as proposals are acceptedwithprobability given by Eq. (6).

It is important to note that accepting new particles based onthe Metropolis–Hastings probability ensures that the distributionof particles in a group will match the target distribution. Whileproposals with higher target densities are always accepted,proposals with lower target densities can also be accepted (seeEq. (6), and, more generally, Eq. (3)), so the sampler will maintainthe diversity required to explore the entire target distribution.However, if we only accepted new proposals based on Eq. (7),as in standard genetic algorithms, the group of particles wouldconverge to the maximum a posteriori probability estimate andit would not match the variability of the true target distribution.We stress that while using Eq. (7) may satisfy the goals of someresearchers who seek ‘‘best-fitting’’ parameter values, using Eq. (7)will be unacceptable for those interested in obtaining the Bayesianposterior.

2.4. Mutation

The mutation step is very similar to other perturbationmethods, such as in standard KABC algorithms (Wilkinson,submitted for publication). However, for our algorithm, we donot need our perturbation kernel to be as variable as in otherABC algorithms because we are relying on the crossover step toexhaustively search the parameter space. Themutation step servesonly to further diversify the particles within the groups and, as weshow in our simulations, often is not required to arrive at highlyaccurate posterior estimates.

The mutation step is performed for all particles in each of the Kgroups (see the algorithm in Fig. 1), taken in turn. On each update,a given particle, denoted θ∗∗, is perturbed using a transition kernelsuch that θ∗

∼ K(θ∗∗). For example, the transition kernel couldbe a normal distribution with variance equal to one so that θ∗

N (θ∗∗, 1). We then generate data X by using the assumed modelso thatX ∼ Model(x|θ∗). The simulated dataX is then compared tothe observed data Y through the distance metric ρ(X, Y ), and wejump from the current state (θt , Xt) to the proposed state (θ∗, X)with Metropolis–Hastings probability given by Eq. (3).

For the mutation step, the transition kernel densities q(·|·) aredetermined by the specific transition kernelK(·) that is chosen. Inthe example above, the transition kernel was normal, and so thetransition kernel density can be written as

q(a|b, σ ) =1

√2πσ

exp−(a − b)2

2σ 2

,

where σ is a tuning parameter (e.g., σ = 1 in the exampleabove). When the transition kernel densities are equal so thatq(a|b) = q(b|a), which happens when K(·) is chosen to be

symmetric, the densities cancel out and so Eq. (3) reduces toEq. (6). However, sometimes the mutation step can benefit fromasymmetric transition kernels, and so Eq. (3) is required. Inparticular, we often use asymmetric transition kernels when thesupport of a parameter does not have infinite support to improvethe acceptance rate.

2.5. Migration

The migration step is the algorithm’s way of diversifying par-ticles across groups in a distributed genetic algorithm framework(Hu & Tsui, 2005; Tanese, 1989). Because the particles from eachgroup are randomly traversing the space, one group may land ina high-density region of the posterior very quickly, while othersconverge much more slowly. To allow for efficient posterior map-ping, we simply allow the poorly-performing group to be assistedby the high-performance group by initiating a particle swap be-tween the groups. For ABCDE, the particle swap is (typically) per-formed on the particles that have small weights through sampling.These particles, while doing poorly in their groups, can still carryvaluable information about the location of a higher-density re-gion. Performing the swap in this way allows particles that wouldhave been discarded (because theywere theworst performing par-ticles in their groups) a chance to become the best performingparticles in another group. In thisway, the ‘‘leaders’’ from the high-performance groups can continue directing their group to higherdensity regions.

To perform the migration step, we first determine the num-ber of groups that will be involved in the swapping by sampling anumber η uniformly from the set K = {1, 2, . . . , K}. Then, to de-termine which of the groups to use, we sample η numbers with-out replacement from K , forming the group set G = {G1,G2, . . . ,Gη}. Next, for each group in G we sample a single particle θ∗ withprobabilities based on the inverse of the current weight set fromeach group. Finally, we swap the particles from each of the sets ina cyclical fashion so thatθ∗

G1, θ∗

G2, . . . , θ∗

Gη−1θ∗

θ∗

Gη, θ∗

G1, . . . , θ∗

Gη−2, θ∗

Gη−1

.

Unlike the mutation and crossover steps, we recommend that themigration step be deterministic, and so it will not rely on theMetropolis–Hastings probability in Eq. (3). However, probabilisticrules can be adopted so that each swap is determined by firstadding a small amount of random noise and evaluating Eq. (3)as proposed by Hu and Tsui (2005). Unfortunately, in the ABCcontext, generating proposals in this way requires another datasimulation step which might be costly, so we recommend thesimpler, deterministic swapping rule.

Fig. 2 (right panel) provides an illustration of a migration step.Each of the four different groups is represented by a differentsymbol (circles, asterisks, squares and triangles). In this example,we randomly sampled from K and obtained η = 3. We thensampled from K again without replacement to form the set G,consisting of the circle, asterisk and square groups. A single particleis sampled from each of the three groups, and these particles areswapped in a cyclical fashion.

3. Simulations

We now present three examples meant to test the validity,speed, and scalability of the ABCDE algorithm. First, we use theABCDE algorithm to fit the classic mixture of normals example.In this first example, we employ a simplified version of theABCDE algorithm that uses only the crossover step. In the secondexample, we compare the speed of the ABCDE algorithm to arecently developed KABC algorithm (Wilkinson, submitted for

Page 6: Approximate Bayesian computation with differential evolution

6 B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) –

publication). In this example, the joint posterior distribution ishighly correlated, which makes optimal transition kernels difficultto specify for other samplers (see Turner, Sederberg, Brown, andSteyvers (submitted for publication)). Finally, we use ABCDE tofit a 20-parameter model to data and show that despite themodel’s complexity, the ABCDE algorithm is able to obtain accurateestimates of the posterior distribution in a very small amount oftime.

3.1. Mixture of normals

To demonstrate the advancement of the vectorized perturba-tion approach, we begin by fitting what has become a classicproblem in the ABC literature, a relatively simple mixture of nor-mal distributions (Beaumont et al., 2009; Sisson et al., 2007; Toniet al., 2009;Wilkinson, submitted for publication). We will use theABCDE algorithm to estimate this posterior distribution using onlythe crossover step.

Suppose the posterior distribution of interest is given bythe sum of two normal distributions with known variance andcommon mean parameter so that

f (θ) =12φ (θ |0, 0.1)+

12φ (θ |0, 1) ,

where φ(a, b) is the normal density with mean a and standarddeviation b. To fit the model using ABCDE, we need only samplea candidate parameter θ and generate a single data point X fromthe above model. To compare the simulated data to the observeddata Y = 0, we use a simple Euclidean metric

ρ(X, Y ) = X − Y .

To estimate the posterior using ABCDE, we assumed that thedistribution of errors were Gaussian so that

ζ ∼ N (0, δ).

Thus, the particle’s fitness is determined by evaluating

ψ(ρ(X, Y )|δ) = φ(ρ(X, Y )|0, δ).

Wewill again assume a noninformative prior for θ as in Beaumontet al. (2009) and Sisson et al. (2007), namely

θ ∼ CU[−10, 10],

where CU[a, b] is the continuous uniform distribution over theinterval [a, b]. In addition, we placed an exponential prior on δ sothat

δ ∼ Exp(20).

Weused only the crossover component of the ABCDE algorithmto fit the data so that α = β = 0. We specified uniform additivesampling error so that b ∼ CU[−0.001, 0.001]. We set κ = 1,γ2 = 0 and sampled a new γ1 ∼ CU[0.5, 1] for each iteration. Wedivided 100 total particles into 10 groups of 10 particles each.We then ran the algorithm for 500 iterations with no burn-inperiod, fitting both θ and δ simultaneously. The total number ofmodel simulations required to fit the joint distribution of (θ, δ)was 50,000. Given that wewere sampling from a two-dimensionalspace, this is substantially better than the ABC PRC algorithm,1which used 75,895 model simulations (Sisson et al., 2007). Weemphasize that we were not able to control for the accuracy of the

1 Although the ABC PRC method was subsequently found to have a bias in itsparameter estimates, this bias would not give rise to a decrease in performancefor this simulation and, consequently, this comparison in total number of functionevaluations is valid.

two samplers, so we instead based our estimation accuracy on thebottom right panel of Fig. 2 in Sisson et al. (2007).

The left panel of Fig. 3 shows the estimated target distributionobtained using the crossover component of the ABCDE algorithmalong with the true target density (black line). The right panelshows the estimated posterior distribution of the standarddeviation of the error term δ. The figure shows that although smallvalues of δ are preferred (i.e., there ismore density at lower values),the distribution has a long tail, extending out to around 0.4. Thus,themarginal posterior distribution of δ informs our understandingabout of the model specified in Eq. (2). Specifically, the data thatwere observed could be the result of an appropriately selectedmodel (i.e., the mixture of normal distributions), in which case δwould be small (e.g., δ = 0.05), or it could be the result of a poorlyspecified model and random error, in which case δ would be larger(e.g., δ = 0.35). Comparing the true density with the estimates weobtain (left panel), we see that using only the crossover componentof ABCDE, we can recover the target distribution with remarkableaccuracy.

This simulation was useful in demonstrating that the crossoverperturbation method that relies on vector operations is equallycapable of recovering the full target density of the classic mixtureof normals problem, which has been challenging for previousABC algorithms (Beaumont et al., 2009; Sisson et al., 2007). Weemphasize that all that was required to fit the model was the priorspecification of δ, and we did not require explicit specification of atolerance threshold.

3.2. The Wald distribution

The analysis of response time (RT) has a long and interestinghistory in psychology. The focus on the RT distribution hashighlighted the importance of Bayesianmodeling of RTs. However,our most sophisticated models of RTs do not lend themselveswell to Bayesian modeling because of their complexity (Lee,Fuss, & Navarro, 2006; Oravecz, Tuerlinckx, & Vandekerckhove,2009; Vandekerckhove, Tuerlinckx, & Lee, 2011). Instead, Bayesianmodels of RT have often resorted to simplified but well-fittingdescriptive models with likelihoods such as the Weibull, ex-Gaussian, or log Gaussian (Craigmile, Peruggia, & Van Zandt, 2010;Lee & Wagenmakers, 2012; Peruggia, Van Zandt, & Chen, 2002;Rouder, Lu, Speckman, Sun, & Jiang, 2005; Rouder, Sun, Speckman,Lu, & Zhou, 2003).

One statisticalmodel of RTs that is both simple and theoreticallyuseful is the Wald or inverse Gaussian distribution (Chhikara &Folks, 1989; Wald, 1947). The Wald distribution describes thebehavior of the first-passage time of a diffusion process with asingle boundary. This process can explain effects on RTs from asimple RT experiment (Luce, 1986), where subjects respond assoon as a signal is perceived, regardless of what the signal is,or go/no-go tasks (Heathcote, 2004; Schwarz, 2001). The two-parameter Wald distribution, with density

f (y|α, ν) = α2πy3

−1/2exp

−(α − νy)2

2y

, (8)

provides a simple, psychologically meaningful interpretation ofobserved RT data. Specifically, the parameters α and ν correspondto different components of the psychological process of stimulusdetection. The parameter α is the threshold amount of evidencethat a subject requires to respond that a signal has been presented.The parameter ν is the drift rate or rate of accumulation of evidence(Heathcote, 2004; Matzke & Wagenmakers, 2009). Given theseparameters, theWald distribution can be thought of as a simplifiedversion of the drift diffusion model (DDM; Ratcliff (1978)), whichis a popular model for two-choice RT data. However, the likelihoodfunction for the DDM is technically intractable because it requires

Page 7: Approximate Bayesian computation with differential evolution

B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) – 7

Fig. 3. The estimated target distribution (left panel) overlaid with the true target distribution (black line) and the estimated posterior distribution of the error parameter δ.

the evaluation of an oscillating but convergent infinite sum (seeFeller (1968), Lee et al. (2006), and Navarro and Fuss (2009)). Thisis not the case for the two-parameter Wald distribution, whoselikelihood function is given by

L(α, ν|Y ) =

Ni=1

f (yi|α, ν),

and can be factored into

L(α, ν|Y ) =

Ni=1

α2πY 3

i

−1/2exp

Ni=1

(α − νYi)2

2Yi

=

Ni=1

α2πY 3

i

−1/2

× exp

Ni=1

α2− 2ανYi + νY 2

i

2Yi

=

Ni=1

α2πY 3

i

−1/2

× exp

−α2

2

Ni=1

1Yi

−ν2

2

Ni=1

Yi + Nαν

.

Thus, by the Fisher–Neyman Factorization Theorem (Rice, 2007),the statistics S1(y) =

i Yi and S2(y) =

i 1/Yi are jointly

sufficient for the parameters α and ν.2For the purposes of this demonstration, we simulated 100 RTs

in a simple RT task. We denote Yi for the ith RT and let Yi bedistributed as a Wald random variable with parameters {α, ν}.We chose mildly informative priors for each of these parameters,namely

α ∼ Γ (1, 1), andν ∼ Γ (1, 1).

Our goal is to show that even without the Wald densityfunction, ABCDE is able to recover the same posterior distributionsfor the parameters of this model as a standard Bayesian approach.To make the comparison between the estimates of the posteriorsfrom different algorithms, we used three methods. The firstmethodmade use of the likelihood function andwe performed DE-MCMC sampling as described in ter Braak (2006).

The other two methods were likelihood-free methods: thefirst was a basic KABC algorithm as presented in Wilkinson

2 This result is easy to see because the Wald distribution belongs to theexponential family, so sufficient statistics are the functions of the data in theexponential.

(submitted for publication), and the second was the ABCDEalgorithm. To compare the algorithms as closely as possible, weconstrained the ABCDE algorithm to match the KABC algorithm sothat the two methods differed only in their method of proposalgeneration. For both of these likelihood free methods, we used thetwo sufficient statistics S1(y) =

i Yi and S2(y) =

i 1/Yi derived

above and assumed separate error terms for each statistic. Weemployed a Gaussian kernel (see Silverman (1986)) and assumedthat our data Y arose from the process

Y = Model(y|θ)+ ζ ,

where ζ = [ζ1, ζ2] such that ζ1 ∼ N (0, δ1) and ζ2 ∼ N (0, δ2).Here, ζ1 and ζ2 are the error terms corresponding to each sufficientstatistic S1(y) and S2(y). For both likelihood-freemethods,we fixedδ1 = 0.005 and δ2 = 0.01. To compare the simulated data X to theobserved data Y , we used the Euclidean distance between the ithsummary statistic, so that

ρi(X, Y ) = Si(X)− Si(Y ).

To implement the restricted ABCDE algorithm, we again setα = β = 0.We used only 24 particles in a single group, to facilitatecomparison across sampling methods. For this simulation, wedid not use the burn-in mode (γ2 = 0) and ran the algorithmin sampling mode for 10,000 iterations. We specified uniformadditive sampling error so that b ∼ CU[−0.001, 0.001], a uniformdistribution for γ1 ∼ CU[0.5, 1], and set κ = 1.

To implement the KABC algorithm (Wilkinson, submitted forpublication), we applied the same basic settings that were used forABCDE and ran 24 chains independently for 10,000 iterations. Toform our estimates, we discarded the first 100 iterations in bothsamplers. To generate proposals, we used a Gaussian transitionkernel centered at the current state of a given chain with standarddeviation equal to 0.5.

Fig. 4 shows the estimated marginal posteriors for each ofthe methods used. The top panels show the estimates obtainedusing ABCDE whereas the bottom panels show the estimatesobtained using KABC. In each of these three panels, the true valuesused to generate the data are shown by vertical lines and thelikelihood-informed estimates, which were obtained using DE-MCMC, are represented as the black densities. Comparing thetop and bottom panels, we can see that both likelihood-freemethods provide reasonable approximations to the true posteriordistribution. However, there is evidence to suggest that the ABCDEalgorithm has achieved a better fit. In particular, the bottom panelof Fig. 4 shows thatmore sampling error is present in the estimateswhen using KABC. The estimates do not conform well to the trueposterior distributions and several outliers are present, despitehaving been initialized at the same locations as in ABCDE anddiscarding the same number of burn-in samples.

Page 8: Approximate Bayesian computation with differential evolution

8 B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) –

Fig. 4. The estimated marginal posterior distributions of α (left panels) and ν (right panels) using ABCDE (top panel histograms) and KABC (bottom panel histograms). Thelikelihood-informed estimates using DE-MCMC are shown as the black densities in all panels and the vertical lines show the parameter values used to generate the data.

Fig. 5. The estimated joint posterior distributions of α and ν using likelihood-based DE-MCMC sampling (left panel) and likelihood-free ABCDE (middle panel) and KABC(right panel). The vertical and horizontal lines represent the parameter values used to generate the data.

Fig. 5 shows the estimated joint posterior distributions usingDE-MCMC (left panel), ABCDE (middle panel), and KABC (rightpanel) along with the parameter values used to generate the data(red lines). The figure shows thatwhen the likelihood is known, thesampler mixes very nicely, creating a smooth estimate. However,the two likelihood-free methods show that the chains stick forseveral iterations, creating black dots throughout the estimatedposterior. The problem of chain sticking is prevalent in the ABCsetting, especiallywhen δ (or ϵ) is small. However, the figure showsthat the estimates obtained using ABCDE fill a larger region (i.e., theestimates are much smoother) than the estimates obtained usingKABC.

The performance of the two likelihood-free methods isfavorable, but is to be expected given that we were able to derivesufficient statistics for the model parameters. However, showingthatwe canmatch the true posteriorswhen a likelihood is availableis useful in illustrating that it is also possible to obtain the trueposterior estimates when the likelihood is intractable, such as inthe DDM. As is normal in the ABC setting, both algorithms suffered

from substantial rejection rates: the ABCDE algorithm accepted1.31% of its proposals while the KABC algorithm accepted only0.24% of its proposals, after the burn-in period. Taken together,Figs. 4 and 5 show that this difference – a 454% improvement overthe KABC algorithm – has a drastic impact on the estimates weobtain.

3.3. Bivariate normal example

As a final example, we consider the problem of scalability.Currently, ABC algorithms become extremely inefficient in high-dimensional parameter spaces, a direct result of the accept/rejectnature of these algorithms. In high-dimensional spaces, proposinga successful candidate parameter vector using only a mutationkernel becomes a very improbable event.

The ABCDE algorithm, on the other hand, can be used to sampleefficiently from the posterior distributions of high-dimensionalmodels. Rather than regularly rejecting parameter vectors in low-density regions, the ABCDE algorithm will simply assign a small

Page 9: Approximate Bayesian computation with differential evolution

B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) – 9

weight to the particle as measured by the kernel function. Thecrossover process will then evolve the population towards higher-density regions, narrowing in on the true posterior. For this nextexample we illustrate how well this process scales to multipledimensions by applying the ABCDE algorithm to a set of 10bivariate normal distributions – a 20-parameter problem – andperforming inference on the parameters jointly.

We denote the observed data from the jth source as Yj, eachof which are assumed to be independently distributed from abivariate normal distributionwith anunknownmeanvectorµj andknown variance matrixΣ , or

Yj ∼ N2

µj =

µj,1µj,2

,Σ =

0.012 00 0.012

. (9)

To generate the observed data for testing the model, werandomly sampled 10 points from the space µj ∼ CU[0, 10] × CU[0, 10], and then treated the resultingµjs as the sole observation ofthe data. These sampledµjs are shown in Fig. 6 by the ‘‘x’’ symbols.When simulating data X , we generated n = 50 observations fromeach bivariate normal distribution using Eq. (9) with the µjsspecified by the 20-dimensional parameter vector of the particle.

For the discriminant function, we calculated the root meansquared difference (RMSD) between the average over these 50simulated observations Xj and the ‘‘true’’ observed data Yj, suchthat

ρ(X, Y ) =

120

20j=1

[(Xj − Yj)2],

where in this case j is indexed over the 20 parameters of theflattened 10×2matrix of bivariatemeans.We then used the RMSDto compute a final weight of each proposal,

ψ(ρ(X, Y )|δ) = φ(ρ(X, Y )|0, δ),

where φ(a, b) is the normal density with mean a and standarddeviation b.

For eachµj, we specified a noninformative prior over the entirespace [0, 10] × [0, 10] so that

µj ∼ CU[0, 10] × CU[0, 10].

In this high-dimensional space, the specification of δ becomes adifficult problem. At first, the values of ρ(X, Y ) will all be large,and so a small value for δ will result in the numerator of Eq. (3)being equal to zero which will lead to the rejection of the proposal.To work around this problem, we treated δ as a free parameter inthe model, and assigned it a strong prior so that

δ ∼ Exp(20).

This exponential prior thus favors particles in high-density regionswith small δs, but it allows for particles in the low-density regionsas well (i.e., they have nonzero weights).

We set the mutation probability to β = 0.0 (i.e., we usedcrossover only) and crossover parameter update probability κ =

0.9 (allowing for more diversity in the crossover step). We usedonly one group (G = 1) of size K = 50, for a total of 50 particles.There was no need to run the migration step, so we set α = 0.0.We again implemented burn-in and sampling periods. During theburn-in period, we sampled γ2 ∼ CU[0.5, 1] for each update andran the algorithm for 200 iterations. During the sampling period,we set γ2 = 0 and ran the algorithm for 300 iterations. For bothmodes, we sampled γ1 ∼ CU[0.5, 1] and b ∼ CU[−0.001, 0.001].To transition from the burn-in mode to the sampling mode, weinitialized the chains during the sampling mode to the final valuesobtained during the burn-in mode. Finally, during the burn-inmode, δ was treated as a free parameter, but, during the sampling

mode, δ was fixed to the minimum of the distribution obtained bythe end of the burn-in mode.

For such a high-dimensional problem, one could certainlybenefit from dividing the sampling procedure into blocks (Gelman,Carlin, Stern, & Rubin, 2004). For example, one could determinethe conditional distributions of each µj parameter, which wouldresult in 10 blocks each consisting of a two-dimensional space.To obtain samples from the joint posterior in this regime, onewould then cycle through each conditional distribution and samplefrom this distribution using ABCDE. Because the ABCDE algorithmemploys a system of Markov chains, developing such a sampleris straightforward, unlike other algorithms that rely on sequentialMonte Carlo techniques. However, to create a modeling challengefor ABCDE, we instead used only a single block (a twenty-dimensional space). In this regime, rejection rates will be very highand efficiently sampling from the joint posterior distribution willbe difficult.

The middle panel of Fig. 6 shows the estimated joint posteriordistribution for each of the 10 bivariate normal components duringthe burn-in period. To illustrate the convergence of the algorithmwe color-coded the parameter estimates based on the currentiteration. Beginning with the uniform distribution, the particlesrepresenting the 20 parameters quickly converged to clusteraround the 10 bivariate means. This is made possible because δstarts high to provide a non-zero weight for the DE crossoveralgorithm to converge to the high-density regions. The left panelof Fig. 6 shows the convergence of δ across the iterations and iscolor coded to match the posterior estimates in the middle panel.The values of δ decrease with each iteration as ABCDE converges tothe high-density regions of the posterior. The right panel of Fig. 6shows the final estimate of the joint posterior distribution of the10 bivariate normal components obtained during the 300 samplingmode iterations following the burn-in period.

This example has illustrated a number of important featuresof the ABCDE algorithm. First, while the algorithm was ableto converge to the bivariate means, the algorithm was ableto recover the true posterior distribution and did not simplyconverge to single points (i.e., themaximum a posteriori estimate).Furthermore, this accuracy was obtained using only 25,000model simulations with a total computation time of under oneminute. We speculate that obtaining such results would be nearlyimpossible when using a standard ABC algorithm such as ABC SMCdue to the extremely high rejection rates. We also implemented aKABC algorithm using the same number of particles and iterations,and a normal transition kernel with standard deviation equal to0.1. Although we do not report these results here, the algorithmfailed to convergewithin 500 iterations, leaving the final estimatedposterior spanning the entire space. This additional simulationsuggests that the crossover portion of the algorithm is essentialfor efficiently scaling ABC to models with large numbers ofparameters.

4. General discussion

Typically, Monte Carlo methods such as Markov chain MonteCarlo or sequential Monte Carlo rely on transition kernels tomove proposals around the region of interest. These transitionkernels generally are Gaussian in form and are centered atthe previous state of the chain. Thus, to fully specify thekernel, one only needs to select the standard deviation ofthe Gaussian as a tuning parameter. However, it can be verydifficult to specify the optimal transition kernel, especially inthe case of highly correlated parameters. For highly correlatedparameter spaces (e.g., correlations greater than 0.4), Turneret al. (submitted for publication) have shown in a simulationstudy that the DE proposal generation method is much more

Page 10: Approximate Bayesian computation with differential evolution

10 B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) –

Fig. 6. The marginal convergence of all δ values (left panel) and the estimated joint posterior distribution of (µ1, µ2) (middle panel) as a function of the iteration numberduring the burn-in period. The grayscale color coding is equivalent for the left and middle panels to show the trade-off between the estimated posterior and the maximumand minimum values of the error term. The right panel plots the final estimated joint posterior distribution using only the samples obtained during the sampling period.

efficient than conventional Markov chain Monte Carlo algorithms,as measured by both the rejection rate and the Kullback–Leiblerdistance. Furthermore, in the case of high-dimensional parameterspaces, the cost of selecting sub-optimal transition kernels canbe incredibly high—many more samples may be required toadequately sample from the posterior.

On the other hand, genetic algorithms such as DE relieve thedemands of selecting transition kernels because the perturbationmethod is explicitly dependent on the current population ofparticles. In addition to DE being convenient, it has been shownthat the basic DE algorithm can achieve the optimal acceptancerate when

γ1 =2.38√2d,

where d is the number of dimensions in the parameter space(ter Braak, 2006). The automatic specification of a transitionkernel used by ABCDE offers a compelling advantage over mostcurrent algorithms in the ABC framework, where rejection ratesare exacerbated by the mandatory comparison of the distancebetween simulated and observed data to some tolerance threshold.

The ABCDE algorithmhas another benefit thatwas not exploredin this article. Because the ABCDE algorithm consists of a systemof Markov chains, sets of parameters can be ‘‘blocked’’ andestimated cyclically. For example,when estimating the parametersof a hierarchical model, likelihood-informed methods can beused for the hyperparameters whose conditional distributionsdo not depend on the likelihood, and ABCDE can be used forthe lower-level parameters as shown in Turner and Van Zandt(submitted for publication). This is an added benefit of ABCDE oversequential Monte Carlo methods, which do not converge to thetrue joint posterior distribution when used in a Gibbs samplingframework unless proper precautions are taken.

While the ABCDE algorithm is technically a system of Markovchains, it can still be easily parallelized by distributing modelsimulations in the ‘‘for’’ loop (Line 6 in Fig. 1). We simulated allexamples presented in this article in R on a standard desktopcomputer with an Intel i7 processor (3.07 GHz) and distributed thejobs across eight cores. We found that this was sufficient for ourneeds, but we emphasize that more powerful computational tools(e.g., compiler languages, graphics processor units, computingclusters) could be used to amplify the algorithm’s efficiency wellbeyond what we demonstrate in the present examples.

4.1. Guidelines

A difficulty of employing ABCDE is that, in its complete formwith crossover, migration, and mutation, it requires a numberof tuning parameters. While we have found that the selectionof these parameters does not critically alter the performance ofthe algorithm, we now provide some basic guidelines for theirselection. First, the number of particles per group should be

about 5–10 times the number of modes in the posterior and/orparameters in the model. Dividing up the pool of particles is reallyjust a matter of selecting the appropriate group size. We havefound that a group size of 20–50 particles works well to fit a widerange of models.

Selecting the migration probability α and the mutation proba-bility β is usually a matter of knowing something about the poste-rior distribution.While these features were not used in this article,we have found that α and β should usually be some small value(e.g., α = β = 0.1). For multi-modal posteriors, it might be im-portant to increase the α parameter to allow for a more thoroughexploration of the parameter space. Selecting β and κ is a matterof convergence speed. We have found that small values such asβ = 0.1 or β = 0.2 work quite well. Increasing β will result inmore mutation steps, and the algorithm will be very similar to theABC SMC algorithm (Toni et al., 2009) or the Markov chain MonteCarlo algorithm (Wilkinson, submitted for publication). Large val-ues of the crossover parameter update probability (e.g., κ = 0.9)will allow for fast convergence while maintaining some additionaldiversity in the crossover proposals, which is ideal formostmodelsand especially when there is parameter dependence. Some prob-lems, however,may benefit from slower convergencewith κ = 0.2that fosters search along individual dimensions. Selecting the per-turbation kernel in themutation step is similar to selecting the ker-nel in the ABC SMC algorithm. However, for ABCDE,we do not needthese transition kernels to be highly variable and have found thatchoosing a normal kernel with small variance works well.

Finally, the choice of the model error function ψ(·|δ) is an im-portant one. While we have found the Gaussian and Epanechnikovfunctions very useful, we also suggest using exponential functions.As δ → 0, wewill obtain estimates closer to the true posterior dis-tribution, and the importance of selecting the model error func-tionψ(·|δ)will diminish. In the examples presented here, we haveshown that δ can be treated as a free parameter in themodel,whichautomates the fitting procedure.

It is important to note thatwhile including crossover,migration,and mutation provides for the most flexible algorithm, in manycases ABCDE works extremely well with only a single groupof particles that update via the crossover step. This is becausethe DE algorithm in combination with the Metropolis–Hastingsacceptance probability allows for efficient proposals that spanthe entire range of high-density regions of the parameter space.The bivariate normal simulation outlined above is an example ofestimating a complexmodelwith only the crossover component. Insuch cases, you only need to decide on the number of particles andan error function, making the simulation quite straightforward.

5. Conclusions

In this article, we have presented a new algorithm, ABCDE, thatattempts to solve the problem of scalability associated with ABC

Page 11: Approximate Bayesian computation with differential evolution

B.M. Turner, P.B. Sederberg / Journal of Mathematical Psychology ( ) – 11

methods. The algorithm relies on DE to drive the sampler to high-density regions of the posterior distribution and automaticallyoptimize the proposal efficiency. This type of proposal generationis very efficient, capturing bothmulti-modal and high-dimensionalposteriors both quickly and accurately. Although we have notprovided proofs of convergence for the ABCDE algorithm in thisarticle, we have demonstrated with many examples (only three ofwhich are reported here) that the ABCDE algorithm will convergeto the true posterior distribution.

When developing a model in any field, one often investigatesmany variants of a single base model. ABC algorithms, whichrequire only a simulation of the model, afford us the opportunityto test many variants without having to develop full statisticalmodels for each variant. The posteriors that ABC provides cangive us clear information about the relationships between theparameters—information that might otherwise be available onlythrough extensive experience with each variant. Equipped withonly a prior distribution and the ABCDE algorithm, estimation isnow feasible for virtually any posterior distribution.

References

Barthelmé, S., & Chopin, N. (2011). ABC-EP: expectation propagation for likelihood-free Bayesian computation. In Proceedings of the 28th international conference onmachine learning. Bellevue, WA.

Beaumont, M. A. (2010). Approximate Bayesian computation in evolution andecology. Annual Review of Ecology, Evolution, and Systematics, 41, 379–406.

Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., & Robert, C. P. (2009). Adaptiveapproximate Bayesian computation. Biometrika, asp052, 1–8.

Beaumont, M. A., Zhang, W., & Balding, D. J. (2002). Approximate Bayesiancomputation in population genetics. Genetics, 162, 2025–2035.

Blum, M. G. B., & François, O. (2010). Non-linear regression models for approximateBayesian computation. Statistics and Computing , 20, 63–73.

Chhikara, R. S., & Folks, L. (1989). The inverse Gaussian distribution: theorymethodology and applications. New York: Marcel Dekker, Inc.

Craigmile, P., Peruggia, M., & Van Zandt, T. (2010). Hierarchical Bayes models forresponse time data. Psychometrika, 75, 613–632.

Del Moral, P., Doucet, A., & Jasra, A. (2008). An adaptive sequential Monte Carlomethod for approximate Bayesian computation Technical Report.

Feller,W. (1968). An introduction to probability theory and its applications, Vol. 1. NewYork: John Wiley.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian data analysis. NewYork (NY): Chapman and Hall.

Havangi, R., Nekoui, M., & Teshnehlab, M. (2010). A multi swarm particle filter formobile robot localization. International Journal of Computer Science, (7), 15–22.

Heathcote, A. (2004). Fitting Wald and ex-Wald distributions to response timedata: an example using functions for the S-PLUS package. Behavioral ResearchMethods, Instruments, & Computers, 36, 678–694.

Hu, B., & Tsui, K.-W. (2005). Distributed evolutionaryMonte Carlo with applicationsto Bayesian analysis Technical Report Number 1112.

Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings ofIEEE international conference on neural networks. vol. 4 (pp. 1942–1948).

Lee,M. D., Fuss, I. G., &Navarro, D. J. (2006). A Bayesian approach to diffusionmodelsof decision-making and response time. In B. Scholkopf, J. Platt, & T. Hoffman(Eds.), Advances in neural information processing (19th ed.) (pp. 809–815).Cambridge, MA: MIT Press.

Lee, M.D., & Wagenmakers, E.-J. (2012). A course in Bayesian graphical modelingfor cognitive science, last downloaded January 1, 2012. Available fromhttp://www.ejwagenmakers.com/BayesCourse/BayesBookWeb.pdf.

Luce, R. D. (1986). Response times: their role in inferring elementary mentalorganization. New York: Oxford University Press.

Marjoram, P., Molitor, J., Plagnol, V., & Tavare, S. (2003). Markov chain Monte Carlowithout likelihoods. Proceedings of the National Academy of Sciences of the UnitedStates, 100, 324–328.

Matzke, D., & Wagenmakers, E.-J. (2009). Psychological interpretation of the ex-Gaussian and shifted wald parameters: a diffusion model analysis. PsychonomicBulletin and Review, 16, 798–817.

Navarro, D. J., & Fuss, I. G. (2009). Fast and accurate calculations for first-passagetimes in Wiener diffusion models. Journal of Mathematical Psychology, 53,222–230.

Oravecz, Z., Tuerlinckx, F., & Vandekerckhove, J. (2009). A hierarchical Orn-stein–Uhlenbeckmodel for continuous repeatedmeasurement data. Psychome-trika, 74, 395–418.

Peruggia, M., Van Zandt, T., & Chen,M. (2002).Was it a car or a cat i saw? An analysisof response times for word recognition. Case Studies in Bayesian Statistics, VI ,319–334.

Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., & Feldman, M. W. (1999).Population growth of human Y chromosomes: a study of Y chromosomemicrosatellites.Molecular Biology and Evolution, 16, 1791–1798.

Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108.Rice, J. A. (2007). Mathematical statistics and data analysis. Belmont, CA: Duxbury

Press.Robert, C. P., Cornuet, J.-M., Marin, J.-M., & Pillai, N. (2011). Lack of confidence in

approximate Bayesian computation model choice. Proceedings of the NationalAcademy of Sciences of the United States, 108, 15112–15117.

Rouder, J. N., Lu, J., Speckman, P., Sun, D., & Jiang, Y. (2005). A hierarchical modelfor estimating response time distributions. Psychonomic Bulletin and Review, 12,195–223.

Rouder, J., Sun, D., Speckman, P., Lu, J., & Zhou, D. (2003). A hierarchicalBayesian statistical framework for response time distributions. Psychometrika,68, 589–606.

Schwarz, W. (2001). The ex-Wald distribution as a descriptive model of responsetimes. Behavioral Research Methods, Instruments, & Computers, 33, 457–469.

Silverman, B. W. (1986). Density estimation for statistics and data analysis. London:Chapman & Hall.

Sisson, S., Fan, Y., & Tanaka, M. M. (2007). Sequential Monte Carlo withoutlikelihoods. Proceedings of the National Academy of Sciences of the United States,104, 1760–1765.

Storn, R., & Price, K. (1997). Differential evolution—a simple and efficient heuristicfor global optimization over continuous spaces. Journal of Global Optimization,11, 341–359.

Tanese, R. (1989). Distributed genetic algorithms (pp. 434–439).Tavaré, S., Balding, D. J., Griffiths, R. C., & Donnelly, P. (1997). Inferring coalescence

times from dna sequence data. Genetics, 145, 505–518.ter Braak, C. J. F. (2006). AMarkov chainMonteCarlo version of the genetic algorithm

differential evolution: easy Bayesian computing for real parameter spaces.Statistics and Computing , 16, 239–249.

Tong, G., Fang, Z., & Xu, X. (2006). A particle swarm optimized particle filter fornonlinear system state estimation. Evolutionary Computation, 438–442.

Toni, T., Welch, D., Strelkowa, N., Ipsen, A., & Stumpf, M. P. (2009). ApproximateBayesian computation scheme for parameter inference and model selection indynamical systems. Journal of the Royal Society Interface, 6, 187–202.

Turner, B.M., Sederberg, P.B., Brown, S.D., & Steyvers, M. (2012). A note onefficiently sampling from distributionswith correlated dimensions, manuscript(submitted for publication).

Turner, B.M., & Van Zandt, T. (2012). Hierarchical approximate Bayesian computa-tion, manuscript (submitted for publication).

Turner, B. M., & Van Zandt, T. (2012). A tutorial on approximate Bayesiancomputation. Journal of Mathematical Psychology, 56, 69–85.

Vandekerckhove, J., Tuerlinckx, F., & Lee,M. D. (2011). Hierarchical diffusionmodelsfor two-choice response time. Psychological Methods, 16, 44–62.

Vrugt, J. A., ter Braak, C. J. F., Diks, C. G. H., Robinson, B. A., Hyman, J. M., & Higdon,D. (2009). Accelerating Markov chain Monte Carlo simulation by differentialevolution with self-adaptive randomized subspace sampling. InternationalJournal of Nonlinear Sciences and Numerical Simulation, 10.

Wald, A. (1947). Sequential analysis. New York: Wiley.Wilkinson, R.D. (2011). Approximate Bayesian computation (ABC) gives exact

results under the assumption of model error, manuscript (submitted forpublication).