a comparative review of dimension reduction methods in ... · arxiv:1202.3819v3 [stat.me] 11 jun...

21
arXiv:1202.3819v3 [stat.ME] 11 Jun 2013 Statistical Science 2013, Vol. 28, No. 2, 189–208 DOI: 10.1214/12-STS406 c Institute of Mathematical Statistics, 2013 A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation M. G. B. Blum, M. A. Nunes, D. Prangle and S. A. Sisson Abstract. Approximate Bayesian computation (ABC) methods make use of comparisons between simulated and observed summary statistics to over- come the problem of computationally intractable likelihood functions. As the practical implementation of ABC requires computations based on vectors of summary statistics, rather than full data sets, a central question is how to derive low-dimensional summary statistics from the observed data with min- imal loss of information. In this article we provide a comprehensive review and comparison of the performance of the principal methods of dimension reduction proposed in the ABC literature. The methods are split into three nonmutually exclusive classes consisting of best subset selection methods, projection techniques and regularization. In addition, we introduce two new methods of dimension reduction. The first is a best subset selection method based on Akaike and Bayesian information criteria, and the second uses ridge regression as a regularization procedure. We illustrate the performance of these dimension reduction techniques through the analysis of three challeng- ing models and data sets. Key words and phrases: Approximate Bayesian computation, dimension re- duction, likelihood-free inference, regularization, variable selection. 1. INTRODUCTION Bayesian inference is typically focused on the pos- terior distribution p(θ|y obs ) p(y obs |θ)p(θ) of a pa- rameter vector θ Θ R q , q 1, representing the M. G. B. Blum is Research Associate, Universit´ e Joseph Fourier, Centre National de la Recherche Scientifique, Laboratoire TIMC-IMAG UMR 5525, Grenoble, F-38041, France e-mail: [email protected]. M. A. Nunes is Senior Research Associate and D. Prangle is Lecturer, Mathematics and Statistics Department, Fylde College, Lancaster University, Lancaster LA1 4YF, United Kingdom. S. A. Sisson is Associate Professor, School of Mathematics and Statistics, University of New South Wales, Sydney 2052, Australia. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in Statistical Science, 2013, Vol. 28, No. 2, 189–208. This reprint differs from the original in pagination and typographic detail. updating of one’s prior beliefs, p(θ), through the likelihood (model) function, p(y obs |θ), having ob- served data y obs ∈Y . The term approximate Bayesian computation (ABC) refers to a family of models and algorithms that aim to draw samples from an approximate posterior distribution when the likeli- hood, p(y obs |θ), is unavailable or computationally intractable, but where it is feasible to quickly gener- ate data from the model, y p(·|θ). ABC is rapidly becoming a popular tool for the analysis of com- plex statistical models in an increasing number and breadth of research areas. See, for example, Lopes and Beaumont (2010), Bertorelle, Benazzo and Mona (2010), Beaumont (2010), Csill´ ery et al. (2010) and Sisson and Fan (2011) for a partial overview of the application of ABC methods. ABC introduces two principal approximations to the posterior distribution. First, the posterior distri- bution of the full data set, p(θ|y obs ), is approximated by p(θ|s obs ) p(s obs |θ)p(θ), where s obs = S (y obs ) is 1

Upload: others

Post on 27-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • arX

    iv:1

    202.

    3819

    v3 [

    stat

    .ME

    ] 1

    1 Ju

    n 20

    13

    Statistical Science

    2013, Vol. 28, No. 2, 189–208DOI: 10.1214/12-STS406c© Institute of Mathematical Statistics, 2013

    A Comparative Review of DimensionReduction Methods in ApproximateBayesian ComputationM. G. B. Blum, M. A. Nunes, D. Prangle and S. A. Sisson

    Abstract. Approximate Bayesian computation (ABC) methods make use ofcomparisons between simulated and observed summary statistics to over-come the problem of computationally intractable likelihood functions. As thepractical implementation of ABC requires computations based on vectors ofsummary statistics, rather than full data sets, a central question is how toderive low-dimensional summary statistics from the observed data with min-imal loss of information. In this article we provide a comprehensive reviewand comparison of the performance of the principal methods of dimensionreduction proposed in the ABC literature. The methods are split into threenonmutually exclusive classes consisting of best subset selection methods,projection techniques and regularization. In addition, we introduce two newmethods of dimension reduction. The first is a best subset selection methodbased on Akaike and Bayesian information criteria, and the second uses ridgeregression as a regularization procedure. We illustrate the performance ofthese dimension reduction techniques through the analysis of three challeng-ing models and data sets.

    Key words and phrases: Approximate Bayesian computation, dimension re-duction, likelihood-free inference, regularization, variable selection.

    1. INTRODUCTION

    Bayesian inference is typically focused on the pos-terior distribution p(θ|yobs)∝ p(yobs|θ)p(θ) of a pa-rameter vector θ ∈ Θ ⊆ Rq, q ≥ 1, representing the

    M. G. B. Blum is Research Associate, Université Joseph

    Fourier, Centre National de la Recherche Scientifique,

    Laboratoire TIMC-IMAG UMR 5525, Grenoble,

    F-38041, France e-mail: [email protected]. M. A.

    Nunes is Senior Research Associate and D. Prangle is

    Lecturer, Mathematics and Statistics Department, Fylde

    College, Lancaster University, Lancaster LA1 4YF,

    United Kingdom. S. A. Sisson is Associate Professor,

    School of Mathematics and Statistics, University of New

    South Wales, Sydney 2052, Australia.

    This is an electronic reprint of the original articlepublished by the Institute of Mathematical Statistics inStatistical Science, 2013, Vol. 28, No. 2, 189–208. Thisreprint differs from the original in pagination andtypographic detail.

    updating of one’s prior beliefs, p(θ), through thelikelihood (model) function, p(yobs|θ), having ob-served data yobs ∈ Y . The term approximate Bayesiancomputation (ABC) refers to a family of modelsand algorithms that aim to draw samples from anapproximate posterior distribution when the likeli-hood, p(yobs|θ), is unavailable or computationallyintractable, but where it is feasible to quickly gener-ate data from the model, y ∼ p(·|θ). ABC is rapidlybecoming a popular tool for the analysis of com-plex statistical models in an increasing number andbreadth of research areas. See, for example, Lopesand Beaumont (2010), Bertorelle, Benazzo andMona(2010), Beaumont (2010), Csilléry et al. (2010) andSisson and Fan (2011) for a partial overview of theapplication of ABC methods.ABC introduces two principal approximations to

    the posterior distribution. First, the posterior distri-bution of the full data set, p(θ|yobs), is approximatedby p(θ|sobs)∝ p(sobs|θ)p(θ), where sobs = S(yobs) is

    1

    http://arxiv.org/abs/1202.3819v3http://www.imstat.org/sts/http://dx.doi.org/10.1214/12-STS406http://www.imstat.orgmailto:[email protected]://www.imstat.orghttp://www.imstat.org/sts/http://dx.doi.org/10.1214/12-STS406

  • 2 BLUM, NUNES, PRANGLE AND SISSON

    a vector of summary statistics of lower dimensionthan the data yobs. In this manner, p(θ|sobs) ≈p(θ|yobs) is a good approximation if sobs is highly in-formative for the model parameters, and p(θ|sobs) =p(θ|yobs) if sobs is sufficient. As p(sobs|θ) is also likelyto be computationally intractable if p(yobs|θ) is com-putationally intractable, a second approximation isconstructed as pABC(θ|sobs) =

    p(θ, s|sobs)ds, with

    p(θ, s|sobs)∝Kǫ(‖s− sobs‖)p(s|θ)p(θ),(1)

    where Kǫ(‖u‖) =K(‖u‖/ǫ)/ǫ is a standard smooth-ing kernel with scale parameter ǫ > 0. As a result of(1), approximating the target p(θ|sobs) by pABC(θ|sobs) can be shown to be a good approximation if thekernel scale parameter, ǫ, is small enough, followingstandard kernel density estimation arguments (e.g.,Blum (2010a)).In combination, both approximations allow for prac-

    tical methods of sampling from pABC(θ|sobs) thatavoid explicit evaluation of the intractable likelihoodfunction, p(yobs|θ). A simple rejection-sampling al-gorithm to achieve this was proposed by Pritchardet al. (1999) (see also Marjoram et al. (2003)), whichproduces draws from p(θ, s|sobs). In general terms,an importance-sampling version of this algorithmproceeds as follows:

    (1) Draw a candidate parameter vector from theprior, θ′ ∼ p(θ);(2) Draw summary statistics from the model s′ ∼

    p(s|θ′);(3) Assign to (θ′, s′) a weight, w′, that is propor-

    tional to Kǫ(‖s′ − sobs‖).

    Here, the sampling distribution for (θ′, s′) is theprior predictive distribution, p(s|θ)p(θ), and the tar-get distribution is p(θ, s|sobs). Using equation (1),it is then straightforward to compute the impor-tance weight for the pair (θ′, s′). The weight is pro-portional to p(θ′, s′|sobs)/[p(s

    ′|θ′)p(θ′)] = Kǫ(‖s′ −

    sobs‖), which is free of intractable likelihood terms,p(s′|θ′). The manner by which the intractable like-lihoods cancel between sampling and target distri-butions forms the basis for the majority of ABCalgorithms.Clearly, both ABC approximations to the poste-

    rior distribution help to avoid the computationalintractability of the original problem. The first ap-proximation allows the kernel weighting of the sec-ond approximation, Kǫ(‖s− sobs‖), to be performedon a lower dimension than that of the original data,yobs. Kernel smoothing is known to suffer from the

    curse of dimensionality (e.g., Blum (2010a)), and sokeeping dim(s) ≤ dim(y) as small as possible helpsto improve algorithmic efficiency. The second ap-proximation (1) allows the sampler weights (or ac-ceptance probabilities, if one considers rejection-basedsamplers, such as Markov chain Monte Carlo) to befree of intractable likelihood terms.In practice, however, there is typically a trade-

    off between the two approximations: if the dimen-sion of s is large so that the first approximation,p(θ|sobs) ≈ p(θ|yobs), is good, the second approxi-mation may then be poor due to the inefficiency ofkernel smoothing in large dimensions. Conversely,if the dimension of s is small while the second ap-proximation (1) will be good (with a small kernelscale parameter, ǫ), any loss of information in themapping sobs = S(yobs) means that the first approx-imation may be poor. Naturally, a low-dimensionaland near-sufficient statistic, s, would provide a near-optimal and balanced choice.For a given set of summary statistics, much work

    has been done on deriving more efficient samplingalgorithms to reduce the effect of the second ap-proximation by allowing a smaller value for the ker-nel scale parameter, ǫ, which in turn improves theapproximation pABC(θ|sobs)≈ p(θ|sobs). The greaterthe algorithmic efficiency, the smaller the scale pa-rameter that can be achieved for a given compu-tational burden. These algorithms include Markovchain Monte Carlo (Marjoram et al. (2003); Bor-tot, Coles and Sisson (2007)) and sequential MonteCarlo techniques (Sisson, Fan and Tanaka (2007);Toni et al. (2009); Beaumont et al. (2009); Drovandiand Pettitt (2011); Peters, Fan and Sisson (2012);Del Moral, Doucet and Jasra (2012)). By contrast,the regression-based methods described in Section 2.1do not aim at reducing the scale parameter ǫ butrather explicitly account for the imperfect matchbetween observed and simulated summary statistics(Beaumont, Zhang and Balding (2002); Blum andFrançois (2010)).Achieving a good trade-off between the two ap-

    proximations revolves around the identification of aset of summary statistics, s, which are both low-dimensional and highly informative for θ. A numberof methods, primarily based on dimension reduc-tion ideas, have been proposed to achieve this (Joyceand Marjoram (2008); Wegmann, Leuenberger andExcoffier (2009); Nunes and Balding (2010); Blumand François (2010); Blum (2010b); Fearnhead andPrangle (2012)). The choice of summary statistics

  • DIMENSION REDUCTION METHODS IN ABC 3

    is one of the most important aspects of a statis-tical analysis using ABC methods (along with thechoice of algorithm). Poor specification of s can havea large and detrimental impact on both ABC modelapproximations.In this article we provide the first detailed re-

    view and comparison of the performance of the cur-rent methods of dimension reduction for summarystatistics within the ABC framework. We character-ize these methods into three nonmutually exclusiveclasses: (i) best subset selection, (ii) projection tech-niques and (iii) regularization approaches. As part ofthis analysis, we introduce two additional novel tech-niques for dimension reduction within ABC. Thefirst adopts the ideas of Akaike and Bayesian in-formation criteria to the ABC framework, whereasthe second makes use of ridge regression as a reg-ularization procedure for ABC. The dimension re-duction methods are compared through the analy-sis of three challenging models and data sets. Theseinvolve the analysis of a coalescent model with re-combination (Joyce and Marjoram (2008)), an eval-uation of the evolutionary fitness cost of mutationin drug-resistant tuberculosis (Luciani et al. (2009))and an assessment of the number and size-distribu-tion of particle inclusions in the production of cleansteels (Bortot, Coles and Sisson (2007)).The layout of this article is as follows: in Section 2

    we classify and review the existing methods of sum-mary statistic dimension reduction in ABC, and inSection 3 we outline our two additional novel meth-ods. A comparative analysis of the performance ofeach of these methods is provided in Section 4. Weconclude with a discussion.

    2. CLASSIFICATION OF ABC DIMENSION

    REDUCTION METHODS

    In a typical ABC analysis, an initial collection ofstatistics s⊤ = (s1, . . . , sp) is chosen by the modeler,the elements of which have the potential to be infor-mative for the model parameters, θ⊤ = (θ1, . . . , θq).Choice of these initial statistics is highly problemspecific, and the number of candidate statistics, p,often considerably outnumbers the number of modelparameters, q, that is, p ≫ q (e.g., Bortot, Colesand Sisson (2007); Allingham, King and Mengersen(2009); Luciani et al. (2009)). For example, Bor-tot, Coles and Sisson (2007) and Allingham, Kingand Mengersen (2009) use the ordered observationsS(y) = (s(1), . . . , s(p)) so that there is no loss of infor-mation at this stage. The analysis then proceeds byeither using all p statistics in full or by attempting

    to reduce their dimension while minimizing informa-tion loss. Note that the most suitable set of summarystatistics for an analysis may be data set depen-dent, as the information content of summary statis-tics may vary within the parameter space, Θ (anexception is when sufficient statistics are known).As such, any analysis should also consider establish-ing potentially different summary statistics when re-implementing any model with a different data set.Methods of summary statistics dimension reduc-

    tion for ABC can be broadly classified into threenonmutually exclusive classes. The first class of meth-ods follows a best subset selection approach. Here,candidate subsets are evaluated and ranked accord-ing to various information-based criteria, such asmeasures of sufficiency (Joyce andMarjoram (2008))or the entropy of the posterior distribution (Nunesand Balding (2010)). In this article we contributeadditional criteria for this process derived from Akaikeand Bayesian information criteria arguments. Fromthese criteria, the highest ranking subset (or, al-ternatively, a subset consisting of those summarystatistics which demonstrate clear importance) isthen chosen for the final analysis.The second class of methods can be considered as

    projection techniques. Here, the dimension of (s1, . . . ,sp) is reduced by considering linear or nonlinearcombinations of the summary statistics. These meth-ods make use of a regression layer within the ABCframework, whereby the response variable, θ, is re-gressed by the (possibly transformed) predictor vari-ables, s (Beaumont, Zhang and Balding (2002); Blumand François (2010)). These projection methods in-clude partial least squares regression (Wegmann,Leuenberger and Excoffier (2009)), feed-forward neu-ral networks (Blum and François (2010)) and re-gression guided by minimum expected posterior lossconsiderations (Fearnhead and Prangle (2012)).In this article we introduce a third class of meth-

    ods for dimension reduction in ABC, based on regu-larization techniques. Using ridge regression, we alsomake use of the regression layer between the pa-rameter θ and the summary statistics, s. However,rather than explicitly considering a selection of sum-mary statistics, we propose to approach this implic-itly, by shrinking the regression coefficients towardzero so that uninformative summary statistics havethe weakest contribution in the regression equation.In the remainder of this section we discuss each of

    these methods in more detail. We first describe theideas behind ABC regression adjustment strategies(Beaumont, Zhang and Balding (2002); Blum and

  • 4 BLUM, NUNES, PRANGLE AND SISSON

    François (2010)), as many of the dimension reduc-tion techniques build on this framework.

    2.1 Regression Adjustment in ABC

    Standard ABC methods suffer from the curse ofdimensionality in that the rate of convergence ofposterior expectations with respect to pABC(θ|sobs)(such as the Nadaraya–Watson estimator of the pos-terior mean) decreases dramatically as the dimen-sion of the summary statistics, p, increases (Blum(2010a)). ABC regression adjustment (Beaumont,Zhang and Balding (2002)) aims to avoid this byexplicitly modeling the discrepancy between s andsobs. When describing regression adjustment meth-ods, for notational simplicity and clarity of exposi-tion, we assume that the parameter of interest, θ, isunivariate (i.e., q = 1). Regression adjustment meth-ods may be readily applied to multivariate θ, by us-ing a different regression equation for each parame-ter, θ1, . . . , θq, separately.The simplest model for this is a homoscedastic

    regression in the region of sobs, so that

    θi =m(si) + ei,

    where (θi, si)∼ p(s|θ)p(θ) are i= 1, . . . , n draws fromthe prior predictive distribution, m(si) = E[θ|s= si]is the mean function, and the ei are zero-mean ran-dom variates with common variance. To estimatethe conditional mean m(·), Beaumont, Zhang andBalding (2002) assumed a linear model

    m(si) = α+ β⊤si(2)

    in the neighborhood of sobs. An estimate of the meanfunction, m̂(·), is obtained by minimizing the weight-ed least squares criterion

    ∑ni=1w

    i‖m(si) − θi‖2,where wi =Kǫ(‖s

    i−sobs‖). A weighted sample fromthe posterior distribution, pABC(θ|sobs), is then ob-tained by the adjustment

    θ∗i = m̂(sobs) + (θi − m̂(si))(3)

    for i= 1, . . . , n. In the above, the kernel scale param-eter ǫ controls the bias-variance trade-off: increasingǫ reduces variance by increasing the effective samplesize—the number of accepted simulations when us-ing a uniform kernel K—but increases bias arisingfrom departures from a linear mean function m(·)and homoscedastic error structure (Blum (2010a)).Blum and François (2010) proposed the more flex-

    ible, heteroscedastic model

    θi =m(si) + σ(si)ei,(4)

    where σ2(si) = V[θ|s = si] denotes the conditionalvariance. This variance is estimated using a second

    regression model for the log of the squared residuals,that is, log(θi − m̂(si))2 = logσ2(si) + ηi, where theηi are independent, zero-mean variates with com-mon variance. The equivalent adjustment to (3) isthen given by

    θ∗i = m̂(sobs) + [θi − m̂(si)]

    σ̂(sobs)

    σ̂(si),(5)

    where σ̂(s) denotes the estimate of σ(s). The kernelscale parameter, ǫ, plays the same role as for the ho-moscedastic model, except with more flexibility ondeviations from homoscedasticity. Nott et al. (2013)have demonstrated that regression adjustment ABCalgorithms produce samples, {θ∗i}, for which first-and second-order moment summaries approximateadjusted expectation and variance for a Bayes lin-ear analysis. We do not describe here an alterna-tive regression adjustment method where the sum-mary statistics are rather considered as the depen-dent variables and the parameters as the indepen-dent variables of the regression (Leuenberger andWegmann (2010)).

    2.2 Best Subset Selection Methods

    Best subset selection methods are conceptuallysimple, but are cumbersome to manage for largenumbers of potential summary statistics, s= (s1, . . . ,sp). Exhaustive enumeration of the 2

    p − 1 possiblecombinations of summary statistics is practically in-feasible beyond a moderate value of p. This is espe-cially true of Markov chain Monte Carlo or sequen-tial Monte Carlo based analyses, which require onesampler implementation per combination. As a re-sult, stochastic or deterministic (greedy) search pro-cedures, such as forward or backward selection, arerequired to implement them.

    2.2.1 A sufficiency criterion The first principledapproach to dimension reduction in ABC was theε-sufficiency concept proposed by Joyce and Marjo-ram (2008), which was used to determine whetherto include an additional summary statistic, sk, toa model already containing statistics s1, . . . , sk−1.Here, noting that the difference between the log like-lihoods of p(s1, . . . , sk|θ) and p(s1, . . . , sk−1|θ) islog p(sk|s1, . . . , sk−1, θ), Joyce and Marjoram (2008)defined the set of statistics s1, . . . , sk−1 to be ε-suffi-cient relative to sk if

    δk = supθ

    log p(sk|s1, . . . , sk−1, θ)

    − infθlog p(sk|s1, . . . , sk−1, θ)(6)

    ≤ ε.

  • DIMENSION REDUCTION METHODS IN ABC 5

    Accordingly, if an estimate of δk (i.e., the “score”of sk relative to s1, . . . , sk−1) is greater than ε, thenthere is enough additional information content insk to justify including it in the model. In practice,Joyce and Marjoram (2008) implement a conceptu-ally equivalent assessment, whereby sk is added tothe model if the ratio of posteriors

    Rk(θ) =pABC(θ|s1, . . . , sk−1, sk)

    pABC(θ|s1, . . . , sk−1)

    differs from one by more than some threshold valueT (θ) for any value of θ. As such, a statistic sk will beadded to the model if the resulting posterior changessufficiently at any point. The threshold, T (θ), isuser-specified, with one particular choice describedin Section 5 of Joyce and Marjoram (2008).This procedure can be implemented within any

    stepwise search algorithm, each of which have vari-ous pros and cons. Following the definition (6), theresulting optimal subset of summary statistics isthen ε-sufficient relative to each one of the remain-ing summary statistics. Here ε intuitively representsan acceptable error in determining whether sk con-tains further useful information in addition to s1, . . . ,sk. This quantity is also user-specified, and so the fi-nal optimal choice of summary statistics will dependon the chosen value.Sensitivity to the choice of ε aside, this approach

    may be criticized in that it assumes that every changeto the posterior obtained by adding a statistic, sk,is beneficial. It is conceivable that attempting to in-clude a completely noninformative statistic, wherethe observed statistic is unlikely to have been gen-erated under the model, will result in a sufficientlymodified posterior as measured by ε, but one whichis more biased away from the true posterior p(θ|yobs)than without including sk. A toy example illustrat-ing this was given by Sisson and Fan (2011).A further criticism is that the amount of computa-

    tion required to evaluate Rk(θ) for all θ, and on mul-tiple occasions, is considerable, especially for large q.In practice, Joyce and Marjoram (2008) consideredθ to be univariate, and approximated continuous θover a discrete grid in order to keep computationaloverheads to acceptable levels. As such, this methodappears largely restricted to dimension reduction forunivariate parameters (q = 1).

    2.2.2 An entropy criterion Nunes and Balding(2010) propose the entropy of a distribution as aheuristic to measure the informativeness of candi-

    date combinations of summary statistics. Since en-tropy measures information and a lack of random-ness (Shannon (1948)), the authors propose mini-mizing the entropy of the approximate posterior,pABC(θ|sobs), over subsets of the summary statistics,s, as a proxy for determining maximal informationabout a parameter of interest. High entropy resultsfrom a diffuse posterior sample, whereas low entropyis obtained from a posterior which is more precisein nature.Nunes and Balding (2010) estimate entropy us-

    ing the unbiased kth nearest neighbor estimator ofSingh et al. (2003). For a weighted posterior sample,(w1, θ1), . . . , (wn, θn), where

    iwi = 1, this estima-

    tor can be written as

    Ê = log

    [

    πq/2

    Γ(q/2+1)

    ]

    −ψ(k) + logn

    (7)

    + qn∑

    i=1

    wi log Ĉ−1i (k/(n− 1)),

    where q = dim(θ), ψ(x) = Γ′(x)/Γ(x) denotes the

    digamma function, and where Ĉi(·) denotes the em-pirical distribution function of the Euclidean dis-tance from θi to the remainder of the weighted pos-terior sample, that is, of the weighted samples {(w̃j ,‖θi − θj‖)}j 6=i, where w̃

    j = wj/∑

    j 6=iwj . Following

    Singh et al. (2003), the original work of Nunes andBalding (2010) used k = 4 and was based on anequally weighted posterior sample (i.e., with wi =

    1/n, i= 1, . . . , n), so that Ĉ−1i (k/(n−1)) denotes theEuclidean distance from θi to its kth closest neigh-bor in the posterior sample {θ1, . . . , θi−1, θi+1, . . . , θn}.While minimum entropy could in itself be used to

    evaluate the informativeness of a vector of summarystatistics for θ (although see the criticism of entropybelow), Nunes and Balding (2010) propose a sec-ond stage to their analysis, which aims to assess theperformance of a candidate set of summary statis-tics using a measure of posterior error. For exam-ple, when the true parameter vector, θtrue, is known,the authors suggest the root sum of squared errors(RSSE), given by

    RSSE=

    (

    n∑

    i=1

    wi‖θi − θtrue‖2

    )1/2

    ,(8)

    where the measure ‖θi−θtrue‖ compares the compo-nents of θ on a suitable scale (and so some component-wise standardization may be required). Naturally,

  • 6 BLUM, NUNES, PRANGLE AND SISSON

    the true parameter value, θtrue, is unknown in prac-tice. However, if the simulated summary statisticsfrom the samples (θi, si) are treated as observeddata, it is clear that θtrue = θ

    i for the posteriorpABC(θ|s

    i). As such, the RSSE can be easily com-puted with a leave-one-out technique.As the subset of summary statistics that mini-

    mizes (8) will likely vary over observed data sets, si,Nunes and Balding (2010) propose minimizing theaverage RSSE over some number of simulated datasets which are close to the observed, sobs. To avoidcircularity, Nunes and Balding (2010) define these“close” data sets to be the j = 1, . . . , n∗ simulateddata sets, {sj}, that minimize ‖sjME − sME‖, where

    sjME and sME are the vectors of minimum entropysummary statistics computed via (7) from sj andthe observed summary statistics, sobs, respectively.That is, the quantity

    RSSE=1

    n∗

    n∗∑

    j=1

    RSSEj(9)

    is minimized (over subsets of summary statistics),where RSSEj corresponds to (8) using the simulateddata set sj .This approach is intuitive and is attractive be-

    cause the second stage directly measures error in theposterior with respect to a known truth, θtrue, whichis not typically considered in other ABC dimensionreduction approaches, albeit at the extra computa-tional expense of a two-stage procedure. A weak-ness of the first stage, however, is the assumptionthat addition of an informative statistic will reducethe entropy of the resulting posterior distribution.An example of when this does not occur is whenthe posterior distribution is diffuse with respect tothe prior—for instance, if an overly precise prior islocated in the distributional tails of the posterior(e.g., Jeremiah et al. (2011)). In this case, attempt-ing to include an informative additional statistic, sk,can result in a distribution that is more diffuse thanwith sk excluded. As such, the entropic approach istherefore mostly suited to models with relatively dif-fuse prior distributions. Another potential criticismof the first stage is that minimizing the entropy doesnot necessarily provide the minimal subset of suffi-cient statistics. This provides an argument for con-sidering the mutual information between θ and s,rather than the entropy (Barnes et al. (2012); seealso Filippi, Barnes and Stumpf (2012)). However,it is clear that the overall approach of Nunes andBalding (2010) could easily be implemented with al-ternative first-stage selection criteria.

    2.2.3 AIC and BIC criteria Information criteriabased on Akaike and Bayesian information are nat-ural best subset selection techniques for summarystatistic dimension reduction in ABC analyses. Weintroduce and develop these criteria in Section 3.1.

    2.3 Projection Techniques

    Selecting a best subset of summary statistics froms= (s1, . . . , sp) suffers from the problem that it mayrequire several statistics to provide the same infor-mation content as a single, highly informative statis-tic that was not specified in the initial set, s. Toavoid this, projection techniques aim to combine theelements of s through linear or nonlinear transfor-mations, in order to construct a potentially muchlower-dimensional set of highly informative statis-tics.One of the main advantages of projection tech-

    niques is that, unlike best subset selection methods,they scale well with increasing numbers of summarystatistics. They can handle large numbers of possiblyuninformative summary statistics, in addition to ac-counting for high levels of interdependence and mul-ticollinearity. A minor disadvantage of projectiontechniques is that the final sets of projected sum-mary statistics typically (but not universally) lackinterpretability. In addition, most projection meth-ods require the specification of a hyperparameterthat governs the number of projections to perform.

    2.3.1 Partial least squares regression Partial leastsquares regression seeks the orthogonal linear combi-nations of the explanatory variables which have highvariance and high correlation with the response vari-able (e.g., Boulesteix and Strimmer (2007); Vinziet al. (2010); Abdi and Williams (2010)). Wegmann,Leuenberger and Excoffier (2009) proposed the useof partial least squares regression for dimension re-duction in ABC, where the explanatory variablesare the suitably (e.g., Box–Cox) transformed sum-mary statistics, s, and the response variables is theparameter vector, θ.The output of a partial least squares analysis is

    the set of k orthogonal components of the regressiondesign matrix

    X =

    1 s11 · · · s1p

    ......

    . . ....

    1 sn1 · · · snp

    (10)

    that are optimally correlated (in a specific sense)with θ. Here, sij denotes the jth component of the

    ith simulated summary statistic, si. To choose the

  • DIMENSION REDUCTION METHODS IN ABC 7

    appropriate number of orthogonal components, Weg-mann, Leuenberger and Excoffier (2009) examinethe root mean square error of θ for each value of k, asestimated by a leave-one-out cross-validation strat-egy. For a fixed number of components, k, this cor-responds to

    RMSEk =

    (

    1

    n

    n∑

    i=1

    ‖m̂−ik (si)− θi‖2

    )1/2

    ,(11)

    where m̂−ik (s) denotes the mean response of the par-tial least squares regression, estimated without theith simulated summary statistic, si (e.g., Mevik andCederkvist (2004)). The optimal number of compo-nents is then chosen by inspection of the RMSEk val-ues, based on minimum gradient change arguments(e.g., Mevik and Wehrens (2007)).A potential disadvantage of partial least squares

    regression, as performed by Wegmann, Leuenbergerand Excoffier (2009), is that it aims to infer a globallinear relationship between θ and s based on drawsfrom the prior predictive distribution, p(s|θ)p(θ).This may differ from the relationship observed inthe region around s = sobs, and as such may pro-duce unsuitable orthogonal components as a result.A workaround for this would be to follow Fearn-head and Prangle (2012) (see Section 2.3.3) andelicit the relationship between θ and s based on sam-ples from a truncated prior (θi, si)∼ p(s|θ)p(θ)I(θ ∈ΘR), where ΘR ⊂Θ restricts the samples, θi, to re-gions of significant posterior density. One simple wayto identify such a region is through a pilot ABCanalysis (Fearnhead and Prangle (2012)).

    2.3.2 Neural networks In the regression setting,feed-forward neural networks can be considered as anonlinear generalization of the partial least squaresregression technique described above. Blum and Fran-çois (2010) proposed the neural network as a ma-chine learning approach to dimension reduction byestimating the conditional mean and variance func-tions, m(·) and σ2(·) in the nonlinear, heteroscedas-tic regression adjustment model (4)—see Section 2.1.The neural network reduces the dimension of the

    summary statistics to H < p, using H hidden unitsin the network, z1, . . . , zH , defined as

    zj = h

    (

    p∑

    k=1

    ω(1)jk sk + ω

    (1)j0

    )

    (12)

    for j = 1, . . . ,H . The ω(1)jk terms are the weights of

    the first layer of the neural network, and h(·) is anonlinear function, typically the logistic function.

    The reduced and nonlinearly transformed summarystatistics of the hidden units, zj , are then combinedthrough the regression function of the neural net-work

    m(s) = g

    (

    H∑

    j=1

    ω(2)j zj + ω

    (2)0

    )

    ,(13)

    where ω(2)j denotes the weights of the second layer

    of the neural network and g(·) is a link function.A similar neural network is used to model logσ2(s)(e.g., Nix and Weigend (1995)), with the possibilityof allowing for a different number of hidden units toestimate heteroscedasticity in the regression adjust-ment compared to that in the mean function m(s).Rather than dynamically determining the number

    of hidden units H , Blum and François (2010) pro-pose to specify a fixed value, such as H = q whereq = dim(θ) is the number of parameters to infer. Theweights of the neural network are then obtained byminimizing the regularized least-squares criterion

    n∑

    i=1

    wi‖m(si)− θi‖2 + λ‖ω‖2,

    where ω is the vector of all weights in the neuralnetwork model for m(s), wi =Kǫ(‖s

    i − sobs‖) is theweight of the sample (θi, si)∼ p(s|θ)p(θ), and λ > 0denotes the regularization parameter (termed theweight-decay parameter for neural networks). Theidea of regularization is to shrink the weights to-ward zero so that only informative summary statis-tics contribute in the models (12) and (13) for m(s).Following the estimation of m(s), a similar regular-ization criterion is used to estimate logσ2(s). Bothmean and variance functions can then be used in theregression adjustment of equation (5).

    2.3.3 Minimum expected posterior loss Fearnheadand Prangle (2012) proposed a decision-theoretic di-mension reduction method with a slightly differentaim to previous dimension reduction approaches.Here, rather than constructing appropriate summarystatistics to ensure that pABC(θ|sobs) ≈ p(θ|yobs) isa good approximation, pABC(θ|sobs) is alternativelyrequired to be a good approximation in terms ofthe accuracy of specified functions of the model pa-rameters. In particular, assuming that interest is inpoint estimates of the model parameters, if θtruedenotes the true parameter value and θ̂ an esti-mate, then Fearnhead and Prangle (2012) proposeto choose those summary statistics that minimizethe quadratic loss

    L(θtrue, θ̂) = (θtrue − θ̂)⊤A(θtrue − θ̂)

  • 8 BLUM, NUNES, PRANGLE AND SISSON

    for some p×p positive-definite matrix A. This loss isminimized for sobs =Ep(θ|yobs)(θ), the true posteriormean.To estimate Ep(θ|y)(θ), Fearnhead and Prangle

    (2012) propose least squares regression models forthe k = 1, . . . , q model parameters, (θ1, . . . , θq), givenby

    θik =Ep(θ|y)(θk) + ηik = αk + β

    ⊤k f(s

    i) + ηik,(14)

    where (θi, si)∼ p(s|θ)p(θ) are draws from the priorpredictive distribution, αk and βk are unknown re-gression parameters to be estimated, and ηik denotesa zero-mean noise process. Here f(s) is a vector ofpotentially nonlinear transformations of the data(i.e., of the original summary statistics). For ex-ample, in one application, Fearnhead and Prangle(2012) use the polynomial basis functions f(s) =(s, s2, s3, s4), that is, a vector of length 4p, wherep= dim(s) is the number of elements in s, consist-ing of the first four powers of each element of s. Thechoice of f(s) can be based on standard diagnosticsof regression fit, such as BIC. If the prior π(θ) isdiffuse with respect to the posterior, then one mayestimate the regression model (14) based on sam-ples from a truncated prior (θi, si)∼ p(s|θ)p(θ)I(θ ∈ΘR), where ΘR ⊂Θ restricts the samples, θi, to re-gions of significant posterior density (e.g., via a pilotABC analysis). Clearly, more sophisticated alterna-tives to least squares regression may be used.After fitting equation (14), the new, single sum-

    mary statistic for the parameter θk is β̂⊤k f(s), where

    β̂k denotes the least squares estimate of βk. The re-sulting q-dimensional vector of new summary statis-tics is then used in a standard ABC analysis. Fearn-head and Prangle (2012) show that these new statis-tics can lead to posterior inferences that consid-erably outperform inferences based on the originalstatistics, s. Nott, Fan and Sisson (2012) demon-strate that these summary statistics can be viewedas Bayes linear estimates of the posterior mean.

    2.4 Regularization Approaches

    Regularization approaches aim to reduce overfit-ting in a model by penalizing model complexity.A simple example where overfitting can occur inABC is the standard regression adjustment (Beau-mont, Zhang and Balding (2002); Section 2.1), wherethere is a risk of over adjusting the parameters, θi,in the direction of uninformative summary statisticsvia (3). Regularization is used as part of the estima-tion of the neural network weights in the projection

    technique proposed by Blum and François (2010)(see Section 2.3.2). As such, the regression adjust-ment of Beaumont, Zhang and Balding (2002) is aprocedure that could greatly benefit from the inclu-sion of regularization techniques. We introduce theridge regression adjustment to ABC in Section 3.2.

    2.5 Other Methods

    There are a number of alternative approaches todimension reduction for ABC, including methodsthat aim to circumvent the dimensionality issue, thatwe do not include in our comparative analysis (Sec-tion 4). Drovandi, Pettitt and Faddy (2011) pro-posed to adopt ideas from indirect inference (e.g.,Heggland and Frigessi (2004)) as a means to iden-tify summary statistics for an ABC analysis. Thisinvolves specification of a model p̃(·|θ̃) which is simi-lar to p(·|θ), but which is computationally tractable.The idea is that estimates of θ̃ under p̃(·|θ̃), such asmaximum likelihood estimates or posterior means,are likely to be informative about θ if p(·|θ) andp̃(·|θ̃) are sufficiently similar. This approach can beconsidered similar in spirit to that of Fearnheadand Prangle (2012) which uses estimated posteriormeans under a pilot ABC analysis (see Section 2.3.3).Blum (2010b) proposed a Bayesian criterion relatedto the BIC (see Section 3.1) as a best subset selec-tion procedure. The idea is to implement a Bayesiananalysis of the standard regression adjustment mod-el (3). The criterion, called the evidence approxi-mation, seeks the best subset of summary statisticsto regress the parameter θ. In comparison to theBIC, the evidence criterion is attractive because itcontains no approximation in its derivation. How-ever, the downside is that its computation requiresthe tuning of the Bayesian linear regression hyper-parameters. Additionally, Aeschbacher, Beaumontand Futschik (2012) proposed to use boosting forchoosing summary statistics and Jung and Marjo-ram (2011) developed a genetic algorithm thatweights the summary statistics so that individualstatistics do not contribute equally to the compar-isons between observations and simulations. The aimis that the uninformative summary statistics shouldideally have negligible weights.Finally, a number of recent ABC modeling ap-

    proaches have attempted to find ways of accuratelyhandling the full vector of initial statistics, s [or thefull data set, s = S(y) = y], thereby avoiding theneed to perform dimension reduction. Bonassi, Youand West (2011) propose fitting a (p + q)-dimen-

  • DIMENSION REDUCTION METHODS IN ABC 9

    sional mixture of Gaussian distributions to the sam-ple (θi, si) ∼ p(s|θ)p(θ), i = 1, . . . , n, and then find-ing the distribution of θ|sobs by conditioning on ob-serving s= sobs. This approach potentially requiresa large number of mixture components to accuratelymodel the joint density when (p+ q) is large. Fan,Nott and Sisson (2012) suggest using an approxima-tion to p(s|θ) by approximating each marginal like-lihood function, p(si|θ), using a mixture of expertsmodel, where the weights, mean and variance of eachmixture component is allowed to depend on θ, andthen inducing dependence between these marginalsusing a mixture of multivariate Gaussian distribu-tions. This approach requires continuous summarystatistics for the mixture regression and is practi-cally useful for moderate p (i.e., hundreds of sum-mary statistics). Writing y = (y1, . . . , yp), Barthelméand Chopin (2011) propose to factorize the likeli-hood as p(y|θ) =

    i p(yi|y1:i−1, θ) and construct anABC approximation of each component in turn [i.e.,pABC(yi|y1:i−1)=

    Kǫ(‖yi−yobs,i‖)p(yi|y1:i−1, θ)dyi]with computation performed using an expectation-propagation algorithm (Minka (2001)). This ap-proach, while potentially fast and accurate, assumesthat conditional simulation of yi ∼ p(yi|y1:i−1, θ) isavailable for i= 1, . . . , n, and so is not suitable for allmodels and analyses. Last, Jasra et al. (2012) exploitthe structure of hidden Markov models to performan iterative sequence of ABC analyses, each usingonly a single data point in each analysis, and Nak-agome, Fukumizu and Mano (2012) propose a novelapproach to post-processing ABC importance sam-pling output whose convergence rate is claimed toavoid the curse of dimensionality.

    3. NEW DIMENSION REDUCTION

    METHODS

    In this section we introduce two new dimensionreduction criteria for ABC methods. The first is abest subset selection procedure deriving from AICand BIC criteria, constructed under implementationof the local linear model of equation (2) (Beaumont,Zhang and Balding (2002)). A similar idea was pro-posed and tested for a Gaussian model by Sedki andPudlo (2012). The second is a modification to the fit-ting of (2) by considering ridge regression instead ofleast squares regression. Both of these methods arenow implemented in the freely available R packageabc (Csilléry, François and Blum (2012)).

    3.1 AIC and BIC Criteria

    Akaike information criterion (AIC) and Bayesianinformation criterion (BIC) provide a measure ofthe relative goodness of fit of a statistical model.Each can be expressed as the sum of the maximizedlog-likelihood that measures the fit of the modelto the data, and a penalty for model complexity(Akaike (1974); Schwarz (1978)). While evaluation

    of log p(yobs|θ̂mle) or log p(sobs|θ̂mle) is unavailable inthe ABC framework and determination of the max-imum likelihood estimator, θ̂mle, is challenging, asimple and tractable likelihood function is availablethough the local-linear regression model of equa-tion (2) (Section 2.1).Specifically, we consider the local linear regres-

    sion model equation (2) of Beaumont, Zhang andBalding (2002) for each parameter θ1, . . . , θq and as-sume independent Gaussian errors, ej ∼ N(0, σ2j ),for j = 1, . . . , q. Then the AIC becomes

    AIC = ñ log

    q∏

    j=1

    σ̂2j + 2d,(15)

    where d = q(p + 1) is the number of estimated re-gression parameters and ñ is the effective number ofsimulations used in the local-linear regression model,which we define as ñ =

    ∑ni=1 I(w

    i > 0) when thekernel Kǫ has compact support. Alternative defini-tions of the effective number of simulations, such asc∑n

    i=1wi for some c > 0, can be on an arbitrary

    scale, since the least squares regression solution isinsensitive to the scale of the weights. For any fixedvalue of c, the value of c

    ∑ni=1w

    i will decrease asp= dim(s) increases so that it will artificially favorlarger numbers of (even uninformative) summarystatistics. Our definition of ñ guarantees that theAIC scores are comparable for different subsets ofsummary statistics. A downside is that this defini-tion of ñ is only suitable for kernels, Kǫ, with acompact support.In equation (15), σ̂2j is defined as the weighted

    mean of squared residuals for the regression of θjand is given by

    σ̂2j =

    ∑ni=1w

    i[θij − m̂j(si)]2

    ∑ni=1w

    i,

    where θij is the jth component of θi and m̂j(s) de-

    notes the estimate of the mean function mj(s) =E[θj|s]. For small sample sizes, the corrected AIC,the so-called AICc, is given by replacing d in (15)by d(d+ 1)/(ñ− d− 1) (Hurvich and Tsai (1989)).

  • 10 BLUM, NUNES, PRANGLE AND SISSON

    In the same manner the BIC can be defined as

    BIC = ñ log

    q∏

    j=1

    σ̂2j + d log ñ.(16)

    Alternative penalty terms involving the hat matrixof the regression could also be used in the above(e.g., Hurvich, Simonoff and Tsai (1998); Irizarry(2001); Konishi, Ando and Imoto (2004)).It is instructive to note that in using the linear re-

    gression adjustment (3), the above information cri-teria may be expressed as

    xIC= ñ log

    q∏

    j=1

    Var(θ∗j ) + penalty term,

    where θ∗j is the jth element of the regression ad-justed vector θ∗ = (θ∗1, . . . , θ

    ∗q). As such, up to the

    penalty terms, both AIC and BIC seek the com-bination of summary statistics that minimizes theproduct of the marginal variances of the adjustedposterior sample. Similarly to the entropy criterionof Nunes and Balding (2010) (see Section 2.2.2),these information criterion will select those sum-mary statistics that maximize the precision of theposterior distribution, pABC(θ|sobs). However, un-like Nunes and Balding (2010), this precision istraded off by a penalty for model complexity.A rationale for the construction of AIC and BIC

    in this manner is that the summary statistics thatshould be included within an ABC analysis are thosewhich are good predictors of θ. However, an obviousrequirement for AIC or BIC to identify an infor-mative statistic is that the statistic varies (with θ)within the local range of the regression model. Ifa statistic is informative outside of this range, butuninformative within it, it will not be identified asinformative under these criteria.

    3.2 Regularization via Ridge Regression

    As described in Section 2.1, the local-linear regres-sion adjustment of Beaumont, Zhang and Balding(2002) fits the linear model

    θi = α+ β⊤si + ei

    based on the prior predictive samples (θi, si) ∼p(s|θ)p(θ) and with regression weights given by wi =Kǫ(‖s

    i − sobs‖). (As before, we describe the casewhere θ is univariate for notational simplicity andclarity of exposition, but the approach outlined be-low can be readily implemented for each compo-nent of a multivariate θ.) However, in fitting themodel by minimizing the weighted least squares cri-teria,

    ∑ni=1w

    i‖α − β⊤si − θi‖2, there is a risk of

    over-adjustment by adjusting the parameter valuesvia (3) in the direction of uninformative summarystatistics.To avoid this, implicit dimension reduction within

    the regression framework can be performed by alter-natively minimizing the regularized weighted sum ofsquares (Hoerl and Kennard (1970))

    n∑

    i=1

    wi‖θi − (α+ β⊤si)‖2 + λ‖β‖2(17)

    with regularization parameter λ > 0. As with theregularization component within the neural networkmodel of Blum and François (2010) (Section 2.3.2),with ridge regression the risk of over-adjustment isreduced because the regression coefficients, β, areshrunk toward zero by imposing a penalty on theirmagnitudes. Note that while we consider ridge re-gression here, a number of alternative regularizationprocedures could be implemented, such as the Lassomethod.An additional advantage of ridge regression is that

    standard least squares estimates, (α̂LS, β̂LS)⊤ = (X⊤ ·

    WX)−1X⊤WΘ, are not guaranteed to have a uniquesolution. Here X is a n× (p+1) design matrix givenin equation (10), Θ = (θ1, . . . , θn) is the n×1 columnvector of sampled θi, and W = diag(w1, . . . ,wn) isan n × n diagonal matrix of weights. The lack ofa unique solution can arise through multicolinear-ity of the summary statistics, which can result insingularity of the matrix X⊤WX . In contrast, min-imization of the regularized weighted sum of squares(17) always has a unique solution, provided that

    λ > 0. This solution is given by (α̂ridge, β̂ridge)⊤ =

    (X⊤WX+λIp)−1X⊤WΘ, where Ip denotes the p×

    p identity matrix. There are several approaches fordealing with the regularization parameter λ, includ-ing cross-validation and generalized cross-validationto identify an optimal value of λ (Golub, Heath andWahba (1979)), as well as averaging the regularized

    estimates (α̂ridge, β̂ridge)⊤ obtained for different val-

    ues of λ (Taniguchi and Tresp (1997)).

    4. A COMPARATIVE ANALYSIS

    We now provide a comparative analysis of the pre-viously described methods of dimension reductionwithin the context of three previously studied anal-yses in the ABC literature. Specifically, this includesthe analysis of a coalescent model with recombina-tion (Joyce and Marjoram (2008)), an evaluation ofthe evolutionary fitness cost of mutation in drug-

  • DIMENSION REDUCTION METHODS IN ABC 11

    resistant tuberculosis (Luciani et al. (2009)) and anassessment of the number and size-distribution ofparticle inclusions in the production of clean steels(Bortot, Coles and Sisson (2007)).Each analysis is based on n = 1,000,000 simula-

    tions where the parameter θ is drawn from the priordistribution p(θ). The performance of each methodis measured through the RSSE criterion (9) follow-ing Nunes and Balding (2010), based on the samerandomly selected subset of n∗ = 100 samples (θi,si) = (θtrue, sobs) as “observed” data sets. When eval-uating the RSSE error measure of equation (8), wegive a weight wi = 1 for the accepted simulationsand a weight of 0 otherwise. As the value of theRSSE (8) depends on the scale of each parameter,we standardize the parameters in each example bydividing the parameter values by the standard de-viation obtained from the n= 1,000,000 simulations(with the exception of the first example, where theparameters are on similar scales). For comparativeease, and to provide a performance baseline for eachexample, all RSSE results are presented as relativeto the RSSE obtained when using the maximal vec-tor of summary statistics and no regression adjust-ment. In this manner, a relative RSSE of x/−x de-notes an x% worsening/improvement over the base-line score.Within each ABC analysis, we use Euclidean dis-

    tance within an Epanechnikov kernel Kǫ(‖s−sobs‖).The Euclidean distances are computed after stan-dardizing the summary statistics with a robust esti-mate of the standard deviation (the mean absolutedeviation). The kernel scale parameter, ǫ, is deter-mined as the value at which exactly 1% of the sim-ulations, (θi, si), have nonzero weight. This yieldsexactly ñ = 10,000 simulations that form the finalsample from each posterior. To perform the methodof Fearnhead and Prangle (2012), a randomly cho-sen 10% of the n simulations were used to fit theregression model that determines the choice of sum-mary statistics, with the remaining 90% used for theABC analysis. The final ABC sample size ñ= 10,000was kept equal to the other methods by slightly ad-justing the scale parameter, ǫ. In addition, for themethod of Fearnhead and Prangle (2012), follow-ing exploratory analyses, the regression model (14)was fitted using f(s) = (s, s2, s3, s4) for examples1 and 2 (as described in Section 2.3.3) and usingf(s) = (log s, [log s]2, [log s]3, [log s]4) for example 3,always resulting in 4 × p independent variables inthe regression model of equation (14).

    When using neural networks or ridge regression toestimate the conditional mean and variance, m(s)and σ2(s), we take the pointwise median of the es-timated functions obtained with the regularizationparameters λ= 10−3,10−2 and 10−1. These values ofλ assume that the summary statistics and the pa-rameters have been standardized before fitting theregression function (Ripley (1994)). However, be-cause the optimization procedure for neural networks(the R function nnet) only finds local optima, inthis case we take the pointwise median of ten esti-mated functions, with each optimization initializedfrom a different random starting point, and ran-domly choosing the regularization parameter withequal probability from the above values (see Tanigu-chi and Tresp (1997)).

    4.1 Example 1: A Coalescent Analysis

    This model was previously considered by JoyceandMarjoram (2008) and Nunes and Balding (2010),each while proposing their respective ABC dimen-sion reduction strategies (see Sections 2.2.1 and 2.2.2).The analysis focuses on joint estimation of the scaledmutation rate, θ̃, and the scaled recombination rate,ρ, in a coalescent model with recombination (Nord-borg (2007)). Under this model, 5001 base pair DNAsequences for 50 individuals are generated from thecoalescent model, with recombination, under the infi-nite-sites mutation model, using the software ms(Hudson (2002)). The initial summary statistics, s=(s1, . . . , s6), are the number of segregating sites (s1),the pairwise mean number of nucleotidic differences(s2), the mean R

    2 across pairs separated by

  • 12 BLUM, NUNES, PRANGLE AND SISSON

    Table 1

    Relative RSSE for examples 1 and 2. The leftmost column shows the minimal RSSE whenconsidering only one summary statistic (with no regression adjustment). Rightmost columns

    show relative RSSE using all summary statistics under no, homoscedastic andheteroscedastic regression adjustment. All RSSE are relative to the RSSE obtained when

    using no regression adjustment with all summary statistics. The score of the best method ineach analysis (row) is emphasised in boldface

    One optimalstatistic (no adj.)

    All summary statistics

    No adj. Homo adj. Hetero adj.

    Example 1 θ̃ −7 (s1) 0 −3 −3ρ 9 (s5) 0 −5 −4

    (θ̃, ρ) 7 (s1) 0 0 −7

    Example 2 α 6 0 −3 −3c −7 0 −5 −5ρ −9 0 −8 −8µ −14 0 −5 −6

    (α, c, ρ,µ) 5 0 −4 −4

    and (θ̃, ρ) jointly], regression adjustments generallyimprove the inference when using all six summarystatistics, which is consistent with previous results(Nunes and Balding (2010)). The only exception iswhen jointly estimating (θ̃, ρ), where homoscedasticlinear adjustment neither decreases nor increases theerror obtained with the pure rejection algorithm.Next, we investigate the performance of each di-

    mension reduction technique. Table 2 and Figure 1show the relative RSSE obtained under each dimen-sion reduction method for each parameter combi-nation and under heteroscedastic regression adjust-ment. For all three examples, more complete tablesthat contain the results obtained with no regressionadjustment and homoscedastic adjustment can befound in the supplementary information to this ar-ticle (Blum et al. (2013)).The performance achieved with AIC, AICc or BIC

    is comparable to (i.e., the same or slightly betterthan) the result obtained when including all six pop-ulation genetic statistics. When using the ε-sufficiencycriterion, we find that the performance is improvedfor the inference on θ̃ only. The only best subsetselection method for dimension reduction that sub-stantially and uniformly improves the performanceof ABC posterior estimates is the entropy-based ap-proach. For the projection techniques, all methods(partial least squares, neural nets and minimum ex-pected posterior loss) outperform the adjustmentmethod based on all six population genetics statis-tics, with a large performance advantage for partial

    least squares when estimating (θ̃, ρ) jointly. By con-trast, ridge regression provides no improvement overthe standard regression adjustment (the “All” col-umn).Based on these results, a loose performance rank-

    ing of the dimension reduction methods can be ob-tained by computing, for each method, the mean(relative) RSSE over all parameter combinations θ̃,ρ and (θ̃, ρ) using the heteroscedastic adjustment.The worst performers were ridge regression and theε-sufficiency criterion (with a mean relative RSSEof −3%). These are followed by the standard regres-sion adjustment with all summary statistics (−5%)and the AIC/BIC, neural nets and the posteriorloss method (−6%). The best performing methodsare partial least squares (−10%) and the two-stageentropy-based procedure (−16%).

    4.2 Example 2: The Fitness Cost of Drug

    Resistant Tuberculosis

    We now consider an example of Markov processesfor epidemiological modeling. If a pathogen, suchas Mycobacterium tuberculosis, mutates to gain anevolutionary advantage, such as antibiotic resistance,it is biologically plausible that this mutation willcome at a cost to the pathogen’s relative fitness.Based on a stochastic model to describe the trans-mission and evolutionary dynamics of Mycobacteriumtuberculosis, and based on incidence and genotypicdata of the IS6110 marker, Luciani et al. (2009) es-timated the posterior distribution of the pathogen’s

  • DIMENSION REDUCTION METHODS IN ABC 13

    Table 2

    Relative RSSE for examples 1–3 for different parameter combinations using each method of dimension reduction,and under heteroscedastic regression adjustment. Columns denote no dimension reduction (All), BIC, AIC,AICc, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure (Ent), partial least squares (PLS),neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). All RSSE arerelative to the RSSE obtained when using no regression adjustment with all summary statistics. The score of

    the best method in each analysis (row) is emphasized in boldface

    Best subset selection Projection techniques Regularization

    All BIC AIC AICc ε-suff Ent PLS NNet1 Loss Ridge1

    θ̃ −3 −5 −5 −5 −6 −11 −6 −4 −7 1ρ −4 −6 −6 −6 0 −12 −7 −8 −7 −3

    (θ̃, ρ) −7 −7 −7 −7 – −24 −16 −7 −6 −6

    α −3 −15 −15 −15 0 −17 −13 −15 −17 −15c −5 −15 −15 −15 −8 −15 −8 −12 −9 −9ρ −8 −16 −16 −16 −8 −16 1 −12 −9 −10µ −6 −18 −18 −18 −8 −13 −10 −13 −12 −12(α, c, ρ,µ) −4 −19 −19 −19 – −13 −10 −9 −12 −11

    τ −49 −47 −47 −48 −19 −52 −22 −20/−42 −75 −48/−48σ −45 −46 −47 −46 −15 −50 −15 −21/−37 −56 −43/−43ξ −27 −29 −29 −28 −13 −32 −28 −7/−41 −41 −26/−44(τ,σ, ξ) −39 −39 −40 −39 – −42 −11 −4/−38 −60 −39/−32

    1For the third example, the first value is found by integrating out the regularization parameter, whereas the second oneis found by choosing an optimal regularization parameter with cross-validation. In examples 1 and 2, integration over theregularization parameter is performed.

    Fig. 1. Relative RSSE for the different methods of dimension reduction in the three examples. All RSSE are relative to theRSSE obtained when using no regression adjustment with all summary statistics. Methods of dimension reduction include nodimension reduction (All), AIC/BIC, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure (Ent), partial leastsquares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). The crossescorrespond to situations for which there is no result available.

  • 14 BLUM, NUNES, PRANGLE AND SISSON

    transmission cost and relative fitness. The modelcontained q = 4 free parameters: the transmissionrate, α, the transmission cost of drug resistant strains,c, the rate of evolution of resistance, ρ, and the mu-tation rate of the IS6110 marker, µ.Luciani et al. (2009) summarized information gen-

    erated from the stochastic model through p = 11summary statistics. These statistics were expertlyelicited as quantities that were expected to be infor-mative for one or more model parameters, and in-cluded the number of distinct genotypes in the sam-ple, gene diversity for sensitive and resistant cases,the proportion of resistant cases and measures of thedegree of clustering of genotypes, etc. It is consid-ered likely that there is dependence and potentiallyreplicate information within these statistics.As before, we examine the relative performance

    of the statistics without using dimension reductiontechniques. Table 1 shows that for the univariateanalysis of c, ρ or µ, performing rejection samplingABC with a single, well-chosen summary statisticcan provide an improved performance over a simi-lar analysis using all 11 summary statistics, underany form of regression adjustment. In particular, theproportion of isolates that are drug resistant is theindividual statistic which is most informative to esti-mate c (with a relative RSSE of −7%) and ρ (−9%).For the marker mutation rate, µ, the most infor-mative statistic is the number of distinct genotypes,with a relative RSSE of −14%. Conversely, an analy-sis using all summary statistics with a regression ad-justment offers the best inferential performance forα alone, or for (α, c, ρ,µ). These results provide sup-port for recent arguments in favor of “marginal re-gression adjustments” (Nott et al. (2013)), wherebythe univariate marginal distributions of a full mul-tivariate ABC analysis are replaced by separatelyestimated marginal distributions using only statis-tics relevant for each parameter. Here, more pre-cisely estimated margins can improve the accuracyof the multivariate posterior sample, beyond the ini-tial analysis.The performance results of each dimension reduc-

    tion method are shown in Table 2 and Figure 1.In contrast with the previous example, here the useof the AIC/BIC criteria can substantially decreaseposterior errors. For example, compared to the lin-ear adjustment of all 11 parameters, which producesa mean relative RSSE between −3% and −8% de-pending on the parameter (Table 2), using the AIC/BIC criteria results in a relative RSSE of between

    −15% and −19%. The ε-sufficiency criterion pro-duces more equivocal results, however, as the er-ror is sometimes increased with respect to baselineperformance (e.g., +6% when estimating α with ho-moscedastic adjustment) and sometimes reduced (e.g.,−8% for c, ρ and θ with heteroscedastic adjust-ment). As with the previous example, the entropycriterion provides a clear improvement to the ABCposterior, and this improvement is almost compara-ble to that produced by AIC/BIC. Finally, the pro-jection and regularization methods mostly all pro-vide comparable and substantive improvements com-pared to the baseline error, with only partial leastsquares producing more equivocal results (e.g., +1%when estimating ρ).Based on these results, the loose performance rank-

    ing of the dimension reduction methods determinesthe worst performers to be the standard least squaresregression adjustment (with a mean relative RSSEof −5%), the ε-sufficiency approach (−6%) and par-tial least squares (−8%). These are followed by ridgeregression (−11%), neural networks and the pos-terior loss method (−12%). The best performingmethods for this analysis are the two-stage entropy-based procedure (−15%) and the AIC/BIC criteria(−17%).In this example, it is interesting to compare the

    performance of the standard linear regression ad-justment of all 11 summary statistics (mean rela-tive RSSE of −5%) with that of the ridge regres-sion equivalent (mean relative RSSE of −11%). Theincrease in performance with ridge regression maybe attributed to its more robust handling of mul-ticolinearity of the summary statistics than underthe standard regression adjustment. To see this, Fig-

    ure 2 illustrates the relationship between the relativeRSSE (again, relative to using all summary statis-tics and no regression adjustment) and the condi-

    tion number of the matrix X⊤WX , for both thestandard regression (top panel) and ridge regres-sion (bottom panel) adjustments based on inferencefor (α, c, ρ,µ). The condition number of X⊤WX isgiven by κ=

    λmax/λmin, where λmax and λmin arethe largest and smallest eigenvalues of X⊤WX . Ex-tremely large condition numbers are evidence formulticolinearity.Figure 2 demonstrates that for large values of the

    condition number (e.g., for κ > 108), the least-squares-based regression adjustment clearly performs verypoorly. The region of κ > 108 corresponds to almost5% of all simulations, and for these cases the rel-

  • DIMENSION REDUCTION METHODS IN ABC 15

    Fig. 2. Scatterplots of relative RSSE versus the condition number of the matrix X⊤WX for linear least squares (top) andridge (bottom) regression adjustments. Points are based on joint inference for (α, c, ρ,µ) in example 2 using 1000 randomlyselected vectors of summary statistics, si, as “observed” data. When the minimum eigenvalue, λmin, is zero, the (infinite)

    condition number is arbitrarily set to be 1025 for visual clarity (open circles on the scatterplot).

    ative error is hugely increased (w.r.t. rejection) toanywhere between 5% and 200%. In contrast, forridge regression, the relative errors corresponding toκ > 108 are not larger than the errors obtained fornonextreme condition numbers. This analysis clearlyillustrates that, unlike ridge regression, the stan-dard least squares regression adjustment can per-form particularly poorly when there is multicolin-earity between the summary statistics.In terms of the original analysis of Luciani et al.

    (2009) which used all eleven summary statistics withno regression adjustment (although with a very lowvalue for ǫ), the above results indicate that a moreefficient analysis may have been achieved by using asuitable dimension reduction technique.

    4.3 Example 3: Quality Control in theProduction of Clean Steels

    Our final example concerns the statistical mod-eling of extreme values. In the production of cleansteels, the occurrence of microscopic imperfections(termed inclusions) is unavoidable. The strength ofa clean steel block is largely dependent on the sizeof the largest inclusion. Bortot, Coles and Sisson(2007) considered an extreme value twist on the

    standard stereological problem (e.g., Baddeley andJensen (2004)), whereby inference is required on thesize and number of 3-dimensional inclusions, basedon data obtained from those inclusions that inter-sect with a 2-dimensional slice. The model assumes aPoisson point process of inclusion locations with rateparameter τ > 0 and that the distribution of inclu-sion size exceedances above a measurement thresh-old of 5µm are drawn from a generalized Paretodistribution with scale and shape parameters σ > 0and ξ, following standard extreme value theory ar-guments (e.g., Coles (2001)).The observed data consist of 112 cross-sectional

    inclusion diameters measured above 5µm. The sum-mary statistics thereby comprise 112 equally spacedquantiles of the cross-sectional diameters, in addi-tion to the number of inclusions observed, yieldingp= 113 summary statistics in total. The ordering ofthe summary statistics creates strong dependencesbetween them, a fact which can be exploited bydimension reduction techniques. Bortot, Coles andSisson (2007) considered two models based on spher-ical or ellipsoidal shaped inclusions. Our analysishere focuses on the ellipsoidal model.

  • 16 BLUM, NUNES, PRANGLE AND SISSON

    By construction, the large number (2113) of possi-ble combinations of summary statistics means thatthe best subset selection methods are strictly notpracticable for this analysis, unless the number ofsummary statistics is reduced further a priori. Inorder to facilitate at least some comparison withthe other dimension reduction approaches, for thebest subset selection methods only, we consider sixcandidate subsets. Each subset consists of the num-ber of observed inclusions in addition to 5, 10, 20,50, 75 or 112 empirical quantiles of the inclusionsize exceedances (the latter corresponds to the com-plete set of summary statistics). Due to the extremevalue nature of this analysis, the parameter esti-mates are likely to be more sensitive to the pre-cise values of the larger quantiles. As such, ratherthan using equally spaced quantiles, we use a schemewhich favors quantiles closer to the maximum inclu-sion and we always include the maximum inclusion.The relative RSSE obtained under each dimension

    reduction method is shown in Table 2 and Figure 1.In comparison to an analysis using all 113 summarystatistics and regression adjustment (the “All” col-umn), the best subset selection approaches do not ingeneral offer any improvement. While the entropy-based method provides a slight improvement, therelative RSSE under the ε-sufficiency criterion issubstantially worse (along with partial least squares).Of course, these results are limited to the few subsetsof statistics considered and it is possible that alter-native subsets could perform substantially better.However, it is computationally untenable to evalu-ate this possibility based on exhaustive enumerationof all subsets.When using neural networks to perform the re-

    gression adjustment based on computing the point-wise median of the m(s) and σ2(s) estimates, ob-tained using varying regularization parameter val-ues (see the introduction to Section 4), the relativeperformance is quite poor (left-hand side RSSE val-ues in Table 2). The mean relative RSSE is −13%for neural networks, compared to −40% for het-eroscedastic least squares regression. As an alterna-tive approach, rather than averaging over the regu-larization parameter λ, we rather choose the valueof λ ∈ {10−3,10−2,10−1} that minimizes the leave-one-out error of θ [equation (11)]. This approachconsiderably improves the performance of the net-work (right-hand side RSSE values in Table 2) withthe mean relative RSSE improving to the same levelas for heteroscedastic regression. Adopting the same

    procedure to determine the regularization parame-ter within ridge regression, there is also a mean gainin performance from −39% to −42%, although thejoint parameter inference on (τ, σ, ξ) actually per-forms worse under this alternative approach. Thevariability in these results highlights the importanceof making an optimal choice of the regularizationparameter for an ABC analysis.The minimum expected posterior loss approach

    performs particularly well here. This approach hasalso been shown to perform well in a similar anal-ysis: that of performing inference using quantilesof a large number of independent draws from the(intractable) g-and-k distribution (Fearnhead andPrangle (2012)).The loose performance ranking of each of the di-

    mension reduction methods finds that the worst per-formers are the ε-sufficiency criterion (with a meanrelative RSSE of −16%) and partial least squares(−19%). Neural networks and AIC/BIC perform justas well as standard least squares regression (−40%),ridge regression slightly outperforms standard re-gression (−42%) and the entropy-based approachis a further slight improvement at −44%. The clearwinner in this example is the posterior loss approachwith a mean relative RSSE of −58%.

    5. DISCUSSION

    The process of dimension reduction is a criticaland influential part of any ABC analysis. In thisarticle we have provided a comparative review ofthe major dimension reduction approaches (and in-troduced two new ones) in order to provide someguidance to the practitioner in choosing the mostappropriate technique for their own analysis. A sum-mary of the qualitative features of each dimensionreduction method is shown in Table 3, and a com-parison of the relative performances of each methodfor each example is illustrated in Figure 3. As witheach individual example, we may compute an over-all performance ranking of the dimension reductionmethods by averaging the mean relative RSSE val-ues over the examples. Performing worse, on aver-age, than a standard least squares regression adjust-ment with no dimension reduction (with an overallmean relative RSSE of −17%) is the ε-sufficiencytechnique (−8%) and partial least squares (−12%).Performing better, on average, than standard leastsquares regression is ridge regression and neural net-works (−19%) and AIC/BIC (−21%). In this study,

  • DIMENSION REDUCTION METHODS IN ABC 17

    Table 3

    Summary of the main features of the different methods of dimension reduction for ABC

    Class Method Hyper-parameter Choice of hyper-parameter Computational burden

    Best subset selection AIC/BIC None – Substantial/greedy alg.

    ε-suff T (θ) User choice Substantial/greedy alg.

    Ent None – Substantial/greedy alg.

    Projection techniques PLS Number of PLS components, k Cross-validation Weak

    NNet Regularization parameter, λ Integration or cross-validation Moderate (optimization algorithm)

    Loss Choice of basis functions BIC Weak (closed-form solution)

    Regularization Ridge Regularization parameter, λ Integration or cross-validation Weak (closed-form solution)

    the top performers, on average, were the entropy-based procedure and the minimum expected pos-terior loss approach, with an overall mean relativeRSSE of −25%. It is worth emphasizing that thepotential gains in performing a regression adjust-ment alone (with all summary statistics and no di-mension reduction) can be quite substantial. Thissuggests that regression adjustment should be anintegral part of the majority of ABC analyses. Fur-

    ther gains in performance can then be obtained bycombining regression adjustment with dimension re-duction procedures, although in some cases (suchas with the ε-sufficiency technique and partial leastsquares) performance can sometimes worsen.While being ranked in the top three, a clear dis-

    advantage of the entropy-based procedure and theAIC/BIC criteria is the quantity of computation re-quired. This primarily occurs as the best subset se-

    Fig. 3. Mean relative RSSE values using each method of dimension reduction and for each example. Methods of dimensionreduction include no dimension reduction (All), AIC/BIC, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure(Ent), partial least squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression(Ridge). For examples 1 and 2, the results for ridge regression and neural networks estimate m(s) and σ2(s) have beenobtained by taking the pointwise median curve over varying values of the regularization parameter; λ= 10−3,10−2 and 10−1

    (see introduction to Section 4). For example 3, an optimal value of λ was chosen based on a cross-validation procedure (seeSection 4.3).

  • 18 BLUM, NUNES, PRANGLE AND SISSON

    lection procedures require evaluation of all 2p po-tential models. For examples 1 and 2, a greedy al-gorithm was able to find the optimum solution ina reasonable time. This was not possible for exam-ple 3. Additionally, in this latter case, for the subsetsof summary statistics considered, the performanceobtained by implementing computationally expen-sive methods of dimension reduction was barely animprovement over the computationally cheap, leastsquares regression adjustment. This raises the im-portant point that the benefits of performing poten-tially expensive forms of dimension reduction over,say, the simple linear regression adjustment, shouldbe evaluated prior to their full implementation. Wealso note that the second stage of the entropy-basedmethod (Section 2.2.2) targets minimization of (9),the same error measure used in our comparativeanalysis. As such, this approach is likely to be nu-merically favored in our results.The top ranked (ex aequo) minimum expected

    posterior loss approach particularly outperformsother dimension reduction methods in the final ex-ample (the production of clean steels). In such anal-yses, with large numbers of summary statistics (herep= 113), nonlinear methods such as neural networksmay become overparametrized, and simpler alterna-tives, such as least squares or ridge regression ad-justment, can work more effectively. This is natu-rally explained through the usual bias-variance trade-off: more complex regression models such as neuralnetworks reduce the bias of the estimate of m(s)[and optionally σ2(s)], but in doing so the varianceof the estimate is increased. This effect can be espe-cially acute for high-dimensional regression (Geman,Bienenstock and Doursat (1992)).Our analyses indicate that the original least

    squares, linear regression adjustment (Beaumont,Zhang and Balding (2002)) can sometimes performquite well, despite its simplicity. However, the pres-ence of multicolinearity between the summary statis-tics can cause severe performance degradation, com-pared to not performing the regression adjustment(see Figure 2). In such situations, regularization pro-cedures, such as ridge regression (e.g., example 2 andFigure 2) and projection techniques, can be benefi-cial.However, an important issue with regularization

    procedures, such as neural networks and ridge re-gression, is the handling of the regularization param-eter, λ. The “averaging” procedure that was used inthe first two examples proved quite suboptimal in

    the third, where a cross-validation procedure to se-lect a single best parameter value produced muchimproved results. This problem can be particularlycritical for neural networks with large numbers ofsummary statistics, p, as the number of networkweights is much larger than p, and, accordingly, mas-sive shrinkage of the weights (i.e., large values of λ)is required to avoid overfitting.The posterior loss approach produced the supe-

    rior performance in the third example. In general, astrong performance of this method can be primarilyattributed to two factors. First, in the presence oflarge numbers of highly dependent summary statis-tics, the extra analysis stage in determining the mostappropriate regression model (14) by choosing f(s)through, for example, BIC diagnostics, affords theopportunity to reduce the complexity of the regres-sion in a simple and relatively low-parameterizedmanner. This was not a primary contributor in ex-ample 3, however, as the regression [equation (14)]was directly performed on the full set of 113 statis-tics. Given the benefits of using regularization meth-ods in this setting, it is possible that a ridge re-gression model would allow a more robust estimateof the posterior mean (as a summary statistic) aspart of this process. Second, the posterior loss ap-proach determines the number of summary statis-tics to be equal to the number of posterior quan-tities of interest—in this case, q = 3 posterior pa-rameter means. This small number of derived sum-mary statistics naturally allows more precise poste-rior statements to be made, compared to dimensionreduction methods that produce a much larger num-ber of equally informative statistics. Of course, thedimension advantage here is strongly related to thenumber of parameters (q = 3) and summary statis-tics (p = 113) in this example. However, it is notfully clear how any current methods of dimensionreduction for ABC would perform for substantiallymore challenging analyses with considerably highernumbers of parameters and summary statistics. Thisis because the curse of dimensionality in ABC (Blum(2010a)) has tended to restrict existing applicationsof ABC methods to problems of moderate parameterdimension, although this may change in the future.What is very apparent from this study is that

    there is no single “best” method of dimension re-duction for ABC. For example, while the posteriorloss and entropy-based methods were the best per-formers for example 3, AIC and BIC were rankedfirst in the analysis of example 2, and partial least

  • DIMENSION REDUCTION METHODS IN ABC 19

    squares outperformed the posterior loss approach inexample 1. A number of factors can affect the mostsuitable choice for any given analysis. As discussedabove, these can include the number of initial sum-mary statistics, the amount of dependence and mul-ticolinearity within the statistics, the computationaloverheads of the dimension reduction method, therequirement to suitably determine hyperparametersand sensitivity to potentially large numbers of un-informative statistics.One important point to understand is that all of

    the ABC analyses of this review were performed us-ing the rejection algorithm optionally followed bysome form of regression adjustment. While alterna-tive, potentially more efficient and accurate methodsof ABC posterior simulation exist, such as Markovchain Monte Carlo or sequential Monte Carlo basedsamplers, the computational cost of separately im-plementing such an algorithm 2p times (in the caseof best subset selection methods) means that suchdimension reduction methods can become rapidlyuntenable, even for small p. The price of the benefitof using the more computationally practical, fixedlarge number of samples is that decisions on the di-mension reduction of the summary statistics will bemade on potentially worse estimates of the posteriorthan those available under superior sampling algo-rithms. As such, the final derived summary statisticsmay in fact not be those which are most appropri-ate for subsequent use in, for example, Markov chainMonte Carlo or sequential Monte Carlo based algo-rithms.However, this price is arguably a necessity. It is

    practically important to evaluate the performance ofany dimension reduction procedure in a given anal-ysis. Here we used a criterion [the RSSE of equa-tion (9)] that is based on a leave-one-out procedure.When using a fixed, large number of samples, eval-uation of such a performance diagnostic is entirelypracticable, as no further model simulations are re-quired. This idea is also relevant to methods of di-mension reduction for model selection (Barnes et al.(2012); Estoup et al. (2012)) where a misclassifica-tion rate based on a leave-one-out procedure canserve as a comparative criterion.

    ACKNOWLEDGMENTS

    S. A. Sisson is supported by the Australian Re-search Council through the Discovery Project Scheme(DP1092805). M. G. B. Blum is supported by theFrench National Research Agency (DATGEN project,ANR-2010-JCJC-1607-01).

    SUPPLEMENTARY MATERIAL

    Supplement to “A Comparative Review of Di-mension Reduction Methods in Approximate Bayes-ian Computation” (DOI: 10.1214/12-STS406SUPP;.pdf). The supplement contains for each of the threeexamples a comprehensive comparison of the errorsobtained with the different methods of dimensionreduction.

    REFERENCES

    Abdi, H. and Williams, L. J. (2010). Partial least squareregression, projection on latent structure regression. WileyInterdiscip. Rev. Comput. Stat. 2 433–459.

    Aeschbacher, S., Beaumont, M. A. and Futschik, A.(2012). A novel approach for choosing summary statisticsin approximate Bayesian computation. Genetics 192 1027–1047.

    Akaike, H. (1974). A new look at the statistical model iden-tification. IEEE Trans. Automat. Control AC-19 716–723.MR0423716

    Allingham, D., King, R. A. R. and Mengersen, K. L.(2009). Bayesian estimation of quantile distributions. Stat.Comput. 19 189–201. MR2486231

    Baddeley, A. and Jensen, E. B. V. (2004). Stereology forStatisticians. Chapman & Hall/CRC, Boca Raton, FL.

    Barnes, C., Filippi, S., Stumpf, M. P. H. and Thorne, T.(2012). Considerate approaches to constructing summarystatistics for ABC model selection. Stat. Comput. 22 1181–1197. MR2992293

    Barthelmé, S. and Chopin, N. (2011). Expectation-propagation for summary-less, likelihood-free inference.Available at http://arxiv.org/abs/1107.5959.

    Beaumont, M. A. (2010). Approximate Bayesian computa-tion in evolution and ecology. Annual Review of Ecology,Evolution, and Systematics 41 379–406.

    Beaumont, M. A., Zhang, W. and Balding, D. J. (2002).Approximate Bayesian computation in population genet-ics. Genetics 162 2025–2035.

    Beaumont, M. A., Marin, J. M., Cornuet, J. M. andRobert, C. P. (2009). Adaptivity for ABC algorithms:The ABC–PMC scheme. Biometrika 96 983–990.

    Bertorelle, G., Benazzo, A. and Mona, S. (2010). ABCas a flexible framework to estimate demography over spaceand time: Some cons, many pros. Mol. Ecol. 19 2609–2625.

    Blum, M. G. B. (2010a). Approximate Bayesian computa-tion: A nonparametric perspective. J. Amer. Statist. Assoc.105 1178–1187. MR2752613

    Blum, M. G. B. (2010b). Choosing the summary statisticsand the acceptance rate in approximate Bayesian computa-tion. In COMPSTAT 2010: Proceedings in ComputationalStatistics (G. Saporta and Y. Lechevallier, eds.) 47–56. Springer, New York.

    Blum, M. G. B. and François, O. (2010). Non-linear regres-sion models for approximate Bayesian computation. Stat.Comput. 20 63–73. MR2578077

    Blum, M. G. B., Nunes, M. A., Prangle, D. and Sis-son, S. A. (2013). Supplement to “A Comparative Review

    http://dx.doi.org/10.1214/12-STS406SUPPhttp://www.ams.org/mathscinet-getitem?mr=0423716http://www.ams.org/mathscinet-getitem?mr=2486231http://www.ams.org/mathscinet-getitem?mr=2992293http://arxiv.org/abs/1107.5959http://www.ams.org/mathscinet-getitem?mr=2752613http://www.ams.org/mathscinet-getitem?mr=2578077

  • 20 BLUM, NUNES, PRANGLE AND SISSON

    of Dimension Reduction Methods in Approximate BayesianComputation.” DOI:10.1214/12-STS406SUPP.

    Bonassi, F. V., You, L. and West, M. (2011). Bayesianlearning from marginal data in bionetwork models. Stat.Appl. Genet. Mol. Biol. 10 Art. 49, 29. MR2851291

    Bortot, P., Coles, S. G. and Sisson, S. A. (2007). In-ference for stereological extremes. J. Amer. Statist. Assoc.102 84–92. MR2345549

    Boulesteix, A.-L. and Strimmer, K. (2007). Partialleast squares: A versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinformatics 8 32–44.

    Coles, S. (2001). An Introduction to Statistical Modeling ofExtreme Values. Springer, London. MR1932132

    Csilléry, K., François, O. and Blum, M. G. B. (2012).abc: An R package for approximate Bayesian computation.Methods in Ecology and Evolution 3 475–479.

    Csilléry, K., Blum, M. G. B., Gaggiotti, O. andFrançois, O. (2010). Approximate Bayesian computa-tion in practice. Trends in Ecology and Evolution 25 410–418.

    Del Moral, P., Doucet, A. and Jasra, A. (2012). Anadaptive sequential Monte Carlo method for approximateBayesian computation. Stat. Comput. 22 1009–1020.

    Drovandi, C. C. and Pettitt, A. N. (2011). Estimation ofparameters for macroparasite population evolution usingapproximate Bayesian computation. Biometrics 67 225–233. MR2898834

    Drovandi, C. C., Pettitt, A. N. and Faddy, M. J. (2011).Approximate Bayesian computation using indirect infer-ence. J. R. Stat. Soc. Ser. C. Appl. Stat. 60 317–337.MR2767849

    Estoup, A., Lombaert, E., Marin, J. M., Guille-maud, T., Pudlo, P., Robert, C. and Cornuet, J. M.(2012). Estimation of demo-genetic model probabilitieswith approximate Bayesian computation using linear dis-criminant analysis on summary statistics. Molecular Ecol-ogy Resources 12 846–855.

    Fan, Y., Nott, D. J. and Sisson, S. A. (2012). Regressiondensity estimation ABC. Unpublished manuscript.

    Fearnhead, P. and Prangle, D. (2012). Constructing sum-mary statistics for approximate Bayesian computation:Semi-automatic approximate Bayesian computation (withdiscussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 74 419–474. MR2925370

    Filippi, S., Barnes, C. P. and Stumpf, M. P. H. (2012).Contribution to the discussion of Fearnhead and Pran-gle (2012). Constructing summary statistics for approxi-mate Bayesian computation: Semi-automatic approximateBayesian computation. J. R. Stat. Soc. Ser. B Stat.Methodol. 74 459–460.

    Geman, S., Bienenstock, E. andDoursat, R. (1992). Neu-ral networks and the bias/variance dilemma. Neural Com-put. 4 1–58.

    Golub, G. H., Heath, M. and Wahba, G. (1979). General-ized cross-validation as a method for choosing a good ridgeparameter. Technometrics 21 215–223. MR0533250

    Heggland, K. and Frigessi, A. (2004). Estimating func-tions in indirect inference. J. R. Stat. Soc. Ser. B Stat.Methodol. 66 447–462. MR2062387

    Hoerl, A. E. andKennard, R. W. (1970). Ridge regression:Biased estimation for nonorthogonal problems. Technomet-rics 12 55–67.

    Hudson, R. R. (2002). Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics18 337–338.

    Hurvich, C. M., Simonoff, J. S. and Tsai, C.-L. (1998).Smoothing parameter selection in nonparametric regres-sion using an improved Akaike information criterion. J. R.Stat. Soc. Ser. B Stat. Methodol. 60 271–293. MR1616041

    Hurvich, C. M. and Tsai, C.-L. (1989). Regression and timeseries model selection in small samples. Biometrika 76 297–307. MR1016020

    Irizarry, R. A. (2001). Information and posterior probabil-ity criteria for model selection in local likelihood estima-tion. J. Amer. Statist. Assoc. 96 303–315. MR1952740

    Jasra, A., Singh, S. S., Martin, J. S. and McCoy, E.(2012). Filtering via approximate Bayesian computation.Statist. Comput. 22 1223–1237.

    Jeremiah, E., Sisson, S. A., Marshall, L., Mehrotra, R.and Sharma, A. (2011). Bayesian calibration and uncer-tainty analysis for hydrological models: A comparison ofadaptive-Metropolis and sequential Monte Carlo samplers.Water Resources Research 47 W07547, 13pp.

    Joyce, P. and Marjoram, P. (2008). Approximately suf-ficient statistics and Bayesian computation. Stat. Appl.Genet. Mol. Biol. 7 Art. 26, 18. MR2438407

    Jung, H. and Marjoram, P. (2011). Choice of summarystatistic weights in approximate Bayesian computation.Stat. Appl. Genet. Mol. Biol. 10 Art. 45, 25. MR2851287

    Konishi, S., Ando, T. and Imoto, S. (2004). Bayesian infor-mation criteria and smoothing para