abstract - university of...

AbstractThe ancient demographic past of populations can be uncovered by a careful examination

of the distribution of mutations within populations. ∂a∂i is an open source Python softwarepackage that has the ability to generate expected allele frequency data from a given demographicmodel, as well as fit the parameters of demographic models to real data. ∂a∂i has been usedin several population genetics papers, but the quality of its optimizers has created a bottleneckfor users. The optimizers are slow, tending to spend the great majority of their time evaluatingpoints near the optimum but not converving. The optimizers also tend to explore parametervalues far from the optimum, which are often expensive to compute.

In this thesis, an in-depth investigation of ∂a∂i’s optimizing process is performed. A moreprecise description of the optimizers’ problems and a characterization of typical log-likelihoodlandscapes was found, some existing explanations for the optimizers’ issues were explored,and newly identified algorithmic problems in both the line search routine and finite differenceroutine of the algorithm as well as possible fixes are noted.

1

Contents

1 Introduction 3

2 Simulators in general and ∂a∂i in particular 32.1 Generation of synthetic mutation distributions . . . . . . . . . . . . . . . . . . . . . 32.2 Overview of ∂a∂i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Allele frequency spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 The equations behind ∂a∂i . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 How does optimization fit into ∂a∂i? . . . . . . . . . . . . . . . . . . . . . . . 10

3 Optimizers in general and ∂a∂i’s optimization in particular 103.1 Definition of optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Practical challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Identified problems with ∂a∂i’s optimization . . . . . . . . . . . . . . . . . . . . . . 11

4 Exploration of ∂a∂i’s optimizers and objective function 124.1 Non-log-transformed optimizers were consistently poorer performers . . . . . . . . . 124.2 Multiple optimia are possible, even with a simple model . . . . . . . . . . . . . . . . 15

5 Line search and its effect on optimizer path 175.1 The line search process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Is poor line search responsible for the exploration of values far from the optimum? . 17

6 Calculation of the gradient in dadi’s optimizers 176.1 How the gradient is calculated now . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.2 A better way to calculate the gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 186.3 Choosing an optimal h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.4 An explanation for cycling near the optimum? . . . . . . . . . . . . . . . . . . . . . . 206.5 An example of improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2

1 Introduction

The distribution of mutations within populations holds clues to the demographic past of thosepopulations. A higher or lower instance of rare alleles can identify a population size change, forexample, or migration can give two populations a tendency to have similar distributions of newmutations. Software exists to analyze or simulate these distributions, offering insight into thecorresponding demographic history or a prediction of future genetic diversity. ∂a∂i is an open sourcePython software package developed by Ryan Gutenkunst that has the ability to generate expectedallele frequency data from a given demographic model as well as fit the parameters of demographicmodels to real data[1]. ∂a∂i has been used in several population genetics papers[2, 3, 4] and iscurrently in use to determine a demographic model for modern human populations in Africa thatmay have mated with ancient non-Homo sapiens members of the Homo genus.

∂a∂i uses optimization routines to find the set of parameters that are most likely to generatea given data set, for a particular demographic model. The log-likelihood of the parameters for theobserved data is the objective function, and ∂a∂i currently offers a choice of six optimizers fromthe SciPy Python module[5], including a brute force method. However, ∂a∂i’s current optimizershave problems that have limited the software’s ease of use. The optimizers tend to be slow forusers, creating a bottleneck in the analysis. User-observed problems with the optimizers includethe qualitative observations points far from the optimum were explored, even when the optimizerpreviously visited the region near the optimum. The optimizers also tend to spend significant timeexploring a small area without converging, even for models with few parameters.

In this thesis, an in-depth investigation of ∂a∂i‘s optimizers is performed. A more precisedescription of the optimizers’ problems and a characterization of typical log-likelihood landscapeswas found, some existing explanations for the optimizers’ issues were explored, and newly identifiedalgorithmic problems in both the line search routine and finite difference routine of the algorithmas well as possible fixes are noted.

2 Simulators in general and ∂a∂i in particular

2.1 Generation of synthetic mutation distributions

There are many scenarios where simulations of the expected genetic diversity of populations areuseful[6]. The most straightforward application is prediction of future demographic statistics basedon present or future conditions. For example, such a simulation could determine the expectedgenetic diversity of a population of endangered animals after several decades of population decline.In this scenario, the most appropriate demographic model is known.

Another application is to evaluate various demographic hypotheses and their parameters, inorder to choose the most likely model for a given set of observed data. This may be done bygenerating population genetics statistics for the different models, which can guide the identificationof the best-fitting demographic model. When a probable demographic history is known, it canbe used to guide population sampling for future studies, so individuals comprising an appropriateamount of the total genetic diversity are studied.

A third application is the use of simulators to create an expected null distribution against whichreal data can be compared. This technique can then identify the effect of evolutionary forces likenatural selection at certain locations in the genome.

3

Simulation software can create synthetic data from a wide variety of demographic scenarios.Software may be as specialized as Aquasplatche, which simulates populations co-exisiting in a riverenvironment, or as generally applicable as ms, which generates data based on the classic Wright-Fisher model. Simulators may incorporate demographic features such as spacial arrangements ofpopulations, various life cycles and fecundity, and events like migration, population size changes,and population splits and joins. Some software packages offer the ability for selection to influence thedistribution of mutations, while others focus on generating neutral data. Genetically, recombinationis sometimes accounted for, and the final data may be in the form of a collection of sequences,frequency count of SNPs, or data incorporating more complicated mutations like microsatellites.

Synthetic data can be generated by a few different techniques. Simulators are often classified intoone of two groups: forward simulators and backwards simulators[6]. The difference is the directionin time the simulator’s algorithm moves as the simulation is completed. Forward simulators startfrom a point in the past and move forwards in time, while backwards simulations start at the mostrecent point and travel back in time. “Forward” and “backward” are useful general distinctions,but each category contains distinct techniques.

Backwards methods tend to be coalescent-based Monte Carlo methods, such as ms. Thesetechniques focus on lineages, following mutations backwards in time and probabilistically joiningthem at a common ancestor until the root of the demographic tree is found. A newer addition to thebackwards methods is the fast coalescent approximation method, used in fastsimcoal and MaCS.This technique is an approximation to the full coalescent simulation that allows for larger geneticdata sizes and a generally faster simulation. Such methods claim to give a much faster result withlittle to no loss in accuracy[7].

In some forward simulators, such as BottleSim and Nemo, each individual is simulated throughan entire life cycle, from birth to reproduction to death. This approach is not efficient, but clearlyit is more straightforward to incorporate complex events into this simulation, like unusual matingcycles. ∂a∂i’s diffusion-based method is a forward method, as it integrates over its diffusion equationtowards the present, but it doesn’t focus on individuals. Instead, a discrete approximation to thepartial differential equation model (Equation 2) is used to to identify the expected distribution ofalleles.

It’s worth noting that in all simulation programs, the investigator must determine the generalshape of a demographic model before simulating it or fitting its parameters to a particular dataset. This requires that the user of the software make some assumptions about the results beforegenerating them. In addition, it’s difficult to evaluate all possibile shapes of a demographic model,especially for higher population counts. For example, archaic introgression is a recently discoveredphenomenon[8], and was not considered in previous models.

2.2 Overview of ∂a∂i

∂a∂i is a forward simulator that utilizes a diffusion-based approach to efficiently generate syntheticdata from complex demographic histories. Demographic events can include population size changesfollowing any rate function, population splits and joins, and migration, for up to three populationsexisting simultaneously. As well as the generation and visualization of data, models can be fittedto data by an optimization of the model’s parameters.

∂a∂i’s simulated data is given in the form of an allele frequency spectrum, a compact way tocommunicate the distribution of genetic variation. The diffusion equation at the heart of ∂a∂i

generates a continuous model of the distribution of new mutations, which is then discritized to

4

match the number of chromosomes in the populations.In order to use ∂a∂i, the user specifies a model by writing a Python function with method calls

for each event in the demographic model. The function should take as arguments the parameters ofthe model, the number of chromosomes in each populations, and the number of points to use on thediscrete approximation of the continuous diffusion function. It should return an allele frequencyspectrum. This function can then be used not only to generate data, but also to optimize themodel’s parameters for a given data set. See Figure 1 for an example.

import dadi

def simple_branching_model(params, ns, pts):

"""

A population splits into two populations of sizes nu1 and nu2 relative to

the initial population size, and then continues to exist for tb time.

| |

| | size is 1

| |

| \

| \___________

| |\ \

|n| |n|

|u| |u| tb time passes

|1| |2|

| | | |________

# 1 2

Note that tb is in units of 2*Na generations, where Na is the ancestral

population size. Times in dadi are twice the times values in ms/macs.

params = ( nu1, nu2, tb )

ns = ( pop1_count, pop2_count )

pts is number of grid points in each direction for representing phi.

"""

# Create a grid that phi will be calculated on.

xx = dadi.Numerics.default_grid(pts)

# Unpack parameters

nu1, nu2, tb = params

# Calculate initial phi on the grid

phi = dadi.PhiManip.phi_1D(xx)

# Split the population into pop 1 (bianca) and pop 2 (yaruba)

phi = dadi.PhiManip.phi_1D_to_2D(xx, phi)

# Integrate it forward

phi = dadi.Integration.two_pops(phi, xx, tb, nu1=nu1, nu2=nu2)

# 4. Finally, get the spectrum from the phi and return it

afs = dadi.Spectrum.from_phi(phi, ns, (xx,xx))

return afs

Figure 1: An example dadi function with explanation

5

2.2.1 Allele frequency spectra

An allele frequency spectrum (AFS) is a matrix containing information about the distribution ofallele frequencies for one or more populations. Each population is represented by a dimension ofthe matrix, and the number of elements along each dimension is the number of chromosomes inthat population.

%pylab inline

# The first population doubles while the second population is decreased by

# half, after their split ~17,000 years ago. The first population has 12

# chromosomes simulated, while the second population has 10.

afs_2pop = simple_branching_model((2,0.5,0.07), (12,10), 50)

# Marginalize out the second population to get AFS of first population

afs_1pop = afs_2pop.marginalize([1])

print afs_1pop.round(3)

dadi.Plotting.plot_1d_afs(afs_1pop)

title("One dimensional allele frequency spectrum")

[-- 1.198 0.522 0.339 0.252 0.201 0.168 0.143 0.125 0.111 0.1 0.091 --]

6

print afs_2pop.round(3)

dadi.Plotting.plot_single_2d_sfs(afs_2pop)

title("Two-dimensional frequency spectrum")

[[-- 0.449 0.164 0.073 0.034 0.016 0.007 0.003 0.001 0.0 0.0]

[0.885 0.12 0.078 0.049 0.03 0.017 0.01 0.005 0.002 0.001 0.0]

[0.253 0.083 0.062 0.045 0.031 0.02 0.013 0.007 0.004 0.002 0.001]

[0.107 0.057 0.048 0.039 0.03 0.022 0.015 0.01 0.006 0.003 0.001]

[0.051 0.038 0.036 0.033 0.028 0.023 0.017 0.012 0.008 0.004 0.002]

[0.025 0.024 0.026 0.027 0.025 0.022 0.018 0.014 0.01 0.006 0.003]

[0.013 0.015 0.018 0.02 0.021 0.02 0.018 0.016 0.012 0.008 0.005]

[0.006 0.009 0.012 0.015 0.017 0.018 0.018 0.016 0.014 0.011 0.008]

[0.003 0.005 0.008 0.01 0.013 0.015 0.016 0.016 0.015 0.013 0.012]

[0.001 0.003 0.004 0.007 0.009 0.011 0.013 0.015 0.016 0.015 0.017]

[0.001 0.001 0.002 0.004 0.006 0.008 0.01 0.013 0.015 0.016 0.024]

[0.0 0.0 0.001 0.002 0.003 0.005 0.007 0.01 0.013 0.016 0.034]

[0.0 0.0 0.0 0.001 0.001 0.002 0.004 0.006 0.01 0.015 --]]

Each entry in the AFS represents the frequency of polymorphisms in which the correspondingnumber of chromosomes has the derived allele. For example, if the entry in the 6th row and 4thcolumn is 3, then three polymorphisms have the derived allele in 6 chromosomes from populationone, and 4 chromosomes from population two. In ∂a∂i, an AFS is visualized as the natural logathimof this frequency.

AFS are often created from SNP data. While a minority of human SNPs are poly-allelic, mostcome in only two varieties /cite{threeSNP}. If all polymorphic sites used to create an AFS arebiallelic and independent, then the AFS is a complete summary of the data.

In addition to the fact that the AFS is a convenient and visually accessible version of populationvariation data, popular summary statistics can also be computed directly from the AFS. Theseinclude Tajima’s D statistic, an indicator of the number of polymorphic differences relative toexpectation.

In fact, qualitative inferences about the history of a population can be made directly by visualinspection of the AFS. Consider these one-dimensional spectra (one population), created by ∂a∂i

as the results of various models.

7

def instant_size_change(params, ns, pts):

"""

T

| |

__________________----------

1 nu

------------------__________

"""



nu, T = params

phi = dadi.Integration.one_pop(phi, xx, T, nu)

afs = dadi.Spectrum.from_phi(phi, ns, (xx,))

return afs

def exponential_growth(params, ns, pts):

"""

T

)

______________/

1 nu

______________

\

)

"""

nu,T = params



nu_func = lambda t: numpy.exp(numpy.log(nu) * t/T)

phi = dadi.Integration.one_pop(phi, xx, T, nu_func)


return afs

def bottleneck(params, ns, pts):

"""

TB TF

| |

------_______________----------

1 nuB nuF

---------------------__________

"""

nuB,nuF,TB,TF = params



phi = dadi.Integration.one_pop(phi, xx, TB, nuB)

phi = dadi.Integration.one_pop(phi, xx, TF, nuF)


return afs

8

chroms = 30 # 15 individuals with 2 chromosomes each

pts = 50 # number of points on which to simulate grid

time_ago = 0.07 #~17,000 years ago

afss = [instant_size_change((4, time_ago), (chroms,), pts),

instant_size_change((0.25, time_ago), (chroms,), pts),

exponential_growth((4, time_ago), (chroms,), pts),

bottleneck((0.25, 4, 0.2, time_ago), (chroms,), pts),]

for afs in afss:

semilogy(afs)

title("Various 1D frequency spectra")

xlabel("Number of chromosomes")

ylabel("Frequency of allele")

legend(["Population size instantly quadrupled",

"Population size instantly quartered",

"Population size grows exponentially",

’’’Population size goes through

a bottleneck of a quarter of its size

before quadrupling’’’], prop={’size’:10})

2.2.2 The equations behind ∂a∂i

If we are simulating data from P populations, the numbers of chromosomes from each popula-tion are n1, n2, ..., nP . d1, d2, ..., dp indicate entried in the AFS. The diffusion equation modelsφ(x1, x2, ..., xP , t), the density of derived mutations at relative frequencies x1, x2, ..., xP in popula-tions 1, 2, ..., P at time t. (All x ∈ [0, 1].) Assuming an infinitely-many-sites mutational model andWright-Fisher reproduction for each generation, φ can be found for any number of populations Pby the folowing linear diffusion equation:

∂

∂τφ =

1

2

�

i=1,2,...,P

∂2

∂2xi

xi(1− xi)

viφ−

�

i=1,2,...,P

∂

∂xi(γixi(1− xi) +

�

j=1,2,...,P

Mi<−j(xj − xi)φ (1)

It’s important to note that the diffusion approach is only useful when changes in each generation

9

are small. More specifically, the diffusion approximation applies when the effective population sizeN is large and migration rates and selection coefficients are of the order 1

N .A likelihood function governs our understanding of the fit between a model with a certain set

of parameters and data. The likelihood function L quantifies how likely it was that a certainmodel and set of parameters (denoted Θ) generated an observed allele frequency spectrum (thedata, denoted S[d1, d2, ..., dp]). The likelihood of the parameter values given the observed AFS(L(Θ|S)) is equivalent to the probability of the AFS being generated by those same parametervalues (P(S|Θ)), so the range of the likelihood function is [0,1].

The fact that each entry of an AFS can be modeled as an independent Poisson distribution [9]if there is no linkage between mutations guides the creation of the Likelihood function.

If each entry in the AFS is an independent Poisson variable with mean M [d1, d2, ..., dp] (whichdepends on Θ), then L is the product of (n1+1)(n2+1)...(np+1) Poisson likelihoods, one for eachentry in the AFS.

L(Θ|S) =�

i=1...P

�

di=0...ni

e−M [d1,d2,...,dp]M [d1, d2, ..., dp]S[d1,d2,...,dp]

S[d1, d2, ..., dp](2)

However, so many multiplications are expensive to compute. So, ∂a∂i considers the log-likelihood instead: the natural logarithm of L. The log-likelihood is a sum, not a product, making itcheaper to compute and also widening the range from [0,1] to (-∞,0]. The range actually consideredby the optimizers is [0, ∞), since the optimizers are designed to search for a minimum value.

2.2.3 How does optimization fit into ∂a∂i?

∂a∂i uses optimization to determine the parameters that best fit a particular demographic model.Various parameter values are explored according to the optimization algorithm in use, with the goalbeing to find the optimum log-likelihood value.

3 Optimizers in general and ∂a∂i’s optimization in particular

3.1 Definition of optimization

Generally speaking, the scenario that calls for optimization is the following: There is a function f

that depends on some number of parameters. The goal is to find the values of those parametersthat produce the highest or lowest value of f .

More formally, if we have a set X ∈ Rn (n variables to optimize), then the goal is to find ann-tuple x such that

f(x) ≤ f(x) for all x ∈ X (3)

If X = Rn, the problem is called an unconstrained optimization problem. Alternatively, ifthere are constraints of the form lb ≤ x ≤ ub for some lower bounds lb and some upper boundsub, the problem is named a box-constrained optimization problem. If there are more complicatedconstraints on X, such as A ∗ x ≤ b for some matrix A and vector b, c(x) ≤ 0 for some arbitraryfunction of constraints c, or h(x) = 0 for some arbitrary function of contraints h, the problem isdeemed a general-constrained problem.

If f is non-linear, then the optimization is called a non linear optimization problem.

10

3.2 Practical challenges

Optimization is a challenging endeavour, without a “perfect” technique proven to work the best forall optimization problems[10]. The uniqueness and hidden complexity of high-dimensional objec-tive functions seems to require algorithmic fine-tuning more often than not. Testing every possiblecombination of parameters is clearly infeasible, especially as the number of such parameters in-creases. Information about the gradient and second derivative of the objective function may beused to “hill-climb”, but this technique is prone to finding local optima rather than the singleglobal optimum.

Often, the objective function is expensive to evaluate (and therefore so is finding the gradi-ent, which is calculated from O(n) objective function calls when evaluated by the finite differencemethod), so minimizing the number of objective function calls often becomes a main objective.Optimization research has produced a wide array of optimizers with these issues in mind, but thetask of finding and tweaking the best optimizer for a specific objective function remains a difficultpractical task.

In addition to the challenge of avoiding areas far from the optimum, optimization in ∂a∂i hasthe additional challenge of a highly variable cost in evaluating the objective function dependingon parameter values. When a population size parameter is very small (˜10−3), for instance, theevaluation of the log-likelihood function can jump from less than a second to over an hour.

∂a∂i is a perfect example of box-bounded optimization: most parameters must be greater thanzero and must be less than non-sensically large values. Time spans, for example, must be greaterthan zero and less than, say, 100 (˜50 million years, 45 million years before humans diverged fromchimps). However, some optimization methods cannot handle constrained problems, limiting thechoices for ∂a∂i or necessitating workarounds.

3.3 Identified problems with ∂a∂i’s optimization

Problem behavior that was clear from standard use included two main complaints:

1. Some optimizers tended to cycle near a single point seemingly indefinitely, only stopping whengiven a limit of number of allowed iterations.

2. Seemingly random evaluation of points far from the previous location of the optimizer wouldoccur. Many times, these points are near the boundries given to the optimizer, which areoften expensive to evaluate.

See Figure 2 for an illustrated example of both these problems.

11

Figure 2: On this simple convex surface, the points evaluated by the optimization process of dadi’soptimize log optimizer are shown in dots that change color from black to gray to white over time.Lines connect the dots in the order they are evaluated by the optimizer.Using the Simple Branching model, the two population sizes nu1 and nu2 are optimized to fitsynthetic data generated by ms while the time parameter is fixed at the optimum value. Thesynthetic data was generated based on the parameters [2.71138819, 2.17218738, 0.10186171].The maximum number of iterations (10) was exceeded, and 517 function evaluations and 121gradient evaluations were performed. 84.3% of all points tried were within 0.05 of the optimum inboth dimensions.The optimum found was [ 2.9285926, 2.27135484, 0.10186171], which gives a log-likelihood value of300.187929.

4 Exploration of ∂a∂i’s optimizers and objective function

4.1 Non-log-transformed optimizers were consistently poorer performers

The simplex method and the log-transformed LBFGS method consistently outperformed or matchedthe other optimizers (See Figure 4.1, at least for the models explored in this analysis. The logtransformation of the parameter space was expected to make the objective function easier forthe optimizer to explore by leveling steep peaks and valleys, and it appears that approach hassome support. Interestingly, the optimizer most highly recommended by ∂a∂i’s manual for generalproblems, optimize log, did not fare very well on even the simplest model, as illustrated in Figure 4.1.

Whether an optimizer was log-tranformed seemed to be a bigger predictor of optimizer successthan the type of optimization algorithm itself. This was evident across models, and for furtherbenchmarking, the non-transformed optimizers were dropped.

12

dadi.Inferencemethod

scipy.optimizemethod

Description

optimize fmin bfgs The quasi-Newton method of Broyden, Fletcher,Goldfarb, and Shanno. Uses an approximation ofsecond derivative information to efficiently move to-wards a local minimum[10]. This method cannothandle constraints without a workaround.

optimize log fmin bfgs BFGS, but where the natural logarithms of the pa-rameters are being optimized, with an appropriatetransformation of the objective function.

optimize lbfgsb fmin l bfgs b The limited memory and boundary-conscious versionof BFGS. [11] (and also see the recent improvement[12])

optimize log lbfgsb fmin l bfgs b L-BFGS-B, but where the natural logarithms of theparameters are being optimized, with an appropriatetransformation of the objective function.

optimize log fmin fmin The Nelder-Mead simplex algorithm. A simple al-gorithm that utilizes an “ameoba” or n-dimentionalshape that “crawls” downhill to look for a minimum.It has a long history of successful use in applications,but it will usually be slower than an algorithm thatuses first or second derivative information. In prac-tice, considered to have poor performance in high-dimensional problems and is not robust to minimiz-ing complicated functions[5]. Not able to considerbounds without a workaround.In ∂a∂i, only the log transform is available.

optimize grid brute Given grid parameters, evaluate all of the parametercombinations on the grid and find the minimum.

Table 1: These optimizers are currently used by ∂a∂i.

13

Model NameNumber ofParameters

Diagram

Simple Branching Model (SB) 3

Simple Branching Modelwith Migration (SBM)

4

Simple Branching Modelwith Increase and Migration (SBIM)

5

Table 2: These models of increasing complexity were used in this analysis. ∂a∂i’s optimizers showedsome of their typical problems even on the simplest model below, so this range of models was deemedappropriate.

Figure 3: These images are from benchmarking of the least complicated model, the Simple Branch-ing Model, with increasing starting distance from the optimum from left to right.

14

4.2 Multiple optimia are possible, even with a simple model

The benchmarking results shown in Figure 4 shows a consistent double peak in the objective functionvalues achieved by the optimizers on the Simple Branching Model with Migration across multipleruns. These runs were fitting real data.

Figure 4: Results clustered around two log-likelihood values for these benchmarking runs on theSimple Branching Model with Migration on real data.

It seemed that multiple global maxima could exist in the log-likelihood surfaces, even in such abasic model.

And indeed, the log likelihood values along a straight line between the two points show clearevidence that the points are two local optima, as seen in Figure 4.2.

15

Figure 5: The log-likelihood values along a straight line between two values for the Simple Branchingwith Migration model clearly showed that there were multiple local optima. This model only has 4parameters.

This strange pattern in the benchmarking data prompted a closer look at the log-likeihoodsurface for this model. Upon examining slices of the multi-dimentional data, one plot stood outbecause it was concave, the only plot without a gentle hill shape. The narrow crest, evident in thebottom left of both graphs in Figure 6, perhaps gives a glimpse of problems even more likely toarise in higher-dimensional models.

16

Figure 6: A strange pointed region was visible on the log-likelihood surface, a shape that would bedifficult for optimizers to explore.

5 Line search and its effect on optimizer path

5.1 The line search process

Line search is an important component of the BFGS and L-BFGS-B algorithms. In line search, firsta descent direction is found. In BFGS and L-BFGS-B, this is done with the help of an approximationto the Hessian as well as the gradient (this is the Quasi-Newton method). Then, a step size α alonga line in this direction is found, in order to loosely find the minimum. The idea is to find somethingnear the optimum along this line without requiring too much effort.

5.2 Is poor line search responsible for the exploration of values far fromthe optimum?

The fact that points are often chosen far from previous points and far from the optimum seems toindicate that there is a problem with the line search process. If α is mischosen, for example, theresult could be what we see in the dadi optimizers.

A look at the line search code in SciPy reveals that the line search process is not very sophis-ticated. An attempt to modify the choice of α to be more responsive to the objective function’scharacteristics seems like a promising avenue to improve ∂a∂i’s optimizers.

6 Calculation of the gradient in dadi’s optimizers

The calculation of the gradient of the objective function is a critical task in many optimizers. Thisholds true for the majority of dadi’s optimizers as well. Four out of the six dadi optimizers usegradient information in their journey towards an optimum: optimize and optimize log (basedon SciPy’s optimize.fmin bfgs) and optimize lbfgs and optimize log lbfgs (which rely onSciPy’s optimize.fmin l bfgs b).

17

6.1 How the gradient is calculated now

No closed-form equation or algorithm for calculating the gradient of dadi’s log-likelihood func-tion exists, so dadi uses a finite difference method to estimate the gradient near a specific point.Specifically, dadi uses the finite difference method that is built into SciPy’s optimizers.

An investigation of the finite difference method implemented (as of SciPy 0.11.0) shows thatforward finite difference is used. In other words, each element of the gradient at a point a1, . . . , anis calculated by

∂f

∂xi(a1, . . . , an) ≈

f(a1, . . . , ai + h, . . . , an)− f(a1, . . . , ai, . . . , an)

h(4)

However, this technique is perhaps one of the least accurate methods, according to [10]. Toquote Numerical Recipes in C, Second Edition:

“Quite a lot [can go wrong], actually. Applied uncritically, the above procedure is almostguaranteed to produce inaccurate results. Applied properly, it can be the right wasy tocompute a derivative only when the function f is fiercely expensive to compute, whenyou already have invested in computing f(x), and when, therefore, you want to get thederivative in no more than a single additional function evaluation.”- Chapter 5.7 Numerical Derviatives

Although dadi’s objective function can be costly to compute in certain cases, it seems unlikelyto be “fierce” enough to meet the criteria for the forward finite difference method in the greatmajority of cases. And in fact, a closer look at SciPy’s finite difference function shows that theobjective function value at the point being evaluated is not passed from the optimizer to avoidits recalculation (see Figure˜??), but calculated once within the function. So each call to SciPy’sfinite difference function requires 1 + n objective function evaluations, where n is the number ofparameters in the model.

6.2 A better way to calculate the gradient

Numerical Recipies suggests a centered finite difference calculation, if extra function evaluationscan be spared. Such a procedure looks like:

∂f

∂xi(a1, . . . , an) ≈

f(a1, . . . , ai + h, . . . , an)− f(a1, . . . , ai − h, . . . , an)

2h(5)

Such a procedure requires 2n objective function evaluations. This is more than the forwardmethod, but both procedures are O(n). The real advantage lies in the accuracy gained.

Numerical Recipes estimates that typical improvement in fractional error of the calculation ofthe derivative is two orders of magnitude (assuming use of double precision values). To achievemaximum improvement, however, Numerical Recipes stresses the need for a well-chosen value of h.

18

"""Finite-difference approximation of the gradient of a scalar function.

Parameters----------xk : array_like

The coordinate vector at which to determine the gradient of ‘f‘.f : callableThe function of which to determine the gradient (partial derivatives).

Should take ‘xk‘ as first argument, other arguments to ‘f‘ can besupplied in ‘‘*args‘‘. Should return a scalar, the value of thegradient at ‘xk‘.

epsilon : array_likeIncrement to ‘xk‘ to use for determining the function gradient.If a scalar, uses the same finite difference delta for all partialderivatives. If an array, should contain one value per element of‘xk‘.

\*args : args, optionalAny other arguments that are to be passed to ‘f‘.

Returns-------grad : ndarray

The partial derivatives of ‘f‘ to ‘xk‘.

See Also--------check_grad : Check correctness of gradient function against approx_fprime.

Notes-----The function gradient is determined by the forward finite differenceformula::

f(xk[i] + epsilon[i]) - f(xk[i])f’[i] = ---------------------------------

epsilon[i]

The main use of ‘approx_fprime‘ is in scalar function optimizers like‘fmin_bfgs‘, to determine numerically the Jacobian of a function.

Examples-------->>> from scipy import optimize>>> def func(x, c0, c1):... "Coordinate vector ‘x‘ should be an array of size two."... return c0 * x[0]**2 + c1*x[1]**2

>>> x = np.ones(2)>>> c0, c1 = (1, 200)>>> eps = np.sqrt(np.finfo(np.float).eps)>>> optimize.approx_fprime(x, func, [eps, np.sqrt(200) * eps], c0, c1)array([ 2. , 400.00004198])

"""

def approx_fprime(xk, f, epsilon, *args):

f0 = f(*((xk,) + args))

grad = numpy.zeros((len(xk),), float)

ei = numpy.zeros((len(xk),), float)

for k in range(len(xk)):

ei[k] = 1.0

d = epsilon * ei

# forward finite difference

grad[k] = (f(*((xk+d,)+args)) - f0) / d[k]

ei[k] = 0.0

return grad

def approx_fprime(xk, f, epsilon, *args):

grad = numpy.zeros((len(xk),), float)

ei = numpy.zeros((len(xk),), float)

for k in range(len(xk)):

ei[k] = 1.0

d = epsilon * ei

# centered finite difference

grad[k] = (f(*((xk+d,)+args)) -

f(*((xk-d,)+args))) / (2*d[k])

ei[k] = 0.0

return grad

Figure 7: The SciPy 0.11.0 documentation for the finite difference function is on the left. The SciPy 0.11.0 code for the finitedifference function is on the upper right. A version of the same function that uses the centered finite difference method is onthe lower right.

19

6.3 Choosing an optimal h

Choosing an optimal h for the finite difference gradient calculation is not as simple as picking asmall value. Numerical Recipes outlines several considerations for picking h.

First, floating point error in the effective value of h, the difference between x + h and x, willcause error in the derivative calculation. To avoid this, it is recommended to choose h so thisdifference is a number exactly representable in binary. Numerical Recipes ’s suggestion is to add thefollowing lines of code:

temp = x + h

h = temp - x

Secondly, improved values of h may be found by incorporating a “curvature scale” factor, de-

noted xc. The curvature scale is defined as�

ff �� , and Numerical Recipes notes that xc is often

approximated as x when there is a lack of information and x is sufficiently far from 0.Finally, the method of derivative calculation guides the choice of h. Numerical Recipes suggests√

�fxc for the forward finite difference calculation, where �f is the fractional error of the calculationof f and xc is the curvature scale. (SciPy uses the square root of the machine precision, perhapsbecause machine precision is approximately equal to �f for simple functions. However, for compli-cated functions—like dadi’s log-likelihood function—�f is likely larger. Numerical Recipes suggests(�f )1/3xc for the centered finite difference calculation.

Dadi’s default choice of h is currently 0.001. This is much larger than the square root of themachine precision that is SciPy’s default (˜10−8 on an iMac). An h that is too large will clearlylead to errors in the derviative calculations. But 0.001 may be appropriate for dadi if the factionalerror in calculating dadi’s objective function is sufficiently high and/or if the curvature scale isoften appropriately large. Besides a change from forward finite differentiation to centered finitedifferentiation, a more sophisticated choice of h may also yield improvements.

6.4 An explanation for cycling near the optimum?

Inaccuracy in the gradient calculation could be an explanation for the tendency of the optimizersto spend so much time evaluating values near the optimum. Consider the scenario when the pointhaving its gradient evaluated is at or very near the optimum. This point should be at the bottomof a smooth basin, since dadi’s log-likelihood function is known to be quite smooth. Clearly, itsgradient should be zero or quite close to zero.

But if a forward finite difference calculation is used, the gradient will have a greater magnitudethan it should have (see Figure 8). With an optimally small h, the error will be smaller, but asdiscussed above, the forward finite difference method has a higher percent error than other methodsof calculating the gradient. If the amount of error makes the magnitude of the gradient high enoughto be outside the accepted tolerance of the stop criteria, the BFGS and L-BFGS-B optimizationalgorithm will continue, and a line search will be performed based on the inaccurate gradient. Thisline search will find a point very near the optimum, and the process will repeat. The expectedresults of this process closely match the observed cycling behavior of those optimizers on dadifunctions.

20

Figure 8: Finding the deriviative with the forward finite difference method

Consider a centered finite difference calculation instead. The two points chosen to evaluate thepartial derivative are symmetrically located, so as long as h is small, a symmetric neighborhoodmust surround the optimum, due to the smoothness of the objective function. This gives a moreaccurate calculation of the gradient, eliminating pointless cycling near the optimum. See Figure 9for an illustration.

Figure 9: Finding the deriviative with the centered finite difference method

21

6.5 An example of improvement

Figure 10: The exact same optimization scenario as ?? was performed, but with the centered finitedifference method. h remained at 0.001.The optimization algorithm returned a message that it terminated successfully after 8 iterations.125 function evaluations and 25 gradient evaluations were performed. 15.2% of all points tried werewithin 0.05 of the optimum in both dimensions.The optimum found was [ 2.92943025, 2.27239257, 0.10186171], which gives a log-likelihood valueof 300.187910.

22

References

[1] R. N. Gutenkunst, R. D. Hernandez, S. H. Williamson, and C. D. Bustamante, “Inferring thejoint demographic history of multiple populations from multidimensional snp frequency data,”PLoS Genet, vol. 5, p. e1000695, 10 2009.

[2] X. Yi, Y. Liang, E. Huerta-Sanchez, X. Jin, Z. X. P. Cuo, J. E. Pool, X. Xu, H. Jiang,N. Vinckenbosch, T. S. Korneliussen, H. Zheng, T. Liu, W. He, K. Li, R. Luo, X. Nie, H. Wu,M. Zhao, H. Cao, J. Zou, Y. Shan, S. Li, Q. Yang, Asan, P. Ni, G. Tian, J. Xu, X. Liu, T. Jiang,R. Wu, G. Zhou, M. Tang, J. Qin, T. Wang, S. Feng, G. Li, Huasang, J. Luosang, W. Wang,F. Chen, Y. Wang, X. Zheng, Z. Li, Z. Bianba, G. Yang, X. Wang, S. Tang, G. Gao, Y. Chen,Z. Luo, L. Gusang, Z. Cao, Q. Zhang, W. Ouyang, X. Ren, H. Liang, H. Zheng, Y. Huang,J. Li, L. Bolund, K. Kristiansen, Y. Li, Y. Zhang, X. Zhang, R. Li, S. Li, H. Yang, R. Nielsen,J. Wang, and J. Wang, “Sequencing of 50 human exomes reveals adaptation to high altitude,”Science, vol. 329, no. 5987, pp. 75–78, 2010.

[3] C. E. Ellison, C. Hall, D. Kowbel, J. Welch, R. B. Brem, N. L. Glass, and J. W. Taylor,“Population genomics and local adaptation in wild isolates of a model microbial eukaryote,”Proceedings of the National Academy of Sciences, vol. 108, no. 7, pp. 2831–2836, 2011.

[4] D. P. Locke, L. W. Hillier, W. C. Warren, K. C. Worley, L. V. Nazareth, D. M. Muzny,S.-P. Yang, Z. Wang, A. T. Chinwalla, P. Minx, M. Mitreva, L. Cook, K. D. Delehaunty,C. Fronick, H. Schmidt, L. A. Fulton, R. S. Fulton, J. O. Nelson, V. Magrini, C. Pohl, T. A.Graves, C. Markovic, A. Cree, H. H. Dinh, J. Hume, C. L. Kovar, G. R. Fowler, G. Lunter,S. Meader, A. Heger, C. P. Ponting, T. Marques-Bonet, C. Alkan, L. Chen, Z. Cheng, J. M.Kidd, E. E. Eichler, S. White, S. Searle, A. J. Vilella, Y. Chen, P. Flicek, J. Ma, B. Raney,B. Suh, R. Burhans, J. Herrero, D. Haussler, R. Faria, O. Fernando, F. Darre, D. Farre,E. Gazave, M. Oliva, A. Navarro, R. Roberto, O. Capozzi, N. Archidiacono, G. D. Valle,S. Purgato, M. Rocchi, M. K. Konkel, J. A. Walker, B. Ullmer, M. A. Batzer, A. F. A. Smit,R. Hubley, C. Casola, D. R. Schrider, M. W. Hahn, V. Quesada, X. S. Puente, G. R. Ordonez,C. Lopez-Otin, T. Vinar, B. Brejova, A. Ratan, R. S. Harris, W. Miller, C. Kosiol, H. A.Lawson, V. Taliwal, A. L. Martins, A. Siepel, A. RoyChoudhury, X. Ma, J. Degenhardt, C. D.Bustamante, R. N. Gutenkunst, T. Mailund, J. Y. Dutheil, A. Hobolth, M. H. Schierup, O. A.Ryder, Y. Yoshinaga, P. J. de Jong, G. M. Weinstock, J. Rogers, E. R. Mardis, R. A. Gibbs,and R. K. Wilson, “Comparative and demographic analysis of orang-utan genomes,” Nature,vol. 469, pp. 529–533, 01 2011.

[5] E. Jones, T. Oliphant, P. Peterson, et al., “SciPy: Open source scientific tools for Python,”2001–.

[6] S. Hoban, G. Bertorelle, and O. E. Gaggiotti, “Computer simulations: tools for populationand evolutionary genetics,” Nat Rev Genet, vol. 13, pp. 110–122, 02 2012.

[7] L. Excoffier and M. Foll, “fastsimcoal: a continuous-time coalescent simulator of genomicdiversity under arbitrarily complex evolutionary scenarios,” Bioinformatics, vol. 27, no. 9,pp. 1332–1334, 2011.

[8] I. Alves, A. Sramkova Hanulova, M. Foll, and L. Excoffier, “Genomic data reveal a complexmaking of humans,” PLoS Genet, vol. 8, p. e1002837, 07 2012.

23

[9] S. A. Sawyer and D. L. Hartl, “Population genetics of polymorphism and divergence.,” Genet-ics, vol. 132, no. 4, pp. 1161–76, 1992.

[10] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes in C,ch. 10, pp. 394–455. Cambridge University Press, second ed., 1992.

[11] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-b: Fortran subroutines forlarge-scale bound-constrained optimization,” ACM Trans. Math. Softw., vol. 23, pp. 550–560,Dec. 1997.

[12] J. Morales and J. Nocedal, “Remark on “algorithm 778: L-bfgs-b: Fortran subroutines forlarge-scale bound constrained optimization”,” ACM Transactions on Mathematical Software(TOMS), vol. 38, no. 1, p. 7, 2011.

24

abstract - university of...

Documents