genetic variation in structured...

130
Licentiate Thesis 1 Genetic variation in structured populations Marina Rafajlovi´ c Department of Physics University of Gothenburg G¨oteborg, Sweden 2012 1 The thesis is available at http://physics.gu.se/rmarina/Marina Rafajlovic/Home files/lic.pdf

Upload: others

Post on 27-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Licentiate Thesis1

Genetic variation in structuredpopulations

Marina Rafajlovic

Department of Physics

University of Gothenburg

Goteborg, Sweden 2012

1The thesis is available at http://physics.gu.se/rmarina/Marina Rafajlovic/Home files/lic.pdf

Page 2: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

ISBN 978-91-633-9728-8

Printed by Ineko AB

Goteborg 2012

Page 3: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Genetic variation in structured populations

Marina Rafajlovic

Department of Physics

University of Gothenburg

SE-412 96 Goteborg, Sweden

Abstract

It is widely acknowledged that the process of speciation in the pres-ence of gene flow between different ecotypes is common in nature. How-ever, the mechanisms underlying speciation are not understood. In orderto gain understanding of this process, one must identify loci which es-tablish primary reproductive barriers between ecotypes and monitor howthese barriers evolve towards completion. Since such loci are presumablyunder natural selection, they can be detected using neutral loci as mark-ers in genome-wide scans. Therefore, the patterns of neutral geneticvariation must be understood first.

This thesis analyses genetic variation in structured populations. Firstit is shown how to compute the moments of the frequency spectrum of nu-cleotide polymorphisms under a varying population size. It is discussedhow these results together with empirical data can be used to infer theunderlying population size history. Moreover, the effect of sustained pop-ulation size fluctuations on two-locus patterns of neutral genetic varia-tion is analysed. It is shown that, under severe reductions of populationsize, pairs of loci can exhibit long-range association. Afterwards, neutralgenetic variation in geographically structured populations is discussed.An important example are populations arising from a series of founderevents. By describing such populations using a mainland-island model,it is shown how genetic variation decays as the distance from the main-land increases. The effect of multiple paternity is also discussed. It isshown that populations with high degree of multiple paternity have onaverage higher heterozygosity than the populations with low degree ofmultiple paternity. Finally, it is demonstrated that migration patternsbetween partly isolated populations can be represented in terms of acolonisation-migration ancestral tree. Inferring such a tree using genetrees of samples taken across different populations is related to inferringspecies tree using gene trees of samples taken across different species.Thereafter, the relatedness of species trees to their resulting gene treesunder rapid speciations is discussed. The effects of the time betweensuccessive speciations, of the number of derived species, and of the sym-metry of the species tree are analysed. The effect of the time scale ofspeciation remains to be taken into account.

iii

Page 4: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

iv

Page 5: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

This thesis consists of an introductory text and the following two appendedresearch papers, henceforth referred to as [I], and [II]:

[I] A. Eriksson, B. Mehlig, M. Rafajlovic and S. Sagitov, The total branchlength of sample genealogies in populations of variable size, Genetics186, 601–611 (2010).

[II] E. Schaper, A. Eriksson, M. Rafajlovic, S. Sagitov and B. Mehlig,Linkage disequilibrium under recurrent bottlenecks, Genetics 190(1),217–229 (2012)

v

Page 6: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Acknowledgments

This thesis is an outcome of numerous discussions I had with my super-visor, Bernhard Mehlig. Not only that he has guided me through thiswork, but he has also managed to be optimistic all the time, for whichI am extremely grateful. Thanks Bernhard for introducing me to thispart of the stochastic-processes world, and for constantly reminding meto keep things simple.

I would also like to thank Anders Eriksson and Serik Sagitov forhelpful discussions and novel ideas on the topic. My deep gratitudegoes to Kerstin Johannesson, Carl Andre and Marina Panova for fruitfuldiscussions about wonderful snails L. saxatilis. Also, I thank Masterstudents, Elke Schaper and Anna Rimark, whom I collaborated with.Thanks also to Alexander Klassmann and Thomas Wiehe for discussionsabout genome sequencing, and for providing us with the human SNPsdata.

Big thanks to all proof-readers of this thesis: thanks Federico ElıasWolff, Jonas Einarsson, Erik Werner, Omid Ghavami and Anton Johans-son. Also, for technical support I thank Kristian Gustafsson and MatteoBazanella.

Finally, I thank my son and my husband for their love, patience andunderstanding.

I acknowledge the support from Vetenskapsradet, the support fromthe Centre for Theoretical Biology at the University of Gothenburg, andthe support from Goran-Gustafsson stiftelsen.

Marina RafajlovicGoteborg

May 11, 2012

vi

Page 7: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Contents

1 Introduction 1

2 Modelling population genetics 5

2.1 Wright-Fisher model . . . . . . . . . . . . . . . . . . . . 52.2 Mutation and recombination . . . . . . . . . . . . . . . . 72.3 Standard coalescent approximation . . . . . . . . . . . . 102.4 Observables . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Single-locus observables . . . . . . . . . . . . . . 122.4.2 Two-locus observables . . . . . . . . . . . . . . . 14

3 Gene genealogies under a varying population size 17

3.1 Model of a single bottleneck . . . . . . . . . . . . . . . . 213.2 Model of recurrent bottlenecks . . . . . . . . . . . . . . . 25

4 Frequency spectrum of SNPs 29

4.1 Frequency spectrum of SNPs under a single bottleneck . 294.2 Fitting human histories using SNPs . . . . . . . . . . . . 33

5 Linkage disequilibrium 43

6 Genetic variation in structured populations 51

6.1 Mating model . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Spatial model . . . . . . . . . . . . . . . . . . . . . . . . 576.3 Genetic variation . . . . . . . . . . . . . . . . . . . . . . 61

6.3.1 Colonisation phase . . . . . . . . . . . . . . . . . 616.3.2 Migration phase . . . . . . . . . . . . . . . . . . . 65

6.4 Heterozygosity in L. saxatilis . . . . . . . . . . . . . . . . 74

7 Species trees 79

7.1 Gene trees and species trees . . . . . . . . . . . . . . . . 797.2 Tree-to-tree distance . . . . . . . . . . . . . . . . . . . . 82

7.2.1 Fixed branch lengths . . . . . . . . . . . . . . . . 847.2.2 Yule trees . . . . . . . . . . . . . . . . . . . . . . 88

vii

Page 8: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

8 Conclusions 91

A Moment-generating function of Sn 95

B Probabilities appearing in Eqs. (2.16), (2.17) 97

C Tests of neutrality based on ξi 99

D Frequency spectrum of SNPs: formulae 101

D.1 Moments of ξi . . . . . . . . . . . . . . . . . . . . . . . . 101D.2 Tests of neutrality . . . . . . . . . . . . . . . . . . . . . . 103

E Effective population size under the mating model introduced in Chapter 6109

F Heterozygosity during migration 111

Bibliography 115

Papers I-II 122

Page 9: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

1Introduction

In nature, genetic variation within and between biological species is sub-stantial. Today, there are more than a million known species, and indi-cations are that they all come from a single ancestral species [1]. Yet, themechanisms underlying speciation, particularly speciation in the presenceof gene flow between ecotypes, are not well understood.

It is widely recognised [2, 3, 4, 5, 6] that different species are repro-ductively isolated from each other. It is likely that primary reproductivebarriers emerge due to natural selection acting upon one or several genes,the so-called genes underlying speciation [4, 5, 6]. If different ecotypeslive in mixed habitats, initial reproductive barriers between the ecotypesmay develop towards complete reproductive isolation due to, for exam-ple, ecotype incompatibilities or due to sexual selection towards the alikeecotype [3, 5]. For speciation to occur in the latter case, it is necessaryto establish and sustain a certain degree of association (i. e. linkage) be-tween the genetic sequences that are responsible for producing differentecotypes, and those responsible for determining mating preferences to-wards the alike ecotype [5]. Since recombination tends to break linkagebetween pairs of loci, it is to be expected that speciation in the presence ofgene flow occurs rarely. However, recent experimental data [2, 3, 4, 5, 6]show that speciation in the presence of gene flow is common in nature.

In order to understand the speciation mechanisms, one needs to detectand trace the genes underlying speciation [5]. One method for achievingthis is to compare the gene-expression levels of candidate genes acrossdifferent ecotypes. Another method is to search for genomic regions thatdiffer from the neutral regions using genome-wide sequencing. For thesecond method, it is crucial to first quantitatively understand the single-locus and multi-locus patterns of neutral genetic variation [5].

1

Page 10: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2 Chapter 1 Introduction

Neutral genetic variation in populations of constant size is well un-derstood [7, 8, 9]. However, natural populations are commonly sub-ject to sustained environmental and demographic changes resulting inpopulation-size fluctuations [10, 11, 12, 13, 14]. Therefore, it is impor-tant to analyse the patterns of neutral genetic variation under a givenfluctuating population-size history. Although this task can be tackled bycomputer simulations [9], it is also important to support the results ofsimulations with theoretical insight.

The analytical results given in [I] make it possible to compute the mo-ments of the total number of neutral mutations along genetic sequencesunder a given population-size history. Using these results, it is possi-ble to reconstruct the demographic history of a given population usinggenome-wide observed neutral mutations [15, 16]. This is discussed inChapters 3-4. The effect of sustained population-size fluctuations on thedegree of association between pairs of loci was analysed in [II], and it isdiscussed here in Chapter 5.

The findings mentioned above [I, II] are restricted to a freely mix-ing population of fluctuating size. Since migration between populationsmay be an important source of genetic variation [17], gene flow betweenpopulations must also be taken into account. This is discussed in Chap-ter 6 for a particular model relevant to populations of a marine snail (L.saxatilis) of Sweden’s west coast archipelago.

The results in Chapters 3-6 constitute an advance in the endeavor ofidentifying the genes underlying speciation. Species existing today arethe outcomes of a number of speciation events. The relationship betweenspecies is usually represented by a corresponding species tree. In general,species trees are unknown. The task of inferring species trees using genetrees of samples taken across different species is complicated. This holdseven under the assumption that all speciations occurred very rapidly inthe past. Under this assumption, the time from establishing the firstbarriers to gene flow, to the time of establishing complete reproductiveisolation between species may be disregarded. The relationship of genetrees and their underlying species tree under this assumption is discussedin Chapter 7.

This thesis provides a background to the methods used and findingsdiscussed in [I, II], but it also includes a number of results not containedin [I, II]. This thesis is organised as follows.

The basic models and concepts used in population genetics are intro-duced in Chapter 2. This chapter contains an overview of the Wright-Fisher model of reproduction [18, 19] and of common models of the pro-cesses of mutation and of recombination. It also includes an introductionto the standard coalescent approximation [7].

Page 11: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Introduction 3

Chapter 3 summarises how to compute the moments of the branchlengths of gene genealogies for populations of varying sizes [I]. The single-locus patterns of neutral genetic variation are completely determinedby these results. For example, it is possible to derive expressions forthe moments of the number of neutral mutations that appear a givennumber of times in the sampled sequences. These results can be usedto infer the underlying population-size history. Inferring the histories ofhuman populations is discussed in Chapter 4.

The effect of population-size fluctuations on the degree of associationbetween neutral genetic variation at two loci is discussed in Chapter 5.This discussion is based on the results obtained in [II]. As mentionedabove, the degree of association between pairs of loci is important forspeciation in the presence of gene flow.

A model of colonisation of spatially distributed islands is introducedin Chapter 6. The predictions of this model are compared to the ob-served patterns of genetic variation in populations of a marine snail L.saxatilis of Sweden’s west coast archipelago [10]. Motivated by the factthat these populations are characterised by extremely high levels of mul-tiple paternity [20], Chapter 6 also presents the analysis of a particularmating model which allows for different levels of multiple paternity.

Finally, a number of properties of gene trees corresponding to geneticsequences, each taken from a different species, are discussed in Chapter 7.

The main findings presented in this thesis, as well as concluding re-marks about what remains to be done in the course of understandingspeciation mechanisms, are given in Chapter 8. Appendices A-F sum-marise the calculations behind the results given in this thesis.

Page 12: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4 Chapter 1 Introduction

Page 13: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2Modelling population genetics

This chapter contains an overview of modelling tools commonly usedin population genetics. First, in Section 2.1, the well-known Wright-Fisher model of reproduction [18, 19] is introduced. Modelling of neutralmutations and of recombination are covered in Section 2.2 [9, 21]. Then,in Section 2.3, the standard coalescent method is introduced [7]. Finally,in Section 2.4, a number of common measures of genetic variation arelisted.

2.1 Wright-Fisher model

The Wright-Fisher model [18, 19] of N haploid1 individuals is based onthe following three assumptions:

• generations are discrete and non-overlapping,

• the population size N is constant, independent of time, and

• the number of offspring of each individual is binomially distributedwith parameters N and 1/N (in other words, the family sizes aredistributed according to the symmetric multinomial distribution).

Assume that the distribution of allelic types at a given locus in gener-ation ℓ = 0 is known. Then under the three assumptions listed above,the Wright-Fisher population in generation ℓ = 1 can be obtained by

1In the cells of a haploid organism, one finds a single copy of each chromosome. Bycontrast, in diploid organisms, the sex cells carry a single copy of each chromosome(they are haploid cells), whereas somatic cells carry paired chromosomes. These arediploid cells. Two copies of a single chromosome typically differ in sequence.

5

Page 14: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6 Chapter 2 Modelling population genetics

sampling at random (with replacement) N alleles from the parental alle-les, where each parental allele is chosen with probability 1/N . Similarly,starting from the population in generation ℓ = 1, one can generate thepopulation in the next generation, and so on.

Note that the first assumption above implies that the members of theparental generation produce progeny simultaneously, and that they all dieimmediately afterwards. This assumption may be relaxed. For example,in the Moran model [22], a single randomly chosen individual gives riseto a child in each generation. At the same time, a single randomly chosenindividual dies. This individual may be the one that gave rise to a child,but it may also be some other individual. It follows that in this model,the generations are overlapping. It can be shown [8] that the varianceof the number of children per individual is equal to 2/N in the Moranmodel, whereas it is equal to 1 − N−1 in the Wright-Fisher model. Inboth models, the population size is kept constant, and thus the averagenumber of children per individual is equal to unity. For simplicity, theanalysis in this thesis is restricted to non-overlapping generations, as inthe Wright-Fisher model.

The parameter N in the Wright-Fisher model determines the timescale of genetic drift, a process that tends to reduce genetic variationwithin a given population. As an example, consider a locus with twoallelic types, denoted by A1 and A2. Assuming that in generation ℓthere are i copies of A1, the probability that there are j copies of A1 ingeneration ℓ + 1 is:

pij =

(

N

j

)(

i

N

)j (

1 −i

N

)N−j

. (2.1)

The process defined with transition probabilities given in Eq. (2.1) in-evitably reaches an absorbing state [23], characterised by complete lossof either A1 or A2. This corresponds to complete loss of genetic variationat a given locus. Note that once this happens, there may be no morechanges in allelic frequencies under the Wright-Fisher model. If the lossof A1 is reached, one says that fixation of A2 occurred, and vice versa[23].

Assuming that the frequency of A1 in generation ℓ = 0 is α, thenthe number of generations ℓloss(α) before the population experiences thecomplete loss of genetic variation (also known as the mean fixation time)is [23]:

ℓloss(α) ≈ −2N [αln(α) + (1 − α)ln(1 − α)] . (2.2)

Here, it is assumed that N ≫ 1. The fixation of A1 is obtained withprobability α [23].

Page 15: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2.2 Mutation and recombination 7

The Wright-Fisher model may be extended to account for sexually re-producing diploid organisms. As an example, consider a well mixed pop-ulation of Nf females and Nm males. Since the individuals are diploid, thepopulation contains 2(Nf + Nm) alleles. Assume that the males and thefemales mate randomly and that Mendelian inheritance2 applies. Underthese conditions, the probability that two alleles sampled at random from2(Nf + Nm) alleles stem from a single allele in the previous generation is(2Ne)

−1, where [23]

Ne =4NfNm

Nm + Nf. (2.3)

Here, Ne stands for an effective population size (see also [24]).

Recall that in the models discussed above, sources of genetic varia-tion are not taken into account. These are the processes of mutationand genetic recombination3 [21]. How these two processes are commonlymodelled is covered in the next section.

2.2 Mutation and recombination

In natural populations, a number of different alleles appear at many loci[21]. Genetic variation at a single locus may be induced by the processof mutation. Mutations may cause changes in the nucleotide sequence ata given locus in several different ways. One possibility is that mutationsinduce a change of one or more nucleotides in the sequence. Other pos-sibilities include rearrangements within sequences, such as inversions, ortranslocations [21]. Mutations may also shorten or extend the sequences(deletions and insertions).

In [25] it was suggested that most mutations in natural populationsare selectively neutral. But they may also be under a selective pres-sure (for different types of natural selection see, for example, [21]). Asmentioned in the Introduction, loci under selection may reveal the mecha-nisms underlying speciation. One method to detect them is to use neutralloci as markers. In what follows, patterns of neutral genetic variation areanalysed (unless otherwise stated).

When modelling neutral mutations, the so-called infinite-alleles model[26] is commonly used. Under this model, each mutation gives rise to anon preexisting allele in a population.

2According to Mendelian inheritance, a child inherits at random one of the twomaternal alleles, and at random one of the two paternal alleles.

3Apart from these two processes, a population may also receive new genetic ma-terial through the process of migration [21].

Page 16: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

8 Chapter 2 Modelling population genetics

Another commonly used model is the infinite-sites model [27]. Inthis model, one treats loci as infinitely long sequences of sites (i. e. nu-cleotides). It is assumed that each mutation occurs at a new site, causingsingle nucleotide polymorphisms (SNPs). Note that exactly two differentnucleotides appear at each polymorphic site under this model.

A third model is the stepwise-mutation model [28, 29] (see also [30]).In this model, an allele is defined by the number of repeated sequencesof base pairs it contains, and it is assumed that a mutation occurringat a given locus may either decrease or increase the number of repeatedsequences by one [28] (for example, due to deletions, or insertions). Thestepwise-mutation model is commonly used to describe genetic variationat microsatellite loci4 since alleles at a given microsatellite locus mainlydiffer by the number of repeated sequences.

In this thesis, mutations are modelled according to either the infinite-alleles or the infinite-sites model. It is assumed that mutations mayaccumulate along genetic sequences with probability µ per sequence pergeneration. This probability is assumed to be constant over time.

As mentioned in Section 2.1, the process of mutation establishes ge-netic variation at a single locus. Conversely, genetic recombination es-tablishes variation at multiple loci. This is because in nature, a child typ-ically inherits neither a complete set of chromosomes from the mother,nor a complete set of chromosomes from the father (see [23, 21]). In-stead, due to recombination, parts of a given chromosome may stem fromthe maternal chromosome, while other parts may stem from the pater-nal chromosome. An example of chromosomal crossover is illustrated inFig. 2.1. Aside from crossover, there are other types of recombinationsuch as gene conversion (see [21] and references therein).

Empirical data typically show that the larger the distance between apair of loci on the same chromosome is, the more frequently the two locirecombine [31]. It is common to express the physical distance betweentwo loci in terms of r, the probability that a chromosome recombinesbetween the two loci per generation per chromosome. In this thesis, it isassumed that r is constant over time.

Note that in well mixed populations, assuming random mating andMendelian inheritance, the association of neutral genetic variation be-tween a pair of loci situated at different chromosomes is random. Suchloci are said to be in linkage equilibrium. Otherwise, loci are said to be inlinkage disequilibrium (LD). A list of measures of linkage disequilibriumare given in Subsec. 2.4.2.

4Microsatellite loci contain repeated sequences of two to five base pairs [30].

Page 17: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2.2 Mutation and recombination 9

a

a

aa

b

bb

b

a

b

c

Maternal Chromosomes Paternal Chromosomes

Crossover

Paternal Chromosome

Maternal Chromosome

Child’s Chromosomes

Figure 2.1: Recombination due to crossover of chromosomes (schemati-cally). Two maternal and two paternal chromosomes are shown in panela. Left colored areas of the chromosomes depict allelic types at locusa, and right colored parts depict allelic types at locus b. Both loci havetwo alleles. Two parental chromosomes (one maternal and one pater-nal) that are passed on to a child are depicted by an arrow in panela. In panel b, a crossover between the two chromosomes is illustrated.Both chromosomes split in two parts. Each part stemming from the ma-ternal chromosome attaches to the corresponding part of the paternalchromosome (depicted by arrows). As a result, the child has differentcombination of allelic types than its parents (see panel c). Note, for ex-ample, that neither of the parents has a green allele at locus a combinedwith a white allele at locus b.

Page 18: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

10 Chapter 2 Modelling population genetics

2.3 Standard coalescent approximation

The standard coalescent approximation provides a fast method for trac-ing backwards the ancestry of alleles sampled at the present time untilthe most recent common ancestor (MRCA) of the sample is found. Inwhat follows, the ancestry of a given sample is called the gene genealogy.The basic concepts behind the standard coalescent theory are describedin this section.

Consider a Wright-Fisher population of N haploid individuals. A genegenealogy of n sequences sampled from this population at the presenttime can be inferred using the standard coalescent theory. The derivationof Eqs. (2.4)-(2.9) given below was described in [7] (see also [9]).

The probability P (n, 1) that the n alleles sampled have different an-cestors one generation back in time is

P (n, 1) =n−1∏

i=1

(

1 −i

N

)

. (2.4)

In the case N ≫ n, P (n, 1) becomes

P (n, 1) ≈ 1 −n−1∑

i=1

i

N= 1 −

n(n − 1)

2N. (2.5)

It follows that in this case, the probability P (n, ℓ) = P (n, 1)ℓ that thesample has n distinct ancestors ℓ generations back in time is

P (n, ℓ) ≈

(

1 −n(n − 1)

2N

)ℓ

, (2.6)

which in the case n2 ≪ N reduces to

P (n, ℓ) ≈ e−ℓ n(n−1)2N . (2.7)

Therefore, ℓ + 1 generations back in time, the number of ancestors of asample is less than n with probability

Pc(n, ℓ + 1) ≈n(n − 1)

2Ne−ℓ

n(n−1)2N . (2.8)

In other words, in generation ℓ + 1 at least two sequences find theircommon ancestor with probability Pc(n, ℓ + 1). Note that, under the as-sumption N ≫ 1, the probability that more than two sequences find theirMRCA in a single generation is negligible and can be ignored. Therefore,

Page 19: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2.3 Standard coalescent approximation 11

Pc(n, ℓ + 1) is the probability that a pair of sequences, among the n se-quences sampled, find their MRCA in generation ℓ + 1 back in time. Anevent in which a pair of sequences find their MRCA is called a coalescentevent.

Using Eq. (2.8) it is possible to compute the distribution of the timeto the first coalescent event in a sample of n alleles. Let the numberof generations ℓ be measured in units of t, such that ℓ = ⌊Nt⌋, where⌊x⌋ is the largest integer less than or equal to x. In these units of time,it follows from Eq. (2.8) that the time to the first coalescent event in asample of n alleles, τn, is approximately exponentially distributed withmean:

〈τn〉 =

(

n

2

)−1

. (2.9)

In order to compute the time until the MRCA of the entire sample,note that each coalescent event reduces the number of ancestral linesto be traces back by one. Therefore, the time to the MRCA of theentire sample is given by

∑ni=2 τi, where τi (i = 2, . . . , n) are independent

random variables, distributed exponentially with parameter(

i2

)

[9].The standard coalescent approximation provides a method for gener-

ating an ensemble of gene genealogies of sample size n much more effi-ciently than by tracing the ancestry generation by generation. Moreover,although the above approximation does not take into account mutations,neutral mutations may be superimposed on the resulting gene genealo-gies [9]. The number of mutations along a branch of length τi is Poissondistributed with mean θτi/2, where θ = 2µN , and µ ≪ 1 is the mutationrate per generation per allele.

Apart from being efficient, the standard coalescent approximation isalso robust. It can be proven that the standard coalescent is not onlyvalid for the Wright-Fisher model, but also for many other populationmodels, provided that the variance σ2 of the reproductive success betweenindividuals remains finite for N → ∞ (σ2/N → 0 in the limit N → ∞)[8]. An example is the Moran model (see Section 2.1). For this model itcan be shown [8] that the coalescent method is applicable, but with thegenerations rescaled by a factor N2/2 instead of by N as in the Wright-Fisher model. As a consequence, the number of generations for two linesto coalesce under the Wright-Fisher model is N/2 times shorter than thetime to the MRCA of two lines in the Moran model. In other words, byincreasing the variance in reproductive success among individuals, thetime to the MRCA becomes shorter.

Furthermore, although the standard coalescent approximation is builtupon the assumption of constant population size, in some cases it canbe applied to fluctuating population sizes by defining a corresponding

Page 20: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

12 Chapter 2 Modelling population genetics

effective population size [24, 32, 33]. In [32] it was shown that the ef-fective population-size approximation is applicable for the cases of bothslow and rapid population-size fluctuations. In the former case, the ef-fective population size is approximately equal to the population size atthe present time and in the latter case it is equal to the harmonic meanof temporal population sizes Nℓ

Ne =

(

limL→∞

1

L

L−1∑

ℓ=0

1

Nℓ

)−1

. (2.10)

However, in the general case of population size fluctuations, the standardcoalescent approximation may not be appropriate to describe typical genegenealogies [32, I]. The gene genealogies for populations of fluctuatingsizes are discussed in more detail in Chapter 3.

Finally, note that apart from the standard coalescent, there are alsoother types of coalescents, such as Xi-coalescents [34, 35, 36, 37]. Undera Xi-coalescent, multiple ancestral lines are allowed to merge in a singleancestor in a given generation. This type of Xi-coalescents is also knownas the Lambda-coalescent [35]. In a more general case, a Xi-coalescencemay allow for simultaneous multiple mergers in a given generation.

A Xi-coalescent can be obtained under models allowing for skewedoffspring distribution among individuals in a population [38], as well asin the models accounting for selective sweeps [37]. It was shown in [II]that multiple mergers can also be obtained when a population undergoesrecurrent bottlenecks in its history (as noticed also in [39]).

2.4 Observables

Thanks to advances in genetic sequencing, it is possible to observe geneticvariation in natural populations. In this section, a number of commonlyused observables in population genetics are listed.

2.4.1 Single-locus observables

Consider a population with a given demographic history. Denote by Nℓ

the population size in generation ℓ, and assume that ℓ is measured in unitsof t such that ℓ = ⌊tN0⌋. Using these units, denote by τi (i = 2, . . . , n) thetime during which a given gene genealogy of sample size n has exactlyi ancestral lines. Assume that neutral mutations accumulate at rateµ along branches of this gene genealogy according to the infinite-sitesmodel (see Section 2.2) and denote the total number of segregating sitesby Sn. Assuming that Nℓ ≫ 1 and µ = θ/(2N0) ≪ 1, one finds that the

Page 21: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2.4 Observables 13

number of mutations Sn accumulated along the branches of total lengthTn =

∑ni=2 iτi is a random number following a Poisson distribution with

mean θTn/2 [9]. Averaging Sn over different gene genealogies under thesame demographic history yields

〈〈Sn|Tn〉〉 =θ

2〈Tn〉, and 〈〈S2

n|Tn〉〉 =θ

2〈Tn〉 +

θ2

4〈T 2

n〉 . (2.11)

More generally, the distribution of Sn may be expressed in terms of themoments 〈T k

n 〉. In order to show this, one can consider the function Fn(q),which is defined as the Laplace transform of the probability distributionof Tn:

Fn(q) = 〈e−qTn〉 . (2.12)

In Eq. (2.12), averaging is done over independent gene genealogies ofsample size n, resulting from the same demographic history. Note thatFn(θ/2) corresponds to the probability of no mutations in a sample ofsize n. Conversely, F2(θ/2) is the probability that two randomly chosengenetic sequences are identical. This probability is also known as popu-lation homozygosity. From Eq. (2.12) it follows that the moments of Tn

can be found according to

〈T kn 〉 = (−1)k dk

dqkFn(q)|q=0 . (2.13)

The moments of Sn can be expressed in terms of 〈e−qSn〉, such that

〈e−qSn〉 = Fn

[

θ

2(1 − e−q)

]

. (2.14)

In Appendix A it is demonstrated how this expression is obtained startingfrom Eq. (2.12). As Eq. (2.14) shows, the moments of the total numberof SNPs are completely determined by the moments of Tn.

Moreover, under the stepwise mutation model, one can use the func-tion F2 given above to calculate the distribution pj of the number of differ-ences j between two randomly sampled microsatellite sequences [29, 40].As mentioned in Section 2.2, within the stepwise mutation model, a mi-crosatellite locus under a mutation either gains or, equally likely, loosesone repeat unit. Assuming that such gains and losses occur according toa Poisson process, pj is found to be [29, 40]

pj =1

∫ 2π

0

dω cos(ωj)F2 [θ − θ cos(ω)] . (2.15)

Apart from the total number of SNPs, one can observe the frequencyspectrum of SNPs in a given population. This is an important set of

Page 22: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

14 Chapter 2 Modelling population genetics

observables because, as pointed out in [15, 16], the frequency spectrum ofSNPs provides information about the underlying population-size history.Let ξi denote the number of mutations of size i in a sample of size n(i = 1, . . . , n − 1). The total number of SNPs is Sn =

∑n−1i=1 ξi. It

was shown in [41] how 〈ξi〉 and 〈ξiξj〉 can be computed in terms of themoments of branch lengths τi. The corresponding expressions are [41]:

〈ξi〉 =θ

2

n∑

k=2

kp(k, i)〈τk〉 , (2.16)

〈ξiξj〉 = δi,j

n∑

k=2

kp(k, i)

(

θ

2〈τk〉 +

θ2

4〈τ 2

k 〉

)

+θ2

4

(

n∑

k=2

k(k − 1)p(k, i; k, j)〈τ 2k 〉

+n∑

k<m

km(

p (k, i; m, j) + p (k, j; m, i))

〈τkτm〉

)

, (2.17)

where δij = 1 for i = j, and δij = 0 otherwise. The terms p(k, i),p(k, i; k, j), p(k, i; m, j) appearing in Eqs. (2.16)-(2.17) come from [41],and they are listed in Appendix B. Note that in a population of constantsize, one has [41]

〈ξi〉 =θ

i. (2.18)

In summary, the expressions (2.16)-(2.17) show how the first two mo-ments of the frequency spectrum of SNPs can be computed using the firsttwo moments of the branch lengths τi. These expressions imply that anyobservable defined as a linear combination of the frequency spectrum ofSNPs can be expressed in terms of the moments of τi. Such observablesinclude a number of commonly used tests of neutrality. The idea behindthe construction of such tests is briefly reviewed in Appendix C.

2.4.2 Two-locus observables

Multi-locus patterns of genetic variation reflect the degree of associationbetween loci. This subsection introduces common measures of the degreeof association between loci.

Let two loci be denoted by a and b, and let A1 and A2 be possibleallelic types at a (likewise, B1 and B2 denote possible allelic types atb). Assume that loci are short, so that a child inherits complete locifrom its parents. In humans, this is typically fulfilled for loci that arenot longer than a few-hundred base pairs [42]. The degree of association

Page 23: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

2.4 Observables 15

between these loci is commonly expressed in terms of the frequencies pij

(i = 1, 2, j = 1, 2) of the so-called gamete AiBj according to [21]

D = p11p22 − p12p21 . (2.19)

In order to understand the measure D, it is convenient to write Eq. (2.19)in terms of the frequencies of Ai, and of Bi, denoted by qi, and qi, re-spectively [21]

D = pii − qiqi, for i = 1, 2 ,

D = pij − qiqj , for i = 1, 2, and j = 3 − i . (2.20)

Here, one has qi = pii + pij , and qi = pii + pji, where j = 3 − i. IfD = 0, the loci a and b are randomly associated and they are said tobe in linkage equilibrium. Otherwise they are in linkage disequilibrium(LD).

Another well known measure of the degree of association between loci(that is, of LD) is [21]

r2 =D2

q1q2q1q2

. (2.21)

Under linkage equilibrium, the numerator of Eq. (2.21) is expected toevaluate to zero, but this may not be true for r2 itself. For this reason,empirically obtained values of r2 may be difficult to interpret.

It was suggested in [43] that 〈r2〉 may be approximated by σ2d defined

as

σ2d =

〈D2〉

〈q1q2q1q2〉. (2.22)

In [44] it was shown that σ2d is a good approximation to 〈r2〉 upon the

exclusion of rare alleles. More importantly, it was shown in [44] that σ2d

can be computed using the coalescent approximation. Denoting the timeto the MRCA of chromosomes i and j (i 6= j) at the locus a by ta(ij) (andlikewise, tb(ij) for the locus b), the measure σ2

d can be expressed as [44]

σ2d =

cov[ta(ij), tb(ij)] − 2cov[ta(ij), tb(ik)] + cov[ta(ij), tb(kl)]

〈ta(ij)〉2 + cov[ta(ij), tb(kl)]. (2.23)

In Eq. (2.23) it is assumed that the sample size is large, so that theprobability that the chromosomes i, j, k, and l are not all distinct isnegligible. For finite sample sizes, a correction needs to be taken intoaccount (see [44] and references therein).

Moreover, the covariances of the times to the MRCA of pairs of chro-mosomes at given loci may also be understood as measures of the degreeof association between genetic variation at these loci. Indeed, denoting

Page 24: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

16 Chapter 2 Modelling population genetics

by Sa(ij) and Sb(ij) the total number of SNPs observed at the locus a andat the locus b, respectively, both loci being situated at the chromosomesi and j, it follows [42]

cov[Sa(ij), Sb(ij)] ∝ µ′2cov[ta(ij), tb(ij)] . (2.24)

Here µ′ is the mutation rate per generation per site of a locus, and itis assumed that the mutation rate does not fluctuate significantly alongthe genome. Note that Eqs. (2.23)-(2.24) can be employed for any givendemographic history.

For a constant population size with N chromosomes, using results ofthe standard coalescent approximation (see Section 2.3), it follows that[45, 46, 47, 44]

cov[ta(ij), tb(ij)] =18 + R

18 + 13R + R2, (2.25)

cov[ta(ij), tb(ik)] =6

18 + 13R + R2, (2.26)

cov[ta(ij), tb(kl)] =4

18 + 13R + R2. (2.27)

Here, the parameter R = 2Nr is the scaled recombination rate. Further-more, one has [43, 44]

σ2d =

10 + R

22 + 13R + R2. (2.28)

Chapter 5 discusses how these expressions are modified under sustainedpopulation-size fluctuations. The results in Chapter 5 make it possibleto qualitatively understand the effect of population-size fluctuations ontwo-locus patterns of neutral genetic variation.

Page 25: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

3Gene genealogies under avarying population size

As explained in the previous chapter, it is known how to compute themoments of branch lengths of single-locus gene genealogies within pop-ulations of constant size. Since these moments are directly related tothe moments of the number of neutral SNPs appearing a given numberof times in a sample [41], it follows that the patterns of neutral geneticvariation in populations of constant size are well understood.

However, real populations exhibit population-size fluctuations, dueto, for example, environmental changes [14], repeated founder events[10, 11, 12], or due to range expansions [13] (see also [I, II] and referencestherein). Since the moments of the total number of SNPs along genegenealogies of sample size n can be computed in terms of the momentsof the total branch length 〈T k

n 〉 [9] (see also Subsection 2.4.1), it is ofparticular interest to compute 〈T k

n 〉. A general expression for 〈T kn 〉 under

a varying population size was derived in [I]. Note also that an expressionfor 〈Tn〉 was computed in [48] and the results in [49] allow for evaluating〈T 2

n〉 under a varying population size. The remainder of this chapter isbased on the results of [I].

In what follows, a haploid population of a varying size is considered.(Note that in the case of a well mixed diploid population, the followingalso applies but with population size multiplied by two [9]). The popu-lation size at generation ℓ = 0, 1, . . . is denoted by Nℓ, and it is assumedthat Nℓ ≫ 1. This assumption allows for using the coalescent approx-imation that not more than two ancestral lines can merge in a singlegeneration. Furthermore, a time span of ℓ generations is measured inunits of t such that ℓ = ⌊N0t⌋ (see also Section 2.3). Under the condition

17

Page 26: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

18 Chapter 3 Gene genealogies under a varying population size

MRCA

present

past

τ 5τ 4

τ 3τ 2

Figure 3.1: Gene genealogy of a sample of size n = 5 (illustration). Thetimes during which the gene genealogy has exactly i = 2, 3, 4, 5 lines aredenoted by τi. The most recent common ancestor (MRCA) of the sampleis found at time

∑ni=2 τi back in the past.

Nℓ ≫ 1, t can be treated as a continuous variable. In order to emphasizethis, the population size at time t is denoted by N(t) below, and therelative population size N(t)/N0 is denoted by X(t).

If X(t) can be approximated by a piecewise smooth function, themoments of the total branch length Tn of the resulting gene genealogiesof sample size n can be computed using the approach outlined in [I].Denoting by τm (m = 2, . . . , n) the time during which a given genegenealogy of n sampled genetic sequences has exactly m ancestral lines(see Fig. 3.1), the total branch length can be expressed as:

Tn =n∑

m=2

mτm . (3.1)

The kth moment of Tn is given by [I]

〈T kn 〉 =

ν2,ν3,...,νn

ν2+ν3+···+νn=k

(

k

ν2, ν3, . . . , νn

)

nνn · · · 2ν2 〈τ νn

n · · · τ ν22 〉 . (3.2)

Here, νm ∈ {0, . . . , k} obey the constraint∑n

m=2 νm = k. In order tocalculate the kth moment of Tn, one needs to compute the nested ex-pectation value 〈τ νn

n · · · τ ν22 〉 in Eq. (3.2). In the constant population-size

Page 27: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Gene genealogies under a varying population size 19

case, the times τm, m = 2, . . . , n, are mutually independent (see Sec-tion 2.3). Therefore, in this case one has

〈τ νn

n · · · τ ν22 〉 = 〈τ νn

n 〉 · · · 〈τ ν22 〉 . (3.3)

Moreover, in populations of constant size, τm is exponentially distributedwith parameter

(

m2

)

. This allows for the expectation in Eq. (3.3) to becomputed explicitly. By contrast, in the case of a varying populationsize, τm depends on τn, . . . , τm+1. Following [I], τm is expressed as

τm =

∫ ∞

0

dt 1{η(t)=m} . (3.4)

Here, η(t) denotes the number of ancestral lines at time t, and 1{η(t)=m} isunity if η(t) = m, and zero otherwise. Averaging Eq. (3.4) over differentgene genealogies yields [I]:

〈τm〉 =

∫ ∞

0

dt fnm(0, t) , (3.5)

where fnm(t1, t2) denotes the probability of having m ancestral lines attime t2, conditional on having n ≥ m lines at time t1 < t2. This proba-bility satisfies [50, 51]

fnm(t1, t2) = gnm[Λ(t2) − Λ(t1)] . (3.6)

Here, gnm(t2 − t1) is the probability of having m lines at time t2, condi-tional on having n ≥ m lines at time t1 < t2 for a population of constantsize. It is explicitly computed in [52]. The function

Λ(t) =

∫ t

0

ds

X(s), (3.7)

has been termed ‘the population-size intensity function’ [50]. It accountsfor temporal changes in the coalescent time scale (the coalescent timescale within a population with size x is equal to x). For a population ofconstant size, one has Λ(t) = t.

Similarly, for k = 2, one finds [I]:

〈τiτj〉 =

∫ ∞

0

dt1 fni(0, t1)

∫ ∞

t1

dt2fij(t1, t2), for j < i , (3.8)

〈τ 2i 〉 = 2

∫ ∞

0

dt1 fni(0, t1)

∫ ∞

t1

dt2fii(t1, t2) . (3.9)

Here, fni(t0, t1)fij(t1, t2) stands for the probability that a gene genealogyhas j lines at time t2 conditional on having i ≥ j lines at time t1 < t2,and n lines at time t0 < t1.

Page 28: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

20 Chapter 3 Gene genealogies under a varying population size

Higher moments of τm can be obtained in a similar manner, that is byusing the multiplicative rule for conditional probabilities fij(t1, t2). Thismakes it possible to compute the nested expectation value in Eq. (3.3),and hence for computing 〈T k

n 〉. It is found that [I]:

〈T kn 〉 = k!

n∑

j1=2

· · ·

jk−1∑

jk=2

dn;j1,...,jk

∫ ∞

0

dt1 e−bj1Λ(t1)

∫ ∞

t1

dt2 e−bj2[Λ(t2)−Λ(t1)]

· · ·

∫ ∞

tk−1

dtk e−bjk[Λ(tk)−Λ(tk−1)] . (3.10)

Here, bj =(

j2

)

, and the coefficients dn;j1,...,jkare given in [I]. The expres-

sion (3.10) shows how to compute moments of the total branch lengthcorresponding to gene genealogies of a varying population size. In thecase k = 1, Eq. (3.10) is in agreement with the expression for 〈Tn〉 de-rived in [48]. For k = 2, Eqs. (3.8)-(3.9) agree with the results given in[49].

As mentioned in the introduction of this chapter, the derivation ofEq. (3.10) given above relies upon the assumption that the relative pop-ulation size X(t) can be approximated by a piecewise smooth function.A particular model of the population size that fluctuates stochasticallyaround a time-dependent carrying capacity was analysed in [I]. Themodel is shown in Fig. 3 in [I]. The first four moments of the total branchlength of gene genealogies obtained under this model were computed in[I] by means of computer simulations and also analytically, identifyingthe population size in Eq. (3.10) with the given time-dependent carryingcapacity. The results of computer simulations were found to agree wellwith the analytical results (see Fig. 4 in [I]). In other words, rapid fluc-tuations around the carrying capacity in the model analysed were foundto be irrelevant. This is true independently of how the carrying capacitychanges in time, as long as the fluctuations around it remain small andrapid [I].

More importantly, for a given model of population-size history, byusing the analytical result Eq. (3.10), it is possible to determine at whichtime scales of population-size fluctuations the effective population-sizeapproximation is expected to be reached, as well as the time scales atwhich deviations from this approximation are severe. In other words, theanalytical result allows for quantifying the so-called ‘fast’, and ‘slow’ timescales of population-size fluctuations at which the effective population-size approximation is expected to describe well the statistical propertiesof gene genealogies [32, 33, I].

Finally, note that models of piecewise constant population sizes allowfor the nested integrals in Eq. (3.10) to be explicitly calculated. Such

Page 29: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

3.1 Model of a single bottleneck 21

models include bottlenecks. Since bottlenecks are embedded in demo-graphic histories of many biological populations [10, 11, 12, 13, 14] (seealso [15, 16]) it is of interest to quantify neutral genetic variation underbottlenecks.

In the following sections, two examples of piecewise constant pop-ulation sizes are analysed. In both examples, it is assumed that thepopulation size may acquire two values, denoted by N0 and NB = xN0

(x < 1). As above, it is assumed that the population size at the time ofsampling (t = 0) is equal to N0.

The remainder of this chapter is organised as follows. In Section 3.1a model of a single bottleneck is considered, and in Section 3.2 a modelof recurrent bottlenecks is considered.

3.1 Model of a single bottleneck

The model analysed in this section is illustrated in Fig. 3.2a. As Fig. 3.2ashows, going back in time, the population size passes through a bottle-neck. The bottleneck starts at time t0 after the time of sampling, andthe duration of the bottleneck is denoted by tB. Note that, as in theintroduction of this chapter, time is measured backwards starting fromthe present, t = 0.

The duration of the bottleneck and its relative population size x de-termine the probability that a pair of lines entering the bottleneck co-alesces before the end of the bottleneck. This probability is equal to1 − e−tB/x, and it increases as the parameter sB = tB/x increases. Inwhat follows, the parameter sB is called the strength of the bottleneck.Applying Eq. (3.10) to the model illustrated in Fig. 3.2a yields:

〈Tn〉 = 2hn − (1 − x)

n∑

j=2

dn;j

bje−bjt0(1 − e−bjsB) , (3.11)

〈T 2n〉 = 2

n∑

i=3

i−1∑

j=2

dn;ij

bj

( 1 − x

bi − bj

(

xe−(bit0+bjsB)(1 − e−(bi−bj)sB)

− e−bjt0(1 − e−bjsB)(1 − e−(bi−bj)t0))

+1 + (1 − x2)(e−bi(t0+sB) − e−bit0)

bi

)

+ 2n∑

i=2

dn;ii

bi

(

(1 − x)(

xsBe−bi(t0+sB) − t0e−bit0(1 − e−bisB)

)

+1 + (1 − x2)(e−bi(t0+sB) − e−bit0)

bi

)

. (3.12)

Page 30: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

22 Chapter 3 Gene genealogies under a varying population size

N0

N0

NB

past

past

NB

t 0t B

E(λ

)E

(λx)

present

present

a

b

Figure 3.2: Two models of piecewise constant population size. a Singlebottleneck. The population size is equal to N0 at the present time. Goingbackwards in time, after t0 units of time, the population size reduces toNB (bottleneck). The duration of the bottleneck is equal to tB. After thistime, the population size restores to N0. b Recurrent bottlenecks. Thepopulation size at present is N0. The population size switches from N0 toNB after an exponentially distributed time E(λ) with parameter λ. Con-versely, the population size jumps from NB to N0 after an exponentiallydistributed time E(λx) with parameter λx.

Page 31: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

3.1 Model of a single bottleneck 23

Here, hn =∑n−1

i=1 i−1, and∑n

i=2 dn,jb−1j = 2hn was used (as in [I]). In

the following, Eq. (3.11) is analysed in three different cases. The first isthe case of slow population-size fluctuations (formally, t0 → ∞ keepingx constant), the second is the case of fast population-size fluctuations(t0 → 0 keeping x constant), and the third is the case of severe reductionof population size during the bottleneck (x → 0 keeping t0 constant).Note that tB is set equal to t0 below (tB = t0) in order to reduce thenumber of cases to be considered.

First, in the limit t0 → ∞ keeping x constant, the MRCA of genegenealogies is expected to be found before the bottleneck. Therefore,the resulting gene genealogies are expected to agree with those obtainedunder a constant population size. This is indeed the case, as in this limitEq. (3.11) reduces to:

〈Tn〉 ≈ 2hn , for t0 → ∞ . (3.13)

Second, in the limit t0 → 0 (keeping x constant), the population sizequickly changes from N0 to NB, and vice versa. Thus, going back in time,the population size quickly after the present enters the stage in which thepopulation size is equal to N0, and it remains constant over time. In thislimit, it is thus expected that the statistics of the total branch lengthagrees with that obtained in the constant population-size case. Indeed,in the limit t0 → 0, Eq. (3.11) becomes:

〈Tn〉 ≈ 2hn − n(1 − x)t0x

, for t0 → 0 , (3.14)

where the equality∑n

i=2 dn;i = n was used. It follows from Eq. (3.14) thatin this limit, the total branch length is effectively that of the constantpopulation size, but corrected by a small negative value, −O(t0). Thisshift occurs because in the resulting gene genealogies, the parts withk ≈ n ancestral lines are typically shorter than the corresponding partsunder the constant population size (since in this case the bottleneck isencountered).

Third, in the case of a severe reduction of population size, that is inthe limit x → 0 keeping t0 constant, the MRCA of the resulting genegenealogies is expected to be found before the bottleneck ends (goingbackwards in time, the bottleneck ends at time t0 + tB = 2t0). This isbecause in the limit considered, the strength of the bottleneck is sB → ∞.In this limit, 〈Tn〉 satisfies:

〈Tn〉 ≈n∑

i=2

dn;i

bi(1 − e−bit0) + x

n∑

i=2

dn;i

bie−bit0 , for x → 0 . (3.15)

Page 32: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

24 Chapter 3 Gene genealogies under a varying population size〈T

n〉

t−10t−1

0

var(

Tn)

a b

00

33

66

99

0.010.01 0.10.1 11 1010 100100 10001000

Figure 3.3: Mean (a) and variance (b) of the total branch length underthe model of a single bottleneck shown in Fig. 3.2a. Solid lines denotethe analytical results, Eq. (3.11) and Eq. (3.12), for different values ofn: n = 5 (red lines), n = 10 (blue lines), and n = 50 (denoted by greenlines). Dashed lines show the approximation Eq. (3.14), for t0 → ∞,and dashed-dotted lines show the approximation Eq. (3.13), for t0 → 0.Remaining parameter used: x = 0.1.

Here, the factors 1− e−bit0 , for i = 2, . . . , n, describe the probability thata pair among i ancestral lines coalesces before the bottleneck starts.

In Fig. 3.3 it is shown how the mean and the variance of the totalbranch length depend on t0, for x = 0.1. Solid lines in panel a are forEq. (3.11), and solid lines in panel b are for Eq. (3.12). In panel a,dashed lines indicate the limit t0 → ∞, and dashed-dotted lines indicatethe limit t0 → 0. These results confirm that in the limit of both fast(t0 → 0), and slow (t0 → ∞) population-size fluctuations, gene genealo-gies under a single bottleneck can be approximated with those obtainedunder a constant effective population size, as expected. In these twocases the corresponding effective population size is equal to unity. Thedashed and the dashed-dotted lines in panel a show how fast or slowthis approximation breaks down for different values of n. As indicated inEqs. (3.13)-(3.14), the threshold value of t0 at which the approximationfails depends on n much more severely in the case of fast fluctuationsthan in the case of slow fluctuations. The largest deviations from theeffective population-size approximation are achieved for t0 of the orderof unity.

Page 33: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

3.2 Model of recurrent bottlenecks 25

3.2 Model of recurrent bottlenecks

As in the previous section, it is assumed that a population size mayacquire two values, N0 and NB (NB < N0) (see Fig. 3.2b). In the modelillustrated in Fig. 3.2b, the population size switches from N0 to NB withrate λ, and from NB to N0 with rate λx. Thus, the time between twosuccessive bottlenecks is exponentially distributed with mean λ−1, andthe duration of a bottleneck is exponentially distributed with mean λ−1

x .Note that two-locus gene genealogies under such a model of recurrentbottlenecks were analysed in [II]. In the notations used in [II], one hasλx = λB/x. The results obtained in [II] are discussed in Chapter 5.

In what follows a particular demographic history generated for givenvalues of λ, x, and λx is denoted by D, and the mean of the total branchlength of sample size n for a given D is denoted by 〈Tn|D〉. Using Eq. (20)in [I], the first and second moment of Tn averaged over independenthistories having the same values of λ, x and λx turn out to be

〈〈Tn|D〉〉 =n∑

j=1

dn;jxλ + xλx + bj

bj (λ + xλx + bj), (3.16)

〈〈T 2n |D〉〉 =2

n∑

i=2

i∑

j=2

dn;ijbibj + xλ (bi + xbj) + xλx (bi + bj) + x2 (λ + λx)

2

bibj (λ + xλx + bi) (λ + xλx + bj).

(3.17)

Note that in Eqs. (3.16)-(3.17) averaging is performed first over randomgene genealogies conditional on D, and then over randomly generateddemographic histories with the given parameters λ, x and λx.

In the following, Eq. (3.16) is analysed in the limit of slow (λ → 0)and fast population-size fluctuations (λ → ∞) and it is assumed thatλ = λxin order to reduce the number of cases to be treated.

First, in the limit of slow fluctuations, λ → 0, one finds:

〈〈Tn|D〉〉 ≈ 2hn − λ(1 − x)

n∑

j=2

dn;j

b2j

, for λ → 0 . (3.18)

As expected, for λ = 0, 〈〈Tn|D〉〉 corresponds to the expected total branchlength obtained under a constant population size (2hn). The term −O(λ)stands for the correction to the constant population-size case. The minussign appears because the upper parts of the resulting gene genealogies(close to the MRCA) are affected by a reduced population size.

Conversely, taking the limit λ → ∞ (fast fluctuations) in Eq. (3.16),it can be found

〈〈Tn|D〉〉 ≈ 2hnc + n1 − x

(1 + x)2λ−1, for λ → ∞ . (3.19)

Page 34: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

26 Chapter 3 Gene genealogies under a varying population size

Here c = 2x/(1 + x) is equal to the harmonic mean of temporal pop-ulation sizes given by Eq. (2.10)). In other words, the first term 2hnccorresponds to the expectation of the total branch length under an ef-fective population size c. The second term describes the correction tothe effective population-size approximation obtained for large but finitefrequencies of population-size fluctuations.

More importantly, Eqs. (3.18)-(3.19) show the corrections to the ef-fective population-size approximation. These results make it possible toestimate at which values of λ the approximation breaks down.

The dependence on λ of the mean and the variance of the total branchlength under a model of recurrent bottlenecks is shown in Fig. 3.4. Solidlines denote Eq. (3.16) for different sample sizes (see the caption of thefigure for details about the parameters used). Dashed lines show theapproximation (3.18) for slow population-size fluctuations, and dashed-dotted lines depict the approximation (3.19) for fast population-size fluc-tuations. It can be noticed that, as in the case of a single bottleneck dis-cussed in the previous subsection, the threshold value of λ at which theeffective population-size approximation breaks down strongly depends onn for fast fluctuations. By contrast, such a dependence on n turns outto be negligible in the case of slow fluctuations.

Finally, it should be emphasised that in the case of a varying popula-tion size, moments of an observable An of the form An =

∑ni=2 aiτi with

given coefficients ai, are determined by the results of [I]. As shown inSubsection 2.4.1, observables of this kind include Sn, the total numberof segregating sites in a sample of size n, as well as ξi (i = 1, . . . , n − 1),the frequency spectrum of SNPs. These observables are commonly usedfor testing genetic variation in natural populations against neutral mu-tations (see, for example, [53, 54, 55, 49, 56]), as well as for inferringan unknown demographic history of a given population [15, 16]. This isdiscussed in the next chapter.

Page 35: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

3.2 Model of recurrent bottlenecks 27〈T

n〉

λλ

var(

Tn)

a b

00

33

66

99

0.010.01 0.10.1 11 1010 100100 10001000

Figure 3.4: Mean (a) and variance (b) of the total branch length underthe model of recurrent bottlenecks shown in Fig. 3.2b. Solid lines denotethe analytical results, Eq. (3.16) and Eq. (3.17), for different values ofn: n = 5 (red lines), n = 10 (denoted by blue lines), and n = 50 (greenlines). Dashed lines show the approximation Eq. (3.18), for λ → 0,and dashed-dotted lines show the approximation Eq. (3.19), for λ → ∞.Remaining parameter used: x = 0.1.

Page 36: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

28 Chapter 3 Gene genealogies under a varying population size

Page 37: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4Frequency spectrum of SNPs

The frequency spectrum of neutral SNPs depends on the parameters ofthe demographic history of a given population, as explained in Subsec-tion 2.4.1. By computing the moments of SNP counts under a givendemographic history, one can compare the results obtained against theempirically observed SNPs. This makes it possible to infer the parame-ters of the past population-size history of a given population [15].

This chapter is organised as follows. Section 4.1 discusses how a singlebottleneck modifies the frequency spectrum of SNPs expected under neu-tral mutations in the Wright-Fisher population of constant size. In Sec-tion 4.2 human genome-wide SNP counts, taken from the 1000 GenomesProject [57], are analysed. Assuming that each population experiencedtwo population-size changes in its past (see Fig. 4.3), the spectra are usedto infer the corresponding parameters of the model. The resulting fitsof the underlying population-size histories are shown at the end of thischapter. The results of this chapter are complemented by Appendix Dwhich discusses how the frequency spectrum of SNPs affects Tajima’stest [53], Fu and Li’s test [54], and the singleton-exclusive versions ofTajima’s test with no out-group available [55], all three being built uponthe standard null-model.

4.1 Frequency spectrum of SNPs under a sin-

gle bottleneck

Consider a haploid population with demographic history represented bythe model shown in Fig. 3.2a (the model given in Fig. 4.3 reduces to themodel in Fig. 3.2a in the case x < 1, x′ = 1). Let n genetic sequences at

29

Page 38: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

30 Chapter 4 Frequency spectrum of SNPs

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

a b c

d e f

g h i

iii

i〈ξ i〉

i〈ξ i〉

i〈ξ i〉

Figure 4.1: i〈ξi〉 for sB = 100. The black solid lines denote the corre-sponding analytical results. The dashed lines show the result expectedin the constant population-size case. The population size within the bot-tleneck differs between panels: x = 5 · 10−3 in a, d, and g, x = 5 · 10−2

in b, e, and h, x = 0.5 in c, f, and i. The time of sampling t0 afterthe bottleneck is set to: t0 = 10−2 in panels a, b, and c, t0 = 10−1 ind, e, and f, and t0 = 1 in g, h, and i. Remaining parameters used:n = 50, θ = 10.

a given locus be sampled from this population. Assume that mutationsaccumulated along the locus are neutral, and may be modeled by theinfinite-sites model. The mutation probability per sequence per genera-tion is denoted by µ, and the scaled mutation rate is defined as θ = 2µN0.As in Chapter 3, N0 is the population size at the present, and it servesas a time-rescaling factor. Note that the infinite-sites mutation modelmay be applied if the locus sampled is very large. In what follows it isassumed that all sites within the locus share the same ancestral history.In other words, no recombination within the locus is allowed.

Under the given assumptions, the first two moments of the frequencyspectrum ξi (i = 1, . . . , n − 1) for an arbitrary demographic history canbe computed using Eqs. (2.16)-(2.17), combined with Eqs. (3.5), (3.8),(3.9) (see also [41, 49, I]). For the model of a single bottleneck shownin Fig. 3.2a, this procedure yields Eqs. (D.1)-(D.4) in Appendix D. Thecorresponding analytical results for the first moment of ξi are shown herein Figs. 4.1-4.2.

Page 39: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.1 Frequency spectrum of SNPs under a single bottleneck 31

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

1 10 20 30 40 490

5

10

a b c

d e f

g h i

iii

i〈ξ i〉

i〈ξ i〉

i〈ξ i〉

Figure 4.2: Same as in Fig. 4.1 but for sB = 2.

Figs. 4.1-4.2 show i〈ξi〉 for different values of the parameters x, tB,and t0. Recall that in a population of constant size, i〈ξi〉 = θ, in otherwords i〈ξi〉 is flat [41] (see Subsection 2.4.1). The results in Figs. 4.1-4.2 illustrate how the shape of i〈ξi〉 under a bottleneck differs from thatexpected under a constant population size. A close inspection of Figs. 4.1-4.2 reveals that the shape of i〈ξi〉 does not change with changing x,keeping sB = tB/x and t0 fixed (compare the panels belonging to thesame row in Fig. 4.1, as well as the panels in the same row in Fig. 4.2).In other words, the shape of i〈ξi〉 is qualitatively determined only by sB

and t0. In what follows the effect of sB on the shape of i〈ξi〉 is discussedfirst. Thereafter, the effect of t0 is discussed.

The results shown in Figs. 4.1-4.2 differ by the strength of the underly-ing bottleneck: the bottleneck assumed in Fig. 4.1 is much stronger thanthe bottleneck assumed in Fig. 4.2 (as explained in Chapter 3 the strongerthe bottleneck is, the larger the probability for coalescence within thebottleneck is). Consider the case of a very strong bottleneck, formallydescribed by sB → ∞. In this case it is to be expected that the resultingfrequency spectrum of SNPs at high frequencies resembles the frequencyspectrum within the Wright-Fisher population of constant size NB. Con-versely, the number of SNPs at low frequencies is in this case expected tobe enhanced in comparison to that within the Wright-Fisher populationof constant size NB. The latter follows from the fact that the parts ofgene genealogies with k ≈ n ancestral lines can be elongated due to the

Page 40: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

32 Chapter 4 Frequency spectrum of SNPs

time spent in the high population-size regime, t0.Now, assume that a bottleneck is of intermediate strength. Such

a bottleneck may or may not host coalescences. Unlike in the case ofa strong bottleneck, some ancestral lines may survive through such abottleneck, and hence the parts of gene genealogies with k ≈ 2 ancestrallines typically become longer than the corresponding branches of theWright-Fisher population of constant size NB. These branches are mainlyresponsible for accumulating the SNPs of high frequencies. Therefore,when a bottleneck is of an intermediate strength, one expects the numberof SNPs at high frequencies to be larger than in the the case of a strongbottleneck. Indeed, this may be seen by comparing the correspondingpanels in Figs. 4.1-4.2, each panel corresponding to fixed values of x, andt0.

Finally, a weak bottleneck (sB → 0) is expected not to significantlyalter the shape of i〈ξi〉.

The effect of changing the parameter t0 may be seen by comparingthe panels within a single column in Fig. 4.1 for sB = 100, as well as inFig. 4.2 for sB = 2. Note that t0 corresponds to the strength of the initialhigh-population size regime: the higher the t0 is, the more coalescencesoccur before the bottleneck. It is to be expected that for t0 ≫ 1, theeffect of the bottleneck becomes negligible.

It is interesting to quantify at which values of t0 one expects to detectsignatures of a bottleneck, as well as the values of sB corresponding tostrong and those corresponding to weak bottlenecks. By varying theseparameters, it can be noticed (not shown) that the spectrum of SNPsdoes not change significantly when sB is increased above sB ≈ 5. Inwhat follows, therefore, the bottlenecks with sB ≥ 5 are treated as beingstrong. Note that this also implies that t0 ≥ 5 is treated as sufficientlylong time so that the effect of the past bottleneck may be neglected.Conversely, the bottlenecks with sB ≤ 0.1 are treated as being weak.

Figs. 4.1-4.2 show how the values of i〈ξi〉, i = 1, . . . , n − 1, under abottleneck differ from those expected under a constant population size.Note, however, that ξi can only be observed if an out-group (ancestralsequence) is available. If the ancestral sequence is not available, onemakes use of the folded frequency spectrum

ηi = ξi + ξn−i . (4.1)

For a population of constant size, one finds [41]

〈ηi〉 = θ

(

1

i+

1

n − i

)

. (4.2)

Therefore, in this case 〈ηi〉 (i−1 + (n − i)−1)−1

is flat.

Page 41: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.2 Fitting human histories using SNPs 33

present

past

t 0t B

N0

NB

N ′

Figure 4.3: Model of a single bottleneck. The population size at thepresent time is N0. The time to the bottleneck is t0, and the durationof the bottleneck is tB. The population size during the bottleneck is NB,and that before the bottleneck is N ′.

In the following, the results of this section are compared to relatedresults found in the literature.

First, an expression for 〈ξi〉 under a piecewise constant populationsize was derived in [15] (see Eq. (1) in [15]). The result given here isconsistent with Eq. (1) in [15].

Second, in [16] it was discussed how 〈ηi〉 (i−1 + (n − i)−1)−1

dependson the parameters t0, x, and tB. Note, however, that the results shownhere imply that the shape of the frequency spectra of SNPs (both of ξi

and of ηi) can be explained in terms of only two parameters: the strengthof the bottleneck sB, and the time of sampling after the bottleneck t0.

The effect of a single bottleneck on Tajima’s test, Fu and Li’s test,and the singleton-exclusive version of Tajima’s test with no out-groupavailable, assuming that these tests are built upon the standard null-model, is discussed in Appendix D.

As pointed out in [15, 16], the frequency spectrum of SNPs may beused to infer the demographic history of a given population. In thenext section the results of the fits of histories of twelve sampled humanpopulations are shown.

4.2 Fitting human histories using SNPs

This section demonstrates how the frequency spectrum of neutral SNPscan be used to infer the demographic history of a given population. Theresults given below were obtained in a joint collaboration with groups inCologne, and in Cambridge [58]. The frequency spectra of SNPs upon

Page 42: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

34 Chapter 4 Frequency spectrum of SNPs

1 20 40 60 80 100 1190

0.01

0.02

0.03

1 20 40 60 80 100 1190

0.01

0.02

0.03

1 20 40 60 80 100 1190

0.01

0.02

0.03

i

i

i

a

b

c

ζ i,1

ζ i,2

ζ i,3

Figure 4.4: Normalised frequency spectra of the population ASW. Thecircles show the experimentally obtained spectrum. Solid lines show theresult of the best fit for: a the entire spectrum, b the spectrum withoutthe singletons, and c the spectrum without both the singletons and thedoubletons.

Page 43: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.2 Fitting human histories using SNPs 35

1 20 40 60 80 100 1190

0.01

0.02

0.03

1 20 40 60 80 100 1190

0.01

0.02

0.03

1 20 40 60 80 100 1190

0.01

0.02

0.03

i

i

i

a

b

c

ζ i,1

ζ i,2

ζ i,3

Figure 4.5: Same as in Fig. 4.4 but for the population TSI.

which the calculations were done, were extracted by Alexander Klass-mann from the 1000 Genomes Project [57].

Recent theoretical studies (see [15, 16] and references therein) suggestthat using the frequency spectra of SNPs it is possible to differentiatethe human populations with African ancestry from the populations withEuropean, or with Asian ancestry. The 1000 Genomes Project provideshigh-quality data on genome-wide variation in a number of human pop-ulations. This makes it possible to compare genetic variation betweenpopulations with the same ancestry, as well as to compare genetic vari-ation between populations with different ancestry. The spectra of thefollowing populations are used below:

• with African ancestry: ASW (African ancestry in Southwest US),YRI (Yoruba in Ibadan, Nigeria), LWK (Luhya in Webuye, Kenya),

• with European ancestry: CEU (Utah with Northern and EasternEuropean ancestry), TSI (Toscani in Italia), GBR (Great Britain),FIN (Finland),

• with East Asian ancestry: CHB (China), CHS (Han Chinese South),JPT (Japan),

Page 44: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

36 Chapter 4 Frequency spectrum of SNPs

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

NA

SW

NY

RI

NLW

K

years

years

a b

c

Figure 4.6: Optimal histories found for: a ASW, b YRI, and c LWK.Results of fitting of the entire spectra (blue), spectra without the single-tons (red), and spectra without both the singletons and the doubletons(green). In panel c, the red coincides with the green line. For the expla-nation of the fitting procedure, see the main text.

• with Americas ancestry: CLM (Colombia), MXL (Mexican ances-try in Los Angeles).

The frequency spectra of SNPs in these human populations are basedon direct comparisons of intergenic regions in human genome to the cor-responding regions in the genome of chimpanzee, the latter being theancestral genome [57]. The intergenic sequences were chosen here sincethey are presumably neutral [59]. The sample size used is n = 120.

It is to be expected that empirical frequency spectra are affectedby sequencing errors. In order to avoid the sequencing errors made inthe ancestral genome, it is clear that instead of the unfolded frequencyspectrum ξi, one must use the folded spectrum ηi. The fits shown beloware based on ηi.

The collected data are used to fit the unknown human population-size histories. Motivated by the human out-of-Africa scenario [11, 12],in the analysis below it is assumed that all populations experienced twopopulation-size changes in their history. The model is shown in Fig. 4.3.Note that humans are diploid. In order to retain the notation used in theprevious chapters, the population size in Fig. 4.3 stands for the effectivenumber of chromosomes in a population. (Recall that in a population

Page 45: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.2 Fitting human histories using SNPs 37

with effective size Ne the effective number of chromosomes is 2Ne.)

As Fig. 4.3 shows, each population-size history has five unknown pa-rameters to be determined: the population size at the present time N0,the number of generations to the first change of population size ℓ0 =⌊N0t0⌋, the population size NB = xN0, the duration ℓB = ⌊tBN0⌋, andthe ancestral population size N ′ = x′N0. (Note that for x < 1, x′ = 1,this model reduces to the model of a bottleneck shown in Fig. 3.2a.) Thefive parameters listed can be fitted separately for different populations.

However, another unknown parameter is the mutation rate µ′ persite per generation. It is common (see [15] and references therein) toassume that µ′ is constant along the genome, and across different humanpopulations. In order to disentangle the SNPs data observed in differentpopulations, the following reasoning is used.

By expressing ηi using Eq. (2.16), one finds that ηi ∝ θ/2, whereθ = 2N0µ

′L, and L is the number of the sites scanned (the same forall populations). In fact, ηi depends on the parameters µ′ and N0 onlythrough θ. It follows that in order to fit separately the demographichistories of the populations sampled, the following quantity is to be con-sidered:

ζi,1 =θi

∑⌈(n−1)/2⌉k=1 θk

, for i = 1, . . . , ⌈(n − 1)/2⌉ . (4.3)

Here

θi =ηi

1i+ 1

n−i

. (4.4)

is the so-called ith estimate of θ [16] (as explained in the previous section,for a population of constant size, one has 〈θi〉 = θ). In the following,ζi,1, i = 1, . . . , ⌈(n − 1)/2⌉ are termed normalised frequency spectrum.Note that ζi,1 does not depend on θ, but only on the parameters t0, tB,x, and x′. In other words, one can fit the normalised spectrum ζi,1 using

the analytically computed ζ(an)i,1 for the mutation rate θ(an)/2 = 1. Under

this setting, the value of θ that corresponds to the fitting parameters t0,tB, x, and x′ is

θ

2=

i θi∑

j θ(an)j

, (4.5)

where

θ(an)j =

η(an)j

1j

+ 1n−j

. (4.6)

Here η(an)j is the analytically computed folded spectrum for the param-

eters t0, tB, x, x′ assuming that the scaled mutation rate θ(an) satisfies

Page 46: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

38 Chapter 4 Frequency spectrum of SNPs

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

NT

SI

NC

EU

NG

BR

NFIN

yearsyears

a b

c d

Figure 4.7: Optimal histories found for: a TSI, b CEU, c GBR, andd FIN. Results of fitting of the entire spectra (blue), spectra withoutthe singletons (red), and spectra without both the singletons and thedoubletons (green). The red line coincides with the green line in panelsa-c.

θ(an)/2 = 1. Using this procedure, the population size N0 for a givenpopulation remains a free parameter.

Two examples of how ζi,1, i = 1, . . . , n − 1 is retrieved by the fit-ting procedure described above are shown in Figs. 4.4a, 4.5a. The trialspectra are computed for 107 random sets of the unknown parameterst0, tB, x, and x′. For a given trial set, the logarithm of each parameteris sampled randomly from a wide range [−4, 2]. Note that it was proposedin [16, 60] that low-frequency mutations should not be considered sincethey are expected to be most sensitive to sequencing errors. Therefore,the fits are performed in three variants. The first takes into account theentire empirical spectra ζi,1. The second is based on the spectra withoutthe singletons (mutations of type one [41]), and it is denoted by ζi,2. Notethat ζi,2 is defined for 2 ≤ i ≤ ⌈(n − 1)/2⌉. The third variant is basedon the spectra without both the singletons and the doubletons (η1, η2),and it is denoted by ζi,3.

The optimal population-size histories found are given in Figs. 4.6-4.9.By comparing for each population the three optimal histories found, itcan be observed that the populations TSI and GBR are the least sensi-tive to low-frequency mutations. In the results shown, TSI is used as a

Page 47: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.2 Fitting human histories using SNPs 39

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000

NJP

TN

CH

B

NC

HS

years

years

a b

c

Figure 4.8: Optimal histories found for: a CHB, b CHS and c JPT.Results of fitting of the entire spectra (blue), spectra without the single-tons (red), and spectra without both the singletons and the doubletons(green).

reference population, and the population size of TSI at the present timeis set to N0 = 20000. This corresponds to assuming that the human effec-tive population size of this population is equal to 10000 (see [15]). UsingEq. (4.5) one can find the scaled mutation rate θ(TSI) for this population,and compute the corresponding value of µ′L. As mentioned above, thelatter is assumed to be the same for all populations. The estimated valueof µ′L is used to compute the value of N0 for the remaining populations.In all populations, it is assumed that one generation covers a period of25 years.

Figs. 4.6-4.9 show that the optimal histories obtained are typicallysensitive to low-frequency mutations. Recall that the results shown arebased on the folded frequency spectrum ηi = ξi + ξn−i. According to theresults given in Section 4.1 it can be deduced that the sensitivity observedis expected in at least the following two cases. First, it is expected underhistories involving recent population expansion, or a recent sufficientlystrong bottleneck. In these cases, as Figs. 4.1-4.2 indicate, ξi for smallvalues of i are expected to be larger than ξi obtained in a population ofsize equal to the size within a bottleneck. The more recent the expansionis, the lower frequencies of SNPs become enhanced. Second, ξi for highvalues of i are mainly determined by the branch lengths of gene genealogy

Page 48: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

40 Chapter 4 Frequency spectrum of SNPs

close to the MRCA. In the case a population size declined far in the past(past population size is larger than the present population size), theresulting ξi for high values of i are expected to be larger than ξi obtainedin a population of size equal to the size at the present time. If the declineoccurred very far in the past, it is possible that only ξn−1 are affected byit (if at all).

In both cases mentioned above, it is to be expected that low-frequencymutations (ηi where i is small) carry abundant information about thepast demography. This calls for precaution against sequencing errors.But, it also implies that by omitting low-frequency mutations, signifi-cant information about the past demography can be lost. Note that itwas suggested in [55, 16] that by omitting low-frequency mutations, oneavoids most of sequencing errors.

By comparing the optimal histories across different populations, a re-markable distinction between the African and non-African populationscan be noticed: the bottleneck signature is evident in all non-Africanpopulations, but not in the African ones. The histories shown suggestthat a recent population growth occurred in non-African populations5000 − 20000 years ago. This growth may mimic the population growthdue to agricultural advances. A population-size decline that occurred80000 − 150000 years ago may mimic the founding of the non-Africanpopulations according to the out-of-Africa scenario [11, 12]. Note that,unlike in [15] where the ancestral population was fixed to 20000 in allpopulations, the ancestral population size was treated here as a free pa-rameter. Still, the results in Figs. 4.6-4.9 indicate that the ancestralpopulation size is around 15000 − 20000 for all populations, except forASW and LWK. For the latter two, the ancestral population sizes arefound to be very small. However, note that these occur in a distant past.It turns out that the MRCA of samples are found typically before thistime, and thus the population size in such a far past can not be reliablydetermined by the fitting procedure described.

In summary, the results obtained here are in agreement with the out-of-Africa scenario. But, they also stress the importance of low-frequencymutations. This is particularly important as it puts into question thevalidity of studies in which low-frequency mutations are omitted.

Two studies [15, 16] are closely related to the study presented in thissection. In both studies it was argued that populations with African an-cestry are likely to have experienced a recent population growth, whereaspopulations with the European, and that with Asian ancestry are likelyto have experienced a bottleneck. The conclusions of [16] are based ona qualitative comparison of empirically observed spectra to spectra ob-tained using computer simulations. The analysis in [15] is based on the

Page 49: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

4.2 Fitting human histories using SNPs 41

1000 10000 100000 1e+061000

10000

100000

1000 10000 100000 1e+061000

10000

100000N

MX

L

NC

LM

yearsyears

a b

Figure 4.9: Optimal histories found for: a CLM, b MXL. Results offitting of the entire spectra (blue), spectra without the singletons (red),and spectra without both the singletons and the doubletons (green). Inpanel b, the red line coincides with the green line.

exact analytical results. Note, however, that the analysis given here isbased on the spectra which are of much higher quality than the spectraused in [15] (see [57]). Also, the SNPs used here are observed along in-tergenic regions, that are in principle neutral. The data used in [15] area combination of neutral loci, and of loci under selection. Finally, the1000 Genomes Project allows for using large sample sizes n = 120. Theanalysis in [15] was performed for sample size n = 42.

It remains to be checked which improvements can be gained when fit-ting the spectra to more complex histories, especially in the case of non-African populations. Examples are the histories involving several bottle-necks (that account for the most recent ice-age), and histories involvingslow population growth in the recent past. Moreover, it would be inter-esting to check for African populations whether or not the histories foundhere are improvements to a simple model of population growth. This cancan be done by comparing the likelihoods for the histories shown, to thelikelihoods of other alternative histories. Note that in order to computethe likelihoods, the first two moments of the SNPs need to be evaluated.The corresponding expressions for the model shown in Fig. 4.3 are givenin Appendix D. The remaining expressions can be derived starting fromthe results presented here.

This concludes the discussion in this thesis on the single-locus prop-erties under a varying population size. The following chapter analysestwo-locus gene genealogies under sustained population-size fluctuations.

Page 50: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

42 Chapter 4 Frequency spectrum of SNPs

Page 51: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

5Linkage disequilibrium

Chapter 3 summarised how to compute the moments of the total branchlength of gene genealogies under a varying population size. This makesit possible to describe single-locus patterns of genetic variation in realpopulations, as shown in Chapter 4. In order to interpret empiricallyobserved multi-locus patterns of neutral genetic variation, one needs tounderstand the properties of multi-locus gene genealogies.

Multi-locus gene genealogies trace the ancestry of chromosomes atmany loci simultaneously. Due to recombination, a pair of loci at thesame chromosome can have completely different ancestry, in which case,the degree of association between the two loci is weak. It is now wellunderstood (see Section 2.2) that the degree of association between apair of loci on a given chromosome depends on the physical distance,that is, on the number of base pairs between the loci. It is observedthat typically, the larger the distance between loci is, the larger is theprobability that a chromosome recombines between them is. Commonly,the physical distance between a pair of loci is mapped to the geneticdistance measured in terms of the probability of recombination betweenthe two loci per generation per chromosome, r. In what follows, it isassumed that r is proportional to the physical distance. See [31] forother mappings of the physical distance to r.

Note that the physical distance between two loci that are located ondifferent chromosomes is considered to be infinitely large. Such loci areinherited independently of each other, and they are said to be in linkageequilibrium. Loci that are not independent are said to be in linkagedisequilibrium (LD).

Two commonly used measure of linkage disequilibrium are r2 and σ2d,

as mentioned in Subsection 2.4.2. The latter may be expressed in terms

43

Page 52: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

44 Chapter 5 Linkage disequilibrium

of the covariances of the times to the MRCA of two chromosomes at twoloci, as shown in [44]. This is an important result, as it provides a simplemethod for calculating the degree of association between pairs of loci.

In populations of constant size, analytical expressions for the covari-ances and for σ2

d are readily available (see Eqs. (2.25)-(2.28) in Subsec-tion 2.4.2, and also [43, 45, 46, 47, 44]). However, it was found (see [42]and references therein) that empirical genome-wide data of the two-locuspatterns of genetic variation in humans can not be explained using theconstant population-size model, since the observed degree of associationbetween pairs of loci at intermediate distances was found to be muchhigher than the degree expected under a constant population size.

Many authors recognised that the effect of demographic history on thetwo-locus patterns of neutral genetic variation needs to be understood.Two-locus gene genealogies were analysed under different demographichistories, such as single bottlenecks, or population-size expansions (see[44, 42] and references therein). The effect of sustained population-sizefluctuations on two-locus gene genealogies was investigated in [II]. It wasshown in [II] how to compute the covariances of the times to the MRCAof two chromosomes at two loci under the model of recurrent bottlenecksillustrated in Fig. 3.2b. The remainder of this chapter summarises theresults given in [II].

Consider two loci denoted by a and b. Assume that they are sampledat chromosomes called i, j, k, and l. As in Subsection 2.4.2, the timeto the MRCA of the chromosomes i and j at locus a is denoted by ta(ij)

(and, similarly, by tb(ij) at locus b). Recall that the time ta(ij) stands forℓ = ⌊N0ta(ij)⌋ generations, where N0 denotes the number of chromosomesin the Wright-Fisher population in generation ℓ = 0. As in the previouschapters, it is assumed that Nℓ ≫ 1.

In order to compute σ2d, one must first compute its three constituent

covariances, that is cov[ta(ij), tb(ij)], cov[ta(ij), tb(ik)], and cov[ta(ij), tb(kl)].It is well known how these three covariances in populations with constantsize depend on the parameter R = 2rN0, commonly referred to as thescaled recombination rate (see Eqs. (2.25)-(2.27)). For example, fromEq. (2.25) it follows that

cov[ta(ij), tb(ij)] ∝1

R, for R ≫ 1 , (5.1)

implying that the degree of association between two loci at large ge-netic distances decays with increasing R. This is expected, because as Rincreases, the physical distance between the two loci increases.

But how do sustained population-size fluctuations alter the degree ofassociation between two loci? Can an effective population-size approxi-mation be used for calculating the covariances cov[ta(ij), tb(ij)], cov[ta(ij), tb(jk)],

Page 53: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Linkage disequilibrium 45

and cov[ta(ij), tb(kl)] at two loci? When is this possible? Is it possible toencounter the long-range association between two loci under population-size fluctuations? Under which conditions this occurs?

In order to answer such questions for a given fluctuating population-size history D, one needs to compute the corresponding covariancescovD[ta(ij), tb(ij)], covD[ta(ij), tb(ik)], and covD[ta(ij), tb(kl)], where

covD[ta(ij), tb(ij)] = 〈ta(ij)tb(ij)|D〉 − 〈ta(ij)|D〉〈tb(ij)|D〉 . (5.2)

The remaining covariances covD[ta(ij), tb(ik)], and covD[ta(ij), tb(kl)] can beexpressed similarly. Note that the second term in Eq. (5.2) depends onlyon the single-locus statistical properties, and one has [9, 44, II]

〈ta(ij)|D〉〈tb(ij)|D〉 = 〈ta(ij)|D〉2 . (5.3)

This expression can be evaluated using the results in Chapter 3. Amethod for dealing with the first term in Eq. (5.2) is discussed next.

Consider the model of recurrent bottlenecks shown in Fig. 3.2b. As in[II], the results given below are expressed in terms of the parameter λB ≡xλx. Note that the probability that two lines entering the bottleneckcoalesce before the bottleneck ends is (1 + λB)−1 [II].

The covariance averaged over random realisations of demographic his-tories D having the same values of the parameters x, λ, and λB is

〈covD[ta(ij), tb(ij)]〉 = 〈〈ta(ij)tb(ij)|D〉 − 〈ta(ij)|D〉2〉 . (5.4)

The covariances obtained under individual realisations of demographichistory are expected to be qualitatively resembled by the average covari-ance 〈covD[ta(ij), tb(ij)]〉. By contrast to 〈covD[ta(ij), tb(ij)]〉, the uncondi-tional covariance

〈ta(ij)tb(ij)〉 − 〈ta(ij)〉2 = 〈covD[ta(ij), tb(ij)]〉 + var[〈ta(ij)|D〉] (5.5)

encounters an additional term, that is the variance of the time to theMRCA of two chromosomes at a single locus. This term is irrelevant fordescribing the degree of association between two loci.

In what follows it is discussed how 〈covD[ta(ij), tb(ij)]〉 can be computed[II] (see also [42]). The computation is based on representing the ancestryof two chromosomes at two loci in terms of a Markov process. Thecorresponding graph is shown in Fig. 5.1.

The term 〈〈ta(ij)tb(ij)|D〉〉 ≡ 〈ta(ij)tb(ij)〉 can be computed using [42, II]

〈ta(ij)tb(ij)〉 = 〈ta(ij)tb(ij)〉linked + 〈ta(ij)tb(ij)〉unlinked . (5.6)

The first term in Eq. (5.6) corresponds to the case when two chromosomesfind the MRCA at both loci simultaneously. In this case the loci are said

Page 54: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

46 Chapter 5 Linkage disequilibrium

to be linked. The second term is for the case when the ancestries of twochromosomes at a given pair of loci are independent of each other. Inother words, the loci are unlinked. Note that the latter occurs if thesystem visits either the state 4 or the state 4′ in Fig. 5.1.

Using the method for computing 〈ta(ij)tb(ij)〉 that is outlined in [II],yields

〈covD[ta(ij), tb(ij)]〉 =R3C3 + R2C2 + RC1 + C0

R4D4 + R3D3 + R2D2 + RD1 + D0, (5.7)

where C0, . . . , C3, D0, . . . , D4 are functions of x, λ, and λB.In order to understand the role of different time-scales in shaping the

degree of association between two loci, Eq. (5.7) must be analysed in atleast three limits [II]: the limit of fast population-size fluctuations (λ →∞, λB → ∞), the limit of slow population-size fluctuations (λ → 0),and the limit of severe reduction of population size during the bottleneck(x → 0). These limits are briefly discussed in the following.

It is shown in [II] that in the limits of fast, and of slow population-sizefluctuations, Eq. (5.7) reduces to the result of the corresponding effectivepopulation-size approximation

〈covD[ta(ij), tb(ij)]〉 = x2eff

Rxeff + 18

(Rxeff)2 + 13Rxeff + 18. (5.8)

In the former case, the effective population size is equal to the harmonicmean of population size, and in the latter case it is equal to unity. How-ever, in the limit x → 0 (keeping λ, λB, and R constant) one finds[II]

〈covD[ta(ij), tb(ij)]〉 ≈R2A2 + RA1 + A0

R2B2 + RB1 + B0, for x → 0 . (5.9)

The coefficients appearing in Eq. (5.9) are listed in [II]. Note that, unlikein the case of constant population size, Eq. (5.9) implies a long-rangeassociations between pairs of loci.

The results of simulations of the model described are shown in Fig. 5.2(taken from [II]). In Fig. 5.2a the parameters are chosen to mimic fastpopulation-size fluctuations, and in Fig. 5.2b the parameter x is set to asmall value, so that it may mimic the case of severe reduction of popula-tion size during bottlenecks. In each panel, the red line represents the co-variance averaged over an ensemble of randomly generated demographichistories D having the same values of the parameters x, λ, and λB.

As can be seen in Fig. 5.2, the general result Eq. (5.7) depicted byblack solid lines agrees well with the outcomes of computer simulations.Furthermore, in Fig. 5.2a the effective population-size approximation (de-picted by the dashed line) agrees well with the outcomes of the computer

Page 55: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Linkage disequilibrium 47

simulations, as long as recombination is not faster than population-sizefluctuations. Still, the failure of the approximation for large R is foundto be insignificant in this case.

Conversely, in the case shown in Fig. 5.2b, for low values of R onefinds that the effective population-size approximation describes well thedegree of association between pairs of loci. But, as R increases, thisapproximation fails: in a range of values of R the covariance remainsconstant, implying long-range association between pairs of loci. As ex-pected, the covariance in this region is well predicted by Eq. (5.9) shownas a dashed-dotted line.

It was shown in [II] that Eq. (5.9) can be obtained within a Xi-coalescent approximation (see Section 2.3). In order to understand howthe Xi-coalescent approximation arises in this case, note that the dura-tion of the bottleneck is on average equal to xλ−1

B , and the time betweentwo successive bottlenecks is on average equal to λ−1. In the limit x → 0,keeping λ and λB constant, it follows that the duration of the bottleneckis negligible compared to the time between two successive bottlenecks,that is xλλ−1

B → 0. This allows for the durations of the bottlenecks tobe ignored. But, note that a bottleneck may, nevertheless, host coales-cences. The capability of a bottleneck to host coalescences of ancestrallines entering it, is determined by λB. Since in the case considered, theduration of a bottleneck is neglected, it follows that possible coalescencesoccurring during a bottleneck can be considered as a single simultane-ous multiple merger. The graph corresponding to the limit described isshown in Fig. 5.3. The transition rates between the states in this case aregiven in Appendix B (taken from [II]). Since Eq. (5.9) was derived un-der the Xi-coalescent approximation in [II], it was argued in [II] that thelong-range association between pairs of loci arises due to multiple merg-ers at the arrival times of bottlenecks. As expected, the multiple-mergerapproximation, and therefore Eq. (5.9), fails for R > x−1, that is whenthe time scale of recombination becomes shorter than the coalescent timescale within the bottleneck.

By computing the covariances 〈covD[ta(ij), tb(ik)]〉 and 〈covD[ta(ij), tb(kl)]〉one can determine σ2

d. Although the long-range association between pairsof loci is evident in its three constituent covariances, it turns out that σ2

d

does not carry signatures of the long-range association between pairs ofloci [II]. This is an important result, since it raises the question of howmuch information about the processes shaping the two-locus patterns ofneutral genetic variation can a given measure provide.

It should be noted that the covariances resulting from individual real-isations of demographic history typically fluctuate around the averagedcovariance. Still, the results in [II] imply that, typically, the average

Page 56: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

48 Chapter 5 Linkage disequilibrium

covariance qualitatively reflects the results of individual realisations ofdemographic histories.

In summary, the results of [II] show how two-locus gene genealogiesare shaped under sustained population-size fluctuations. It was foundthat the covariance of the times to the MRCA of two chromosomes attwo loci can be captured by the effective population-size approximation inthe limit of either slow, or fast population-size fluctuations. By analysingthe case of severe reductions of population size during bottlenecks, it wasdeduced that 〈covD[ta(ij), tb(ij)]〉 exhibits long-range association betweenloci. It was also shown that the long-range association arises because genegenealogies encounter multiple mergers. This implies that long-range as-sociation between loci is expected to be obtained under any model allow-ing for multiple mergers. Examples include models of selective sweeps[61], as well as models allowing for skewed offspring distribution amongindividuals [62]. As shown in [II], in the limit of severe, strong bottle-necks (x → 0, λB → 0), the model of recurrent bottlenecks reduces to themodel introduced in [62] under which occasionally one individual givesrise to an entire population. In this case, one finds [II]

σ2d ≈

10 + R + 2λ

22 + R2 + 16λ + 2λ2 + 13R + 3λR, (5.10)

where the sign ≈ indicates that σ2d is obtained upon averaging separately

the numerator and the denominator (averaging being done over differentdemographic histories with the same values of parameters x, λ, and λB).This result is in agreement with the corresponding result in [62].

This chapter, together with Chapters 3-4, contributes to the quali-tative and quantitative understanding of the patterns of neutral geneticvariation in freely mixing populations with fluctuating population size.

The next chapter analyses gene genealogies in geographically struc-tured populations.

Page 57: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Linkage disequilibrium 49

1

23 5

4

4’

3’

1’

2’

1 1

1R

4

R/2

22

λλ

λλ

λ ′

λ′

λ ′

λ′

x−1 x−1

x−1R

4x−1

R/2

2x−

1

2x−1

group state1,1′ aibi, a jb j

aib j, a jbi2,2′ ai◦, ◦bi, a jb j

aibi, a j◦, ◦b jai◦, ◦b j, a jbiaib j, a j◦, ◦bi

3,3′ ai◦, ◦bi, a j◦, ◦b j4,4′ ai•, a j•

•bi, •b j5 ••

Figure 5.1: Left: graph showing the states and transition rates deter-mining the ancestral history of two loci in a sample of two chromosomes,under the population model illustrated in Fig. 3.2b. States where thepopulation is in a bottleneck are marked with a prime. The final stateis denoted by 5 (in this state it does not matter whether the populationis in a bottleneck or not). Arrows indicate transitions between states.The corresponding transition rates are displayed next to the lines. Hereλ′ corresponds to λx in the main text. Right: table of possible states ofthe system. The two loci are denoted by a and b, and the correspond-ing chromosomes by i and j. Empty circles denote genetic material notancestral to sampled loci, and full circles denote the MRCA. Figure andcaption are taken from [II].

Page 58: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

50 Chapter 5 Linkage disequilibrium

0.1 10 1000 1000000.00001

0.0001

0.001

0.01

0.1

0.1 10 1000 1000000.00001

0.0001

0.001

0.01

0.1a b

〈cov

D[t a

(ij),t b(i

j)]〉

RR

Figure 5.2: Covariance of the times to the MRCA of two chromosomesat two loci, averaged over random population-size histories. (a) Rapidpopulation-size fluctuations. The solid line shows the analytical resultgiven by Eq. (5.7). The red line shows the covariance averaged overten randomly generated demographic histories with λ = 102, λB = 10,x = 0.1 (the population sizes used are N0 = 105, and NB = 104). Foreach population-size history, an ensemble of 1000 gene genealogies weregenerated using Wright-Fisher simulations. The dashed line shows theresult of the corresponding effective population-size approximation givenby Eq. (5.8). (b) The same, but for the case of severe reductions ofpopulation size during bottlenecks. The red line denoted the result av-eraged over fifteen randomly generated sequences of bottlenecks, withparameters λ = 10, λB = 10, x = 5 · 10−4 (the population-size used areN0 = 106, and NB = 5 · 102). Averages are over 100 gene genealogiesfor each demographic history. The dashed-dotted line shows the resultof the Xi-coalescent approximation, Eq. (5.9). Figure is taken from [II].

23

1

5

4

Figure 5.3: Graph showing the states and possible transitions corre-sponding to the limit of severe reductions of population size during bot-tlenecks. The states 1, . . . , 5 are explained in Fig. 5.1. The three redarrows in this graph correspond to simultaneous multiple mergers. Fig-ure and caption are taken from [II].

Page 59: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6Genetic variation in structured

populations

Chapters 3-5 discuss how population-size fluctuations shape genetic vari-ation of natural populations. In these chapters, many discussions havebeen focused on population bottlenecks. Indeed, natural populations maybe exposed to frequent drastic environmental changes, which in returnmay induce population bottlenecks [14].

Furthermore, many natural populations have emerged through a se-ries of founder events [10, 11, 12]. An example are populations of a marinesnail (L. saxatilis) of Sweden’s west coast archipelago. As explained in[10], an empty habitat (island, or skerry) can be rapidly colonised bythese snails upon the arrival of one or few founder females. This is possi-ble since females can carry large number of progeny beneath their shells,and the progeny is found to have typically large survival rate. It wassuggested in [10] that the first colonisers come to an empty island fromone of the mainlands, and that the islands give colonisers to the neigh-boring skerries. This is illustrated in Fig. 6.1. The colonisation path,indicated by arrows, is mainly dictated by the distance between emptyand populated habitats [10].

Founder events are often treated as population bottlenecks [10, II].Genetic variation of an isolated population under recurrent bottlenecksis analysed in Chapters 3-5. However, this approach does not accountfor effects caused by immigrants coming from populations that are partlyisolated from the population analysed. In order to account for the ge-ographical structure of natural populations, and thereby for gene flowbetween them, it is necessary to analyse gene genealogies in geographi-cally structured populations.

51

Page 60: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

52 Chapter 6 Genetic variation in structured populations

The effect of geographical structure on gene genealogies is discussedin many theoretical studies [63, 64, 65, 48]. In [63] the correlation ofgene frequencies between two populations at a given distance apart wasanalysed. The model used in [63] assumes that populations may exchangeindividuals according to the stepping-stone model [66]. The coalescenttimes of, and the average pairwise differences between two sequencessampled within a single population, or from a pair of partly isolatedpopulations during colonisation are investigated in [48].

However, an assumption made in the above mentioned studies, andin many other related studies (see [48] and references therein), is thatdifferent habitats have the same carrying capacity. Thus, the effect of apossible large source of genetic variation (a mainland) is not understood.

Another common assumption is that populations are well mixed, andthat mating takes place according to the random-mating model [19]. Therandom-mating model is questionable for at least two reasons. First, itcarries an unrealistic assumption that each female mates infinitely manytimes with all males present in the population [67]. Second, in naturalpopulations both low and high levels of multiple paternity are observed,and this can not be accomplished under the random-mating model.

Therefore, the effect of multiple paternity on genetic variation withingeographically structured populations is not understood. The aim ofthis chapter is to provide such an understanding. This chapter providesa summary of the results in [68].

In what follows, a model which allows for different levels of multiplepaternity is introduced. A measure of the level of multiple paternity isproposed, and it is shown how to compute the effective population sizeof a single isolated population in terms of this measure. Afterwards, amodel of geographically structured populations is illustrated. The modelis a caricature of the colonisation illustrated in Fig. 6.1. It is assumedthat there is one mainland population, and a given number of island pop-ulations arranged in a one-dimensional chane. The migration is assumedto occur between nearest neighbors. In the model, the first two islandsclosest to the mainland may stand for an island and one of its nearbyskerries shown in Fig. 6.1. A summary of conclusions is given next.

First, on short time scales, that is, during the colonisation phase, themodel introduced is closely related to the model of recurrent bottlenecksdiscussed in the previous chapters (see Fig. 3.2b). Otherwise, the twomodels may produce significantly different patterns of genetic variation.Second, assuming that the mainland acts as the only source of new ge-netic material, genetic heterozygosity decreases as the distance from themainland increases. This is true in both the colonisation phase, and onlong time scales, that is, in the stationary state of the migration phase.

Page 61: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.1 Mating model 53

Third, in both cases it may be observed that by increasing the level ofmultiple paternity, genetic variation at a given distance from the main-land increases. Moreover the gain in genetic variation due to multiplepaternity increases as the distance from the mainland increases. Fourth,the gain in genetic variation at a given island due to multiple paternityis higher in the colonisation phase than in the stationary state of the mi-gration phase. Fifth, population heterozygosity fluctuates significantly.At distances far from the mainland, the fluctuations are particularly se-vere. It turns out that by increasing the level of multiple paternity, theduration of high-heterozygosity phase increases, whereas the duration oflow-heterozygosity phase appears to be only modestly affected by thelevel of multiple paternity.

Finally, it is shown that, despite its simplicity, the model analysedhere may be suitable to mimic repeated colonisation events observed inpromiscuous populations of a marine snail L. saxatilis (see discussionon multiple paternity and on an colonisation in these populations in[10, 20, 69]).

The remainder of this chapter is organised as follows. In Section 6.1a mating model which allows for different levels of multiple paternity isintroduced. In Section 6.2 the spatial model of geographically structuredpopulations is illustrated. In Section 6.3 the main findings are discussed.Finally, in Section 6.4 the theoretical predictions of the model are com-pared to the empirical data collected in populations of L. saxatilis [10].

6.1 Mating model

Consider a well mixed diploid population of Nm males and Nf females.Assume that generations are discrete and non-overlapping, and that Nm,and Nf are constant in time. Under the random-mating model, eachfemale mates infinitely many times with all males in a population. Hence,the probability that two children having the same mother come from thesame male is equal to N−1

m .

The random-mating model was modified in [67], to account for thefact that females may mate only a finite number of times during a repro-ductive cycle. The number of matings each female experiences is denotedby l below. In [67] it is assumed that l is the same for all females and thatit is constant over time. A model proposed here is base on the model in[67], but with a constraint that during a reproductive cycle, each femalemeets s ≤ Nm different males sampled at random (without replacement)from the Nm males existing in the population. Furthermore it is assumedthat each of the s males assigned to a female is equally likely to be the

Page 62: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

54 Chapter 6 Genetic variation in structured populations

Mainland

Island

Skerry

Colonisation Path

Figure 6.1: Illustration of colonisation in populations of L. saxatilis ofSweden’s west coast archipelago [10]. The mainland is shown in red,islands are shown in green, and skerries are shown in blue (schemati-cally). An empty island may become populated upon the arrival of oneor more founder females. The females may carry large number of progenywith them, which allows them to populate an empty island in a singlegeneration. The size and the carrying capacity of islands is typicallymuch smaller than the size, and the carrying capacity of the mainland.Conversely, the skerries are much smaller and may contain much smallerpopulations than the islands. In [10] it was concluded that island pop-ulations were populated by the females from the mainland. The skerrypopulations were probably founded by females from a closest nearby is-land. A possible colonisation path is depicted by arrows.

Page 63: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.1 Mating model 55

mating partner to this female for each of the l matings she experiences.This is illustrated in Fig. 6.2.

The parameters s and l are assumed to be constant in time, and theyare assumed not to differ between the females in a population. Addi-tionally, all females are assumed to contribute equally likely to the nextgeneration. Under these settings, the probability that two children stemfrom the same female is Pf = N−1

f , whereas the probability Pm that twochildren stem from the same male is:

Pm = Pfκ + (1 − Pf)1

Nm. (6.1)

Here

κ =1

l+

(

1 −1

l

)

1

s(6.2)

is the probability that two children with the same mother share a father.Note that an assumption behind Eq. (6.2) is that each mating gives riseto large number of children. In what follows, the inverse of κ is taken asa measure of the degree of multiple paternity.

The probabilities Pm and Pf determine the population heterozygosityat a given locus. Let the population heterozygosity at the locus sampledin generation ℓ be denoted by Hℓ. Assume that Nf ≫ 1, Nm ≫ 1, andthat mutations occur at rate µ per generation per allele according to theinfinite alleles model. Under these assumptions, the expected heterozy-gosity in the stationary state can be expressed as (see Appendix E)

〈H〉 ≈ 1 −1

1 + 4µNe

, (6.3)

where

Ne = 4

(

1 + κ

Nf+

1

Nm

)−1

(6.4)

is the effective population size within the population under the matingmodel introduced above. In the case s = Nm, the model reduces to themodel analysed in [67]. As expected, Eq. (6.4) reduces to Eq. (7) in [67](Eq. (7) in [67] contains two additional terms, but in the case discussedhere, Nm ≫ 1, Nf ≫ 1, these terms are negligible). Conversely, in thecase l → ∞, s = Nm, Eq. (6.4) reduces to the result obtained under therandom-mating model. Note that Eq. (6.3) becomes exact in the limitNm → ∞, Nf → ∞.

The effect of l, and s on Ne is shown in Fig. 6.3. As s increases,keeping l > 1 fixed, the effective population size increases towards the

Page 64: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

56 Chapter 6 Genetic variation in structured populations

maximum value Nm + Nf , which is expected under the random-matingmodel. This effect starts saturating at s ≈ 10. For l = 1, one has

Ne = 4NmNf

2Nm + Nf

,

independently of s. This case corresponds to monogamous mating. Notethat from Eqs. (6.2) and (6.4), it follows that Ne remains the same uponthe exchange of the parameters l and s.

In summary, the larger the degree of multiple paternity is, the largerthe effective population size becomes. This is qualitatively consistentwith the findings in [70] (for a different mating model).

In order to check how reasonable the mating model is, comparisonsof predictions of the model to data collected in natural populations arerequired. A well known population known for multiple paternity is amarine snail L. saxatilis. For this species, it has been determined that innatural habitats, around 20 males sire a brood of a single female [20, 69].The number of fathers 〈nf〉 per progeny of a single female estimated in[20] for populations of L. saxatilis is indicated by yellow color in Fig. 6.3.Note that 〈nf〉 ≤ s in the model introduced here. Namely, the typicalnumber of fathers per progeny of a female, assuming that each femalemay give rise to large number of children, is

〈nf〉 = s

(

1 −

(

1 −1

s

)l)

. (6.5)

It follows that in the case l ≫ 1 one has 〈nf〉 ≈ s.Recently, experiments were conducted to determine the number of

sires per brood of a female living with s = 2, s = 5, and s = 10 males[71]. In the experiments, a total of 19 young females (not inseminatedpreviously) were placed in 19 jars. Of these, 7 females were accompaniedwith s = 2 males, 6 females with s = 5 males, and the remaining 6 wereaccompanied with s = 10 males. The paternity of between 10 and 20children per female was analysed. In what follows, the number of childrenof a father involved in siring the brood of a female is referred to as thefamily size. The results obtained are shown in Fig. 6.4.

These data are compared to the outcomes of computer simulations ofthe model discussed here. For each data set with a given value of s, theparameter l in computer simulations was varied in order to find the valueresulting in family sizes that are closest to those observed (search for theleast square deviation). The corresponding results are shown in Fig. 6.4(circles connected by solid lines). It can be noticed that the experimentaldata for s = 10 agree well with the mating model introduced. The model

Page 65: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.2 Spatial model 57

does not seem to recover the empirical distributions found for s = 2, ands = 5. This may be due to sperm competition, or due to female’s crypticchoice of a particular sperm [71].

Recall that in the experiments described, each female was exposed toexactly s males. In natural populations different females may be exposedto different number of males. In Fig. 6.4d the histogram of family sizes offour female broods collected in natural habitats (data taken from [20]) isshown. Using computer simulations, upon setting s = l (for simplicity),the fitting parameter l is found to be l = 29. This result is shown inFig. 6.4d by red circles connected by a line. According to the comparisonsin Fig. 6.4 it may be concluded that the mating model agrees well withthe empirical data in the case females have a large number of availablemales to mate with.

In the next section, this model is used to analyse the effect of multiplepaternity on genetic variation in geographically structured populations.

6.2 Spatial model

The model is illustrated in Fig. 6.5. It is motivated by empirical ob-servations made in L. saxatilis [10, 20]. A set of k + 1 islands, labelledby i = 0, 1, . . . , k, is arranged in a one-dimensional chain. The islandlabelled by i = 0 is the mainland. The population on the mainlandconsists of K ≫ 1 males and K females, where K does not depend ontime. The islands labelled by i = 1, . . . , k, are either populated or not.In the former case, the population size is equal to the carrying capacityof the island, and in the latter case the population size is equal to zero .All islands i = 1, . . . , k, are assumed to have the same carrying capacityequal to 2N ≫ 1, where N is the number of males (and it is equal to thenumber of females). The carrying capacity of the mainland is assumedto be much larger than the carrying capacity of the remaining islands,i. e. K ≫ N .

It is assumed that initially (in generation ℓ = 0) the mainland ispopulated, and that all other islands are empty. The colonisation of anisland occurs upon the arrival of one or more founder mothers from thenearest neighboring island. As indicated in Fig. 6.5, individuals may mi-grate from island i = 1, . . . , k − 1 to any of the two neighboring islands,left and right neighbors being equally likely. Assuming that mating oc-curs only before migration, of interest is only the movement of females.The rate of female migration is equal to 2M per generation per island.Therefore, an island sends to one of its neighbors on average M femalesper generation. The rate of female migration per generation from the

Page 66: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

58 Chapter 6 Genetic variation in structured populations

a

b

Figure 6.2: Schematic diagram of mating. A population with Nf = 2females and Nm = 7 males is shown. a To each female s = 3 differentmales are assigned, the males being sampled randomly (without replace-ment) from the 7 males present in the population. These assignments aredepicted by dashed lines. b Each female mates l = 2 times. In this step,for each mating of a given female, a male is sampled randomly (with re-placement) from the s males available to this female. These samplings aredepicted by arrows. In order to emphasize that individuals may carrydifferent genetic material, the individuals are colored differently. Thecolor of the dashed lines matches the color of the corresponding female.Note that, under the model described, genetic material does not play anyrole in the choice of mating.

Page 67: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.2 Spatial model 59

1 10 1000.6

0.7

0.8

0.9

1

s

Ne/(

Nm

+N

f)

Figure 6.3: Effective population size for given s and l, relative to theeffective population size, Nm + Nf , expected under the random-matingmodel. The results shown are for l = 1 (blue), l = 2 (red), l = 3(green), l = 5 (magenta), l = 10 (black, solid). The black dashed lineis for l = Nm. Note that Ne remains the same upon the exchange ofthe parameters l and s. In this case, the dashed line shows Ne underthe mating model analysed in [67]. The yellow region depicts 〈nf〉, thenumber of fathers estimated in four broods of L. saxatilis [20]. Note that〈nf〉 in the model discussed is a function of l and s according to Eq. (6.5).Parameters used: Nm = Nf = 103.

Page 68: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

60 Chapter 6 Genetic variation in structured populations

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

5 10 15 200

0.1

0.2

0.3

0.4

0.5

a b

c d

ii

P(i

)P

(i)

Figure 6.4: Histogram of family sizes (denoted by i) within broods offemales. Bars in panels a-c show the empirical given in [71], and barsin panel d show the results reported in [20]. The parameter s is fixedin panels a-c to s = 2 in a, s = 5 in b, and s = 10 in c. The circlesconnected by lines are the result of the fit (see text for details), the fittingparameter being l. The resulting values of l are l = 4 in a, l = 10 in b,and l = 20 in c. In panel d, the circles connected by red line are obtainedby setting l = s, and thereafter fitting the parameter l. The resultingvalue of l is l = 29. The fits are computed using computer simulating,by averaging family sizes over 2500 independent broods.

Page 69: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 61

mainland, and from the last island is M , as they both have only oneneighboring island.

As mentioned in the introduction of this section, this model is mo-tivated by populations of L. saxatilis. Since it was observed that fe-males usually carry large number of progeny beneath their shells [20],the founder mothers (first colonisers) in the model are assumed to carrya large number of children. Upon the arrival on an empty island, thesemothers give rise to 2N children in total, and then die. Hence, theempty island becomes populated one generation after the arrival of thefirst founders. This corresponds to assuming an infinite growth rate ofpopulation size up to the carrying capacity of the given island.

Once an island becomes populated, it is assumed that the populationsize of the established population remains constant over time. Afteran island becomes populated, an exchange of individuals (migration)between islands continues at a constant migration rate 2M per islandper generation (except for islands i = 0, k, in which cases the migrationrate per island per generation is equal to M).

In the model introduced, an island i = 1, . . . , k, encounters twophases. The initial phase is the phase of colonisation. After the com-pletion of the process of colonisation, the migration phase starts. In thefollowing, it is shown for both phases how the heterozygosity depends onthe distance from the mainland. The mutation rate per generation perallele is denoted by µ, and the infinite alleles model is assumed.

6.3 Genetic variation

In this section, it is shown how to compute the mean heterozygosity ofthe population at distance i from the mainland at the time when thispopulation is established. The heterozygosity in the migration phase isanalysed afterwards.

6.3.1 Colonisation phase

As explained in Section 6.2, the population on the mainland is assumedto be in the stationary state. Therefore, the expected homozygosity onthe mainland is

〈F(0)0 〉 =

1

1 + θ0

, (6.6)

where θ0 = 4µKe, and Ke is the effective population size of the mainland.The effective population size Ke is given by Eq. (6.4), by substituting Nm

and Nf in Eq. (6.4) by K. The mean heterozygosity on the mainland is

〈H(0)0 〉 = 1 − 〈F

(0)0 〉.

Page 70: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

62 Chapter 6 Genetic variation in structured populations

i = 0 i = 1 i = 2 i = 3 i = k

M M M

M MMM

M

Figure 6.5: Migration between islands (schematically). The mainland isdenoted by i = 0. In generation ℓ = 0, only the mainland is populated.An island becomes populated upon the arrival of one or more founderfemales. The females may carry large number of progeny with them, sothat the founder females populate an empty island in a single generation.The size of such a newly established population is equal to the carryingcapacity of the island, and it is assumed that the size remains constant intime. The carrying capacity of islands i = 1, . . . , k is equal to 2N , and itis assumed to be much smaller than the carrying capacity of the mainland(2K). The migration of females occurs with rate 2M per generation perisland. It is assumed that from each island each generation on average Mfemales migrate to the nearest left neighboring island, if one exists (andlikewise M individuals migrate to the nearest right neighboring island,if one exists). Note that an island may send out individuals only if it ispopulated. Otherwise it may only receive individuals.

Page 71: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 63

In order to find a corresponding expression for the remaining islands,the coalescent approach is used. Assume that N ≫ 1 (as mentionedin Section 6.2) and that the migration rate M is small, M ≪ 1. Thelatter implies that the time between two successive founder events islong, and that typically one founder female arrives at an empty island.Under this assumption, the ancestral population size of the newly es-tablished population at island i may be approximated by a sequence ofi bottlenecks. The duration of each bottleneck is one generation sincethe founder female alone gives rise to 2N children. The time betweentwo successive bottlenecks corresponds to the waiting time between twofounder events, which is typically M−1 generations. Let the generationindex ℓ be expressed as ℓ = ⌊2tNe⌋, where Ne is the effective populationsize on the island i = 1, . . . , n. It is given by Eq. (6.4). In these unitsof time, the waiting time between two successive founder events may beapproximated by an exponential distribution with mean

(2MNe)−1 . (6.7)

Within the coalescent approximation, it is possible to compute theaverage time to the MRCA of two lines sampled at a given island atthe time of the establishment of its population. Note that in the modelintroduced here, the mainland acts as the only source of genetic variation,since in the limit N/K → 0, keeping θ0 constant, one has θ = 4µNe → 0.It follows that if the MRCA of two alleles sampled randomly from thenewly established population at island i, was not born on the mainlandbut on the island j ≤ i (j 6= 0), the two alleles sampled are identical.Otherwise, if the MRCA was born on the mainland, the two alleles areexpected to be identical with probability 〈F

(0)0 〉.

Denote by P (0|i) the probability that the MRCA of two lines sam-pled from the newly established population at island i is found at themainland (i. e. that the two lines stem from an allele that was born onthe mainland). The mean of the initial homozygosity at island i is:

〈F(i)0 〉 = 1 − P (0|i) + P (0|i)〈F

(0)0 〉 . (6.8)

The probability P (0|i) is given by

P (0|i) =

(

1 −1

8(1 + κ)

)i(2MNe

2MNe + 1

)i−1

. (6.9)

Here, 1 − 18(1 + κ) is the probability that two sampled alleles do not

find their MRCA during the bottleneck (they neither stem from thesame maternal allele, nor from the same paternal allele), and the term

Page 72: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

64 Chapter 6 Genetic variation in structured populations

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

a b

〈H(i

)0〉

ii

Figure 6.6: Heterozygosity in the colonisation phase as a function ofthe distance from the mainland (denoted by i). The migration rate isM = 0.5 in panel a, and M = 0.05 in panel b. The results shown arefor s = 1 (blue), s = 2 (red), s = 3 (green), s = 5 (magenta), ands = 10 (black). The analytical results are shown by solid lines, and theresults of computer simulations are depicted by symbols. It is assumedthat in each populated island, the number of females N is equal to thenumber of males. An empty island becomes populated upon the arrivalof the first immigrant females. After it becomes populated, the islandmay colonise its neighboring empty island. Remaining parameters used:N = 100, l = 10, θ = 0, θ0 = ∞, k = 10. The results shown are averagesover 9 · 104 independent realisations of the process of colonisation ofempty islands for l = 1, 2, 3. For l = 5, 10, averaging is done over 104

independent realisations of colonisation.

2MNe(2MNe + 1)−1 is the probability that the MRCA of two alleles isnot found between two successive bottlenecks.

The mean of the heterozygosity in the colonisation phase is shown inFig. 6.6. The results of computer simulations are shown by symbols andEq. (6.8) is shown by lines. The disagreement in Fig. 6.6a is due to largemigration rate M used in this case (M = 0.5).

The results in Fig. 6.6 show that the average initial heterozygositydecays as the distance from the mainland increases. This is expectedgiven that the mainland is the only source of new genetic material in themodel analysed. More importantly, the results shown indicate that byincreasing the level of multiple paternity, the initial heterozygosity at agiven distance from the mainland increases. However, for a given valueof l > 1, this effect saturates at s ≈ 10.

Page 73: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 65

6.3.2 Migration phase

In the stationary state of the migration phase, the mean heterozygosity〈H(i)〉 at distance i from the mainland may be obtained using recursionrelations, as shown in Appendix F. Note that, due to the assumptionK ≫ N , the heterozygosity on the mainland may is not affected bymigrants coming from the first island. Therefore one has 〈H(0)〉 = 〈H

(0)0 〉.

The analytical results for the remaining islands are shown in Fig. 6.7by solid lines, and the corresponding simulation results are shown bysymbols. As can be seen in both panels in Fig. 6.7, the analytical resultsagree with the results of the computer simulations. Note that the analyt-ical result in this case (see Appendix F) is derived under the assumptionM/N ≪ 1.

The two conclusions derived in the colonisation phase are valid here aswell. First, the mean heterozygosity in the stationary state decays as thedistance from the mainland increases. Second, as the degree of multiplepaternity increase, the heterozygosity at a given island increases. Thelatter effect seems to saturate at s ≈ 10, as also found in the colonisationphase.

However, comparison of the results shown in Fig. 6.7a to those inFig. 6.6a reveals that the effect of multiple paternity is more pronouncedin the colonisation phase, than in the stationary state of the migrationphase. In order to quantify the gain in heterozygosity due to multiplepaternity, one may compute 〈H

(i)0 (s)〉 and 〈H

(i)0 (s)〉, according to:

〈H(i)0 (s)〉 =

〈H(i)0 (s)〉

〈H(i)0 (1)〉

, (6.10)

〈H(i)(s)〉 =〈H(i)(s)〉

〈H(i)(1)〉, (6.11)

where the parameter l is assumed to be fixed. Fig. 6.8 shows Eqs. (6.10)-(6.11) for s = 2 (depicted by red color), and s = 3 (depicted by greencolor). The dashed lines are for the colonisation phase, and the solidlines are for the stationary state of the migration phase. Fig. 6.8 confirmswhat as mentioned above, that the gain in heterozygosity due to multiplepaternity is indeed higher in the colonisation phase than in the stationarystate.

In order to understand this effect, recall that founder females in thecolonisation phase establish the population on an empty island in a singlegeneration. This implies that, typically, the newly established populationreceives genetic material of most partners that female was inseminatedby. By contrast, in the migration phase the immigrant female may ormay not be a successful mother since, in this phase, typically N mothers

Page 74: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

66 Chapter 6 Genetic variation in structured populations

contribute to the population in the next generation. Furthermore, evenif the female is a successful mother, she gives rise to 2 children typically.The genetic material of many different fathers is, thus, typically lost dueto sampling in the migration phase. Following this reasoning, it is to beexpected that the gain in heterozygosity due to multiple paternity in thecolonisation phase decreases as the growth rate of population size up tothe carrying capacity of an island decreases. Note that the results showncorrespond to the infinite growth rate.

In conclusion, the heterozygosity on average increases as the degreeof multiple paternity increases. Note that in Figs. 6.6-6.8 the heterozy-gosities resulting from computer simulations are averaged over a largenumber of generations, and over independent realisations of the modeldescribed. The question is how well the averages computed representtemporal heterozygosities in the migration phase.

Fig. 6.9 shows the heterozygosities at islands i = 1, . . . , 10 as func-tions of time for a particular realisation of the model. These results implythat the heterozygosity fluctuates significantly over time. The fluctua-tions become more severe as the distance from the mainland increases.In order to emphasize this, the heterozygosity at the island closest tothe mainland, and the heterozygosity at the island furthest from themainland are shown in Fig. 6.10a, b (black solid lines). For comparison,the corresponding analytically computed stationary-state values of theheterozygosity are shown by green dashed lines.

As Fig. 6.10 shows, at the island furthest from the mainland theheterozygosity typically jumps between values around zero and valuesaround 0.5. As indicated by the grey region in Fig. 6.10, the heterozy-gosity of a population consisting of only two allelic types may attainvalues between zero and 0.5. In the case shown, it turns out that thenumber of allelic types at the island furthest from the mainland is typi-cally not larger than 2 (results not shown). Note that heterozygosity isexpected to vary between such extreme values in populations with verysmall rate of income of new genetic material, as pointed out in [72] (seeFig. 1 in [72]).

In such an extreme case, it is natural to ask what is the duration ofthe phases of low and of high heterozygosity. And, beyond this, maymultiple paternity in a given population influence these durations andhow?

In order to answer these two questions, the time that the populationfurthest from the mainland needs to jump from the low- to the high-heterozygosity phase (and vice versa) are computed next. A scheme forcomputing these durations is shown in Fig. 6.11. Within this scheme, aheterozygosity smaller than 0.1 is considered low, and the heterozygosity

Page 75: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 67

larger than 0.4 is considered high. The former is motivated by the factthat the heterozygosity at a monomorphic locus is ≤ 0.1 (see [21]). Thelatter is based on the observation that the typical maximum value thatthe heterozygosity may have at the island furthest from the mainland is≈ 0.5 in the results shown.

The times of transitions between the two phases are recorded usingthe following principle. If the heterozygosity is in the high phase, thenthe nearest point in time when the heterozygosity becomes less than 0.1 isrecorded. Say this occurs in generation ℓ0. For the time of transition, thegeneration ℓ1 < ℓ0 is taken: ℓ1 is the generation closest to ℓ0 in which theheterozygosity has a value larger than or equal to the average between0.1 and 0.4 (that is, to 0.25). Under this method, the heterozygosityin generation ℓ ∈ (ℓ1, ℓ0] is less than or equal to 0.25. Using a similarmethod, the transitions from the low to the high phase are recorded.

The average durations computed in this way of the high- and low-heterozygosity phases are shown by symbols in Fig. 6.12. In the follow-ing, the duration of the low-heterozygosity phase is discussed first, andthereafter the duration of the high-heterozygosity phase is discussed.

The condition for the heterozygosity to switch from the low to thehigh phase is that new genetic material comes to the population, and thatthis material is not lost due to random genetic drift. In the model anal-ysed, the population at a given island receives new genetic material onlythrough the process of migration. However, it must be emphasized thatnot every migration brings new material to the population (except for theisland closest to the mainland, under the assumption that θ0 = ∞). Us-ing the analytical results derived for 〈H(i)〉, one can compute the effectivesuccessful migration rate at a given distance from the mainland (here,the term successful means that the migrants bring new genetic materialto the population). To this end, the effective successful migration rate

per allele per generation at island i, denoted by m(i)e , can be identified to

the mutation rate per allele per generation at this island assuming theinfinite alleles model. Therefore, one requires the following to hold

〈H(i)〉 =4m

(i)e Ne

1 + 4m(i)e Ne

. (6.12)

As explained in the previous paragraph, with probability m(i)e per gen-

eration per allele, a new allele comes to the population at island i. In thecase m

(i)e Ne ≪ 1 it follows that typically every (2m

(i)e Ne)

−1 generationsone mother comes to an island carrying a new allele. The descendants ofthis mother may be lost due to genetic drift with probability 1−(Ne/2)−1,where Ne/2 is the number of females in the diploid Wright-Fisher popula-tion with population size equal to Ne. It follows that the typical waiting

Page 76: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

68 Chapter 6 Genetic variation in structured populations

time for a successful establishment of new genetic material at island i is(

4m(i)e

)−1

. By comparing this estimate to the results of computer sim-

ulations, one observes good agreement. A slight disagreement betweenthe dashed-dotted line and the diamonds in Fig. 6.12a is probably due tosmall value of N used in this case. The agreement becomes better whenN is increased (see squares and dashed line) or when the mutation rateis decreased (see circles and solid line).

The duration of the high-heterozygosity phase in the case of rare in-come of new allelic types, m

(i)e Ne ≪ 1, may be approximated by the

time that a locus with two alleles in the Wright-Fisher population of Ne

diploid individuals needs to reach fixation. The condition given aboveassures that the fixation occurs typically much before new genetic ma-terial arrives to the population. The corresponding result for a haploidpopulation is given in Chapter 2. Here, the result is:

ℓloss(α) ≈ −4Ne [αln(α) + (1 − α)ln(1 − α)] . (6.13)

For a locus with two alleles, α denotes the initial frequency of a givenallelic type. In Fig. 6.12b, the resulting ℓloss is plotted for different α(see colored regions). The upper boundaries of the colored region arefor α = 0.5, and the lower boundaries are for α = 0.1. The solid andthe dashed lines in Fig. 6.12b show the results obtained after integratingEq. (6.13) from α = 0.1 to α = 0.9. The lower boundary in the integral(α = 0.1) corresponds to the heterozygosity ≈ 0.2, which is close to theintermediate value 0.25 in the method used for determining the transi-tion time between the low and the high phase. As Fig. 6.12b shows, thesolid line agrees well with the results of computer simulation (denotedby circles). The diamonds are at slightly higher values than the circles.But, this is expected, as in the case shown by the diamonds, the migra-tion rate is higher than in the case shown by circles, leading to longerdurations of high-heterozygosity phase. The same may be deduced forthe disagreement obtained between the dashed line and the squares.

The results shown in Fig. 6.12 also reveal the effect of multiple pa-ternity. It turns out that the duration of the high-heterozygosity phasebecomes longer as the degree of multiple paternity increases. Conversely,the low-heterozygosity phase becomes shorter as the degree of multiplepaternity increases. But, the latter effect is negligible comparing to theeffect on the duration of high phase. Indeed, the high-heterozygosityphase for s = 10 is by ≈ 60% longer than in the case s = 1. By contrast,the low-heterozygosity phase for s = 10 is only by ≈ 10% shorter thanin the case s = 1 (results not shown).

Page 77: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 69

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

0 2 4 6 8 100

0.2

0.4

0.6

0.8

1

a b

〈H(i

) 〉

ii

Figure 6.7: Heterozygosity in the stationary state of the migration phaseas a function of the distance from the mainland. The migration rate isM = 0.5 in panel a, and M = 0.05 in panel b. The results shown are fors = 1 (blue), s = 2 (red), s = 3 (green), s = 5 (magenta), and s = 10(black solid). The black dashed line shows the result for s = N , whereN is the number of males as well as of the females on each island. Theanalytical results are shown by solid lines, and the results of computersimulations are depicted by symbols. In panel a, it is set N = 103, andaveraging is done over 1.5 · 107 generations, the initial 107 generationsbeing discarded. Three independent realisations of the model describedhave were simulated and averaged over. In panel b, it is set N = 100,and averages are over 4 · 107 generations, the initial 5.5 · 107 generationsbeing discarded. Four independent realisations of the model describedwere simulated and averaged over. Remaining parameters used: l = 10,θ = 0, θ0 = ∞, k = 10.

Page 78: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

70 Chapter 6 Genetic variation in structured populations

0 2 4 6 8 100

0.5

1

1.5

2

0 2 4 6 8 100

0.5

1

1.5

2

a b

〈H(i

)0〉,

〈H(i

) 〉

ii

Figure 6.8: Gain in heterozygosity due to multiple paternity as a functionof the distance from the mainland. The analytical results are shown bylines, and the results of computer simulations are depicted by symbols.Results for the colonisation phase, 〈H

(i)0 〉 (see Eq. (6.10)), are depicted

by dashed lines and squares, and for the stationary state, 〈H(i)〉 (seeEq. (6.11)) are shown by solid lines and circles. The results shown arefor s = 2 (red), and s = 3 (green). The migration rate is M = 0.5 inpanel a, and M = 0.05 in panel b. The simulation data points correspondto the points in Figs. 6.6, 6.7. For details about the parameters used, seeFigs. 6.6, 6.7.

Page 79: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 71

Figure 6.9: Heterozygosity as a function of the distance from the main-land, and of time (single realisation of the model described). The het-erozygosity increases going from dark towards light colors. Shown are105 generations after the initial 7 · 106 generations. It is assumed that ineach island, the number of females N is equal to the number of males.Parameters used: M = 0.05, s = 10, l = 10, N = 100, θ = 0, θ0 = ∞,k = 10.

Page 80: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

72 Chapter 6 Genetic variation in structured populations

0 1 2 3 4 5 6 7 8 9 10

x 104

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4 5 6 7 8 9 10

x 104

0

0.2

0.4

0.6

0.8

1

a

b

H(1

)ℓ

H(1

0)

Figure 6.10: Heterozygosity at island i = 1 (panel a), and at islandi = 10 (panel b). The black solid line shows the results of computersimulation. The dashed green lines show the average heterozygosity inthe stationary state computed analytically. In both panels, colouredregions show the range of values of heterozygosity for which the minimumnumber of allelic types present in the population are: 2 (grey), 3 (red), 4(orange), 5 (yellow), and 6 (white). The data points correspond to thoseshown in Fig. 6.9.

2.2 2.25 2.3 2.35 2.4 2.45 2.5 2.55 2.6

x 104

0

0.1

0.2

0.3

0.4

0.5

H(1

0)

Figure 6.11: Illustration of the method used to determine the durationof the low- and high-heterozygosity phases. The phases are depicted bythe red line. The black line shows the results of the computer simulation(single realisation). The data correspond to those shown in Fig. 6.10bwithin the corresponding time span.

Page 81: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.3 Genetic variation 73

1 2 3 5 10100

1000

10000

100000

1 2 3 5 10100

1000

10000

100000

Dura

tion

a b

ss

Figure 6.12: Average duration of the low-heterozygosity phase (panel a),and of the high-heterozygosity phase (panel b). The symbols denote theresults of computer simulations, and the lines show the theoretical expec-tations (in panel b the lines are obtained by integrating Eq. (6.13) fromα = 0.1 to α = 0.9). In panel b, the boundaries of the colored regionscorrespond to setting α in Eq. (6.13) to α = 0.1 (the lower boundaries),and to α = 0.5 (the upper boundaries). The yellow color is for N = 100,and the orange is for N = 103. The results shown are for M = 0.5(squares and dashed lines, diamonds and dashed-dotted lines), and forM = 0.05 (circles and black solid lines). Note that the dashed-dottedline in panel b coincides with the black solid line. The number of femalesN is equal to the number of males, and the values used for producingthis figure are N = 100 (circles, diamonds), and N = 103 (squares).Remaining parameters used: l = 10, θ = 0, θ0 = ∞, k = 10.

Page 82: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

74 Chapter 6 Genetic variation in structured populations

6.4 Heterozygosity in L. saxatilis

This chapter is concluded by comparing the results obtained under themodel analysed to the empirical data: the heterozygosity found in pop-ulations of a marine snail, L. saxatilis. The data are reported in [10].In [10], the heterozygosity at 13 skerries, 7 islands, and 5 mainlands ismeasured. According to the illustration in Fig. 6.1, and to the modelshown in Fig. 6.5, an island in terminology used in [10], is treated as anisland at distance i = 1 from the mainland in the model analysed here.Similarly, a skerry is treated as an island at distance i = 2 from themainland.

Using the empirically observed heterozygosity averaged over the main-land populations sampled, one can compute the corresponding value ofθ0 using Eq. (6.6). In order to compare the model analysed above to theempirical data, one needs to find the corresponding parameter MNe/N .The comparison in Fig. 6.13 is for MNe/N = 2.5. As the results show,this parameter gives rise to the heterozygosities that are close to thoseempirically observed. However, it has been estimated that the rate ofcolonisation of natural skerries is ≈ 0.03 per generation per skerry [5].The migration rate found here is somewhat higher than that reported in[5]. Namely, the rate estimated here suggests that in the case of extremedegree of multiple paternity, that is for Ne = 2N , a skerry population istypically colonised in ≈ 2 generations after the first colonisers are sentfrom the mainland. For smaller degree of multiple paternity the coloni-sation rate turns out to be even higher. Although the value found hereis higher than that estimated in natural habitats, the disagreement maybe reasonable. Namely, it is to be noticed that, unlike in the modelconsidered here, in natural habitats it is possible that the rate of suc-cessful colonisation is smaller than the rate of migration: some femalesmay carry only a small number of progeny which may be insufficientto colonise the entire island successfully. Having the rate of successfulcolonisation smaller than the rate of migration can result in a differentheterozygosity during the colonisation phase in the model analysed here.However, the stationary state of the migration phase remains unchanged.

In order to further support the findings of the model, the principalcomponent analysis of the empirically observed allelic frequencies [10]is performed. Within the model analysed in this chapter, it is to beexpected that the principal component analysis clusters together the datataken at a given distance from the mainland (since the mainland is thesource of genetic variation). The results of the analysis are shown inFig. 6.14. The mainlands are depicted by red circles, the islands bygreen circles, and the skerries by blue circles. It can be seen in Fig. 6.14

Page 83: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.4 Heterozygosity in L. saxatilis 75

that the mainlands are distinguished from the islands and skerries alongthe first principal component. The islands are separated from the skerriesalong the second principal component. It might be that apart from theisolation by distance between populations, the classifications observedreflect different population sizes as well.

In summary, the results shown in this chapter suggest that the modelintroduced can be appropriate to describe the gene flow between geo-graphically structured populations of L. saxatilis. However, the theoreti-cal model analysed here may still be improved. For example, one shouldtake into account different population sizes in different habitats, similarto that expected in nature. Also, the consequences of occasional popula-tion extinctions must be analysed. Furthermore, the effect of migrationon two-locus genetic variation remains to be described.

Migration patterns in nature are usually unknown. But using theresults for the single-locus and the two-locus patterns of genetic varia-tion, one may be able to infer the migration patterns between partly orcompletely isolated populations. For example, note that demographichistory of populations described in this chapter may be represented bya colonisation-migration ancestral tree shown in Fig. 6.15. The gene ge-nealogies of samples taken from different populations are conditional onthis colonisation-migration ancestral tree. If the populations are com-pletely isolated from each other, two alleles taken from different popu-lations cannot find their MRCA before the divergence time between thetwo populations.

Furthermore, note that two populations in this case can be treatedanalogously to two reproductively isolated species. In a different casewhen an occasional gene flow between geographically structured popula-tions is allowed, gene flow may mimic an ongoing process of speciation.Indeed, species trees are usually represented similarly to the colonisation-migration ancestral tree shown in Fig. 6.15. As in the case of migrationpatterns, species trees are also typically unknown. The task of inferringspecies trees using gene trees sampled across different species is difficultin general. This is discussed in the next chapter.

Page 84: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

76 Chapter 6 Genetic variation in structured populations

0 1 20.1

0.12

0.14

0.16

i

〈H(i

) 〉

Figure 6.13: Average heterozygosity over different loci at distance i fromthe mainland. The experimental data are shown by symbols (taken from[10]), and the theoretical results are shown by lines. The theoreticalresults are for the heterozygosity in the stationary state of migrationphase under the model shown in Fig. 6.5, with the parameters MNe/N =2.5, and θ = 0, θ0 = 0.1784 (see text). The parameter MNe/N is chosenso that the resulting line is sufficiently close to the empirical data (thisparameter is not fitted).

Page 85: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

6.4 Heterozygosity in L. saxatilis 77

−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

First component

Sec

ond

com

pon

ent

Figure 6.14: Principal component analysis of the experimental datareported in [10]. The mainlands are denoted by red circles, the islandsby green circles, and the skerries by blue circles. The analysis is donewithout taking into account the locus Aat−1 since it is likely to be subjectto selection [73]. The results when this locus is taken into account (notshown) do not differ significantly from the results shown here.

Page 86: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

78 Chapter 6 Genetic variation in structured populations

Migration

Migration

Bottleneck

Time to Colonisation

Mainland Island Skerry

Figure 6.15: Colonisation-migration ancestral tree. The tree showsdemographic histories of the mainland, island, and skerry populationsaccording to the model analysed in this chapter. The resulting gene ge-nealogies corresponding to samples taken across different populations areconditional on the histories shown.

Page 87: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7Species trees

There are a number of theoretical models which aim at inferring thespecies tree underlying empirically determined gene trees. The aim of thischapter is to discuss typical problems one may encounter when choosingand implementing a theoretical model for such a purpose.

This chapter is organised as follows. In Section 7.1, typical sources ofincongruence between gene trees and their underlying species tree are dis-cussed. Several theoretical methods commonly used to infer the speciestree corresponding to a given set of gene trees are also listed. IIn Sec-tion 7.2 a measure of tree-to-tree distance introduced in [74] is discussed.

7.1 Gene trees and species trees

Species trees arise as a result of the processes of speciation and speciesextinction. Neither of the two processes is well understood. But, by us-ing genetic sequences, one may presumably nfer the relatedness betweenspecies existing today.

However, the task of inferring species trees using empirically esti-mated gene trees turns out to be difficult. This is true even when onedisregards the durations of speciations, that is the time during whichgene flow between partly isolated ecotypes may occur. The reason isexplained in the following.

Suppose that one has a collection of gene trees corresponding to sam-ples taken across different species. Let one individual be sampled fromeach species, and assume that all speciations occurred rapidly in the past.One of the first attempts to solve the task of inferring species tree is basedon determining the most common gene tree among the trees estimatedat different loci, and identifying such a tree with the underlying species

79

Page 88: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

80 Chapter 7 Species trees

tree [75, 76, 77, 78]. Assuming that under the process of speciation onespecies may split in two subspecies (that is, assuming that species treesare bifurcating trees), this method of assigning the most likely gene treeto the underlying species tree is expected to work well in the case ofthree external taxa. In this case, the most likely gene tree is concordantto its underlying species tree [75, 77]. In order to show this (see [75] andreferences therein), notice that a tree with three external taxa has oneinternal branch, and that it allows for three different topologies. Let thelength of this internal branch be denoted by τ , where τ > 0. This con-straint assures that the species tree is bifurcating. The given gene treehas the same topology as the species tree if a pair of lineages sampledfrom the two species corresponding to the younger speciation event, coa-lesce during time τ , or this pair coalesces after time τ (the latter occursin one third of cases, because each pair out of three possible is equallylikely to coalesce). The two terms are given by:

P0 =

∫ τ

0

dt e−t +1

3(1 −

∫ τ

0

dt e−t) , (7.1)

which yields:

P0 = 1 −2

3e−τ >

1

3. (7.2)

The inequality in Eq. (7.2) holds because τ > 0. The probabilities P1

and P2, that the given gene tree has one of the two remaining topologiesare:

P1 = P2 =1

3, (7.3)

it follows that the probability for a gene tree to have the topology ofits underlying species tree is strictly larger than the probability that ithas one of the two remaining topologies. Thus, in the case of threeexternal taxa, the most likely gene tree is always in concordance with itsunderlying species tree.

However, already in the case of four external taxa, the discordancebetween the most likely gene tree and its underlying species tree mayoccur [77]. Although this does not occur in the case of symmetric treeswith four taxa, this typically occurs in the case of asymmetric four-taxonspecies trees with internal branch lengths smaller than ≈ 0.15 coalescenttime units [77]. Here, one coalescent time unit is the time needed for twolineages sampled in a population of 2Ne chromosomes to coalesce, andNe is the effective population size of each internal and external taxon.

Furthermore, if the species tree involves n ≥ 5 external taxa, theresulting gene trees may differ in their topology significantly from eachother, as well as from their underlying species tree [77]. Moreover, the

Page 89: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7.1 Gene trees and species trees 81

most likely gene-tree topology is with large probability different from thetopology of the underlying species tree.

It is well understood that such a discordance arises because speciestrees with large number of external taxa typically contain deep shortinternal branches, allowing for incomplete lineage sorting (also known asdeep coalescences). This is true regardless of the topology of the speciestree, and of the number of its external taxa n, provided n ≥ 5 [77].Therefore, the method of using the most common gene tree to estimatethe species tree may be, and typically is, misleading.

Several other statistical methods have been developed for inferringspecies trees using gene trees. The existing statistical methods include(see [78] and references therein) the maximum likelihood approach, theBayesian approach (see the method BEST [79]), as well as the approachesrelying on the summary statistics of estimated gene trees (see the meth-ods COAL [76], STAR [80], GLASS [81]).

Since the number of possible tree topologies grows rapidly as the num-ber of external taxa increases [82], for species trees with n ≥ 6 externaltaxa such algorithms become computationally expensive. More impor-tantly, in the algorithms listed, it is assumed that the incomplete lineagesorting is the only source of differences between the given gene trees andtheir underlying species tree. Yet, many other factors may significantlycontribute to the diversity of the resulting gene trees. One such processis gene flow between partly reproductively isolated ecotypes during theprocess of speciation. Indeed, none of the above methods takes into ac-count the empirical fact that speciation is a process allowing for geneflow between different ecotypes over a number of generations [5]. Dueto such gene flow, it is possible to obtain hybrid ecotypes, which mayeither establish hybrid species, or go extinct. A model of speciation inthe presence of gene flow was discussed in [83], and a particular modelof hybridization was discussed in [84]. Moreover, other processes suchas gene transfer, gene duplication, and gene extinction may as well con-tribute to the diversity of gene trees [78]. In fact, when dealing withreal data, one is usually interested in detecting the signatures of suchprocesses despite the incomplete lineage sorting [85]. This task, however,remains unresolved.

In order to solve this task, one must first quantify the difference be-tween trees under the incomplete lineage sorting alone. One possiblemeasure is the tree-to-tree distance introduced in [74]. This measure isdiscussed in the following.

In what follows, bifurcating species trees are considered. Using mul-tispecies coalescent simulations [78] gene trees resulting from a givenspecies tree are generated. Similar to other studies [76, 77, 84], it is

Page 90: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

82 Chapter 7 Species trees

assumed here that the effective population size is constant along eachinternal and external branch, and that it is the same among differentexternal taxa. The species are assumed to consist of diploid organisms.But, unlike in other studies, it is assumed here that the effective popula-tion size of internal taxa is equal to the sum of sizes of its derived taxa.For example, in Fig. 7.1, the external taxa A, B, C, D, and E have thesame effective population size Ne (illustrated by the same width of thecorresponding external branches), but the internal taxa have larger sizes.The effective population size of AB, and of CD is equal to 2Ne, and itis equal to 4Ne for the taxon ABCD. The root of the tree, ABCDE, hasthe effective population size equal to 5Ne.

As mentioned in the previous chapter, such a model may represent de-mographic history of populations that are completely isolated from eachother. If the isolated populations evolve to establish different species,this model describes the so-called allopatric speciation.

7.2 Tree-to-tree distance

The tree-to tree distance discussed below was described in [74].

Denote a pair of trees with n external taxa by a, and b. The distanceD(a, b) between the trees a and b is the number of internal edges on thetree a for which there are no equivalent edges on the tree b, plus thenumber of internal edges on the tree b for which there are no equivalentedges on the tree a. Here, an internal edge on a given tree is described bythe two subsets to each the edge partitions the tree. In order to comparepairs of edges, one assigns to each external taxon i (i = 1, . . . , n) anumber 2i−1. Each of the two subsets corresponding to a given edge isassigned the sum

j∈J 2j−1, where J denotes a collection of the externaltaxa within the subset. This procedure yields two numbers for each ofthe internal edges, and one can describe an edge by the smaller of thetwo numbers. In an unrooted tree with n external taxa, one finds n − 3independent internal edges. Therefore, the largest distance between twounrooted trees with n external taxa is Dmax = 2(n − 3). The minimumdistance is Dmin = 0. In this case, the trees are said to be topologicallyequivalent. The case of a rooted species tree with n external taxa can betreated equivalently to the case of an unrooted tree with n + 1 externaltaxa.

An example of a species tree is shown in Fig. 7.1. This species treehas 5 external taxa, denoted by A, B, C, D, and E. A possible resultinggene tree is also illustrated (see red line). As can be seen, the two treesare topologically different. The difference between them can be measured

Page 91: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7.2 Tree-to-tree distance 83

A B C D E

AB

CD

ABCD

ABCDE

Figure 7.1: Species tree with five external taxa, denoted by A, B, C, D,and E. The internal taxa are denoted by AB, CD, ABCD, and ABCDE,the latter being the root of the tree. The width of the branches depictthe effective population size of external taxa: the external taxa have thesame effective population size, but the effective population size of theinternal taxa is equal to the sum of the sizes of its derived subspecies. Apossible gene tree corresponding to the given species tree is illustrated bythe red line. It can be seen that the gene tree is in discordance with itsunderlying species tree. The distance D between the species tree and thegene tree shown is D = 2 (see Section 7.2 for details of the correspondingcalculation).

Page 92: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

84 Chapter 7 Species trees

using the distance explained in the previous paragraph. As an example,assume that the root is unknown, and assign the following codes to theexternal taxa: 1 to A, 2 to B, 4 to C, 8 to D, and 16 to E. By applying theprocedure explained in the previous paragraph, one obtains the numbers{3, 12} for the species tree shown, and the numbers {3, 7} for the genetree shown. It follows that the distance between the given trees is D = 2.

As noted in [76], the probability of topological equivalence (D = 0)between gene trees and their underlying species tree depends on thelength of internal branches, and on the symmetry of the species tree. As-suming that all internal and external taxa have the same, time-independenteffective population sizes, and that the branches within a given speciestree are known, this probability can be computed analytically using theresults of [76].

In the remainder of this chapter, the case of unrooted species treesuch that the effective population size of an internal taxon is equal tothe sum of the effective population sizes of its derived taxa is considered.It is assumed that all external taxa have the same effective populationsizes. It is also assumed that an effective population size is constantalong each branch. Two models of speciation are employed. First, inSubsec. 7.2.1, it is assumed that all internal branches are of the samelength. Second, in Subsec. 7.2.2, speciation is modelled using the Yulemodel [86]. Under these two models, it is shown how the probabilityof topological equivalence between the species tree and the gene treedepends on the length of internal branches in the former case, and on therate of speciation in the latter. It is also discussed how this probabilityis affected by the symmetry of the given species tree. The degree ofsymmetry in a given tree may be measured by the Colles’s index, denotedby Ic [87, 88]:

Ic =

∑n−1i=1 |Tr(i) − Tl(i)|

(

n−12

) . (7.4)

Here the summation is performed over all internal taxa, and Tr(i) andTl(i) denote the number of external taxa belonging to the right, and tothe left branch of the internal node i, respectively. The index Ic rangesbetween zero, for a completely balanced tree, and unity, for a completelyimbalanced tree.

7.2.1 Fixed branch lengths

In this section the distance of gene trees from their underlying speciestree is analysed. The results are obtained using multispecies coalescentsimulations. Here, it is assumed that a single individual is sampled from

Page 93: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7.2 Tree-to-tree distance 85

each species, and the MRCA of a pair of chromosomes at a given locusmay not occur before the divergence time of the corresponding taxa. Theinternal branches within the species tree are assumed to be of the samelength. As mentioned in the introduction of this chapter, the externaltaxa are assumed to have the same effective population sizes, Ne, andtime is rescaled so that one unit of time corresponds to 2Ne generations(the number of chromosome in species at the present time). The resultsare plotted in Figs. 7.2-7.3.

The results shown in Fig. 7.2 indicate that the probability that a genetree and its underlying species tree are topologically equivalent stronglydepends not only on branch lengths, but on the symmetry of the under-lying species trees as well. According to the results shown, the speciestrees with smaller Ic (more symmetrical) tend to generate larger fractionof topologically equivalent gene trees than the species trees with largerIc, as pointed out also in [75, 76]. The effect of symmetry upon P (D = 0)is shown explicitly here.

In Fig. 7.3 it is shown how P (D = 0) depends on n. These results areindependent of the balance of the underlying species tree. As expected,the results in Fig. 7.3 show that as internal branch lengths increase,P (D = 0) increases towards unity. Note that when the species tree hasinternal branches much shorter than unity (under the model employed,the coalescent time scale within internal taxa is larger than unity), theprobability P (D = 0) is approximately equal to the probability that agene tree shares the topology of its underlying species tree purely bychance. Since the number of possible tree topologies grows fast withincreasing n [82], it is to be expected that P (D = 0) rapidly approacheszero with increasing n. This is consistent with the results shown (P (D =0) ≈ 0.1 for n = 5, but P (D = 0) ≈ 0 for n ≥ 10). Also, the probabilitythat a randomly chosen pair of gene trees resulting from the same speciestree are topologically equivalent is shown in Fig. 7.3 by dashed lines. Thisprobability is lower than the probability that a gene tree is at distanceD = 0 from its underlying species tree.

The results in Fig. 7.3 do not quantitatively agree with the resultsin [76] (compare the results for n = 5 and n = 10 to the correspondingresults in Fig. 7 in [76]). The comparison reveals that for given branchlengths, P (D = 0) is smaller in the model discussed here than in themodel discussed in [76]. This is expected, because the coalescent timescale corresponding to internal taxa is longer in the model employed herethan in the model employed in [76].

Page 94: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

86 Chapter 7 Species trees

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

a b

P(D

=0)

Branch lengthsBranch lengths

Figure 7.2: Probability of topological equivalence of gene trees and theirunderlying species tree (unrooted). The results shown are for n = 15(panel a), and for n = 20 (panel b). All branches within species treesare of the same length, and the length indicated in the figure is expressedin units of 2Ne, where Ne is the effective population size of the externaltaxa (each external taxon has the same Ne, but the effective populationsize of the internal taxa is equal to the sum of its derived external taxa).Each color represents results corresponding to a fixed symmetry of aspecies tree, measured by the tree balance Ic. The corresponding valuesof the tree balance of the species trees used in panel a are: Ic = 0.5238(blue), Ic = 0.3524 (red), and Ic = 0.0857 (green), and in b, the valuesare: Ic = 0.4316 (blue), Ic = 0.3053 (red), and Ic = 0.0737 (green). Thenumber of generated gene trees corresponding to a single species tree is102.

Page 95: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7.2 Tree-to-tree distance 87

0.1 1 10 1000

0.2

0.4

0.6

0.8

1

P(D

=0)

Branch lengths

Figure 7.3: Probability of topological equivalence of trees. The solid linesshow comparisons between gene trees and their underlying species tree(unrooted). The dashed lines show comparisons between pairs of genetrees conditional on the same species tree. All branches within speciestrees are of the same length, and the length indicated in the figure isexpressed in units of 2Ne, where Ne is the effective population size ofthe external taxa (each external taxon has the same Ne, but the effectivepopulation size of the internal taxa is equal to the sum of its derivedexternal taxa). Different colors are for different number of external taxa:n = 5 (blue), n = 10 (red), n = 15 (green), and n = 20 (black). For eachinternal branch length, a species tree with a particular tree balance isgenerated, and 100 gene trees are generated for each species tree. This isrepeated until results for 5 different values of Ic are collected (except forn = 5, in which case there are only 3 possible values of Ic). Note that forn = 15 and n = 20, the results corresponding to three different values ofIc are shown in Fig. 7.2. The results shown are averages over differentvalues of Ic.

Page 96: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

88 Chapter 7 Species trees

7.2.2 Yule trees

In this subsection, species trees generated according to the Yule processare considered. Under this process, speciations occur at a constant rateper specie. The rate of speciation per unit of time (that is, per 2Ne

generations) per species is denoted by λ. It follows that the time untila speciation event among i species is exponentially distributed with pa-rameter iλ. As in the previous sections, it is assumed that the resultingspecies trees are bifurcating.

Fig. 7.4 shows the probability P (D = 0) that a gene tree and itsunderlying species tree are topologically equivalent as a function of λ−1.The results shown agree qualitatively with those found in the case offixed internal branch lengths (see Fig. 7.3). First, with increasing λ−1,the probability P (D = 0) for a given n increases towards unity. Second,for fixed λ−1, the probability P (D = 0) decreases with increasing n.Note, however, that the internal branches within Yule trees are typicallyof different lengths. This is important, since how much a species treediffers from the resulting gene trees mainly depends on the length of theshortest internal branch within the species tree. This may be understoodby analysing the distributions of tree-to tree distances: between speciestree and its underlying gene trees, and between gene trees correspondingto the same species tree.

An example is shown in Fig. 7.5. The distributions of tree-to treedistances may be very broad due to the incomplete lineage sorting. Still,it is evident that with increasing λ−1 (compare different panels in eachrow), the distributions shift toward larger values of distances. On theone hand, it is to be expected that for λ−1 ≪ 1 the distance between agene tree and its underlying species tree approaches the maximum value(that is, 2(n − 3)). This is consistent with the results shown. On theother hand, the gene trees are expected to be topologically equivalent totheir underlying species tree for λ−1 ≫ 1. However, in Fig. 7.5a (whereλ−1 = 100), this is not achieved. As explained in the previous paragraph,this is because in the results shown, the number of external taxa is large(n = 15), hence young internal branches may be short compared to thecoalescent time scale. Typically, the youngest internal branch is theshortest. Its length is on average equal to (λ(n − 1))−1.

In summary, the results given in this chapter show the effect of branchlengths, number of external taxa, and symmetry of the species tree ontree-to-tree distances under the incomplete lineage sorting. Still, oneneeds to quantify how other processes, such as gene flow, may alter thedistribution of the distances between gene trees resulting from the samespecies tree. This, together with the results presented here, may allow

Page 97: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

7.2 Tree-to-tree distance 89

0.1 1 10 100 1000 100000

0.2

0.4

0.6

0.8

1

P(D

=0)

λ−1

Figure 7.4: Same as in Fig. 7.3, but for species trees generated accordingto the Yule process occurring at rate λ per unit time per species. Notethat one unit of time stands for 2Ne generations, where Ne is the effectivepopulation size of the external taxa (each external taxon has the same Ne,but the effective population size of the internal taxa is equal to the sumof its derived external taxa). For each value of λ, 100 species trees witha particular tree balance are generated, and 100 gene trees are generatedfor each species tree. This is repeated until results for 5 different values ofIc were collected (except for n = 5, in which case there are only 3 possiblevalues of Ic). The results shown are averages over different values of Ic.

for distinguishing between the cases when differences between trees arecaused by incomplete lineage sorting alone, from the cases when incom-plete lineage sorting is combined with gene flow.

Page 98: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

90 Chapter 7 Species trees

0 10 200

0.1

0.2

0.3

0 10 200

0.1

0.2

0.3

0 10 200

0.1

0.2

0.3

0.4

0.5

0 10 200

0.2

0.4

0.6

0 10 200

0.2

0.4

0.6

0 10 200

0.2

0.4

0.6

a b c

d e f

DDD

P(D

(ai,

aj))

P(D

(α,a

i))

Figure 7.5: Probability distribution of distances between trees accordingto the metric introduced in [74]. Panels a, b and c show comparisonsof gene trees denoted by ai with their underlying species tree (unrooted)denoted by α. The parameter λ is the rate of speciations in the Yulemodel. Panels d, e, and f show the pairwise comparisons between genetrees ai conditional on the same species tree α for three different valuesof λ. The parameters used: λ = 10−2 in panels a, and d, λ = 1 in b, ande, and λ = 102 in c, and f. The number of external taxa in each panel is15. The number of generated gene trees per species tree is 102. A singlespecies tree is generated for each λ.

Page 99: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

8Conclusions

In this thesis, patterns of genetic variation in structured populations wereanalysed.

The effect of population-size fluctuations on the single-locus patternsof neutral genetic variation was discussed in Chapters 3-4. First, it wasshown in Chapter 3 how to compute the moments of the total branchlength of gene genealogies under a varying population size (see also[I]). These results make it possible to determine under which conditionspopulation-size fluctuations can be accommodated by using an effectivepopulation size. Second, in Chapter 4 the frequency spectrum of neutralSNPs was analysed. By using the frequency spectra of SNPs observedin twelve different human populations, it was shown how to infer theparameters of unknown demographic histories. The comparison of thehistories obtained upon fitting the complete spectra to those obtainedwithout taking into account low-frequency mutations demonstrated thatthe inferred histories are typically sensitive to low-frequency mutations.Yet, it was concluded that the qualitative features of the optimal histo-ries found in individual human populations are in agreement with theout-of-Africa scenario.

Two-locus patterns of genetic variation under a varying populationsize were discussed in Chapter 5 (see also [II]). By using a model ofrecurrent bottlenecks, it was shown that the effective-population sizeapproximation can be applied to two-locus gene genealogies when thetime scale of population-size fluctuations is either much faster or muchslower than both the time scale of coalescence, and the time scale ofrecombination. Otherwise, this approximation fails. A significant failureof the effective population-size approximation was found to occur in thecase of severe reductions of population size, since, in this case, pairs of

91

Page 100: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

92 Chapter 8 Conclusions

loci may exhibit long-range associations in a wide region of values of R.It was shown that regions of long-range associations between pairs of lociare a direct consequence of multiple mergers. This made it possible toargue that such regions must also be encountered in other models in whichmultiple mergers are obtained, such as models of selective sweeps [61] andmodels allowing for skewed offspring distribution among individuals. Itwas shown that in a particular limit, the model of recurrent bottlenecksand the model allowing for skewed offspring distribution introduced in[62] resemble each other.

The effect of migration between partly isolated populations is dis-cussed in Chapter 6. Both the spatial and the mating model analysed inChapter 6 are motivated by empirical observations made in populationsof a marine snail, L. saxatilis. The results in Chapter 6 quantify theeffect of multiple paternity during colonisation of empty islands, as wellas during migration between colonised islands. It was found that popu-lation heterozygosity decays as the distance from the mainland increases.As expected, it was shown that populations with high degree of multiplepaternity can establish higher heterozygosity than populations with lowdegree of multiple paternity. This effect of multiple paternity becomesmore pronounced as the distance from the mainland increases. How-ever, the effect is modest in the stationary state of the migration phase.Furthermore, it was observed that the heterozygosity of populations farfrom the mainland can fluctuate significantly, allowing for long periodsof almost complete loss of genetic variation. Such fluctuations are ex-pected in populations with small rate of income of new genetic material.A method for quantifying the durations of low- and high-heterozygosityphases was proposed in Chapter 6. By using this method, it was foundthat by increasing the level of multiple paternity, the high-heterozygosityphase becomes longer. In comparison to the high-heterozygosity phase,the duration of the low-heterozygosity phase was found to be much lessaffected by the level of multiple paternity. Finally, by comparing thepredictions of the model used to the empirical data taken from [10], itwas found that the model describes well the observed patterns of geneticvariation in the populations of L. saxatilis.

The results in Chapters 3-6 contribute to our understanding of thepatterns of neutral genetic variation. As mentioned in the Introduction,such results are important when searching for genes under selection, andparticularly the genes underlying speciation.

Chapter 7 discusses the relationship between gene trees and theirunderlying species tree, in the simplest case when the time scale of speci-ation is much shorter than the time scale between successive speciations.It was shown how the time scale between speciations, the number of de-

Page 101: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Conclusions 93

rived species, and the symmetry of the species tree affect the tree-to-treedistance between gene trees and their underlying species tree. These re-sults can serve as a reference for distinguishing gene trees taken acrossdifferent species from gene trees taken across different ecotypes.

As explained in the Introduction, gene trees taken across differentecotypes reflect an ongoing process of speciation. Therefore, the existingmethods of inferring species trees from their resulting gene trees must bemodified to take into account the time scale of speciation. This, however,requires models of speciation that allow for gene flow between differentecotypes, for establishment of the so-called hybrid ecotypes, and also foroccasional extinctions of some ecotypes established during the process ofspeciation.

In order to propose reasonable speciation models, insight into themechanisms underlying speciation is required. Three crucial questionsneed to be answered [5, 6]. First, under which conditions do the primaryreproductive barriers evolve? Second, which are the genes underlyingspeciation? And third, how do genome-wide barriers evolve from primarybarriers? These questions will be dealt with in a joint collaboration withKerstin Johannesson et al.. The empirical genome-wide studies will makeit possible to detect regions under selection and thus candidate genesunderlying speciation. By tracing such genes throughout different stagesof speciation, a valuable insight into the evolution of reproductive barrierswill be gained, since the experimental findings will make it possible to testfor validity of existing speciation models, and to modify them accordingly.To this end, we plan to make use of the model proposed in [83]. We willextend it by allowing for recombination and multiple paternity to startwith. Upon establishing a reasonable speciation model, a theoreticalinsight into speciation mechanisms will be gained, which will make itpossible to, for example, test the role of gene flow between ecotypes inspeciation, and to estimate the typical time of speciation. These andrelated questions will be discussed in detail in my PhD thesis.

Page 102: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

94

Page 103: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

AMoment-generating function of

Sn

In this appendix, it is shown how to compute the moments of the numberof segregating sites Sn in a sample of size n by using the moments of thecorresponding total branch length Tn. As in the main text, it is assumedin the following that demographic history may have an arbitrary depen-dence on time. Furthermore, mutations are neutral, and they occur withrate µ according to the infinite-sites model. The scaled mutation rate isθ = 2µN0, where N0 stands for the population size of a haploid popula-tion at the present time (the time of sampling). The time is rescaled, sothat t units of time stand for ℓ = ⌊tN0⌋ generations.

As explained in Section 2.2, the moments of the total branch lengthTn can be computed using Fn(q), the Laplace transform of the proba-bility distribution ρ(Tn) (see Eq. (2.12)). Denote by Gn(q) the Laplacetransform of the probability distribution of Sn. On a given gene geneal-ogy with the total branch length equal to Tn, under the mutation modelassumed, it follows that Sn is Poisson distributed with mean θTn/2 [9].This suggests that the function Gn may be expressed in terms of Fn. Us-ing the probability distribution of Sn conditional on Tn, it can be found:

Gn(q) =

∫ ∞

0

dTn ρ(Tn)∞∑

Sn=0

e−qSne−

θ2Tn(

θ2Tn

)Sn

Sn!

=

∫ ∞

0

dTn ρ(Tn) e−θ2Tn(1−e−q)

= Fn

[

θ

2

(

1 − e−q)

]

. (A.1)

95

Page 104: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

This shows how Eq. (2.14) is obtained.If the population size is constant in time, Eq. (A.1) reduces to Eq. (1.3)

in [89].

96

Page 105: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

BProbabilities appearing in

Eqs. (2.16), (2.17)

The probabilities appearing in Eqs. (2.16), (2.17) are [41]

p(k, i) =

(

n−ki−1

)

(

n−1i

)

k − 1

i, (B.1)

p(k, i; k, j) =

(

n−i−j−1k−3

)

(

n−1k−1

) , for k > 2 ,

p(k, i), for k = 2, and i + j = n ,

0, for k = 2, and i + j 6= n ,

(B.2)

p(k, i; m, j) =(

δ⌊i/j⌋,0 + δi,j

)

pa(k, i; m, j)+(

δ⌊(i+j)/n⌋,0 + δi+j,n

)

pb(k, i; m, j) . (B.3)

The probabilities in Eq. (B.3) are [41]:

pa(k, i; m, j) =

a1∑

t=2

(

m−kt−1

)

(

m−1t

)

k − 1

m

(

i−j−1t−2

)(

n−i−1m−t−1

)

(

n−1m−1

) , for j < i ,

k − 1

m(m − 1)

(

n−i−1m−2

)

(

n−1m−1

) , for i = j ,

(B.4)

pb(k, i; m, j) =

a2∑

t=1

(

m−kt−1

)

(

m−1t

)

(k − 1)(m − t)

tm

(

i−1t−1

)(

n−i−j−1m−t−2

)

(

n−1m−1

) , for k > 2 ,

1

jm

(

n−mj−1

)

(

n−1j

) , for k = 2, and i + j = n .

(B.5)

97

Page 106: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

In Eqs. (B.4)-(B.5), a1 and a2 are given by a1 = min(m−k+1, i−j +1),and a2 = min(m − 2, m − k + 1, i).

98

Page 107: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

CTests of neutrality based on ξi

Commonly used tests of neutrality are based on the frequency spectrumof SNPs. One task is to compare the genetic variation of sampled se-quences to the variation expected under a given null-model. The stan-dard null-model assumes the Wright-Fisher population of haploid indi-viduals and neutral mutations which accumulate along genetic sequencesaccording to the infinite-sites model with the scaled mutation rate θ (seeSection 2.2). In reality, the parameter θ is unknown. But, as shown inthe introduction of this section, under the standard null-model, θ can beestimated using 〈ξi〉 or by using linear combinations of 〈ξi〉. Indeed, inthe constant population-size case one has [41]

〈ξi〉 =θ

i. (C.1)

Any two independent estimates of θ are expected to mutually agree underthe given null-model. Therefore, if sampled sequences show that the twoestimates of θ strongly mutually disagree, then the null-model assumedmust be rejected.

Examples of tests of neutrality based on the frequency spectrum ofSNPs include Tajima’s test [53], Fu and Li’s test [54], and singletonexclusive versions of Tajima’s test [55]. These, and many other tests ofthis kind, mutually differ by how the two estimators of θ in each of themare chosen. Yet it has been shown in [60] that any test of neutrality Tbased on the frequency spectrum of SNPs under the standard null-modelcan be expressed as

T =θ

(T )1 − θ

(T )2

var[θ(T )1 − θ

(T )2 ]

, (C.2)

99

Page 108: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

where θ(T )i , i = 1, 2, are two estimates of θ. They may be written as

linear combinations of ξj [60]

θ(T )i =

n−1∑

j=1

ω(T )ij jξj , i = 1, 2 , (C.3)

wheren−1∑

j=1

ω(T )ij = 1 , and

n−1∑

j=1

(

ω(T )1j − ω

(T )2j

)

= 0 . (C.4)

The first equality in Eq. (C.4)serves as a normalisation factor over dif-ferent estimates of θ and the second equality assures that the averages〈θ1〉, and 〈θ2〉 are equal.

Recently, [56] has pointed out that the findings in [60] may be furthergeneralised to account for the tests of neutrality that may incorporate anarbitrary demographic history or a selection scenario (or both) in theirnull-model. Furthermore, a possibility of building an optimal test ofneutrality for a given null-model is discussed in [56] (see [56] for criteriasatisfied by optimal tests of neutrality).

100

Page 109: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

DFrequency spectrum of SNPs:

formulae

In this appendix, formulae for the frequency spectrum of SNPs and forthree tests of neutrality under a single bottleneck are listed. The modelof a bottleneck (see Fig. 4.3) is described by the following parameters:t0 is time to the bottleneck, tB is the duration of the bottleneck, x isthe population size during the bottleneck relative to the population sizeat the present time, and x′ is the population size before the bottleneckrelative to the population size at the present time. Note that, unlike inFig. 3.2a, in this model the population size before the bottleneck maydiffer from the population size at the present time. In the case x′ = 1the two models are the same.

D.1 Moments of ξi

The first two moments of ξi, i = 1, . . . , n− 1, can be computed using theresults in [41] combined with the results in [I] (see also [49]). It followsthat

〈ξi〉 =θ

2

n∑

m1=2

a(ni)m1

fm1 , for i = 1, . . . , n − 1 , (D.1)

where a(ni)m1 , and fm1 are:

a(ni)m1

=

m1∑

k=2

kcnkm1p(k, i) , (D.2)

fm1 = b−1m1

(

1 − (1 − x)e−bm1 t + (x′ − x) e−bm1 e−bm1 sb)

. (D.3)

101

Page 110: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Here, bm1 =(

m1

2

)

, and cnkm1 is given by Eq. (11) in [I]. This result isconsistent with Eq. (1) in [15], assuming M = 3, and identifying T1 = t,T2 = tB, T3 = ∞, N1 = N0, N2 = NB, and N3 = x′N0. The secondmoment is given by

〈ξiξj〉 = δi,j

(

〈ξi〉 +θ2

4

n∑

m1=2

m1∑

k=2

a(nij)m1k fm1k

)

+θ2

4

n∑

m1=2

(

m1∑

k=2

g(nij)m1k fm1k +

m1∑

m2=2

h(nij)m1m2

fm1m2

)

, (D.4)

where

a(nij)m1k = 2kcnkm1ckkkp(k, i) , (D.5)

g(nij)m1k = 2k(k − 1)cnkm1ckkkp(k, i; k, j) , (D.6)

h(nij)m1m2

=

m1∑

l=m2

lcnlm1

m2∑

k=2

kclkm2[p(k, i; l, j) + p(k, j; l, i)] . (D.7)

For the terms fm1,m2 in Eq. (D.4) the cases m1 6= m2, and m1 = m2 areconsidered separately. For the case m1 6= m2 one obtains

fm1m2 =

1

bm2

(

1 − e−bm1 t(

1 − x2 +(

x2 − x′2)

e−bm1sB)

bm1

−(

1 − x + (x − x′)e−bm2sB) e−bm2 t − e−bm1 t

bm1 − bm2

+x(x′ − x)e−bm1 t e−bm2sB − e−bm1sB

bm1 − bm2

)

. (D.8)

For the case m1 = m2 one has

fm1m1 =

1

bm1

(

1 − e−bm1 t(

1 − x2 +(

x2 − x′2)

e−bm1sB)

bm1

−t(

1 − x + (x − x′)e−bm2sB)

e−bm1 t

+x(x′ − x)sBe−bm1 te−bm1sB

)

. (D.9)

102

Page 111: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

D.2 Tests of neutrality

Let Tajima’s test be denoted by D, Fu and Li’s test by F , and thesingleton-exclusive version of Tajima’s test assuming that no outgroup isavailable by Y .

The weights ω(D)j,i , j = 1, 2, i = 1, . . . , n− 1, for the two estimators in

Tajima’s test D are [60]

ω(D)1,i = 2

n − i

n(n − 1), ω

(D)2,i = (ian)−1 . (D.10)

Here, an =∑n−1

j=1 j−1 denotes the n−1-st harmonic number. The weights

in θ(F )1 and θ

(F )2 , appearing in F , are [60]

ω(F )1,i = ω

(D)2,i , ω

(F )2,i = δi,1 . (D.11)

For the singleton exclusive version of Tajima’s test, Y , one has [60]

ω(Y )1,i = 2(1 − δi,1 − δi,n)

n − i

n(n − 3), ω

(Y )2,i =

1 − δi,1 − δi,n

i∑n−2

j=2 j−1. (D.12)

For the model of a bottleneck in the case x′ = 1, one finds

〈[θ(D)1 − θ

(D)2 ]t〉 =

θ

2(1 − x)

( 1

an

n∑

i=2

dn;i

bi

e−bit(1 − e−bisB)

− 2e−t(1 − e−sB))

, (D.13)

〈[θ(F )1 − θ

(F )2 ]t〉 =

θ

2(1 − x)

n∑

i=2

(δn;i

bi

1

n − 1−

dn;i

bi

(1

an

+1

n − 1))

e−bit(1 − e−bisB) , (D.14)

〈[θ(Y )1 − θ

(Y )2 ]t〉 =

θ

2(1 − x)

( 1

(n − 3)

n∑

i=2

(

−δn;i

bi

(qn −2

n)

+dn;i

bi(nqn −

2

n))

e−bit(1 − e−bisB)

−(

2 +2

n − 1(qn −

2

n))

e−t(1 − e−sB))

. (D.15)

The coefficients dn;i are given by Eq. (21) in [I], and

qn =n − 3

(n − 1)an − n, δn;i =

i∑

j=2

j2cnji . (D.16)

103

Page 112: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

The identities

n∑

i=2

δn;i

bi= 2(n − 1 + an) , (D.17)

n∑

i=2

dn;i

bi= 2an . (D.18)

are used for deriving Eqs. (D.13)-(D.15). The validity of Eq. (D.17) ischecked for n between n = 3, and n = 100. Eq. (D.18) is given in [I].Note that using the results in Chapter 3 one may compute the varianceof the numerators of the tests analysed.

In the following, mean numerators in the three tests listed above arediscussed in the limit of weak bottlenecks sB → 0, and in the limit ofstrong bottlenecks sB → ∞.

In the limit sB → 0, the mean difference between the two estimatorsof θ in all three tests is 〈[θ

(T )1 − θ

(T )2 ]t〉 ≈ 0. As expected, the tests

of neutrality in this case do not show significant deviations from theconstant population-size neutral model.

By contrast, in the limit sB → ∞, one finds:

〈[θ(D)1 − θ

(D)2 ]t〉 ≈

θ

2(1 − x)

( 1

an

n∑

i=2

dn;i

bie−bit − 2e−t

)

, (D.19)

〈[θ(F )1 − θ

(F )2 ]t〉 ≈

θ

2(1 − x)

n∑

i=2

(δn;i

bi

1

n − 1

−dn;i

bi(

1

an+

1

n − 1))

e−bit , (D.20)

〈[θ(Y )1 − θ

(Y )2 ]t〉 ≈

θ

2(1 − x)

(

(

−2 −2

n − 1(qn −

2

n))

e−t

+

n∑

i=2

(

−δn;i

bi(qn −

2

n) +

dn;i

bi(nqn −

2

n)) e−bit

n − 3

)

. (D.21)

There are several remarks concerning Eqs. (D.19)-(D.21).First, taking the limit t → 0 (keeping x constant) in Eqs. (D.19)-

(D.21) results in 〈[θ(T )1 −θ

(T )2 ]t→0〉 ≈ 0 for all three tests. This is expected,

because if sampling is done at t → 0 (immediately after the strong bottle-neck) most, if not all, coalescent events of the underlying gene genealogytake place during the bottleneck phase, and the contribution from thepopulation-size expansion is negligible (note that in this limit one hast/x → 0, where x is the coalescent time scale within the bottleneck).Therefore, this case is identical to the case of constant population sizexN0.

104

Page 113: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

0 1 2 3 4 5 6−0.6

−0.4

−0.2

0

0.2

0.4

〈[θ(Y

)1

−θ(Y

)2

]〉

After the Bottleneck

t

Figure D.1: Shown is 〈[θ(Y )1 −θ

(Y )2 ]〉 at time t after the bottleneck obtained

analytically. Sample sizes: n = 5 (denoted by blue), n = 10 (denotedby red), and n = 20 (denoted by green). Remaining parameters used:sB = 100, and x = 0.05.

Second, one has that 〈[θ(D)1 − θ

(D)2 ]t〉, and 〈[θ

(F )1 − θ

(F )2 ]t〉 are strictly

non-positive for t ∈ (0,∞) and for arbitrary n. By contrast, 〈[θ(Y )1 −

θ(Y )2 ]t〉 in the limit of strong bottlenecks may be positive, zero or negative,

depending on the sample size (see Fig. D.1). According to Fig. D.1,the sign of the numerator of the singleton-exclusive version of Tajima’sstatistic matches the sign of Tajima’s numerator after the bottleneckonly for large sample sizes, n ≥ 10. Therefore, omitting the nucleotidefrequencies i = 1, and i = n − 1 when constructing a test of neutrality,may result not only in a weaker signal of the underlying demographichistory, as originally argued in [55], but it may even lead to erroneousdemography estimates.

Third, it is interesting to note that the effect of a strong bottleneckon the numerator of a given test is indistinguishable from the effect of acorresponding rapid population-size expansion (from NB to N0):

limsB→∞

〈[θ(T )1 − θ

(T )2 ]t〉 = 〈[θ

(T )1 − θ

(T )2 ]t〉expansion . (D.22)

It is to be remarked that, although it is possible to derive analyticalresults for the mean of the numerator of a given test, as well as of thevariance of the test numerator, the mean of the test may not be com-puted analytically. Yet one may use computer simulations for such acalculation.

It is common practice to assume that the sing of the mean numeratorof the test considered corresponds to the sign of the mean of the test.

105

Page 114: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

However, this may not be true. As has been shown in [53], the mean ofTajima’s test under the standard null-model does not evaluate to zero,but to a slightly negative value, which depends on the sample size n,and on θ. The question is how is the sign of the average test numeratorrelated to the sign of the average test when the standard null-model isviolated due to population-size fluctuations.

The sign of the mean numerators of D, F , and Y can be explained interms of the parameters t0 and sB. This is possible, because the sign of agiven numerator is dictated by the shape of the spectrum i〈ξi〉. Accordingto the explicit results listed in Appendix D, it follows that the numeratorsof both Tajima’s and Fu and Li’s test, are strictly non-negative during thebottleneck (after the rapid population-size decline), and that they can bezero, positive, or negative after the bottleneck depending on sB (see also[90]). If sB ≥ 5 (strong bottleneck), these numerators are non-positive.Interestingly, the sign of the numerator of the singleton-exclusive versionof Tajima’s test agrees with the sign of Tajima’s numerator only for largen, that is for n ≥ 10 (results not shown).

Computer simulations show that the sign of the mean of the test maydiffer from the sign of the mean of the test numerator. An example isshown in Fig. D.2. Here the circles denote 〈D〉 computed using computersimulations, and the solid line shows 〈D0〉, defined as:

〈D0〉 =〈θ

(D)1 − θ

(D)2 〉

〈var[θ(D)1 − θ

(D)2 ]〉

. (D.23)

Note that the definition of 〈D0〉 implies that its sign agrees with the sign

of the mean numerator of Tajima’s test (〈θ(D)1 − θ

(D)2 〉).

As can be seen in Fig. D.2, at time t0 ≈ 10−2 after the bottleneck, onehas 〈D0〉 > 0, and 〈D〉 < 0. Here, the difference between 〈D0〉 and 〈D〉is much larger than the difference between the two values in the constantpopulation-size case. Note that the disagreement between the circles andthe solid line at t0 > 10 agrees with that reported by Tajima in [53].

106

Page 115: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

t0

1

1

−1.5

−1

−0.5

1.5

0.5

10−110−2

10−2

10−3

10−3

10−4

10−4

101 102 103

0

〈D〉

During

After the Bottleneck

t0

Figure D.2: Tajima’s test for the model of a single bottleneck shownin Fig. 3.2a. Shown are 〈D〉 (circles) obtained using coalescent simula-tions, and 〈D0〉 (black solid line) obtained analytically (see Eq. (D.23)).Left: sampling during the bottleneck, at time t0 after the population-sizedecline. Right: sampling after the bottleneck, at time t0 after the popu-lation recovered from the bottleneck. Parameters used: sB = 2, x = 0.05,n = 50, θ = 10. Averaging is done over 105 independent realizations ofgene genealogies.

107

Page 116: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

108

Page 117: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

EEffective population size underthe mating model introduced in

Chapter 6

Consider an isolated, well-mixed diploid population with Nf females, andNm males (Nf ≫ 1 and Nm ≫ 1 are independent of time) which mateaccording to the model introduced in Chapter 6. In order to computethe corresponding effective population size under this mating model, onecan compute the population homozygosity Fℓ in generation ℓ. In whatfollows, it is assumed that mutation rate per generation per allele isµ ≪ 1, and the infinite alleles model is employed.

To begin with, note that in a diploid population with effective popu-lation size Ne ≫ 1, Fℓ may be computed according to:

Fℓ ≈1

2Neǫℓ + (1 −

1

2Ne)χℓ . (E.1)

In the first term, ǫℓ denotes the probability that at time ℓ, two alleleswithin a single randomly chosen individual are identical. This probabilityis called inbreeding coefficient. In the second term, χℓ is the probabil-ity that two alleles sampled in generation ℓ from two different randomlychosen individuals are identical. This probability is called coancestry.The sign ≈ appears because for large populations (Ne ≫ 1) the proba-bility that one samples a pair of alleles belonging to the same individualis (2Ne − 1)−1 ≈ (2Ne)

−1. The effective population size for the matingmodel discussed in Chapter 6 can be determined using explicit recursion

109

Page 118: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

relations for ǫℓ, and χℓ. Under the model described, it may be obtained:

ǫℓ = (1 − µ)2χℓ−1 , (E.2)

χℓ = (1 − µ)2{ 1

4Nf

[

1 + ǫℓ−1

2(1 + κ) + (3 − κ) χℓ−1

]

+1

4

(

1 −1

Nf

)[

1 + ǫℓ−1

2Nm+

(

4 −1

Nm

)

χℓ−1

]

}

, (E.3)

where κ stands for the probability that two children which share a mother,have the same father.

Taking only the leading order terms in Eqs. (E.2)-(E.3) results in:

ǫℓ ≈ (1 − 2µ)χℓ−1 , (E.4)

χℓ ≈1

8

(

1 + κ

Nf+

1

Nm

)

+1

8

(

1 + κ

Nf+

1

Nm

)

ǫℓ−1

+

(

1 −2

8

(

1 + κ

Nf

+1

Nm

))

χℓ−1 − 2µχℓ−1 . (E.5)

Identifying

Ne = 4

(

1 + κ

Nf+

1

Nm

)−1

, (E.6)

and making use of Eq. (E.1), yields:

Fℓ =1

2Ne+ (1 −

1

2Ne− 2µ)Fℓ−1 . (E.7)

It follows that 〈F 〉, the expected homozygosity in the stationary state, is

〈F 〉 =1

1 + θ, (E.8)

where θ = 4µNe. Note that Eq. (E.6) in the case Nm = Nf = N becomes:

Ne =4N

2 + κ. (E.9)

110

Page 119: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

FHeterozygosity during migration

In this chapter, the migration phase in the model introduced in Chapter 6is analysed. Assuming that all islands are populate, the expressionsfor the population heterozygosity at distance i = 0, 1, . . . , k from themainland are derived. The inbreeding coefficient in generation ℓ at islandi is denoted by ǫ

(i)ℓ , and the coancestry between two alleles sampled at

island i from two distinct individuals by χ(i)ℓ . As discussed in Appendix E,

these two determine the homozygosity F(i)ℓ at island i in generation ℓ.

Note that the coancestry between islands i, and j may be understood asthe inter-island homozygosity, and in this case the notation χ

(i,j)ℓ ≡ F

(i,j)ℓ

(i 6= j) is used.

The following assumptions are made below. The mutation rate pergeneration per allele is µ ≪ 1, and the infinite alleles model is employed.The population size on the islands i = 1, . . . , k is equal to 2N , andthere is an equal number of males and females in each population. Thepopulation size on the mainland 2K is taken to be much larger than2N . Moreover, in generation ℓ = 0, the mainland is in the stationarystate. The mean homozygosity on the mainland in ℓ = 0 is, thus, equalto (1 + θ0)

−1, where θ0 = 4µKe is the mutation rate on the mainlandscaled by its effective population size (see Appendix E for derivationof the homozygosity in an isolated population under the mating modelintroduced in Chapter 6).

Recall that in the spatial model introduced in Section 6.2, the female-migration rate per island per generation is 2M for islands i = 1, . . . , k−1,whereas for the mainland and for the island at the distance k from themainland it is equal to M . Due to the assumption K ≫ N , it followsthat the process of migration may not affect genetic variation on themainland. Formally, the probability that a child on the mainland stems

111

Page 120: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

from a mother coming from the neighboring island i = 1 is MK−1, andthis probability approaches zero in the limit K → ∞, M being constant.In summary, one has:

F(0)ℓ =

1

1 + θ0. (F.1)

For the remaining islands, it is necessary to consider separately the inter-island and the intra-island homozygosity. In what follows, sampling fromtwo distinct islands, i 6= j is considered first. After this, the case ofsampling within a single island i = 1, . . . , k is treated.

The inter-island homozygosity F(i,j)ℓ+1 for i = 0, 0 < j ≤ k satisfies:

F(0,j)ℓ+1 = (1 − µ)2

(

(1 − m + δj,km

2)F

(0,j)ℓ

+m

2

(

F(0,j−1)ℓ + (1 − δj,k)F

(0,j+1)ℓ

))

, (F.2)

where m = 2M/N ≪ 1 is the migration rate per island per female pergeneration, δj,k is equal to unity when j = k, and it is zero otherwise.In the case 0 < i < k, 0 < j < k, i 6= j, the inter-island homozygosityobeys:

F(i,j)ℓ+1 = (1 − µ)2(1 − m)

(

(1 − m)F(i,j)ℓ + mχ

(i)ℓ (δi,j−1 + δi,j+1)

+m

2(1 − δi,j−1)

(

F(i,j−1)ℓ + F

(i+1,j)ℓ

)

+(1 − δi,j+1)(

F(i,j+1)ℓ + F

(i−1,j)ℓ

) )

+ O(m2) . (F.3)

Lastly, when i = k, 0 < j < k, it can be found

F(k,j)ℓ+1 = (1 − µ)2

((

1 −m

2

)

(1 − m) F(k,j)ℓ

+m

2

(

1 −m

2

)(

F(k,j−1)ℓ (1 − δk,j−1) + χ

(k,j−1)ℓ δk,j−1

+F(k,j+1)ℓ (1 − δk,j+1) + δk,j+1χ

(k,j+1)ℓ

)

+m

2(1 − m)

(

F(k−1,j)ℓ (1 − δk,j+1) + χ

(k−1,j)ℓ δk,j+1

) )

+ O(m2) . (F.4)

For two alleles are sampled from the same island 0 < i < k, theinbreeding coefficient and the coancestry satisfy:

ǫ(i)ℓ+1 = (1 − µ)2

(

(1 − m) χ(i)ℓ +

m

(i−1)ℓ +

m

(i+1)ℓ

)

, (F.5)

112

Page 121: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

χ(i)ℓ+1 = (1 − µ)2

(

(1 − m)

(

(1 − m)

(

1

N(1 − m)(

1 + ǫ(i)ℓ

8(1 + κ) +

1 − κ

(i)ℓ +

χ(i)ℓ

2

)

+(1 −1

N(1 − m))

(

3

(i)ℓ +

1 + ǫ(i)ℓ

8N+ (1 −

1

N)χ

(i)ℓ

4

))

+mF(i,i+1)ℓ + mF

(i,i−1)ℓ

) )

+ O(m2) . (F.6)

For i = k, it holds:

ǫ(k)ℓ+1 = (1 − µ)2

((

1 −m

2

)

χ(k)ℓ +

m

(k−1)ℓ

)

, (F.7)

χ(k)ℓ+1 = (1 − µ)2

(

1 −m

2)

(

(1 −m

2)

(

1

N(1 − m2)

(

1 + ǫ(k)ℓ

8(1 + κ) +

1 − κ

(k)ℓ +

χ(k)ℓ

2

)

+(1 −1

N(1 − m2))

(

3

(k)ℓ +

1 + ǫ(k)ℓ

8N+ (1 −

1

N)χ

(k)ℓ

4

))

+mF(i,i−1)ℓ

) )

+ O(m2) . (F.8)

Recall that, in the case considered here, it is assumed N ≫ 1, µ ≪ 1,m ≪ 1. Taking only the leading order terms in Eqs. (F.2)-(F.8), therecursions can be simplified significantly. The results in the remainder ofthis appendix take into account this simplification. Moreover, instead ofgeneration index ℓ, the rescaled time t is used, where ℓ = ⌊2Net⌋. Here,Ne is the effective population size under the mating model given (seeEq. (6.4)). In these units of time, the homozygosity in generation ℓ atisland i is denoted by F (i)(t).

For the inter-island homozygosity between the mainland and islandi < k one has:

0 = (−∂t − θ) F (0,i)(t)

+ Me

(

F (0,i+1)(t) + F (0,i−1)(t) − 2F (0,i)(t))

, (F.9)

and for i = k, it holds:

0 = (−∂t − θ) F (0,k)(t) + Me

(

F (0,k−1)(t) − F (0,k)(t))

. (F.10)

113

Page 122: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Here θ = 4µNe, and Me = 2MNe/N .For the case 0 < i < k, 0 < j < k, j 6= i, one has:

0 = (−∂t − θ) F (i,j)(t)

+ Me (1 − δi−1,j)(

F (i,j+1)(t) + F (i−1,j)(t))

+ Meδi−1,j

(

F (i−1)(t) + F (i−1)(t))

+ Me (1 − δi+1,j)(

F (i,j−1)(t) + F (i+1,j)(t))

+ Meδi+1,j

(

F (i)(t) + F (i+1)(t))

− 4MeF(i,j)(t) , (F.11)

and for i = k, 0 < j < k, the homozygosity satisfies:

0 = (−∂t − θ) F (k,j)(t) + Me

(

F (k−1,j)(t) − 2F (k,j)(t))

. (F.12)

Lastly, the homozygosity at distance 0 < i < k from the mainlandobeys:

0 = (−∂t − θ − 1) F (i)(t)

+ 2Me

(

F (i+1,i)(t) + F (i−1,i)(t) − F (i)(t))

+ 1 , (F.13)

and for i = k, the expression is:

0 = (−∂t − θ − 1)F (k)(t) + Me

(

F (k,k−1)(t) − 2F (k)(t))

+ 1 . (F.14)

The homozygosity in the stationary state of the system may be de-termined by setting in Eqs. (F.9)-(F.14) the terms involving the timederivative to zero.

114

Page 123: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

Bibliography

[1] D. L. Theobald, A formal test of the theory of universal commonancestry, Nature 465, 219–222 (2010).

[2] K. Johannesson, Parallel speciation: a key to sympatric divergence,Trends in ecology & evolution (Personal edition) 16, 148–153(2001).

[3] K. Johannesson, Evolution in littorina: ecology matters, Journal ofSea Research 49, 107–117 (2003).

[4] P. Nosil, Speciation with gene flow could be common, Frontiers AJournal of Women Studies 17, 2103–2106 (2008).

[5] K. Johannesson, M. Panova, P. Kemppainen, C. Andre,E. Rolan-Alvarez and R. K. Butlin, Repeated evolution ofreproductive isolation in a marine snail: unveiling mechanisms ofspeciation, Philosophical Transactions of the Royal Society B:Biological Sciences 365, 1735–1747 (2010).

[6] R. Butlin, What do we need to know about speciation?, Trends EcolEvol 27, 27–39 (2012).

[7] J. Kingman, The coalescent, Stoch. Proc. Appl. 13, 235–248(1982).

[8] J. F. C. Kingman, Essays in statistical science, Journal of AppliedProbability 19, 27–43 (1982).

[9] R. R. Hudson, Gene genealogies and the coalescent process, OxfordSurveys in Evolutionary Biology 7, 1–44 (1990).

[10] K. Janson, Genetic drift in small and recently founded populationsof the marine snail littorina saxatilis, Heredity 58, 31–37 (1987).

[11] S. Ramachandran, O. Deshpande, C. Roseman, N. Rosenberg,M. Feldman and L. Cavalli-Sforza, Support from the relationship of

115

Page 124: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

genetic and geographic distance in human populations for a serialfounder effect originating in Africa, Proceedings of the NationalAcademy of Sciences of the United States of America 102,15942–15947 (2005).

[12] H. Liu, F. Prugnolle, A. Manica and F. Balloux, A geographicallyexplicit genetic model of worldwide human-settlement history, Am.J. Hum. Genet. 79, 230–237 (2006).

[13] K. Tanabe, T. Mita, T. Jombart, A. Eriksson, S. Horibe,N. Palacpac, L. Ranford-Cartwright, H. Sawai, N. Sakihama,H. Ohmae, M. Nakamura, M. U. Ferreira, A. A. Escalante,F. Prugnolle, A. Bjorkman, A. Farnert, A. Kaneko, T. Horii,A. Manica, H. Kishino and F. Balloux, Plasmodium falciparumaccompanied the human expansion out of Africa, Curr Biol 20,1283–9 (2010).

[14] J. Pujolar, S. Vicenzi, L. Zane, D. Jesensek, G. De Leo andA. Crivelli, The effect of recurrent floods on genetic composition ofmarble trout populations, PLoS ONE 6, e23822 (2011).

[15] G. T. Marth, E. Czabarka, J. Murvai and S. T. Sherry, The allelefrequency spectrum in genome-wide human variation data revealssignals of differential demographic history in three large worldpopulations, Genetics 166, 351–372 (2004).

[16] N. Nawa and F. Tajima, Simple method for analyzing the patternof DNA polymorphism and its application to SNP data of human,Genes & Genetic Systems 83, 353–360 (2008).

[17] A. Griffiths, W. Gelbart, J. Miller and et al., Modern GeneticAnalysis. W. H. Freeman, New York, 1999. The Sources ofVariation.

[18] R. A. Fisher, The genetical theory of natural selection. Clarendon,Oxford, 1930.

[19] S. Wright, Evolution in Mendelian populations, Genetics 16,97–159 (1931).

[20] M. Panova, J. Bostrom, T. Hofving, T. Areskoug, A. Eriksson,B. Mehlig, T. Makinen, C. Andre and K. Johannesson, Extremefemale promiscuity in a non-social invertebrate species, PLoS ONE5, e9640 (2010).

116

Page 125: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[21] D. L. Hartl and A. G. Clark, Principles of Population Genetics.Sinauer Associates, 1998.

[22] P. Moran, Random processes in genetics., Proc. Cambridge Philos.Soc. 54, 60–71 (1958).

[23] W. J. Ewens, Mathematical population genetics. Springer, Berlin,1979.

[24] W. Ewens, On the concept of the effective population size,Theoretical Population Biology 21, 373–378 (1982).

[25] M. Kimura, The Neutral Theory of Molecular Evolution.Cambridge University Press, 1985.

[26] M. Kimura and J. F. Crow, The number of alleles that can bemaintained in a finite population, Genetics 49, 725–738 (1964).

[27] M. Kimura, The number of heterozygous nucleotide sitesmaintained in a finite population due to steady flux of mutations,Genetics 61, 893–903 (1969).

[28] M. Kimura and T. Ohta, Mutation and evolution at the molecularlevel., Genetics 73 (1973).

[29] T. Ohta and M. Kimura, A model of mutation appropriate toestimate the number of electrophoretically detectable alleles in afinite population., Genet Res 22, 201–204 (1973).

[30] A. M. Valdes, M. Slatkin and N. B. Freimer, Allele frequencies atmicrosatellite loci: The stepwise mutation model revisited, Genetics133, 737–749 (1993).

[31] M. S. McPeek and T. P. Speed, Modelling interference in geneticrecombination, Genetics 139, 1031–1044 (1995).

[32] P. Sjodin, I. Kaj, S. Krone, M. Lascoux and M. Nordborg, On themeaning and existence of an effective population size., Genetics169, 1061–1070 (2005).

[33] J. Wakeley and O. Sargsyan, Extensions of the coalescent effectivepopulation size, Genetics 181, 341–345 (2009).

[34] J. Pitman, Coalescents with multiple collisions, The Annals ofProbability 27, pp. 1870–1902 (1999).

117

Page 126: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[35] S. Sagitov, The general coalescent with asynchronous mergers ofancestral lines, Journal of Applied Probability 36, pp. 1116–1125(1999).

[36] M. Mohle and S. Sagitov, A classification of coalescent processesfor haploid exchangeable population models, The Annals ofProbability 29, pp. 1547–1562 (2001).

[37] J. Schweinsberg, Coalescents with simultaneous multiple collisions,Electron. J. Probab. 5, 1–55 (2000).

[38] O. Sargsyan and J. Wakeley, A coalescent process withsimultaneous multiple mergers for approximating the genegenealogies of many marine organisms, Theoretical PopulationBiology 74, 104 – 114 (2008).

[39] M. Birkner, J. Blath, M. Mohle, M. Steinrucken and J. Tams, Amodified lookdown construction for the xi-fleming-viot process withmutation and populations with recurrent bottlenecks, ALEA 6,25–61 (2009).

[40] M. Kimmel and R. Chakraborty, Measures of variation at DNArepeat loci under a general stepwise mutation model, TheoreticalPopulation Biology 50, 345–367 (1996).

[41] Y. X. Fu, Statistical properties of segregating sites, TheoreticalPopulation Biology 48, 172 – 197 (1995).

[42] A. Eriksson and B. Mehlig, Gene-history correlation andpopulation structure, Physical Biology 1, 220–228 (2004).

[43] T. Ohta and M. Kimura, Linkage disequilibrium between twosegregating nucleotide sites under the steady flux of mutations in afinite population., Genetics 68, 571–580 (1971).

[44] G. McVean, A genealogical interpretation of linkage disequilibrium,Genetics 162, 987 – 991 (2002).

[45] R. C. Griffiths, Neutral 2-locus multiple allele models withrecombination., Theoretical Population Biology 19, 169–186 (1981).

[46] R. R. Hudson, Properties of a neutral allele model with intrageneticrecombination, Theoretical Population Biology 23, 183–201 (1983).

[47] R. R. Hudson, Gene genealogies and the coalescent process, OxfordSurveys in Evolutionary Biology 7, 1–44 (1990).

118

Page 127: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[48] F. Austerlitz, B. JungMuller, B. Godelle and P. Gouyon, Evolutionof coalescence times, genetic diversity and structure duringcolonization, Theoretical Population Biology 51, 148–164 (1997).

[49] D. Zivkovic and T. Wiehe, Second-order moments of segregatingsites under variable population size, Genetics 180, 341–357 (2008).

[50] R. Griffiths and S. Tavare, Sampling theory for neutral alleles in avarying environment, Phil. Trans. Roy. Soc. Lon. B 344, 403–410(1994).

[51] S. Tavare, Ancestral Inference in Population Genetics, pp. 1–188.Springer, Berlin, 2004.

[52] S. Tavare, Line-of-descent and genealogical processes, and theirapplications in population genetics models, Theoretical PopulationBiology 26, 119–64 (1984).

[53] F. Tajima, Statistical method for testing the neutral mutationhypothesis by dna polymorphism, Genetics 123, 585–595 (1989).

[54] Y. X. Fu and W. H. Li, Statistical tests of neutrality of mutations,Genetics 133, 693–709 (1993).

[55] G. Achaz, Testing for neutrality in samples with sequencing errors.,Genetics 179, 1409–1424 (2008).

[56] L. Ferretti, M. Perez-Enciso and S. Ramos-Onsins, Optimalneutrality tests based on the frequency spectrum, Genetics 186,353–365 (2010).

[57] A map of human genome variation from population-scalesequencing, Nature 467, 1061–1073 (2010).

[58] M. Rafajlovic, A. Klassmann, A. Eriksson, T. Wiehe andB. Mehlig, Frequency spectrum of nucleotide polymorphisms undera single bottleneck, (in preparation).

[59] A. Klassmann (personal communication).

[60] G. Achaz, Frequency spectrum neutrality tests: One for all and allfor one, Genetics 183, 249–258 (2009).

[61] R. Durrett and J. Schweinsberg, Approximating selective sweeps,Theor. Popul. Biol. 66, 129 – 138 (2004).

119

Page 128: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[62] B. Eldon and J. Wakeley, Linkage disequilibrium under skewedoffspring distribution among individuals in a population, Genetics178, 1517–1532 (2008).

[63] M. Kimura and G. H. Weiss, The stepping stone model ofpopulation structure and the decrease of genetic correlation withdistance, Genetics 49, 561–576 (1964).

[64] M. Notohara, The coalescent and the genealogical process ingeographically structured population, Journal of MathematicalBiology 29, 59–75 (1990).

[65] N. Takahata and M. Slatkin, Genealogy of neutral genes in twopartially isolated populations, Theoretical Population Biology 38,331 – 350 (1990).

[66] M. Kimura, Stepping-stone model of population, Ann. Rept. Nat.Inst. Genetics, Japan 3, 62–63 (1953).

[67] F. Balloux and L. Lehmann, Random mating with a finite numberof matings, Genetics 165, 2313–2315 (2003).

[68] M. Rafajlovic, A. Eriksson, A. Rimark, M. Panova, K. Andre,K. Johannesson and B. Mehlig, The effect of multiple paternity ongenetic diversity during colonisation, (in preparation).

[69] A. Eriksson, B. Mehlig, M. Panova, C. Andre and K. Johannesson,Multiple paternity: determining the minimum number of sires of alarge brood, Molecular Ecology Resources 10, 282–291 (2010).

[70] D. E. Pearse and E. C. Anderson, Multiple paternity increaseseffective population size, Molecular Ecology 18, 3124–3127 (2009).

[71] S. Sara et al., Mating experiments in Littorina saxatilis, (inpreparation).

[72] A. Eriksson, B. Haubold and B. Mehlig, Statistics of selectivelyneutral genetic variation, Phys. Rev. E 65, 040901 (2002).

[73] K. Johannesson (personal communication).

[74] D. Penny and M. D. Hendy, The use of tree comparison metrics,Systematic Zoology 34, pp. 75–82 (1985).

[75] N. A. Rosenberg, The probability of topological concordance of genetrees and species trees, Theoretical Population Biology 61, 225 –247 (2002).

120

Page 129: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[76] J. H. Degnan and L. A. Salter, Gene tree distributions under thecoalescent process, Evolution 59, 24–37 (2005).

[77] J. H. Degnan and N. A. Rosenberg, Discordance of species treeswith their most likely gene trees, PLoS Genet 2, e68 (2006).

[78] J. H. Degnan and N. A. Rosenberg, Gene tree discordance,phylogenetic inference and the multispecies coalescent, Trends inecology & evolution (Personal edition) 24, 332–340 (2009).

[79] L. Liu, Best: Bayesian estimation of species trees under thecoalescent model, Bioinformatics 24, 2542–2543 (2008).

[80] L. Liu, L. Yu, D. K. Pearl and S. V. Edwards, Estimating speciesphylogenies using coalescence times among sequences, SystematicBiology 58, 468–477 (2009).

[81] E. Mossel and S. Roch, Incomplete lineage sorting: Consistentphylogeny estimation from multiple loci, Computational Biologyand Bioinformatics, IEEE/ACM Transactions on 7, 166 –171(2010).

[82] A. W. F. Edwards, Estimation of the branch points of a branchingdiffusion process, Journal of the Royal Statistical Society. Series B(Methodological) 32, pp. 155–174 (1970).

[83] S. Sadedin, J. Hollander, M. Panova, K. Johannesson andS. Gavrilets, Case studies and mathematical models of ecologicalspeciation. 3: Ecotype formation in a Swedish snail, MolecularEcology 18, 4006–4023 (2009).

[84] C. Meng and L. S. Kubatko, Detecting hybrid speciation in thepresence of incomplete lineage sorting using gene tree incongruence:A model, Theoretical Population Biology 75, 35 – 45 (2009).

[85] I. J. Maureira-Butler, B. E. Pfeil, A. Muangprom, T. C. Osbornand J. J. Doyle, The reticulate history of medicago (fabaceae),Systematic Biology 57, 466–482 (2008).

[86] G. U. Yule, A mathematical theory of evolution, based on theconclusions of dr. j. c. willis, f.r.s., Philosophical Transactions ofthe Royal Society of London. Series B, Containing Papers of aBiological Character 213, pp. 21–87 (1925).

121

Page 130: Genetic variation in structured populationsphysics.gu.se/~rmarina/Marina_Rafajlovic/Home_files/lic.pdf · 2012. 5. 11. · Genetic variation in structured populations Marina Rafajlovic

[87] D. H. Colless in Review of Phylogenetics: The Theory and Practiceof Phylogenetics Systematics, E. O. Wiley, ed., vol. 31,pp. 100–104. 1982.

[88] S. B. Heard, Patterns in Tree Balance among Cladistic, Phenetic,and Randomly Generated Phylogenetic Trees, Evolution 46,1818–1826 (1992).

[89] G. A. Watterson, On the number of segregation sites in geneticalmodels without recombination, Theor. Pop. Biol. 7, 256–276 (1975).

[90] J. C. Fay and C. I. Wu, A human population bottleneck can accountfor the discordance between patterns of mitochondrial versusnuclear DNA variation., Molecular Biology and Evolution 16,1003–1005 (1999).

122