evaluating the use of abba-baba statistics to locate ...abba sites localized over a narrow region,...

33
Manuscript type: Article Recommended MBE section: Methods Evaluating the use of ABBA-BABA statistics to locate introgressed loci Simon H. Martin* ‡1 , John W. Davey* 1 , Chris D. Jiggins 1 1 Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, UK * These authors contributed equally Corresponding Author: [email protected] . https://doi.org/10.1101/001347 doi: bioRxiv preprint

Upload: others

Post on 16-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Manuscript type: Article

Recommended MBE section: Methods

Evaluating the use of ABBA-BABA statistics to locate introgressed loci

Simon H. Martin*‡1, John W. Davey*1, Chris D. Jiggins1

1Department of Zoology, University of Cambridge, Downing Street, Cambridge, CB2 3EJ, UK

* These authors contributed equally

‡ Corresponding Author: [email protected]

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 2: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

ABSTRACT

Several methods have been proposed to test for introgression across genomes. One method identifies an

excess of shared derived alleles between taxa using Patterson’s D statistic, but does not establish which loci

show such an excess or whether the excess is due to introgression or ancestral population structure. Smith

and Kronforst (2013) propose that, at loci identified as outliers for the D statistic, introgression is indicated

by a reduction in absolute genetic divergence (dXY) between taxa with shared ancestry, whereas ancestral

structure produces no reduction in dXY at these loci. Here, we use simulations and Heliconius butterfly data to

investigate the behavior of D when applied to small genomic regions. We find that D imperfectly identifies

loci with shared ancestry in many scenarios due to a bias in regions with few segregating sites. A related

statistic, f, is mostly robust to this bias but becomes less accurate as gene flow becomes more ancient.

Although reduced dXY does indicate introgression when loci with shared ancestry can be accurately detected,

both D and f systematically identify regions of lower dXY in the presence of both gene flow and ancestral

structure, so detecting a reduction in dXY at D or f outliers is not sufficient to infer introgression. However,

models including gene flow produced a larger reduction in dXY than models including ancestral structure in

almost all cases, so this reduction may be suggestive, but not conclusive, evidence for introgression.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 3: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

INTRODUCTION

Hybridization and gene flow between taxa play a major role in evolution, acting as a force against

divergence, and as a potential source of adaptive novelty (Abbott et al. 2013). Consequently, identifying,

quantifying and dating gene flow between species are important objectives in the era of population

genomics. Although identifying gene flow between species has been a long-standing problem in population

genetics, the issue has received considerable recent attention with the analysis of shared ancestry between

Humans and Neanderthals (for example, Yang et al. 2012; Wall et al. 2013). Furthermore, with genomic data

sets becoming available in a wide variety of other taxonomic groups, there is a need for reliable methods that

are computationally tractable on large data sets.

A sensitive and widely used approach to test for gene flow is to fit coalescent models using maximum-

likelihood or Bayesian methods (Pinho and Hey 2010). However, simulation and model fitting are

computationally intensive tasks, and are not easily applied on a genomic scale. A simpler and more

computationally efficient approach that is gaining in popularity is to test for an excess of shared derived

variants using a four-taxon test (Kulathinal et al. 2009; Green et al. 2010; Durand et al. 2011). The test

considers ancestral ('A') and derived ('B') alleles, and is based on the prediction that two particular SNP

patterns, termed 'ABBA' and 'BABA' (Figure 1A), should be equally frequent under a scenario of incomplete

lineage sorting without gene flow. An excess of ABBA patterns is indicative of gene flow between two of

the taxa, and can be detected using Patterson's D statistic (Green et al. 2010; Durand et al. 2011; see Material

and Methods for details). However, an excess of shared derived variants can arise from processes other than

recent introgression, in particular non-random mating in the ancestral population due to population structure

(Eriksson and Manica 2012). It is therefore important to make use of additional means to distinguish between

these alternative hypotheses.

The D statistic was originally designed to be applied on a genome-wide or chromosome-wide scale, with

block-jackknifing used to overcome the problem of non-independence between loci (Green et al. 2010).

However, many researchers are interested in identifying particular genomic regions subject to gene flow,

rather than simply estimating a genome-wide parameter. Theory predicts that the rate of gene flow should

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 4: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

vary across the genome, both in the case of secondary contact after isolation (Barton and Gale 1993) as well

as continuous gene flow during speciation (Wu 2001). Indeed, a maximum likelihood test for speciation with

gene flow devised by Yang (2010) is based on detecting this underlying heterogeneity. The pattern of

introgression across the genome may carry information about the selective forces at play and the genetic

architecture of adaptive traits.

Several recent studies have attempted to characterize heterogeneity of introgression in more detail. Garrigan

et al. (2012) used a likelihood ratio test to identify genomic windows that have been shared between

Drosophila species. Models with and without gene flow were evaluated over 1 and 5 kb genomic windows.

Windows showing a significantly better fit to the gene flow model were widely distributed across the

autosomes, but scarce on the Z chromosome. Roux et al. (2013) used an ABC framework to fit models of

single and variable introgression rates among loci in Ciona species. Their results indicated strong

heterogeneity across the genome, and unidirectional introgression consistent with multiple incompatibilities

between species. Rheindt et al. (2013) used ABBA-BABA statistics to map candidate loci for introgression

in the flycatcher Zimmerius chrysops and attempted to link these loci to gene functions.

The Heliconius Genome Consortium (2012) used a sliding-window approach to investigate whether alleles

that determine convergent wing patterns displayed an excess of shared variation between co-mimetic species

of Heliconius butterflies (Figure 1B). This method compared the numbers of SNPs matching the ABBA and

BABA patterns, for each 50 kb window across two wing patterning loci, the HmB locus, controlling red

pattern elements, and the HmYb locus, controlling yellow pattern elements. In a comparison between H.

melpomene amaryllis and H. timareta thelxinoe from Peru, the HmB locus displayed a strong excess of

ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a

smaller, narrower excess present at the HmYb locus (The Heliconius Genome Consortium 2012). A similar

excess of shared derived alleles at the HmB locus was observed between a different pair of co-mimetic races

from Ecuador with “rayed” patterns, H. melpomene aglaope and H. timareta florencia.

A companion paper by Pardo-Diaz et al. (2012) provided additional evidence that alleles at the HmB locus

have been repeatedly shared among populations of H. melpomene and H. timareta. Analyzing a larger

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 5: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

number of populations, these authors constructed phylogenetic trees using PCR amplicons distributed across

the HmB locus. One amplicon displayed perfect clustering by wing pattern, suggesting that these pattern

alleles each had a single mutational origin and were subsequently shared between multiple populations over

a wide geographic area. The short branch lengths that separated populations with the same pattern were

consistent with recent sharing rather than inheritance of ancient polymorphic variation. In addition, a

signature of introgression was identified using an explicit test for elevated linkage disequilibrium between

shared polymorphisms (Pardo-Diaz et al. 2012).

All of these data, in combination with a recent study that showed pervasive genome-wide evidence for gene

flow between H. melpomene and H. timareta (Martin et al. 2013), strongly support the hypothesis of

repeated introgression between these species, as it is difficult to envisage a scenario in which both species

could have inherited both of these pattern alleles from their common ancestor. One possibility is that shared

ancestral variation in wing patterns has been maintained through multiple speciation events by balancing

selection, but this seems unlikely given the known positive frequency dependent selection on warning color

(Mallet and Barton 1989).

Nevertheless, a formal test to rule out the ancestral structure hypothesis based on the SNP data used to

identify a signal of shared ancestry is desirable, both to complement the existing case for introgression in

Heliconius and as a general method for other taxa where many lines of evidence are not available. Smith and

Kronforst (2013) recently applied a new test to distinguish between the hypotheses of ancestral structure and

introgression between Heliconius species. They proposed that introgression between taxa should result in

lower absolute genetic divergence at loci with an excess of shared alleles. In contrast, a reduction in

divergence is not expected under incomplete sorting of ancestral variation. Loci with an excess of shared

alleles were identified by dividing the region of interest (HmB locus, HmYb locus, or the whole genome) into

5 kb windows, calculating Patterson’s D statistic for each window, and partitioning the windows into outliers

(the 10% of windows with the highest D values) and background (the remaining 90% of windows). The

mean absolute genetic divergence (dXY) was then compared between the partitions and found to be

significantly lower in outlier windows than in Background windows, consistent with recent introgression

(Smith and Kronforst 2013).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 6: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

This approach represents a novel application of the D statistic for the detection of introgressed loci, which

may hold promise as a general method for similar genome-scale studies in other taxa. The robustness of the

D statistic for detecting genome-wide excesses of shared derived alleles has been thoroughly explored

(Green et al. 2010; Durand et al. 2011; Yang et al. 2012; Eaton and Ree 2013; Martin et al. 2013). However,

it is not clear how reliably D can be used to find the location of individual loci with a history of shared

ancestry. In the present study, we used simulations to evaluate the effectiveness of the D statistic, calculated

for small genomic windows, as a means to (1) identify loci of shared ancestry and (2) differentiate between

recent introgression and shared ancestral variation at these loci.

We first assessed the reliability of the D statistic in identifying regions of increased shared variation under a

variety of phylogenetic scenarios. We also investigated the utility of an additional parameter, f, originally

designed to estimate the proportion of the genome that has been shared (Green et al. 2010). Second, we re-

analyzed Heliconius data to examine the performance of these statistics on empirical data. Third, we used

our simulated datasets to test the proposal that recent gene flow can be distinguished from shared ancestral

variation based on absolute divergence. Overall our results suggest that caution should be exercised when

interpreting values of the D statistic for small windows. Using the f statistic in addition to D may offer a

better means to identify introgressed loci. However, both methods introduce systematic biases, which may

impede attempts to distinguish between recent introgression and ancestral structure at loci identified in this

way.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 7: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

RESULTS

Identifying windows of shared ancestry in simulated datasets

We compared two methods to identify sequence windows with a history of shared ancestry. The first used

the magnitude of Patterson's D statistic directly (Equation 1). The second used the f statistic as proposed here

(Equation 6), but f outliers were only identified among those windows with positive D values (Figure 2).

This combination of D and f was used since f only makes sense as a statistic where there is an excess of

ABBA over BABA sites, rather than the reverse.

To determine the accuracy of these approaches, we simulated datasets of 10 000 5 kb sequence windows

using a combined model, in which 9000 windows were assigned the species topology (((P1,P2),P3),O)

(Background), and the remaining 1000 (Alternate) windows were assigned a topology consistent with gene

flow or ancestral structure ((P1,(P2,P3),O) (Figure 2). Four split times varied between models: Bgt12, the split

time between P1 and P2 in the Background topology; Bgt123, the split time between the merged P1 and P2

populations and the P3 population in the Background topology; Alt23, the split time between P2 and P3 in the

Alternate topology; and Alt123, the split time between the merged P2 and P3 populations and the P1

population in the Alternate topology. Varying these four parameters was sufficient to model gene flow from

P2 to P3, gene flow from P3 to P2, or ancestral structure (Figure 3A, D and G; see Material and Methods for

details). In total, 360 combined models were analyzed, each with a unique combination of the four split times

(File S6).

The accuracy of the two approaches in identifying the 1000 Alternate windows was strongly dependent on

the particular combination of split times in the Background and Alternate trees (Figures S1, S2). Despite the

complex interaction between the split times, several general trends were apparent. The true Alternate

windows were more readily identified under gene flow scenarios compared to ancestral structure scenarios,

and windows with a history of gene flow from P2 to P3 were less accurately identified than those with a

history of gene flow from P3 to P2. In general, quantifying introgression using the f statistic was more reliable

than the D statistic (Figures S1, S2, S3).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 8: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Accuracy was more dependent on the recent split times Alt23 and Bgt12 than the more ancient split times

Bgt123 and Alt123. Among all the models in which the deepest split was 2 MYA (Figure 3), the most

important parameter in determining accuracy was the timing of the P2-P3 split in the Alternate windows

(Alt23). For gene flow models, this represents the date of gene flow. For ancestral structure models this

represents the split time between P3 and the ancestor of P1 and P2 (Figure 3). In all three scenarios, accuracy

improved as Alt23 became more recent, approaching 100% in models of very recent gene flow, where Alt23 =

0.2 MYA.

For gene flow models, accuracy also depended on the timing of the split between P1 and P2 in the

Background trees (Bgt12), with more ancient splits tending to give better accuracy. However, using f to

define outliers largely eliminated this effect. In ancestral structure models, using f improved accuracy more

for models in which Bgt12 was more recent, but offered little improvement when Bgt12 was ancient. Finally,

accuracy was generally lower in models simulating ancestral structure than those with gene flow, plateauing

at about 80% in the best cases (Figures 3, S1, S2).

Overall, across the 360 simulated models, using f to quantify admixture provided equal or greater accuracy in

identifying Alternate windows than using D alone (Figure S3). The small size of the windows analyzed

played a major role in producing this trend. We repeated all the simulations using 10 and 20 kb windows,

and found that the accuracy of the D statistic improved dramatically with increasing window size (mean

improvement of 16% between 5 and 10kb windows and 21% between 5 and 20 kb windows). However, f did

not show a concordant improvement with increasing window size (Figure S4). Hence, at the 20 kb scale,

quantifying introgression using f provided little improvement over the value of D alone, and actually

decreased the accuracy for a handful of ancestral structure models (Figure S3). Nevertheless, it was clear

from these simulations that, in many scenarios, there was a limit to the number of Alternate windows that

could be identified, and this appeared to be due to an inherent property of the model, rather than a lack of

data. This suggests there are a number of realistic evolutionary scenarios under which this genome scan

approach would fail to identify all, or even a majority of, windows with a history of excess shared ancestry.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 9: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Identifying regions of shared ancestry around Heliconius wing pattern loci

The performance of the D and f statistics on real data provided further insight into the patterns in the

simulated data (Figure 4). We plotted the counts of ABBA and BABA sites as well as the D and f statistics in

sliding windows across two Heliconius wing patterning loci HmB and HmYb, parts of which have been found

to show strong evidence for introgression in previous studies (Heliconius Genome Consortium 2012, Pardo-

Diaz et al. 2012). The narrow genomic regions identified by the Heliconius Genome Consortium (2012),

using absolute counts of ABBA and BABA sites, were not accurately recovered by the D statistic alone, but

show better correspondence with values of the f statistic, plotted for all regions where D≥0. We hypothesized

that this is due to the sensitivity of D to small numbers of segregating sites (sites that were variable across

the analyzed taxa), as many regions where the D statistic approached 1 were windows in which the counts of

both ABBAs and BABAs were very low. By contrast, most high peaks of f corresponded to regions

containing large numbers of ABBA sites (Figure 4).

Effect of the number of segregating sites on D and f

Analysis of published Heliconius whole genome data from the same taxa (Martin et al. 2013) confirmed that

the variance in Patterson's D statistic was strongly affected by the number of segregating sites per window

(Figure 5). Of the extreme positive or negative D values (those above the 90th percentile or below the 10th

percentile), the vast majority had a below average proportion of segregating sites (calculated as the number

of variable sites across all taxa, divided by the total number of sites analyzed in each window; Figure 5A). In

addition, the variance in D decreased as the proportion of segregating sites per window increased. Windows

with extreme D values around the Heliconius wing pattering loci (see previous section) also had below-

average proportions of segregating sites. While it is expected that the proportion of segregating sites would

be reduced at introgressed loci with a history of strong selection, these findings suggest that a high D value

alone is not sufficient evidence for introgression. Many of the D outlier loci result from noise in the signal

due to low numbers of segregating sites.

In contrast, when f was calculated for all windows with positive D, the values were not sensitive to the

proportion of segregating sites per window (Figure 5B). The top 10% of f values were roughly evenly

distributed around the mean proportion of segregating sites, and many of the most extreme f values

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 10: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

corresponded to windows at the wing patterning loci, particularly the HmB locus.

D values were more variable in genomic regions with fewer segregating sites. The variance in D among 5 kb

windows for each chromosome was strongly negatively correlated with the average proportion of segregating

sites (Figure 5C). Chromosomes with a lower density of segregating sites, such as the Z chromosome, had

higher variance among D values than chromosomes with larger proportions of segregating sites. By contrast,

the variance in f, calculated for all windows with positive D, displayed no significant relationship to the

mean proportion of segregating sites per chromosome. In summary, these data show that extreme D values,

both positive and negative, occur disproportionately in genomic regions with fewer segregating sites, while f

values are far less biased by underlying heterogeneity in genetic diversity.

Distinguishing between gene flow and ancestral structure scenarios

Smith and Kronforst proposed a novel method to discriminate between gene flow and ancestral structure as

causes of an excess of shared derived alleles, by comparing absolute divergence (dXY) between outlier

windows and the remaining non-outliers (Smith and Kronforst 2013; see Introduction). It makes intuitive

sense that introgressed regions should show lower between-species divergence as compared to the rest of the

genome, while shared ancestry due to ancestral structure would not lead to lower divergence. We confirmed

this prediction by comparing dXY between Alternate and Background windows in our simulated datasets,

under models including gene flow and ancestral structure. In models simulating gene flow, levels of dXY

between the three populations for the Alternate and Background windows matched the predictions of Smith

and Kronforst accurately, with dXY between P2 and P3 strongly reduced in Alternate windows relative to

Background, a weaker reduction in the same direction in dXY between P1 and P3, and equal dXY between P1

and P2 for Alternate and Background windows (see Figure 6A, Simulation, for example). Similarly, models

simulating ancestral structure matched the corresponding prediction, with Alternate windows showing no

reduction in dXY between P2 and P3 as compare to Background, but an elevated dXY between both P1 and P3

and between P1 and P2.

We then tested whether gene flow and ancestral structure could be distinguished where loci with shared

ancestry are not known (as would be the situation with empirical data), but are instead inferred by selecting

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 11: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

the top 10% of D or f values (outliers). We focus mainly on dXY between P2 and P3, the most important

parameter in these predictions. As predicted, in all 240 models simulating gene flow, mean dXY between P2

and P3 was reduced in outlier windows as compared to non-outlier windows, whether outliers were defined

by D or f (Figure 7). This reduction was significant for all gene flow models (p<2e-05 in all cases, 99%

significance level with Bonferroni correction over 480 tests; see Figure 6A for example). However, mean P2-

P3 dXY was also significantly reduced in D and f outliers in all 120 models simulating ancestral structure

(p<4e-05 in all cases, 99% significance level with Bonferroni correction over 240 tests; Figure 6B, Figure 7).

Since mean P2-P3 dXY was not reduced in Alternate windows derived from ancestral structure topologies, the

reduction for D and f outliers must result from the fact that D and f do not accurately identify all Alternate

windows in these cases. Thus, there is a systematic bias, with the D and f outliers showing a lower mean dXY

between P2 and P3 than the non-outlier windows, regardless of whether gene flow was simulated. This

finding is perhaps unsurprising, as D and f are both methods for identifying increased levels of shared

ancestry between P2 and P3, and will find such loci whether they are due to real gene flow or stochastic

variation in the coalescent process. Hence, among the Background windows, those that happen by chance to

have lower dXY between P2 and P3, will also tend to have higher D and f.

It is therefore clear that, although a distinction could be made between simulated gene flow and ancestral

structure scenarios by examining the Alternate and Background windows directly (Figure 7, Simulation), the

lack of 100% accuracy in identifying alternate windows using D or f makes a simple test for a statistical

difference in P2-P3 dXY meaningless. In our models, there is always a reduction in mean P2-P3 dXY in outliers

compared to non-outliers, regardless of the underlying evolutionary scenario.

Nevertheless, there were clear differences in the magnitude of the decrease in P2-P3 dXY between outlier and

non-outlier windows that could potentially be used to distinguish different evolutionary scenarios. Gene flow

models almost always produced a larger reduction in P2-P3 dXY than ancestral structure models. For ancestral

structure models, average P2-P3 dXY for f outlier windows was never less than 77% of that for non-outlier

windows. By contrast, for gene flow models, mean P2-P3 dXY for outlier windows was on average 55% of that

for non-outliers for models including gene flow from P2 to P3 and 50% for gene flow from P3 to P2, with a

maximum value of 82% for either type of gene flow (Figure 7). 8 of the 240 gene flow models showed a

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 12: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

relative reduction in P2-P3 dXY for f outliers that was within the range of that seen in ancestral structure

models. These 8 models all included comparatively ancient gene flow (≥1 MYA) occurring very soon after

the P1-P2 species split, meaning that incomplete lineage sorting probably obscured much of the signal.

Nevertheless, it is unclear how appropriate general thresholds could be set to reliably distinguish the two

scenarios, given that results will vary according to demographic histories.

Differences between predictions, simulated data and real data

There is a discrepancy between the predictions made by Smith and Kronforst (2013) for gene flow scenarios

and the divergences that were observed in the genome-wide RAD sequence data set they analyzed. A

significant reduction in dXY between P1 and P2 was observed for D outliers, even though such a reduction was

not predicted under either gene flow or ancestral structure scenarios (see Figure 2C of Smith and Kronforst

2013). We did not observe such a pattern in any of our simulations based on neutral coalescent models (see

Figure 6 for example) but we did observe a similar pattern for D outliers in our whole genome data for the

same taxa (Figure S5). However, using f outliers eliminated this reduction in dXY between P1 and P2, thus

recreating exactly the predictions of Smith and Kronforst for a case of recent gene flow (Figure S5), as

confirmed by our simulations (Figure 6). This finding is consistent with both our simulation and empirical

data above, showing that D outliers are more likely to occur in region of generally lower divergence between

P1, P2 and P3, but that f is less affected by confounding factors, and more likely to identify truly introgressed

loci.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 13: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

DISCUSSION

With the advent of population genomics, studies of species divergence have moved from simply

documenting inter-specific gene flow, towards the identification of specific genomic regions that show

strong signals of either introgression or divergence (Garrigan et al. 2012; Staubach et al. 2012; Roux et al.

2013). This is a useful goal for many reasons. It can permit the identification of large-scale trends, such as

chromosomal differences, and the fine-scale localization of putative targets of adaptive introgression for

further characterization. There is therefore likely to be considerable interest in simple and easily computable

statistics that can be used to identify loci with a history of introgression. We have shown that Patterson's D

does not accurately identify such loci when calculated over small genomic regions (10 kb or less). We

propose that the f statistic, as applied here, provides a better means to identify putatively introgressed

regions.

Previous studies have explored the behavior of D across the whole genome, where there are large numbers of

variable sites (Green et al. 2010; Durand et al. 2011; Yang et al. 2012; Eaton and Ree 2013; Martin et al.

2013; Wall et al. 2013). These studies have shown that D is a robust method to test for admixture on a

genome-wide scale. In particular, the non-independence between linked sites can be overcome by block-

jackknifing. The main problem documented here therefore results from calculating D for small genomic

regions with small numbers of segregating sites. This leads to highly variable estimates, and so outlying D

values do not provide a reliable signal of shared ancestry.

The number of segregating sites can be influenced by many technical and biological factors. Here, a window

size of 5 kb is simply too small for sufficient numbers of informative sites to be detected. However,

biological factors such as directional selection and population bottlenecks can cause reductions in genetic

diversity, thus reducing the number of segregating sites. These problems may be confounded if we are

specifically interested in loci that experience strong selective pressures, which could increase the likelihood

of detecting chance outliers at such loci.

Using the f statistic to quantify introgression for windows in which D is positive provides a more reliable

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 14: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

method of detecting introgressed loci under a range of parameters. It is not biased by the number of

segregating sites and can identify almost all loci derived from a recent gene flow event, provided the species

split and the gene flow event are separated by a substantial period of time. Where f performs poorly, it is

because the time of gene flow approaches the time of the species split, leading to large amounts of

incomplete lineage sorting across the simulated windows, swamping the signal of gene flow. Therefore, the f

statistic can be considered a reliable detector of meaningful differences in ancestry.

The most accurately identified models were those with recent gene flow from P3 to P2. This is unsurprising,

because gene flow from P3 to P2 is more likely to generate ABBA patterns than gene flow in the opposite

direction. Derived 'B' alleles are more likely to emerge on the longer branch leading to P3, than on the shorter

branch separating P2 from P1. This reasoning is supported by the fact that the accuracy of the D statistic

improves as the split between P1 and P2 gets older, increasing the length of the branch leading to P2. Using

the f statistic largely circumvents this issue, as it accounts for the expected number of ABBA patterns under

complete gene flow, thus controlling to some extent for the differences in relative branch lengths. An issue

with f as it was originally implemented (Green et al. 2010, Durand et al. 2011) is that it assumes

unidirectional gene flow from P3 to P2. If gene flow had in fact occurred in the opposite direction, or

bidirectionally, this would tend to underestimate of the proportion of admixture, except in rare regions where

derived alleles happen to be more common in P2 than P3, which would produce overestimates. Our modified

implementation of the f statistic is also conservative, but it is protected from overestimates by allowing for

bidirectional gene flow, and setting the ‘donor’ population at each site as the population with the highest

frequency of the derived allele (see Methods for details).

The final goal of this study was to assess the method proposed by Smith and Kronforst to distinguish

between recent gene flow and ancestral structure using absolute genetic divergence (dXY). Our simulations

confirm that the intuitive predictions of this method are valid, as windows with simulated gene flow showed

consistently lower dXY between P2 and P3 as compared to windows without gene flow, whereas windows with

simulated ancestral structure did not. However, we also found that this test is not appropriate when using

either the D or f statistic to identify outliers, as this adds a systematic bias. Outliers of D and f are over-

represented in regions of lower than average dXY between P2 and P3, even in models without gene flow. It

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 15: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

seems likely that in real genomes the bias in D will be even more extreme, as factors such as selective

constraint and variable mutation rates produce significant variation in rates of divergence across the genome,

which cannot be accurately recreated by neutral simulations.

Although the existence of a reduction in P2-P3 dXY is not sufficient evidence for gene flow, it may be that the

size of the reduction does distinguish gene flow and ancestral structure. In our simulations, we found that

models with gene flow consistently produced larger reductions in P2-P3 dXY in outlier windows than models

with ancestral structure, except in a small number of cases where gene flow was relatively ancient and the

differences between the Alternate and Background topologies was small. This observation is relevant to the

case for adaptive introgression of Heliconius wing patterning alleles. The magnitude of the reduction in dXY

between H. m. amaryllis and H. t. thelxinoe for D outlier windows at the HmB and HmYb loci (Smith and

Kronforst 2013, Figures 2A and 2B) was greater than that observed in any of our simulated ancestral

structure models. In contrast, many of our gene flow models produced reductions in dXY of a similar

magnitude to that observed by Smith and Kronforst (2013). As we have simulated a wide range of scenarios,

with divergence times and population sizes similar to those in Heliconius species, we believe that these

patterns of divergence add support to the case for introgression in Heliconius, a case already supported by

multiple lines of evidence.

In conclusion, while Patterson’s D statistic provides a robust signal of shared ancestry across the genome, it

should not be used for naïve scans to ascribe shared ancestry to small genomic regions, due to its bias toward

extreme values in regions containing few segregating sites. The f statistic is a more appropriate, but still

imperfect, tool for the identification of loci with shared ancestry. Smith and Kronforst’s predictions for

distinguishing gene flow and ancestral structure using absolute divergence are legitimate only when regions

of shared ancestry can be accurately identified, but do not consistently hold for regions identified as D or f

outliers, due to the biases of these statistics. Nevertheless, the size of the reduction in absolute divergence

may be informative in distinguishing between gene flow and ancestral structure. We recommend that future

analyses of this sort make use of simulations tailored to the taxa under study to determine appropriate

threshold values, but it remains to be determined how generally applicable this approach may be.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 16: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

MATERIAL AND METHODS

Statistics used to detect shared ancestry

In this study, we focused on an approach to identify an excess of shared derived polymorphisms, indicated

by the relative abundance of two SNP patterns termed “ABBAs” and “BABAs” (Figure 1A; Green et al.

2010). Given three populations and an outgroup with the relationship (((P1, P2), P3), O), ABBAs are sites at

which the derived allele “B” is shared between the non-sister taxa P2 and P3, while P1 carries the ancestral

allele, as defined by the outgroup (Figure 1A). Similarly, BABAs are sites at which the derived allele is

shared between P1 and P3, while P2 carries the ancestral allele. Under a neutral coalescent model, both

patterns can only result from incomplete lineage sorting and should be equally abundant in the genome

(Durand et al. 2011). A significant excess of ABBAs over BABAs is indicative either of gene flow between

P2 and P3, or some form of non-random mating or structure in the population ancestral to P1, P2 and P3. This

excess can be tested for, using Patterson's D statistic,

(1) where CABBA(i) and CBABA(i) are counts of either 1 or 0, depending on whether or not the specified pattern

(ABBA or BABA) is observed at site i in the genome. Under the null hypothesis of no gene flow and random

mating in the ancestral population, D will approach zero, regardless of differences in effective population

sizes (Durand et al. 2011). Hence, a D significantly greater than zero is indicative of a significant excess of

shared derived alleles between P2 and P3.

If population samples are used, then rather than binary counts of fixed ABBA and BABA sites, the frequency

of the derived allele at each site in each population can be used (Green et al. 2010, Durand et al. 2011),

effectively weighting each segregating site according to its fit to the ABBA or BABA pattern, with

(2)

(3) where Pij is the frequency of the derived allele at site i in population j. These values are then used in equation

1 to calculate D (Durand et al. 2011).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 17: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Whereas Patterson's D allows a test for an excess of shared derived alleles, a related statistic, f, was

developed to quantify this excess in terms of the introgressed fraction of the genome (Green et al. 2010,

Durand et al. 2011). This method makes use of the numerator of equation 1, the difference between sums of

ABBAs and BABAs, or S. In the example described above, with ((P1,P2),P3),O); the proportion of the

genome that has been shared between P2 and P3 subsequent to the split between P1 and P2, can be estimated

by comparing the observed value of S to a value estimated under a scenario of complete sharing between P2

and P3. This would mean that P2 would now resemble a lineage of the P3 taxon, and so the denominator of

equation 1 can be estimated by replacing P2 in equations 2 and 3 with a second lineage sampled from P3, or

by splitting the P3 sample into two,

(4) where P3a and P3b are the two lineages sampled from P3. Splitting P3 in this way may lead to stochastic errors

at individual sites, particularly with small sample sizes. These should be negligible when whole-genome data

are analyzed but could easily lead to erroneous values of f (including f>1) when small genomic windows are

analyzed, as in the present study. We therefore used a more conservative version, in which P3 a and P3b are

both substituted with the entire P3 sample:

(5) While this may generally underestimate the size of the shared fraction of the genome, it also reduces the rate

of stochastic error. Moreover, in the present study, we are less concerned with the absolute value of f, and

more with the relative values of f between genomic regions.

The f statistic assumes unidirectional gene flow from P3 to P2 (i.e. P3 is the donor and P2 is the recipient).

Since the branch leading to P3 is longer than that leading to P2 (Figure 1), gene flow in the opposite direction

(P2 to P3) is likely to generate fewer ABBAs. Thus, in the presence of gene flow from P2 to P3, or in both

directions, the f equation should lead to an underestimate. However, when small genomic windows are

analyzed, the assumption of unidirectional gene flow could lead to overestimates, because any region in

which derived alleles are present in both P2 and P3, but happen to be at higher frequency in P2, will yield f

estimates that are greater than 1. Thus, we propose a modified version of f in which the denominator is

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 18: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

calculated by defining a donor population (PD) for each site, which can differ between sites and can be either

P3 or P2. For each site, PD is the population that has the highest frequency of the derived allele, thus

maximizing the denominator and eliminating f estimates greater than 1:

(6) All f estimates shown in the Results were calculated using Equation 6.

Finally, to estimate divergence between two populations we calculated Dxy, the mean number of differences

between each pair of individuals sampled from the two populations. At sites with missing data were excluded

in a pairwise manner, and each pair of individuals contributed equally to the mean.

Measuring the ability of the D and f statistics to identify shared regions using simulations

To determine how reliably the D and f statistics identify windows of shared ancestry, we generated a large

range of sequence datasets using Hudson's ms simulator (Hudson 2002) and Seq-Gen (Rambaut and Grass,

1996). Model parameters were selected to produce sequence alignments comparable to data analyzed in

recent studies of Heliconius butterflies (The Heliconius Genome Consortium 2012, Smith and Kronforst

2013, Martin et al. 2013, Kronforst et al. 2013), but also generic enough to allow for broader interpretation.

In all models, a constant effective population size (Ne) of 1×106 was used, with a mutation rate of 2.5×10-9

and a generation time of 0.25 years. To approximate a scenario in which a subset of the genome has a

distinct phylogenetic history, we used a combined model approach, in which 90% of windows had a standard

species tree or background topology (((P1,P2),P3),O), and the remaining 10% of windows had an alternate

topology ((P1,(P2,P3)),O), consistent with shared ancestry between P2 and P3 (Figure 3A, D, G). In all

datasets, the roots of both the background and alternate trees were fixed to 3 MYA, while the other split

times varied between datasets and ranged between 0.2 and 2 MYA (File S6). Only certain combinations of

split times for the background and alternate trees were tested, restricting simulations to simple and

biologically plausible scenarios. These combinations fall into three distinct categories: Gene flow from P2 to

P3, gene flow from P3 to P2, and ancestral structure (Figure 3A, D, G). In total, 120 datasets in each category

met the criteria laid out in Figure 3. Full parameters of these 360 models are provided in File S6.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 19: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 2 illustrates the design of the simulation study. Ten thousand 5 kb windows were simulated for each

of the 360 models, with 9000 windows generated using the background topology and 1000 windows

generated using the alternate topology in each case. The accuracy of the D and f statistics in identifying

regions of shared ancestry was estimated by determining how reliably each was able to identify the 10% of

alternate windows in each dataset. While this approach of partitioning each dataset into two somewhat

arbitrarily-sized subsets with distinct histories is unrealistic, it provides a simple and powerful framework in

which to measure the accuracy with which we can subsequently identify windows subject to shared ancestry,

with clear expectations. The accuracy of each statistic was therefore defined as the proportion of the 1000

alternate windows that fell within the top 1000 values. Hence 100% accuracy would indicate that the statistic

identified all of the alternate windows from the simulated data, i.e. the top 1000 values corresponded

perfectly to the 1000 alternate windows. A completely uninformative statistic would randomly identify 10%

of the alternate windows. For D, only positive values were included in the set of 1000 outlying windows;

windows with extreme negative D values were included in the background partition. Similarly, for f, only

windows with D≥0 were considered, as f is meaningless when there is no excess of ABBAs; and windows

with negative Ds were included in the background partition (Figure 1C). To examine the effects of window

size, we repeated the entire simulation study for windows of 10 kb (5000 windows) and 20 kb (2500

windows).

Analysis of Heliconius wing patterning regions and whole genome sequences

To investigate the ability of the D and f statistics to identify known loci at which adaptive introgression is

thought to have occurred, we re-analyzed sequences of Heliconius wing patterning regions from The

Heliconius Genome Consortium (2012), also recently analyzed by Smith and Kronforst (2013). The D and f

statistics were calculated as described above for 5 kb sliding windows, moving in increments of 1 kb.

Windows for which fewer than 1000 sites had genotype calls for at least half of the individuals were

discarded. This threshold was selected as a compromise between the number of sites covered, and the

number of individuals represented at each site. Stricter filtering on individual representation resulted in a

considerable reduction in the number of sites analyzed, but more lenient filtering would result in poor allele

frequency estimates at sites where few individuals are represented.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 20: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

To interrogate how the D and f statistics are affected by the number of segregating sites in a given window,

we re-analyzed whole genome data from Martin et al. (2013). D and f were calculated, along with proportion

of segregating sites over the complete dataset, for non-overlapping 5 kb windows across the genome.

Windows were restricted to single scaffolds and, as above, windows for which fewer than 1000 sites had

genotype calls for at least half of the individuals were discarded. We also analyzed windows from each of the

21 chromosomes of the H. m. melpomene genome sequence separately. Scaffolds were assigned to

chromosomes according to the Heliconius Genome Consortium (2012), and incorporating the improved

assignment of Z-linked scaffolds by Martin et al. (2013) (details available in Dryad repositories

http://dx.doi.org/10.5061/dryad.m27qq and http://dx.doi.org/10.5061/dryad.dk712).

Assessing a test to distinguish gene flow from ancestral structure based on absolute divergence

Smith and Kronforst (2013) proposed a simple test to distinguish between the hypotheses of pre- and post-

speciation shared ancestry based on absolute divergence. We tested their predictions using the 360 simulated

datasets described above, by comparing dXY (mean differences per site) between P2 and P3 for the background

and alternate windows. The test was repeated, but rather than comparing the actual alternate and background

windows, we identified 10% of the windows as outliers according to D and f values. Average dXY was then

compared between the outlier and non-outlier partitions using a Wilcoxon rank-sum test, as values tended to

be non-normally distributed (confirmed with Bonferroni-corrected Shapiro-Wilk tests).

Software

Code and data for this manuscript will be made available as a Data Dryad repository on acceptance. Most

files, with instructions for running scripts to generate the results, are currently available on GitHub at

https://github.com/johnomics/Martin_Davey_Jiggins_evaluating_introgression_statistics. Large data sets can

be made available on request and will be made publicly available after review. This work was made possible

by the free, open source software packages EggLib (De Mita and Siol 2012), phyclust (Chen 2011), R (R

Core Team 2013), ggplot2 (Wickham 2009), plyr (Wickham 2011), reshape (Wickham 2007) and Inkscape

(http://www.inkscape.org).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 21: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

ACKNOWLEDGEMENTS

We thank Krzysztof Kozak, Richard Merrill and Richard Wallbank for initial discussions that inspired this

work and Jim Mallet and Marcus Kronforst for comments on the manuscript. This work was supported by

the Leverhulme Trust (F/09364/E to C.D.J.) and the Herchel Smith Fund (Postdoctoral Fellowship to

J.W.D.).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 22: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

FIGURE LEGENDS

Figure 1. Four-taxon design with Heliconius subspecies. A. The D and f statistics operate on four taxa

featuring three related populations P1, P2, P3 rooted by an outgroup O and related by the background

topology shown on the left and right in brown. Two biallelic SNP patterns are used to calculate D and f; the

first with an ancestral allele in P1 and O and a derived allele in P2 and P3 (ABBA) and the second with an

ancestral allele in P2 and O and a derived allele in P1 and P3 (BABA). This general design applies to the four

particular Heliconius subspecies shown here, showing the convergent wing patterns of H. melpomene

amaryllis and H. timareta thelxinoe. B. Number of sites with ABBA and BABA patterns per 100bp across

the Heliconius HmB locus for the subspecies shown in part A, estimated from allele frequencies using

Equations 2 and 3 over 5 kb windows, moving in increments of 1 kb.

Figure 2. Simulation study design. A. For each of 360 models, 10 000 windows were simulated, 9000 from

a Background topology and 1000 from an Alternate topology. Four parameters varied between models;

Bgt12, the split time between P1 and P2 in the Background topology, Bgt123, the split time between the merged

P1 and P2 populations and the P3 population in the Background topology, Alt23, the split time between P2 and

P3 in the Alternate topology and Alt123, the split time between the merged P2 and P3 populations and the P1

population in the Alternate topology. This figure shows one of 120 models simulating gene flow from P2 to

P3 (not drawn to scale). B. Histogram of D values calculated per window. Colors show windows from

Background topology (brown) and Alternate topology (green), as per part A. Filled bars represent windows

where D≥0. f is calculated for these windows only. The dashed black line separates the top 1000 D values

(outliers) from the lower 9000 (non-outliers).. 603 of 1000 Alternate windows are identified as outliers by D,

so D has 60.3% accuracy in this case. C. Histogram of f values for windows with D≥0. The dashed black line

delineates the top 1000 f values (outliers) . f identifies 855 of 1000 Alternate windows as outliers, so has

85.5% accuracy in this case.

Figure 3. Models simulated and the accuracy of D and f in identifying Alternate windows. A,D,G. The

three types of models simulated with 9000 Background topologies (brown) and 1000 Alternate topologies

(colored). To simulate gene flow from P2 to P3, Alt123 was set equal to Bgt12, Alt23 had to be more recent than

this, and Bgt123 more ancient. To simulate gene flow from P3 to P2, Alt123 was set equal to Bgt123, Bgt12 had to

be more recent than this, and Alt23 more recent than Bgt12. To simulate ancestral structure, Alt23 was set

equal to Bgt123, Bgt12 had to be more recent than this, and Alt123 more ancient. B,C,E,F,H,I. The accuracy of

D and f for each combination of Background and Alternate topologies. Accuracy was calculated as the

percentage of the 1000 Alternate windows that were among the top 1000 D and f values. In this plot, only

models with an oldest split time of 2 MYA are plotted, thus leaving only two independent variables: Alt23 (x-

axis) and Bgt12 (shading). See Figures S1 and S2 for accuracy of all 360 models.

Figure 4. Identifying putatively introgressed regions around Heliconius wing patterning loci. Counts of

ABBA and BABA SNP patterns, D, f and dXY across the HmYb and HmB BAC sequences, as analyzed by

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 23: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

The Heliconius Genome Consortium (2012) and Smith and Kronforst (2013). Counts of ABBA and BABA

SNP patterns for the Heliconius populations shown in Figure 1A, were estimated from allele frequencies

using Equations 2 and 3. All plotted values were calculated for 5 kb sliding windows to replicate the

approach of Smith and Kronforst (2013), but in order to generate smooth plots, overlapping windows were

used, moving in increments of 1 kb.

Figure 5. Effects of the proportion of segregation sites on the D and f statistics in Heliconius whole

genome data. A,B. Values of D and f for non-overlapping 5 kb windows across the genome, plotted against

the proportion of segregating sites in each window. Segregating sites were only counted at positions that had

genotype calls for at least 50% of the individuals. f values are only plotted for windows with D≥0. Data from

Martin et al. 2013, for the same taxa as in Figure 1A. C. The variance among D and f values for each

chromosome, plotted against the mean proportion of segregating sites.

Figure 6. Mean dXY between population pairs for outlier and non-outlier windows, identified by

Alternate or Background topology (Simulation) or by outlying D or f values. A. An example gene flow

model, with Bgt123=2, Bgt12=1.2, Alt123=1.2, Alt23=0.2. B. An example ancestral structure model, with

Bgt123=1.4, Bgt12=0.2, Alt123=2, Alt23=1.4. ***=p<0.001, #=p>0.05 (Wilcoxon rank-sum test)

Figure 7. Mean dXY between P2 and P3 in outlier windows as a percentage of P2- P3 dXY in Background

windows. Outlier windows defined by Alternate or Background topology (Simulation) or by outlying D and

f values, as per Figure 6. Model types shown in color (ancestral structure, red; gene flow from P2 to P3,

green; gene flow from P2 to P3, blue).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 24: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

FIGURES

Figure 1. Four-taxon design with Heliconius subspecies.

B

A

coun

ts p

er 1

00 b

p

H. m aglaope: P1

H. m. amaryllis: P2

H. timareta: P3

outgroups: O

A

B

B

A

A

B

B

A

ABBABABA

0.0

0.4

0.8

1.2

0 200 400 600 800

A. The D and f statistics operate on four taxa featuring three related populations P1, P2, P3 rooted by an

outgroup O and related by the background topology shown on the left and right in brown. Two biallelic SNP

patterns are used to calculate D and f; the first with an ancestral allele in P1 and O and a derived allele in P2

and P3 (ABBA) and the second with an ancestral allele in P2 and O and a derived allele in P1 and P3 (BABA).

This general design applies to the four particular Heliconius subspecies shown here, showing the convergent

wing patterns of H. melpomene amaryllis and H. timareta thelxinoe. B. Number of sites with ABBA and

BABA patterns per 100bp across the Heliconius HmB locus for the subspecies shown in part A, estimated

from allele frequencies using Equations 2 and 3 over 5 kb windows, moving in increments of 1 kb.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 25: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 2. Simulation study design.

P1

Background(9000 windows)

Alternate(1000 windows)

P2 P3 O

Bgt123 = 2.0 MYA

A

Bgt12 = 1.2 MYA Alt123 = 1.2 MYA

Alt23 = 0.2 MYA

B

C

0

200

400

�1.0 �0.5 0.0 0.5 1.0D

Count

0

1000

2000

3000

4000

0.00 0.25 0.50 0.75 1.00f

Count

A. For each of 360 models, 10 000 windows were simulated, 9000 from a Background topology and 1000

from an Alternate topology. Four parameters varied between models; Bgt12, the split time between P1 and P2

in the Background topology, Bgt123, the split time between the merged P1 and P2 populations and the P3

population in the Background topology, Alt23, the split time between P2 and P3 in the Alternate topology and

Alt123, the split time between the merged P2 and P3 populations and the P1 population in the Alternate

topology. This figure shows one of 120 models simulating gene flow from P2 to P3 (not drawn to scale). B.

Histogram of D values calculated per window. Colors show windows from Background topology (brown)

and Alternate topology (green), as per part A. Filled bars represent windows where D≥0. f is calculated for

these windows only. The dashed black line separates the top 1000 D values (outliers) from the lower 9000

(non-outliers).. 603 of 1000 Alternate windows are identified as outliers by D, so D has 60.3% accuracy in

this case. C. Histogram of f values for windows with D≥0. The dashed black line delineates the top 1000 f

values (outliers) . f identifies 855 of 1000 Alternate windows as outliers, so has 85.5% accuracy in this case.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 26: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 3. Models simulated and the accuracy of D and f in identifying Alternate windows.

P1

Bg. (9000)Alt. (1000)

Alt123

Alt23

Gene flow P2 -> P3

Gene flow P3 -> P2

Ancestral Structure

% p

redi

cted

by

D

% p

redi

cted

by

f

% p

redi

cted

by

D

% p

redi

cted

by

f

Alt23 (MYA)P2 P3 O

P1 P2 P3 O

P1 P2 P3 O

Bgt12

Bgt123

Alt23 (MYA)

Alt23 (MYA) Alt23 (MYA)

Alt23 (MYA) Alt23 (MYA)

Bgt23 (MYA)

Bgt23 (MYA)

Bgt23 (MYA)

Bgt23 (MYA)

Bgt23 (MYA)

Bgt23 (MYA)

A B C

D E F

G H I

% p

redi

cted

by

D

% p

redi

cted

by

f

Bgt12

Bgt123

Bgt12

Bgt123

Alt123

Alt23

Alt123

Alt23

0.0 0.5 1.0 1.5 2.0

020

4060

8010

0

��

��

��

��

��

��

� �

��

0.40.60.81

1.21.41.61.8

0.0 0.5 1.0 1.5 2.0

020

4060

8010

0

��

��

� �

��

0.40.60.81

1.21.41.61.8

0.0 0.5 1.0 1.5 2.0

020

4060

8010

0

� �� � �

� � � �

�� �

��

�� �

��

� ��

��

�� �

� �

0.40.60.81

1.21.41.61.8

0.0 0.5 1.0 1.5 2.00

2040

6080

100

�� �� � �� �

��

� ��

��

��

��

� ��

��

��

��

��

0.40.60.81

1.21.41.61.8

0.0 0.5 1.0 1.5 2.0

020

4060

8010

0

� ��

��

� �� �

� �

� � � ��

� ��

��

� � ��� �����

0.20.40.60.8

11.21.41.6

0.0 0.5 1.0 1.5 2.0

020

4060

8010

0

� � � � � ��

� � � � ��

�� � �

� � ��

�� �� �

� � ��

0.20.40.60.8

11.21.41.6

A,D,G. The three types of models simulated with 9000 Background topologies (brown) and 1000 Alternate

topologies (colored). To simulate gene flow from P2 to P3, Alt123 was set equal to Bgt12, Alt23 had to be more

recent than this, and Bgt123 more ancient. To simulate gene flow from P3 to P2, Alt123 was set equal to Bgt123,

Bgt12 had to be more recent than this, and Alt23 more recent than Bgt12. To simulate ancestral structure, Alt23

was set equal to Bgt123, Bgt12 had to be more recent than this, and Alt123 more ancient. B,C,E,F,H,I. The

accuracy of D and f for each combination of Background and Alternate topologies. Accuracy was calculated

as the percentage of the 1000 Alternate windows that were among the top 1000 D and f values. In this plot,

only models with an oldest split time of 2 MYA are plotted, thus leaving only two independent variables:

Alt23 (x-axis) and Bgt12 (shading). See Figures S1 and S2 for accuracy of all 360 models.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 27: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 4. Identifying putatively introgressed regions around Heliconius wing patterning loci. co

unts

per

100

bp

Df

dXY

Position(HmB region)

Position(HmYb region)

ABBABABA

0.0

0.4

0.8

1.2

�1.0

�0.5

0.0

0.5

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000 1200

0.00

0.02

0.04

0.06

0 200 400 600 800

Counts of ABBA and BABA SNP patterns, D, f and dXY across the HmYb and HmB BAC sequences, as

analyzed by The Heliconius Genome Consortium (2012) and Smith and Kronforst (2013). Counts of ABBA

and BABA SNP patterns for the Heliconius populations shown in Figure 1A, were estimated from allele

frequencies using Equations 2 and 3. All plotted values were calculated for 5 kb sliding windows to replicate

the approach of Smith and Kronforst (2013), but in order to generate smooth plots, overlapping windows

were used, moving in increments of 1 kb.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 28: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 5. Effects of the proportion of segregation sites on the D and f statistics in Heliconius whole

genome data.

A,B. Values of D and f for non-overlapping 5 kb windows across the genome, plotted against the proportion

of segregating sites in each window. Segregating sites were only counted at positions that had genotype calls

for at least 50% of the individuals. f values are only plotted for windows with D≥0. Data from Martin et al.

2013, for the same taxa as in Figure 1A. C. The variance among D and f values for each chromosome,

plotted against the mean proportion of segregating sites.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 29: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 6. Mean dXY between population pairs for outlier and non-outlier windows, identified by

Alternate or Background topology (Simulation) or by outlying D or f values.

P2−P3 P1−P3 P1−P2

******

******

******

******

##

##

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

PredictionSim

ulationAlternateBackground AlternateBackground AlternateBackground

Partition

Mean d X

Y

P2−P3 P1−P3 P1−P2

******

******

******

******

**

******

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

Df

Outlier Non−Outlier Outlier Non−Outlier Outlier Non−Outlier

Partition

Mean d X

Y

P2−P3 P1−P3 P1−P2

##

##

******

******

******

******

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

PredictionSim

ulation

AlternateBackground AlternateBackground AlternateBackground

Partition

Mean d X

Y

P2−P3 P1−P3 P1−P2

******

******

******

******

******

******

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

Df

Outlier Non−Outlier Outlier Non−Outlier Outlier Non−Outlier

Partition

Mean d X

Y

A

B

C

D

A. An example gene flow model, with Bgt123=2, Bgt12=1.2, Alt123=1.2, Alt23=0.2. B. An example ancestral

structure model, with Bgt123=1.4, Bgt12=0.2, Alt123=2, Alt23=1.4. ***=p<0.001, #=p>0.05 (Wilcoxon rank-

sum test)

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 30: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Figure 7. Mean dXY between P2 and P3 in outlier windows as a percentage of P2- P3 dXY in Background

windows.

Simulation D f

●●●●●●●●●

●●

●●

●●●●

●●●●●●

●●

● ●

●●●●

●●●●●●

●●●●●

● ●

●●●●●

●●●●●

●●●●●

●●●

● ●

●●●●

●●●

●●●

●●●●

●●●

● ●

●●●

●●

●●●

●●●

●●●●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●●

● ●●

●●●●●

●●

●●●●●

●●

●●

●●●●●

●●●●●●

●●●●●●

●●●●

●●●●●

●●●●●

●●●●●

●●●●

●●●●

●●●●

●●●●

●●●

●●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●●

●●●●●●

●●●●●●

●●●

●●

●●●●●

●●●●●

●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

0

20

40

60

80

100

Out

lier t

o N

on−O

utlie

r P2−

P3 d

XY (%

)

Outlier windows defined by Alternate or Background topology (Simulation) or by outlying D and f values, as

per Figure 6. Model types shown in color (ancestral structure, red; gene flow from P2 to P3, green; gene flow

from P2 to P3, blue).

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 31: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

REFERENCES

Abbott R, Albach D, Ansell S, et al. 2013. Hybridization and speciation. J. Evol. Biol. 26:229–246.

Barton NH, Gale KS. 1993. Genetic analysis of hybrid zones. In: Price J, Harrison RG, editors. Hybrid

Zones and the evolutionary process. USA: Oxford University Press.

Chen W-C. 2011. Overlapping Codon model, Phylogenetic Clustering, and Alternative Partial Expectation

Conditional Maximization Algorithm, Ph.D. Diss., Iowa Stat University.

De Mita S, Siol M. 2012. EggLib: processing, analysis and simulation tools for population genetics and

genomics. BMC Genet. 13:27.

Durand EY, Patterson N, Reich D, Slatkin M. 2011. Testing for ancient admixture between closely related

populations. Mol. Biol. Evol. 28:2239–2252.

Eaton D, Ree R. 2013. Inferring Phylogeny and Introgression using RADseq Data: An Example from

Flowering Plants (Pedicularis: Orobanchaceae). Syst. Biol. 682:689-706.

Eriksson A, Manica A. 2012. Effect of ancient population structure on the degree of polymorphism shared

between modern human populations and ancient hominins. Proc. Natl. Acad. Sci. U. S. A. 109:13956–

13960.

Garrigan D, Kingan SB, Geneva AJ, Andolfatto P, Clark AG, Thornton K, Presgraves DC. 2012. Genome

sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res. 22:1499–1511.

Green RE, Krause J, Briggs AW, et al. 2010. A draft sequence of the Neandertal genome. Science 328:710–

722.

The Heliconius Genome Consortium. 2012. Butterfly genome reveals promiscuous exchange of mimicry

adaptations among species. Nature 487:94–98.

Kulathinal RJ, Stevison LS, Noor MAF. 2009. The Genomics of Speciation in Drosophila: Diversity,

Divergence, and Introgression Estimated Using Low- Coverage Genome Sequencing. PLoS Genet.

5(7):e1000550.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 32: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Mallet J, Barton N. 1989. Strong natural selection in a warning-color hybrid zone. Evolution 43:421–431.

Martin SH, Dasmahapatra KK, Nadeau NJ, et al. 2013. Genome-wide evidence for speciation with gene flow

in Heliconius butterflies. Genome Res. 23:1817–1828.

Pardo-Diaz C, Salazar C, Baxter SW, Merot C, Figueiredo-Ready W, Joron M, McMillan WO, Jiggins CD.

2012. Adaptive introgression across species boundaries in Heliconius butterflies. PLoS Genet.

8:e1002752.

Pinho C, Hey J. 2010. Divergence with Gene Flow: Models and Data. Annu. Rev. Ecol. Evol. Syst. 41:215–

230.

R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical

Computing, Vienna, Austria. URL http://www.R-project.org/.

Rheindt FE, Fujita MK, Wilton PR, Edwards SV. 2013. Introgression and Phenotypic Assimilation in

Zimmerius Flycatchers (Tyrannidae): Population Genetic and Phylogenetic Inferences from Genome-

Wide SNPs. Syst. Biol., in press. doi: 10.1093/sysbio/syt070

Roux C, Tsagkogeorga G, Bierne N, Galtier N. 2013. Crossing the Species Barrier: Genomic Hotspots of

Introgression between Two Highly Divergent Ciona intestinalis Species. Mol. Biol. Evol. 30:1574–

1587.

Smith J, Kronforst MR. 2013. Do Heliconius butterfly species exchange mimicry alleles? Biol. Lett. 9.

Staubach F, Lorenc A, Messer PW, Tang K, Petrov DA, Tautz D. 2012. Genome patterns of selection and

introgression of haplotypes in natural populations of the house mouse (Mus musculus). PLoS Genet.

8:e1002891.

Wall JD, Yang MA, Jay F, et al. 2013. Higher levels of neanderthal ancestry in East Asians than in

Europeans. Genetics 194:199–209.

Wickham H. 2007. Reshaping data with the reshape package. J. Stat. Softw. 21(12).

Wickham H. 2009. ggplot2: elegant graphics for data analysis. Springer (New York).

Wickham H. 2011. The Split-Apply-Combine Strategy for Data Analysis. J. Stat. Softw. 40(1):1-29.

Wu C. 2001. The genic view of the process of speciation. J. Evol. Biol. 14:851–865.

. https://doi.org/10.1101/001347doi: bioRxiv preprint

Page 33: Evaluating the use of ABBA-BABA statistics to locate ...ABBA sites localized over a narrow region, suggesting introgression of the forewing red band pattern, with a smaller, narrower

Yang MA, Malaspinas A-S, Durand EY, Slatkin M. 2012. Ancient structure in Africa unlikely to explain

Neanderthal and non-African genetic similarity. Mol. Biol. Evol. 29:2987–2995

Yang Z. 2010. A likelihood ratio test of speciation with gene flow using genomic sequence data. Genome

Biol. Evol. 2:200–211.

. https://doi.org/10.1101/001347doi: bioRxiv preprint