a coalescent-based method for population tree inference with haplotypes

8
A Coalescent-based Method for Population Tree Inference with Haplotypes Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA 1 Cold Spring Harbor Asia Meeting SuZhou, China, 2014

Upload: aubrey-allen

Post on 01-Jan-2016

19 views

Category:

Documents


6 download

DESCRIPTION

A Coalescent-based Method for Population Tree Inference with Haplotypes. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA. Population Tree: Population split history (including order and time); not known. Coalescence. Locus (gene): genomic region. Mutation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Coalescent-based Method for Population Tree Inference with Haplotypes

1

A Coalescent-based Method for Population Tree Inference with Haplotypes

Yufeng Wu

Dept. of Computer Science & Engineering

University of Connecticut, USA

Cold Spring Harbor Asia MeetingSuZhou, China, 2014

Page 2: A Coalescent-based Method for Population Tree Inference with Haplotypes

H: haplotypes at SNPsa (A): AAGCCAATTCCGAACAAGAb (B): ACGCCAATTCCGGACAAGAc (C): ACGCCTATTCCGGACAAGA d (D): AAGCCAATTCCGAACCAGA

P(H|T): probability of H given T under coalescent models

Population Tree: Population split history (including order and time); not known

A C Da b c d

B

Coalescence

Mutation

Coalescent genealogical tree: underlying genetic model

Population tree inference: given haplotypes H from multiple loci, infer the population tree MLE of T: find T maximizing P(H|T)

Locus (gene): genomic region

Time

1 2 3 4 AAAA CAGA CTGA AAAC

Challenge: P(H|T) is difficult to compute even for single population

Page 3: A Coalescent-based Method for Population Tree Inference with Haplotypes

This talk: likelihood based population tree inference from haplotypes. Assumptions: (1) No intra-locus recombination and (2) infinite sites model of mutations

Common simplification: treating haplotypes as unlinked variants (SNPs). P(H|T) ≈ P(S1|T)P(S2|T)P(S3|T)…, Si: ith SNP of H. See, e.g. SNAPP (Bryant, et al., MBE, 2012), TreeMix (Pickrell and Pritchard, PLoS Genet, 2012)

Single SNPs: potential loss of information in haplotypes.

Fact 1: haplotypes H implies a unique (non-bifurcating) genealogical tree called the perfect phylogeny TH

1 2 3 4a: AAAAb: CAGAc: CTGA d: AAAC cba d

AAAA

1

2

3 4

Fact 2: under infinite sites model, P(H|T)=P(TH|T)Unfortunately, computing P(TH|T) is still non-trivial

SNP vs. Haplotype

Page 4: A Coalescent-based Method for Population Tree Inference with Haplotypes

G’: genealogical topology implied by haplotypes HIgnore mutations on genealogy G.Key Assumption: P(G|T) P(G’|T)

Ignore mutationsInference of population tree T: maximizing P(G’1|T)P(G’2|T)P(G’3|T)…G’i : gene genealogical topologies of ith locusUse G to refer to genealogical topology

Simplification of Likelihood

cba d

1

2

3 4

G

cba d

G’

cba d A C Da b cd

B

Genealogical topology G and population tree T:Gene lineages b and c coalesce first Populations B and C are likely to be more closely relatedBut not always…

Incomplete lineage sorting: gene tree topology is stochastic

Page 5: A Coalescent-based Method for Population Tree Inference with Haplotypes

5

STELLSH: infer population trees from haplotypes

Gene tree probability for non-bifurcating topology: sum over all compatible bifurcating topologies. Can be more efficiently computed: Wu, manuscript, 2014.

Issue: perfect phylogeny from haplotypes usually non-bifurcating

STELLSH: maximizing probability of all gene topologies, by optimizing topology and branch lengths of population tree (e.g. nearest neighbor interchange)

Gene tree probability P(G|T): (relatively) efficiently computed by the STELLS algorithm (Wu, Evolution, 2012) algorithm for when G is bifurcating and can be used in inference.

For population tree T and a gene tree topology G:Gene tree probability P(G|T): probability of observing a gene tree topology G for population tree T under coalescent theory.

Page 6: A Coalescent-based Method for Population Tree Inference with Haplotypes

Population tree: same tree topologies. Haplotypes: use Hudson’s ms (support island model)

• Multiple alleles per population per gene• Various population tree heights (0.1, 0.5 and 1.0 coalescent units)• Number of loci: 10,50,100,200,500

Simulation

Inference STELLSH: infer population tree from haplotypes

Evaluation Topological error of inferred population trees

Assume: no migration; no intra-locus recombination.

Moderate migration or recombination: accurate inferenceStrong migration or high recombination: less accurate

Accuracy: higher with more loci

Number of loci

Inference error

Page 7: A Coalescent-based Method for Population Tree Inference with Haplotypes

Compare with TreeMixSimulation dataSTELLSH (Solid lines): up to 4 alleles per populationTreeMix (dashed lines): up to 100 alleles per populationSTELLSH: more accurate than TreeMix, even TreeMix uses 25 times more data.

Research supported by National Science Foundation under grants IIS-0803440 and CCF-1116175

Paper: “A Coalescent-based Method for Population Tree Inference with Haplotypes”, Yufeng Wu, submitted for publication, 2014.Paper: “Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood”, Yufeng Wu, Evolution, v. 66 (3), p. 763-775, 2012.”

Conclusion: • Haplotypes: can be more informative than individual SNPs• Simplifying likelihood function may lead to faster algorithms to use in inference.

Also analyzed part of 1000 Genomes Project to infer population trees from 10 populations: CHB,JPT,CHS,CEU, TSI,FIN,GBR,IBS, YRI, and LWK.

Page 8: A Coalescent-based Method for Population Tree Inference with Haplotypes

8