a coalescent-based method for population tree inference with haplotypes
DESCRIPTION
A Coalescent-based Method for Population Tree Inference with Haplotypes. Yufeng Wu Dept. of Computer Science & Engineering University of Connecticut, USA. Population Tree: Population split history (including order and time); not known. Coalescence. Locus (gene): genomic region. Mutation. - PowerPoint PPT PresentationTRANSCRIPT
1
A Coalescent-based Method for Population Tree Inference with Haplotypes
Yufeng Wu
Dept. of Computer Science & Engineering
University of Connecticut, USA
Cold Spring Harbor Asia MeetingSuZhou, China, 2014
H: haplotypes at SNPsa (A): AAGCCAATTCCGAACAAGAb (B): ACGCCAATTCCGGACAAGAc (C): ACGCCTATTCCGGACAAGA d (D): AAGCCAATTCCGAACCAGA
P(H|T): probability of H given T under coalescent models
Population Tree: Population split history (including order and time); not known
A C Da b c d
B
Coalescence
Mutation
Coalescent genealogical tree: underlying genetic model
Population tree inference: given haplotypes H from multiple loci, infer the population tree MLE of T: find T maximizing P(H|T)
Locus (gene): genomic region
Time
1 2 3 4 AAAA CAGA CTGA AAAC
Challenge: P(H|T) is difficult to compute even for single population
This talk: likelihood based population tree inference from haplotypes. Assumptions: (1) No intra-locus recombination and (2) infinite sites model of mutations
Common simplification: treating haplotypes as unlinked variants (SNPs). P(H|T) ≈ P(S1|T)P(S2|T)P(S3|T)…, Si: ith SNP of H. See, e.g. SNAPP (Bryant, et al., MBE, 2012), TreeMix (Pickrell and Pritchard, PLoS Genet, 2012)
Single SNPs: potential loss of information in haplotypes.
Fact 1: haplotypes H implies a unique (non-bifurcating) genealogical tree called the perfect phylogeny TH
1 2 3 4a: AAAAb: CAGAc: CTGA d: AAAC cba d
AAAA
1
2
3 4
Fact 2: under infinite sites model, P(H|T)=P(TH|T)Unfortunately, computing P(TH|T) is still non-trivial
SNP vs. Haplotype
G’: genealogical topology implied by haplotypes HIgnore mutations on genealogy G.Key Assumption: P(G|T) P(G’|T)
Ignore mutationsInference of population tree T: maximizing P(G’1|T)P(G’2|T)P(G’3|T)…G’i : gene genealogical topologies of ith locusUse G to refer to genealogical topology
Simplification of Likelihood
cba d
1
2
3 4
G
cba d
G’
cba d A C Da b cd
B
Genealogical topology G and population tree T:Gene lineages b and c coalesce first Populations B and C are likely to be more closely relatedBut not always…
Incomplete lineage sorting: gene tree topology is stochastic
5
STELLSH: infer population trees from haplotypes
Gene tree probability for non-bifurcating topology: sum over all compatible bifurcating topologies. Can be more efficiently computed: Wu, manuscript, 2014.
Issue: perfect phylogeny from haplotypes usually non-bifurcating
STELLSH: maximizing probability of all gene topologies, by optimizing topology and branch lengths of population tree (e.g. nearest neighbor interchange)
Gene tree probability P(G|T): (relatively) efficiently computed by the STELLS algorithm (Wu, Evolution, 2012) algorithm for when G is bifurcating and can be used in inference.
For population tree T and a gene tree topology G:Gene tree probability P(G|T): probability of observing a gene tree topology G for population tree T under coalescent theory.
Population tree: same tree topologies. Haplotypes: use Hudson’s ms (support island model)
• Multiple alleles per population per gene• Various population tree heights (0.1, 0.5 and 1.0 coalescent units)• Number of loci: 10,50,100,200,500
Simulation
Inference STELLSH: infer population tree from haplotypes
Evaluation Topological error of inferred population trees
Assume: no migration; no intra-locus recombination.
Moderate migration or recombination: accurate inferenceStrong migration or high recombination: less accurate
Accuracy: higher with more loci
Number of loci
Inference error
Compare with TreeMixSimulation dataSTELLSH (Solid lines): up to 4 alleles per populationTreeMix (dashed lines): up to 100 alleles per populationSTELLSH: more accurate than TreeMix, even TreeMix uses 25 times more data.
Research supported by National Science Foundation under grants IIS-0803440 and CCF-1116175
Paper: “A Coalescent-based Method for Population Tree Inference with Haplotypes”, Yufeng Wu, submitted for publication, 2014.Paper: “Coalescent-based Species Tree Inference from Gene Tree Topologies Under Incomplete Lineage Sorting by Maximum Likelihood”, Yufeng Wu, Evolution, v. 66 (3), p. 763-775, 2012.”
Conclusion: • Haplotypes: can be more informative than individual SNPs• Simplifying likelihood function may lead to faster algorithms to use in inference.
Also analyzed part of 1000 Genomes Project to infer population trees from 10 populations: CHB,JPT,CHS,CEU, TSI,FIN,GBR,IBS, YRI, and LWK.
8