quantifying mcmc exploration of phylogenetic tree space

30
Quantifying MCMC exploration of phylogenetic tree space Christopher Whidden and Frederick “Erick” A. Matsen IV Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org @ematsen

Upload: erick-matsen

Post on 20-Jun-2015

759 views

Category:

Technology


3 download

DESCRIPTION

A talk given at the Algorithms for Threat Detection Program Review, March 10th, 2014.

TRANSCRIPT

Page 1: Quantifying MCMC exploration of phylogenetic tree space

Quantifying MCMC explorationof phylogenetic tree space

Christopher Whidden and Frederick “Erick” A. Matsen IVFred Hutchinson Cancer Research Center

http://matsen.fhcrc.org @ematsen

Page 2: Quantifying MCMC exploration of phylogenetic tree space

Phylogenetics: reconstruct evolutionary history from DNA

armadillo

giraffe

rat

human"phylogenetics"DNA or RNA sequence data

Page 3: Quantifying MCMC exploration of phylogenetic tree space

Phylogenetics helps us learn how HIV-1 came to be

Etienne, Hahn, Sharp, Matsen and Emerman, Cell Host &Microbe, 2013

Page 4: Quantifying MCMC exploration of phylogenetic tree space

We are fond of statistical approaches to phylogenetics

These are important when one would like a clear notion ofuncertainty (like medicine, epidemiology, and biodefense!)

Page 5: Quantifying MCMC exploration of phylogenetic tree space

We are fond of statistical approaches to phylogenetics

In particular, Bayesian methods fall into this category and havebecome quite popular.

ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...

...

We can’t solve for this posterior distribution, but we can satisfyour needs by getting a big sample from it.

Page 6: Quantifying MCMC exploration of phylogenetic tree space

Markov chain Monte Carlo (MCMC)

Metropolis et al., 1953.

Set up a simulation such that the amount of time spent in a givenstate is proportional to the posterior probability of that state.

Page 7: Quantifying MCMC exploration of phylogenetic tree space

Here we want a posterior on trees

If we want to use the same strategy to get a posterior onphylogenetic trees. . .

ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...

...

we need a way to move from one phylogenetic tree to another.

Page 8: Quantifying MCMC exploration of phylogenetic tree space

Subtree-prune-regraft (SPR) definition

1 4 5 61 2 3 4 5 6 1 2 34 5 6

2 3

Page 9: Quantifying MCMC exploration of phylogenetic tree space

The set of trees as a graph connected by SPR moves(Figure from Mossel and Vigoda, Science, 2005).

Page 10: Quantifying MCMC exploration of phylogenetic tree space

This graph is connected, and every tree has nonzeroposterior probability, so MCMC works†

We are guaranteed to converge to the posterior distribution ontrees by using Metropolis-Hastings moves built on these SPRs.

That is, by bouncing around “tree space” we can get a good ideaof a set of good trees.

† That is, it works if we run the MCMC forever

Page 11: Quantifying MCMC exploration of phylogenetic tree space

We can’t run it forever.

News flash:

5 million < ∞

Page 12: Quantifying MCMC exploration of phylogenetic tree space

With pathological data, can be hard to traverse peaks

goodness

Page 13: Quantifying MCMC exploration of phylogenetic tree space

We wanted to know: does this happen in real data sets?

Lots of discussion in literature, but few clear conclusions.

In order to understand the reasons differentiating “easy” and“difficult” data sets for phylogenetic MCMC, we wanted to make itpossible to visualize tree space with a relevant geometry.

So, what trees are close to each other in terms of SPR moves?

Page 14: Quantifying MCMC exploration of phylogenetic tree space

dSPR: how many SPR moves from one tree to another?

Say T1 T2 if there is an SPR transformation of T1 to T2.

dSPR(T ,S) = minT1 ··· Tk=S

k

This distance is NP-hard to compute. That’s no fun!

Page 15: Quantifying MCMC exploration of phylogenetic tree space

Meet Chris Whidden, algorithms strongman

In a series of four very technical papers, Chris took exactcomputation of dSPR from O(infeasible) to O(feasible).

Then he joined my group!

Page 16: Quantifying MCMC exploration of phylogenetic tree space

Let’s take some common data sets and see what we see

These are completely standard data sets of the sort that biologistsanalyze every day: slowly evolving nuclear, mitochondrial, orchloroplast genes.

Also used as examples in:

I Lakner et al., Syst. Biol., 2008I Hohna and Drummond, Syst. Biol., 2012I Larget, Syst. Biol., 2013

Page 17: Quantifying MCMC exploration of phylogenetic tree space

Interested in high probability subsets of the SPR graph

Page 18: Quantifying MCMC exploration of phylogenetic tree space

Summarize by subsetting to high probability nodes

node size proportional to posterior probability, and color shows distance to the highest PP tree.

Page 19: Quantifying MCMC exploration of phylogenetic tree space

The top 4096 trees for a data set

Page 20: Quantifying MCMC exploration of phylogenetic tree space

The top 4096 trees for a data set

What's up with this stuff?

Is it important? Is it difficult for the MCMC to see?

Page 21: Quantifying MCMC exploration of phylogenetic tree space

Commute time definition

Commute time for a node y : how long to make the round tripfrom y to the highest posterior probability tree and back?

Any round trip path counts!

Page 22: Quantifying MCMC exploration of phylogenetic tree space

Commute time definition

Commute time for a node y : how long to make the round tripfrom y to the highest posterior probability tree and back?

Any round trip path counts!

Page 23: Quantifying MCMC exploration of phylogenetic tree space

Commute time plot for this data set

Page 24: Quantifying MCMC exploration of phylogenetic tree space

The separation is problematic indeed

Yep, those parts of the posterior are important and MCMC has trouble entering them.

Page 25: Quantifying MCMC exploration of phylogenetic tree space

Trees with 95% of posterior probability for another data set

Page 26: Quantifying MCMC exploration of phylogenetic tree space

We can use our methods to identify source of bottlenecks

Bufo_valliceps

Ambystoma_mexicanum

Heterodon_platyrhinos

Gastrophryne_carolinensis

Rattus_norvegicus

Eleutherodactylus_cuneatus

Scaphiopus_holbrooki

Typhlonectes_natans

Grandisonia_alternans

Xenopus_laevis

Siren_intermedia

Amphiuma_tridactylum

Turdus_migratorius

Sceloporus_undulatus

Discoglossus_pictus

Homo_sapiens

Mus_musculus

Oryctolagus_cuniculus

Nesomantis_thomasseti

Plethodon_yonhalossee

Gallus_gallus

Ichthyophis_bannanicus

Hypogeophis_rostratus

Alligator_mississippiensis

Trachemys_scripta

Hyla_cinerea

Latimeria_chalumnae

Alligator_mississippiensis

Bufo_valliceps

Homo_sapiens

Amphiuma_tridactylum

Trachemys_scripta

Sceloporus_undulatus

Plethodon_yonhalossee

Scaphiopus_holbrooki

Oryctolagus_cuniculus

Siren_intermedia

Discoglossus_pictus

Ichthyophis_bannanicus

Nesomantis_thomasseti

Turdus_migratorius

Eleutherodactylus_cuneatus

Gastrophryne_carolinensis

Typhlonectes_natans

Ambystoma_mexicanum

Rattus_norvegicus

Gallus_gallus

Grandisonia_alternans

Heterodon_platyrhinos

Hyla_cinerea

Mus_musculus

Latimeria_chalumnae

Xenopus_laevis

Hypogeophis_rostratus

These are the trees at the two peaks of the connected components.Indeed, it’s very tricky to get between them!

Page 27: Quantifying MCMC exploration of phylogenetic tree space

Multidimensional scaling visualizations via dSPR

Page 28: Quantifying MCMC exploration of phylogenetic tree space

In general, a new way to explore tree space

Page 29: Quantifying MCMC exploration of phylogenetic tree space

Our applications: it’s party time

I Automatic identification of (multiple) peaks in posteriorsI Performance of Metropolis-coupled Markov chain Monte Carlo

for getting between peaksI Accuracy of new “mean-field” posterior probability

approximationsI The first topological convergence diagnostic

These empirical investigations set the stage for additionaltheoretical development, and suggest new ways to move aroundtree space.

This will translate into better phylogenetic uncertainty estimates,and hence better preparedness and response to biological threats.

Page 30: Quantifying MCMC exploration of phylogenetic tree space

Thank you

I Robert Beiko (Dalhousie University)I Aaron Darling (University of Technology, Sydney)I Connor McCoy (Fred Hutchinson Cancer Research Center)I NSF award 1223057