algorithms for inferring the tree of life

27
Algorithms for Inferring the Tree of Life Tandy Warnow Dept. of Computer Science The University of Texas at Austin

Upload: tashya-hudson

Post on 31-Dec-2015

23 views

Category:

Documents


1 download

DESCRIPTION

Algorithms for Inferring the Tree of Life. Tandy Warnow Dept. of Computer Science The University of Texas at Austin. Phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human. Gorilla. Chimpanzee. Evolution informs about everything in biology. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algorithms for Inferring the Tree of Life

Algorithms for Inferring the Tree of Life

Tandy Warnow

Dept. of Computer Science

The University of Texas at Austin

Page 2: Algorithms for Inferring the Tree of Life

Phylogeny

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Page 3: Algorithms for Inferring the Tree of Life

Evolution informs about everything in biology

• Big genome sequencing projects just produce data – so what?

• Evolutionary history relates all organisms and genes, and helps us understand and predict – interactions between genes (genetic networks)

– drug design

– predicting functions of genes

– influenza vaccine development

– origins and spread of disease

– origins and migrations of humans

Page 4: Algorithms for Inferring the Tree of Life

Reconstructing the “Tree” of Life

Handling large datasets: Handling large datasets: millions of species, millions of species, NP-hard problems, NP-hard problems, Lots of computer science Lots of computer science research to doresearch to do

Page 5: Algorithms for Inferring the Tree of Life

Steps in a phylogenetic analysis

• Gather data• Align sequences• Estimate phylogeny on the multiple

alignment • Estimate the reliable aspects of the

evolutionary history (using bootstrapping, consensus trees, or other methods)

• Perform post-tree analyses.

Page 6: Algorithms for Inferring the Tree of Life

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 7: Algorithms for Inferring the Tree of Life

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Page 8: Algorithms for Inferring the Tree of Life

1. Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood)

Phylogenetic reconstruction methods

Phylogenetic trees

Cost

Global optimum

Local optimum

2. Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

Page 9: Algorithms for Inferring the Tree of Life

Performance criteria

• Running time.• Space.• Statistical performance issues (e.g., statistical

consistency) with respect to a Markov model of evolution.

• “Topological accuracy” with respect to the underlying true tree. Typically studied in simulation.

• Accuracy with respect to a particular criterion (e.g. tree length or likelihood score), on real data.

Page 10: Algorithms for Inferring the Tree of Life

How can we infer evolution?

While there are more than two sequences, DO

• Find the “closest” pair of sequences and make them siblings

• Replace the pair by a single sequence

Page 11: Algorithms for Inferring the Tree of Life

That was called “UPGMA”

• Advantages: UPGMA is polynomial time and works well under the “strong molecular clock” hypothesis.

• Disadvantages: UPGMA does not work well in simulations, perhaps because the molecular clock hypothesis does not generally apply.

• Other polynomial time methods, also distance-based, work better. One of the best of these is Neighbor Joining.

Page 12: Algorithms for Inferring the Tree of Life

Quantifying Error

FN: false negative (missing edge)FP: false positive (incorrect edge)

50% error rate

FN

FP

Page 13: Algorithms for Inferring the Tree of Life

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001]

Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides.

Error rates reflect proportion of incorrect edges in inferred trees.

NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Page 14: Algorithms for Inferring the Tree of Life

• Other standard polynomial time methods don’t improve substantially on NJ (and have the same problem with large diameter datasets).

• What about trying to “solve” maximum parsimony or maximum likelihood?

Page 15: Algorithms for Inferring the Tree of Life

Maximum Parsimony

• Input: Set S of n aligned sequences of length k• Output:

– A phylogenetic tree T leaf-labeled by sequences in S

– additional sequences of length k labeling the internal nodes of T

such that is minimized, where H(i,j) denotes the Hamming distance between sequences at nodes i and j

∑∈ )(),(

),(TEji

jiH

Page 16: Algorithms for Inferring the Tree of Life

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Page 17: Algorithms for Inferring the Tree of Life

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Page 18: Algorithms for Inferring the Tree of Life

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Page 19: Algorithms for Inferring the Tree of Life

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)

Page 20: Algorithms for Inferring the Tree of Life

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Page 21: Algorithms for Inferring the Tree of Life

1. Hill-climbing heuristics (which can get stuck in local optima)2. Randomized algorithms for getting out of local optima3. Approximation algorithms for MP (based upon Steiner Tree

approximation algorithms).

Approaches for “solving” MP/ML

Phylogenetic trees

Cost

Global optimum

Local optimum

Page 22: Algorithms for Inferring the Tree of Life

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

Page 23: Algorithms for Inferring the Tree of Life

Observations

• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

Page 24: Algorithms for Inferring the Tree of Life

Empirical problems with existing methods

• Heuristics for Maximum Parsimony (MP) and Maximum Likelihood (ML) cannot handle large datasets (take too long!) – we need new heuristics for MP/ML that can analyze large datasets

• Polynomial time methods have poor topological accuracy on large diameter datasets – we need better polynomial time methods

Page 25: Algorithms for Inferring the Tree of Life

My research• Focused on the design and analysis of algorithms for

large-scale phylogeny reconstruction and multiple sequence alignment.

• Objective: the design of new algorithms with better performance than existing algorithms, as evidenced by mathematical theory, experiment, or empirical studies.

• Collaborations with biologists for modelling and data analysis.

• Current group: four PhD students, one postdoc, and two undergrads.

Page 26: Algorithms for Inferring the Tree of Life

What happens after the analysis?

• The result of a phylogenetic analysis is often thousands (or tens of thousands) of equally good trees. – How should we analyze the set of trees?– How can we store the set of trees?

• Current approaches use consensus methods, as well as other techniques, to try to infer what is likely to be the characteristics of the “true tree”. Current techniques use too much space, take too much time, and are not sufficiently informative.

Page 27: Algorithms for Inferring the Tree of Life

General comments

• There is interesting computer science research to be done in computational phylogenetics, with a tremendous potential for impact.

• Algorithm development must be tested on both real and simulated data.

• The interplay between data, stochastic models of evolution, optimization problems, and algorithms, is important and instructive.