how to see a tree for a forest? combining phylogenetic trees – reasons, methods, and consequences...

31
How to See a Tree for a How to See a Tree for a Forest? Forest? Combining Phylogenetic Trees – Combining Phylogenetic Trees – Reasons, Methods, and Consequences Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance Algorithm Engineering and Computational Biology Dept. of Computer Science University of New Mexico www.compbio.unm. edu

Post on 20-Dec-2015

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

How to See a Tree for a Forest?How to See a Tree for a Forest?Combining Phylogenetic Trees – Combining Phylogenetic Trees –

Reasons, Methods, and ConsequencesReasons, Methods, and Consequences

Tanya Y. Berger-WolfLaboratory for High-Performance Algorithm Engineering and

Computational BiologyDept. of Computer ScienceUniversity of New Mexico

www.compbio.unm.edu

Page 2: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Phylogeny Reconstruction

Orangutan Chimpanzee HumanGorilla

Page 3: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Phylogeny Reconstruction

1. Get an estimate of evolutionary distance between species

2. Treat the species as a set of points with pairwise distance measure

3. Find a tree that optimizes{parsimony, likelihood, function of your choice}on that set of points

Page 4: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Overview of My Research• Computational Phylogeny

– Comparison of methods that combine trees (greed is bad)

– Topological accuracy of maximum parsimony

• Is optimal necessary?• How to know when “good enough”?

– Online consensus and other statistics– Heterogeneous data in phylogeny

• Controlled animal breeding strategies

• Computational Phylogeny– Comparison of methods that combine trees

(greed is bad)– Topological accuracy of maximum

parsimony• Is optimal necessary?• How to know when “good enough”?

– Online consensus and other statistics– Heterogeneous data in phylogeny

• Controlled animal breeding strategies

Page 5: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Computational Pitfalls

• Resulting optimization problems are hard

• Existing heuristics expensive on large datasets

• Same score – many topologies

• True tree is unknown

⇓When to stop and what to return?

Page 6: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Consensus Methods

ABCDE

ACBDE

ABCDE

+

=

Consensus is what many people say in chorus but do not believe as individuals

Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990

Page 7: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Consensus Methods: StrictMcMorris et al. (83)

E

ABCD

E

ABCD

E

ABCD

AB CD ABCDABCDE

AB ABC DEABCDE

BCD ABCDABCDE

Strict: contains clades common to all trees

E

ABCD

Page 8: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Consensus Methods: MajorityMargush & McMorris (81), McMorris et al. (83), Barthelemy & McMorris (86)

E

ABCD

E

ABCD

E

ABCD

AB CD ABCDABCDE

AB ABC DEABCDE

BCD ABCDABCDE

Majority: contains clades common to majority

AB CD ABCD AB ABC DE BCD ABCD

E

ABCD

Page 9: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Stopping Maximum Parsimony(joint work with T.Williams, B.M.E.Moret, U.Roshan, T.Warnow)

If return Majority Consensus of the top scoring trees how early can we stop without changing the outcome? What stopping criteria?

Biological datasets: •three567: “three-gene” (rbcL, atpB, and 18s) DNA sequences (Soltis et al., 2000)

•aster328: ITS RNA sequences from the plant Asteracaeae (Gutell Lab, ICMB, UT Austin)

•ocho854: rbcL DNA sequences (Goloboff, 1999)

•lipsc439: rDNA sequences of Eukaryotes (Goloboff, 1999)

•john921: Avian Cytochrome b DNA sequences (Johnson, 2001)

•eern476: Metazoan DNA sequences (Goloboff, 1999)

•will2000: Eukaryotic sRNA sequences (Gutell Lab, ICMB, UT Austin)

•rbcL500: rbcL DNA sequences (Rice et al., 1997)

•mari2594: rbcL DNA sequences (Kallerjo et al., 1998)

Page 10: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Experiment DesignATTCGGAAGCGATAGCTGAATCGATCGATCGTATTACGTTAGCTAGTATGCAGCGGAG

Biological dataset

Run parsimony ratchet (PAUP*)500 iterations, 5 repetitionsSave the tree at each iteration

Majority consensus ofoptimal trees (PAUP*)

Output consensus tree

…Optimal - best scoring treesin all repetitions

Majority consensus ofbest and second bestso far

Page 11: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Results

rbcl500

02468

10121416

0 50 100 150 200 250 300 350 400 450 500

Iteration

RF

rate

(%)

0.001

0.01

0.1

1

MP

Sco

re (%

)

Optimal-best MRC

Best-second best MRC

Score error (from optimal)

Page 12: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Results

aster328

0

2

4

6

8

10

12

0 50 100 150 200 250 300 350 400 450 500

Iteration

RF

rat

e (%

)

0.001

0.01

0.1

1

MP

Sco

re (

%)

Optimal-best MRC

Best-second best MRC

Score error (from optimal)

rbcl500

02468

10121416

0 50 100 150 200 250 300 350 400 450 500

Iteration

RF

rat

e (%

)

0.001

0.01

0.1

1

MP

Sco

re (

%)

Optimal-best MRC

Best-second best MRC

Score error (from optimal)

ocho854

0

5

10

15

20

0 50 100 150 200 250 300 350 400 450 500

Iteration

RF

rat

e (%

)

0.0001

0.001

0.01

0.1

1

MP

Sco

re (

%)

Optimal-best MRC

Best-second best MRC

Score error (from optimal)

mari2594

0

5

10

15

20

0 50 100 150 200 250 300 350 400 450 500

Iteration

RF

rat

e (%

)

0.0001

0.001

0.01

0.1

1

MP

Sco

re (

%)

Optimal-best MRC

Best-second best MRC

Score error (from optimal)

Page 13: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Online ConsensusInput: T1, T2, …, Tk with n leaves, one at a time

Output: Majority Consensus tree Mi of T1,…,Ti

Solution: Maintain set of clades C with counters

When Ti arrives, need to consider only the clades in Ti and Mi-1, total of 2n

Data structure Time Space

Self balancing binary tree O(n lg n) O(|C|)

Hash table, h=O(n2) O(n) O(n2)

Page 14: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Conclusions and Future

• Evidence that can stop parsimony search early

• Need simulations and more data to verify

• Collect other (than consensus) statistics

• Other stopping criteria

• Different representation of finalsets of trees

• Other methods

Page 15: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Wait! There is more!Part II: Heterogeneous Data

(joint work with Tandy Warnow)

Page 16: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Heterogeneous Data

Molecular data: DNA and genomes

Pros Cons

• Have distance measure

• Unambiguous• Many characters

• No data for extinct species

• Difficulties with ancient evolutionary events

• Recombination, repeated evolution

Page 17: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Heterogeneous Data

Paleontological, morphological, geographical, historical data

Pros Cons

• Easy to sample• Sometimes is the

only available information

• Has been used for a century

• Character states hard to determine

• Genetic basis not known

• No distance measure• Subjective

Page 18: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Data As ConstraintsConstraints, not distance!• Positive: these species are together

(phylogenetic trees, presence of a morphological character)

• Negative: these species are not together (above + geography, fossils)

• Temporal: these events happened in this order (fossils, history)

• Frequency: this even happens more often than another (adaptation mechanisms)

Page 19: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

E

ABCD

Consensus Methods: Greedy

E

ABCD

E

ABCD

E

ABCD

AB CD ABCDABCDE

AB ABC DEABCDE

BCD ABCDABCDE

Greedy: resolves majority by adding compatible clades

E

ABCD

AB CD ABCD

E

ABCD

AB ABC DE

E

ABCD

Page 20: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Consensus Methods: AMTPhillips & Warnow (95)

E

ABCD

E

ABCD

E

ABCD

AB CD ABCDABCDE

AB ABC DEABCDE

BCD ABCDABCDE

Asymmetric Median Tree: maximum (weighted) collection of compatible clades

ABABC

ABCD

BCDDE

CD

AB CD ABCD ABCDE

AB ABC ABCD ABCDE

AB CD ABCD ABCDE

Page 21: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Consensus of Positive Constraints

Formalize constraint, go through existing consensus methods, see if satisfies or can be extended

Positive Constraints Strict + res Maj + res Grdy AMT Input

All input have isomorphic T... all output have T One input has isomorphic T, no contradictions output have T All input have clade all output have One input has clade , no con- tradictions output have

ππ

ππ

Partially from Steel et al. 2000

Page 22: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

1. a and b are separated by C

2. C is closer to a than b – same as positive

Negative Constraints Strict + res Maj + res Grdy AMT Input

All input have 1 .all output…. have 1 One input has 1, no contradictions output have 1

Consensus of Negative Constraints

Page 23: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Conclusions and Future (Part II)

• Existing methods are insufficient

• (Consensus with respect to temporal, frequency constraints)

• Developing new methods that preserve 4 types of constraints

• Network phylogeny

• Error measure and evaluation of quality

Page 24: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Even Bigger Future• Phylogeny

• Getting good reconstructions fast• Heterogeneous data• Network phylogeny

• Epidemiology• Flu SIR model, combining data• Vaccination strategies

• Population biology• Discrete methods for small populations

(esp. conservation)

Page 25: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Work is supported by the National Science Foundation postdoctoral

fellowship grant EIA 02-03584

Thank you

Page 26: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Controlled Breeding(joint work with Cris Moore and Jared Saia)

Given an initial population of animals design a mating strategy that achieves a

breeding goal (within shortest time)

Page 27: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Controlled Breeding: Background

• Conservation Biology and Agriculture

• Breeding strategies: designed and evaluated empirically or using stochastic time-step modeling

• Empirical evaluation – too slow!

• Stochastic modeling – mathematically and biologically inappropriate.

• Classic algorithm design problem

Page 28: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Breeding All Possible Animals

Given k binary strings of length nDesign an algorithm that Produces all possible strings With the smallest expected # matings

Greedy: mate two animals with the highest probability of producing new

Upper bound: 2.32•2n

Page 29: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Breeding a Target Animal

Given k strings of length nDesign an algorithm that Produces a target string With the smallest expected # matings

Alg 1: breed for one trait at a timeO(n lg n)

Alg 2: breed the animals closest to the target

O(n2)

Page 30: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

Algorithm: One Trait at a TimeAddOneTrait (11…100...0, 00…010…0)

x = 11…100…0y = 00…010…0While (y has < i+1 ones) do

Mate x and y twicey = string with 1 in bit (i+1)

Return y

The Algorithm (e1,e2,…,en)x = e1

For x = 2..n dox = AddOneTrait(x,ei)

Page 31: How to See a Tree for a Forest? Combining Phylogenetic Trees – Reasons, Methods, and Consequences Tanya Y. Berger-Wolf Laboratory for High-Performance

More Realistic Breeding

• Gender

• Variable probability of outcome

• Deaths

• Minimize number of generations

• Goal: maximum diversity

• On-line: maintain the distribution