ran libeskind-hadas department of computer science harvey mudd college joint work with: mike...

Post on 31-Mar-2015

214 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Flowers, Bees, and Algorithms: Adventures in

Cophylogenetics

Ran Libeskind-HadasDepartment of Computer Science

Harvey Mudd CollegeJoint work with:

Mike Charleston (Univ. of Sydney)Chris Conow (USC)Ben Cousins (Clemson)Daniel Fielder (HMC)John Peebles (HMC)Tselil Schramm (HMC)Anak Yodpinyanee (HMC)

Integrated CS/Bio Course

Send e-mail to: ran@cs.hmc.edu

Overview

• A 75-minute “research lecture” to first-year students in our CS/Bio intro course

• Show first-year students that what they’ve learned is relevant to current research

• Showcase research done with senior students• What have they have done so far?

– Biology: Genes, alignment, phylogenetic trees, RNA folding

– CS: Programming, recursion, “memoization”

Specifically…

• Pairwise global alignment and RNA folding– Why you should care– Designed and implemented recursive solutions– Why are they slow?– How do we make them faster?– “Memoization” idea– Wow, that’s fast! (but no actual analysis yet)– Designed and implemented “memoized” versions– Used their implementations to investigate questions

Around 10 lines of Python code!

Specifically…

• Phylogenetic trees– Why you should care– Implemented simple algorithm (e.g. UPGMA)– Used their implementation to answer questions…– Existence and relative merits of other algorithms

(mention maximum likelihood… but it’s slow!)

A 75-minute lecture in 30 minutes (or less)

Cophylogenetics

“ I can understand how a flower and a bee might slowly become, either simultaneously or one after the other, modified and adapted in the most perfect manner to each other, by the continued preservation of individuals presenting mutual and slightly favourable deviations of structure.”

Charles Darwin, The Origin of Species

Actual 75-minute lecture starts here! (Also a chapter in new B4B)

Obligate Mutualism ofFigs and Fig Wasps

From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004

ovipostor

The Cophylogeny Problem

From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259

Indigobirds and Finches

www.indigobirds.com

• High level of host specificity (e.g. mouth markings)

The Question…

Given a host tree, parasite tree, and tip mapping, what is the most plausiblemapping between the trees and is it suggestive of coevolution?

This seems tobe a “hard” problem!

Measuring the “Hardness” of Computational Problems

There are three kinds of problems…

1. Easy

2. Hard

3. Impossible!

“Easy” Problems

Sorting a list of n numbers: [42, 3, 17, 26, … , 100]

Multiplying two n x n matrices:

3 5 2 71 6 8 92 4 6 109 3 2 12( )

1 5 5 45 12 8 67 6 1 59 23 5 8( ) = ( )

n

n

n n

n

Global Alignment is “easy”!

• Reminder of 2n running time of alignment

• Informally motivate n2 running time of memoized version

Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

Freezeapolis

“Hard” Problems

“Hard” Problems

Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

FreezeapolisBrute-force? Greed?

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec

4900< 1 sec

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->

37 billion years!

Snowplows and Travelling Salesperson Revisited!

Travelling Salesperson Problem

Snowplow Problem

Protein Folding

NP-complete problems

Tens of thousands of other known problems go in this cloud!!

Phylogenetic trees by maximum likelihood

Multiple sequence alignment

“I can’t find an efficient algorithm. I guess I’m too dumb.”

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm because no suchalgorithm is possible!”

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm, but neithercan all these famous people.”

$1 million

Vinay Deolalikar

Coping with NP-completeness…

• Brute force • Ad hoc Heuristics• Meta heuristics• Approximation algorithms

Obligate Mutualism ofFigs and Fig Wasps

From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004

ovipostor

The Cophylogeny Problem…

Host tree

a b c

Parasite tree

d

e

The Cophylogeny Problem

Host tree

Tips associations

a b c

Parasite tree

d

e

Possible Solutions

a b c

d

e

a b c

d

e

Input

Event Cost Modelcospeciation

a b c

d

e

cospeciation cospeciation

a b c

d

e

Event Cost Modelduplication

a b c

d

eduplication

a b c

d

e

Event Cost Modelhost-switch

a b c

d

e

host-switch

a b c

d

e

Event Cost Modelloss

a b c

d

e

lossloss loss

loss

a b c

d

e

Event Cost Model

a b c

d

e

cospeciation

lossloss

duplication

host-switchloss

loss cospeciation

a b c

d

e

Cost = duplication + cospeciation + 3 * loss

Cost = cospeciation + host-switch + loss

Some typical costs

a b c

d

e

a b c

Cost = 8 Cost = 5

cospeciation

lossloss

duplication

host-switchloss

loss cospeciation+ 0

+ 2+ 2

+ 2

+ 3+ 2

+ 2 + 0e

d

This problem is hard!

• How hard? NP-complete! (Joint work with Charleston, Ovadia, Conow, Fielder)

• The host-switches are the culprits

e

f

g

h

Existing Methods TreeMap Tarzan/CoRe-PA

Technique Brute force Ignore timing incompatibilities

Solution Optimal Can be BETTER than optimal!

Running Time Exponential Polynomial,Very fast

Tree Builder No Yes

Solution Viewer Yes Yes

A Metaheuristic Approach• Fix a timing• We can solve the problem optimally for a

given timing using Dynamic Programming (Memoization)

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

s

t

r

u

v w x y

a

Compute Cost[a,su,2]

a

c

b

parasite

t = 0

t = 1

t = 2

t = 3

t = 4

s

t

r

u

v w x y

a

Compute Cost[a,su,2]

b

c

Cost[b,tw,3]

Cost[c,y,4]

a

c

b

parasite

Dynamic Programming

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

a

b

c

s

t

r

u

v w x y

Cost[b,tw,3]

loss host-switch

loss

Cost[c,y,4]

a

c

b

parasiteCompute Cost[a,su,2]

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

a

b

c

s

t

r

u

v w x y

Cost[b,tw,3]

loss host-switch

loss

Cost[c,y,4]

Candidate for Cost[a,su,2]:Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch

Dynamic Programming

Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable

O(n3 x (n2 x n2)) = O(n7) total

Dynamic Programming

Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable

O(n3 x (n2 x n2)) = O(n7) total

Can be improved to O(n3)

Genetic Algorithm

Existing SoftwareTreeMap Tarzan/CoRe-PA Jane 2

Technique Brute force DP, Ignore timing incompatibilities

Genetic algorithmDP

Solution Optimal Can be BETTER than optimal! Sometimes suboptimal

Running Time Exponential Polynomial,Very fast

Polynomial, a lot faster!Can control running time

Tree Builder No Yes No, but Jane 2 can readCoRe-PA’s trees

Solution Viewer Yes Yes YesAlso Interactive

The Fig/Wasp Challenge

Results

The Fig/Wasp Dataset…Randomly

Generated Problem Instances

Solve

for

optim

al co

st

OriginalProblem Instance

Solve for

optimal cost

Paper recently completed…

30 Coauthors

18 Institutes10 Countries

Results

Results

Demo

Future Work…

• One parasite, many hosts (“failure to diverge”)• Reticulate phylogenies• Multifurcations• Suggestions?

Questions/Comments

top related