ran libeskind-hadas department of computer science harvey mudd college joint work with: mike...

58
Flowers, Bees, and Algorithms: Adventures in Cophylogenetics Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins (Clemson) Daniel Fielder (HMC) John Peebles (HMC) Tselil Schramm (HMC) Anak Yodpinyanee (HMC)

Upload: jeffrey-larcom

Post on 31-Mar-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Flowers, Bees, and Algorithms: Adventures in

Cophylogenetics

Ran Libeskind-HadasDepartment of Computer Science

Harvey Mudd CollegeJoint work with:

Mike Charleston (Univ. of Sydney)Chris Conow (USC)Ben Cousins (Clemson)Daniel Fielder (HMC)John Peebles (HMC)Tselil Schramm (HMC)Anak Yodpinyanee (HMC)

Page 2: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Integrated CS/Bio Course

Send e-mail to: [email protected]

Page 3: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Overview

• A 75-minute “research lecture” to first-year students in our CS/Bio intro course

• Show first-year students that what they’ve learned is relevant to current research

• Showcase research done with senior students• What have they have done so far?

– Biology: Genes, alignment, phylogenetic trees, RNA folding

– CS: Programming, recursion, “memoization”

Page 4: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Specifically…

• Pairwise global alignment and RNA folding– Why you should care– Designed and implemented recursive solutions– Why are they slow?– How do we make them faster?– “Memoization” idea– Wow, that’s fast! (but no actual analysis yet)– Designed and implemented “memoized” versions– Used their implementations to investigate questions

Around 10 lines of Python code!

Page 5: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Specifically…

• Phylogenetic trees– Why you should care– Implemented simple algorithm (e.g. UPGMA)– Used their implementation to answer questions…– Existence and relative merits of other algorithms

(mention maximum likelihood… but it’s slow!)

Page 6: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

A 75-minute lecture in 30 minutes (or less)

Page 7: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Cophylogenetics

“ I can understand how a flower and a bee might slowly become, either simultaneously or one after the other, modified and adapted in the most perfect manner to each other, by the continued preservation of individuals presenting mutual and slightly favourable deviations of structure.”

Charles Darwin, The Origin of Species

Actual 75-minute lecture starts here! (Also a chapter in new B4B)

Page 8: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Obligate Mutualism ofFigs and Fig Wasps

From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004

ovipostor

Page 9: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Cophylogeny Problem

From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259

Page 10: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Indigobirds and Finches

www.indigobirds.com

• High level of host specificity (e.g. mouth markings)

Page 11: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Question…

Given a host tree, parasite tree, and tip mapping, what is the most plausiblemapping between the trees and is it suggestive of coevolution?

This seems tobe a “hard” problem!

Page 12: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Measuring the “Hardness” of Computational Problems

There are three kinds of problems…

1. Easy

2. Hard

3. Impossible!

Page 13: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

“Easy” Problems

Sorting a list of n numbers: [42, 3, 17, 26, … , 100]

Multiplying two n x n matrices:

3 5 2 71 6 8 92 4 6 109 3 2 12( )

1 5 5 45 12 8 67 6 1 59 23 5 8( ) = ( )

n

n

n n

n

Page 14: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Global Alignment is “easy”!

• Reminder of 2n running time of alignment

• Informally motivate n2 running time of memoized version

Page 15: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

Freezeapolis

“Hard” Problems

Page 16: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

“Hard” Problems

Snowplows of Northern Minnesota

Burrsburg

Frostbite City

Shiversville

Tundratown

FreezeapolisBrute-force? Greed?

Page 17: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec

4900< 1 sec

Page 18: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

Page 19: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Page 20: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->

Page 21: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

n2 versus 2n

The Ran-O-Matic performs 109 operations/sec

Ran-O-Matic

n2

2n

n = 10 n = 30 n = 50n = 70

100< 1 sec

900< 1 sec

2500< 1 sec

1024< 1 sec

109

1 sec 1015

13 days

4900< 1 sec

1021

37 trillion years

Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->

37 billion years!

Page 22: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Snowplows and Travelling Salesperson Revisited!

Travelling Salesperson Problem

Snowplow Problem

Protein Folding

NP-complete problems

Tens of thousands of other known problems go in this cloud!!

Phylogenetic trees by maximum likelihood

Multiple sequence alignment

Page 23: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

“I can’t find an efficient algorithm. I guess I’m too dumb.”

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

Page 24: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm because no suchalgorithm is possible!”

Page 25: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson

“I can’t find an efficient algorithm, but neithercan all these famous people.”

Page 26: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

$1 million

Vinay Deolalikar

Page 27: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Coping with NP-completeness…

• Brute force • Ad hoc Heuristics• Meta heuristics• Approximation algorithms

Page 28: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Obligate Mutualism ofFigs and Fig Wasps

From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004

ovipostor

Page 29: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Cophylogeny Problem…

Host tree

a b c

Parasite tree

d

e

Page 30: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Cophylogeny Problem

Host tree

Tips associations

a b c

Parasite tree

d

e

Page 31: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Possible Solutions

a b c

d

e

a b c

d

e

Input

Page 32: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Event Cost Modelcospeciation

a b c

d

e

cospeciation cospeciation

a b c

d

e

Page 33: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Event Cost Modelduplication

a b c

d

eduplication

a b c

d

e

Page 34: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Event Cost Modelhost-switch

a b c

d

e

host-switch

a b c

d

e

Page 35: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Event Cost Modelloss

a b c

d

e

lossloss loss

loss

a b c

d

e

Page 36: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Event Cost Model

a b c

d

e

cospeciation

lossloss

duplication

host-switchloss

loss cospeciation

a b c

d

e

Cost = duplication + cospeciation + 3 * loss

Cost = cospeciation + host-switch + loss

Page 37: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Some typical costs

a b c

d

e

a b c

Cost = 8 Cost = 5

cospeciation

lossloss

duplication

host-switchloss

loss cospeciation+ 0

+ 2+ 2

+ 2

+ 3+ 2

+ 2 + 0e

d

Page 38: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

This problem is hard!

• How hard? NP-complete! (Joint work with Charleston, Ovadia, Conow, Fielder)

• The host-switches are the culprits

e

f

g

h

Page 39: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Existing Methods TreeMap Tarzan/CoRe-PA

Technique Brute force Ignore timing incompatibilities

Solution Optimal Can be BETTER than optimal!

Running Time Exponential Polynomial,Very fast

Tree Builder No Yes

Solution Viewer Yes Yes

Page 40: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

A Metaheuristic Approach• Fix a timing• We can solve the problem optimally for a

given timing using Dynamic Programming (Memoization)

Page 41: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

s

t

r

u

v w x y

a

Compute Cost[a,su,2]

a

c

b

parasite

Page 42: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

t = 0

t = 1

t = 2

t = 3

t = 4

s

t

r

u

v w x y

a

Compute Cost[a,su,2]

b

c

Cost[b,tw,3]

Cost[c,y,4]

a

c

b

parasite

Dynamic Programming

Page 43: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

a

b

c

s

t

r

u

v w x y

Cost[b,tw,3]

loss host-switch

loss

Cost[c,y,4]

a

c

b

parasiteCompute Cost[a,su,2]

Page 44: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

t = 0

t = 1

t = 2

t = 3

t = 4

Dynamic Programming

a

b

c

s

t

r

u

v w x y

Cost[b,tw,3]

loss host-switch

loss

Cost[c,y,4]

Candidate for Cost[a,su,2]:Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch

Page 45: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Dynamic Programming

Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable

O(n3 x (n2 x n2)) = O(n7) total

Page 46: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Dynamic Programming

Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable

O(n3 x (n2 x n2)) = O(n7) total

Can be improved to O(n3)

Page 47: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Genetic Algorithm

Page 48: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Existing SoftwareTreeMap Tarzan/CoRe-PA Jane 2

Technique Brute force DP, Ignore timing incompatibilities

Genetic algorithmDP

Solution Optimal Can be BETTER than optimal! Sometimes suboptimal

Running Time Exponential Polynomial,Very fast

Polynomial, a lot faster!Can control running time

Tree Builder No Yes No, but Jane 2 can readCoRe-PA’s trees

Solution Viewer Yes Yes YesAlso Interactive

Page 49: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Fig/Wasp Challenge

Page 50: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Results

Page 51: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

The Fig/Wasp Dataset…Randomly

Generated Problem Instances

Solve

for

optim

al co

st

OriginalProblem Instance

Solve for

optimal cost

Page 52: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins
Page 53: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Paper recently completed…

30 Coauthors

18 Institutes10 Countries

Page 54: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Results

Page 55: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Results

Page 56: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Demo

Page 57: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Future Work…

• One parasite, many hosts (“failure to diverge”)• Reticulate phylogenies• Multifurcations• Suggestions?

Page 58: Ran Libeskind-Hadas Department of Computer Science Harvey Mudd College Joint work with: Mike Charleston (Univ. of Sydney) Chris Conow (USC) Ben Cousins

Questions/Comments