ran libeskind-hadas department of computer science harvey mudd college joint work with: mike...
TRANSCRIPT
Flowers, Bees, and Algorithms: Adventures in
Cophylogenetics
Ran Libeskind-HadasDepartment of Computer Science
Harvey Mudd CollegeJoint work with:
Mike Charleston (Univ. of Sydney)Chris Conow (USC)Ben Cousins (Clemson)Daniel Fielder (HMC)John Peebles (HMC)Tselil Schramm (HMC)Anak Yodpinyanee (HMC)
Integrated CS/Bio Course
Send e-mail to: [email protected]
Overview
• A 75-minute “research lecture” to first-year students in our CS/Bio intro course
• Show first-year students that what they’ve learned is relevant to current research
• Showcase research done with senior students• What have they have done so far?
– Biology: Genes, alignment, phylogenetic trees, RNA folding
– CS: Programming, recursion, “memoization”
Specifically…
• Pairwise global alignment and RNA folding– Why you should care– Designed and implemented recursive solutions– Why are they slow?– How do we make them faster?– “Memoization” idea– Wow, that’s fast! (but no actual analysis yet)– Designed and implemented “memoized” versions– Used their implementations to investigate questions
Around 10 lines of Python code!
Specifically…
• Phylogenetic trees– Why you should care– Implemented simple algorithm (e.g. UPGMA)– Used their implementation to answer questions…– Existence and relative merits of other algorithms
(mention maximum likelihood… but it’s slow!)
A 75-minute lecture in 30 minutes (or less)
Cophylogenetics
“ I can understand how a flower and a bee might slowly become, either simultaneously or one after the other, modified and adapted in the most perfect manner to each other, by the continued preservation of individuals presenting mutual and slightly favourable deviations of structure.”
Charles Darwin, The Origin of Species
Actual 75-minute lecture starts here! (Also a chapter in new B4B)
Obligate Mutualism ofFigs and Fig Wasps
From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
ovipostor
The Cophylogeny Problem
From Hafner MS and Nadler SA, Phylogenetic trees support the coevolution of parasites and their hosts. Nature 1988, 332:258-259
Indigobirds and Finches
www.indigobirds.com
• High level of host specificity (e.g. mouth markings)
The Question…
Given a host tree, parasite tree, and tip mapping, what is the most plausiblemapping between the trees and is it suggestive of coevolution?
This seems tobe a “hard” problem!
Measuring the “Hardness” of Computational Problems
There are three kinds of problems…
1. Easy
2. Hard
3. Impossible!
“Easy” Problems
Sorting a list of n numbers: [42, 3, 17, 26, … , 100]
Multiplying two n x n matrices:
3 5 2 71 6 8 92 4 6 109 3 2 12( )
1 5 5 45 12 8 67 6 1 59 23 5 8( ) = ( )
n
n
n n
n
Global Alignment is “easy”!
• Reminder of 2n running time of alignment
• Informally motivate n2 running time of memoized version
Snowplows of Northern Minnesota
Burrsburg
Frostbite City
Shiversville
Tundratown
Freezeapolis
“Hard” Problems
“Hard” Problems
Snowplows of Northern Minnesota
Burrsburg
Frostbite City
Shiversville
Tundratown
FreezeapolisBrute-force? Greed?
n2 versus 2n
The Ran-O-Matic performs 109 operations/sec
Ran-O-Matic
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec
4900< 1 sec
n2 versus 2n
The Ran-O-Matic performs 109 operations/sec
Ran-O-Matic
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec 1015
13 days
4900< 1 sec
n2 versus 2n
The Ran-O-Matic performs 109 operations/sec
Ran-O-Matic
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec 1015
13 days
4900< 1 sec
1021
37 trillion years
n2 versus 2n
The Ran-O-Matic performs 109 operations/sec
Ran-O-Matic
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec 1015
13 days
4900< 1 sec
1021
37 trillion years
Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->
n2 versus 2n
The Ran-O-Matic performs 109 operations/sec
Ran-O-Matic
n2
2n
n = 10 n = 30 n = 50n = 70
100< 1 sec
900< 1 sec
2500< 1 sec
1024< 1 sec
109
1 sec 1015
13 days
4900< 1 sec
1021
37 trillion years
Computers double in speed every 2 years. Let’s just wait 10 years! 37 trillion years ->
37 billion years!
Snowplows and Travelling Salesperson Revisited!
Travelling Salesperson Problem
Snowplow Problem
Protein Folding
NP-complete problems
Tens of thousands of other known problems go in this cloud!!
Phylogenetic trees by maximum likelihood
Multiple sequence alignment
“I can’t find an efficient algorithm. I guess I’m too dumb.”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm because no suchalgorithm is possible!”
Cartoon courtesy of “Computers and Intractability: A Guide to the Theory of NP-completeness” by M. Garey and D. Johnson
“I can’t find an efficient algorithm, but neithercan all these famous people.”
$1 million
Vinay Deolalikar
Coping with NP-completeness…
• Brute force • Ad hoc Heuristics• Meta heuristics• Approximation algorithms
Obligate Mutualism ofFigs and Fig Wasps
From Cophylogeny of the Ficus Microcosm, A. Jackson, 2004
ovipostor
The Cophylogeny Problem…
Host tree
a b c
Parasite tree
d
e
The Cophylogeny Problem
Host tree
Tips associations
a b c
Parasite tree
d
e
Possible Solutions
a b c
d
e
a b c
d
e
Input
Event Cost Modelcospeciation
a b c
d
e
cospeciation cospeciation
a b c
d
e
Event Cost Modelduplication
a b c
d
eduplication
a b c
d
e
Event Cost Modelhost-switch
a b c
d
e
host-switch
a b c
d
e
Event Cost Modelloss
a b c
d
e
lossloss loss
loss
a b c
d
e
Event Cost Model
a b c
d
e
cospeciation
lossloss
duplication
host-switchloss
loss cospeciation
a b c
d
e
Cost = duplication + cospeciation + 3 * loss
Cost = cospeciation + host-switch + loss
Some typical costs
a b c
d
e
a b c
Cost = 8 Cost = 5
cospeciation
lossloss
duplication
host-switchloss
loss cospeciation+ 0
+ 2+ 2
+ 2
+ 3+ 2
+ 2 + 0e
d
This problem is hard!
• How hard? NP-complete! (Joint work with Charleston, Ovadia, Conow, Fielder)
• The host-switches are the culprits
e
f
g
h
Existing Methods TreeMap Tarzan/CoRe-PA
Technique Brute force Ignore timing incompatibilities
Solution Optimal Can be BETTER than optimal!
Running Time Exponential Polynomial,Very fast
Tree Builder No Yes
Solution Viewer Yes Yes
A Metaheuristic Approach• Fix a timing• We can solve the problem optimally for a
given timing using Dynamic Programming (Memoization)
t = 0
t = 1
t = 2
t = 3
t = 4
Dynamic Programming
s
t
r
u
v w x y
a
Compute Cost[a,su,2]
a
c
b
parasite
t = 0
t = 1
t = 2
t = 3
t = 4
s
t
r
u
v w x y
a
Compute Cost[a,su,2]
b
c
Cost[b,tw,3]
Cost[c,y,4]
a
c
b
parasite
Dynamic Programming
t = 0
t = 1
t = 2
t = 3
t = 4
Dynamic Programming
a
b
c
s
t
r
u
v w x y
Cost[b,tw,3]
loss host-switch
loss
Cost[c,y,4]
a
c
b
parasiteCompute Cost[a,su,2]
t = 0
t = 1
t = 2
t = 3
t = 4
Dynamic Programming
a
b
c
s
t
r
u
v w x y
Cost[b,tw,3]
loss host-switch
loss
Cost[c,y,4]
Candidate for Cost[a,su,2]:Cost[b, tw, 3] + Cost[c, uy, 4] + 2 * loss + host-switch
Dynamic Programming
Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable
O(n3 x (n2 x n2)) = O(n7) total
Dynamic Programming
Running Time• O(n3) cells to fill in• O(n2) positions for first child• O(n2) positions for second child• O(n) to count #losses from each child, but this is precomputable
O(n3 x (n2 x n2)) = O(n7) total
Can be improved to O(n3)
Genetic Algorithm
Existing SoftwareTreeMap Tarzan/CoRe-PA Jane 2
Technique Brute force DP, Ignore timing incompatibilities
Genetic algorithmDP
Solution Optimal Can be BETTER than optimal! Sometimes suboptimal
Running Time Exponential Polynomial,Very fast
Polynomial, a lot faster!Can control running time
Tree Builder No Yes No, but Jane 2 can readCoRe-PA’s trees
Solution Viewer Yes Yes YesAlso Interactive
The Fig/Wasp Challenge
Results
The Fig/Wasp Dataset…Randomly
Generated Problem Instances
Solve
for
optim
al co
st
OriginalProblem Instance
Solve for
optimal cost
Paper recently completed…
30 Coauthors
18 Institutes10 Countries
Results
Results
Demo
Future Work…
• One parasite, many hosts (“failure to diverge”)• Reticulate phylogenies• Multifurcations• Suggestions?
Questions/Comments