cyberinfrastructure for phylogenetic research simulation, modeling, and benchmarks u penn: junhyong...

54
Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Simulation, Modeling, and Benchmarks Benchmarks U Penn: U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Cohen U Texas : U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers Eric Miller, Tracy Heath, Derrick Zwickl Eric Miller, Tracy Heath, Derrick Zwickl NC State: NC State: Spencer Muse Spencer Muse Errol Strain Errol Strain Yale: Yale: Paul Turner Paul Turner and and Bernard Moret Bernard Moret Tandy Warnow Tandy Warnow Robert Jensen Robert Jensen Randy Linder Randy Linder

Upload: sharleen-willis

Post on 17-Jan-2016

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Simulation, Modeling, and Simulation, Modeling, and BenchmarksBenchmarks

U Penn:U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas :U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick ZwicklEric Miller, Tracy Heath, Derrick Zwickl

NC State:NC State: Spencer Muse Spencer Muse

Errol StrainErrol Strain

Yale:Yale: Paul Turner Paul Turner

andand

Bernard MoretBernard Moret

Tandy WarnowTandy Warnow

Robert JensenRobert Jensen

Randy LinderRandy Linder

Page 2: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Goal:Goal: Develop validated datasets of Develop validated datasets of sufficient complexity and scale to sufficient complexity and scale to realistically benchmark latest tree realistically benchmark latest tree

algorithmsalgorithms

Page 3: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

ProblemsProblems

Large-scale simulations is computationally demanding and difficult to reproduce Large-scale simulations is computationally demanding and difficult to reproduce independently independently

The model parameter space explodes in combinatorial complexity with increase The model parameter space explodes in combinatorial complexity with increase in model complexity in model complexity

Large-scale algorithm test experimental design is extremely difficult to manageLarge-scale algorithm test experimental design is extremely difficult to manage Branching structure specification is critical but the standard options are limited Branching structure specification is critical but the standard options are limited

for very large treesfor very large trees Credible simulation model acceptable to the community is difficult to establishCredible simulation model acceptable to the community is difficult to establish

Page 4: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

ProblemsProblems

Large-scale simulations is Large-scale simulations is computationally demandingcomputationally demanding and and difficult todifficult to reproducereproduce independently independently

The model parameter space explodes inThe model parameter space explodes in combinatorial complexity combinatorial complexity with with increase in model complexity increase in model complexity

Large-scale algorithm test experimental design is extremely Large-scale algorithm test experimental design is extremely difficult to managedifficult to manage Branching structure specification is criticalBranching structure specification is critical but the standard options are but the standard options are

limited for very large treeslimited for very large trees Credible simulation Credible simulation model acceptable to the communitymodel acceptable to the community is difficult to establish is difficult to establish

Page 5: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Simulation Design Simulation Design

Pre-generate a very large dataset (>10Pre-generate a very large dataset (>1066 positions) over a very large positions) over a very large complex tree (>10complex tree (>1066 taxa) using a suite of complex models of taxa) using a suite of complex models of evolutionevolution

Store the data in a databaseStore the data in a database Retrieve subsets of the data by various sampling schemesRetrieve subsets of the data by various sampling schemes

computationally demandingcomputationally demanding

difficult todifficult to reproducereproduce

model acceptable to model acceptable to the communitythe community

difficult to managedifficult to manage

Branching structure Branching structure specification is criticalspecification is critical

combinatorial complexitycombinatorial complexity

Page 6: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Simulation and Data AccessSimulation and Data Access

Character Evolution Simulators•HyPhy•Micro-evolution•Others

Tree Topology Simulators•Pure Birth•Birth-Death•Empirical Fit•Others

Others•Tree/Char Combined•Experimental Evolution•Virtual Cell•etc

Model Model CharacterizationCharacterizationSimulatorsSimulators

Taxon Sampling

Model Sampling

Dat

abas

eD

atab

ase

Data Subset with Associated Subtree

Format TranslatorsFormat Translators

PAUP*, etc

Page 7: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Crimson Simulation DBCrimson Simulation DB

Page 8: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Obligatory Schema Diagram (Don’t Look)Obligatory Schema Diagram (Don’t Look)

Page 9: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Database Performance: Constant or Linear Time QueriesDatabase Performance: Constant or Linear Time Queries

Select n random taxa from 2000-taxon tree

Select 20 fixed taxa from tree of size t (100 to 600)

Select 20 random taxa from tree of size t (100 to 600)

random

stratified

MRC subtree

Implemented tree-based taxon sampling query

Page 10: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Query OptionsQuery Options Species SelectionSpecies Selection

Select AllSelect All Random Selection (num species)Random Selection (num species) Select By Depth (num species, depth threshold)Select By Depth (num species, depth threshold) Manual SelectionManual Selection

Sequence SelectionSequence Selection Select AllSelect All Random Selection (num bp)Random Selection (num bp) Manual Selection (positions)Manual Selection (positions)

Repeat Query (num runs)Repeat Query (num runs) Rerun Query (seed)Rerun Query (seed)

Page 11: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Query ManagementQuery Management Load queries from the databaseLoad queries from the database Save queries to the databaseSave queries to the database

Import queries from a text fileImport queries from a text file Export queries to a text fileExport queries to a text file

Create local queries (ie not stored in the database)Create local queries (ie not stored in the database) Delete queries from local session and databaseDelete queries from local session and database

Access query objects through the command lineAccess query objects through the command line Manipulate query objects within jython scriptsManipulate query objects within jython scripts

Page 12: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 13: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

SimulationSimulation

Tree Topology Simulation:Tree Topology Simulation: Generate the temporal Generate the temporal

branching structure of branching structure of populations/speciespopulations/species

Character Simulation:Character Simulation: Generate the evolution of Generate the evolution of

sequences/morphology/etc sequences/morphology/etc over the tree generated over the tree generated above above

Page 14: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Tree Topology Simulation(Tracy Heath and David Hillis, UT Austin)

Standard Approach: Simulate a homogeneous branching

process (e.g., pure-birth model) Sub-sample from a large homogeneous

branching process

Problems: Larger trees are self-similar to smaller

trees Most biologists don’t think trees in

simulations “look” like “real” trees

Page 15: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Tree Topology Simulation Modified code from Phyl-O-Gen - a tree

simulation program (Rambaut)

Birth-death process

After a speciation event, the rates of each daughter lineage are mutated

The new rate is obtained by multiplying the parent rate by a gamma-distributed multiplier centered on 1

The new rate is accepted in proportion to a prior distribution on birth and death rates

Page 16: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Tree Shape

Balanced Imbalanced

Page 17: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Tree Shape

67.2=N 17.3=N83.2=N

47.12 =σ22.02 =σ 81.02 =σ

9.2=EN

33.3=N

22.22 =σ

Expectation under the equal rates Markov (ERM) model

Page 18: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Tree Shape

I = 1

I = 1

I = 0

I = 0.5

I = 1

I = 0

I = 1

I = 1

I = 1

Expectation under the equal rates Markov (ERM) model

I = 0.5

Weighted mean imbalance (I)

Page 19: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Simulated trees were compared with published phylogenies using measures of tree shape. 200 trees of 10000 taxa under constant rates standard model 200 trees of 10000 taxa under variable rates our model

433 trees were collected from various sources and sorted based on the method used to estimate the phylogeny and the proportion of the ingroup sampled.

Weighted mean imbalance (I) was used to compare the simulated trees with published trees

Tree Topology Simulation

Page 20: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

ln(node size)

wei

gh

ted

mea

n im

bal

ance

(I)

Page 21: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

wei

gh

ted

mea

n im

bal

ance

(I)

ln(node size)

Page 22: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

wei

gh

ted

mea

n im

bal

ance

(I)

ln(node size)

Page 23: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

wei

gh

ted

mea

n im

bal

ance

(I)

ln(node size)

Page 24: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

wei

gh

ted

mea

n im

bal

ance

(I)

ln(node size)

Page 25: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Comparing Trees

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9

wei

gh

ted

mea

n im

bal

ance

(I)

ln(node size)

Page 26: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Three trees ranging from simple to complex were simulated

Equal rates tree Variable rates tree Variable rates tree with mass extinctions

Million-taxon Trees

Page 27: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 28: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 29: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 30: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Multi-layered simulations for character evolutionMulti-layered simulations for character evolution

Key molecule simulation (Muse, Hillis)Key molecule simulation (Muse, Hillis) RNA macro-evolution simulation (Kim)RNA macro-evolution simulation (Kim) RNA micro-evolution simulation (Kim, Meyers)RNA micro-evolution simulation (Kim, Meyers) Experimental viral evolution (Turner)Experimental viral evolution (Turner)

Page 31: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Key molecule simulation (Muse, Key molecule simulation (Muse, Hillis, Holder)Hillis, Holder)

Estimate statistical parameters for Estimate statistical parameters for real molecules (e.g., rbcL) using real molecules (e.g., rbcL) using HyPhy, extend model family to HyPhy, extend model family to include more discrete rate include more discrete rate distribution and positional distribution and positional dependencies, and finally generate dependencies, and finally generate a very large tree of 10a very large tree of 1066~10~1077 taxa taxa using the key molecule models as using the key molecule models as its basis.its basis.

rbcL model family estimated under rbcL model family estimated under codon-specific model (Muse)codon-specific model (Muse)

rRNA gene model (including 2rRNA gene model (including 2ndnd structure; Hillis and Gutel)structure; Hillis and Gutel)

rbcLrbcL

invariable sites

=2.1/ = 1.3= (0.1,..,0.2)..

=1.1/ = 1.7= (0.3,..,0.2)..

=0.8/ = 0.5= (0.1,..,0.5)..

Page 32: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Simulation of complex evolutionary processesSimulation of complex evolutionary processes

Reflect more complex dynamics • Heterogeneous rates:

– lineage and site specific mutation rates– genomic context dependent rates

• Phenotypic effects– Selection– Population interaction

Page 33: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

RNA and its secondary structure as a model system for genotype-phenotype evolution

Page 34: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Micro-Macro simulation model Micro-Macro simulation model (Meyers, Kim)(Meyers, Kim)

Generate a population of Generate a population of molecules incorporating a fitness molecules incorporating a fitness model and speciation process model and speciation process based on RNA folding. Fitness based on RNA folding. Fitness from (1) similarity to known 16S from (1) similarity to known 16S RNA (~67k seqs); (2) similarity to RNA (~67k seqs); (2) similarity to known 16S structure (~200 crystal known 16S structure (~200 crystal structure); (3) folding stability structure); (3) folding stability

Experimental viral evolution (Turner; Experimental viral evolution (Turner; non-ITR funding for empirical worknon-ITR funding for empirical work))

Use the RNA bacteriophage phi-6 Use the RNA bacteriophage phi-6 system to generate an system to generate an experimental phylogeny (~64-experimental phylogeny (~64-taxon tree with host switching and taxon tree with host switching and horizontal transfer)horizontal transfer)

Page 35: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Individual-based simulation Individual-based simulation (E. Miller and L. Ancel)(E. Miller and L. Ancel)

More fitMore fit

Different adaptive peaksDifferent adaptive peaks

Page 36: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Strategy for macro-evolutionStrategy for macro-evolution

mutationmutation

fixationfixation

Compute probability of fixation of Compute probability of fixation of different mutation types using different mutation types using Kimura’s derivations. Draw Kimura’s derivations. Draw waiting time for each event from waiting time for each event from an exponential processan exponential process

Page 37: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Mutations in RNA

?Advantageous

Neutral

Deleterious

Page 38: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Folding energy based fitness model

Assumption: Thermodynamically more stable structure is more fit.

-491.07J/mol -636.71J/mol

Page 39: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

2.2 A Free Energy Based Schema

M0 (E0)

M1 (E1)

M2 (E2)

M3 (E3)Mi (Ei)

Mn (En)

.

.

.

. . .

Page 40: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

For each ancestral RNA molecule, enumerate all its mutants.

Compute Ei – free energy of a RNA molecule Mi

RNAeval from Vienna RNA package computes Ei for all possible single mutants of a RNA molecule in 5~6 minutes using one CPU (2 ghz).

Draw new descendent molecule according to convolution of mutation probability and fixation probability from free energy calculations.

ComputationComputation

Page 41: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Acceptance-Rejection MethodAcceptance-Rejection Method

Enumerate Energy (=fitness) landscape Enumerate Energy (=fitness) landscape around ancestor, Find minimum (most fit)around ancestor, Find minimum (most fit)

In the descendent, assume that the energy In the descendent, assume that the energy differential to local minimum is the same as differential to local minimum is the same as the ancestor. Sample a new mutation, the ancestor. Sample a new mutation, accept-reject as a conditional event vis-à-vis accept-reject as a conditional event vis-à-vis the local minimumthe local minimum

Page 42: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

New RNA macro simulatorNew RNA macro simulator

Can simulate folding-energy dependent evolution efficiently (estimate Can simulate folding-energy dependent evolution efficiently (estimate 30 days for 1 million taxa on 20 CPU 2ghz cluster)30 days for 1 million taxa on 20 CPU 2ghz cluster)

Produces secondary structure changes and records history of changesProduces secondary structure changes and records history of changes Produces indel events and produces alignment history--will output files Produces indel events and produces alignment history--will output files

with indels and the correct alignmentwith indels and the correct alignment Parameterized with empirical data statistics (Hillis, Gutell)Parameterized with empirical data statistics (Hillis, Gutell)

Page 43: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Alignment

Top is homologous alignment. Bottom is Clustalw alignment.

First sequence is root RNA, others are randomly chosen leaf RNA’s

Page 44: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Statistics from 100 Statistics from 100 Eukaryote ssRNAEukaryote ssRNA

Statistics from RNASimStatistics from RNASim

Page 45: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Heuristic Search Landscape PropertiesHeuristic Search Landscape Properties

rbcL: 467 taxa, 660 sitesrbcL: 467 taxa, 660 sites rna simulator: 512 taxarna simulator: 512 taxa seqgen: 512 taxa, seqgen: 512 taxa,

rate heterogeneity: 0 (no gamma dist.), gamma=1rate heterogeneity: 0 (no gamma dist.), gamma=1 Sample 660 sites from each dataset without replacementSample 660 sites from each dataset without replacement Call PAUP Hsearch with default settings and time limit=6hrsCall PAUP Hsearch with default settings and time limit=6hrs Report best parsimony score at each secondReport best parsimony score at each second

Page 46: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Normalized Parsimony Score Excess (NPSE)Normalized Parsimony Score Excess (NPSE)

Let B(t) be the best parsimony score at time t; let B(0) be the score of Let B(t) be the best parsimony score at time t; let B(0) be the score of the starting treethe starting tree

B is monotonically decreasingB is monotonically decreasing Assume we run the heuristic search for Assume we run the heuristic search for

6 hrs. The NPSE is defined as6 hrs. The NPSE is defined as

NPSE(t) = (B(t)-B(6hrs))/(B0-B(6hrs))NPSE(t) = (B(t)-B(6hrs))/(B0-B(6hrs))

Page 47: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 48: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Page 49: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

P phaseo.P pseudo.P phaseo. b’neck

P pseudo. b’neckAlternating b’neckIncreasing P. phaseopop sizeDecreasing P. phaseopop size

350 generations

Experimental Experimental Evolution of Phi Evolution of Phi 6 virus6 virus

Page 50: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

350 generations

1

H51

A31

C31

B31

D51

D31

A21

B21

11

H71

P71

A51

E51

F51

B51

C51

G51

C71

K71

D71

L71

G71

O71

B71

J71

F71

N71

E71

M71

A71

I71

A41

F41

H41

E41

A61

J41

P41

L41 Clones for sequencing

Whole-genome Whole-genome sequencing 50% sequencing 50% complete (Penn complete (Penn Genomics Inst Genomics Inst funds)funds)

Page 51: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

350 generations

Next Steps:Next Steps:

Introduce ReticulationsIntroduce Reticulations

Page 52: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Christina Burch, UNCChristina Burch, UNC

Paul Higgs, McMasterPaul Higgs, McMaster

Laura Landweber, PrincetonLaura Landweber, Princeton

Carlo Maley, Wistar Inst.Carlo Maley, Wistar Inst.

Claus Wilke, UT AustinClaus Wilke, UT Austin

John Yin, U WisconsinJohn Yin, U Wisconsin

Page 53: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Year 3 StatusYear 3 Status

Complex SimulationsComplex Simulations Scaling up to 1 million-taxon treeScaling up to 1 million-taxon tree

Generated Complex Branching ModelsGenerated Complex Branching Models Developed Novel Importance Sampling SchemeDeveloped Novel Importance Sampling Scheme Parallelized HyPhyParallelized HyPhy

Simulation DatabaseSimulation Database Scaling and Code HardeningScaling and Code Hardening

Developed New Extensions to IndexingDeveloped New Extensions to Indexing Developed and tested with a test suiteDeveloped and tested with a test suite

Experimental EvolutionExperimental Evolution Generated 16 new lineages evolved for 350 generations under Generated 16 new lineages evolved for 350 generations under

heterogeneous conditionsheterogeneous conditions 50% of whole genomes sequenced50% of whole genomes sequenced

Page 54: Cyberinfrastructure for Phylogenetic Research Simulation, Modeling, and Benchmarks U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Yifeng Zheng, Steve

Cyberinfrastructure for Phylogenetic Research

Simulation, Modeling, and Simulation, Modeling, and BenchmarksBenchmarks

U Penn:U Penn: Junhyong Kim, Sampath Kannan, Susan Davidson Junhyong Kim, Sampath Kannan, Susan Davidson

Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen Yifeng Zheng, Steve Fisher, Sheng Guo, Lisan Wang, Shirley Cohen

U Texas :U Texas : David Hillis, Lauren Meyers David Hillis, Lauren Meyers

Eric Miller, Tracy Heath, Derrick ZwicklEric Miller, Tracy Heath, Derrick Zwickl

NC State:NC State: Spencer Muse Spencer Muse

Errol StrainErrol Strain

Yale:Yale: Paul Turner Paul Turner

andand

Bernard MoretBernard Moret

Tandy WarnowTandy Warnow

Robert JensenRobert Jensen

Randy LinderRandy Linder