march 26, 2007 phyloinformatics of neuraminidase at micro and macro levels using grid-enabled hpc...
TRANSCRIPT
March 26, 2007
Phyloinformatics of Neuraminidase at Micro and
Macro Levels using Grid-enabled HPC Technologies
B. Schmidt (UNSW)D.T. Singh (Genvea Biosciences)
R. Trehan, T. Bretschneider (NTU, Singapore)
March 26, 2007
Contents
• H5N1 Genetics• H5N1 Phyloinformatics• Design Principles of Quascade• H5N1 Phyloinformatics with Quascade• Results• Conclusion and Future work
March 26, 2007
H5N1 Genetics
• Belongs to the Influenza A virus type• Segmented RNA genome• 8 genes, 11 proteins• Classification based on:
– Hemagglutinin (HA): 15 subtypes– Neuraminidase (NA): 9 subtypes
• Genetic variations in HA/NA • Genetic drift
– Point mutations– 1918 Spanish flu
• Genetic shift– Reassortment of the segmented
genome– 1957, 1968, 1997 pandemics– 2003 Z strain of H5N1
March 26, 2007
H5N1 Phyloinformatics
• Essential to monitor new emerging strains – Molecular evolution at gene and genome level– Phylogenetic analysis for determining the origin of new
strains
• Phylogenetics– How fast do proteins evolve?– What is the best method to measure the evolution?– How to obtain the best phylogenetic tree?
• Phylogenetic algorithms– Character based
• Maximum Parsimony, Maximum Likelihood (ML)– Distance based
• UPGMA, Neighborhood Join (NJ) – Bayesian MCMC based
• Mr. Bayes, BEAST
March 26, 2007
Quascade – User Interface Example
Communication • A data-flow tool in which each black-box represents Java
objects running on different computers!• Assignment of objects to available computers done
automatically (manually if required)• Communication between objects done transparently• Configuration of objects done before run-time
Processing pipeline
March 26, 2007
Java Object
Java Object
Java Object
• Coding in regular Java/ C/ C++• Persistent – activated whenever all data-inputs present• No explicit messaging protocol required• No distributed computing concepts need to be understood• Objects automatically or manually assigned to computers /
CPU-cores
Object Features
March 26, 2007
Data and Algorithms
• Core Group– 22 H5N1 NA sequences from SwissProt and TREMBL
• Medium Set– 581 NA H5N1 sequences from Uniprot
• Large Set– 909 NA Influenza A sequences from Uniprot
• ProtDist– NJ– UPGMA
• ProtPars• ProtML• Mr. Bayes
March 26, 2007
Runtime and Scalability (NA Bird Flu Protein)
25 processors
360
145
16 60
100
200
300
400
909sequences
581sequences
Pro
cess
ing
time
[h]
Distance-based workflow360
140
16 50
100
200
300
400
909sequences
581sequences
Pro
cess
ing
time
[h]
MP workflow
1 processor
March 26, 2007
Mr Bayes – Tree Core Set
P18269Sial
Q05JH9H9N2
Q6DTU0swinech03
A1EHP1goBav06
A1EHP3goBa06
0.99
0.75
Q0A2H3Chsc59
Q710U6chSc59
1.00
Q0PEF9chIn06
Q0PEG0chIn06
0.99
Q5MD56TiTh04
Q6PUP6HuTh04
Q307V5catth04
Q5SDA6chTh04
Q45ZM8wpfth04
Q307U7PigeonTh04
Q6PUP7HuTh04
Q2L700HuTh05
Q2LDC0QuTh06
Q2LDC8chTh05
0.54
0.91
0.90
Q6B518chTh04
Q4PKD4chTh04
0.86
0.70
0.71
0.63
1.00
March 26, 2007
Analysis and Observations• Clustering possibilities
– Temporal, host-based, geographical• Algorithms
– Mr. Bayes and ProtML are most consistent in their performance– Too compute-intensive for the larger “macro” sets
• Observed pattern– All phylograms yielded geographic-based clustering rather than
time-based clustering – Host ranges along clustered clades vary– Same strain with identical NA sequences can infect different
hosts– NA may not be the sole factor responsible for determining the
diverse host range– Glycan site acquisition or loss seems to play a critical role in the
molecular evolution of H5N1 NA– Identification of “bridging isolates” may help in rapid monitoring
and development of global scale warning system for H5N1
March 26, 2007
Conclusion and Future Work
• Quascade– New graphical data-flow tool to design automatically grid-
enabled pipelines / workflows– Supports implicit high-performance parallelization– Supports persistent components– Can be used with Java / C/ C++ code or application-binaries
• H5N1 Phyloinformatics – Can take advantage of workflow system and HPC– Can be easily used and modified by biologists– Use H5N1 NA sequences to better understand evolution of H5N1– Analysis of H5N1 NA data with different algorithms indicates
spatial clustering based on geographical distribution rather than temporal or host.
• Future work– Studies in conjunction with other proteins such as HA,
Polymerase etc., and also at gene and genome level