march 26, 2007 phyloinformatics of neuraminidase at micro and macro levels using grid-enabled hpc...

13
March 26, 2007 Phyloinformatics of Neuraminidase at Micro and Macro Levels using Grid- enabled HPC Technologies B. Schmidt (UNSW) D.T. Singh (Genvea Biosciences) R. Trehan, T. Bretschneider (NTU, Singapore)

Upload: victor-hodge

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

March 26, 2007

Phyloinformatics of Neuraminidase at Micro and

Macro Levels using Grid-enabled HPC Technologies

B. Schmidt (UNSW)D.T. Singh (Genvea Biosciences)

R. Trehan, T. Bretschneider (NTU, Singapore)

March 26, 2007

Contents

• H5N1 Genetics• H5N1 Phyloinformatics• Design Principles of Quascade• H5N1 Phyloinformatics with Quascade• Results• Conclusion and Future work

March 26, 2007

H5N1 Genetics

• Belongs to the Influenza A virus type• Segmented RNA genome• 8 genes, 11 proteins• Classification based on:

– Hemagglutinin (HA): 15 subtypes– Neuraminidase (NA): 9 subtypes

• Genetic variations in HA/NA • Genetic drift

– Point mutations– 1918 Spanish flu

• Genetic shift– Reassortment of the segmented

genome– 1957, 1968, 1997 pandemics– 2003 Z strain of H5N1

March 26, 2007

H5N1 Phyloinformatics

• Essential to monitor new emerging strains – Molecular evolution at gene and genome level– Phylogenetic analysis for determining the origin of new

strains

• Phylogenetics– How fast do proteins evolve?– What is the best method to measure the evolution?– How to obtain the best phylogenetic tree?

• Phylogenetic algorithms– Character based

• Maximum Parsimony, Maximum Likelihood (ML)– Distance based

• UPGMA, Neighborhood Join (NJ) – Bayesian MCMC based

• Mr. Bayes, BEAST

March 26, 2007

Quascade – User Interface Example

Communication • A data-flow tool in which each black-box represents Java

objects running on different computers!• Assignment of objects to available computers done

automatically (manually if required)• Communication between objects done transparently• Configuration of objects done before run-time

Processing pipeline

March 26, 2007

Java Object

Java Object

Java Object

• Coding in regular Java/ C/ C++• Persistent – activated whenever all data-inputs present• No explicit messaging protocol required• No distributed computing concepts need to be understood• Objects automatically or manually assigned to computers /

CPU-cores

Object Features

March 26, 2007

Phyloinformatics Workflow with Quascade

March 26, 2007

Parallelized Phyloinformatics Workflow

March 26, 2007

Data and Algorithms

• Core Group– 22 H5N1 NA sequences from SwissProt and TREMBL

• Medium Set– 581 NA H5N1 sequences from Uniprot

• Large Set– 909 NA Influenza A sequences from Uniprot

• ProtDist– NJ– UPGMA

• ProtPars• ProtML• Mr. Bayes

March 26, 2007

Runtime and Scalability (NA Bird Flu Protein)

25 processors

360

145

16 60

100

200

300

400

909sequences

581sequences

Pro

cess

ing

time

[h]

Distance-based workflow360

140

16 50

100

200

300

400

909sequences

581sequences

Pro

cess

ing

time

[h]

MP workflow

1 processor

March 26, 2007

Mr Bayes – Tree Core Set

P18269Sial

Q05JH9H9N2

Q6DTU0swinech03

A1EHP1goBav06

A1EHP3goBa06

0.99

0.75

Q0A2H3Chsc59

Q710U6chSc59

1.00

Q0PEF9chIn06

Q0PEG0chIn06

0.99

Q5MD56TiTh04

Q6PUP6HuTh04

Q307V5catth04

Q5SDA6chTh04

Q45ZM8wpfth04

Q307U7PigeonTh04

Q6PUP7HuTh04

Q2L700HuTh05

Q2LDC0QuTh06

Q2LDC8chTh05

0.54

0.91

0.90

Q6B518chTh04

Q4PKD4chTh04

0.86

0.70

0.71

0.63

1.00

March 26, 2007

Analysis and Observations• Clustering possibilities

– Temporal, host-based, geographical• Algorithms

– Mr. Bayes and ProtML are most consistent in their performance– Too compute-intensive for the larger “macro” sets

• Observed pattern– All phylograms yielded geographic-based clustering rather than

time-based clustering – Host ranges along clustered clades vary– Same strain with identical NA sequences can infect different

hosts– NA may not be the sole factor responsible for determining the

diverse host range– Glycan site acquisition or loss seems to play a critical role in the

molecular evolution of H5N1 NA– Identification of “bridging isolates” may help in rapid monitoring

and development of global scale warning system for H5N1

March 26, 2007

Conclusion and Future Work

• Quascade– New graphical data-flow tool to design automatically grid-

enabled pipelines / workflows– Supports implicit high-performance parallelization– Supports persistent components– Can be used with Java / C/ C++ code or application-binaries

• H5N1 Phyloinformatics – Can take advantage of workflow system and HPC– Can be easily used and modified by biologists– Use H5N1 NA sequences to better understand evolution of H5N1– Analysis of H5N1 NA data with different algorithms indicates

spatial clustering based on geographical distribution rather than temporal or host.

• Future work– Studies in conjunction with other proteins such as HA,

Polymerase etc., and also at gene and genome level