virtual laboratory: enabling distributed molecular modelling for drug discovery on the grid

35
Virtual Laboratory: Enabling Distributed Molecular Modelling for Drug Discovery on the Grid Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. The University of Melbourne Melbourne, Australia www.gridbus.org/vlab/ WW Grid

Upload: pascal

Post on 01-Feb-2016

34 views

Category:

Documents


2 download

DESCRIPTION

WW Grid. Virtual Laboratory: Enabling Distributed Molecular Modelling for Drug Discovery on the Grid. Rajkumar Buyya. Gri d Computing and D istributed S ystems (GRIDS) Lab . The University of Melbourne Melbourne, Australia www.gridbus.org/vlab/. Agenda. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

Virtual Laboratory: Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

Rajkumar BuyyaGrid Computing and Distributed Systems

(GRIDS) Lab. The University of MelbourneMelbourne, Australiawww.gridbus.org/vlab/

WW Grid

Page 2: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

2

Page 3: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

3

Agenda

Introduction Molecular Docking Application Needs

Virtual Lab Architecture Grid Enabling CDB (chemical databases) Application Composition Scheduling Experiments Conclusions

Page 4: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

4

Drug Design: Data Intensive Computing on Grid

It involves screening millions of chemical compounds (molecules) in the Chemical DataBase (CDB) to identify those having potential to serve as drug candidates.

Protein

Molecules

Chemical Databases(legacy, in .MOL2 format)

[Collaboration with WEHI for Medical Science, Melbourne]

Page 5: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

5

Using Basic Job submission commands

Do all yourself! (manually)

Total Cost:$???

Page 6: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

6

Build Distributed Application & Scheduler

Build App case by case basisComplicated Construction

E.g., MPI based Total Cost:$???

Page 7: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

7

Rapid Parameterisation and Deployment Using the Gridbus and

Nimrod-G Tools

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

North

South

Compose, Submit, & Play!

Page 8: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

8

Docking Application Requirements

Protein

Molecules

Protein

Molecules

It is compute intensive: Each docking job can

take few minutes to hours depending on the structural complexity.

It is data intensive: The databases are huge

(MBs tpo GBs) and each contain thousands of molecules. Screening all molecules in all databases is a real data challenge!

CDBs are distributed. It is a killer application

for the Grid.

Chemical Databases(legacy, in .MOL2 format)

Page 9: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

9

DataGrid Brokering

Nimrod/GComputational

Grid Broker

Data Replica CatalogueCDB Broker

Algorithm1

AlgorithmN

. . .

CDB Service

“Screen mol.5 please?”

GSP1 GSP2 GSP4GSP3(Grid Service Provider)

GSPm

CDB Service

GSPn

1

“advise CDB source?

2“selection & advise: use GSP4!”

5Grid Info. Service

3

“Is GSP4 healthy?”

4

“mol.5 please?”6

“CDB replicas please?”

“Screen 2K molecules in 30min. for $10”

7

“process & send results”

Page 10: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

10

Software Tools

Molecular Modelling Application (DOCK) Parameter Modelling Tools (Nimrod/enFusion) Grid Resource Broker (Nimrod-G) Data Grid Broker Chemical DataBase (CDB) Management and Intelligent

Access Tools PDB databse Lookup/Index Table Generation. PDB and associated index-table Replication. PDB Replica Catalogue (that helps in Resource Discovery). PDB Servers (that serve PDB clients requests). PDB Brokering (Replica Selection). PDB Clients for fetching Molecule Record (Data Movement).

Grid Middleware (Globus and GrACE) Grid Fabric Management (Fork/LSF/Condor/Codine/…)

Page 11: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

11

The Virtual Lab. – Software Stack

Globus [security, information, job submission]

[Distributed computers and databases with different Arch, OS, and local resource management systems]

Nimrod-G and CDB Data Broker

[task farming engine, scheduler, dispatcher, agents, CDB (chemical database) server]

Nimrod and Virtual Lab Tools

[parametric programming language, GUI tools, and CDB indexer]

Molecular Modelling for Drug Design

FABRIC

APPLICATIONS

CORE MIDDLEWARE

USER LEVEL MIDDLEWARE

PROGRAMMINGTOOLS

PDB

CDB

Worldwide Grid

Page 12: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

12

V-Lab Components InteractionGrid InfoServer

ProcessServer

UserProcess

File accessFileServer

Grid Node

NimrodAgent

Compute NodeUser Node

GridDispatcher

Grid Trade Server

GridScheduler

Local Resource Manager

Nimrod-G Grid Broker

TaskFarmingEngine

Grid ToolsAnd

Applications

Do this in 30 min. for $10?

CDB Client

Get molecule “n” record from “abc” CDB

DockingProcess

CDBServer

Index and CDB1

..

.. .... ..

..

CDBm

Molecule “n”Location ?

Get mol. record

CDB Service on Grid

Page 13: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

13

DOCK code*(Enhanced by WEHI, U of

Melbourne)

A program to evaluate the chemical and geometric complementarities between a small molecule and a macromolecular binding site.

It explores ways in which two molecules, such as a drug and an enzyme or protein receptor, might fit together.

Compounds which dock to each other well, like pieces of a three-dimensional jigsaw puzzle, have the potential to bind.

So, why is it important to able to identify small molecules which may bind to a target macromolecule?

A compound which binds to a biological macromolecule may inhibit its function, and thus act as a drug.

E.g., disabling the ability of (HIV) virus attaching itself to molecule/protein!

With system specific code changed, we have been able to compile it for Sun-Solaris, PC Linux, SGI IRIX, Compaq Alpha/OSF1

* Original Code: University of California, San Francisco: http://www.cmpharm.ucsf.edu/kuntz/

Page 14: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

14

Dock input filescore_ligand yesminimize_ligand yesmultiple_ligands norandom_seed 7anchor_search notorsion_drive yesclash_overlap 0.5conformation_cutoff_factor 3torsion_minimize yesmatch_receptor_sites norandom_search yes . . . . . . . . . . . .maximum_cycles 1ligand_atom_file S_1.mol2receptor_site_file ece.sphscore_grid_prefix ecevdw_definition_file parameter/vdw.defnchemical_definition_file parameter/chem.defnchemical_score_file parameter/chem_score.tblflex_definition_file parameter/flex.defnflex_drive_file parameter/flex_drive.tblligand_contact_file dock_cnt.mol2ligand_chemical_file dock_chm.mol2ligand_energy_file dock_nrg.mol2

Molecule to Molecule to be screenedbe screened

Page 15: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

15

score_ligand $score_ligandminimize_ligand $minimize_ligandmultiple_ligands $multiple_ligandsrandom_seed $random_seedanchor_search $anchor_searchtorsion_drive $torsion_driveclash_overlap $clash_overlapconformation_cutoff_factor $conformation_cutoff_factortorsion_minimize $torsion_minimizematch_receptor_sites $match_receptor_sitesrandom_search $random_search . . . . . . . . . . . .maximum_cycles $maximum_cyclesligand_atom_file ${ligand_number}.mol2receptor_site_file $HOME/dock_inputs/${receptor_site_file}score_grid_prefix $HOME/dock_inputs/${score_grid_prefix}vdw_definition_file vdw.defnchemical_definition_file chem.defnchemical_score_file chem_score.tblflex_definition_file flex.defnflex_drive_file flex_drive.tblligand_contact_file dock_cnt.mol2ligand_chemical_file dock_chm.mol2ligand_energy_file dock_nrg.mol2

1. Parameterize Dock input file(use Nimrod Tools: GUI/language)

Molecule to be Molecule to be screenedscreened

Page 16: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

16

parameter database_name label "database_name" text select oneof "aldrich" "maybridge" "maybridge_300" "asinex_egc" "asinex_epc" "asinex_pre" "available_chemicals_directory" "inter_bioscreen_s" "inter_bioscreen_n" "inter_bioscreen_n_300" "inter_bioscreen_n_500" "biomolecular_research_institute" "molecular_science" "molecular_diversity_preservation" "national_cancer_institute" "IGF_HITS" "aldrich_300" "molecular_science_500" "APP" "ECE" default "aldrich_300";

parameter CDB_SERVER text default "bezek.dstc.monash.edu.au";parameter CDB_PORT_NO text default "5001";parameter score_ligand text default "yes";parameter minimize_ligand text default "yes";parameter multiple_ligands text default "no";parameter random_seed integer default 7;parameter anchor_search text default "no";parameter torsion_drive text default "yes";parameter clash_overlap float default 0.5;parameter conformation_cutoff_factor integer default 5;parameter torsion_minimize text default "yes";parameter match_receptor_sites text default "no"; . . . . . . . . . . . .parameter maximum_cycles integer default 1;parameter receptor_site_file text default "ece.sph";parameter score_grid_prefix text default "ece";parameter ligand_number integer range from 1 to 2000 step 1;

2. Create Docking Plan:Define Variable and their value

Molecules to be Molecules to be screenedscreened

Page 17: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

17

task nodestart copy ./parameter/vdw.defn node:. copy ./parameter/chem.defn node:. copy ./parameter/chem_score.tbl node:. copy ./parameter/flex.defn node:. copy ./parameter/flex_drive.tbl node:. copy ./dock_inputs/get_molecule node:. copy ./dock_inputs/dock_base node:.endtasktask main node:substitute dock_base dock_run node:substitute get_molecule get_molecule_fetch node:execute sh ./get_molecule_fetch node:execute $HOME/bin/dock.$OS -i dock_run -o dock_out copy node:dock_out ./results/dock_out.$jobname copy node:dock_cnt.mol2 ./results/dock_cnt.mol2.$jobname copy node:dock_chm.mol2 ./results/dock_chm.mol2.$jobname copy node:dock_nrg.mol2 ./results/dock_nrg.mol2.$jobnameendtask

Create Docking PlanFile3. Define Task that jobs need to

do

Page 18: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

18

Gridbus Visual Tool for Parametric Application Creation (e.g., Docking)

Page 19: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

19

Chemical DataBase (CDB)

Databases consist of small molecules from commercially available organic synthesis libraries, and natural product databases.

There is also the ability to screen virtual combinatorial databases, in their entirety.

This methodology allows only the required compounds to be subjected to physical screening and/or synthesis reducing both time and expense.

Page 20: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

20

Target Testcase

The target for the test case: electrocardiogram (ECE) endothelin converting enzyme. This is involved in “heart stroke” and other transient ischemia.

Is·che·mi·a : A decrease in the blood supply to a bodily organ, tissue, or part caused by constriction or obstruction of the blood vessels.

Page 21: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

Docking Deployment on The World Wide Grid

Page 22: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

22

Scheduling Molecular Docking Application on Grid: Experiment

Workload – Docking 200 molecules with ECE 200 jobs, each need in the order of 3 minute depending

on molecule weight. Deadline: 60 min. and budget: 50, 000 G$/tokens Strategy: minimise time / cost Execution Cost with cost optimisation

Optimise Cost: 14, 277(G$) (finished in 59.30 min.) Optimise Time: 17, 702 (G$) (finished in 34 min.) In this experiment: Time-optimised scheduling costs

extra 3.5K$ compared to that of Cost-optimised. Users can now trade-off between Time Vs. Cost.

Page 23: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

23

WWG Setup

GMonitor

Grid MarketDirectory

Australia

Melbourne+Monash U:

VPAC, Physics

Solaris WS

Gridbus+Nimrod-G

Europe

ZIB: T3E/OnyxAEI: Onyx CNR: ClusterCUNI/CZ: OnyxPozman: SGI/SP2Vrije U: ClusterCardiff: Sun E6500Portsmouth: Linux PCManchester: O3KCambridge: SGIMany others

Asia

AIST, Japan: Solaris ClusterOsaka University: ClusterDoshia: Linux clusterKorea: Linux cluster

North America

ANL: SGI/Sun/SP2NCSA: ClusterWisc: PC/clusterNRC, CanadaMany others

InternetWW Grid

MEG Visualisation

Page 24: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

24

Resources Selected & Price/CPU-sec.

Resource & Location

Grid services & Fabric

Cost/CPU sec. or unit

No. of Jobs Executed

Time_Opt Cost_Opt

Monash, Melbourne, Australia (Sun Ultra01)

Globus, Nimrod-G, GTS (master node)

-- -- --

AIST, Tokyo, Japan, Ultra-4

Globus, GTS, Fork 1 44 102

AIST, Tokyo, Japan, Ultra-4

Globus, GTS, Fork 2 41 41

AIST, Tokyo, Japan, Ultra-4

Globus, GTS, Fork 1 42 39

AIST, Tokyo, Japan, Ultra-2

Globus, GTS, Fork 3 11 4

Sun-ANL, Chicago,US, Ulta-8

Globus, GTS, Fork 1 62 14Total Experiment Cost (G$) 17,702 14,277

Time to Complete Exp. (Min.) 34 59.30

Page 25: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

25

DBC Scheduling for Time Optimization – No. of Jobs in Exec.

0

1

2

3

4

5

6

7

8

9

Time (in Min.)

No

. o

f Jo

bs

in E

xec.

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 26: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

26

DBC Scheduling for Cost Optimization – No. of Jobs in Exec.

0

1

2

3

4

5

6

7

8

9

10

Time (in min.)

No

. o

f Jo

bs

in E

xecu

tio

n

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 27: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

27

Summary and Conclusion

Applications can be Grid enabled and deployed on the Grid with minimal effort, but need a right set of Grid tools.

Distributed Docking demonstrates that Nimrod-G and Gridbus tools:

Enable Grid application software engineering rapidly Provide powerful runtime machinery for optimal

deployment of applications on the Grid. Easy to use tools for composing applications to run on

Grid are essential to attracting and getting application community on board.

Integrate with our Data Grid Broker to support selection of CDB nodes dynamically. (progress)

Page 28: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

28

Thanks

http:/www.gridbus.org/vlab

Page 29: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

29

DBC Time Opt. Scheduling

Page 30: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

30

DBC Scheduling for Time Optimization – No. of Jobs Finished

0

10

20

30

40

50

60

70

Time (in min.)

No

. o

f Jo

bs

Fin

ish

ed

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 31: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

31

DBC Scheduling for Time Optimization – Budget Spent

0

1000

2000

3000

4000

5000

6000

7000

Time (in min.)

G$

spen

t fo

r p

roce

ssin

g j

ob

s

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 32: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

32

DBC Cost Opt. Scheduling

Page 33: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

33

DBC Scheduling for Cost Optimization – No. of Jobs

Finished

0

20

40

60

80

100

120

Time (in min.)

No

. o

f Jo

bs

Exe

cute

d

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 34: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

34

DBC Scheduling for Cost Optimization – Budget Spent

0

1000

2000

3000

4000

5000

6000

Time (in min.)

G$

spen

t fo

r jo

b p

roce

ssin

g

AIST-Sun-hpc420.hpcc.jp

AIST-Sun-hpc420-1.hpcc.jp

AIST-Sun-hpc420-2.hpcc.jp

AIST-Sun-hpc220-2.hpcc.jp

ANL-Sun-pitcairn.mcs.anl.gov

Page 35: Virtual Laboratory:  Enabling Distributed Molecular Modelling for Drug Discovery on the Grid

35

Parametric Processing

Multiple RunsSame ProgramMultiple Data Killer Application for the Grid!

ParametersAge Hair

23 CleanAge Hair

23 Clean23 Beard28 Goatee

Age Hair23 Clean23 Beard

Age Hair23 Clean23 Beard28 Goatee28 Clean

Age Hair23 Clean23 Beard28 Goatee28 Clean19 Moustache

Age Hair23 Clean23 Beard28 Goatee28 Clean19 Moustache10 Clean

Age Hair23 Clean23 Beard28 Goatee28 Clean19 Moustache10 Clean

-4000000 Too much

Courtesy: Anand Natrajan, University of Virginia

Magic Engine forManufacturing Humans!