building a massive virtual screening using grid infrastructure chak sangma centre for...

Building a Massive Virtual Building a Massive Virtual Screening using Grid InfrastructureScreening using Grid Infrastructure

Chak SangmaCentre for CheminformaticsKasetsart University

Putchong UthayopasHigh Performance Computing and Networking Center, Kasetsart University

Motivation• Thailand’s Medicinal Plants is

important for Thai society– Over 1,000 species– Over 200,000 compounds– Multiple disease targets

• Problem– No complete collection of compounds

database– The practice is still mostly rely on local

knowledge and conventional wisdom– Lack of systematic verifications by scientific

methods

SIATIC PENNYWORT

Bariena lunulina Linae

Kasetsart University Thai Medicinal Plants Effort

• Led by Center for Cheminformatics, Kasetsart University (Dr. Chak Sangma)

• Goal – Establish Thai medicinal plant knowledgebase

by building 3D molecular database– Employ Virtual Screening to verify active

compounds with conventional knowledge

2D Structures

Optimized 3D Structures with GAMESS

Calculated Binding Energy with Autodock 3.0

Reports and Literatures

Structure in 0.5 Å from Binding Site

Results

SOM Neural Network Map

Approximated 3D Structures ComputeIntensive!

ThaiGrid Drug Design Portal• Partners

– High Performance Computing and networking Center, KU– Center for Cheminfomatics, KU– IBM Thailand

• Goal – Building a virtual screening infrastructure on ThaiGrid System– Start from KU campus Grid and extended to other ThaiGrid

partner universities later

• Link – http://tgcc.cpe.ku.ac.th– http://www.thaigrid.net

Challenge• Recent project for National Center for Genetic

Engineering and Biotechnology, Thailand– Screen 3000 compounds in 3 months

• Computation time on 2.4 GHz Pentium IV 4 system– Over 30 mins/1 optimized structure– Over 30 mins/1 docking

• Estimate computing time on single processor – (3,000 x 30) + (3,000 x 30) – 3,000 Hours– 125 Days– 4 month 16 days

• Not fast enough!

Key Technologies• Three key technologies must be combined

to provide the solution– Cluster Computing– Grid Computing– Portal Technology

What we want to do?

Hide the complexity of Grid and computational chemistry software from scientists while providing massive computational power needed

Infrastructure• ThaiGrid infrastructure are

used• 10 Clusters from 6

organizations– AMATA – KU– GASS – KU– MAEKA – KU– WARINE – KU– CAMETA – SUT– OPTIMA - AIT– ENQUEUE – KMUTNB– PALM – KMUTNB– SPIRIT – CU– INCA - KMUTT

• 158 CPUs on 110 nodes

Network

SPIRIT

PALM

OPTIMA

ENQUEUE

GASS

AMATA

MAEKA

WARINE

KU

ThaiGrid User

ThaiGrid PortalTgcc.cpe.ku.ac.th

Submit

Grid Job Scheduling

CU

KMUTT

KMUTNB

AIT

SUT

CAMETAINCA

Software Architecture• Each cluster has local

scheduler– SGE, OpenPBS, Condor can

be used– We use our SQMS scheduler

• Globus2.4 is used as middleware– Resources control and security

(GSI)

• Grid level scheduler control multi-cluster job submission– Use KU own SQMS/G

AMATAAMATA

KU Gigabit Campus NetworkKU Gigabit Campus Network

WarineWarine GASSGASS MaekaMaeka

Globus 2.4Globus 2.4

SQMSSQMS SQMSSQMS SQMSSQMS SQMSSQMS

SQMS/GSQMS/G

PortalPortal

SCMSWebSCMSWeb

The Portal• Roles

– User interface– Automate execution flow– File access and management

• Features– Create project – Add ligand, enzyme– Submit screening job, monitor job

status– Download output

• Current portal is built using Plone – http://www.plone.org/– Python based web content

management– Flexible and extensible

How things work!

ComputeResource

ComputeResource

ComputeResource

ComputeResource

ComputeResource

KU Campus network

Resource Broker

(SQMS/G)Portal

Grid MiddlewareGlobus2.4

Task Task

TaskTaskTaskMonitor

Results• The first version of

compound databases (around 3,000 compounds)

• 3,000 compounds screened ( found 30 high potential compounds)– 4 drug targets (Influenza,

HIV-RT, HIV-PR, HIV-IN)

XK-263

Experiences• Some files such as enzyme structure and output are very

large. – Require a good bandwidth between sites– Some simple optimizing techniques can help

• Implements caching of enzyme structure file at target hosts. Substantially reduce the number of transfer needed

• Batch scheduling approach is good if the systems are very homogenous– Allow dynamic execution code staging to the target host without

installation/recompilation• Many script tools must be developed to

– Streamline the execution– Handling data and code staging– Cleanup the execution

Next Generation Massive Screening on Grid

• Move to Service Oriented Grid – Use Grid and Web services to encapsulate key applications– Build broker and service discovery infrastructure– Rely heavily on OGSA and GT3.X, 4.X

• Portlet based portal– JSR 168: Portlet Specification compliance– More modular , customizable, flexible– Plan to adopt GridShpere from gridlab (www.gridlab.org)

• Use database as backend instead of files– OGSA DAI might be used for data access

Progress• We are working on

– New portal using GridSphere technology (done, testing)– Service wrapper for lagacy code

• Gamess, autodock (done, testing)

– MMJFS interface ( progress) – OGSA DAI integration (progress) – Service Registration and Discovery (partial) – Broker System ( design)– New Monitoring (done)

• Schedule – Finish and testing Jan-Feb 2005– Deploy in March 2005

Scheduler

MMJFS

Gamess

GamessService

Gamess

File Server

Portal

Portlet

OG

SA D

AI

BrokerServer

RegistrationServer

BackendDB

MolecularDB

Grid Ftp

Design Choices• Mass Data Transportation across site

– Central ftp server is used to store data/database – Each compute node can pull required data from this

ftp• Adhoc – ftp , wget/http (firewall friendly) • Next – Grid ftp

• Cluster/ Single server– Gridify using service wrapper to expose grid service

of that lagacy application to the grid– Not working for cluster since compute node are

hidden behind head node• Back to MMJFS interface that talk to local shceduler

Design Choices• Service Discovery Mechanism

– Publish/subscribe model• Service advertising interface/protocol• Backend data based that shared

between registration service component and broker component

• Adoption of Grid Notification service and model– Available from mygrid project, seems

to be useful for more dynamics environment

– Scalability….

BrokerService

RegistrationService

Discovery (SQL)

Job Submission

Job Status

Result visualization

System Status

Performance Record

Job Queue Monitoring

Service Discovery

Conclusion• Grid and cluster computing is a key technology that

can give us the power. Grid works if use wisely!• Challenges

– Grid standard is still rapidly evolving• Things change before you can finish!

– Difficult to configure, maintain, Some part is still unstable

– Firewall and security concern– Lack of manpower with expertise

• Opportunity– Secure infrastructure– Cost reduction by the integration of networked

resources on demand

Acknowledgement• HPCNC Team

– Somsak Sriprayoonsakul– Nuttaphon Thangkittisuwan – Thanakit Petchprasan – Isiriya Paireepairit

The End

Backup

Process

2D Structure 3D StructureGAMESS

MolecularStructureDatabase

Optimized 3D Structure

Enzyme Enzyme GridAutodock

AutodockAutodock

Autodock

GAMESSGAMESS

GAMESSGAMESS

SOMNeural Network

Analysis Results

GRID

Grid Middleware (OGSA )

GridPortal

MoleculeDatabase

Docking Services

Resources ( Computer, Network)

Optimizing Services

OGSADAI

Monitoring Services

Portlet

Workflow Engine

Broker Services

Portlet Portlet Portlet

building a massive virtual screening using grid infrastructure chak sangma centre for...

Documents

networking center

massive virtual screening

enzymesubmit screening

national center

active compounds

ku campus grid

d structures computeintensive

thai societyover