the global bio grid
DESCRIPTION
The Global Bio Grid. Virginia Center for Grid Research. Andrew Grimshaw University of Virginia January, 2006. Why Bio Grids? Grid Basics The Global Bio Grid. In ten years the world will be very different. Think back ten years. No web Wide-spread internet was new - PowerPoint PPT PresentationTRANSCRIPT
The Global Bio GridAndrew Grimshaw
University of VirginiaJanuary, 2006
Virginia Center for Grid Research
• Why Bio Grids?
• Grid Basics
• The Global Bio Grid
In ten years the world will be very different.
Think back ten years.
• No web
• Wide-spread internet was new
• Human Genome Project still far from completion
• Science (biology) done primarily in individual labs
Today
• Billions a year in e-commerce• Internet everywhere
• Broadband to your home• Wireless becoming pervasive
• Pervasive device are proliferating – motes
• Sequencing of organisms a daily event. Bioinformatics hitting the main stream
Tomorrow
• $1000/sequnce for humans – becomes standard clinical practice
• “Biology is becoming an information science”(Large Scale Biomedical Science: Exploring Strategies for future research, Institute of Medicine, National Research Council, 2003)
• Global interconnected networks – grids• Provide transparent, secure, access to data, applications,
and on-demand compute.
• Research using not just your data, but all trusted data, not just your applications, but any trusted application.
• Implications for progress are significant.
There are a number of “catches”
• So much data!
• So many organizations with so little trust!
• So much complexity!
An IT guys view
• Data is all over, of all different forms, with lots of different policies• Need to get the right data in the right place at the
right time
• Ontology problem – how do we compare, integrate, the databases• Need to understand semantics, automatically
transform
• Semantics• Knowledge Discovery – “mining”
This is where grids enter the picture
(we do the plumbing)
Some lessons learned
• 10+ years in academic and commercial grids• All/most problems are not technical• Users don’t want change!
• Too many grids are technology centric• Must keep “activation energy low”• Need a user-centric approach• There are at least four classes of users• Wide variance in computational savvy
A grid enables users to collaborate securely by sharing processing, applications, work flows and processes, and data across heterogeneous systems and administrative domains for collaboration, faster application execution, and easier access to data.
What is a Grid? A grid is all about gathering together resources and making them accessible to users and applications.
The emphasis is on secure access to a widevariety of resources
Characteristics of Grid systems
Numerous ResourcesOwnership by
MutuallyDistrustful
Organizations & Individuals
Potentially FaultyResources
Different Security
Requirements & Policies Required
Resources areHeterogeneous
GeographicallySeparated
Different Resource
ManagementPolicies
Connected byHeterogeneous, Multi-Level
NetworksGrid System
Characteristics of a Grid system
Numerous ResourcesOwnership by
MutuallyDistrustful
Organizations & Individuals
Potentially FaultyResources
Different Security
Requirements & Policies Required
Resources areHeterogeneous
GeographicallySeparated
Different Resource
ManagementPolicies
Connected byHeterogeneous, Multi-Level
Networks
What grids are not
• The solution to all problems
• Clusters of machines
• SETI@home
• Any one particular technology
Users view
Site 0 Site 1 Site 2 Site 3
Cluster
Cluster
HPSS
UsersUsers
Grid
Runprograms
AccessData Collaborate
Provideshared
services
Grid Computing Scenarios
Desktop Cycle Aggregation• Limited acceptance in commercial enterprises
Cluster Grids• Single owner, department, project • Single domain, file system• LAN connection
Campus/Enterprise Grids• Multiple owners, domains• Multiple file systems• WAN connection
Partner Grids• Multiple owners, sites, domains• Multiple file systems• Internet connectivity
Legion Grid
Software – C
ompute
and Data G
rid
Standards
• Global Grid Forum – ggf.org• OGSA – Open Grid Services Architecture
• Web-Services based IPC• WSRF and possibly other• OGSA-BES – Basic Execution Service• OGSA-ByteIO – file IO• WS-Naming – abstract name to EPR• RNS-lite – Resource Name Space
The Global Bio Grid
• Federated access to multiple • Data sources
• Public databases• Commercial databases• In-house databases, annotations, etc.
• Application suites (including processes and workflows)
• Compute resources
• Shared among collaborative research teams• Multiple research locations• Virtual organizations
• Built on evolving computing standards (GGF, I3C, WS-*)
GBG concept
Global Bio Grid• Datagrid using Avaki DG technology
• Working on ADG available free for “.edu”• UVA, NCBIO, U-Texas, Texas Tech• Already operational• Flat file and relational• Working on an OGSA-compliant implementation
• Compute grid at UVA on-line• 64 dual processor Opteron’s available• Sunfires• Hundreds of Windows machines• Legion 1.8 based – moving towards OGSA-compliant services
• Applications• Biomarker• Searching pub med• Hospital info integration
Three resource classes illustrate the Grid-effect
• Data
• Processing
• Applications
Data• Suppose you have collaborators with critical
databases (clinical, protein, other) that you need to use.
• You use a number of databases that change on a regular basis.
• You want to “mine” heterogeneous data sets (relational, flat-file, XML, …) in different locations – say in a hospital
• Want to produce, consume, or share derivative data products, e.g., the result of a set of joins and data transformation steps.
• This applies to business data (BI/EII) as well as life science data
SEQ_3
BiochemistryBiology
Partner Institution
SEQ_2SEQ_1
Partner Institution
Public DB Public DB
Research Institution
APP 2APP 1
Public DBDataGrid: Unifying fabric for data access • Transparent access to multiple DBs• Multiple domains• Highly-secure, flexible access control• Automatic cache management and
coherence
PDB
NCBI
EMBL
SEQ_1
Data
Three Concrete Examples
• KDS – “data mining” on widely separated data sets such as PubMed.
• “Map” UniProt datasets into data grid• Researchers no longer need to spend time
downloading latest
• Extended Hospital
Extended Hospital
Insurance companies
Emergency vehicles
Research
DataWarehouse
Department Domain
Data
Department Domain
Data
Department Domain
Data
HOSPITAL
Clinics / Large Practices
Non-relatedHospitals
AuthorizedFamily
Processing• Classic high-throughput computing
• Suppose you have thousands of computationally intensive jobs to run• SW, CHARMm, Sequest, a.out
• Your usage is bursty – need a lot over short period of time, but often have idle resources
• You wish you had more!
SEQ_3
BiochemistryBiology
Partner Institution
SEQ_2SEQ_1
Partner Institution
Public DB Public DB
Research Institution
APP 2APP 1
Cluster 1
Cluster 2
Cluster N
Processing
Public DBCompute Grid: Shared access to processing
• Flexible, location-independent access to virtually unlimited processing, on-demand
• Scheduling, usage, management policies• System detects, recovers from job failures• Heterogeneous platform support• Usage accounting, as required
PDB
NCBI
EMBL
SEQ_1
Data
Concrete Examples
• Biomarkers project wants to run Sequest-2 using public databases
• Charmm/Amber
• Gnomad (Altman et al)
• BLAST, FASTA, ….
• Autodock
Applications
• Suppose you want to use applications or workflows developed, maintained, and supported by others – without the hassle of installing all of them on your gear.
• Suppose you want to couple multiple applications developed at different institutions together.
SEQ_3
BiochemistryBiology
Partner Institution
SEQ_2SEQ_1
Partner Institution
Public DB Public DB
Research Institution
APP 2APP 1
PDBNCBIEMBLSEQ_NData
Cluster 1
Cluster 2
Cluster N
Processing
APP 1
APP 2
APP N
Applications
Public DB
• Flexible binary management• No need to recompile applications• Securely share applications
• Restrict who gains access• Restrict where apps run
Grid users share applications, employing multiple data & processing resources
PDB
NCBI
EMBL
SEQ_1
Data
SEQ_3
BiochemistryBiology
Partner Institution
SEQ_2SEQ_1
Partner Institution
Public DB Public DB
Research Institution
APP 2APP 1
Cluster 1
Cluster 2
Cluster N
Processing
APP 1
APP 2
APP N
Applications
Public DBBetter Research, Faster
• Secure, wide-area access to global breadth of consistent, current data
• Access to vast processing power• Ability to securely share proprietary
data and applications, as needed
PDB
NCBI
EMBL
SEQ_1
Data
Evolution in action
Bare Metal Programming
50’s
Batch OS
Multi-UserTimeshare
60’s to 80’s
Low Level Network
Programming
Today
Grid & WS
Now & Future!
Summary
Summary
• Grids will have a huge impact on the life sciences
• Prototype GBG operational
• Applications are underway
• We’re always looking for new applications