Tom Furlani, PhDTom Furlani, PhDCenter for Computational ResearchCenter for Computational Research
University at Buffalo, SUNYUniversity at Buffalo, SUNY
Solving the “last mile of computing problem” –
developing portals to enable simulation-based science and
engineering
The Role of High Performance Computation The Role of High Performance Computation in Economic Developmentin Economic Development
Rensselaer Polytechnic InstituteOctober 22 - 24, 2008
Outline
How Did Computation Become so ImportantBringing HPC to the Researcher’s Desktop
Portals Grid Computing Example Portals
Research Center for Computational Research
• Overview Understanding Protein Chemistry
• Photoactive Yellow Protein Toward Petascale level calculations
How did computation become critical?
1940’s
Revolution in Computing Storage Networking/Communication
Today1980’s
1TB - $120.
Computing Revolution
Microprocessor Revolution
How long would 1 hr calc today take on a PC from 1984?
Slide courtesy – Dan Reed, RENCI
1890-1945 Mechanical, relay 7 year doubling
1945-1985 Tube, transistor 2.3 year doubling
1985-2005 Microprocessor 1 – 1.5 year doubling
Exponentials Transistor density
• 2X in ~18 months (Moore’s Law) Graphics: 100X in 3 years WAN bandwidth: 64X in 2 years Storage: 7X in 2 years
24 Years!
The Storage RevolutionMegabyte
5 MB: complete works of Shakespeare Terabyte: 1,000,000 MB – ~$120 today
The text in 1 million books Entire U.S. Library of Congress is 10TB of text 50,000 trees made into paper and printed Large Hadron Collider Experiment– 15 TB/day
Petabyte: 1000 terabytes 20 million four-drawer filing cabinets full of text
The Data Tsunami - Many sources Agricultural, Medical, Environmental, Engineering, Financial
Why so much data? More sensors – higher resolution Faster/cheaper storage capability Faster processors – generate more data!
The challenge: extracting insight! Without being overwhelmed
Advanced Networking
Networks are the 21st century interstate highway system expertise and information - the real product
Removes the barriers of time and space
Eisenhower Interstate System National Lambda Rail Network
Enabling SBES for Non-ExpertsBringing HPC to the desktop
Analogous to impact of Windows vs DOS for PC’s• Brought computing/internet to the home
Many users need periodic, but infrequent access Experiment driven
Ease of use is key Shouldn’t need to know about OS, compilers, queuing
system, etc GUI Interface, Web-based, Access anywhere
How do we get there? Focus on development of portals, custom software
and tools, data models, GUI’s, etc. Provide training on the use of these tools Ex: nanoHUB – one stop resource for nanotechnology
“Old School” Computing
InputFile
VPNsoftware
SecureShell
software
Unixcommands
Use VPNto accessnetwork
Secure loginto front-end
machine
Create subdirectory
Uploadinput data file
Add keywordsto Input
file
Securefile
transfer
Identifykeywordsfor model
Edit inputfile
Create PBSscript file
Edit file
Applicationcommand
line
Setnumber ofprocessors
PBS format and syntax
Set pathand
variables
Submit jobto queue
Set runtime andqueue
PBScommands
Monitorjob
Portal Driven Computing
InputFile
Secure loginto webportal
Uploadinput data file
Select model and
run job
Monitorjob
View Output in Browser
View Output
Open Browser Monitor JobsSelect Model
What is an Application Portal?No consistent definitionWeb-based
On-line simulation from you browser Simulation typically doesn’t run on your PC
Doesn’t have to be grid enabledWebMO
Computational Chemistry Portal
nanoHUB Web-based resource for research, education and collaboration
in nanotechnology Includes application portals (tools)
Portal BasicsRemote Access to simulations and compute power
V
Application Server
Authentication
Internet
ccr.buffalo.edu
Remote DesktopRun Simulation
Export Display
Application Portals Benefits
Scientists able to focus on research rather than details of computing environment
Underlying infrastructure complexities are hidden Transparently integrate compute and data resources Moving application to a web-based interface provides ubiquitous
access Single sign-on – Don’t have to maintain accounts on many
machines
Challenges Requires close collaboration between domain experts and
developers Developers must be aware of and hide underlying complexity Must be easy to use (web-based, GUI) Must provide full application functionality
Grid Enabling Applications Why Needed
Scientists require an ever growing amount of compute and storage resources
Experiments may have requirements beyond the capabilities of a single data center
Datasets are growing at a tremendous rate
Grid Computing Provides infrastructure for data and job management Handles authentication of users across administrative
and political domains Provides monitoring of resources and user jobs Allows researchers to harness the power of multiple
datacenters for large experiments Provide reusable interface to commonly used
functions: Job status, job submission, file management
Example PortalsWebMO – Computational ChemistryREDfly – Bioinformatics iNquiry: Common web interface to many command-line
toolsGenePattern: Scientific workflow and genomic analysis
tools
CCR Computational Chemistry Portal
CCR iNquiry Bioinformatics Portal, Glimmer page
Based on WebMO: www.webmo.net CCR portal:
webmo.ccr.buffalo.edu Extensive QC Support
Gaussian, GAMESS, NWChem, Q-Chem, Mopac, Molpro, Tinker
Interfaces with batch queues on U2 and several faculty clusters
Computational Chemistry Portal
Browser based loginMenu driven
Computational Chemistry Portal
Choose level of theory
Computational Chemistry Portal
View output
Computational Chemistry Portal
……including vibrational modes
Database/Portal Development REDfly (Regulatory Element
Database for Fly) Database of transcriptional regulatory elements
Aggregates data from multiple offline & online sources
Over 2100 entries
Most comprehensive resource of curated animal regulatory elements
Fully searchable, includes DNA sequence, gene expression data, link-outs to other databases
Extensive collaboration with other online data sources using web services
CCR Bioinformatics Portal Based on iNquiry:
www.bioteam.net Web portal:
inquiry.ccr.buffalo.edu Extensive Application
Support Includes popular open-
source bioinformatics packages
EMBOSS, *PHYLIP, HMMer, BLAST, MPI-BLAST, NCBI Toolkit, Glimmer, Wise2,*ClustalW, *BLAT, *FASTA
Extensible for customized application interfaces
Uses U2 Compute Cluster as Computational Engine
TITAN - Modeling GeohazardsModeling of Volcanic Flows, Mud flows
(flash flooding), and AvalanchesBenefits for Developers
Developers – too much time supporting user installations
Support single web-based portal CCR supports back-end infrastructure Frees developers to focus on improving the
models, science Integrate information from several
sources Simulation results Remote sensing GIS data
Web enable for remote access
Metrics on Demand Portal UBMoD: Web-based Interface for On-demand Metrics CPU cycles delivered, Storage, Queue Statistics, etc Role based interface (User, Faculty, Staff, Admin) Available in open source :
Center for Computational Research
Under NYS Center for Excellence in Bioinformatics & Life Sciences Moved to New Buffalo Life Sciences Complex Building Leading Academic Supercomputing Site Mission: “Enabling and facilitating research within the University
community” Enable Research by Providing
high-end computing and visualization resources, software engineering, scientific computing/modeling, bioinformatics/computational biology, scientific and urban visualization, advanced computing systems
Industrial Outreach/Technology Transfer to WNY Education, Outreach and Training in WNY
2007 HighlightsComputational Cycles Delivered in 2007:
224 different users submitted jobs (88 research groups) 354,447 jobs run (almost 1000 per day) 700,000 CPU days delivered 200 new user accounts created
CIT/CCR Collaboration to Improve Research Computing Condor deployment
Portal/Tool Development Make machines easier to use
• WebMO (Chemistry)• iNquiry (Bioinformatics)• UBMoD (Metrics on Demand)
Accountability On-line real-time metrics
UB 2020 Campus Master Planning 3D models of all 3 campuses
NYSGrid
CCR Research & Projects Urban Simulation and
Visualization Accident Reconstruction Risk Mitigation (GIS) Medical Imaging High School Workshops Cluster Computing Data Fusion
Groundwater Flow Modeling Turbulence and Combustion
Modeling Molecular Structure Determination Protein Folding Prediction Data Mining – Digital Gov, Library Grid Computing Computational Chemistry Biomedical Engineering Bioinformatics
Photoactive Yellow Protein
Simple prototype of Rhodpsin family of proteins
Chromophore is located completely inside the protein pocket
Protein environment causes absorption shift from 2.70 eV (gas phase) to 2.78 eV (protein) yielding the yellow color
Chromophore Spectra Measured
Experimental spectra of the protein active site in vacuum, in a protein and in water solution
Provides insight into environmental effects on electronic spectra, large shift of absorption maximum
Can gauge accuracy of theory
Modeling the System
Combined Quantum Mechanical / Molecular Mechanical Method
System is divided into a QM part and a MM part
QM used in to model “important” part of system; MM used to model remainder
The QM part includes the active site of the protein
The MM part includes the rest of the protein, as well as surrounding water molecules QM
QM versus MM based Methods
QM Calculations
Advantages: Very accurate, based on first principles (ab initio, DFT - there are not empirical parameters involved), can treat bond breaking and formation
Disadvantages: Time consuming, limited to small molecular systems (~100 atoms)
MM Calculations
Advantages: Very fast, capable to calculate entire proteins or solutions (~100,000 atoms)
Disadvantages: Less accurate, based on empirical parameters, not capable to calculate chemical reactions (electrons are not involved)
QM/MM
Why use the QM/MM Method?
Improved accuracy (QM) and faster (MM) Model active site of proteins
Drug-receptor binding Electrostatic effects Steric effects
Interpretation of experimental data Vibrational spectra Electronic spectra
Mechanism of enzymatic activity Reaction profiles
Thermal motion effects on reactivity
Modeling Protein Dynamics
1. Run MM based Molecular Dynamics simulation2. From MD simulation, randomly select protein conformations
(snapshots)3. Run QM/MM simulation for each snapshot4. Generate results based on averages taken from snapshots
Protein dynamics time
Goal: Understand how protein thermal dynamics effects function
Getting Results Faster
Carry out QM/MM calcs simultaneously for many snapshots (protein conformations)
QM/MM Calc for Each Snapshot
After MD, protein snapshots are randomly selected (1000)
Full geometry optimization of the ligand inside the fixed protein matrix (Q-Chem) QM: DFT/B3LYP/6-31+G* (ligand) MM: AMBER (protein + water)
Electronic excitations (Q-Chem): QM: TDDFT/B3LYP/aug-cc-pVTZ
(ligand) MM: AMBER (protein + water)
• 4500 water molecules
CPU Demand - Current Calculation
MD Simulation 1600 CPU hours Select 1000 Snapshots
Each Snapshot (54 CPU Hours) Combined QM/MM Geometry Optimization
• 24 CPU hours (3 hours on 8 processors) Electronic Excitation Calc
• 30 CPU Hours
Total for all 1000 snapshots + MD Simulation 55,600 CPU Hours (2300 CPU Days)
Results
Electronic Excitation
Gas-Phase(eV)
Protein(eV)
Solution(eV)
Calculated 3.07 3.31(0.06)=0.24
3.52(0.04)=0.45
Experiment 2.70 2.78=0.08
3.10=0.40
( ) - standard deviation - change relative to the gas phase
Electronic excitations of the chromophore
Toward Petascale Level Calc
More accurate MD simulation Larger water sphere (50 A radius)
• ~12,000 water molecules
500 hours on 32 processors - 16,000 CPU hours
More accurate QM/MM simulations Larger basis set 350 hours on 16 processors - 5600 CPU hours
Better statistics 100,000 MD snapshots (560,000,000 CPU hours) 2 MD simulations - 1,120,000,000 CPU hours!
Power of Parallel Processing
Assume a modest 4X increase in processor performance/computational efficiency over the next few years Reduce requirement to about 10,000,000 CPU
daysTranslates to 100 CPU days on 100,000
coresCombined QM/MM simulations of this scale
possible on petascale level hardware
Acknowledgements
Portal Development Steve Gallo, Dr. Matt Jones, Jon
Bednasz, Rob Leach Combined QM/MM Calculations
Dr. Marek FriendorfFunding
NIH