federation of american scientists biomedical computing requirements for hpcs kay howell, federation...
TRANSCRIPT
Federation of American Scientists
Biomedical Computing
Requirements for HPCS
Kay Howell, Federation of American Scientists [email protected] Higgins, SimQuest, LLC [email protected]
Federation of American Scientists
Biomedical Computing Requirements for HPCS
Examine broad range of application areas
Identify key applications driving computing demand
Identify hardware/software challenges for important classes of applications
Highlight HPCS areas critical to advances in biomedical computing
Identify technology gaps common to biomedical, national security, and other nationally important applications
Demonstrate market potential of HPCS in biomedical computing
Federation of American Scientists
Requirements Analysis
System architecture requirements, including:
processors, memory, interconnects, system software, and programming environments
Bandwidth requirements
System robustness
Application development and maintenance
System management, operation, and maintenance
Federation of American Scientists
Biomedical Computing Requirements
Genome Bioinformatics Protein Biochemistry
(Proteomics)
Chemoinformatics – Drug Discovery
Computational Biology Diagnostic Imaging and
Image-Guided Interventions
Description Genomics, DNA sequencing, microarray technologies and
bioinformatics
Includes protein structure and function
Molecular Modeling
(MD, QM, MC, MM)
Tissue Engineering
/Organ Modeling / Systems Biology
Market Projections
Moderate, but growingSmall, but growing
very rapidlyModerate, growing
moderatelySmall, but growing
rapidlySmall, but growing
rapidlyVery large, growing
moderately
People to
interview
Adam Arkin
George Church
Shankar Subramaniam
Jack Dixon
Andrea Sinz
Barry Stoddard
Rick Blevins
Donna Huryn
Wah Chiu Andrew McCulloch
Rick Lathrop Brian Athey
Klaus Schulten Chris Johnson
Bill Lorensen
Ron Kikinis
Michael Vannier
People to
Interview –
federal
Gerard Bouffard, NISC (NIH)
Stephen Altschul, NCBI (NIH)
Francis Collins (NHGRI)
Parag Chitnis, NSF
Yawen Bai, NIH
Nigel Page, DoD
Dan Zaharevitz, NCI
Peter Steinbach, CMM, NIH
Donna Hillmann, NSF
Bret Peterson, NCRR
Sri Kumar, DARP
Carol Lucas, NSF
Ruth Prachter , AF
Terry Yoo, NLM
Richard Swaja, NBIB
Larry Clarke, NCI
Companies to
talk to
Incyte
Celera
Viaken
Geneva Bioinformatics
Myriad Proteomics
Oxford GlycoSciences
ArQule
Albany Molecular Research
Trega
Structural Bioinformatics
HPC vendors
Physiome Sciences
Entelos
SimQuest
GE Medical Systems
Medtronic
BrainLab
Federation of American Scientists
Focus Areas
Resources for managing, analyzing, interpreting data
Extending the time scale & complexity of simulations
Combined classical/quantum chemical simulations
Simulations of large systems
Protein structure prediction
Diagnostic imaging and image-guided interventions
Federation of American Scientists
Work Plan
Survey existing information and materials
Interview researchers, sponsors and industrial representatives
Produce preliminary report summarizing findings and distribute for review and comment
Deliver initial reqmts one year after project award
Update the report one year later
Federation of American Scientists
Biomedical ComputingWhat we’d like to be able to do…
Static Dynamic Functional
Mouse/Human Genome Correlation
Individual Pharmacogenomic analysis using Gene Expression Arrays
Multi-modal Radiology Image Fusion
Millisecond Structural Biology enabled by Synchrotron X-ray Sources and 900 Mhz NMR
Physiologically competent Digital Human Simulations
your additions to the list…
Federation of American Scientists
Challenges in Challenges in Biomedical ComputingBiomedical Computing
Non-linear - current models are simplified linear approximations
System Complexity - need to span multiple scales of biological organization
Time Scales
Exponential increases in data
Federation of American Scientists
Timescale (seconds)10
-1510
-910
-610
-310
0 103
10-12 10
9
Size
Sca
leA
tom
s B
iopo
lym
ers
Geologic &EvolutionaryTimescales
106
Org
anis
ms
Ab initioQuantum Chemistry
First PrinciplesMolecular Dynamics
Empirical force fieldMolecular Dynamics
EnzymeMechanisms
ProteinFolding
Homology-basedProtein modeling
EvolutionaryProcessesEcosystems
andEpidemiology
Cell signallingCel
ls
100
103
106
100
103
106
100
103
106
100
103
106
Organ function
DNAreplication
Finite elementmodels
Electrostaticcontinuum models
Discrete Automatamodels
Complexity and Timescale
ORNL
Biomedical Computing Problems
Federation of American Scientists
Biomedical Computing Requirements for HPCS
Application Areas
Federation of American Scientists
Biological Research Requiring ultra-HPC Resources
Structure of proteosome, ribozyme, ribosome, ATPases, Virus, membrane protein complexes
Whole genome comparison
Combined quantum/classical simulations
Protein folding/threading
Microsecond time-scale simulations
Self-organization and self-assembly
Protein-protein and protein-DNA recognition and assembly
Your additions….
Federation of American Scientists
Sequencing and Analysis
Key Attributes:
Integer intensive
Significant research into new kinds of statistical models: hybrids of HMMs and neural nets, dynamic Bayesian nets, factorial HMMs, Boltzmann trees
Clusters typically used
Large scale database infrastructure common Cluster can be dedicated to single task/local data
control Cycle requirements can be substantial because of data Systems often in excess of 1Tflop (range 1-5)
Federation of American Scientists
Protein Structure Prediction Summary of Computational
Characteristics
Pipeline processing (network of interrelated tasks)
Clustering: Computationally intensive Algorithms easier to implement using shared memory
parallelism due to tight coupling, fine grained, non-uniform work load
Generation of sequence fragments: ANN algorithm may be ideal for this and for clustering purposes Fragment library written to a database
Compute intensive algorithms are clustering (ANN) and optimization (GA)
Optimization easier to implement using loosely coupled distributed compute cluster
Federation of American Scientists
Protein Structure PredicationWish List
Hardware/software to map the processing pipeline efficiently
Tools to schedule such a pipeline, checkpoint
Well balanced hardware pipeline from archival storage to the compute elements without bottlenecks
Easily programmable FPGA coprocessor boards to handle integer and other DSP branch of the pipeline
Hardware and software that can handle truly asynchronous computing as it is the key to scalability (overlapped computation, communication and I/O)
Efficient ANN and GA libraries similar to LAPACK
Efficient skeleton/template codes for common computation/communication/IO (OO jargon patterns) across all platforms
Standardized Framework, libraries, database providing the computational characteristics of the underlying hardware/software environment
Source: G. Chukkapalli, UCSD
Federation of American Scientists
Protein Structure PredictionFuture requirements
Combine knowledge based prediction with ab initio methods to improve the prediction accuracy
Execute the whole pipeline on demand in an automated fashion
Generate predicted structures for whole genomes
Protein design: inverse problem
All these are prohibitively expensive at present
Federation of American Scientists
Molecular Level Modeling
Biochemical analysis
Protein binding /drug target evaluation
Dynamics of molecules
Very large systems with physics
Federation of American Scientists
Computational Biology HPC Challenges
Activity Current Limit Problem Size Complexity Memory
Ab inito study of enzyme catalysis 60 heavy atoms 250 heavy atoms O(n^3) O(n)
X-ray refinement of large assemblies
25,000 atoms 125,000 atoms O(n^2) O(n^2)
Large scale protein motion, membrane transport
200 residues 1000 residues O(nlogn) O(n)
Flexible docking of chemical databases
3000 compounds 1,000,000 compounds O(n) O(n)
Phylogenetic mapping150 sequences
10,000 bases
200 sequences
1,000,000 basesO(n^3) O(nxm)
RNA 3-D conformations 100 residues 1000’s residues O(nlogn) O(n)
Source: S. Burke, NIH
Biological Computing Assessment
(Assume 10^5 seconds to finish computation)
BioCatalysis
Mn-Salen
(QM)
(float)
Enzymes
ras
(QM/MM)
(float)
-Array
8000 genes
(Clustering)
(integer)
Multiple Alignment
Phylogenetics
(Pattern Matching)
(integer)
Whole Genome
Analysis
(Sequence
Comparison)
(integer)
Computational requirements (Ops/sec)
1X1012 10X1012 200X109 100X109 100X1012
Memory access patterns Random Partitioned - Random
Sequential Random Sequential or Random
I/O performance Moderate Bandwidth Communication
NA Memory Bandwidth Memory Bandwidth
Compiler speed Optimization Optimization Optimization Optimization Critical for FPGA
O/S speed and stability MTBF Processor Scale Processor Scale Processor Scale Support for new architectures
Platform porting strategies/experiences
Runs on Many CPU Platforms
Most Scalar and Many Parallel
Runs on Many CPU Platforms
Runs on Many CPU Platforms
All CPUs (ev7 opt)
FPGA
Performance across multiple architectures
Scalar & Parallel Spatial Decomp.
Parallel
Scalar & Parallel Scalar Scalar, Parallel, Vector, FPGA
Code size (Lines) 300,000 400,000 3,000 5,000 2,000
Key algorithms and improvements
Direct, Parallel, Vector?
Linear Scaling, Parallel
Parallel, MHz Needs Parallelization FPGA
Source: S. Burke, NIH
Federation of American Scientists
Data Management
Data management issues will be critically important
- Growth rate of biological data is estimated to be doubling every 6 months
- GenBank grew from 680,338 base pairs in 1982 to 22 billion base pairs in 2002 (compared to 13.5 base pairs as of August 2001
- Rate of data acquistion 100X higher than originally anticipated due to improved sequencing technology and methods
Redundancies and database asynchrony is increasing - data-base-to-database comparisons are required for analysis and validation
To look at long-range patterns of expression synthetic regions on the order of 10’s of megabases become reasonable lengths for consideration
What other data issues should be highlighted?
Federation of American Scientists
Data Management Issues
New Types of Data Support to extend existing RDBMS:
Sequences and Strings Trees and Clusters Networks and Pathways Deep Images 3D Models and Shapes Molecules and Coordinate Structures Hierarchical Models and Systems Descriptions Time Series and Sets Probabilities and Confidence Factors Visualizations
Source: Davidson, Bristol-Myers Squibb Pharm. Res. Institute
Federation of American Scientists
Systems Biology – Modeling the Cellular System
Combine cell signaling, gene regulatory and metabolic networks to simulate cell behavior
Hybrid information & physics based model Integrating Computational/Experimental Data at all levels
Modeling of network connectivity (sets of reactions: proteins, small molecules, stochastic, MD)
Difficult to handle computationally
importance of spatial location within the cell
instability associated with reactions between small numbers of molecular species
combinatorial explosion of large numbers of different species
>Petaflop problem
Federation of American Scientists
Systems Biology Need to simulate gene expression, metabolism and signal transduction
for a single and multiple cells
Algorithms need to be designed precisely for biological research -parameter optimizer needs to find as many local minima, including global minima, as possible because there are multiple possible solutions of which only one is actually used
Must be able to simulate both high concentration of proteins that can be described by differential equations and low concentration of proteins that need to be handled by stochastic process simulation
Stochastic methods are being used (STOCHSIM and Gillespie algorithm)
individual molecules represented rather than concentrations of molecular species; Monte Carlo methods are used to predict interactions
rate equations are replaced by individual reaction probabilities
Federation of American Scientists
Digital Imaging
Used for monitoring of disease progression, diagnosis, preoperative planning and intraoperative guidance and monitoring
Algorithms are computationally demanding
Key issues are segmentation and registration
Signal processing techniques are used to enhance features and generate the desired segmentation
Results of the segmentation are aligned to other data acquisitions and to the actual patient during procedures
Results of the segmentation are visualized using different rendering methods
Federation of American Scientists
Digital Imaging
Idea Method Parallelization Applications
Feature Enhancement
Modulate selected characteristics
Spatial and frequency domain filtering: convolutions
SMP and MPI style for Fourier transforms [Frigo,1997] and convolutions
Noise reduction [Gerig, 1992], removal of partial volume artefacts [Westin,1997]
Classification
k-NN, Parzen window
Classify an unknown voxel based on prototypes
Nonparametric supervised statistical classification [Duda, 1973], [Cover,1967], [Cover,1968],[Clarke,1993], [Warfield, 1996], [Friedman, 1975]
Each voxel treated separately [Friedman, 1975]. SMP for core, MPI
Classification in different areas of the body [Kikinis 1992], [Huppi 1998], [Warfield, 1995, Warfield, 1996]
EM
Increase robustness of statistical approach through adaptive behaviour
Iterates between statistical classification and intensity prediction/correction [Wells, 1996]
Classification step as in k-NN, intensity correction [Wells, 1986]: convolutions SMP, NUMA
Classification primarily of brain MRI [Morocz, 1995], [Kikinis, 1997], [Iosifescu, 1997]
Linear Registration
Intra-subject Use inherent contrast similarity to align image
Requires entropy and joint entropy computation [Wells, 1996a]
Joint histogram computation, parallelized by computing the histogram of data chunks. Joint entropy calculated by a loop over the histogram SMP, MPI
Registration of slices for multichannel analysis [Huppi 1998, Nakajima, 1997]
Inter-subject
Measure mismatch of alignment of two subjects by counting the number of voxel labels that don't match.
Multiresolution alignment using XOR function [Warfield, 1998]
First data is resampled then misalignment used to calculate registration. MPI
Initial alignment for template driven segmentation [Warfield, 1996]
Nonlinear Registration
Use rubbersheet transform to align two data sets from different subjects.
Multiresolution approach with fast local similarity measurement, and a simplified regularization model
Low pass filter, upsampling, downsampling, arithmetic operations, solve systems of equations. SMP
Template driven segmentation [Warfield, 1996]
Visualization
Surface Model
Generation
Generate highly optimized triangle surface models
Pipeline of marching cubes [Lorensen,1987], triangle reduction [Schroeder,1992], and triangle smoothing [Taubin,1995]
Distributed computation of triangle models for each structure of a data set (up to 300). LSF
Visualization for surgical applications and for presentation purposes [Ozlen,1998], [Chabrerie,1998],[Chabrerie,1998a], [Kikininis,1996]
Volume Rendering Direct visualization of volume data without prior processing
Shear warp algorithm [Ylä-Jääski, 1997], [Lacroute, 1994], [Saiviroonporn,1998])
Render subvolumes separately. SMP MPI
Visualize data before segmentation, interactive editing
Source: R. Kikinis, Brigham and Women's Hospital and Harvard Medical School
Federation of American Scientists
Biomedical Application and Kernels
Kernels Application Source Today
BioCatalysis Ab Initio Quantum Chemistry GAMESS DoD HPCMP TI-03 TeraOp/s sustained
Quantum Chemistry GAUSSIAN www.gaussian.com/ TeraOp/s sustained
Quantum Mechanics NWChem PNNL TeraOp/s sustained Quantum and MM Macromolecular Dynamics CHARM http://yuri.harvard.edu/ 10 TeraOp/s sustained
Energy Minimization
MonteCarlo Simulation
Molecular Mechanical Field Force AMBER http://www.amber.ucsf.edu/ 10 TeraOp/s sustained
m-Array 8000 Genes Clustering CLUSTALW http://bimas.dcrt.nih.gov/sw.html 200 GigaOps/s sustained
Multiple Alignment Phylogenetics Pattern Matching NONMEM http://www.globomaxservice.com/ products/nonmem.html 100 GigaOps/s sustained
Pattern Matching PHYLIP http://evolution.genetics.washington.edu/ phylip.html 100 GigaOps/s sustained
Pattern Matching FASTme http://www.ncbi.nlm.nih.gov/ CBBresearch/Desper/FastME.html 100 GigaOps/s sustained
Whole Genome Analysis Sequence Comparison Needleman-Wunsch http://www.med.nyu.edu/ rcr/rcr/course/sim-sw.html 100 TeraOps/s sustained
Sequence Comparison FASTA http://www.ebi.ac.uk/fasta33/ 100 TeraOps/s sustained
Sequence Comparison HMMR http://hmmer.wustl.edu/ 100 TeraOps/s sustained
Sequence Comparison GENSCAN http://genes.mit.edu/GENSCANinfo.html 100 TeraOps/s sustained
Systems Biology Functional Genomics http://genomics.lbl.gov/~aparkin/ Group/Codebase.html
Biological Pathway Analysis
Complex Systems Simulation and Analysis http://ecell.sourceforge.net/
Partial Differential Equation Solver http://www.nrcam.uchc.edu/
Ordinary Differential Equation Solver
Digital Imaging Marching Cubes Paper & Pencil for Kernels
Triangle Reduction Paper & Pencil for Kernels
Triangle Smoothing Paper & Pencil for Kernels Noise Reduction Paper & Pencil for Kernels
Artifact Removal Paper & Pencil for Kernels