federation of american scientists biomedical computing requirements for hpcs kay howell, federation...

Federation of American Scientists

Biomedical Computing

Requirements for HPCS

Kay Howell, Federation of American Scientists [email protected] Higgins, SimQuest, LLC [email protected]

mailto:[email protected]

mailto:[email protected]


Biomedical Computing Requirements for HPCS

Examine broad range of application areas

Identify key applications driving computing demand

Identify hardware/software challenges for important classes of applications

Highlight HPCS areas critical to advances in biomedical computing

Identify technology gaps common to biomedical, national security, and other nationally important applications

Demonstrate market potential of HPCS in biomedical computing


Requirements Analysis

System architecture requirements, including:

processors, memory, interconnects, system software, and programming environments

Bandwidth requirements

System robustness

Application development and maintenance

System management, operation, and maintenance


Biomedical Computing Requirements

Genome Bioinformatics Protein Biochemistry

(Proteomics)

Chemoinformatics – Drug Discovery

Computational Biology Diagnostic Imaging and

Image-Guided Interventions

Description Genomics, DNA sequencing, microarray technologies and

bioinformatics

Includes protein structure and function

Molecular Modeling

(MD, QM, MC, MM)

Tissue Engineering

/Organ Modeling / Systems Biology

Market Projections

Moderate, but growingSmall, but growing

very rapidlyModerate, growing

moderatelySmall, but growing

rapidlySmall, but growing

rapidlyVery large, growing

moderately

People to

interview

Adam Arkin

George Church

Shankar Subramaniam

Jack Dixon

Andrea Sinz

Barry Stoddard

Rick Blevins

Donna Huryn

Wah Chiu Andrew McCulloch

Rick Lathrop Brian Athey

Klaus Schulten Chris Johnson

Bill Lorensen

Ron Kikinis

Michael Vannier

People to

Interview –

federal

Gerard Bouffard, NISC (NIH)

Stephen Altschul, NCBI (NIH)

Francis Collins (NHGRI)

Parag Chitnis, NSF

Yawen Bai, NIH

Nigel Page, DoD

Dan Zaharevitz, NCI

Peter Steinbach, CMM, NIH

Donna Hillmann, NSF

Bret Peterson, NCRR

Sri Kumar, DARP

Carol Lucas, NSF

Ruth Prachter , AF

Terry Yoo, NLM

Richard Swaja, NBIB

Larry Clarke, NCI

Companies to

talk to

Incyte

Celera

Viaken

Geneva Bioinformatics

Myriad Proteomics

Oxford GlycoSciences

ArQule

Albany Molecular Research

Trega

Structural Bioinformatics

HPC vendors

Physiome Sciences

Entelos

SimQuest

GE Medical Systems

Medtronic

BrainLab


Focus Areas

Resources for managing, analyzing, interpreting data

Extending the time scale & complexity of simulations

Combined classical/quantum chemical simulations

Simulations of large systems

Protein structure prediction

Diagnostic imaging and image-guided interventions


Work Plan

Survey existing information and materials

Interview researchers, sponsors and industrial representatives

Produce preliminary report summarizing findings and distribute for review and comment

Deliver initial reqmts one year after project award

Update the report one year later


Biomedical ComputingWhat we’d like to be able to do…

Static Dynamic Functional

Mouse/Human Genome Correlation

Individual Pharmacogenomic analysis using Gene Expression Arrays

Multi-modal Radiology Image Fusion

Millisecond Structural Biology enabled by Synchrotron X-ray Sources and 900 Mhz NMR

Physiologically competent Digital Human Simulations

your additions to the list…


Challenges in Challenges in Biomedical ComputingBiomedical Computing

Non-linear - current models are simplified linear approximations

System Complexity - need to span multiple scales of biological organization

Time Scales

Exponential increases in data


Timescale (seconds)10

-1510

-910

-610

-310

0 103

10-12 10

9

Size

Sca

leA

tom

s B

iopo

lym

ers

Geologic &EvolutionaryTimescales

106

Org

anis

ms

Ab initioQuantum Chemistry

First PrinciplesMolecular Dynamics

Empirical force fieldMolecular Dynamics

EnzymeMechanisms

ProteinFolding

Homology-basedProtein modeling

EvolutionaryProcessesEcosystems

andEpidemiology

Cell signallingCel

ls

100

103

106

100

103

106

100

103

106

100

103

106

Organ function

DNAreplication

Finite elementmodels

Electrostaticcontinuum models

Discrete Automatamodels

Complexity and Timescale

ORNL

Biomedical Computing Problems


Biomedical Computing Requirements for HPCS

Application Areas


Biological Research Requiring ultra-HPC Resources

Structure of proteosome, ribozyme, ribosome, ATPases, Virus, membrane protein complexes

Whole genome comparison

Combined quantum/classical simulations

Protein folding/threading

Microsecond time-scale simulations

Self-organization and self-assembly

Protein-protein and protein-DNA recognition and assembly

Your additions….


Sequencing and Analysis

Key Attributes:

Integer intensive

Significant research into new kinds of statistical models: hybrids of HMMs and neural nets, dynamic Bayesian nets, factorial HMMs, Boltzmann trees

Clusters typically used

Large scale database infrastructure common Cluster can be dedicated to single task/local data

control Cycle requirements can be substantial because of data Systems often in excess of 1Tflop (range 1-5)


Protein Structure Prediction Summary of Computational

Characteristics

Pipeline processing (network of interrelated tasks)

Clustering: Computationally intensive Algorithms easier to implement using shared memory

parallelism due to tight coupling, fine grained, non-uniform work load

Generation of sequence fragments: ANN algorithm may be ideal for this and for clustering purposes Fragment library written to a database

Compute intensive algorithms are clustering (ANN) and optimization (GA)

Optimization easier to implement using loosely coupled distributed compute cluster


Protein Structure PredicationWish List

Hardware/software to map the processing pipeline efficiently

Tools to schedule such a pipeline, checkpoint

Well balanced hardware pipeline from archival storage to the compute elements without bottlenecks

Easily programmable FPGA coprocessor boards to handle integer and other DSP branch of the pipeline

Hardware and software that can handle truly asynchronous computing as it is the key to scalability (overlapped computation, communication and I/O)

Efficient ANN and GA libraries similar to LAPACK

Efficient skeleton/template codes for common computation/communication/IO (OO jargon patterns) across all platforms

Standardized Framework, libraries, database providing the computational characteristics of the underlying hardware/software environment

Source: G. Chukkapalli, UCSD


Protein Structure PredictionFuture requirements

Combine knowledge based prediction with ab initio methods to improve the prediction accuracy

Execute the whole pipeline on demand in an automated fashion

Generate predicted structures for whole genomes

Protein design: inverse problem

All these are prohibitively expensive at present


Molecular Level Modeling

Biochemical analysis

Protein binding /drug target evaluation

Dynamics of molecules

Very large systems with physics


Computational Biology HPC Challenges

Activity Current Limit Problem Size Complexity Memory

Ab inito study of enzyme catalysis 60 heavy atoms 250 heavy atoms O(n^3) O(n)

X-ray refinement of large assemblies

25,000 atoms 125,000 atoms O(n^2) O(n^2)

Large scale protein motion, membrane transport

200 residues 1000 residues O(nlogn) O(n)

Flexible docking of chemical databases

3000 compounds 1,000,000 compounds O(n) O(n)

Phylogenetic mapping150 sequences

10,000 bases

200 sequences

1,000,000 basesO(n^3) O(nxm)

RNA 3-D conformations 100 residues 1000’s residues O(nlogn) O(n)

Source: S. Burke, NIH

Biological Computing Assessment

(Assume 10^5 seconds to finish computation)

BioCatalysis

Mn-Salen

(QM)

(float)

Enzymes

ras

(QM/MM)

(float)

-Array

8000 genes

(Clustering)

(integer)

Multiple Alignment

Phylogenetics

(Pattern Matching)

(integer)

Whole Genome

Analysis

(Sequence

Comparison)

(integer)

Computational requirements (Ops/sec)

1X1012 10X1012 200X109 100X109 100X1012

Memory access patterns Random Partitioned - Random

Sequential Random Sequential or Random

I/O performance Moderate Bandwidth Communication

NA Memory Bandwidth Memory Bandwidth

Compiler speed Optimization Optimization Optimization Optimization Critical for FPGA

O/S speed and stability MTBF Processor Scale Processor Scale Processor Scale Support for new architectures

Platform porting strategies/experiences

Runs on Many CPU Platforms

Most Scalar and Many Parallel



All CPUs (ev7 opt)

FPGA

Performance across multiple architectures

Scalar & Parallel Spatial Decomp.

Parallel

Scalar & Parallel Scalar Scalar, Parallel, Vector, FPGA

Code size (Lines) 300,000 400,000 3,000 5,000 2,000

Key algorithms and improvements

Direct, Parallel, Vector?

Linear Scaling, Parallel

Parallel, MHz Needs Parallelization FPGA

Source: S. Burke, NIH


Data Management

Data management issues will be critically important

- Growth rate of biological data is estimated to be doubling every 6 months

- GenBank grew from 680,338 base pairs in 1982 to 22 billion base pairs in 2002 (compared to 13.5 base pairs as of August 2001

- Rate of data acquistion 100X higher than originally anticipated due to improved sequencing technology and methods

Redundancies and database asynchrony is increasing - data-base-to-database comparisons are required for analysis and validation

To look at long-range patterns of expression synthetic regions on the order of 10’s of megabases become reasonable lengths for consideration

What other data issues should be highlighted?


Data Management Issues

New Types of Data Support to extend existing RDBMS:

Sequences and Strings Trees and Clusters Networks and Pathways Deep Images 3D Models and Shapes Molecules and Coordinate Structures Hierarchical Models and Systems Descriptions Time Series and Sets Probabilities and Confidence Factors Visualizations

Source: Davidson, Bristol-Myers Squibb Pharm. Res. Institute


Systems Biology – Modeling the Cellular System

Combine cell signaling, gene regulatory and metabolic networks to simulate cell behavior

Hybrid information & physics based model Integrating Computational/Experimental Data at all levels

Modeling of network connectivity (sets of reactions: proteins, small molecules, stochastic, MD)

Difficult to handle computationally

importance of spatial location within the cell

instability associated with reactions between small numbers of molecular species

combinatorial explosion of large numbers of different species

>Petaflop problem


Systems Biology Need to simulate gene expression, metabolism and signal transduction

for a single and multiple cells

Algorithms need to be designed precisely for biological research -parameter optimizer needs to find as many local minima, including global minima, as possible because there are multiple possible solutions of which only one is actually used

Must be able to simulate both high concentration of proteins that can be described by differential equations and low concentration of proteins that need to be handled by stochastic process simulation

Stochastic methods are being used (STOCHSIM and Gillespie algorithm)

individual molecules represented rather than concentrations of molecular species; Monte Carlo methods are used to predict interactions

rate equations are replaced by individual reaction probabilities


Digital Imaging

Used for monitoring of disease progression, diagnosis, preoperative planning and intraoperative guidance and monitoring

Algorithms are computationally demanding

Key issues are segmentation and registration

Signal processing techniques are used to enhance features and generate the desired segmentation

Results of the segmentation are aligned to other data acquisitions and to the actual patient during procedures

Results of the segmentation are visualized using different rendering methods


Digital Imaging

Idea Method Parallelization Applications

Feature Enhancement

Modulate selected characteristics

Spatial and frequency domain filtering: convolutions

SMP and MPI style for Fourier transforms [Frigo,1997] and convolutions

Noise reduction [Gerig, 1992], removal of partial volume artefacts [Westin,1997]

Classification

k-NN, Parzen window

Classify an unknown voxel based on prototypes

Nonparametric supervised statistical classification [Duda, 1973], [Cover,1967], [Cover,1968],[Clarke,1993], [Warfield, 1996], [Friedman, 1975]

Each voxel treated separately [Friedman, 1975]. SMP for core, MPI

Classification in different areas of the body [Kikinis 1992], [Huppi 1998], [Warfield, 1995, Warfield, 1996]

EM

Increase robustness of statistical approach through adaptive behaviour

Iterates between statistical classification and intensity prediction/correction [Wells, 1996]

Classification step as in k-NN, intensity correction [Wells, 1986]: convolutions SMP, NUMA

Classification primarily of brain MRI [Morocz, 1995], [Kikinis, 1997], [Iosifescu, 1997]

Linear Registration

Intra-subject Use inherent contrast similarity to align image

Requires entropy and joint entropy computation [Wells, 1996a]

Joint histogram computation, parallelized by computing the histogram of data chunks. Joint entropy calculated by a loop over the histogram SMP, MPI

Registration of slices for multichannel analysis [Huppi 1998, Nakajima, 1997]

Inter-subject

Measure mismatch of alignment of two subjects by counting the number of voxel labels that don't match.

Multiresolution alignment using XOR function [Warfield, 1998]

First data is resampled then misalignment used to calculate registration. MPI

Initial alignment for template driven segmentation [Warfield, 1996]

Nonlinear Registration

Use rubbersheet transform to align two data sets from different subjects.

Multiresolution approach with fast local similarity measurement, and a simplified regularization model

Low pass filter, upsampling, downsampling, arithmetic operations, solve systems of equations. SMP

Template driven segmentation [Warfield, 1996]

Visualization

Surface Model

Generation

Generate highly optimized triangle surface models

Pipeline of marching cubes [Lorensen,1987], triangle reduction [Schroeder,1992], and triangle smoothing [Taubin,1995]

Distributed computation of triangle models for each structure of a data set (up to 300). LSF

Visualization for surgical applications and for presentation purposes [Ozlen,1998], [Chabrerie,1998],[Chabrerie,1998a], [Kikininis,1996]

Volume Rendering Direct visualization of volume data without prior processing

Shear warp algorithm [Ylä-Jääski, 1997], [Lacroute, 1994], [Saiviroonporn,1998])

Render subvolumes separately. SMP MPI

Visualize data before segmentation, interactive editing

Source: R. Kikinis, Brigham and Women's Hospital and Harvard Medical School


Biomedical Application and Kernels

Kernels Application Source Today

BioCatalysis Ab Initio Quantum Chemistry GAMESS DoD HPCMP TI-03 TeraOp/s sustained

Quantum Chemistry GAUSSIAN www.gaussian.com/ TeraOp/s sustained

Quantum Mechanics NWChem PNNL TeraOp/s sustained Quantum and MM Macromolecular Dynamics CHARM http://yuri.harvard.edu/ 10 TeraOp/s sustained

Energy Minimization

MonteCarlo Simulation

Molecular Mechanical Field Force AMBER http://www.amber.ucsf.edu/ 10 TeraOp/s sustained

m-Array 8000 Genes Clustering CLUSTALW http://bimas.dcrt.nih.gov/sw.html 200 GigaOps/s sustained

Multiple Alignment Phylogenetics Pattern Matching NONMEM http://www.globomaxservice.com/ products/nonmem.html 100 GigaOps/s sustained

Pattern Matching PHYLIP http://evolution.genetics.washington.edu/ phylip.html 100 GigaOps/s sustained

Pattern Matching FASTme http://www.ncbi.nlm.nih.gov/ CBBresearch/Desper/FastME.html 100 GigaOps/s sustained

Whole Genome Analysis Sequence Comparison Needleman-Wunsch http://www.med.nyu.edu/ rcr/rcr/course/sim-sw.html 100 TeraOps/s sustained

Sequence Comparison FASTA http://www.ebi.ac.uk/fasta33/ 100 TeraOps/s sustained

Sequence Comparison HMMR http://hmmer.wustl.edu/ 100 TeraOps/s sustained

Sequence Comparison GENSCAN http://genes.mit.edu/GENSCANinfo.html 100 TeraOps/s sustained

Systems Biology Functional Genomics http://genomics.lbl.gov/~aparkin/ Group/Codebase.html

Biological Pathway Analysis

Complex Systems Simulation and Analysis http://ecell.sourceforge.net/

Partial Differential Equation Solver http://www.nrcam.uchc.edu/

Ordinary Differential Equation Solver

Digital Imaging Marching Cubes Paper & Pencil for Kernels

Triangle Reduction Paper & Pencil for Kernels

Triangle Smoothing Paper & Pencil for Kernels Noise Reduction Paper & Pencil for Kernels

Artifact Removal Paper & Pencil for Kernels

federation of american scientists biomedical computing requirements for hpcs kay howell, federation...

Documents

biomedical computing

maintenance slide

computing demand

hpcs areas critical

system software

important applications

nih donna hillmann

maintenance system management