bioinformatics and life sciences – standards and ...standards and programming for heterogeneous...

28
Bioinformatics and Life Sciences – Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) [email protected] 1 SIAM Conference on Parallel Processing for Scientific Computing Savannah, GA, February16, 2012 Caveats: Content and statements following do not constitute any official position or endorsement, whether stated or implied. All copyrights of referenced material remain with the original owner.

Upload: others

Post on 15-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Bioinformatics and Life Sciences –

Standards and Programming for

Heterogeneous Architectures

Eric Stahlberg Ph.D.Eric Stahlberg Ph.D.

(SAIC-Frederick contractor)

[email protected]

1

SIAM Conference on Parallel Processing for Scientific Computing

Savannah, GA,

February 16, 2012

Caveats: Content and statements following do not constitute any official position or

endorsement, whether stated or implied.

All copyrights of referenced material remain with the original owner.

Page 2: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Cancer kills every 55 seconds

• Cancer research utilizes bioinformatics heavily

• Bioinformatics is computationally intensive

• Faster solutions help cancer research move faster

Context for Heterogeneous Acceleration

• Faster solutions help cancer research move faster

• Faster and better clinical applications help to impact

patient lives

• Today’s Goal: Encourage paths to improve

bioinformatics applications for cancer research

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 3: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Faster and better applications

• Better education and preparation in parallel and

distributed computing

• Better and faster data handling solutions

Three Key Needs

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 4: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Technical and operations contractor to the U.S. National

Cancer Institute

• Federally Funded Research and Development Center for DHHS

SAIC-Frederick, Inc.

• Many technical and operational areas of support for the NCI

including bioinformatics

4

IT

picture

here

Page 5: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

NCI Center for Cancer Research

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 6: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

BRANCHES

Cell and Cancer Biology

Dermatology

Experimental Immunology

Experimental Transplantation and

Immunology

Genetics

HIV and AIDS Malignancy

HIV DRP Host-Virus Interaction

Medical Oncology

Metabolism

LABS

Basic Research Laboratory

Cancer and Developmental Biology

Laboratory

Chemical Biology Laboratory

Gene Regulation and Chromosome

Biology Laboratory

HIV DRP Retroviral Replication

Laboratory

Laboratory of Biochemistry and

Molecular Biology

Laboratory of Experimental Immunology

Laboratory of Genome Integrity

Laboratory of Human Carcinogenesis

Laboratory of Immune Cell Biology

Laboratory of Metabolism

Laboratory of Molecular Biology

Laboratory of Molecular Immunoregulation

Laboratory of Molecular Pharmacology

Laboratory of Pathology

Laboratory of Population Genetics

NCI Center for Cancer Research

Metabolism

Neuro-Oncology

Pediatric Oncology

Radiation Biology

Radiation Oncology

Surgery

Urologic Oncology

Vaccine

Molecular Biology

Laboratory of Cancer Biology and

Genetics

Laboratory of Cancer Prevention

Laboratory of Cell and Developmental

Signaling

Laboratory of Cell Biology

Laboratory of Cellular and Molecular

Biology

Laboratory of Cellular Oncology

Laboratory of Experimental

Carcinogenesis

Biophysics Laboratory

Laboratory of Population Genetics

Laboratory of Protein Dynamics and Signaling

Laboratory of Receptor Biology and Gene

Expression

Laboratory of Tumor Immunology and

Biology

Macromolecular Crystallography Laboratory

Molecular Targets Laboratory

Structural Biophysics Laboratory

PROGRAMS

Cancer and Inflammation

CCR Nanobiology

HIV Drug Resistance

Molecular Discovery

Molecular Imaging

Mouse Cancer Genetics

Page 7: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Image processing– 3D imaging

– 2D imaging

• Sequence and protein analysis– Microarray

– Next Generation Sequence Analysis

– Proteomics

Life Science Application Areas

– Proteomics

• Simulation– Molecular interactions and dynamics

– Complex systems biology simulations

• Data mining and analytics– Statistics

– Graph and cluster analysis

– Population analysis

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 8: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

DNA RNATranscription

Mitosis Translation

Dataflow View of Basic Biology

Data source

Transform

Process

DNA information flow

Protein feedback loop

Intercellular communication

Duplicated

DNA Intra-Cellular

Functions

Proteins

Cell

Functionsne

w c

ell

Page 9: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Source: http://web.expasy.org/cgi-bin/pathways/show_thumbnails.pl

Metabolic Pathways at Higher Resolution

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 10: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Next Generation Sequencing Focus

• Used to understand complex biological systems

• Common types of NGS applications– ChIPseq

– RNAseq

– miRNAseq

Next Generation Sequencing Focus

– miRNAseq

– Epigenetic studies

• Large and growing dataset sizes

• Identify, associate, and compare within individual experiments

• Integrate and compare across experiments

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 11: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Data Acquisition Costs PlummetData Acquisition Costs Plummet

11SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 12: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Large Data Challenges

• Volume of available data is growing rapidly

• One run produces hundreds of gigabytes of data*

• Policy issues

– HIPAA, security and protection

Big Data = Big Challenges

– HIPAA, security and protection

– Move it, store it, delete it?

– Validation and clinical liability

• Metadata - reliable secondary value

12

*Reference: Barski and Zhao, Journal of Cellular Biochemistry, 107:11-18, 2009

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 13: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Generic NextGen Workflow

Sequence

Acquisition

Data

Quality

Evaluation

Sequence

Read

Mapping

Analysis of

Mapped

Reads

Compare

Across

Samples

General NGS Workflow

• Experimental data is progressively concentrated to become knowledge for decision

• Per sample volume of information reduced as data is analyzed

• Concentrated results are integrated to inform decisions

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 14: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Example Areas in Next Gen Sequencing

• Genome Assembly

– Combine small fragments of DNA/RNA into high-

confidence composite contigs

– Connect the small pieces into a larger ‘string’ consistent

with observed sequences and known biology

Illustrative Next Gen Sequencing Apps

with observed sequences and known biology

• Read Mapping

– Start with a known baseline reference genome

– Map smaller pieces of DNA/RNA to their “correct” location

on the reference genome allowing for mismatches,

insertions, deletions

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 15: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

RNA Sequencing OverviewRNA Sequencing Overview

Source: http://www.bgisequence.com/eu/services/sequencing-services/rna-sequencing/rna-seq/

Page 16: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Challenges in Next Gen Sequencing

• Transferring large datasets

• Processing huge datasets

• Integrating datasets

• Proliferation of sequencing capabilities

Key NGS Challenges

• Proliferation of sequencing capabilities

• Growing data volumes too great to store results

• Overcoming ambiguity with algorithmic improvements

• Reproducibility over time

• Translation to clinical application

• Applications are parallel but not system friendly

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 17: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Contrasting Application Goals

• Agile

• Rapid incorporation of new

advances

• Ad hoc development process

– Stable

– Measured incorporation of

proven advances

– Development process

Research Application Aims Clinical Application Aims

Research vs. Clinical Application

• Open source

• Documented as needed

• Generally portable

• Limited liability for failures

• Marginal testing

• Reproducibility

• Speed

required

– Licensed and proprietary

– Well documented

– Supportable

– Liability for failure

– Certification of testing

– Reproducibility

– Speed SIAM Conference on Parallel Processing

for Scientific Computing, February 16, 2012

Page 18: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Faster and better applications

• Better education and preparation in parallel and

distributed computing

• Better and faster data handling solutions

Three Key Needs

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 19: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Improved application development

– Higher speed applications

– More robust applications in PDC environments

– More efficient applications

– Better interoperability among PDC technologies

Why Better Education in PDC?

• More effective application use at run time

– Analysts know how to use parallel computing effectively

– Understanding of scalability to better relate problem size to computational resources

– Improved planning of large computational analysis efforts

– Better run-time efficiency

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 20: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Changing a Way of Thinking

Education is key

– Teaching parallel and accelerated

computing across the CS curriculum

– Innovative NSF funded project

– Incorporating parallel computing into

CS, software development, and

computational science

Courses Enhanced

– Computer Literacy

– Intro to Programming

– Data Structures

– Algorithms

– Programming Languages

Changing a Way of Thinking

computational science

– Workshop and website under

development

– See www.accel2apps.org for more

information

– Programming Languages

– Computer Hardware

– Computational Modeling

– Bioinformatics (applications)

– Computational chemistry

(applications)

We gratefully acknowledge the support of the National Science Foundation

Grant CCF-0915805, SHF:Small:RUI:Collaborative Research: Accelerators to Applications –

Supercharging the Undergraduate Computer Science Curriculum

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 21: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Why Heterogeneous Acceleration?

• Problems are large

– Recent sample runs have taken up to 4 days to compute

– Experiments include many samples

– Data is becoming too large to move

– Instrument systems are becoming smaller and cheaper

Why Heterogeneous Acceleration ?

– Instrument systems are becoming smaller and cheaper

– Trend to generate much more data continues

• Technologies are heterogeneous

– Multicore is pervasive and proven

– GPU technology is affordable and available

– FPGAs have history for fast bioinformatics

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 22: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Parallel Computing in Bioinformatics

• Parallel Computing and bioinformatics

– 182 articles in PubMed since 1995

• GPU and Bioinformatics

– 50 articles dating back to 2007

– 33 articles in CUDA and bioinformatics

• Message Passing and bioinformatics

26 articles with ‘message passing’ and Message Passing

CUDA

GPU

Parallel Computing

Parallel Computing in Bioinformatics

– 26 articles with ‘message passing’ and

bioinformatics

• FPGA and Bioinformatics

– 22 articles in PubMed since 1993

• OpenMP and bioinformatics

– 6 articles in OpenMP and bioinformatics

• OpenCL and bioinformatics

– 3 articles reported

0 50 100 150 200

OpenCL

OpenMP

FPGA

Message Passing

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 23: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Biowulf – NIH HPC Resource

GPU

cluster

available

Biowulf at NIH

Page 24: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

Weighing the Merits of Standards

• Pros

• Stabilizes development

efforts

• Improve portability of

algorithms and applications

• Raise productivity and

• Cons

– Takes time for community

adoption

– Possible performance

penalty in some cases

Relative Merits of Standards

• Raise productivity and

innovation

• Improve robustness of

mission critical applications

• Improve supportability

• Channels creativity and

innovation

• Easier education

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 25: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Not to be confused with OpenACC

• Open Accelerator Initiative provides community knowledge

base of accelerated computing activity

– Components, performance, literature, and more to come

• Encourages interoperability among technologies and

Open Accelerator

• Encourages interoperability among technologies and

standards

• Registration services support application reproducibility and

certification

• Downloads: OpenFPGA draft GenAPI standard

• Visit www.openaccelerator.org

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 26: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov
Page 27: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Faster and better applications

– Heterogeneous acceleration

– Support standards and interoperability

– Multiple areas exist

• Better education and preparation in parallel and

Summary

• Better education and preparation in parallel and distributed computing

– Improved application development

– Ease of application use

• Better and faster data handling solutions

– Not addressed here

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012

Page 28: Bioinformatics and Life Sciences – Standards and ...Standards and Programming for Heterogeneous Architectures Eric Stahlberg Ph.D. (SAIC-Frederick contractor) stahlbergea@mail.nih.gov

• Colleagues at NCI CCR, SAIC-F ABCC

• National Science Foundation CISE

• Colleagues at Wittenberg and ClemsonDr. Steven Bogaerts, Dr. Kyle Burke, Dr. Brian Shelburne,

Acknowledgements

Dr. Brian Shelburne, Dr. Melissa Smith

• OpenFPGA and OpenAccelerator communities

Contact information:

estahlberg (-at-) gmail.com or stahlbergea(-at-)mail.nih.gov

SIAM Conference on Parallel Processing for Scientific Computing, February 16, 2012