genomics medicine solution on power 8 kathy tzeng, phd worldwide technical lead healthcare &...

Genomics Medicine Solution on Power 8

Kathy Tzeng, PhDWorldwide Technical LeadHealthcare & Life SciencesIBM Systems GroupNovember 13, 2015

GENOMIC MEDICINE– from Sequencing to Personalized Healthcare

NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green et al., Nature 470, 204–213)

Next Generation Sequencing(or other ingestion)

the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reductionincludes human, plant, animal, and microbiome genomics

Translational Research/Early Discoverythe focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments

Personalized Healthcare/Clinical Genomicsthe focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments

Computational Challenges

Feature combinatorics Large file sizes Large population sizes Unstructured data types

A Computationally Challenging Problem

Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses

Predictive Response Function

Known Traits or Environmental Features

Measured Biological Response

Model of associations between features and responses as a function of time t

F(t) R(t)

Quantities describing population traits or environmental factors at time t

Quantities describing response events for an organism at time t

Key CapabilitiesLeading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine

Flexible, scalable, and low-cost high-performance compute and storage solutions capable of efficiently processing rapidly growing quantities of genomic and other types of complex life science data

Seamless integration of complex life science data types on a common analytical platform

Rapid extraction and analysis of unstructured language content from very large volumes of clinical and scientific documents

Metadata collection capabilities providing detailed audit trails as source data are transformed into analytical results

Tools for scientific collaboration that enable data and workload sharing to cross organizations and geographic boundaries in a secure environment that ensures data privacy

A Foundation for Computational ScienceIBM’s Reference Architecture for Genomics supports ‘big data’ computational research on a foundation of HPC compute, storage, and workload management capabilities

Research Applicatio

‘Big Data’

Foundation

Researchers

Intelligent resource allocation, sharing, and monitoring across parallel HPC workloadsIBM Platform Computing, IBM Business Partners

Data Management: File System & Storage | ILM

‘Big’ Data Warehouse

- Apache UIMA

- IBM System T

+Low-cost, easy-access storage & archiving of data and metadata across heterogeneous environmentsIBM Spectrum Scale / Elastic Storage Server

Heterogenous, flexible server infrastructureIBM OpenPOWER serversHeterogen

eous Compute

Resources

Performance optimization for open source and commercial analytics applications

Text Analytics for the conversion of natural language concepts into structured data entities

IBM Research, IBM Watson, IBM Business Partners

Workload Orchestration, Monitoring & Metadata Capture

Resource Management enabling Private Clouds

Resource &

Workload Manageme

Data management and analytics tools can be accessed and shared across heterogeneous systems in on-premise and cloud environments

IBM Systems Facilitate Scientific Collaboration

External Collaborators (Heterogeneous Environments)Local Data Center

Virtual Private Clouds

Public Cloud UsersPrivate Cloud UsersOn-Premise Users

On-Premise Cluster

Encrypted VPN

‘Big Data’ foundation enables data access, data management, and HPC workload orchestration across heterogeneous on-premise, private cloud, public cloud, and hybrid cloud environments

HPC Network

Data Management: File System / Storage ILMWAN

Workload Burst

Applications

10GbE or InfiniBand

1/10 GbE

Workload Orchestration with Metadata Capture

‘Big’ Data Warehouse

AppCenter(PAC, Galaxy, DataBiology, Lab7)

Orchestrator(ASC/EGO, LSF, Symphony, PPM)

Translational

SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage

Personalized HealthcareGenomics

Datahub(Spectrum Scale, Zato, Nirvana)

HPC Cluster Big Data Spark Cluster Openstack Docker

Application & Workflow File & Database Visualization System & LogAcc

IBM Reference Architecture for Genomics

February 2015 8

Genomic Analysis System Architecture

IBM Genomics Reference Architecture

The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

BioBuilds – Open Source Bioinformatics

• Turn-key: Pre-built binaries and complete build scripts enable easy deployment

• Optimized: POWER8 binaries provide the best performance for your hardware

• Ready for the Clinic: A single source for tools streamlining verification and audit

• Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools

http://biobuilds.org/

Open Source bioinformatics tools for research, commercial, and regulated environments.

A Portfolio of Open Source Applications on Power

BioBuilds 2015.04• ALLPATHS-LG• Bedtools• Bfast• Bioconductor• BLAST (NCBI)• Bowtie• Bowtie2• BWA• Cufflinks• FASTA• FastQC• HMMER• HTSeq• IGV• ISAAC• iRODS• Mothur

• Numpy• PICARD• PLINK• Python• SAMTools• SHRiMP• SOAP3-DP• SOAPaligner• SOAPDenovo• SQLite• Tabix• TMAP• TopHat• Trinity• Velvec/Oases• R• RNAStar/STAR

BioBuilds Roadmap2015.11 SC15• bamtools• BarraCUDA• BioPython • ClustalW• Conda• EMBOSS• GraphViz• Htslib• Pysam• RSEM• Sratoolkit• Variant_tools

2016• BioPerl• CEGMA• Celera Assembler• DIALIGN-TX• FASTX-Toolkit• GenomicConsensus • Infernal• T-Coffee• MUSCLE• Sailfish• SIFT• STAR-fusion• Infernal And more

https://www.broadinstitute.org/gatk/blog?id=4833

Optimization of GATK from Broad InstituteIBM works with genomics leaders to improve performance of analytical workflows like GATK on IBM Power 8 Systems

Note*: http://library.wolfram.com/infocenter/Conferences/9045/Intel_LifeSciences_Personalized_Medicine_Wolfram%202014_Paolo%20Narvaez.pdf

Broad Institute Best Practice Performance

Input Dataset: G15512.HCC1954.1, coverage: 65x

Both IBM and Intel solution: # of Machines = 1# of cores/Machine: IBM: 16, Intel: 24

IBM Solution: IBM 3.32 GHz 8335-GTA with SMT=8ESS GL4

IBM Solution performance highlights:• 65X Whole Human Genome analysis done in about 21 hours• 150X Whole Exome analysis done in 2.55 hours

Steps Intel Runtime* IBM Runtime

BWA 7 4.26

Samtools 5 2.08

MarkDuplicates 11 6.86

RealignTargets 1 0.29

IndelRealigner 6.5 0.77

BaseRecalibrator 1.3 1.49

PrintReads+Index 12.3 2.55

PreProcessiong Total 44 18.3

HaplotypeCaller 2.64

Total 20.94

+ The execution times were measured by using the following hardware and software configuration (the speed-up with FPGA was estimated based on the throughput of the compression accelerator) – Hardware: POWER8 (10-core chip x 2 SMT8 @3.5ghz, 510gb memory) with GPFS, Ubuntu 14.04. Software versions: Samtools (http://www.htslib.org/) – built using the source obtained from git on 2015-08-25. Our tool was prototyped by modifying the same version of Samtools. Picard (http://broadinstitute.github.io/picard/) – version 1.138. Java – java version "1.8.0“ Java(TM) SE Runtime Environment SR1 FP10. Input file: a SAM file of an unsorted and unmarked version of an NA12878 WGS chr20 file.++ The order of sorted reads and the reads marked as duplicates were compared between samtools/picard and ours. +++ https://www.broadinstitute.org/gatk/guide/best-practices

Significant Speed-up with Multi-threaded S/W and FPGA+ for NA12878 WGS chr

20 The Same Results++

licate

Map to Reference

Sort & Mark Duplicates

Indel Realignment

Base Recalibration

Pre-processing Part of a Typical Workflow+++

Samtools/Picard

Acceleration of genomics pipeline on POWER8

Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)Data Set: 8 lanes of HiSeq data

Elapsed Time = 1730 min Elapsed Time = 107 min

Without cache library With cache library

IO Cache Library to Optimize Performance of Genomics ApplicationIBM uses a File Cache Library to improve I/O Performance and reduce workflow runtimes

2015 17

IBM scale genomic analysis from the desktop to the enterprise using IBM ESS/Spectrum Scale

Speed of the file system matters

Genomic Workload Optimization

BWA Samtools0

Analysis of 150x human genome WEX (NA12878)

GPFS Local NFS

• Spectrum Scale scalable I/O performance significantly benefits various NGS workloads.

• Spectrum Scale also provides seamless capacity expansion, improved enterprise wide efficiency, commercial-grade reliability, business continuity and the flexibility of supporting a wide variety of platforms.

Data Compression

Compression Algorithms

Compression ratio (lossless)

Speed/throughput

gzip on Power 8 with FPGA board– available now

On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)

CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)

Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.

• Samtools sort: 1.19-1.25x faster

• Picard MarkDuplicate:1.42-1.47x faster

• Picard AddOrReplaceReadGroups: 1.89-2.3x faster

• IBM is collaborating with Sanger Institute and EBI on improving compression for genomics data – Samtools, Picard, CRAM

Source: Baker M., Nature Methods 7, 495 - 499 (2010)

Noblis BioVelocity is Developed and Optimized on Power 8

L3 Bioinformatics BALSA on Power 8 with GPU

Power8 3.32 GHz, 2x k40 GPU and Spectrum Scale

Edico Genome

Proprietary & Confidential

Analyze a Whole Human Genome at 30x coverage in under 30 minutes

BCL Map/Align Sort Dedup Variant Calling VCF/GVCF

IBM works with Lab7 to deliver data provenance with performance, reliability and security

IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput

8Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes

reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems

Lab7 ESPComprehensive software platform --- combines LIMS and informatics functionalitieshData provenance --- maintains continuous data provenance by:• Tracking the history of samples, analyses, and results• Providing detailed audit trails9Sequencing platform flexibility --- manages data generated from any sequencing platform

Enterprise Data Management

IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput

Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and

includes reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems

3 C’s (Configure, Command, Collaborate)

OntologiesAnnotation

Samples

Comments + Attachments

Roles + Access

Shopping Basket

Social

Scientific

Lifecycle ManagementMeta Information

Financial + Resource MgmtTask Management

Project Management ApplicationsImportAnalysis

Visualization

Infrastructure

NetworkStorage

Compute

ConfigurationInstruments

Compute and StorageSoftlayer – LSF – GPFS

Transport

DBE Download Manager

S3, SCP, RSync, SFTP, FTP HTTP

Version Control + Reproducible

Data Provenance

Everything as an app:Scripts, Binaries,

Pipelines, Workflow Management, Virtual

Machines

Portal API Custom Web Apps via API

DBE Multiprot

Email + WF Integration

Identity Management

Databiology for Enterprise Functional Architecture Databiology for EnterpriseSaaS + customer specific instances

Central hub to manage all ‘omics data and to orchestrate all activities

Functionally rich and orientated on key steps in R&D life cycle

Insight to Instrument with best in class applications

Easy integration with existing environments

Automatic data provenance and reporting

Cost neutral deployment

Gradual roll-out / Low risk

Data Provenance with Performance, Reliability and Security

IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

tranSMART - Optimized on Power8 and ESS/Spectrum Scale• tranSMART associates genotypic & phenotypic data for complex analytics • Watson Explorer extracts insight from scientific literature and data record and provides enrichment to

tranSMART’s analysis

https://www.dropbox.com/s/9qw2kr339cl0mie/wats_tran.mp4?dl=0

IBM HPC, hardware, software, and libraries enable faster time to insight in translational medicine—from hours to minutes: tranSMART

Loading ‘TCGA_OV’ into PostgreSQLElapsed time, in seconds

5,789,362 recordsData Type File Size Time

Clinical Up to 0.5 MB 13s

Gene Expression

Up to 32 MB 227s (3min 47s)

tranSMART Analytics using ‘TCGA_OV’

Elapsed time by Analysis Type, in seconds, for single node

Type Acquire Data Run Analysis Total

MAS 157s 55s 212s (3min 32s)

HCA 262s 83s 345s (5min 45s)

PCA 253s 558s 811s (13min 31s)

Results are based on IBM internal testing and optimization of tranSMART using• IBM Power S822LC 16-core, 3.32 GHz 8335-GTA with SMT=2, Turbo Mode 512 GB memory, IBM XL Compiler Suite, IBM ESSL• IBM Elastic Storage Server Model GL4

Complete loading and analysis in mins instead of hours

tranSMART performance on Power 8 and ESS

Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258No. Records 5,789,632 40,774,968 942,724 1,203,282 3,600,555 4,702,050

Accelerate tranSMART ETL by Power8/Spectrum Scale

R Analytics Tools

Solr Full Text index

Gene Patterns

Watson Analytics

ApplicationBrowser

PostgreSQLtranSMART

Spectrum Scale

I2b2 Application

Server

Application Server

(Tomcat 7)

tranSMART

JDBCQuartz Job Call

Web Server(Apache2)

HTTPUsers

Power8

Watson Analytics

Server

command

tranSMART Power8 Deployment Architecture

Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and ESS/Spectrum Scale

NIH DataCDC Data NLM Data

Internet

Lab Results

Imaging Data

RadiologyReports

MicrobiologyReports

Nursing HomeRecords

Claims Data

VPNVPN

ElectronicHealth

Record Data

Genomic Data

Accepted Medical

Knowledge

Zato Data Federation Solution for Healthcare and Genomics Data

Genomic Solution Enablement Team

Mission:• Porting and Optimization of Genomics/Translational applications on IBM solution• Developing Solutions with Partners• Making IBM SW/HW available to Software developers

Members: • Independent Software Vendor (ISV) team• Toronto Compiler Lab• Boeblingen Development Lab• Tokyo Research Lab• Austin Research Lab

Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples

Workload Challenge #1: ‘Big Data’ Analytics

ANNOVAR Gene Ontology …

~ 150 GB (compressed)

Each human genome can have a few million variants

High-Throughput Sequencing

File Format

Assembly & Alignment

Raw Reads

De Novo Assembly

~ 150 GB

Whole Human Genome

SOAPdenovo Velvet …

Reference-Based MappingBWA Bowtie SOAP …

Reference GenomesTGCA GEO dbSNP …

Variant CallingVariant Calling

VCF 100 to 200 MBPicard GATK SAMtools SOAPsnp …

Variant Annotations

Annotation Toolsintergenic … SNP in IL23R associated with Crohn's disease …

Sample:

Processing time per genome

1 to 100 hours* on 1 compute node

* Duration depends on selection of analytical tools and hardware

500 MB

3 billion DNA base pairs

@ 30 x coverage

Phenotypic DataEx. Clinical Histories, Medical Images

…was in good health until 2-3 months ago when she gradually developed fatigue and intermittent epigastric pain, …

exonic NOD2 16 … a frameshift … SNP… exonic GJB2 13 … associated with hearing loss … exonic CRYL1,GJB6 13 … a 342kb deletion

Omics DataVariant Databases

Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis

Workload Challenge #2: Unstructured Information

Scientific Literature

Peer-Reviewed Articles, Clinical Guidelines, Textbooks, Patents

… for statistical analysis and relationship visualization

Information must be transformed into normalized structured data …

1 Omics Data

Workload Challenge #3: ‘Big Data’ Integration

2 Phenotypic Data 3 Knowledge Base

Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment

Variant Calls & Annotations

Electronic Text & Web Sites

##FORMAT=<ID=DP, …##FORMAT=<ID=HQ, …#CHROM POS ID REF ALT …20 14370 rs6054257 G A …

Clinical Features,Environmental Factors, Biological Responses

Phenotypic Data

Knowledge Base

Variant ID

Patient-Centric Logical Data Model

Patient IDGenotypic Data

Patient Population

‘Big’ Data Warehouse Environment

RDBMS and/or NoSQL

Variant List

Detail on a Single Variant

Phenotype ID

Patient ID

Observation Detail

Observed Traits & Responses

Scale-out cluster

UsersUsers

DevicesDevices

Spectrum ArchiveTSM/LTFS

Scale-up SMP

A framework for NGS and HPC Systems Architecture

Spectrum Scale ESS

Solution: GenWEQ hardware compression Three steps are tested

• Samtools Sort • Picard MarkDuplicate• Picard AddOrReplaceReadGroups -- an commonly used optional

component

• Easy to plugin and setup. Do not require any code changes or recompilation. Benefit multiple applications that use compression.

• Reduce compression time up to 98%, with a compression ratio up to 72% for bam files. Provides close to free compression.

• Significantly improved performance of important steps in the genomic pipeline with small sacrifices in compressed file size (~1.2x bigger):

• Samtools sort: 1.19-1.25x faster• Picard MarkDuplicate: 1.42-1.47x faster• Picard AddOrReplaceReadGroups: 1.89-2.30x faster

IBM Confidential

genomics medicine solution on power 8 kathy tzeng, phd worldwide technical lead healthcare &...

genomic data

genomic medicine

genomic sequence data

data processing

large data generation

data integration

heterogeneous data

genomic sequences

Documents

functional genomics brochuretdu.edu.in › tdusecond ›...

jung-ying tzeng · cross-disorder group of the psychiatric...

kathy hendrick

kathy mccoy

genomics in society: genomics, preventive medicine, and...

microbial genomics genomics: study of entire genomes logical...

preview: some illustrations of graphs in integrative...

using accessible course material – what is involved?...

google genomics documentation - read the...

expression and purification of membrane proteins from...

tzeng (2012): story of eel, lanyang museum, taipeh, … ·...

kathy excel

23andme, pathway genomics, complete genomics, counsyl |...

cardiovascular genomics in 2007: careers in genomics and

kathy lien

what are genomics and computational genomics?

counsyl, pathway genomics, 23andme, color genomics | company...

does china care about the environment? peter tzeng natural...

purification of yeast membrane proteins for structural...

kathy corbiere