genomics medicine solution on power 8 kathy tzeng, phd worldwide technical lead healthcare &...
Post on 18-Jan-2016
225 Views
Preview:
TRANSCRIPT
Genomics Medicine Solution on Power 8
Kathy Tzeng, PhDWorldwide Technical LeadHealthcare & Life SciencesIBM Systems GroupNovember 13, 2015
GENOMIC MEDICINE– from Sequencing to Personalized Healthcare
NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green et al., Nature 470, 204–213)
Next Generation Sequencing(or other ingestion)
the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reductionincludes human, plant, animal, and microbiome genomics
Translational Research/Early Discoverythe focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments
Personalized Healthcare/Clinical Genomicsthe focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments
Computational Challenges
Feature combinatorics Large file sizes Large population sizes Unstructured data types
A Computationally Challenging Problem
Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses
Predictive Response Function
Known Traits or Environmental Features
Measured Biological Response
W(t)
Model of associations between features and responses as a function of time t
F(t) R(t)
Quantities describing population traits or environmental factors at time t
Quantities describing response events for an organism at time t
Key CapabilitiesLeading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine
Flexible, scalable, and low-cost high-performance compute and storage solutions capable of efficiently processing rapidly growing quantities of genomic and other types of complex life science data
Seamless integration of complex life science data types on a common analytical platform
Rapid extraction and analysis of unstructured language content from very large volumes of clinical and scientific documents
Metadata collection capabilities providing detailed audit trails as source data are transformed into analytical results
Tools for scientific collaboration that enable data and workload sharing to cross organizations and geographic boundaries in a secure environment that ensures data privacy
5
A Foundation for Computational ScienceIBM’s Reference Architecture for Genomics supports ‘big data’ computational research on a foundation of HPC compute, storage, and workload management capabilities
Research Applicatio
ns
‘Big Data’
Foundation
Researchers
Intelligent resource allocation, sharing, and monitoring across parallel HPC workloadsIBM Platform Computing, IBM Business Partners
Gen
omic
Ana
lysi
s Pi
pelin
es
Tran
slati
onal
Re
sear
ch
Text
Ana
lytic
s /N
LP
LAN
Data Management: File System & Storage | ILM
‘Big’ Data Warehouse
Imag
e An
alys
is
- Apache UIMA
- IBM System T
+Low-cost, easy-access storage & archiving of data and metadata across heterogeneous environmentsIBM Spectrum Scale / Elastic Storage Server
Heterogenous, flexible server infrastructureIBM OpenPOWER serversHeterogen
eous Compute
Resources
Performance optimization for open source and commercial analytics applications
Text Analytics for the conversion of natural language concepts into structured data entities
IBM Research, IBM Watson, IBM Business Partners
Workload Orchestration, Monitoring & Metadata Capture
Resource Management enabling Private Clouds
Resource &
Workload Manageme
nt
Data management and analytics tools can be accessed and shared across heterogeneous systems in on-premise and cloud environments
IBM Systems Facilitate Scientific Collaboration
External Collaborators (Heterogeneous Environments)Local Data Center
Virtual Private Clouds
Public Cloud UsersPrivate Cloud UsersOn-Premise Users
On-Premise Cluster
Encrypted VPN
‘Big Data’ foundation enables data access, data management, and HPC workload orchestration across heterogeneous on-premise, private cloud, public cloud, and hybrid cloud environments
HPC Network
Data Management: File System / Storage ILMWAN
Workload Burst
Applications
10GbE or InfiniBand
1/10 GbE
Workload Orchestration with Metadata Capture
‘Big’ Data Warehouse
AppCenter(PAC, Galaxy, DataBiology, Lab7)
Orchestrator(ASC/EGO, LSF, Symphony, PPM)
Translational
SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage
Pla
tform
sC
ompu
teS
tora
ge
Personalized HealthcareGenomics
Datahub(Spectrum Scale, Zato, Nirvana)
…
HPC Cluster Big Data Spark Cluster Openstack Docker
Application & Workflow File & Database Visualization System & LogAcc
ess
IBM Reference Architecture for Genomics
February 2015 8
Genomic Analysis System Architecture
IBM Genomics Reference Architecture
The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers
IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers
BioBuilds – Open Source Bioinformatics
• Turn-key: Pre-built binaries and complete build scripts enable easy deployment
• Optimized: POWER8 binaries provide the best performance for your hardware
• Ready for the Clinic: A single source for tools streamlining verification and audit
• Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools
http://biobuilds.org/
Open Source bioinformatics tools for research, commercial, and regulated environments.
A Portfolio of Open Source Applications on Power
BioBuilds 2015.04• ALLPATHS-LG• Bedtools• Bfast• Bioconductor• BLAST (NCBI)• Bowtie• Bowtie2• BWA• Cufflinks• FASTA• FastQC• HMMER• HTSeq• IGV• ISAAC• iRODS• Mothur
• Numpy• PICARD• PLINK• Python• SAMTools• SHRiMP• SOAP3-DP• SOAPaligner• SOAPDenovo• SQLite• Tabix• TMAP• TopHat• Trinity• Velvec/Oases• R• RNAStar/STAR
BioBuilds Roadmap2015.11 SC15• bamtools• BarraCUDA• BioPython • ClustalW• Conda• EMBOSS• GraphViz• Htslib• Pysam• RSEM• Sratoolkit• Variant_tools
2016• BioPerl• CEGMA• Celera Assembler• DIALIGN-TX• FASTX-Toolkit• GenomicConsensus • Infernal• T-Coffee• MUSCLE• Sailfish• SIFT• STAR-fusion• Infernal And more
https://www.broadinstitute.org/gatk/blog?id=4833
Optimization of GATK from Broad InstituteIBM works with genomics leaders to improve performance of analytical workflows like GATK on IBM Power 8 Systems
14
Note*: http://library.wolfram.com/infocenter/Conferences/9045/Intel_LifeSciences_Personalized_Medicine_Wolfram%202014_Paolo%20Narvaez.pdf
Broad Institute Best Practice Performance
Input Dataset: G15512.HCC1954.1, coverage: 65x
Both IBM and Intel solution: # of Machines = 1# of cores/Machine: IBM: 16, Intel: 24
IBM Solution: IBM 3.32 GHz 8335-GTA with SMT=8ESS GL4
IBM Solution performance highlights:• 65X Whole Human Genome analysis done in about 21 hours• 150X Whole Exome analysis done in 2.55 hours
Steps Intel Runtime* IBM Runtime
BWA 7 4.26
Samtools 5 2.08
MarkDuplicates 11 6.86
RealignTargets 1 0.29
IndelRealigner 6.5 0.77
BaseRecalibrator 1.3 1.49
PrintReads+Index 12.3 2.55
PreProcessiong Total 44 18.3
HaplotypeCaller 2.64
Total 20.94
+ The execution times were measured by using the following hardware and software configuration (the speed-up with FPGA was estimated based on the throughput of the compression accelerator) – Hardware: POWER8 (10-core chip x 2 SMT8 @3.5ghz, 510gb memory) with GPFS, Ubuntu 14.04. Software versions: Samtools (http://www.htslib.org/) – built using the source obtained from git on 2015-08-25. Our tool was prototyped by modifying the same version of Samtools. Picard (http://broadinstitute.github.io/picard/) – version 1.138. Java – java version "1.8.0“ Java(TM) SE Runtime Environment SR1 FP10. Input file: a SAM file of an unsorted and unmarked version of an NA12878 WGS chr20 file.++ The order of sorted reads and the reads marked as duplicates were compared between samtools/picard and ours. +++ https://www.broadinstitute.org/gatk/guide/best-practices
Significant Speed-up with Multi-threaded S/W and FPGA+ for NA12878 WGS chr
20 The Same Results++
Ours
Mark
Dup
licate
sS
ort
Map to Reference
Sort & Mark Duplicates
Indel Realignment
Base Recalibration
Pre-processing Part of a Typical Workflow+++
Samtools/Picard
Acceleration of genomics pipeline on POWER8
Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)Data Set: 8 lanes of HiSeq data
Elapsed Time = 1730 min Elapsed Time = 107 min
Without cache library With cache library
IO Cache Library to Optimize Performance of Genomics ApplicationIBM uses a File Cache Library to improve I/O Performance and reduce workflow runtimes
2015 17
IBM scale genomic analysis from the desktop to the enterprise using IBM ESS/Spectrum Scale
Speed of the file system matters
Genomic Workload Optimization
BWA Samtools0
5
10
15
20
25
30
35
20
14
23
16
30
20
Analysis of 150x human genome WEX (NA12878)
GPFS Local NFS
Ela
pse
d T
ime
(m
inu
tes)
• Spectrum Scale scalable I/O performance significantly benefits various NGS workloads.
• Spectrum Scale also provides seamless capacity expansion, improved enterprise wide efficiency, commercial-grade reliability, business continuity and the flexibility of supporting a wide variety of platforms.
Data Compression
Compression Algorithms
Compression ratio (lossless)
Speed/throughput
gzip on Power 8 with FPGA board– available now
On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)
CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)
Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.
• Samtools sort: 1.19-1.25x faster
• Picard MarkDuplicate:1.42-1.47x faster
• Picard AddOrReplaceReadGroups: 1.89-2.3x faster
• IBM is collaborating with Sanger Institute and EBI on improving compression for genomics data – Samtools, Picard, CRAM
Source: Baker M., Nature Methods 7, 495 - 499 (2010)
Noblis BioVelocity is Developed and Optimized on Power 8
L3 Bioinformatics BALSA on Power 8 with GPU
Power8 3.32 GHz, 2x k40 GPU and Spectrum Scale
Edico Genome
Proprietary & Confidential
21
Analyze a Whole Human Genome at 30x coverage in under 30 minutes
BCL Map/Align Sort Dedup Variant Calling VCF/GVCF
IBM works with Lab7 to deliver data provenance with performance, reliability and security
IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput
8Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes
reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems
Lab7 ESPComprehensive software platform --- combines LIMS and informatics functionalitieshData provenance --- maintains continuous data provenance by:• Tracking the history of samples, analyses, and results• Providing detailed audit trails9Sequencing platform flexibility --- manages data generated from any sequencing platform
Enterprise Data Management
IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput
8
Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and
includes reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems
3 C’s (Configure, Command, Collaborate)
OntologiesAnnotation
Samples
Comments + Attachments
Roles + Access
Shopping Basket
Social
Scientific
Lifecycle ManagementMeta Information
Financial + Resource MgmtTask Management
Project Management ApplicationsImportAnalysis
Visualization
Infrastructure
NetworkStorage
Compute
ConfigurationInstruments
Compute and StorageSoftlayer – LSF – GPFS
Transport
DBE Download Manager
S3, SCP, RSync, SFTP, FTP HTTP
Logic
Version Control + Reproducible
Data Provenance
Everything as an app:Scripts, Binaries,
Pipelines, Workflow Management, Virtual
Machines
Portal API Custom Web Apps via API
DBE Multiprot
Email + WF Integration
Identity Management
Info
rmati
on
Man
agem
ent
Inte
rfac
eO
rche
stra
tion
Databiology for Enterprise Functional Architecture Databiology for EnterpriseSaaS + customer specific instances
Central hub to manage all ‘omics data and to orchestrate all activities
Functionally rich and orientated on key steps in R&D life cycle
Insight to Instrument with best in class applications
Easy integration with existing environments
Automatic data provenance and reporting
Cost neutral deployment
Gradual roll-out / Low risk
Data Provenance with Performance, Reliability and Security
IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers
tranSMART - Optimized on Power8 and ESS/Spectrum Scale• tranSMART associates genotypic & phenotypic data for complex analytics • Watson Explorer extracts insight from scientific literature and data record and provides enrichment to
tranSMART’s analysis
https://www.dropbox.com/s/9qw2kr339cl0mie/wats_tran.mp4?dl=0
IBM HPC, hardware, software, and libraries enable faster time to insight in translational medicine—from hours to minutes: tranSMART
Loading ‘TCGA_OV’ into PostgreSQLElapsed time, in seconds
5,789,362 recordsData Type File Size Time
Clinical Up to 0.5 MB 13s
Gene Expression
Up to 32 MB 227s (3min 47s)
tranSMART Analytics using ‘TCGA_OV’
Elapsed time by Analysis Type, in seconds, for single node
Type Acquire Data Run Analysis Total
MAS 157s 55s 212s (3min 32s)
HCA 262s 83s 345s (5min 45s)
PCA 253s 558s 811s (13min 31s)
Results are based on IBM internal testing and optimization of tranSMART using• IBM Power S822LC 16-core, 3.32 GHz 8335-GTA with SMT=2, Turbo Mode 512 GB memory, IBM XL Compiler Suite, IBM ESSL• IBM Elastic Storage Server Model GL4
Complete loading and analysis in mins instead of hours
tranSMART performance on Power 8 and ESS
Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258No. Records 5,789,632 40,774,968 942,724 1,203,282 3,600,555 4,702,050
Accelerate tranSMART ETL by Power8/Spectrum Scale
R Analytics Tools
Solr Full Text index
Gene Patterns
PLINK
Watson Analytics
ApplicationBrowser
PostgreSQLtranSMART
DB
Spectrum Scale
JDBC
I2b2 Application
Server
Application Server
(Tomcat 7)
tranSMART
JDBCQuartz Job Call
HTTP
HTTP
HTTP
Web Server(Apache2)
HTTP
HTTP
HTTPUsers
Power8
Watson Analytics
Server
command
tranSMART Power8 Deployment Architecture
Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and ESS/Spectrum Scale
NIH DataCDC Data NLM Data
Internet
Lab Results
Imaging Data
RadiologyReports
MicrobiologyReports
Nursing HomeRecords
Claims Data
VPN
VPNVPN
LAN
LAN
LAN
LAN
LAN
ElectronicHealth
Record Data
Genomic Data
Accepted Medical
Knowledge
Zato Data Federation Solution for Healthcare and Genomics Data
Genomic Solution Enablement Team
Mission:• Porting and Optimization of Genomics/Translational applications on IBM solution• Developing Solutions with Partners• Making IBM SW/HW available to Software developers
Members: • Independent Software Vendor (ISV) team• Toronto Compiler Lab• Boeblingen Development Lab• Tokyo Research Lab• Austin Research Lab
32
Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples
Workload Challenge #1: ‘Big Data’ Analytics
ANNOVAR Gene Ontology …
~ 150 GB (compressed)
Each human genome can have a few million variants
High-Throughput Sequencing
File Format
Assembly & Alignment
BAM
Raw Reads
De Novo Assembly
~ 150 GB
Whole Human Genome
SOAPdenovo Velvet …
Reference-Based MappingBWA Bowtie SOAP …
Reference GenomesTGCA GEO dbSNP …
Variant CallingVariant Calling
VCF 100 to 200 MBPicard GATK SAMtools SOAPsnp …
Variant Annotations
Annotation Toolsintergenic … SNP in IL23R associated with Crohn's disease …
Sample:
Processing time per genome
1 to 100 hours* on 1 compute node
* Duration depends on selection of analytical tools and hardware
FastQ
500 MB
3 billion DNA base pairs
@ 30 x coverage
Phenotypic DataEx. Clinical Histories, Medical Images
…was in good health until 2-3 months ago when she gradually developed fatigue and intermittent epigastric pain, …
exonic NOD2 16 … a frameshift … SNP… exonic GJB2 13 … associated with hearing loss … exonic CRYL1,GJB6 13 … a 342kb deletion
Omics DataVariant Databases
Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis
Workload Challenge #2: Unstructured Information
Scientific Literature
Peer-Reviewed Articles, Clinical Guidelines, Textbooks, Patents
… for statistical analysis and relationship visualization
Information must be transformed into normalized structured data …
+
1 Omics Data
Workload Challenge #3: ‘Big Data’ Integration
2 Phenotypic Data 3 Knowledge Base
Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment
Variant Calls & Annotations
Electronic Text & Web Sites
##FORMAT=<ID=DP, …##FORMAT=<ID=HQ, …#CHROM POS ID REF ALT …20 14370 rs6054257 G A …
Clinical Features,Environmental Factors, Biological Responses
Phenotypic Data
Knowledge Base
Variant ID
Patient-Centric Logical Data Model
Patient IDGenotypic Data
Patient Population
‘Big’ Data Warehouse Environment
RDBMS and/or NoSQL
Variant List
Detail on a Single Variant
VCF1
3
2
Phenotype ID
Patient ID
Observation Detail
Observed Traits & Responses
Scale-out cluster
UsersUsers
DevicesDevices
Spectrum ArchiveTSM/LTFS
Scale-up SMP
HP
C M
anag
emen
t Sui
teP
latfo
rm S
oftw
are
Sta
ck
A framework for NGS and HPC Systems Architecture
Spectrum Scale ESS
Solution: GenWEQ hardware compression Three steps are tested
• Samtools Sort • Picard MarkDuplicate• Picard AddOrReplaceReadGroups -- an commonly used optional
component
• Easy to plugin and setup. Do not require any code changes or recompilation. Benefit multiple applications that use compression.
• Reduce compression time up to 98%, with a compression ratio up to 72% for bam files. Provides close to free compression.
• Significantly improved performance of important steps in the genomic pipeline with small sacrifices in compressed file size (~1.2x bigger):
• Samtools sort: 1.19-1.25x faster• Picard MarkDuplicate: 1.42-1.47x faster• Picard AddOrReplaceReadGroups: 1.89-2.30x faster
IBM Confidential
top related