infrastructure to host sensitive data: hipaa cloud … to host sensitive data: hipaa cloud storage...

33
integrating Data for Analysis, Anonymization, and SHaring Infrastructure to Host Sensitive Data: HIPAA Cloud Storage and Compute Claudiu Farcas, Olivier Harismendy, Antonios Koures UCSD

Upload: vukhue

Post on 31-Mar-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

integrating Data for Analysis, Anonymization, and SHaring

Infrastructure to Host Sensitive Data:

HIPAA Cloud Storage and Compute

Claudiu Farcas, Olivier Harismendy, Antonios Koures

UCSD

Outline

• History

• iDASH CLOUD/SHADE Current State

• Future Plans for CLOUD

• The Quest for Repeatable Science

• Genomics Collaborations

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 2

In the beginnings…

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 3

There was a

“MindMap” to serve

the needs of a very

diverse community

We had a plan…

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 4

Conquer the world of

hardware through

extensive

virtualization

… and started our journey.

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 5

Some hardware

and …

Rack #1 Rack #2

lots of ideas…

some good, others …

Road blocks and mishaps…

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 6

Rack #1 Rack #2

Some things

simply don’t

work out…

… so we start

from scratch

… and successes to keep us going.

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 7

Tools and data create science!

Fast forward to today…

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 8

Safe HIPAA-compliant Annotated Data deposit box Environment

On-demand Virtualized Elastic Resilient Compute And Storage Technology

HIPAA and non-public data

public data, tools, recipes

Po

wer

ed b

y

MID

AS

Data Tools Recipes

upload & download data

compute request, direct upload & download of proprietary data, tool, recipe

middleware and HIPAA security developed by iDASH

Compute nodes Memory Disk storage Networking

Po

wer

ed b

y V

Mw

are AUTOMATED

iDASH CLOUD

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 9

Quarantine

Development

Production

Staging

Successive progression through the environment towards Production

Technical Specifications

3 computation tiers 3 storage tiers 10GbE throughout Full redundancy RSA Two Factor Auth. Remote data replication

1000+ cores 9TB+ RAM 1PB+ storage

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 10

Cloud Environments

• Quarantine

» An isolated environment for incoming code and applications

• Only accessible internally

• All ports closed, except SSH

» Apps and/or code can be scanned for vulnerabilities, malware etc

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 11

Cloud Environments

• Development & Testing » A controlled environment designed for agile

development

» No Personal Health Information(PHI) • VPN access, no 2-Factor

» Source code control

» Bug tracking

» Development wiki

» Confluence

» Group Chat utility

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 12

Cloud Environments

• Staging/QA Environment

» Uses both VPN and 2-factor authentication

» Mirrors, but is independent of, Production

• Used for pre-production tests and UAT

• Must have user acceptance before promotion into production

• Production

» Very secure with VPN, 2-factor and significant development restrictions

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 13

Cloud Improvements Y5

• Added a Quarantine environment with tools/utilities to analyze unknown incoming applications and source code

• Added Development services to the Development environment

» GitLab source code control

» Jira for bug and issue tracking

» Openfire chat

» Confluence for organizing and sharing information

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 14

Cloud Improvements Y5

• Added 3 FX630 Dell chasses each with 4 blades

• Each blade has two CPU sockets populated with the Intel Haswell E5-2699 v2 processors and 512GB of RAM

• This brings the available core count to over 1000

• Added additional disks (SSD and non-SSD) to increase the capacity of the cloud to over 1PB

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 15

Future plans

• FISMA ATO

• Integration of popular pipelines (e.g., SeqWare, OmicsPipe, etc) into blueprints

• Billing and accounting

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 16

Future State: NSX implementation

• Analogous to server virtualization for compute, the NSX network virtualization approach allows System Admins to treat their physical network as a pool of transport capacity that can be consumed and repurposed on demand

• Network services are programmatically distributed to each virtual machine, independent of the underlying network hardware or topology, so workloads can be dynamically added or moved and all of the network and security services attached to the virtual machine move with it » Automate network provisioning for tenants with customization

and complete isolation » Better, dynamically adjustable, isolation of the cloud

environments (Quarantine, Dev, QA, Prod)

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 17

Future State: Hybrid Cloud

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 18

Secure Section

Public Cloud

Private Cloud

Challenges for reproducible research

• missing or obsolete source code

• undocumented or unexpected dependencies to install and configure applications

• undisclosed values of the parameters used in published analyses

• requirements for querying and pre-processing external reference datasets.

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 19

Support for containers

Running Docker within Linux VM: • Flexibility of the applications (bundle necessary libraries, retain provenance) • Improve efficiency/ scalability/ economics / security of the cloud

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 20

FlightDeck

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 21

Repeatable Results

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 22

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

iDASH On-Demand Resources

CLOUD

SHADE Repository

Automation

Repeatable Results

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 23

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

iDASH On-Demand Resources

Bookshelf

MyDATA

Templates

Repeatable Results

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 24

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Blueprint

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

Instance

Workflow

Short reads

Index reference

Align to reference

Call variants

Annotate variants

Pick high impact

Deleterious SNPs

Co

nte

xt

Reference DB

Test data

Configuration

Helper tools

OS

iDASH On-Demand Resources

Bookshelf

MyDATA

Input Results

Instance

External Data

Protected Health Information

• Cancer Genomic Data is Protected Health Information » DNA sequences

• Germline polymorphism, insertion, deletions • Somatic mutations • Structural variations

» RNA sequences » Genotyping arrays

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 25

Cancer Genomics Datasets

• Moores Cancer Center Internal datasets (sequencing or genotyping) » 1032 Chronic Lymphocytic Leukemia

» 20 Myelodysplastic syndromes

» 12 Mesotheliomas

» 29 Appendix cancer

» 38 Breast cancers (sequencing)

» 36 Breast cancers (genotyping)

• “Public” Datasets » The Cancer Genome Atlas (1078 Breast, 478 Lung)

» The International Cancer Genome Consortium (100 Ovarian)

» dbGAP datasets

9/30/2015 Supported by the NIH Grant U54 HL108460 to the University of California, San Diego 26

Genomics Machines Available

Li Ding , Jay Mashl - WashU Brad Chapman- Harvard

Best Practice Pipelines • Germline variant calling • Cancer variant calling • Structural variant calling • RNA-seq • smallRNA-seq • ChIP-seq • Standard

Local vs Remote Public Data

0

200

400

600

800

1000

1200

hg

38

hg

38

hg

38

ExA

CE

xA

CE

xA

C

db

SN

Pd

bS

NP

db

SN

P

dbNS…

dbNS…

dbNS…

Rem

ote

/Lo

cal

Du

rati

on

Local copy is installed up to 1000x faster Cumulative from 6.8 hrs to 29 seconds

Broad Institute

NCBI Soft-genetics

UC Santa Cruz

Tumor vs Normal Exome

.fastq

.bam

.refined.bam

.vcf

.annotated.vcf

.copy_number

Alignment(BWA)

Duplicateremoval(Picard)QualityRecalibra on(GATK)

IndelRealignment(GATK)

Variantcalling(VarScan)

VariantAnnota on(Oncotator,VariantTools)

DatabasesdbNSFPExACdbSNPCOSMIC

.realigned.bam

.fastq

.bam

.refined.bam

.realigned.bam

NORMALDNA TUMORDNA

Performance is similar to public cloud Little/No overhead from docker

Pan Cancer Analysis of Whole Genomes

• 2,601 donors (Tumor-Normal WG pairs)

• ~300GB of data per donor

• Sanger Center Dockerized workflow

• 9 VMs, 32 CPUs, 256GB RAM, 1TB storage

• 115 donors analyzed

» Shortest 21hrs

» Longest 17 days (not included in freeze)

Download x2

getBasFile x2

ASCAT allele count x2 Pindel input x2

BRASS Input

ASCAT Pindel x 24

BRASS

ASCAT finalize Pindel 2 VCF x24

Package results

Caveman prepare x2 BRASS filter

Pindel merge

Caveman setup BRASS split

Caveman split xN

BRASS assemble xN

Package results Caveman split concat

BRASS grass

Caveman mstep xN BRASS tabix

Caveman merge Package results

Caveman estep xN

Caveman merge

Caveman add ID

Caveman flag

Package results Caveman cleanup

Metrics

VCF upload

PCAWG Sanger Workflow

PCAWG Sanger Wall Time

0.0e+00

2.5e+05

5.0e+05

7.5e+05

1.0e+06

idashosdc bsc ebi ucscsangerdkfz oicr etri pdc rikenAnalysis Center

Wa

ll tim

e (

s)

analysis_center

bsc

dkfz

ebi

etri

idash

oicr

osdc

pdc

riken

sanger

ucsc

• Not corrected for • N CPUs • available RAM • File size

Workflow 1.0.4/5/6

OICR is mainly run on AWS spot instances

UC San Diego CTRI Antonios Koures

Ashley Williams

Tony Chen

UC San Diego DBMI Claudiu Farcas

Michelle Dow

Lucila Ohno-Machado

Jihoon Kim

Tyler Bath

PCAWG-Tech Lincoln Stein

Brian O’Connor

Christina Yung

Wash U

Li Ding

Jay Mashl

Acknowledgments