2015 04 bio it world

Research Computing@Broad

An Update: Bio-IT World Expo

April, 2015

Chris Dwan ([email protected])

Director, Research Computing and Data Services

mailto:[email protected]

Take Home Messages

• Go ahead and do the legwork to federate your

environment to at least one public cloud.– It’s “just work” at this point.

• Spend a lot of time understanding your data lifecycle, then

stuff the overly bulky 95% of it in an object store fronted

by a middleware application.– The latency sensitive, constantly used bits fit in RAM

• Human issues of consent, data privacy, and ownership

are still the hardest part of the picture.– We must learn to work together from a shared, standards based framework

– The time is now.

The world is quite ruthless in selecting between

the dream and the reality, even where we will not.

Cormac McCarthy, All the Pretty Horses

• The Broad Institute is a non-profit biomedical

research institute founded in 2004

• Fifty core faculty members and hundreds of associate

members from MIT and Harvard

• ~1000 research and administrative personnel, plus

~2,400+ associated researchers

Programs and Initiativesfocused on specific disease or biology areas

Cancer

Genome Biology

Cell Circuits

Psychiatric Disease

Metabolism

Medical and Population Genetics

Infectious Disease

Epigenomics

Platformsfocused technological innovation and application

Genomics

Therapeutics

Imaging

Metabolite Profiling

Proteomics

Genetic Perturbation

The Broad Institute

• The Broad Institute is a non-profit biomedical

research institute founded in 2004

• Twelve core faculty members and more than 200

associate members from MIT and Harvard

• ~1000 research and administrative personnel, plus

~1000 associated researchers

Programs and Initiativesfocused on specific disease or biology areas

Cancer

Genome Biology

Cell Circuits

Psychiatric Disease

Metabolism

Medical and Population Genetics

Infectious Disease

Epigenomics

Platformsfocused technological innovation and application

Genomics

Therapeutics

Imaging

Metabolite Profiling

Proteomics

Genetic Perturbation

The Broad Institute

60+ Illumina HiSeq instruments, including 14 ‘X’ sequencers

700,000+ genotyped samples

~18PB unique data / ~30PB usable file storage

The HPC Environment

Shared Everything: A reasonable architecture

Network

• ~10,000 Cores of Linux

server

• 10Gb/sec ethernet

backplane.

• All storage is available as

files (NAS) from all

servers.

Monitoring and metrics

Matt Nicholson

joins the Broad

Chris Dwan

joins the Broad


Matt Nicholson

joins the Broad

Gradual puppetization

Chris Dwan joins

the Broad


Matt Nicholson

joins the BroadGradual puppetization:

increased visibility

We’re pretty sure

that we actually

have ~15,000 cores

Chris Dwan joins

the Broad

Metal

Boot Image Provisioning (PXE / Cobbler, Kickstart)

Hardware Provisioning (UCS, Xcat)

Broad specific system configuration (Puppet)

User or execution environment (Dotkit, docker, JVM, Tomcat)

Bare Metal

OS and vendor patches (Red Hat / yum, plus satellite)

Private Cloud Public CloudContainerized

Wonderland

Management Stack: Bare Metal

Network topology (VLANS, et al)

Many specific technical

decisions do not matter, so long

as you choose something and

make it work (Dagdigian, 2015)

Shared Everything: Ugly reality

10 Gb/sec Network

• At least six discrete

compute farms running

at least five versions of

batch schedulers (LSF

and SGE)

• Nodes “shared” by mix-

and-match between

owners.

• Nine Isilon clusters

• Five Infinidat filers

• ~19 distinct storage

technologies.

Genomics

PlatformCancer

ProgramShared

“Farm”

Overlapping usage = Potential I/O

bottleneck when multiple groups are

doing heavy analysis

Metal



Broad specific configuration (Puppet)


Hypervisor OS

Instance Provisioning

(Openstack)

Bare Metal



Wonderland

Configuration Stack: Now with Private Cloud!


Re-use everything possible

from the bare metal

environment.

While inserting things that make

our life easier.

Openstack (RHEL, Icehouse)

Openstack@Broad: The least cloudy cloud.

10 Gb/sec Network

Genomics

PlatformCancer

Program

Shared

“Farm”

Least “cloudy” implementation

possible.

IT / DevOps staff as users

Simply virtualizing and

abstracting away hardware

from user facing OS.

Note that most former

problems remain intact

Incrementally easier to manage

with very limited staff (3 FTE

Linux admins).

Openstack monitoring / UI is primitive at best

Openstack: open issues

Excluded from our project:

– Software defined networking (Neutron)

– “Cloud” storage (Cinder / Swift)

– Monitoring / Billing (Ceilometer, Heat)

– High Availability on Controllers

Custom:

– Most deployment infrastructure / scripting, including DNS

– Network encapsulation

– Active Directory integration

– All core systems administration functions

Core Message:

– Do not change both what you do and how you do it at the same time.

– Openstack could have been a catastrophe without rather extreme project

scoping.

I need “telemetry,” rather than logs*.

Jisto: Software startup with smart, smart monitoring and potential for containerized

cycle harvesting a-la condor.

*Logs let you know why you crashed. Telemetry lets you steer.

Trust no one:

This machine was not actually the

hypervisor of anything.

Broad Institute

Firewall

NAS Filers Compute

Internet

Internet 2

Edge

Router

Router

We need elastic computing, a more cloudy cloud.

Broad Institute

Firewall

NAS Filers Compute

Internet

Internet 2

Edge

Router

Router

Subnet: 10.200.x.x

Domain: openstack.broadinstitute.org

Hostname: tenant-x-x

Starting at the bottom of the stack

There is no good answer to the question of

“DNS in your private cloud”

Private BIND domains are your friend

The particular naming scheme does not

matter. Just pick a scheme

Broad Institute

Firewall

NAS Filers Compute

Internet

Internet 2

Edge

Router

Router

Amazon VPC

VPN

Endpoint

VPN

Endpoint

More Compute!

Subnet: 10.200.x.x



Subnet: 10.199.x.x

Domain: aws.broadinstitute.org

Network Engineering: You don’t have to replace everything.

Broad Institute

Firewall

NAS Filers Compute

Internet

Internet 2

Edge

Router

Router

Amazon VPC

VPN

Endpoint

VPN

Endpoint

More Compute!

Subnet: 10.200.x.x



Subnet: 10.199.x.x

Domain: aws.broadinstitute.org

Ignore issues of latency, network transport costs,

and data locality for the moment. We’ll get to those

later.

Differentiate Layer 2 from Layer 3 connectivity.

We are not using Amazon Direct Connect. We don’t

need to, because AWS is routable via Internet 2.

Network Engineering Makes Everything Simpler

Physical Network Layout: More bits!

Markley Data

Centers (Boston)

Internap Data

Centers

(Somerville)

Broad: Main St.

Broad: Charles St.

Between data centers:

• 80 Gb/sec dark fiber

Internet:

1 Gb/sec

10 Gb/sec

Internet 2

10 Gb/sec

100 Gb/sec

Failover Internet:

1 Gb/sec

Metro Ring:

• 20 Gb/sec

dark fiber

My Metal



Broad specific configuration (Puppet)


Hypervisor OS


(Openstack)

Bare Metal



Wonderland

Configuration Stack: Private Hybrid Cloud!


Public Cloud

Infrastructure


(CycleCloud)

CycleCloud provides straightforward, recognizable cluster

functionality with autoscaling and a clean management UI.

Do not be fooled by the 85 page “quick start guide,” it’s just a

cluster.

A social digression on cloud resources

• Researchers are generally:

– Remarkably hardworking

– Responsible, good stewards of resources

– Not terribly engaged with IT strategy

These good character traits present social barriers to

cloud adoption

Researchs Need

– Guidance and guard rails.

– Confidence that they are not “wasting” resources

– A sufficiently familiar environment to get started

Multiple Public Clouds

Openstack

Batch Compute Farm: 2015 Edition

Production Farm Shared Research FarmTwo clusters, running the same batch

scheduler (Univa’s Grid Engine).

Production: Small number of humans

operating several production systems for

business critical data delivery.

Resarch: Many humans running ad-

hoc tasks.


Openstack

End State: Compute Clusters

Production Farm Shared Research Farm

A financially in-elastic portion of the

clusters is governed by traditional

fairshare scheduling.

Fairshare allocations change slowly

(month to month) based on

conversation, investment, and

discussion of both business and

emotional factors

This allows consistent budgeting,

dynamic exploration, and ad-hoc use

without fear or guilt.

Multiple public clouds support auto-scaling

queues for projects with funding and urgency

Openstack plus public clouds provides a

consistent capacity

End State: Compute Clusters

Production Farm Shared Research Farm

On a project basis, funds can be allocated

for truly elastic burst computing.

This allows business logic to drive delivery

based on investment

A financially in-elastic portion of the clusters

is governed by traditional fairshare

scheduling.

Fairshare allocations are changed slowly

(month to month, perhaps) based on

substantial conversation, investment, and

discussion of both business logic and

feelings.

This allows consistent budgeting, dynamic

exploration, and ad-hoc use without fear or

guilt.

Broad Institute

Amazon VPC

The term I’ve heard for this is

“intercloud.”

End State: Multiple Interconnected Public Clouds for collaboration

Google Cloud

Sibling

Institutions

Long term goal:

• Seamless collaboration inside and outside of Broad

• With elastic compute and storage

• With little or no copying of files or ad-hoc, one-off hacks

My Metal



Broad configuration (Puppet)


Hypervisor OS


(Openstack)

Bare Metal

End User visible OS and vendor patches (Red Hat, plus satellite)


Wonderland

Configuration Stack: Now with containers!


Public Cloud

Infrastructure


(CycleCloud)???

… Docker / Mesos

Kubernetes / Cloud

Foundry / Common

Workflow Language

/ …

What about the data?

Scratch Space: “Pod local,” SSD filers

10 Gb/sec Network

80+ Gb/sec Network

Scratch Space:

• 3 x 70TB filers from Scalable Informatics.

• Running a relative of Lustre

• Over multiple 40Gb/sec interfaces

• Managed using hostgroups, workload affinities,

and an attentive operations team

Openstack

Production Farm

For small data: Lots of SSD / Flash

• Unreasonable requirement: Make it impossible for

spindles to be my bottleneck

– 8 GByte per second throughput• Multiple quotes with fewer than 8 x 10Gb/sec ports

– ~100 TByte usable capacity • I am not asking about large volume storage

• Give me sustainable pricing.

– On a NAS style file share • Raw SAN / block device / iSCSI is not the deal.

• Solution: Scalable Informatics “Unison” filers– A lot of vendors failed basic sanity checks on this one.

– Please listen carefully when I state requirements. I do not believe in single monolithic

solutions anymore.

Caching edge filers for shared references

10 Gb/sec Network

80+ Gb/sec Network

Scratch Space:

• 3 x 70TB filers from Scalable

Informatics.

• Workload managed by

hostgroups, workload

affinities, and an attentive

operations team

Openstack

Production Farm

Avere Edge Filer

(physical)On premise data

stores

Shared Research Farm

Coherence on small volumes

of files provided by a

combination of clever network

routing and Avere’s caching

algorithms.

Plus caching edge filers for shared references

10 Gb/sec Network

80+ Gb/sec Network

Scratch Space:


Informatics.




operations team

Openstack

Production Farm


Avere Edge Filer


stores

Cloud backed data stores






algorithms.

Plus caching edge filers for shared references

10 Gb/sec Network

80+ Gb/sec Network

Scratch Space:


Informatics.




operations team

Openstack

Production Farm


Avere Edge Filer


stores

Avere Edge Filer

(virtual)Cloud backed

data stores






algorithms.

Geek Cred: My First Petabyte, 2008

Cool thing: Avere

• Avere sells software with optional hardware:

• NFS front end whose block-store is an S3 bucket.

• It was born as a caching accelerator, and it does that well,

so the network considerations are in the right place.

• Since the hardware is optional …

Cool thing: Avere

• Avere sells software with optional hardware:

• NFS front end whose block-store is an S3 bucket.

• It was born as a caching accelerator, and it does that well,

so the network considerations are in the right place.

• Since the hardware is optional …

• NFS share that bridging on premise and cloud.

But what about the big data?

Broad Data Production, 2015: ~100TB /wk

Data production will continue to grow year over year

We can easily keep up with it, if we adopt appropriate

technologies.

100TB/wk ~= 1.3Gb/sec but 1PB @ 1Gb/sec ~= 12 days.

Broad Data Production, 2015: ~100TB /wk of unique information

“Data is heavy: It goes to the cheapest, closest place, and it stays

there”

Jeff Hammerbacher

Data Sizes for one 30x Whole Genome

Base Calls from a single lane of an Illumina HiSeq X

• Approximately the coverage required for 30x on a

whole human genome.

• Record of a laboratory event

• Totally immutable

• Almost never directly used

Aligned reads from that same lane:

• Substantially better compression because of putting like

with like.

95 GB

60 GB

Aggregated, topped up, and re-normalized BAM:

• Near doubling in file size because of multiple quality scores per base.145 GB

Variant File (VCF) and other directly usable formats

• Even smaller when we cast the distilled information into a

database of some sort

Tiny

Under the hood: ~1TB of MongoDB

And now for something completely different

File based storage: The Information Limits

• Single namespace filers hit real-world limits at:

– ~5PB (restriping times, operational hotspots, MTBF headaches)

– ~109 files: Directories must either be wider or deeper than human

brains can handle.

• Filesystem paths are presumed to persist forever

– Leads inevitably to forests of symbolic links

• Access semantics are inadequate for the federated world.

– We need complex, dynamic, context sensitive semantics including

consent for research use.

We’re all familiar with this

Limits of File Based Organization


• The fact that whatever.bam and whatever.log are in the same

directory implies a vast amount about their relationship.

• The suffixes “bam” and “log” are also laden with meaning

• That implicit organization and metadata must be made explicit

in order to transcend the boundaries of file based storage


• Broad hosts genotypes derived from perhaps 700,000

individuals

• These genotypes are organized according to a variety of

standards (~1,000 cohorts), and are spread across a variety of

filesystems

• Metadata about consent, phenotype, etc is scattered across

dozens to hundreds of “databases.”


• This lack of organization is holding us back from:• Collaboration and Federation between sibling organizations

• Substantial cost savings using policy based data motion

• Integrative research efforts

• Large scale discoveries that are currently in reach

We’re all familiar with this

Early 2014: Conversations about object storage with:

• Amazon Google

• EMC Cleversafe

• Avere Amplidata

• Data Direct Networks Infinidat

• …

My object storage opinions

• The S3 standard defines object storage

– Any application that uses any special / proprietary features is a

nonstarter – including clever metadata stuff.

• All object storage must be durable to the loss of an entire

data center

– Conversations about sizing / usage need to be incredibly simple

• Must be cost effective at scale

– Throughput and latency are considerations, not requirements

– This breaks the data question into stewardship and usage

• Must not merely re-iterate the failure modes of filesystems

The dashboard should look opaque


• Object “names” should be a bag of UUIDs

• Object storage should be basically unusable without the

metadata index.

• Anything else recapitulates the failure mode of file based

storage.


• Object “names” should be a bag of UUIDs

• Object storage should be basically unusable without the

metadata index.

• Anything else recapitulates the failure mode of file based

storage.

• This should scare you.

Current Object Storage Architecture

“Boss” Middleware

Consent for

Research Use

Phenotype

LIMS

Legacy File

(file://)Cloud Providers

(AWS / Google)

• Domain specific middleware (“BOSS”) objects

• Mediates access by issuing pre-signed URLs

• Provides secured, time limited links

• A work in progress.

On Premise

Object Store

(2.6PB of EMC)

Current Object Storage Architecture

“Boss” Middleware

Consent for

Research Use

Phenotype

LIMS

Legacy File

(file://)Cloud Providers

(AWS / Google)

On Premise

Object Store

(2.6PB of EMC)

• Broad is currently decanting our two IRODs

archives into 2.6PB of on-premise object storage.

• This will free up 4PB of NAS filer (enough for a year

of data production).

• Have pushed at petabyte scale to Google’s cloud

storage

• At every point: Challenging but possible.

Data Deletion @ Scale

Me: “Blah Blah … I think we’re cool to delete about

600TB of data from a cloud bucket. What do you

think?”


Blah Blah … I think we’re cool to delete about 600TB of

data from a cloud bucket Ray: “BOOM!”


Blah Blah … I think we’re cool to delete about 600TB of

data from a cloud bucket

• This was my first deliberate data deletion at this scale.

• It scared me how fast / easy it was.

• Considering a “pull request” model for large scale deletions.

Files must give way to APIs

At large scale, the file/folder model for managing

data on computers becomes ineffective as a

human interface, and eventually a hindrance to

programmatic access. The solution: object storage

+ metadata. Regulatory Issues

Ethical Issues

Technical Issues

Federated Identity Management

• This one is not solved.

• I have the names of various technologies that I think are

involved: OpenAM, Shibboleth, NIH Commons, …

• It is up to us to understand the requirements and build a

system that meets them.

• Requirements are:

– Regulatory / legal

– Organizational

– Ethical.

Genomic data is not de-identifiable

Regulatory Issues

Ethical Issues

Technical Issues

This stuff is important

We have an opportunity to change lives and health

outcomes, and to realize the gains of genomic medicine, this

year.

We also have an opportunity to waste vast amounts of

money and still not really help the world.

I would like to work together with you to build a better future,

sooner.

[email protected]

Standards are needed for genomic data

“The mission of the Global Alliance for Genomics

and Health is to accelerate progress in human

health by helping to establish a common framework

of harmonized approaches to enable effective and

responsible sharing of genomic and clinical data,

and by catalyzing data sharing projects that drive

and demonstrate the value of data sharing.”

Regulatory Issues

Ethical Issues

Technical Issues

Thank You

Research Computing Ops:

Katie Shakun, David Altschuler, Dave Gregoire, Steve Kaplan, Kirill Lozinskiy, Paul McMartin,

Zach Shulte, Brett Stogryn, Elsa Tsao

Scientific Computing Services:

Eric Jones, Jean Chang, Peter Ragone, Vince Ryan

DevOps:

Lukas Karlsson, Marc Monnar, Matt Nicholson, Ray Pete, Andrew Teixeira

DSDE Ops:

Kathleen Tibbetts, Sam Novod, Jason Rose, Charlotte Tolonen,, Ellen Winchester

Emeritus:

Tim Fennell, Cope Frazier, Eric Golin, Jay Weatherell, Ken Streck

BITS: Matthew Trunnell, Rob Damian, Cathleen Bonner, Kathy Dooley, Katey Falvey, Eugene

Opredelennov, Ian Poynter, (and many more)

DSDE: Eric Banks, David An, Kristian Cibulskis, Gabrielle Franceschelli, Adam Kiezun, Nils

Homer, Doug Voet, (and many more)

KDUX: Scott Sutherland, May Carmichael, Andrew Zimmer (and many more)

68

Partner Thank Yous

• Accunet (Nick Brown) Amazon

• Avere Cisco (Skip Giles)

• Cycle Computing EMC (Melissa Crichton, Patrick Combes)

• Google (Will Brockman) Infinidat

• Intel (Mark Bagley) Internet 2

• Red Hat Scalable Informatics (Joe Landman)

• Solina Violin Memory

• …

Take Home Messages

• Go ahead and do the legwork to federate your

environment to at least one public cloud.– It’s “just work” at this point.

• Spend a lot of time understanding your data lifecycle, then

stuff the overly bulky 95% of it in an object store fronted

by a middleware application.– The latency sensitive, constantly used bits fit in RAM

• Human issues of consent, data privacy, and ownership

are still the hardest part of the picture.– We must learn to work together from a shared, standards based framework

– The time is now.

The opposite of play is not work, it’s depression

Jane McGonnigal, Reality is Broken

2015 04 bio it world

Documents

broad institute3the

broad monitoring

nonprofit institute

mit harvard

harvard teaching hospitals

metricsmatt nicholson

genome reasearch

human genome project