2015 04 bio it world
TRANSCRIPT
Research Computing@Broad
An Update: Bio-IT World Expo
April, 2015
Chris Dwan ([email protected])
Director, Research Computing and Data Services
Take Home Messages
• Go ahead and do the legwork to federate your
environment to at least one public cloud.– It’s “just work” at this point.
• Spend a lot of time understanding your data lifecycle, then
stuff the overly bulky 95% of it in an object store fronted
by a middleware application.– The latency sensitive, constantly used bits fit in RAM
• Human issues of consent, data privacy, and ownership
are still the hardest part of the picture.– We must learn to work together from a shared, standards based framework
– The time is now.
The world is quite ruthless in selecting between
the dream and the reality, even where we will not.
Cormac McCarthy, All the Pretty Horses
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Fifty core faculty members and hundreds of associate
members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~2,400+ associated researchers
Programs and Initiativesfocused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platformsfocused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
• The Broad Institute is a non-profit biomedical
research institute founded in 2004
• Twelve core faculty members and more than 200
associate members from MIT and Harvard
• ~1000 research and administrative personnel, plus
~1000 associated researchers
Programs and Initiativesfocused on specific disease or biology areas
Cancer
Genome Biology
Cell Circuits
Psychiatric Disease
Metabolism
Medical and Population Genetics
Infectious Disease
Epigenomics
Platformsfocused technological innovation and application
Genomics
Therapeutics
Imaging
Metabolite Profiling
Proteomics
Genetic Perturbation
The Broad Institute
60+ Illumina HiSeq instruments, including 14 ‘X’ sequencers
700,000+ genotyped samples
~18PB unique data / ~30PB usable file storage
The HPC Environment
Shared Everything: A reasonable architecture
Network
• ~10,000 Cores of Linux
server
• 10Gb/sec ethernet
backplane.
• All storage is available as
files (NAS) from all
servers.
Monitoring and metrics
Matt Nicholson
joins the Broad
Gradual puppetization
Chris Dwan joins
the Broad
Monitoring and metrics
Matt Nicholson
joins the BroadGradual puppetization:
increased visibility
We’re pretty sure
that we actually
have ~15,000 cores
Chris Dwan joins
the Broad
Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific system configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public CloudContainerized
Wonderland
Management Stack: Bare Metal
Network topology (VLANS, et al)
Many specific technical
decisions do not matter, so long
as you choose something and
make it work (Dagdigian, 2015)
Shared Everything: Ugly reality
10 Gb/sec Network
• At least six discrete
compute farms running
at least five versions of
batch schedulers (LSF
and SGE)
• Nodes “shared” by mix-
and-match between
owners.
• Nine Isilon clusters
• Five Infinidat filers
• ~19 distinct storage
technologies.
Genomics
PlatformCancer
ProgramShared
“Farm”
Overlapping usage = Potential I/O
bottleneck when multiple groups are
doing heavy analysis
Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public CloudContainerized
Wonderland
Configuration Stack: Now with Private Cloud!
Network topology (VLANS, et al)
Re-use everything possible
from the bare metal
environment.
While inserting things that make
our life easier.
Openstack (RHEL, Icehouse)
Openstack@Broad: The least cloudy cloud.
10 Gb/sec Network
Genomics
PlatformCancer
Program
Shared
“Farm”
Least “cloudy” implementation
possible.
IT / DevOps staff as users
Simply virtualizing and
abstracting away hardware
from user facing OS.
Note that most former
problems remain intact
Incrementally easier to manage
with very limited staff (3 FTE
Linux admins).
Openstack: open issues
Excluded from our project:
– Software defined networking (Neutron)
– “Cloud” storage (Cinder / Swift)
– Monitoring / Billing (Ceilometer, Heat)
– High Availability on Controllers
Custom:
– Most deployment infrastructure / scripting, including DNS
– Network encapsulation
– Active Directory integration
– All core systems administration functions
Core Message:
– Do not change both what you do and how you do it at the same time.
– Openstack could have been a catastrophe without rather extreme project
scoping.
I need “telemetry,” rather than logs*.
Jisto: Software startup with smart, smart monitoring and potential for containerized
cycle harvesting a-la condor.
*Logs let you know why you crashed. Telemetry lets you steer.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
We need elastic computing, a more cloudy cloud.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Starting at the bottom of the stack
There is no good answer to the question of
“DNS in your private cloud”
Private BIND domains are your friend
The particular naming scheme does not
matter. Just pick a scheme
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Amazon VPC
VPN
Endpoint
VPN
Endpoint
More Compute!
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Subnet: 10.199.x.x
Domain: aws.broadinstitute.org
Network Engineering: You don’t have to replace everything.
Broad Institute
Firewall
NAS Filers Compute
Internet
Internet 2
Edge
Router
Router
Amazon VPC
VPN
Endpoint
VPN
Endpoint
More Compute!
Subnet: 10.200.x.x
Domain: openstack.broadinstitute.org
Hostname: tenant-x-x
Subnet: 10.199.x.x
Domain: aws.broadinstitute.org
Ignore issues of latency, network transport costs,
and data locality for the moment. We’ll get to those
later.
Differentiate Layer 2 from Layer 3 connectivity.
We are not using Amazon Direct Connect. We don’t
need to, because AWS is routable via Internet 2.
Network Engineering Makes Everything Simpler
Physical Network Layout: More bits!
Markley Data
Centers (Boston)
Internap Data
Centers
(Somerville)
Broad: Main St.
Broad: Charles St.
Between data centers:
• 80 Gb/sec dark fiber
Internet:
1 Gb/sec
10 Gb/sec
Internet 2
10 Gb/sec
100 Gb/sec
Failover Internet:
1 Gb/sec
Metro Ring:
• 20 Gb/sec
dark fiber
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad specific configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
OS and vendor patches (Red Hat / yum, plus satellite)
Private Cloud Public CloudContainerized
Wonderland
Configuration Stack: Private Hybrid Cloud!
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)
CycleCloud provides straightforward, recognizable cluster
functionality with autoscaling and a clean management UI.
Do not be fooled by the 85 page “quick start guide,” it’s just a
cluster.
A social digression on cloud resources
• Researchers are generally:
– Remarkably hardworking
– Responsible, good stewards of resources
– Not terribly engaged with IT strategy
These good character traits present social barriers to
cloud adoption
Researchs Need
– Guidance and guard rails.
– Confidence that they are not “wasting” resources
– A sufficiently familiar environment to get started
Multiple Public Clouds
Openstack
Batch Compute Farm: 2015 Edition
Production Farm Shared Research FarmTwo clusters, running the same batch
scheduler (Univa’s Grid Engine).
Production: Small number of humans
operating several production systems for
business critical data delivery.
Resarch: Many humans running ad-
hoc tasks.
Multiple Public Clouds
Openstack
End State: Compute Clusters
Production Farm Shared Research Farm
A financially in-elastic portion of the
clusters is governed by traditional
fairshare scheduling.
Fairshare allocations change slowly
(month to month) based on
conversation, investment, and
discussion of both business and
emotional factors
This allows consistent budgeting,
dynamic exploration, and ad-hoc use
without fear or guilt.
Multiple public clouds support auto-scaling
queues for projects with funding and urgency
Openstack plus public clouds provides a
consistent capacity
End State: Compute Clusters
Production Farm Shared Research Farm
On a project basis, funds can be allocated
for truly elastic burst computing.
This allows business logic to drive delivery
based on investment
A financially in-elastic portion of the clusters
is governed by traditional fairshare
scheduling.
Fairshare allocations are changed slowly
(month to month, perhaps) based on
substantial conversation, investment, and
discussion of both business logic and
feelings.
This allows consistent budgeting, dynamic
exploration, and ad-hoc use without fear or
guilt.
Broad Institute
Amazon VPC
The term I’ve heard for this is
“intercloud.”
End State: Multiple Interconnected Public Clouds for collaboration
Google Cloud
Sibling
Institutions
Long term goal:
• Seamless collaboration inside and outside of Broad
• With elastic compute and storage
• With little or no copying of files or ad-hoc, one-off hacks
My Metal
Boot Image Provisioning (PXE / Cobbler, Kickstart)
Hardware Provisioning (UCS, Xcat)
Broad configuration (Puppet)
User or execution environment (Dotkit, docker, JVM, Tomcat)
Hypervisor OS
Instance Provisioning
(Openstack)
Bare Metal
End User visible OS and vendor patches (Red Hat, plus satellite)
Private Cloud Public CloudContainerized
Wonderland
Configuration Stack: Now with containers!
Network topology (VLANS, et al)
Public Cloud
Infrastructure
Instance Provisioning
(CycleCloud)???
… Docker / Mesos
Kubernetes / Cloud
Foundry / Common
Workflow Language
/ …
Scratch Space: “Pod local,” SSD filers
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable Informatics.
• Running a relative of Lustre
• Over multiple 40Gb/sec interfaces
• Managed using hostgroups, workload affinities,
and an attentive operations team
Openstack
Production Farm
For small data: Lots of SSD / Flash
• Unreasonable requirement: Make it impossible for
spindles to be my bottleneck
– 8 GByte per second throughput• Multiple quotes with fewer than 8 x 10Gb/sec ports
– ~100 TByte usable capacity • I am not asking about large volume storage
• Give me sustainable pricing.
– On a NAS style file share • Raw SAN / block device / iSCSI is not the deal.
• Solution: Scalable Informatics “Unison” filers– A lot of vendors failed basic sanity checks on this one.
– Please listen carefully when I state requirements. I do not believe in single monolithic
solutions anymore.
Caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Avere Edge Filer
(physical)On premise data
stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Plus caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)On premise data
stores
Cloud backed data stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Plus caching edge filers for shared references
10 Gb/sec Network
80+ Gb/sec Network
Scratch Space:
• 3 x 70TB filers from Scalable
Informatics.
• Workload managed by
hostgroups, workload
affinities, and an attentive
operations team
Openstack
Production Farm
Multiple Public Clouds
Avere Edge Filer
(physical)On premise data
stores
Avere Edge Filer
(virtual)Cloud backed
data stores
Shared Research Farm
Coherence on small volumes
of files provided by a
combination of clever network
routing and Avere’s caching
algorithms.
Cool thing: Avere
• Avere sells software with optional hardware:
• NFS front end whose block-store is an S3 bucket.
• It was born as a caching accelerator, and it does that well,
so the network considerations are in the right place.
• Since the hardware is optional …
Cool thing: Avere
• Avere sells software with optional hardware:
• NFS front end whose block-store is an S3 bucket.
• It was born as a caching accelerator, and it does that well,
so the network considerations are in the right place.
• Since the hardware is optional …
• NFS share that bridging on premise and cloud.
Broad Data Production, 2015: ~100TB /wk
Data production will continue to grow year over year
We can easily keep up with it, if we adopt appropriate
technologies.
100TB/wk ~= 1.3Gb/sec but 1PB @ 1Gb/sec ~= 12 days.
Broad Data Production, 2015: ~100TB /wk of unique information
“Data is heavy: It goes to the cheapest, closest place, and it stays
there”
Jeff Hammerbacher
Data Sizes for one 30x Whole Genome
Base Calls from a single lane of an Illumina HiSeq X
• Approximately the coverage required for 30x on a
whole human genome.
• Record of a laboratory event
• Totally immutable
• Almost never directly used
Aligned reads from that same lane:
• Substantially better compression because of putting like
with like.
95 GB
60 GB
Aggregated, topped up, and re-normalized BAM:
• Near doubling in file size because of multiple quality scores per base.145 GB
Variant File (VCF) and other directly usable formats
• Even smaller when we cast the distilled information into a
database of some sort
Tiny
File based storage: The Information Limits
• Single namespace filers hit real-world limits at:
– ~5PB (restriping times, operational hotspots, MTBF headaches)
– ~109 files: Directories must either be wider or deeper than human
brains can handle.
• Filesystem paths are presumed to persist forever
– Leads inevitably to forests of symbolic links
• Access semantics are inadequate for the federated world.
– We need complex, dynamic, context sensitive semantics including
consent for research use.
Limits of File Based Organization
• The fact that whatever.bam and whatever.log are in the same
directory implies a vast amount about their relationship.
• The suffixes “bam” and “log” are also laden with meaning
• That implicit organization and metadata must be made explicit
in order to transcend the boundaries of file based storage
Limits of File Based Organization
• Broad hosts genotypes derived from perhaps 700,000
individuals
• These genotypes are organized according to a variety of
standards (~1,000 cohorts), and are spread across a variety of
filesystems
• Metadata about consent, phenotype, etc is scattered across
dozens to hundreds of “databases.”
Limits of File Based Organization
• This lack of organization is holding us back from:• Collaboration and Federation between sibling organizations
• Substantial cost savings using policy based data motion
• Integrative research efforts
• Large scale discoveries that are currently in reach
We’re all familiar with this
Early 2014: Conversations about object storage with:
• Amazon Google
• EMC Cleversafe
• Avere Amplidata
• Data Direct Networks Infinidat
• …
My object storage opinions
• The S3 standard defines object storage
– Any application that uses any special / proprietary features is a
nonstarter – including clever metadata stuff.
• All object storage must be durable to the loss of an entire
data center
– Conversations about sizing / usage need to be incredibly simple
• Must be cost effective at scale
– Throughput and latency are considerations, not requirements
– This breaks the data question into stewardship and usage
• Must not merely re-iterate the failure modes of filesystems
The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
The dashboard should look opaque
• Object “names” should be a bag of UUIDs
• Object storage should be basically unusable without the
metadata index.
• Anything else recapitulates the failure mode of file based
storage.
• This should scare you.
Current Object Storage Architecture
“Boss” Middleware
Consent for
Research Use
Phenotype
LIMS
Legacy File
(file://)Cloud Providers
(AWS / Google)
• Domain specific middleware (“BOSS”) objects
• Mediates access by issuing pre-signed URLs
• Provides secured, time limited links
• A work in progress.
On Premise
Object Store
(2.6PB of EMC)
Current Object Storage Architecture
“Boss” Middleware
Consent for
Research Use
Phenotype
LIMS
Legacy File
(file://)Cloud Providers
(AWS / Google)
On Premise
Object Store
(2.6PB of EMC)
• Broad is currently decanting our two IRODs
archives into 2.6PB of on-premise object storage.
• This will free up 4PB of NAS filer (enough for a year
of data production).
• Have pushed at petabyte scale to Google’s cloud
storage
• At every point: Challenging but possible.
Data Deletion @ Scale
Me: “Blah Blah … I think we’re cool to delete about
600TB of data from a cloud bucket. What do you
think?”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket Ray: “BOOM!”
Data Deletion @ Scale
Blah Blah … I think we’re cool to delete about 600TB of
data from a cloud bucket
• This was my first deliberate data deletion at this scale.
• It scared me how fast / easy it was.
• Considering a “pull request” model for large scale deletions.
Files must give way to APIs
At large scale, the file/folder model for managing
data on computers becomes ineffective as a
human interface, and eventually a hindrance to
programmatic access. The solution: object storage
+ metadata. Regulatory Issues
Ethical Issues
Technical Issues
Federated Identity Management
• This one is not solved.
• I have the names of various technologies that I think are
involved: OpenAM, Shibboleth, NIH Commons, …
• It is up to us to understand the requirements and build a
system that meets them.
• Requirements are:
– Regulatory / legal
– Organizational
– Ethical.
This stuff is important
We have an opportunity to change lives and health
outcomes, and to realize the gains of genomic medicine, this
year.
We also have an opportunity to waste vast amounts of
money and still not really help the world.
I would like to work together with you to build a better future,
sooner.
Standards are needed for genomic data
“The mission of the Global Alliance for Genomics
and Health is to accelerate progress in human
health by helping to establish a common framework
of harmonized approaches to enable effective and
responsible sharing of genomic and clinical data,
and by catalyzing data sharing projects that drive
and demonstrate the value of data sharing.”
Regulatory Issues
Ethical Issues
Technical Issues
Thank You
Research Computing Ops:
Katie Shakun, David Altschuler, Dave Gregoire, Steve Kaplan, Kirill Lozinskiy, Paul McMartin,
Zach Shulte, Brett Stogryn, Elsa Tsao
Scientific Computing Services:
Eric Jones, Jean Chang, Peter Ragone, Vince Ryan
DevOps:
Lukas Karlsson, Marc Monnar, Matt Nicholson, Ray Pete, Andrew Teixeira
DSDE Ops:
Kathleen Tibbetts, Sam Novod, Jason Rose, Charlotte Tolonen,, Ellen Winchester
Emeritus:
Tim Fennell, Cope Frazier, Eric Golin, Jay Weatherell, Ken Streck
BITS: Matthew Trunnell, Rob Damian, Cathleen Bonner, Kathy Dooley, Katey Falvey, Eugene
Opredelennov, Ian Poynter, (and many more)
DSDE: Eric Banks, David An, Kristian Cibulskis, Gabrielle Franceschelli, Adam Kiezun, Nils
Homer, Doug Voet, (and many more)
KDUX: Scott Sutherland, May Carmichael, Andrew Zimmer (and many more)
68
Partner Thank Yous
• Accunet (Nick Brown) Amazon
• Avere Cisco (Skip Giles)
• Cycle Computing EMC (Melissa Crichton, Patrick Combes)
• Google (Will Brockman) Infinidat
• Intel (Mark Bagley) Internet 2
• Red Hat Scalable Informatics (Joe Landman)
• Solina Violin Memory
• …
Take Home Messages
• Go ahead and do the legwork to federate your
environment to at least one public cloud.– It’s “just work” at this point.
• Spend a lot of time understanding your data lifecycle, then
stuff the overly bulky 95% of it in an object store fronted
by a middleware application.– The latency sensitive, constantly used bits fit in RAM
• Human issues of consent, data privacy, and ownership
are still the hardest part of the picture.– We must learn to work together from a shared, standards based framework
– The time is now.