from the benchtop to the datacenter: it and converged infrastructure in life sciences
DESCRIPTION
Talk given at the Leverage Big Data '14 Event in May 2014. Big data has arrived in the life science research domain, and it has caught researchers and IT professionals alike off-guard. Workstations, Excel, and even small clusters under people's desks are no longer sufficient to meet the data storage and processing needs of modern biological research techniques. Data is being produced cheaply and rapidly at unprecedented rates in academic, commercial and clinical laboratories while budgets in those spaces continue to be slashed. Despite the reduced budgets, it is predicted that 25% of all researchers will require HPC to analyze their data in the coming year. Research organizations are starting to realize they have to run to catch up, or face failure in the wake of old-school IT infrastructures and policies. IT organizations have been forced to get creative and build amazing infrastructures for pennies, or fail in the wake of the user pressure being generated from the laboratories. Converged infrastructure is the present and the future for biomedical, clinical, and life sciences research. In this talk, I'll cover the IT challenges in life sciences, how and where they are being met, and talk about the near-future trends in IT infrastructure, services, and informatics and how they will affect medical discoveries in the next 5-10 years.TRANSCRIPT
From the Benchtop to the DatacenterIT and Converged Infrastructure in Life Science Research
1Leverage Big Data ’14, San Diego, CA.; May 2014
Thursday, May 22, 14
Ari E. Berman, Ph.D.
Who am I?
2
Director of Government Services, Principal Investigator
I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics
I’m an HPC/Infrastructure geek - 14 years
I help enable science!Thursday, May 22, 14
3
BioTeam
‣ Independent consulting shop
‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done
‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments
‣ 11+ years bridging the “gap” between science, IT & high performance computing
‣ Our wide-ranging work is what gets us invited to speak at events like this ...
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Converged Solution
Thursday, May 22, 14
What do we do?BioTeam
4
Laboratory Knowledge
Converged Solution
Thursday, May 22, 14
Mostly work in Life SciencesOur domain coverage
• Government• Universities• Big pharma• Biotech• Private institutes• Diagnostic startups• Oil and Gas• Geospatial• Hollywood Animation• Law Enforcement
5Thursday, May 22, 14
A ton of .gov work on a massive scaleMy Recent Work...
‣ National Institutes of Health
‣ US Department of Agriculture (USDA)
‣ Navy‣ Centers for Disease
Control
6Thursday, May 22, 14
7
OK, so why are we here talking to you?
Thursday, May 22, 14
We have a unique perspective across much of life sciences
We’ve noticed a few things
‣ Big Data has arrived in Life Sciences
‣ Data is being generated at unprecedented rates
‣ Research and Biomedical Orgs were caught off guard
‣ IT running to catch up, limited budgets
‣ Money is tight, Orgs reluctant to invest in Bio-IT
8
25% of all Life Scientists will require HPC in 2015!Thursday, May 22, 14
It’s being made harder by lack of supportResearch is hard
‣ Scientists are getting frustrated
‣ Stubborn, fight for what they need
‣ They will build a cluster under their desk if it gets the job done
‣ In general they will win against IT
9Thursday, May 22, 14
10
It’s a risky time to be doing Bio-IT
Thursday, May 22, 14
11
Big Picture / Meta Issue
‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed
‣ IT not a part of the conversation, running to catch up
Thursday, May 22, 14
Science progressing way faster than IT can refresh/change
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure
• Bench science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every
2-7 years
‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)
12Thursday, May 22, 14
The Central Problem Is ...
‣ The easy period is over
‣ 5 years ago - under the desk solutions worked
‣ This doesn’t work any more!
13Thursday, May 22, 14
14
The new normal for informatics
Thursday, May 22, 14
And a related problem ...
‣ Easy to acquire vast amounts of data cheaply and easily
‣ Growth rate of data creation exceeds rate of storage improvements
‣ Not just a storage lifecycle problem.
• This data *moves* and often needs to be shared among multiple entities and providers
• ... ideally without punching holes in your firewall or consuming all available internet bandwidth
15Thursday, May 22, 14
If you get it wrong ...
‣ Lost opportunity
‣ Missing capability
‣ Frustrated & very vocal scientific staff
‣ Problems in recruiting, retention, publication & product development
16Thursday, May 22, 14
17
It’s a risky time to be doing Bio-IT
11
What are the drivers in Bio-IT today?
Thursday, May 22, 14
18
Genomics: Next Generation Sequencing (NGS)
Thursday, May 22, 14
It’s like the hard drive of life
19
The big deal about DNA
‣ DNA is the template of life
‣ DNA is read --> RNA
‣ RNA is read --> Proteins
‣ Proteins are the functional machinery that make life possible
‣ Understanding the template = understanding basis for disease
Thursday, May 22, 14
Sequencing by SynthesisHow does NGS work?
20Thursday, May 22, 14
Reference assembly, variant callingHow does NGS work?
21Thursday, May 22, 14
Reference assembly, variant callingHow does NGS work?
21Thursday, May 22, 14
Reference assembly, variant callingHow does NGS work?
21Thursday, May 22, 14
Gateway to personalized medicineThe Human Genome
‣ 3.2 Gbp
‣ 23 chromosomes
‣ ~21,000 genes
‣ Over 55M known variations
22Thursday, May 22, 14
...and why NGS is the primary driver
23
The Problem...
‣ Sequencers are now relatively cheap and fast
‣ Some can generate a human genome in 18 hours, for $2,000
‣ Everyone is doing it
‣ Can generate 3TB of data in that time
‣ First genome took 13 years and $2.7B to complete
Thursday, May 22, 14
24
Other Methodologies Not Far Behind
Thursday, May 22, 14
High-throughput Imaging
‣ Robotics screening millions of compounds on live cells 24/7
• Not as much data as genomics in volume, but just as complex
• Data volumes in the 10’s TB/week
‣ Confocal Imaging• Scanning 100’s of tissue sections/
week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels
• Data volumes in the 1’s - 10’s TB/week
25Thursday, May 22, 14
High-power, dense detector MRI scanners in use 24/7 at large research hospitals
High-res medical imaging
‣ Creating 3D models of brains, comparing large datasets
‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)
‣ Also generates 10’s of TB/week
26Thursday, May 22, 14
27
This is a huge problem
‣ Causing a literal deluge of data, in the 10’s of Petabytes
‣ NIH generating 1.5PB of data/month
‣ First real case in life science where 100Gb networking might really be needed
‣ But, not enough storage or compute
Thursday, May 22, 14
28
And the problem is getting even bigger
Thursday, May 22, 14
We have them allFile & Data Types
‣ Massive text files
‣ Massive binary files
‣ Flatfile ‘databases’
‣ Spreadsheets everywhere
‣ Directories w/ 6 million files
‣ Large files: 600GB+
‣ Small files: 30kb or smaller
29Thursday, May 22, 14
30
Application characteristics‣ Mostly SMP/threaded apps
performance bound by IO and/or RAM
‣ Hundreds of apps, codes & toolkits
‣ 1TB - 2TB RAM “High Memory” nodes becoming essential
‣ Lots of Perl/Python/R
‣ MPI is rare• Well written MPI is even rarer
‣ Few MPI apps actually benefit from expensive low-latency interconnects*
• *Chemistry, modeling and structure work is the exception
Thursday, May 22, 14
Why, giant meta-analyses, of course
31
What to do with all that data?
‣ Typical problem across all of big data: how do you use it?
‣ In life sciences: no real standards of data formats
‣ Data scattered all over, despite push for Data Commons
‣ Not always accessible
‣ Combining the data if you have it all is a real challenge
Thursday, May 22, 14
Scientists don’t like to share (really!)A Compounding Problem...
‣ The fear: • if someone sees data before it
is published, they might steal it and publish it themselves (getting scooped)
‣ Causes:• Long time to publication• Outdated methods of
assigning scientific credit• Not properly incentivized
32Thursday, May 22, 14
Sharing requiredA Problem for Data Commons
‣ Data piling up
‣ Bad network infrastructures
‣ Few central analytics platforms
‣ Wild-west file formats/algorithms
‣ No sharing33
Thursday, May 22, 14
Sharing requiredA Problem for Data Commons
‣ Data piling up
‣ Bad network infrastructures
‣ Few central analytics platforms
‣ Wild-west file formats/algorithms
‣ No sharing33
Hyperscale analytics will only work if the data is accessible!
Thursday, May 22, 14
34
But wait! There’s more!
Thursday, May 22, 14
35
Local IT inhibiting research
‣ IT originally developed for business administration
‣ Tight policies, security at the expense of performance
‣ IT determines resources, not users
‣ Result: lack of institutional investment and lack of talent
‣ Doesn’t work for science
Thursday, May 22, 14
IT staff become a part of the research projectsSuccessful Bio-IT orgs
‣ When IT and scientists at the table together, mutual understanding happens
‣ IT is driven to innovate to accommodate science
‣ Scientists better understand resourcing
‣ Roadblocks are removed, science is enabled
36Thursday, May 22, 14
Up to a two line subtitle, generally used to describe the takeaway for the slide
37
How are these issues being solved?
Thursday, May 22, 14
Compute related design patterns largely static
38
Core Compute
‣ Linux compute clusters are the baseline compute platform
‣ Even our lab instruments know how to submit jobs to common HPC cluster schedulers
‣ Compute is not hard. It’s a commodity that is easy to acquire & deploy in 2014
Thursday, May 22, 14
Defensive hedge against Big Data / HDFS
39
Compute: Local Disk is Back
‣ We’ve started to see organizations move away from blade servers and 1U pizza box enclosures for HPC
‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available
‣ Why? Hadoop & Big Data
‣ This is a defensive hedge against future HDFS or similar requirements
• Remember the ‘meta’ problem - science is changing far faster than we can refresh IT. This is a defensive future-proofing play.
‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count
Thursday, May 22, 14
New and refreshed HPC systems running many node types
40
Compute: Huge trend in ‘diversity’
‣ Accelerated trend since at least 2012 ...• HPC compute resources no longer homogenous;
many types and flavors now deployed in single HPC stacks
‣ Newer clusters mix-and-match to match the known use cases:
• GPU nodes for compute
• GPU nodes for visualization• Large memory nodes (512GB +)• Very Large memory nodes (1TB +)
• ‘Fat’ nodes with many CPU cores• ‘Thin’ nodes with super-fast CPUs• Analytic nodes with SSD, FusionIO, flash or large
local disk for ‘big data’ tasks
Thursday, May 22, 14
Most use Amazon Web ServicesBig push for cloud use
‣ Many Orgs are pushing for cloud
‣ Unsupported scientists end up using cloud
‣ It’s fast, flexible, affordable, if done right
‣ If done wrong, way more expensive than local compute
‣ Biggest problem: getting data to it!
41Thursday, May 22, 14
42
Storage & Data Management‣ LifeSci core requirement:
• Shared, simultaneous read/write access across many instruments, desktops & HPC silos
‣ NAS = “easiest” option • Scale Out NAS products are the
mainstream standard
‣ Parallel & Distributed storage for edge cases and large organizations with known performance needs
• Becoming much more common: GPFS has taken hold in LifeSci
Thursday, May 22, 14
43
Storage & Data Management
‣ Storage & data mgmt. is the #1 infrastructure headache in life science environments
‣ Most labs need “peta capable” storage due to unpredictable future
• Only a small % will actually hit 1PB• Often forced to trade away performance
in order to obtain capacity
‣ Object stores, ZFS and commodity “Nexentastor-style” methods are making significant inroads
Thursday, May 22, 14
44
Data Movement & Data Sharing
‣ Peta-scale data movement needs
• Within an organization• To/from collaborators• To/from suppliers• To/from public data repos
‣ Peta-scale data sharing needs• Collaborators and partners may be
all over the world
Thursday, May 22, 14
Physical & Network
45
We Have Both Ingest Problems
‣ Significant physical ingest occurring in Life Science
• Standard media: naked SATA drives shipped via Fedex
‣ Cliche example:• 30 genomes outsourced means 30
drives will soon be sitting in your mail pile
‣ Organizations often use similar methods to freight data between buildings and among geographic sites
Thursday, May 22, 14
46
Physical Ingest Just Plain Nasty
‣ Most common high-speed network: FedEx
‣ Easy to talk about in theory
‣ Seems “easy” to scientists and even IT at first glance
‣ Really really nasty in practice• Incredibly time consuming• Significant operational burden• Easy to do badly / lose data
Thursday, May 22, 14
47
Networking
‣ Major 2014 focus
‣ May surpass storage as our #1 infrastructure headache
‣ Why?• Petascale storage meaningless
if you can’t access/move it• 10-Gig, 40-Gig and 100-Gig
networking will force significant changes elsewhere in the ‘bio-IT’ infrastructure
Thursday, May 22, 14
48
Network: Speed @ Core and Edge
‣ Remember 2004 when research storage requirements started to dwarf what the enterprise was using?
‣ Same thing is happening now for networking
‣ Research core, edge and top-of-rack networking speeds may exceed what the rest of the organization has standardized on
Thursday, May 22, 14
Network: ‘ScienceDMZ’
‣ Very fast “low-friction” network links and paths with security policy and enforcement specific to scientific workflows
‣ “ScienceDMZ” concept is real and necessary
‣ BioTeam will be building them in 2014 and beyond
‣ Central premise:• Legacy firewall, network and security methods architected
for “many small data flows” use cases• Not built to handle smaller #s of massive data flows
• Also very hard to deploy ‘traditional’ security gear on 10Gigabit and faster networks
‣ More details, background & documents at http://fasterdata.es.net/science-dmz/
49
Background traffic or
competing bursts
DTN traffic with wire-speed
bursts
10GE
10GE
10GE
Thursday, May 22, 14
50
Simple Science DMZ:
Image source: “The Science DMZ: Introduction & Architecture” -- esnet
Thursday, May 22, 14
51
What does it all mean?
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
What just happened?
52
Laboratory Knowledge
Thursday, May 22, 14
Converged Infrastructure
53
The meta issue
‣ Individual technologies and their general successful use are fine
‣ Unless they all work together as a unified solution, it all means nothing
‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure
Thursday, May 22, 14
It’s what we doConvergence
54
Laboratory Knowledge
Thursday, May 22, 14
It’s what we doConvergence
54
Laboratory Knowledge
Converged Solution
Thursday, May 22, 14
It’s what we doConvergence
54
Laboratory Knowledge
Converged Solution
Thursday, May 22, 14
55
What’s the future look like?
Thursday, May 22, 14
Small desktop appliances; gateways
56
Local Compute Appliances
‣ BioTeam is developing appliances: affordable compute for labs without IT resources
‣ More of these will make their way into the field
‣ Act at gateways to larger resources
‣ Pre-packaged with best-practices, low IT burden
Thursday, May 22, 14
Also known as hybrid cloudsHybrid HPC
‣ Relatively new idea• small local footprint• large, dynamic, scalable, orchestrated
public cloud component
‣ DevOps is key to making this work
‣ High-speed network to public cloud required
‣ Software interface layer acting as the mediator between local and public resources
‣ Good for tight budgets, has to be done right to work
‣ Not many working examples yet57
Thursday, May 22, 14
Commodity Science DMZsScientific Networks
‣ As data grows, need clean/fast networks
‣ Security needs to be addressed, but performance is key
‣ Internet2 enables direct, clean, long-distance, affordable fast networking
58Thursday, May 22, 14
Reduce guessing game on scientists’ partCommoditized Infrastructure
‣ Building blocks matched to use cases
‣ Blocks designed as converged infrastructure
‣ Let scientists get back to the science
‣ Make the infrastructure serve the science
59Thursday, May 22, 14
Central storage of knowledge with computeData Commons
‣ Common structure for data storage and indexing (a cloud?)
‣ Associated compute for analytics
‣ Development platform for application development (PaaS)
‣ Make discovery more possible
60Thursday, May 22, 14
Data Integration Platform!
Source 1! Source 2! Source 3!
Analytics/Data Retrieval!
Don’t move data, move computeData Integration Systems
‣ PaaS is a meta-index and development platform
‣ Index knows where the data is
‣ System reads data as needed from remote sources
‣ Requires good networks61
Thursday, May 22, 14
Not so distant future!Fully converged laboratories
62
Laboratory Knowledge
Converged Solution
Thursday, May 22, 14
Not so distant future!Fully converged laboratories
62
Laboratory Knowledge
Converged Solution
New Bottleneck!
Thursday, May 22, 14
Not so distant future!Fully converged laboratories
63
Laboratory
Thursday, May 22, 14
Not so distant future!Fully converged laboratories
63
Laboratory
Knowledge
Thursday, May 22, 14
64
end; Thanks!Slides can be found at http://www.slideshare.com/arieberman
Thursday, May 22, 14