from the benchtop to the datacenter: it and converged infrastructure in life sciences

From the Benchtop to the DatacenterIT and Converged Infrastructure in Life Science Research

1Leverage Big Data ’14, San Diego, CA.; May 2014

Thursday, May 22, 14

Ari E. Berman, Ph.D.

Who am I?

2

Director of Government Services, Principal Investigator

I’m a fallen scientist - Ph.D. Molecular Biology, Neuroscience, Bioinformatics

I’m an HPC/Infrastructure geek - 14 years

I help enable science!Thursday, May 22, 14

3

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ Infrastructure, Informatics, Software Development, Cross-disciplinary Assessments

‣ 11+ years bridging the “gap” between science, IT & high performance computing

‣ Our wide-ranging work is what gets us invited to speak at events like this ...


What do we do?BioTeam

4

Laboratory Knowledge


What do we do?BioTeam

4


Converged Solution


Mostly work in Life SciencesOur domain coverage

• Government• Universities• Big pharma• Biotech• Private institutes• Diagnostic startups• Oil and Gas• Geospatial• Hollywood Animation• Law Enforcement

5Thursday, May 22, 14

A ton of .gov work on a massive scaleMy Recent Work...

‣ National Institutes of Health

‣ US Department of Agriculture (USDA)

‣ Navy‣ Centers for Disease

Control


7

OK, so why are we here talking to you?


We have a unique perspective across much of life sciences

We’ve noticed a few things

‣ Big Data has arrived in Life Sciences

‣ Data is being generated at unprecedented rates

‣ Research and Biomedical Orgs were caught off guard

‣ IT running to catch up, limited budgets

‣ Money is tight, Orgs reluctant to invest in Bio-IT

8

25% of all Life Scientists will require HPC in 2015!Thursday, May 22, 14

It’s being made harder by lack of supportResearch is hard

‣ Scientists are getting frustrated

‣ Stubborn, fight for what they need

‣ They will build a cluster under their desk if it gets the job done

‣ In general they will win against IT


10

It’s a risky time to be doing Bio-IT


11

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

‣ IT not a part of the conversation, running to catch up


Science progressing way faster than IT can refresh/change

The Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)


The Central Problem Is ...

‣ The easy period is over

‣ 5 years ago - under the desk solutions worked

‣ This doesn’t work any more!


14

The new normal for informatics


And a related problem ...

‣ Easy to acquire vast amounts of data cheaply and easily

‣ Growth rate of data creation exceeds rate of storage improvements

‣ Not just a storage lifecycle problem.

• This data *moves* and often needs to be shared among multiple entities and providers

• ... ideally without punching holes in your firewall or consuming all available internet bandwidth


If you get it wrong ...

‣ Lost opportunity

‣ Missing capability

‣ Frustrated & very vocal scientific staff

‣ Problems in recruiting, retention, publication & product development


17

It’s a risky time to be doing Bio-IT

11

What are the drivers in Bio-IT today?


18

Genomics: Next Generation Sequencing (NGS)


It’s like the hard drive of life

19

The big deal about DNA

‣ DNA is the template of life

‣ DNA is read --> RNA

‣ RNA is read --> Proteins

‣ Proteins are the functional machinery that make life possible

‣ Understanding the template = understanding basis for disease


Sequencing by SynthesisHow does NGS work?


Reference assembly, variant callingHow does NGS work?


Gateway to personalized medicineThe Human Genome

‣ 3.2 Gbp

‣ 23 chromosomes

‣ ~21,000 genes

‣ Over 55M known variations


...and why NGS is the primary driver

23

The Problem...

‣ Sequencers are now relatively cheap and fast

‣ Some can generate a human genome in 18 hours, for $2,000

‣ Everyone is doing it

‣ Can generate 3TB of data in that time

‣ First genome took 13 years and $2.7B to complete


24

Other Methodologies Not Far Behind


High-throughput Imaging

‣ Robotics screening millions of compounds on live cells 24/7

• Not as much data as genomics in volume, but just as complex

• Data volumes in the 10’s TB/week

‣ Confocal Imaging• Scanning 100’s of tissue sections/

week, each with 10’s of scans, each with 20-40 layers and multiple florescent channels

• Data volumes in the 1’s - 10’s TB/week


High-power, dense detector MRI scanners in use 24/7 at large research hospitals

High-res medical imaging

‣ Creating 3D models of brains, comparing large datasets

‣ Using those models to perform detailed neurosurgery with real-time analytic feedback from supercomputer in the OR (cool stuff)

‣ Also generates 10’s of TB/week


27

This is a huge problem

‣ Causing a literal deluge of data, in the 10’s of Petabytes

‣ NIH generating 1.5PB of data/month

‣ First real case in life science where 100Gb networking might really be needed

‣ But, not enough storage or compute


28

And the problem is getting even bigger


We have them allFile & Data Types

‣ Massive text files

‣ Massive binary files

‣ Flatfile ‘databases’

‣ Spreadsheets everywhere

‣ Directories w/ 6 million files

‣ Large files: 600GB+

‣ Small files: 30kb or smaller


30

Application characteristics‣ Mostly SMP/threaded apps

performance bound by IO and/or RAM

‣ Hundreds of apps, codes & toolkits

‣ 1TB - 2TB RAM “High Memory” nodes becoming essential

‣ Lots of Perl/Python/R

‣ MPI is rare• Well written MPI is even rarer

‣ Few MPI apps actually benefit from expensive low-latency interconnects*

• *Chemistry, modeling and structure work is the exception


Why, giant meta-analyses, of course

31

What to do with all that data?

‣ Typical problem across all of big data: how do you use it?

‣ In life sciences: no real standards of data formats

‣ Data scattered all over, despite push for Data Commons

‣ Not always accessible

‣ Combining the data if you have it all is a real challenge


Scientists don’t like to share (really!)A Compounding Problem...

‣ The fear: • if someone sees data before it

is published, they might steal it and publish it themselves (getting scooped)

‣ Causes:• Long time to publication• Outdated methods of

assigning scientific credit• Not properly incentivized


Sharing requiredA Problem for Data Commons

‣ Data piling up

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing33


Sharing requiredA Problem for Data Commons

‣ Data piling up

‣ Bad network infrastructures

‣ Few central analytics platforms

‣ Wild-west file formats/algorithms

‣ No sharing33

Hyperscale analytics will only work if the data is accessible!


34

But wait! There’s more!


35

Local IT inhibiting research

‣ IT originally developed for business administration

‣ Tight policies, security at the expense of performance

‣ IT determines resources, not users

‣ Result: lack of institutional investment and lack of talent

‣ Doesn’t work for science


IT staff become a part of the research projectsSuccessful Bio-IT orgs

‣ When IT and scientists at the table together, mutual understanding happens

‣ IT is driven to innovate to accommodate science

‣ Scientists better understand resourcing

‣ Roadblocks are removed, science is enabled


Up to a two line subtitle, generally used to describe the takeaway for the slide

37

How are these issues being solved?


Compute related design patterns largely static

38

Core Compute

‣ Linux compute clusters are the baseline compute platform

‣ Even our lab instruments know how to submit jobs to common HPC cluster schedulers

‣ Compute is not hard. It’s a commodity that is easy to acquire & deploy in 2014


Defensive hedge against Big Data / HDFS

39

Compute: Local Disk is Back

‣ We’ve started to see organizations move away from blade servers and 1U pizza box enclosures for HPC

‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available

‣ Why? Hadoop & Big Data

‣ This is a defensive hedge against future HDFS or similar requirements

• Remember the ‘meta’ problem - science is changing far faster than we can refresh IT. This is a defensive future-proofing play.

‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count


New and refreshed HPC systems running many node types

40

Compute: Huge trend in ‘diversity’

‣ Accelerated trend since at least 2012 ...• HPC compute resources no longer homogenous;

many types and flavors now deployed in single HPC stacks

‣ Newer clusters mix-and-match to match the known use cases:

• GPU nodes for compute

• GPU nodes for visualization• Large memory nodes (512GB +)• Very Large memory nodes (1TB +)

• ‘Fat’ nodes with many CPU cores• ‘Thin’ nodes with super-fast CPUs• Analytic nodes with SSD, FusionIO, flash or large

local disk for ‘big data’ tasks


Most use Amazon Web ServicesBig push for cloud use

‣ Many Orgs are pushing for cloud

‣ Unsupported scientists end up using cloud

‣ It’s fast, flexible, affordable, if done right

‣ If done wrong, way more expensive than local compute

‣ Biggest problem: getting data to it!


42

Storage & Data Management‣ LifeSci core requirement:

• Shared, simultaneous read/write access across many instruments, desktops & HPC silos

‣ NAS = “easiest” option • Scale Out NAS products are the

mainstream standard

‣ Parallel & Distributed storage for edge cases and large organizations with known performance needs

• Becoming much more common: GPFS has taken hold in LifeSci


43

Storage & Data Management

‣ Storage & data mgmt. is the #1 infrastructure headache in life science environments

‣ Most labs need “peta capable” storage due to unpredictable future

• Only a small % will actually hit 1PB• Often forced to trade away performance

in order to obtain capacity

‣ Object stores, ZFS and commodity “Nexentastor-style” methods are making significant inroads


44

Data Movement & Data Sharing

‣ Peta-scale data movement needs

• Within an organization• To/from collaborators• To/from suppliers• To/from public data repos

‣ Peta-scale data sharing needs• Collaborators and partners may be

all over the world


Physical & Network

45

We Have Both Ingest Problems

‣ Significant physical ingest occurring in Life Science

• Standard media: naked SATA drives shipped via Fedex

‣ Cliche example:• 30 genomes outsourced means 30

drives will soon be sitting in your mail pile

‣ Organizations often use similar methods to freight data between buildings and among geographic sites


46

Physical Ingest Just Plain Nasty

‣ Most common high-speed network: FedEx

‣ Easy to talk about in theory

‣ Seems “easy” to scientists and even IT at first glance

‣ Really really nasty in practice• Incredibly time consuming• Significant operational burden• Easy to do badly / lose data


47

Networking

‣ Major 2014 focus

‣ May surpass storage as our #1 infrastructure headache

‣ Why?• Petascale storage meaningless

if you can’t access/move it• 10-Gig, 40-Gig and 100-Gig

networking will force significant changes elsewhere in the ‘bio-IT’ infrastructure


48

Network: Speed @ Core and Edge

‣ Remember 2004 when research storage requirements started to dwarf what the enterprise was using?

‣ Same thing is happening now for networking

‣ Research core, edge and top-of-rack networking speeds may exceed what the rest of the organization has standardized on


Network: ‘ScienceDMZ’

‣ Very fast “low-friction” network links and paths with security policy and enforcement specific to scientific workflows

‣ “ScienceDMZ” concept is real and necessary

‣ BioTeam will be building them in 2014 and beyond

‣ Central premise:• Legacy firewall, network and security methods architected

for “many small data flows” use cases• Not built to handle smaller #s of massive data flows

• Also very hard to deploy ‘traditional’ security gear on 10Gigabit and faster networks

‣ More details, background & documents at http://fasterdata.es.net/science-dmz/

49

Background traffic or

competing bursts

DTN traffic with wire-speed

bursts

10GE

10GE

10GE


http://fasterdata.es.net/science-dmz/

http://fasterdata.es.net/science-dmz/

50

Simple Science DMZ:

Image source: “The Science DMZ: Introduction & Architecture” -- esnet


51

What does it all mean?


What just happened?

52



Converged Infrastructure

53

The meta issue

‣ Individual technologies and their general successful use are fine

‣ Unless they all work together as a unified solution, it all means nothing

‣ Creating an end-to-end solution based on the use case (science!): converged infrastructure


It’s what we doConvergence

54



It’s what we doConvergence

54


Converged Solution


55

What’s the future look like?


Small desktop appliances; gateways

56

Local Compute Appliances

‣ BioTeam is developing appliances: affordable compute for labs without IT resources

‣ More of these will make their way into the field

‣ Act at gateways to larger resources

‣ Pre-packaged with best-practices, low IT burden


Also known as hybrid cloudsHybrid HPC

‣ Relatively new idea• small local footprint• large, dynamic, scalable, orchestrated

public cloud component

‣ DevOps is key to making this work

‣ High-speed network to public cloud required

‣ Software interface layer acting as the mediator between local and public resources

‣ Good for tight budgets, has to be done right to work

‣ Not many working examples yet57


Commodity Science DMZsScientific Networks

‣ As data grows, need clean/fast networks

‣ Security needs to be addressed, but performance is key

‣ Internet2 enables direct, clean, long-distance, affordable fast networking


Reduce guessing game on scientists’ partCommoditized Infrastructure

‣ Building blocks matched to use cases

‣ Blocks designed as converged infrastructure

‣ Let scientists get back to the science

‣ Make the infrastructure serve the science


Central storage of knowledge with computeData Commons

‣ Common structure for data storage and indexing (a cloud?)

‣ Associated compute for analytics

‣ Development platform for application development (PaaS)

‣ Make discovery more possible


Data Integration Platform!

Source 1! Source 2! Source 3!

Analytics/Data Retrieval!

Don’t move data, move computeData Integration Systems

‣ PaaS is a meta-index and development platform

‣ Index knows where the data is

‣ System reads data as needed from remote sources

‣ Requires good networks61


Not so distant future!Fully converged laboratories

62


Converged Solution



62


Converged Solution

New Bottleneck!



63

Laboratory



63

Laboratory

Knowledge


64

end; Thanks!Slides can be found at http://www.slideshare.com/arieberman


from the benchtop to the datacenter: it and converged infrastructure in life sciences

Technology

ngs work

life science research

life sciences data

life scientists

recent work

gov work

life possible

template of life dna