time to science/time to results: transforming research in the cloud

Accelerating Time to Science:

Transforming Research in the Cloud

Jamie Kinney - @jamiekinney

Director of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services

Michael Franklin - @amplab

Director, AMPLab - UC Berkeley

Agenda

• An introduction to scientific computing on AWS

• How are researchers using AWS today?

• Case study: The UC Berkeley AMP Lab

• Q & A

What do we mean by Scientific Computing?

Scientific Computing refers to the application of simulation,

mathematical modeling and quantitative analysis to analyze and

solve scientific problems.

How is AWS Used for Scientific Computing?

• High Performance Computing (HPC) for Engineering and Simulation

• High Throughput Computing (HTC) for Data-Intensive Analytics

• Hybrid Supercomputing centers

• Collaborative Research Environments

• Citizen Science

• Science-as-a-Service

Why do researchers love using AWS?

Time to ScienceAccess research

infrastructure in minutes

Low CostPay-as-you-go pricing

ElasticEasily add or remove capacity

Globally AccessibleEasily Collaborate with

researchers around the world

SecureA collection of tools to

protect data and privacy

ScalableAccess to effectively

limitless capacity

Why does AWS care about Scientific Computing?

• We want to improve our world by accelerating the pace of scientific discovery

• It is a great application of AWS with a broad customer base

• The scientific community helps us innovate on behalf of all customers

– Streaming data processing & analytics

– Exabyte scale data management solutions and exaflop scale compute

– Collaborative research tools and techniques

– New AWS regions

– Significant advances in low-power compute, storage and data centers

– Efficiencies which will lower our costs and therefore pricing for all customers

Research Grants

AWS provides free usage

credits to help researchers:

• Teach advanced courses

• Explore new projects

• Create resources for the

scientific community

aws.amazon.com/grants

Peering with all global research networks

Image courtesy John Hover - Brookhaven National Lab

Breaking news! Restricted-access genomics on

AWS

aws.amazon.com/genomics

How are researchers using AWS today?

High Throughput Computing at Scale

The Large Hadron Collider

@ CERN includes 6,000+

researchers from over 40

countries and produces

approximately 25PB of data

each year.

The ATLAS and CMS

experiments are using AWS

for Monte Carlo simulations

and analysis of LHC data.

Data-Intensive Computing

The Square Kilometer Array will link 250,000 radio

telescopes together, creating the world’s most

sensitive telescope. The SKA will generate zettabytes

of raw data, publishing exabytes annually over 30-40

years.

Researchers are using AWS to develop and test:

• Data processing pipelines

• Image visualization tools

• Exabyte-scale research data management

• Collaborative research environments

aws.amazon.com/solutions/case-studies/icrar/

High Performance Computing

Simulations in the Automotive Sector• Crash and materials simulations

• Fluid and thermal dynamics simulations

• Car body aerodynamics

• Electronics and electromagnetic simulations

Honda materials science simulations on AWS:• Deploying scalable HPC clusters on AWS Spot – up to 1000 C3 instances

• Running more simulations than before, for more accurate results

“Cloud offers us an opportunity, as we can innovate faster than before.”

- Ayumi Tada, IT System Administrator, Honda R&D

Schrodinger & Cycle Computing:

Computational Chemistry for Better Solar Power

Simulation by Mark Thompson of the

University of Southern California to see

which of 205,000 organic compounds

could be used for photovoltaic cells for

solar panel material.

Estimated computation time 264 years

completed in 18 hours.

• 156,314 core cluster, 8 regions

• 1.21 petaflops (Rpeak)

• $33,000 or 16¢ per molecule

Loosely

Coupled

Science-as-a-Service

Globus Genomics, DNAnexus, and SevenBridges Genomics offer inexpensive, easy-

to-use, and secure platforms for processing and analyzing genomic data.

The Weather Company pushes four gigabytes of data to AWS

each second in order to delivers 15 billion forecasts each day

to their customers around the world.

aws.amazon.com/solutions/case-studies/the-weather-company/

Citizen Science

The Asteroid Data Hunters competition used AWS to develop better mechanisms for

finding near-Earth asteroids. The top algorithm is 18% better at finding asteroids!

Case Study: The UC Berkeley AMP Lab

Scalable Data-Driven

Science at the AMPLab

UC BERKELEY

Michael Franklin

April 9, 2015

AWS Summit SF

AMPLab Overview

• 80+ Students, Postdocs, Faculty and Staff from:

Databases, Machine Learning, Systems, Security, and Networking

• 28 Industry Sponsors +

White House Big Data Program:

NSF CISE Expeditions in Computing and Darpa XData

• Founding Sponsors:

“… Berkeley’s AMPLab has already left an indelible mark on the world of

information technology, and even the web. But we haven’t yet experienced

the full impact of the group … Not even close.”

– Derrick Harris, GigaOM, Aug 2, 2014

Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldbergCuller

http://www.nsf.gov/start.htm

http://www.nsf.gov/start.htm

AMPLab: Integrating 3

Resources

Algorithms

• Machine Learning, Statistical Methods

• Prediction, Business Intelligence

Machines

• Clusters and Clouds

• Warehouse Scale Computing

People

• Crowdsourcing, Human Computation

• Data Scientists, Analysts

Berkeley Data Analytics Stack

(Apache and BSD open source)

Resource

Virtualization

Storage

Processing

Engine

Access and

Interfaces

In-house

Apps

Open Source Community Building

MeetUp on MLbase @Twitter (Aug 6, 2013)

Spark Summit SF (June 30, 2014)

Apps: Genomics Patterson et al.

Using BDAS, SNAP (Scalable Nucleotide

Alignment) aligns in minutes vs. days

Why Speed Matters: A real-world use case

ADAM – Data formats and Processing

Patterns for Genomics on Big Data Platforms

(e.g., Spark)

Collaborations with: UCSF, UCSC, OHSU,

Microsoft Research, Mt. SinaiM. Wilson, …, and C. Chiu, “Actionable Diagnosis of Neuroleptospirosis by Next-Generation Sequencing”,

June 4, 2014, New England Journal of Medicine.

June 4, 2014

5

June 4, 2014

In a First, Test of DNA Finds Root of Illness By CARL ZIMMER JUNE 4, 2014

Joshua Osborn, 14, lay in a coma at American Family Children’s Hospital in Madison,

Wis. For weeks his brain had been swelling with fluid, and a battery of tests had failed to

reveal the cause.

The doctors told his parents, Clark and Julie, that they wanted to run one more test with

an experimental new technology. Scientists would search Joshua’s cerebrospinal fluid for

pieces of DNA. Some of them might belong to the pathogen causing his encephalitis.

The Osborns agreed, although they were skeptical that the test would succeed where so

many others had failed. But in the first procedure of its kind, researchers at the University

of California, San Francisco, managed to pinpoint the cause of Joshua’s problem —

within 48 hours. He had been infected with an obscure species of bacteria. Once

identified, it was eradicated within days.

The case, reported on Wednesday in The New England Journal of Medicine, signals an

important advance in the science of diagnosis. For years, scientists have been sequencing

DNA to identify pathogens. But until now, the process has been too cumbersome to yield

useful information about an individual patient in a life-threatening emergency.

“This is an absolutely great story — it’s a tremendous tour de force,” said Tom Slezak,

the leader of the pathogen informatics team at the Lawrence Livermore National

Laboratory, who was not involved in the study.

Mr. Slezak and other experts noted that it would take years of further research before

such a test might become approved for regular use. But it could be immensely useful: Not

only might it provide speedy diagnoses to critically ill patients, they said, it could lead to

more effective treatments for maladies that can be hard to identify, such as Lyme disease.

Diagnosis is a crucial step in medicine, but it can also be the most difficult. Doctors

usually must guess the most likely causes of a medical problem and then order individual

tests to see which is the right diagnosis.

The guessing game can waste precious time. The causes of some conditions, like

encephalitis, can be so hard to diagnose that doctors often end up with no answer at all.

“About 60 percent of the time, we never make a diagnosis” in encephalitis, said Dr.

Michael R. Wilson, a neurologist at the University of California, San Francisco, and an

author of the new paper. “It’s frustrating whenever someone is doing poorly, but it’s

especially frustrating when we can’t even tell the parents what the hell is going on.”

For the last decade, researchers at the university have been working on methods for

identifying pathogens based on their DNA. In 2003 Dr. Joseph DeRisi, a biochemist at https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/5

SNAP

Carat Collaborative Battery App

24

750,000+

downloads

Big Data Ecosystem

Evolution

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batch

processing

AMPLab Unification PhilosophyDon’t specialize MapReduce – generalize it!

Two additions to Hadoop MR can enable all the models shown earlier!

1. General Task DAGs

2. Data Sharing

For Users:

Fewer Systems to Use

Less Data MovementSpark

Str

eam

ing

Gra

phX

…Spark

SQ

L

MLb

ase

In-Memory

Dataflow

System

M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.

“It’s only September but it’s already clear that 2014 will

be the year of Apache Spark”

-- Datanami, 9/15/14

• Developed in AMPLab and its predecessor the RADLab

• Alternative to Hadoop MapReduce

• 10-100x speedup for ML and interactive queries

• Central component of the BDAS Stack

• “Graduated” to Apache Foundation -> Apache Spark

Apache Spark Contributors:

0

25

50

75

100

2011 2012 2013 2014

400+ contributors to current release

Apache Spark:

Compared to Other Projects

Ma

pR

educe

YA

RN

HD

FS

Sto

rm

Spark

0

500

1000

1500

2000

MapR

educe

YA

RN

HD

FS

Sto

rm

Spark

0

50000

100000

150000

200000

250000

300000

350000

Commits Lines of Code Changed

Activity in past 6 months

2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3,

Julia, …

Iteration in MapReduce

Training

Data

Map Reduce LearnedModel

w(1)

w(2)

w(3)

w(0)

Initial

Model

Cost of Iteration in MapReduceMap Reduce Learned

Model

w(1)

w(2)

w(3)

w(0)

Initial

Model

Training

Data

Read 2Repeatedly

load same data

Cost of Iteration in MapReduce

Map Reduce LearnedModel

w(1)

w(2)

w(3)

w(0)

Initial

Model

Training

DataRedundantly save

output between

stages

Dataflow View

Training

Data

(HDFS)

Map

Re

duc

e

MapR

ed

uc

e

Map

Re

duc

e

Memory Opt. Dataflow

Training

Data

(HDFS)

Map

Re

duc

e

MapR

ed

uc

e

Map

Re

duc

e

Cached

Load

Memory Opt. Dataflow View

Training

Data

(HDFS)

Map

Re

duc

e

MapR

ed

uc

e

Map

Re

duc

e

Efficiently

move data

between

stages

Spark:10-100× faster than Hadoop MapReduce

Resilient Distributed Datasets (RDDs)API: coarse-grained transformations (map, group-by, join, sort, filter,

sample,…) on immutable collections

Resilient Distributed Datasets (RDDs)» Collections of objects that can be stored in memory or disk across a cluster

» Built via parallel transformations (map, filter, …)

» Automatically rebuilt on failure

Rich enough to capture many models:» Data flow models: MapReduce, Dryad, SQL, …

» Specialized models: Pregel, Hama, …

M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.

Abstraction: Dataflow Operators

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Apache Spark v1.3 (3/15)

Includes» Spark (core)

» Spark Streaming

» GraphX

» MLlib

» Spark SQL – Query Processing

Wide range of interfaces:» Enhanced Dataframes API

» Python / interactive ipython

» Scala / interactive scala shell

» R / interactive R-shell

» Java

Now included in all major Hadoop distributions

Data Intensive GenomicsNew population-scale experiments will sequence 10-100k

samples• 100k samples @ 60x WGS will generate ~20PB of read data and

~300TB of genotype data

End-to-end pipeline latency is important to clinical work

We want to jointly analyze samples to uncover low

frequency variations

How can we improve analysis

productivity?Flat file formats sacrifice interoperability but do not improve

performance

Common sort order invariants imposed by tools compromise

correctness

Genomics APIs tend to be at a lower level of abstraction, which

compromises productivity

ADAMAn open source, high performance, distributed platform for genomic

analysis

ADAM defines a:

1. Data schema and layout on disk*

2. Programming interface for distributed processing of genomic

data**

3. Command line interface* Via Parquet and Avro

** Work on Python integration is underway

Data Model is the "Narrow Waist"

Data FormatSchema can be updated without

breaking backwards compatibility

Normalize metadata fields into schema

for O(1) metadata access

Models are “dumb”; enhance as

necessary with rich objects

record AlignmentRecord {

union { null, Contig } contig = null;

union { null, long } start = null;

union { null, long } end = null;

union { null, int } mapq = null;

union { null, string } readName = null;

union { null, string } sequence = null;

union { null, string } mateReference = null;

union { null, long } mateAlignmentStart = null;

union { null, string } cigar = null;

union { null, string } qual = null;

union { null, string } recordGroupName = null;

union { int, null } basesTrimmedFromStart = 0;

union { int, null } basesTrimmedFromEnd = 0;

union { boolean, null } readPaired = false;

union { boolean, null } properPair = false;

union { boolean, null } readMapped = false;

union { boolean, null } mateMapped = false;

union { boolean, null } firstOfPair = false;

union { boolean, null } secondOfPair = false;

union { boolean, null } failedVendorQualityChecks = false;

union { boolean, null } duplicateRead = false;

union { boolean, null } readNegativeStrand = false;

union { boolean, null } mateNegativeStrand = false;

union { boolean, null } primaryAlignment = false;

union { boolean, null } secondaryAlignment = false;

union { boolean, null } supplementaryAlignment = false;

union { null, string } mismatchingPositions = null;

union { null, string } origQual = null;

union { null, string } attributes = null;

union { null, string } recordGroupSequencingCenter = null;

union { null, string } recordGroupDescription = null;

union { null, long } recordGroupRunDateEpoch = null;

union { null, string } recordGroupFlowOrder = null;

union { null, string } recordGroupKeySequence = null;

union { null, string } recordGroupLibrary = null;

union { null, int } recordGroupPredictedMedianInsertSize = null;

union { null, string } recordGroupPlatform = null;

union { null, string } recordGroupPlatformUnit = null;

union { null, string } recordGroupSample = null;

union { null, Contig } mateContig = null;

}

Schemas at https://www.github.com/bigdatagenomics/bdg-formats

https://www.github.com/bigdatagenomics/bdg-formats

Parquet: A Modern Big Data Storage

FormatASF Incubator project, based on Google

Dremel

High performance columnar store with

support for projections and push-down

predicates

Short read data stored in Parquet achieves a

25% improvement in size over compressed

BAM

Enables scale-out using modern Big Data

technology (e.g., Spark)

Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format

https://www.github.com/apache/incubator-parquet-format

ADAM’s API

ADAM is built on top of Apache Spark, which provides the RDD

abstraction —> distributed arrays

Common primitives include:

• Aggregates: BQSR, Indel Realignment

• Bucketing: Duplicate Marking, Concordance

• Region Joins: Variant Calling and Filtration

Adam Performance Bottom Line

F. Nothaft, et. al., “Rethinking Data-Intensive

Science Using Scalable Analytics Systems”,

ACM SIGMOD Conf., June 2015, to appear.

$214.39

$78.92

ADAM Performance Update

Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge

Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch

Achieve linear scalability out to

128 nodes for most tasks

2-4x improvement over {GATK,

samtools,Picard} on single node

https://www.github.com/fnothaft/bdg-recipes.git

Scalable Analytics for ScienceData Model is the “narrow waist” of the architecture

Modern “NoSQL” models support evolution and heterogeneity with high

performance.

BDAS Declarative Analytics: Specify What not How

MLBase chooses:

• Algorithms/Operators

• Ordering and Physical Placement

• Parameter and Hyperparameter Settings

• Featurization

Leverages BDAS (Spark, GraphX, Tachyon) and Hadoop File System

for Speed and Scale

To find out more or get

involved:

amplab.berkeley.edu

[email protected]

UC BERKELEY

Thanks to NSF CISE Expeditions in Computing, DARPA XData,

Founding Sponsors: Amazon Web Services, Google, and SAP,

the Thomas and Stacy Siebel Foundation,

all our industrial sponsors and partners, and all the members of the AMPLab Team.

Additional resources…

• aws.amazon.com/hpc

• aws.amazon.com/big-data

• aws.amazon.com/grants

• aws.amazon.com/genomics

• aws.amazon.com/compliance

• aws.amazon.com/security

Thank you!Jamie Kinney

[email protected]

@jamiekinney

time to science/time to results: transforming research in the cloud

Technology

aws care

pb of data

scientific communityaws

scientific problems

great application of

data centers efficiencies

analysis of lhc data

zettabytesof raw data