rethinking data-intensive science using scalable analytics systems

26
Rethinking Data-Intensive Science Using Scalable Analytics Systems Frank Austin Nothaft UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Upload: fnothaft

Post on 04-Aug-2015

560 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Rethinking Data-Intensive Science Using Scalable

Analytics Systems Frank Austin Nothaft

UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja,

Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson

Page 2: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Scientific revolutions are driven by data acquisition

revolutions

Page 3: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Genome Sequencing

Source: NIH National Genome Research Institute

2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day

= ~10PB/yearHuman Genome!Project: ~10GB

1000 Genomes: 15TB

TCGA: 3PB

Page 4: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Sequencing advances line up well with scalable analytics software

Source: NIH National Genome Research Institute

Google MapReduce

Hadoop MR

Spark

Parquet

Page 5: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Mapping scientific systems to commodity analytics systems

• Contemporary scientific systems are custom-built

• Leads to functionality from commodity systems being rebuilt

• We have an opportunity to rethink the abstractions that scientific systems use:

• Migrate from a flat architecture to a stacked architecture

• Expose higher level programming primitives

• Use commodity tools wherever possible

Page 6: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Common Traits of Legacy Data Intensive Scientific Systems

1. Computation is workflow/pipeline oriented

2. Processing system has monolithic/flat architecture

3. Data is stored in flat files

Page 7: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Genomics Pipelines

Source: The Broad Institute of MIT/Harvard

Page 8: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Flat File Formats• Scientific data is typically stored in application

specific file formats:

• Genomic reads: SAM/BAM, CRAM

• Genomic variants: VCF/BCF, MAF

• Genomic features: BED, NarrowPeak, GTF

• Centralized metadata makes it difficult to parallelize applications

Page 9: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Flat Architectures• APIs present very barebones abstractions:

• GATK: Sorted iterator over the genome

• Why are flat architectures bad?

1. Trivial: low level abstractions are not productive

2. Trivial: flat architectures create technical lock-in

3. Subtle: low level abstractions can introduce bugs

Page 10: Rethinking Data-Intensive Science Using Scalable Analytics Systems

The perils of flattening…• The trivial:

• You can improve performance by pushing data access order into your data layout

• But now, you can’t easily compose pipeline stages that have different access orders

• The obscure:

• If you access data via a sorted iterator, will you incorrectly implement your algorithm?

Page 11: Rethinking Data-Intensive Science Using Scalable Analytics Systems

A green field approach

Page 12: Rethinking Data-Intensive Science Using Scalable Analytics Systems

First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 13: Rethinking Data-Intensive Science Using Scalable Analytics Systems

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

A schema provides a narrow waist

record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 14: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Accelerate common access patterns

• In genomics, we commonly have to find observations that overlap in a coordinate plane

• This coordinate plane is genomics specific, and is known a priori

• We can use our knowledge of the coordinate plane to implement a fast overlap join

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 15: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Pick appropriate storage• When accessing scientific

datasets, we frequently slice and dice the dataset:

• Algorithms may touch subsets of columns

• We don’t always touch the whole dataset

• This is a good match for columnar storage

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Page 16: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Is introducing a new data model really a good idea?

Source: XKCD, http://xkcd.com/927/

Page 17: Rethinking Data-Intensive Science Using Scalable Analytics Systems

A subtle point:!Proper stack design can simplify

backwards compatibility

To support legacy data formats, you define a way to serialize/deserialize the schema into/from the

legacy flat file format!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

Materialized DataColumnar Storage

SchemaData Models

Page 18: Rethinking Data-Intensive Science Using Scalable Analytics Systems

A subtle point:!Proper stack design can simplify

backwards compatibility

This is a view!

Data Distribution

Materialized DataLegacy File Format

SchemaData Models

Data Distribution

Materialized DataColumnar Storage

SchemaData Models

Page 19: Rethinking Data-Intensive Science Using Scalable Analytics Systems

A well designed stack simplifies application design

ApplicationTransformations

Physical StorageAttached Storage

Data DistributionParallel FS

Materialized DataColumnar Storage

Evidence AccessMapReduce/DBMS

PresentationEnriched Models

SchemaData Models

Variant calling & analysis,

RNA-seq analysis, etc.

Disk, SDD, block

store, memory cache

HDFS, Tachyon, HPC file

systems, S3

Load data from Parquet and

legacy formats

Spark, Spark-SQL,

Hadoop

Enriched Read/Variant

Avro Schema for reads,

variants, and genotypes

Users define analyses

via transformations

Enriched models provide convenient

methods on common models

The evidence access layer

efficiently executes transformations

Schemas define the logical

structure of basic genomic objects

Common interfaces map logical

schema to bytes on disk

Parallel file system layer

coordinates distribution of data

Decoupling storage enables

performance/cost tradeoff

Page 20: Rethinking Data-Intensive Science Using Scalable Analytics Systems

How does this perform on real scientific data?

Page 21: Rethinking Data-Intensive Science Using Scalable Analytics Systems

ADAM performs genomic preprocessing

Source: The Broad Institute of MIT/Harvard

Page 22: Rethinking Data-Intensive Science Using Scalable Analytics Systems

ADAM’s Performance

• Achieve linear scalability out to 128 nodes for most tasks

• Up to 3x improvement over current tools on a single node

Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git

Page 23: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Astronomy Pipelines

Source: The LSST Project

Page 24: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Astronomy Image Co-addition Performance

• Scales out to 16 nodes

• ~3x improvement over extant tool on a single node

Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)

Page 25: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Conclusions• There is a huge increase in the amount of scientific

data being processed

• Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions

• When we move to a general solution, we can gain performance without losing correctness

Page 26: Rethinking Data-Intensive Science Using Scalable Analytics Systems

Acknowledgements• ADAM (https://www.github.com/bigdatagenomics/adam):!

• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!

• GenomeBridge: Carl Yeksigian!

• Cloudera: Uri Laserson!

• Microsoft Research: Ravi Pandya!

• UC Santa Cruz: Benedict Paten, David Haussler!

• KIRA (https://www.github.com/BIDS/Kira):!

• UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter!

• PoC code at https://github.com/zhaozhang/SparkMontage