rethinking data-intensive science using scalable analytics systems
TRANSCRIPT
Rethinking Data-Intensive Science Using Scalable
Analytics Systems Frank Austin Nothaft
UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja,
Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
Scientific revolutions are driven by data acquisition
revolutions
Genome Sequencing
Source: NIH National Genome Research Institute
2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day
= ~10PB/yearHuman Genome!Project: ~10GB
1000 Genomes: 15TB
TCGA: 3PB
Sequencing advances line up well with scalable analytics software
Source: NIH National Genome Research Institute
Google MapReduce
Hadoop MR
Spark
Parquet
Mapping scientific systems to commodity analytics systems
• Contemporary scientific systems are custom-built
• Leads to functionality from commodity systems being rebuilt
• We have an opportunity to rethink the abstractions that scientific systems use:
• Migrate from a flat architecture to a stacked architecture
• Expose higher level programming primitives
• Use commodity tools wherever possible
Common Traits of Legacy Data Intensive Scientific Systems
1. Computation is workflow/pipeline oriented
2. Processing system has monolithic/flat architecture
3. Data is stored in flat files
Genomics Pipelines
Source: The Broad Institute of MIT/Harvard
Flat File Formats• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF
• Centralized metadata makes it difficult to parallelize applications
Flat Architectures• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are flat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: flat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs
The perils of flattening…• The trivial:
• You can improve performance by pushing data access order into your data layout
• But now, you can’t easily compose pipeline stages that have different access orders
• The obscure:
• If you access data via a sorted iterator, will you incorrectly implement your algorithm?
A green field approach
First, define a schemarecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
A schema provides a narrow waist
record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null;}
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Accelerate common access patterns
• In genomics, we commonly have to find observations that overlap in a coordinate plane
• This coordinate plane is genomics specific, and is known a priori
• We can use our knowledge of the coordinate plane to implement a fast overlap join
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Pick appropriate storage• When accessing scientific
datasets, we frequently slice and dice the dataset:
• Algorithms may touch subsets of columns
• We don’t always touch the whole dataset
• This is a good match for columnar storage
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Is introducing a new data model really a good idea?
Source: XKCD, http://xkcd.com/927/
A subtle point:!Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized DataLegacy File Format
SchemaData Models
Data Distribution
Materialized DataColumnar Storage
SchemaData Models
A subtle point:!Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized DataLegacy File Format
SchemaData Models
Data Distribution
Materialized DataColumnar Storage
SchemaData Models
A well designed stack simplifies application design
ApplicationTransformations
Physical StorageAttached Storage
Data DistributionParallel FS
Materialized DataColumnar Storage
Evidence AccessMapReduce/DBMS
PresentationEnriched Models
SchemaData Models
Variant calling & analysis,
RNA-seq analysis, etc.
Disk, SDD, block
store, memory cache
HDFS, Tachyon, HPC file
systems, S3
Load data from Parquet and
legacy formats
Spark, Spark-SQL,
Hadoop
Enriched Read/Variant
Avro Schema for reads,
variants, and genotypes
Users define analyses
via transformations
Enriched models provide convenient
methods on common models
The evidence access layer
efficiently executes transformations
Schemas define the logical
structure of basic genomic objects
Common interfaces map logical
schema to bytes on disk
Parallel file system layer
coordinates distribution of data
Decoupling storage enables
performance/cost tradeoff
How does this perform on real scientific data?
ADAM performs genomic preprocessing
Source: The Broad Institute of MIT/Harvard
ADAM’s Performance
• Achieve linear scalability out to 128 nodes for most tasks
• Up to 3x improvement over current tools on a single node
Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git
Astronomy Pipelines
Source: The LSST Project
Astronomy Image Co-addition Performance
• Scales out to 16 nodes
• ~3x improvement over extant tool on a single node
Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)
Conclusions• There is a huge increase in the amount of scientific
data being processed
• Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions
• When we move to a general solution, we can gain performance without losing correctness
Acknowledgements• ADAM (https://www.github.com/bigdatagenomics/adam):!
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• UC Santa Cruz: Benedict Paten, David Haussler!
• KIRA (https://www.github.com/BIDS/Kira):!
• UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter!
• PoC code at https://github.com/zhaozhang/SparkMontage