strata big data science talk on adam
TRANSCRIPT
![Page 1: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/1.jpg)
ADAM: Fast, Scalable Genome Analysis
Frank Austin NothaftAMPLab, University of California, Berkeley
with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave
Patterson
https://github.com/bigdatagenomics
Wednesday, February 12, 14
![Page 2: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/2.jpg)
Problem
• Whole genome files are large
• Biological systems are complex
• Population analysis requires petabytes of data
• Analysis time is often a matter of life and death
Wednesday, February 12, 14
![Page 3: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/3.jpg)
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
![Page 4: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/4.jpg)
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
![Page 5: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/5.jpg)
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
![Page 6: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/6.jpg)
Shredded Book AnalogyDickens accidentally shreds the first printing of A Tale of Two Cities
Text printed on 5 long spools
• How can he reconstruct the text?– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments– The short fragments from every copy are mixed together– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/Wednesday, February 12, 14
![Page 7: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/7.jpg)
What is ADAM?
• File formats: columnar file format that allows efficient parallel access to genomes
• API: interface for transforming, analyzing, and querying genomic data
• CLI: a handy toolkit for quickly processing genomes
Wednesday, February 12, 14
![Page 8: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/8.jpg)
The Broad Institute “Best Practices” Pipeline
SNAPADAM Avocado
MLBaseGraphX
SiRen CAGE
Wednesday, February 12, 14
![Page 9: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/9.jpg)
Design Goals
• Develop processing pipeline that enables efficient, scalable use of cluster/cloud
• Provide data format that has efficient parallel/distributed access across platforms
• Enhance semantics of data and allow more flexible data access patterns
Wednesday, February 12, 14
![Page 10: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/10.jpg)
Whole GenomeData Sizes
Input PipelineStage
Output
SNAP 1GB Fasta150GB Fastq
Alignment 250GB BAM
ADAM 250GB BAM Pre-processing
200GB ADAM
Avocado 200GB ADAM
Variant Calling 10MB ADAM
Variants found at about 1 in 1,000 loci
Wednesday, February 12, 14
![Page 11: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/11.jpg)
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem‣Local Filesystem
‣Schema-driven records w/ Apache Avro‣Store and retrieve records using Parquet‣Read BAM Files using Hadoop-BAM
RDD
‣Transform records using Apache Spark‣Query with SQL using Shark‣Graph processing with GraphX‣Machine learning using MLBase
Wednesday, February 12, 14
![Page 12: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/12.jpg)
Implementation
• Work accelerated at end of September
• 15K lines of Scala code
• 100% Apache-licensed open-source
• Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute
• Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC
• Global Alliance looking at ADAM as potential standard
Commits
Wednesday, February 12, 14
![Page 13: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/13.jpg)
AWS EC2 Performance
Picard
ADAM 1 node
ADAM 32 nodes
ADAM 100 nodes
1 10 100 1000 10000
28
70
1222
20
32
536
1064
Minutes (log scale)
Sort Mark Duplicates
NA12878 Whole Genome234GB as BAM, 229 GB as ADAM
2x
33x
53x
Wednesday, February 12, 14
![Page 14: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/14.jpg)
Mt. Sinai Performance
Picard
ADAM 16 nodes
ADAM 32 nodes
ADAM 64 nodes
ADAM 82 nodes
1 10 100 1000 10000
14
15
34
50
1222
8
11
17
29
1064
Minutes (log scale)
Sort Mark Duplicates
NA12878 Whole Genome234GB as BAM, 229 GB as ADAM
36x
62x
96x
133x
Wednesday, February 12, 14
![Page 15: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/15.jpg)
Scalability
HG00096 à 16GBNA12878 à 234GB
Wednesday, February 12, 14
![Page 16: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/16.jpg)
Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher
• GenomeBridge: Timothy Danford, Carl Yeksigian
Wednesday, February 12, 14
![Page 17: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/17.jpg)
Acknowledgements
This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA
XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc.,
Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric,
Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk, VMware, WANdisco and
Yahoo!.
Wednesday, February 12, 14
![Page 18: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/18.jpg)
“Before seeing your data, I would not think sorting could be done that fast.”
- research scientist at the Broad Institute
Wednesday, February 12, 14
![Page 19: Strata Big Data Science Talk on ADAM](https://reader033.vdocument.in/reader033/viewer/2022051709/53ec625a8d7f72821e8bcbd8/html5/thumbnails/19.jpg)
Call for contributions
• As an open source project, we welcome contributions
• We maintain a list of open enhancements at our Github issue tracker
• Enhancements tagged with “Pick me up!” don’t require a genomics background
• Github: https://github.com/bigdatagenomics/
Wednesday, February 12, 14