petascale genomics (strata singapore 20151203)
TRANSCRIPT
![Page 1: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/1.jpg)
1© Cloudera, Inc. All rights reserved.
Scaling Up Genomics with Hadoop and Spark
Uri Laserson | @laserson | 14 November 2015
Petascale Genomics
![Page 2: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/2.jpg)
2© Cloudera, Inc. All rights reserved.
We come in peace.
Pioneer plaque
![Page 3: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/3.jpg)
3© Cloudera, Inc. All rights reserved.
What is genomics?
![Page 4: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/4.jpg)
4© Cloudera, Inc. All rights reserved.
Organism
![Page 5: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/5.jpg)
5© Cloudera, Inc. All rights reserved.
Organism Cell
![Page 6: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/6.jpg)
6© Cloudera, Inc. All rights reserved.
Organism Cell Genome
![Page 7: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/7.jpg)
7© Cloudera, Inc. All rights reserved.
![Page 8: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/8.jpg)
8© Cloudera, Inc. All rights reserved.
![Page 9: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/9.jpg)
9© Cloudera, Inc. All rights reserved.
Reference chromosome
![Page 10: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/10.jpg)
10© Cloudera, Inc. All rights reserved.
Reference chromosome
Location
![Page 11: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/11.jpg)
11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”
![Page 12: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/12.jpg)
12© Cloudera, Inc. All rights reserved.
...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca
agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa
ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag
ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa
acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct
aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac
aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact
tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca
tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac
tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat
acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg
tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca
tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat
cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga
cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa
aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct
gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat
gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac
catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa
aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag
tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca
aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa
tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg
ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag
caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa
ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta
gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca
ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...
![Page 13: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/13.jpg)
13© Cloudera, Inc. All rights reserved.
![Page 14: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/14.jpg)
14© Cloudera, Inc. All rights reserved.
![Page 15: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/15.jpg)
15© Cloudera, Inc. All rights reserved.
![Page 16: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/16.jpg)
16© Cloudera, Inc. All rights reserved.
![Page 17: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/17.jpg)
17© Cloudera, Inc. All rights reserved.
![Page 18: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/18.jpg)
18© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
![Page 19: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/19.jpg)
19© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
![Page 20: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/20.jpg)
20© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
![Page 21: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/21.jpg)
21© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Pipelines!
![Page 22: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/22.jpg)
22© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
![Page 23: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/23.jpg)
23© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
Global sort order
![Page 24: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/24.jpg)
24© Cloudera, Inc. All rights reserved.
CHPC (scheduler)POSIX filesystem
JavaHPC (Queue)POSIX filesystem
C++Single-nodeSQLite
It’s file formats all the way down!
![Page 25: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/25.jpg)
25© Cloudera, Inc. All rights reserved.
Dedup
![Page 26: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/26.jpg)
26© Cloudera, Inc. All rights reserved.
/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {
// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();} else {
// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;
}} else {
rec.setDuplicateReadFlag(false);
Method
Code
![Page 27: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/27.jpg)
27© Cloudera, Inc. All rights reserved.
/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {
// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();} else {
// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;
}} else {
rec.setDuplicateReadFlag(false);
Method
Code
![Page 28: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/28.jpg)
28© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
![Page 29: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/29.jpg)
29© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method/Algo
Code
Platform
![Page 30: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/30.jpg)
30© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
![Page 31: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/31.jpg)
31© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
![Page 32: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/32.jpg)
32© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 2
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 3
![Page 33: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/33.jpg)
33© Cloudera, Inc. All rights reserved.
Manually running pipelines on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
![Page 34: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/34.jpg)
34© Cloudera, Inc. All rights reserved.
![Page 35: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/35.jpg)
35© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
![Page 36: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/36.jpg)
36© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node 4
![Page 37: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/37.jpg)
37© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup QC/FilterVariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node 4
Recalibrate
![Page 38: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/38.jpg)
38© Cloudera, Inc. All rights reserved.
Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file formats for each data type and access pattern…
• Parquet creates a compressed format for each Avro-defineddata model
• Improvements over existing formats• ~20% for BAM• ~90% for VCF
![Page 39: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/39.jpg)
39© Cloudera, Inc. All rights reserved.
YARN-managedHadoop cluster
Sparkexecutors
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)Partial sums
𝑖=1
𝑁
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
Driver
Applicationcode
ContEst Algorithm
![Page 40: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/40.jpg)
40© Cloudera, Inc. All rights reserved.
Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole …
bd
g-fo
rmat
s (A
vro
/Par
qu
et)
![Page 41: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/41.jpg)
41© Cloudera, Inc. All rights reserved.
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Spark + Genomics = ADAM
![Page 42: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/42.jpg)
42© Cloudera, Inc. All rights reserved.
Core Genomics Primitives: Spatial Join
![Page 43: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/43.jpg)
43© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-byaggregate (MAF)
persist data
![Page 44: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/44.jpg)
44© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: distributed SQL
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
apply predicates
“load” and join data
group-by
aggregate (UDAF)
![Page 45: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/45.jpg)
45© Cloudera, Inc. All rights reserved.
ADAM preliminary performance
![Page 46: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/46.jpg)
46© Cloudera, Inc. All rights reserved.
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your
software.
4. Making software free for commercial use shows you are not against
companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
“Myths of Bioinformatics Software”
![Page 47: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/47.jpg)
47© Cloudera, Inc. All rights reserved.
![Page 48: Petascale Genomics (Strata Singapore 20151203)](https://reader034.vdocument.in/reader034/viewer/2022051521/58f01ef81a28ab6f4d8b4603/html5/thumbnails/48.jpg)
48© Cloudera, Inc. All rights reserved.
Acknowledgements
UCBerkeleyMatt MassieFrank NothaftMichael Heuer
TamrTimothy Danford
MSSMJeff HammerbacherRyan Williams
ClouderaTom WhiteSandy Ryza