![Page 1: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/1.jpg)
PAGE: A Framework for Easy Parallelization of Genomic
Applications
1
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State University
IPDPS 2014, Phoenix, Arizona
![Page 2: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/2.jpg)
IPDPS'14 2
Motivation
• The sequencing costs are decreasing
*Adapted from genome.gov/sequencingcosts
![Page 3: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/3.jpg)
IPDPS'14 3
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
![Page 4: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/4.jpg)
IPDPS'14 4
Typical Analysis on Genomic Data
• Single Nucleotide Polymorphism (SNP) calling
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C
Alig
nmen
t File
-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
*Adapted from Wikipedia
A single SNP may cause Mendelian disease!
✖ ✓✖
![Page 5: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/5.jpg)
IPDPS'14 5
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 6: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/6.jpg)
IPDPS'14 6
Existing Solutions for Implementation
• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling
• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis
• Middleware Systems– Hadoop
• Not designed for specific needs of genetic data• Limited programmability
– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools
![Page 7: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/7.jpg)
IPDPS'14 7
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 8: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/8.jpg)
IPDPS'14 8
Our Goal• We want to develop a middleware system
– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
![Page 9: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/9.jpg)
IPDPS'14
Challenges• Load Imbalance due to
nature of genomic data– It is not just an array of
A, G, C and T characters
• High overhead of tasks
• I/O contention
9
1 3 4
Coverage Variance
![Page 10: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/10.jpg)
IPDPS'14 10
Our Work• PAGE: A Map-Reduce-like middleware for easy
parallelization of genomic applications
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
![Page 11: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/11.jpg)
IPDPS'14 11
File-mFile-2File-1
Map
Reduce
Region-1
MapRegion-n
Intra-dependent Processing
O-11
O-1n
Output-1
Map
Reduce
Region-1
MapRegion-n
O-m1
O-mn
Output-m
• Each file is processed independently
![Page 12: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/12.jpg)
IPDPS'14 12
Map O1
Ok
On
Reduce Output
Region-1
Input Files
MapRegion-k
Map
Region-n
Inter-dependent Processing• Each map task processes a particular region of ALL files
![Page 13: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/13.jpg)
IPDPS'14 13
What Can PAGE Parallelize?• PAGE can parallelize all applications that have the
following property• M - Map task• R, R1 and R2 are three regions such that
R = concatenation of R1 and R2
• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function
R1 R2
R
![Page 14: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/14.jpg)
IPDPS'14 14
Data Partitioning• Data is NOT packaged into equal-size data blocks as in
Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base
location information• Genome structure is divided into regions and each map
task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of
the input files• It is a common feature for many genomic tools (GATK, SamTools)
![Page 15: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/15.jpg)
IPDPS'14 15
Genome Partition
• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into
regions
– By-chromosome partitioning: Chromosomes preserve their unity
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
![Page 16: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/16.jpg)
IPDPS'14 16
Task Scheduling
Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce
tasks.
Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available
intermediate results.
PAGE provides two types of scheduling schemes.
![Page 17: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/17.jpg)
IPDPS'14 17
Applications Developed Using PAGE
• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in
alignment files– Indel Realigner: Applies local realignment to improve
quality of alignment files– Unified Genotyper: SNP detection
![Page 18: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/18.jpg)
IPDPS'14 18
Sample Application Development with PAGE
• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp
• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f
reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command
![Page 19: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/19.jpg)
IPDPS'14 19
Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
![Page 20: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/20.jpg)
IPDPS'14 20
Experiments• Experimental Setup
– In our cluster • Each node has 12 GB memory• 8 cores (2.53 GHz)
– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK
![Page 21: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/21.jpg)
IPDPS'14 21
Comparison with GATK
Scalability Data Size Impact
- Indel Realigner tool of GATK
Data Size: 11 GB # of cores: 128
3.3x
9x
![Page 22: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/22.jpg)
IPDPS'14 22
Comparison with GATK
Scalability Data Size Impact
- Unified Genotyper tool of GATK
10.9x 12.8x
Data Size: 34 GB # of cores: 128
![Page 23: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/23.jpg)
IPDPS'14 23
Scalability Data Size Impact
- VarScan Application
6.9x 12.7x
Comparison with Hadoop Streaming
Data Size: 52 GB # of cores: 128
![Page 24: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/24.jpg)
IPDPS'14 24
Summary of Experimental Results
When the computing power increased by 16 times
Indel Realigner
Unified Genotyper
VarScan Realigner Target Creator
PAGE 9x 12.8x 12.7x 14.1x
GATK 3.3x 10.9x - -
Hadoop Streaming
- - 6.9x -
![Page 25: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/25.jpg)
IPDPS'14 25
Conclusion• We developed a middleware
– Easily parallelizes genomic applications– High applicability
• No restriction on programming language or data format• Allows to use existing applications
– Provides user to control the parallel execution while hiding the details
• Alternative scheduling schemes, execution models and data partitioning types
– Good Scalability
![Page 26: PAGE: A Framework for Easy Parallelization of Genomic Applications](https://reader035.vdocument.in/reader035/viewer/2022062811/56815fce550346895dcecd5a/html5/thumbnails/26.jpg)
IPDPS'14 26
Thank you for listening …
Questions