page: a framework for easy parallelization of genomic applications

PAGE: A Framework for Easy Parallelization of Genomic

Applications

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

IPDPS 2014, Phoenix, Arizona

IPDPS'14 2

Motivation

• The sequencing costs are decreasing

*Adapted from genome.gov/sequencingcosts

IPDPS'14 3

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

IPDPS'14 4

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) calling

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

t File

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

t File

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

IPDPS'14 5

Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

IPDPS'14 6

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools

IPDPS'14 7

IPDPS'14 8

Our Goal• We want to develop a middleware system

– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

IPDPS'14

Challenges• Load Imbalance due to

nature of genomic data– It is not just an array of

A, G, C and T characters

• High overhead of tasks

• I/O contention

Coverage Variance

IPDPS'14 10

Our Work• PAGE: A Map-Reduce-like middleware for easy

parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

IPDPS'14 11

File-mFile-2File-1

Reduce

Region-1

MapRegion-n

Intra-dependent Processing

Output-1

Reduce

Region-1

MapRegion-n

Output-m

• Each file is processed independently

IPDPS'14 12

Map O1

Reduce Output

Region-1

Input Files

MapRegion-k

Region-n

Inter-dependent Processing• Each map task processes a particular region of ALL files

IPDPS'14 13

What Can PAGE Parallelize?• PAGE can parallelize all applications that have the

following property• M - Map task• R, R1 and R2 are three regions such that

R = concatenation of R1 and R2

• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function

IPDPS'14 14

Data Partitioning• Data is NOT packaged into equal-size data blocks as in

Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base

location information• Genome structure is divided into regions and each map

task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of

the input files• It is a common feature for many genomic tools (GATK, SamTools)

IPDPS'14 15

Genome Partition

• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into

regions

– By-chromosome partitioning: Chromosomes preserve their unity

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

IPDPS'14 16

Task Scheduling

Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce

tasks.

Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available

intermediate results.

PAGE provides two types of scheduling schemes.

IPDPS'14 17

Applications Developed Using PAGE

• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in

alignment files– Indel Realigner: Applies local realignment to improve

quality of alignment files– Unified Genotyper: SNP detection

IPDPS'14 18

Sample Application Development with PAGE

• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp

• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f

reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command

IPDPS'14 19

IPDPS'14 20

Experiments• Experimental Setup

– In our cluster • Each node has 12 GB memory• 8 cores (2.53 GHz)

– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK

IPDPS'14 21

Comparison with GATK

Scalability Data Size Impact

- Indel Realigner tool of GATK

Data Size: 11 GB # of cores: 128

IPDPS'14 22

Comparison with GATK

- Unified Genotyper tool of GATK

10.9x 12.8x

IPDPS'14 23

- VarScan Application

6.9x 12.7x

Comparison with Hadoop Streaming

IPDPS'14 24

Summary of Experimental Results

When the computing power increased by 16 times

Indel Realigner

Unified Genotyper

VarScan Realigner Target Creator

x 12.8x 12.7x 14.1x

GATK 3.3x 10.9x - -

Hadoop Streaming

- - 6.9x -

IPDPS'14 25

Conclusion• We developed a middleware

– Easily parallelizes genomic applications– High applicability

• No restriction on programming language or data format• Allows to use existing applications

– Provides user to control the parallel execution while hiding the details

• Alternative scheduling schemes, execution models and data partitioning types

– Good Scalability

IPDPS'14 26

Thank you for listening …

Questions

page: a framework for easy parallelization of genomic applications

tb data parallel processing

equalsize data blocks

agagtrea file

nature of genomic datait

bedtools file merging

wikipediaa single snp

particular region

middleware systemspecific

Documents

parallelization strategy

Łukasz kokoszkiewicz. envirogrids project overview swat...

parallelization of gauss-seidel relaxation for real gas...

smith waterman algorithm parallelization

open ts dynamic parallelization system

a software framework for easy parallelization of pde solvers

snphub: an easy-to-set-up web server framework for exploring...

shared memory parallelization

java code transformation for parallelization

massive parallelization of sat solvers

test parallelization using jenkins

assisting technologies for program parallelization

parallelization and tuning

parallelization and performance optimization of...

optimization and openmp parallelization of …€¦ ·...

page: a framework for easy parallelization of genomic...

parallelization - xs4all klantenservice

parallelization in molecular dynamics

efficient parallelization of a dynamic unstructured ... ·...

aho-corasick algorithm parallelization