page: a framework for easy parallelization of genomic applications

Post on 20-Feb-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

PAGE: A Framework for Easy Parallelization of Genomic Applications. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. IPDPS 2014, Phoenix, Arizona. Motivation. The sequencing costs are decreasing. - PowerPoint PPT Presentation

TRANSCRIPT

PAGE: A Framework for Easy Parallelization of Genomic

Applications

1

Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering

The Ohio State University

IPDPS 2014, Phoenix, Arizona

IPDPS'14 2

Motivation

• The sequencing costs are decreasing

*Adapted from genome.gov/sequencingcosts

IPDPS'14 3

• Big data problem– 1000 Human Genome Project already produced 200 TB data

– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html

Motivation

IPDPS'14 4

Typical Analysis on Genomic Data

• Single Nucleotide Polymorphism (SNP) calling

Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C

Alig

nmen

t File

-1

Reference A G C G T A C C

Sequences 1 2 3 4 5 6 7 8Read-1 A G A G

Read-2 A G A G T

Read-3 G A G T

Read-4 G T T C CAlig

nmen

t File

-2

*Adapted from Wikipedia

A single SNP may cause Mendelian disease!

✖ ✓✖

IPDPS'14 5

Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

IPDPS'14 6

Existing Solutions for Implementation

• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling

• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis

• Middleware Systems– Hadoop

• Not designed for specific needs of genetic data• Limited programmability

– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools

IPDPS'14 7

Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

IPDPS'14 8

Our Goal• We want to develop a middleware system

– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data

formats – Allows use of existing programs

IPDPS'14

Challenges• Load Imbalance due to

nature of genomic data– It is not just an array of

A, G, C and T characters

• High overhead of tasks

• I/O contention

9

1 3 4

Coverage Variance

IPDPS'14 10

Our Work• PAGE: A Map-Reduce-like middleware for easy

parallelization of genomic applications

• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language

IPDPS'14 11

File-mFile-2File-1

Map

Reduce

Region-1

MapRegion-n

Intra-dependent Processing

O-11

O-1n

Output-1

Map

Reduce

Region-1

MapRegion-n

O-m1

O-mn

Output-m

• Each file is processed independently

IPDPS'14 12

Map O1

Ok

On

Reduce Output

Region-1

Input Files

MapRegion-k

Map

Region-n

Inter-dependent Processing• Each map task processes a particular region of ALL files

IPDPS'14 13

What Can PAGE Parallelize?• PAGE can parallelize all applications that have the

following property• M - Map task• R, R1 and R2 are three regions such that

R = concatenation of R1 and R2

• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function

R1 R2

R

IPDPS'14 14

Data Partitioning• Data is NOT packaged into equal-size data blocks as in

Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base

location information• Genome structure is divided into regions and each map

task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of

the input files• It is a common feature for many genomic tools (GATK, SamTools)

IPDPS'14 15

Genome Partition

• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into

regions

– By-chromosome partitioning: Chromosomes preserve their unity

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

IPDPS'14 16

Task Scheduling

Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce

tasks.

Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available

intermediate results.

PAGE provides two types of scheduling schemes.

IPDPS'14 17

Applications Developed Using PAGE

• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in

alignment files– Indel Realigner: Applies local realignment to improve

quality of alignment files– Unified Genotyper: SNP detection

IPDPS'14 18

Sample Application Development with PAGE

• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp

• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f

reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command

IPDPS'14 19

Outline• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion

IPDPS'14 20

Experiments• Experimental Setup

– In our cluster • Each node has 12 GB memory• 8 cores (2.53 GHz)

– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK

IPDPS'14 21

Comparison with GATK

Scalability Data Size Impact

- Indel Realigner tool of GATK

Data Size: 11 GB # of cores: 128

3.3x

9x

IPDPS'14 22

Comparison with GATK

Scalability Data Size Impact

- Unified Genotyper tool of GATK

10.9x 12.8x

Data Size: 34 GB # of cores: 128

IPDPS'14 23

Scalability Data Size Impact

- VarScan Application

6.9x 12.7x

Comparison with Hadoop Streaming

Data Size: 52 GB # of cores: 128

IPDPS'14 24

Summary of Experimental Results

When the computing power increased by 16 times

Indel Realigner

Unified Genotyper

VarScan Realigner Target Creator

PAGE 9x 12.8x 12.7x 14.1x

GATK 3.3x 10.9x - -

Hadoop Streaming

- - 6.9x -

IPDPS'14 25

Conclusion• We developed a middleware

– Easily parallelizes genomic applications– High applicability

• No restriction on programming language or data format• Allows to use existing applications

– Provides user to control the parallel execution while hiding the details

• Alternative scheduling schemes, execution models and data partitioning types

– Good Scalability

IPDPS'14 26

Thank you for listening …

Questions

top related