igen: the illinois genomics execution...

1
IGen: The Illinois Genomics Execution Environment Subho Sankar Banerjee, Ravishankar K. Iyer (Advisor) University of Illinois at Urbana-Champaign Goals Identify system performance bottlenecks in variant calling workflows running on extreme scale systems Identify algorithmic patterns in the variant calling workflow to exploit available parallelism Inform the design of future genomics algorithms and their implementations on large computing systems Identifies and characterizes mutations in NGS data Map NGS data to reference genome Correct for noisy data Differentiate strings in the presence of noise and ploidy First phase of the personalized medicine flow Recurrently used Data intensive part of NGS analytics Application: Variant Calling Workflow Performance Bottlenecks Walltime for human genome @ 50x coverage 43.7 ± 0.01 h on a single node 28.3 ± 0.2 h on 22 Blue-Waters nodes Very poor resource utilization IO dependent performance limitations: File system as distributed memory Htslib: Inefficient for distributed file systems Sorting large alignments on- disk This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. This material is based upon work supported in part by the National Science Foundation under Grant No. CNS 13-37732. Acknowledgements Efficient NGS Analytics Genome Analytics as Data-Flow Graphs “Kernels” (Vertices): Data transformations “Patterns” (Edges): Data dependencies Calculate Permutations Discard Poor matches Map Map Calculate Alignment Score Aligned NGS Data NGS Data Tree Reduce Find Seed (K-Mer) Locations Merge Seed Index Reference Sequence All-to-All Shufe Find Seed Matches Map Tree Reduce Find k-Best Locations Seed Index INDEXING ALIGNMENT Landau-Vishkin Algorithm Prex Match Algorithm 1: ACGTACGA 2: ACGATATAC 3: ACAAACAA 4: CCTATATTC ACGTACGAAA TTACATCTTAT GTATCATCGA Single Ended NGS Read Alignment as a DFG Workflow stage Kernels Error Correction K-mer computation Alignment K-mer computation, prefix tree, Edit -distance computation Indel Re-Alignment Edit -distance computation, Table lookup Re-Calibration Yates correction, Table lookup Variant Calling Entropy, Convolution, Assembly, Edit -distance, Bayesian inference Common kernels in the Variant Calling Workflow Distributed DFG executions over heterogeneous clusters Preliminary Results Ex ecution Engine: Runtime framework for distributed data-parallel executions of data flow graphs on heterogeneous clusters I llinois Genomics Execution Environment: Accelerated genomic analyses tools for ExEn Synthetic human chromosome 1 @ 50x IGen Aligner (vs. SNAP) Single Node: 1.2x (35 min to 30 min) Multiple Node (10): 14x IGen Variant Caller (vs. GATK HaplotypeCaller) Single Node: 9x (36 min to 4.1 min) Multiple Node (10): 81x Measurement driven study of performance bottlenecks in existing NGS analytics tools Similar performance pathologies across multiple tools. Scope for system level optimization Present a data-flow based abstraction for NGS analytics Demonstrate preliminary results of significant performance acceleration Simpler to build high performance parts Conclusions We would like to thank Zhengping Wang, Varun Bahl, Valerio Formicola, Luidmila Mainzer, Arjun P. Athreya, Zachary Stephens, Zbigniew Kalbarczyk and Victor Jongeneel for their help, support and advice Task Farm Send Task Controller Done Communication Primitives (RDMA/MPI) Distributed PGAS Memory and Remote Calls Task Scheduler Language Wrappers MMAP IO Fault Tolerance Data Formats Parsers Data Flow Graph Computational Kernels Synchronization Patterns IGen ExEn High level view of the ExEn-IGen framework Ongoing Work Improved Kernel Scheduling Deployment Mechanisms Accelerators Explore the use of GPUs for computationally heavy kernels Custom hardware accelerators Containerized deployments using Docker Integration with HDFS and Tachyon Optimal task assignment under constraints of: Affinity Shared resource contention Data Locality Performance Enhancements: Distributed execution of kernel functions RDMA to cut down data-serialization costs RPC: Control Transfer PGAS Memory: Data Transfer Efficient data formats Columnar store Use in-memory representation Memory Mapped IO High performance kernel functions Implicitly data parallel Composable and pluggable model Reuse of computational kernels Potential for system level optimizations Potential for accelerators Data filtering process In: 100GB Out: 50 MB Best Practices Workflow: BWA for alignment GATK for BQSR and realignment GATK for SNP calling Data-parallel distributed computation Variant Calling and Genotyping Workflow Error Correction Read Mapping Alignment/Assembly Realignment, Dedup Recalibration Variant Calling SNP Variant Calling Structural Variations ! ! NGS Sequencing Compute dependent performance limitations: Serialization across threads in GATK Poor cache locality in Java

Upload: others

Post on 06-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IGen: The Illinois Genomics Execution Environmentsc15.supercomputing.org/sites/all/themes/SC15images/src...IGen: The Illinois Genomics Execution EnvironmentSubho Sankar Banerjee, Ravishankar

IGen: The Illinois Genomics Execution EnvironmentSubho Sankar Banerjee, Ravishankar K. Iyer (Advisor)University of Illinois at Urbana-Champaign

Goals• Identify system performance bottlenecks in

variant calling workflows running on extreme scale systems

• Identify algorithmic patterns in the variant calling workflow to exploit available parallelism

• Inform the design of future genomics algorithms and their implementations on large computing systems

• Identifies and characterizes mutations in NGS data• Map NGS data to reference genome• Correct for noisy data• Differentiate strings in the presence of noise

and ploidy• First phase of the personalized medicine flow• Recurrently used• Data intensive part of NGS analytics

Application: Variant Calling Workflow

Performance Bottlenecks

Walltime for human genome @ 50x coverage

• 43.7 ± 0.01 h on a single node

• 28.3 ± 0.2 h on 22 Blue-Waters nodes

• Very poor resource utilization

• IO dependent performance limitations:

• File system as distributed memory• Htslib:

Inefficient for distributed file systems• Sorting large

alignments on-disk

This research is part of the Blue Waters sustained-petascale computing project, which is supportedby the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state ofIllinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its NationalCenter for Supercomputing Applications.This material is based upon work supported in part by the National Science Foundation under GrantNo. CNS 13-37732.

Acknowledgements

Efficient NGS AnalyticsGenome Analytics as Data-Flow Graphs

• “Kernels” (Vertices): Data transformations

• “Patterns” (Edges): Data dependencies

CalculatePermutations

DiscardPoor matches

Map Map

CalculateAlignment Score

AlignedNGS DataNGS Data

Tree Reduce

Find Seed(K-Mer) Locations

Merge SeedIndex

ReferenceSequence

All-to-AllShuffle

Find SeedMatches

Map

Tree Reduce

Find k-BestLocations

Seed Index

INDEXING ALIGNMENT

Landau-VishkinAlgorithm

Prefix MatchAlgorithm

1: ACGTACGA2: ACGATATAC3: ACAAACAA4: CCTATATTC

ACGTACGAAATTACATCTTATGTATCATCGA

Single Ended NGS Read Alignment as a DFG

Workflow stage KernelsError Correction K-mer computation

Alignment K-mer computation, prefix tree, Edit-distance computation

Indel Re-Alignment Edit-distance computation, Table lookup

Re-Calibration Yates correction, Table lookup

Variant Calling Entropy, Convolution, Assembly, Edit-distance, Bayesian inference

Common kernels in the Variant Calling Workflow

Distributed DFG executions over heterogeneous clusters

Preliminary Results

• Execution Engine: Runtime framework for distributed data-parallel executions of data flow graphs on heterogeneous clusters

• Illinois Genomics Execution Environment: Accelerated genomic analyses tools for ExEn

• Synthetic human chromosome 1 @ 50x• IGen Aligner (vs. SNAP)• Single Node: 1.2x (35 min to 30 min)• Multiple Node (10): 14x

• IGen Variant Caller (vs. GATK HaplotypeCaller)• Single Node: 9x (36 min to 4.1 min)• Multiple Node (10): 81x

• Measurement driven study of performance bottlenecks in existing NGS analytics tools

• Similar performance pathologies across multiple tools. Scope for system level optimization

• Present a data-flow based abstraction for NGS analytics

• Demonstrate preliminary results of significant performance acceleration

• Simpler to build high performance parts

Conclusions

We would like to thank Zhengping Wang,Varun Bahl, Valerio Formicola, LuidmilaMainzer, Arjun P. Athreya, ZacharyStephens, Zbigniew Kalbarczyk and VictorJongeneel for their help, support andadvice

Task Farm

Send Task

Controller

Done

Communication Primitives (RDMA/MPI)

Distributed PGAS Memory and Remote Calls

Task Scheduler Language Wrappers

MMAP IO

Fault Tolerance

Data Formats

ParsersData Flow Graph

Computational Kernels

SynchronizationPatterns

IGen

ExEn

High level view of the ExEn-IGen framework

Ongoing WorkImproved Kernel Scheduling

Deployment Mechanisms

Accelerators• Explore the use of GPUs for

computationally heavy kernels• Custom hardware accelerators

• Containerized deployments using Docker• Integration with HDFS and Tachyon

• Optimal task assignment under constraints of:• Affinity• Shared resource contention • Data Locality

Performance Enhancements:• Distributed execution of kernel functions• RDMA to cut down data-serialization costs• RPC: Control Transfer• PGAS Memory: Data Transfer

• Efficient data formats• Columnar store• Use in-memory representation• Memory Mapped IO

• High performance kernel functions

• Implicitly data parallel

• Composable and pluggable model

• Reuse of computational kernels

• Potential for system level optimizations• Potential for accelerators

• Data filtering process• In: 100GB• Out: 50 MB

• Best Practices Workflow:• BWA for

alignment• GATK for BQSR

and realignment• GATK for SNP

calling

• Data-parallel distributed computationVariant Calling and Genotyping Workflow

Error Correction

Read MappingAlignment/Assembly

Realignment, Dedup Recalibration

Variant CallingSNP

Variant CallingStructural Variations

!

!

NGS Sequencing

• Compute dependent performance limitations:

• Serialization across threads in GATK

• Poor cache locality in Java