a hybrid task graph scheduler for high performance image processing workflows timothy blattner nist...

GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1

A Hybrid Task Graph Scheduler for High Performance Image Processing WorkflowsTIMOTHY BLATTNERNIST | UMBC

12/15/2015


Outline Introduction

Challenges

Image Stitching

Hybrid Task Graph Scheduler

Preliminary Results

Conclusions

Future Work

12/15/2015


Credits Walid Keyrouz (NIST)

Milton Halem (UMBC)

Shuvra Bhattacharrya (UMD)

12/15/2015


Introduction Hardware landscape is changing

Traditional software approaches to extracting performance from the hardware◦ Reaching complexity limit

◦ Multiple GPUs on a node◦ Complex memory hierarchies

We present a novel abstract machine model◦ Hybrid task graph scheduler

◦ Hybrid pipeline workflows◦ Scope: Single node with multiple CPUs and GPUs◦ Emphasis on

◦ Execution pipelines to scale to multiple GPUs/CPU sockets◦ Memory interface to attach to hierarchies of memory◦ Can be expanded beyond single node (clusters)

12/15/2015


Introduction – Future Architectures

Future hybrid architecture generation◦ Few fat cores with many more simpler cores

◦ Intel Knights Landing◦ POWER 9 + NVIDIA Volta + NVLink

◦ Sierra cluster◦ Faster interconnect

◦ Deeper memory hierarchy

Programming methods must present the right machine model to programmers so they can extract performance

12/15/2015

Figure: NVIDIA Volta GPU(nvidia.com)


Introduction – Data transfer costs

Copying data between address spaces is expensive◦ PCI express bottleneck

Current hybrid CPU+GPU systems contain multiple independent address spaces◦ Unification of the address spaces

◦ Simplification for programmer◦ Good for prototyping◦ Obscures the cost of data motion

Techniques for improving hybrid utilization◦ Have enough computation per data element◦ Overlap data motion with computation◦ Faster bus (80 GB/s NVLink versus 16 GB/s PCIe)

◦ NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014]

12/15/2015


Introduction – Complex Memory Hierarchies

Data locality is becoming more complex◦ Non-volatile storage devices

◦ NVMe◦ 3D XPoint (future)◦ SATA SSD◦ SATA HDD

◦ Volatile memories◦ HBM / 3D stacked◦ DDR◦ GPU Shared Memory / L1,L2,L3 Cache

Need to model these memories within programming methods◦ Effectively utilize based on size and speed

◦ Hierarchy-aware programming

12/15/2015

Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014]


Key Challenges Changing H/W landscape

◦ Hierarchy-aware programming◦ Manage data locality

◦ Wider data transfer channels ◦ Requires multi-GPU computation

◦ NVLink◦ Hybrid computing

◦ Utilize all compute resource

A programming and execution machine model is needed to address the above challenges◦ Hybrid Task Graph Scheduler (HTGS) model

◦ Expands on hybrid pipeline workflows [Blattner 2013]

12/15/2015


Hybrid Pipeline Workflows Hybrid pipeline workflow system

◦ Schedule tasks using a multiple-producer multiple-consumer model◦ Prototype in 2013 Master’s thesis [Blattner 2013]

◦ Kept all GPUs busy◦ Execution pipelines, one per GPU

◦ Stayed within memory limits◦ Overlapped data motion with computation

◦ Tailored for image stitching◦ Required significant programming effort to implement

◦ Prevent race conditions, manage dependencies, and maintain memory limits

We expand on hybrid pipeline workflows◦ Formulates a model for a variety of algorithms

◦ Will reduce programmer effort◦ Hybrid Task Graph Scheduler (HTGS)

12/15/2015


Hybrid Workflow Impact – Image Stitching

Image Stitching◦ Addresses the scale mismatch between microscope field of view and a plate under study◦ Need to ‘stitch’ overlapping images to form one large image◦ Three compute stages

◦ (S1) fast Fourier Transform (FFT) of an image◦ (S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975)◦ (S3) Cross correlation factors (CCFs)

Figure: Image stitching dataflow graph

12/15/2015


Hybrid Workflow Impact – Image Stitching

Implementation using traditional parallel techniques (Simple-GPU)◦ Port computationally intensive components to the GPU◦ Copy to/from GPU as needed◦ 1.14x speedup end-to-end time compared to a sequential CPU-only implementation

◦ Data motion dominated the run-time

Implementation using hybrid workflow system◦ Reuse existing compute kernels◦ 24x speedup end-to-end compared to Simple-GPU◦ Scales using multiple GPUs (~1.8x from one to two GPUs) ◦ Requires significant programming effort

[Blattner et al. 2014]

12/15/2015


HTGS Motivation Performance gains using a hybrid pipeline workflow

Figure 1: Simple-GPU Profile

Figure 2: Hybrid Workflow Profile

12/15/2015


HTGS Motivation Transforming dataflow graphs

Into task graphs

12/15/2015


Dataflow and Task Graphs Contains a series of vertices and edges

◦ A vertex is a task/compute function◦ Implements a function applied on data

◦ An edge is data flowing between tasks

◦ Main difference between dataflow and task graphs◦ Scheduling

◦ Effective method for representing MIMD concurrency

Figure: Example task graph

12/15/2015

Figure: Example dataflow graph


HTGS Motivation Scale to multiple GPUs

◦ Partition task graph into sub-graphs◦ Bind sub-graph to separate GPUs

Memory interface◦ Represent separate address spaces

◦ CPU◦ GPU◦ Managing complex memory hierarchies (future)

Overlap computation with I/O◦ Pipeline computation with I/O

12/15/2015


Hybrid Task Graph Scheduler Model

Four primary components◦ Tasks◦ Data◦ Dependency Rules◦ Memory Rules

Construct task graphs using the four components◦ Vertices are tasks◦ Edges are data flow

12/15/2015

Figure: Task graph



Tasks◦ Programmer implements ‘execute’

◦ Defines functionality of the task◦ Special task types

◦ GPU Tasks◦ Binds to device prior to execution

◦ Bookkeeper◦ Manages dependencies

◦ Threading ◦ Each task is bound to one or more threads in a thread pool

12/15/2015


CUDA Task Binds CUDA graphics card to a task

◦ Provides CUDA context and stream to the execute function◦ 1 CPU thread launches GPU kernels with thousands or millions of GPU threads

Figure: CUDA Task

12/15/2015


Memory Interface Attaches to a task needing reusable memory

Memory is freed based on memory rules◦ Programmer defined

Task requests memory from manager◦ Blocks if no memory is available

Acts as a separate channel from dataflow

Figure: Memory Manager Interface

12/15/2015



Execution Pipelines◦ Encapsulates a sub graph◦ Creates duplicate instances of the sub graph

◦ Each instance is scheduled and executed using new threads◦ Can be distributed among available GPUs (one instance per GPU)

12/15/2015

Figure: Execution Pipeline Task


HTGS API Using the model, implement the HTGS API

◦ Tasks◦ Default◦ Bookkeeper◦ Execution Pipeline◦ CUDA

◦ Memory Interface◦ Attaches to any task to allocate/free/update memory

12/15/2015


Prototype HTGS API – Image Stitching

Full implementation in Java◦ Uses image stitching as a test case

Figure: Image Stitching Task Graph

12/15/2015


Preliminary Results Machine specifications

◦ Two Xeon E5620 (16 logical cores)◦ Two NVIDIA Tesla C2070s and one GTX 680◦ Libraries: JCuda and JCuFFT◦ Baseline implementation: [Blattner et al. 2014]◦ Problem size: 42x59 images (70% overlap)◦ HTGS prototype similar runtime as baseline, 23.6% reduction in code size

HTGS Exec Pipeline GPUs Runtime (s) Lines of Code

3 29.8 949

1 43.3 725

1 41.4 726

2 26.6 726

3 24.5 726

<- Baseline Hybrid pipeline workflow

12/15/2015


Conclusions Prototype HTGS API

◦ Reduces code size by 23.6% ◦ Compared to the hybrid pipeline workflow implementation

◦ Speedup of 17%◦ Enables multi-GPU execution by adding a single line of code

Coarse-grained parallelism◦ Decomposition of algorithm and data structures◦ Memory management◦ Data locality◦ Scheduling

12/15/2015


Conclusions The HTGS model and API

◦ Scales using multiple GPUs and CPUs◦ Overlap data motion◦ Keeps processors busy◦ Memory interface for separate address spaces◦ Restricted to single node with multiple CPUs and multiple NVIDIA GPUs

A Tool to represent complex, image processing algorithms that require high performance

12/15/2015


Future Work Release of C++ implementation of HTGS (currently in development)

Use HTGS with other classes of algorithms◦ Out-of-core matrix multiplication and LU factorization

Expand execution pipelines to support clusters and Intel MIC

Image Stitching with LIDE++ ◦ Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦ Tool-assisted acceleration◦ Annotated dataflow graphs

◦ Manage memory and data motion◦ Enhanced scheduling

◦ Improved concurrency

12/15/2015


References[Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J. 2014. abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press.

[Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S. 2014. a hybrid cpu-gpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9.

[Blattner 2013] Blattner, T. 2013. A Hybrid CPU/GPU Pipeline Workflow System. Master’s thesis, University of Maryland Baltimore County.

[Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages 609-629. CRC Press, second edition, 2012. Chapter 24

[Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C. 1975. the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165.

[NVLink Whitepaper 2014] NVIDIA 2014. http://www.nvidia.com/object/nvlink.html

12/15/2015


Thank You Questions?

12/15/2015

a hybrid task graph scheduler for high performance image processing workflows timothy blattner nist...

Documents