a hybrid task graph scheduler for high performance image processing workflows timothy blattner nist...
DESCRIPTION
Credits Walid Keyrouz (NIST) Milton Halem (UMBC) Shuvra Bhattacharrya (UMD) 12/15/2015GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3TRANSCRIPT
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 1
A Hybrid Task Graph Scheduler for High Performance Image Processing WorkflowsTIMOTHY BLATTNERNIST | UMBC
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 2
Outline Introduction
Challenges
Image Stitching
Hybrid Task Graph Scheduler
Preliminary Results
Conclusions
Future Work
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 3
Credits Walid Keyrouz (NIST)
Milton Halem (UMBC)
Shuvra Bhattacharrya (UMD)
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 4
Introduction Hardware landscape is changing
Traditional software approaches to extracting performance from the hardware◦ Reaching complexity limit
◦ Multiple GPUs on a node◦ Complex memory hierarchies
We present a novel abstract machine model◦ Hybrid task graph scheduler
◦ Hybrid pipeline workflows◦ Scope: Single node with multiple CPUs and GPUs◦ Emphasis on
◦ Execution pipelines to scale to multiple GPUs/CPU sockets◦ Memory interface to attach to hierarchies of memory◦ Can be expanded beyond single node (clusters)
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 5
Introduction – Future Architectures
Future hybrid architecture generation◦ Few fat cores with many more simpler cores
◦ Intel Knights Landing◦ POWER 9 + NVIDIA Volta + NVLink
◦ Sierra cluster◦ Faster interconnect
◦ Deeper memory hierarchy
Programming methods must present the right machine model to programmers so they can extract performance
12/15/2015
Figure: NVIDIA Volta GPU(nvidia.com)
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 6
Introduction – Data transfer costs
Copying data between address spaces is expensive◦ PCI express bottleneck
Current hybrid CPU+GPU systems contain multiple independent address spaces◦ Unification of the address spaces
◦ Simplification for programmer◦ Good for prototyping◦ Obscures the cost of data motion
Techniques for improving hybrid utilization◦ Have enough computation per data element◦ Overlap data motion with computation◦ Faster bus (80 GB/s NVLink versus 16 GB/s PCIe)
◦ NVLink requires multiple GPUs to reach peak performance [NVLink whitepaper 2014]
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 7
Introduction – Complex Memory Hierarchies
Data locality is becoming more complex◦ Non-volatile storage devices
◦ NVMe◦ 3D XPoint (future)◦ SATA SSD◦ SATA HDD
◦ Volatile memories◦ HBM / 3D stacked◦ DDR◦ GPU Shared Memory / L1,L2,L3 Cache
Need to model these memories within programming methods◦ Effectively utilize based on size and speed
◦ Hierarchy-aware programming
12/15/2015
Figure: Memory hierarchies speed, cost, and capacity. [Ang et. Al. 2014]
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 8
Key Challenges Changing H/W landscape
◦ Hierarchy-aware programming◦ Manage data locality
◦ Wider data transfer channels ◦ Requires multi-GPU computation
◦ NVLink◦ Hybrid computing
◦ Utilize all compute resource
A programming and execution machine model is needed to address the above challenges◦ Hybrid Task Graph Scheduler (HTGS) model
◦ Expands on hybrid pipeline workflows [Blattner 2013]
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 9
Hybrid Pipeline Workflows Hybrid pipeline workflow system
◦ Schedule tasks using a multiple-producer multiple-consumer model◦ Prototype in 2013 Master’s thesis [Blattner 2013]
◦ Kept all GPUs busy◦ Execution pipelines, one per GPU
◦ Stayed within memory limits◦ Overlapped data motion with computation
◦ Tailored for image stitching◦ Required significant programming effort to implement
◦ Prevent race conditions, manage dependencies, and maintain memory limits
We expand on hybrid pipeline workflows◦ Formulates a model for a variety of algorithms
◦ Will reduce programmer effort◦ Hybrid Task Graph Scheduler (HTGS)
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 10
Hybrid Workflow Impact – Image Stitching
Image Stitching◦ Addresses the scale mismatch between microscope field of view and a plate under study◦ Need to ‘stitch’ overlapping images to form one large image◦ Three compute stages
◦ (S1) fast Fourier Transform (FFT) of an image◦ (S2) Phase correlation image alignment method (PCIAM) (Kuglin & Hines 1975)◦ (S3) Cross correlation factors (CCFs)
Figure: Image stitching dataflow graph
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 11
Hybrid Workflow Impact – Image Stitching
Implementation using traditional parallel techniques (Simple-GPU)◦ Port computationally intensive components to the GPU◦ Copy to/from GPU as needed◦ 1.14x speedup end-to-end time compared to a sequential CPU-only implementation
◦ Data motion dominated the run-time
Implementation using hybrid workflow system◦ Reuse existing compute kernels◦ 24x speedup end-to-end compared to Simple-GPU◦ Scales using multiple GPUs (~1.8x from one to two GPUs) ◦ Requires significant programming effort
[Blattner et al. 2014]
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 12
HTGS Motivation Performance gains using a hybrid pipeline workflow
Figure 1: Simple-GPU Profile
Figure 2: Hybrid Workflow Profile
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 13
HTGS Motivation Transforming dataflow graphs
Into task graphs
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 14
Dataflow and Task Graphs Contains a series of vertices and edges
◦ A vertex is a task/compute function◦ Implements a function applied on data
◦ An edge is data flowing between tasks
◦ Main difference between dataflow and task graphs◦ Scheduling
◦ Effective method for representing MIMD concurrency
Figure: Example task graph
12/15/2015
Figure: Example dataflow graph
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 15
HTGS Motivation Scale to multiple GPUs
◦ Partition task graph into sub-graphs◦ Bind sub-graph to separate GPUs
Memory interface◦ Represent separate address spaces
◦ CPU◦ GPU◦ Managing complex memory hierarchies (future)
Overlap computation with I/O◦ Pipeline computation with I/O
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 16
Hybrid Task Graph Scheduler Model
Four primary components◦ Tasks◦ Data◦ Dependency Rules◦ Memory Rules
Construct task graphs using the four components◦ Vertices are tasks◦ Edges are data flow
12/15/2015
Figure: Task graph
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 17
Hybrid Task Graph Scheduler Model
Tasks◦ Programmer implements ‘execute’
◦ Defines functionality of the task◦ Special task types
◦ GPU Tasks◦ Binds to device prior to execution
◦ Bookkeeper◦ Manages dependencies
◦ Threading ◦ Each task is bound to one or more threads in a thread pool
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 18
CUDA Task Binds CUDA graphics card to a task
◦ Provides CUDA context and stream to the execute function◦ 1 CPU thread launches GPU kernels with thousands or millions of GPU threads
Figure: CUDA Task
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 19
Memory Interface Attaches to a task needing reusable memory
Memory is freed based on memory rules◦ Programmer defined
Task requests memory from manager◦ Blocks if no memory is available
Acts as a separate channel from dataflow
Figure: Memory Manager Interface
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 20
Hybrid Task Graph Scheduler Model
Execution Pipelines◦ Encapsulates a sub graph◦ Creates duplicate instances of the sub graph
◦ Each instance is scheduled and executed using new threads◦ Can be distributed among available GPUs (one instance per GPU)
12/15/2015
Figure: Execution Pipeline Task
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 21
HTGS API Using the model, implement the HTGS API
◦ Tasks◦ Default◦ Bookkeeper◦ Execution Pipeline◦ CUDA
◦ Memory Interface◦ Attaches to any task to allocate/free/update memory
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 22
Prototype HTGS API – Image Stitching
Full implementation in Java◦ Uses image stitching as a test case
Figure: Image Stitching Task Graph
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 23
Preliminary Results Machine specifications
◦ Two Xeon E5620 (16 logical cores)◦ Two NVIDIA Tesla C2070s and one GTX 680◦ Libraries: JCuda and JCuFFT◦ Baseline implementation: [Blattner et al. 2014]◦ Problem size: 42x59 images (70% overlap)◦ HTGS prototype similar runtime as baseline, 23.6% reduction in code size
HTGS Exec Pipeline GPUs Runtime (s) Lines of Code
3 29.8 949
1 43.3 725
1 41.4 726
2 26.6 726
3 24.5 726
<- Baseline Hybrid pipeline workflow
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 24
Conclusions Prototype HTGS API
◦ Reduces code size by 23.6% ◦ Compared to the hybrid pipeline workflow implementation
◦ Speedup of 17%◦ Enables multi-GPU execution by adding a single line of code
Coarse-grained parallelism◦ Decomposition of algorithm and data structures◦ Memory management◦ Data locality◦ Scheduling
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 25
Conclusions The HTGS model and API
◦ Scales using multiple GPUs and CPUs◦ Overlap data motion◦ Keeps processors busy◦ Memory interface for separate address spaces◦ Restricted to single node with multiple CPUs and multiple NVIDIA GPUs
A Tool to represent complex, image processing algorithms that require high performance
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 26
Future Work Release of C++ implementation of HTGS (currently in development)
Use HTGS with other classes of algorithms◦ Out-of-core matrix multiplication and LU factorization
Expand execution pipelines to support clusters and Intel MIC
Image Stitching with LIDE++ ◦ Lightweight dataflow environment [Shen, Plishker, & Bhattacharyya 2012] ◦ Tool-assisted acceleration◦ Annotated dataflow graphs
◦ Manage memory and data motion◦ Enhanced scheduling
◦ Improved concurrency
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 27
References[Ang et al. 2014] Ang, J. A.; Barrett, R. F.; Benner, R. E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, S. D.; Hemmert, K. S.; Kelly, S. M.; Le, H.; Leung, V. J.; Resnick, D. R.; Rodrigues, A. F.; Shalf, J.; Stark, D.; Unat, D.; and Wright, N. J. 2014. abstract machine models and proxy architectures for exascale computing. In proceedings of the 1st international workshop onhardware-software co-design for high performance computing, co-hpc ’14,25–32. ieee press.
[Blattner et al. 2014] Blattner, T.; Keyrouz, W.; Chalfoun, J.; Stivalet, B.; Brady, M.; and Zhou, S. 2014. a hybrid cpu-gpu system for stitching large scale optical microscopy images. In 43rd international conference on parallel processing (icpp), 1–9.
[Blattner 2013] Blattner, T. 2013. A Hybrid CPU/GPU Pipeline Workflow System. Master’s thesis, University of Maryland Baltimore County.
[Shen, Plishker, & Bhattacharyya 2012] C. Shen, W. Plishker, and S. S. Bhattacharyya. Dataflow-based design and implementation of image processing applications. In L. Guan, Y. He, and S.-Y. Kung, editors, Multimedia Image and Video Processing, pages 609-629. CRC Press, second edition, 2012. Chapter 24
[Kuglin & Hines 1975] Kuglin, C. D., and Hines, D. C. 1975. the phase correlation image alignment method. In proceedings of the 1975 ieee international conference on cybernetics and society, 163–165.
[NVLink Whitepaper 2014] NVIDIA 2014. http://www.nvidia.com/object/nvlink.html
12/15/2015
GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING 28
Thank You Questions?
12/15/2015