university of wisconsin petascale tools workshop madison, wi august 4-7 th 2014 the hybrid model:...
TRANSCRIPT
University of Wisconsin
Petascale Tools WorkshopMadison, WI
August 4-7th 2014
The Hybrid Model:Experiences at Extreme Scale
Benjamin Welton
The Hybrid Model
o TBON + Xo Leveraging TBONs, GPUs, and CPUs in large scale
computationo Combination creates a new computational
model with new challengeso Management of multiple devices, Local node load
balancing, and Node level data management.o Traditional distributed systems problems get
worseo Cluster wide load balancing, I/O management, and
debugging.
2The Hybrid Model: Experiences at Extreme Scale
MRNet and GPUs
o To get more experience with GPUs at scale we built a leadership class application called Mr. Scan
o Mr. Scan is a density based clustering algorithm utilizing GPUso The first to application able to cluster multi-
billion point datasetsoUses MRNet as its distribution framework
o However we ran into some challengeso Load balancing, debugging, and I/O
inhibited performance and increased development time 3The Hybrid Model: Experiences at Extreme Scale
Density-based clustering
o Discovers the number of clusterso Finds oddly-shaped clusters
4The Hybrid Model: Experiences at Extreme Scale
Goal: Find regions that meet minimum density and spatial
distance characteristics
The two parameters that determine if a point is in a cluster
is Epsilon (Eps), and MinPts
If the number of points in Eps is > MinPts, the point is a core point.
For every discovered point, this same calculation is performed until
the cluster is fully expanded
Clustering Example (DBSCAN[1])
5The Hybrid Model: Experiences at Extreme Scale
EpsMinPts
MinPts: 3
[1] M. Ester et. al., A density-based algorithm for discovering clusters in large spatial databases with noise, (1996)
6The Hybrid Model: Experiences at Extreme Scale
MRNet – Multicast / Reduction Networko General-purpose
TBON APIo Network: user-defined
topologyo Stream: logical data channel
o to a set of back-endso multicast, gather, and custom
reductiono Packet: collection of datao Filter: stream data operator
o synchronizationo transformation
o Widely adopted by HPC toolso CEPBA toolkit o Cray ATP & CCDBo Open|SpeedShop & CBTFo STATo TAU
FE
…… …BE
appappappapp
BE
appappappapp
BE
appappappapp
BE
appappappapp
CP CP
CP CP CP CP
F(x1,…,xn)
Computation in a Tree-Based Overlay Network
7The Hybrid Model: Experiences at Extreme Scale
FE
BE BE BE
CP CP
BE
o Adjustable for load balance
o Output sizes MUST be constant or decreasing at each level for scalability
o MRNet provides this process structure Data Size:
10MB per BE
Total Size of Packets: ≤10 MB
Total Size of Packets:≤10 MB
MRNet Hybrid Computation
o A hybrid computation includes GPU processing elements alongside traditional CPU elements.
o In MRNet, GPUs were included as filters.
o A combination of CPU and GPU filters were used in MRNet.
8Mr. Scan: Performance challenges of an extreme scale GPU-Based Application
FE
……BE
appappappapp
BE
appappappapp
CP
CP CP
F(x1,…,xn)
F(x1,…,xn)
F(x1,…,xn)
Intro to Mr. Scan
9The Hybrid Model: Experiences at Extreme Scale
BE BE BE
CP CP
BE
DBSCAN
Merge
FE
Mr. Scan Phases
Partition: Distributed
DBSCAN: GPU (on BE)
Merge: CPU (x #levels)
Sweep: CPU (x #levels)FE
BE BE BE BE
Merge
FS
Sweep
Sweep
Mr. Scan SC 2013 Performance
10The Hybrid Model: Experiences at Extreme Scale
Time: 0 Time: 18.2 Min
Partitioner
DB
SC
AN
Merge &
Sw
eep
Clustering 6.5 Billion Points
FS Read 224 Secs
FS Write489 Secs
MRNetStartup
130 Secs
FS Read: 24 Secs
DBSCAN168 Secs
Merge Time: 6 Secs
Sweep Time: 4 Secs
Write Output: 19 Secs
Load Balancing Issue
o In initial testing imbalances in load between nodes was a significant limiting factor in performance o Increased run-time of Mr. Scan’s computation
phase by a factor of 10. o Input point counts did not correlate to run
times for a specific nodeo Adding an additional 25 minutes to the
computationo Resolving the load balance problem required
numerous Mr. Scan specific optimizationso Algorithm Tricks like Dense Box and heuristics in
data partitioning11The Hybrid Model: Experiences at Extreme Scale
Partition Phase
o Goal: Partitions computationally equivalent to DBSCAN
o Algorithm:o Form initial partitionso Add shadow regionso Rebalance
12The Hybrid Model: Experiences at Extreme Scale
GPU DBSCAN Computation
14The Hybrid Model: Experiences at Extreme Scale
DBSCAN computation is performed in two distinct steps on the leaf nodes of the tree
Step 1: Detect Core Points
Block 1
Block 2
Block 900
T1
T2
T512
T1
T2
T512
T1
T2
T512
Block 1T1
T2
T512
Block 2T1
T2
T512
Block 900T1
T2
T512
Step 2: Expand core points and color
The DBSCAN Density Problem
o Imbalances in point density can cause huge differences in runtimes between Thread Groups inside of a GPU (10-15x variance in time)o Issue is caused by the lookup operation for a
points neighbors in the DBSCAN point expansion phase.
15The Hybrid Model: Experiences at Extreme Scale
𝜀𝜀
Higher density results in higher neighbor count which increases the number of comparison operations
Dense Box
o Dense Box eliminates the need to perform neighbor lookups on points in dense regions by labeling points as a member of cluster before DBSCAN is run.
The Hybrid Model: Experiences at Extreme Scale
1. Start with an region.
𝜖2√2
𝜖2√2
2. Divide the region of data into area’s of size for dense area detection*.
3. For each area which has point count >= MinPts. Mark points as members of a cluster. Do not expand these points.
* chosen because it guarantees all points inside are within distance of each other.
15
𝜺
Challenges of the Hybrid Model
o Debuggingo Difficult to detect incorrect output without writing
application specific verification toolso Load Balancing
o GPUs increased the difficulty of balancing load both cluster wide and on a local node (due to large variance in runtimes with identical sized input)
o Application-specific solution required for load balancing
o Existing distributed framework components stressedo Increased computational performance of GPUs
stress other non-accelerated components of the system (such as I/O) 17The Hybrid Model: Experiences at Extreme Scale
Debugging Mr. Scan
o Result verification complicated due to:o CUDA Warp Scheduling not being deterministico Packet reception order not deterministic in MRNet
o Both issues altered output slightly o DBSCAN non-core point cluster selection is order
dependento Output cluster IDs would vary based on packet
processing order in MRNeto Easy verification of output, such as a bitwise
comparison against a known correct output, not possible
18The Hybrid Model: Experiences at Extreme Scale
Debugging Mr. Scan
o We had to write verification tools to run after each run to ensure output was still correctoVery costly in terms of both programmer
time (to write the tools) and wall clock runtime
o Worst of all…. Tools used for verification are DBSCAN specific. oGeneric solutions needed badly for
increased productivity
19The Hybrid Model: Experiences at Extreme Scale
Load Balancing
o Load balancing between nodes proved to be a significant and serious issueo Identical input sizes would result in vastly
differing runtimes (by up to 10x)oWithout the load balancing work, Mr. Scan
would not have scaledo Application specific GPU load balancing
system implementedoNo existing frameworks could help with
balancing GPU applications
20The Hybrid Model: Experiences at Extreme Scale
Other components
o GPU use revealed flaws that were hidden in the original non-GPU implementation of Mr. Scan.
o I/O, start-up, and other components of the system impacted performance greatlyoAccounting for a majority of the run time of
Mr. Scano Solutions to these issued that scaled for
a CPU based application, might not for a GPU based application
21The Hybrid Model: Experiences at Extreme Scale
Work in progress
o We are currently looking at ways to perform load balance/sharing in GPU applications in a generic way
o We are looking at methods that do not change the distributed models used by applications and require no direct vendor supportoGetting users or hardware vendors to make
massive changes to their applications/hardware is hard
22Mr. Scan: Performance challenges of an extreme scale GPU-Based Application
Characteristics of a ideal load balancing frameworko Require as few changes to existing
applications as possibleo We cannot expect application developers to give
up MPI, MapReduce, TBON, or other computational frameworks to solve load imbalance
o Take advantage of the fine grained computation decomposition we see with GPUs/Acceleratorso Course grained solutions (such as moving entire
kernel invocations/processes) limits options for balancing load.
o Needs to play by the hardware vendors “rules”o We cannot rely on support from hardware vendors
for a distributed framework.
24Mr. Scan: Performance challenges of an extreme scale GPU-Based Application
An Idea: Automating Load Balancing
o Have a layer above the GPU but below the user application framework to manage and load balance GPU computations across nodes
o GPU Manager would execute user application code on device while attempting to share load with idle GPUs
25Mr. Scan: Efficient Clustering with MRNet and GPUs
User Application (MPI/MRNet/
MapReduce/etc)
GPU Manager
GPU Device
An Idea: A Load Balancing Service
26Mr. Scan: Efficient Clustering with MRNet and GPUs
Application Supplies CUDA functions (PTX, CUBIN)
Program sent to device, pointer to function saved
Function Binary
Argument Data for functions passed to manager
Data forwarded to device
Data
Application ask to run function binary. Supplying a data stride and number of compute blocks
Compute Blocks created and added to queue
Function Ptr + Data Offset
Function Ptr + Data Offset
SIMD
Persistent kernel in GPU would pull off this queue and execute the user’s function
At completion of all queued blocks results returned
Result
An Idea: A Load Balancing Serviceo On detection of an idle GPU, load is shared between
nodes.
27Mr. Scan: Efficient Clustering with MRNet and GPUs
Function Ptr + Data Offset
Function Ptr + Data Offset
Binary Data
User Binary transfer to new host Binary sent to GPU
Binary
Data for compute blocks Data copied to GPU
Data
Block moved, updating data offset
Block moved, updating data offset
Function Ptr + Data Offset
SIMD SIMD
Block Executed, Result returned to originating node