rahul sharma (stanford) michael bauer (nvidia research) alex aiken (stanford) verification of...
TRANSCRIPT
Rahul Sharma (Stanford)
Michael Bauer (NVIDIA Research)
Alex Aiken (Stanford)
Verification of Producer-Consumer Synchronization in GPU Programs
June 15, 2015 Rahul Sharma Michael Bauer
Outline
GPU background
Motivating examples
Verification algorithm and implementation
Results
GPU background
GPU
Off-Chip Global Memory
SM SM SM SM
Streaming Multiprocessors
On-Chip
ALU
ALU
ALU
ALU
ALU
ALU
Shared Memory(up to 48 KB)
Threadblock (CTA) ~ 100s of threads
Load Data from Global to Shared
__syncthreads() barrier
Compute on data in Shared
__syncthreads() barrier
Store Data from Shared to Global
Warp = 32 threads
SM SM SM SM
Named barriersSynchronization primitive
Built into hardware
16 named barriers per SM
Two instructionsSync: blocking
Arrive: non-blocking
Specify participating count
__synchthreads is a special case of named barriers
Sync 0, N
Encode producer-consumer patterns
Producer Warp
Consumer Warp
Named Barrier 0
Sync 0,64Arrive 0,64
CudaDMA library (SC 2011)
Simple library to abstract data movement between GPU memories
Global to shared
Shared to global
Specialize warpsCompute warps: do math
DMA warps: move data
Use named barriers to synchronize transfers
Use more named barriers for double buffering
Compute Warps
DMA Warps
start_xfer (arrive 0,N) wait_start (sync 0,N)
finish_xfer (arrive 1,N)
wait_start (sync 0,N)
wait_finish (sync 1,N)
start_xfer (arrive 0,N)
Load Data into Shared Buffer
Compute on Shared Buffer
Load Data into Shared Buffer
wait_finish (sync 1,N)finish_xfer (arrive 1,N)
Singe compiler (PPoPP 2014)DSL compiler for combustion chemistry
Up to 4X speedup
Kernels contain 10K lines
Maps static dataflow graphs onto warps
Use shared memory for communication
Assign synchronization points to named barriers
Analogous to register allocation
Manage passing of data through shared memory
Warp 0 Warp 1 Warp 2 Warp 3
A
B C D
E
GF
IH J
2
0 13
2
Named barrier challenges
Three challenges:
Named barrier reuseMust prove that it is safe to recycle named barriers
Need happens-before relationship
Must be self-consistent
Deadlock
Shared memory racesTwo accesses to the same location with at least one being a write
Warp 0 Warp 1 Warp 2 Warp 3
A
B C D
E
GF
IH J
2
0 13
2
Warp 0 Warp 1
sync 0arrive 1
sync 1arrive 0
WEFT architecture
GPUkernel
GPUkernel
compile 0
Thread programs
1
…n
Happens Before Improper barrier recycling
Shared memory data racesWEFT
Deadlocks
Threadblock (n threads)
Thread programs
Omit statements irrelevant to properties
Straight line programs: sequences of commands
Commandssync b [m]
arrive b [m]
read a
write a
Restrictive, but followed by the majority of GPU code
Well synchronization
“Synchronization pattern is deterministic”Same commands synchronize, no double duty
Obey generations
Subsumes deadlock freedom and safe recycling
Producer Consumer
sync 0 sync 0
write a sync 1
arrive 1 read a
sync 0 sync 0
Generation 1 of barrier 0
Generation 1 of barrier 1
Generation 2 of barrier 0
Check well synchronization
Need to knowWhich commands synchronize together
What is the generation of the corresponding barrier
First challenge: how to infer this information?
Generations are invariant over all executions
Statically emulate one executionRecord synchronization
Check that all executions respect the generations
Happens before
HB relation: reachabilityA happens before B if path from A from B
The path has at least one black edge
Check successive generations have HB relationship
Main result: HB relation is sound and precise
Producer Consumer
sync 0 sync 0
write a sync 1
arrive 1 read a
sync 0 sync 0
sync 0
write a
arrive 1
sync 0
sync 0
sync 1
read a
sync 0
gen 1
gen 2
Data races
For every two commands that can race
check an HB relationship
Sound and complete for race detection
Implementation
Naïve implementation does not scale
Extensive optimizationsFour orders of magnitude improvement
: total commands across all thread programs
Memory:
Time:
Evaluation (Singe kernels)
Discovered bugs
Write-after-read
Benign data races
All kernels were well synchronized
Conclusion
GPUs are much more flexible than people realizeCan use GPUs in new ways with named barriers
Use of named barriers can create many complicationsDeadlock, improper recycling, data races
Providing good software verification is importantNecessary to make named barriers easy to use
WEFT verifies code with named barriersAlgorithm is both sound and complete
Handles real production code efficiently
https://github.com/lightsighter/Weft