2
Increasing specialization
Need to programthese accelerators
Challenges1. Consistent pointers2. Data movement3. Security (Fast)
This talk: GPGPUs *NVIDIA via anandtech.com
Programming accelerators (baseline)
int main() {int a[N], b[N], c[N];
init(a, b, c);
add(a, b, c);
return 0;}
void add(int*a, int*b, int*c) {for (int i = 0;
i < N;i++)
{c[i] = a[i] + b[i];
}}
3Accelerator-side code CPU-Side code
Programming accelerators (GPU)
void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);
i < N;i += get_global_size(0))
{c[i] = a[i] + b[i];
}}
int main() {int a[N], b[N], c[N];
init(a, b, c);
add(a, b, c);
return 0;}
4Accelerator-side code CPU-Side code
Programming accelerators (GOAL)
void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);
i < N;i += get_global_size(0))
{c[i] = a[i] + b[i];
}}
int main() {int a[N], b[N], c[N];int *d_a, *d_b, *d_c;
cudaMalloc(&d_a, N*sizeof(int));cudaMalloc(&d_b, N*sizeof(int));cudaMalloc(&d_c, N*sizeof(int));
init(a, b, c);
cudaMemcpy(d_a, a, N*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, N*sizeof(int),cudaMemcpyHostToDevice);
add_gpu(a, b, c);
cudaMemcpy(c, d_c, N*sizeof(int),cudaMemcpyDeviceToHost);
cudaFree(d_a); cudaFree(d_b);cudaFree(d_c);
return 0;}
5Accelerator-side code CPU-Side code
Programming accelerators (GOAL)
void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);
i < N;i += get_global_size(0))
{c[i] = a[i] + b[i];
}}
int main() {int a[N], b[N], c[N];
init(a, b, c);
add_gpu(a, b, c);
return 0;}
6Accelerator-side code CPU-Side code
Key challenges
Memory
CPU
MMU
a[i]
ld: 0x1000000
0x5000
Cache
Virtual addresspointer
Physical address
7
Cache
Key challenges
Memory
a[i]
ld: 0x1000000
0x5000
Virtual addresspointer
CPU
MMUCache
GPU
MMU
Consistent pointers Data movement
8
Consistent pointersSupporting x86-64 Address
Translation for 100s of GPU Lanes
Data movementHeterogeneous System
Coherence
9
[MICRO 2014][HPCA 2014]
Heterogeneous System
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
GPU Core
L1
L2
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
Shared memory controller
Directory
10
Why not CPU solutions?
It’s all about bandwidth!
Translating 100s of addresses
500 GB/s at the directory (many accesses per-cycle)
Theoretical Memory Bandwidth
GB
/s
11*NVIDIA via anandtech.com
[MICRO 2014]
Consistent pointersSupporting x86-64 Address
Translation for 100s of GPU Lanes
Data movementHeterogeneous System
Coherence
12
[HPCA 2014]
Why virtual addresses?
Virtual memory13
Why virtual addresses?
GPU address space
Simply copy data
Transform to new pointers
Transform to new pointers
Virtual memory
14
Bandwidth problem
CPU
TLB
Virtual memoryrequests
Physical memoryrequests
15
Bandwidth problem
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
GPU ProcessingElements
(one GPU core)
16
0.45x
Solution: Filtering
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Lan
e
Shared Memory(scratchpad)
Coalescer
1x
0.06x
GPU ProcessingElements
(one GPU core)
TLB 17
18
Poor performanceAverage 3x slowdown
Bottleneck 1: Bursty TLB misses
Average: 60 outstanding requests
Max 140 requests
Huge queuing delays
Solution:Highly-threaded pagetable walker
19
Bottleneck 2: High miss rate
Large 128 entry TLB doesn’t help
Many address streams
Need low latency
Solution:Shared page-walk cache
20
Performance: Low overhead
21
Worst case: 12% slowdown
Average: Less than 2% slowdown
21
22
Shared virtual memory is important
Non-exotic MMU design• Post-coalescer L1 TLBs
• Highly-threaded page table walker
• Page walk cache
Full compatibility with minimal overhead
Still room to optimize
Consistent pointersSupporting x86-64 Address
Translation for 100s of GPU Lanes
[HPCA 2014]
Consistent pointersSupporting x86-64 Address
Translation for 100s of GPU Lanes
Data movementHeterogeneous System
Coherence
23
[MICRO 2014]
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
GPU Core
L1
L2
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
Directory
Memory
Legacy Interface
1.CPU writes memory
2.CPU initiates DMA
3.GPU direct access
High bandwidthNo directory access
24
Inv
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
GPU Core
L1
L2
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
Directory
Memory
CC Interface
1.CPU writes memory
2.GPU access
Bottleneck: Directory1. Access rate2. Buffering
25
Inv
Directory Bottleneck 1: Access rate
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
bp bfs hs lud nw km sd bn dct hg mm
Dir
ect
ory
acc
ess
es
pe
r cy
cle
26
Many requests per cycle
Difficult to design multi-ported directory
Directory Bottleneck 2: Buffering
1
10
100
1000
10000
100000
bp bfs hs lud nw km sd bn dct hg mmM
axim
um
MSH
Rs
27
Must track many outstanding requests
Huge queuing delays
Solution:Reduce pressure on
directory
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
CPU Core
L1
L2
GPU Core
L1
L2
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
GPU Core
L1
Memory
HSC Design
Goal: Direct access (B/W)+ Cache coherence
Add: Region DirectoryRegion Buffers
Decouples permission from access
RegionDirectory
Region Buffer
Region Buffer
Only permission traffic
28
HSC: Performance Improvement
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
bp bfs hs lud nw km sd bn dct hg mmN
orm
aliz
ed
sp
ee
d-u
p
29
Want cache coherence without sacrificing bandwidth
Major bottlenecks in current coherence implementations
1. High bandwidth difficult to supportat directory
2. Extreme resource requirements
Heterogeneous System CoherenceLeverages spatial locality
Reduces bandwidth and resource requirements by 95%
Data movementHeterogeneous System
Coherence
30
31
Increasing specialization
Need to programthese accelerators
Challenges1. Consistent pointers2. Data movement3. Security (Fast)
This talk: GPGPUs *NVIDIA via anandtech.com
Trusted
Security & tightly-integrated accelerators
32
Memory
CPU Core
L1
L2
CPU Core
L1
L2
Accelerator
L1
TLB
OS (Protected)
Data
Process Data
Accelerator
L1
TLB
What if accelerators come from 3rd parties?
Untrusted!
All accesses via IOMMUSafeLow performance
Bypass IOMMUHigh performanceUnsafe
IOMMU
Untrusted
Trusted
Border control: sandboxing accelerators
33
Memory
CPU Core
L1
L2
CPU Core
L1
L2
Accelerator
L1
TLB
OS (Protected)
Data
Process Data
Accelerator
L1
TLB
Solution:Border control
Key Idea: Decouple translation from safety
Safety + PerformanceIOMMU
Untrusted
Border Control
Border Control
[MICRO 2015]
Conclusions
34
Challenges
1. Consistent addressesGPU MMU Design
2. Data movementHeterogeneous System Coherence
3. SecurityBorder Control
Goal: Enable programmers to use the whole chip
Jason Power, Mark D. Hill, David A. Wood
Consistent pointersSupporting x86-64 Address
Translation for 100s of GPU Lanes[HPCA 2014]
*
Jason Power, Arkaprava Basu*, JunliGu*, Sooraj Puthoor*, Bradford M. Beckmann*, Mark D. Hill, Steven K.
Reinhardt*, David A. Wood
Data movementHeterogeneous System
Coherence[MICRO 2013]
Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood
SecurityBorder Control:
Sandboxing Accelerators
[MICRO 2015]
Contact:Jason Lowe-Power
[email protected]/~powerjg
I’m on the job market this year!
Graduating in Spring
Other work
36
Analytic database + Tightly-integrated GPUs
Simulation Infrastructure
When to use 3D Die-StackedMemory for Bandwidth-Constrained
Big-Data Workloads[BPOE 2016]
Towards GPUs being mainstreamin analytic processing
[DaMoN 2015]
Implications of Emerging 3D GPUArchitecture on the Scan Primitive
[SIGMOD Rec. 2015]
Jason Lowe-Power, Mark D. Hill, David A. Wood
Jason Power, Yinan Li, Mark D. Hill, Jignesh M.
Patel, David A. Wood
Jason Power, Yinan Li, Mark D. Hill, Jignesh M.
Patel, David A. Wood
gem5-gpu: A HeterogeneousCPU-GPU Simulator
[CAL 2014]Jason Power, Joel Hestness,
Marc S. Orr, Mark D. Hill, David A. Wood
Comparison to CAPI/OpenCAPI
Same virtual address space
Cache coherent
System safetyfrom
accelerator
Assumes on-chip accel.
Allows accel.physical caches
Allows pre-translation
CAPI Yes Yes Yes No No No
My work Yes Yes Yes Yes Yes Yes
37
Allows for high-performanceaccelerator optimizations