programmable accelerators - university of...

Post on 24-May-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Programmable AcceleratorsJason Lowe-Power

powerjg@cs.wisc.edu

cs.wisc.edu/~powerjg

2

Increasing specialization

Need to programthese accelerators

Challenges1. Consistent pointers2. Data movement3. Security (Fast)

This talk: GPGPUs *NVIDIA via anandtech.com

Programming accelerators (baseline)

int main() {int a[N], b[N], c[N];

init(a, b, c);

add(a, b, c);

return 0;}

void add(int*a, int*b, int*c) {for (int i = 0;

i < N;i++)

{c[i] = a[i] + b[i];

}}

3Accelerator-side code CPU-Side code

Programming accelerators (GPU)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];

init(a, b, c);

add(a, b, c);

return 0;}

4Accelerator-side code CPU-Side code

Programming accelerators (GOAL)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];int *d_a, *d_b, *d_c;

cudaMalloc(&d_a, N*sizeof(int));cudaMalloc(&d_b, N*sizeof(int));cudaMalloc(&d_c, N*sizeof(int));

init(a, b, c);

cudaMemcpy(d_a, a, N*sizeof(int),cudaMemcpyHostToDevice);

cudaMemcpy(d_b, b, N*sizeof(int),cudaMemcpyHostToDevice);

add_gpu(a, b, c);

cudaMemcpy(c, d_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaFree(d_a); cudaFree(d_b);cudaFree(d_c);

return 0;}

5Accelerator-side code CPU-Side code

Programming accelerators (GOAL)

void add_gpu(int*a, int*b, int*c) {for (int i = get_global_id(0);

i < N;i += get_global_size(0))

{c[i] = a[i] + b[i];

}}

int main() {int a[N], b[N], c[N];

init(a, b, c);

add_gpu(a, b, c);

return 0;}

6Accelerator-side code CPU-Side code

Key challenges

Memory

CPU

MMU

a[i]

ld: 0x1000000

0x5000

Cache

Virtual addresspointer

Physical address

7

Cache

Key challenges

Memory

a[i]

ld: 0x1000000

0x5000

Virtual addresspointer

CPU

MMUCache

GPU

MMU

Consistent pointers Data movement

8

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

9

[MICRO 2014][HPCA 2014]

Heterogeneous System

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Shared memory controller

Directory

10

Why not CPU solutions?

It’s all about bandwidth!

Translating 100s of addresses

500 GB/s at the directory (many accesses per-cycle)

Theoretical Memory Bandwidth

GB

/s

11*NVIDIA via anandtech.com

[MICRO 2014]

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

12

[HPCA 2014]

Why virtual addresses?

Virtual memory13

Why virtual addresses?

GPU address space

Simply copy data

Transform to new pointers

Transform to new pointers

Virtual memory

14

Bandwidth problem

CPU

TLB

Virtual memoryrequests

Physical memoryrequests

15

Bandwidth problem

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

GPU ProcessingElements

(one GPU core)

16

0.45x

Solution: Filtering

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Lan

e

Shared Memory(scratchpad)

Coalescer

1x

0.06x

GPU ProcessingElements

(one GPU core)

TLB 17

18

Poor performanceAverage 3x slowdown

Bottleneck 1: Bursty TLB misses

Average: 60 outstanding requests

Max 140 requests

Huge queuing delays

Solution:Highly-threaded pagetable walker

19

Bottleneck 2: High miss rate

Large 128 entry TLB doesn’t help

Many address streams

Need low latency

Solution:Shared page-walk cache

20

Performance: Low overhead

21

Worst case: 12% slowdown

Average: Less than 2% slowdown

21

22

Shared virtual memory is important

Non-exotic MMU design• Post-coalescer L1 TLBs

• Highly-threaded page table walker

• Page walk cache

Full compatibility with minimal overhead

Still room to optimize

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

[HPCA 2014]

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes

Data movementHeterogeneous System

Coherence

23

[MICRO 2014]

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Directory

Memory

Legacy Interface

1.CPU writes memory

2.CPU initiates DMA

3.GPU direct access

High bandwidthNo directory access

24

Inv

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Directory

Memory

CC Interface

1.CPU writes memory

2.GPU access

Bottleneck: Directory1. Access rate2. Buffering

25

Inv

Directory Bottleneck 1: Access rate

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

bp bfs hs lud nw km sd bn dct hg mm

Dir

ect

ory

acc

ess

es

pe

r cy

cle

26

Many requests per cycle

Difficult to design multi-ported directory

Directory Bottleneck 2: Buffering

1

10

100

1000

10000

100000

bp bfs hs lud nw km sd bn dct hg mmM

axim

um

MSH

Rs

27

Must track many outstanding requests

Huge queuing delays

Solution:Reduce pressure on

directory

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

CPU Core

L1

L2

GPU Core

L1

L2

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

GPU Core

L1

Memory

HSC Design

Goal: Direct access (B/W)+ Cache coherence

Add: Region DirectoryRegion Buffers

Decouples permission from access

RegionDirectory

Region Buffer

Region Buffer

Only permission traffic

28

HSC: Performance Improvement

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

bp bfs hs lud nw km sd bn dct hg mmN

orm

aliz

ed

sp

ee

d-u

p

29

Want cache coherence without sacrificing bandwidth

Major bottlenecks in current coherence implementations

1. High bandwidth difficult to supportat directory

2. Extreme resource requirements

Heterogeneous System CoherenceLeverages spatial locality

Reduces bandwidth and resource requirements by 95%

Data movementHeterogeneous System

Coherence

30

31

Increasing specialization

Need to programthese accelerators

Challenges1. Consistent pointers2. Data movement3. Security (Fast)

This talk: GPGPUs *NVIDIA via anandtech.com

Trusted

Security & tightly-integrated accelerators

32

Memory

CPU Core

L1

L2

CPU Core

L1

L2

Accelerator

L1

TLB

OS (Protected)

Data

Process Data

Accelerator

L1

TLB

What if accelerators come from 3rd parties?

Untrusted!

All accesses via IOMMUSafeLow performance

Bypass IOMMUHigh performanceUnsafe

IOMMU

Untrusted

Trusted

Border control: sandboxing accelerators

33

Memory

CPU Core

L1

L2

CPU Core

L1

L2

Accelerator

L1

TLB

OS (Protected)

Data

Process Data

Accelerator

L1

TLB

Solution:Border control

Key Idea: Decouple translation from safety

Safety + PerformanceIOMMU

Untrusted

Border Control

Border Control

[MICRO 2015]

Conclusions

34

Challenges

1. Consistent addressesGPU MMU Design

2. Data movementHeterogeneous System Coherence

3. SecurityBorder Control

Goal: Enable programmers to use the whole chip

Jason Power, Mark D. Hill, David A. Wood

Consistent pointersSupporting x86-64 Address

Translation for 100s of GPU Lanes[HPCA 2014]

*

Jason Power, Arkaprava Basu*, JunliGu*, Sooraj Puthoor*, Bradford M. Beckmann*, Mark D. Hill, Steven K.

Reinhardt*, David A. Wood

Data movementHeterogeneous System

Coherence[MICRO 2013]

Lena E. Olson, Jason Power, Mark D. Hill, David A. Wood

SecurityBorder Control:

Sandboxing Accelerators

[MICRO 2015]

Contact:Jason Lowe-Power

powerjg@cs.wisc.educs.wisc.edu/~powerjg

I’m on the job market this year!

Graduating in Spring

Other work

36

Analytic database + Tightly-integrated GPUs

Simulation Infrastructure

When to use 3D Die-StackedMemory for Bandwidth-Constrained

Big-Data Workloads[BPOE 2016]

Towards GPUs being mainstreamin analytic processing

[DaMoN 2015]

Implications of Emerging 3D GPUArchitecture on the Scan Primitive

[SIGMOD Rec. 2015]

Jason Lowe-Power, Mark D. Hill, David A. Wood

Jason Power, Yinan Li, Mark D. Hill, Jignesh M.

Patel, David A. Wood

Jason Power, Yinan Li, Mark D. Hill, Jignesh M.

Patel, David A. Wood

gem5-gpu: A HeterogeneousCPU-GPU Simulator

[CAL 2014]Jason Power, Joel Hestness,

Marc S. Orr, Mark D. Hill, David A. Wood

Comparison to CAPI/OpenCAPI

Same virtual address space

Cache coherent

System safetyfrom

accelerator

Assumes on-chip accel.

Allows accel.physical caches

Allows pre-translation

CAPI Yes Yes Yes No No No

My work Yes Yes Yes Yes Yes Yes

37

Allows for high-performanceaccelerator optimizations

top related