scaling datacenter accelerators with compute-reuse ...next-gen power solutions for hyperscale data...

Post on 21-May-2020

10 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs and David Wentzlaff

ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA

Scaling Datacenter Accelerators With Compute-Reuse Architectures

2

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures

3

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures

4

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Scaling Datacenter Accelerators With Compute-Reuse Architectures

?

Scaling Datacenter Accelerators With Compute-Reuse Architectures

5

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures

6

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures

7

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Transistor scaling stops. Chip specialization runs out of steam.

What’s Next?

Scaling Datacenter Accelerators With Compute-Reuse Architectures

8

Observation I: The Density of Emerging Memories are Projected to Increase

Scaling Datacenter Accelerators With Compute-Reuse Architectures

ITRS Logic Roadmap

Scaling Datacenter Accelerators With Compute-Reuse Architectures

9

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec t=2 sec t=4 sec

Scaling Datacenter Accelerators With Compute-Reuse Architectures

10

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec

0% recurrence 38% recurrence 61% recurrence

t=2 sec t=4 sec

Scaling Datacenter Accelerators With Compute-Reuse Architectures

11

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

Scaling Datacenter Accelerators With Compute-Reuse Architectures

12

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

Scaling Datacenter Accelerators With Compute-Reuse Architectures

13

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

Scaling Datacenter Accelerators With Compute-Reuse Architectures

14

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

Scaling Datacenter Accelerators With Compute-Reuse Architectures

15

Source: Twitter

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Power laws suggest high recurrent processing of popular content

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

16

InputLookup

core result

DMA Engine

Accelerator Core

input

input output

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

17

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

Scaling Datacenter Accelerators With Compute-Reuse Architectures

18

InputLookup

lookup

fetchedresult

core result

core result

DMA Engine

Accelerator Core

input

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

hit

Host Processors

Scratchpad Memory

19

Architectural Guidelines

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

20

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

21

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

22

Architectural Guidelines

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

CMP

Shared LLC

Output

Input

Compute

Goal: Extend Specialization with Workload-Specific Memoization

23

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

▪ New Modules:

o Input Hashing Unit (IHU)

24

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

25

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

ILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

Hashes

AssociativeCache

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

26

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Fetch

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

27

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Fetch

Match Input

▪ New Modules:

o Input Hashing Unit (IHU)

o Input Lookup Unit (ILU)

o Computation History Table (CHT)

28

Accelerator Core

Specialized Compute Lanes

Scratchpad General-Purpose CMP

Shared LLC

IHU

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

Top Level Architecture

DMA Engine

Use Output

Fetch

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

29

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

30

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Runtime OPT: 5.8[us]

Energy OPT: 6.2[uJ]

EDP OPT: 148.7[pJs]

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

31

32

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

33

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

Energy Optimization: 56.6% Energy Saved.

64KB ILU, 8MB CHT

EDP Optimization:63.5% EDP Saved.

512KB ILU, 2GB CHT

Runtime Optimization:2.7x Speedup.

512KB ILU, 32GB CHT

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

34

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Workloads

35

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

36

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

37

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

38

Workloads

Methodology

o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

SSSP

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

39

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

40

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

41

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM

▪ General Trends:

o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

42

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

Conclusions

43

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

▪ Memoization is Fit for Datacenters

o Temporal Redundancy, Search Commonality, Content Popularity

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

44

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

▪ COREx Opens New Opportunities for Future Architectures

o Shift compute from non-scaling CMOS to still-scaling memories

45

Conclusions

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs David Wentzlaffadif@princeton.edu wentzlaf@princeton.edu

top related