scaling datacenter accelerators with compute-reuse ...next-gen power solutions for hyperscale data...

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Adi Fuchs and David Wentzlaff

ISCA 2018 Session 5AJune 5, 2018 Los Angeles, CA

Sources:"Cramming more components onto integrated circuits” GE Moore, Computer 1965“Next-Gen Power Solutions for Hyperscale Data Centers”, DataCenter Knowledge 2016

Sources:“Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective”, Hazelwood et al. HPCA 2018“Cloud TPU”, Google, https://cloud.google.com/tpu/“FPGA Accelerated Computing Using AWS F1 Instances”, David Pellerin, AWS summit 2017“Microsoft unveils Project Brainwave for real-time AI“, Doug Burger, https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/“NVIDIA TESLA V100“, NVIDIA, https://www.nvidia.com/en-us/data-center/tesla-v100/

Transistor scaling stops. Chip specialization runs out of steam.

What’s Next?

Observation I: The Density of Emerging Memories are Projected to Increase

ITRS Logic Roadmap

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

Observation II: Datacenter Accelerators Perform Redundant Computations

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec t=2 sec t=4 sec

Source:”Face recognition in unconstrained videos with matched background similarity”, Wolf et al., CVPR 2011

▪ Temporal locality introduces redundancy in videos encoders (recurrent blocks in white)

t=0 sec

0% recurrence 38% recurrence 61% recurrence

t=2 sec t=4 sec

▪ Search term commonality retrieves the similar content

intercontinental downtown los angeles

Source: Google

hotel in downtown los angeles near intercontinental

Source: Google

hotel in downtown los angeles near intercontinental

Source: Twitter

▪ Power laws suggest high recurrent processing of popular content

Source: Twitter

▪ Power laws suggest high recurrent processing of popular content

Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing.

COREx: Compute-Reuse Architecture For Accelerators

InputLookup

core result

DMA Engine

Accelerator Core

input output

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

InputLookup

lookup

fetchedresult

core result

DMA Engine

Accelerator Core

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

InputLookup

lookup

fetchedresult

core result

DMA Engine

Accelerator Core

input output

Compute-Reuse Storage

Acceleration Fabric

Shared LLC / NoC

Host Processors

Scratchpad Memory

Architectural Guidelines

Accelerator Core

Specialized Compute Lanes

ScratchpadDMA

EngineGeneral-Purpose

Shared LLC

▪ Accelerators Memoization is Naturalo Little or no additional programming efforto Built-in input-compute-output flow

Accelerator Core

ScratchpadDMA

Shared LLC

Output

Compute

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

ScratchpadDMA

Shared LLC

Output

Compute

▪ But Not Straightforward!o High lookup costso Unnecessary accesses o High access costs

▪ COREx Key Ideas:o Hashing (reduce lookup costs)o Lookup filtering (fewer accesses)o Banking (reduce access costs)

Accelerator Core

ScratchpadDMA

Shared LLC

Output

Compute

Goal: Extend Specialization with Workload-Specific Memoization

Accelerator Core

Scratchpad General-Purpose CMP

Shared LLC

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

Top Level Architecture

DMA Engine

▪ New Modules:

o Input Hashing Unit (IHU)

Accelerator Core

Shared LLC

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

DMA Engine

▪ New Modules:

o Input Lookup Unit (ILU)

Accelerator Core

Shared LLC

Cache Ctrl.

COREx Interconnect

SoC Interconnect

Mem. Chip

Func. Block

Datapath

Control

DMA Engine

Hashes

AssociativeCache

▪ New Modules:

o Computation History Table (CHT)

Accelerator Core

Shared LLC

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

Control AssociativeCache

DMA Engine

▪ New Modules:

Accelerator Core

Shared LLC

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

DMA Engine

Match Input

▪ New Modules:

Accelerator Core

Shared LLC

CHTILU

Cache Ctrl.

COREx Interconnect

SoC Interconnect

RAM-Array Ctrl.

RAM-Array Table

Mem. Chip

Func. Block

Datapath

DMA Engine

Use Output

IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table.

Building COREx

Case Study: Acceleration of Video Motion Estimation

▪ Optimization Goals:

o Runtime, Energy, and Energy-Delay Product (EDP)

▪ Baseline: highly-tuned accelerators

o Sweep space for design alternatives (Aladdin)

o Find optimal accelerator design for each goal

Building COREx

Runtime OPT: 5.8[us]

Energy OPT: 6.2[uJ]

EDP OPT: 148.7[pJs]

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

▪ Memoization-Layers Specialization

o Extract input traces, examine hit and miss rates of different ILU/CHT sizes.

o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space.

▪ Example: Resistive RAM based COREx

Building COREx

Energy Optimization: 56.6% Energy Saved.

64KB ILU, 8MB CHT

EDP Optimization:63.5% EDP Saved.

512KB ILU, 2GB CHT

Runtime Optimization:2.7x Speedup.

512KB ILU, 32GB CHT

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression Web-Server Traffic

Compression

TailBench

Snappy-C

Wikipedia Abstracts. 13 Million Search Queries.

("SSP")

Graph Processing Maps Service: Shortest

Walking Route

Internal DIMACS NYC Streets, 10 Million Zipfian Transactions.

BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions.

RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

Experimental Setup

Workloads

WorkloadsKernel Domain Use-Case App Source Input Source and Description

DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression

TailBench

Snappy-C

("SSP")

Walking Route

Temporal Redundancy

Experimental Setup

SNAPPY

("SNP")

Compression

TailBench

Snappy-C

("SSP")

Walking Route

Temporal Redundancy

Search Commonality

Experimental Setup

SNAPPY

("SNP")

Compression

TailBench

Snappy-C

("SSP")

Walking Route

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

Workloads

Methodology

o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny)o Integrate with highly-tuned accelerators (Aladdin)

Kernel Domain Use-Case App Source Input Source and DescriptionDCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS.

SNAPPY

("SNP")

Compression

TailBench

Snappy-C

("SSP")

Walking Route

Temporal Redundancy

Search Commonality

Content Popularity (75%, 90%, 95% Recurrence)

Experimental Setup

Results

▪ Runtime-OPT: Avg. 6.0-6.4x Speedupo Negligible Differences Between Memories

Results

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

Results

▪ EDP-OPT: Avg. 50%-68% Savingso PCM/Racetrack High write energyo Gain less for low bias apps (freq. updates)

▪ Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM

▪ General Trends:

o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs)

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

Conclusions

▪ Memoization is Fit for Accelerators

o Memoization-Ready Programming Environment+Interface

▪ Memoization is Fit for Datacenters

o Temporal Redundancy, Search Commonality, Content Popularity

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

Conclusions

▪ COREx Extends Hardware Specialization

o Memoization-layer specialization tailored for the workload

▪ COREx Opens New Opportunities for Future Architectures

o Shift compute from non-scaling CMOS to still-scaling memories

Conclusions

Adi Fuchs David Wentzlaffadif@princeton.edu wentzlaf@princeton.edu

scaling datacenter accelerators with compute-reuse ...next-gen power solutions for hyperscale data...

Documents

flexpod datacenter for ai/ml with cisco ucs 480 ml for...

co-designing accelerators and soc interfaces using gem5...

hp datacenter compute & vmware virtualization€¢built-in...

datacenter services datacenter automation design … · web...

datacenter 2020

software-like compilation for datacenter fpga...

new world of datacenter kris vandermeulen, product marketing...

datacenter transformation

2020 datacenter leasing by city2020 datacenter leasing by

achieving pci express compliance faster · big data, iot...

datacenter security automation - nanjgel solutions ·...

datacenters of the past datacenter of the (new) … ·...

delivering the datacenter of the future - home » open ......

the software defined datacenter - openfabrics...

programming models for exascale...

feature data analytics, accelerators, and supercomputing...

calm it down · deploying complex service architectures...

accelerators f or america’s f uture · introduction...

introduction to accelerators: evolution of accelerators

grvi phalanx update - github pages · fpga datacenter...