google workloads for consumer devices: mitigating data ... · tensorflow mobile on average, energy...

1
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Google Consumer Workloads Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu Chrome Google’s web browser TensorFlow Mobile Google’s machine learning framework Video Playback and Video Capture Google’s video codec Goals Data Movement Cost Data Movement 1 st key observaBon: 62.7% of total system energy is spent on data movement Processing-in-Memory (PIM) SoC DRAM L2 L1 CPU CPU CPU CPU Compute Unit 2 nd key observaBon: a significant fracBon of data movement oOen comes from simple funcBons We look at widely-used Google workloads to idenBfy major sources of energy consumpBon : 1 2 Understand the data movement related boSlenecks in modern consumer workloads Analyze the benefits of processing-in-memory (PIM) to miBgate data movement cost InvesBgate the PIM logic that can maximize energy efficiency given the limited area and energy budget in consumer devices 3 Browser 0% 20% 40% 60% 80% 100% Fraction of Total Energy Texture Tiling Color Blitting Other Google Docs Gmail Google Calendar Word- Press Twi6er Ani- ma8on AVG 41.9% of page scrolling energy is spent on texture Bling and color bliZng 77% of total energy consumpBon goes to data movement CPU L1 LLC Inter- connect Mem Ctrl DRAM Total Energy (pJ) Texture Tiling Color Blitting Other 0 × 10 12 3 × 10 12 6 × 10 12 9 × 10 12 12 × 10 12 15 × 10 12 18 × 10 12 49.1% of total data movement comes from texture Bling and color bliZng 0% 10% 20% 30% 40% Texture Tiling Color Blitting Fraction of Total Energy Data Movement Compute HTML CSS HTML Parser CSS Parser Render Tree Layout Rasteriza- tion Composit- ing Texture Tiling Color Blitting Rasteriza8on Invoke Composi8ng idle high Conversion Texture Tiles %me CPU Memory Rasteriza8on Read Bitmap Conversion Write Back CPU PIM CPU + PIM CPU-Only Linear Bitmap Texture Tiles Invoke Composi8ng Texture Tiling data movement Linear Bitmap Texture Bling is a good candidate for PIM execuBon Requires simple primiBves: memcopy, bitwise operaBons, and simple arithmeBc operaBons PIM Core PIM Accelerator Texture Tiling Linear Bitmap Tiled Texture 9.4% of the area available for PIM logic 7.1% of the area available for PIM logic PIM core and PIM accelerator are feasible to implement in-memory texture tiling Chrome Process Render Process (tab 1) Render Process (tab 2) Render Process (tab n) Chrome uses compression to reduce each tab’s memory footprint To study data movement during tab switching, we perform an experiment: A user opens 50 tabs (most-accessed websites) Scrolls through each for a few second Switches to the next tab Compression and decompression contribute to 18.1% of the total system energy 19.6 GB of data move between CPU and ZRAM TensorFlow Mobile 57.3% of the inference energy is spent on data movement 54.4% of the data movement energy comes from packing/unpacking and quanBzaBon Inference PredicBon Reorders elements of matrices to minimize cache misses Up to 40% of the inference energy and 31% of inference execuBon Bme Packing’s data movement accounts for up to 35.3% of the inference energy Packing Matrix Packed Matrix A simple data reorganizaBon process that requires simple arithmeBc Converts 32-bit floaBng point to 8-bit integers Up to 16.8% of the inference energy and 16.1% of inference execuBon Bme Majority of quanBzaBon energy comes from data movement Quantization floa8ng point integer A simple data conversion operaBon that requires shiO, addiBon, and mulBplicaBon Video Playback and Capture 63.5% of the system energy is spent on data movement Compressed video VP9 Decoder Display 80.4% of the data movement energy comes from sub-pixel interpolaBon and deblocking filter Deblocking filter : a simple low- pass filter that aSempts to remove disconBnuity in pixels Sub-pixel interpolaBon : interpolates the value of pixels at non-integer locaBon 59.1% of the system energy is spent on data movement Captured video VP9 Encoder Compressed Majority of the data movement energy comes from moBon esBmaBon MoBon esBmaBon : compresses the frames using temporal redundancy between them Evaluation 0 0.2 0.4 0.6 0.8 1 Texture Tiling Color Blitting Com- pression Decom- pression Packing Quantization Sub-Pixel Interpolation Deblocking Filter Motion Estimation Normalized Energy CPU-Only PIM-Core PIM-Acc On average, energy consumpBon reduces by 49.1% using PIM core and 55.4% using PIM accelerator 0.0 0.2 0.4 0.6 0.8 1.0 Texture Tiling Color Blitting Comp- ression Decomp- ression Sub-Pixel Interpolation Deblocking Filter Motion Estimation TensorFlow Normalized Runtime CPU-Only PIM-Core PIM-Acc Chrome Browser Video Playback and Capture TensorFlow Mobile On average, energy consumpBon reduces by 44.6% using PIM core and 54.2% using PIM accelerator Chrome Rendering Pipeline Scrolling Tab Switching Chrome employs a multi-process architecture Main operations during tab switching: Context switch Load the new page CPU DRAM ZRAM Compressed Tab 1 2 CPU CPU-Only Memory CPU CPU + PIM PIM %me Swap out N pages Read N Pages Compress Other tasks Write back compression ZRAM Swap out N pages Other tasks Compress ZRAM No off-chip data movement data movement high Uncompressed Pages Uncompressed Pages Both funcBons can benefit from PIM execuBon and can be implemented at PIM logic Color bliZng is a good candidate for PIM execuBon Generates a large amount of data movement Requires low-cost operaBons: Memset, simple arithmeBc, and shiO operaBons Accounts for 19.1% of the total system energy during scrolling Color BliZng Texture Tiling Color bliZng is feasible to implement in PIM logic

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Google Workloads for Consumer Devices: Mitigating Data ... · TensorFlow Mobile On average, energy consumpon reduces by 44.6% using PIM core and 54.2% using PIM accelerator Chrome

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Google Consumer Workloads

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu

ChromeGoogle’swebbrowser

TensorFlowMobileGoogle’smachinelearning

framework

VideoPlaybackandVideoCaptureGoogle’svideocodec

Goals Data Movement Cost

DataMovement

1stkeyobservaBon:62.7%oftotalsystemenergyisspentondatamovement

Processing-in-Memory(PIM)

SoC

DRAM L2 L1

CPU CPU CPU CPU

Compute Unit

2ndkeyobservaBon:asignificantfracBonofdatamovementoOencomesfromsimplefuncBons

Welookatwidely-usedGoogleworkloadstoidenBfymajorsourcesofenergyconsumpBon:

1

2

UnderstandthedatamovementrelatedboSlenecksinmodernconsumerworkloads

Analyzethebenefitsofprocessing-in-memory(PIM)tomiBgatedatamovementcost

InvesBgatethePIMlogicthatcanmaximizeenergyefficiencygiventhe

limitedareaandenergybudgetinconsumerdevices3

Browser

0% 20% 40% 60% 80%

100%

Frac

tion

of

Tota

l Ene

rgy

Texture Tiling Color Blitting Other

GoogleDocs Gmail Google

CalendarWord-Press Twi6er Ani-ma8on AVG

41.9%ofpagescrollingenergyisspentontextureBlingandcolorbliZng

77%oftotalenergyconsumpBongoestodatamovement

CPU L1 LLC Inter- connect

Mem Ctrl

DRAM

Tota

l Ene

rgy

(pJ) Texture Tiling Color Blitting Other

0×1012 3×1012 6×1012 9×1012

12×1012 15×1012 18×1012

49.1%oftotaldatamovementcomesfrom

textureBlingandcolorbliZng

0%

10%

20%

30%

40%

Texture Tiling

Color Blitting

Frac

tion

of

Tota

l Ene

rgy

Data Movement Compute

HTML

CSS

HTML Parser

CSS Parser

Render Tree Layout Rasteriza-

tion Composit-

ing

Texture Tiling

Color Blitting

Rasteriza8on

InvokeComposi8ng

idle

high

Conversion

TextureTiles

%meCPU Memory

Rasteriza8on

ReadBitmap

Conversion

WriteBack

CPU PIM

CPU + PIM CPU-Only

LinearBitmap

TextureTilesInvoke

Composi8ng

TextureTiling

datamovement

LinearBitmap

TextureBlingisagoodcandidateforPIMexecuBon

RequiressimpleprimiBves:memcopy,bitwiseoperaBons,andsimplearithmeBcoperaBons

PIM Core PIM Accelerator

Texture Tiling

LinearBitmap TiledTexture

9.4%oftheareaavailableforPIMlogic

7.1%oftheareaavailableforPIMlogic

PIMcoreandPIMacceleratorarefeasibletoimplementin-memorytexturetiling

ChromeProcess

RenderProcess(tab1)

RenderProcess(tab2)

RenderProcess(tabn)

•  Chrome uses compression to reduce each tab’s memory footprint

•  To study data movement during tab switching, we perform an experiment: –  A user opens 50 tabs (most-accessed websites) –  Scrolls through each for a few second –  Switches to the next tab

Compressionanddecompressioncontributeto

18.1%ofthetotalsystemenergy

19.6GBofdatamovebetweenCPUandZRAM

TensorFlow Mobile

57.3%oftheinferenceenergyisspentondatamovement

54.4%ofthedatamovementenergycomesfrompacking/unpackingandquanBzaBon

Inference PredicBon

Reorderselementsofmatricestominimizecachemisses

Upto40%oftheinferenceenergyand31%ofinferenceexecuBonBme

Packing’sdatamovementaccountsfor

upto35.3%oftheinferenceenergy

Packing Matrix PackedMatrix

AsimpledatareorganizaBonprocessthatrequiressimplearithmeBc

Converts32-bitfloaBngpointto8-bitintegers

Upto16.8%oftheinferenceenergyand16.1%ofinferenceexecuBonBme

MajorityofquanBzaBonenergycomesfromdatamovement

Quantization floa8ngpoint integer

AsimpledataconversionoperaBonthatrequiresshiO,addiBon,andmulBplicaBon

Video Playback and Capture

63.5%ofthesystemenergyisspentondatamovement

Compressedvideo VP9Decoder

Display

80.4%ofthedatamovementenergycomesfromsub-pixelinterpolaBonanddeblockingfilter

Deblockingfilter:asimplelow-passfilterthataSemptsto

removedisconBnuityinpixels

Sub-pixelinterpolaBon:interpolatesthevalueof

pixelsatnon-integerlocaBon

59.1%ofthesystemenergyisspentondatamovement

Capturedvideo

VP9Encoder

Compressed

MajorityofthedatamovementenergycomesfrommoBonesBmaBon

MoBonesBmaBon:compressestheframesusingtemporalredundancybetweenthem

Evaluation

0

0.2

0.4

0.6

0.8

1

Texture Tiling

Color Blitting

Com- pression

Decom- pression

Packing Quantization Sub-Pixel Interpolation

Deblocking Filter

Motion Estimation

Nor

mal

ized

Ene

rgy

CPU-Only PIM-Core PIM-Acc

Onaverage,energyconsumpBonreducesby49.1%usingPIMcoreand55.4%usingPIMaccelerator

0.0

0.2

0.4

0.6

0.8

1.0

Texture Tiling

Color Blitting

Comp- ression

Decomp- ression

Sub-Pixel Interpolation

Deblocking Filter

Motion Estimation

TensorFlow

Nor

mal

ized

Run

tim

e

CPU-Only PIM-Core PIM-Acc

ChromeBrowser VideoPlaybackandCapture

TensorFlowMobile

Onaverage,energyconsumpBonreducesby44.6%usingPIMcoreand54.2%usingPIMaccelerator

ChromeRenderingPipeline

Scrolling Tab Switching •  Chrome employs a multi-process architecture

•  Main operations during tab switching: –  Context switch –  Load the new page

CPU

DRAM

ZRAM

CompressedTab

12

CPUCPU-Only

Memory CPUCPU + PIM

PIM%me

SwapoutNpages

ReadNPages

Compress

Othertasks

Writeback

compression

ZRAM

SwapoutNpages

OthertasksCompress

ZRAMNooff-chipdata

movementdatamovement

high

UncompressedPages

UncompressedPages

BothfuncBonscanbenefitfromPIMexecuBonandcanbeimplementedatPIMlogic

ColorbliZngisagoodcandidateforPIMexecuBon

Generatesalargeamountofdatamovement

Requireslow-costoperaBons:Memset,simplearithmeBc,andshiOoperaBons

Accountsfor19.1%ofthetotalsystemenergyduringscrolling

ColorBliZng

TextureTiling

ColorbliZngisfeasibletoimplementinPIMlogic