google workloads for consumer devices: mitigating data ... · tensorflow mobile on average, energy...
TRANSCRIPT
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks
Google Consumer Workloads
Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, Onur Mutlu
ChromeGoogle’swebbrowser
TensorFlowMobileGoogle’smachinelearning
framework
VideoPlaybackandVideoCaptureGoogle’svideocodec
Goals Data Movement Cost
DataMovement
1stkeyobservaBon:62.7%oftotalsystemenergyisspentondatamovement
Processing-in-Memory(PIM)
SoC
DRAM L2 L1
CPU CPU CPU CPU
Compute Unit
2ndkeyobservaBon:asignificantfracBonofdatamovementoOencomesfromsimplefuncBons
Welookatwidely-usedGoogleworkloadstoidenBfymajorsourcesofenergyconsumpBon:
1
2
UnderstandthedatamovementrelatedboSlenecksinmodernconsumerworkloads
Analyzethebenefitsofprocessing-in-memory(PIM)tomiBgatedatamovementcost
InvesBgatethePIMlogicthatcanmaximizeenergyefficiencygiventhe
limitedareaandenergybudgetinconsumerdevices3
Browser
0% 20% 40% 60% 80%
100%
Frac
tion
of
Tota
l Ene
rgy
Texture Tiling Color Blitting Other
GoogleDocs Gmail Google
CalendarWord-Press Twi6er Ani-ma8on AVG
41.9%ofpagescrollingenergyisspentontextureBlingandcolorbliZng
77%oftotalenergyconsumpBongoestodatamovement
CPU L1 LLC Inter- connect
Mem Ctrl
DRAM
Tota
l Ene
rgy
(pJ) Texture Tiling Color Blitting Other
0×1012 3×1012 6×1012 9×1012
12×1012 15×1012 18×1012
49.1%oftotaldatamovementcomesfrom
textureBlingandcolorbliZng
0%
10%
20%
30%
40%
Texture Tiling
Color Blitting
Frac
tion
of
Tota
l Ene
rgy
Data Movement Compute
HTML
CSS
HTML Parser
CSS Parser
Render Tree Layout Rasteriza-
tion Composit-
ing
Texture Tiling
Color Blitting
Rasteriza8on
InvokeComposi8ng
idle
high
Conversion
TextureTiles
%meCPU Memory
Rasteriza8on
ReadBitmap
Conversion
WriteBack
CPU PIM
CPU + PIM CPU-Only
LinearBitmap
TextureTilesInvoke
Composi8ng
TextureTiling
datamovement
LinearBitmap
TextureBlingisagoodcandidateforPIMexecuBon
RequiressimpleprimiBves:memcopy,bitwiseoperaBons,andsimplearithmeBcoperaBons
PIM Core PIM Accelerator
Texture Tiling
LinearBitmap TiledTexture
9.4%oftheareaavailableforPIMlogic
7.1%oftheareaavailableforPIMlogic
PIMcoreandPIMacceleratorarefeasibletoimplementin-memorytexturetiling
ChromeProcess
…
RenderProcess(tab1)
RenderProcess(tab2)
RenderProcess(tabn)
• Chrome uses compression to reduce each tab’s memory footprint
• To study data movement during tab switching, we perform an experiment: – A user opens 50 tabs (most-accessed websites) – Scrolls through each for a few second – Switches to the next tab
Compressionanddecompressioncontributeto
18.1%ofthetotalsystemenergy
19.6GBofdatamovebetweenCPUandZRAM
TensorFlow Mobile
57.3%oftheinferenceenergyisspentondatamovement
54.4%ofthedatamovementenergycomesfrompacking/unpackingandquanBzaBon
Inference PredicBon
Reorderselementsofmatricestominimizecachemisses
Upto40%oftheinferenceenergyand31%ofinferenceexecuBonBme
Packing’sdatamovementaccountsfor
upto35.3%oftheinferenceenergy
Packing Matrix PackedMatrix
AsimpledatareorganizaBonprocessthatrequiressimplearithmeBc
Converts32-bitfloaBngpointto8-bitintegers
Upto16.8%oftheinferenceenergyand16.1%ofinferenceexecuBonBme
MajorityofquanBzaBonenergycomesfromdatamovement
Quantization floa8ngpoint integer
AsimpledataconversionoperaBonthatrequiresshiO,addiBon,andmulBplicaBon
Video Playback and Capture
63.5%ofthesystemenergyisspentondatamovement
Compressedvideo VP9Decoder
Display
80.4%ofthedatamovementenergycomesfromsub-pixelinterpolaBonanddeblockingfilter
Deblockingfilter:asimplelow-passfilterthataSemptsto
removedisconBnuityinpixels
Sub-pixelinterpolaBon:interpolatesthevalueof
pixelsatnon-integerlocaBon
59.1%ofthesystemenergyisspentondatamovement
Capturedvideo
VP9Encoder
Compressed
MajorityofthedatamovementenergycomesfrommoBonesBmaBon
MoBonesBmaBon:compressestheframesusingtemporalredundancybetweenthem
Evaluation
0
0.2
0.4
0.6
0.8
1
Texture Tiling
Color Blitting
Com- pression
Decom- pression
Packing Quantization Sub-Pixel Interpolation
Deblocking Filter
Motion Estimation
Nor
mal
ized
Ene
rgy
CPU-Only PIM-Core PIM-Acc
Onaverage,energyconsumpBonreducesby49.1%usingPIMcoreand55.4%usingPIMaccelerator
0.0
0.2
0.4
0.6
0.8
1.0
Texture Tiling
Color Blitting
Comp- ression
Decomp- ression
Sub-Pixel Interpolation
Deblocking Filter
Motion Estimation
TensorFlow
Nor
mal
ized
Run
tim
e
CPU-Only PIM-Core PIM-Acc
ChromeBrowser VideoPlaybackandCapture
TensorFlowMobile
Onaverage,energyconsumpBonreducesby44.6%usingPIMcoreand54.2%usingPIMaccelerator
ChromeRenderingPipeline
Scrolling Tab Switching • Chrome employs a multi-process architecture
• Main operations during tab switching: – Context switch – Load the new page
CPU
DRAM
ZRAM
CompressedTab
12
CPUCPU-Only
Memory CPUCPU + PIM
PIM%me
SwapoutNpages
ReadNPages
Compress
Othertasks
Writeback
compression
ZRAM
SwapoutNpages
OthertasksCompress
ZRAMNooff-chipdata
movementdatamovement
high
UncompressedPages
UncompressedPages
BothfuncBonscanbenefitfromPIMexecuBonandcanbeimplementedatPIMlogic
ColorbliZngisagoodcandidateforPIMexecuBon
Generatesalargeamountofdatamovement
Requireslow-costoperaBons:Memset,simplearithmeBc,andshiOoperaBons
Accountsfor19.1%ofthetotalsystemenergyduringscrolling
ColorBliZng
TextureTiling
ColorbliZngisfeasibletoimplementinPIMlogic