graphpim: enabling instruction-level pim offloading in ... · ⎮ pim offloading for atomic...
TRANSCRIPT
![Page 1: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/1.jpg)
GraphPIM:EnablingInstruction-LevelPIMOffloadinginGraphComputingFrameworks
LifengNai,RamyadHadidi,JaewoongSim*,HyojongKim,PranithKumar,HyesoonKim
GeorgiaTech,*IntelLabs
![Page 2: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/2.jpg)
INTRODUCTION
⎮ Graphcomputing:processingbignetworkdata} Socialnetwork,knowledgenetwork,bioinformatics,etc.
⎮ Graphcomputingisinefficientonconventionalarchitectures} Inefficiencyinmemorysubsystems
2
![Page 3: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/3.jpg)
INTRODUCTION
⎮ Processing-in-memory(PIM)} PIMhasthepotentialofhelpinggraphcomputingperformance} PIMisbeingrealizedinrealproducts:Hybridmemorycube(HMC)2.0
3
EnablePIMforgraphcomputing
![Page 4: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/4.jpg)
CHALLENGES
⎮ WhatarethebenefitsofPIMforgraphcomputing?} KnownbenefitsofPIM
} Bandwidthsavings,latencyreduction,morecomputationpower} But,theyarenotgoodenough} Weexploresomethingmore!
⎮ HowtoenablePIMforgraphinapracticalway?} Minorhardware/softwarechange} Noprogrammerburden
4
![Page 5: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/5.jpg)
OVERVIEW 5
GraphPIM:aPIM-enabled graphframework
WeidentifyanewbenefitofPIMoffloading
WedeterminePIMoffloadingtargets
WeenablePIMwithoutuser-applicationchange
![Page 6: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/6.jpg)
KNOWNPIMBENEFITS
⎮ Morecomputationpower} Extracomputationunitsinmemory
⎮ Bandwidthsavings} Pullingdatavs.pushingcomputationcommand
⎮ Latencyreduction} Bypassingcacheforoffloadedaccessesavoidscache-checkingoverhead} Avoidscachepollutionandincreaseseffectivecachesize
Request Response Total64-byteREAD 1FLIT(addr) 5FLITs(data) 6FLITs
64-byteWRITE 5FLITs(addr,data) 1FLIT(ack) 6 FLITs
CPUrd-modify-wr(rd-Miss;wr-evict)
6FLIT 6FLIT 12FLITs
PIMrd-modify-wr 2FLITs(addr,imm) 1 FLIT(ack) 3FLITs
6
(FLIT:16byte,basicflowunit)
![Page 7: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/7.jpg)
PERFORMANCEBENEFITS
⎮ Morecomputationpower?} Limited#ofFUsinmemory
⎮ Bandwidthsavings?} NotBWsaturated
⎮ Latencyreduction?} Yes,butsmall
7
![Page 8: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/8.jpg)
GRAPHPIM EXPLORES…
⎮ Atomicoverheadreduction} AtomicinstructionsonCPUshavesubstantialoverhead[Schweizer’15]
} RMW:read-modify-write} Cacheoperation:cache-lineinvalidation,coherencetrafficetc.} Dataordering:writebufferdraining,pipelinefreezeetc.
} Becauseofthecharacteristicsofgraphprogrammingmodel,PIMoffloadingcanavoidtheatomicoverhead
H.Schweizer etal.,“EvaluatingtheCostofAtomicOperationsonModernArchitectures,”PACT’15
AtomicInstruction
RMWDataOrdering CacheOperation
8
![Page 9: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/9.jpg)
ATOMICOVERHEADREDUCTION 9
RMWDataOrdering CacheOperation
PipelineStall(Serialization)
Retire
CPU
RMW
Offload
ACK
Offload
ACK
RetireCPU
PIM Atomic
ContinueExecution
SerializationinPIM
![Page 10: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/10.jpg)
ATOMICOVERHEADESTIMATION
⎮ AtomicoverheadexperimentsonaXeonE5machine} AtomicRMWà regularload+compute+store
⎮ Atomicinstructionsincur30%performancedegradation
0
0.5
1
1.5
2
BFS CComp DC kCore SSSP TC BC PRank GMean
Normalize
dExecution
Time
AtomicNon-Atomic
10
(Non-Atomic:artificialexperiment,notpreciseestimation)
![Page 11: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/11.jpg)
PERFORMANCEBENEFITS
⎮ Morecomputationpower?} Limited#ofFUsinmemory
⎮ Bandwidthsavings?} NotBWsaturated
⎮ Latencyreduction?} Yes,butsmall
⎮ Atomicoverheadreduction?} Yesandsignificant!} MainsourceofPIMbenefitforgraph
11
![Page 12: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/12.jpg)
OFFLOADINGTARGETS
⎮ Codesnippet:Breadth-firstsearch(BFS)
F← {source}while F isnotempty
F’← {∅}for eachu∈ F inparallel
d← u.depth+1for eachv∈ neighbor(u)
ret←CAS(v.depth,inf,d)ifret==success
F’← F’ ∪ vendif
endforendforbarrier()F← F’
endwhile
123456789101112131415
F:frontiervertexsetofcurrentstepF’:frontiervertexsetofnextstepu.depth:depthvalueofvertexuneighbor(u):neighborverticesofuCAS(v.depth,inf,d):atomiccompareandswapoperation
line4-5,8-10:accessingmetadata
line6:accessinggraphstructure
line7:accessinggraphproperty
✔
✔
12
CacheFriendly
CacheUnfriendly+Atomic
Offloadatomic operationsongraphproperty
![Page 13: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/13.jpg)
INDICATEOFFLOADINGTARGETS 13
⎮ Howtoindicateoffloadingtargets?
⎮ Option#1:Markinstructions} NewinstructionsinISA} Requireschangesinuser-levelapplications
⎮ Option#2:Markmemoryregions} Specialmemoryregionforoffloadingdata} Canbetransparenttoapplicationprogrammers
![Page 14: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/14.jpg)
GRAPHFRAMEWORK
⎮ Graphcomputingisframework-based
} Userapplicationisdesignedontopofframeworkinterfaces} Dataismanagedwithintheframework
14
UserApplication
G=load_graph(“Fruit”);V1=G.find_vertex(“Apple”);V1.property().price=5;V1.add_neighbor(“Orange”);
load_graph (Framework)G_structure =malloc(size1);G_property = malloc(size2);Openfile&loaddata
GraphAP
Is
Middlew
are
![Page 15: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/15.jpg)
ENABLEPIM INGRAPHFRAMEWORK
…
GraphAPI
Middleware
OS
HardwareArchitecture
GraphDataManagement
UserApplication
UserApplication
15
SWHW
GraphFramework
HostProcessorCore
PIMOffloadingUnit
malloc()à pmr_malloc()
GraphProperty
No userapplicationchangeload_graph
G_structure =malloc(size1);G_property =malloc(size2);Openfile&loaddata
pmr_malloc(size2);
![Page 16: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/16.jpg)
FRAMEWORKCHANGE
⎮ PIMmemoryregion(PMR)} Uncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86
⎮ Frameworkchange} malloc()à pmr_malloc()} pmr_malloc():customizedmalloc functionthatallocatesmemobjectsin
PMR
VirtualMemorySpacePIMMemRegion
GraphProperty
pmr_malloc()Graph
Structure
malloc()
OtherData
GraphDataManagement(Framework)
16
![Page 17: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/17.jpg)
ARCHITECTURECHANGE
⎮ PIMoffloadingunit(POU)} Identifiesatomic instructionsthatareaccessingPIMMemoryRegion} OffloadsthemasPIMinstructions
Core
Caches
POU
HMC
Atom
ic
Unit
HostProcessor HMC
HardwareArchitecture
17
![Page 18: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/18.jpg)
CHANGES
⎮ Softwarechanges:} Nouserapplicationchange} Minorchangeinframework:malloc()à pmr_malloc()
⎮ Hardwarechanges:} PIMmemoryregion:utilizesexistinguncacheable (UC)support} PIMoffloadingunit(POU):identifiesoffloadingtargets
Noburdenonprogrammers+MinorHW/SWchange
18
![Page 19: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/19.jpg)
EVALUATION
⎮ SimulationEnvironment} SST(framework)+MacSim (CPU)+VaultSim (HMC)
⎮ Benchmark} GraphBIGbenchmarksuite[Nai’15](https://github.com/graphbig)} LDBCdatasetfromLinkedDataBenchmarkingCouncil(LDBC)
⎮ Configuration} 16OoO cores,2GHz,4-issue} 32KBL1/256KBL2/16MBsharedL3} HMC2.0spec,8GB,32vaults,512banks,4links,120GB/sperlink
19
L.Naietal.“GraphBIG:UnderstandingGraphComputingintheContextofIndustrialSolutions,”SC’15
![Page 20: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/20.jpg)
EVALUATION:PERFORMANCE
⎮ Baseline:NoPIMoffloading⎮ U-PEI:Performanceupper-boundofPIM-enabledinstructions[Ahn’15]
0.0%
1.0%
2.0%
3.0%
0
1
2
3
%PIM-AtomicinAll
Instructions
Speedu
poverBaseline
Baseline U-PEI GraphPIM %PIM-Atomic
Upto2.4XspeedupOnaverage1.6Xspeedup
20
J.Ahn etal.“PIM-EnabledInstructions:ALow-Overhead,Locality-AwarePIMArchitecture,”ISCA’15
![Page 21: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/21.jpg)
EVALUATION:EXECUTIONTIMEBREAKDOWN
⎮ Breakdownofnormalizedexecutiontime} Atomic-inCore:atomicoverheadofoffloadingtargets(atomicinst.)} Atomic-inCache:cache-checkingoverheadofoffloadingtargets
00.20.40.60.81
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
Baseline
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRank
Normalize
dExecution
Time
Other Atomic-inCore Atomic-inCache
21
![Page 22: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/22.jpg)
CONCLUSION
⎮ Graphcomputingisinefficiencyonconventionalarchitectures
⎮ GraphPIMenablesPIMingraphcomputingframeworks} ExploresanewbenefitofPIMoffloading:atomicoverhead reduction} Identifiesatomicoperations ongraphproperty astheoffloadingtarget} Requiresnouser-applicationchange andonlyminorchange in
frameworkandarchitecture
22
![Page 23: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/23.jpg)
THANKYOU!
![Page 24: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/24.jpg)
BACKUPSLIDES
![Page 25: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/25.jpg)
DYNAMICCACHEBEHAVIOR? 25
⎮ GraphPIMmarksmemoryregionstatically} Cons:cannotbeadaptive totheworkingsetsizes} But,propertyaccessestographshaveveryhighcachemissesregardless
ofgraphinputsexceptforreallysmallgraphsizes[JPDC’16,SC’15]} Pros:coherence supportbetweenmemoryandprocessor-cacheisnot
required
![Page 26: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/26.jpg)
CONSISTENCY? 26
⎮ PIMoffloadingforatomicinstructionsworksfinebecause…} Theprogrammingmodelofgraphapplicationsnaturallyavoids
consistencyissues:allPIMwritesaredonebeforereads} Graphapplicationsrequireonlyatomicityfromatomicinstructions} But,atomicinstructionsinCPUsdon’tallowtospecifyatomicitywithout
fence
} Wealsohaveafollow-upworkdiscussingtheconsistencyissueforPIMinstructionsinthecontextofgeneralapplications[MEMSYS’17]
![Page 27: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/27.jpg)
CONSISTENCY? 27
⎮ GraphapplicationswithBSPmodelnaturallyavoidsconsistencyissues} BarriersensuresallPIMwritesaredonebeforereads
ProgramPhases Operationloop:foreach vertex intaskqueue:readpropertyfetchneighborlistforeach neighbor:updateneighborpropertyupdatenext-iter taskqueue
barrier
// Reads
//HMCInst.
![Page 28: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/28.jpg)
WHYGRAPHPIM ISAFRAMEWORK? 28
⎮ GraphPIM} Considerstheseparation offrameworkanduserapplication} Proposesafull-stack solution:SWframework+HWarchitecture} Requiresno applicationprogrammers’efforts
⎮ Userscaneasilyenable/disableGraphPIMbyswitchingbetweendifferentframeworklibraries.
![Page 29: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/29.jpg)
BANDWIDTHSENSITIVITY? 29
⎮ GraphonCPUsarenotverysensitivetoBWchanges} SpeedupoverbaselinesystemwithdifferentHMClinkbandwidth
0
0.5
1
1.5
2
2.5
BFS CComp DC kCore SSSP TC BC PRank GMean
Speedu
poverBaseline
Baseline Baseline-half-BW Baseline-double-BW
GraphPIM GraphPIM-Half-BW GraphPIM-Double-BW
![Page 30: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/30.jpg)
APPLICABILITY? 30
Category Workload Applicable? OffloadingTarget PIMInst.
GraphTraversal
Breadth-firstsearch ✔ lockcmpxchg CASifequalDegreecentrality ✔ lockaddw Singed addBetweenness centrality ✘ (Floating pointadd) (FPadd)Shortest path ✔ lockcmpxchg CASifequalK-core decomposition ✔ locksubw Singed addConnectedcomponent ✔ lockcmpxchg CASifequalPagerank ✘ (Floating pointadd) (FPadd)
DynamicGraph
Graphconstruction ✘ (Complexoperation)Graphupdate ✘ (Complexoperation)Topologymorphing ✘ (Complexoperation)
RichProperty
Trianglecount ✔ lockadd Singed addGibbs inference ✘ (Compute intensive)
✔
✔
![Page 31: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/31.jpg)
GRAPHPIM:EVALUATION 31
⎮ GraphPIMspeedupoverbaselinewithdifferentdatasetsizes
0
1
2
3
4
5
BFS CComp DC kCore SSSP TC BC PRank
Speedu
p
LDBC-1M LDBC-100k LDBC-10k LDBC-1k
![Page 32: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/32.jpg)
GRAPHPIM:EVALUATION 32
⎮ Normalizeduncore energyconsumption
00.20.40.60.81
Baseline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IMBa
seline
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRank GMean
Normalize
dEn
ergy
Breakdow
n
Caches HMCLink HMCFU HMCLogicLayer HMCDRAM
Onaverage,GraphPIM saves37% ofuncore energybecauseofreductionincacheaccessesandmemory
bandwidth
![Page 33: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/33.jpg)
GRAPHPIM:EVALUATION 33
⎮ Normalizedbandwidthconsumptionwithrequest/responsebreakdown
00.20.40.60.81
1.2
Baseline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IMBa
seline
U-PEI
GraphP
IM
BFS CComp DC kCore SSSP TC BC PRankNormalize
dMem
oryBa
ndwidth
Request Response
![Page 34: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/34.jpg)
GRAPHPIM:EVALUATION 34
⎮ Performanceandenergyresultsoftworeal-worldapplications} Basedonananalyticalmodel} FD:Financialfrauddetection;RS:Recommendersystem
0
0.5
1
1.5
2
2.5
Baseline
GraphP
IM
Baseline
GraphP
IM
FD RS
Speedu
poverbaseline
00.20.40.60.81
Baseline
GraphP
IM
Baseline
GraphP
IM
FD RS
Normalize
dEn
ergyBreakdo
wn
Caches HMCLink HMCOther
![Page 35: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/35.jpg)
HARDWARECHANGES 35
⎮ PIMMemoryRegion(PMR)} Auncacheable memoryregioninvirtualmemoryspace} Utilizingexistinguncacheable (UC)supportinX86
⎮ PIMOffloadingUnit(POU)
Links
Core
Vault00Logic DRAMPartitionAtomicUnit
Vault01Logic DRAMPartitionAtomicUnit
Vault31Logic DRAMPartitionAtomicUnit
...Switch
HMCCo
ntroller
Last-le
velCache
HMCHOST
L2L1
Core
POU
... ...L1 L2
PIMMemRegion?
AtomicInst?
HMCPIMRequest
toL1NN
Y YMemoryInst.
POU
toHMCMemReq
![Page 36: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/36.jpg)
MOTIVATION 36
⎮ ProfilingusingHWperformancecounters} Executioncyclebreakdown:top-downmethodologyfromIntel
050
100150200
Miss
esPerKilo
Instructions
L1D L2 L30%
25%
50%
75%
100%
ExecutionCycle
Breakdow
n
Backend Frontend BadSpeculation Retiring
Bottleneckcausedbybackendstalls
Highnumberofcachemisses
![Page 37: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/37.jpg)
BACKGROUND:PIMOFFLOADINGINHMC2.0 37
⎮ HybridMemoryCube(HMC)2.0} OneofthefirstindustrialPIMproposals} Instruction-level PIMoffloading
} 1logicdie+4/8DRAMdies} 32Vaults} 4seriallinks
SerialLinks
TSVs
Logic LayerDRAM Layers
Vault
![Page 38: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/38.jpg)
BACKGROUND:PIMOFFLOADINGINHMC2.0 38
⎮ Packet-basedprotocol
⎮ RegularREAD/WRITE} FLIT:16-byte;basicflowunit
HeaderPayloadTail
Request Response64-byteREAD 1FLIT 5FLITs
64-byteWRITE 5FLITs 1FLIT
8byte 8byte0~256byte
![Page 39: GraphPIM: Enabling Instruction-Level PIM Offloading in ... · ⎮ PIM offloading for atomic instructions works fine because…}The programming model of graph applications naturally](https://reader034.vdocument.in/reader034/viewer/2022050423/5f92872cad5836128e517012/html5/thumbnails/39.jpg)
BACKGROUND:PIMOFFLOADINGINHMC2.0 39
⎮ PIMInstruction:read-modify-write (RMW)operation} SimilarasregularREAD/WRITE,justdifferentCMD intheHeader} DRAMbankislockedduringthewholeRMWforatomicity
PIM-ADD(addr,imm)
Header(PIM-ADD)addr,immTail
ACK