conda: efficient cache coherence support for near-data ...€¦ · application analysis analysis of...
TRANSCRIPT
CoNDA: Efficient Cache Coherence Support for Near-Data Accelerators
Coherence For NDAs
Amirali Boroumand, Saugata Ghose, Minesh Patel, Hasan Hassan, Brandon Lucia, Rachata Ausavarungnirun, Kevin Hsieh, Nastaran Hajinazar, Krishna Malladi, Hongzhong Zheng, Onur Mutlu
Application Analysis
Analysis of Existing Coherence Mechanisms
Architecture Support Evaluation
CoNDA consistently retains most of Ideal-NDA’s benefits, coming within 10.4% of the Ideal-NDA performance
CoNDA significantly reduces energy consumption and comes within 4.4% of Ideal-NDA
Challenge:CoherencebetweenNDAsandCPUs
DRAM L2 L1
CPU CPU CPU CPU
NDA
Compute Unit
(1)Largecostofoff-chipcommunicaBon
ItisimpracBcaltousetradiBonalcoherenceprotocols
(2)NDAapplicaBonsgeneratealargeamountofoff-chipdatamovement
1stkeyobservaBon:CPUthreadsoHenconcurrentlyaccessthesameregionofdatathatNDAkernelsareaccessingwhichleadstosignificantdatasharing
Graph Processing Hybrid Databases (HTAP)
WefindnotallporBonsofapplicaBonsbenefitfromNDA
1 Memory-intensiveporBonsbenefitfromNDA
2 Compute-intensiveorcachefriendlyporBonsshouldremainontheCPU
Hybrid Database (HTAP)
Transactions Analytics
Transactions
CPU CPU NDA
Analytics
Data Sharing
2ndkeyobservaBon:CPUthreadsandNDAkernelstypically
donotconcurrentlyaccessthesamecachelines
CPUthreadsrarelyupdatethesamedatathatanNDAisacBvelyworkingon
ForConnectedComponentsapplicaBon,only5.1%oftheCPUaccessescollidewith
NDAaccesses
PoorhandlingofcoherenceeliminatesmuchofanNDA’sperformanceandenergybenefits
0.0
0.5
1.0
1.5
2.0
CC Radii PageRank CC Radii PageRank
arXiV Gnutella
Speedu
p
CPU-only NC CG FG Ideal-NDA
GMEAN
0.0
0.5
1.0
1.5
2.0
CC Radii PageRank CC Radii PageRank
arXiV Gnutella
Normalized
Ene
rgy
CPU-only NC CG FG Ideal-NDA
GMEAN
CoNDA
WeproposeCoNDA,amechanismthatusesopBmisBcNDAexecuBontoavoidunnecessarycoherencetraffic
Time
OpBmisBc-execuBon
CPU NDA
ConcurrentCPU+NDAExecuBon
OffloadNDAkernel
Sendsignatures
CoherenceResoluBon
CommitorRe-execute
NoCoherenceRequest
Signature Signature
CPUThreadExecuBon
Identifying Coherence Violations
Time CPU NDA
C1.WrZC2.RdAC3.WrB
N1.RdXN2.WrYN3.RdZ
AnyCoherenceViolaBon?
N4.RdXN5.WrYN6.RdZ
AnyCoherenceViolaBon?
C6.WrX
C4.WrYC5.RdY
Yes.FlushZtoDRAM
No.commitNDAoperaBons
EffecBveOrdering
C1.WrXC2.RdXC3.RdYC4.WrY
C5.WrYC6.RdYN4.RdZN5.WrYN6.RdXC7.WrX
Non-Cacheable Approach
Hybrid Database (HTAP)
Transactions Analytics
CPU CPU
Transactions
NDA
Analytics
Data Sharing
(1)Generatesalargenumber
ofoff-chipaccesses
(2)SignificantlyhurtsCPUthreadsperformance
NCfailstoprovideanyenergysavingandperform6.0%worsethanCPU-only
MarktheNDAdataasnon-cacheable
CPU DRAM
CPU
CPUWriteSet
SharedLLCCoherence Resolution
L1 NDA Core
L1 NDAReadSet
NDAWriteSet
High Level Architecture of CoNDA
CPU
CPUWriteSet
SharedLLCCoherence Resolution
L1 NDA Core NDAReadSet
NDAWriteSet
L1
Per-worddirtybitmasktomarkalluncommifeddataupdates
TheNDAReadSetandNDAWriteSetareusedtotrackmemoryaccessesfromNDA
Optimistic Execution
0.0
0.5
1.0
1.5
2.0
2.5
CC Radii PR CC Radii PR CC Radii PR 128 256
arXiV Gnutella Enron HTAP
Spee
dup
CPU-only NDA-only FG CoNDA Ideal-NDA
GMEAN
0.00
0.25
0.50
0.75
1.00
1.25
CC Radii PR CC Radii PR CC Radii PR 128 256
arXiV Gnutella Enron HTAP
Normalized
Ene
rgy
CPU-only FG CoNDA Ideal-NDA
GMEAN
CPU
CPUWriteSet
SharedLLCCoherence Resolution
L1 NDA Core NDAReadSet
NDAWriteSet
L1
Address
…1 1 00 0 1 11 0 0 01
hk-1 h1 h0 …NDAReadSet CPUWriteSet
Conflict
Ifconflictshappens:• TheCPUflushesthedirtycachelinesthatmatch
addressesintheNDAReadSet• NDAinvalidatesalluncommiQedcachelines• SignaturesareerasedandNDArestartsexecuSon
Ifnoconflicts:
• AnycleancachelinesintheCPUthatmatchanaddressintheNDAWriteSetareinvalidated
• NDAcommitsdataupdates
Coherence Resolution
Bloomfilterbasedsignaturehastwobenefits:
• AllowsustoeasilyperformcoherenceresoluSon• Allowsforalargenumberofaddressestobestoredwithinafixed-lengthregister
Fine-Grained Coherence
CPU CPU NDA
High amount of off-chip coherence Traffic
FGeliminates71.8%oftheenergybenefitsofanidealNDAmechanism
Usingfine-grainedcoherencehastwobenefits:
1 SimplifiesNDAprogrammingmodel
2 Allowsustogetpermissionsforonlythepiecesofdatathatareactuallyaccessed
Coarse-Grained Coherence
CPU CPU NDA
GetcoherencepermissionfortheNDAregion
Unnecessarilyflushesalargeamountofdirty
data
Usecoarse-grainedlockstoprovideexclusiveaccess
AccesstoNDAdata
CPU NDATime
STALLBlocksCPUthreadswhen
theyaccessNDAdataregions
CGfailstoprovideanyperformancebenefitofNDAandperform0.4%worsethanCPU-only