memory ordering: a value-based approach
DESCRIPTION
Memory Ordering: A Value-based Approach. Trey Cain and Mikko Lipasti University of Wisconsin-Madison. Value-based replay. High ILP => Large instruction windows Larger physical register file Larger scheduler Larger load/store queues Result in increased access latency Value-based Replay - PowerPoint PPT PresentationTRANSCRIPT
Trey Cain and Mikko LipastiTrey Cain and Mikko Lipasti
University of Wisconsin-MadisonUniversity of Wisconsin-Madison
Memory Ordering:Memory Ordering:A Value-based ApproachA Value-based Approach
Cain and Lipasti, ISCA 2004 2 of 26
Value-based replayValue-based replay High ILP => Large instruction windowsHigh ILP => Large instruction windows
Larger physical register fileLarger physical register file Larger schedulerLarger scheduler Larger load/store queuesLarger load/store queues Result in increased access latencyResult in increased access latency
Value-based ReplayValue-based Replay If load queue scalability a problem…who needs one!If load queue scalability a problem…who needs one! Instead, re-execute load instructions a 2Instead, re-execute load instructions a 2ndnd time in time in
program orderprogram order Filter replays: heuristics reduce extra cache Filter replays: heuristics reduce extra cache
bandwidth to 3.5% on averagebandwidth to 3.5% on average
Cain and Lipasti, ISCA 2004 3 of 26
OutlineOutline
Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture
Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation
Cain and Lipasti, ISCA 2004 4 of 26
Enforcing RAW dependencesEnforcing RAW dependences
1. (1) store A2. (3) store ?3. (2) load A
Program order (Exe order)
Load queue contains load addressesLoad queue contains load addresses One search per store address calculationOne search per store address calculation If match, the load is squashed If match, the load is squashed
Cain and Lipasti, ISCA 2004 5 of 26
Enforcing memory consistencyEnforcing memory consistency
Processor p2
1. (2) store A
Processor p1
1. (3) load A
2. (1) load A
raw
war
Two approachesTwo approaches Snooping: Search per incoming invalidateSnooping: Search per incoming invalidate Insulated: Search per load address calculationInsulated: Search per load address calculation
Cain and Lipasti, ISCA 2004 6 of 26
Load queue implementationLoad queue implementation
addressCAM
loadmeta-data
RAM
external address
store address
load address
store age
load age
squash determination
queue management
external request
# of write ports = load address calc width# of write ports = load address calc width # of read ports = load+store address calc width ( + 1)# of read ports = load+store address calc width ( + 1) Current generation designs (32-48 entries, 2 write ports, Current generation designs (32-48 entries, 2 write ports,
2 (3) read ports)2 (3) read ports)
Cain and Lipasti, ISCA 2004 7 of 26
Load queue scalingLoad queue scaling
Larger instruction window => larger load Larger instruction window => larger load queuequeue Increases access latencyIncreases access latency Increases energy consumptionIncreases energy consumption
Wider issue width => more read/write Wider issue width => more read/write portsports Also increases latency and energyAlso increases latency and energy
Cain and Lipasti, ISCA 2004 8 of 26
Related work: MICRO 2003Related work: MICRO 2003
Park et al., PurduePark et al., Purdue Extra structure dedicated to enforcing memory Extra structure dedicated to enforcing memory
consistencyconsistency Increase capacity through segmentationIncrease capacity through segmentation
Sethumadhavan et al., UT-AustinSethumadhavan et al., UT-Austin Add set of filters summarizing contents of load Add set of filters summarizing contents of load
queuequeue
Cain and Lipasti, ISCA 2004 9 of 26
Keep it simple…Keep it simple…
Throw more hardware at the problem?Throw more hardware at the problem? Need to design/implement/verifyNeed to design/implement/verify Execution core is already complicatedExecution core is already complicated
Load queue checks for rare errorsLoad queue checks for rare errors Why not move error checking away from exe?Why not move error checking away from exe?
Cain and Lipasti, ISCA 2004 10 of 26
CMP
Value-based orderingValue-based ordering
ReplayReplay: access the cache a second time -: access the cache a second time -cheaply!cheaply! Almost always cache hitAlmost always cache hit Reuse address calculation and translationReuse address calculation and translation Share cache port used by stores in commit stageShare cache port used by stores in commit stage
CompareCompare: compares new value to original value: compares new value to original value Squash if the values differSquash if the values differ
DIVA áDIVA á la carte [Austin, Micro 99]la carte [Austin, Micro 99]
IF1 D R Q S EX CREPIF2 WB…
Cain and Lipasti, ISCA 2004 11 of 26
Rules of replayRules of replay
1.1. All prior stores must have written data to All prior stores must have written data to the cachethe cache
No store-to-load forwardingNo store-to-load forwarding
2.2. Loads must replay in program orderLoads must replay in program order If a cache miss occurs, all subsequent loads If a cache miss occurs, all subsequent loads
must be replayedmust be replayed
3.3. If a load is squashed, it should not be If a load is squashed, it should not be replayed a second timereplayed a second time
Ensures forward progressEnsures forward progress
Cain and Lipasti, ISCA 2004 12 of 26
Replay reductionReplay reduction
Replay costsReplay costs Consumes cache bandwidth (and power)Consumes cache bandwidth (and power) Increases reorder buffer occupancyIncreases reorder buffer occupancy
Can we avoid these penalties?Can we avoid these penalties? Infer correctness of certain operationsInfer correctness of certain operations
Four replay filtersFour replay filters
Cain and Lipasti, ISCA 2004 13 of 26
No-Reorder filterNo-Reorder filter
Avoid replay if load isn’t reordered wrt Avoid replay if load isn’t reordered wrt other memory operationsother memory operations
Can we do better?Can we do better?
Cain and Lipasti, ISCA 2004 14 of 26
Enforcing single-thread RAW Enforcing single-thread RAW dependenciesdependencies
No-Unresolved Store Address FilterNo-Unresolved Store Address Filter Load instruction Load instruction ii is replayed if there are prior is replayed if there are prior
stores with unresolved addresses when stores with unresolved addresses when ii issuesissues
Works for intra-processor RAW dependencesWorks for intra-processor RAW dependences Doesn’t enforce memory consistencyDoesn’t enforce memory consistency
Cain and Lipasti, ISCA 2004 15 of 26
Enforcing MP consistencyEnforcing MP consistency
No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line
fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow
No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external
invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window
Cain and Lipasti, ISCA 2004 16 of 26
Constraint graphConstraint graph
Defined for sequential consistency by Landin et Defined for sequential consistency by Landin et al., ISCA-18al., ISCA-18
Directed-graph represents a multithreaded Directed-graph represents a multithreaded executionexecution Nodes represent dynamic instruction instancesNodes represent dynamic instruction instances Edges represent their transitive orders (program Edges represent their transitive orders (program
order, RAW, WAW, WAR).order, RAW, WAW, WAR). If the constraint graph is acyclic, then the If the constraint graph is acyclic, then the
execution is correctexecution is correct
Cain and Lipasti, ISCA 2004 17 of 26
Constraint graph example - SCConstraint graph example - SC
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
WAR
RAW
Cycle indicates that execution is
incorrect
1.
2.
3.
4.
Cain and Lipasti, ISCA 2004 18 of 26
Anatomy of a cycleAnatomy of a cycle
Proc 1
ST A
Proc 2
LD AST B
LD BProgramorder
Programorder
WAR
RAW
Incoming invalidate
Cache miss
Cain and Lipasti, ISCA 2004 19 of 26
Enforcing MP consistencyEnforcing MP consistency
No-Recent-Miss FilterNo-Recent-Miss Filter Avoid replay if there have been no cache line Avoid replay if there have been no cache line
fills (to any address) while load in instruction fills (to any address) while load in instruction windowwindow
No-Recent-Snoop FilterNo-Recent-Snoop Filter Avoid replay if there have been no external Avoid replay if there have been no external
invalidates (to any address) while load in invalidates (to any address) while load in instruction windowinstruction window
Cain and Lipasti, ISCA 2004 20 of 26
Filter SummaryFilter Summary
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/No-Recent-Snoop Filter
No-Unresolved Store/No-Recent-Miss Filter
Conservative
Aggressive
Cain and Lipasti, ISCA 2004 21 of 26
OutlineOutline
Conventional load queue Conventional load queue functionality/microarchitecturefunctionality/microarchitecture
Value-based memory orderingValue-based memory ordering Replay-reduction heuristicsReplay-reduction heuristics Performance evaluationPerformance evaluation
Cain and Lipasti, ISCA 2004 22 of 26
Base machine modelBase machine modelPHARMsimPHARMsim Based on SimpleMP including Sun Gigaplane-like snooping coherence Based on SimpleMP including Sun Gigaplane-like snooping coherence
protocol [Rajwar], within the SimOS-PPC full-system simulatorprotocol [Rajwar], within the SimOS-PPC full-system simulator
Out-of-order Out-of-order execution execution corecore
5 GHZ, 5 GHZ, 15-stage, 8-wide pipeline15-stage, 8-wide pipeline
256 entry reorder buffer, 128 entry load/store queue256 entry reorder buffer, 128 entry load/store queue
32 entry issue queue32 entry issue queue
Functional Functional units units (latency)(latency)
8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4),
4 L1 Dcache load ports in OoO window4 L1 Dcache load ports in OoO window
1 L1 Dcache load/store port at commit1 L1 Dcache load/store port at commit
Front-endFront-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory Memory system system (latency)(latency)
32k DM L1 icache (1), 32k DM L1 dcache (1)32k DM L1 icache (1), 32k DM L1 dcache (1)
256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines
Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)Memory (400 cycle/100 ns best-case latency, 10 GB/S BW)
Stride-based prefetcher modeled after Power4`Stride-based prefetcher modeled after Power4`
Cain and Lipasti, ISCA 2004 23 of 26
%L1 DCache bandwidth increase%L1 DCache bandwidth increase
(a) replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter
On average, 3.4% bandwidth overhead using no-recent-snoop filter
SPECint2000 SPECfp2000 commercial multiprocessor
Cain and Lipasti, ISCA 2004 24 of 26
Value-based replay performance Value-based replay performance (relative to constrained load queue)(relative to constrained load queue)
Value-based replay 8% faster on avg than baseline using 16-entry ld queue
SPECint2000 SPECfp2000 commercial multiprocessor
Cain and Lipasti, ISCA 2004 25 of 26
Value-based replay Pros/ConsValue-based replay Pros/Cons
+ Eliminates associative lookup hardwareEliminates associative lookup hardware Load queue becomes simple FIFOLoad queue becomes simple FIFO Negligible IPC or L1D bandwidth impactNegligible IPC or L1D bandwidth impact
+ Can be used to fix value predictionCan be used to fix value prediction Enforces dependence order consistency Enforces dependence order consistency
constraint [Martin et al., Micro 2001]constraint [Martin et al., Micro 2001]- Requires additional pipeline stagesRequires additional pipeline stages- Requires additional cache datapath for Requires additional cache datapath for
loadsloads
Cain and Lipasti, ISCA 2004 26 of 26
The EndThe End
Questions?Questions?
Cain and Lipasti, ISCA 2004 27 of 26
BackupsBackups
Cain and Lipasti, ISCA 2004 28 of 26
Does value locality help?Does value locality help?
Not much…Not much… Value locality does avoid memory ordering Value locality does avoid memory ordering
violationsviolations 59% single-thread violations avoided59% single-thread violations avoided 95% consistency violations avoided95% consistency violations avoided
But these violations rarely occurBut these violations rarely occur ~1 single-thread violation per 100 million instr~1 single-thread violation per 100 million instr 4 consistency violation per 10,000 instr 4 consistency violation per 10,000 instr
Cain and Lipasti, ISCA 2004 29 of 26
What About What About PowerPower??
Simple power model:Simple power model:
Empirically: 0.02 replay loads per committed Empirically: 0.02 replay loads per committed instructioninstruction
If load queue CAM energy/insn > 0.02 If load queue CAM energy/insn > 0.02 × energy energy expenditure of a cache access and comparison: expenditure of a cache access and comparison: value-based implementation saves power!value-based implementation saves power!
Energy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search × # ldq searches )
Cain and Lipasti, ISCA 2004 30 of 26
Caveat: Memory Dependence PredictionCaveat: Memory Dependence Prediction
Some predictors train using the conflicting storeSome predictors train using the conflicting store (e.g. store-set predictor)(e.g. store-set predictor)
Replay mechanism is unable to pinpoint Replay mechanism is unable to pinpoint conflicting storeconflicting store
Fair comparison:Fair comparison: Baseline machine: store-set predictor w/ 4k entry Baseline machine: store-set predictor w/ 4k entry
SSIT and 128 entry LFSTSSIT and 128 entry LFST Experimental machine: Simple 21264-style Experimental machine: Simple 21264-style
dependence predictor w/ 4k entry history tabledependence predictor w/ 4k entry history table
Cain and Lipasti, ISCA 2004 31 of 26
Load queue search energyLoad queue search energy
0
0.5
1
1.5
2
2.5
3
3.5
16 32 64 128 256 512
number of entries
ac
ce
ss
en
erg
y (
nJ
)
rd6wr6
rd4wr4
rd2wr2
Based on 0.09 micron process technology using Cacti v. 3.2
Cain and Lipasti, ISCA 2004 32 of 26
Load queue search latencyLoad queue search latency
0
0.2
0.4
0.6
0.8
1
1.2
1.4
16 32 64 128 256 512
number of entries
ac
ce
ss
late
nc
y (
ns
)
rd6wr6
rd4wr4
rd2wr2
Based on 0.09 micron process technology using Cacti v. 3.2
Cain and Lipasti, ISCA 2004 33 of 26
BenchmarksBenchmarks
MP (16-way)MP (16-way) Commercial workloads (SPECweb, TPC-H)Commercial workloads (SPECweb, TPC-H) SPLASH2 scientific application (ocean)SPLASH2 scientific application (ocean) Error bars signify 95% statistical confidenceError bars signify 95% statistical confidence
UPUP 3 from SPECfp20003 from SPECfp2000
Selected due to high reorder buffer utilizationSelected due to high reorder buffer utilization apsi, art, wupwiseapsi, art, wupwise
3 commercial3 commercial SPECjbb2000, TPC-B, TPC-HSPECjbb2000, TPC-B, TPC-H
A few from SPECint2000A few from SPECint2000
Cain and Lipasti, ISCA 2004 34 of 26
LD ?ST ?ST ?LD ? LD ?ST ? LD ?
Life cycle of a loadLife cycle of a load
OoO Execution Window
LD ?ST ? ST ? ST ?
Load queue
LD ?LD A
LD A ST A
Blam!
Cain and Lipasti, ISCA 2004 35 of 26
Performance relative to Performance relative to unconstrained load queueunconstrained load queue
Good news: Replay w/ no-recent-snoop filter only 1% slower on average
Cain and Lipasti, ISCA 2004 36 of 26
Reorder-Buffer UtilizationReorder-Buffer Utilization
Cain and Lipasti, ISCA 2004 37 of 26
Why focus on load queue?Why focus on load queue?
Load queue has different constraints that store Load queue has different constraints that store queuequeue More loads than stores (30% vs 14% dynamic More loads than stores (30% vs 14% dynamic
instructions)instructions) Load queue searched more frequently (consuming Load queue searched more frequently (consuming
more power)more power) Store-forwarding logic performance criticalStore-forwarding logic performance critical
Many non-scalable structures in OoO processorMany non-scalable structures in OoO processor SchedulerScheduler Physical register filePhysical register file Register mapRegister map
Cain and Lipasti, ISCA 2004 38 of 26
Prior work: formal memory model Prior work: formal memory model representationsrepresentations
Local, WRT, global “performance” of memory Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13)ops (Dubois et al., ISCA-13)
Acyclic graph representation (Landin et al., Acyclic graph representation (Landin et al., ISCA-18)ISCA-18)
Modeling memory operation as a series of sub-Modeling memory operation as a series of sub-operations (Collier, RAPA)operations (Collier, RAPA)
Acyclic graph + sub-operations (Adve, thesis)Acyclic graph + sub-operations (Adve, thesis) Initiation event, for modeling early store-to-load Initiation event, for modeling early store-to-load
forwarding (Gharachorloo, thesis)forwarding (Gharachorloo, thesis)