quantifying and comparing the impact of wrong-path memory references in multiple-cmp systems ayse...
Post on 01-Jan-2016
218 Views
Preview:
TRANSCRIPT
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-
CMP Systems
Ayse Yilmazer, University of Rhode Island Resit Sendag, University of Rhode Island
Joshua J. Yi, Freescale Semiconductor, Inc.
Motivation
Previous work on Wrong-path (WP) effects in Uniprocessors Positive Effects: Prefetching
Up to 20% better performance for 181.mcf (SPECint 2000) Negative Effects: Pollution
L1 and L2 cache pollution Extra traffic
Important to simulate WP, especially for some applications
How about WP effects in Multiple-CMP systems?
Outlines Wrong Path Effects in SMPs and multi-CMPs
Simulation Methodology
Evaluation Results
Conclusion
Wrong-path effects in SMPs – 0 / 4 Broadcast (snoop)- and directory-based SMP
systems MSI, MOSI, MESI, MOESI cache coherence protocols
Same issues in uniprocessors apply Pollution effect Prefetching effect Extra cache/memory traffic
In contrast to uniprocessor effects, WP cause: Extra coherence traffic:
data, invalidations, write-backs, acknowledgements Additional cache block state transitions
Wrong-path effects in SMPs – 1 / 4 Replacements
A speculatively replaces B
A is a Wrong-path Block !
Initial States
Block B M Block A I -> M
Processor 0 Processor 1
LRU
1 Intial: P1 writes on block A
Block A I -> S Block A M -> O
Processor 0 Processor 1
2 P0 speculatively reads block A
Block A S Block A O
Processor 0 Processor 1
3 Speculation resolves: Mis-speculation!
Block A S -> I Block A O -> M
Processor 0 Processor 1
4 P1 writes on block A
Wrong-path effects in SMPs – 2 / 4 Write-backs
Block B M Block A I -> M
Processor 0 Processor 1
LRU
1 Intial: P1 writes on block A
Block A I -> S Block A M -> O
Processor 0 Processor 1
2 P0 speculatively reads block A
Block A S Block A O
Processor 0 Processor 1
3 Speculation resolves: Mis-speculation!
Block A S -> I Block A O -> M
Processor 0 Processor 1
4 P1 writes on block A
Write-back dirty copy of B
Write-back dirty copy of AOnly for MESI (or MSI)
M -> S
Wrong-path effects in SMPs – 3 / 4 Invalidations
Block B M Block A I -> M
Processor 0 Processor 1
LRU
1 Intial: P1 writes on block A
Block A I -> S Block A M -> O
Processor 0 Processor 1
2 P0 speculatively reads block A
Block A S Block A O
Processor 0 Processor 1
3 Speculation resolves: Mis-speculation!
Block A S -> I Block A O -> M
Processor 0 Processor 1
4 P1 writes on block A
P1 loses its write privileges for block A
P1 asks for grant to write and sends invalidation
Wrong-path effects in SMPs – 4 / 4 Data/Bus and Coherence Traffic Increases
L1 references, L2 references, coherence traffic
snoop, directory requests for data and invalidations
Power Consumption Increases Due to extra cache references, coherence traffic and
cache block state transitions
Resource Contention Competing with correct-path resources
In contrast to uniprocessors, the increase in the frequency of full service buffers critical when many cache-to-cache transfers
WP effects in Multiple-CMPs – 0 / 2
CMP node and a 4 CMP system We studied inclusive L1 and L2 cache L2 cache also tracks the coherence of cache blocks in L1
Memory Controller
Mem I/F
P
I D
P
I D
P
I D
P
I D
Interconnection network
L2 L2 L2 L2 I/F
(a)
CMP CMP
CMP CMP
Intra-CMP coherence
Inter-CMP coherence
(b)
WP effects in Multiple-CMPs – 1 / 2
OIV SO
OIN I
S
I
Wrong-path Replacement
Place invalidation for sharer
copies
Write-back ACK received from dir
All A
CK
s received from sharers
Write-b
ack cach
e block
L2 CacheL1 Cache
Invalidation received from L
2 cache
Sen
d A
CK
to L2 cach
e
State Transitions when replacement of an SO line in L2 cache
SOOIV
OIN I
S
I
WP effects in Multiple-CMPs – 1 / 2
State Transitions when an MT line in L2 cache receives a WP request
MO MT
SO
M
S
Wrong-path shared copy request from local L1s or external world
Issue downgrade to writing L1 cache
Data and A
CK
received from
writing L
1 cache
Sen
d d
ata to requ
estor
L2 Cache L1 Cache
Dow
ngrade requested from
L2 cache
Sen
d d
ata and
AC
K to
L2 cach
e
MTMO
SO
M
S
Outlines Wrong Path Effects in SMPs and multi-CMPs
Simulation Methodology
Evaluation Results
Conclusion
Experimental Methodology
GEMS simulator – Wisconsin Multifacet Group Based on Virtutech SIMICS Aggressive out-of-order superscalar processor Detailed Shared-Memory Model
We evaluate 16-processor (4 and 8-CMPs) SPARC V9 system running unmodified Solaris 9
Evaluated 2-level MOSI directory coherence protocol MOSI: Modified, Owned, Shared, Invalid
We track the speculatively generated memory references and mark them as being on the wrong-path when the branch
misprediction is known
Experimental MethodologyProcessor Configuration
UltraSPARC III ISA
2 GHz 15-stage pipeline, OoO execution
8-wide dispatch/retirement
256/128-entry ROB/scheduler
10 cycle branch misprediction penalty
gshare branch predictor with 4K PHT
64-entry RAS and RAS exception table
32-entry CAS and CAS exception table
CMP Configuration
4 CMPs
4 processors per CMP
L1 caches: 32 KB, 2-way, 2 cycle latency
L2 cache: 2M, 2-way, 20 cycle latency
128 byte cache block size
32-entry MSHRs
4 GByte per bank
240 cycle DRAM latency
Benchmark Input Data Set fft 64K points
radix 2M integers, radix 1024 ocean 128x128 ocean
Water-spatial 512 molecules em3d 400K nodes, degree 2, span 5,
15% remote
Outlines Wrong Path Effects in SMPs and multi-CMPs
Simulation Methodology
Evaluation Results
Conclusion
Evaluation Results 1 / 5
4 CMPs 8 CMPs
-- L1 and L2 Cache Traffic
• Total memory references increase by 16% and 14% for 4- and 8-CMPs, respectively.
• L2 cache references increase by 35% and 36%, respectively.
• For em3d, the increase in the number of L1 misses increase as much as 70%.
0
10
20
30
40
50
60
70
80
90
100
FFT RADIX OCEAN WATER em3d AVERAGE
L1L2
0
10
20
30
40
50
60
70
80
90
100
FFT RADIX OCEAN WATER em3d AVERAGE
L1L2
Evaluation Results 2 / 5-- Coherence Traffic
• Internal -- 36% External -- 30%
0
10
20
30
40
50
60
70
80
90
100
FFT RADIX OCEAN WATER em3d AVERAGE
Internal Net.External Net.
0
10
20
30
40
50
60
70
80
90
100
FFT RADIX OCEAN WATER em3d AVERAGE
Internal Net.External Net.
4 CMPs 8 CMPs
Evaluation Results 3 / 5-- L1 and L2 cache replacements
• L1 -- 30%, L2 -- 17%
Potential Cache Performance Impact
Type Meaning L1 L2
Used used by a correct-path reference 50% 7%
Unusedevicted before being used or never used by a correct-path
42% 70%
Direct MissReplaces a cache block that is needed by a later correct-path load, and is evicted before being used.
4% 20%
Indirect MissChanges the LRU of a set, which may eventually cause correct-path misses
4% 3%
Evaluation Results 4 / 5-- Write Misses
4 CMPs 8 CMPs
On average 4% On average 7%
0
2
4
6
8
10
FFT RADIX OCEAN WATER em3d AVERAGE
External NetInternal Net
0
2
4
6
8
10
FFT RADIX OCEAN WATER em3d AVERAGE
External NetInternal Net
Evaluation Results 5 / 5-- Cache Line State Transitions
4 CMPs
• Internal: 2% to 13%• External: 1% to 9%
• Internal: 2% to 17%• External: 1% to 10%
0
2
4
6
8
10
12
14
16
18
Inte
rnal
Net
Ext
erna
l Net
Inte
rnal
Net
Ext
erna
l Net
Inte
rnal
Net
Ext
erna
l Net
Inte
rnal
Net
Ext
erna
l Net
Inte
rnal
Net
Ext
erna
l Net
Inte
rnal
Net
Ext
erna
l Net
FFT RADIX OCEAN WATER em3d AVERAGE
O/S -> MM -> 0/S
0
2
4
6
8
10
12
14
16
18
Inte
rnal N
et
Ext
ern
al N
et
Inte
rnal N
et
Ext
ern
al N
et
Inte
rnal N
et
Ext
ern
al N
et
Inte
rnal N
et
Ext
ern
al N
et
Inte
rnal N
et
Ext
ern
al N
et
Inte
rnal N
et
Ext
ern
al N
et
FFT RADIX OCEAN WATER em3d AVERAGE
O/S -> MM -> 0/S
8 CMPs
Outlines Wrong Path Effects in SMPs and multi-CMPs
Simulation Methodology
Evaluation Results
Conclusion
Conclusion It is important to model WP memory references in cache-
coherent multi-CMP systems
For multi-CMPs, not only do the WP affect the performance of individual processors due to prefetching and pollution, they also affect the performance of the entire system by increasing cache coherence transactions cache block state transitions write-backs invalidations resource contention
For a workload with many cache-to-cache transfers, WP can significantly affect coherence actions.
top related