quantifying and comparing the impact of wrong-path memory references in multiple-cmp systems ayse...

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-

CMP Systems

Ayse Yilmazer, University of Rhode Island Resit Sendag, University of Rhode Island

Joshua J. Yi, Freescale Semiconductor, Inc.

Motivation

Previous work on Wrong-path (WP) effects in Uniprocessors Positive Effects: Prefetching

Up to 20% better performance for 181.mcf (SPECint 2000) Negative Effects: Pollution

L1 and L2 cache pollution Extra traffic

Important to simulate WP, especially for some applications

How about WP effects in Multiple-CMP systems?

Outlines Wrong Path Effects in SMPs and multi-CMPs

Simulation Methodology

Evaluation Results

Conclusion

Wrong-path effects in SMPs – 0 / 4 Broadcast (snoop)- and directory-based SMP

systems MSI, MOSI, MESI, MOESI cache coherence protocols

Same issues in uniprocessors apply Pollution effect Prefetching effect Extra cache/memory traffic

In contrast to uniprocessor effects, WP cause: Extra coherence traffic:

data, invalidations, write-backs, acknowledgements Additional cache block state transitions

Wrong-path effects in SMPs – 1 / 4 Replacements

A speculatively replaces B

A is a Wrong-path Block !

Initial States

Block B M Block A I -> M

Processor 0 Processor 1

1 Intial: P1 writes on block A

Block A I -> S Block A M -> O

2 P0 speculatively reads block A

Block A S Block A O

3 Speculation resolves: Mis-speculation!

Block A S -> I Block A O -> M

4 P1 writes on block A

Wrong-path effects in SMPs – 2 / 4 Write-backs

Block A S Block A O

Write-back dirty copy of B

Write-back dirty copy of AOnly for MESI (or MSI)

M -> S

Wrong-path effects in SMPs – 3 / 4 Invalidations

Block A S Block A O

P1 loses its write privileges for block A

P1 asks for grant to write and sends invalidation

Wrong-path effects in SMPs – 4 / 4 Data/Bus and Coherence Traffic Increases

L1 references, L2 references, coherence traffic

snoop, directory requests for data and invalidations

Power Consumption Increases Due to extra cache references, coherence traffic and

cache block state transitions

Resource Contention Competing with correct-path resources

In contrast to uniprocessors, the increase in the frequency of full service buffers critical when many cache-to-cache transfers

WP effects in Multiple-CMPs – 0 / 2

CMP node and a 4 CMP system We studied inclusive L1 and L2 cache L2 cache also tracks the coherence of cache blocks in L1

Memory Controller

Mem I/F

Interconnection network

L2 L2 L2 L2 I/F

CMP CMP

Intra-CMP coherence

Inter-CMP coherence

OIV SO

Wrong-path Replacement

Place invalidation for sharer

copies

Write-back ACK received from dir

s received from sharers

Write-b

ack cach

e block

L2 CacheL1 Cache

Invalidation received from L

2 cache

to L2 cach

State Transitions when replacement of an SO line in L2 cache

State Transitions when an MT line in L2 cache receives a WP request

Wrong-path shared copy request from local L1s or external world

Issue downgrade to writing L1 cache

Data and A

received from

writing L

1 cache

ata to requ

L2 Cache L1 Cache

ngrade requested from

L2 cache

ata and

L2 cach

Evaluation Results

Conclusion

Experimental Methodology

GEMS simulator – Wisconsin Multifacet Group Based on Virtutech SIMICS Aggressive out-of-order superscalar processor Detailed Shared-Memory Model

We evaluate 16-processor (4 and 8-CMPs) SPARC V9 system running unmodified Solaris 9

Evaluated 2-level MOSI directory coherence protocol MOSI: Modified, Owned, Shared, Invalid

We track the speculatively generated memory references and mark them as being on the wrong-path when the branch

misprediction is known

Experimental MethodologyProcessor Configuration

UltraSPARC III ISA

2 GHz 15-stage pipeline, OoO execution

8-wide dispatch/retirement

256/128-entry ROB/scheduler

10 cycle branch misprediction penalty

gshare branch predictor with 4K PHT

64-entry RAS and RAS exception table

32-entry CAS and CAS exception table

CMP Configuration

4 CMPs

4 processors per CMP

L1 caches: 32 KB, 2-way, 2 cycle latency

L2 cache: 2M, 2-way, 20 cycle latency

128 byte cache block size

32-entry MSHRs

4 GByte per bank

240 cycle DRAM latency

Benchmark Input Data Set fft 64K points

radix 2M integers, radix 1024 ocean 128x128 ocean

Water-spatial 512 molecules em3d 400K nodes, degree 2, span 5,

15% remote

Evaluation Results

Conclusion

Evaluation Results 1 / 5

4 CMPs 8 CMPs

-- L1 and L2 Cache Traffic

• Total memory references increase by 16% and 14% for 4- and 8-CMPs, respectively.

• L2 cache references increase by 35% and 36%, respectively.

• For em3d, the increase in the number of L1 misses increase as much as 70%.

FFT RADIX OCEAN WATER em3d AVERAGE

Evaluation Results 2 / 5-- Coherence Traffic

• Internal -- 36% External -- 30%

Internal Net.External Net.

4 CMPs 8 CMPs

Evaluation Results 3 / 5-- L1 and L2 cache replacements

• L1 -- 30%, L2 -- 17%

Potential Cache Performance Impact

Type Meaning L1 L2

Used used by a correct-path reference 50% 7%

Unusedevicted before being used or never used by a correct-path

42% 70%

Direct MissReplaces a cache block that is needed by a later correct-path load, and is evicted before being used.

4% 20%

Indirect MissChanges the LRU of a set, which may eventually cause correct-path misses

Evaluation Results 4 / 5-- Write Misses

4 CMPs 8 CMPs

On average 4% On average 7%

External NetInternal Net

Evaluation Results 5 / 5-- Cache Line State Transitions

4 CMPs

• Internal: 2% to 13%• External: 1% to 9%

• Internal: 2% to 17%• External: 1% to 10%

O/S -> MM -> 0/S

rnal N

O/S -> MM -> 0/S

8 CMPs

Evaluation Results

Conclusion

Conclusion It is important to model WP memory references in cache-

coherent multi-CMP systems

For multi-CMPs, not only do the WP affect the performance of individual processors due to prefetching and pollution, they also affect the performance of the entire system by increasing cache coherence transactions cache block state transitions write-backs invalidations resource contention

For a workload with many cache-to-cache transfers, WP can significantly affect coherence actions.

The End

Thank You !

quantifying and comparing the impact of wrong-path memory references in multiple-cmp systems ayse...

block awrongpath effects

cache transferswp effects

outlineswrong path effects

text block

uniprocessor effects

wrongpath wp effects

negative effects

uniprocessorspositive

Documents

enhancing a gcse maths resit using a problem-solving

applied population ecology resit akcakaya

case study on water supply and sanitation in uzbekistan ayse...

ienac15/ econometrics 2 final examination (resit)

ayse gül altınay by hambersoum aghbashian

methodological approaches resit 2 2015 2014-06ylÃ³pez_ma

evaluation resit word

maths gcse resit support project - sussex learning...

addictive substances psychiatrist doctor ayse ozkiris

english workbook yuksel goknel & ayse goknel 1976

tem info sheet the pitfalls of a resit and...tips for a...

2014 english resit

m16mkt resit cw - nick stamelos.docx

inclass resit solutions (1).pdf

cemal resit rey turk muzik egitimine ve cagdas turk muzigine...

destinations erasmus 2005-2006 - fea-rp/usp · resit...

ayse second concert violin i

foundation portfolion evaluation resit

resit on work

ayse kucuk yilmaz chunlan qiu & yonglin xiao