introductionsnoopingdirectoryconclusion introductionsnoopingdirectoryconclusion memory 1a 2b 3c 4d...

Cache Coherence Protocols in Multicore Architectures

Cache Coherence Protocols in Multicore Architectures Shahab HelmiSupervised by:Professor Gita AlaghbandFeel Free To Ask Questions

Cache CoherenceMemory1A2B3C4D5ECache 11A2B3CCache 23C4D5ECache 41A2B4DCache 31A3C5EMemory1A2B3C4D5ECache 11A2B3CCache 23C4D5ECache 41A2B4DCache 31A3F5ECache CoherenceMemory1A2B3C4D5ECache 11A2B3CCache 23C4D5ECache 41A2B4DCache 31A3C5EMemory1A2B3F4D5ECache 11A2B3CCache 23C4D5ECache 41A2B4DCache 31A3F5EThe Baseline SystemThe goal of a coherence protocol is to maintain coherence by enforcing the SWMR invariant: Single-Writer, Multiple-Read (SWMR) invariant: For any memory location A, at any given time, there exist only one core that may write to A or some number of cores that may read it.issued coherencerequests & responsesCoreCache ControllerInterconnection NetworkCacheloads & storesloaded valuesreceived coherencerequests & responsesreceived coherencerequests & responsesMemory ControllerInterconnection NetworkMemoryissued coherencerequests & responsesThey are finite states machines to implement the SWMR invariant.Each coherence controller implements a set of finite state machines per block.Cache controller:In the core side: Interfaces to the processor core.Receives load and store from the core and returns values to the core.In the network side:Interfaces to rest of the system using the interconnection network.If a cache miss occurs, issues a coherence request to get for that block.Could either receive data in response of its request or receives coherence requests from the network. Memory Controller:Similar to cache controller but only has the network side.

Cache/Memory ControllersCoreCache ControllerInterconnection NetworkCacheMemory ControllerInterconnection NetworkMemoryThe state of a cache block contains of 4 main elements:Validity: A valid block has the most up-to-date value for this block. A valid block could be read, but written if it is also exclusive.Dirtiness: a cache block is dirty if its value is the most up-to-date value, and this value differs from the value in the memory.Exclusivity: a cache block is exclusive if it is the only copy of this block among all caches.Ownership: a cache controller (or memory controller) is the owner of a block if it is responsible to for responding to coherence requests for that block. An owned block cannot be evicted without giving the ownership to another block. In most protocols, there is exactly one owner for each block.

StatesStable states: most protocols use a subset of the classic five state MOESI model (pounced MO-Zee). Each state has different combination of elements, described in the previous slide.Modified: valid, exclusive, owned, and potentially dirty. May be read or written. The only valid copy of this block. Should respond the requests for this block. The memory copy of this block is potentially stale.Shared: Valid, not exclusive, not dirty, and not owned. The cache has a read-only copy of this block. Other caches might have valid, read-only copies of the block.Invalid: the block is invalid. The cache either does not contain the block or have a stale version of it. It may not be read or written.Owned: the block is valid, owned and potentially dirty but not exclusive. The cache has a read-only copy of this block and should respond to the requests for this block. The memory copy is potentially stale.Exclusive: valid, exclusive, and not dirty. The cache has a read-only copy of this block. The memory copy of this block is up-to-date.

Stable StatesTransient states occur during the transition from one stable state to another one.XYz: the block is transition from stable state X to stable state Y and the transition will not be complete until an event of type Z occurs.IMD: denotes that a block was in the I state and will become in the M state when data (D) is received. Transient StatesThere are 2 general approaches to naming states of blocks in the memory. The choice of the naming does not affect the functionality or performance.Cache-centric: the state of block in the memory is an aggregation of the block in the caches. For example, if a block in all caches is in state I, the memory state for this block is I. If one or more copies are in S, then the block in S in memory. If block in one cache is in state M, it is in M in memory.Memory-centric: the state of the block corresponds to the memory controller's permission to this block.For example, if all if a block in all caches is in I, the memory state for it will be O because the memory will behave like its owner. If they are all in S the memory state will be O. If the block is in M or O in one cache, then its memory state will be I since the memory has the invalid copy.

States of Blocks in the MemoryTo maintain the state of blocks in caches, the most common way is to add some extra bit at the end of each block. For example, in MOSEI we need 3 bits to show the state.To maintain the state of blocks in memory, we can use the same approach. Alternatively, we can use logical gates. For example we can use an NOR gate and if one of its inputs are OWNED = 1, the state of the block in memory would be I = 0.

Maintaining the Block StateBlock DataState10011.000 -> I11111.001 -> O00000.101 -> MBlock state in cache 1Block state in cache 2Block state in cache 3State of block in memoryMost protocols have a similar set of transactions, because the basic goals of the coherence controllers are similar.Transactions are all initiated by cache controllers that are responding to requests from their associated cores

TransactionsTransactionGoalGetShared (GetS)Obtain block in Shared (read-only) state.GetModified (GetM)Obtain block in Modified (read-write) state.Upgrade (Upg)Upgrade block state from read-only (Shared or Owned) to read-write (Modified);Upg (unlike GetM) does not require data to be sent to requestor.PutShared (PutS)Evict block in Shared state.PutExclusive (PutE)Evict block in Exclusive state.PutOwned (PutO)Evict block in Owned state.PutModified (PutM)Evict block in Modified state.Events are core requests to their cache controllers.EventsEventResponse of Cache ControllerLoadif cache hit, respond with data from cache; else initiate GetS transactionSoteif cache hit in state E or M, write data into cache; else initiate GetM or Upg transactionAtomic read-modify-writeif cache hit in state E or M, atomically execute read-modify-write semantics; else initiate GetMor Upg transactionInstruction fetchif cache hit (in I-cache), respond with instruction from cache; else initiate GetS transactionRead-only prefetchif cache hit, ignore; else may optionally initiate GetS transactionRead-write prefetchIf cache hit in state M, ignore; else may optionally initiate GetM or Upg transactionReplacementdepending on state of block, initiate PutS, PutE, PutO, or PutM transactionThe other major design decision in a coherence protocol is to decide what to do when a core writes to a block. There are two options:Invalidate protocols: when a core wishes to write to a block, it initiates a coherence transaction to invalidate the copies in all other caches. Thus; if other cores want to read this block, they need to issue a new request to obtain a new copy of this block.Update protocols: when a core wishes to write a block, it initiates a coherence transaction to update the copies in all other caches to reflect the new value it wrote to the block.

Tradeoffs:Update protocols reduce the reading latency.They use more bandwidth since their messages are bigger (carry data as well).Invalidate vs. UpdateSnooping protocolsDirectory protocolsHybrid (a combination of Snooping and Directory protocols)Cache Coherence ProtocolsIdea: all coherence controllers observe (snoop) coherence requests in the same order. By requiring that all requests to a given block arrive in order, a snooping system enables the distributed coherence controllers to correctly update the finite state machines that collectively represent a cache blocks state.Traditional snooping protocols broadcast requests to all coherence controllers, including the controller that initiated the request. The coherence requests typically travel on an ordered broadcast network, such as a bus.Snooping Protocols TimeC1C2Memory0A:IA:IA:I, Owner1A: GetM from C1 /M, OwnerA: GetM from C1/IGetM from C1/ M2A: GetM from C2 /IA: GetM from C2/M, OwnerGetM from C2/ MTimeC1C2Memory0A:IA:IA:I, Owner1A: GetM from C1 /M, OwnerA: GetM from C2/M, OwnerGetM from C1/ M2A: GetM from C2 /IA: GetM from C1/IGetM from C2/ MThe baseline SystemMAIN MEMORYcoreInterconnection networkLLC/directory controllerLast-level cache (LLC)Private data (LI) cacheCache controllercoreCache controllerPrivate data (LI) cacheMULTICORE PROCESSOR CHIPMSIImplements 2 atomicity properties.Atomic requests: states that a coherence request is ordered in the same cycle that it is issued. Atomic transactions: states that coherence transactions are atomic in that a subsequent request for the same block may not appear on the bus until after the first transaction completes (i.e., until after the response has appeared on the bus).MSI, Cache ControllerStateCore EventsBus EventOwn TransactionOther Cores TransactionsLoadStoreReplacementGetSGetMPutMdataGetSGetMPutMIGetS/ISDGetS/ISDISDstall loadstall storestall evictcopy data into cache, load hit/S(A)(A)

(A)IMDstall loadstall storestall evictcopy data into cache, store hit/M(A)

(A)

(A)

Sload hitGetM/SMD-/I-/ISMDload hitstall store

stall evictcopy data into cache, load hit/S(A)

(A)

(A)

Mload hitstore hitPutM,Send data to memory /Isend data to req and memory/Ssend data to req/I

MSI, Memory ControllerstateBus EventsGetSGetMPutMData from OwnerIorSSend data block to requestor/IorSSend data block to requestor/MIorSD(A)(A)Update data block in memory/IorSM-/IorSD-/IorSD

Advantages:Small table and few possible states.Easy to understand and implementMultiple copy of a same block could be available because of the shared state.Disadvantages:Many impossible states due to atomic transaction property and many stallslower throughputHigher latencyUnnecessary broadcast of invalidate messages: when a core wants to write on block should get the block in the stat M and send an invalidate message to all other cores, no matter if it is the only copy of that block or not.Tradeoffs: downgrade from M to S or I? We need to predict if block is going to be used again or not.MSI, Advantages/DisadvantagesImplements atomic transactions and non-atomic request properties.The Exclusive state is used in almost all commercial coherence protocols because it optimizes a common case: a core first reads a block and then subsequently writes it. In MSI, a core needs to issue a GetS message to get the read permission (in case a cache miss) and then have to issue a GetM message to get the write permission.In MESI, a core can get the block in the exclusive state and no other block can access it anymore. Thus, the core does not need to issue a GetM message.

MESILoadStoreRepl.GetSGetMPutMGetSGetMPutMDataDataIGetS/ISADGetM/IMAD---ISADstallstallstall-/ISD---ISDstallstallstall(A)(A)(A)-/S-/EIMADstallstallstall-/IMD---IMDstallstallstall(A)(A)(A)-/MShitGetM/SMAD-/I--/I-SMADhitstallstall-/SMD--/IMAD-SMDhitstallstall(A)(A)(A)-/MEhithit/MPutM/EIAdata toR & M/Sdata toR/I-MhithitPutM/MIAdata toR & M/Sdata toR/I-MIAhithitstalldata to M/Idata toM & R/IIAdata toR/IIA-EIAhitstallstall-/Idata toM & R/IIAdata toR/IIA-IIAstallstallstall-/I---MESI, Cache Controller

GetSGetMPutMDataNoDataNoData-EIdata toR/EorMdata toR/EorM-/IDSdata toR/EorMdata toR/EorM-/SDEorM-/SD--/EorMDID(A)(A)(A)write data toM/I-/I-/ISD(A)(A)(A)write data to M/S-/S-/SEorMD(A)(A)(A)write data to M/I-/EorM-/IMESI, Memory Controller

Advantages:Silent transition from the exclusive state to the modified/shared state. No unnecessary invalidate messages are issued.Read and write with issuing only request.Fewer number of messages.Less traffic on the bus, lower bandwidth usage.

Disadvantages:Extra hardware is needed to implement the exclusive state.

MESI, Advantages/DisadvantagesUses MOESINon-atomic requests and transactions.Supports up to 64bit processors.Wired snooping busses consume lots of energy; thus, they do not scale up to large number of cores. To solve this problem. E10000 uses point-to-point links instead.Uses a separate bus for sending out-of-order data response messages.Case Studies: Sun Starfire E10000

Benchmarks:SPLASH-2 (2007): implements 8 applications:LU: dense matrix manipulation.OCEAN: large-scale movements.Cholesky: sparse matrix manipulation.Radix: sorting radix=based integersSPECjbb: benchmark for computing the performance of java servers, applications PARSEC: benchmark for shared memory, multithreaded programs.Metrics:Processor utilizationBus utilization Number of accesses to physical memory EvaluationBenchmark suite: Splash-2Benchmark application: Gem5, SE modeHardware: four CPUs. Each CPU has private L1 cache of 32KB with associativity 4. Default cache line size is 64 bytes which we configure for our experiment.Evaluation, Effects of Cache SizeL1 Cache Size (KB)Write-Back/Memory References161730032126726452511280L1 Block Size (bytes)Write-Back/Memory References16112143212350641267212813001Write backsL1 cash size (KB)Write backsL1 block size (bytes)Benchmark suite: SPECBenchmark applications: blackscholes, bodytrack, canneal, facesim, fluidanimate, freqmine, raytrace, and swaptions.Protocols: MESI, MOSI, and MOESI (compared to MSI).

Number of Broadcasts and Write-backs: Across all the benchmarks and input sizes, MESI and MOESI reduce the number of broadcasts 7% on average.MOSI and MOESI, reduce the number of write-backs is reduced by 5% on average.

Evaluations, # of Cores, Energy ConsumptionLLC power consumption:Since MOSI and MOESI substantially reduce the number of write-backs for workloads, they reduce the energy consumption of the LLC by %4 on average.Sincerity to the number of cores: MOSI and MOESI are only showing very little increasing benefits with regard to write-back traffic reduction compared to MSI and MESI.Evaluation, # of Cores, Energy Consumption (contd)

Benchmark suite: Splash-2Benchmark applications: Barnes-Hut, LU, OCEAN, Radiosity, Radix, Ray TraceProtocols: MESI and MSIHardware: ?Evaluations, Buss Traffic

Protocols: MSI and MESI, MOSI, MOESIEvaluations, # of Invalidate Messages

HardwareSplash-2 inputs and applications

[1] - Daniel J. S. Mark D. H. David A. W., A Primer on Memory Consistency and Cache Coherence, Morgan Claypool Publishers, 2011.[2] Suleman, Linda Bigelow Veynu Narasiman Aater. "An Evaluation of Snoop-Based Cache Coherence Protocols."[3] Tiwari, Anoop. Performance comparison of cache coherence protocol on multi-core architecture. Diss. 2014.[4] Chang, Mu-Tien, Shih-Lien Lu, and Bruce Jacob. "Impact of Cache Coherence Protocols on the Power Consumption of STT-RAM-Based LLC."[5] CMU 15-418: Parallel Architecture and Programming. Lecture Series. Spring 2012ReferencesQ&A