different directions in solving cache coherence problems

A survey of different techniques for solving cache coherence problems

Zahid Iqbal

National University of Computing and Emerging Sciences

Islamabad, Pakistan 44000

[email protected]

Muhammed Arslan

National University Of Computing and Emerging Sciences


[email protected]

Usman Bashir

National University Of Computing and Emerging Sciences


[email protected]

Abstract:

Cache Coherence, now, has become a well developed research field. A lot of work has been done but still there is need for more refinement and optimization in cache coherence protocols. In this paper, we have done analyses on different techniques and methods of solving cache coherence problems. We have mentioned pros and cons of each method discussed here so it will be easy to judge which method is more appropriate in specific conditions.

Introduction:

A multi-core processor can be described as an integrated circuit which consists of two or more individual processors (cores). Each core has its own cache. Multi-core systems solve the heat and power problems of single core system. Due to individual cache of each processor, we have to face cache coherence problems. A no. of software based, hardware based and hybrid techniques had been introduced to solve cache coherence problems.

Different authors have proposed different techniques for solving the cache coherence problems like Huang Yongqin, Yuan Aidong, Li Jun, Hu Xiangdong have made a lot of research on various cache coherence protocols, such as Piranha[1] prototype system, GS320[2]and AMD64[3]. NB2CC directory based Cache Coherence protocol proposed here which divides Serial Processing into two states Conflict detection and Conflict Solution.

mailto:[email protected]



Jing_Mei Li, Wen_Jia Liu, Ping Jiao aims to reduce the bus transitions which improves the efficiency of processer or data access.

Youhui Zhang, Ziqiang Qian, Weimin Zheng proposes a new software/hardware hybrid cache coherence optimization to reduce the overhead of broadcasting.

TECHNIQUES:

1. A Novel Directory-Based Non-Busy, Non-Blocking Cache Coherence

Directory based cache coherence protocols used in main systems (good expansibility).

Generally, introduce two ways to solve cache coherence problem: Direction and Indirection.

According to M. M. K. Martin [4], directory-based protocols introduce a level of indirection to obtain scalability at the cost of increasing sharing miss latency.

Token counting directly ensures coherence safety: To determine current block access is legal, processors pass tokens and used the rules on the base of number of tokens.

PATCH [] is used to combine token counting and standard directory protocol support direct requests and destination-set prediction without requiring a non-scalable interconnect.

Direct Protocols are rarely used in actual system because of high overhead and complex processing.

Authors keep their focus on indirect protocols. Home node will forward the request local to owner when request is not newest data.

Serial Processing has two steps Conflict detection and Conflict solution. In directed protocols conflict detect on home node. To determine how to design system, direct protocols are used:

SGI Origin [] solve at home node. First request is entered to execute, its state is busy until execution is completed. In GS320 [], global switch concept is implemented. Conflicts are solved on hot point usually. In Piranha solved at the end state called owner. Piranha introduces other innovative techniques, like clean exclusive but Piranha’s deadlock solution used buffering. In GS320 limits applicability.

NB2CC distribute the conflict solution to the owners of the newest data by the advantage of modern processors. Avoid Negative Acknowledge technique at very

Small cost and avoid unnecessary ordering requirements to achieve more concurrency with less overhead.

Methodology

Author’s goal to reduce the complexity and make easily understandable and define several sets for description convenience

Definition 1: Request = {R1, R2… Rn-1}

Set of Request to share address(X).

Definition 2: Home = {H1, H2…Hn-1}

Is the ordering of requests in set Request processed by the home directory?

If Hi > Hj, then Ri is processed by the home directory before Rj.

Definition 3: Local = {L1, L2…Ln-1}

Is the ordering of requests in set Request satisfied by owners? If Li > Lj, then Ri

is satisfied by the owner before Rj.

Following are premises and deduction

Premise 1: For any Request Ri and Ri+1, Hi > Hi+1.

Premise 2: If Hi > Hi+1, then Li > Li+1.

Deduction 1: For any Request Ri and Rj, if Hi > Hj, then Li>Lj

NB2CC inherits the characteristics of traditional protocols such as relaxed memory model, the method of avoiding protocol deadlock and basic process of

request races.

Summary

NB2CC is designed for a high concurrency and pipelining system, which solves the multi cache coherence problem in a software-transparent way. Only the requests for the same block need to be delayed in this system rather than blocking the head request of the queue. [1]

NB2CC is balanced and has no hot point and the lack of NAKs/retries makes it more efficient protocol. Owner node is guaranteed to service a forwarded request; the protocol can complete all directory state changes immediately therefore no need of conformation. Due to owner node is guaranteed to service a forwarded request; the protocol can complete all directory state changes immediately.

Conflict is detected at home node if home node does not supply the newest data to local according to the directory state. It forwards the request to owner without any temporary directory. COH_ACK1 is sent from the home to help the requestor to distinguish the ordering of reaching the home directory between the requests from the requestor itself and the ones from others. This makes Home process requests from different requestors in a pipeline way. NB2CC is an ordering point protocol therefore achieve high concurrency

2) A New kind of Cache Coherence Protocol with SC Cache for Multiprocessor

A multi-core processor is a processing system composed of two or more independent cores. Cores are integrate onto a single integrate circuit die known as CMP. Compared with single core processor CMP has the simple control logic, high frequency, low latency of communication. Widely used application in image processing and networking. In CMP Architecture there must be multi-level cache through the hierarchical storage structure. In this paper author add a SC-Cache of the classic CMP located between at the private cache & bus. Author used mechanism the combination of write through and writes back. Protocol has four states PI, PE, PD, and SS. First three states are stored in local or remote cache and last state occur in SC-Cache so shared block writes operation to update the block. In the same time writes back into memory without invalidating the other cache.

The processor firstly accesses the SC-Cache if miss happens than the cache controller broadcast the corresponding request to remote cache on the bus.

If remote cache exists that’s copy then changing cache block state from PD to SS by write through to update the main memory copy and make the remote cache value invalid.

If there is no requesting copy in the main memory and transfer to local cache.

Write hit if local cache block state is PD. If the status PE the state would be change to PD after updating the copy .If the copy hit in the SC-Cache the share copy would be updated.

Write miss the request copy is neither in local cache nor the SC-cache and the state is PD or PE then puts the request on bus it supplies and change state to SS by write through and makes the copy invalid.

If the requesting copy does not exist in remote cache, it must be in main memory then send the request copy SC-Cache.

CSC increased the performance than Dragon. MESI of the CSC decreased 6% and 8% of the total execution time.

MESI<Dragon<CSC

10% < 7 %< CSC

TO DO IN Future

Storage performance upgrades but the cache miss did not reduce clearly. With the CMP further study, it must be make a great contribution to the future high performance processor’s study.

References:

[1]. Luiz AndrC Barroso, Kourosh Gharachorloo, Robert McNamarat, Andreas owatzyk, Shaz Qadeert,Barton Sano, Scott Smith, Robert Stets, and Ben Verghese, “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing”, In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000.

[2]. Kourosh Gharachorlooy, Madhu Sharma, Simon Steely, and Stephen Van Doren, Architecture and Design of AlphaServer GS320”, ASPLOS, 2000.

[3]. AMD64 Architecture Programmer's Manual Vol 2 'System Programming, Advanced Micro Devices, 2007.

[4]. P. Bannon, “Alpha 21364: A Scalable Single-Chip SMP”, In Microprocessor Forum '98, October 1998.

different directions in solving cache coherence problems

Documents