enhancement of cache performance in multi-core … enhancement of cache performance in multi-core...

i

ENHANCEMENT OF CACHE PERFORMANCE IN

MULTI-CORE PROCESSORS

A THESIS

Submitted by

MUTHUKUMAR.S

Under the Guidance of

Dr. P.K. JAWAHAR

in partial fulfillment for the award of the degree of

DOCTOR OF PHILOSOPHY

in

ELECTRONICS AND COMMUNICATION ENGINEERING

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)

(Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

November 2014

http://www.bsauniv.ac.in/

ii

B.S.ABDUR RAHMAN UNIVERSITY

(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY

(Estd. u/s 3 of the UGC Act. 1956)

www.bsauniv.ac.in

BONAFIDE CERTIFICATE

Certified that this thesis ENHANCEMENT OF CACHE PERFORMANCE

IN MULTI-CORE PROCESSORS is the bonafide work of MUTHUKUMAR S.

(RRN: 1184211) who carried out the thesis work under my supervision. Certified

further, that to the best of my knowledge the work reported herein does not form

part of any other thesis or dissertation on the basis of which a degree or award

was conferred on an earlier occasion on this or any other candidate.

SIGNATURE

Dr. P.K.JAWAHAR

RESEARCH SUPERVISOR

Professor and Head

Department of ECE

B.S. Abdur Rahman University

Vandalur, Chennai – 600 048

http://www.bsauniv.ac.in/

iii

ACKNOWLEDGEMENT

First and foremost, I would forever be thankful to the almighty for having

bestowed me with this life and knowledge. Next, I would like to express my

sincere gratitude to my supervisor Professor Dr. P.K. Jawahar for his

continuous support during my PhD study and research, for his patience,

motivation, enthusiasm, and immense knowledge. He gave me the freedom to

do the research in my chosen area, as a result of which, I felt more encouraged

and motivated to achieve my objectives. Besides my advisor, I would also like to

thank the doctoral committee members, Dr. P. Rangarajan and Dr. T.R.

Rangaswamy for their valuable suggestions, encouragement and insightful

comments.

My sincere thanks to Dr Kaja Mohideen, Dean, School of Electrical and

Communication Sciences for having provided me with this opportunity.

I would like to thank Professor Dr. V. Sankaranarayanan for his advice and

valuable suggestions throughout my research.

I would also like to thank Dr. M. Arunachalam, Secretary, Sri Venkateswara

College of Engineering for his encouragement and support.

Last but not the least, I would like to thank my family members for their immense

support. Special thanks goes to my wife Mrs. M. Shanthi, my son Mr. M.

Subramanian and my daughter M. Uma for their patience and moral support

that kept me going throughout the entire course.

I am also indebted to my parents who have stimulated my research interests

from the very early stage of my life. I sincerely hope that I can make my family

members proud.

iv

ABSTRACT

The impact of various cache replacement policies act as the main deciding

factor of system performance and efficiency in Chip Multi-core Processors. It is

imperative for any level of cache memory in a multi-core architecture to have a

well-defined, dynamic replacement algorithm in place to ensure consistent

superlative performance. Many existing cache replacement policies such as

Least Recently Used (LRU) policy etc have proved to work well in the shared

Level 2 (L2) cache for most of the data set patterns generated by single

threaded applications. But when it comes to parallel multi-threaded applications

that generate differing patterns of workload at different intervals, the

conventional replacement policies can prove sub-optimal as they generally do

not abide by the spatial and temporal locality theories and do not acquaint

dynamically to the changes in the workload.

This dissertation proposes three novel cache replacement policies for L2

cache that are targeted towards such applications. Context Based Data Pattern

Exploitation Technique assigns a counter for every block of the L2 cache for

monitoring the data access patterns of various threads and modifies the counter

values appropriately to maximize the performance. Logical Cache Partitioning

technique logically partitions the cache elements into four zones based on their

„likeliness‟ to be referenced by the processor in the near future. Replacement

candidates are chosen from the zones in the ascending order of their „likeliness

factor‟ to minimize the number of cache misses. Sharing and Hit based

Prioritizing replacement technique takes the sharing status of the data elements

into consideration while making replacement decisions. This information is

combined along with the number of hits received by the element to embark upon

a priority based on which the replacement decisions are made.

The proposed techniques are effective at improving cache performance and

offer an improvement of up to 9% in overall hits at the L2 cache level and an IPC

(Instructions Per Cycle) speedup of up to 1.08 times that of LRU for a wide

range of multithreaded benchmarks.

v

TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT

LIST OF TABLES

LIST OF FIGURES

LIST OF SYMBOLS AND ABBREVIATIONS

iv

viii

ix

xi

1. INTRODUCTION 1

1.1 Chip Multi-core Processors 1

1.2 Memory Management in CMP

1.3 Overview of Existing Replacement Policies 3

1.4 The Problem 5

1.5 Scope of Work

1.6 Research Objectives

1.7 Contributions

1.8 Outline of the Thesis

1.9 Summary

2

3

3

6

6

7

7

8

9

2. LITERATURE SURVEY 10

2.1 Replacement Policies

2.2 Data Prefetching and Cache Bypassing

2.3 Effectiveness of Cache Partitioning

2.4 Embedded System Processors

2.5 Performance of Shared Last-Level Caches

2.6 Cache Thrashing and Cache Pollution

2.7 Prevalence of Multi-Threaded Applications

2.8 Summary

10

11

11

12

13

15

16

17

3. CONTEXT-BASED DATA PATTERN EXPLOITATION

TECHNIQUE 18

vi


3.1 Introduction

3.1.1. Different Types of Application Workloads

18

19

3.2 Basic Structure of CB-DPET

3.3 DPET

3.4 DPET-V: A Thrash Proof Version

3.5 Policy Selection Technique

3.6 Thread Based Decision Making

3.7 Block Status Based Decision Making

3.8 Modifications in the Proposed Policies

3.9 Comparison between LRU, NRU and

CB-DPET

3.9.1 Performance of LRU

3.9.2 Performance of NRU

3.9.3 Performance of DPET

3.10 Results

3.10.1 CB-DPET Performance

3.11 Summary

22

22

24

31

33

34

34

35

35

36

37

38

39

43

4. LOGICAL CACHE PARTITIONING TECHNIQUE 45

4.1 Introduction

4.2 Logical Cache Partitioning Method

4.3 Replacement Policy

4.4 Insertion Policy

4.5 Promotion Policy

4.6 Boundary Condition

4.7 Comparison with LRU

4.8 Results

4.8.1 LCP Performance

4.9 Summary

45

46

47

47

48

49

53

53

54

57

vii


5. SHARING AND HIT-BASED PRIORITIZING

TECHNIQUE

58

5.1 Introduction

5.2 Sharing and Hit-Based Prioritizing Method

5.3 Replacement Priority

58

59

60

5.4 Replacement Policy

5.5 Insertion Policy

5.6 Promotion Policy

5.7 Results

5.7.1 SHP Performance

5.8 Summary

61

66

66

66

66

69

6.

EXPERIMENTAL METHODOLOGY

63

6.1 Simulation Environment

6.2 PARSEC Benchmark

6.3 Evaluation of Many-Core Systems

6.4 Storage Overhead

6.4.1 CB-DPET

6.4.2 LCP

6.4.3 SHP

70

71

71

76

76

76

76

7. CONCLUSION AND FUTURE WORK 78

7.1 Conclusion

7.2 Scope for Further Work

78

79

REFERENCES

LIST OF PUBLICATIONS

TECHNICAL BIOGRAPHY

81

97

98

viii

LIST OF TABLES

TABLE NO.

TITLE PAGE NO.

4.1

5.1

6.1

6.2

7.1

LCP zone categorization

Sharing degree values and their descriptions

Summary of the key characteristics of the PARSEC

benchmarks

Cache architecture configuration parameters

Percentage increase in overall hits compared to LRU

47

65

70

71

79

ix

LIST OF FIGURES

FIGURE NO.

TITLE PAGE NO.

1.1

1.2

1.3

1.4

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

3.13

3.14

3.15

4.1

4.2

Processor memory gap [1]

Two core processor-T1 and T2 represent the threads

General memory hierarchy in a CMP environment

Data sequence containing a recurring pattern {1,5,7}

Data sequence which will result in thrashing for DPET

(i) Contents of the cache at the end of each burst for DPET

(ii) Contents of the cache at the end of each burst for

DPET-V

(i) CB-DPET flow diagram

(ii) Victim Selection flow

(iii) Insert flow DPET

(iv) Insert flow DPET-V

Policy selection using DM register

CB-DPET Policy selection flow

Data sequence containing a recurring pattern

Behavior of LRU for the data sequence

Behavior of NRU for the data sequence

Behavior of DPET for the data sequence

Overall number of hits recorded in L2 cache

L2 cache hit rate

Hits recorded in individual phases of ROI

Miss latency at L2 cache for individual cores

Percentage increase in L2 cache hits

Overall number of misses at L2 cache

(i) LCP flow diagram

(ii) Victim Selection flow for LCP

Working of LRU and LCP for the data set

1

2

3

5

24

27

27

28

29

30

30

31

32

34

35

37

38

39

40

41

41

42

43

48

49

52

x

FIGURE NO.

TITLE PAGE NO.

4.3

4.4

4.5

4.6

5.1

5.2

5.3

5.4

6.1

6.2

6.3

6.4

6.5

6.6

Percentage increase in overall hits at L2 cache

Difference in number of replacements made at L2 cache

Average cache occupancy percentage

Miss rate recorded by individual cores

(i) SHP flow diagram

(ii) Victim Selection flow for SHP

Overall number of hits at L2 cache

Percentage decrease in number of replacements at L2

cache

Overall miss rate and core-wise miss rate at L2 cache

Performance comparison for 2 cores



CB-DPET- percentage increase in hits

LCP- percentage increase in hits

SHP- percentage increase in hits

54

54

55

56

64

65

67

68

68

72

73

73

74

74

75

xi

LIST OF SYMBOLS AND ABBREVIATIONS

AMA

CMP

CB-DPET

DPET

DPET-V

DM

L2

LCP

LR

LLR

LLC

LRU

LFU

MRU

MLR

NRU

NLR

PP1

PP2

ROI

SHP

Age Manipulation Algorithm

Chip Multi-core Processors

Context Based Data Pattern Exploitation Technique

Data Pattern Exploitation Technique

Data Pattern Exploitation Technique - Variant

Decision Making Register

Level 2

Logical Cache Partitioning

Likely to be Referenced

Less Likely to be Referenced

Last Level Cache

Least Recently Used

Least Frequently Used

Most Recently Used

Most Likely to be Referenced

Not Recently Used

Never Likely to be Referenced

Parallel Phase 1

Parallel Phase 2

Region of Interest

Sharing and Hit-based Prioritizing

1

1. INTRODUCTION

The advent of multi-core computers consisting of two or more processors

working in tandem on a single integrated circuit has produced a phenomenal

breakthrough in performance over the recent years. Efficient memory

management is required to utilize the complete potential of CMPs.

One of the main concerns in recent times is the processor memory gap.

Processor memory gap indicates a scenario where the processor is stalled while

a memory operation is happening. Fig 1.1 shows that this situation is getting

worse as the processor performance tends to increase manifold but the memory

latency continues to be almost the same. The on-chip cache memory clock

speeds have hit a wall where practically no improvement could be achieved in

that area anymore. The usage of multiple cores has mitigated this effect to some

extent but this gap tends to widen lot more in the future stressing the need for

effective memory management techniques.

Fig 1.1 Processor memory gap [1]

2

1.1. Chip Multi-core Processors (CMP)

In a CMP, multiple processor cores are packaged into a single physical

processor chip that is capable of executing multiple tasks simultaneously. Multi-

core processors provide room for faster execution of applications by exploiting

the available parallelism (mainly instruction level parallelism), which implies the

ability to work on multiple problems simultaneously. Parallel multi-threaded

applications demand for high computational and throughput requirements. To

expedite their execution speed, these applications spawn multiple threads which

run in parallel. Each thread has its own execution context and is assigned to

perform a particular task. In a multi-core architecture, these threads can run on the

same core or on different cores. A two-core processor with two threads T1 and

T2 is depicted in Fig 1.2.

Fig 1.2 Two core processor-T1 and T2 represent the threads

Every core now behaves like an individual processor and has its own

execution pipeline. They can execute multiple threads in parallel thus expediting

the running time of the program as a whole and boosting the overall system

performance and throughput. Efficiency of CMPs largely rely on how well a

software application can leverage the multiple core advantage while execution.

Many applications that have emerged in recent times have been optimized to

utilize the multi-core advantage (if present) to improve the performance.

Networking applications and other such CPU-intensive applications have

benefitted hugely from the CMP architecture.

3

1.2. Memory Management in CMP

Memory management forms an integral part in determining the performance of

CMPs. Cache memory is used to hide the latency of main memory and it also

helps in reducing the processing time in CMPs. Generally every core in a CMP

environment will be having a private L1 cache and all the available cores share a

relatively larger unified L2 cache (which in most cases can also be referred to as

the shared Last Level Cache). The L1 cache is split into „instruction cache‟ and

„data cache‟ as shown in Fig 1.3. The mapping method used is „set-associative‟

mapping [1], where the cache is divided into a specified number of sets with

each set having a specified number of blocks.

The number of blocks in each set indicates the associativity of the cache. The

size of L1 cache has to be kept small owing to chip constraints. Since it is

directly integrated into the chip that carries the processor, its data retrieval time

is expedited.

Fig 1.3 General memory hierarchy in a CMP environment

1.3. Overview of Existing Replacement Policies

Replacement policies play a vital role in determining the performance of a

system. In a multi-core environment, where processing speed and throughput

are of essence, selecting a suitable replacement technique for the underlying

cache memories becomes crucial. Primary objective of any replacement

4

algorithm is to utilize the available cache space effectively and reduce the number

of data misses to the maximum extent possible. Misses can be of three types

namely- compulsory misses, conflict misses and capacity misses.

Compulsory misses happen when the data item is fetched from the main

memory and stored into the cache for the first time. In most cases it is

unavoidable. Conflict misses are quite common in direct mapped cache

architectures where more than one data item compete for the same cache line

resulting in eviction of the previously inserted data item. As the name suggests,

capacity misses arise due to the inability of the cache to hold the entire working

set which can be observed in most applications owing to the smaller cache size.

An efficient replacement algorithm targets the capacity misses and tries to reduce

them as much as possible.

Since L2 cache is the one that will be accessed by multiple cores and it has

shared data, it is essential that judicious replacement strategies need to be

adopted here, in order to improve the performance. At present, many

applications make use of the LRU cache replacement algorithm [1], which

discards the „oldest possible‟ cache line available in the cache i.e. the cache line

which has not been accessed over a period of time. This implies that the „age‟ of

every cache line needs to be updated after every access. This is done with the

help of either a counter or a queue data structure. In the queue based

approach, data that is referenced by the processor is moved to the top of the

queue, pushing all the elements below by one step. At any point of time, to

accommodate an incoming block, LRU deletes the block which is found at the

bottom of the queue (least recently used data).

This method seems to work well with many current applications. However for

certain workloads which possess an unpredictable data reference pattern that

fluctuates frequently with time, LRU can result in poor performance. A closer

observation on LRU can help reveal the cause for this problem. It assumes that

all the applications follow the same reference pattern i.e. they abide by the

spatial and temporal locality theories. Any data item that is not referenced by the

processor for some considerable amount of time is tagged as „least recently

5

used‟ and evicted from the cache. When such a data item is required by the

processor in future, it is not found in the cache and has to be fetched from the

main memory.

Most Recently Used (MRU) policy also works in a similar fashion, with a slight

difference that the victim is chosen from the opposite end of the data structure

i.e. the most recently referred data is evicted to accommodate an incoming

block. The above stated algorithms work fine when the „locality of reference‟ is

high. But for applications whose data sets do not exhibit any degree of regularity

in their access pattern, these algorithms might result in sub-optimal

performance. An example of such a pattern is given in Fig 1.4. In the ensuing

sections, individual data items are referred by enclosing them within flower

braces.

Fig 1.4 Data sequence containing a recurring pattern {1,5,7}

As it is observed from Fig 1.4, the starting pattern {1,5,7} seems to recur after

a long burst of random data. If LRU is applied for the above set of data, it will

result in poor performance as there is no regularity in the access pattern.

Pseudo-LRU keeps a single status bit for each cache block. Initially all bits are

reset. Every access to a block sets the bit indicating that the block is recently

used. This also may degrade the performance for the above data sequence.

Another replacement technique that is commonly being employed apart from

LRU and MRU is the Not Recently Used (NRU) algorithm. NRU associates a

single bit counter with every cache line to aid in making replacement decisions. It

might be effective compared to LRU in certain cases, but a 1-bit counter may not

be powerful enough to create the required dynamicity.

Random replacement algorithm works by arbitrarily selecting a victim for

replacement when necessary. It does not keep any information about the cache

access history. Least Frequently Used technique tracks the number of times a

6

block is referenced using a counter and a block with lowest count is selected for

replacement.

1.4. The Problem

An effective replacement algorithm is important for cache management in

CMP. When it comes to the on-chip cache, it becomes even more crucial as the

size of the cache memory is considerably small. In order to fine tune the

performance of the cache, it is imperative to have an efficient replacement

technique in place. Apart from being efficient it is also important that it should be

dynamic, i.e. adjusting itself according to the changing data access patterns.

The most prevalently used replacement algorithms such as Least Recently

Used, Most Recently Used, Not Recently Used etc may have been proven to

work well with many applications but they are not dynamic in nature as

discussed in the earlier section. Irrespective of the data pattern exhibited by the

applications, they follow the same technique to find out the replacement victim.

It was observed that if a replacement algorithm could make decisions based

on the access history and the current access pattern, the number of cache hits

received can increase considerably thereby reducing the number of misses and

improving the overall performance. The primary aim of this dissertation is to

propose and design such dynamic novel cache replacement algorithms and

evaluate them against the conventional LRU approach.

1.5. Scope of Work

Managing cache memory involves many challenges that need to be addressed

from various angles. The behavior of various existing replacement policies on

shared last level cache is analyzed. In multi-threaded applications shared data

has to be handled with care because when there is a miss on a data item which is

shared by multiple threads, the execution of many threads gets stalled.

Replacement decisions made need to be dynamic and self-adjusting according to

the workload pattern changes. Since the size of the cache is small, available

cache space must be utilized efficiently. Techniques proposed in this work cover

7

all the above mentioned aspects while refraining from in-depth focusing on cache

memory hardware architecture optimizations.

1.6. Research Objectives

The following are the objectives of this research

To augment the cache subsystem of a CMP with novel replacement

algorithm for improving the performance of shared Last Level Cache

(LLC).

To propose a dynamic and self adjusting cache replacement policy with

low storage overhead and robustness across workloads.

To manage the shared cache efficiently between multiple applications.

Make LLC thrash resistant and scan resistant and maximize the utilization

of LLC, which helps to accelerate overall performance for memory

intensive workloads

1.7. Contributions

In order to address the shortcomings of the conventional replacement

algorithms, three novel counter based replacement strategies, which are

targeted towards the shared LLC, have been proposed and evaluated over a set

of multi-threaded benchmarks.

First of the three methods, Context based data pattern exploitation

technique, maximizes the cache performance by closely monitoring and

adapting itself to the pattern that occurs in the workload sequence with

the help of a block-level 3-bit counter. Results shows that the overall

number of hits obtained at the L2 cache level is on average 8% more than

LRU approach and an improvement of IPC speedup of up to 6% is

achieved.

Logical cache partitioning, which is the second technique, logically partitions

the cache into four different zones namely – Most likely to be referenced,

Likely to be referenced, Less likely to be referenced and Never likely to be

8

referenced in the decreasing order of their likeliness factor. Replacement

decisions are made based on likeliness to be referenced by the processor

in the near future. Results shows that an improvement of up to 9% in overall

number of hits and 5% in IPC speedup compared to LRU.

Finally, Sharing and hit-based prioritizing is an effective novel counter

based replacement technique, which makes replacement decisions based

on the sharing nature of the cache blocks. Every cache block is classified

based on four degrees of sharing namely – not shared (or private), lightly

shared, heavily shared and very heavily shared with the help of a 2-bit

counter. This sharing degree information is combined along with the

number of hits received by the element based on which the replacement

decisions are made. Results shows that this method achieves 9%

improvement in the overall number of hits and 8% improvement in IPC

speedup compared to baseline LRU Policy.

1.8. Outline of the Thesis

This thesis is organized into seven chapters which are as follows

Chapter 2 consists of the literature survey which discusses about a versatile set

of researches that have been carried out on cache management in single core

and multi-core architectures.

Chapter 3 presents a novel counter based replacement technique namely

Context Based Data Pattern Exploitation Technique. Initially it presents the

different types of workloads and the basic structure of the policy. Then it

elaborates on the algorithm and results and finally summarizes the method.

Chapter 4 presents the Logical Cache Partitioning technique. First it presents

the proposed replacement policy and finally it evaluates its significance and

results compared to LRU.

Chapter 5 describes the novel replacement policy Sharing and Hit-based

Prioritizing technique and evaluates its performance.

9

Chapter 6 discusses about the simulation environment and the benchmarks

used to evaluate all the three methods and also about the results obtained for all

the techniques, comparing and contrasting with the results obtained for the

same set of benchmarks by applying the conventional LRU approach. The final

section of this chapter presents the storage overhead.

Chapter 7 wraps up by presenting the conclusion and directions for future work.

1.9. Summary

This Chapter has provided an overview of the problem associated with

existing cache replacement policies for shared last level cache present in chip

multiprocessors and this forms the essential foundation for the chapters 3, 4 and

5. It has presented an outline of the thesis, specifically devoting three chapters

to target the drawbacks associated with shared LLC in CMPs. The next chapter

presents the literature survey of various cache replacement policies for multi-

core processors.

10

2. LITERATURE SURVEY

2.1. Replacement Policies

Replacement policies play a crucial role in CMP cache management.

Considering the on-chip cache memory of a processor, choosing a perfect

replacement technique is extremely important in determining the performance of

the system. The more adaptive a replacement technique is to the incoming

workload, the lesser is the overhead involved in transferring data to and from the

secondary memory frequently. John Odule and Ademola Osingua have come up

with a dynamically self-adjusting cache replacement algorithm [2] that makes

page replacement decisions based on changing access patterns by adapting to

the spatial and temporal features of the workload. Though it dynamically

changes to the workload needs, it does not detect changes or make decisions at

a finer level i.e. at block level which can be crucial when the working set

changes quite frequently.

Many such algorithms which work better than LRU on specific workloads have

evolved in recent times. One such algorithm is the Dueling Clock (DC) [3]

algorithm, designed to be applied on the shared L2 cache. But it does not

consider applications that involve multiple threads executing in parallel.

Effectiveness-Based Replacement technique (EBR) [4] ranks the cache

elements within a set, based on the recency and frequency of access and

chooses the replacement victim based on those ranks. In a multi-threaded

environment, a miss on shared data can be pivotal and calls for more attention

which is not provided here. Competitive cache replacement [5] proposes a

replacement algorithm for inclusive, exclusive and partially-inclusive caches by

considering two knowledge models. Self correcting LRU [6] proposes a

technique in which LRU is augmented with a framework to constantly monitor

and correct the replacement mistakes which LRU can tend to make in its

execution cycle. Adaptive Block management [7] proposes a method to use the

victim cache more efficiently thereby reducing the number of accesses to L2

cache.

11

2.2. Data Prefetching and Cache Bypassing

Data prefetching can be very useful in any cache architecture since it helps in

reducing the cache miss latency [8]. Prediction-Aware Confidence-based

Pseudo LRU (PAC-PLRU) [9] algorithm efficiently makes use of the pre-fetched

information and also saves those information from unnecessarily being evicted

from the cache. There can also be scenarios where the overhead can be

reduced by promoting the data fetched from the main memory directly to the

processor, bypassing the cache [10]. Hongliang Gao and Chris Wilkerson‟s Dual

Segmented LRU algorithm [11] has explored the positive aspects obtained from

cache bypassing in detail. Though cache bypassing [12] can help in reducing the

transfer overhead, it may not work well with the „locality principles‟ i.e. bypassing

a data item likely to be referenced in the near future might not be a good idea.

Counter based cache replacement and cache bypassing algorithm [13] tries to

locate the dead cache lines well in advance and choose them as replacement

victims, thus preventing such lines from polluting the cache. Augmented cache

architecture is one that comprises two parts- a larger direct-mapped cache and a

smaller, fully-associative cache which are accessed in parallel. A replacement

technique similar to LRU known as LRU-Like algorithm [14] was proposed to be

applied across augmented cache architectures.

Backing up few of the evicted lines in another cache can help in incrementing

the overall hits received. Adaptive Spill-Receive Cache [15] works on similar

lines by having a spiller cache (one that „spills‟ the evicted line) and a receiver

cache (one that stores the „spilled‟ lines).But if the cache line reuse prediction is

not efficient this can result in lots of stale cache lines as the algorithm executes.

2.3. Effectiveness of Cache Partitioning

In a multi-processor environment, efficient cache partitioning can help in

improving the performance and throughput of the system [17,18,19]. Partition

aware replacement algorithms exploit the cache partitions and try to minimize

the cache misses [20,21]. Filter based partitioning techniques are widely being

12

proposed by many studies [22,24]. Adaptive Bloom Filter Cache Partitioning

(ABFCP) [23] involves dynamic cache partitioning at periodic intervals using

bloom filters and counters to adapt to the needs of concurrent applications. But

since every processor core is allotted an array of bloom filters, the hardware

complexity can increase as the number of cores becomes higher. Pseudo LRU

algorithm (similar to LRU but with lesser hardware complexity) has been proven

to work well with partitioned caches [25]. Studies have also been conducted in

enhancing the performance of Non-Uniform Cache Architecture (NUCA)

[26,27,29]. A dynamic cache management scheme [28] aimed at the Last-Level

Caches in NUCA claims to have improved the system performance.

Although many schemes work well by adapting to the dynamic data

requirements of the applications, they do not attach importance to data that is

shared between the threads of the parallel workloads. Hits on such data items

not only improve the performance of those workloads, but also reduces the

overhead involved in transferring the data to and from the secondary memory

frequently. Dynamic cache clustering [31] groups cache blocks into multiple

dynamic cache banks on a per core basis to expedite data access. Placing the

new incoming data items into the appropriate cache bank during run time can

induce transfer overheads though. Smaller split L-1 data caches [32] proposes a

split design for L1 data cache using separate L-1 array and L-2 scalar caches

which has shown performance improvements. Yong Chen et al. suggest the use

of separate cache to store the data access histories and studies its associated

prefetching mechanism which has produced performance gain [33].

2.4. Embedded System Processors

Embedded processors differ from the conventional processors in a way that

they are tailor-made for very specific applications. Replacement policies are

being proposed which are targeted towards embedded systems [36,37,38].

Hierarchical Non-Most-Recently-Used (H-NMRU) [35] replacement policy is one

such algorithm that has proven to work well in those processors. [39] deals with

the cache block locking problem and provides a heuristic and efficient solution

13

whereas [40] proposes improved procedure placement policies in the cache

memory. A Two Bloom Filter (TBF) based approach [42], proposed primarily for

flash-based cache systems has proven to work well compared to LRU. The

proposal made by Pepijn de Langen et al. [43] is targeted to improve the

performance of direct mapped cache for application specific processors.

Dongxing Bao et al. suggest a method [44] to improve the performance of direct

mapped cache in which blocks are evicted based on their usage efficiency. This

method has shown improvement in miss rate.

2.5. Performance of Shared Last-Level Caches

Performance of the shared LLC in multi-core architecture plays a crucial role in

deciding the overall performance of the system, since all the cores will access

the data present here. Versatile researches [45,46,47] have been conducted in

improving the shared LLC performance. Phase Change Memory (PCM) with a

LLC forms an effective low-power consuming alternative to the traditional

DRAMs in existence. Writeback-aware Cache Partitioning (WCP) [52] efficiently

partitions the LLC in PCM among multiple applications and Write Queue

Balancing (WQB) replacement policy [52] helps in improving the throughput of

PCM. But one of the main drawbacks of PCM is that the writes are much slower

compared to DRAM.

The Adaptive Cache Management technique [53] combines SRAM and DRAM

architectures into the LLC and proposes an adaptive DRAM replacement policy

to accommodate the changing access patterns of the applications. Though it

reports improvement over LRU it does not discuss on the hardware overhead

that might result with the change in the architecture. Frequency based LRU

(FLRU) [54] takes the recent access information, as the LRU does, along with

the partition and frequency information while making replacement decision.

Cache Hierarchy Aware Replacement (CHAR) [55] makes replacement

decisions in LLC based on the activities that occur in the inner levels of the

cache. It studies the data pattern that hits/misses in the higher levels of cache

and makes appropriate decisions in LLC. Yingying Tian et al. [56] have

14

proposed a method which also functions on similar lines for making decisions at

LLC. But propagating such information frequently across various levels of cache

might slow down the system over a period of time. Cache space is another area

of concern. Instead of utilizing the entire cache space, an Adaptive Subset

Based Replacement Policy (ASRP) [57] has been proposed, where the cache

space is divided into multiple subsets and at any point of time only one subset is

active. Whenever a replacement needs to be made, the victim is selected only

from the active set. But there is a possibility of under-utilization of the available

cache space in this technique. For any replacement algorithm to be successful,

it is always essential to make good use of the available cache space. This can

be achieved by eradicating the dead cache lines i.e. the lines which will never be

referenced by the processor in the future. Lot of researches have been carried

out in this area [58,59]. But the problem which can arise here is that for

applications with an unpredictable data access pattern, the process of dead line

prediction becomes intricate and in the long run the predictor might lose its

accuracy.

Vineeth Mekkat et al. [60] have proposed a method that focuses on improving

the LLC performance on heterogeneous multi-core processors by targeting and

optimizing certain Graphical Processing Unit (GPU) based applications since

they spawn huge amount of threads compared to any other applications.

The effect of miss penalty can have a serious impact on the system efficiency.

Locality Aware Cost Sensitive (LACS) replacement technique [61] works on

reducing the miss penalty in caches. It calculates the number of instructions

issued during a miss for every cache block and when a victim needs to be

selected for replacement, it selects the block which issues the minimum number

of miss instructions. While it helps in reducing the miss penalty, it does not

attach importance to the data access pattern of the application under execution.

Another replacement policy which is called the LR+5LF [62] policy combines the

concepts of LRU and LFU to find a replacement candidate. Hardware complexity

is an issue here. Replacement algorithms that work well with L1 cache may not

suit L2 cache. Thus choosing an efficient pair of replacement algorithm for L1

15

and L2 becomes important for maximizing the hit rate [63]. Studies have

indicated that execution history like the initial hardware state of the cache can

also impact the sensitivity of replacement algorithms [64]. Memory Level

Parallelism (MLP) involves servicing multiple memory accesses simultaneously

which will save the processor from stalling due to a single long-latency memory

request. Replacement technique proposed in [65] incorporates MLP and

dynamically makes a replacement decision on a miss such that the memory

related stalls are reduced. Tag based data duplication may occur in the caches

which can be exploited to eliminate lower-level memory accesses thereby

reducing memory latency [66]. Demand-Based associativity [67] involves varying

the associativity of the cache set (basically total number of cache blocks per set)

according to the requirements of the application‟s data access pattern. Storage

complexity will be high in this method though. Cooperative caching [68]

proposes a unified framework for managing the on-chip cache. Set pinning [69]

proposes a technique for avoiding inter-core misses and reducing intra-core

misses in a shared cache which has shown improvement in miss rate.

Adaptive caches [70] propose a method which dynamically selects any two of

cache replacement algorithms (LRU, LFU, FIFO, and Random) by tracking the

locality information. Application aware cache [71] tracks the life time of cache

blocks in shared LLC during runtime for parallel applications and thereby utilises

the cache space more efficiently. EDIP [72] proposes a method for memory

intensive workloads where a new block is inserted at the LRU position and

claims to have obtained less miss rate than LRU for workloads with large

working set.

2.6. Cache Thrashing and Cache Pollution

One predominant problem with LRU is that it may result in thrashing (a

condition where most of the time is spent in replacement and there will not be

any hits at all) when it is applied for memory intensive applications.

Investigations are being conducted to reduce the effect of thrashing while

making replacement decisions [73,74]. Adaptive Insertion Policies [76] have

16

tweaked the insertion policy of LRU to reduce the cache miss for memory

intensive workloads. It is termed as LRU Insertion Policy (LIP) which has

outperformed LRU with respect to thrashing. OS buffer management technique

[73] tries to reduce thrashing in LLCs by adopting techniques to improve the

management of OS buffer caches. Since it has to operate at an OS level, it may

pull down the system performance. Evicted-Address filter mechanism [75] keeps

track of the recently evicted blocks to predict the reuse of the incoming cache

blocks thereby mitigating the effects of thrashing but at the cost of

implementation complexity.

Cache pollution happens when the cache is predominantly filled up with stale

data that the processor will never reference in the future. Inefficient cache

replacement algorithms can lead to cache pollution. Cache line reuse prediction

techniques [77,78,79] are required to avoid or mitigate cache pollution.

Software-only pollute buffer based approach [45] works on eliminating the LLC

polluters dynamically using a special buffer that is controlled at the Operating

System level but as the algorithm executes, it may burden the OS downgrading

the performance. A Pollution Alleviative L2 cache [80] proposes a method to

avoid cache pollution that might arise due to the execution of wrong path in case

of miss predictions and claims to have improved the IPC and cache hit rate.

Scalable shared cache [81] proposes a method for managing thrashing

applications in shared LLC of embedded multi-core architectures.

2.7. Prevalence of Multi-Threaded Applications

Concurrent multi-threaded applications have become prevalent in today‟s

CMPs. Works have been proposed on how effectively cache can be shared by

multiple threads running in parallel [82]. Co-operative cache partitioning [83]

works by efficiently allocating cache resources to threads which are concurrently

executing on multiple cores. Thread Aware Dynamic Insertion Policy (TA DIP)

[84] takes the memory requirements of individual applications into consideration

and makes replacement decisions. Adaptive Timekeeping Replacement (ATR)

[85] policy is designed to be applied primarily at LLC. It assigns „lifetime‟ to every

17

cache line. Unlike other replacement strategies like LRU, it takes the process

priority levels and application memory access patterns into consideration and

adjusts the cache line lifetime accordingly to improve performance. But the

condition which can lead to thrashing has not been considered in both the

methods [84,85].Pseudo-LIFO [86] suggests a method for early eviction of dead

blocks and retain live blocks which improves the cache utilization and average

CPI for multi-programmed workloads.

2.8. Summary

This Chapter has presented the prior work on various cache replacement

algorithms like the use cache partitioning and bypassing together with their

additional overhead involved in implementation. An exhaustive study of prior

work on cache thrashing and cache pollution has helped in the design of

dynamic cache replacement schemes that can be applied in high performance

processors. The next chapter discusses a counter based cache replacement

technique for shared last level cache in multi-core processors.

18

3. CONTEXT-BASED DATA PATTERN EXPLOITATION TECHNIQUE

Data access patterns of workload vary across applications. A replacement

algorithm can aid in improving the performance of the system only when it works

well majority of the applications in existence. In short, if a replacement algorithm

could make decisions based on the access history and the current access

pattern of the workloads, the number of hits received can increase considerably

[87,88,89].

The impact of various cache replacement policies act as the main deciding

factor of system performance and efficiency in Chip Multi-Core Processors.

Conventional replacement approaches work well for applications which have a

well-defined and a predictable access pattern for their workloads. But when the

workload of the applications is dynamic and constantly changes pattern,

traditional replacement techniques may fail to produce improvement in the cache

metrics. Hence this chapter discusses a novel cache replacement policy that is

targeted towards applications with dynamically varying workloads. Context

Based Data Pattern Exploitation Technique assigns a counter for every block of

the L2 cache. It then closely observes the data access patterns of various

threads and modifies the counter values appropriately to maximize the overall

performance. Experimental results obtained by using the PARSEC benchmarks

have shown an average improvement of 8% in overall hits at L2 cache level

when compared to the conventional LRU algorithm.

3.1. Introduction

Memory management plays an essential role in determining the performance

of CMPs. Cache memory is indispensible in CMPs to enhance the data retrieval

time and to achieve high performance. The level 1 cache or the L1 cache needs

to be small in size owing to the hardware limitations but since it is directly

integrated into the chip that carries the processor, accessing it typically takes

only a few processor cycles when compared to the accessing time of the next

higher levels of memories. Accessing the L2 cache may not be as fast as

accessing the L1 cache but since it is larger in size and is shared by all the

19

available cores, effectively managing it becomes more important. This chapter

puts forth a dynamic replacement approach called CB-DPET that is targeted

mainly towards the shared L2 cache.

3.1.1. Different Types of Application Workloads

The following are the types of workload patterns [90] exhibited by many

applications: Let „x‟ and „y‟ denote sequence of references where „xi‟ denotes the

address of the cache block. „a‟, „p‟ and „n‟ are integers where „a‟ denotes the

number of times the sequence repeats itself.

Thrashing Workload:

[.. (x1, x2….xn) a ..] a>0, n>cache size

These workloads have a working set size that is greater than the size of the

available shared cache. They may not perform well under LRU replacement

policy. Sample data pattern for a thrashing workload for a cache which can hold

8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80 . . .

Here, cache_size = 8 blocks, x1 = 8, x11 = 80, n = 11 and a = any value > 0.

On first scan, elements „8‟ till „5‟ will be fetched and loaded into the cache. When

the next element „10‟ arrives, the cache will be full and hence replacement has

to be made. Same is the case for remaining data items like „16‟, „80‟ and so on.

Regular Workload:

[… (x1, x2…. xn -1, xn, xn-1, … x2, x1)a…] n>0, a>0

Applications possessing this type of workload benefit hugely from the cache

usage. LRU policy suits them perfectly since the data elements follow stack

20

based pattern. Sample data pattern for regular workload for a cache which can

hold 8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 4, 20, 11 . . .

Here, cache_size = 8 blocks, x1 = 8, x7 = 4, x8 = 5, n = 8 and a = any value > 0.

After the cache gets filled up with the eighth element „5‟, the remaining elements

„4‟, „20‟ and „11‟‟ result in back to back cache hits.

Random Workload:

[…(x1, x2….xn)…] n -› ∞

Random workloads possess poor cache re-use. They generally tend to thrash

under any replacement policy. Sample data pattern for a random workload for a

cache which can hold 8 blocks at a time can be as follows

. . . 1, 9, 2, 8, 0, 12, 22, 99, 101, 50, 3 . . .

Recurring Thrashing Workload:

[… (x1, x2…. xn)a … (y1, y2.. yp) … (x2, x1… xn) a …], a>0, p>0, n>cache size

Sample data pattern for a recurring thrashing workload for a cache which can

hold 8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 7, 17, 46, 10, 16, 80, 90, 99, 8, 9, 2 . . .

Here, cache_size = 8 blocks, x1 = 8, x11 = 46, y1= 10, y5= 99, n = 11 and a = 1.


the next element „7‟ arrives, the cache will be full and hence replacement has to

be made. Same is the case for remaining data items like „17‟, „46‟ and so on. By

the time the initial pattern recurs towards the end, it would have been replaced

resulting in cache misses towards the end.

21

Recurring Non-Thrashing Workload:

[… (x1, x2 ….xn)a … (y1, y2 ... yp ) … (x2, x1… xn) a

…], a>0, p >0, n<=cache size

It is similar to the thrashing workload but with a recurring pattern in between.

Sample data pattern for a recurring non-thrashing workload for a cache which

can hold 8 blocks at a time can be as follows

. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80, 8, 9, 2 . . .

Here, cache_size = 8 blocks, x1 = 8, x8 = 5, y1= 10, y3= 80, n = 8 and a = 1.


the next element „10‟ arrives, the cache will be full and hence replacement has

to be made. Same is the case for remaining data items like „16‟ and „80‟. Again

the initial pattern recurs towards the end. If LRU is applied here, it will result in 3

cache misses towards the end as it would have replaced data items „8‟, „9‟ and

„2‟.

In the recurring non-thrashing pattern, if „n‟ is large, then LRU can result in

sub-optimal performance. Data items x2 to xn which reoccur after a short burst of

random data (y1 to yp) may result in cache miss. The problem here is even if a

data item receives many cache hits in the past, if it is not referenced by the

processor for a brief period of time, LRU evicts the item from the cache. When

the processor looks for that data item again it results in a miss. This stresses the

need for a replacement technique that adapts itself to the changing workload

and makes replacement decisions appropriately.

This chapter‟s contribution includes,

A novel counter based replacement algorithm called CB-DPET which tries

to maximize the cache performance by closely monitoring and adapting

itself to the pattern that occurs in the sequence.

Association of a 3-bit counter (which will be referred to as the PET

counter from here onwards) with every cache block in the cache set.

22

A set of rules to adjust the counter value whenever an insertion,

promotion or a replacement happens such that the overall hit percentage

is maximized.

Extension of the method to suit multi-threaded applications and also to

avoid cache thrashing.

PET counter boundary determination such that there will not be any stale

data in the cache for longer periods of time, i.e. non-accessed data

„mature‟ gradually with every access and gets removed at some point of

time to pave the way for new incoming data sets.

3.2. Basic Structure of CB-DPET:

This section and the following sub-sections elaborate on the basic concepts

that make up CB-DPET- namely DPET and DPET-V. Mapping method is set-

associative mapping [1]. A counter called the PET counter is associated with

every block in the cache set. Conceptually this counter holds the „age‟ of each

and every cache block. The maximum value that the counter can hold is set to 4

which imply that only 3 bits are required to implement it at the hardware level,

thereby not increasing the complexity. The minimum value that the counter can

hold is „0‟. The maximum upper bound value that the counter can hold is actually

„7‟ (23-1) but the value of „4‟ has been chosen. This is because, any value that is

chosen past 6 can result in stale data polluting the cache for longer periods of

time. Similarly if a value below „3‟ had been chosen, the performance will be

more or less similar to that of NRU. Thus the upper bound was taken to be „4‟.

The algorithm which implements the task of assigning the counter values based

on access pattern is referred as the „Age Manipulation Algorithm‟ (AMA).

3.3. DPET

A block with a lower PET counter value can be regarded as „younger‟ when

compared to a block having a higher PET value. Initially the counter values of all

the blocks are set to „4‟. Conceptually every block is regarded as „oldest‟. So in

this sense, the block which is considered to be the „oldest‟ (PET counter value 4)

23

is chosen as a replacement candidate when the cache gets filled up. If the oldest

block cannot be found, the counter value of every block is incremented

continuously till the oldest cache block is found. This block is then chosen for

replacement. Once the new block has been inserted into this slot, AMA

decrements the counter associated with it by „1‟ (it is set to „3‟). This is done

because the block can no more be regarded as „oldest‟, as it has been accessed

recently. Also it cannot be given a very low value as the AMA is not sure about

its future access pattern. Hence it is inserted with a value of „3‟. When a data hit

occurs, that block is promoted to the „youngest‟ level irrespective of which level it

was in previously. By this way, more time is given for data items which repeat

themselves in a workload to stay in the cache.

To put this in more concrete terms,

A. DPET associates a counter, which iterates from „0‟ to „4‟, with every cache

block.

B. Initially the counter values of all the blocks are set to „4‟. Conceptually every

block is regarded as „oldest‟.

C. Now when a new data item comes in, the oldest block needs to be replaced.

Since contention can arise in this case, AMA chooses the first „oldest‟ block

from the bottom of the cache as a tie-breaking mechanism.

D. Once the insertion is made, AMA decrements the counter associated with it

by „1‟ (it is set to „3‟). This is done because the block can no more be

regarded as „oldest‟, as it has been accessed recently. Also it cannot be

given a very low value as the AMA is not sure about its future access pattern.

Hence it is inserted with the value of „3‟.

E. In the case where all the blocks are filled up (without any hits in between),

there will be a situation where all the blocks will have a PET counter value of

„3‟. Now when a new data item arrives, there is no block with a counter value

of „4‟. As a result, AMA increments all the counter values by „1‟ and then

performs the check again. If a replacement candidate is found this time, AMA

executes steps C and D again. Otherwise it again increments the values.

This is carried out recursively until a suitable replacement candidate is found.

24

F. Now consider the situation where a hit occurs for a particular data item in the

cache. This might be the start of a specific data pattern. So AMA gives high

priority to this data item and restores its associated PET counter value to „0‟

terming it as the „youngest‟ block.

On the other hand, if a miss is encountered, the situation becomes quite similar

to what AMA dealt with in the initial steps, i.e. it scans all the blocks from the

bottom to find the „oldest‟ block. If found it proceeds to replace it. Otherwise the

counter values are incremented and once again the same search is performed

and this process goes on till a replacement candidate is found.

3.4. DPET-V : A Thrash-Proof Version

This method is a variant of DPET (In the following sections, it will be referred

to as DPET-V). Some times for workloads with a working set greater than the

available cache size, there is a possibility that there will not be any hits at all.

The entire time will be spent only in replacement. This condition is often referred

to as thrashing. The performance of cache can be improved if some fraction of

randomly selected working set is retained in the cache for some more time

which will result in increase in cache hits Hence very rarely, AMA decrements

the PET counter value by „2‟ (instead of „1‟) after an insertion is made. So the

new counter value post insertion will be „2‟ instead of „3‟. Conceptually it allows

more time for the data item to stay in the cache. This technique is referred to as

DPET-V. The probability of inserting with „2‟ is chosen approximately as 5/100,

i.e. out of 100 blocks, the PET counter value of (approximately) 5 blocks will be

set to „2‟ while the others will be set to „3‟ during insertion.

Fig 3.1 Data sequence which will result in thrashing for DPET

25

Algorithm 1: Implementation for Cache Miss, Hit and Insert Operations

Init:

Initialize all blocks with PET counter value „4‟.

Input:

A cache set instance

Output:

removing_index /** Block id of the victim **/

Miss:

/** Invoke FindVictim method. Set is passed as parameter **/

FindVictim(set):

/** Infinite Loop **/

while 1 do

removing_index = -1;

/** Scanning from the bottom **/

for i=associativity-1 to 0 do

if set.blk[i].pet == 4 then

/** Block Found. Exit Loop **/

removing_index = i;

break;

if removing_index == -1 then

for i=associativity-1 to 0 do

/** Increment all blocks pet value by 1 and search again **/

set.blks[i].pet = set.blks[i].pet + 1

else

break; /** break from main while loop **/

return removing_index;

26

Hit:

set.blk.pet = 0;

Insert:

set.blk.pet = 3;

Correctness of Algorithm

Invariant:

At any iteration i, the PET counter value of set.blk[i] either holds a value of „4‟ or

a value in the range of „0‟ to „3‟.

Initialization:

When i = 0, set.blk[i].pet will be „4‟ initially. After being selected as victim and

after the new item has been inserted, set.blk[i].pet will be „3‟. Hence invariant

holds good at initialization.

Maintenance:

At i=n, set.blk[i].pet will contain

either zero (if a cache hit has happened) or

set.blk[i].pet will have „3‟ (if it has been newly inserted).

set.blk[i].pet will hold „1‟ or „2‟ or „4‟ (if a replacement needs to be made but

block with PET counter „4‟ not yet found). Hence invariant holds good for i =

n. Same case when i = n+1 as the individual iterations are mutually exclusive.

Thus invariant holds good for i = n + 1.

Termination:

The internal infinite loop will terminate when a replacement candidate is found.

A replacement candidate will be found for sure in every run since the counter

value of individual blocks is proven to be in the range „0‟ to „4‟. A candidate is

found as and when a counter value reaches „4‟. Since an increment by „1‟ is

happening for all the blocks until a value of „4‟ is found, the boundary condition

27

“if set.blk[i].pet == „4' ” is bound to hit for sure, terminating the loop after a finite

number of runs. The invariant can also be found to hold good for all the

cache blocks present when the loop terminates.

It is observed that if DPET is applied over the working set shown in Fig 3.1

there will not be any hits at all (Fig 3.2(i)). After the first burst {8,9,2,6,5,0}, the

PET counter value of all blocks is set to „3‟. When the next burst {11,12,1,3,4,15}

arrives, none of them results in a hit. Now when the third burst arrives, again all

the old blocks are replaced with new ones.. Fig 3.2(ii) shows the action of DPET-

V over the same data pattern. According to DPET-V, AMA has randomly

selected two blocks (containing data items {5} and {2}) whose counter values

have been set to „2‟ instead of „3‟ while inserting the data burst 1. Now when

burst 2 arrives, those two blocks are left untouched as their PET counter values

have not matured to „4‟ yet. Finally when the third burst of data {2,9,5,8,6}

arrives, it results in two hits more ( for {5} and {2} ) compared to DPET. Hence

by giving more time for those two blocks to stay in cache, the hit rate has been

improved. Fig 3.3(i), 3.3(ii), 3.3(iii) and 3.3(iv) shows the working of CB-DPET in

the form of a flow diagram.

(i) (ii)

Fig 3.2 (i) Contents of the cache at the end of each burst for DPET

(ii) Contents of the cache at the end of each burst for DPET-V

28

Processor looks for a particular block ‘c’ in the cache

AMA starts search for block ‘c’ in the cache

If ‘c’ is found

Set PET counter value of ‘c’ to 0

Processor brings the data item from next level of

memory

Find replacement victim in the

cache

HIT

MISS

Start

Finish

Insert new data item

Finish

Fig 3.3 (i) CB-DPET flow Diagram

29

If ‘c’ is found

AMA algorithm starts searching for a replacement

candidate ‘c’ in the cache

Choose ‘c’ as replacement victim

Increment counter values of all blocks by 1

Yes

No

Search for first block ‘c’ from cache bottom with PET counter value of ‘4’

Find replacement victim in the cache

Fig 3.3 (ii) Victim Selection flow

30

Insert new data item-DPET

Insert incoming data item into the victim block

Set PET Counter value of the block to 3

Fig 3.3 (iii) Insert flow- DPET

Insert new data item-DPET-V

Insert incoming data item into the victim block


95 out of every 100 blocks


5 out of every 100 blocks

Fig 3.3 (iv) Insert flow DPET-V

31

3.5. Policy Selection Technique

A few test sets in the cache are allocated to choose between DPET and

DPET-V. Among those test sets, DPET is applied over a few sets and DPET-V

is applied over the remaining sets and they are closely observed. Based on the

performance obtained here, either DPET or DPET-V is applied across the

remaining sets in the cache (apart from the test sets). A separate global register

(which from here onwards will be referred to as Decision Making register or DM

register) is maintained to keep track of the number of misses encountered in the

test sets for both the techniques. Initially the value of this register is set to zero.

Fig 3.4 Policy selection using DM register

32

Data miss on cache block ‘c’

If ‘c’ is in DPET assigned test sets

DM register is incremented by 1

If ‘c’ is in DPET-V assigned test sets

Check value of DM

Insert new data item-DPET-V


Finish

Yes

No

No

If <= 0

If > 0



DM register is decremented by 1

YesFind replacement

victim in the cacheInsert new data item-

DPET-V

Fig 3.5 CB-DPET Policy selection flow

For every miss encountered in a test set which has DPET running on it, the

register value is incremented by one and for every miss which DPET-V produce

in its test sets, AMA decrements the register value by one. Effectively, if at any

33

point of time the DM register value is positive, then it means that DPET has

been issuing many misses. Similarly if the register value goes negative, it

indicates that DPET-V has resulted in more misses. Based on this value, AMA

makes the apt replacement decision at run time. This entire algorithm which

alternates dynamically between DPET and DPET-V is termed Context Based

DPET. Fig 3.4 depicts the policy selection process. Fig 3.5 shows the Victim

selection process.

3.6. Thread Based Decision Making

The concept of CB-DPET is extended to suit multi-threaded applications with a

slight modification. For every thread a DM register and few test sets are

allocated. Before the arrival of the actual workload, in its allocated test sets each

thread is assigned to run DPET for one half and DPET-V for the remaining half

of the test sets. The sets running DPET (or DPET-V) need not be consecutive.

These sets are selected in a random manner. They can occur intermittently i.e. a

set assigned to run DPET can immediately follow a set that is assigned to run

DPET-V or vice versa. On the whole, the total number of test sets allocated for

each thread is equally split between DPET and DPET-V. Now when the actual

workload arrives, the thread starts to access the cache sets. Two possibilities

arise here.

Case 1:

The thread can access its own test set. Here, if there is data miss, it records the

number of misses in its DM register appropriately depending on whether DPET

or DPET-V was assigned to that set as discussed in the previous section.

Case 2:

The thread gets to access any of the other sets apart from its own test sets.

Here, either DPET or DPET-V is applied depending on the thread‟s DM register

value. If this access turns out to be the very first cache access of that thread,

34

then its DM register will have a value of „0‟, in which case the policy is chosen as

DPET.

3.7. Block Status Based Decision Making

When more than one thread tries to access a data item, it can be regarded as

shared data. Compared to data that is accessed only by one thread (private

data), shared data has to be placed in the cache for a longer duration. This is

because multiple threads try to access those data at different points of time and

if a miss is encountered, the resulting overhead would be relatively high. Taking

this into consideration, it is possible to find out whether each block has shared or

private data (using the id of the thread that accesses it) and decide on the

counter values appropriately. To serve this purpose, there is a „blockstatus‟

register associated with every cache block. When a thread accesses that block,

its thread id is stored into the register. Now when another thread tries to access

the same block, the value of the register is set to some random number that

does not fall within the accessing threads‟ id range. So if the register holds that

random number, the corresponding block is regarded as shared. Otherwise it is

treated as private.

3.8. Modification in the Proposed Policies

In case of DPET, when a block is found to be shared, insertion is made with a

counter value of „2‟ instead of „3‟. Similarly in DPET-V, for shared data, insertion

is always made with a PET counter value of „2‟. If a hit is encountered in either of

the methods, the PET counter value is set to „0‟ (as discussed earlier) only if the

block is shared. For private blocks, the counter value is set to „1‟ when there is a

hit.

Fig 3.6 Data sequence containing a recurring pattern {1,5,7}

35

3.9. Comparison between LRU, NRU and DPET

3.9.1. Performance of LRU

The behaviour of LRU for the data sequence in Fig 3.6 is shown in Fig 3.7. It is

assumed that there are five blocks in the cache set under consideration. Fig. 3.7

shows the working of the LRU replacement policy over the given data set. The

workload sequence (divided into bursts) is specified in the Fig. 3.7. Below every

data, „m‟ is used to indicate a miss in the cache and „h‟ is used to indicate a hit.

Alphabet „N‟ in the cache denotes Null or Invalid data. The cache set can be

viewed as a queue with insertions taking place from the top of the queue and

deletion taking place from the bottom. Additionally, if a hit occurs, that particular

data is promoted to the top of the queue, pushing other elements down by one

step. Hence at any point of time, the element that has been accessed recently is

kept at the top and the „least recently used‟ element is found at the bottom,

which becomes the immediate candidate for replacement. In the above example,

initially the stream {1, 5, 7} results in miss and the data is fetched from the main

memory and placed in the cache.

Fig 3.7 Behaviour of LRU for the data sequence

36

Now data item {7} will hit and it is brought to the top of the queue (in this case it

is already at the top). Data items {1} and {5} results in two more consecutive hits

bringing {5} to the top. The ensuing elements {9, 6} results in misses and are

brought into the cache one after another. Among the burst {5, 2, 1} the data item

{2} alone misses. Then {0} too results in a miss and is brought from the main

memory. In the stream {4,0,8} the data item {0} hits. And finally the burst {1, 5, 7}

re-occurs towards the end where {5, 7} will result in two misses as indicated by

Fig. 3.8. The overall number of misses encountered here amounts to eleven.

3.9.2. Performance of NRU

The working of NRU on the given data set is shown in Fig. 3.8. NRU

associates a single bit counter with every cache line which is initialized to „1‟.

When there is a hit, the value is changed to „0‟. A „1‟ indicates that the data block

has remained unreferenced in the cache for a longer period of time compared to

other blocks. Thus the block from the top which has a counter value of „1‟ is

selected as a victim for replacement. Scanning from the top helps in breaking

the tie. If such a block is not found, the counter values of all the blocks are set to

„1‟ and a search is performed again. In Fig. 3.8, counter values which get

affected in every burst are underlined for more clarity. As it is observed from the

figure, NRU results in one miss more than the overall misses encountered in

LRU. Towards the end, the burst {5,7} result in two misses. This is primarily

because NRU tags the stream {5,7} as „not-recently used‟ as it had not seen

them for some time and thus chooses them as replacement victims. In the long

run, as the data set size increases in a similar fashion, the number of misses will

increase considerably.

One feature that is common between both algorithms is that they follow the

same technique irrespective of the pattern that the data set follows. This might

minimize the hit rate in some cases. In case of LRU, if an item is accessed by

the processor, it is immediately taken to the top of the queue thereby pushing

the remaining set of elements one step below, irrespective of whatever pattern

the data set might follow thereafter. Thus even if any element in this group, that

37

gets pushed down, repeats itself after some point of time, it might have been

marked as „least recently used‟ and evicted from the cache. NRU is also not

powerful as it has only a single bit counter, which does not allocate enough time

for data items which might be accessed in the near future.

Fig 3.8 Behaviour of NRU for the data sequence

3.9.3. Performance of DPET

Fig. 3.9 shows the contents of the cache at various points when DPET is

applied, along with the burst of input data which is processed at that moment.

The PET counter value associated with every block is shown, as a sub-script to

the actual data item and the counter value of the block which is directly affected

at that instant is underlined for more clarity. The input data set has the letters „m‟

and „h‟ under every data item to indicate a miss or a hit respectively. The

behaviour is studied with a cache set consisting of five blocks. Initially AMA

assigns a value of „4‟ to PET counters associated with all the blocks. As seen

from figure, the burst {1,5,7,7} arrives initially. Data items {1,5,7} are placed in

the cache in a bottom up fashion and their PET values are decremented by „1‟.

Now when the data item {7} arrives again, it hits in the cache and thus the

38

counter value associated with {7} is made „0‟. Data items {1,5} arrive individually

and result in two hits. The corresponding PET values are set to „0‟ as indicated

in the figure. The next burst contains elements {9,6}, both resulting in cache

misses. They are fetched from the main memory and stored in the remaining two

blocks of the cache set and their counter values are decremented appropriately.

The fifth data burst comprises the data items {5,2,1}. Among these, {5} is already

present in the cache and hence its counter value is set to „0‟. Data item {2} is not

present in cache. So as with DPET, AMA searches for a replacement candidate

bottom up, which has its counter value set to „4‟. Since such a block could not be

found, the PET values of all the blocks are incremented by „1‟ and the search is

performed again. Now the PET value of the block containing {9} would have

matured to „4‟ and it is replaced by {2} and its counter value is decremented to

„3‟. Data item {1} hits in the cache. The overall number of misses is reduced

compared to LRU and NRU.

Fig 3.9 Behaviour of DPET for the data sequence

3.10. Results

To evaluate the effectiveness of the replacement policy on real workloads, a

performance analysis has been done with Gem5 simulator using PARSEC

39

benchmark programs, discussed in Chapter 6, as input sets. The results are

shown in this section.

3.10.1. CB-DPET Performance

The effectiveness of this method is visible in the following figures as it has

surpassed the LRU. Fig. 3.10 shows the overall number of hits obtained for

each workload with respect to CB-DPET and LRU at L2 cache. The overall

number of hits obtained in the ROI is taken. These numbers have been scaled

down by appropriate factors to keep them around the same range. It is

observed from the figure that CB-DPET has resulted in improvement in the

number of hits compared to LRU for majority of the workloads with fluidanimate

showing a high performance with respect to baseline policy.

Fig 3.10 Overall number of hits recorded in L2 cache

Fig. 3.11 shows the hit rate obtained for individual benchmarks. Hit rate is given

by the following expression.

Hit Rate =Overall number of hits in ROI

Overall number of accesses in ROI ……… (1)

0

200000

400000

600000

800000

1000000

1200000

Nu

mb

er

of

Hit

s

Benchmark

Overall number of hits in L2

LRU

DPET

40

Six of the seven benchmarks have reported improvements in the hit rate.

Fluidanimate shows maximum improvement of 8% whereas swaptions and vips

have exhibited hit rates on par with LRU.

Fig. 3.12 shows the number of hits obtained in the individual parallel phases.

As discussed earlier, the ROI can be further divided into two phases denoted by

PP1 (Parallel Phase 1) and PP2 (Parallel Phase 2). These phases need not be

equal in complexity, number of instructions, etc. There can be cases where PP1

might be more complex with a huge instruction set compared to PP2 or vice

versa. The number of hits recorded in these phases is shown in the graph in

Fig. 3.12. CB-DPET has shown an increase in hits in most of the phases of

majority of the benchmarks.

Fig 3.11 L2 cache hit rate

Miss latency at a memory level refers to the time spent by the processor to

fetch a data from the next level of memory (which is the primary memory in this

case) following a miss in that level of memory. Miss latency at L2 cache for

individual cores is shown in Fig. 3.13. It is usually measured in cycles. To

present it in a more readable format, it has been converted to seconds using the

following formula:

0

0.2

0.4

0.6

0.8

1

1.2

Hit

Rat

e

Benchmarks

L2 cache hit rate

LRU

DPET

41

Miss latency (in secs) =Miss latency in cycles

CPU Clock Frequency……… (2)

The impact of miss latency is an important factor to be taken into consideration.

If the miss latency is greater, it means that the overhead involved in fetching the

data will be high thereby resulting in performance bottleneck.

Fig 3.12 Hits recorded in individual phases of ROI

Fig 3.13 Miss latency at L2 cache for individual cores

0100000200000300000400000500000600000700000800000900000

1000000

PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2

vips swaptions fluidanimate dedup ferret canneal

Nu

mb

er

of

Hit

s

Benchmark

Number of hits in individual phases

LRU

DPET

0

5

10

15

20

25

30

35

40

45

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

cpu

0

cpu

1

ove

rall

Mis

s La

ten

cy (

in s

eco

nd

s)

Benchmarks

Miss latency at L2 cache for individual cores

LRU

DPET

42

Overall miss latency at L2 is also shown in Fig. 3.13. Note that this is with

respect to the ROI of the benchmarks. Considerable reduction in latency is

observed across the majority of the benchmarks. Vips has shown a maximum

reduction in Overall miss latency. Except ferret and swaptions, all the other

workloads have shown reduction in the overall miss latency.

Percentage increase in Hits =Hits in CBDPET − Hits in LRU

Hits in LRU∗ 100 ……… (3)

Fig. 3.14 shows the percentage increase in hits obtained from CB-DPET

compared to LRU. The figure shows that there is an increase in the hit

percentage for all the benchmarks except ferret. Swaptions exhibiting the

highest percentage increase of 13.5% whereas dedup has shown a 2%

increase. An approximate 8% improvement over LRU is obtained as the average

hit percentage when CB-DPET is applied at the L2 cache level. This is because

CB-DPET adapts to the changing pattern of the incoming workload and retains

the frequently used data in the cache, thus maximizing the number of hits. Fig.

3.15 shows the overall number of misses recorded by each benchmark.

Decrease in the number of misses is observed across majority of the

benchmarks compared to LRU.

Fig 3.14 Percentage increase in L2 cache hits

-15

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

% increase in L2 cache hits compared to LRU

% increase in cache hits

43

Fig 3.15 Overall number of misses at L2 Cache

3.11. Summary

When it comes to concurrent multi-threaded applications in a CMP

environment, LLC plays a crucial role in the overall system efficiency.

Conventional replacement strategies may not be effective in many instances as

they do not adapt dynamically to the data requirements. Numerous researches

[45,46,79,90] are being conducted to find novel ways for improving the efficiency

of the replacement algorithms applied at LLC. Both LRU and NRU schemes tend

to perform poorly on data sets possessing irregular patterns because they do not

store any extra information regarding the recent access pattern of the data

items. They make decisions purely based on the current access only

[15,45,51,79]. This chapter hence proposes a scheme called CB-DPET which

tries to address the shortcomings arising from the conventional replacement

methods and improve the performance.

In this chapter, the following are the conclusions

CB-DPET takes a more systematic and access-history based decision

compared to LRU. A counter associated with every cache block tracks the

0

100000

200000

300000

400000

500000

600000

700000

800000

Nu

mb

er

of

mis

ses

Benchmark

Number of misses

LRU

DPET

44

behavior of application access pattern and makes the replacement, insertion

and promotion decisions.

DPET-V is a thrash proof version of DPET. Here some randomly selected

data items are allowed to stay in the cache for a longer time by adjusting the

counter values appropriately.

In multi-threaded applications, a DM register and some test sets are

allocated for every thread. Based on the number of misses in the test sets, a

best policy is chosen for the remaining sets.

These algorithms are made thread-aware by using the thread id. Based on

the thread id, it can identify which thread is trying to access the cache block,

so when multiple threads try to access the same block, it is tagged as shared

and allow that block to stay in the cache for more time compared to other

blocks by appropriately adjusting their counter values.

Evaluation results on PARSEC benchmarks have shown that CB-DPET has

reported an average improvement in the overall hits (up to 8%) when compared

to LRU at the L2 cache level.

45

4. LOGICAL CACHE PARTITIONING TECHNIQUE

It is imperative for any level of cache memory in a multi-core architecture to

have a well defined, dynamic replacement algorithm in place to ensure

consistent superlative performance [92,93,94,95]. At L2 cache level, the most

prevalently used LRU replacement policy does not acquaint itself dynamically to

the changes in the workload. As a result, it can lead to sub-optimal performance

for certain applications whose workloads exhibit frequently fluctuating patterns.

Hence many works have strived to improve performance at the shared L2 cache

level [91,98,99].

To overcome the limitation of this conventional LRU approach, This chapter

proposes a novel counter-based replacement technique which logically partitions

the cache elements into four zones based on their „likeliness‟ to be referenced

by the processor in the near future. Experimental results obtained by using the

PARSEC benchmarks have shown almost 9% improvement in the overall

number of hits and 3% improvement in the average cache occupancy

percentage when compared to LRU algorithm.

4.1. Introduction

Cache hits play a vital role in boosting the system performance. When there

are more hits, the number of transactions with the next levels of memories is

reduced thereby expediting the overall data access time. Many replacement

techniques use the cache miss metric to determine the replacement victim but in

this chapter, choosing the victim is based on the number of hits received by the

data items in the cache. Based on the hits received, the cache blocks are

logically partitioned into separate zones. These virtual zones are searched in a

pre-determined order to select the replacement candidate.

Contribution of this chapter includes,

A novel counter based replacement technique that logically partitions the

cache into four different zones namely – Most Likely to be Referenced

46

(MLR), Likely to be Referenced (LR), Less Likely to be Referenced (LLR),

Never Likely to be Referenced (NLR) in the decreasing order of their

likeliness factor.

Replacement, insertion and promotion of data elements within these zones

in such a manner that the overall hit rate is maximized.

Association of a 3-bit counter with every cache line to categorize the

elements into different zones. On a cache hit, the corresponding element

is promoted from one zone to another zone.

Replacement candidates selection from the zones in the ascending order

of their „likeliness factor‟ i.e. the first search space for the victim would be

the never likely to be referenced zone, followed by the subsequent zones

till the most likely to be referenced zone is reached.

Periodic zone demotion of elements also occurs to make sure that stale

data does not pollute the cache.

4.2. Logical Cache Partitioning Method

LCP uses a 3-bit counter to dynamically shuffle the cache elements and logically

partition them into four different zones based on their likeliness to be referenced by

the processor in the near future. This counter will be referred to as LCP (Logical

Cache Partitioning) counter in subsequent sections. The lower and upper bounds

for the counter are set to „0‟ and „7‟ respectively. The prediction about the likeliness

of reference of the data items is made with the help of the hits encountered in the

cache. As the number of hits for a particular element increases, it is moved up

the zone list till it reaches the MLR region.

Only if it stays unreferenced for quite an amount of time, it is evicted from

the cache. As their names imply, the zones are arranged in the decreasing

order of their likeliness factor. Table 4.1 shows the counter values and their

corresponding zones. Any replacement policy consists of three phases-

Replacement, Insertion and Promotion and so does this method. Each of which

is explained in the subsequent sub-sections.

47

Table 4.1 LCP zone categorization

Counter Value

Range

Zone

6-7

MLR

3-5

LR

1-2

LLR

0

NLR

4.3. Replacement Policy

When a miss is encountered in the cache, the data item is fetched from the

secondary memory and brought into the cache thereby replacing one of the

elements which was already present in the cache. This element is often referred

to as the victim. The process of selecting a victim needs to be efficient in order to

improve the hit rate. In this method, the victim is selected from the zones in the

increasing order of their likeliness factor. Any cache line which comes under the

NLR zone is considered first for replacement. If no such line is found search is

performed again to check if any element falls in the LLR region and so on till

MLR is reached. If two or more lines possess the same counter value that is

considered for replacement during that iteration, the line that is encountered first

is chosen as the victim.

Once the replacement is made, LCP counters of all the other data elements

are decremented by „1‟.This is done to carry out gradual zone demotion as

discussed earlier to flush out unused data items from the cache.

4.4. Insertion Policy

This phase is encountered as soon as the replacement candidate is found.

The new incoming data item is inserted into the corresponding cache set and its

LCP counter value is set to „2‟ (LLR zone).

48

4.5. Promotion Policy

A hit on any data item in the cache calls for the promotion phase. The LCP

value associated with the cache line is incremented to the final value of its

immediate upper zone. For example, if it was earlier in the LLR zone (LCP value

„1‟ or „2‟), its LCP value is set to „5‟ i.e. the element now falls within the LR zone.

By transferring elements within the zone according to the workload pattern, LCP

has proven to be more dynamic than LRU.

Processor looks for a particular block ‘c’ in the

cache

LCP algorithm starts search for block ‘c’ in the cache

If ‘c’ is found

Set LCP counter value to the max value of the next higher

zone

Processor brings the data item from next level of

memory


cache

HIT

MISS

Start

Finish

Fig 4.1(i) LCP flow diagram

49

4.6. Boundary Condition

It is essential that the LCP counter value does not overshoot its specified

range. Thus whenever it is modified, a boundary condition check is carried out to

ensure that the value does not go below „0‟ or beyond „7‟. Fig. 4.1(i), 4.2(ii)

shows the working of LCP.

If ‘c’ is found

LCP algorithm starts searching for a replacement

candidate ‘c’ in the cache from zone ‘X’ where ‘X’

denotes NLR initially

No

Yes

Search for first block ‘c’ which has the LCP

counter value that falls within the range of zone

‘X’


cache

Increment ‘X’ to point to the next higher zone


Insert new data item here and set LCP counter

value to ‘2’ (LLR zone)

Fig 4.1 (ii) Victim Selection flow for LCP

50


Init:

Initialize all blocks with LCP counter value „0‟.

Input:


Output:


Miss:

/** Invoke FindVictim method. Set is passed as parameter **/

FindVictim (set): i = 0

removing_index = -1

while i < 8 do

/** Scanning the set **/

for j = 0 to associativity do

/** Search from the NLR zone. LCP counter = 0 **/

if set.blks[j].lcp == i then

/** Block Found. Exit Loop **/

removing_index = i;

break;

if removing_index ! = -1 then

break;

i = i + 1;

for i = 0 to associativity do

/** Gradual Zone Demotion **/

if set.blks[j].lcp>0 then

set.blks[j].lcp = set.blks[j].lcp – 1;

return removing_index

Hit:

/** Boundary Condition Check **/

51

if set.blk.lcp ! = 7 then

/** NLR Zone. Promote to LLR **/

if set.blk.lcp == 0 then

set.blk.lcp = 2;

/** LLR Zone. Promote to LR **/

else if set.blk.lcp == 1 or set.blk.lcp == 2 then

set.blk.lcp = 5;

/** LR Zone. Promote to MLR **/

else

set.blk.lcp = 7

Insert:

/** LR Zone **/

set.blk.lcp = 2;


Invariant: At any iteration i, the LCP counter value of set.blk[i] holds a value in

the range [0,7].

Initialization:

When i = 0, set.blk[i].lcp will be „0‟ initially. After being selected as victim and

after the new item has been inserted, set.blk[i].lcp will be set to „2‟. Hence

invariant holds good at initialization.

Maintenance:

At i = n, set.blk[i].lcp will contain

either zero (Cache block yet to be occupied for the first time) or

set.blk[i].lcp will have „2‟ (If it has been newly inserted) or

set.blk[i].lcp will have „1‟ or „3‟ or „4‟ or „6‟ (As a part of gradual zone demotion)

set.blk[i].lcp will hold „5‟ or „7‟ (On a cache hit from previous zone).

52

“if blk.lcp ! = 7” condition ensures that the lcp counter value does not

overshoot its range. Hence invariant holds good for i = n.

Same case when i = n+1 as the individual iterations are mutually exclusive. Thus

invariant holds good for i = n + 1.

Fig 4.2 Working of LRU and LCP for the data set

53

Termination:

Increment of the loop control variable „i‟ ensures that the loop terminates within

finite number of runs which is „8‟ in this case. The condition “if set.blks[j].lcp>0”

ensures that the lcp counter value of the block does not go below „0‟. The

invariant can also be found to hold good for all the cache blocks present

when the loop terminates.

4.7. Comparison with LRU

The following data set is considered.

. . . 2 9 1 7 6 1 7 5 9 1 0 7 5 4 8 3 6 1 7 . . .

It is observed that data items 1 and 7 occur frequently. Fig. 4.2 shows the

working of LRU and LCP on this data pattern. Assume that the cache can hold 4

blocks at a time. Incoming data items at that point of time are shown in the

leftmost column.

In the right most column, „m‟ indicates a miss and „h‟ indicates hit. Initially the

cache contains invalid blocks. Counter values associated with all the blocks are

set to -1. After applying both the techniques, LCP has resulted in 3 hits more

than LRU. Frequently occurring data items 1 and 7 have resulted in hits towards

the end unlike LRU. This is primarily because LRU follows the same approach

irrespective of the workload pattern and tags 1 and 7 as „least recently used‟.

4.8. Results

Full system simulation using Gem5 simulator with PARSEC benchmarks

(discussed in Chapter 6) as input sets shows improvement in many cache

metrics. To measure the performance of LCP, metrics like hit percentage

increase, number of replacements made at L2, average cache occupancy and

miss rate are used. Results are shown for two, four and eight core systems.

54

4.8.1. LCP Performance

Subsequently, the term LCP will be used to represent this method in the

graphs. Percentage increase in the overall number of hits at L2 is shown in Fig.

4.3.

Fig 4.4 Difference in number of replacements made at L2 cache

-15

-10

-5

0

5

10

15

20

25

Pe

rce

nta

ge

Benchmark

% lesser number of replacements made at L2 compared to LRU

% lesser number of replacements made at L2 compared to LRU (in percentage)

Fig 4.3 Percentage increase in overall number of hits at L2 cache

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

% increase in overall number of hits at L2 compared to LRU

% increase in overall number of hits at L2 compared to LRU

55

Fig. 4.4 shows the total number of replacements made at L2. Fig. 4.5 shows

the average L2 cache occupancy percentage for all the workloads. Fig. 4.6

shows the miss rate recorded by the individual cores and also the overall miss

rate at L2.

Fig 4.5 Average cache occupancy percentage

Miss rate is defined as the ratio of number of misses to the total number of

accesses. As it is observed from Fig. 4.6 there is reduction in the core-wise miss

rate and overall miss rate across majority of the benchmarks when compared to

LRU.

Apart from two workloads, all the others have shown significant improvement

in hits. Percentage increase in hits varies from a minimum of 0.1% (dedup) to

maximum of up to 11% (ferret) as shown in Fig. 4.3.

Number of replacements reflects the efficiency of any replacement algorithm.

A maximum of almost 20% reduction in the number of replacements is

observed across the given workloads compared to LRU from Fig. 4.4. Cache

occupancy refers to the amount of cache that is being effectively utilized to

improve the performance for any workload. Fig. 4.5 indicates cache utilization is

higher for majority of the benchmarks when LCP is applied compared to LRU.

0

10

20

30

40

50

60

70

80

90

100

Pe

rce

nta

ge

Benchmark

Average cache occupancy(in percentage)

LCP

LRU

56

Vips utilizes the cache space to the maximum possible extent.

Fig 4.6 Miss rate recorded by individual cores

4.9. Summary

This chapter discusses about a dynamic and a structured replacement

technique that can be adopted across the LLC. Key points pertaining to this

replacement policy are as follows:

Elements of the cache are logically partitioned into four zones based on

their likeliness to be referenced by the processor with the help of a 3-bit LCP

counter associated with every cache line. The minimum value that the counter

can hold is „0‟ and the maximum value is „7‟.

Replacement candidates are chosen from the zones in the increasing order

of their likeliness factor starting from the NLR zone. Initially all the counter

values are set to „-1‟. Conceptually all the blocks contain invalid data.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1co

re 1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

Fluidanimate Ferret x264 Swaptions Vips Dedup Canneal Blackscholes

Mis

s R

ate

Benchmark

Miss rate recorded by individual cores at L2

LCP

LRU

57

For every hit, the corresponding element is moved up by one zone by

adjusting the LCP counter value. If it has reached the top most zone (MLR)

the counter value is left untouched.

For every miss the counter value is decremented by „1‟ to prevent stale data

items from polluting the cache as the algorithm executes. When a new data

arrives, its LCP value is set to „2‟. Boundary condition check needs to be

applied whenever the LCP counter value is modified to make sure that it does

not overshoot its designated range.

Experimental results obtained by applying the method on PARSEC

benchmarks have shown a maximum improvement of 9% in the overall number

of hits and 3% improvement in the average cache occupancy percentage when

compared to the conventional LRU approach.

58

5. SHARING AND HIT BASED PRIORITIZING TECHNIQUE

The efficiency of cache memory can be enhanced by applying different

replacement algorithms. An optimal replacement algorithm must exhibit good

performance with minimal overhead. Traditional replacement methods that are

currently being deployed across multi-core architecture platforms, try to classify

elements purely based on the number of hits they receive during their stay in the

cache. In multi-threaded applications data can be shared by multiple threads

(which might run on the same core or across different cores). Those data items

need to be given more priority when compared to private data because miss on

those data items may stall the functioning of multiple threads resulting in

performance bottleneck. Since the traditional replacement approaches do not

possess this additional capability, they might lead to inadequate performance for

most of the current multi-threaded applications.

To address this limitation, this chapter proposes a Sharing and Hit based

Prioritizing (SHP) replacement technique that takes the sharing status of the

data elements into consideration while making replacement decisions.

Evaluation results obtained using multi-threaded workloads derived from the

PARSEC benchmark suite shows an average improvement of 9% in the overall

hit rate when compared to LRU algorithm.

5.1. Introduction

In multi-threaded applications, when threads run across different cores, the

activity in L2 cache tend to shoot up as multiple threads try to store and retrieve

shared data. At this point there are many challenges that need to be addressed.

Cache coherence has to be maintained, the replacement decisions made need to

be judicious and the available cache space must be utilized efficiently. When

there is a miss on a data item which is shared by multiple threads, the execution

of many threads gets stalled. This is because when the first thread, which had

encountered a miss, is attempting to fetch the data from the next level of memory,

any subsequent thread which tries to access the same data will result in miss.

59

Hence it becomes imperative to handle shared data with more care compared to

other data items. Traditional replacement algorithms like LRU, MRU etc do not

possess this capability. They classify elements based on when they will be

required by the processor but do not check for the status of the cache block, i.e.

whether it is shared or private at any point of time. Also when there are more than

thousands of threads running in parallel, it may not be very useful in tagging the

block simply as „shared‟ or „private‟. In those cases, additional information about

their shared status can prove handy. This chapter provides an overview on a novel

counter based cache replacement technique that associates a „sharing degree‟

with every cache block. This degree specifies a set of ranges which includes – not

shared (or private), lightly shared, heavily shared and very heavily shared. This

information combined along with the number of hits received by the element during

its stay in the cache is used to produce a priority for that element based on which

judicious replacement decisions are made.

Contributions of this chapter includes,

A novel counter based replacement algorithm that makes replacement

decisions based on the sharing nature of the cache blocks

Association of a 'sharing degree' with every cache line, which indicates

the extent to which the element is shared depending on the number of

threads that try to access that element.

Four degrees of sharing namely – not shared (or private), lightly shared,

heavily shared and very heavily shared.

A replacement priority for every cache line which is arrived upon by

combining the sharing degree along with the number of hits received by

the element based on which the replacement decisions are made.

5.2. Sharing and Hit Based Prioritizing Method

Similar to the other two replacement algorithms, every cache block is

associated with a counter. This 2-bit counter is called as the Sharing Degree

Counter or SD counter. Table 5.1 shows all the four possible values that the

60

counter can hold and their individual descriptions. Depending on the number of

threads that access a cache block, it is classified into any one of the sharing

categories as shown in the table. To collect the sharing status of the cache

blocks, it is essential to have an efficient data structure in place to track the

number of threads that accesses the data item. For this purpose there is a filter

which is referred to as the Thread Tracker filter (TT filter). It is a flexible dynamic

software based array that gets created and is associated with every cache block

during run time.

When a thread tries to access a data item, a search is conducted in the TT

filter to check if the thread id is already present in it. If not, then the id is stored in

the corresponding cache block‟s TT filter. Size of the filter expands as and when

a thread id gets added to it. Once a cache block is about to be evicted, the

memory allocated for the associated TT filter is freed. At any point of time, with

the help of thread ids that are found in the TT filter, the sharing degree counter is

populated for every cache block.

5.3. Replacement Priority

The sharing status of the blocks alone cannot be used to make replacement

decisions. Number of hits garnered by the element is also an important factor to

consider when performing a replacement. Hence a priority for every individual

cache block has been arrived at, that not only gives more weightage to sharing

nature but also attaches importance to the number of cache hits garnered by the

element. Priority computation equation is as follows

Priority = Sharing Degree + 1 ∗Hits Count

2 ……… (4)

Sharing degree indicates the sharing degree counter value for the particular

cache block at that particular point of time. A software based hit counter is

associated with every cache block that provides an estimate of the hits received

61

by the element during its stay in the cache. This counter is refreshed every time

a new element enters the block. The upper bound of the counter is fixed to a

specific value. If that value is reached, the counter is considered saturated and

any hits received beyond that will not be taken into account. Experimental

results have shown that in a unified cache memory, the average number of

instructions is 50% and the remaining 50% forms the data. For calculation of

priority, the analysis is mainly upon the data part as it is accessed much more

frequently compared to instructions. Also due to locality of reference, several

threads can access the data several times. To highlight the former part, the hit

count value is divided by 2 and to highlight the latter part, hit count value is

scaled by a factor of the sharing degree counter value. Whenever any one of

these counters change, the priority of the element needs to be recomputed.

Cache replacement policy consists of deletion, insertion and promotion

operations. Each phase of the algorithm is explained in detail in the subsequent

sections. The mapping method used is set associative mapping.

5.4. Replacement Policy

When the cache becomes full, replacement has to be made to pave way for

new incoming data items. SHP makes replacement decisions based on the

computed priority values. Elements are evicted in the increasing order of their

priority. The element with the least priority in the list is chosen as the victim. If

more than one element has the same least priority, then the one which is

encountered first while scanning the cache is taken as the victim as a tie-

breaking mechanism. It is also essential to ensure that stale data do not pollute

the cache for longer periods of time.

From this aspect the hit counter can be regarded as an ageing counter. Every

time a victim is found, the hit counter of all the other elements in the cache is

decremented and their corresponding priorities are re-computed. If any element

remains unreferenced for a long period of time, its priority will gradually

62

decrease and the element will eventually be flushed out of the cache.


Input:


Output:


Miss:

/** Invoke FindVictim method. Set id is passed as parameter **/

FindVictim(set)

FindVictim(set):

min = ∞

for i = 0 to associativity do

/** Choose the minimum priority element **/

if set.blk[i].priority < min then

min = set.blk[i].priority;

removing_index = i;

return removing_index;

Hit:

/** Scan the TT Filter **/

for i = 0 to length(TT_Filter) do

if TT_Fitler[i] == accessing_thread_id then

thread_found = 1;

/** If thread id not found in TT Filter, add the thread id to it **/

if thread_found ! = 1 then

TT_Filter[i] = accessing_thread_id;

set blk.sharing_degree counter value[0 to 3] according to TT_Filter length

63

priority = (set.blk.sharing_degree +1) * (set.blk.hit_counter)/2;

Insert:

/** Private block **/

set.blk.sharing_degree = 0;

refresh blk.hit_counter;


Invariant: At any iteration i, the sharing degree counter value of set.blk[i] holds a

value in the range [0,3].

Initialization:

When i = 0, set.blk[i]. sharing_degree will be set to „0‟ initially. Hence invariant

holds good at initialization.

Maintenance:

At i = n, set.blk[i].sharing_degree will contain

either zero (Private) or

set.blk[i].sharing_degree will have „1‟ (Lightly shared) or

set.blk[i].sharing_degree will have „2‟ (Heavily Shared) or

set.blk[i].sharing_degree will hold „3‟ (Very Heavily Shared).

Hence invariant holds good for i = n.

Same case when i = n+1 as the individual iterations are mutually exclusive. Thus

invariant holds good for i = n + 1.

Termination:

The priority computation loop is limited by the associativity of the cache which

will always be finite. TT filter iteration is limited by the size of the TT filter which

will be equal to the total number of threads executing. It will also be finite. The

64

invariant can also be found to hold good for all the cache blocks present

when the loops terminate.

Processor looks for a particular block ‘c’ in the

cache

SHP algorithm starts search for block ‘c’ in the cache

If ‘c’ is found

Increment Hit Counter

Processor brings the data item from next

level of memory


HIT

MISS

Start

Finish

Update TT Filter if needed and sharing

degree counter

Re-compute priority

Fig 5.1 (i) SHP flow diagram

65

SHP algorithm starts searching for a replacement candidate ‘c’ in

the cache

Search for first block ‘c’ which has the minimum

priority value



Insert new data item here and set Sharing degree counter value to ‘0’ (private) and re-

compute priority

Fig 5.1 (ii) Victim Selection flow for SHP

Table 5.1 Sharing degree values and their descriptions

Number of Accessing Threads

Sharing Degree Counter Value

Nature of Sharing

1 0 Private/Not Shared

2-3 1 Lightly Shared

4-7 2 Heavily Shared

8-10 3 Very Heavily Shared

66

5.5. Insertion Policy

After evicting the victim, the new data item needs to be inserted into the cache.

Sharing degree counter is set to „0‟ to indicate that the incoming block is

currently not shared by any other threads. The corresponding hit counter is

refreshed.

5.6. Promotion Policy

When a cache hit happens, the hit counter of the corresponding block is

incremented. Also the accessing thread id is compared against the ids which

are already present in the TT filter and if it is not present there, it is added into

the TT filter. Sharing degree counter is adjusted accordingly and the priority is

re-computed. Fig. 5.1(i), 5.1(ii) shows the basic flow of SHP technique.

5.7. Results

To evaluate the performance of SHP, simulation has been carried out with

GEM5 simulator using PARSEC workloads (discussed in Chapter 6) and the

results are compared with LRU. Most of the applications are benefited through

this replacement policy. The results are shown in this section.

5.7.1. SHP Performance

It is observed from the following figures, that the performance of the proposed

replacement technique is better than LRU for multithreaded workloads. The

overall number of hits obtained at LLC cache for SHP method compared to LRU

is shown in the graph in Fig. 5.2. In Fig. 5.2, blackscholes has shown the

maximum improvement. On an average, a 9% percent improvement in the

overall number of hits is observed across the given benchmarks.

67

Fig 5.2 Overall number of hits at L2

Fig. 5.3 shows the overall number of replacements made at L2 for SHP and

LRU. The more the number of replacements made, the more will be the

overhead involved. So it is always desirable to keep this parameter as low as

possible. In this method, the average number of replacements made at L2 has

decreased compared to LRU. Compared to other benchmarks, blackscholes has

exhibited the least number of replacements (1.6% lesser than that of LRU) and

swaptions has shown a 0.55% decrease when compared to LRU

Fig. 5.4 shows the L2 miss rate for different workloads. Miss rate is computed

from the overall number of misses and the overall number of accesses. To

obtain a high performance, LLC must provide a lower miss rate. Majority of the

benchmarks have shown improvement in this metric when compared to LRU

with dedup showing superior performance.

0

50000

100000

150000

200000

250000

300000

350000

400000

No

of

Hit

s

Benchmark

Overall Number of Hits

SHP

LRU

68

Fig 5.3 Percentage decrease in number of replacements made at L2

Fig 5.4 Overall miss rate and core-wise miss rate at L2 cache

-1

-0.5

0

0.5

1

1.5

2

Pe

rce

nta

ge

Benchmarks

% decrease in number of replacements compared to LRU

% decrease

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

core

1

core

2

ove

rall

Blackscholes Ferret x264 swaptions Vips Dedup Canneal

Mis

s ra

te

Benchmark

Overall and per core miss rate

SHP

LRU

69

5.8. Summary

Shared data plays a critical role in determining the performance of cache

memory systems, especially in a multi-threaded environment. Conventional LRU

approach does not attach importance to such data and hence this chapter

proposes a novel counter based prioritizing algorithm.

Every cache block is associated with a 2-bit sharing degree counter which

iterates from 0 to 3.

A dynamic software based TT filter is associated with every block to keep

track of the threads that are accessing that block.

A hit counter is used to keep track of the hits received by the data item.

Values of the sharing degree and the hit counters are used to compute a

priority for each cache block.

This priority is then used to make judicious replacement decisions.

Evaluation results have shown an average improvement of up to 9% in the

overall number of hits when compared to the traditional LRU approach.

70

6. EXPERIMENTAL METHODOLOGY

This section outlines the simulation infrastructure and benchmarks used to

evaluate the dynamic replacement policies. To evaluate the proposed

techniques, Gem5 [102] simulator is used and PARSEC benchmark suites

[103,104,105] are used as input sets.

6.1. Simulation Environment

For simulation studies, Gem5 - a modular, discrete, event driven, completely

object oriented, computer system open source simulator platform, has been

chosen. It can simulate a complete system with devices and an operating

system, in full system mode (FS mode), or user space only programs where

system services are provided directly by the simulator in syscall emulation mode

(SE mode). The FS mode is currently being used. It is also capable of simulating

a variety of Instruction Set Architectures (ISAs).

Table 6.1 Summary of the key characteristics of the PARSEC benchmarks

Program

Application

Domain

Working Set

Blackscholes

Financial Analysis

Small

Canneal

Computer Vision

Medium

Dedup

Enterprise Storage

Unbounded

Ferret

Similarity Search

Unbounded

Fluidanimate

Animation

Large

Swaptions

Financial Analysis

Medium

Vips

Media Processing

Medium

x264

Media Processing

Medium

71

6.2. PARSEC Benchmark

The Princeton Application Repository for Shared-Memory Computers

(PARSEC) is a benchmark suite that comprises numerous large scale

commercial multi-threaded workloads for CMP. For simulation purposes, seven

workloads have been picked up from PARSEC suite. Every workload is divided

into three phases: an initial sequential phase, a parallel phase (which is further

split into two phases), and a final sequential phase.

This work focuses on the performance obtained at the parallel phase which is

also known as the Region of Interest (ROI). All the stats which are collected to

measure the cache performance correspond to the ROI of the workloads. The

basic characteristics of the chosen benchmarks are shown in Table 6.1. Table

6.2 highlights the cache configuration parameters used for simulation.

Table 6.2 Cache architecture configuration parameters

Parameter L1 Cache L2 Cache

Total Cache Size i-cache 32 kB 2MB

d-cache 64 kB

Associativity 2-way 8-way

Miss Information/Status Handling Registers (MSHR)

10 20

Cache Block Size 64B 64B

Latency 1ns 10ns

Replacement Algorithm LRU CB-DPET

6.3. Evaluation Of Many-Core Systems

Evaluation using PARSEC benchmarks for all the three methods on two, four

and eight core systems with a 2MB LLC have yielded better results compared to

LRU in terms of parameters like number of hits, hit rate, throughput and IPC

72

speedup. Throughput is measured from the total number of instructions that get

executed in a single clock cycle (Instruction Per Cycle or IPC). In multi-threaded

applications, throughput is the sum of individual thread IPCs as shown below.

Throughput = IPCi

n

i=1

……… (5)

IPC Speedup of Algorithm k over LRU = IPCk / IPCLRU

where „n‟ refers to the total number of threads and k refers to anyone of the

methods CB-DPET, LCP or SHP.

Evaluation results using PARSEC benchmarks for two cores have shown that

an average improvement in IPC speedup of up to 1.06, 1.05 and 1.08 times that

of LRU has been observed across CB-DPET, LCP and SHP techniques

respectively (Fig. 6.1, 6.2 and 6.3).

Fig 6.1 Performance comparison for 2 cores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Spe

ed

up

Benchmark

Average IPC speedup over LRU - 2 core

CB-DPET

LCP

SHP

73



0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Spe

ed

up

Benchmarks


CB-DPET

LCP

SHP

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

Spe

ed

up

Benchmarks


CB-DPET

LCP

SHP

74

Fig 6.4 CB-DPET- Percentage increase in hits

Fig 6.5 LCP- Percentage increase in hits

-20

-15

-10

-5

0

5

10

15

Pe

rce

nta

ge

Benchmark

CBDPET - Percentage increase in number of hits compared to LRU

2-Core

4-Core

8-Core

-20

-15

-10

-5

0

5

10

15

20

Pe

rce

nta

ge

Benchmark

LCP - Percentage increase in number of hits compared to LRU

2-Core

4-Core

8-Core

75

Fig 6.6 SHP- Percentage increase in hits

In 2 core scenario for CB-DPET, canneal has shown highest IPC speedup

improvment (1.2 times that of LRU) and dedup has recorded the least

improvment of 1.02 times that of LRU. Vips and swaptions have shown

maximum IPC speed up compared to other workloads for LCP and SHP

methods in 2-core system. In 4 core scenario, fluidanimate has shown maximum

IPC speedup of 1.075 and 1.07 times that of LRU for CB-DPET and LCP

methods respectively. Blackscholes has shown IPC speedup of 1.07 times that

of LRU which is the maximum for SHP method in 4-core environment. For 8

cores, ferret has recorded maximum IPC speedup for CB-DPET and LCP

methods whereas swaptions has shown maximum IPC speedup of almost 1.1

times that of LRU for SHP method. Improvement in overall hits of up to 8% has

been observed using CB-DPET when compared to LRU at the L2 cache level

(Fig. 6.4). LCP and SHP have shown an average improvement of upto 9% (Fig.

6.5) and 9% (Fig. 6.6) respectively in the overall number of hits.

-30

-25

-20

-15

-10

-5

0

5

10

15

20

25

Pe

rce

nta

ge

Benchmark

SHP - Percentage increase in number of hits compared to LRU

2-Core

4-Core

8-Core

76

6.4. Storage Overhead

6.4.1. CB-DPET

PET counter requires frequent lookup so it can be kept it as a part of the

cache hardware to expedite the access time. Considering an 8-way associative,

2MB cache which has 4096 sets and the cache block size is 64 Bytes. Additional

Storage required is calculated as follows.

Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block

= 4096 ∗ 8 ∗ 3

= 98304 𝑏𝑖𝑡𝑠 𝑜𝑟 12 𝐾𝐵

DM Register is allocated on a per thread basis and it depends on the

application during its run time so it is kept more like software based register

rather than a hardware one. The decision to have block status register either as

a part of hardware or software depends during implementation time because the

value it can hold is not fixed right now. Assuming that it can take only a „0‟ (say

shared) and „1‟ (not shared) and assuming that the thread id range do not start

from „0‟: If implemented in hardware. Additional storage required for blockstatus

register is approx 4KB. The total additional storage required is approximately

16KB which amounts to 0.78% overall increase in area for a 2MB L2 Cache.

6.4.2. LCP

Similar to PET, LCP counter also requires frequent lookup so keeping it as a

part of the cache hardware to expedite the access time. Storage overhead is

similar to that of PET counter which is approximately 12KB since LCP is also a

3-bit counter. As no other registers and counters are required, LCP results in

0.58% overall increase in area for a 2MB L2 Cache.

6.4.3. SHP

SHP requires a 2-bit Sharing Degree counter which requires frequent look up

and hence keeping it as a part of the hardware. Additional storage required for

Sharing Degree counter is calculated as follows

77

Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block

= 4096 ∗ 8 ∗ 2

= 65536 𝑏𝑖𝑡𝑠 𝑜𝑟 8 𝐾𝐵

Thread Tracker filter which is used to keep track of the threads that accesses

the blocks can be implemented as a part of software which gives the flexibility of

dynamically allocating and freeing the memory as and when required. Hence

SHP results in 0.39% overall increase in area for a 2MB L2 Cache.

78

7. CONCLUSION AND FUTURE WORK

7.1. Conclusion

When it comes to concurrent multi-threaded applications in a CMP

environment, LLC plays a crucial role in the overall system efficiency.

Conventional replacement strategies like LRU, NRU etc may not be effective in

many instances as they do not adapt dynamically to the data requirements.

Numerous researches are being conducted to find novel ways for improving the

efficiency of the replacement algorithms applied at LLC. In this dissertation,

three novel counter based replacement strategies have been proposed.

A novel replacement algorithm called CB-DPET has been proposed which

takes a more systematic and access-history based decision compared to

LRU. It is further split into DPET and DPET-V. DPET associates a counter

with every cache block to keep track of its recent access information. Based

on this counter‟s values, the algorithm makes replacement, insertion and

promotion decisions.

A dynamic and a structured replacement technique called LCP has been

proposed to be adopted across the LLC. Here the elements of the cache are

logically partitioned into four zones based on their likeliness to be referenced

by the processor with the help of a 3-bit LCP counter associated with every

cache line. Replacement candidates are chosen from the zones in the

increasing order of their likeliness factor starting from the NLR zone. For

every hit, the corresponding element is moved up by one zone by adjusting

the LCP counter value. If it has reached the top most zone (MLR) the counter

value is left untouched. For every miss the counter value is decremented by

„1‟ to prevent stale data items from polluting the cache.

A novel counter based prioritizing algorithm called SHP has been proposed to

be applied across the LLC. Here every cache block is associated with a 2-bit

sharing degree counter which iterates from 0 to 3. A dynamic software based

TT filter is associated with every block to keep track of the threads that are

79

accessing that block. A hit counter is used to keep track of the hits received

by the data item. Values of the sharing degree and the hit counters are used

to compute a priority for each cache block. This priority is then used to make

judicious replacement decisions.

Table 7.1 Percentage increase in overall hits at L2 Cache compared to LRU

Method 2-Core 4-Core 8-Core

CB-DPET 8% 4% 3%

LCP 6% 9% 5%

SHP 7% 9% 5%

Table 7.1 provides the percentage improvement obtained in overall number of

hits when compared to LRU for all the three methods. It is observed that up to

9% improvement has been obtained at the L2 cache level (for LCP) in overall hit

percentage increase and an IPC speed up of up to 1.08 times that of LRU (for

SHP).

Simulation studies shows that all the three proposed counter based replacement

techniques have outperformed LRU with minimal storage overhead.

7.2. Scope for Further work

Dynamic replacement strategies have proven to work well with versatile set of

application workloads that currently exist. Increasing the dynamicity without

inducing much hardware overhead is crucial. This research has experimented

novel replacement strategies on the shared L2 cache while the modifications

that may be required to run the same algorithms in the relatively smaller private

L1 cache is an interesting area to explore.

Appending additional bits per cache block in CB-DPET can help in storing

more information about the access pattern of the data item but storage overhead

80

needs to be taken into consideration. In LCP, optimization of the victim search

process could be done to expedite the replacement candidate selection. Multi-

threading could be a better solution here. Instead of having one thread

searching the cache set in multiple passes there can be multiple threads

searching in different zones but inter-thread communication may have to be

explored to make sure the process stops once victim is found. In sharing and hit-

based prioritizing algorithm, replacement priority of cache blocks is computed

from the sharing degree and number of hits received. In order to boost the

accuracy, various other parameters like miss penalty incurred while replacing

that block etc can also be considered while computing priority.

81

REFERENCES [1] John L. Hennessy and David A. Patterson, "Computer architecture: a

quantitative approach", Fifth Edition, Elsevier, 2012.

[2] Tola John Odule and Idowun Ademola Osinuga, “Dynamically Self-

Adjusting Cache Replacement Algorithm”, International Journal of Future

Generation Communication and Networking, Vol. 6, No. 1, Feb. 2013.

[3] Andhi Janapsatya, Aleksandar Ignjatovic, Jorgen Peddersen and Sri

Parameswaran, "Dueling clock: adaptive cache replacement policy based

on the clock algorithm", IEEE Design, Automation & Test Europe

Conference & Exhibition (DATE), pp. 920-925, 2010.

[4] Geng Tian and Michael Liebelt, "An effectiveness-based adaptive cache

replacement policy", Microprocessors and Microsystems, Vol 38, No.1,

pp. 98-111, Elsevier, Feb. 2014.

[5] Anil Kumar Katti and Vijaya Ramachandran, "Competitive cache

replacement strategies for shared cache environments", IEEE 26th

International Parallel & Distributed Processing Symposium (IPDPS), pp.

215-226, 2012.

[6] Martin Kampe, Per Stenstrom and Michel Dubois, "Self-correcting LRU

replacement policies", 1st Conference on Computing Frontiers, ACM

Proceedings, pp. 181-191, 2004.

[7] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik

Jhon, "Adaptive block management for victim cache by exploiting l1

cache history information", Embedded and Ubiquitous Computing, pp. 1-

11, Springer Berlin Heidelberg, 2004.

82

[8] Jorge Albericio, Ruben Gran, Pablo Ibanez, Víctor Vinals and Jose Maria

Llaberia, "ABS: A low-cost adaptive controller for prefetching in a banked

shared last-level cache", ACM Transactions on Architecture and Code

Optimization (TACO), Vol. 8, No. 4, Article. 19, 2012.

[9] Ke Zhang1, Zhensong Wang, Yong Chen3, Huaiyu Zhu4 and Xian-He

Sun, “PAC-PLRU: A Cache Replacement Policy to Salvage Discarded

Predictions from Hardware Prefetchers”, 11th IEEE/ACM International

Symposium on Cluster, Cloud and Grid Computing, p.265-274, May

2011.

[10] Jayesh Gaur, Mainak Chaudhuri and Sreenivas Subramoney, "Bypass

and insertion algorithms for exclusive last-level caches", ACM SIGARCH

Computer Architecture News, Vol. 39, No. 3, 2011.

[11] Hongliang Gao and Chris Wilkerson, "A dueling segmented LRU

replacement algorithm with adaptive bypassing", JWAC 1st JILP

Workshop on Computer Architecture Competitions: cache replacement

Championship, 2010.

[12] Saurabh Gao, Hongliang Gao and Huiyang Zhou, "Adaptive Cache

Bypassing for Inclusive Last Level Caches", IEEE 27th International

Symposium on Parallel & Distributed Processing (IPDPS), pp. 1243-1253,

2013.

[13] Mazen Kharbutli and D. Solihin, "Counter-based cache replacement and

bypassing algorithms", IEEE Transactions on Computers, Vol. 57, No. 4

pp. 433-447, 2008.

[14] Pierre Kharbutli, "Replacement policies for shared caches on symmetric

multicores: a programmer-centric point of view", 6th International

Conference on High Performance and Embedded Architectures and

83

Compilers, pp. 187-196, 2011.

[15] Moinuddin K Qureshi, "Adaptive spill-receive for robust high-performance

caching in CMPs", IEEE 15th International Symposium on High

Performance Computer Architecture (HPCA), pp. 45-54, 2009.

[16] Kathlene Hurt and Byeong Kil Lee, “Proposed Enhancements to Fixed

Segmented LRU Cache Replacement Policy”, IEEE 32nd International

Conference on Performance and Computing and Communications

(IPCCC), pp. 1-2, Dec. 2013.

[17] Manikantan, R, Kaushik Rajan and Ramaswamy Govindarajan.

"NUcache: An efficient multi-core cache organization based on next-use

distance", IEEE 17th International Symposium on High Performance

Computer Architecture (HPCA), pp. 243-253, 2011.

[18] Kedzierski, Kamil, Miquel Moreto, Francisco J. Cazorla and Mateo Valero,

"Adapting cache partitioning algorithms to pseudo-lru replacement

policies", IEEE International Symposium on Parallel & Distributed

Processing (IPDPS), pp. 1-12, 2010.

[19] Xie, Yuejian and Gabriel H. Loh, "PIPP: promotion/insertion pseudo-

partitioning of multi-core shared caches", ACM SIGARCH Computer

Architecture News, Vol. 37, No. 3, pp. 174-183, 2009.

[20] Hasenplaugh, William, Pritpal S. Ahuja, Aamer Jaleel, Simon Steely Jr

and Joel Emer, "The gradient-based cache partitioning algorithm", ACM

Transactions on Architecture and Code Optimization (TACO), Vol. 8, No.

4, Article 44, Jan 2012.

84

[21] Haakon Dybdahl, Per Stenstrom and Lasse Natvig, "A cache-partitioning

aware replacement policy for chip multiprocessors", High Performance

Computing-HiPC, Springer, pp. 22-34, 2006.

[22] Naoki Fujieda, and Kenji Kise, "A partitioning method of cooperative

caching with hit frequency counters for many-core processors", Second

International Conference on Networking and Computing (ICNC), pp. 160-

165, 2011.

[23] Konstantinos Nikas, Matthew Horsnell and Jim Garside, "An adaptive

bloom filter cache partitioning scheme for multi-core architectures”,

International Conference on Embedded Computer Systems,

Architectures, Modeling, and Simulation, SAMOS, pp. 25-32, 2008.

[24] Moinuddin K Qureshi and Yale N. Patt, "Utility-based cache partitioning: A

low-overhead, high-performance, runtime mechanism to partition shared

caches", 39th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 423-432, 2006.

[25] Daniel A Jimenez, "Insertion and promotion for tree-based PseudoLRU

last-level caches", 46th Annual IEEE/ACM International Symposium on

Microarchitecture, pp. 284-296, 2013.

[26] Lira, Javier, Carlos Molina and Antonio Gonzalez, "LRU-PEA: A smart

replacement policy for non-uniform cache architectures on chip

multiprocessors", IEEE International Conference on Computer Design,

pp. 275-281, 2009.

[27] Nikos Hardavellas, Michael Ferdman, Babak Falsafi and Anastasia

Ailamaki, "Reactive NUCA: near-optimal block placement and replication

in distributed caches", ACM SIGARCH Computer Architecture News, Vol.

37, No. 3, pp. 184-195, 2009.

85

[28] Fazal Hameed, Bauer L and Henkel J, “Dynamic Cache Management in

Multi-Core Architectures Through Runtime Adaptation”, Design,

Automation & Test in Europe Conference & Exhibition (DATE), pp. 485-

490, March 2012.

[29] Alberto Ros, Marcelo Cintra, Manuel E. Acacio and Jose M. García.

"Distance-aware round-robin mapping for large NUCA caches",

International Conference on High Performance Computing (HiPC), pp.

79-88, 2009.

[30] Javier Lira, Timothy M. Jones, Carlos Molina and Antonio Gonzalez. "The

migration prefetcher: Anticipating data promotion in dynamic NUCA

caches", ACM Transactions on Architecture and Code Optimization

(TACO), Vol. 8, No. 4, Article. 45, 2012.

[31] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Dynamic

cache clustering for chip multiprocessors", ACM 23rd International

Conference on Supercomputing, pp. 56-67, 2009.

[32] O. Adamao, A. Naz, K. Kavi, T. Janjusic and C. P. Chung, “Smaller split

L-1 data caches for multi-core processing systems”, IEEE 10th

International Symposium on Pervasive Systems, Algorithms and

Networks(I-SPAN 2009), Dec 2009.

[33] Yong Chen, Surendra Byna and Xian-He Sun, "Data access history

cache and associated data prefetching mechanisms", ACM/IEEE

Conference on Supercomputing, pp. 1-12, 2007.

[34] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Acm: An

efficient approach for managing shared caches in chip multiprocessors",

High Performance Embedded Architectures and Compilers, Springer

Berlin Heidelberg, pp. 355-372, 2009.

86

[35] Sourav Roy, "H-NMRU: a low area, high performance cache replacement

policy for embedded processors", 22nd International Conference on VLSI

Design, pp. 553-558, 2009.

[36] Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa and Tatsuo Ohtsuki,

"Exact and fast L1 cache simulation for embedded systems", Asia and

South Pacific Design Automation Conference, pp. 817-822, 2009.

[37] Sangyeun Cho, and Lory Al Moakar, "Augmented FIFO Cache

Replacement Policies for Low-Power Embedded Processors", Journal of

Circuits, Systems and Computers, Vol. 18, No. 6, pp. 1081-1092, 2009.

[38] Kimish Patel, Luca Benini, Enrico Macii and Massimo Poncino, "Reducing

conflict misses by application-specific reconfigurable indexing", IEEE

Transactions on Computer-Aided Design of Integrated Circuits and

Systems, Vol. 25, No. 12, pp. 2626-2637, 2006.

[39] Yun Liang, and Tulika Mitra, "Instruction cache locking using temporal

reuse profile", 47th Design Automation Conference, pp. 344-349, 2010.

[40] Yun Liang, and Tulika Mitra, "Improved procedure placement for set

associative caches", International Conference on Compilers,

Architectures and Synthesis for Embedded Systems, pp. 147-156, 2010.

[41] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik

Jhon, "Adaptive block management for victim cache by exploiting l1

cache history information", Embedded and Ubiquitous Computing,

Springer Berlin Heidelberg, pp. 1-11, 2004.

[42] Cristian Ungureanu, Biplob Debnath, Stephen Rago and Akshat Aranya,

"TBF: A memory-efficient replacement policy for flash-based caches",

IEEE 29th International Conference on In Data Engineering (ICDE), pp.

87

1117-1128, 2013.

[43] Pepijn de Langen and Ben Juurlink, "Reducing Conflict Misses in Caches

by Using Application Specific Placement Functions", 17th Annual

Workshop on Circuits, Systems and Signal Processing, pp. 300-305.

2006.

[44] Dongxing Bao and Xiaoming Li, "A cache scheme based on LRU-like

algorithm", IEEE International Conference on Information and Automation

(ICIA), pp. 2055-2060, 2010.

[45] Livio Soares, David Tam and Michael Stumm, "Reducing the harmful

effects of last-level cache polluters with an OS-level, software-only pollute

buffer", 41st annual IEEE/ACM International Symposium on

Microarchitecture, IEEE Computer Society, pp. 258-269, 2008.

[46] Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose

Duato, "MRU-Tour-based Replacement Algorithms for Last-Level

Caches", 23rd International Symposium on Computer Architecture and

High Performance Computing (SBAC-PAD), pp. 112-119, 2011.

[47] Gangyong Jia, Xi Li, Chao Wang, Xuehai Zhou and Zongwei Zhu, "Cache

Promotion Policy Using Re-reference Interval Prediction", IEEE

International Conference on Cluster Computing (CLUSTER), pp. 534-537,

2012.

[48] Junmin Wu, Xiufeng Sui, Yixuan Tang, Xiaodong Zhu, Jing Wang and

Guoliang Chen, "Cache Management with Partitioning-Aware Eviction

and Thread-Aware Insertion/Promotion Policy", International Symposium

on Parallel and Distributed Processing with Applications (ISPA), pp. 374-

381, 2010.

[49] Ali A. El-Moursy and Fadi N. Sibai, "V-Set cache design for LLC of multi-

88

core processors", IEEE 14th International Conference on High

Performance Computing and Communication (HPCC), pp. 995-1000,

2012.

[50] Aamer Jaleel, Matthew Mattina and Bruce Jacob, "Last level cache (LLC)

performance of data mining workloads on a CMP-a case study of parallel

bioinformatics workloads", The Twelfth International Symposium on High-

Performance Computer Architecture, pp. 88-98, 2006.

[51] Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri and

Jose Martinez, "Scavenger: A new last level cache architecture with

global block priority", 40th Annual IEEE/ACM International Symposium on


[52] Miao Zhou, Yu Du, Bruce Childers, Rami Melhem and Daniel Mossé,

"Writeback-aware partitioning and replacement for last-level caches in

phase change main memory systems", ACM Transactions on

Architecture and Code Optimization (TACO), Vol. 8, No. 4, Article. 53,

2012.

[53] Fazal Hameed, Lars Bauer and Jorg Henkel, "Adaptive cache

management for a combined SRAM and DRAM cache hierarchy for multi-

cores", Design, Automation & Test in Europe Conference & Exhibition

(DATE), pp. 77-82, 2013.

[54] Fang Juan and Li Chengyan, "An improved multi-core shared cache

replacement algorithm", 11th International Symposium on Distributed

Computing and Applications to Business, Engineering & Science

(DCABES), pp. 13-17, 2012.

[55] Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas

Subramoney and Joseph Nuzman, "Introducing hierarchy-awareness in

replacement and bypass algorithms for last-level caches", International

89

Conference on Parallel Architectures and Compilation Techniques, pp.

293-304, 2012.

[56] Yingying Tian, Samira M. Khan and Daniel A. Jimenez, "Temporal-based

multilevel correlating inclusive cache replacement", ACM Transactions on


2013.

[57] Liqiang He, Yan Sun and Chaozhong Zhang, "Adaptive Subset Based

Replacement Policy for High Performance Caching", JWAC-1st JILP

Workshop on Computer Architecture Competitions: Cache Replacement

Championship, 2010.

[58] Baozhong Yu, Yifan Hu, Jianliang Ma and Tianzhou Chen, "Access

Pattern Based Re-reference Interval Table for Last Level Cache", 12th

International Conference on Parallel and Distributed Computing,

Applications and Technologies (PDCAT), pp. 251-256, 2011.

[59] Haiming Liu, Michael Ferdman, Jaehyuk Huh and Doug Burger, "Cache

bursts: A new approach for eliminating dead blocks and increasing cache

efficiency", 41st annual IEEE/ACM International Symposium on


[60] Vineeth Mekkat, Anup Holey, Pen-Chung Yew and Antonia Zhai,

"Managing shared last-level cache in a heterogeneous multicore

processor", 22nd International Conference on Parallel Architectures and

Compilation Techniques, pp. 225-234, 2013.

[61] Rami Sheikh and Mazen Kharbutli, "Improving cache performance by

combining cost-sensitivity and locality principles in cache replacement

algorithms", IEEE International Conference on Computer Design (ICCD),

pp. 76-83, 2010.

90

[62] Adwan AbdelFattah and Aiman Abu Samra, "Least recently plus five least

frequently replacement policy (LR+ 5LF)", Int. Arab J. Inf. Technology,

Vol. 9, No. 1, pp. 16-21, 2012.

[63] Richa Gupta and Sanjiv Tokekar, "Proficient Pair of Replacement

Algorithms on L1 and L2 Cache for Merge Sort", Journal of Computing,

Vol. 2, No. 3, pp. 171-175, March 2010.

[64] Jan Reineke, Daniel Grund, “Sensitivity of Cache Replacement Policies”,

ACM Transactions on Embedded Computer Systems, Vol. 9, No. 4,

Article 39, March 2010.

[65] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu and Yale N. Patt, "A

case for MLP-aware cache replacement", ACM SIGARCH Computer

Architecture News, Vol. 34, No. 2, pp. 167-178, 2006.

[66] Marios Kleanthous and Yiannakis Sazeides, "CATCH: A mechanism for

dynamically detecting cache-content-duplication in instruction caches",

ACM Transactions on Architecture and Code Optimization (TACO),

Vol. 8, No. 3, Article. 11, 2011.

[67] Moinuddin K. Qureshi, David Thompson and Yale N. Patt, "The V-Way

cache: demand-based associativity via global replacement", 32nd

International Symposium on Computer Architecture, ISCA'05

Proceedings, pp. 544-555, 2005.

[68] Jichuan Chang and Gurindar S. Sohi, “Cooperative caching for chip

multiprocessors”, IEEE Computer Society, Vol. 34, No. 2, 2006.

[69] Shekhar Srikantaiah, Mahmut Kandemir and Mary Jane Irwin, "Adaptive

set pinning: managing shared caches in chip multiprocessors", ACM

91

SIGARCH Computer Architecture News, Vol. 36, No. 1, pp. 135-144,

2008.

[70] Ranjith Subramanian, Yannis Smaragdakis and Gabriel H. Loh, “Adaptive

Caches: Effective Shaping of Cache Behavior to Workloads”, IEEE

MICRO, Vol. 39, 2006.

[71] Tripti Warrier S, B. Anupama and Madhu Mutyam, "An application-aware

cache replacement policy for last-level caches", Architecture of

Computing Systems–ARCS, Springer Berlin Heidelberg, pp. 207-219,

2013.

[72] Chongmin Li, Dongsheng Wang, Yibo Xue, Haixia Wang and Xi Zhang.

"Enhanced adaptive insertion policy for shared caches", Advanced

Parallel Processing Technologies, Springer Berlin Heidelberg, pp. 16-30,

2011.

[73] Xiaoning Ding, Kaibo Wang and Xiaodong Zhang, "SRM-buffer: an OS

buffer management technique to prevent last level cache from thrashing

in multi-cores", ACM 6th Conference on Computer Systems, pp. 243-256,

2011.

[74] Jiayuan Meng and Kevin Skadron, "Avoiding cache thrashing due to

private data placement in last-level cache for many core scaling", IEEE

International Conference on Computer Design (ICCD), pp. 282-288,

2009.

[75] Vivek Seshadri, Onur Mutlu, Michael A. Kozuch and Todd C. Mowry, "The

Evicted-Address Filter: A unified mechanism to address both cache

pollution and thrashing", 21st International Conference on Parallel

Architectures and Compilation Techniques, pp. 355-366, 2012.

92

[76] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr and

Joel Emer, “Adaptive Insertion Policies for High Performance Caching”,

ACM SIGARCH Computer Architecture News, Vol. 35, No. 2, pp. 381-

391, June 2007.

[77] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero

and Alexander V. Veidenbaum, "Improving cache management policies

using dynamic reuse distances", 45th Annual IEEE/ACM International

Symposium on Microarchitecture, IEEE Computer Society, pp. 389-400,

2012.

[78] Xi Zhang, Chongmin Li, Haixia Wang and Dongsheng Wang, "A cache

replacement policy using adaptive insertion and re-reference prediction",

22nd International Symposium on Computer Architecture and High

Performance Computing (SBAC-PAD), pp. 95-102, 2010.

[79] Min Feng, Chen Tian, Changhui Lin and Rajiv Gupta, "Dynamic access

distance driven cache replacement", ACM Transactions on Architecture

and Code Optimization (TACO), Vol. 8, No. 3, Article 14, 2011.

[80] Jun Zhang, Xiaoya Fan and Song-He Liu, "A pollution alleviative L2

cache replacement policy for chip multiprocessor architecture",

International Conference on Networking, Architecture and Storage,

NAS'08, pp. 310-316, 2008.

[81] Yuejian Xie and Gabriel H. Loh, "Scalable shared-cache management by

containing thrashing workloads", High Performance Embedded

Architectures and Compilers, Springer Berlin Heidelberg, pp. 262-276,

2010.

[82] Seongbeom Kim, Dhruba Chandra and Yan Solihin, "Fair cache sharing

and partitioning in a chip multiprocessor architecture", 13th International

93

Conference on Parallel Architectures and Compilation Techniques, IEEE

Computer Society, pp. 111-122, 2004.

[83] Jichuan Chang and Gurindar S. Sohi, "Cooperative cache partitioning for

chip multiprocessors", 21st ACM International Conference on

Supercomputing, pp. 242-252, 2007.

[84] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot,

Simon Steely Jr and Joel Emer, "Adaptive insertion policies for managing

shared caches", 17th International Conference on Parallel Architectures

and Compilation Techniques, pp. 208-219, 2008.

[85] Carole-Jean Wu and Margaret Martonosi, "Adaptive timekeeping

replacement: Fine-grained capacity management for shared CMP

caches", ACM Transactions on Architecture and Code Optimization


[86] Mainak Chaudhuri, "Pseudo-LIFO: the foundation of a new family of

replacement policies for last-level caches", 42nd Annual IEEE/ACM

International Symposium on Microarchitecture, pp. 401-412, 2009.

[87] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr and Joel Emer,

"High performance cache replacement using re-reference interval

prediction (RRIP)", ACM SIGARCH Computer Architecture News, Vol. 38,

No. 3, pp. 60-71, 2010.

[88] Viacheslav V Fedorov, Sheng Qiu, A. L. Reddy and Paul V. Gratz, "ARI:

Adaptive LLC-memory traffic management", ACM Transactions on


2013.

94

[89] Moinuddin, K., Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr, and Joel

Emer, "Set-dueling-controlled adaptive insertion for high-performance

caching", Micro IEEE, Vol. 28, No. 1, pp. 91-98, 2008.

[90] Aamer Jaleel, Hashem H. Najaf-Abadi, Samantika Subramaniam, Simon

C. Steely and Joel Emer, "Cruise: cache replacement and utility-aware

scheduling", ACM SIGARCH Computer Architecture News, Vol. 40, No.

1, pp. 249-260, 2012.

[91] George Kurian, Srinivas Devadas and Omer Khan, "Locality-Aware Data

Replication in the Last-Level Cache", High-Performance Computer

Architecture, 20th International Symposium on IEEE, 2014.

[92] Jason Zebchuk, Srihari Makineni and Donald Newell, "Re-examining

cache replacement policies", IEEE International Conference on Computer

Design (ICCD), pp. 671-678, 2008.

[93] Samira Manabi Khan, Zhe Wang and Daniel A. Jimenez, "Decoupled

dynamic cache segmentation", IEEE 18th International Symposium

on High Performance Computer Architecture (HPCA), pp. 1-12, 2012.

[94] Song Jiang and Xiaodong Zhang, "LIRS: an efficient low inter-reference

recency set replacement policy to improve buffer cache performance"

ACM SIGMETRICS Performance Evaluation Review, Vol. 30, No.1, pp.

31-42, 2002.

[95] Song Jiang and Xiaodong Zhang, "Making LRU friendly to weak locality

workloads: A novel replacement algorithm to improve buffer cache

performance", IEEE Transactions on Computers, Vol. 54, No. 8, pp. 939-

952, 2005.

95

[96] Izuchukwu Nwachukwu, Krishna Kavi, Fawibe Ademola and Chris Yan,

"Evaluation of techniques to improve cache access uniformities",

International Conference on Parallel Processing (ICPP), pp. 31-40, 2011.

[97] Dyer Rolán, Basilio B. Fraguela and Ramon Doallo, “Virtually split cache:

An efficient mechanism to distribute instructions and data”, ACM

Transactions on Architecture and Code Optimization (TACO), Vol. 10, No.

4, Article. 27, Dec 2013.

[98] Huang ZhiBin, Zhu Mingfa and Xiao Limin, "LvtPPP: live-time protected

pseudopartitioning of multicore shared caches", IEEE Transactions on

Parallel and Distributed Systems, Vol. 24, No. 8, pp. 1622-1632, 2013.

[99] Bradford M Beckmann, Michael R. Marty and David A. Wood, "ASR:

Adaptive selective replication for CMP caches", 39th Annual IEEE/ACM

International Symposium on Microarchitecture, MICRO-39, pp. 443-454,

2006.

[100] Garcia-Guirado, Antonio, Ricardo Fernandez-Pascual, Alberto Ros and

José M. García, "DAPSCO: Distance-aware partially shared cache

organization", ACM Transactions on Architecture and Code Optimization


[101] Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi,

Simon C. Steely Jr and Joel Emer, "SHiP: signature-based hit predictor

for high performance caching", 44th Annual IEEE/ACM International

Symposium on Microarchitecture, pp. 430-441, 2011.

[102] Nathan Binkert, Bradford Beckman, Gabriel Black, Steven K. Reinhardt,

Ali Saidi, Arkaprava Basu, Joel Hestness, Derek. R. Hower, Tushar

Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad

Shoaib, Nilay Vaish, Mark D. Hill and David A. Wood, "The gem5

http://dl.acm.org/author_page.cfm?id=81448600243&coll=DL&dl=ACM&trk=0&cfid=412875296&cftoken=23498027



96

simulator", ACM SIGARCH Computer Architecture News, Vol. 39, No. 2

pp. 1-7, 2011.

[103] Christian Bienia, Sanjeev Kumar and Kai Li, "PARSEC vs. SPLASH-2: A

quantitative comparison of two multithreaded benchmark suites on chip-

multiprocessors", IEEE International Symposium on Workload

Characterization (IISWC), pp. 47-56, 2008.

[104] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh and Kai Li, "The

PARSEC benchmark suite: Characterization and architectural

implications", 17th International Conference on Parallel Architectures and

Compilation Techniques, pp. 72-81, 2008.

[105] Mark Gebhart, Joel Hestness, Ehsan Fatehi, Paul Gratz and Stephen W.

Keckler, "Running PARSEC 2.1 on M5", University of Texas at Austin,

Department of Computer Science, Technical Report# TR-09-32, 2009.

97

LIST OF PUBLICATIONS

[1] Muthukumar S and Jawahar P.K., “Hit Rate Maximization by Logical

Cache Partitioning in a Multi-Core Environment”, Journal of Computer

Science, Vol. 10, Issue 3, pp. 492-498, 2014.

[2] Muthukumar S and Jawahar P.K., “Sharing and Hit Based Prioritizing

Replacement Algorithm for Multi-Threaded Applications”, International

Journal of Computer Applications, Vol. 90, Issue 12, pp. 34-38, 2014.

[3] Muthukumar S and Jawahar P.K., “Cache Replacement for Multi-

Threaded Applications using Context-Based Data Pattern Exploitation

Technique”, Malaysian Journal of Computer Science, Vol. 26, Issue 4, pp.

277-293, 2014.

[4] Muthukumar S and Jawahar P.K., “Redundant Cache Data Eviction in a

Multi-Core Environment”, International Journal of Advances in

Engineering and Technology, Vol. 5, Issue 2, pp.168-175, 2013.

[5] Muthukumar S and Jawahar P.K., “Cache Replacement Policies for

Improving LLC Performance in Multi-Core Processors”, International

Journal of Computer Applications, Vol. 105, Issue 8, pp. 12-17, 2014.

98

TECHNICAL BIOGRAPHY

Muthukumar S (RRN: 1184211) was born on 15th April 1965, in Erode,

Tamilnadu. He received the BE degree in Electronics and Communication

Engineering and the ME degree in Applied Electronics from Bharathiar

University, Coimbatore in the year 1989 and 1992 respectively. He is currently

pursuing his Ph.D. Degree in the Department of Electronics and Communication

Engineering of B.S. Abdur Rahman University, Chennai, India. He has got more

than 22 years of teaching experience in various institutions. He is working as

Professor in the Department of Electronics and Communication Engineering at

Sri Venkateswara College of Engineering, Chennai, India. His current research

interests include Multi-Core Processor Architectures, High Performance

Computing and Embedded Systems.

E-mail ID : [email protected]

Contact number : 9941437738

Photo

mailto:[email protected]

enhancement of cache performance in multi-core … enhancement of cache performance in multi-core...

Documents