enhancement of cache performance in multi-core … enhancement of cache performance in multi-core...
TRANSCRIPT
i
ENHANCEMENT OF CACHE PERFORMANCE IN
MULTI-CORE PROCESSORS
A THESIS
Submitted by
MUTHUKUMAR.S
Under the Guidance of
Dr. P.K. JAWAHAR
in partial fulfillment for the award of the degree of
DOCTOR OF PHILOSOPHY
in
ELECTRONICS AND COMMUNICATION ENGINEERING
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY)
(Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
November 2014
ii
B.S.ABDUR RAHMAN UNIVERSITY
(B.S. ABDUR RAHMAN INSTITUTE OF SCIENCE & TECHNOLOGY
(Estd. u/s 3 of the UGC Act. 1956)
www.bsauniv.ac.in
BONAFIDE CERTIFICATE
Certified that this thesis ENHANCEMENT OF CACHE PERFORMANCE
IN MULTI-CORE PROCESSORS is the bonafide work of MUTHUKUMAR S.
(RRN: 1184211) who carried out the thesis work under my supervision. Certified
further, that to the best of my knowledge the work reported herein does not form
part of any other thesis or dissertation on the basis of which a degree or award
was conferred on an earlier occasion on this or any other candidate.
SIGNATURE
Dr. P.K.JAWAHAR
RESEARCH SUPERVISOR
Professor and Head
Department of ECE
B.S. Abdur Rahman University
Vandalur, Chennai – 600 048
iii
ACKNOWLEDGEMENT
First and foremost, I would forever be thankful to the almighty for having
bestowed me with this life and knowledge. Next, I would like to express my
sincere gratitude to my supervisor Professor Dr. P.K. Jawahar for his
continuous support during my PhD study and research, for his patience,
motivation, enthusiasm, and immense knowledge. He gave me the freedom to
do the research in my chosen area, as a result of which, I felt more encouraged
and motivated to achieve my objectives. Besides my advisor, I would also like to
thank the doctoral committee members, Dr. P. Rangarajan and Dr. T.R.
Rangaswamy for their valuable suggestions, encouragement and insightful
comments.
My sincere thanks to Dr Kaja Mohideen, Dean, School of Electrical and
Communication Sciences for having provided me with this opportunity.
I would like to thank Professor Dr. V. Sankaranarayanan for his advice and
valuable suggestions throughout my research.
I would also like to thank Dr. M. Arunachalam, Secretary, Sri Venkateswara
College of Engineering for his encouragement and support.
Last but not the least, I would like to thank my family members for their immense
support. Special thanks goes to my wife Mrs. M. Shanthi, my son Mr. M.
Subramanian and my daughter M. Uma for their patience and moral support
that kept me going throughout the entire course.
I am also indebted to my parents who have stimulated my research interests
from the very early stage of my life. I sincerely hope that I can make my family
members proud.
iv
ABSTRACT
The impact of various cache replacement policies act as the main deciding
factor of system performance and efficiency in Chip Multi-core Processors. It is
imperative for any level of cache memory in a multi-core architecture to have a
well-defined, dynamic replacement algorithm in place to ensure consistent
superlative performance. Many existing cache replacement policies such as
Least Recently Used (LRU) policy etc have proved to work well in the shared
Level 2 (L2) cache for most of the data set patterns generated by single
threaded applications. But when it comes to parallel multi-threaded applications
that generate differing patterns of workload at different intervals, the
conventional replacement policies can prove sub-optimal as they generally do
not abide by the spatial and temporal locality theories and do not acquaint
dynamically to the changes in the workload.
This dissertation proposes three novel cache replacement policies for L2
cache that are targeted towards such applications. Context Based Data Pattern
Exploitation Technique assigns a counter for every block of the L2 cache for
monitoring the data access patterns of various threads and modifies the counter
values appropriately to maximize the performance. Logical Cache Partitioning
technique logically partitions the cache elements into four zones based on their
„likeliness‟ to be referenced by the processor in the near future. Replacement
candidates are chosen from the zones in the ascending order of their „likeliness
factor‟ to minimize the number of cache misses. Sharing and Hit based
Prioritizing replacement technique takes the sharing status of the data elements
into consideration while making replacement decisions. This information is
combined along with the number of hits received by the element to embark upon
a priority based on which the replacement decisions are made.
The proposed techniques are effective at improving cache performance and
offer an improvement of up to 9% in overall hits at the L2 cache level and an IPC
(Instructions Per Cycle) speedup of up to 1.08 times that of LRU for a wide
range of multithreaded benchmarks.
v
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.
ABSTRACT
LIST OF TABLES
LIST OF FIGURES
LIST OF SYMBOLS AND ABBREVIATIONS
iv
viii
ix
xi
1. INTRODUCTION 1
1.1 Chip Multi-core Processors 1
1.2 Memory Management in CMP
1.3 Overview of Existing Replacement Policies 3
1.4 The Problem 5
1.5 Scope of Work
1.6 Research Objectives
1.7 Contributions
1.8 Outline of the Thesis
1.9 Summary
2
3
3
6
6
7
7
8
9
2. LITERATURE SURVEY 10
2.1 Replacement Policies
2.2 Data Prefetching and Cache Bypassing
2.3 Effectiveness of Cache Partitioning
2.4 Embedded System Processors
2.5 Performance of Shared Last-Level Caches
2.6 Cache Thrashing and Cache Pollution
2.7 Prevalence of Multi-Threaded Applications
2.8 Summary
10
11
11
12
13
15
16
17
3. CONTEXT-BASED DATA PATTERN EXPLOITATION
TECHNIQUE 18
vi
CHAPTER NO. TITLE PAGE NO.
3.1 Introduction
3.1.1. Different Types of Application Workloads
18
19
3.2 Basic Structure of CB-DPET
3.3 DPET
3.4 DPET-V: A Thrash Proof Version
3.5 Policy Selection Technique
3.6 Thread Based Decision Making
3.7 Block Status Based Decision Making
3.8 Modifications in the Proposed Policies
3.9 Comparison between LRU, NRU and
CB-DPET
3.9.1 Performance of LRU
3.9.2 Performance of NRU
3.9.3 Performance of DPET
3.10 Results
3.10.1 CB-DPET Performance
3.11 Summary
22
22
24
31
33
34
34
35
35
36
37
38
39
43
4. LOGICAL CACHE PARTITIONING TECHNIQUE 45
4.1 Introduction
4.2 Logical Cache Partitioning Method
4.3 Replacement Policy
4.4 Insertion Policy
4.5 Promotion Policy
4.6 Boundary Condition
4.7 Comparison with LRU
4.8 Results
4.8.1 LCP Performance
4.9 Summary
45
46
47
47
48
49
53
53
54
57
vii
CHAPTER NO. TITLE PAGE NO.
5. SHARING AND HIT-BASED PRIORITIZING
TECHNIQUE
58
5.1 Introduction
5.2 Sharing and Hit-Based Prioritizing Method
5.3 Replacement Priority
58
59
60
5.4 Replacement Policy
5.5 Insertion Policy
5.6 Promotion Policy
5.7 Results
5.7.1 SHP Performance
5.8 Summary
61
66
66
66
66
69
6.
EXPERIMENTAL METHODOLOGY
63
6.1 Simulation Environment
6.2 PARSEC Benchmark
6.3 Evaluation of Many-Core Systems
6.4 Storage Overhead
6.4.1 CB-DPET
6.4.2 LCP
6.4.3 SHP
70
71
71
76
76
76
76
7. CONCLUSION AND FUTURE WORK 78
7.1 Conclusion
7.2 Scope for Further Work
78
79
REFERENCES
LIST OF PUBLICATIONS
TECHNICAL BIOGRAPHY
81
97
98
viii
LIST OF TABLES
TABLE NO.
TITLE PAGE NO.
4.1
5.1
6.1
6.2
7.1
LCP zone categorization
Sharing degree values and their descriptions
Summary of the key characteristics of the PARSEC
benchmarks
Cache architecture configuration parameters
Percentage increase in overall hits compared to LRU
47
65
70
71
79
ix
LIST OF FIGURES
FIGURE NO.
TITLE PAGE NO.
1.1
1.2
1.3
1.4
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
4.1
4.2
Processor memory gap [1]
Two core processor-T1 and T2 represent the threads
General memory hierarchy in a CMP environment
Data sequence containing a recurring pattern {1,5,7}
Data sequence which will result in thrashing for DPET
(i) Contents of the cache at the end of each burst for DPET
(ii) Contents of the cache at the end of each burst for
DPET-V
(i) CB-DPET flow diagram
(ii) Victim Selection flow
(iii) Insert flow DPET
(iv) Insert flow DPET-V
Policy selection using DM register
CB-DPET Policy selection flow
Data sequence containing a recurring pattern
Behavior of LRU for the data sequence
Behavior of NRU for the data sequence
Behavior of DPET for the data sequence
Overall number of hits recorded in L2 cache
L2 cache hit rate
Hits recorded in individual phases of ROI
Miss latency at L2 cache for individual cores
Percentage increase in L2 cache hits
Overall number of misses at L2 cache
(i) LCP flow diagram
(ii) Victim Selection flow for LCP
Working of LRU and LCP for the data set
1
2
3
5
24
27
27
28
29
30
30
31
32
34
35
37
38
39
40
41
41
42
43
48
49
52
x
FIGURE NO.
TITLE PAGE NO.
4.3
4.4
4.5
4.6
5.1
5.2
5.3
5.4
6.1
6.2
6.3
6.4
6.5
6.6
Percentage increase in overall hits at L2 cache
Difference in number of replacements made at L2 cache
Average cache occupancy percentage
Miss rate recorded by individual cores
(i) SHP flow diagram
(ii) Victim Selection flow for SHP
Overall number of hits at L2 cache
Percentage decrease in number of replacements at L2
cache
Overall miss rate and core-wise miss rate at L2 cache
Performance comparison for 2 cores
Performance comparison for 4 cores
Performance comparison for 8 cores
CB-DPET- percentage increase in hits
LCP- percentage increase in hits
SHP- percentage increase in hits
54
54
55
56
64
65
67
68
68
72
73
73
74
74
75
xi
LIST OF SYMBOLS AND ABBREVIATIONS
AMA
CMP
CB-DPET
DPET
DPET-V
DM
L2
LCP
LR
LLR
LLC
LRU
LFU
MRU
MLR
NRU
NLR
PP1
PP2
ROI
SHP
Age Manipulation Algorithm
Chip Multi-core Processors
Context Based Data Pattern Exploitation Technique
Data Pattern Exploitation Technique
Data Pattern Exploitation Technique - Variant
Decision Making Register
Level 2
Logical Cache Partitioning
Likely to be Referenced
Less Likely to be Referenced
Last Level Cache
Least Recently Used
Least Frequently Used
Most Recently Used
Most Likely to be Referenced
Not Recently Used
Never Likely to be Referenced
Parallel Phase 1
Parallel Phase 2
Region of Interest
Sharing and Hit-based Prioritizing
1
1. INTRODUCTION
The advent of multi-core computers consisting of two or more processors
working in tandem on a single integrated circuit has produced a phenomenal
breakthrough in performance over the recent years. Efficient memory
management is required to utilize the complete potential of CMPs.
One of the main concerns in recent times is the processor memory gap.
Processor memory gap indicates a scenario where the processor is stalled while
a memory operation is happening. Fig 1.1 shows that this situation is getting
worse as the processor performance tends to increase manifold but the memory
latency continues to be almost the same. The on-chip cache memory clock
speeds have hit a wall where practically no improvement could be achieved in
that area anymore. The usage of multiple cores has mitigated this effect to some
extent but this gap tends to widen lot more in the future stressing the need for
effective memory management techniques.
Fig 1.1 Processor memory gap [1]
2
1.1. Chip Multi-core Processors (CMP)
In a CMP, multiple processor cores are packaged into a single physical
processor chip that is capable of executing multiple tasks simultaneously. Multi-
core processors provide room for faster execution of applications by exploiting
the available parallelism (mainly instruction level parallelism), which implies the
ability to work on multiple problems simultaneously. Parallel multi-threaded
applications demand for high computational and throughput requirements. To
expedite their execution speed, these applications spawn multiple threads which
run in parallel. Each thread has its own execution context and is assigned to
perform a particular task. In a multi-core architecture, these threads can run on the
same core or on different cores. A two-core processor with two threads T1 and
T2 is depicted in Fig 1.2.
Fig 1.2 Two core processor-T1 and T2 represent the threads
Every core now behaves like an individual processor and has its own
execution pipeline. They can execute multiple threads in parallel thus expediting
the running time of the program as a whole and boosting the overall system
performance and throughput. Efficiency of CMPs largely rely on how well a
software application can leverage the multiple core advantage while execution.
Many applications that have emerged in recent times have been optimized to
utilize the multi-core advantage (if present) to improve the performance.
Networking applications and other such CPU-intensive applications have
benefitted hugely from the CMP architecture.
3
1.2. Memory Management in CMP
Memory management forms an integral part in determining the performance of
CMPs. Cache memory is used to hide the latency of main memory and it also
helps in reducing the processing time in CMPs. Generally every core in a CMP
environment will be having a private L1 cache and all the available cores share a
relatively larger unified L2 cache (which in most cases can also be referred to as
the shared Last Level Cache). The L1 cache is split into „instruction cache‟ and
„data cache‟ as shown in Fig 1.3. The mapping method used is „set-associative‟
mapping [1], where the cache is divided into a specified number of sets with
each set having a specified number of blocks.
The number of blocks in each set indicates the associativity of the cache. The
size of L1 cache has to be kept small owing to chip constraints. Since it is
directly integrated into the chip that carries the processor, its data retrieval time
is expedited.
Fig 1.3 General memory hierarchy in a CMP environment
1.3. Overview of Existing Replacement Policies
Replacement policies play a vital role in determining the performance of a
system. In a multi-core environment, where processing speed and throughput
are of essence, selecting a suitable replacement technique for the underlying
cache memories becomes crucial. Primary objective of any replacement
4
algorithm is to utilize the available cache space effectively and reduce the number
of data misses to the maximum extent possible. Misses can be of three types
namely- compulsory misses, conflict misses and capacity misses.
Compulsory misses happen when the data item is fetched from the main
memory and stored into the cache for the first time. In most cases it is
unavoidable. Conflict misses are quite common in direct mapped cache
architectures where more than one data item compete for the same cache line
resulting in eviction of the previously inserted data item. As the name suggests,
capacity misses arise due to the inability of the cache to hold the entire working
set which can be observed in most applications owing to the smaller cache size.
An efficient replacement algorithm targets the capacity misses and tries to reduce
them as much as possible.
Since L2 cache is the one that will be accessed by multiple cores and it has
shared data, it is essential that judicious replacement strategies need to be
adopted here, in order to improve the performance. At present, many
applications make use of the LRU cache replacement algorithm [1], which
discards the „oldest possible‟ cache line available in the cache i.e. the cache line
which has not been accessed over a period of time. This implies that the „age‟ of
every cache line needs to be updated after every access. This is done with the
help of either a counter or a queue data structure. In the queue based
approach, data that is referenced by the processor is moved to the top of the
queue, pushing all the elements below by one step. At any point of time, to
accommodate an incoming block, LRU deletes the block which is found at the
bottom of the queue (least recently used data).
This method seems to work well with many current applications. However for
certain workloads which possess an unpredictable data reference pattern that
fluctuates frequently with time, LRU can result in poor performance. A closer
observation on LRU can help reveal the cause for this problem. It assumes that
all the applications follow the same reference pattern i.e. they abide by the
spatial and temporal locality theories. Any data item that is not referenced by the
processor for some considerable amount of time is tagged as „least recently
5
used‟ and evicted from the cache. When such a data item is required by the
processor in future, it is not found in the cache and has to be fetched from the
main memory.
Most Recently Used (MRU) policy also works in a similar fashion, with a slight
difference that the victim is chosen from the opposite end of the data structure
i.e. the most recently referred data is evicted to accommodate an incoming
block. The above stated algorithms work fine when the „locality of reference‟ is
high. But for applications whose data sets do not exhibit any degree of regularity
in their access pattern, these algorithms might result in sub-optimal
performance. An example of such a pattern is given in Fig 1.4. In the ensuing
sections, individual data items are referred by enclosing them within flower
braces.
Fig 1.4 Data sequence containing a recurring pattern {1,5,7}
As it is observed from Fig 1.4, the starting pattern {1,5,7} seems to recur after
a long burst of random data. If LRU is applied for the above set of data, it will
result in poor performance as there is no regularity in the access pattern.
Pseudo-LRU keeps a single status bit for each cache block. Initially all bits are
reset. Every access to a block sets the bit indicating that the block is recently
used. This also may degrade the performance for the above data sequence.
Another replacement technique that is commonly being employed apart from
LRU and MRU is the Not Recently Used (NRU) algorithm. NRU associates a
single bit counter with every cache line to aid in making replacement decisions. It
might be effective compared to LRU in certain cases, but a 1-bit counter may not
be powerful enough to create the required dynamicity.
Random replacement algorithm works by arbitrarily selecting a victim for
replacement when necessary. It does not keep any information about the cache
access history. Least Frequently Used technique tracks the number of times a
6
block is referenced using a counter and a block with lowest count is selected for
replacement.
1.4. The Problem
An effective replacement algorithm is important for cache management in
CMP. When it comes to the on-chip cache, it becomes even more crucial as the
size of the cache memory is considerably small. In order to fine tune the
performance of the cache, it is imperative to have an efficient replacement
technique in place. Apart from being efficient it is also important that it should be
dynamic, i.e. adjusting itself according to the changing data access patterns.
The most prevalently used replacement algorithms such as Least Recently
Used, Most Recently Used, Not Recently Used etc may have been proven to
work well with many applications but they are not dynamic in nature as
discussed in the earlier section. Irrespective of the data pattern exhibited by the
applications, they follow the same technique to find out the replacement victim.
It was observed that if a replacement algorithm could make decisions based
on the access history and the current access pattern, the number of cache hits
received can increase considerably thereby reducing the number of misses and
improving the overall performance. The primary aim of this dissertation is to
propose and design such dynamic novel cache replacement algorithms and
evaluate them against the conventional LRU approach.
1.5. Scope of Work
Managing cache memory involves many challenges that need to be addressed
from various angles. The behavior of various existing replacement policies on
shared last level cache is analyzed. In multi-threaded applications shared data
has to be handled with care because when there is a miss on a data item which is
shared by multiple threads, the execution of many threads gets stalled.
Replacement decisions made need to be dynamic and self-adjusting according to
the workload pattern changes. Since the size of the cache is small, available
cache space must be utilized efficiently. Techniques proposed in this work cover
7
all the above mentioned aspects while refraining from in-depth focusing on cache
memory hardware architecture optimizations.
1.6. Research Objectives
The following are the objectives of this research
To augment the cache subsystem of a CMP with novel replacement
algorithm for improving the performance of shared Last Level Cache
(LLC).
To propose a dynamic and self adjusting cache replacement policy with
low storage overhead and robustness across workloads.
To manage the shared cache efficiently between multiple applications.
Make LLC thrash resistant and scan resistant and maximize the utilization
of LLC, which helps to accelerate overall performance for memory
intensive workloads
1.7. Contributions
In order to address the shortcomings of the conventional replacement
algorithms, three novel counter based replacement strategies, which are
targeted towards the shared LLC, have been proposed and evaluated over a set
of multi-threaded benchmarks.
First of the three methods, Context based data pattern exploitation
technique, maximizes the cache performance by closely monitoring and
adapting itself to the pattern that occurs in the workload sequence with
the help of a block-level 3-bit counter. Results shows that the overall
number of hits obtained at the L2 cache level is on average 8% more than
LRU approach and an improvement of IPC speedup of up to 6% is
achieved.
Logical cache partitioning, which is the second technique, logically partitions
the cache into four different zones namely – Most likely to be referenced,
Likely to be referenced, Less likely to be referenced and Never likely to be
8
referenced in the decreasing order of their likeliness factor. Replacement
decisions are made based on likeliness to be referenced by the processor
in the near future. Results shows that an improvement of up to 9% in overall
number of hits and 5% in IPC speedup compared to LRU.
Finally, Sharing and hit-based prioritizing is an effective novel counter
based replacement technique, which makes replacement decisions based
on the sharing nature of the cache blocks. Every cache block is classified
based on four degrees of sharing namely – not shared (or private), lightly
shared, heavily shared and very heavily shared with the help of a 2-bit
counter. This sharing degree information is combined along with the
number of hits received by the element based on which the replacement
decisions are made. Results shows that this method achieves 9%
improvement in the overall number of hits and 8% improvement in IPC
speedup compared to baseline LRU Policy.
1.8. Outline of the Thesis
This thesis is organized into seven chapters which are as follows
Chapter 2 consists of the literature survey which discusses about a versatile set
of researches that have been carried out on cache management in single core
and multi-core architectures.
Chapter 3 presents a novel counter based replacement technique namely
Context Based Data Pattern Exploitation Technique. Initially it presents the
different types of workloads and the basic structure of the policy. Then it
elaborates on the algorithm and results and finally summarizes the method.
Chapter 4 presents the Logical Cache Partitioning technique. First it presents
the proposed replacement policy and finally it evaluates its significance and
results compared to LRU.
Chapter 5 describes the novel replacement policy Sharing and Hit-based
Prioritizing technique and evaluates its performance.
9
Chapter 6 discusses about the simulation environment and the benchmarks
used to evaluate all the three methods and also about the results obtained for all
the techniques, comparing and contrasting with the results obtained for the
same set of benchmarks by applying the conventional LRU approach. The final
section of this chapter presents the storage overhead.
Chapter 7 wraps up by presenting the conclusion and directions for future work.
1.9. Summary
This Chapter has provided an overview of the problem associated with
existing cache replacement policies for shared last level cache present in chip
multiprocessors and this forms the essential foundation for the chapters 3, 4 and
5. It has presented an outline of the thesis, specifically devoting three chapters
to target the drawbacks associated with shared LLC in CMPs. The next chapter
presents the literature survey of various cache replacement policies for multi-
core processors.
10
2. LITERATURE SURVEY
2.1. Replacement Policies
Replacement policies play a crucial role in CMP cache management.
Considering the on-chip cache memory of a processor, choosing a perfect
replacement technique is extremely important in determining the performance of
the system. The more adaptive a replacement technique is to the incoming
workload, the lesser is the overhead involved in transferring data to and from the
secondary memory frequently. John Odule and Ademola Osingua have come up
with a dynamically self-adjusting cache replacement algorithm [2] that makes
page replacement decisions based on changing access patterns by adapting to
the spatial and temporal features of the workload. Though it dynamically
changes to the workload needs, it does not detect changes or make decisions at
a finer level i.e. at block level which can be crucial when the working set
changes quite frequently.
Many such algorithms which work better than LRU on specific workloads have
evolved in recent times. One such algorithm is the Dueling Clock (DC) [3]
algorithm, designed to be applied on the shared L2 cache. But it does not
consider applications that involve multiple threads executing in parallel.
Effectiveness-Based Replacement technique (EBR) [4] ranks the cache
elements within a set, based on the recency and frequency of access and
chooses the replacement victim based on those ranks. In a multi-threaded
environment, a miss on shared data can be pivotal and calls for more attention
which is not provided here. Competitive cache replacement [5] proposes a
replacement algorithm for inclusive, exclusive and partially-inclusive caches by
considering two knowledge models. Self correcting LRU [6] proposes a
technique in which LRU is augmented with a framework to constantly monitor
and correct the replacement mistakes which LRU can tend to make in its
execution cycle. Adaptive Block management [7] proposes a method to use the
victim cache more efficiently thereby reducing the number of accesses to L2
cache.
11
2.2. Data Prefetching and Cache Bypassing
Data prefetching can be very useful in any cache architecture since it helps in
reducing the cache miss latency [8]. Prediction-Aware Confidence-based
Pseudo LRU (PAC-PLRU) [9] algorithm efficiently makes use of the pre-fetched
information and also saves those information from unnecessarily being evicted
from the cache. There can also be scenarios where the overhead can be
reduced by promoting the data fetched from the main memory directly to the
processor, bypassing the cache [10]. Hongliang Gao and Chris Wilkerson‟s Dual
Segmented LRU algorithm [11] has explored the positive aspects obtained from
cache bypassing in detail. Though cache bypassing [12] can help in reducing the
transfer overhead, it may not work well with the „locality principles‟ i.e. bypassing
a data item likely to be referenced in the near future might not be a good idea.
Counter based cache replacement and cache bypassing algorithm [13] tries to
locate the dead cache lines well in advance and choose them as replacement
victims, thus preventing such lines from polluting the cache. Augmented cache
architecture is one that comprises two parts- a larger direct-mapped cache and a
smaller, fully-associative cache which are accessed in parallel. A replacement
technique similar to LRU known as LRU-Like algorithm [14] was proposed to be
applied across augmented cache architectures.
Backing up few of the evicted lines in another cache can help in incrementing
the overall hits received. Adaptive Spill-Receive Cache [15] works on similar
lines by having a spiller cache (one that „spills‟ the evicted line) and a receiver
cache (one that stores the „spilled‟ lines).But if the cache line reuse prediction is
not efficient this can result in lots of stale cache lines as the algorithm executes.
2.3. Effectiveness of Cache Partitioning
In a multi-processor environment, efficient cache partitioning can help in
improving the performance and throughput of the system [17,18,19]. Partition
aware replacement algorithms exploit the cache partitions and try to minimize
the cache misses [20,21]. Filter based partitioning techniques are widely being
12
proposed by many studies [22,24]. Adaptive Bloom Filter Cache Partitioning
(ABFCP) [23] involves dynamic cache partitioning at periodic intervals using
bloom filters and counters to adapt to the needs of concurrent applications. But
since every processor core is allotted an array of bloom filters, the hardware
complexity can increase as the number of cores becomes higher. Pseudo LRU
algorithm (similar to LRU but with lesser hardware complexity) has been proven
to work well with partitioned caches [25]. Studies have also been conducted in
enhancing the performance of Non-Uniform Cache Architecture (NUCA)
[26,27,29]. A dynamic cache management scheme [28] aimed at the Last-Level
Caches in NUCA claims to have improved the system performance.
Although many schemes work well by adapting to the dynamic data
requirements of the applications, they do not attach importance to data that is
shared between the threads of the parallel workloads. Hits on such data items
not only improve the performance of those workloads, but also reduces the
overhead involved in transferring the data to and from the secondary memory
frequently. Dynamic cache clustering [31] groups cache blocks into multiple
dynamic cache banks on a per core basis to expedite data access. Placing the
new incoming data items into the appropriate cache bank during run time can
induce transfer overheads though. Smaller split L-1 data caches [32] proposes a
split design for L1 data cache using separate L-1 array and L-2 scalar caches
which has shown performance improvements. Yong Chen et al. suggest the use
of separate cache to store the data access histories and studies its associated
prefetching mechanism which has produced performance gain [33].
2.4. Embedded System Processors
Embedded processors differ from the conventional processors in a way that
they are tailor-made for very specific applications. Replacement policies are
being proposed which are targeted towards embedded systems [36,37,38].
Hierarchical Non-Most-Recently-Used (H-NMRU) [35] replacement policy is one
such algorithm that has proven to work well in those processors. [39] deals with
the cache block locking problem and provides a heuristic and efficient solution
13
whereas [40] proposes improved procedure placement policies in the cache
memory. A Two Bloom Filter (TBF) based approach [42], proposed primarily for
flash-based cache systems has proven to work well compared to LRU. The
proposal made by Pepijn de Langen et al. [43] is targeted to improve the
performance of direct mapped cache for application specific processors.
Dongxing Bao et al. suggest a method [44] to improve the performance of direct
mapped cache in which blocks are evicted based on their usage efficiency. This
method has shown improvement in miss rate.
2.5. Performance of Shared Last-Level Caches
Performance of the shared LLC in multi-core architecture plays a crucial role in
deciding the overall performance of the system, since all the cores will access
the data present here. Versatile researches [45,46,47] have been conducted in
improving the shared LLC performance. Phase Change Memory (PCM) with a
LLC forms an effective low-power consuming alternative to the traditional
DRAMs in existence. Writeback-aware Cache Partitioning (WCP) [52] efficiently
partitions the LLC in PCM among multiple applications and Write Queue
Balancing (WQB) replacement policy [52] helps in improving the throughput of
PCM. But one of the main drawbacks of PCM is that the writes are much slower
compared to DRAM.
The Adaptive Cache Management technique [53] combines SRAM and DRAM
architectures into the LLC and proposes an adaptive DRAM replacement policy
to accommodate the changing access patterns of the applications. Though it
reports improvement over LRU it does not discuss on the hardware overhead
that might result with the change in the architecture. Frequency based LRU
(FLRU) [54] takes the recent access information, as the LRU does, along with
the partition and frequency information while making replacement decision.
Cache Hierarchy Aware Replacement (CHAR) [55] makes replacement
decisions in LLC based on the activities that occur in the inner levels of the
cache. It studies the data pattern that hits/misses in the higher levels of cache
and makes appropriate decisions in LLC. Yingying Tian et al. [56] have
14
proposed a method which also functions on similar lines for making decisions at
LLC. But propagating such information frequently across various levels of cache
might slow down the system over a period of time. Cache space is another area
of concern. Instead of utilizing the entire cache space, an Adaptive Subset
Based Replacement Policy (ASRP) [57] has been proposed, where the cache
space is divided into multiple subsets and at any point of time only one subset is
active. Whenever a replacement needs to be made, the victim is selected only
from the active set. But there is a possibility of under-utilization of the available
cache space in this technique. For any replacement algorithm to be successful,
it is always essential to make good use of the available cache space. This can
be achieved by eradicating the dead cache lines i.e. the lines which will never be
referenced by the processor in the future. Lot of researches have been carried
out in this area [58,59]. But the problem which can arise here is that for
applications with an unpredictable data access pattern, the process of dead line
prediction becomes intricate and in the long run the predictor might lose its
accuracy.
Vineeth Mekkat et al. [60] have proposed a method that focuses on improving
the LLC performance on heterogeneous multi-core processors by targeting and
optimizing certain Graphical Processing Unit (GPU) based applications since
they spawn huge amount of threads compared to any other applications.
The effect of miss penalty can have a serious impact on the system efficiency.
Locality Aware Cost Sensitive (LACS) replacement technique [61] works on
reducing the miss penalty in caches. It calculates the number of instructions
issued during a miss for every cache block and when a victim needs to be
selected for replacement, it selects the block which issues the minimum number
of miss instructions. While it helps in reducing the miss penalty, it does not
attach importance to the data access pattern of the application under execution.
Another replacement policy which is called the LR+5LF [62] policy combines the
concepts of LRU and LFU to find a replacement candidate. Hardware complexity
is an issue here. Replacement algorithms that work well with L1 cache may not
suit L2 cache. Thus choosing an efficient pair of replacement algorithm for L1
15
and L2 becomes important for maximizing the hit rate [63]. Studies have
indicated that execution history like the initial hardware state of the cache can
also impact the sensitivity of replacement algorithms [64]. Memory Level
Parallelism (MLP) involves servicing multiple memory accesses simultaneously
which will save the processor from stalling due to a single long-latency memory
request. Replacement technique proposed in [65] incorporates MLP and
dynamically makes a replacement decision on a miss such that the memory
related stalls are reduced. Tag based data duplication may occur in the caches
which can be exploited to eliminate lower-level memory accesses thereby
reducing memory latency [66]. Demand-Based associativity [67] involves varying
the associativity of the cache set (basically total number of cache blocks per set)
according to the requirements of the application‟s data access pattern. Storage
complexity will be high in this method though. Cooperative caching [68]
proposes a unified framework for managing the on-chip cache. Set pinning [69]
proposes a technique for avoiding inter-core misses and reducing intra-core
misses in a shared cache which has shown improvement in miss rate.
Adaptive caches [70] propose a method which dynamically selects any two of
cache replacement algorithms (LRU, LFU, FIFO, and Random) by tracking the
locality information. Application aware cache [71] tracks the life time of cache
blocks in shared LLC during runtime for parallel applications and thereby utilises
the cache space more efficiently. EDIP [72] proposes a method for memory
intensive workloads where a new block is inserted at the LRU position and
claims to have obtained less miss rate than LRU for workloads with large
working set.
2.6. Cache Thrashing and Cache Pollution
One predominant problem with LRU is that it may result in thrashing (a
condition where most of the time is spent in replacement and there will not be
any hits at all) when it is applied for memory intensive applications.
Investigations are being conducted to reduce the effect of thrashing while
making replacement decisions [73,74]. Adaptive Insertion Policies [76] have
16
tweaked the insertion policy of LRU to reduce the cache miss for memory
intensive workloads. It is termed as LRU Insertion Policy (LIP) which has
outperformed LRU with respect to thrashing. OS buffer management technique
[73] tries to reduce thrashing in LLCs by adopting techniques to improve the
management of OS buffer caches. Since it has to operate at an OS level, it may
pull down the system performance. Evicted-Address filter mechanism [75] keeps
track of the recently evicted blocks to predict the reuse of the incoming cache
blocks thereby mitigating the effects of thrashing but at the cost of
implementation complexity.
Cache pollution happens when the cache is predominantly filled up with stale
data that the processor will never reference in the future. Inefficient cache
replacement algorithms can lead to cache pollution. Cache line reuse prediction
techniques [77,78,79] are required to avoid or mitigate cache pollution.
Software-only pollute buffer based approach [45] works on eliminating the LLC
polluters dynamically using a special buffer that is controlled at the Operating
System level but as the algorithm executes, it may burden the OS downgrading
the performance. A Pollution Alleviative L2 cache [80] proposes a method to
avoid cache pollution that might arise due to the execution of wrong path in case
of miss predictions and claims to have improved the IPC and cache hit rate.
Scalable shared cache [81] proposes a method for managing thrashing
applications in shared LLC of embedded multi-core architectures.
2.7. Prevalence of Multi-Threaded Applications
Concurrent multi-threaded applications have become prevalent in today‟s
CMPs. Works have been proposed on how effectively cache can be shared by
multiple threads running in parallel [82]. Co-operative cache partitioning [83]
works by efficiently allocating cache resources to threads which are concurrently
executing on multiple cores. Thread Aware Dynamic Insertion Policy (TA DIP)
[84] takes the memory requirements of individual applications into consideration
and makes replacement decisions. Adaptive Timekeeping Replacement (ATR)
[85] policy is designed to be applied primarily at LLC. It assigns „lifetime‟ to every
17
cache line. Unlike other replacement strategies like LRU, it takes the process
priority levels and application memory access patterns into consideration and
adjusts the cache line lifetime accordingly to improve performance. But the
condition which can lead to thrashing has not been considered in both the
methods [84,85].Pseudo-LIFO [86] suggests a method for early eviction of dead
blocks and retain live blocks which improves the cache utilization and average
CPI for multi-programmed workloads.
2.8. Summary
This Chapter has presented the prior work on various cache replacement
algorithms like the use cache partitioning and bypassing together with their
additional overhead involved in implementation. An exhaustive study of prior
work on cache thrashing and cache pollution has helped in the design of
dynamic cache replacement schemes that can be applied in high performance
processors. The next chapter discusses a counter based cache replacement
technique for shared last level cache in multi-core processors.
18
3. CONTEXT-BASED DATA PATTERN EXPLOITATION TECHNIQUE
Data access patterns of workload vary across applications. A replacement
algorithm can aid in improving the performance of the system only when it works
well majority of the applications in existence. In short, if a replacement algorithm
could make decisions based on the access history and the current access
pattern of the workloads, the number of hits received can increase considerably
[87,88,89].
The impact of various cache replacement policies act as the main deciding
factor of system performance and efficiency in Chip Multi-Core Processors.
Conventional replacement approaches work well for applications which have a
well-defined and a predictable access pattern for their workloads. But when the
workload of the applications is dynamic and constantly changes pattern,
traditional replacement techniques may fail to produce improvement in the cache
metrics. Hence this chapter discusses a novel cache replacement policy that is
targeted towards applications with dynamically varying workloads. Context
Based Data Pattern Exploitation Technique assigns a counter for every block of
the L2 cache. It then closely observes the data access patterns of various
threads and modifies the counter values appropriately to maximize the overall
performance. Experimental results obtained by using the PARSEC benchmarks
have shown an average improvement of 8% in overall hits at L2 cache level
when compared to the conventional LRU algorithm.
3.1. Introduction
Memory management plays an essential role in determining the performance
of CMPs. Cache memory is indispensible in CMPs to enhance the data retrieval
time and to achieve high performance. The level 1 cache or the L1 cache needs
to be small in size owing to the hardware limitations but since it is directly
integrated into the chip that carries the processor, accessing it typically takes
only a few processor cycles when compared to the accessing time of the next
higher levels of memories. Accessing the L2 cache may not be as fast as
accessing the L1 cache but since it is larger in size and is shared by all the
19
available cores, effectively managing it becomes more important. This chapter
puts forth a dynamic replacement approach called CB-DPET that is targeted
mainly towards the shared L2 cache.
3.1.1. Different Types of Application Workloads
The following are the types of workload patterns [90] exhibited by many
applications: Let „x‟ and „y‟ denote sequence of references where „xi‟ denotes the
address of the cache block. „a‟, „p‟ and „n‟ are integers where „a‟ denotes the
number of times the sequence repeats itself.
Thrashing Workload:
[.. (x1, x2….xn) a ..] a>0, n>cache size
These workloads have a working set size that is greater than the size of the
available shared cache. They may not perform well under LRU replacement
policy. Sample data pattern for a thrashing workload for a cache which can hold
8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80 . . .
Here, cache_size = 8 blocks, x1 = 8, x11 = 80, n = 11 and a = any value > 0.
On first scan, elements „8‟ till „5‟ will be fetched and loaded into the cache. When
the next element „10‟ arrives, the cache will be full and hence replacement has
to be made. Same is the case for remaining data items like „16‟, „80‟ and so on.
Regular Workload:
[… (x1, x2…. xn -1, xn, xn-1, … x2, x1)a…] n>0, a>0
Applications possessing this type of workload benefit hugely from the cache
usage. LRU policy suits them perfectly since the data elements follow stack
20
based pattern. Sample data pattern for regular workload for a cache which can
hold 8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 4, 20, 11 . . .
Here, cache_size = 8 blocks, x1 = 8, x7 = 4, x8 = 5, n = 8 and a = any value > 0.
After the cache gets filled up with the eighth element „5‟, the remaining elements
„4‟, „20‟ and „11‟‟ result in back to back cache hits.
Random Workload:
[…(x1, x2….xn)…] n -› ∞
Random workloads possess poor cache re-use. They generally tend to thrash
under any replacement policy. Sample data pattern for a random workload for a
cache which can hold 8 blocks at a time can be as follows
. . . 1, 9, 2, 8, 0, 12, 22, 99, 101, 50, 3 . . .
Recurring Thrashing Workload:
[… (x1, x2…. xn)a … (y1, y2.. yp) … (x2, x1… xn) a …], a>0, p>0, n>cache size
Sample data pattern for a recurring thrashing workload for a cache which can
hold 8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 7, 17, 46, 10, 16, 80, 90, 99, 8, 9, 2 . . .
Here, cache_size = 8 blocks, x1 = 8, x11 = 46, y1= 10, y5= 99, n = 11 and a = 1.
On first scan, elements „8‟ till „5‟ will be fetched and loaded into the cache. When
the next element „7‟ arrives, the cache will be full and hence replacement has to
be made. Same is the case for remaining data items like „17‟, „46‟ and so on. By
the time the initial pattern recurs towards the end, it would have been replaced
resulting in cache misses towards the end.
21
Recurring Non-Thrashing Workload:
[… (x1, x2 ….xn)a … (y1, y2 ... yp ) … (x2, x1… xn) a
…], a>0, p >0, n<=cache size
It is similar to the thrashing workload but with a recurring pattern in between.
Sample data pattern for a recurring non-thrashing workload for a cache which
can hold 8 blocks at a time can be as follows
. . . 8, 9, 2, 6, 11, 20, 4, 5, 10, 16, 80, 8, 9, 2 . . .
Here, cache_size = 8 blocks, x1 = 8, x8 = 5, y1= 10, y3= 80, n = 8 and a = 1.
On first scan, elements „8‟ till „5‟ will be fetched and loaded into the cache. When
the next element „10‟ arrives, the cache will be full and hence replacement has
to be made. Same is the case for remaining data items like „16‟ and „80‟. Again
the initial pattern recurs towards the end. If LRU is applied here, it will result in 3
cache misses towards the end as it would have replaced data items „8‟, „9‟ and
„2‟.
In the recurring non-thrashing pattern, if „n‟ is large, then LRU can result in
sub-optimal performance. Data items x2 to xn which reoccur after a short burst of
random data (y1 to yp) may result in cache miss. The problem here is even if a
data item receives many cache hits in the past, if it is not referenced by the
processor for a brief period of time, LRU evicts the item from the cache. When
the processor looks for that data item again it results in a miss. This stresses the
need for a replacement technique that adapts itself to the changing workload
and makes replacement decisions appropriately.
This chapter‟s contribution includes,
A novel counter based replacement algorithm called CB-DPET which tries
to maximize the cache performance by closely monitoring and adapting
itself to the pattern that occurs in the sequence.
Association of a 3-bit counter (which will be referred to as the PET
counter from here onwards) with every cache block in the cache set.
22
A set of rules to adjust the counter value whenever an insertion,
promotion or a replacement happens such that the overall hit percentage
is maximized.
Extension of the method to suit multi-threaded applications and also to
avoid cache thrashing.
PET counter boundary determination such that there will not be any stale
data in the cache for longer periods of time, i.e. non-accessed data
„mature‟ gradually with every access and gets removed at some point of
time to pave the way for new incoming data sets.
3.2. Basic Structure of CB-DPET:
This section and the following sub-sections elaborate on the basic concepts
that make up CB-DPET- namely DPET and DPET-V. Mapping method is set-
associative mapping [1]. A counter called the PET counter is associated with
every block in the cache set. Conceptually this counter holds the „age‟ of each
and every cache block. The maximum value that the counter can hold is set to 4
which imply that only 3 bits are required to implement it at the hardware level,
thereby not increasing the complexity. The minimum value that the counter can
hold is „0‟. The maximum upper bound value that the counter can hold is actually
„7‟ (23-1) but the value of „4‟ has been chosen. This is because, any value that is
chosen past 6 can result in stale data polluting the cache for longer periods of
time. Similarly if a value below „3‟ had been chosen, the performance will be
more or less similar to that of NRU. Thus the upper bound was taken to be „4‟.
The algorithm which implements the task of assigning the counter values based
on access pattern is referred as the „Age Manipulation Algorithm‟ (AMA).
3.3. DPET
A block with a lower PET counter value can be regarded as „younger‟ when
compared to a block having a higher PET value. Initially the counter values of all
the blocks are set to „4‟. Conceptually every block is regarded as „oldest‟. So in
this sense, the block which is considered to be the „oldest‟ (PET counter value 4)
23
is chosen as a replacement candidate when the cache gets filled up. If the oldest
block cannot be found, the counter value of every block is incremented
continuously till the oldest cache block is found. This block is then chosen for
replacement. Once the new block has been inserted into this slot, AMA
decrements the counter associated with it by „1‟ (it is set to „3‟). This is done
because the block can no more be regarded as „oldest‟, as it has been accessed
recently. Also it cannot be given a very low value as the AMA is not sure about
its future access pattern. Hence it is inserted with a value of „3‟. When a data hit
occurs, that block is promoted to the „youngest‟ level irrespective of which level it
was in previously. By this way, more time is given for data items which repeat
themselves in a workload to stay in the cache.
To put this in more concrete terms,
A. DPET associates a counter, which iterates from „0‟ to „4‟, with every cache
block.
B. Initially the counter values of all the blocks are set to „4‟. Conceptually every
block is regarded as „oldest‟.
C. Now when a new data item comes in, the oldest block needs to be replaced.
Since contention can arise in this case, AMA chooses the first „oldest‟ block
from the bottom of the cache as a tie-breaking mechanism.
D. Once the insertion is made, AMA decrements the counter associated with it
by „1‟ (it is set to „3‟). This is done because the block can no more be
regarded as „oldest‟, as it has been accessed recently. Also it cannot be
given a very low value as the AMA is not sure about its future access pattern.
Hence it is inserted with the value of „3‟.
E. In the case where all the blocks are filled up (without any hits in between),
there will be a situation where all the blocks will have a PET counter value of
„3‟. Now when a new data item arrives, there is no block with a counter value
of „4‟. As a result, AMA increments all the counter values by „1‟ and then
performs the check again. If a replacement candidate is found this time, AMA
executes steps C and D again. Otherwise it again increments the values.
This is carried out recursively until a suitable replacement candidate is found.
24
F. Now consider the situation where a hit occurs for a particular data item in the
cache. This might be the start of a specific data pattern. So AMA gives high
priority to this data item and restores its associated PET counter value to „0‟
terming it as the „youngest‟ block.
On the other hand, if a miss is encountered, the situation becomes quite similar
to what AMA dealt with in the initial steps, i.e. it scans all the blocks from the
bottom to find the „oldest‟ block. If found it proceeds to replace it. Otherwise the
counter values are incremented and once again the same search is performed
and this process goes on till a replacement candidate is found.
3.4. DPET-V : A Thrash-Proof Version
This method is a variant of DPET (In the following sections, it will be referred
to as DPET-V). Some times for workloads with a working set greater than the
available cache size, there is a possibility that there will not be any hits at all.
The entire time will be spent only in replacement. This condition is often referred
to as thrashing. The performance of cache can be improved if some fraction of
randomly selected working set is retained in the cache for some more time
which will result in increase in cache hits Hence very rarely, AMA decrements
the PET counter value by „2‟ (instead of „1‟) after an insertion is made. So the
new counter value post insertion will be „2‟ instead of „3‟. Conceptually it allows
more time for the data item to stay in the cache. This technique is referred to as
DPET-V. The probability of inserting with „2‟ is chosen approximately as 5/100,
i.e. out of 100 blocks, the PET counter value of (approximately) 5 blocks will be
set to „2‟ while the others will be set to „3‟ during insertion.
Fig 3.1 Data sequence which will result in thrashing for DPET
25
Algorithm 1: Implementation for Cache Miss, Hit and Insert Operations
Init:
Initialize all blocks with PET counter value „4‟.
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set is passed as parameter **/
FindVictim(set):
/** Infinite Loop **/
while 1 do
removing_index = -1;
/** Scanning from the bottom **/
for i=associativity-1 to 0 do
if set.blk[i].pet == 4 then
/** Block Found. Exit Loop **/
removing_index = i;
break;
if removing_index == -1 then
for i=associativity-1 to 0 do
/** Increment all blocks pet value by 1 and search again **/
set.blks[i].pet = set.blks[i].pet + 1
else
break; /** break from main while loop **/
return removing_index;
26
Hit:
set.blk.pet = 0;
Insert:
set.blk.pet = 3;
Correctness of Algorithm
Invariant:
At any iteration i, the PET counter value of set.blk[i] either holds a value of „4‟ or
a value in the range of „0‟ to „3‟.
Initialization:
When i = 0, set.blk[i].pet will be „4‟ initially. After being selected as victim and
after the new item has been inserted, set.blk[i].pet will be „3‟. Hence invariant
holds good at initialization.
Maintenance:
At i=n, set.blk[i].pet will contain
either zero (if a cache hit has happened) or
set.blk[i].pet will have „3‟ (if it has been newly inserted).
set.blk[i].pet will hold „1‟ or „2‟ or „4‟ (if a replacement needs to be made but
block with PET counter „4‟ not yet found). Hence invariant holds good for i =
n. Same case when i = n+1 as the individual iterations are mutually exclusive.
Thus invariant holds good for i = n + 1.
Termination:
The internal infinite loop will terminate when a replacement candidate is found.
A replacement candidate will be found for sure in every run since the counter
value of individual blocks is proven to be in the range „0‟ to „4‟. A candidate is
found as and when a counter value reaches „4‟. Since an increment by „1‟ is
happening for all the blocks until a value of „4‟ is found, the boundary condition
27
“if set.blk[i].pet == „4' ” is bound to hit for sure, terminating the loop after a finite
number of runs. The invariant can also be found to hold good for all the
cache blocks present when the loop terminates.
It is observed that if DPET is applied over the working set shown in Fig 3.1
there will not be any hits at all (Fig 3.2(i)). After the first burst {8,9,2,6,5,0}, the
PET counter value of all blocks is set to „3‟. When the next burst {11,12,1,3,4,15}
arrives, none of them results in a hit. Now when the third burst arrives, again all
the old blocks are replaced with new ones.. Fig 3.2(ii) shows the action of DPET-
V over the same data pattern. According to DPET-V, AMA has randomly
selected two blocks (containing data items {5} and {2}) whose counter values
have been set to „2‟ instead of „3‟ while inserting the data burst 1. Now when
burst 2 arrives, those two blocks are left untouched as their PET counter values
have not matured to „4‟ yet. Finally when the third burst of data {2,9,5,8,6}
arrives, it results in two hits more ( for {5} and {2} ) compared to DPET. Hence
by giving more time for those two blocks to stay in cache, the hit rate has been
improved. Fig 3.3(i), 3.3(ii), 3.3(iii) and 3.3(iv) shows the working of CB-DPET in
the form of a flow diagram.
(i) (ii)
Fig 3.2 (i) Contents of the cache at the end of each burst for DPET
(ii) Contents of the cache at the end of each burst for DPET-V
28
Processor looks for a particular block ‘c’ in the cache
AMA starts search for block ‘c’ in the cache
If ‘c’ is found
Set PET counter value of ‘c’ to 0
Processor brings the data item from next level of
memory
Find replacement victim in the
cache
HIT
MISS
Start
Finish
Insert new data item
Finish
Fig 3.3 (i) CB-DPET flow Diagram
29
If ‘c’ is found
AMA algorithm starts searching for a replacement
candidate ‘c’ in the cache
Choose ‘c’ as replacement victim
Increment counter values of all blocks by 1
Yes
No
Search for first block ‘c’ from cache bottom with PET counter value of ‘4’
Find replacement victim in the cache
Fig 3.3 (ii) Victim Selection flow
30
Insert new data item-DPET
Insert incoming data item into the victim block
Set PET Counter value of the block to 3
Fig 3.3 (iii) Insert flow- DPET
Insert new data item-DPET-V
Insert incoming data item into the victim block
Set PET Counter value of the block to 3
95 out of every 100 blocks
Set PET Counter value of the block to 2
5 out of every 100 blocks
Fig 3.3 (iv) Insert flow DPET-V
31
3.5. Policy Selection Technique
A few test sets in the cache are allocated to choose between DPET and
DPET-V. Among those test sets, DPET is applied over a few sets and DPET-V
is applied over the remaining sets and they are closely observed. Based on the
performance obtained here, either DPET or DPET-V is applied across the
remaining sets in the cache (apart from the test sets). A separate global register
(which from here onwards will be referred to as Decision Making register or DM
register) is maintained to keep track of the number of misses encountered in the
test sets for both the techniques. Initially the value of this register is set to zero.
Fig 3.4 Policy selection using DM register
32
Data miss on cache block ‘c’
If ‘c’ is in DPET assigned test sets
DM register is incremented by 1
If ‘c’ is in DPET-V assigned test sets
Check value of DM
Insert new data item-DPET-V
Insert new data item-DPET
Finish
Yes
No
No
If <= 0
If > 0
Find replacement victim in the cache
Insert new data item-DPET
DM register is decremented by 1
YesFind replacement
victim in the cacheInsert new data item-
DPET-V
Fig 3.5 CB-DPET Policy selection flow
For every miss encountered in a test set which has DPET running on it, the
register value is incremented by one and for every miss which DPET-V produce
in its test sets, AMA decrements the register value by one. Effectively, if at any
33
point of time the DM register value is positive, then it means that DPET has
been issuing many misses. Similarly if the register value goes negative, it
indicates that DPET-V has resulted in more misses. Based on this value, AMA
makes the apt replacement decision at run time. This entire algorithm which
alternates dynamically between DPET and DPET-V is termed Context Based
DPET. Fig 3.4 depicts the policy selection process. Fig 3.5 shows the Victim
selection process.
3.6. Thread Based Decision Making
The concept of CB-DPET is extended to suit multi-threaded applications with a
slight modification. For every thread a DM register and few test sets are
allocated. Before the arrival of the actual workload, in its allocated test sets each
thread is assigned to run DPET for one half and DPET-V for the remaining half
of the test sets. The sets running DPET (or DPET-V) need not be consecutive.
These sets are selected in a random manner. They can occur intermittently i.e. a
set assigned to run DPET can immediately follow a set that is assigned to run
DPET-V or vice versa. On the whole, the total number of test sets allocated for
each thread is equally split between DPET and DPET-V. Now when the actual
workload arrives, the thread starts to access the cache sets. Two possibilities
arise here.
Case 1:
The thread can access its own test set. Here, if there is data miss, it records the
number of misses in its DM register appropriately depending on whether DPET
or DPET-V was assigned to that set as discussed in the previous section.
Case 2:
The thread gets to access any of the other sets apart from its own test sets.
Here, either DPET or DPET-V is applied depending on the thread‟s DM register
value. If this access turns out to be the very first cache access of that thread,
34
then its DM register will have a value of „0‟, in which case the policy is chosen as
DPET.
3.7. Block Status Based Decision Making
When more than one thread tries to access a data item, it can be regarded as
shared data. Compared to data that is accessed only by one thread (private
data), shared data has to be placed in the cache for a longer duration. This is
because multiple threads try to access those data at different points of time and
if a miss is encountered, the resulting overhead would be relatively high. Taking
this into consideration, it is possible to find out whether each block has shared or
private data (using the id of the thread that accesses it) and decide on the
counter values appropriately. To serve this purpose, there is a „blockstatus‟
register associated with every cache block. When a thread accesses that block,
its thread id is stored into the register. Now when another thread tries to access
the same block, the value of the register is set to some random number that
does not fall within the accessing threads‟ id range. So if the register holds that
random number, the corresponding block is regarded as shared. Otherwise it is
treated as private.
3.8. Modification in the Proposed Policies
In case of DPET, when a block is found to be shared, insertion is made with a
counter value of „2‟ instead of „3‟. Similarly in DPET-V, for shared data, insertion
is always made with a PET counter value of „2‟. If a hit is encountered in either of
the methods, the PET counter value is set to „0‟ (as discussed earlier) only if the
block is shared. For private blocks, the counter value is set to „1‟ when there is a
hit.
Fig 3.6 Data sequence containing a recurring pattern {1,5,7}
35
3.9. Comparison between LRU, NRU and DPET
3.9.1. Performance of LRU
The behaviour of LRU for the data sequence in Fig 3.6 is shown in Fig 3.7. It is
assumed that there are five blocks in the cache set under consideration. Fig. 3.7
shows the working of the LRU replacement policy over the given data set. The
workload sequence (divided into bursts) is specified in the Fig. 3.7. Below every
data, „m‟ is used to indicate a miss in the cache and „h‟ is used to indicate a hit.
Alphabet „N‟ in the cache denotes Null or Invalid data. The cache set can be
viewed as a queue with insertions taking place from the top of the queue and
deletion taking place from the bottom. Additionally, if a hit occurs, that particular
data is promoted to the top of the queue, pushing other elements down by one
step. Hence at any point of time, the element that has been accessed recently is
kept at the top and the „least recently used‟ element is found at the bottom,
which becomes the immediate candidate for replacement. In the above example,
initially the stream {1, 5, 7} results in miss and the data is fetched from the main
memory and placed in the cache.
Fig 3.7 Behaviour of LRU for the data sequence
36
Now data item {7} will hit and it is brought to the top of the queue (in this case it
is already at the top). Data items {1} and {5} results in two more consecutive hits
bringing {5} to the top. The ensuing elements {9, 6} results in misses and are
brought into the cache one after another. Among the burst {5, 2, 1} the data item
{2} alone misses. Then {0} too results in a miss and is brought from the main
memory. In the stream {4,0,8} the data item {0} hits. And finally the burst {1, 5, 7}
re-occurs towards the end where {5, 7} will result in two misses as indicated by
Fig. 3.8. The overall number of misses encountered here amounts to eleven.
3.9.2. Performance of NRU
The working of NRU on the given data set is shown in Fig. 3.8. NRU
associates a single bit counter with every cache line which is initialized to „1‟.
When there is a hit, the value is changed to „0‟. A „1‟ indicates that the data block
has remained unreferenced in the cache for a longer period of time compared to
other blocks. Thus the block from the top which has a counter value of „1‟ is
selected as a victim for replacement. Scanning from the top helps in breaking
the tie. If such a block is not found, the counter values of all the blocks are set to
„1‟ and a search is performed again. In Fig. 3.8, counter values which get
affected in every burst are underlined for more clarity. As it is observed from the
figure, NRU results in one miss more than the overall misses encountered in
LRU. Towards the end, the burst {5,7} result in two misses. This is primarily
because NRU tags the stream {5,7} as „not-recently used‟ as it had not seen
them for some time and thus chooses them as replacement victims. In the long
run, as the data set size increases in a similar fashion, the number of misses will
increase considerably.
One feature that is common between both algorithms is that they follow the
same technique irrespective of the pattern that the data set follows. This might
minimize the hit rate in some cases. In case of LRU, if an item is accessed by
the processor, it is immediately taken to the top of the queue thereby pushing
the remaining set of elements one step below, irrespective of whatever pattern
the data set might follow thereafter. Thus even if any element in this group, that
37
gets pushed down, repeats itself after some point of time, it might have been
marked as „least recently used‟ and evicted from the cache. NRU is also not
powerful as it has only a single bit counter, which does not allocate enough time
for data items which might be accessed in the near future.
Fig 3.8 Behaviour of NRU for the data sequence
3.9.3. Performance of DPET
Fig. 3.9 shows the contents of the cache at various points when DPET is
applied, along with the burst of input data which is processed at that moment.
The PET counter value associated with every block is shown, as a sub-script to
the actual data item and the counter value of the block which is directly affected
at that instant is underlined for more clarity. The input data set has the letters „m‟
and „h‟ under every data item to indicate a miss or a hit respectively. The
behaviour is studied with a cache set consisting of five blocks. Initially AMA
assigns a value of „4‟ to PET counters associated with all the blocks. As seen
from figure, the burst {1,5,7,7} arrives initially. Data items {1,5,7} are placed in
the cache in a bottom up fashion and their PET values are decremented by „1‟.
Now when the data item {7} arrives again, it hits in the cache and thus the
38
counter value associated with {7} is made „0‟. Data items {1,5} arrive individually
and result in two hits. The corresponding PET values are set to „0‟ as indicated
in the figure. The next burst contains elements {9,6}, both resulting in cache
misses. They are fetched from the main memory and stored in the remaining two
blocks of the cache set and their counter values are decremented appropriately.
The fifth data burst comprises the data items {5,2,1}. Among these, {5} is already
present in the cache and hence its counter value is set to „0‟. Data item {2} is not
present in cache. So as with DPET, AMA searches for a replacement candidate
bottom up, which has its counter value set to „4‟. Since such a block could not be
found, the PET values of all the blocks are incremented by „1‟ and the search is
performed again. Now the PET value of the block containing {9} would have
matured to „4‟ and it is replaced by {2} and its counter value is decremented to
„3‟. Data item {1} hits in the cache. The overall number of misses is reduced
compared to LRU and NRU.
Fig 3.9 Behaviour of DPET for the data sequence
3.10. Results
To evaluate the effectiveness of the replacement policy on real workloads, a
performance analysis has been done with Gem5 simulator using PARSEC
39
benchmark programs, discussed in Chapter 6, as input sets. The results are
shown in this section.
3.10.1. CB-DPET Performance
The effectiveness of this method is visible in the following figures as it has
surpassed the LRU. Fig. 3.10 shows the overall number of hits obtained for
each workload with respect to CB-DPET and LRU at L2 cache. The overall
number of hits obtained in the ROI is taken. These numbers have been scaled
down by appropriate factors to keep them around the same range. It is
observed from the figure that CB-DPET has resulted in improvement in the
number of hits compared to LRU for majority of the workloads with fluidanimate
showing a high performance with respect to baseline policy.
Fig 3.10 Overall number of hits recorded in L2 cache
Fig. 3.11 shows the hit rate obtained for individual benchmarks. Hit rate is given
by the following expression.
Hit Rate =Overall number of hits in ROI
Overall number of accesses in ROI ……… (1)
0
200000
400000
600000
800000
1000000
1200000
Nu
mb
er
of
Hit
s
Benchmark
Overall number of hits in L2
LRU
DPET
40
Six of the seven benchmarks have reported improvements in the hit rate.
Fluidanimate shows maximum improvement of 8% whereas swaptions and vips
have exhibited hit rates on par with LRU.
Fig. 3.12 shows the number of hits obtained in the individual parallel phases.
As discussed earlier, the ROI can be further divided into two phases denoted by
PP1 (Parallel Phase 1) and PP2 (Parallel Phase 2). These phases need not be
equal in complexity, number of instructions, etc. There can be cases where PP1
might be more complex with a huge instruction set compared to PP2 or vice
versa. The number of hits recorded in these phases is shown in the graph in
Fig. 3.12. CB-DPET has shown an increase in hits in most of the phases of
majority of the benchmarks.
Fig 3.11 L2 cache hit rate
Miss latency at a memory level refers to the time spent by the processor to
fetch a data from the next level of memory (which is the primary memory in this
case) following a miss in that level of memory. Miss latency at L2 cache for
individual cores is shown in Fig. 3.13. It is usually measured in cycles. To
present it in a more readable format, it has been converted to seconds using the
following formula:
0
0.2
0.4
0.6
0.8
1
1.2
Hit
Rat
e
Benchmarks
L2 cache hit rate
LRU
DPET
41
Miss latency (in secs) =Miss latency in cycles
CPU Clock Frequency……… (2)
The impact of miss latency is an important factor to be taken into consideration.
If the miss latency is greater, it means that the overhead involved in fetching the
data will be high thereby resulting in performance bottleneck.
Fig 3.12 Hits recorded in individual phases of ROI
Fig 3.13 Miss latency at L2 cache for individual cores
0100000200000300000400000500000600000700000800000900000
1000000
PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2 PP1 PP2
vips swaptions fluidanimate dedup ferret canneal
Nu
mb
er
of
Hit
s
Benchmark
Number of hits in individual phases
LRU
DPET
0
5
10
15
20
25
30
35
40
45
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
cpu
0
cpu
1
ove
rall
Mis
s La
ten
cy (
in s
eco
nd
s)
Benchmarks
Miss latency at L2 cache for individual cores
LRU
DPET
42
Overall miss latency at L2 is also shown in Fig. 3.13. Note that this is with
respect to the ROI of the benchmarks. Considerable reduction in latency is
observed across the majority of the benchmarks. Vips has shown a maximum
reduction in Overall miss latency. Except ferret and swaptions, all the other
workloads have shown reduction in the overall miss latency.
Percentage increase in Hits =Hits in CBDPET − Hits in LRU
Hits in LRU∗ 100 ……… (3)
Fig. 3.14 shows the percentage increase in hits obtained from CB-DPET
compared to LRU. The figure shows that there is an increase in the hit
percentage for all the benchmarks except ferret. Swaptions exhibiting the
highest percentage increase of 13.5% whereas dedup has shown a 2%
increase. An approximate 8% improvement over LRU is obtained as the average
hit percentage when CB-DPET is applied at the L2 cache level. This is because
CB-DPET adapts to the changing pattern of the incoming workload and retains
the frequently used data in the cache, thus maximizing the number of hits. Fig.
3.15 shows the overall number of misses recorded by each benchmark.
Decrease in the number of misses is observed across majority of the
benchmarks compared to LRU.
Fig 3.14 Percentage increase in L2 cache hits
-15
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
% increase in L2 cache hits compared to LRU
% increase in cache hits
43
Fig 3.15 Overall number of misses at L2 Cache
3.11. Summary
When it comes to concurrent multi-threaded applications in a CMP
environment, LLC plays a crucial role in the overall system efficiency.
Conventional replacement strategies may not be effective in many instances as
they do not adapt dynamically to the data requirements. Numerous researches
[45,46,79,90] are being conducted to find novel ways for improving the efficiency
of the replacement algorithms applied at LLC. Both LRU and NRU schemes tend
to perform poorly on data sets possessing irregular patterns because they do not
store any extra information regarding the recent access pattern of the data
items. They make decisions purely based on the current access only
[15,45,51,79]. This chapter hence proposes a scheme called CB-DPET which
tries to address the shortcomings arising from the conventional replacement
methods and improve the performance.
In this chapter, the following are the conclusions
CB-DPET takes a more systematic and access-history based decision
compared to LRU. A counter associated with every cache block tracks the
0
100000
200000
300000
400000
500000
600000
700000
800000
Nu
mb
er
of
mis
ses
Benchmark
Number of misses
LRU
DPET
44
behavior of application access pattern and makes the replacement, insertion
and promotion decisions.
DPET-V is a thrash proof version of DPET. Here some randomly selected
data items are allowed to stay in the cache for a longer time by adjusting the
counter values appropriately.
In multi-threaded applications, a DM register and some test sets are
allocated for every thread. Based on the number of misses in the test sets, a
best policy is chosen for the remaining sets.
These algorithms are made thread-aware by using the thread id. Based on
the thread id, it can identify which thread is trying to access the cache block,
so when multiple threads try to access the same block, it is tagged as shared
and allow that block to stay in the cache for more time compared to other
blocks by appropriately adjusting their counter values.
Evaluation results on PARSEC benchmarks have shown that CB-DPET has
reported an average improvement in the overall hits (up to 8%) when compared
to LRU at the L2 cache level.
45
4. LOGICAL CACHE PARTITIONING TECHNIQUE
It is imperative for any level of cache memory in a multi-core architecture to
have a well defined, dynamic replacement algorithm in place to ensure
consistent superlative performance [92,93,94,95]. At L2 cache level, the most
prevalently used LRU replacement policy does not acquaint itself dynamically to
the changes in the workload. As a result, it can lead to sub-optimal performance
for certain applications whose workloads exhibit frequently fluctuating patterns.
Hence many works have strived to improve performance at the shared L2 cache
level [91,98,99].
To overcome the limitation of this conventional LRU approach, This chapter
proposes a novel counter-based replacement technique which logically partitions
the cache elements into four zones based on their „likeliness‟ to be referenced
by the processor in the near future. Experimental results obtained by using the
PARSEC benchmarks have shown almost 9% improvement in the overall
number of hits and 3% improvement in the average cache occupancy
percentage when compared to LRU algorithm.
4.1. Introduction
Cache hits play a vital role in boosting the system performance. When there
are more hits, the number of transactions with the next levels of memories is
reduced thereby expediting the overall data access time. Many replacement
techniques use the cache miss metric to determine the replacement victim but in
this chapter, choosing the victim is based on the number of hits received by the
data items in the cache. Based on the hits received, the cache blocks are
logically partitioned into separate zones. These virtual zones are searched in a
pre-determined order to select the replacement candidate.
Contribution of this chapter includes,
A novel counter based replacement technique that logically partitions the
cache into four different zones namely – Most Likely to be Referenced
46
(MLR), Likely to be Referenced (LR), Less Likely to be Referenced (LLR),
Never Likely to be Referenced (NLR) in the decreasing order of their
likeliness factor.
Replacement, insertion and promotion of data elements within these zones
in such a manner that the overall hit rate is maximized.
Association of a 3-bit counter with every cache line to categorize the
elements into different zones. On a cache hit, the corresponding element
is promoted from one zone to another zone.
Replacement candidates selection from the zones in the ascending order
of their „likeliness factor‟ i.e. the first search space for the victim would be
the never likely to be referenced zone, followed by the subsequent zones
till the most likely to be referenced zone is reached.
Periodic zone demotion of elements also occurs to make sure that stale
data does not pollute the cache.
4.2. Logical Cache Partitioning Method
LCP uses a 3-bit counter to dynamically shuffle the cache elements and logically
partition them into four different zones based on their likeliness to be referenced by
the processor in the near future. This counter will be referred to as LCP (Logical
Cache Partitioning) counter in subsequent sections. The lower and upper bounds
for the counter are set to „0‟ and „7‟ respectively. The prediction about the likeliness
of reference of the data items is made with the help of the hits encountered in the
cache. As the number of hits for a particular element increases, it is moved up
the zone list till it reaches the MLR region.
Only if it stays unreferenced for quite an amount of time, it is evicted from
the cache. As their names imply, the zones are arranged in the decreasing
order of their likeliness factor. Table 4.1 shows the counter values and their
corresponding zones. Any replacement policy consists of three phases-
Replacement, Insertion and Promotion and so does this method. Each of which
is explained in the subsequent sub-sections.
47
Table 4.1 LCP zone categorization
Counter Value
Range
Zone
6-7
MLR
3-5
LR
1-2
LLR
0
NLR
4.3. Replacement Policy
When a miss is encountered in the cache, the data item is fetched from the
secondary memory and brought into the cache thereby replacing one of the
elements which was already present in the cache. This element is often referred
to as the victim. The process of selecting a victim needs to be efficient in order to
improve the hit rate. In this method, the victim is selected from the zones in the
increasing order of their likeliness factor. Any cache line which comes under the
NLR zone is considered first for replacement. If no such line is found search is
performed again to check if any element falls in the LLR region and so on till
MLR is reached. If two or more lines possess the same counter value that is
considered for replacement during that iteration, the line that is encountered first
is chosen as the victim.
Once the replacement is made, LCP counters of all the other data elements
are decremented by „1‟.This is done to carry out gradual zone demotion as
discussed earlier to flush out unused data items from the cache.
4.4. Insertion Policy
This phase is encountered as soon as the replacement candidate is found.
The new incoming data item is inserted into the corresponding cache set and its
LCP counter value is set to „2‟ (LLR zone).
48
4.5. Promotion Policy
A hit on any data item in the cache calls for the promotion phase. The LCP
value associated with the cache line is incremented to the final value of its
immediate upper zone. For example, if it was earlier in the LLR zone (LCP value
„1‟ or „2‟), its LCP value is set to „5‟ i.e. the element now falls within the LR zone.
By transferring elements within the zone according to the workload pattern, LCP
has proven to be more dynamic than LRU.
Processor looks for a particular block ‘c’ in the
cache
LCP algorithm starts search for block ‘c’ in the cache
If ‘c’ is found
Set LCP counter value to the max value of the next higher
zone
Processor brings the data item from next level of
memory
Find replacement victim in the
cache
HIT
MISS
Start
Finish
Fig 4.1(i) LCP flow diagram
49
4.6. Boundary Condition
It is essential that the LCP counter value does not overshoot its specified
range. Thus whenever it is modified, a boundary condition check is carried out to
ensure that the value does not go below „0‟ or beyond „7‟. Fig. 4.1(i), 4.2(ii)
shows the working of LCP.
If ‘c’ is found
LCP algorithm starts searching for a replacement
candidate ‘c’ in the cache from zone ‘X’ where ‘X’
denotes NLR initially
No
Yes
Search for first block ‘c’ which has the LCP
counter value that falls within the range of zone
‘X’
Find replacement victim in the
cache
Increment ‘X’ to point to the next higher zone
Choose ‘c’ as replacement victim
Insert new data item here and set LCP counter
value to ‘2’ (LLR zone)
Fig 4.1 (ii) Victim Selection flow for LCP
50
Algorithm 2: Implementation for Cache Miss, Hit and Insert Operations
Init:
Initialize all blocks with LCP counter value „0‟.
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set is passed as parameter **/
FindVictim (set): i = 0
removing_index = -1
while i < 8 do
/** Scanning the set **/
for j = 0 to associativity do
/** Search from the NLR zone. LCP counter = 0 **/
if set.blks[j].lcp == i then
/** Block Found. Exit Loop **/
removing_index = i;
break;
if removing_index ! = -1 then
break;
i = i + 1;
for i = 0 to associativity do
/** Gradual Zone Demotion **/
if set.blks[j].lcp>0 then
set.blks[j].lcp = set.blks[j].lcp – 1;
return removing_index
Hit:
/** Boundary Condition Check **/
51
if set.blk.lcp ! = 7 then
/** NLR Zone. Promote to LLR **/
if set.blk.lcp == 0 then
set.blk.lcp = 2;
/** LLR Zone. Promote to LR **/
else if set.blk.lcp == 1 or set.blk.lcp == 2 then
set.blk.lcp = 5;
/** LR Zone. Promote to MLR **/
else
set.blk.lcp = 7
Insert:
/** LR Zone **/
set.blk.lcp = 2;
Correctness of Algorithm
Invariant: At any iteration i, the LCP counter value of set.blk[i] holds a value in
the range [0,7].
Initialization:
When i = 0, set.blk[i].lcp will be „0‟ initially. After being selected as victim and
after the new item has been inserted, set.blk[i].lcp will be set to „2‟. Hence
invariant holds good at initialization.
Maintenance:
At i = n, set.blk[i].lcp will contain
either zero (Cache block yet to be occupied for the first time) or
set.blk[i].lcp will have „2‟ (If it has been newly inserted) or
set.blk[i].lcp will have „1‟ or „3‟ or „4‟ or „6‟ (As a part of gradual zone demotion)
set.blk[i].lcp will hold „5‟ or „7‟ (On a cache hit from previous zone).
52
“if blk.lcp ! = 7” condition ensures that the lcp counter value does not
overshoot its range. Hence invariant holds good for i = n.
Same case when i = n+1 as the individual iterations are mutually exclusive. Thus
invariant holds good for i = n + 1.
Fig 4.2 Working of LRU and LCP for the data set
53
Termination:
Increment of the loop control variable „i‟ ensures that the loop terminates within
finite number of runs which is „8‟ in this case. The condition “if set.blks[j].lcp>0”
ensures that the lcp counter value of the block does not go below „0‟. The
invariant can also be found to hold good for all the cache blocks present
when the loop terminates.
4.7. Comparison with LRU
The following data set is considered.
. . . 2 9 1 7 6 1 7 5 9 1 0 7 5 4 8 3 6 1 7 . . .
It is observed that data items 1 and 7 occur frequently. Fig. 4.2 shows the
working of LRU and LCP on this data pattern. Assume that the cache can hold 4
blocks at a time. Incoming data items at that point of time are shown in the
leftmost column.
In the right most column, „m‟ indicates a miss and „h‟ indicates hit. Initially the
cache contains invalid blocks. Counter values associated with all the blocks are
set to -1. After applying both the techniques, LCP has resulted in 3 hits more
than LRU. Frequently occurring data items 1 and 7 have resulted in hits towards
the end unlike LRU. This is primarily because LRU follows the same approach
irrespective of the workload pattern and tags 1 and 7 as „least recently used‟.
4.8. Results
Full system simulation using Gem5 simulator with PARSEC benchmarks
(discussed in Chapter 6) as input sets shows improvement in many cache
metrics. To measure the performance of LCP, metrics like hit percentage
increase, number of replacements made at L2, average cache occupancy and
miss rate are used. Results are shown for two, four and eight core systems.
54
4.8.1. LCP Performance
Subsequently, the term LCP will be used to represent this method in the
graphs. Percentage increase in the overall number of hits at L2 is shown in Fig.
4.3.
Fig 4.4 Difference in number of replacements made at L2 cache
-15
-10
-5
0
5
10
15
20
25
Pe
rce
nta
ge
Benchmark
% lesser number of replacements made at L2 compared to LRU
% lesser number of replacements made at L2 compared to LRU (in percentage)
Fig 4.3 Percentage increase in overall number of hits at L2 cache
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
% increase in overall number of hits at L2 compared to LRU
% increase in overall number of hits at L2 compared to LRU
55
Fig. 4.4 shows the total number of replacements made at L2. Fig. 4.5 shows
the average L2 cache occupancy percentage for all the workloads. Fig. 4.6
shows the miss rate recorded by the individual cores and also the overall miss
rate at L2.
Fig 4.5 Average cache occupancy percentage
Miss rate is defined as the ratio of number of misses to the total number of
accesses. As it is observed from Fig. 4.6 there is reduction in the core-wise miss
rate and overall miss rate across majority of the benchmarks when compared to
LRU.
Apart from two workloads, all the others have shown significant improvement
in hits. Percentage increase in hits varies from a minimum of 0.1% (dedup) to
maximum of up to 11% (ferret) as shown in Fig. 4.3.
Number of replacements reflects the efficiency of any replacement algorithm.
A maximum of almost 20% reduction in the number of replacements is
observed across the given workloads compared to LRU from Fig. 4.4. Cache
occupancy refers to the amount of cache that is being effectively utilized to
improve the performance for any workload. Fig. 4.5 indicates cache utilization is
higher for majority of the benchmarks when LCP is applied compared to LRU.
0
10
20
30
40
50
60
70
80
90
100
Pe
rce
nta
ge
Benchmark
Average cache occupancy(in percentage)
LCP
LRU
56
Vips utilizes the cache space to the maximum possible extent.
Fig 4.6 Miss rate recorded by individual cores
4.9. Summary
This chapter discusses about a dynamic and a structured replacement
technique that can be adopted across the LLC. Key points pertaining to this
replacement policy are as follows:
Elements of the cache are logically partitioned into four zones based on
their likeliness to be referenced by the processor with the help of a 3-bit LCP
counter associated with every cache line. The minimum value that the counter
can hold is „0‟ and the maximum value is „7‟.
Replacement candidates are chosen from the zones in the increasing order
of their likeliness factor starting from the NLR zone. Initially all the counter
values are set to „-1‟. Conceptually all the blocks contain invalid data.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1co
re 1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
Fluidanimate Ferret x264 Swaptions Vips Dedup Canneal Blackscholes
Mis
s R
ate
Benchmark
Miss rate recorded by individual cores at L2
LCP
LRU
57
For every hit, the corresponding element is moved up by one zone by
adjusting the LCP counter value. If it has reached the top most zone (MLR)
the counter value is left untouched.
For every miss the counter value is decremented by „1‟ to prevent stale data
items from polluting the cache as the algorithm executes. When a new data
arrives, its LCP value is set to „2‟. Boundary condition check needs to be
applied whenever the LCP counter value is modified to make sure that it does
not overshoot its designated range.
Experimental results obtained by applying the method on PARSEC
benchmarks have shown a maximum improvement of 9% in the overall number
of hits and 3% improvement in the average cache occupancy percentage when
compared to the conventional LRU approach.
58
5. SHARING AND HIT BASED PRIORITIZING TECHNIQUE
The efficiency of cache memory can be enhanced by applying different
replacement algorithms. An optimal replacement algorithm must exhibit good
performance with minimal overhead. Traditional replacement methods that are
currently being deployed across multi-core architecture platforms, try to classify
elements purely based on the number of hits they receive during their stay in the
cache. In multi-threaded applications data can be shared by multiple threads
(which might run on the same core or across different cores). Those data items
need to be given more priority when compared to private data because miss on
those data items may stall the functioning of multiple threads resulting in
performance bottleneck. Since the traditional replacement approaches do not
possess this additional capability, they might lead to inadequate performance for
most of the current multi-threaded applications.
To address this limitation, this chapter proposes a Sharing and Hit based
Prioritizing (SHP) replacement technique that takes the sharing status of the
data elements into consideration while making replacement decisions.
Evaluation results obtained using multi-threaded workloads derived from the
PARSEC benchmark suite shows an average improvement of 9% in the overall
hit rate when compared to LRU algorithm.
5.1. Introduction
In multi-threaded applications, when threads run across different cores, the
activity in L2 cache tend to shoot up as multiple threads try to store and retrieve
shared data. At this point there are many challenges that need to be addressed.
Cache coherence has to be maintained, the replacement decisions made need to
be judicious and the available cache space must be utilized efficiently. When
there is a miss on a data item which is shared by multiple threads, the execution
of many threads gets stalled. This is because when the first thread, which had
encountered a miss, is attempting to fetch the data from the next level of memory,
any subsequent thread which tries to access the same data will result in miss.
59
Hence it becomes imperative to handle shared data with more care compared to
other data items. Traditional replacement algorithms like LRU, MRU etc do not
possess this capability. They classify elements based on when they will be
required by the processor but do not check for the status of the cache block, i.e.
whether it is shared or private at any point of time. Also when there are more than
thousands of threads running in parallel, it may not be very useful in tagging the
block simply as „shared‟ or „private‟. In those cases, additional information about
their shared status can prove handy. This chapter provides an overview on a novel
counter based cache replacement technique that associates a „sharing degree‟
with every cache block. This degree specifies a set of ranges which includes – not
shared (or private), lightly shared, heavily shared and very heavily shared. This
information combined along with the number of hits received by the element during
its stay in the cache is used to produce a priority for that element based on which
judicious replacement decisions are made.
Contributions of this chapter includes,
A novel counter based replacement algorithm that makes replacement
decisions based on the sharing nature of the cache blocks
Association of a 'sharing degree' with every cache line, which indicates
the extent to which the element is shared depending on the number of
threads that try to access that element.
Four degrees of sharing namely – not shared (or private), lightly shared,
heavily shared and very heavily shared.
A replacement priority for every cache line which is arrived upon by
combining the sharing degree along with the number of hits received by
the element based on which the replacement decisions are made.
5.2. Sharing and Hit Based Prioritizing Method
Similar to the other two replacement algorithms, every cache block is
associated with a counter. This 2-bit counter is called as the Sharing Degree
Counter or SD counter. Table 5.1 shows all the four possible values that the
60
counter can hold and their individual descriptions. Depending on the number of
threads that access a cache block, it is classified into any one of the sharing
categories as shown in the table. To collect the sharing status of the cache
blocks, it is essential to have an efficient data structure in place to track the
number of threads that accesses the data item. For this purpose there is a filter
which is referred to as the Thread Tracker filter (TT filter). It is a flexible dynamic
software based array that gets created and is associated with every cache block
during run time.
When a thread tries to access a data item, a search is conducted in the TT
filter to check if the thread id is already present in it. If not, then the id is stored in
the corresponding cache block‟s TT filter. Size of the filter expands as and when
a thread id gets added to it. Once a cache block is about to be evicted, the
memory allocated for the associated TT filter is freed. At any point of time, with
the help of thread ids that are found in the TT filter, the sharing degree counter is
populated for every cache block.
5.3. Replacement Priority
The sharing status of the blocks alone cannot be used to make replacement
decisions. Number of hits garnered by the element is also an important factor to
consider when performing a replacement. Hence a priority for every individual
cache block has been arrived at, that not only gives more weightage to sharing
nature but also attaches importance to the number of cache hits garnered by the
element. Priority computation equation is as follows
Priority = Sharing Degree + 1 ∗Hits Count
2 ……… (4)
Sharing degree indicates the sharing degree counter value for the particular
cache block at that particular point of time. A software based hit counter is
associated with every cache block that provides an estimate of the hits received
61
by the element during its stay in the cache. This counter is refreshed every time
a new element enters the block. The upper bound of the counter is fixed to a
specific value. If that value is reached, the counter is considered saturated and
any hits received beyond that will not be taken into account. Experimental
results have shown that in a unified cache memory, the average number of
instructions is 50% and the remaining 50% forms the data. For calculation of
priority, the analysis is mainly upon the data part as it is accessed much more
frequently compared to instructions. Also due to locality of reference, several
threads can access the data several times. To highlight the former part, the hit
count value is divided by 2 and to highlight the latter part, hit count value is
scaled by a factor of the sharing degree counter value. Whenever any one of
these counters change, the priority of the element needs to be recomputed.
Cache replacement policy consists of deletion, insertion and promotion
operations. Each phase of the algorithm is explained in detail in the subsequent
sections. The mapping method used is set associative mapping.
5.4. Replacement Policy
When the cache becomes full, replacement has to be made to pave way for
new incoming data items. SHP makes replacement decisions based on the
computed priority values. Elements are evicted in the increasing order of their
priority. The element with the least priority in the list is chosen as the victim. If
more than one element has the same least priority, then the one which is
encountered first while scanning the cache is taken as the victim as a tie-
breaking mechanism. It is also essential to ensure that stale data do not pollute
the cache for longer periods of time.
From this aspect the hit counter can be regarded as an ageing counter. Every
time a victim is found, the hit counter of all the other elements in the cache is
decremented and their corresponding priorities are re-computed. If any element
remains unreferenced for a long period of time, its priority will gradually
62
decrease and the element will eventually be flushed out of the cache.
Algorithm 3: Implementation for Cache Miss, Hit and Insert Operations
Input:
A cache set instance
Output:
removing_index /** Block id of the victim **/
Miss:
/** Invoke FindVictim method. Set id is passed as parameter **/
FindVictim(set)
FindVictim(set):
min = ∞
for i = 0 to associativity do
/** Choose the minimum priority element **/
if set.blk[i].priority < min then
min = set.blk[i].priority;
removing_index = i;
return removing_index;
Hit:
/** Scan the TT Filter **/
for i = 0 to length(TT_Filter) do
if TT_Fitler[i] == accessing_thread_id then
thread_found = 1;
/** If thread id not found in TT Filter, add the thread id to it **/
if thread_found ! = 1 then
TT_Filter[i] = accessing_thread_id;
set blk.sharing_degree counter value[0 to 3] according to TT_Filter length
63
priority = (set.blk.sharing_degree +1) * (set.blk.hit_counter)/2;
Insert:
/** Private block **/
set.blk.sharing_degree = 0;
refresh blk.hit_counter;
Correctness of Algorithm
Invariant: At any iteration i, the sharing degree counter value of set.blk[i] holds a
value in the range [0,3].
Initialization:
When i = 0, set.blk[i]. sharing_degree will be set to „0‟ initially. Hence invariant
holds good at initialization.
Maintenance:
At i = n, set.blk[i].sharing_degree will contain
either zero (Private) or
set.blk[i].sharing_degree will have „1‟ (Lightly shared) or
set.blk[i].sharing_degree will have „2‟ (Heavily Shared) or
set.blk[i].sharing_degree will hold „3‟ (Very Heavily Shared).
Hence invariant holds good for i = n.
Same case when i = n+1 as the individual iterations are mutually exclusive. Thus
invariant holds good for i = n + 1.
Termination:
The priority computation loop is limited by the associativity of the cache which
will always be finite. TT filter iteration is limited by the size of the TT filter which
will be equal to the total number of threads executing. It will also be finite. The
64
invariant can also be found to hold good for all the cache blocks present
when the loops terminate.
Processor looks for a particular block ‘c’ in the
cache
SHP algorithm starts search for block ‘c’ in the cache
If ‘c’ is found
Increment Hit Counter
Processor brings the data item from next
level of memory
Find replacement victim in the cache
HIT
MISS
Start
Finish
Update TT Filter if needed and sharing
degree counter
Re-compute priority
Fig 5.1 (i) SHP flow diagram
65
SHP algorithm starts searching for a replacement candidate ‘c’ in
the cache
Search for first block ‘c’ which has the minimum
priority value
Find replacement victim in the cache
Choose ‘c’ as replacement victim
Insert new data item here and set Sharing degree counter value to ‘0’ (private) and re-
compute priority
Fig 5.1 (ii) Victim Selection flow for SHP
Table 5.1 Sharing degree values and their descriptions
Number of Accessing Threads
Sharing Degree Counter Value
Nature of Sharing
1 0 Private/Not Shared
2-3 1 Lightly Shared
4-7 2 Heavily Shared
8-10 3 Very Heavily Shared
66
5.5. Insertion Policy
After evicting the victim, the new data item needs to be inserted into the cache.
Sharing degree counter is set to „0‟ to indicate that the incoming block is
currently not shared by any other threads. The corresponding hit counter is
refreshed.
5.6. Promotion Policy
When a cache hit happens, the hit counter of the corresponding block is
incremented. Also the accessing thread id is compared against the ids which
are already present in the TT filter and if it is not present there, it is added into
the TT filter. Sharing degree counter is adjusted accordingly and the priority is
re-computed. Fig. 5.1(i), 5.1(ii) shows the basic flow of SHP technique.
5.7. Results
To evaluate the performance of SHP, simulation has been carried out with
GEM5 simulator using PARSEC workloads (discussed in Chapter 6) and the
results are compared with LRU. Most of the applications are benefited through
this replacement policy. The results are shown in this section.
5.7.1. SHP Performance
It is observed from the following figures, that the performance of the proposed
replacement technique is better than LRU for multithreaded workloads. The
overall number of hits obtained at LLC cache for SHP method compared to LRU
is shown in the graph in Fig. 5.2. In Fig. 5.2, blackscholes has shown the
maximum improvement. On an average, a 9% percent improvement in the
overall number of hits is observed across the given benchmarks.
67
Fig 5.2 Overall number of hits at L2
Fig. 5.3 shows the overall number of replacements made at L2 for SHP and
LRU. The more the number of replacements made, the more will be the
overhead involved. So it is always desirable to keep this parameter as low as
possible. In this method, the average number of replacements made at L2 has
decreased compared to LRU. Compared to other benchmarks, blackscholes has
exhibited the least number of replacements (1.6% lesser than that of LRU) and
swaptions has shown a 0.55% decrease when compared to LRU
Fig. 5.4 shows the L2 miss rate for different workloads. Miss rate is computed
from the overall number of misses and the overall number of accesses. To
obtain a high performance, LLC must provide a lower miss rate. Majority of the
benchmarks have shown improvement in this metric when compared to LRU
with dedup showing superior performance.
0
50000
100000
150000
200000
250000
300000
350000
400000
No
of
Hit
s
Benchmark
Overall Number of Hits
SHP
LRU
68
Fig 5.3 Percentage decrease in number of replacements made at L2
Fig 5.4 Overall miss rate and core-wise miss rate at L2 cache
-1
-0.5
0
0.5
1
1.5
2
Pe
rce
nta
ge
Benchmarks
% decrease in number of replacements compared to LRU
% decrease
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
core
1
core
2
ove
rall
Blackscholes Ferret x264 swaptions Vips Dedup Canneal
Mis
s ra
te
Benchmark
Overall and per core miss rate
SHP
LRU
69
5.8. Summary
Shared data plays a critical role in determining the performance of cache
memory systems, especially in a multi-threaded environment. Conventional LRU
approach does not attach importance to such data and hence this chapter
proposes a novel counter based prioritizing algorithm.
Every cache block is associated with a 2-bit sharing degree counter which
iterates from 0 to 3.
A dynamic software based TT filter is associated with every block to keep
track of the threads that are accessing that block.
A hit counter is used to keep track of the hits received by the data item.
Values of the sharing degree and the hit counters are used to compute a
priority for each cache block.
This priority is then used to make judicious replacement decisions.
Evaluation results have shown an average improvement of up to 9% in the
overall number of hits when compared to the traditional LRU approach.
70
6. EXPERIMENTAL METHODOLOGY
This section outlines the simulation infrastructure and benchmarks used to
evaluate the dynamic replacement policies. To evaluate the proposed
techniques, Gem5 [102] simulator is used and PARSEC benchmark suites
[103,104,105] are used as input sets.
6.1. Simulation Environment
For simulation studies, Gem5 - a modular, discrete, event driven, completely
object oriented, computer system open source simulator platform, has been
chosen. It can simulate a complete system with devices and an operating
system, in full system mode (FS mode), or user space only programs where
system services are provided directly by the simulator in syscall emulation mode
(SE mode). The FS mode is currently being used. It is also capable of simulating
a variety of Instruction Set Architectures (ISAs).
Table 6.1 Summary of the key characteristics of the PARSEC benchmarks
Program
Application
Domain
Working Set
Blackscholes
Financial Analysis
Small
Canneal
Computer Vision
Medium
Dedup
Enterprise Storage
Unbounded
Ferret
Similarity Search
Unbounded
Fluidanimate
Animation
Large
Swaptions
Financial Analysis
Medium
Vips
Media Processing
Medium
x264
Media Processing
Medium
71
6.2. PARSEC Benchmark
The Princeton Application Repository for Shared-Memory Computers
(PARSEC) is a benchmark suite that comprises numerous large scale
commercial multi-threaded workloads for CMP. For simulation purposes, seven
workloads have been picked up from PARSEC suite. Every workload is divided
into three phases: an initial sequential phase, a parallel phase (which is further
split into two phases), and a final sequential phase.
This work focuses on the performance obtained at the parallel phase which is
also known as the Region of Interest (ROI). All the stats which are collected to
measure the cache performance correspond to the ROI of the workloads. The
basic characteristics of the chosen benchmarks are shown in Table 6.1. Table
6.2 highlights the cache configuration parameters used for simulation.
Table 6.2 Cache architecture configuration parameters
Parameter L1 Cache L2 Cache
Total Cache Size i-cache 32 kB 2MB
d-cache 64 kB
Associativity 2-way 8-way
Miss Information/Status Handling Registers (MSHR)
10 20
Cache Block Size 64B 64B
Latency 1ns 10ns
Replacement Algorithm LRU CB-DPET
6.3. Evaluation Of Many-Core Systems
Evaluation using PARSEC benchmarks for all the three methods on two, four
and eight core systems with a 2MB LLC have yielded better results compared to
LRU in terms of parameters like number of hits, hit rate, throughput and IPC
72
speedup. Throughput is measured from the total number of instructions that get
executed in a single clock cycle (Instruction Per Cycle or IPC). In multi-threaded
applications, throughput is the sum of individual thread IPCs as shown below.
Throughput = IPCi
n
i=1
……… (5)
IPC Speedup of Algorithm k over LRU = IPCk / IPCLRU
where „n‟ refers to the total number of threads and k refers to anyone of the
methods CB-DPET, LCP or SHP.
Evaluation results using PARSEC benchmarks for two cores have shown that
an average improvement in IPC speedup of up to 1.06, 1.05 and 1.08 times that
of LRU has been observed across CB-DPET, LCP and SHP techniques
respectively (Fig. 6.1, 6.2 and 6.3).
Fig 6.1 Performance comparison for 2 cores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Spe
ed
up
Benchmark
Average IPC speedup over LRU - 2 core
CB-DPET
LCP
SHP
73
Fig 6.2 Performance comparison for 4 cores
Fig 6.3 Performance comparison for 8 cores
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Spe
ed
up
Benchmarks
Average IPC speedup over LRU - 4 core
CB-DPET
LCP
SHP
0.96
0.98
1
1.02
1.04
1.06
1.08
1.1
Spe
ed
up
Benchmarks
Average IPC speedup over LRU - 8 core
CB-DPET
LCP
SHP
74
Fig 6.4 CB-DPET- Percentage increase in hits
Fig 6.5 LCP- Percentage increase in hits
-20
-15
-10
-5
0
5
10
15
Pe
rce
nta
ge
Benchmark
CBDPET - Percentage increase in number of hits compared to LRU
2-Core
4-Core
8-Core
-20
-15
-10
-5
0
5
10
15
20
Pe
rce
nta
ge
Benchmark
LCP - Percentage increase in number of hits compared to LRU
2-Core
4-Core
8-Core
75
Fig 6.6 SHP- Percentage increase in hits
In 2 core scenario for CB-DPET, canneal has shown highest IPC speedup
improvment (1.2 times that of LRU) and dedup has recorded the least
improvment of 1.02 times that of LRU. Vips and swaptions have shown
maximum IPC speed up compared to other workloads for LCP and SHP
methods in 2-core system. In 4 core scenario, fluidanimate has shown maximum
IPC speedup of 1.075 and 1.07 times that of LRU for CB-DPET and LCP
methods respectively. Blackscholes has shown IPC speedup of 1.07 times that
of LRU which is the maximum for SHP method in 4-core environment. For 8
cores, ferret has recorded maximum IPC speedup for CB-DPET and LCP
methods whereas swaptions has shown maximum IPC speedup of almost 1.1
times that of LRU for SHP method. Improvement in overall hits of up to 8% has
been observed using CB-DPET when compared to LRU at the L2 cache level
(Fig. 6.4). LCP and SHP have shown an average improvement of upto 9% (Fig.
6.5) and 9% (Fig. 6.6) respectively in the overall number of hits.
-30
-25
-20
-15
-10
-5
0
5
10
15
20
25
Pe
rce
nta
ge
Benchmark
SHP - Percentage increase in number of hits compared to LRU
2-Core
4-Core
8-Core
76
6.4. Storage Overhead
6.4.1. CB-DPET
PET counter requires frequent lookup so it can be kept it as a part of the
cache hardware to expedite the access time. Considering an 8-way associative,
2MB cache which has 4096 sets and the cache block size is 64 Bytes. Additional
Storage required is calculated as follows.
Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block
= 4096 ∗ 8 ∗ 3
= 98304 𝑏𝑖𝑡𝑠 𝑜𝑟 12 𝐾𝐵
DM Register is allocated on a per thread basis and it depends on the
application during its run time so it is kept more like software based register
rather than a hardware one. The decision to have block status register either as
a part of hardware or software depends during implementation time because the
value it can hold is not fixed right now. Assuming that it can take only a „0‟ (say
shared) and „1‟ (not shared) and assuming that the thread id range do not start
from „0‟: If implemented in hardware. Additional storage required for blockstatus
register is approx 4KB. The total additional storage required is approximately
16KB which amounts to 0.78% overall increase in area for a 2MB L2 Cache.
6.4.2. LCP
Similar to PET, LCP counter also requires frequent lookup so keeping it as a
part of the cache hardware to expedite the access time. Storage overhead is
similar to that of PET counter which is approximately 12KB since LCP is also a
3-bit counter. As no other registers and counters are required, LCP results in
0.58% overall increase in area for a 2MB L2 Cache.
6.4.3. SHP
SHP requires a 2-bit Sharing Degree counter which requires frequent look up
and hence keeping it as a part of the hardware. Additional storage required for
Sharing Degree counter is calculated as follows
77
Additional Overhead = no. of sets ∗ no. of blocks per set ∗ no. of bits per block
= 4096 ∗ 8 ∗ 2
= 65536 𝑏𝑖𝑡𝑠 𝑜𝑟 8 𝐾𝐵
Thread Tracker filter which is used to keep track of the threads that accesses
the blocks can be implemented as a part of software which gives the flexibility of
dynamically allocating and freeing the memory as and when required. Hence
SHP results in 0.39% overall increase in area for a 2MB L2 Cache.
78
7. CONCLUSION AND FUTURE WORK
7.1. Conclusion
When it comes to concurrent multi-threaded applications in a CMP
environment, LLC plays a crucial role in the overall system efficiency.
Conventional replacement strategies like LRU, NRU etc may not be effective in
many instances as they do not adapt dynamically to the data requirements.
Numerous researches are being conducted to find novel ways for improving the
efficiency of the replacement algorithms applied at LLC. In this dissertation,
three novel counter based replacement strategies have been proposed.
A novel replacement algorithm called CB-DPET has been proposed which
takes a more systematic and access-history based decision compared to
LRU. It is further split into DPET and DPET-V. DPET associates a counter
with every cache block to keep track of its recent access information. Based
on this counter‟s values, the algorithm makes replacement, insertion and
promotion decisions.
A dynamic and a structured replacement technique called LCP has been
proposed to be adopted across the LLC. Here the elements of the cache are
logically partitioned into four zones based on their likeliness to be referenced
by the processor with the help of a 3-bit LCP counter associated with every
cache line. Replacement candidates are chosen from the zones in the
increasing order of their likeliness factor starting from the NLR zone. For
every hit, the corresponding element is moved up by one zone by adjusting
the LCP counter value. If it has reached the top most zone (MLR) the counter
value is left untouched. For every miss the counter value is decremented by
„1‟ to prevent stale data items from polluting the cache.
A novel counter based prioritizing algorithm called SHP has been proposed to
be applied across the LLC. Here every cache block is associated with a 2-bit
sharing degree counter which iterates from 0 to 3. A dynamic software based
TT filter is associated with every block to keep track of the threads that are
79
accessing that block. A hit counter is used to keep track of the hits received
by the data item. Values of the sharing degree and the hit counters are used
to compute a priority for each cache block. This priority is then used to make
judicious replacement decisions.
Table 7.1 Percentage increase in overall hits at L2 Cache compared to LRU
Method 2-Core 4-Core 8-Core
CB-DPET 8% 4% 3%
LCP 6% 9% 5%
SHP 7% 9% 5%
Table 7.1 provides the percentage improvement obtained in overall number of
hits when compared to LRU for all the three methods. It is observed that up to
9% improvement has been obtained at the L2 cache level (for LCP) in overall hit
percentage increase and an IPC speed up of up to 1.08 times that of LRU (for
SHP).
Simulation studies shows that all the three proposed counter based replacement
techniques have outperformed LRU with minimal storage overhead.
7.2. Scope for Further work
Dynamic replacement strategies have proven to work well with versatile set of
application workloads that currently exist. Increasing the dynamicity without
inducing much hardware overhead is crucial. This research has experimented
novel replacement strategies on the shared L2 cache while the modifications
that may be required to run the same algorithms in the relatively smaller private
L1 cache is an interesting area to explore.
Appending additional bits per cache block in CB-DPET can help in storing
more information about the access pattern of the data item but storage overhead
80
needs to be taken into consideration. In LCP, optimization of the victim search
process could be done to expedite the replacement candidate selection. Multi-
threading could be a better solution here. Instead of having one thread
searching the cache set in multiple passes there can be multiple threads
searching in different zones but inter-thread communication may have to be
explored to make sure the process stops once victim is found. In sharing and hit-
based prioritizing algorithm, replacement priority of cache blocks is computed
from the sharing degree and number of hits received. In order to boost the
accuracy, various other parameters like miss penalty incurred while replacing
that block etc can also be considered while computing priority.
81
REFERENCES [1] John L. Hennessy and David A. Patterson, "Computer architecture: a
quantitative approach", Fifth Edition, Elsevier, 2012.
[2] Tola John Odule and Idowun Ademola Osinuga, “Dynamically Self-
Adjusting Cache Replacement Algorithm”, International Journal of Future
Generation Communication and Networking, Vol. 6, No. 1, Feb. 2013.
[3] Andhi Janapsatya, Aleksandar Ignjatovic, Jorgen Peddersen and Sri
Parameswaran, "Dueling clock: adaptive cache replacement policy based
on the clock algorithm", IEEE Design, Automation & Test Europe
Conference & Exhibition (DATE), pp. 920-925, 2010.
[4] Geng Tian and Michael Liebelt, "An effectiveness-based adaptive cache
replacement policy", Microprocessors and Microsystems, Vol 38, No.1,
pp. 98-111, Elsevier, Feb. 2014.
[5] Anil Kumar Katti and Vijaya Ramachandran, "Competitive cache
replacement strategies for shared cache environments", IEEE 26th
International Parallel & Distributed Processing Symposium (IPDPS), pp.
215-226, 2012.
[6] Martin Kampe, Per Stenstrom and Michel Dubois, "Self-correcting LRU
replacement policies", 1st Conference on Computing Frontiers, ACM
Proceedings, pp. 181-191, 2004.
[7] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik
Jhon, "Adaptive block management for victim cache by exploiting l1
cache history information", Embedded and Ubiquitous Computing, pp. 1-
11, Springer Berlin Heidelberg, 2004.
82
[8] Jorge Albericio, Ruben Gran, Pablo Ibanez, Víctor Vinals and Jose Maria
Llaberia, "ABS: A low-cost adaptive controller for prefetching in a banked
shared last-level cache", ACM Transactions on Architecture and Code
Optimization (TACO), Vol. 8, No. 4, Article. 19, 2012.
[9] Ke Zhang1, Zhensong Wang, Yong Chen3, Huaiyu Zhu4 and Xian-He
Sun, “PAC-PLRU: A Cache Replacement Policy to Salvage Discarded
Predictions from Hardware Prefetchers”, 11th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing, p.265-274, May
2011.
[10] Jayesh Gaur, Mainak Chaudhuri and Sreenivas Subramoney, "Bypass
and insertion algorithms for exclusive last-level caches", ACM SIGARCH
Computer Architecture News, Vol. 39, No. 3, 2011.
[11] Hongliang Gao and Chris Wilkerson, "A dueling segmented LRU
replacement algorithm with adaptive bypassing", JWAC 1st JILP
Workshop on Computer Architecture Competitions: cache replacement
Championship, 2010.
[12] Saurabh Gao, Hongliang Gao and Huiyang Zhou, "Adaptive Cache
Bypassing for Inclusive Last Level Caches", IEEE 27th International
Symposium on Parallel & Distributed Processing (IPDPS), pp. 1243-1253,
2013.
[13] Mazen Kharbutli and D. Solihin, "Counter-based cache replacement and
bypassing algorithms", IEEE Transactions on Computers, Vol. 57, No. 4
pp. 433-447, 2008.
[14] Pierre Kharbutli, "Replacement policies for shared caches on symmetric
multicores: a programmer-centric point of view", 6th International
Conference on High Performance and Embedded Architectures and
83
Compilers, pp. 187-196, 2011.
[15] Moinuddin K Qureshi, "Adaptive spill-receive for robust high-performance
caching in CMPs", IEEE 15th International Symposium on High
Performance Computer Architecture (HPCA), pp. 45-54, 2009.
[16] Kathlene Hurt and Byeong Kil Lee, “Proposed Enhancements to Fixed
Segmented LRU Cache Replacement Policy”, IEEE 32nd International
Conference on Performance and Computing and Communications
(IPCCC), pp. 1-2, Dec. 2013.
[17] Manikantan, R, Kaushik Rajan and Ramaswamy Govindarajan.
"NUcache: An efficient multi-core cache organization based on next-use
distance", IEEE 17th International Symposium on High Performance
Computer Architecture (HPCA), pp. 243-253, 2011.
[18] Kedzierski, Kamil, Miquel Moreto, Francisco J. Cazorla and Mateo Valero,
"Adapting cache partitioning algorithms to pseudo-lru replacement
policies", IEEE International Symposium on Parallel & Distributed
Processing (IPDPS), pp. 1-12, 2010.
[19] Xie, Yuejian and Gabriel H. Loh, "PIPP: promotion/insertion pseudo-
partitioning of multi-core shared caches", ACM SIGARCH Computer
Architecture News, Vol. 37, No. 3, pp. 174-183, 2009.
[20] Hasenplaugh, William, Pritpal S. Ahuja, Aamer Jaleel, Simon Steely Jr
and Joel Emer, "The gradient-based cache partitioning algorithm", ACM
Transactions on Architecture and Code Optimization (TACO), Vol. 8, No.
4, Article 44, Jan 2012.
84
[21] Haakon Dybdahl, Per Stenstrom and Lasse Natvig, "A cache-partitioning
aware replacement policy for chip multiprocessors", High Performance
Computing-HiPC, Springer, pp. 22-34, 2006.
[22] Naoki Fujieda, and Kenji Kise, "A partitioning method of cooperative
caching with hit frequency counters for many-core processors", Second
International Conference on Networking and Computing (ICNC), pp. 160-
165, 2011.
[23] Konstantinos Nikas, Matthew Horsnell and Jim Garside, "An adaptive
bloom filter cache partitioning scheme for multi-core architectures”,
International Conference on Embedded Computer Systems,
Architectures, Modeling, and Simulation, SAMOS, pp. 25-32, 2008.
[24] Moinuddin K Qureshi and Yale N. Patt, "Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition shared
caches", 39th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 423-432, 2006.
[25] Daniel A Jimenez, "Insertion and promotion for tree-based PseudoLRU
last-level caches", 46th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 284-296, 2013.
[26] Lira, Javier, Carlos Molina and Antonio Gonzalez, "LRU-PEA: A smart
replacement policy for non-uniform cache architectures on chip
multiprocessors", IEEE International Conference on Computer Design,
pp. 275-281, 2009.
[27] Nikos Hardavellas, Michael Ferdman, Babak Falsafi and Anastasia
Ailamaki, "Reactive NUCA: near-optimal block placement and replication
in distributed caches", ACM SIGARCH Computer Architecture News, Vol.
37, No. 3, pp. 184-195, 2009.
85
[28] Fazal Hameed, Bauer L and Henkel J, “Dynamic Cache Management in
Multi-Core Architectures Through Runtime Adaptation”, Design,
Automation & Test in Europe Conference & Exhibition (DATE), pp. 485-
490, March 2012.
[29] Alberto Ros, Marcelo Cintra, Manuel E. Acacio and Jose M. García.
"Distance-aware round-robin mapping for large NUCA caches",
International Conference on High Performance Computing (HiPC), pp.
79-88, 2009.
[30] Javier Lira, Timothy M. Jones, Carlos Molina and Antonio Gonzalez. "The
migration prefetcher: Anticipating data promotion in dynamic NUCA
caches", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 4, Article. 45, 2012.
[31] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Dynamic
cache clustering for chip multiprocessors", ACM 23rd International
Conference on Supercomputing, pp. 56-67, 2009.
[32] O. Adamao, A. Naz, K. Kavi, T. Janjusic and C. P. Chung, “Smaller split
L-1 data caches for multi-core processing systems”, IEEE 10th
International Symposium on Pervasive Systems, Algorithms and
Networks(I-SPAN 2009), Dec 2009.
[33] Yong Chen, Surendra Byna and Xian-He Sun, "Data access history
cache and associated data prefetching mechanisms", ACM/IEEE
Conference on Supercomputing, pp. 1-12, 2007.
[34] Mohammad Hammoud, Sangyeun Cho and Rami Melhem, "Acm: An
efficient approach for managing shared caches in chip multiprocessors",
High Performance Embedded Architectures and Compilers, Springer
Berlin Heidelberg, pp. 355-372, 2009.
86
[35] Sourav Roy, "H-NMRU: a low area, high performance cache replacement
policy for embedded processors", 22nd International Conference on VLSI
Design, pp. 553-558, 2009.
[36] Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa and Tatsuo Ohtsuki,
"Exact and fast L1 cache simulation for embedded systems", Asia and
South Pacific Design Automation Conference, pp. 817-822, 2009.
[37] Sangyeun Cho, and Lory Al Moakar, "Augmented FIFO Cache
Replacement Policies for Low-Power Embedded Processors", Journal of
Circuits, Systems and Computers, Vol. 18, No. 6, pp. 1081-1092, 2009.
[38] Kimish Patel, Luca Benini, Enrico Macii and Massimo Poncino, "Reducing
conflict misses by application-specific reconfigurable indexing", IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, Vol. 25, No. 12, pp. 2626-2637, 2006.
[39] Yun Liang, and Tulika Mitra, "Instruction cache locking using temporal
reuse profile", 47th Design Automation Conference, pp. 344-349, 2010.
[40] Yun Liang, and Tulika Mitra, "Improved procedure placement for set
associative caches", International Conference on Compilers,
Architectures and Synthesis for Embedded Systems, pp. 147-156, 2010.
[41] Cheol Hong Kim, Jong Wook Kwak, Seong Tae Jhang and Chu Shik
Jhon, "Adaptive block management for victim cache by exploiting l1
cache history information", Embedded and Ubiquitous Computing,
Springer Berlin Heidelberg, pp. 1-11, 2004.
[42] Cristian Ungureanu, Biplob Debnath, Stephen Rago and Akshat Aranya,
"TBF: A memory-efficient replacement policy for flash-based caches",
IEEE 29th International Conference on In Data Engineering (ICDE), pp.
87
1117-1128, 2013.
[43] Pepijn de Langen and Ben Juurlink, "Reducing Conflict Misses in Caches
by Using Application Specific Placement Functions", 17th Annual
Workshop on Circuits, Systems and Signal Processing, pp. 300-305.
2006.
[44] Dongxing Bao and Xiaoming Li, "A cache scheme based on LRU-like
algorithm", IEEE International Conference on Information and Automation
(ICIA), pp. 2055-2060, 2010.
[45] Livio Soares, David Tam and Michael Stumm, "Reducing the harmful
effects of last-level cache polluters with an OS-level, software-only pollute
buffer", 41st annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 258-269, 2008.
[46] Alejandro Valero, Julio Sahuquillo, Salvador Petit, Pedro Lopez and Jose
Duato, "MRU-Tour-based Replacement Algorithms for Last-Level
Caches", 23rd International Symposium on Computer Architecture and
High Performance Computing (SBAC-PAD), pp. 112-119, 2011.
[47] Gangyong Jia, Xi Li, Chao Wang, Xuehai Zhou and Zongwei Zhu, "Cache
Promotion Policy Using Re-reference Interval Prediction", IEEE
International Conference on Cluster Computing (CLUSTER), pp. 534-537,
2012.
[48] Junmin Wu, Xiufeng Sui, Yixuan Tang, Xiaodong Zhu, Jing Wang and
Guoliang Chen, "Cache Management with Partitioning-Aware Eviction
and Thread-Aware Insertion/Promotion Policy", International Symposium
on Parallel and Distributed Processing with Applications (ISPA), pp. 374-
381, 2010.
[49] Ali A. El-Moursy and Fadi N. Sibai, "V-Set cache design for LLC of multi-
88
core processors", IEEE 14th International Conference on High
Performance Computing and Communication (HPCC), pp. 995-1000,
2012.
[50] Aamer Jaleel, Matthew Mattina and Bruce Jacob, "Last level cache (LLC)
performance of data mining workloads on a CMP-a case study of parallel
bioinformatics workloads", The Twelfth International Symposium on High-
Performance Computer Architecture, pp. 88-98, 2006.
[51] Arkaprava Basu, Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri and
Jose Martinez, "Scavenger: A new last level cache architecture with
global block priority", 40th Annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 421-432, 2007.
[52] Miao Zhou, Yu Du, Bruce Childers, Rami Melhem and Daniel Mossé,
"Writeback-aware partitioning and replacement for last-level caches in
phase change main memory systems", ACM Transactions on
Architecture and Code Optimization (TACO), Vol. 8, No. 4, Article. 53,
2012.
[53] Fazal Hameed, Lars Bauer and Jorg Henkel, "Adaptive cache
management for a combined SRAM and DRAM cache hierarchy for multi-
cores", Design, Automation & Test in Europe Conference & Exhibition
(DATE), pp. 77-82, 2013.
[54] Fang Juan and Li Chengyan, "An improved multi-core shared cache
replacement algorithm", 11th International Symposium on Distributed
Computing and Applications to Business, Engineering & Science
(DCABES), pp. 13-17, 2012.
[55] Mainak Chaudhuri, Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas
Subramoney and Joseph Nuzman, "Introducing hierarchy-awareness in
replacement and bypass algorithms for last-level caches", International
89
Conference on Parallel Architectures and Compilation Techniques, pp.
293-304, 2012.
[56] Yingying Tian, Samira M. Khan and Daniel A. Jimenez, "Temporal-based
multilevel correlating inclusive cache replacement", ACM Transactions on
Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 33,
2013.
[57] Liqiang He, Yan Sun and Chaozhong Zhang, "Adaptive Subset Based
Replacement Policy for High Performance Caching", JWAC-1st JILP
Workshop on Computer Architecture Competitions: Cache Replacement
Championship, 2010.
[58] Baozhong Yu, Yifan Hu, Jianliang Ma and Tianzhou Chen, "Access
Pattern Based Re-reference Interval Table for Last Level Cache", 12th
International Conference on Parallel and Distributed Computing,
Applications and Technologies (PDCAT), pp. 251-256, 2011.
[59] Haiming Liu, Michael Ferdman, Jaehyuk Huh and Doug Burger, "Cache
bursts: A new approach for eliminating dead blocks and increasing cache
efficiency", 41st annual IEEE/ACM International Symposium on
Microarchitecture, IEEE Computer Society, pp. 222-233, 2008.
[60] Vineeth Mekkat, Anup Holey, Pen-Chung Yew and Antonia Zhai,
"Managing shared last-level cache in a heterogeneous multicore
processor", 22nd International Conference on Parallel Architectures and
Compilation Techniques, pp. 225-234, 2013.
[61] Rami Sheikh and Mazen Kharbutli, "Improving cache performance by
combining cost-sensitivity and locality principles in cache replacement
algorithms", IEEE International Conference on Computer Design (ICCD),
pp. 76-83, 2010.
90
[62] Adwan AbdelFattah and Aiman Abu Samra, "Least recently plus five least
frequently replacement policy (LR+ 5LF)", Int. Arab J. Inf. Technology,
Vol. 9, No. 1, pp. 16-21, 2012.
[63] Richa Gupta and Sanjiv Tokekar, "Proficient Pair of Replacement
Algorithms on L1 and L2 Cache for Merge Sort", Journal of Computing,
Vol. 2, No. 3, pp. 171-175, March 2010.
[64] Jan Reineke, Daniel Grund, “Sensitivity of Cache Replacement Policies”,
ACM Transactions on Embedded Computer Systems, Vol. 9, No. 4,
Article 39, March 2010.
[65] Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu and Yale N. Patt, "A
case for MLP-aware cache replacement", ACM SIGARCH Computer
Architecture News, Vol. 34, No. 2, pp. 167-178, 2006.
[66] Marios Kleanthous and Yiannakis Sazeides, "CATCH: A mechanism for
dynamically detecting cache-content-duplication in instruction caches",
ACM Transactions on Architecture and Code Optimization (TACO),
Vol. 8, No. 3, Article. 11, 2011.
[67] Moinuddin K. Qureshi, David Thompson and Yale N. Patt, "The V-Way
cache: demand-based associativity via global replacement", 32nd
International Symposium on Computer Architecture, ISCA'05
Proceedings, pp. 544-555, 2005.
[68] Jichuan Chang and Gurindar S. Sohi, “Cooperative caching for chip
multiprocessors”, IEEE Computer Society, Vol. 34, No. 2, 2006.
[69] Shekhar Srikantaiah, Mahmut Kandemir and Mary Jane Irwin, "Adaptive
set pinning: managing shared caches in chip multiprocessors", ACM
91
SIGARCH Computer Architecture News, Vol. 36, No. 1, pp. 135-144,
2008.
[70] Ranjith Subramanian, Yannis Smaragdakis and Gabriel H. Loh, “Adaptive
Caches: Effective Shaping of Cache Behavior to Workloads”, IEEE
MICRO, Vol. 39, 2006.
[71] Tripti Warrier S, B. Anupama and Madhu Mutyam, "An application-aware
cache replacement policy for last-level caches", Architecture of
Computing Systems–ARCS, Springer Berlin Heidelberg, pp. 207-219,
2013.
[72] Chongmin Li, Dongsheng Wang, Yibo Xue, Haixia Wang and Xi Zhang.
"Enhanced adaptive insertion policy for shared caches", Advanced
Parallel Processing Technologies, Springer Berlin Heidelberg, pp. 16-30,
2011.
[73] Xiaoning Ding, Kaibo Wang and Xiaodong Zhang, "SRM-buffer: an OS
buffer management technique to prevent last level cache from thrashing
in multi-cores", ACM 6th Conference on Computer Systems, pp. 243-256,
2011.
[74] Jiayuan Meng and Kevin Skadron, "Avoiding cache thrashing due to
private data placement in last-level cache for many core scaling", IEEE
International Conference on Computer Design (ICCD), pp. 282-288,
2009.
[75] Vivek Seshadri, Onur Mutlu, Michael A. Kozuch and Todd C. Mowry, "The
Evicted-Address Filter: A unified mechanism to address both cache
pollution and thrashing", 21st International Conference on Parallel
Architectures and Compilation Techniques, pp. 355-366, 2012.
92
[76] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr and
Joel Emer, “Adaptive Insertion Policies for High Performance Caching”,
ACM SIGARCH Computer Architecture News, Vol. 35, No. 2, pp. 381-
391, June 2007.
[77] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero
and Alexander V. Veidenbaum, "Improving cache management policies
using dynamic reuse distances", 45th Annual IEEE/ACM International
Symposium on Microarchitecture, IEEE Computer Society, pp. 389-400,
2012.
[78] Xi Zhang, Chongmin Li, Haixia Wang and Dongsheng Wang, "A cache
replacement policy using adaptive insertion and re-reference prediction",
22nd International Symposium on Computer Architecture and High
Performance Computing (SBAC-PAD), pp. 95-102, 2010.
[79] Min Feng, Chen Tian, Changhui Lin and Rajiv Gupta, "Dynamic access
distance driven cache replacement", ACM Transactions on Architecture
and Code Optimization (TACO), Vol. 8, No. 3, Article 14, 2011.
[80] Jun Zhang, Xiaoya Fan and Song-He Liu, "A pollution alleviative L2
cache replacement policy for chip multiprocessor architecture",
International Conference on Networking, Architecture and Storage,
NAS'08, pp. 310-316, 2008.
[81] Yuejian Xie and Gabriel H. Loh, "Scalable shared-cache management by
containing thrashing workloads", High Performance Embedded
Architectures and Compilers, Springer Berlin Heidelberg, pp. 262-276,
2010.
[82] Seongbeom Kim, Dhruba Chandra and Yan Solihin, "Fair cache sharing
and partitioning in a chip multiprocessor architecture", 13th International
93
Conference on Parallel Architectures and Compilation Techniques, IEEE
Computer Society, pp. 111-122, 2004.
[83] Jichuan Chang and Gurindar S. Sohi, "Cooperative cache partitioning for
chip multiprocessors", 21st ACM International Conference on
Supercomputing, pp. 242-252, 2007.
[84] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot,
Simon Steely Jr and Joel Emer, "Adaptive insertion policies for managing
shared caches", 17th International Conference on Parallel Architectures
and Compilation Techniques, pp. 208-219, 2008.
[85] Carole-Jean Wu and Margaret Martonosi, "Adaptive timekeeping
replacement: Fine-grained capacity management for shared CMP
caches", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 1, Article. 3, 2011.
[86] Mainak Chaudhuri, "Pseudo-LIFO: the foundation of a new family of
replacement policies for last-level caches", 42nd Annual IEEE/ACM
International Symposium on Microarchitecture, pp. 401-412, 2009.
[87] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr and Joel Emer,
"High performance cache replacement using re-reference interval
prediction (RRIP)", ACM SIGARCH Computer Architecture News, Vol. 38,
No. 3, pp. 60-71, 2010.
[88] Viacheslav V Fedorov, Sheng Qiu, A. L. Reddy and Paul V. Gratz, "ARI:
Adaptive LLC-memory traffic management", ACM Transactions on
Architecture and Code Optimization (TACO), Vol. 10, No. 4, Article. 46,
2013.
94
[89] Moinuddin, K., Aamer Jaleel, Yale N. Patt, Simon C. Steely Jr, and Joel
Emer, "Set-dueling-controlled adaptive insertion for high-performance
caching", Micro IEEE, Vol. 28, No. 1, pp. 91-98, 2008.
[90] Aamer Jaleel, Hashem H. Najaf-Abadi, Samantika Subramaniam, Simon
C. Steely and Joel Emer, "Cruise: cache replacement and utility-aware
scheduling", ACM SIGARCH Computer Architecture News, Vol. 40, No.
1, pp. 249-260, 2012.
[91] George Kurian, Srinivas Devadas and Omer Khan, "Locality-Aware Data
Replication in the Last-Level Cache", High-Performance Computer
Architecture, 20th International Symposium on IEEE, 2014.
[92] Jason Zebchuk, Srihari Makineni and Donald Newell, "Re-examining
cache replacement policies", IEEE International Conference on Computer
Design (ICCD), pp. 671-678, 2008.
[93] Samira Manabi Khan, Zhe Wang and Daniel A. Jimenez, "Decoupled
dynamic cache segmentation", IEEE 18th International Symposium
on High Performance Computer Architecture (HPCA), pp. 1-12, 2012.
[94] Song Jiang and Xiaodong Zhang, "LIRS: an efficient low inter-reference
recency set replacement policy to improve buffer cache performance"
ACM SIGMETRICS Performance Evaluation Review, Vol. 30, No.1, pp.
31-42, 2002.
[95] Song Jiang and Xiaodong Zhang, "Making LRU friendly to weak locality
workloads: A novel replacement algorithm to improve buffer cache
performance", IEEE Transactions on Computers, Vol. 54, No. 8, pp. 939-
952, 2005.
95
[96] Izuchukwu Nwachukwu, Krishna Kavi, Fawibe Ademola and Chris Yan,
"Evaluation of techniques to improve cache access uniformities",
International Conference on Parallel Processing (ICPP), pp. 31-40, 2011.
[97] Dyer Rolán, Basilio B. Fraguela and Ramon Doallo, “Virtually split cache:
An efficient mechanism to distribute instructions and data”, ACM
Transactions on Architecture and Code Optimization (TACO), Vol. 10, No.
4, Article. 27, Dec 2013.
[98] Huang ZhiBin, Zhu Mingfa and Xiao Limin, "LvtPPP: live-time protected
pseudopartitioning of multicore shared caches", IEEE Transactions on
Parallel and Distributed Systems, Vol. 24, No. 8, pp. 1622-1632, 2013.
[99] Bradford M Beckmann, Michael R. Marty and David A. Wood, "ASR:
Adaptive selective replication for CMP caches", 39th Annual IEEE/ACM
International Symposium on Microarchitecture, MICRO-39, pp. 443-454,
2006.
[100] Garcia-Guirado, Antonio, Ricardo Fernandez-Pascual, Alberto Ros and
José M. García, "DAPSCO: Distance-aware partially shared cache
organization", ACM Transactions on Architecture and Code Optimization
(TACO), Vol. 8, No. 4, Article. 25, 2012.
[101] Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi,
Simon C. Steely Jr and Joel Emer, "SHiP: signature-based hit predictor
for high performance caching", 44th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 430-441, 2011.
[102] Nathan Binkert, Bradford Beckman, Gabriel Black, Steven K. Reinhardt,
Ali Saidi, Arkaprava Basu, Joel Hestness, Derek. R. Hower, Tushar
Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad
Shoaib, Nilay Vaish, Mark D. Hill and David A. Wood, "The gem5
96
simulator", ACM SIGARCH Computer Architecture News, Vol. 39, No. 2
pp. 1-7, 2011.
[103] Christian Bienia, Sanjeev Kumar and Kai Li, "PARSEC vs. SPLASH-2: A
quantitative comparison of two multithreaded benchmark suites on chip-
multiprocessors", IEEE International Symposium on Workload
Characterization (IISWC), pp. 47-56, 2008.
[104] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh and Kai Li, "The
PARSEC benchmark suite: Characterization and architectural
implications", 17th International Conference on Parallel Architectures and
Compilation Techniques, pp. 72-81, 2008.
[105] Mark Gebhart, Joel Hestness, Ehsan Fatehi, Paul Gratz and Stephen W.
Keckler, "Running PARSEC 2.1 on M5", University of Texas at Austin,
Department of Computer Science, Technical Report# TR-09-32, 2009.
97
LIST OF PUBLICATIONS
[1] Muthukumar S and Jawahar P.K., “Hit Rate Maximization by Logical
Cache Partitioning in a Multi-Core Environment”, Journal of Computer
Science, Vol. 10, Issue 3, pp. 492-498, 2014.
[2] Muthukumar S and Jawahar P.K., “Sharing and Hit Based Prioritizing
Replacement Algorithm for Multi-Threaded Applications”, International
Journal of Computer Applications, Vol. 90, Issue 12, pp. 34-38, 2014.
[3] Muthukumar S and Jawahar P.K., “Cache Replacement for Multi-
Threaded Applications using Context-Based Data Pattern Exploitation
Technique”, Malaysian Journal of Computer Science, Vol. 26, Issue 4, pp.
277-293, 2014.
[4] Muthukumar S and Jawahar P.K., “Redundant Cache Data Eviction in a
Multi-Core Environment”, International Journal of Advances in
Engineering and Technology, Vol. 5, Issue 2, pp.168-175, 2013.
[5] Muthukumar S and Jawahar P.K., “Cache Replacement Policies for
Improving LLC Performance in Multi-Core Processors”, International
Journal of Computer Applications, Vol. 105, Issue 8, pp. 12-17, 2014.
98
TECHNICAL BIOGRAPHY
Muthukumar S (RRN: 1184211) was born on 15th April 1965, in Erode,
Tamilnadu. He received the BE degree in Electronics and Communication
Engineering and the ME degree in Applied Electronics from Bharathiar
University, Coimbatore in the year 1989 and 1992 respectively. He is currently
pursuing his Ph.D. Degree in the Department of Electronics and Communication
Engineering of B.S. Abdur Rahman University, Chennai, India. He has got more
than 22 years of teaching experience in various institutions. He is working as
Professor in the Department of Electronics and Communication Engineering at
Sri Venkateswara College of Engineering, Chennai, India. His current research
interests include Multi-Core Processor Architectures, High Performance
Computing and Embedded Systems.
E-mail ID : [email protected]
Contact number : 9941437738
Photo