vlsi project - technionwebee.technion.ac.il/vlsi/projects/archive/2005_6/alex_marcel.pdfvlsi project...
TRANSCRIPT
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 11
VLSI ProjectVLSI Project
Least Recently Frequently Used Least Recently Frequently Used Caching AlgorithmCaching Algorithm
with Filtering Policieswith Filtering Policies
Alexander Zlotnik Marcel Apfelbaum
Supervised by: Michael Behar, Winter 2005/2006
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 22
Introduction (cont.)Introduction (cont.)
Cache definitionCache definitionMemory chip Memory chip –– part of the Processorpart of the Processor
Same technologySame technology
Speed: same order of magnitude as accessing RegistersSpeed: same order of magnitude as accessing Registers
Relatively small and expensiveRelatively small and expensive
Acts like an HASH function : holds part of the address spaces.Acts like an HASH function : holds part of the address spaces.
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 33
Introduction (cont.)Introduction (cont.)Cache memories Cache memories –– Main ideaMain idea
When processor needs instruction or data it first looks for it iWhen processor needs instruction or data it first looks for it in n the cache. If that fails, it brings the data from the main memorthe cache. If that fails, it brings the data from the main memory y to the cache and uses it from there.to the cache and uses it from there.
Address space is partitioned into blocksAddress space is partitioned into blocks
Cache holds lines, each line holds a blockCache holds lines, each line holds a blockA block may not exist in the cache A block may not exist in the cache --> cache miss> cache miss
If we miss the CacheIf we miss the CacheEntire block is fetched into a line buffer, and then put into thEntire block is fetched into a line buffer, and then put into the cachee cacheBefore putting the new block in the cache, another block may neeBefore putting the new block in the cache, another block may need d to be evicted from the cache (to make room for the new block) to be evicted from the cache (to make room for the new block)
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 44
Introduction (cont.)Introduction (cont.)Cache aimCache aim
Fast access timeFast access time
Fast search mechanismFast search mechanism
High HitHigh Hit--RatioRatio
Highly effective replacement mechanismHighly effective replacement mechanism
High AdaptabilityHigh Adaptability -- fast replacement of not need linesfast replacement of not need lines
Long sightedLong sighted -- estimation if a block will be used in futureestimation if a block will be used in future
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 55
Project ObjectiveProject ObjectiveDevelop an LRFU caching mechanism Develop an LRFU caching mechanism
Implementation of a cache entrance filtering techniqueImplementation of a cache entrance filtering technique
Compare and analyze against LRU Compare and analyze against LRU
Researching various configurations of LRFU , on order Researching various configurations of LRFU , on order to achieve maximum hit rateto achieve maximum hit rate
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 66
Project RequirementsProject RequirementsDevelop for SimpleScalar platform to simulate Develop for SimpleScalar platform to simulate processor cachesprocessor caches
Run developed caching & filtering mechanisms Run developed caching & filtering mechanisms on accepted benchmarkson accepted benchmarks
C languageC language
No hardware components equivalence needed, No hardware components equivalence needed, software implementation onlysoftware implementation only
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 77
Background and TheoryBackground and TheoryCache Replacement options:Cache Replacement options:
FIFO, LRU, Random, Pseudo LRU, LFUFIFO, LRU, Random, Pseudo LRU, LFU
Currently used algorithms:Currently used algorithms:LRU(2 ways requires 1 bit per set to mark latest LRU(2 ways requires 1 bit per set to mark latest accessed)accessed)Pseudo LRU (4 ways and more, Fully associative)Pseudo LRU (4 ways and more, Fully associative)
Pseudo LRU (4Pseudo LRU (4--way example)way example)Bit 0 Bit 0 –– specify if way is (0,1) or (2,3)specify if way is (0,1) or (2,3)Bit 1 Bit 1 –– specify who was between 0 and 1specify who was between 0 and 1Bit 2 Bit 2 –– specify who was between 2 and 3specify who was between 2 and 3
Bit 0
Bit 1 Bit 2
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 88
Background and Theory (cont)Background and Theory (cont)
LRULRU
Advantages Advantages High AdaptabilityHigh Adaptability1 cycle algorithm1 cycle algorithmLow memory usageLow memory usage
DisadvantageDisadvantageShort sighted Short sighted
LFULFU
Advantage Advantage Long sightedLong sightedSmarterSmarter
DisadvantagesDisadvantagesCache pollutionCache pollutionRequires many Requires many cyclescyclesMore memory More memory neededneeded
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 99
Background and Theory (cont)Background and Theory (cont)
ObservationObservationBoth recency and frequency affect the likelihood of future Both recency and frequency affect the likelihood of future referencesreferences
GoalGoalA replacement algorithm that allows a flexible tradeA replacement algorithm that allows a flexible trade--off between off between recency and frequencyrecency and frequency
The idea: LRFU (Least Recently/Frequently Used)The idea: LRFU (Least Recently/Frequently Used)Subsumes both LRU and LFU algorithmsSubsumes both LRU and LFU algorithmsOvercome the cycles used by LFU by filtering Cache entrancesOvercome the cycles used by LFU by filtering Cache entrancesYields better performance than themYields better performance than them
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1010
Development StagesDevelopment Stages
1.1. Studying the backgroundStudying the background2.2. Learning SimpleScalar simLearning SimpleScalar sim--cache platformcache platform3.3. Develop LRFU caching algorithm for SimpleScalarDevelop LRFU caching algorithm for SimpleScalar4.4. Develop filtering policyDevelop filtering policy5.5. Benchmarking (smart environment)Benchmarking (smart environment)6.6. Analyzing various LRFU configurations and Analyzing various LRFU configurations and
comparison with LRU algorithmcomparison with LRU algorithm
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1111
PrinciplesPrinciplesThe LRFU policy associates a value with each block. This value quantifies the likelihood that the block will be referenced in the near future. Each reference to a block in the past adds a contribution to this value and its contribution is determined by a weighing function F.
Current time tc
timet3t2t1
Ctc(block) = F(δ1) + F(δ2) + F(δ3)|| || ||
tc - t1 tc - t2 tc - t3
δ2δ3
δ1
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1212
Principles (cont)Principles (cont)Weighing function F(x) = ((1/2)1/2)λλxx
Monotonically decreasingMonotonically decreasingSubsume LRU and LFUSubsume LRU and LFU
When When λλ = 0, (i.e. F(x) = 1), then it becomes LFU= 0, (i.e. F(x) = 1), then it becomes LFUWhen When λλ = 1, (i.e. F(x) = = 1, (i.e. F(x) = ((1/2)1/2)xx), then it becomes LRU), then it becomes LRUWhen 0 < When 0 < λλ < 1, it is between LFU and LRU< 1, it is between LFU and LRU
F(x) = (1/2)x (LRU extreme)
Spectrum(LRU/LFU)
1
0
F(x) = 1 (LFU extreme)F(x)
Xcurrent time - reference time
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1313
Principles (cont)Principles (cont)Update of C(block) over time
Only two counters for each block are needed to calculate C(blockOnly two counters for each block are needed to calculate C(block))Proof:Proof: t1
time
t2
δ= (t2 - t1)
δ 1δ 2
δ 3
C t2(b) = F (δ1+δ) + F (δ2+δ) + F (δ3+δ) = (1/2)λ(δ1+ δ) + (1/2)λ (δ2+ δ) + (1/2)λ (δ3+ δ)
= ((1/2)λδ1 + (1/2)λδ2 + (1/2)λδ3 ) (1/2)λδ= C t1(b) x F (δ)
t3t2t1
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1414
Design and ImplementationDesign and ImplementationFiltering
Data Address
Filter
Not in cacheIn cache
END
InsertData removed from cache by
LRFU
Insert into cacheFilter out
END
InsertData into
Victims Cache
In Victims cache ?
Not in cacheIn cache
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1515
Design and Implementation (cont)Design and Implementation (cont)
Data structure
LRFU uses for each block two
BOUNDED counters
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1616
Hardware budgetHardware budget
CountersCountersEach block in cache requires two bounded countersEach block in cache requires two bounded counters
Previous C(t)Previous C(t)Time that passed from previous accessTime that passed from previous access
Victims cacheVictims cacheThe size will be based on empirical analysisThe size will be based on empirical analysis
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1717
AlgorithmsAlgorithms
FilteringFilteringWe implemented a very simple filtering algorithm, We implemented a very simple filtering algorithm, whose single task is to cause less changes in cache. whose single task is to cause less changes in cache.
After a cache miss, the brought block is entered in cache with aAfter a cache miss, the brought block is entered in cache with aprobability 0<p<1, p configurable. If the block is not entered iprobability 0<p<1, p configurable. If the block is not entered in n cache , is entered automatically in victims cache.cache , is entered automatically in victims cache.
ReplacementReplacementAfter a cache miss, C(t) is calculated for each block in set andAfter a cache miss, C(t) is calculated for each block in set andthe one with the smallest C(t) is selected for replacement.the one with the smallest C(t) is selected for replacement.
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1818
ResultsResults
Hit Rate
Cache Size (# of blocks)
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1919
Results (cont)Results (cont)
Hit rate
λ
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2020
Special ProblemsSpecial Problems
Software simulation of hardwareSoftware simulation of hardwareUtilizing existing data structures of SimpleScalarUtilizing existing data structures of SimpleScalar
Finding the perfect C(t)Finding the perfect C(t)Applying mathematical theory into practiceApplying mathematical theory into practice
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2121
ConclusionsConclusionsWe implemented a We implemented a differentdifferent cache replacement cache replacement mechanism and received exciting resultsmechanism and received exciting results
Hardware implementation of the mechanism is hard, but Hardware implementation of the mechanism is hard, but possiblepossible
The Implementation achieved the goalsThe Implementation achieved the goalsSubsumes both the LRU and LFU algorithmsSubsumes both the LRU and LFU algorithmsYields better performance than them (up to 30%!!!)Yields better performance than them (up to 30%!!!)
VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2222
Future ResearchFuture Research
Implementation of better filtering techniquesImplementation of better filtering techniques
Dynamic version of the LRFU algorithmDynamic version of the LRFU algorithmAdjust Adjust λλ periodically depending on the evolution of periodically depending on the evolution of workloadworkload
Research of hardware needed for LRFU Research of hardware needed for LRFU