vlsi project - technionwebee.technion.ac.il/vlsi/projects/archive/2005_6/alex_marcel.pdfvlsi project...

22
VLSI Project Winter 2005/2006 VLSI Project Winter 2005/2006 1 1 VLSI Project VLSI Project Least Recently Frequently Used Least Recently Frequently Used Caching Algorithm Caching Algorithm with Filtering Policies with Filtering Policies Alexander Zlotnik Marcel Apfelbaum Supervised by: Michael Behar, Winter 2005/2006

Upload: duongkhuong

Post on 21-Apr-2018

219 views

Category:

Documents


4 download

TRANSCRIPT

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 11

VLSI ProjectVLSI Project

Least Recently Frequently Used Least Recently Frequently Used Caching AlgorithmCaching Algorithm

with Filtering Policieswith Filtering Policies

Alexander Zlotnik Marcel Apfelbaum

Supervised by: Michael Behar, Winter 2005/2006

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 22

Introduction (cont.)Introduction (cont.)

Cache definitionCache definitionMemory chip Memory chip –– part of the Processorpart of the Processor

Same technologySame technology

Speed: same order of magnitude as accessing RegistersSpeed: same order of magnitude as accessing Registers

Relatively small and expensiveRelatively small and expensive

Acts like an HASH function : holds part of the address spaces.Acts like an HASH function : holds part of the address spaces.

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 33

Introduction (cont.)Introduction (cont.)Cache memories Cache memories –– Main ideaMain idea

When processor needs instruction or data it first looks for it iWhen processor needs instruction or data it first looks for it in n the cache. If that fails, it brings the data from the main memorthe cache. If that fails, it brings the data from the main memory y to the cache and uses it from there.to the cache and uses it from there.

Address space is partitioned into blocksAddress space is partitioned into blocks

Cache holds lines, each line holds a blockCache holds lines, each line holds a blockA block may not exist in the cache A block may not exist in the cache --> cache miss> cache miss

If we miss the CacheIf we miss the CacheEntire block is fetched into a line buffer, and then put into thEntire block is fetched into a line buffer, and then put into the cachee cacheBefore putting the new block in the cache, another block may neeBefore putting the new block in the cache, another block may need d to be evicted from the cache (to make room for the new block) to be evicted from the cache (to make room for the new block)

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 44

Introduction (cont.)Introduction (cont.)Cache aimCache aim

Fast access timeFast access time

Fast search mechanismFast search mechanism

High HitHigh Hit--RatioRatio

Highly effective replacement mechanismHighly effective replacement mechanism

High AdaptabilityHigh Adaptability -- fast replacement of not need linesfast replacement of not need lines

Long sightedLong sighted -- estimation if a block will be used in futureestimation if a block will be used in future

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 55

Project ObjectiveProject ObjectiveDevelop an LRFU caching mechanism Develop an LRFU caching mechanism

Implementation of a cache entrance filtering techniqueImplementation of a cache entrance filtering technique

Compare and analyze against LRU Compare and analyze against LRU

Researching various configurations of LRFU , on order Researching various configurations of LRFU , on order to achieve maximum hit rateto achieve maximum hit rate

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 66

Project RequirementsProject RequirementsDevelop for SimpleScalar platform to simulate Develop for SimpleScalar platform to simulate processor cachesprocessor caches

Run developed caching & filtering mechanisms Run developed caching & filtering mechanisms on accepted benchmarkson accepted benchmarks

C languageC language

No hardware components equivalence needed, No hardware components equivalence needed, software implementation onlysoftware implementation only

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 77

Background and TheoryBackground and TheoryCache Replacement options:Cache Replacement options:

FIFO, LRU, Random, Pseudo LRU, LFUFIFO, LRU, Random, Pseudo LRU, LFU

Currently used algorithms:Currently used algorithms:LRU(2 ways requires 1 bit per set to mark latest LRU(2 ways requires 1 bit per set to mark latest accessed)accessed)Pseudo LRU (4 ways and more, Fully associative)Pseudo LRU (4 ways and more, Fully associative)

Pseudo LRU (4Pseudo LRU (4--way example)way example)Bit 0 Bit 0 –– specify if way is (0,1) or (2,3)specify if way is (0,1) or (2,3)Bit 1 Bit 1 –– specify who was between 0 and 1specify who was between 0 and 1Bit 2 Bit 2 –– specify who was between 2 and 3specify who was between 2 and 3

Bit 0

Bit 1 Bit 2

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 88

Background and Theory (cont)Background and Theory (cont)

LRULRU

Advantages Advantages High AdaptabilityHigh Adaptability1 cycle algorithm1 cycle algorithmLow memory usageLow memory usage

DisadvantageDisadvantageShort sighted Short sighted

LFULFU

Advantage Advantage Long sightedLong sightedSmarterSmarter

DisadvantagesDisadvantagesCache pollutionCache pollutionRequires many Requires many cyclescyclesMore memory More memory neededneeded

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 99

Background and Theory (cont)Background and Theory (cont)

ObservationObservationBoth recency and frequency affect the likelihood of future Both recency and frequency affect the likelihood of future referencesreferences

GoalGoalA replacement algorithm that allows a flexible tradeA replacement algorithm that allows a flexible trade--off between off between recency and frequencyrecency and frequency

The idea: LRFU (Least Recently/Frequently Used)The idea: LRFU (Least Recently/Frequently Used)Subsumes both LRU and LFU algorithmsSubsumes both LRU and LFU algorithmsOvercome the cycles used by LFU by filtering Cache entrancesOvercome the cycles used by LFU by filtering Cache entrancesYields better performance than themYields better performance than them

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1010

Development StagesDevelopment Stages

1.1. Studying the backgroundStudying the background2.2. Learning SimpleScalar simLearning SimpleScalar sim--cache platformcache platform3.3. Develop LRFU caching algorithm for SimpleScalarDevelop LRFU caching algorithm for SimpleScalar4.4. Develop filtering policyDevelop filtering policy5.5. Benchmarking (smart environment)Benchmarking (smart environment)6.6. Analyzing various LRFU configurations and Analyzing various LRFU configurations and

comparison with LRU algorithmcomparison with LRU algorithm

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1111

PrinciplesPrinciplesThe LRFU policy associates a value with each block. This value quantifies the likelihood that the block will be referenced in the near future. Each reference to a block in the past adds a contribution to this value and its contribution is determined by a weighing function F.

Current time tc

timet3t2t1

Ctc(block) = F(δ1) + F(δ2) + F(δ3)|| || ||

tc - t1 tc - t2 tc - t3

δ2δ3

δ1

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1212

Principles (cont)Principles (cont)Weighing function F(x) = ((1/2)1/2)λλxx

Monotonically decreasingMonotonically decreasingSubsume LRU and LFUSubsume LRU and LFU

When When λλ = 0, (i.e. F(x) = 1), then it becomes LFU= 0, (i.e. F(x) = 1), then it becomes LFUWhen When λλ = 1, (i.e. F(x) = = 1, (i.e. F(x) = ((1/2)1/2)xx), then it becomes LRU), then it becomes LRUWhen 0 < When 0 < λλ < 1, it is between LFU and LRU< 1, it is between LFU and LRU

F(x) = (1/2)x (LRU extreme)

Spectrum(LRU/LFU)

1

0

F(x) = 1 (LFU extreme)F(x)

Xcurrent time - reference time

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1313

Principles (cont)Principles (cont)Update of C(block) over time

Only two counters for each block are needed to calculate C(blockOnly two counters for each block are needed to calculate C(block))Proof:Proof: t1

time

t2

δ= (t2 - t1)

δ 1δ 2

δ 3

C t2(b) = F (δ1+δ) + F (δ2+δ) + F (δ3+δ) = (1/2)λ(δ1+ δ) + (1/2)λ (δ2+ δ) + (1/2)λ (δ3+ δ)

= ((1/2)λδ1 + (1/2)λδ2 + (1/2)λδ3 ) (1/2)λδ= C t1(b) x F (δ)

t3t2t1

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1414

Design and ImplementationDesign and ImplementationFiltering

Data Address

Filter

Not in cacheIn cache

END

InsertData removed from cache by

LRFU

Insert into cacheFilter out

END

InsertData into

Victims Cache

In Victims cache ?

Not in cacheIn cache

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1515

Design and Implementation (cont)Design and Implementation (cont)

Data structure

LRFU uses for each block two

BOUNDED counters

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1616

Hardware budgetHardware budget

CountersCountersEach block in cache requires two bounded countersEach block in cache requires two bounded counters

Previous C(t)Previous C(t)Time that passed from previous accessTime that passed from previous access

Victims cacheVictims cacheThe size will be based on empirical analysisThe size will be based on empirical analysis

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1717

AlgorithmsAlgorithms

FilteringFilteringWe implemented a very simple filtering algorithm, We implemented a very simple filtering algorithm, whose single task is to cause less changes in cache. whose single task is to cause less changes in cache.

After a cache miss, the brought block is entered in cache with aAfter a cache miss, the brought block is entered in cache with aprobability 0<p<1, p configurable. If the block is not entered iprobability 0<p<1, p configurable. If the block is not entered in n cache , is entered automatically in victims cache.cache , is entered automatically in victims cache.

ReplacementReplacementAfter a cache miss, C(t) is calculated for each block in set andAfter a cache miss, C(t) is calculated for each block in set andthe one with the smallest C(t) is selected for replacement.the one with the smallest C(t) is selected for replacement.

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1818

ResultsResults

Hit Rate

Cache Size (# of blocks)

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 1919

Results (cont)Results (cont)

Hit rate

λ

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2020

Special ProblemsSpecial Problems

Software simulation of hardwareSoftware simulation of hardwareUtilizing existing data structures of SimpleScalarUtilizing existing data structures of SimpleScalar

Finding the perfect C(t)Finding the perfect C(t)Applying mathematical theory into practiceApplying mathematical theory into practice

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2121

ConclusionsConclusionsWe implemented a We implemented a differentdifferent cache replacement cache replacement mechanism and received exciting resultsmechanism and received exciting results

Hardware implementation of the mechanism is hard, but Hardware implementation of the mechanism is hard, but possiblepossible

The Implementation achieved the goalsThe Implementation achieved the goalsSubsumes both the LRU and LFU algorithmsSubsumes both the LRU and LFU algorithmsYields better performance than them (up to 30%!!!)Yields better performance than them (up to 30%!!!)

VLSI Project Winter 2005/2006VLSI Project Winter 2005/2006 2222

Future ResearchFuture Research

Implementation of better filtering techniquesImplementation of better filtering techniques

Dynamic version of the LRFU algorithmDynamic version of the LRFU algorithmAdjust Adjust λλ periodically depending on the evolution of periodically depending on the evolution of workloadworkload

Research of hardware needed for LRFU Research of hardware needed for LRFU