improving cache performance four categories of optimisation: –reduce miss rate –reduce miss...
TRANSCRIPT
![Page 1: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/1.jpg)
![Page 2: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/2.jpg)
Improving Cache Performance
• Four categories of optimisation:– Reduce miss rate– Reduce miss penalty– Reduce miss rate or miss penalty using
parallelism– Reduce hit time
AMAT = Hit time + Miss rate × Miss penalty
![Page 3: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/3.jpg)
5.5. Reducing Miss Rate
• Three sources of misses:– Compulsory
• “cold start misses”
– Capacity• Cache is full
– Conflict• Set is full/block is occupied
Increase block size
Increase size of cache
Increase degree of associativity
![Page 4: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/4.jpg)
Larger Block Size
• Bigger blocks reduce compulsory misses– Spatial locality
• BUT:– Increased miss penalty
• More data to transfer
– Possibly increased overall miss rate• More conflict and capacity misses as there are fewer
blocks
![Page 5: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/5.jpg)
Effect of Block Size
AMAT
Block size
Missrate
Block size
Transfer
Access
Misspenalty
Block size
![Page 6: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/6.jpg)
Larger Caches
• Reduces capacity misses
• Increases hit time and cost
![Page 7: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/7.jpg)
Higher Associativity
• Miss rates improve with higher associativity
• Two rules of thumb:– 8-way set associative caches are almost as
effective as fully associative• But much simpler!
– 2:1 cache rule• A direct mapped cache of size N has about the same
miss rate as a 2-way set associative cache of size N/2
![Page 8: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/8.jpg)
Way Prediction
• Set-associative cache predicts which block will be needed on next access to the set
• Only one tag check is done– If mispredicted the whole set must be checked
• E.g. Alpha 21264 instruction cache– Prediction rate > 85%– Correct prediction: 1 cycle hit– Misprediction: 3 cycles
![Page 9: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/9.jpg)
Pseudo-Associative Caches
• Check a direct mapped cache for a hit as usual
• If it misses, check a second block– Invert MSB of index
• One fast and one slow hit time
![Page 10: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/10.jpg)
Compiler Optimisations
• Compilers can optimise code to minimise miss rates:– Reordering procedures– Aligning basic blocks with cache blocks– Reorganising array element accesses
![Page 11: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/11.jpg)
5.6. Reduce Miss Rate or Miss Penalty via Parallelism
• Three techniques that overlap instruction execution with memory access
![Page 12: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/12.jpg)
Nonblocking caches
• Dynamic scheduling allows CPU to continue with other instructions while waiting for data
• Nonblocking cache allows other cache accesses to continue while waiting for data
![Page 13: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/13.jpg)
Hardware Prefetching• Fetch data/instructions before they are
requested by the processor– Either into cache or another buffer
• Particularly useful for instructions– High degree of spatial locality
• UltraSPARC III– Special prefetch cache for data– Increases effectiveness by about four times
![Page 14: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/14.jpg)
Compiler Prefetching
• Compiler inserts “prefetch” instructions
• Two types:– Prefetch register value– Prefetch data cache block
• Can be faulting or non-faulting
• Cache continues as normal while data is prefetched
![Page 15: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/15.jpg)
SPARC V9• Prefetch:
prefetch [%rs1 + %rs2], fcnprefetch [%rs1 + imm13], fcn
fcn = Prefetch function 0 = Prefetch for several reads 1 = Prefetch for one read 2 = Prefetch for several writes 3 = Prefetch for one write 4 = Prefetch page
![Page 16: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/16.jpg)
5.7. Reducing Hit Time
• Critical– Often affects CPU clock cycle time
![Page 17: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/17.jpg)
Small, simple caches
• Small usually equals fast in hardware
• A small cache may reside on the processor chip– Decreases communication– Compromise: tags on chip, data separate
• Direct mapped– Data can be read in parallel with tag checking
![Page 18: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/18.jpg)
Avoiding address translation
• Physical caches– Use physical addresses
• Address translation must happen before cache lookup
• Virtual caches– Use virtual addresses– Protection issues– High context switching overhead
![Page 19: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/19.jpg)
Virtual caches
• Minimising context switch overhead:– Add process-identifier tag to cache
• Multiple virtual addresses may refer to a single physical address– Hardware enforces anti-aliasing– Software requires less significant bits to be the
same
![Page 20: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/20.jpg)
Avoiding address translation (cont.)
• Choice of page size:– Bigger than cache index + offset– Address translation and tag lookup can happen
in parallel
OffsetTag IndexAddress
CPUPage offsetPage no.
Cache
VM Translation
![Page 21: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/21.jpg)
Pipelining cache access
• Split cache access into several stages
• Impacts on branch and load delays
![Page 22: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/22.jpg)
Trace caches
• Blocks follow program flow rather than spatial locality!
• Branch prediction is taken into account by cache
• Intel NetBurst microarchitecture
• Complicates address mapping
• Minimises wasted space within blocks
![Page 23: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/23.jpg)
Cache OptimisationSummary
• Cache optimisation is very complex– Improving one factor may have a negative
impact on another
![Page 24: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/24.jpg)
5.6. Main Memory
• Latency and bandwidth are both important
• Latency is composed of two factors:– Access time– Cycle time
• Two main technologies:– DRAM– SRAM
![Page 25: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/25.jpg)
5.7. Virtual Memory
• Physical memory is divided into blocks– Allocated to processes– Provides protection– Allows swapping to disk– Simplifies loading
• Historically:– Overlays
• Programmer controlled swapping
![Page 26: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/26.jpg)
Terminology
• Block:– Page– Segment
• Miss:– Page fault– Address fault
• Memory mapping (address translation)– Virtual address physical address
![Page 27: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/27.jpg)
Characteristics
• Block size– 4kB – 64kB
• Hit time– 50 – 150 cycles
• Miss penalty– 1 000 000 – 10 000 000 cycles
• Miss Rate– 0.000 01 – 0.001%
![Page 28: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/28.jpg)
Categorising VM Systems
• Fixed block size– Pages
• Variable block size– Segments– Difficult replacement
• Hybrid approaches– Paged segments– Multiple page sizes (2n × smallest)
![Page 29: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/29.jpg)
Q1: Block placement?
• Anywhere in memory– “Fully associative”– Minimises miss rate
![Page 30: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/30.jpg)
Q2: Block identification?
• Page/segment number gives physical page address– Paging: offset concatenated– Segments: offset added
• Uses a page table– Number of pages in virtual address space– Save space: inverted page table
• Number of pages in physical memory
![Page 31: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/31.jpg)
Q3: Block replacement?
• Least-recently used (LRU)– Minimises miss rate– Hardware provides a use bit or reference bit
![Page 32: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/32.jpg)
Q4: Write strategy?
• Write back– With a dirty bit
You won’t become famous by being the first to try write through!
![Page 33: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/33.jpg)
Fast Address Translation
• Page tables are big– Stored in memory themselves– Two memory accesses for every datum!
• Principle of locality– Cache recent translations– Translation look-aside buffer (TLB), or
translation buffer (TB)
![Page 34: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/34.jpg)
Alpha 21264 TLB
![Page 35: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/35.jpg)
Selecting a Page Size
• Big– Smaller page table– Allows parallel cache access– Efficient disk transfers– Reduces TLB misses
• Small– Less memory wastage (internal fragmentation)– Quicker process startup
![Page 36: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/36.jpg)
Putting it ALL Together!
SPARC Revisited
![Page 37: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/37.jpg)
Two SPARCs
• SuperSPARC– 1992– 32-bit superscalar design
• UltraSPARC– Late 1990’s– 64-bit design– Graphics support (VIS)
![Page 38: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/38.jpg)
UltraSPARC
• Four-way superscalar execution
• Two integer ALU’s
• FP unit– Five functional units
• Graphics unit
![Page 39: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/39.jpg)
Pipeline
• 9 stages:– Fetch
– Decode
– Grouping
– Execution
– Cache access
– Load miss
– Integer pipe wait (for FP/graphics pipelines)
– Trap resolution
– Writeback
![Page 40: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/40.jpg)
Branch Handling
• Dynamic branch prediction– Two bit scheme– Every second instruction in cache has
prediction bits (predicts up to 2048 branches)– 88% success rate (integer)
• Target prediction– Fetches from predicted path
![Page 41: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/41.jpg)
FPU
• Five functional units:– Add– Multiply– Divide/square root– Two graphics units (add and multiply)
• Mostly fully pipelined (latency 3 cycles)– Except divide and square root (not pipelined,
latency is 22 cycles for 64-bit)
![Page 42: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/42.jpg)
Memory Hierarchy
• On-chip instruction and data caches– Data:
• 16kB direct-mapped, write-through
– Instructions:• 16kB 2-way set associative
– Both virtually addressed
• External cache– Up to 4MB
![Page 43: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/43.jpg)
Virtual Memory
• 64-bit virtual addresses 44-bit physical addresses
• TLB– 64 entry, fully-associative cache
![Page 44: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/44.jpg)
Multimedia Support (VIS)
• Integrated with FPU
• Partitioned operations– Multiple smaller values in 64-bits
• Video compression instructions– E.g. motion estimation instruction replaces 48
simple instructions for MPEG compression
![Page 45: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/45.jpg)
The End!
![Page 46: Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism](https://reader035.vdocument.in/reader035/viewer/2022062423/5697bff31a28abf838cbc9c5/html5/thumbnails/46.jpg)