cache optimization for mobile devices running multimedia applications
DESCRIPTION
Cache Optimization for Mobile Devices Running Multimedia Applications. Komal Kasat Gaurav Chitroda Nalini Kumar. Outline. Introduction MPEG-4 Architecture Simulation Results Conclusion. INTRODUCTION. Introduction. Multimedia. Combination of graphics, video, audio - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/1.jpg)
Cache Optimization for Mobile Devices Running Multimedia
Applications
Komal KasatGaurav Chitroda
Nalini Kumar
![Page 2: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/2.jpg)
OutlineIntroductionMPEG-4ArchitectureSimulationResults Conclusion
![Page 3: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/3.jpg)
INTRODUCTION
![Page 4: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/4.jpg)
MultimediaCombination of graphics, video, audioOperates on data presented visually aurallyIn multimedia operations compression is
done such that less significant data to the viewer is discarded
Common events represented by fewer bits while rare events by more bits
Transmitter encodes and transmits, decoder decodes and plays them back
Introduction
![Page 5: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/5.jpg)
CachesSize and complexity of Multimedia applications is
increasingCritical applications have time constraintsRequires more computational power & more traffic
from CPU to memorySignificant processor/memory speed gapTo deal with memory bottlenecks we use cachesCache improves performance by reducing data
access time
Introduction
![Page 6: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/6.jpg)
CPU Main Memory
BUS
Memory HierarchyIntroduction
![Page 7: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/7.jpg)
Main Memory
CPU
Cache
BUS
Memory HierarchyIntroduction
![Page 8: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/8.jpg)
Main Memory
CPU
CL1
BUS
CL2
Memory HierarchyIntroduction
![Page 9: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/9.jpg)
Data between CPU and cache is transferred as data object
Data between cache and main memory is transferred as block
CPU Cache Main Memory
Data Object Transfer Block Transfer
Data transfer among CPU, Cache and Main Memory
Introduction
![Page 10: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/10.jpg)
Why Cache Optimization?With improved CPU, memory subsystem deficiency is main
performance bottleneckSufficient reuse of values for caching to reduce raw
required memory bandwidth for video data High data rates, large sizes and distinctive memory access
patters of MPEG exert strain on cachesThough miss rate acceptable, they increase cache memory
trafficDropped frames or blocking make caches inefficientWe have limited power and bandwidth in mobile
embedded applicationsCache inefficiency has impact on system cost
Introduction
![Page 11: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/11.jpg)
MPEG-4
![Page 12: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/12.jpg)
MPEG 4Moving Picture Experts GroupNext generation global multimedia
standardDefines the compression of Audio and
Visual (AV) digital dataEmploys both spatial & temporal
redundancy for compressionWhat is the technique??
MPEG-4
![Page 13: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/13.jpg)
Break data into 8 x 8 pixel blocksApply Discrete Cosine TransformQuantize, RLE and entropy coding algorithmFor temporal redundancy – motion compensation3 types of frames:◦ I intra : contain complete image, compresses for spatial
redundancy only◦ P predicted : built from 16 x 16 macro blocks
Macro Block: consists of pixels from closet previous I or P frames such that require fewer bits
◦ B bidirectional frames : information not in reference frames is encoded block by block Reference frames are 2 - I and P, one before and one after in
temporal order
MPEG-4
![Page 14: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/14.jpg)
Consider GOP with 7 picture framesDue to dependencies frames are processed
in non temporal orderThe encoding, transmission and decoding
order should be the same2 parameters M & N specified at encoder◦ I frame decoded every N frames◦P frame decoded every M frames◦Rest are B frames
Consider the simplified bit stream hierarchical structure
MPEG-4
![Page 15: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/15.jpg)
N=7 & M=3
B2
B3
P4
B5
B6
P7
I1
Bidirectional Prediction
Prediction
MPEG-4
![Page 16: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/16.jpg)
Sequence Header GOP …. GOP
GOP Header Picture …. Picture
Picture Header Slice …. Slice
Slice Header Macro-block …. Macro-block
Macro-block Header Block …. Block
MPEG-4
![Page 17: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/17.jpg)
Decoder reads data as stream of bitsEach section identified by unique bit patternGOP contains at least one I- frame and
dependent P and B framesThere are dependencies while decoding the
encoded videoSo, selecting right cache parameters
improves cache performance significantlyHence Cache Optimization is important
MPEG-4
![Page 18: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/18.jpg)
ARCHITECTURE
![Page 19: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/19.jpg)
Cache Design ParametersCache Size:
Most significant design parameter Usually increased by factors of two Increasing cache size shows improvement Cost & space constraints - critical design decision
Line Size: Larger line size – lower miss rates, superior spatial locality Sub-block placement helps decouple size of cache lines &
memory bus More data to be read and written back on a miss Minimal memory traffic with small lines
Architecture
![Page 20: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/20.jpg)
Associativity: Better performance by increasing associativity for
small caches Going from direct mapped to 2-way may reduce
memory traffic by 50% for small cache size Sizes greater than 4 show minimal benefit across all
cache sizes
Multilevel Caches: CL2 cache between CL1 and main memory
significantly improves CPU performance CL2 addition decreases bus traffic and latency
Architecture
![Page 21: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/21.jpg)
Simulated ArchitectureArchitecture
![Page 22: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/22.jpg)
DSP decoded encoded video streamCL1 is split cache with D1 and I1CL2 is unified cacheDSP and main memory connected via
shared busDMA I/0 transfers & buffers data from
storage to main memoryDSP decodes and writes video streams to
main memoryCPU reads and writes into main memory
through its cache hierarchy
Architecture
![Page 23: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/23.jpg)
SIMULATION
![Page 24: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/24.jpg)
Simulation ToolsCachegrind – from Valgrind◦ It is a ‘cache profiler’ simulation package◦ Performs detailed simulation of D1, I1, CL2 caches ◦ Gives the total references, misses, miss rates◦ It is useful for programs written in any language
VisualSim◦ Provides block libraries for CPU, caches, bus, main memory◦ Simulation model developed by selecting appropriate
blocks and making connections ◦ Has functionalities to run model and collect results
Simulation
![Page 25: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/25.jpg)
MPEG-4 WorkloadWorkload defines all possible operating
scenarios and environmental conditionsQuality of workload is important for
simulation accuracy and completenessIn the simulation D1, I1 and CL2 hit ratios
are used to model the systemThis data is obtained from Cachegrind
and used by VisualSim simulation model
Simulation
![Page 26: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/26.jpg)
Cache Sizes Line Size D1 Refs (K) I1 Refs (K) CL1 Refs
D1 (KB) I1 (KB) CL2
(KB)(B)
bytes Total Miss Total Miss D1 % I1 %
8 8 128 16 18782 521 38758 512 33 67
16 16 512 32 18782 430 38758 106 33 67
32 32 2048 64 18782 403 38758 39 33 67
Different combinations of D1, I1 and CL2 are used About 33% references are data and 67% are instructions As cache size & line size increase, miss rate decreases
Level 1 Data and Instruction References
Simulation
![Page 27: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/27.jpg)
Cache Sizes Line Size CL1 Hits CL2 Hits
D1 (KB) I1 (KB) L2 (KB) (B) D1 % I1 % %
8 8 128 16 95.0 98.0 99.3
16 16 512 32 96.4 98.6 99.9
32 32 2048 64 98.0 99.5 100
Calculated hit rates for various sizes of CL1 and Cl2 caches
As cache size increases, hit rate increases
D1, I1 and CL2 hit ratios
Simulation
![Page 28: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/28.jpg)
CL2 Size D1 references D1 References
(KB) Read (K) Write(K) R % W %
32 12391 6391 67 33
128 12391 6391 67 33
512 12391 6391 67 33
2048 12391 6391 67 33
About 67 % of references are reads and about 33 % of references are writes
Read and Write References
Simulation
![Page 29: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/29.jpg)
Input ParametersItem ValueCL1 Cache sizes 8+8 to 32+32 KBCL2 Cache Sizes 32 to 4096 KBLine Size 16 to 256 BAssociativity 2-way to 16-wayCache Levels L1 and L2Simulation Time 2000.0 simulation time unitsTask Time 1.0 simulation time unitsTask Rate Task Time * 0.4CPU Time Task Time * 0.4Mem Time Task Time * 0.4Bus Time Mem Time * 0.4CL1 Cache Time Mem Time * 0.2CL2 Cache Time Mem Time * 0.4Main Memory Time Task TimeBus Queue Length 300
Simulation
![Page 30: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/30.jpg)
AssumptionsDedicated bus between CL1 and CL2
introduces negligible delay compared to the bus connecting CL2 and memory
Write back update policy is implemented, so CPU is released immediately after CL1 is updated
Task time has been divided proportionally among CPU, main memory, bus, L1 and L2 cache
Simulation
![Page 31: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/31.jpg)
Performance Metrics2 performance metricsUtilization◦CPU Utilization is ratio of time that CPU spent
computing to time that CPU spent transferring bits and performing un-tarring and tarring functions
Transactions◦Total number of transactions performed is the
total umber of tasks performed by a component during simulation
Simulation
![Page 32: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/32.jpg)
RESULTS
![Page 33: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/33.jpg)
Miss rate variation due to CL1 size changing keeping CL2 size constant
Not much benefit by using CL1 greater than 8+8
Results
![Page 34: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/34.jpg)
Effect on miss rate due to changing CL2 cache size From 32KB to 512KB miss rate decreases slowly From 512KB to 2MB miss rate decreases sharply Form 2MB to 4MB miss rate almost unchanged
From cost, space and complexity standpoint larger CL2 does not provide significant benefits
Results
![Page 35: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/35.jpg)
For smaller cache size like D1, miss rate starts decreasing or hit rates start increasing with increase in line size
Miss rates start increasing after a point called ‘cache pollution point’ From 16 to 64B, larger line size gives better spatial locality From 128B does not show improvement as on a miss more data has
to be read and written
Results
![Page 36: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/36.jpg)
Miss rate significant decreases when going from 2-way to 4-way
Not much significant improvement for 8-way and higher
Results
![Page 37: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/37.jpg)
32K 128K 256K 512K 1M 2M
CPU 10K 10K 10K 10K 10K 10K
CL1 10K 10K 10K 10K 10K 10K
CL2 303 303 303 303 303 303
Bus 3 3 2 2 1 0
MM 3 3 2 2 1 0
CL1: 8+8 size, 16B Line Size, 4-way set associativityCL2 size varied from 32KB to 4MBCPU Utilization and Transactions collected
Total Transactions for different CL2 Sizes
Results
![Page 38: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/38.jpg)
Memory requests initiated by CPU referred to CL1Then to CL2 and finally unsuccessful requests to Main
Memory MM transactions decrease with increase in CL2 sizeAll tasks initiated at CPU referred to CL1Considering 10000 tasks, 3333 data and 6667 instructionsFor D1 hit ratio 5% and I1 hit ratio 2%
◦ 168+135 = 303 go to CL2For CL2 32KB, miss ratio 0.9%
◦ Only 3 tasks go to MMFor CL2 2MB+, miss ratio 0%
◦ No tasks go to MM
Results
![Page 39: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/39.jpg)
CPU Utilization decreases with increase in CL2 sizeBetween 512KB and 2MB decrement is significantFor 128KB and smaller or 4MB and bigger, the
change is not significant
Results
![Page 40: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/40.jpg)
CONCLUSION Focused on enhancing MPEG-4 decoding using cache
optimization for mobile devices Used Cachegrind and VisualSim simulation tools Optimize cache size, line size, associativity and cache levels Simulated architecture consists of and 2 level cache Collected references form Cachegrind to drive VisualSim
simulation model
Future Scope: Improve system performance further by using techniques like Selective Caching, Cache Locking, Scratch Memory, Data Recording
![Page 41: Cache Optimization for Mobile Devices Running Multimedia Applications](https://reader036.vdocument.in/reader036/viewer/2022062815/56816930550346895de07bef/html5/thumbnails/41.jpg)
QUESTIONS