a data cache with dynamic mapping p. d'alberto, a. nicolau and a. veidenbaum ics-uci speaker...
TRANSCRIPT
![Page 1: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/1.jpg)
A Data Cache with A Data Cache with Dynamic MappingDynamic MappingA Data Cache with A Data Cache with Dynamic MappingDynamic Mapping
P. D'Alberto, A. Nicolau and A. Veidenbaum
ICS-UCI
Speaker
Paolo D’Alberto
![Page 2: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/2.jpg)
Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction
● Blocked algorithms have good performance on average– Because exploiting data temporal locality
● For some input sets data cache interference nullifies cache locality
![Page 3: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/3.jpg)
Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.Problem Introduction, cont.
Blocked Matrix Multiplication: 16KB 1-way data cache
0
2
4
6
8
10
12
14
16
18
Matrix Size
mis
s r
ate
%
regular matrix multiply dynamic matrix multiply
![Page 4: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/4.jpg)
Problem IntroductionProblem IntroductionProblem IntroductionProblem Introduction
● What if we remove the spikes– The average performance improves– Execution time is predictable
● We can achieve our goal by: ● Only Software● Only Hardware● Both HW-SW
![Page 5: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/5.jpg)
Related Work (Software)Related Work (Software)Related Work (Software)Related Work (Software)
● Data layout reorganization [Flajolet et al. 91]– Data are reorganized before & after computation
● Data copy [Granston et al. 93]– Data are moved in memory during computation
● Padding [Panda et al. 99]● Computation reorganization [Pingali et al.02]
– e.g., Tiling
![Page 6: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/6.jpg)
Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)Related Work (Hardware)
● Changing cache mapping – Using a different cache mapping functions [Gonzalez 97]– Increasing cache associativity [IA64]– Changing cache Size
● Bypassing caches– No interference: data are not stored in cache. [MIPS R5K]– HW-driven Pre-fetching
![Page 7: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/7.jpg)
Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)Related Work (HW-SW)● Profiling
– Hardware adaptation [UCI]
– Software adaptation [Gatlin et al. 99]
● Pre-fetching [Jouppi et al.]– Latency hiding mostly, used also for cache interference
reduction
● Static Analysis [Ghosh et al. 99 - CME]– e.g., compiler driven data cache line adaptation [UCI]
![Page 8: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/8.jpg)
Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)Dynamic Mapping, (Software)
● We consider applications where memory references are affine functions only
● We associate a memory reference with a twin affine function
● We use the twin function as input address for the target data cache
● We use the original affine function to access memory
![Page 9: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/9.jpg)
Example of twin functionExample of twin functionExample of twin functionExample of twin function● We consider the references A[i][j] and B[i][j]
– The affine functions are: ● A0+(iN+j)*4 ● B0+(iN+j)*4
● When there is interference (i.e.,|A0-B0| mod C < L where C and
L are cache and cache line size): – We use the twin functions
● A0+(iN+j)*4 ● B0+(iN+j)*4+L
![Page 10: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/10.jpg)
Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)Dynamic Mapping, (Hardware)● We introduce a new 3-address load instruction
– A register destination – Two register operands: the results of twin function
and of original affine function ● Note:
– the twin function result may be no real address – the original function is a real address
● (and goes though TLB – ACU)
![Page 11: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/11.jpg)
Pseudo Assembly CodePseudo Assembly CodePseudo Assembly CodePseudo Assembly Code
ORIGINAL CODESet $R0, A_0Set $R1, B_0…Load $F0, $R0Load $F1, $R1Add $R0,$R0,4Add $R1,$R1,4
MODIFIED CODESet $R0, A_0
Set $R1, B_0
Add $R2, $R1, 32…
Load $F0, $R0
Load $F1, $R1, $R2Load $F1, $R1, $R2
Add $R2, $R2, 4
Add $R0, $R0, 4
Add $R1, $R1, 4
…
![Page 12: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/12.jpg)
Experimental ResultsExperimental ResultsExperimental ResultsExperimental Results● We present experimental results obtained by
using combination of software approaches:– Padding– Data Copy – Without using any cycle-accurate simulator
● Matrix multiplication: – Simulation of cache performance a data cache size
16KB 1-way for optimally blocked algorithm
![Page 13: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/13.jpg)
Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation) Matrix Multiply (simulation)
Blocked Matrix Multiplication: 16KB 1-way data cache
0
2
4
6
8
10
12
14
16
18
Matrix Size
mis
s r
ate
%
regular matrix multiply dynamic matrix multiply
![Page 14: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/14.jpg)
Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.Experimental Results, cont.● n-FFT, Cooley-Tookey algorithm using balanced
decomposition in factors – The algorithm has been proposed first by Vitter et al– Complexity
● Best case O(n log log n) - Worst case O(n2)
– Normalized performance (MFLOPS)● We use the codelets from FFTW● For 128KB 4-way data cache
– Performance comparison with FFTW is in the paper
![Page 15: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/15.jpg)
FFT FFT 128KB 4-way data cache128KB 4-way data cache
FFT FFT 128KB 4-way data cache128KB 4-way data cache
SPARC64 100MHz
02468
10121416
1000
1960
4725
1036
8
2700
0
7560
0
1653
75
3628
80 128
256
51210
2420
4840
9681
92
1638
4
3276
8
6553
6
1310
72
2621
44
5242
88
1048
576
2097
152
4194
304
8388
608
size
no
rmal
ized
MF
LO
PS
Upper Bound Dynamic Mapping Base Line
![Page 16: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/16.jpg)
Future workFuture workFuture workFuture work
● Dynamic Mapping is not fully automated– The code is hand made
● A clock-accurate processor simulator is missing– To estimate the effects of twin function
computations on performance and energy
● Application on a set of benchmarks
![Page 17: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/17.jpg)
![Page 18: A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto](https://reader030.vdocument.in/reader030/viewer/2022032607/56649ec95503460f94bd6df0/html5/thumbnails/18.jpg)
ConclusionsConclusionsConclusionsConclusions● The hardware is relatively simple
– Because it is the compiler (or user) that activates the twin computation
● and change the data cache mapping dynamically
● The approach aims to achieve a data cache mapping with:– zero interference,– no increase of cache hit latency – minimum extra hardware