oms 2007 21-oct-15oms 2007 a model for the effect of caching on algorithmic efficiency in radix...
TRANSCRIPT
![Page 1: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/1.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
A Model for the Effect of Caching on Algorithmic Efficiency in Radix based
Sorting
Arne Maus and Stein Gjessing
Dept. of Informatics,
University of Oslo, Norway
![Page 2: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/2.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Overview
Motivation, CPU versus Memory speed– Caches
A cache test A simple model for the execution times of algorithms
– Does theoretical cache tests carry over to real programs?
A real example – three Radix sorting algorithms compared
The number of instructions executed is no longer a good measure for the performance of an algorithm
![Page 3: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/3.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
The need for caches, the CPU-Memory performance gap
from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003
![Page 4: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/4.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
A cache test random vs. sequential access i large arrays
Both a and b are of length n (n= 100, 200, 400,..., 97m)
2 test runs – the same number of instruction performed:– Random access: set b[i] = random(0..n-1)
We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost)
– Sequential access :set b[i] = i. then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b
and 1 in a
for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i;
![Page 5: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/5.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Random vs. sequential read in arrays [0:n-1]
0
10
20
30
40
50
60
70
10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000
n (log scale)
Ran
do
m r
ead
tim
es s
low
er
AMDOpteron254 2.8GHz
Intel Xeon 2.8 GHz
Intel Core Duo U2500 1.16GHz
UltraSparc III 1.1 GHz
Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random
access down a factor: 50 – 60 (4 CPUs)
start of cache miss from L1 to L2
start of cache miss from L2 to memory
![Page 6: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/6.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Why a slowdown of 50-60 and not factor 400 ?
Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why?
Answer: Every array access in Java is checked for lower and upper array limits – say:– load array index– compare with zero (lower limit)– load upper limit– compare index and upper limit– load array base address– load/store array element ( = possible cache miss)
We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67
![Page 7: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/7.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
From the figure for random access, we see a asymptotical slowdown factor of:
1 if n < L1
4 if L1 < n < L2
50 if L2 < n
The access time TR for one random read or write is then:
TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory)
( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 )
The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses
A simple model for the execution time of a program
For every loop in program– Count the number of sequential references– Count the number of random accesses and the
number of places n in which the randomly accessed object (array) is used
Random vs. sequential read in arrays [0:n-1]
0
10
20
30
40
50
60
70
10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000
n (log scale)
Ran
do
m r
ead
tim
es s
low
er
AMDOpteron254 2.8GHz
Intel Xeon 2.8 GHz
Intel Core Duo U2500 1.16GHz
UltraSparc III 1.1 GHz
n L2L1
![Page 8: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/8.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Applying the model to Radix sorting – the test
Three Radix algorithms– radix1, sorting the array in one pass with one ‘large’ digit
– radix2, sorting the array in two passes with two half sized digits
– radix3, sorting the array in three passes with three ‘small’ digits
radix3 performs almost three times as many instructions as radix1– should be almost 3 times as slow as radix1?
radix2 performs almost twice as many instructions as radix1– should be almost 2 times as slow as radix1?
![Page 9: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/9.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) {
int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1];
// a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++;
// b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; }
// c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; }
Base: Right Radix sorting algorithm :
One pass of array a with one sorting digit of width: maskLen (shifted shift bits up)
![Page 10: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/10.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes
static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1;
for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];
while (max >= (1<<numBit)) numBit++;
int [] b = new int [n];
radixSort( a,b, left, right, numBit, 0); }
static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1;
for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];
while (max >= (1<<numBit)) numBit++;
int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2);
int [] b = new int [n];
radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); }
3 1
![Page 11: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/11.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort
Opteron254 - 2.8Ghz - Uniform(0:n-1) distr.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000
n (log scale)
Rel
ativ
e p
erfo
rman
ce t
o
Qu
icks
ort
(=1)
Java Arrays.sort (Quick)
:RRadix - 1pass
:RRadix - 2pass
:RRadix - 3pass
FlashSort - 1pass
radix1 slowed down by a factor 7
radix3, no slowdown radix2, slowdown
started
![Page 12: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/12.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort
Xeon 2.8GHz - Uniform U(0:n-1) distr
0
0.5
1
1.5
2
2.5
3
10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000
n (log scale)
Rel
ativ
e Q
uic
kso
rt (
=1)
Java Arrays.sort (Quick):RRadix - 1pass:RRadix - 2pass:RRadix - 3passFlashSort - 1pass
![Page 13: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/13.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
The model, careful counting of loops in radix1,2,3
Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where:
kkn nm /)(log2
After some simplification:
and (+ some more simplifications):
![Page 14: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/14.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Model vs. test results (Opteron and Xeon)
Test -Opteron
n =100 n= 52m
R2 / R1 1.44 0.31
R3 / R1 1.77 0.25
Model
n small:n < L1, R1 = R2= R3= S
n large:n >>L2,
R1 = 10R2= 50R3= 50S
E2 / E1 21/13 = 1.61 (15+6*10)/(10*1+3*50) = 0.46
E3 /E1 31/13 = 2.38 31/(10*1+3*50) = 0.19
Test - Xeon n =100 n= 52m
R2 / R1 1.28 0.43
R3 / R1 1.74 0.18
![Page 15: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,](https://reader035.vdocument.in/reader035/viewer/2022062423/56649ee15503460f94bf254e/html5/thumbnails/15.jpg)
Apr 21, 2023 OMS 2007
OM
S 2
007
Conclusions
The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays.
We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n.
i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses
1. The number of instructions executed is no longer a good measure for the performance of an algorithm.
2. Algorithms should be rewritten such that random access inlarge data structures is removed.