oms 2007 21-oct-15oms 2007 a model for the effect of caching on algorithmic efficiency in radix...

15
Jun 20, 2022 OMS 2007 O M S 2 0 0 7 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway

Upload: johnathan-patrick

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based

Sorting

Arne Maus and Stein Gjessing

Dept. of Informatics,

University of Oslo, Norway

Page 2: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Overview

Motivation, CPU versus Memory speed– Caches

A cache test A simple model for the execution times of algorithms

– Does theoretical cache tests carry over to real programs?

A real example – three Radix sorting algorithms compared

The number of instructions executed is no longer a good measure for the performance of an algorithm

Page 3: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

The need for caches, the CPU-Memory performance gap

from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003

Page 4: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

A cache test random vs. sequential access i large arrays

Both a and b are of length n (n= 100, 200, 400,..., 97m)

2 test runs – the same number of instruction performed:– Random access: set b[i] = random(0..n-1)

We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost)

– Sequential access :set b[i] = i. then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b

and 1 in a

for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i;

Page 5: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Ran

do

m r

ead

tim

es s

low

er

AMDOpteron254 2.8GHz

Intel Xeon 2.8 GHz

Intel Core Duo U2500 1.16GHz

UltraSparc III 1.1 GHz

Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random

access down a factor: 50 – 60 (4 CPUs)

start of cache miss from L1 to L2

start of cache miss from L2 to memory

Page 6: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Why a slowdown of 50-60 and not factor 400 ?

Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why?

Answer: Every array access in Java is checked for lower and upper array limits – say:– load array index– compare with zero (lower limit)– load upper limit– compare index and upper limit– load array base address– load/store array element ( = possible cache miss)

We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67

Page 7: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

From the figure for random access, we see a asymptotical slowdown factor of:

1 if n < L1

4 if L1 < n < L2

50 if L2 < n

The access time TR for one random read or write is then:

TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory)

( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 )

The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses

A simple model for the execution time of a program

For every loop in program– Count the number of sequential references– Count the number of random accesses and the

number of places n in which the randomly accessed object (array) is used

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Ran

do

m r

ead

tim

es s

low

er

AMDOpteron254 2.8GHz

Intel Xeon 2.8 GHz

Intel Core Duo U2500 1.16GHz

UltraSparc III 1.1 GHz

n L2L1

Page 8: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Applying the model to Radix sorting – the test

Three Radix algorithms– radix1, sorting the array in one pass with one ‘large’ digit

– radix2, sorting the array in two passes with two half sized digits

– radix3, sorting the array in three passes with three ‘small’ digits

radix3 performs almost three times as many instructions as radix1– should be almost 3 times as slow as radix1?

radix2 performs almost twice as many instructions as radix1– should be almost 2 times as slow as radix1?

Page 9: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) {

int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1];

// a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++;

// b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; }

// c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; }

Base: Right Radix sorting algorithm :

One pass of array a with one sorting digit of width: maskLen (shifted shift bits up)

Page 10: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes

static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int [] b = new int [n];

radixSort( a,b, left, right, numBit, 0); }

static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2);

int [] b = new int [n];

radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); }

3 1

Page 11: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Opteron254 - 2.8Ghz - Uniform(0:n-1) distr.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e p

erfo

rman

ce t

o

Qu

icks

ort

(=1)

Java Arrays.sort (Quick)

:RRadix - 1pass

:RRadix - 2pass

:RRadix - 3pass

FlashSort - 1pass

radix1 slowed down by a factor 7

radix3, no slowdown radix2, slowdown

started

Page 12: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Xeon 2.8GHz - Uniform U(0:n-1) distr

0

0.5

1

1.5

2

2.5

3

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e Q

uic

kso

rt (

=1)

Java Arrays.sort (Quick):RRadix - 1pass:RRadix - 2pass:RRadix - 3passFlashSort - 1pass

Page 13: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

The model, careful counting of loops in radix1,2,3

Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where:

kkn nm /)(log2

After some simplification:

and (+ some more simplifications):

Page 14: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Model vs. test results (Opteron and Xeon)

Test -Opteron

n =100 n= 52m

R2 / R1 1.44 0.31

R3 / R1 1.77 0.25

Model

n small:n < L1, R1 = R2= R3= S

n large:n >>L2,

R1 = 10R2= 50R3= 50S

E2 / E1 21/13 = 1.61 (15+6*10)/(10*1+3*50) = 0.46

E3 /E1 31/13 = 2.38 31/(10*1+3*50) = 0.19

Test - Xeon n =100 n= 52m

R2 / R1 1.28 0.43

R3 / R1 1.74 0.18

Page 15: OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Apr 21, 2023 OMS 2007

OM

S 2

007

Conclusions

The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays.

We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n.

i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses

1. The number of instructions executed is no longer a good measure for the performance of an algorithm.

2. Algorithms should be rewritten such that random access inlarge data structures is removed.