oms 2007 21-oct-15oms 2007 a model for the effect of caching on algorithmic efficiency in radix...

Apr 21, 2023 OMS 2007

OM

S 2

007

A Model for the Effect of Caching on Algorithmic Efficiency in Radix based

Sorting

Arne Maus and Stein Gjessing

Dept. of Informatics,

University of Oslo, Norway

Apr 21, 2023 OMS 2007

OM

S 2

007

Overview

Motivation, CPU versus Memory speed– Caches

A cache test A simple model for the execution times of algorithms

– Does theoretical cache tests carry over to real programs?

A real example – three Radix sorting algorithms compared

The number of instructions executed is no longer a good measure for the performance of an algorithm

Apr 21, 2023 OMS 2007

OM

S 2

007

The need for caches, the CPU-Memory performance gap

from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003

Apr 21, 2023 OMS 2007

OM

S 2

007

A cache test random vs. sequential access i large arrays

Both a and b are of length n (n= 100, 200, 400,..., 97m)

2 test runs – the same number of instruction performed:– Random access: set b[i] = random(0..n-1)

We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost)

– Sequential access :set b[i] = i. then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b

and 1 in a

for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i;

Apr 21, 2023 OMS 2007

OM

S 2

007

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Ran

do

m r

ead

tim

es s

low

er

AMDOpteron254 2.8GHz

Intel Xeon 2.8 GHz

Intel Core Duo U2500 1.16GHz

UltraSparc III 1.1 GHz

Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random

access down a factor: 50 – 60 (4 CPUs)

start of cache miss from L1 to L2

start of cache miss from L2 to memory

Apr 21, 2023 OMS 2007

OM

S 2

007

Why a slowdown of 50-60 and not factor 400 ?

Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why?

Answer: Every array access in Java is checked for lower and upper array limits – say:– load array index– compare with zero (lower limit)– load upper limit– compare index and upper limit– load array base address– load/store array element ( = possible cache miss)

We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67

Apr 21, 2023 OMS 2007

OM

S 2

007

From the figure for random access, we see a asymptotical slowdown factor of:

1 if n < L1

4 if L1 < n < L2

50 if L2 < n

The access time TR for one random read or write is then:

TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory)

( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 )

The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses

A simple model for the execution time of a program

For every loop in program– Count the number of sequential references– Count the number of random accesses and the

number of places n in which the randomly accessed object (array) is used

Random vs. sequential read in arrays [0:n-1]

0

10

20

30

40

50

60

70

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Ran

do

m r

ead

tim

es s

low

er

AMDOpteron254 2.8GHz

Intel Xeon 2.8 GHz

Intel Core Duo U2500 1.16GHz

UltraSparc III 1.1 GHz

n L2L1

Apr 21, 2023 OMS 2007

OM

S 2

007

Applying the model to Radix sorting – the test

Three Radix algorithms– radix1, sorting the array in one pass with one ‘large’ digit

– radix2, sorting the array in two passes with two half sized digits

– radix3, sorting the array in three passes with three ‘small’ digits

radix3 performs almost three times as many instructions as radix1– should be almost 3 times as slow as radix1?

radix2 performs almost twice as many instructions as radix1– should be almost 2 times as slow as radix1?

Apr 21, 2023 OMS 2007

OM

S 2

007

static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) {

int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1];

// a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++;

// b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; }

// c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; }

Base: Right Radix sorting algorithm :

One pass of array a with one sorting digit of width: maskLen (shifted shift bits up)

Apr 21, 2023 OMS 2007

OM

S 2

007

Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes

static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int [] b = new int [n];

radixSort( a,b, left, right, numBit, 0); }

static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1;

for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i];

while (max >= (1<<numBit)) numBit++;

int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2);

int [] b = new int [n];

radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); }

3 1

Apr 21, 2023 OMS 2007

OM

S 2

007

Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Opteron254 - 2.8Ghz - Uniform(0:n-1) distr.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e p

erfo

rman

ce t

o

Qu

icks

ort

(=1)

Java Arrays.sort (Quick)

:RRadix - 1pass

:RRadix - 2pass

:RRadix - 3pass

FlashSort - 1pass

radix1 slowed down by a factor 7

radix3, no slowdown radix2, slowdown

started

Apr 21, 2023 OMS 2007

OM

S 2

007

Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort

Xeon 2.8GHz - Uniform U(0:n-1) distr

0

0.5

1

1.5

2

2.5

3

10 100 1 000 10 000 100 000 1 000 000 10 000 000 100 000 000

n (log scale)

Rel

ativ

e Q

uic

kso

rt (

=1)

Java Arrays.sort (Quick):RRadix - 1pass:RRadix - 2pass:RRadix - 3passFlashSort - 1pass

Apr 21, 2023 OMS 2007

OM

S 2

007

The model, careful counting of loops in radix1,2,3

Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where:

kkn nm /)(log2

After some simplification:

and (+ some more simplifications):

Apr 21, 2023 OMS 2007

OM

S 2

007

Model vs. test results (Opteron and Xeon)

Test -Opteron

n =100 n= 52m

R2 / R1 1.44 0.31

R3 / R1 1.77 0.25

Model

n small:n < L1, R1 = R2= R3= S

n large:n >>L2,

R1 = 10R2= 50R3= 50S

E2 / E1 21/13 = 1.61 (15+6*10)/(10*1+3*50) = 0.46

E3 /E1 31/13 = 2.38 31/(10*1+3*50) = 0.19

Test - Xeon n =100 n= 52m

R2 / R1 1.28 0.43

R3 / R1 1.74 0.18

Apr 21, 2023 OMS 2007

OM

S 2

007

Conclusions

The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays.

We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n.

i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses

1. The number of instructions executed is no longer a good measure for the performance of an algorithm.

2. Algorithms should be rewritten such that random access inlarge data structures is removed.

oms 2007 21-oct-15oms 2007 a model for the effect of caching on algorithmic efficiency in radix...

Documents