software data prefetching mohammad al-shurman & amit seth instructor: dr. aleksandar milenkovic...

24
Software Data Software Data Prefetching Prefetching Mohammad Al-Shurman & Amit Seth Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Instructor: Dr. Aleksandar Milenkovic Milenkovic Advanced Computer Architecture CPE631

Upload: alan-martin

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Software Data PrefetchingSoftware Data Prefetching

Mohammad Al-Shurman & Amit SethMohammad Al-Shurman & Amit Seth

Instructor: Dr. Aleksandar Milenkovic Instructor: Dr. Aleksandar Milenkovic

Advanced Computer Architecture

CPE631

Page 2: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

IntroductionIntroduction

Processor-Memory GapProcessor-Memory Gap Memory speed is the bottleneck in the computer Memory speed is the bottleneck in the computer

systemsystem At least 20% from stalls are D-cache stalls At least 20% from stalls are D-cache stalls

(Alpha)(Alpha) Cache miss is expensiveCache miss is expensive

Reduce cache misses by ensuring data in L1Reduce cache misses by ensuring data in L1

How?!How?!

Page 3: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Data PrefetchingData Prefetching

Appeared first with Multimedia Appeared first with Multimedia applications using MMX technology or applications using MMX technology or SSE processor extensionSSE processor extension

Cache memory designed for data with Cache memory designed for data with high temporal & spatial localityhigh temporal & spatial locality

Multimedia data has high spatial Multimedia data has high spatial locality but low temporal localitylocality but low temporal locality

Page 4: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Data Prefetching (cont’d)Data Prefetching (cont’d) IdeaIdea

Bring data closer to the processor before it is Bring data closer to the processor before it is actually neededactually needed

Advantages Advantages No extra hardware is needed (Implemented in No extra hardware is needed (Implemented in

software)software) Used to mitigate the memory latency problemUsed to mitigate the memory latency problem

DisadvantagesDisadvantages Increase Code sizeIncrease Code size

Page 5: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

ExampleExample

//Before prefetching//Before prefetching

for (i=0; i<N; i++) {for (i=0; i<N; i++) {

sum += A[i];sum += A[i];

}}

//After prefetching//After prefetchingfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}

Page 6: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

PropertiesProperties

prefetchprefetch instruction loads one cache instruction loads one cache line from main memory into cache line from main memory into cache memorymemory During prefetching processor must During prefetching processor must

continue executioncontinue execution Cache memory must support hits while Cache memory must support hits while

prefetching occursprefetching occurs Decrease miss ratioDecrease miss ratio It will be ignored if prefetched data exist It will be ignored if prefetched data exist

in cache in cache

Page 7: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Prefetching InstructionsPrefetching Instructions The temporal instructionsThe temporal instructions

prefetcht0prefetcht0 fetch data into all cache levels, that is fetch data into all cache levels, that is to L1 and L2 for Pentium III processorsto L1 and L2 for Pentium III processors

prefetcht1prefetcht1 fetch data into all cache levels except fetch data into all cache levels except the 0th level, that is to L2 only on Pentium III the 0th level, that is to L2 only on Pentium III processorsprocessors

prefetcht2prefetcht2 fetch data into all cache levels except fetch data into all cache levels except the 0th and 1st levels, that is, to L2 only on the 0th and 1st levels, that is, to L2 only on Pentium III processorsPentium III processors

Non-temporal instructionNon-temporal instruction prefetchntaprefetchnta fetch data into location closest to fetch data into location closest to

the processor, minimizing cache pollution. On the processor, minimizing cache pollution. On the Pentium® III processor, this is the L1 cache.the Pentium® III processor, this is the L1 cache.

Page 8: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Prefetching GuidelinesPrefetching Guidelines

prefetch scheduling distanceprefetch scheduling distanceWhat is the next data to prefetch?What is the next data to prefetch?

minimize the number of prefetchesminimize the number of prefetchesoptimize execution time!optimize execution time!

mixing prefetch with computation mixing prefetch with computation instructions instructions

minimize code size and cache stallsminimize code size and cache stalls

Page 9: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Important noticeImportant notice

Prefetching can be harmful if the Prefetching can be harmful if the loop is smallloop is small

Combined with loop unrolling may Combined with loop unrolling may enhance the application execution enhance the application execution timetime

Can not cause exception if we Can not cause exception if we fetch beyond the array index the call fetch beyond the array index the call will be ignoredwill be ignored

Page 10: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

SupportSupport

Check if the processor support SSE Check if the processor support SSE extension (using CPUID inst)extension (using CPUID inst)

mov eax, 1 ; request for feature flagscpuid ; cpuid instructiontest EDX, 002000000h ; bit 25 in feature flags equal to 1jnz Found

We used Intel compiler in our We used Intel compiler in our simulationsimulation

Has built-in macro for prefetchingHas built-in macro for prefetchingSupport loop unrollingSupport loop unrolling

Page 11: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Loop UnrollingLoop Unrolling

IdeaIdea Test performance of code including data Test performance of code including data

prefetch and loop unrollingprefetch and loop unrolling

Advantages Unrolling reduces the branch overhead, since it eliminates

branches Unrolling allows you to aggressively schedule the loop to hide

latencies.

Disadvantages Excessive unrolling, or unrolling of very large loops can lead to

increased code size.

Page 12: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Implementation of Loop Implementation of Loop UnrollingUnrolling

//Prefetch without Unroll//Prefetch without Unrollfor (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}//Prefetching with Unroll//Prefetching with Unroll#pragma unroll (1)#pragma unroll (1)for (i=0; i<N; i++) {for (i=0; i<N; i++) {_mm__mm_prefetchnta( &A[i+1], prefetchnta( &A[i+1], _MM_HINT_NTA_MM_HINT_NTA);); sum += A[i];sum += A[i];}}#pragma unroll (1)#pragma unroll (1)

Page 13: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

SimulationSimulation

We simulate simple addition loopWe simulate simple addition loopfor (i=0; i<size; i++) {for (i=0; i<size; i++) {

prefetch (depth)prefetch (depth)

sum += A[i];sum += A[i];

}}

We studied effects of two factorsWe studied effects of two factors Data size Data size Prefetch depthPrefetch depth

Combination of loop unrolling and Combination of loop unrolling and prefetching prefetching

Page 14: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Simulation (cont’d)Simulation (cont’d)

Intel VTune performance analyzerIntel VTune performance analyzer Event based simulationEvent based simulation

CPICPI L1 miss rateL1 miss rate Clock ticksClock ticks

Page 15: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Size Vs CPI Size Vs CPI CPI

0

0.5

1

1.5

2

2.5

3

3.5

size 0.5M size 1M size 2M size 3M size 4M

no optimization

loop unrolling

data prefetching

loop unrolling and data prefetching

Page 16: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Size Vs L1 miss ratioSize Vs L1 miss ratio

L1 data miss ratio

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

size 0.5M size 1M size 2M size 3M size 4M

no optimization

loop unrolling

data prefetching

loop unrolling and data prefetching

Page 17: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Size Vs clock ticksSize Vs clock ticksClock ticks

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

size 0.5M size 1M Instructions RetiredSamples

size 2M Instructions RetiredSamples

no optimization

loop unrolling

data prefetching

loop unrolling and data prefetching

Page 18: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs CPI for prefetching Depth Vs CPI for prefetching with unrollingwith unrolling

0

0.5

1

1.5

2

2.5

3

1 4 16 64 256 1024

Cycles per Retired Instruction - CPI

Cycles per Retired Instruction - CPI

Page 19: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching with unrollingprefetching with unrolling

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0.018

1 2 4 8 16 32 64 128 256 512 1024 2048

L1 Read Misses Ratio

L1 Read Misses Ratio

Page 20: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs clockticks for Depth Vs clockticks for prefetching with loop unrollingprefetching with loop unrolling

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

1 2 4 8 16 32 64 128 256 512 1024 2048

Clockticks events

Clockticks events

Page 21: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs CPI for prefetching Depth Vs CPI for prefetching without loop unrollingwithout loop unrolling

Cycles per Retired Instruction - CPI

0

0.5

1

1.5

2

2.5

3

1 2 4 8 16 32 64 128 256 512 1024 2048

Cycles per Retired Instruction - CPI

Page 22: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs L1 miss ratio for Depth Vs L1 miss ratio for prefetching without unrollingprefetching without unrolling

L1 Read Misses Ratio

0

0.005

0.01

0.015

0.02

0.025

0.03

1 2 4 8 16 32 64 128 256 512 1024 2048

L1 Read Misses Ratio

Page 23: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Depth Vs clockticks for Depth Vs clockticks for prefetching without loop prefetching without loop

unrollingunrollingClockticks events

0

100000

200000

300000

400000

500000

600000

700000

1 2 4 8 16 32 64 128 256 512 1024 2048

Clockticks events

Page 24: Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631

Questions!!Questions!!