cache-dsp tool how to avoid having a sharc thrashing on a cache-line m. smith, university of...

29
CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact -- smithmr @ ucalgary.ca

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

CACHE-DSP ToolHow to avoid having a SHARC

thrashing on a cache-line

M. Smith, University of Calgary, CanadaB. Howse, Cell-Loc, Calgary, Canada

Contact -- smithmr @ ucalgary.ca

Page 2: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

2/28

Series of Talks and Workshops

CACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code.SQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions .SHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code.SHARC Ecology 201 – Workshop on SQUISH-DSP and CACHE-DSP tools.

Page 3: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

3/28

Concepts to be discussed

Concept behind 2106X instruction cacheCache operationIntroduction of CACHE THRASHINGSolutions to avoid a Cache Thrash without delaying product releaseBasis of Cache-DSP toolAcknowledgements

Page 4: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

4/28

Purpose of SHARC instruction cache

Harvard Processor ArchitectureOne bus for fetching instructionsAnother bus for fetching dataTwin bus architecture avoids instruction/data fetch conflicts

DSP algorithmsAddition and multiplication intensiveMultiple simultaneous access to data structures are typically neededTwin bus architecture does not avoid data/data fetch conflicts

Page 5: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

5/28

Solutions to data/data fetch conflicts

Cache single instructionSingle instruction loopFrees up instruction bus for use as data bus to fetch from separate data memoryVery limited in application

Three bus processorExpensive to implement for all memory

ADSP21XXX approach is to have a three bus processor architecture available for a limited number of instructions on a ‘as needed’ basis – instruction cache

Page 6: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

6/28

Example

C-code

Converts temperature array from C to F

Assembly code has 6 PM( ) operations

Page 7: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

7/28

Example

Fetch Decode ExecuteInstr. on PM F1=, r0=dm

Instr. on PMF13=,r2=dm, pm=

Instr.F1=, r0=dm

Instr. on PMF8=, r0=dm

Instr. F13=,r2=dm, pm=

Data on DM F1=, r0=dm

Instr. on PMF12=, r2=dm, pm=

Instr.F8=, r0=dm

Data on DM, PM F13=,r2=dm, pm=

Instr. F12=, r2=dm, pm=

Data on DMF8=, r0=dm

Page 8: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

8/28

Data on DM, PM F13=,r2=dm, pm=

Instr.F8=, r0=dm

Data on DM F1=, r0=dm

Instr. F13=,r2=dm, pm=

Instr. on PMF8=, r0=dm

Instr.F1=, r0=dm

Instr. on PMF13=,r2=dm, pm=

Instr. on PM F1=, r0=dm

ExecuteDecodeFetch

Instr. on PM/To CacheF12=, r2=dm, pm=

First Time round loop -- STALL

Page 9: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

9/28

2nd Time – 3 bus operation

Data on DMF8=, r0=dm

Instr. F12=, r2=dm, pm=

Data on DM, PM F13=,r2=dm, pm=

Instr.F8=, r0=dm

Instr. From CacheF12=, r2=dm, pm=

Data on DM F1=, r0=dm

Instr. F13=,r2=dm, pm=

Instr. on PMF8=, r0=dm

Instr.F1=, r0=dm

Instr. on PMF13=,r2=dm, pm=

Instr. on PM F1=, r0=dm

ExecuteDecodeFetch

Page 10: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

10/28

Instruction Cache Characteristics

32 cache locations32 locations looks small in number

but is used ONLY when data access on PM bus conflicts with instruction access on PM bus

Typically satisfactory for tight DSP algorithm loops up to 100+ atomic operations.

Page 11: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

11/28

MAJOR LIMITATION POSSIBLE

Cache is 2-way associative32 cache locations grouped in groups of 2Instruction storage location in cache determined by last 4 bits of address

Instruction N stored at Cache location N modulus 16

Also a least recently used bit (LRU)LRU instruction replaced on a cache miss.

Possible to induce -- CACHE THRASH

Page 12: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

12/28

Simple Example

Assume that cache is 2-way associative with 8(not 32) locations

6 cache operations to be placed into 8 cache locations

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

Page 13: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

13/28

Simple Example -- First Cache Op

Instruction 2 forces Instruction 4 into cache line %00

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

Cache line %00

Page 14: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

14/28

Simple Example

Next 2 cache operations place instructions 6 and 9 into cache

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

4 -- %00

6 -- %10 9 -- %01

Page 15: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

15/28

Simple Example

4th and 5th Cache operations set LRU bits for cache lines %00 and %10

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

4 -- %00 LRU

6 -- %10 LRU 9 -- %0110 = %10

12 = %00

Page 16: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

16/28

Execution of Instruction 12

Execution of instruction 12 occurs during Fetch of instruction 2 in loop

3rd Cache operation involving cache line %10

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

Instruction 2 to cache %10

4 -- %00 LRU

6 -- %10 LRU 9 -- %0110 = %10

12 = %00

Page 17: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

17/28

Summary of Cache Operations

First time round loopInstr. 2 pushes Instr. 4 to cache line %00Instr. 4 pushes Instr. 6 to cache line %10Instr. 7 pushes Instr. 9 to cache line %01Instr. 8 pushes Instr. 10 to cache line %10Instr. 10 pushes Instr. 12 to cache line %00

INSTR. 12 pushes INSTR. 2 to cache line %10 WHERE IT REPLACES INSTR. 4 (LRU)

Page 18: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

18/28

Cache Thrash starts operating

Second time round loopInstr. 4 from cache line %00Instr. 4 pushes Instr. 6 to cache line %10

REPLACING INSTR. 10 (LRU for %10)Instr. 9 from cache line %01Instr. 8 pushes Instr. 10 to cache line %10

REPLACING INSTR. 2 (LRU for %10)Instr. 12 from cache line %00Instr. 12 pushes Instr. 2 to cache line %10 REPLACING INSTR. 6 (LRU for %10)

Losing 3 cycles each time around loop

Page 19: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

19/28

Easy to fix in this example

Can delay PM from INSTR. 2 till 3

This forces INSTR 5 to cache (%01) where it does not replace anything

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %00

2 -- %10

4 -- %00 5 -- %01 6 -- %10 9 -- %01 LRU10 = %1011 = %1112 = %00

PM =

Page 20: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

20/28

Real Life more difficult

Larger number of instructions in LoopJump operations (conditional or not)Register DependenciesMay need to move many PM operations

All this takes timeNeed a systematic approach to gain speed while getting the product out-the-door in shortest timeADD-A-NOP – waste 1 cycle to gain 3

Page 21: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

21/28

ADD A CACHE FREEZE at end of the loop

CACHE THRASH (3 cycles waste) replaced by STALL (instruction can’t go into cache) and Freeze instruction (2 cycles wasted)

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %0013 = %01

Instruction 1 stalls

4 -- %00 LRU

6 -- %10 LRU 9 -- %01 LRU10 = %10

12 = %00

BIT SET MODE2 CAFRZ Cache FreezeBIT CLR MODE2 CAFRZ Cache Unfreeze

Page 22: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

22/28

ADD A NOP at end of the loop

CACHE THRASH (3 cycles waste) IS AVOIDED with a loss of only 1 cycle/loop because of additional NOP instruction

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %0013 = %01

Instruction 1 to cache %01

4 -- %00 LRU

6 -- %10 LRU 9 -- %01 LRU10 = %10

12 = %00

NOP

Page 23: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

23/28

Cache-DSP tool concept

Original Code – Loop Cycles = C11, 2, 3, 4, 5, 6, 7, endloop

Trial 1 – Loop Cycles = C21, 2, 3, 4, 5, 6, 7, NOP, endloop

Trial 2– Loop Cycles = C31, 2, 3, 4, 5, 6, NOP, 7, endloop

Trial 3 – Loop Cycles = C41, 2, 3, 4, 5, NOP, 6, 7, endloop

Page 24: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

24/28

Cache-DSP tool

Identifies the number of cache operations and cache thrashes in current codeCalculates the advantage of adding NOP after/before each instruction in loop in reducing cache thrashesRemembers the best case scenarioThen determines the effect of placing 2 NOPs (3, 4 etc) somewhere in the code (preferably at end of loop).

Page 25: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

25/28

Advantages

Typical DSP loops smallCan use brute force approach in identifying where NOPs should be placed

If meet time constraints of your project -- then ship with NOPs includedIf does not meet time constraints then position of NOPs gives hints as to which PM( ) operations to delayWorks with any processor architecture

Page 26: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

26/28

Hint -- Instruction PM( ) Key

Reformat loop so that Instr. 1 is outside loop and repeated as Instr. 13 with Instr. 12 PM( ) moved

Now we have removed cache thrash with no waste

0 = %00

1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %0110 = %1011 = %1112 = %0013 = %01

Instruction 1 outside loop

Instruction 3 to cache %11 4 -- %00 LRU

6 -- %10 LRU 9 -- %0110 = %10

12 = %00

F1=, ro=dm( ), pm( ) =

Page 27: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

27/28

Problems to overcome

Jumps inside loops Complicates which instructions get cachedConditional jump changes which instruction gets cached (dynamic effect)Complicated to the effect of placing a NOP into a delay slot and displacing an instruction out of the delay slot

Effect of loops inside loops

Page 28: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

28/28

Concepts discussed

Concept behind ADI instruction cacheCache operationIntroduction of CACHE THRASHINGSolutions to avoid a Cache Thrash without delaying product release

Introduction of NOP instructions into code -- wasting one cycle to save 3 cyclesIdentification of PM( ) operations to move

Basis of Cache-DSP tool

Page 29: CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact

Cache-DSP Tool [email protected]

29/28

Acknowledgements

Financial support of Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of CalgaryFinancial support from Analog Devices through ADI University professorship for 2001/2002 (Dr. Smith) Future work will be financed in part by the Alberta Government through Alberta Software Engineering Research Consortium (ASERC)