detailed look at the tigersharc pipeline cycle counting for the ialu versionof the dc_removal...

28
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm

Post on 18-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Detailed look at the TigerSHARC pipeline

Cycle counting for the IALU versionof the DC_Removal algorithm

DC_Removal algorithm performance 2 / 28

To be tackled today

Expected and actual cycle count for J-IALU version of DC_Removal algorithmUnderstanding why the stalls occur

and how to fix. Differences between first time into a

function (cache empty) and second time into the function

DC_Removal algorithm performance 3 / 28

Set up timeIn principle 1 cycle / instruction

2 + 4 instructions

DC_Removal algorithm performance 4 / 28

First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N)

4 instructions

N * 5 instructions

1 + 2 * log2N

DC_Removal algorithm performance 5 / 28

Third key element – FIFO circular buffer -- Order (N)

6

3

6 * N

2

DC_Removal algorithm performance 6 / 28

TigerSHARC pipeline

DC_Removal algorithm performance 7 / 28

Using the “Pipeline Viewer”

Available with the TigerSHARC simulator ONLY VIEW | Debug Windows | Pipeline viewer

F1 to F4 – instruction fetch unit pipeline

PD, D, I -- Integer ALU pipeline

A, EX1, EX2 – Compute Block pipeline

DC_Removal algorithm performance 8 / 28

Pipeline symbols

Control - click

A – AbortB – BubbleH – BTB Hit (Jumps)S – StallW – WaitX – Illegal fetch(F1 – F4)X – Illegal instruction (PD – E2)

DC_Removal algorithm performance 9 / 28

Time in theorySet up pointers to buffersInsert values into buffersSUM LOOPSHIFT LOOPUpdate outgoing parametersUpdate FIFOFunction return

244 + N * 51 + 2 * log2N63 + 6 * N2---------------------------22 + 11 N + 2 log2N

N = 128 – instructions = 1444

1444 cycles + 1100 delay cycles

C++ debug mode – 9500 cycles???????

Note other tests executed before this test.Means “cache filled”

DC_Removal algorithm performance 10 / 28

Test environment

Examine the pipelinethe 2nd time around the loop“Cache’s filled”?

DC_Removal algorithm performance 11 / 28

Set up timeExpected2 + 4 instructions

Actual2 + 4 instructions+ 2 stalls

Why not 4 stalls?

DC_Removal algorithm performance 12 / 28

First time round sum loop

Expected 9 instructions

LC0 load – 3 stallsEach memory fetch – 4 stallsActual 9 + 11 stalls

DC_Removal algorithm performance 13 / 28

Other times around the loop

Expected 5 instructions

Each memory fetch – 4 stallsActual 5 + 8 stalls

DC_Removal algorithm performance 14 / 28

Shift Loop – 1st time around

Expected 3 instructions

No stalls on LC0 load?4 stall on ASHIFTRBTB hit followed by 5 aborts

DC_Removal algorithm performance 15 / 28

Shift loop2nd and later times around

Expect 2Get 2

DC_Removal algorithm performance 16 / 28

Store back of &left, &rightExpect 6Actual 6 + 3 stalls

DC_Removal algorithm performance 17 / 28

Exercise 1

Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio

DC_Removal algorithm performance 18 / 28

Third key element – FIFO circular buffer-- Order (N)

6

3

6 * N

2

DC_Removal algorithm performance 19 / 28

Answer

DC_Removal algorithm performance 20 / 28

DC_Removal algorithm performance 21 / 28

DC_Removal algorithm performance 22 / 28

DC_Removal algorithm performance 23 / 28

Second time into function

DC_Removal algorithm performance 24 / 28

What happens if cache not full? – first time function called?

Was 2 + 2 stalls in loopNow 11 + 12 stalls in loop

DC_Removal algorithm performance 25 / 28

First time function called2nd time around the loopDitto 3, 4, 5, 6, 7, 8 times

DC_Removal algorithm performance 26 / 28

9th time around the loopditto 17th, 25th, 33rd, 41st , 49th

DC_Removal algorithm performance 27 / 28

What is happening?

With cache filled – memory read accesses require 4 cycles

Unfilled – first one requires “12 cycles” Then next 7 require 4 cycles

Total guess – is extra time associated with doing extra reads to fill the cache?

DC_Removal algorithm performance 28 / 28

Tackled today

Expected and actual cycle count for J-IALU version of DC_Removal algorithm Understanding why the stalls occur and how

to fix. Differences between first time into a

function (cache empty) and second time into the function

Further unknowns – how memory operations really work