william sandqvist [email protected]. some symbols binary semaphore counting semaphore mutex semaphore...

William Sandqvist [email protected]


Some symbols

Binary Semaphore

Counting Semaphore

Mutex Semaphore

Message Queue

Event Flag

Mailbox

Task ISR Timer


6-5 Synchronization with semaphoresTask 1

…a = f1(…);// Synchronization pointf2(b);…

Task 2

…b = g1(…);// Synchronization pointg2(a);…

Operations: accessSem(Sem) and releaseSem(Sem)

Synchronize Code with binary semaphores!


Binary semaphores Sem1 and Sem2

Task 1…accessSem(Sem1);…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);…

Task 2…accessSem(Sem2);…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);…

Sem1 and Sem2 are created with the value ”1” at start!


Binary semaphores Sem1 and Sem2

Task 1

…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);…

Task 2

…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);…

Sem1 and Sem2 are created with the value ”0” at start!


Task TripletP( max execution time, period, deadline )

Create periodical tasks

A soft-timer could Release a semaphore periodically.

A task could Access a semaphore before execution.


Finite Impulse Response filterYou have programmed a FIR-filter in LAB 2. Every filter stage needs a MAC-operation. MAC = Multiply and ACkumulate.

sample = input(); x[oldest] = sample; y = 0;for (k = 0; k < N; k++){ y += h[k] * x[(oldest + k) % N];}oldest = (oldest + 1) % N; output(y);


7-1 Hardware Accelerators

functionMACninstructioMAC tt __ 1.0

DSP application. 15% of the execution time are call’s to a function that performs a MAC operation. Multiply and ACkumulate. An alternative is to use an other processor which has a MAC-instruction. Suppose that we have the ratio:

How much could the total execution time be increased if the processor with the MAC-instruction is used?

Without MAC 15% + 85%= 100%

With MAC 1.5% + 85% + 13.5% = 100%

a program could do 13,5% moore in the same execution time.


7-3 Hardware accelerator

X = A * B + C * D


Processor only

X = A * B + C * D

load p1,A # 2 time units load p2,B # 2load p3,C # 2load p4,D # 2mul p5,p1,p2 # 8mul p6,p3,p4 # 8add p7,p5,p6 # 1store p7,X # 2Grand total = 27 time units

Can the Hardware Accelerator improve on this?


DFG

Detects possible parallellism

Processor and Accelerator

T=C*DX=A*B+Tload p1,A # 2

load p2,B # 2mul p3,p1,p2 # 8 load a1,C # 2 load a2,D # 2 mul a3,a1,a2 # 1 store T,a3 # 2 (=7)

load p4,T # 2add p5,p4,p3 # 1store p5,X # 2Grand total = 17 time units

Parallellism!


Speedup

6.117

27

Enhanced

Orginal imeExecutionT

imeExecutionTSpeedup


All mul’s with the accelerator

S=A*BT=C*DX=S+T

load a1,A # 2 load a2,B # 2 load a3,C # 2 load a4,D # 2 mul a5,a1,a2 # 1 mul a6,a3,a4 # 1 store S,a5 # 2 store T,a6 # 2load p1,S # 2 load p2,T # 2add p3,p2,p1 # 1store p3,X # 2 Grand total = 21

No parallellism!


Speedup

3.121

27

Enhanced

Orginal imeExecutionT

imeExecutionTSpeedup


Accelerators in the Cyclone II chip

The Cyclone II chip has Embedded Multipliers to use as Hardware accelerators.

(They could be connected to the Embedded Nios II-pro-cessor with the Avalon bus).

Up to 150 18bit18bit Multiplicator units can be used!


5-9 Cache performanceThis is an example of a problem from part B of the written exam.

int i;int y = 0;int u[60];int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . .

Datacache size 128 Bytes, Cacheline/Block 32 Bytes (8 int).u and v are located in sequence in memory. Variables i and y are stored in processor registers.


Hitrate estimationDraw the memory and Cache as Cache-line/Block organized. Block is then 8 int. Vector u and v each occupy 7.5 blocks in memory. We don’t know if the mapping looks exactly this way, but the conflicts will be the same.

u[0] M, v[0] M, u[1…3] HHH, v[1…3] HHHu[4] H, v[4] M, conflict misses u[5…7] MMM, v[5…7] MMM… MM HHH HHH H M MMM MMM … 50%

(loop stops at 59, numbers 60…63 are not included, the hitrate will actually be > 50%)


Program changes for max hitrateint i;int y = 0;int u[72]; /* +12 dummy */ int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . .

v is moved 12 int’s by extending u with dummy elements. MHHHHHHH.Hitrate 88%.

No, there must always be one cold miss every cacheline. The index i counts forwards – every int is used only once, no int is reused!

Is 100% possible?


Good Luck!

william sandqvist [email protected]. some symbols binary semaphore counting semaphore mutex semaphore...

Documents

d slide

load p3

b t load p1

william sandqvist willia

load a2

load a3

load p4

d load p1