william sandqvist [email protected]. some symbols binary semaphore counting semaphore mutex semaphore...
Post on 18-Dec-2015
236 views
TRANSCRIPT
William Sandqvist [email protected]
William Sandqvist [email protected]
Some symbols
Binary Semaphore
Counting Semaphore
Mutex Semaphore
Message Queue
Event Flag
Mailbox
Task ISR Timer
William Sandqvist [email protected]
6-5 Synchronization with semaphoresTask 1
…a = f1(…);// Synchronization pointf2(b);…
Task 2
…b = g1(…);// Synchronization pointg2(a);…
Operations: accessSem(Sem) and releaseSem(Sem)
Synchronize Code with binary semaphores!
William Sandqvist [email protected]
Binary semaphores Sem1 and Sem2
Task 1…accessSem(Sem1);…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);…
Task 2…accessSem(Sem2);…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);…
Sem1 and Sem2 are created with the value ”1” at start!
William Sandqvist [email protected]
Binary semaphores Sem1 and Sem2
Task 1
…a = f1(…);releaseSem(Sem1);// Synchronization point accessSem(Sem2);f2(b);releaseSem(Sem2);…
Task 2
…b = g1(…);releaseSem(Sem2);// Synchronization point accessSem(Sem1);g2(a); releaseSem(sem1);…
Sem1 and Sem2 are created with the value ”0” at start!
William Sandqvist [email protected]
Task TripletP( max execution time, period, deadline )
Create periodical tasks
A soft-timer could Release a semaphore periodically.
A task could Access a semaphore before execution.
William Sandqvist [email protected]
William Sandqvist [email protected]
Finite Impulse Response filterYou have programmed a FIR-filter in LAB 2. Every filter stage needs a MAC-operation. MAC = Multiply and ACkumulate.
sample = input(); x[oldest] = sample; y = 0;for (k = 0; k < N; k++){ y += h[k] * x[(oldest + k) % N];}oldest = (oldest + 1) % N; output(y);
William Sandqvist [email protected]
7-1 Hardware Accelerators
functionMACninstructioMAC tt __ 1.0
DSP application. 15% of the execution time are call’s to a function that performs a MAC operation. Multiply and ACkumulate. An alternative is to use an other processor which has a MAC-instruction. Suppose that we have the ratio:
How much could the total execution time be increased if the processor with the MAC-instruction is used?
Without MAC 15% + 85%= 100%
With MAC 1.5% + 85% + 13.5% = 100%
a program could do 13,5% moore in the same execution time.
William Sandqvist [email protected]
William Sandqvist [email protected]
Processor only
X = A * B + C * D
load p1,A # 2 time units load p2,B # 2load p3,C # 2load p4,D # 2mul p5,p1,p2 # 8mul p6,p3,p4 # 8add p7,p5,p6 # 1store p7,X # 2Grand total = 27 time units
Can the Hardware Accelerator improve on this?
William Sandqvist [email protected]
DFG
Detects possible parallellism
Processor and Accelerator
T=C*DX=A*B+Tload p1,A # 2
load p2,B # 2mul p3,p1,p2 # 8 load a1,C # 2 load a2,D # 2 mul a3,a1,a2 # 1 store T,a3 # 2 (=7)
load p4,T # 2add p5,p4,p3 # 1store p5,X # 2Grand total = 17 time units
Parallellism!
William Sandqvist [email protected]
Speedup
6.117
27
Enhanced
Orginal imeExecutionT
imeExecutionTSpeedup
William Sandqvist [email protected]
All mul’s with the accelerator
S=A*BT=C*DX=S+T
load a1,A # 2 load a2,B # 2 load a3,C # 2 load a4,D # 2 mul a5,a1,a2 # 1 mul a6,a3,a4 # 1 store S,a5 # 2 store T,a6 # 2load p1,S # 2 load p2,T # 2add p3,p2,p1 # 1store p3,X # 2 Grand total = 21
No parallellism!
William Sandqvist [email protected]
Speedup
3.121
27
Enhanced
Orginal imeExecutionT
imeExecutionTSpeedup
William Sandqvist [email protected]
Accelerators in the Cyclone II chip
The Cyclone II chip has Embedded Multipliers to use as Hardware accelerators.
(They could be connected to the Embedded Nios II-pro-cessor with the Avalon bus).
Up to 150 18bit18bit Multiplicator units can be used!
William Sandqvist [email protected]
William Sandqvist [email protected]
5-9 Cache performanceThis is an example of a problem from part B of the written exam.
int i;int y = 0;int u[60];int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . .
Datacache size 128 Bytes, Cacheline/Block 32 Bytes (8 int).u and v are located in sequence in memory. Variables i and y are stored in processor registers.
William Sandqvist [email protected]
Hitrate estimationDraw the memory and Cache as Cache-line/Block organized. Block is then 8 int. Vector u and v each occupy 7.5 blocks in memory. We don’t know if the mapping looks exactly this way, but the conflicts will be the same.
u[0] M, v[0] M, u[1…3] HHH, v[1…3] HHHu[4] H, v[4] M, conflict misses u[5…7] MMM, v[5…7] MMM… MM HHH HHH H M MMM MMM … 50%
(loop stops at 59, numbers 60…63 are not included, the hitrate will actually be > 50%)
William Sandqvist [email protected]
Program changes for max hitrateint i;int y = 0;int u[72]; /* +12 dummy */ int v[60];. . .for(i = 0; i < 60; i++) y += u[i] * v[i];. . .
v is moved 12 int’s by extending u with dummy elements. MHHHHHHH.Hitrate 88%.
No, there must always be one cold miss every cacheline. The index i counts forwards – every int is used only once, no int is reused!
Is 100% possible?
William Sandqvist [email protected]
Good Luck!