send in audio signals and use sharp fir filter to pick out 42 hz and 59 hz signals and send out...

Lab 2 implementation demonstrated using a

Midterm Question

Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample and then use FIR filter of 256 taps –

equivalent to 1 FIR filter of 256 * 256 taps with a bandwidth of 96000 / 256 * 256 Hz

◦ Use code from Lab 0, Lab 1, assignment 1 as much as possible Develop C++ version (show that fails unless optimized code) –

Assignment 1 Modify your Lab 1assembly code to demonstrate (test and audio)

speed improvement for following steps◦ 1) software to hardware loop◦ 2) parallel dm, pm access, don’t unroll loop◦ 3) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do

parallel dm, pm access in parallel with multiple instructions◦ 4) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do

parallel dm, pm access in parallel with multiple and add instructions Remember to provide resource chart and compare your timing to

expected

Lab 2 requirements

Can the processor meet the requirements? Two forms of the code – which one is needed

◦ Grab one audio value -- Process everything before next individual audio samples

◦ Grab one audio block – Collect next audio block and process last audio block before next audio block collected

Real life – worse case◦ Each channel needs 2 256-tap FIR filters◦ Total channels – 42 Hz + harmonics, 19 Hz plus harmonics (19 * 3

= 57 Hz) – say 8 channels◦ Need to generate audio warning signals◦ Modify FIR filter coefficients to following signals – might not be

constant frequency Do the best case timing analysis to see whether algorithm

works

Step 1 -- Is it worth the effort?

Similarity between one signal and another, and at what locations the similarity occurs

Have a heart beat signal 000ABcD0000 Have a signal from patient running00000000ABcD0000000ABcD0000000ABcD0000

Use 0000DcAB0000 as coefficients in FIR filter00000000ABcD0000000ABcD0000000ABcD0000000ABcD0000 -- minimum filter output 000ABcD0000 -- some output 000ABcD0000 -- max output 000ABcD0000 -- less output 000ABcD0000 – max again 000ABcD0000 – max again

Correlation – Essentially Filtering a signal with FIR coeffs equal to the signal

Draw a picture of the situation Known signal sent to ultrasound transmitter A Noisy signal picked up at receiver B

◦ Do auto-correlation to get best estimate of delay Known signal sent to ultrasound transmitter B Noisy signal picked up at receiver A

◦ Do auto-correlation to get best estimate of delay

◦ Differences in delay time are related to speed of air in mine shaft

Mine shaft air speed calculation

Simplest step up from doing examples exactly the same as lab examples

Many standard formats Complex array – real and imaginary Components stored alternately in memory R1, I1, R2, I2, R3, I3 … access using dm(IX, MdmX) where MdmX = 2

Components stored in alternate blocks R1, R2, R3, … I1, I2, I3 access using dm(I1X, MdmP1) and dm(I2X, MdmP1) or access using dm(IdmX, MdmP1) and pm(IpmX, MpmP1) where MdmP1 and MplP1 are set to +1 by compiler

Speed depends on format used and what you are doing with values

Many algorithms use complex numbers.

complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { complex correlation = 0 + j0 -- Missing piece of code for (int k = 0; k < numPts - offset; k++) { // Could be other forms of the algorithm // This is more “autocorrelation” – comparing signal to itself // Would work best when information of interest is in the centre of the signals correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } return correlation; Repeat many times along firstArray for different offsets

Auto-correlation and cross-correlation and convolution are all equivalent to FIR operations where the FIR cofficients are data values rather than fixed values

// How do you return a complex value? Don’t know// Two choices – in R0 (real part) and R1 (imaginary part)

// more likely (Another exam) switch to SIMD mode and use R0 and S0

Example using midterm question

There is absolutely no point trying to optimize a loop that calls a subroutine / function◦ The cost of setting up subroutine call (handling

incoming parameters and return values) and jumping in an out of subroutine

Question reminded you of this◦ Assume that the Conjugate function is in-lined for

speed. ◦ That means you need to go and write out the

equation with inlined code

Optimization

Enter and exit CalculateCorrelation( ) – 20 cycles Set up pointers inpar_Rx Ix – 30 cycles Set up and use hardware loop – 20 cycles Set up sum < 10 cycles So basically timing is (numPts – offset) * loop Body count correlation =

realCorrelation + kImageCorrelation

= correlation + realCorrelation + kImageCorrelation +

firstArray[k] * Conjugate(firstArray [k+offset]); + (a + jb) * (c – jd) -- read in as c + jd

or RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)

Work out the code timing

RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)

Means -- two sets of calculations RC_RX0= RC_RX0+ a*c +b *d RX0 does not mean R0And IC_RX1= IC_RX1 + ( - a * d + c * b)

Looks like 8 memory access per tap (point), fetch a, b, c, d TWICEActually could optimize to 4 fetches and reuse (a, b, c, d IF there are enough registers to store the fetched values and do all the calculations if we unroll the loop and have to cope with memory access delays)

Loop body

Reference sheet saysMULTIFUNCTION COMPUTE OPERATION On certain registers only, unlike standard COMPUTEMultiplication FN = FQ * FR,

with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, FX=F(8,9,10,11),FY=F(12,13,14,15)

So when doing thisRC_RX0= RC_RX0+ a*c +b *d bring a and b into F(0,1,2,3); bring c and d into F(4,5,6,7) store a * c result into F(8,9,10,11) and store b * d result into F(12,13,14,15) store a * c + b * d result into F(8,9,10,11) which would work if RC_RX0 was in F(12,13,14,15)

Questions to answer1) Why?

2) How do we handle IC_R1= IC_R1 + ( - a * d + c * b) given the way the registers were being used by the RC_RX0= RC_RX0+ a*c +b *d calculations

Look ahead hint for Midterm 2 and Lab 2 parallel (super-scalar) instructions

RC_R0= RC_R0 + a*c +b *dAnd IC_R1= IC_R1 + ( - a * d + c * b)

Looks like 8 memory access per tap (point), but actually could optimize to 4 and reuse (IF there are enough registers)

4 multiples and 4 adds

Can (if switch into SIMD mode) do 2 multiplication + 2 adds + 4 memory accesses per cycle

2 cycles needed in SIMD mode time 2 * Numpoints / 500 us < 50% of 10 us (at 96 kHz) Will work provided Numpoints < 5000 / 4

Problem to solve if working with SIMD mode– make sure that we don’t end up with a in register R1 and c in register S1 because then can’t multiply together

Could we -- Unroll loop so do first dm pm fetch in R1 and R4and have SIMD do the (hidden) second dm pm fetch into S1 and S4

Loop body

Even the simplest problem is essentially impossible to translate in time available – that why I say GPA A- starts around 80%

You need to demonstrate that◦ You know what you need to do; so that if you had enough time you

could complete◦ Really key – able to use this knowledge to check that the compiler was doing

a good job 15 marks split across the following (16 as first error is free)

1. REALLY KEY – Design the code before translating it2. Format of assembly language code and course coding requirements3. Demonstrate understanding of parameter passing and return – in R

registers4. Need to save and recover registers – know what is volatile and what is not5. KEY -- Need to move passed pointers (in R registers) into I registers6. How to set up arrays to allow simultaneous dm, pm access7. Hardware / software loop differences8. KEY -- Post-modify and pre-modify difference9. KEY -- USING F registers when doing mults and adds in multi-function mode10. Complex number theory and format on DSP processors

Translation

#include <allNecessary files.h> // How do you return a complex value? Don’t know// Two choices – in R0 (real part) and R1 (imaginary part)// more likely (Midterm 2) switch to SIMD mode and use R0 and S0

.section seg_pmco; .global _ CalculateComplexCorrelation__NM;_CalculateComplexCorrelation__NM:

R16 not a real fake – would look likeRx = dm(2, SP) – but why learn thatwhen could cut-and-paste for a C++ code example

complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { R0, R1 for return values (pretend) 4 parameters in very complex as using stack operations Fake by pretending R4 and R16 (dm and pm pointer) R8 R12 – Then move R16 into real register Rx

Demonstrate basic format and parameter passing

corrReal_F0 = 0.0; corrImag_F1 = 0.0; maxLoop_R8 = numPts_R8 – offset_R12; This sets Z, N flags if LE JUMP END; // no DB realPt_I4 = inPar_R4; imagPt_I12 = inPar_R16;// Want to handle offset into arrays easily Save I5 and I13 to stack// need more R registers Save R3, R6, R7, R9, R10 inParR4Offset_R4 = inPar_R4 + offset_R12;

inParR4Offset_R5 = inPar_R5 + offset_R12; realPtOffset_I5= inParR4Offset_R4 imagPtOffset_I13 = inParR4Offset_R5

// Do a code review and fix the minor bug correlation = 0 + k0 set up pointers There are other ways of doing this using modify registers

Give one / two examples of saving things to registers – I would lose marks on my answer

set up loop using R8 information should be on reference sheet

for (int k = 0; k < numPts - offset; k++) {

Accept the loss of some marksDid not have reference sheet when I did the example

Would look something like this Modify(SP, 3); R0 = I3; // Can’t save Ix directly to memory dm(1, SP) = R0 R0 = I13; // Can’t save Ix directly to memory dm(2, SP) = R0 // Also there is no pm stack implimented

// Read real part of 1 and complex part of other firstReal_R6 = dm(realPt_I4, DMPLUS1), secondImag_R10 = pm(imagPtOffset_I13, PMPLUS1); secondReal_R9 = dm(realPtOffset_I5, DMPLUS1) , firstImag_R7 = pm(imagPtOffset_I13, PMPLUS1); temp_F2 = F6 * F9; temp_F3 = F7 * F10; real_F0 =F0 + F2; real_F0 = F0 + F3; temp_F2 = F6 * F10; temp_F3 = F7 * F9; imag_F1 = F1 – F2; imag_F1 = F1 + F3

correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); // Use math explained above // I am just writing code – not trying to optimize// Valid code BUT these instructions ARE NOT executed in parallel – wrong syntax, wrong registers for multi-function// real update // imag update – less documented temp registers used and discarded quickly – okay under exam condition

Rough out the code syntax

END: Recover registers in reverse order R10, R9, R7, R6, R3 Values already in R0 and R1 5 magic lines to return to C

} return correlation; (R0 and R1)

Oops- forgot to recover I13 and I5

Demonstrate unroll loop – unroll 2 * p times◦ Unrolling allows us to move (make parallel) parts of

the first set of operations and second operations◦ In real life – may unroll up to 8 times to find parallel

operations – demonstrate concept in midterm (time) If switching to SIMD -- unroll 4 * p times Write the optimization design using C++ syntax

◦ Don’t switch to assembly code until VERY last moments

◦ Write in the simplest possible version of C Concentrate on the loop as that is where we get

the speed

Step 2 – Lab 2 optimizing

for (int k= 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);}

Becomes

for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}

Problem 1 – Can’t switch to SIMD mode if k + offset is not divisible by 2 SIMD mode does R0 = dm[2 * x] and S0 = dm[2 * x + 1] Meaning it can do dual fetch dm[1000], dm[1001], but not dm[1001], dm[1002]Means our speed estimate is out by factor of 2 since we can’t switch to SIMD mode – or if we do switch -- code must become more complex – so don’t switch to SIMD

Unroll the loop – problem 1replace k? by k to avoid complex

for (int k = 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);}

Becomes

If (numPts – offset) is even then unrolled code becomesfor (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}

Elsefor (int k = 0; k < numPts – offset - 1; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}

k = numPts – offset – 1; correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);

Unroll the loop – problem 2

correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] *

Conjugate(firstArray [k+offset + 1]);

correlation = correlation + (a[k] + jb[k] )* (a[k + offset] - jb[k + offset] ); correlation = correlation + (a[k + 1] + jb[k + 1] )* (a[k + offset + 1] - jb[k + offset + 1] );

Look at real part only -- usecorrelationRe = correlationRe + (a[k] * a[k + offset]) + (b[k] * b[k + offset] ) correlationRe = correlationRe + (a[k + 1] * a[k + offset + 1]) + (b[k + 1] * b[k + offset + 1] )

Simplify code -- in line

Temp1 = a[k] ; Note register renamingTemp2 = a[k + offset]; Use this approach incase there Mult3 = temp1 * temp2 are unexpected timing delaysTemp4 = b[k]; then can interlink the 2 unrollsTemp5 = b[k+offset];Mult6 = temp4 * temp5; Plan to put imag array on pm accesscorrRe = corrRe + Mult3corrRe = corrRe+ Mult6

Temp11 = a[k+ 1] ;Temp12 = a[k + offset + 1];Mult13 = temp11 * temp12Temp14 = b[k + 1] ;Temp15 = b[k+offset + 1];Mult16 = temp14 * temp51;corrRe = corrRe + Mult13corrRe = corrRe+ Mult16

One operation per line of code

Use this order because of instruction formatOn certain registers only, unlike standard COMPUTEMultiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15)

Other Mult add DM PMTemp1 = a[k] ;

Temp2 = a[k + offset];

Mult3 = temp1 * temp2

Temp4 = b[k];

Temp5 = b[k+offset];

Mult6 = temp4 * temp5;

corrRe = corrRe + Mult3

corrRe = corrRe+ Mult6

Temp11 = a[k+ 1] ;

Temp12 = a[k + offset + 1];


Temp14 = b[k + 1] ;

Temp15 = b[k+offset + 1];


corrRe = corrRe + Mult13


Switch to resource chart

Other Mult add DM PM

Temp1 = a[k] ; Temp1 = a[k] ; Temp2 = a[k + offset];


Mult3 = temp1 * temp2 Mult3 = temp1 * temp2 Temp4 = b[k];

Temp4 = b[k];



Mult6 = temp4 * temp5; Mult6 = temp4 * temp5;

corrRe = corrRe + Mult3 corrRe = corrRe + Mult3

corrRe = corrRe+ Mult6 corrRe = corrRe+ Mult6

Temp11 = a[k + 1] ;

Temp11 = a[k+ 1] ;





Temp14 = b[k + 1];

Temp14 = b[k + 1] ;





corrRe = corrRe + Mult13corrRe = corrRe + Mult13



Temp1 = a[k] ;Temp4 = b[k];

Temp1 = a[k] ;




Mult3 = temp1 * temp2 Mult3 = temp1 * temp2

Temp4 = b[k];


Mult6 = temp4 * temp5; Mult6 = temp4 * temp5;

corrRe = corrRe + Mult3 corrRe = corrRe + Mult3


Temp11 = a[k + 1] ;

Temp14 = b[k + 1];

Temp11 = a[k+ 1] ;






Temp14 = b[k + 1] ;




corrRe = corrRe + Mult13corrRe = corrRe + Mult13




Temp1 = a[k] ;




Mult3 = temp1 * temp2 Temp11 = a[k + 1] ;

Temp14 = b[k + 1];


Temp4 = b[k];


Mult6 = temp4 * temp5; Temp12 = a[k + offset + 1];



Mult13 = temp11 * temp12 corrRe = corrRe + Mult3 Imag fetches corrRe = corrRe + Mult3

Mult16 = temp14 * temp15; corrRe = corrRe+ Mult6 Imag fetches corrRe = corrRe+ Mult6

imag mult 1 corrRe = corrRe + Mult13 Imag fetches

imag mult 2 corrRe = corrRe+ Mult16 Imag fetches Temp11 = a[k+ 1] ;

Imag mult 1 Imag add 1Temp12 = a[k + offset + 1];

imag mult 1 Imag add 2Mult13 = temp11 * temp12

Imag add 1 Temp14 = b[k + 1] ;

Imag add 2Temp15 = b[k+offset + 1];Mult16 = temp14 * temp51;

Efficiency 8 in 12 Efficiency 8 in 12 Efficiency 8 in 12corrRe = corrRe + Mult13


Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15)





Mult3 = temp1 * temp2 F2 F5

Temp11 = a[k + 1] ;

Temp14 = b[k + 1];

What register for Mult 3What register for Temp 11 ?

Mult6 = temp4 * temp5; F3 F6



Mult13 = temp11 * temp12 ? ?

corrRe = corrRe + Mult3 F0 F0

Illegal use of F0Mult16 = temp14 * temp15; ? ?

corrRe = corrRe+ Mult6 F0 F0

imag mult 1 corrRe = corrRe + Mult13

imag mult 2 corrRe = corrRe+ Mult16

Imag add 1

Imag add 2

send in audio signals and use sharp fir filter to pick out 42 hz and 59 hz signals and send out...

Documents