Lab 2 implementation demonstrated using a
Midterm Question
Send in audio signals and use sharp FIR filter to pick out 42 Hz and 59 Hz signals and send out warning tones ◦ Try FIR filter of 256 taps, down sample and then use FIR filter of 256 taps –
equivalent to 1 FIR filter of 256 * 256 taps with a bandwidth of 96000 / 256 * 256 Hz
◦ Use code from Lab 0, Lab 1, assignment 1 as much as possible Develop C++ version (show that fails unless optimized code) –
Assignment 1 Modify your Lab 1assembly code to demonstrate (test and audio)
speed improvement for following steps◦ 1) software to hardware loop◦ 2) parallel dm, pm access, don’t unroll loop◦ 3) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do
parallel dm, pm access in parallel with multiple instructions◦ 4) parallel dm, pm access, unroll loop 4 times. Don’t move code outside loop, do
parallel dm, pm access in parallel with multiple and add instructions Remember to provide resource chart and compare your timing to
expected
Lab 2 requirements
Can the processor meet the requirements? Two forms of the code – which one is needed
◦ Grab one audio value -- Process everything before next individual audio samples
◦ Grab one audio block – Collect next audio block and process last audio block before next audio block collected
Real life – worse case◦ Each channel needs 2 256-tap FIR filters◦ Total channels – 42 Hz + harmonics, 19 Hz plus harmonics (19 * 3
= 57 Hz) – say 8 channels◦ Need to generate audio warning signals◦ Modify FIR filter coefficients to following signals – might not be
constant frequency Do the best case timing analysis to see whether algorithm
works
Step 1 -- Is it worth the effort?
Similarity between one signal and another, and at what locations the similarity occurs
Have a heart beat signal 000ABcD0000 Have a signal from patient running00000000ABcD0000000ABcD0000000ABcD0000
Use 0000DcAB0000 as coefficients in FIR filter00000000ABcD0000000ABcD0000000ABcD0000000ABcD0000 -- minimum filter output 000ABcD0000 -- some output 000ABcD0000 -- max output 000ABcD0000 -- less output 000ABcD0000 – max again 000ABcD0000 – max again
Correlation – Essentially Filtering a signal with FIR coeffs equal to the signal
Draw a picture of the situation Known signal sent to ultrasound transmitter A Noisy signal picked up at receiver B
◦ Do auto-correlation to get best estimate of delay Known signal sent to ultrasound transmitter B Noisy signal picked up at receiver A
◦ Do auto-correlation to get best estimate of delay
◦ Differences in delay time are related to speed of air in mine shaft
Mine shaft air speed calculation
Simplest step up from doing examples exactly the same as lab examples
Many standard formats Complex array – real and imaginary Components stored alternately in memory R1, I1, R2, I2, R3, I3 … access using dm(IX, MdmX) where MdmX = 2
Components stored in alternate blocks R1, R2, R3, … I1, I2, I3 access using dm(I1X, MdmP1) and dm(I2X, MdmP1) or access using dm(IdmX, MdmP1) and pm(IpmX, MpmP1) where MdmP1 and MplP1 are set to +1 by compiler
Speed depends on format used and what you are doing with values
Many algorithms use complex numbers.
complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { complex correlation = 0 + j0 -- Missing piece of code for (int k = 0; k < numPts - offset; k++) { // Could be other forms of the algorithm // This is more “autocorrelation” – comparing signal to itself // Would work best when information of interest is in the centre of the signals correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); } return correlation; Repeat many times along firstArray for different offsets
Auto-correlation and cross-correlation and convolution are all equivalent to FIR operations where the FIR cofficients are data values rather than fixed values
// How do you return a complex value? Don’t know// Two choices – in R0 (real part) and R1 (imaginary part)
// more likely (Another exam) switch to SIMD mode and use R0 and S0
Example using midterm question
There is absolutely no point trying to optimize a loop that calls a subroutine / function◦ The cost of setting up subroutine call (handling
incoming parameters and return values) and jumping in an out of subroutine
Question reminded you of this◦ Assume that the Conjugate function is in-lined for
speed. ◦ That means you need to go and write out the
equation with inlined code
Optimization
Enter and exit CalculateCorrelation( ) – 20 cycles Set up pointers inpar_Rx Ix – 30 cycles Set up and use hardware loop – 20 cycles Set up sum < 10 cycles So basically timing is (numPts – offset) * loop Body count correlation =
realCorrelation + kImageCorrelation
= correlation + realCorrelation + kImageCorrelation +
firstArray[k] * Conjugate(firstArray [k+offset]); + (a + jb) * (c – jd) -- read in as c + jd
or RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)
Work out the code timing
RC + jIC = RC + jIC + a*c +b *d + j( - a * d + c * b)
Means -- two sets of calculations RC_RX0= RC_RX0+ a*c +b *d RX0 does not mean R0And IC_RX1= IC_RX1 + ( - a * d + c * b)
Looks like 8 memory access per tap (point), fetch a, b, c, d TWICEActually could optimize to 4 fetches and reuse (a, b, c, d IF there are enough registers to store the fetched values and do all the calculations if we unroll the loop and have to cope with memory access delays)
Loop body
Reference sheet saysMULTIFUNCTION COMPUTE OPERATION On certain registers only, unlike standard COMPUTEMultiplication FN = FQ * FR,
with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, FX=F(8,9,10,11),FY=F(12,13,14,15)
So when doing thisRC_RX0= RC_RX0+ a*c +b *d bring a and b into F(0,1,2,3); bring c and d into F(4,5,6,7) store a * c result into F(8,9,10,11) and store b * d result into F(12,13,14,15) store a * c + b * d result into F(8,9,10,11) which would work if RC_RX0 was in F(12,13,14,15)
Questions to answer1) Why?
2) How do we handle IC_R1= IC_R1 + ( - a * d + c * b) given the way the registers were being used by the RC_RX0= RC_RX0+ a*c +b *d calculations
Look ahead hint for Midterm 2 and Lab 2 parallel (super-scalar) instructions
RC_R0= RC_R0 + a*c +b *dAnd IC_R1= IC_R1 + ( - a * d + c * b)
Looks like 8 memory access per tap (point), but actually could optimize to 4 and reuse (IF there are enough registers)
4 multiples and 4 adds
Can (if switch into SIMD mode) do 2 multiplication + 2 adds + 4 memory accesses per cycle
2 cycles needed in SIMD mode time 2 * Numpoints / 500 us < 50% of 10 us (at 96 kHz) Will work provided Numpoints < 5000 / 4
Problem to solve if working with SIMD mode– make sure that we don’t end up with a in register R1 and c in register S1 because then can’t multiply together
Could we -- Unroll loop so do first dm pm fetch in R1 and R4and have SIMD do the (hidden) second dm pm fetch into S1 and S4
Loop body
Even the simplest problem is essentially impossible to translate in time available – that why I say GPA A- starts around 80%
You need to demonstrate that◦ You know what you need to do; so that if you had enough time you
could complete◦ Really key – able to use this knowledge to check that the compiler was doing
a good job 15 marks split across the following (16 as first error is free)
1. REALLY KEY – Design the code before translating it2. Format of assembly language code and course coding requirements3. Demonstrate understanding of parameter passing and return – in R
registers4. Need to save and recover registers – know what is volatile and what is not5. KEY -- Need to move passed pointers (in R registers) into I registers6. How to set up arrays to allow simultaneous dm, pm access7. Hardware / software loop differences8. KEY -- Post-modify and pre-modify difference9. KEY -- USING F registers when doing mults and adds in multi-function mode10. Complex number theory and format on DSP processors
Translation
#include <allNecessary files.h> // How do you return a complex value? Don’t know// Two choices – in R0 (real part) and R1 (imaginary part)// more likely (Midterm 2) switch to SIMD mode and use R0 and S0
.section seg_pmco; .global _ CalculateComplexCorrelation__NM;_CalculateComplexCorrelation__NM:
R16 not a real fake – would look likeRx = dm(2, SP) – but why learn thatwhen could cut-and-paste for a C++ code example
complex CalculateComplexCorrelation (complex firstArray[ ], int numPts, int offset) { R0, R1 for return values (pretend) 4 parameters in very complex as using stack operations Fake by pretending R4 and R16 (dm and pm pointer) R8 R12 – Then move R16 into real register Rx
Demonstrate basic format and parameter passing
corrReal_F0 = 0.0; corrImag_F1 = 0.0; maxLoop_R8 = numPts_R8 – offset_R12; This sets Z, N flags if LE JUMP END; // no DB realPt_I4 = inPar_R4; imagPt_I12 = inPar_R16;// Want to handle offset into arrays easily Save I5 and I13 to stack// need more R registers Save R3, R6, R7, R9, R10 inParR4Offset_R4 = inPar_R4 + offset_R12;
inParR4Offset_R5 = inPar_R5 + offset_R12; realPtOffset_I5= inParR4Offset_R4 imagPtOffset_I13 = inParR4Offset_R5
// Do a code review and fix the minor bug correlation = 0 + k0 set up pointers There are other ways of doing this using modify registers
Give one / two examples of saving things to registers – I would lose marks on my answer
set up loop using R8 information should be on reference sheet
for (int k = 0; k < numPts - offset; k++) {
Accept the loss of some marksDid not have reference sheet when I did the example
Would look something like this Modify(SP, 3); R0 = I3; // Can’t save Ix directly to memory dm(1, SP) = R0 R0 = I13; // Can’t save Ix directly to memory dm(2, SP) = R0 // Also there is no pm stack implimented
// Read real part of 1 and complex part of other firstReal_R6 = dm(realPt_I4, DMPLUS1), secondImag_R10 = pm(imagPtOffset_I13, PMPLUS1); secondReal_R9 = dm(realPtOffset_I5, DMPLUS1) , firstImag_R7 = pm(imagPtOffset_I13, PMPLUS1); temp_F2 = F6 * F9; temp_F3 = F7 * F10; real_F0 =F0 + F2; real_F0 = F0 + F3; temp_F2 = F6 * F10; temp_F3 = F7 * F9; imag_F1 = F1 – F2; imag_F1 = F1 + F3
correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); // Use math explained above // I am just writing code – not trying to optimize// Valid code BUT these instructions ARE NOT executed in parallel – wrong syntax, wrong registers for multi-function// real update // imag update – less documented temp registers used and discarded quickly – okay under exam condition
Rough out the code syntax
END: Recover registers in reverse order R10, R9, R7, R6, R3 Values already in R0 and R1 5 magic lines to return to C
} return correlation; (R0 and R1)
Oops- forgot to recover I13 and I5
Demonstrate unroll loop – unroll 2 * p times◦ Unrolling allows us to move (make parallel) parts of
the first set of operations and second operations◦ In real life – may unroll up to 8 times to find parallel
operations – demonstrate concept in midterm (time) If switching to SIMD -- unroll 4 * p times Write the optimization design using C++ syntax
◦ Don’t switch to assembly code until VERY last moments
◦ Write in the simplest possible version of C Concentrate on the loop as that is where we get
the speed
Step 2 – Lab 2 optimizing
for (int k= 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);}
Becomes
for (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}
Problem 1 – Can’t switch to SIMD mode if k + offset is not divisible by 2 SIMD mode does R0 = dm[2 * x] and S0 = dm[2 * x + 1] Meaning it can do dual fetch dm[1000], dm[1001], but not dm[1001], dm[1002]Means our speed estimate is out by factor of 2 since we can’t switch to SIMD mode – or if we do switch -- code must become more complex – so don’t switch to SIMD
Unroll the loop – problem 1replace k? by k to avoid complex
for (int k = 0; k < numPts - offset; k++) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);}
Becomes
If (numPts – offset) is even then unrolled code becomesfor (int k = 0; k < numPts - offset; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}
Elsefor (int k = 0; k < numPts – offset - 1; k = k+ 2) { correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] * Conjugate(firstArray [k+offset + 1]);}
k = numPts – offset – 1; correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]);
Unroll the loop – problem 2
correlation = correlation + firstArray[k] * Conjugate(firstArray [k+offset]); correlation = correlation + firstArray[k + 1] *
Conjugate(firstArray [k+offset + 1]);
correlation = correlation + (a[k] + jb[k] )* (a[k + offset] - jb[k + offset] ); correlation = correlation + (a[k + 1] + jb[k + 1] )* (a[k + offset + 1] - jb[k + offset + 1] );
Look at real part only -- usecorrelationRe = correlationRe + (a[k] * a[k + offset]) + (b[k] * b[k + offset] ) correlationRe = correlationRe + (a[k + 1] * a[k + offset + 1]) + (b[k + 1] * b[k + offset + 1] )
Simplify code -- in line
Temp1 = a[k] ; Note register renamingTemp2 = a[k + offset]; Use this approach incase there Mult3 = temp1 * temp2 are unexpected timing delaysTemp4 = b[k]; then can interlink the 2 unrollsTemp5 = b[k+offset];Mult6 = temp4 * temp5; Plan to put imag array on pm accesscorrRe = corrRe + Mult3corrRe = corrRe+ Mult6
Temp11 = a[k+ 1] ;Temp12 = a[k + offset + 1];Mult13 = temp11 * temp12Temp14 = b[k + 1] ;Temp15 = b[k+offset + 1];Mult16 = temp14 * temp51;corrRe = corrRe + Mult13corrRe = corrRe+ Mult16
One operation per line of code
Use this order because of instruction formatOn certain registers only, unlike standard COMPUTEMultiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15)
Other Mult add DM PMTemp1 = a[k] ;
Temp2 = a[k + offset];
Mult3 = temp1 * temp2
Temp4 = b[k];
Temp5 = b[k+offset];
Mult6 = temp4 * temp5;
corrRe = corrRe + Mult3
corrRe = corrRe+ Mult6
Temp11 = a[k+ 1] ;
Temp12 = a[k + offset + 1];
Mult13 = temp11 * temp12
Temp14 = b[k + 1] ;
Temp15 = b[k+offset + 1];
Mult16 = temp14 * temp51;
corrRe = corrRe + Mult13
corrRe = corrRe+ Mult16
Switch to resource chart
Other Mult add DM PM
Temp1 = a[k] ; Temp1 = a[k] ; Temp2 = a[k + offset];
Temp2 = a[k + offset];
Mult3 = temp1 * temp2 Mult3 = temp1 * temp2 Temp4 = b[k];
Temp4 = b[k];
Temp5 = b[k+offset];
Temp5 = b[k+offset];
Mult6 = temp4 * temp5; Mult6 = temp4 * temp5;
corrRe = corrRe + Mult3 corrRe = corrRe + Mult3
corrRe = corrRe+ Mult6 corrRe = corrRe+ Mult6
Temp11 = a[k + 1] ;
Temp11 = a[k+ 1] ;
Temp12 = a[k + offset + 1];
Temp12 = a[k + offset + 1];
Mult13 = temp11 * temp12
Mult13 = temp11 * temp12
Temp14 = b[k + 1];
Temp14 = b[k + 1] ;
Temp15 = b[k+offset + 1];
Temp15 = b[k+offset + 1];
Mult16 = temp14 * temp15;
Mult16 = temp14 * temp51;
corrRe = corrRe + Mult13corrRe = corrRe + Mult13
corrRe = corrRe+ Mult16 corrRe = corrRe+ Mult16
Other Mult add DM PM
Temp1 = a[k] ;Temp4 = b[k];
Temp1 = a[k] ;
Temp2 = a[k + offset];
Temp5 = b[k+offset];
Temp2 = a[k + offset];
Mult3 = temp1 * temp2 Mult3 = temp1 * temp2
Temp4 = b[k];
Temp5 = b[k+offset];
Mult6 = temp4 * temp5; Mult6 = temp4 * temp5;
corrRe = corrRe + Mult3 corrRe = corrRe + Mult3
corrRe = corrRe+ Mult6 corrRe = corrRe+ Mult6
Temp11 = a[k + 1] ;
Temp14 = b[k + 1];
Temp11 = a[k+ 1] ;
Temp12 = a[k + offset + 1];
Temp15 = b[k+offset + 1];
Temp12 = a[k + offset + 1];
Mult13 = temp11 * temp12
Mult13 = temp11 * temp12
Temp14 = b[k + 1] ;
Temp15 = b[k+offset + 1];
Mult16 = temp14 * temp15;
Mult16 = temp14 * temp51;
corrRe = corrRe + Mult13corrRe = corrRe + Mult13
corrRe = corrRe+ Mult16 corrRe = corrRe+ Mult16
Other Mult add DM PM
Temp1 = a[k] ;Temp4 = b[k];
Temp1 = a[k] ;
Temp2 = a[k + offset];
Temp5 = b[k+offset];
Temp2 = a[k + offset];
Mult3 = temp1 * temp2 Temp11 = a[k + 1] ;
Temp14 = b[k + 1];
Mult3 = temp1 * temp2
Temp4 = b[k];
Temp5 = b[k+offset];
Mult6 = temp4 * temp5; Temp12 = a[k + offset + 1];
Temp15 = b[k+offset + 1];
Mult6 = temp4 * temp5;
Mult13 = temp11 * temp12 corrRe = corrRe + Mult3 Imag fetches corrRe = corrRe + Mult3
Mult16 = temp14 * temp15; corrRe = corrRe+ Mult6 Imag fetches corrRe = corrRe+ Mult6
imag mult 1 corrRe = corrRe + Mult13 Imag fetches
imag mult 2 corrRe = corrRe+ Mult16 Imag fetches Temp11 = a[k+ 1] ;
Imag mult 1 Imag add 1Temp12 = a[k + offset + 1];
imag mult 1 Imag add 2Mult13 = temp11 * temp12
Imag add 1 Temp14 = b[k + 1] ;
Imag add 2Temp15 = b[k+offset + 1];Mult16 = temp14 * temp51;
Efficiency 8 in 12 Efficiency 8 in 12 Efficiency 8 in 12corrRe = corrRe + Mult13
corrRe = corrRe+ Mult16
Multiplication FN = FQ * FR, with FQ=F(0,1,2,3) and FR=F(4,5,6,7)ALU Compute FN = FX op FY, with FX=F(8,9,10,11),FY=F(12,13,14,15)
Other Mult add DM PM
Temp1 = a[k] ;Temp4 = b[k];
Temp2 = a[k + offset];
Temp5 = b[k+offset];
Mult3 = temp1 * temp2 F2 F5
Temp11 = a[k + 1] ;
Temp14 = b[k + 1];
What register for Mult 3What register for Temp 11 ?
Mult6 = temp4 * temp5; F3 F6
Temp12 = a[k + offset + 1];
Temp15 = b[k+offset + 1];
Mult13 = temp11 * temp12 ? ?
corrRe = corrRe + Mult3 F0 F0
Illegal use of F0Mult16 = temp14 * temp15; ? ?
corrRe = corrRe+ Mult6 F0 F0
imag mult 1 corrRe = corrRe + Mult13
imag mult 2 corrRe = corrRe+ Mult16
Imag add 1
Imag add 2