presented by: tal klein omer manor. digital interactive photomontage the project focuses on digital...
TRANSCRIPT
Presented by: Presented by: Tal KleinTal KleinOmer ManorOmer Manor
Digital Interactive Photomontage
• The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs into a single composite picture
• We focused on one feature: extended depth of field (DOF)
• DOF is mostly important in Macro photography where the depth of field is very shallow
Digital Interactive Photomontage
• DOF allows a photographer to take several pictures of the same frame, focusing on different areas in each picture and then combine them using this feature
• Along the benefits of using extended DOF in photography, it is a "Heavy Resource Consumer" due the complex calculations & image manipulations needed here, therefore our goal was to speedup this process
Digital Interactive Photomontage
System Configuration
• Intel® Core 2 Duo E6600 @ 2.4Ghz
• 2Gbyte RAM
• Microsoft Windows XP x64
• Due to the nature of our platform (2 cores) we assumed that by optimization , we can achieve a major boost in performance
The Optimization Process
• Analyzing the application
• Code Optimization
• SIMD
• Multithreading
Analyzing The Application
• Analyzing the application in 3 different ways:
1. VTune performance analyzer in order to search for our program's bottlenecks.
2. We added counters of our own to functions we suspected to be called many times.
3. Call graph (using Intel’s VTune).
transformPixel() - declares unnecessary variable.GetDataCost - we optimized the code and used SIMD instructions.BVZ_interaction_penalty - we optimized the code by merging two loops into one and used SIMD instructions.
displace() - we changed its content to macro instead of function call.BVZ_data_penalty – Calls displace which we change into macro.
Analyzing The Application
Analyzing The Application
BVZ_Expand() - function which calls the small functions and consumes the biggest time on the CPU, We used multithreading on it.
Code Optimization
• Replacement of FP variables with Integer variables when no FP operation is needed
• Merging of 2 concurrent "for loops" into one
• Two assignments to the same pointer without using the 1st
assignment
• Code replacement instead of function
• Unnecessary variable declaration
Optimized Code:
float PortraitCut::BVZ_interaction_penalty{ int c; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } M = sqrt (float(ap)) + sqrt(float(anp));
Original Code:
float PortraitCut::BVZ_interaction_penalty{ int c, k;float a,M=0;if (l==nl) return 0;unsigned char *Il, *Inl;if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } }M += sqrt(a);
Replacement of FP variables with Integer variablesMerging of 2 concurrent "for loops" into oneCode Optimization
Original Code:
for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}
I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x);
lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i];
// { x // { y
Optimized Code:
for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}
I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2];
mean += lum*_gaussianK5[i]; // { x // { y
Code Optimization Two assignments to the same pointer without using the 1st assignment
Original Code:
float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { assert(0); Coord dp = p; _displace(dp,d);
Optimized Code:
#define _displacedef(p,l) _idata->images(l)->displace(p)float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { Coord dp = p;_displacedef(dp,d);
Code Optimization Code replacement instead of function
Original Code:
const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { unsigned char *res = _data + 3*(p.y * _size.x + p.x); return res; } else return __black;}
Optimized Code:
const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { return (_data + 3*(p.y * _size.x + p.x)); } else return __black;}
Code Optimization Unnecessary variable declaration
Code Optimization Optimized Code vs. Original Code Time Based Comparison
Original Code Optimized Code
18% improvement!
SIMD - Single Instruction Multiple Data
• main issue when using the SIMD instruction is that a 128bit register is available to us so we can use it wisely.
• We used this 128bit register in some places in our code that we thought that it will boost our application performance
Optimized Code:float PortraitCut::BVZ_interaction_penalty{ int c;__m128 SimdM; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } }SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap),float(anp)));M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f;
Original Code:float PortraitCut::BVZ_interaction_penalty{ int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M += sqrt(a); M /=6.f;
SIMD
SIMD
• In the following example we used SIMD in order to compute a dot-product on 2 vectors
• In order to make our process efficient, we must align the data in the memory and so we used the __declspec(align(16)) instruction
Optimized Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;__declspec(align(16)) float lumarr[25];__m128 SimdMult; __m128 SimdMean;__m128 *pLumArr = (__m128*)lumarr;__m128 *pGaussArr = (__m128*)_gaussianK5;SimdMean = _mm_set1_ps (0.f); for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lumarr[i] = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; } // x } // y for (i = 0; i < 24 ; i+=4) { SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr); SimdMean = _mm_add_ps (SimdMult, SimdMean); pLumArr++; pGaussArr++; }mean = SimdMean. m128_f32[0]+ SimdMean.m128_f32[1]+ SimdMean. m128_f32[2]+ SimdMean. m128_f32[3]; mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f;
Original Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y mean /= .9997f;
SIMD
SIMD vs. Original Code Time Based Comparison
Optimized Code Original Code
1.5% improvement??
SIMD Optimization
SIMD
• Instead of storing the data (the variables ap & anp) in the registers, it stores it in the memory, an action that causes store forwarding when using the sqrtps instruction
• The use of SIMD accelerates the function's speed by approximately 1 sec, however the delay caused by the store forwarding is larger the speedup the SIMD acquired, and so, we got a slow down
M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f;if (_cuttype == C_GRAD) {
00411378 cmp dword ptr [esi+50h],1 0041137C cvtsi2ss xmm0,edx 00411380 movss dword ptr [esp+20h],xmm0 00411386 sqrtps xmm0,xmmword ptr [esp+20h] 0041138B movaps xmmword ptr [esp+20h],xmm0 00411390 movss xmm0,dword ptr [esp+24h] 00411396 addss xmm0,dword ptr [esp+20h] 0041139C mulss xmm0,dword ptr
__real@3e2aaaab (5397A0h)] 004113A4 movss dword ptr [esp+0Ch],xmm0 004113AA jne 004114AE
SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap), float(anp)));
0041133B xorps xmm0,xmm0 0041133E sub eax,edi 00411340 mov edi,dword ptr [esp+18h] 00411344 add ecx,ebx 00411346 movzx ebx,byte ptr [edi+2] 0041134A mov edi,dword ptr [esp+20h] 0041134E movzx edi,byte ptr [edi+2] 00411352 sub edi,ebx 00411354 mov ebx,eax 00411356 imul ebx,eax 00411359 mov eax,edi 0041135B imul eax,edi 0041135E movss dword ptr [esp+2Ch],xmm0 00411364 movss dword ptr [esp+28h],xmm0 0041136A add ecx,ebx 0041136C cvtsi2ss xmm0,ecx 00411370 movss dword ptr [esp+24h],xmm0 00411376 add edx,eax
Store Forwarding Blocked
SIMD
Multithreading
• Our major attempt to improve the original application was to divide the massive calculation into two independent threads that will run simultaneously on each core
• The main procedure used in this application is the function "compute”
Original Compute function flow
Multithreading
ITER_MAX - Defined so the external loop won't loop forever.
N - Number of pictures in the stack.
Step - Image index descriptor.
BVZ_Expand - Calculates max flow on the image's labels and returns the Energy of the current step. According to the calculation, it also updates the final image labels (the outcome)
Inner Loop - Executed one time on each image.
External Loop - Runs as long as there is improvement in the max flow calculation.As long as the old energy (from the previous step) is bigger than the new one; we continue the iterations to the next image.If no improvement in the flow was made, we achieved the maximum improvement and the function ends.
Compute
Iter < ITER_MAX &&step_counter < _n
Step < _n &&step_counter < _n
++iter
YES
E = BVZ_Expand
E_Old = E
Step_counter++
Step_counter = 0
NO
YES
YES
NO
Finish ComputeNO
Step++E_old = E
Original
Our goal was to parallelize the energy Computation in each step so we can advance the steps by 2 each iteration.
We calculate the odd steps (images) in thread 1 and the even steps in thread 2.
Thread synchronization appears in two places:
BVZ_Expand - The calculation part of the maxflowis parallelized (both threads) and at this point thread 2 waits for thread 1 to finish his energy calculation & label updates. now thread 2 has the right E_old.
Compute - If thread 1 changed the label, thread 2 must recalculate the last step on the updated label.
Optimized Compute function flow
Compute
Iter < ITER_MAX &&step_counter < _n
Step < _n &&step_counter < _n
++iter
YES
THREAD 2E = BVZ_Expand
E_Old =E_thread2
Step_counter++
Step_counter = 0
NO
YES
YES
NO
Finish ComputeNO
Step++E_old = E
Step++
THREAD 1E = BVZ_Expand
E_old = E_thread1(synchronize threads)
Thread 1changed label
Ignore Calc ofThread 2 (Step--)
YES
NO
E_Old =E_thread1
Step_counter++
Step_counter = 0
NO
YES
Multi 1
Multithreading
Multithreading Optimization Multithreading vs. Original Code Time Based Comparison
25% improvement!
Optimized Code Original Code
• Theoretically when we are using 2 threads that are working simultaneously we expect that we would get 50% speedup
• Due to the fact that the results of each thread depends on the previous iteration, synchronization points are required in the code
• Those synchronization points halts the threads runs and therefore causes delays
Multithreading vs. Original Code Time Based Comparison
Multithreading Optimization
We tried to enhance the speed by taking a different approach to the synchronization.
In this attempt, each thread changes the labels (temporary labels) on its own memory segment and we merge the results after the completion of both threads.
Each label that thread 2 changes is marked using an auxiliary array.
In the merging process, the labels are updated using the temp labels of thread 1 unless the specific label was changed also by thread 2 after that, the specific label is updated using thread 2's temp labels.
We can see that the results however do not Have seamlessly differences
Second Attempt
Multithreading 2
Compute
Iter < ITER_MAX &&step_counter < _n
Step < _n &&step_counter < _n
++iter
YES
THREAD 2E = BVZ_Expand
E_Old =E_thread2
Step_counter++
YES
YES
NO
Finish ComputeNO
Step++E_old = E
Step++
Updatetemp_lablel2
THREAD 1E = BVZ_Expand
Updatetemp_lablel1
Merge labels
E_Old =E_thread1
Step_counter++
YES
Multi 2
Multithreading Optimization Multithreading 2 vs. Original Code Time Based Comparison
99.8% improvement!
Optimized Code Original Code
MultithreadingThread Profiler view of Multithreaded Code
Our code’s 2 threads
Our 2 threads
Our program utilization is high and except for out threads' sync points, both cores are working.
The serial part at the beginning of the program is required to load the image stack and to compute the first energy of the image, after that, our threads begin their computation.
MultithreadingThread Profiler view of Multithreaded Code
The thread's sync points are taking little time - most of the code runtime is done simultaneously.
MultithreadingThread Profiler view of Multithreaded Code
Intel® Compiler
• Intel compiler did not run on our SIMD configuration (class error). • We used Intel Compiler on 3 configurations we made and compared its runtime with the same configurations that we ran using Visual Studio's compiler
58.1756.95 55.3853.91
34.734.83
0
10
20
30
40
50
60
Ru
nti
me [
Sec]
code optimization thread 1 thread 2
Time Comparison - Intel vs. Microsoft
Time (Visual Studio Compiler)
Time (Intel Compiler)
Intel® Tuning Assistant
• Using Intel's tuning Assist, we found no significant areas where our code caused a slowdown
• All events collected by the tuning assistant indicates that our optimization is satisfactory
• The store forwarding issue using SIMD was not detected as a "hotspot" because the time consumed by the faulty code was 0.7% than the overall time spent on the entire function (less than 1%)
Optimization SummaryTime Comparison
69.34
58.17
68.36
55.38
34.7
65.33
33.94
0
10
20
30
40
50
60
70
Tim
e (S
eco
nd
s)
original codeoptimization
SIMDinstructions
thread 1 thread 2 thread1+SIMD
thread2+SIMD
Run Type
Optimization SummarySpeed Up Comparison (Original is 100%)
100116.109028
101.4133256120.1326796
149.9567349
105.7830978
151.0527834
0
20
40
60
80
100
120
140
160
Sp
eed
up
(%
)
original codeoptimization
SIMDinstructions
thread 1 thread 2 thread1+SIMD
thread2+SIMD
Run Type
Thank you,
Tal Klein & Omer Manor