presented by: tal klein omer manor. digital interactive photomontage the project focuses on digital...

Presented by: Presented by: Tal KleinTal KleinOmer ManorOmer Manor

Digital Interactive Photomontage

• The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs into a single composite picture

• We focused on one feature: extended depth of field (DOF)

• DOF is mostly important in Macro photography where the depth of field is very shallow


• DOF allows a photographer to take several pictures of the same frame, focusing on different areas in each picture and then combine them using this feature

• Along the benefits of using extended DOF in photography, it is a "Heavy Resource Consumer" due the complex calculations & image manipulations needed here, therefore our goal was to speedup this process

System Configuration

• Intel® Core 2 Duo E6600 @ 2.4Ghz

• 2Gbyte RAM

• Microsoft Windows XP x64

• Due to the nature of our platform (2 cores) we assumed that by optimization , we can achieve a major boost in performance

The Optimization Process

• Analyzing the application

• Code Optimization

• SIMD

• Multithreading

Analyzing The Application

• Analyzing the application in 3 different ways:

1. VTune performance analyzer in order to search for our program's bottlenecks.

2. We added counters of our own to functions we suspected to be called many times.

3. Call graph (using Intel’s VTune).

transformPixel() - declares unnecessary variable.GetDataCost - we optimized the code and used SIMD instructions.BVZ_interaction_penalty - we optimized the code by merging two loops into one and used SIMD instructions.

displace() - we changed its content to macro instead of function call.BVZ_data_penalty – Calls displace which we change into macro.



BVZ_Expand() - function which calls the small functions and consumes the biggest time on the CPU, We used multithreading on it.

Code Optimization

• Replacement of FP variables with Integer variables when no FP operation is needed

• Merging of 2 concurrent "for loops" into one

• Two assignments to the same pointer without using the 1st

assignment

• Code replacement instead of function

• Unnecessary variable declaration

Optimized Code:

float PortraitCut::BVZ_interaction_penalty{ int c; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } M = sqrt (float(ap)) + sqrt(float(anp));

Original Code:

float PortraitCut::BVZ_interaction_penalty{ int c, k;float a,M=0;if (l==nl) return 0;unsigned char *Il, *Inl;if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } }M += sqrt(a);

Replacement of FP variables with Integer variablesMerging of 2 concurrent "for loops" into oneCode Optimization

Original Code:

for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}

I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x);

lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i];

// { x // { y

Optimized Code:

for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}

I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2];

mean += lum*_gaussianK5[i]; // { x // { y

Code Optimization Two assignments to the same pointer without using the 1st assignment

Original Code:

float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { assert(0); Coord dp = p; _displace(dp,d);

Optimized Code:

#define _displacedef(p,l) _idata->images(l)->displace(p)float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { Coord dp = p;_displacedef(dp,d);

Code Optimization Code replacement instead of function

Original Code:

const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { unsigned char *res = _data + 3*(p.y * _size.x + p.x); return res; } else return __black;}

Optimized Code:

const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { return (_data + 3*(p.y * _size.x + p.x)); } else return __black;}

Code Optimization Unnecessary variable declaration

Code Optimization Optimized Code vs. Original Code Time Based Comparison

Original Code Optimized Code

18% improvement!

SIMD - Single Instruction Multiple Data

• main issue when using the SIMD instruction is that a 128bit register is available to us so we can use it wisely.

• We used this 128bit register in some places in our code that we thought that it will boost our application performance

Optimized Code:float PortraitCut::BVZ_interaction_penalty{ int c;__m128 SimdM; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } }SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap),float(anp)));M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f;

Original Code:float PortraitCut::BVZ_interaction_penalty{ int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M += sqrt(a); M /=6.f;

SIMD

SIMD

• In the following example we used SIMD in order to compute a dot-product on 2 vectors

• In order to make our process efficient, we must align the data in the memory and so we used the __declspec(align(16)) instruction

Optimized Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;__declspec(align(16)) float lumarr[25];__m128 SimdMult; __m128 SimdMean;__m128 *pLumArr = (__m128*)lumarr;__m128 *pGaussArr = (__m128*)_gaussianK5;SimdMean = _mm_set1_ps (0.f); for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lumarr[i] = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; } // x } // y for (i = 0; i < 24 ; i+=4) { SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr); SimdMean = _mm_add_ps (SimdMult, SimdMean); pLumArr++; pGaussArr++; }mean = SimdMean. m128_f32[0]+ SimdMean.m128_f32[1]+ SimdMean. m128_f32[2]+ SimdMean. m128_f32[3]; mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f;

Original Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y mean /= .9997f;

SIMD

SIMD vs. Original Code Time Based Comparison

Optimized Code Original Code

1.5% improvement??

SIMD Optimization

SIMD

• Instead of storing the data (the variables ap & anp) in the registers, it stores it in the memory, an action that causes store forwarding when using the sqrtps instruction

• The use of SIMD accelerates the function's speed by approximately 1 sec, however the delay caused by the store forwarding is larger the speedup the SIMD acquired, and so, we got a slow down

M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f;if (_cuttype == C_GRAD) {

00411378 cmp dword ptr [esi+50h],1 0041137C cvtsi2ss xmm0,edx 00411380 movss dword ptr [esp+20h],xmm0 00411386 sqrtps xmm0,xmmword ptr [esp+20h] 0041138B movaps xmmword ptr [esp+20h],xmm0 00411390 movss xmm0,dword ptr [esp+24h] 00411396 addss xmm0,dword ptr [esp+20h] 0041139C mulss xmm0,dword ptr

__real@3e2aaaab (5397A0h)] 004113A4 movss dword ptr [esp+0Ch],xmm0 004113AA jne 004114AE

SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap), float(anp)));

0041133B xorps xmm0,xmm0 0041133E sub eax,edi 00411340 mov edi,dword ptr [esp+18h] 00411344 add ecx,ebx 00411346 movzx ebx,byte ptr [edi+2] 0041134A mov edi,dword ptr [esp+20h] 0041134E movzx edi,byte ptr [edi+2] 00411352 sub edi,ebx 00411354 mov ebx,eax 00411356 imul ebx,eax 00411359 mov eax,edi 0041135B imul eax,edi 0041135E movss dword ptr [esp+2Ch],xmm0 00411364 movss dword ptr [esp+28h],xmm0 0041136A add ecx,ebx 0041136C cvtsi2ss xmm0,ecx 00411370 movss dword ptr [esp+24h],xmm0 00411376 add edx,eax

Store Forwarding Blocked

SIMD

Multithreading

• Our major attempt to improve the original application was to divide the massive calculation into two independent threads that will run simultaneously on each core

• The main procedure used in this application is the function "compute”

Original Compute function flow

Multithreading

ITER_MAX - Defined so the external loop won't loop forever.

N - Number of pictures in the stack.

Step - Image index descriptor.

BVZ_Expand - Calculates max flow on the image's labels and returns the Energy of the current step. According to the calculation, it also updates the final image labels (the outcome)

Inner Loop - Executed one time on each image.

External Loop - Runs as long as there is improvement in the max flow calculation.As long as the old energy (from the previous step) is bigger than the new one; we continue the iterations to the next image.If no improvement in the flow was made, we achieved the maximum improvement and the function ends.

Compute

Iter < ITER_MAX &&step_counter < _n

Step < _n &&step_counter < _n

++iter

YES

E = BVZ_Expand

E_Old = E

Step_counter++

Step_counter = 0

NO

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Original

Our goal was to parallelize the energy Computation in each step so we can advance the steps by 2 each iteration.

We calculate the odd steps (images) in thread 1 and the even steps in thread 2.

Thread synchronization appears in two places:

BVZ_Expand - The calculation part of the maxflowis parallelized (both threads) and at this point thread 2 waits for thread 1 to finish his energy calculation & label updates. now thread 2 has the right E_old.

Compute - If thread 1 changed the label, thread 2 must recalculate the last step on the updated label.

Optimized Compute function flow

Compute



++iter

YES

THREAD 2E = BVZ_Expand

E_Old =E_thread2

Step_counter++

Step_counter = 0

NO

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Step++


E_old = E_thread1(synchronize threads)

Thread 1changed label

Ignore Calc ofThread 2 (Step--)

YES

NO

E_Old =E_thread1

Step_counter++

Step_counter = 0

NO

YES

Multi 1

Multithreading

Multithreading Optimization Multithreading vs. Original Code Time Based Comparison

25% improvement!


• Theoretically when we are using 2 threads that are working simultaneously we expect that we would get 50% speedup

• Due to the fact that the results of each thread depends on the previous iteration, synchronization points are required in the code

• Those synchronization points halts the threads runs and therefore causes delays

Multithreading vs. Original Code Time Based Comparison

Multithreading Optimization

We tried to enhance the speed by taking a different approach to the synchronization.

In this attempt, each thread changes the labels (temporary labels) on its own memory segment and we merge the results after the completion of both threads.

Each label that thread 2 changes is marked using an auxiliary array.

In the merging process, the labels are updated using the temp labels of thread 1 unless the specific label was changed also by thread 2 after that, the specific label is updated using thread 2's temp labels.

We can see that the results however do not Have seamlessly differences

Second Attempt

Multithreading 2

Compute



++iter

YES


E_Old =E_thread2

Step_counter++

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Step++

Updatetemp_lablel2


Updatetemp_lablel1

Merge labels

E_Old =E_thread1

Step_counter++

YES

Multi 2

Multithreading Optimization Multithreading 2 vs. Original Code Time Based Comparison

99.8% improvement!


MultithreadingThread Profiler view of Multithreaded Code

Our code’s 2 threads

Our 2 threads

Our program utilization is high and except for out threads' sync points, both cores are working.

The serial part at the beginning of the program is required to load the image stack and to compute the first energy of the image, after that, our threads begin their computation.


The thread's sync points are taking little time - most of the code runtime is done simultaneously.


Intel® Compiler

• Intel compiler did not run on our SIMD configuration (class error). • We used Intel Compiler on 3 configurations we made and compared its runtime with the same configurations that we ran using Visual Studio's compiler

58.1756.95 55.3853.91

34.734.83

0

10

20

30

40

50

60

Ru

nti

me [

Sec]

code optimization thread 1 thread 2

Time Comparison - Intel vs. Microsoft

Time (Visual Studio Compiler)

Time (Intel Compiler)

Intel® Tuning Assistant

• Using Intel's tuning Assist, we found no significant areas where our code caused a slowdown

• All events collected by the tuning assistant indicates that our optimization is satisfactory

• The store forwarding issue using SIMD was not detected as a "hotspot" because the time consumed by the faulty code was 0.7% than the overall time spent on the entire function (less than 1%)

Optimization SummaryTime Comparison

69.34

58.17

68.36

55.38

34.7

65.33

33.94

0

10

20

30

40

50

60

70

Tim

e (S

eco

nd

s)

original codeoptimization

SIMDinstructions

thread 1 thread 2 thread1+SIMD

thread2+SIMD

Run Type

Optimization SummarySpeed Up Comparison (Original is 100%)

100116.109028

101.4133256120.1326796

149.9567349

105.7830978

151.0527834

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

original codeoptimization

SIMDinstructions

thread 1 thread 2 thread1+SIMD

thread2+SIMD

Run Type

Thank you,

Tal Klein & Omer Manor

presented by: tal klein omer manor. digital interactive photomontage the project focuses on digital...

Documents

y code optimization

y optimized code

p optimized code

f simd slide

y mean

f original code

simd optimization slide

d optimized code