presented by: tal klein omer manor. digital interactive photomontage the project focuses on digital...

37
Presented by: Presented by: Tal Klein Tal Klein Omer Manor Omer Manor

Upload: damaris-pryde

Post on 31-Mar-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Presented by: Presented by: Tal KleinTal KleinOmer ManorOmer Manor

Page 2: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Digital Interactive Photomontage

• The project focuses on digital photomontage: computer-assisted framework for combining parts of a set of photographs into a single composite picture

• We focused on one feature: extended depth of field (DOF)

• DOF is mostly important in Macro photography where the depth of field is very shallow

Page 3: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Digital Interactive Photomontage

• DOF allows a photographer to take several pictures of the same frame, focusing on different areas in each picture and then combine them using this feature

• Along the benefits of using extended DOF in photography, it is a "Heavy Resource Consumer" due the complex calculations & image manipulations needed here, therefore our goal was to speedup this process

Page 4: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Digital Interactive Photomontage

Page 5: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

System Configuration

• Intel® Core 2 Duo E6600 @ 2.4Ghz

• 2Gbyte RAM

• Microsoft Windows XP x64

• Due to the nature of our platform (2 cores) we assumed that by optimization , we can achieve a major boost in performance

Page 6: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

The Optimization Process

• Analyzing the application

• Code Optimization

• SIMD

• Multithreading

Page 7: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Analyzing The Application

• Analyzing the application in 3 different ways:

1. VTune performance analyzer in order to search for our program's bottlenecks.

2. We added counters of our own to functions we suspected to be called many times.

3. Call graph (using Intel’s VTune).

Page 8: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

transformPixel() - declares unnecessary variable.GetDataCost - we optimized the code and used SIMD instructions.BVZ_interaction_penalty - we optimized the code by merging two loops into one and used SIMD instructions.

displace() - we changed its content to macro instead of function call.BVZ_data_penalty – Calls displace which we change into macro.

Analyzing The Application

Page 9: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Analyzing The Application

BVZ_Expand() - function which calls the small functions and consumes the biggest time on the CPU, We used multithreading on it.

Page 10: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Code Optimization

• Replacement of FP variables with Integer variables when no FP operation is needed

• Merging of 2 concurrent "for loops" into one

• Two assignments to the same pointer without using the 1st

assignment

• Code replacement instead of function

• Unnecessary variable declaration

Page 11: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Optimized Code:

float PortraitCut::BVZ_interaction_penalty{ int c; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } } M = sqrt (float(ap)) + sqrt(float(anp));

Original Code:

float PortraitCut::BVZ_interaction_penalty{ int c, k;float a,M=0;if (l==nl) return 0;unsigned char *Il, *Inl;if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } }M += sqrt(a);

Replacement of FP variables with Integer variablesMerging of 2 concurrent "for loops" into oneCode Optimization

Page 12: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Original Code:

for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}

I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x);

lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i];

// { x // { y

Optimized Code:

for (y=p.y-2, i=0; y<=p.y+2; ++y)} for (x=p.x-2; x<=p.x+2; ++x, ++i)}

I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2];

mean += lum*_gaussianK5[i]; // { x // { y

Code Optimization Two assignments to the same pointer without using the 1st assignment

Page 13: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Original Code:

float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { assert(0); Coord dp = p; _displace(dp,d);

Optimized Code:

#define _displacedef(p,l) _idata->images(l)->displace(p)float PortraitCut::BVZ_data_penalty(Coord p, ushort d) { Coord dp = p;_displacedef(dp,d);

Code Optimization Code replacement instead of function

Page 14: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Original Code:

const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { unsigned char *res = _data + 3*(p.y * _size.x + p.x); return res; } else return __black;}

Optimized Code:

const unsigned char* ImageAbs::transformedPixel(Coord p) const { if (transformed()) displace(p); if (p>=Coord(0,0) && p<_size) { return (_data + 3*(p.y * _size.x + p.x)); } else return __black;}

Code Optimization Unnecessary variable declaration

Page 15: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Code Optimization Optimized Code vs. Original Code Time Based Comparison

Original Code Optimized Code

18% improvement!

Page 16: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

SIMD - Single Instruction Multiple Data

• main issue when using the SIMD instruction is that a 128bit register is available to us so we can use it wisely.

• We used this 128bit register in some places in our code that we thought that it will boost our application performance

Page 17: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Optimized Code:float PortraitCut::BVZ_interaction_penalty{ int c;__m128 SimdM; int ap = 0,anp = 0; float M=0; int kp = 0,knp = 0; unsigned char *Il_np, *Inl_np; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { Il = _imptr(l,p); Inl = _imptr(nl,p); Il_np = _imptr(l,np); Inl_np = _imptr(nl,np); for (c=0; c<3; ++c) { kp = Il[c] - Inl[c]; knp = Il_np[c] - Inl_np[c]; ap += kp*kp; anp += knp*knp; } }SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap),float(anp)));M = (SimdM.m128_f32[0] + SimdM.m128_f32[1])/6.f;

Original Code:float PortraitCut::BVZ_interaction_penalty{ int c, k; float a,M=0; if (l==nl) return 0; unsigned char *Il, *Inl; if (_cuttype == C_NORMAL || _cuttype == C_GRAD) { a=0; Il = _imptr(l,p); Inl = _imptr(nl,p); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M = sqrt(a); a=0; Il = _imptr(l,np); Inl = _imptr(nl,np); for (c=0; c<3; ++c) { k = Il[c] - Inl[c]; a += k*k; } M += sqrt(a); M /=6.f;

SIMD

Page 18: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

SIMD

• In the following example we used SIMD in order to compute a dot-product on 2 vectors

• In order to make our process efficient, we must align the data in the memory and so we used the __declspec(align(16)) instruction

Page 19: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Optimized Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;__declspec(align(16)) float lumarr[25];__m128 SimdMult; __m128 SimdMean;__m128 *pLumArr = (__m128*)lumarr;__m128 *pGaussArr = (__m128*)_gaussianK5;SimdMean = _mm_set1_ps (0.f); for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lumarr[i] = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; } // x } // y for (i = 0; i < 24 ; i+=4) { SimdMult = _mm_mul_ps (*pLumArr, *pGaussArr); SimdMean = _mm_add_ps (SimdMult, SimdMean); pLumArr++; pGaussArr++; }mean = SimdMean. m128_f32[0]+ SimdMean.m128_f32[1]+ SimdMean. m128_f32[2]+ SimdMean. m128_f32[3]; mean =(mean+lumarr[24]*_gaussianK5[24])/.9997f;

Original Code:float ContrastCut::getDataCost (Coord p, ushort d) {float mean=0, lum, contrast=0;const unsigned char* I;int y,x, i;for (y=p.y-2, i=0; y<=p.y+2; ++y) { for (x=p.x-2; x<=p.x+2; ++x, ++i) { I = _id->_imptr(d,Coord(x,y)); I = _id->images((int)d)->data() + 3*(y * _id->images((int)d)->_size.x + x); lum = .3086f * (float)I[0] + .6094f * (float)I[1] + .082f * (float)I[2]; mean += lum*_gaussianK5[i]; } // x } // y mean /= .9997f;

SIMD

Page 20: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

SIMD vs. Original Code Time Based Comparison

Optimized Code Original Code

1.5% improvement??

SIMD Optimization

Page 21: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

SIMD

• Instead of storing the data (the variables ap & anp) in the registers, it stores it in the memory, an action that causes store forwarding when using the sqrtps instruction

• The use of SIMD accelerates the function's speed by approximately 1 sec, however the delay caused by the store forwarding is larger the speedup the SIMD acquired, and so, we got a slow down

Page 22: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

M = SimdM. m128_f32[0] + SimdM. m128_f32[1]) /6.f;if (_cuttype == C_GRAD) {

00411378 cmp dword ptr [esi+50h],1 0041137C cvtsi2ss xmm0,edx 00411380 movss dword ptr [esp+20h],xmm0 00411386 sqrtps xmm0,xmmword ptr [esp+20h] 0041138B movaps xmmword ptr [esp+20h],xmm0 00411390 movss xmm0,dword ptr [esp+24h] 00411396 addss xmm0,dword ptr [esp+20h] 0041139C mulss xmm0,dword ptr

__real@3e2aaaab (5397A0h)] 004113A4 movss dword ptr [esp+0Ch],xmm0 004113AA jne 004114AE

SimdM = _mm_sqrt_ps (_mm_set_ps(0,0,float(ap), float(anp)));

0041133B xorps xmm0,xmm0 0041133E sub eax,edi 00411340 mov edi,dword ptr [esp+18h] 00411344 add ecx,ebx 00411346 movzx ebx,byte ptr [edi+2] 0041134A mov edi,dword ptr [esp+20h] 0041134E movzx edi,byte ptr [edi+2] 00411352 sub edi,ebx 00411354 mov ebx,eax 00411356 imul ebx,eax 00411359 mov eax,edi 0041135B imul eax,edi 0041135E movss dword ptr [esp+2Ch],xmm0 00411364 movss dword ptr [esp+28h],xmm0 0041136A add ecx,ebx 0041136C cvtsi2ss xmm0,ecx 00411370 movss dword ptr [esp+24h],xmm0 00411376 add edx,eax

Store Forwarding Blocked

SIMD

Page 23: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Multithreading

• Our major attempt to improve the original application was to divide the massive calculation into two independent threads that will run simultaneously on each core

• The main procedure used in this application is the function "compute”

Page 24: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Original Compute function flow

Multithreading

ITER_MAX - Defined so the external loop won't loop forever.

N - Number of pictures in the stack.

Step - Image index descriptor.

BVZ_Expand - Calculates max flow on the image's labels and returns the Energy of the current step. According to the calculation, it also updates the final image labels (the outcome)

Inner Loop - Executed one time on each image.

External Loop - Runs as long as there is improvement in the max flow calculation.As long as the old energy (from the previous step) is bigger than the new one; we continue the iterations to the next image.If no improvement in the flow was made, we achieved the maximum improvement and the function ends.

Compute

Iter < ITER_MAX &&step_counter < _n

Step < _n &&step_counter < _n

++iter

YES

E = BVZ_Expand

E_Old = E

Step_counter++

Step_counter = 0

NO

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Original

Page 25: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Our goal was to parallelize the energy Computation in each step so we can advance the steps by 2 each iteration.

We calculate the odd steps (images) in thread 1 and the even steps in thread 2.

Thread synchronization appears in two places:

BVZ_Expand - The calculation part of the maxflowis parallelized (both threads) and at this point thread 2 waits for thread 1 to finish his energy calculation & label updates. now thread 2 has the right E_old.

Compute - If thread 1 changed the label, thread 2 must recalculate the last step on the updated label.

Optimized Compute function flow

Compute

Iter < ITER_MAX &&step_counter < _n

Step < _n &&step_counter < _n

++iter

YES

THREAD 2E = BVZ_Expand

E_Old =E_thread2

Step_counter++

Step_counter = 0

NO

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Step++

THREAD 1E = BVZ_Expand

E_old = E_thread1(synchronize threads)

Thread 1changed label

Ignore Calc ofThread 2 (Step--)

YES

NO

E_Old =E_thread1

Step_counter++

Step_counter = 0

NO

YES

Multi 1

Multithreading

Page 26: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Multithreading Optimization Multithreading vs. Original Code Time Based Comparison

25% improvement!

Optimized Code Original Code

Page 27: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

• Theoretically when we are using 2 threads that are working simultaneously we expect that we would get 50% speedup

• Due to the fact that the results of each thread depends on the previous iteration, synchronization points are required in the code

• Those synchronization points halts the threads runs and therefore causes delays

Multithreading vs. Original Code Time Based Comparison

Multithreading Optimization

Page 28: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

We tried to enhance the speed by taking a different approach to the synchronization.

In this attempt, each thread changes the labels (temporary labels) on its own memory segment and we merge the results after the completion of both threads.

Each label that thread 2 changes is marked using an auxiliary array.

In the merging process, the labels are updated using the temp labels of thread 1 unless the specific label was changed also by thread 2 after that, the specific label is updated using thread 2's temp labels.

We can see that the results however do not Have seamlessly differences

Second Attempt

Multithreading 2

Compute

Iter < ITER_MAX &&step_counter < _n

Step < _n &&step_counter < _n

++iter

YES

THREAD 2E = BVZ_Expand

E_Old =E_thread2

Step_counter++

YES

YES

NO

Finish ComputeNO

Step++E_old = E

Step++

Updatetemp_lablel2

THREAD 1E = BVZ_Expand

Updatetemp_lablel1

Merge labels

E_Old =E_thread1

Step_counter++

YES

Multi 2

Page 29: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Multithreading Optimization Multithreading 2 vs. Original Code Time Based Comparison

99.8% improvement!

Optimized Code Original Code

Page 30: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

MultithreadingThread Profiler view of Multithreaded Code

Our code’s 2 threads

Our 2 threads

Our program utilization is high and except for out threads' sync points, both cores are working.

Page 31: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

The serial part at the beginning of the program is required to load the image stack and to compute the first energy of the image, after that, our threads begin their computation.

MultithreadingThread Profiler view of Multithreaded Code

Page 32: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

The thread's sync points are taking little time - most of the code runtime is done simultaneously.

MultithreadingThread Profiler view of Multithreaded Code

Page 33: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Intel® Compiler

• Intel compiler did not run on our SIMD configuration (class error). • We used Intel Compiler on 3 configurations we made and compared its runtime with the same configurations that we ran using Visual Studio's compiler

58.1756.95 55.3853.91

34.734.83

0

10

20

30

40

50

60

Ru

nti

me [

Sec]

code optimization thread 1 thread 2

Time Comparison - Intel vs. Microsoft

Time (Visual Studio Compiler)

Time (Intel Compiler)

Page 34: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Intel® Tuning Assistant

• Using Intel's tuning Assist, we found no significant areas where our code caused a slowdown

• All events collected by the tuning assistant indicates that our optimization is satisfactory

• The store forwarding issue using SIMD was not detected as a "hotspot" because the time consumed by the faulty code was 0.7% than the overall time spent on the entire function (less than 1%)

Page 35: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Optimization SummaryTime Comparison

69.34

58.17

68.36

55.38

34.7

65.33

33.94

0

10

20

30

40

50

60

70

Tim

e (S

eco

nd

s)

original codeoptimization

SIMDinstructions

thread 1 thread 2 thread1+SIMD

thread2+SIMD

Run Type

Page 36: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Optimization SummarySpeed Up Comparison (Original is 100%)

100116.109028

101.4133256120.1326796

149.9567349

105.7830978

151.0527834

0

20

40

60

80

100

120

140

160

Sp

eed

up

(%

)

original codeoptimization

SIMDinstructions

thread 1 thread 2 thread1+SIMD

thread2+SIMD

Run Type

Page 37: Presented by: Tal Klein Omer Manor. Digital Interactive Photomontage The project focuses on digital photomontage: computer-assisted framework for combining

Thank you,

Tal Klein & Omer Manor