Copyright © 2007 Intel Corporation.
Image 2x Shrink Image 2x Shrink SSE implementation, SSE implementation,
benchmarking comparison between Merom and Penrynbenchmarking comparison between Merom and Penryn
Dr. Zvi Danovich, Dr. Zvi Danovich, Senior Application EngineerSenior Application Engineer
November – December 2007November – December 2007
Copyright © 2007 Intel Corporation.
RR
®®
2
AgendaAgendaGeneral description of 2x ShrinkGeneral description of 2x ShrinkStep 1: weights computationStep 1: weights computationStep 2: components computationStep 2: components computationBenchmarks and conclusionsBenchmarks and conclusions
Copyright © 2007 Intel Corporation.
RR
®®
3
Pixel has 3 components (r,g,b) and 4Pixel has 3 components (r,g,b) and 4 thth, ‘a’ – weight, all are 1byte length, ‘a’ – weight, all are 1byte length Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel Each pair of pixel lines is interpolated to one line 2x shortened: 2 opposite pixel
pairs are combined to 1 pixel in shrunk imagepairs are combined to 1 pixel in shrunk image New (interpolated) component C = ∑(ca)New (interpolated) component C = ∑(ca)0-3 0-3 ∕∕ ∑(a) ∑(a)0-30-3, where ‘c’ is r, g or b., where ‘c’ is r, g or b.
New weight ‘a’ A = min(255, ½ ∑(a)New weight ‘a’ A = min(255, ½ ∑(a)0-30-3 ). ).
Preliminary step: Preliminary step: reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23reading (loading) m128i_Ev01, m128i_Ev23, m128i_Od01, m128i_Od23
rr gg bb aa
General descriptionGeneral description
Sourse: even line
Sourse: odd line
0 1 2 3
m128i_Ev01 m128i_Ev23
m128i_Od23m128i_Od01
“Shrunk” pixels
Copyright © 2007 Intel Corporation.
RR
®®
4
AgendaAgendaGeneral description of 2x ShrinkGeneral description of 2x ShrinkStep 1: weights computationStep 1: weights computationStep 2: components computationStep 2: components computationBenchmarks and conclusionsBenchmarks and conclusions
Copyright © 2007 Intel Corporation.
RR
®®
5
equivalent
Step 1: weights computationStep 1: weights computation1.1 Building the partial sums (a1.1 Building the partial sums (a00+a+a11), (a), (a22+a+a33) …) … Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’Building 8*16bit ‘a’-s by 2 shuffles and logical ‘or’
Part sum by MADD with 8*16bit ‘1’-sPart sum by MADD with 8*16bit ‘1’-s
rr gg bb aaEven line
Odd line
11 11 11 11 11 11 11 11MADD
m128i_8a
aa00 aa11 aa22 aa33 aa44 aa55 aa66 aa77
aa00 aa11 aa22 aa33 aa44 aa55 aa66 aa77
aa00+a+a11 aa22+a+a33 aa44+a+a55 aa66+a+a77
Copyright © 2007 Intel Corporation.
RR
®®
6
Perform the same computation for second pair of pixel quads, obtainingPerform the same computation for second pair of pixel quads, obtaining
Building final sums using HADDBuilding final sums using HADD
Converting the result to Float Point (FP) and computation reciprocalsConverting the result to Float Point (FP) and computation reciprocals
Here we have 4 FP ‘a’-sum reciprocals - Here we have 4 FP ‘a’-sum reciprocals - normalization coefficientsnormalization coefficients
Step 1: weights computation (cont)Step 1: weights computation (cont) 1.2 Building the sums (a1.2 Building the sums (a00+a+a11+a+a22+a+a33), (a), (a44+a+a55+a+a66+a+a77) … and reciprocals) … and reciprocals
HADD
0+1+2+3 =0+1+2+3 =∑∑(a)(a)0-30-3 ∑∑(a)(a)4-74-7 ∑∑(a)(a)8-118-11 ∑∑(a)(a)12-1512-15
FPFP 1/1/∑(a)∑(a)0-30-3
FPFP 1/∑(a)1/∑(a)4-74-7
FP FP 1/∑(a)1/∑(a)8-118-11
FPFP 1/∑(a)1/∑(a)12-1512-15
aa88+a+a99 aa1010+a+a1111 aa1212+a+a1313 aa1414+a+a1515
aa00+a+a11 aa22+a+a33 aa44+a+a55 aa66+a+a77 aa88+a+a99 aa1010+a+a1111 aa1212+a+a1313 aa1414+a+a1515
Copyright © 2007 Intel Corporation.
RR
®®
7
1.3 Building new A1.3 Building new A00, A, A11, A, A22, A, A33
Computing new ‘a’: min(255, ½Computing new ‘a’: min(255, ½∑a)∑a)
And, finally – logical shift And, finally – logical shift to 4to 4thth position position
(∑a)(∑a)00
Step 1: weights computation (cont)Step 1: weights computation (cont)
SRAI ( (∑a)(∑a)11 (∑a)(∑a)22 (∑a)(∑a)33 , 1)
½ (∑a)½ (∑a)00 ½ (∑a)½ (∑a)11 ½ (∑a)½ (∑a)22 ½ (∑a)½ (∑a)33MIN ( , 255255 255255 255255 255255 )
AA00 AA11 AA22 AA33
equivalent as values <= 255
This is the basis of resulting quad of pixels This is the basis of resulting quad of pixels
arithmetic shift 1bit to right: division by 2
≡ AA00 AA11 AA22 AA33
AA00 AA11 AA22 AA33
Copyright © 2007 Intel Corporation.
RR
®®
8
AgendaAgendaGeneral description of 2x ShrinkGeneral description of 2x ShrinkStep 1: weights computationStep 1: weights computationStep 2: components computationStep 2: components computationBenchmarks and conclusionsBenchmarks and conclusions
Copyright © 2007 Intel Corporation.
RR
®®
9
equivalent
Step 2: components computationStep 2: components computation2.1 Computation 4 ‘b’-s2.1 Computation 4 ‘b’-sBuilding the partial sums (aBuilding the partial sums (a00bb00+a+a11bb11), (a), (a22bb22+a+a33bb33) …) … Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’Building 8*16bit ‘b’-s by 2 shuffles and logical ‘or’
Part sum by MADD with 8*16bit ‘a’-sPart sum by MADD with 8*16bit ‘a’-s
rr gg bb aabb00 bb11 bb22 bb33 bb44 bb55 bb66 bb77
Even line
Odd line
aa00 aa11 aa22 aa33 aa44 aa55 aa66 aa77
MADD
aa00bb00+a+a11bb11 aa22bb22+a+a33bb33 aa44bb44+a+a55bb55 aa66bb66+a+a77bb77
8 8bit ‘b’-s
bb00 bb11 bb22 bb33 bb44 bb55 bb66 bb77
8 16bit ‘a’-s from previous step
∑∑(ab)(ab)0,10,1 ∑∑(ab)(ab)2,32,3 ∑∑(ab)(ab)4,54,5 ∑∑(ab)(ab)6,76,7≡ short notation
Copyright © 2007 Intel Corporation.
RR
®®
10
Perform the same computation for second pair of pixel quads, obtainingPerform the same computation for second pair of pixel quads, obtaining
Building final NON-normalized interpolation sums using HADDBuilding final NON-normalized interpolation sums using HADD
Converting the result Converting the result to Float Point (FP) and to Float Point (FP) and normalizing bynormalizing by multiplication withmultiplication with ‘ ‘a’-sum reciprocalsa’-sum reciprocals from Step 1from Step 1
Step 2: components computationStep 2: components computation 2.2 Building the sums (a2.2 Building the sums (a00bb00+a+a11bb11+a+a22bb22+a+a33bb33), … and final results in FP form), … and final results in FP form
HADD
∑∑(ab)(ab)8,98,9 ∑∑(ab)(ab)10,1110,11 ∑∑(ab)(ab)12,1312,13 ∑∑(ab)(ab)14,1514,15
∑∑(ab)(ab)0,10,1 ∑∑(ab)(ab)2,32,3 ∑∑(ab)(ab)4,54,5 ∑∑(ab)(ab)6,76,7 ∑∑(ab)(ab)8,98,9 ∑∑(ab)(ab)10,1110,11 ∑∑(ab)(ab)12,1312,13 ∑∑(ab)(ab)14,1514,15
∑∑(ab)(ab)0-30-3 ∑∑(ab)(ab)4-74-7 ∑∑(ab)(ab)8-118-11 ∑∑(ab)(ab)12-1512-15
FPFP 1/1/∑(a)∑(a)0-30-3
FPFP 1/∑(a)1/∑(a)4-74-7
FP FP 1/∑(a)1/∑(a)8-118-11
FPFP 1/∑(a)1/∑(a)12-1512-15
FPFP∑∑(ab)(ab)0-30-3
FPFP∑∑(ab)(ab)4-74-7
FPFP∑∑(ab)(ab)8-118-11
FPFP∑∑(ab)(ab)12-1512-15
cvtepi32_ps
mul_ps
Here we have 4 final ‘b’ values in FP formHere we have 4 final ‘b’ values in FP form
BB00 BB11 BB22 BB33
Copyright © 2007 Intel Corporation.
RR
®®
11
Conversion new ‘B’-sConversion new ‘B’-s to integer form to integer form
BB00 BB11 BB22 BB33
equivalent as values <= 255
BB00 BB11 BB22 BB33
Step 2: components computationStep 2: components computation 2.3 Building new B2.3 Building new B00, B, B11, B, B22, B, B33
BB00 BB11 BB22 BB33
cvtps_epi32
BB00 BB11 BB22 BB33 ≡
Logical shift to 3Logical shift to 3rdrd position position and logical sum with quad of and logical sum with quad of ‘A’-s from previous step ‘A’-s from previous step
AA00 AA11 AA22 AA33
OR
BB00 AA00 BB11 AA11 BB22 AA22 BB33 AA33
Future resulting quad of pixels – A and B are readyFuture resulting quad of pixels – A and B are ready
Copyright © 2007 Intel Corporation.
RR
®®
12
Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad Perform sub-steps of 2.1-2.3 for ‘G’-s and ‘R’-s, when the ‘G’-s quad is shifted to 2is shifted to 2ndnd position before logical sum, and ‘R’-s quad is not position before logical sum, and ‘R’-s quad is not shifted.shifted.
Step 2: components computation Step 2: components computation 2.4-2.9 Building new quads of G and R and summing final results2.4-2.9 Building new quads of G and R and summing final results
GG00 GG11 GG22 GG33
GG00 GG11 GG22 GG33
RR00 RR11 RR22 RR33
OR
BB00 AA00 BB11 AA11 BB22 AA22 BB33 AA33
OR
RR00 GG00 BB00 AA00 RR11 GG11 BB11 AA11 RR22 GG22 BB22 AA22 RR33 GG33 BB33 AA33
This final quad of pixels is stored in resulting imageThis final quad of pixels is stored in resulting image
Copyright © 2007 Intel Corporation.
RR
®®
13
AgendaAgendaGeneral description of 2x ShrinkGeneral description of 2x ShrinkStep 1: weights computationStep 1: weights computationStep 2: components computationStep 2: components computationBenchmarks and conclusionsBenchmarks and conclusions
Copyright © 2007 Intel Corporation.
RR
®®
14
Benchmarking (1 thread)Benchmarking (1 thread) Merom core - WC, 2.66GHzMerom core - WC, 2.66GHz
Penryn core – HPTN, 2.88GHzPenryn core – HPTN, 2.88GHz
6.4Vector Time
Ser Time
0.7Vector Time
Ser Time
Speed-up on Penryn (7.0x) is 1.5 better than on Merom (4.6x)
It is close to theoretical limit for 8-16bit-vector operations !
VTune CPI = 0.78
VTune CPI = 0.46
Overall speed-up Penryn(Vector)/Merom(Ser) = 8.1x