jpeg c ompression a lgorithm i n cuda group members: pranit patel manisha tatikonda jeff wong jarek...

JPEG COMPRESSION ALGORITHM IN CUDA

Group Members:Pranit PatelManisha TatikondaJeff Wong Jarek Marczewski

Date:April 14, 2009

OUTLINEOUTLINE

Motivation JPEG Algorithm Design Approach in CUDA Benchmark Conclusion

MOTIVATIONMOTIVATION

Growth of Digital Imaging Applications Effective algorithm for Video Compression

Applications Loss of Data Information must be minimal

JPEG is a lossy compression algorithm that reduces

the file size without affecting quality of image

It perceive the small changes in brightness more readily than we do small change in color

JPEG ALGORITHMJPEG ALGORITHM Step 1: Divide sample image into Step 1: Divide sample image into

8x8 blocks8x8 blocks

Step 2: Apply DCTStep 2: Apply DCT DCT is applied to each block It replaces actual color of block to average

matrix which is analyze for entire matrix This step does not compress the file In general:

Simple color space model: [R,G,B] per pixel

JPEG uses [Y, Cb, Cr] Model

Y = Brightness

Cb = Color blueness

Cr = Color redness

JPEG ALGORITHMJPEG ALGORITHM Step 3: QuantizationStep 3: Quantization

First Compression Step Each DCT coefficient is divided by its

corresponding constant in Quantization table and rounded off to nearest integer

The result of quantizing the DCT coefficients is that smaller, unimportant coefficients will be replaced by zeros and larger coefficients will lose precision. It is this rounding-off that causes a loss in image quality.

Step 4: Apply Huffman EncodingStep 4: Apply Huffman Encoding Apply Huffman encoding to Quantized DCT

Coefficient to reduce the image size further

Step 5: DecoderStep 5: Decoder Decoder of JPEG consist of:

Huffman Decoding De-Quantization IDCT

DCT and IDCT

Discrete Cosine Transform

Separable transform algorithm (1D and then the 2D):

2D DCT is performed in a 2 pass approach one for horizontal direction and one for vertical direction DCT

1st pass 2nd pass

Discrete Cosine Transform

Translate DCT into matrix cross multiplication Pre-calculate Cosine values are stored as constant array Inverse DCT are calculated in the same way only with

P00 P01 P02 P03 P04 P05 P06 P07

P10 P11 P12 P13 P14 P15 P16 P17

P20 P21 P22 P23 P24 P25 P26 P27

P30 P31 P32 P33 P34 P35 P36 P37

P40 P41 P42 P43 P44 P45 P46 P47

P50 P51 P52 P53 P54 P55 P56 P57

P60 P61 P62 P63 P64 P65 P66 P67

P70 P71 P72 P73 P74 P75 P76 P77

C00 C01 C02 C03 C04 C05 C06 C07

C10 C11 C12 C13 C14 C15 C16 C17

C20 C21 C22 C23 C24 C25 C26 C27

C30 C31 C32 C33 C34 C35 C36 C37

C40 C41 C42 C43 C44 C45 C46 C47

C50 C51 C52 C53 C54 C55 C56 C57

C60 C61 C62 C63 C64 C65 C66 C67

C70 C71 C72 C73 C74 C75 C76 C77

x

DCT CUDA Implementation

Each thread within each block has the same number of calculation

Each thread multiply and accumulated eight elements

P00 P01 P02 P03 P04 P05 P06 P07

P10 P11 P12 P13 P14 P15 P16 P17

P20 P21 P22 P23 P24 P25 P26 P27

P30 P31 P32 P33 P34 P35 P36 P37

P40 P41 P42 P43 P44 P45 P46 P47

P50 P51 P52 P53 P54 P55 P56 P57

P60 P61 P62 P63 P64 P65 P66 P67

P70 P71 P72 P73 P74 P75 P76 P77

C00 C01 C02 C03 C04 C05 C06 C07

C10 C11 C12 C13 C14 C15 C16 C17

C20 C21 C22 C23 C24 C25 C26 C27

C30 C31 C32 C33 C34 C35 C36 C37

C40 C41 C42 C43 C44 C45 C46 C47

C50 C51 C52 C53 C54 C55 C56 C57

C60 C61 C62 C63 C64 C65 C66 C67

C70 C71 C72 C73 C74 C75 C76 C77

x

Thread.x = 2

Thread.y = 3

DCT Grid and Block Two methods and approach Each thread block process 1 macro blocks (64

threads)

Each thread block process 8 macro blocks (512 threads)

Micro Blocks x

y

8

8

Micro Blocks x

y

8

8

DCT and IDCT GPU results

GPU Runtme

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

262144 786432 4194304

Raster Size

Ru

n T

ime

(ms)

DCT 1 macro block

DCT 8 macro block

IDCT1 macro block

512x512 1024x768 2048x2048

DCT Results

DCT GPU Performance Gain over CPU

0

5

10

15

20

25

2048x2048 1024x768 512x512

Raster Size

Pe

rfo

rma

nc

e G

ain

1 macro block

8 macro block

IDCT Results

IDCT GPU Performance over CPU

20

20.5

21

21.5

22

22.5

23

23.5

24

24.5

25

2048x2048 1024x768 512x512

Pe

rfo

rma

nc

e G

ain

Quantization

Quantization

Break the image into 8x8 blocks 8x8 Quantized matrix to be applied to the image. Every content of the image is multiplied by the

Quantized value and divided again to round to the nearest integer value.

Quantization

CUDA Programing Method 1 – Exact implementation as in CPU Method 2 – Shared memory to copy 8x8

image Method 3 – Load divided values into shared

memory.

Quantization CUDA Results

Quantization CUDA Performance

0

0.5

1

1.5

512x512 1024x768 2048x2048

Image Resolution

Tim

e in

ms

Method 1 Method 2 Method 3

Quantization CPU vs GPU Results

Quantization CPUvs CUDA Performance

0

50

100

150

200

250

300

350

512x512 1024x768 2048x2048

Resolutions

x f

as

ter

tha

n C

PU

xCPU - 1

xCPU - 2

xCPU - 3

Tabulated Results for Quantization Method 2 and Method 3 have similar performance on small

image sizes Method 3 might perform better on images bigger that

2048x2048 Quantization is ~x70 faster for the first method and much more

as resolution increases. Quantization is ~ x180 faster for method2 and 3 and much

more as resolution increases.

Method 1 Method 2 Method 3 CPU xCPU - 1 xCPU - 2 xCPU - 3

512x512 0.102 0.039 0.039 7.37 72.2549 188.9744 188.9744

1024x768 0.274 0.085 0.085 22 80.29197 258.8235 258.8235

2048x2048 1.39 0.379 0.36 110 79.13669 290.2375 305.5556

Huffman Encode/Decode

Huffman Encoding Basics

Utilizes frequency of each symbol Lossless compression Uses VARIABLE length code for each

symbol

A - freq = 7B - freq = 3C - freq = 3D - freq = 2E - freq = 1

A

ED

C

B

(2) (1)

(3)

(3)

(5)

0

0

0

0

1

1

1

1

A 0 1

B 11 2

C 100 3

D 1010 4

E 1011 4

Symbol Encoding LengthGet Symbol Freq

IMAGE

Build Tree Build Table

Challenges Encoding is a very very very serial process Variable length of symbols is a problem Encoding: don’t know when symbols needs to

be written unless all other symbols are encoded. Decoding: don’t know where symbols start

DECODING

B C D A E A A B 11100101 00101100 110 7

0 1 2

8 bytes ENCODING 3 bytes

ENCODINGImage

Encode each byte

Combine encoded symbols

EncodedImage

STEP 1

STEP 2

CAPITAN OBVIOUS SAYS: - that’s O(N log N) - CPU is O(N)

DECODING Decoding: don’t know where symbols start Need redundant calculation Uses decoding table, rather then tree

Decode then shift by n bits.

STEP 1: divide bitstream into overlapping segments. 65 bytes. Run 8 threads on each segment with different starting positions

110101010101010100111111100000111001010

0 64

10101010101010100111111100000111001010

0101010101010100111111100000111001010

Thid = 0

0101010100111111100000111001010 Thid = 7

XXX0 A 1

XX11 B 2

X110 C 3

1010 D 4

1011 E 4

EncodingSymbol

Length

DECODING STEP 2:

Determine which threads are valid, throw away others

0

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

0 64 128 196

6 9 E

DECODING - challenges Each segment takes fixed number of encoded bits, but it results in variable length

decoded output 64 bit can result in 64 bytes of output. Memory explosion

Memory address for input do not advance in fixed pattern as output address Memory collisions

Decoding table doesn’t fit into one address line

Combining threads is serial

NOTE: to simplify the algorithm, max symbol length was assumed to be 8 bits. (it didn’t help much)

Huffman Results Encoding

Step one is very fast: ~100 speed up Step two – algorithm is wrong – no results

Decoding 3 times slower then classic CPU method. Using shared memory for encoding table resolved only some conflicts (5 x slower -> 4 x slower) Conflicts on inputs bitstream

Either conflicts on input or output data Moving 65 byte chunks to shared memory and ‘sharing’ it between 8 threads didn’t help much (4 x slower -> 3 x slower)

ENCODING should be left to CPU

Conclusion & Results

ResultsResultsCPU

512x512 - CPU 1024x768 - CPU 2048x2048 -CPU

DCT 3.38 11.05 57.12

Quantization 5.74 17.16 75.97

IDCT 3.34 10.49 56.5

GPU

512x512 -GPU 1024x768 -GPU 2048x2048 -GPU

DCT 0.191 0.47 2.7

Quantization 0.039 0.085 0.379

IDCT 0.171 0.436 2.145

Performance Gain

512x512 1024x768 2048x2048

DCT 17.69633508 23.5106383 21.15555556

Quantization 147.1794872 201.8823529 200.4485488

IDCT 19.53216374 24.05963303 26.34032634

Performance Gain DCT and IDCT are the

major consumers of the computation time.

Computation increases with the increase with resolution.

Total Processing time for 2k image is 5.224ms and for the CPU is 189.59 => speed up of 36x

Performance Gain

0

50

100

150

200

250

512x512 1024x768 2048x2048

Resolution

Spee

d up

DCT

Quantization

IDCT

GPU Performance DCT and IDCT still take

up the major computation cycles but reduced by a x100 magnitude.

2K resolution processing time is 7ms using the GPU as compared to ~900ms with the CPU.

GPU Performance

0

0.5

1

1.5

2

2.5

3

DCT Quantization IDCT

JPEG ComponentsT

ime

in

ms

512x512 -GPU 1024x768 -GPU 2048x2048 -GPU

Conclusion

CUDA implementation for transform and quantization is much faster than CPU (x36 faster)

Huffman Algorithm does not parallelize well and final results show x3 slower than CPU.

GPU architecture is well optimized for image and video related processing.

High Performance Applications - Interframe, HD resolution/Realtime video compression/decompression.

Conclusion – Image QualityResolution – 1024x768CPU GPU

Conclusion – Image Quality Resolution – 2048x2048

Conclusion – Image QualityResolution – 512x512 CPU GPU

jpeg c ompression a lgorithm i n cuda group members: pranit patel manisha tatikonda jeff wong jarek...

Documents

macro blocks

macro blockdct

macro block12

macro block4

68idct1 macro block2

68idct1 macro block0

dct coefficients

dct grid