jpeg c ompression a lgorithm i n cuda group members: pranit patel manisha tatikonda jeff wong jarek...
TRANSCRIPT
JPEG COMPRESSION ALGORITHM IN CUDA
Group Members:Pranit PatelManisha TatikondaJeff Wong Jarek Marczewski
Date:April 14, 2009
OUTLINEOUTLINE
Motivation JPEG Algorithm Design Approach in CUDA Benchmark Conclusion
MOTIVATIONMOTIVATION
Growth of Digital Imaging Applications Effective algorithm for Video Compression
Applications Loss of Data Information must be minimal
JPEG is a lossy compression algorithm that reduces
the file size without affecting quality of image
It perceive the small changes in brightness more readily than we do small change in color
JPEG ALGORITHMJPEG ALGORITHM Step 1: Divide sample image into Step 1: Divide sample image into
8x8 blocks8x8 blocks
Step 2: Apply DCTStep 2: Apply DCT DCT is applied to each block It replaces actual color of block to average
matrix which is analyze for entire matrix This step does not compress the file In general:
Simple color space model: [R,G,B] per pixel
JPEG uses [Y, Cb, Cr] Model
Y = Brightness
Cb = Color blueness
Cr = Color redness
JPEG ALGORITHMJPEG ALGORITHM Step 3: QuantizationStep 3: Quantization
First Compression Step Each DCT coefficient is divided by its
corresponding constant in Quantization table and rounded off to nearest integer
The result of quantizing the DCT coefficients is that smaller, unimportant coefficients will be replaced by zeros and larger coefficients will lose precision. It is this rounding-off that causes a loss in image quality.
Step 4: Apply Huffman EncodingStep 4: Apply Huffman Encoding Apply Huffman encoding to Quantized DCT
Coefficient to reduce the image size further
Step 5: DecoderStep 5: Decoder Decoder of JPEG consist of:
Huffman Decoding De-Quantization IDCT
DCT and IDCT
Discrete Cosine Transform
Separable transform algorithm (1D and then the 2D):
2D DCT is performed in a 2 pass approach one for horizontal direction and one for vertical direction DCT
1st pass 2nd pass
Discrete Cosine Transform
Translate DCT into matrix cross multiplication Pre-calculate Cosine values are stored as constant array Inverse DCT are calculated in the same way only with
P00 P01 P02 P03 P04 P05 P06 P07
P10 P11 P12 P13 P14 P15 P16 P17
P20 P21 P22 P23 P24 P25 P26 P27
P30 P31 P32 P33 P34 P35 P36 P37
P40 P41 P42 P43 P44 P45 P46 P47
P50 P51 P52 P53 P54 P55 P56 P57
P60 P61 P62 P63 P64 P65 P66 P67
P70 P71 P72 P73 P74 P75 P76 P77
C00 C01 C02 C03 C04 C05 C06 C07
C10 C11 C12 C13 C14 C15 C16 C17
C20 C21 C22 C23 C24 C25 C26 C27
C30 C31 C32 C33 C34 C35 C36 C37
C40 C41 C42 C43 C44 C45 C46 C47
C50 C51 C52 C53 C54 C55 C56 C57
C60 C61 C62 C63 C64 C65 C66 C67
C70 C71 C72 C73 C74 C75 C76 C77
x
DCT CUDA Implementation
Each thread within each block has the same number of calculation
Each thread multiply and accumulated eight elements
P00 P01 P02 P03 P04 P05 P06 P07
P10 P11 P12 P13 P14 P15 P16 P17
P20 P21 P22 P23 P24 P25 P26 P27
P30 P31 P32 P33 P34 P35 P36 P37
P40 P41 P42 P43 P44 P45 P46 P47
P50 P51 P52 P53 P54 P55 P56 P57
P60 P61 P62 P63 P64 P65 P66 P67
P70 P71 P72 P73 P74 P75 P76 P77
C00 C01 C02 C03 C04 C05 C06 C07
C10 C11 C12 C13 C14 C15 C16 C17
C20 C21 C22 C23 C24 C25 C26 C27
C30 C31 C32 C33 C34 C35 C36 C37
C40 C41 C42 C43 C44 C45 C46 C47
C50 C51 C52 C53 C54 C55 C56 C57
C60 C61 C62 C63 C64 C65 C66 C67
C70 C71 C72 C73 C74 C75 C76 C77
x
Thread.x = 2
Thread.y = 3
DCT Grid and Block Two methods and approach Each thread block process 1 macro blocks (64
threads)
Each thread block process 8 macro blocks (512 threads)
Micro Blocks x
y
8
8
Micro Blocks x
y
8
8
DCT and IDCT GPU results
GPU Runtme
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
262144 786432 4194304
Raster Size
Ru
n T
ime
(ms)
DCT 1 macro block
DCT 8 macro block
IDCT1 macro block
512x512 1024x768 2048x2048
DCT Results
DCT GPU Performance Gain over CPU
0
5
10
15
20
25
2048x2048 1024x768 512x512
Raster Size
Pe
rfo
rma
nc
e G
ain
1 macro block
8 macro block
IDCT Results
IDCT GPU Performance over CPU
20
20.5
21
21.5
22
22.5
23
23.5
24
24.5
25
2048x2048 1024x768 512x512
Pe
rfo
rma
nc
e G
ain
Quantization
Quantization
Break the image into 8x8 blocks 8x8 Quantized matrix to be applied to the image. Every content of the image is multiplied by the
Quantized value and divided again to round to the nearest integer value.
Quantization
CUDA Programing Method 1 – Exact implementation as in CPU Method 2 – Shared memory to copy 8x8
image Method 3 – Load divided values into shared
memory.
Quantization CUDA Results
Quantization CUDA Performance
0
0.5
1
1.5
512x512 1024x768 2048x2048
Image Resolution
Tim
e in
ms
Method 1 Method 2 Method 3
Quantization CPU vs GPU Results
Quantization CPUvs CUDA Performance
0
50
100
150
200
250
300
350
512x512 1024x768 2048x2048
Resolutions
x f
as
ter
tha
n C
PU
xCPU - 1
xCPU - 2
xCPU - 3
Tabulated Results for Quantization Method 2 and Method 3 have similar performance on small
image sizes Method 3 might perform better on images bigger that
2048x2048 Quantization is ~x70 faster for the first method and much more
as resolution increases. Quantization is ~ x180 faster for method2 and 3 and much
more as resolution increases.
Method 1 Method 2 Method 3 CPU xCPU - 1 xCPU - 2 xCPU - 3
512x512 0.102 0.039 0.039 7.37 72.2549 188.9744 188.9744
1024x768 0.274 0.085 0.085 22 80.29197 258.8235 258.8235
2048x2048 1.39 0.379 0.36 110 79.13669 290.2375 305.5556
Huffman Encode/Decode
Huffman Encoding Basics
Utilizes frequency of each symbol Lossless compression Uses VARIABLE length code for each
symbol
A - freq = 7B - freq = 3C - freq = 3D - freq = 2E - freq = 1
A
ED
C
B
(2) (1)
(3)
(3)
(5)
0
0
0
0
1
1
1
1
A 0 1
B 11 2
C 100 3
D 1010 4
E 1011 4
Symbol Encoding LengthGet Symbol Freq
IMAGE
Build Tree Build Table
Challenges Encoding is a very very very serial process Variable length of symbols is a problem Encoding: don’t know when symbols needs to
be written unless all other symbols are encoded. Decoding: don’t know where symbols start
DECODING
B C D A E A A B 11100101 00101100 110 7
0 1 2
8 bytes ENCODING 3 bytes
ENCODINGImage
Encode each byte
Combine encoded symbols
EncodedImage
STEP 1
STEP 2
CAPITAN OBVIOUS SAYS: - that’s O(N log N) - CPU is O(N)
DECODING Decoding: don’t know where symbols start Need redundant calculation Uses decoding table, rather then tree
Decode then shift by n bits.
STEP 1: divide bitstream into overlapping segments. 65 bytes. Run 8 threads on each segment with different starting positions
110101010101010100111111100000111001010
0 64
10101010101010100111111100000111001010
0101010101010100111111100000111001010
Thid = 0
0101010100111111100000111001010 Thid = 7
XXX0 A 1
XX11 B 2
X110 C 3
1010 D 4
1011 E 4
EncodingSymbol
Length
DECODING STEP 2:
Determine which threads are valid, throw away others
0
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
0 64 128 196
6 9 E
DECODING - challenges Each segment takes fixed number of encoded bits, but it results in variable length
decoded output 64 bit can result in 64 bytes of output. Memory explosion
Memory address for input do not advance in fixed pattern as output address Memory collisions
Decoding table doesn’t fit into one address line
Combining threads is serial
NOTE: to simplify the algorithm, max symbol length was assumed to be 8 bits. (it didn’t help much)
Huffman Results Encoding
Step one is very fast: ~100 speed up Step two – algorithm is wrong – no results
Decoding 3 times slower then classic CPU method. Using shared memory for encoding table resolved only some conflicts (5 x slower -> 4 x slower) Conflicts on inputs bitstream
Either conflicts on input or output data Moving 65 byte chunks to shared memory and ‘sharing’ it between 8 threads didn’t help much (4 x slower -> 3 x slower)
ENCODING should be left to CPU
Conclusion & Results
ResultsResultsCPU
512x512 - CPU 1024x768 - CPU 2048x2048 -CPU
DCT 3.38 11.05 57.12
Quantization 5.74 17.16 75.97
IDCT 3.34 10.49 56.5
GPU
512x512 -GPU 1024x768 -GPU 2048x2048 -GPU
DCT 0.191 0.47 2.7
Quantization 0.039 0.085 0.379
IDCT 0.171 0.436 2.145
Performance Gain
512x512 1024x768 2048x2048
DCT 17.69633508 23.5106383 21.15555556
Quantization 147.1794872 201.8823529 200.4485488
IDCT 19.53216374 24.05963303 26.34032634
Performance Gain DCT and IDCT are the
major consumers of the computation time.
Computation increases with the increase with resolution.
Total Processing time for 2k image is 5.224ms and for the CPU is 189.59 => speed up of 36x
Performance Gain
0
50
100
150
200
250
512x512 1024x768 2048x2048
Resolution
Spee
d up
DCT
Quantization
IDCT
GPU Performance DCT and IDCT still take
up the major computation cycles but reduced by a x100 magnitude.
2K resolution processing time is 7ms using the GPU as compared to ~900ms with the CPU.
GPU Performance
0
0.5
1
1.5
2
2.5
3
DCT Quantization IDCT
JPEG ComponentsT
ime
in
ms
512x512 -GPU 1024x768 -GPU 2048x2048 -GPU
Conclusion
CUDA implementation for transform and quantization is much faster than CPU (x36 faster)
Huffman Algorithm does not parallelize well and final results show x3 slower than CPU.
GPU architecture is well optimized for image and video related processing.
High Performance Applications - Interframe, HD resolution/Realtime video compression/decompression.
Conclusion – Image QualityResolution – 1024x768CPU GPU
Conclusion – Image Quality Resolution – 2048x2048
Conclusion – Image QualityResolution – 512x512 CPU GPU