christopher mitchell cda 6938, spring 2009. the discrete cosine transform in the same family as the...
TRANSCRIPT
The Discrete Cosine Transform In the same family as the Fourier Transform
Converts data to frequency domain. Represents data via summation of variable
frequency cosine waves. Since it is a discrete version, conducive to
problems formatted for computer analysis. Captures only real components of the function.
Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful.
Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.
Significance / Where is this used?
Image Processing Compression - Ex.) JPEG Scientific Analysis - Ex.) Radio Telescope Data
Audio Processing Compression - Ex.) MPEG – Layer 3, aka. MP3
Scientific Computing / High Performance Computing (HPC) Partial Differential Equation Solvers
Significance, Cont.
Image Processing Example Exhibits Energy Compaction
Drop small amplitude coefficients
Original Image DCT Transformed Image
Implementation Platform, Cont.
What Happened to the Cell/BE? Too many technical challenges compared to
the deadline. Algorithm is embarrassingly parallel
Conducive of launching hundreds of threads → GPU Algorithm requires too much data per pass
compared to local store size. Would have to be creative with DMA and no
guarantee of bottleneck mitigation.
Algorithm Walk Through
Mathematical Basis 1D Version:
Where:
2D Version:
Where α(u) and α(v) are defined as shown in the 1D case.
Algorithm Walk Through
Problem 1D DCT is O(n2) 2D DCT is O(n3) Additionally, the Algorithm uses
calls to calculate the cosine and square root.Long Latency ALU Operations
Algorithm Walk Through
Solution 1D DCT is now O(n) 2D DCT is now O(n2) Parallelization key to success with
this algorithm
Testing
Platform Intel Core 2 Duo E6700 @ 2.66 GHz. Gigabyte GA-P35-DQ6 Motherboard 2 GB RAM 2 NVIDIA GeForce 8600 GTS Superclocked GPUs
720 MHz. Core Clock 256 MB GDDR3 Memory 4 Multiprocessors → 32 Streaming Processors
Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare 178.24 Drivers
Testing - Overview
Vector Test Case CPU Version CUDA Version
Vector: 256 3.00 ms 0.016930 ms
Vector: 512 14.67 ms 0.027778 ms
Vector: 1024 64.33 ms 0.015876 ms
Vector: 2058 246.33 ms 0.015213 ms
Vector: 4096 989.33 ms 0.015721 ms
Matrix Test Case CPU Version GPU Version
Matrix: 64 x64 1,055.67 ms 0.009612 ms
Matrix: 128 x 128 16,205.33 ms 0.010277 ms
Matrix: 256 x 256 254,448.33 ms 0.009850 ms
Matrix: 512 x 512 4,007,952.00 ms 0.014130 ms
Future Work
Multiple GPU version Have a dual card setup to test this with. Need to find efficient way to split the problem
between the two cards without incurring a large I/O penalty.
Still interested in trying a Cell/BE version of the algorithm. Need to improve at CBEA programming. DMA & local store size is the limiting factor for this
particular problem.
References
NVIDIA CUDA Programming Guide, Version 2.1 http://developer.download.nvidia.com/compute/c
uda/2_1/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf
The Discrete Cosine Transform (DCT): Theory and Application http://www.egr.msu.edu/waves/people/Ali_files/
DCT_TR802.pdf CDA 6938 Lecture Notes and Slides