christopher mitchell cda 6938, spring 2009. the discrete cosine transform in the same family as the...

21
Christopher Mitchell CDA 6938, Spring 2009

Upload: britton-mitchell

Post on 28-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Christopher MitchellCDA 6938, Spring 2009

The Discrete Cosine Transform In the same family as the Fourier Transform

Converts data to frequency domain. Represents data via summation of variable

frequency cosine waves. Since it is a discrete version, conducive to

problems formatted for computer analysis. Captures only real components of the function.

Discrete Sine Transform (DST) captures odd (imaginary) components → not as useful.

Discrete Fourier Transform (DFT) captures both odd and even components → computationally intense.

Significance / Where is this used?

Image Processing Compression - Ex.) JPEG Scientific Analysis - Ex.) Radio Telescope Data

Audio Processing Compression - Ex.) MPEG – Layer 3, aka. MP3

Scientific Computing / High Performance Computing (HPC) Partial Differential Equation Solvers

Significance, Cont.

Image Processing Example Exhibits Energy Compaction

Drop small amplitude coefficients

Original Image DCT Transformed Image

Implementation Platform

NVIDIA CUDAVersion 2.0

Implementation Platform, Cont.

What Happened to the Cell/BE? Too many technical challenges compared to

the deadline. Algorithm is embarrassingly parallel

Conducive of launching hundreds of threads → GPU Algorithm requires too much data per pass

compared to local store size. Would have to be creative with DMA and no

guarantee of bottleneck mitigation.

Algorithm Walk Through

Mathematical Basis 1D Version:

Where:

2D Version:

Where α(u) and α(v) are defined as shown in the 1D case.

Algorithm Walk Through CPU Version – 1D DCT

Algorithm Walk Through CPU Version – 2D DCT

Algorithm Walk Through

Problem 1D DCT is O(n2) 2D DCT is O(n3) Additionally, the Algorithm uses

calls to calculate the cosine and square root.Long Latency ALU Operations

Algorithm Walk Through CUDA Version – 1D DCT

Algorithm Walk Through CUDA Version – 2D DCT

Algorithm Walk Through

Solution 1D DCT is now O(n) 2D DCT is now O(n2) Parallelization key to success with

this algorithm

Testing

Platform Intel Core 2 Duo E6700 @ 2.66 GHz. Gigabyte GA-P35-DQ6 Motherboard 2 GB RAM 2 NVIDIA GeForce 8600 GTS Superclocked GPUs

720 MHz. Core Clock 256 MB GDDR3 Memory 4 Multiprocessors → 32 Streaming Processors

Windows XP Professional (32-bit) w\ SP3 and NVIDIA ForceWare 178.24 Drivers

Testing - Overview

Vector Test Case CPU Version CUDA Version

Vector: 256 3.00 ms 0.016930 ms

Vector: 512 14.67 ms 0.027778 ms

Vector: 1024 64.33 ms 0.015876 ms

Vector: 2058 246.33 ms 0.015213 ms

Vector: 4096 989.33 ms 0.015721 ms

Matrix Test Case CPU Version GPU Version

Matrix: 64 x64 1,055.67 ms 0.009612 ms

Matrix: 128 x 128 16,205.33 ms 0.010277 ms

Matrix: 256 x 256 254,448.33 ms 0.009850 ms

Matrix: 512 x 512 4,007,952.00 ms 0.014130 ms

Testing – 1D DCT

Testing – 1D DCT

Testing – 2D DCT

Testing – 2D DCT

Future Work

Multiple GPU version Have a dual card setup to test this with. Need to find efficient way to split the problem

between the two cards without incurring a large I/O penalty.

Still interested in trying a Cell/BE version of the algorithm. Need to improve at CBEA programming. DMA & local store size is the limiting factor for this

particular problem.

References

NVIDIA CUDA Programming Guide, Version 2.1 http://developer.download.nvidia.com/compute/c

uda/2_1/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf

The Discrete Cosine Transform (DCT): Theory and Application http://www.egr.msu.edu/waves/people/Ali_files/

DCT_TR802.pdf CDA 6938 Lecture Notes and Slides