2
Overview What is CUDA
◦ Architecture
◦ Programming Model
◦ Memory Model H.264 Motion Estimation on CUDA
◦ Method
◦ Experimental Results
◦ Conclusion
Outline
4
General-purpose Computation on GPUs (GPGPU)◦ Not only for accelerating the graphics display but
also for speeding up non-graphics applications
Linear algebra computation
Scientific simulation
Overview
5
Compute Unified Device Architecture ◦ http://www.nvidia.com.tw/object/cuda_home_tw.ht
ml#
(nVidia CUDA Zone)
◦ Single program multiple data (SPMD) computing
device
What is CUDA ?
Fast Object Detection Leukocyte Tracking Real-time 3D modeling
7
Programming Model◦ Two parts of program executing
Host : CPU
Device : GPU
What is CUDA ?
Main program
Host
End of Main
...................................
Device
.....................
Kernel
End of Kernel
do parallelism
8
Thread Batching◦ CUDA creates a lot of threads on the device then
each thread will execute kernel program with
different data
◦ The threads in the same thread block can co-work
with each other through the shared memory
◦ Number of threads in a thread block is limited
Thread blocks with same dimension can be organized
as a grid and do thread batching
What is CUDA ?
9
Thread Batching
What is CUDA ?
Host
Kernel 1
Kernel 2
Device
Grid 1
Grid 2
Block (0,0) Block (1,0) Block (2,0)
Block (0,1) Block (1,1) Block (2,1)
Block (0,0) Block (1,0) Block (2,0)
Block (0,1) Block (1,1) Block (2,1)
Block (0,2) Block (1,2) Block (2,2)
Block (0,3) Block (1,3) Block (2,3)
Block (1,0)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
10
Memory Model◦ DRAM
◦ Chip memory
What is CUDA ?
Grid
Block(0, 0) Block(1, 0)Shared Memory
Registers Registers
Thread (0, 0)
Thread (1, 0)
Local Memory
Local Memory
Global Memory
Shared Memory
Registers Registers
Thread (0, 0)
Thread (1, 0)
Local Memory
Local Memory
12
In [1], using an efficient block-level parallel algorithm for the variable block size motion estimation in H.264/AVC
MB mode
H.264 ME on CUDA
[1] “H.264/AVC Motion Estimation Implementation on Compute Unified Device Architecture (CUDA)”, IEEE International Conference on Multimedia & Expo (2008)
P_16x16
P_16x8 P_8x16 P_8x8
8x8 8x4 4x8 4x4
13
Two steps for deciding the final coding mode◦ Step 1 : Find best motion vectors of each MB
mode
◦ Step 2 : Evaluate the R-D performance and
choose
the best mode H.264 ME algorithm is extremely complex
and time consuming◦ Fast motion estimation method (TSS, DS, etc.)
◦ In [1], they focus on Full Search ME
H.264 ME on CUDA
Too much branch instruction
14
First stage : Calculate integer pixel MVs
Method
16
16
4
4
4
88
4
8
8
8
16
16
8
Compute all SAD values between each block and all reference candidates in parallel
Merge 4x4 SADs to form all block sizes
Find the minimal SAD and determine the integer pixel MV
16
16
15
Second stage : Calculate fraction-pixel MVs◦ Reference frame is interpolated using a six-tap
filter and a bilinear filter defined in H.264/AVC
◦ Calculate the SADs at 24 fractional pixel positions
that are adjacent to the best integer MV
Method
Half pixel
Quarter pixel
Integer pixel
2423222120
1918171615
14131211
109876
54321
16
4x4 Block-Size SAD Calculation◦ Sequence resolution : 4CIF (704x576)
◦ Search range : 32 x 32 ( leads to 1024 candidates
)
◦ Each candidate SAD is computed by a thread
◦ 256 threads executed in a thread block
Method
256
1_
4
_
4
___ 2 RangeSearch
HeightFrameWidthFrameNumberBlockThread
Every 256 candidates of one 4x4 block SAD calculation is assigned to a thread block
4x4 blocks number in a frame
Number of ME search candidates
256 threads in a thread block
= 706/4 x 576/4 x 322 x 1/256 = 101376
17
Block diagram of 4x4 block SAD calculation
Method
●● ‧ ‧ ‧ ‧ ‧ ‧●
●● ‧ ‧ ‧ ‧ ‧ ‧●
●● ‧ ‧ ‧ ‧ ‧ ‧●
●● ‧ ‧ ‧ ‧ ‧ ‧●1024 candidates of an 4x4 block
B1 T256T1 T2 …
B2 T256T1 T2 …
B3 T256T1 T2 …
B4 T256T1 T2 …
Kernel
…
B101376
…
DRAM
256 SADs
256 SADs
256 SADs
256 SADs
256 SADs
…
18
Variable Block-Size SAD Generation◦ Merge the 4x4 SADs obtained in the previous step
◦ Each thread fetches sixteen 4x4 SADs of one MB
at a candidate position and combines them to
form other block size
Method
256
1_
16
_
16
___ 2 RangeSearch
HeightFrameWidthFrameNumberBlockThread
= 706/16 x 576/16 x 322 x 1/256 = 6336
19
Block diagram of variable block size SAD calculation
Method
DRAM
16 SADs
…
16 SADs
16 SADs
Kernel
B1T1
T2
T256
…
B2
…
B6336
4x8 SAD x88x4 SAD x88x8 SAD x48x16 SAD x216x8 SAD x216x16 SAD
x1
……
DRAM
20
Integer Pixel SAD Comparison◦ All 1024 SADs of one block are compared and the
least SAD is chosen as the integer-pixel MV
◦ Each block size (16x16 to 4x4) has its own kernels
for SAD comparison
◦ Seven kernels are implemented and executed
sequentially
Method
21
Block diagram of integer pixel SAD comparison
Method
1024 SADs DRAM
Kernel
B1T1 T2 T256…
4 SADs 4 SADs 4 SADs
shared memory
SAD SAD SAD
T1 ~ T128/2n -1
256/2n-1 SADs n iterations
Integer-pel MV
22
During the thread reduction process, a problem may occur◦ Shared memory bank conflict
A sequential addressing with non-divergent branching strategy is adopted
Method
23
SAD comparison using sequential addressing with non-divergent branching
Method
8 6 3 4 7 3 7 8 4 7 5 1 9 4 3 6
1 2 3 4 5 6 7 8
Shared memory(SAD value & index)
Thread ID(Do comparison)
4 6 3 1 7 3 3 6 4 7 5 1 9 4 3 6
1 2 3 4…
24
Fractional Pixel MV Refinement◦ Find the best fractional-pixel motion vector
around the integer motion vector of every block
Method
Half pixel
Quarter pixel
Integer pixel
2423222120
1918171615
14131211
109876
54321
DRAMEncoding
FrameInteger- pel MV
Reference Frame
Kernel
B1shared memory
T1 T2 T24…
shared memory
T1 ~ T12/2n -1
24/2n-1 SADs n iterations
fractionl-pel MV
25
Environment◦ AMD Athlon 64 X2 Dual Core 2.1GHz with 2G
memory
◦ NVIDIA GeForce 8800GTX with 768MB DRAM
◦ CUDA Toolkit and SDK 1.1 Parameters
◦ ME algorithm : Full Search
◦ Search Range : 32x32
Experimental Results
26
The average execution time in ms for processing one frame using the proposed algorithm
Experimental Results
Steps ms Percentage (%)
Step1. 4x4 Block Size SADs Calculation 33.98 31.24
Step2. Variable Block Size SADs Generation 30.64 28.16
Step3. Integer Pixel SAD Comparison 9.69 8.90
Step4. Fractional Pixel Interpolation 7.10 6.52
Step5. Fractional Pixel ME Refinement 7.10 9.99
Others 16.49 15.16
Total 108.77 100
27
The ME performance comparison between CPU only and using GPU
Experimental Results
SequenceFrame rate (fps) using AMD CPU
Frame rate (fps) using
GPUSpeed-up
Stefan (CIF) 3.04 31.54 10.38
City (4CIF) 0.78 9.19 11.78