1 a scalable parallel h.264 decoder on the cell broadband engine architecture michael a. baker,...

26
1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sa rma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Cod esign and System Synthesis) 2009

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

1

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine ArchitectureMichael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudh

ula

Arizona State University

CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009

Page 2: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

2

Outline

Introduction and Motivation Opportunities for Parallelization in H.264 Implementation Performance Optimizations Experimental Results Conclusion

Page 3: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

3

Motivation

Multicore Architectures Scalability:

more cores = more performance H.264

Standard for video applications including High Definition(HD)

Computationally expensive Cell Broadband Engine(CBE)

Common and inexpensive thanks to PS3

Low power high performance design gives a glimpse of future embedded architectures

Page 4: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

4

IBM Cell Broadband Engine Architecture 3.2 GHz

9 cores, 10 threads >200 Gflops(single precisi

on) >20 Gflops(double precisi

on) Up to 25 GB/s memory ba

ndwidth Up to 75 GB/s I/O bandwi

dth >300 GB/s interconnect b

us

http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm

SPE: Synergistic Processor ElementSPU: Synergistic Processor UnitSXU: SPU CoreLS: Local Storage SMF: Synergistic Memory Flow ControlEIB: Element Interconnect BusPPE: PowerPC Processor ElementPPU: PowerPC processor UnitPXU: Power Processor UnitMIC: Memory Interface ControllerBIC: Bus Interface ControllerL1: Memory Cache Internal to the CPU L2: Memory Cache External to the CPU

Page 5: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

5

H.264 Advanced Video Coding H.264 is a video compression standard

Version 1 completed May 2003 ITU-T Video Coding Experts Group (H.264) ISO/IEC Moving Picture Experts Group (MPEG-4 AVC)

Macroblock(MB) based CODEC closely related to MPEG-2

Growing demand for HD and Wireless video 50% bit rate reduction over previous standard Computational complexity approximately 2.4 x M

PEG2

Page 6: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

6

H.264: Decoder

Page 7: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

7

Reference Code: FFmpeg (H.264 Decoder)

Open source video and audio converter Handles a multitude of formats Codecs other than H.264 decoder removed About 250K Lines of Code after paring to H.2

64 only About 200 functions ported to SPU in our imp

lementation

http://www.ffmpeg.org/

Page 8: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

8

H.264 Frame Level Relationships

I Frame: Independently Encoded Intra Prediction

P Frame: Predicted from a Preceding frame Intra and Inter Prediction

B Frame: Predicted from Both preceding and following frames Intra and Inter Prediction

Page 9: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

9

H.264 Opportunities for Parallelism: GOP and Frame Level I, P, B Frames

Picture Sequence IBBPBBP

Independent Group of Pictures (GOP)

Independent Frames within GOP

Page 10: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

10

H.264 Opportunities for Parallelism: Slice and MB Level Slices: Independently encode

d groups of MBs within a frame

Intra Dependencies:

Page 11: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

11

Data Partitioning Scheme

Our Scheme: One row of MBs issued to each SPU

Possible Intra MB dependencies:

Page 12: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

12

Functional Partitioning

CBE Architecture:

Page 13: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

13

FFmpeg main MB decoding loop

Intra

Inter

Page 14: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

14

Scalable Implementation

Page 15: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

15

FFmpeg Data Structure Modification Single threaded code: monolithic data structure Entire structure needed to decode single MB but majority is static from one MB

to the next SPU only requires applicable subset for one row of MBs Only MB specific data replicated in SPU LS

Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

Page 16: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

16

SPU LS(Local Store) Limitations

Page 17: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

17

Code Overlay

Code segment contains one or more functions

Memory region assigned one or more segments

At run time, region contains exactly one segment

Page 18: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

18

Designing an Overlay Scheme Start with one flat region

1. Identify key functions and assign to new regions Profiling indicates f21()

is most important with 50 calls

However, f11() is present 80 times in the trace

f11() is a key function 2. Create new regions

based on profiling data until memory is exhausted

Page 19: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

19

Designing an Overlay Scheme

Page 20: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

20

Overlay Performance

Page 21: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

21

Additional Performance Optimizations

Page 22: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

22

Experimental Results Microsoft’s WMV HD demonstration page [13] The source videos were transcoded into H.264 1920x1080 (1080p) format

5 different bitrates: 2.5, 4, 8, 12, 16Mbps CAVLC and CABAC Use the x264 H.264 encoder integrated into ffmpeg

The videos were encoded using the x264 presets: baseline, normal, and hq Decoder performance is measured on the Sony’s Playstation 3, 3.2 GHz Cell

Processor (limited by Sony for access to six of the CBE’s eight SPUs) running Linux Fedora 9

[13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx

Page 23: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

23Figure 14: Breakdown of decoder performance by component using a single SPU.

• Motion vector decoding and deblocking are the most expensive components• The white band at the bottom is the PPU (entropy decoder) contribution

Page 24: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

24

Decoder Performance

[4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs.

Page 25: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

25

• Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs• And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.

Page 26: 1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona

26

Conclusion

Demonstrated scalable H.264 decoder for multicore processor

23% frame rate advantage over prior work [4] on similar videos and using same number of cores

Careful engineering required to efficiently manage data structures and scratchpad memory