[ieee 2010 data compression conference - snowbird, ut, usa (2010.03.24-2010.03.26)] 2010 data...

1
Batch-Pipelining for H.264 Decoding on Multicore Systems Tang-Hsun Tu and Chih-Wen Hsueh Graduate Institute of Networking and Multimedia National Taiwan University, Taipei, Taiwan 621, R.O.C. {d98944004, cwhsueh}@csie.ntu.edu.tw Pipelining has been applied in many area to improve performance by overlapping executionsof computing stages. However, it is difficult to apply on H.264/AVC decoding in frame level, because the bitstreams are encoded with lots of dependencies and little parallelism is left to be explored. Even slice-level parallelism in H.264 is intuitive, because there is usually only one slice in a frame, it is not very applicable. Therefore, after some software improvement, many researches can only adopt hardware assistance. Fortunately, pure software pipelining can be applied on H.264/AVC decoding in macroblock level with reasonable performance gain. However, the pipeline stages might need to synchronize with other stages and incur lots of extra overhead. Moreover, the overhead becomes relatively larger as the stages themselves are executed faster with better hardware and software optimization. We first group multiple stages into larger groups as ”batched” pipelining to execute concurrently in multicore systems. The stages in different groups might not need to synchronize to each other so that it incurs little overhead and can be highly scalable. Therefore, a novel effective batch-pipeline (BP) approach adopting the advantages of both data and function decomposition for H.264/AVC decoding on multicore systems is proposed. Moreover, because of its flexibility, BP can be used with other hardware approaches or software technologies to further improve performance. To optimize our approach, we also analyze how to group the macroblocks and derive close-form formulas to guide the grouping. We conduct various experiments on various bitstreams to verify our approach as shown in the following table. The results show that it can speed up to 93% and achieve up to 249 and 70 FPS for 720P and 1080P resolutions, respectively, on a 4-core machine over a published optimized H.264 decoder. We believe our batch-pipelining approach creates a new effective direction for multimedia software codec development. The detail paper can be found in http://www.csie.ntu.edu.tw/˜ cwhsueh/papers/BPh264 2010 DCC.pdf. No. Name Res. Size(M)/# i% OPT PD xBP 0BP BP-i BP PD+BP All(+UD) 01. Cornell 1080P 109.7/3598 10.5 38 47% -14% 6% 13% 14% 73% 69 (83%) 02. Artbeats 1080P 115.6/2850 33.9 31 44% -12% 0% -5% 3% 57% 50 (61%) 03. BBC-CFB 1080P 44.4/2433 21.8 36 55% -11% 7% 5% 7% 80% 70 (93%) 04. Shark 1080P 81.3/1801 20.8 27 42% 4% 8% 8% 16% 71% 50 (82%) 05. Harbour 720P 9.5/300 1.7 48 20% 22% 9% 24% 26% 80% 87 (81%) 06. Night 720P 6.6/300 9.2 64 24% -4% 2% 9% 18% 67% 112 (74%) 07. Jets 720P 0.9/300 5.2 142 22% -42% -8% -4% 12% 59% 249 (75%) 08. Harbour 480P 5.6/300 5.2 101 15% 21% 3% 9% 21% 56% 168 (65%) 09. Crew 480P 3.1/300 33.5 133 21% -1% -10% -8% 11% 36% 215 (62%) 10. Sailormen 480P 3.4/300 8.0 140 20% 1% -7% 4% 24% 54% 239 (71%) 11. Night 480P 3.0/230 22.0 130 21% -4% -10% -5% 10% 32% 197 (52%) 12. Mobile CIF 2.4/300 0.5 299 10% 5% -17% 1% 17% 22% 427 (43%) 13. Football CIF 1.7/260 32.4 348 7% -14% -24% -21% 5% 5% 460 (32%) 14. Bus CIF 1.1/150 8.3 344 5% -21% -34% -26% 10% -4% 477 (39%) 2010 Data Compression Conference 1068-0314/10 $26.00 © 2010 IEEE DOI 10.1109/DCC.2010.57 553

Upload: chih-wen

Post on 16-Mar-2017

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2010 Data Compression Conference - Snowbird, UT, USA (2010.03.24-2010.03.26)] 2010 Data Compression Conference - Batch-Pipelining for H.264 Decoding on Multicore Systems

Batch-Pipelining for H.264 Decoding onMulticore Systems

Tang-Hsun Tu and Chih-Wen HsuehGraduate Institute of Networking and Multimedia

National Taiwan University, Taipei, Taiwan 621, R.O.C.{d98944004, cwhsueh}@csie.ntu.edu.tw

Pipelining has been applied in many area to improve performance by overlappingexecutions of computing stages. However, it is difficult to apply on H.264/AVC decodingin frame level, because the bitstreams are encoded with lots of dependencies and littleparallelism is left to be explored. Even slice-level parallelism in H.264 is intuitive, becausethere is usually only one slice in a frame, it is not very applicable. Therefore, after somesoftware improvement, many researches can only adopt hardware assistance. Fortunately,pure software pipelining can be applied on H.264/AVC decoding in macroblock levelwith reasonable performance gain.

However, the pipeline stages might need to synchronize with other stages and incurlots of extra overhead. Moreover, the overhead becomes relatively larger as the stagesthemselves are executed faster with better hardware and software optimization. We firstgroup multiple stages into larger groups as ”batched” pipelining to execute concurrentlyin multicore systems. The stages in different groups might not need to synchronize toeach other so that it incurs little overhead and can be highly scalable. Therefore, a noveleffective batch-pipeline (BP) approach adopting the advantages of both data and functiondecomposition for H.264/AVC decoding on multicore systems is proposed. Moreover,because of its flexibility, BP can be used with other hardware approaches or softwaretechnologies to further improve performance. To optimize our approach, we also analyzehow to group the macroblocks and derive close-form formulas to guide the grouping.

We conduct various experiments on various bitstreams to verify our approach as shownin the following table. The results show that it can speed up to 93% and achieve up to249 and 70 FPS for 720P and 1080P resolutions, respectively, on a 4-core machine overa published optimized H.264 decoder. We believe our batch-pipelining approach createsa new effective direction for multimedia software codec development. The detail papercan be found in http://www.csie.ntu.edu.tw/˜ cwhsueh/papers/BPh264 2010 DCC.pdf.

No. Name Res. Size(M)/# i% OPT PD xBP 0BP BP-i BP PD+BP All(+UD)01. Cornell 1080P 109.7/3598 10.5 38 47% -14% 6% 13% 14% 73% 69 (83%)02. Artbeats 1080P 115.6/2850 33.9 31 44% -12% 0% -5% 3% 57% 50 (61%)03. BBC-CFB 1080P 44.4/2433 21.8 36 55% -11% 7% 5% 7% 80% 70 (93%)04. Shark 1080P 81.3/1801 20.8 27 42% 4% 8% 8% 16% 71% 50 (82%)05. Harbour 720P 9.5/300 1.7 48 20% 22% 9% 24% 26% 80% 87 (81%)06. Night 720P 6.6/300 9.2 64 24% -4% 2% 9% 18% 67% 112 (74%)07. Jets 720P 0.9/300 5.2 142 22% -42% -8% -4% 12% 59% 249 (75%)08. Harbour 480P 5.6/300 5.2 101 15% 21% 3% 9% 21% 56% 168 (65%)09. Crew 480P 3.1/300 33.5 133 21% -1% -10% -8% 11% 36% 215 (62%)10. Sailormen 480P 3.4/300 8.0 140 20% 1% -7% 4% 24% 54% 239 (71%)11. Night 480P 3.0/230 22.0 130 21% -4% -10% -5% 10% 32% 197 (52%)12. Mobile CIF 2.4/300 0.5 299 10% 5% -17% 1% 17% 22% 427 (43%)13. Football CIF 1.7/260 32.4 348 7% -14% -24% -21% 5% 5% 460 (32%)14. Bus CIF 1.1/150 8.3 344 5% -21% -34% -26% 10% -4% 477 (39%)

2010 Data Compression Conference

1068-0314/10 $26.00 © 2010 IEEE

DOI 10.1109/DCC.2010.57

553