isscc 2012 / session 12 / multimedia & communications socs ... · pdf file224 †...

3
224 2012 IEEE International Solid-State Circuits Conference ISSCC 2012 / SESSION 12 / MULTIMEDIA & COMMUNICATIONS SoCs / 12.6 12.6 A 2Gpixel/s H.264/AVC HP/MVC Video Decoder Chip for Super Hi-Vision and 3DTV/FTV Applications Dajiang Zhou 1 , Jinjia Zhou 1 , Jiayi Zhu 2 , Peilin Liu 2 , Satoshi Goto 1 1 Waseda University, Kitakyushu, Japan 2 Shanghai Jiao Tong University, Shanghai, China 8K×4K Super Hi-Vision (SHV) offers a significantly enhanced visual experience relative to 1080p, and is on its way to being the next digital TV standard. In addi- tion, advanced 3DTV specifications involving a large number of camera views are targeted by emerging applications such as free-viewpoint TV (FTV). This paper presents a single-chip design that supports real-time H.264 decoding of SHV or up to 32 HD views. The design of the chip involved 3 key challenges: 1) Data dependencies of video coding algorithms restrict the degree of hardware parallelism. For SHV, each macroblock (MB) should be processed in less than 40 cycles at 300MHz, which is difficult to meet with a single pipeline; 2) due to the massive design and verification effort for video decoders, a scalable architecture that allows the maximum reuse of existing IP is desirable; and 3) the DRAM bandwidth requirements are always a bottleneck in high-throughput video decoders. The proposed architecture is characterized as follows: 1) A frame dependency protection (FDP) scheme that enables frame-parallel decoding by reusing multi- ple copies of an existing design. This results in a system throughput of 2Gpixels/s – at least 3.75× better than previous chips [1-4]. 2) A reference-win- dow synchronization (RWS) scheme for efficient frame-level reference sample reuse. This saves DRAM bandwidth for motion compensation (MC) by 35%. 3) A 2-level hybrid caching (2LHC) scheme resolves the unaligned data read issue of frame recompression [1], and contributes to another 14% bandwidth saving. Figure 12.6.1 shows the system architecture. The H.264 decoding flow is divid- ed into two parts: the entropy decoders (EDs) and main decoders (MDs). With NAL/slice parallelism [1], 4 EDs achieve a CABAC/CAVLC decoding performance of over 800Mbps. Each MD processes every MB in 64 cycles. With the support of FDP, RWS and 2LHC, parallel decoding of 2 frames is accomplished on 2 MDs sharing an L2 cache. Figure 12.6.2 describes the FDP and RWS schemes. Since slice parallelism is unsuitable for a generic decoder, frame-parallel decoding that imposes no restrictions on the encoding configurations is applied. However, frame paral- lelism is limited by the data dependencies of inter-frame prediction. Frame dependency protection (FDP) is proposed to overcome this obstacle. As two MDs operate in parallel, they update each other regarding the addresses of the last decoded MBs. Based on the addresses, every time one MD requires refer- ence samples from the frame being decoded by the other, it first checks whether or not the motion vector (MV) points into a decoded area of the reference frame. If yes, the reference samples can be fetched; otherwise, the fetch operation will be stalled until the other MD finishes decoding the corresponding area. This approach enables correct parallel decoding for frames with dependencies. A reference window synchronization (RWS) scheme is incorporated for non- dependent frame parallelism to save DRAM bandwidth. With a shared MC cache, an MD can reuse the reference samples fetched by the other MD, as is typical when a B frame and another B or P frame are decoded in parallel. To minimize bandwidth with a reasonable cache size, the reference windows of the two MDs should overlap with each other in time. RWS is realized by dynamically adjust- ing the speed of the MDs to keep them in sync. With RWS on, the shared cache grants a higher priority to the MD with a smaller current MB address. When the two MB addresses differ by more than a threshold T, all requests from the faster MD are temporarily blocked, essentially permitting the slower MD to “catch up”. This may locally slow down the decoding speed, but overall performance is improved due to reduced DRAM traffic. Figure 12.6.3 shows the frame-scheduling strategies. For groups of pictures (GOPs) with relatively low frame dependencies, a synchronous frame schedul- ing (SFS) strategy is applied, which always launches two frames at the same time. The MD that finishes earlier waits for the other to complete before starting a new frame. With SFS, RWS can be used to minimize DRAM traffic. Based on SFS, detailed scheduling schemes for various GOP structures such as IBP, IBBP and hierarchical B are also given. For IPPP GOPs with significant frame depend- encies, an asynchronous frame scheduling (AFS) strategy is used. AFS allows one MD to start a new frame regardless of the status of the other MD. Although this eliminates the possibility of using RWS, it seldom influences the worst-case performance, since I and P frames involve less DRAM traffic. Moreover, when using AFS, one of the MDs may be stalled as it waits for pixels being processed by the other MD, leading to a difference in the progress achieved by the two MDs. As this difference increases, stalls will decrease, as motion vectors are limited in range. Consequently, shortly after initialization, stalls will be rare and both MDs operate at full speed. Figure 12.6.3 also illustrates the 2-level hybrid caching (2LHC) scheme. Frame recompression (FRC) is an effective bandwidth optimization technique. The FRC approach in [1] suffers from unaligned storage of the compressed data. Since FRC decompression is located between the MC cache and DRAM to lower decompression speed requirements, the cache is incapable of storing the incom- plete data fractions, resulting in additional DRAM traffic. Therefore, a 2-level cache organization is proposed in this work. While the L1 local cache for each MD stores the data after decompression, the shared L2 cache stores the com- pressed data, including the incomplete data fractions for future use. Note that the L2 cache contributes to the majority of the bandwidth reduction, allowing the size of L1 cache to be shrunk, so that the total cache size per MD (18KB) is the same as in [1]. Figure 12.6.4 gives the experimental results. Compared to [1], 1.97× better decoding performance is obtained due to the higher clock speed. An additional 1.9× is contributed by frame parallelism, with double the computational resources but only an 18% increase of bandwidth. This is mainly due to the adoption of 2LHC and RWS, which reduce the average and peak of bandwidth for the MC by 45% and 50%, respectively. Figure 12.6.4 also plots the results of RWS under various thresholds (T). T=2 (MBs) is selected in our design for the highest speed. For IPPP sequences, AFS outperforms SFS with up to 12.4% lower decoding time. Figure 12.6.5 shows the specification of the chip in 65nm CMOS. The die size is 4×4mm 2 including a 64b DDR2 PHY, DLL, PLLs, and the digital core containing 1338K logic gates and 79.9KB on-chip memory. With the core and DRAM work- ing at 340MHz and 400MHz respectively, a throughput of 2Gpixels/s is achieved for 7680×4320 video. In MVC mode, real-time decoding of 32 720p views or 16 1080p views is achieved with one MD working. Figure 12.6.7 shows the chip micrograph. Figure 12.6.6 compares this work with the state-of-the-art. The proposed archi- tecture supports real-time 7680x4320 at 60fps decoding, which is 3.75-to-32 times faster than prior work [1-4]. Relative to [1], the overall DRAM bandwidth requirement is reduced by 24%. The consequent DRAM power saving is 22% (modeled from [5]). Our design also offers high area efficiency. In this work, the L2 cache is a new but relatively small feature required to realize the new propos- als. The other major components, EDs and MDs, reuse the RTL code of [1], with the only modifications being in the MC cache. The reusability demonstrates an effective methodology for saving design and verification effort for high-perform- ance video decoders. Acknowledgements: This research was supported by Knowledge Cluster Initiative (2 nd Stage) and Waseda University Ambient SoC GCOE Program of MEXT, Japan, and by the JST CREST Project. References: [1] D. Zhou, et al., “A 530 Mpixels/s 4096x2160@60fps H.264/AVC high profile video decoder chip,” IEEE J. Solid-State Circuits, vol. 46, no. 4, pp. 777-788, 2011. [2] T.-D. Chuang, et al., “A 59.5mW scalable/multi-view video decoder chip for quad/3D full HDTV and video streaming applications,” ISSCC Dig. Tech. Papers, pp. 330-331, 2010. [3] D. Zhou, et al., “A 1080p@60fps multi-standard video decoder chip designed for power and cost efficiency in a system perspective,” IEEE Symp. VLSI Circuits, pp. 262-263, 2009. [4] C. Lin, et al., “A 160kgate 4.5kB SRAM H.264 video decoder for HDTV appli- cations,” ISSCC Dig. Tech. Papers, pp. 1596-1605, 2006. [5] http://download.micron.com/downloads/misc/ddr2_power_calc_web.xls 978-1-4673-0377-4/12/$31.00 ©2012 IEEE

Upload: duongtram

Post on 18-Feb-2018

218 views

Category:

Documents


4 download

TRANSCRIPT

224 • 2012 IEEE International Solid-State Circuits Conference

ISSCC 2012 / SESSION 12 / MULTIMEDIA & COMMUNICATIONS SoCs / 12.6

12.6 A 2Gpixel/s H.264/AVC HP/MVC Video Decoder Chip for Super Hi-Vision and 3DTV/FTV Applications

Dajiang Zhou1, Jinjia Zhou1, Jiayi Zhu2, Peilin Liu2, Satoshi Goto1

1Waseda University, Kitakyushu, Japan2Shanghai Jiao Tong University, Shanghai, China

8K×4K Super Hi-Vision (SHV) offers a significantly enhanced visual experiencerelative to 1080p, and is on its way to being the next digital TV standard. In addi-tion, advanced 3DTV specifications involving a large number of camera viewsare targeted by emerging applications such as free-viewpoint TV (FTV). Thispaper presents a single-chip design that supports real-time H.264 decoding ofSHV or up to 32 HD views. The design of the chip involved 3 key challenges: 1)Data dependencies of video coding algorithms restrict the degree of hardwareparallelism. For SHV, each macroblock (MB) should be processed in less than 40cycles at 300MHz, which is difficult to meet with a single pipeline; 2) due to themassive design and verification effort for video decoders, a scalable architecturethat allows the maximum reuse of existing IP is desirable; and 3) the DRAMbandwidth requirements are always a bottleneck in high-throughput videodecoders.

The proposed architecture is characterized as follows: 1) A frame dependencyprotection (FDP) scheme that enables frame-parallel decoding by reusing multi-ple copies of an existing design. This results in a system throughput of2Gpixels/s – at least 3.75× better than previous chips [1-4]. 2) A reference-win-dow synchronization (RWS) scheme for efficient frame-level reference samplereuse. This saves DRAM bandwidth for motion compensation (MC) by 35%. 3)A 2-level hybrid caching (2LHC) scheme resolves the unaligned data read issueof frame recompression [1], and contributes to another 14% bandwidth saving.

Figure 12.6.1 shows the system architecture. The H.264 decoding flow is divid-ed into two parts: the entropy decoders (EDs) and main decoders (MDs). WithNAL/slice parallelism [1], 4 EDs achieve a CABAC/CAVLC decoding performanceof over 800Mbps. Each MD processes every MB in 64 cycles. With the supportof FDP, RWS and 2LHC, parallel decoding of 2 frames is accomplished on 2 MDssharing an L2 cache.

Figure 12.6.2 describes the FDP and RWS schemes. Since slice parallelism isunsuitable for a generic decoder, frame-parallel decoding that imposes norestrictions on the encoding configurations is applied. However, frame paral-lelism is limited by the data dependencies of inter-frame prediction. Framedependency protection (FDP) is proposed to overcome this obstacle. As twoMDs operate in parallel, they update each other regarding the addresses of thelast decoded MBs. Based on the addresses, every time one MD requires refer-ence samples from the frame being decoded by the other, it first checks whetheror not the motion vector (MV) points into a decoded area of the reference frame.If yes, the reference samples can be fetched; otherwise, the fetch operation willbe stalled until the other MD finishes decoding the corresponding area. Thisapproach enables correct parallel decoding for frames with dependencies.

A reference window synchronization (RWS) scheme is incorporated for non-dependent frame parallelism to save DRAM bandwidth. With a shared MC cache,an MD can reuse the reference samples fetched by the other MD, as is typicalwhen a B frame and another B or P frame are decoded in parallel. To minimizebandwidth with a reasonable cache size, the reference windows of the two MDsshould overlap with each other in time. RWS is realized by dynamically adjust-ing the speed of the MDs to keep them in sync. With RWS on, the shared cachegrants a higher priority to the MD with a smaller current MB address. When thetwo MB addresses differ by more than a threshold T, all requests from the fasterMD are temporarily blocked, essentially permitting the slower MD to “catch up”.This may locally slow down the decoding speed, but overall performance isimproved due to reduced DRAM traffic.

Figure 12.6.3 shows the frame-scheduling strategies. For groups of pictures(GOPs) with relatively low frame dependencies, a synchronous frame schedul-ing (SFS) strategy is applied, which always launches two frames at the sametime. The MD that finishes earlier waits for the other to complete before startinga new frame. With SFS, RWS can be used to minimize DRAM traffic. Based onSFS, detailed scheduling schemes for various GOP structures such as IBP, IBBP

and hierarchical B are also given. For IPPP GOPs with significant frame depend-encies, an asynchronous frame scheduling (AFS) strategy is used. AFS allowsone MD to start a new frame regardless of the status of the other MD. Althoughthis eliminates the possibility of using RWS, it seldom influences the worst-caseperformance, since I and P frames involve less DRAM traffic. Moreover, whenusing AFS, one of the MDs may be stalled as it waits for pixels being processedby the other MD, leading to a difference in the progress achieved by the twoMDs. As this difference increases, stalls will decrease, as motion vectors arelimited in range. Consequently, shortly after initialization, stalls will be rare andboth MDs operate at full speed.

Figure 12.6.3 also illustrates the 2-level hybrid caching (2LHC) scheme. Framerecompression (FRC) is an effective bandwidth optimization technique. The FRCapproach in [1] suffers from unaligned storage of the compressed data. SinceFRC decompression is located between the MC cache and DRAM to lowerdecompression speed requirements, the cache is incapable of storing the incom-plete data fractions, resulting in additional DRAM traffic. Therefore, a 2-levelcache organization is proposed in this work. While the L1 local cache for eachMD stores the data after decompression, the shared L2 cache stores the com-pressed data, including the incomplete data fractions for future use. Note thatthe L2 cache contributes to the majority of the bandwidth reduction, allowing thesize of L1 cache to be shrunk, so that the total cache size per MD (18KB) is thesame as in [1].

Figure 12.6.4 gives the experimental results. Compared to [1], 1.97× betterdecoding performance is obtained due to the higher clock speed. An additional1.9× is contributed by frame parallelism, with double the computationalresources but only an 18% increase of bandwidth. This is mainly due to theadoption of 2LHC and RWS, which reduce the average and peak of bandwidthfor the MC by 45% and 50%, respectively. Figure 12.6.4 also plots the results ofRWS under various thresholds (T). T=2 (MBs) is selected in our design for thehighest speed. For IPPP sequences, AFS outperforms SFS with up to 12.4%lower decoding time.

Figure 12.6.5 shows the specification of the chip in 65nm CMOS. The die size is4×4mm2 including a 64b DDR2 PHY, DLL, PLLs, and the digital core containing1338K logic gates and 79.9KB on-chip memory. With the core and DRAM work-ing at 340MHz and 400MHz respectively, a throughput of 2Gpixels/s is achievedfor 7680×4320 video. In MVC mode, real-time decoding of 32 720p views or 161080p views is achieved with one MD working. Figure 12.6.7 shows the chipmicrograph.

Figure 12.6.6 compares this work with the state-of-the-art. The proposed archi-tecture supports real-time 7680x4320 at 60fps decoding, which is 3.75-to-32times faster than prior work [1-4]. Relative to [1], the overall DRAM bandwidthrequirement is reduced by 24%. The consequent DRAM power saving is 22%(modeled from [5]). Our design also offers high area efficiency. In this work, theL2 cache is a new but relatively small feature required to realize the new propos-als. The other major components, EDs and MDs, reuse the RTL code of [1], withthe only modifications being in the MC cache. The reusability demonstrates aneffective methodology for saving design and verification effort for high-perform-ance video decoders.

Acknowledgements:This research was supported by Knowledge Cluster Initiative (2nd Stage) andWaseda University Ambient SoC GCOE Program of MEXT, Japan, and by the JSTCREST Project.

References:[1] D. Zhou, et al., “A 530 Mpixels/s 4096x2160@60fps H.264/AVC high profilevideo decoder chip,” IEEE J. Solid-State Circuits, vol. 46, no. 4, pp. 777-788,2011.[2] T.-D. Chuang, et al., “A 59.5mW scalable/multi-view video decoder chip forquad/3D full HDTV and video streaming applications,” ISSCC Dig. Tech. Papers,pp. 330-331, 2010.[3] D. Zhou, et al., “A 1080p@60fps multi-standard video decoder chip designedfor power and cost efficiency in a system perspective,” IEEE Symp. VLSICircuits, pp. 262-263, 2009.[4] C. Lin, et al., “A 160kgate 4.5kB SRAM H.264 video decoder for HDTV appli-cations,” ISSCC Dig. Tech. Papers, pp. 1596-1605, 2006.[5] http://download.micron.com/downloads/misc/ddr2_power_calc_web.xls

978-1-4673-0377-4/12/$31.00 ©2012 IEEE

225DIGEST OF TECHNICAL PAPERS •

ISSCC 2012 / February 21, 2012 / 4:15 PM

Figure 12.6.1: Decoder block diagram. Figure 12.6.2: Frame-parallel decoding with FDP and RWS schemes.

Figure 12.6.3: Frame scheduling and 2-level hybrid caching.

Figure 12.6.5: Chip specifications. Figure 12.6.6: Chip comparison.

Figure 12.6.4: Experimental results.

12

• 2012 IEEE International Solid-State Circuits Conference 978-1-4673-0377-4/12/$31.00 ©2012 IEEE

ISSCC 2012 PAPER CONTINUATIONS

Figure 12.6.7: Chip photo.