new paper special section on design methodologies for system on...

10
1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015 PAPER Special Section on Design Methodologies for System on a Chip Unified Parameter Decoder Architecture for H.265/HEVC Motion Vector and Boundary Strength Decoding Shihao WANG a) , Nonmember, Dajiang ZHOU , Member, Jianbin ZHOU , Nonmember, Takeshi YOSHIMURA , and Satoshi GOTO , Fellows SUMMARY In this paper, VLSI architecture design of unified motion vector (MV) and boundary strength (BS) parameter decoder (PDec) for 8K UHDTV HEVC decoder is presented. The adoption of new coding tools in PDec, such as Advanced Motion Vector Prediction (AMVP), increases the VLSI hardware realization overhead and memory bandwidth require- ment, especially for 8K UHDTV application. We propose four techniques for these challenges. Firstly, this work unifies MV and BS parameter de- coders for line buer memory sharing. Secondly, to support high through- put, we propose the top-level CU-adaptive pipeline scheme by trading obetween implementation complexity and performance. Thirdly, PDec pro- cess engine with optimizations is adopted for 43.2k area reduction. Finally, PU-based coding scheme is proposed for 30% DRAM bandwidth reduc- tion. In 90 nm process, our design costs 93.3k logic gates with 23.0 kB line buer. The proposed architecture can support real-time decoding for 7680x4320@60fps application at 249 MHz in the worst case. key words: UHDTV, H.265/HEVC, parameter decoder, motion vector, boundary strength 1. Introduction Nowadays Ultra High Definition Television (UHDTV) has been a hot topic because of the better visual experience it provides. Up to 8K UHDTV video format has been regarded as the high-end application in the near future. At the same time, the new High Eciency Video Coding (H.265/HEVC) video coding standard has been standardized in January 2013. HEVC, as a successor to H.264, can achieve dou- ble video compression rate while guaranteeing the equiva- lent compression quality [1], [2]. Under this background, video decoder supporting both 8K UHDTV and HEVC is required by the market. However, in order to support real- time decoding ability, VLSI implementation for decoder is challenged by two issues. Firstly, since high video resolu- tion directly increases the burden on system throughput be- cause of the huge data volume, higher throughput require- ment is needed. Assuming the clock frequency is 250 MHz, for 7680x4320@60fps, a throughput of at least 8 pixel/cycle has to be achieved. On the other hand, new HEVC standard introduces complicated coding tools in MV and BS calcula- tion to achieve better compression performance, such as Ad- vanced Motion Vector Prediction (AMVP) mode and merge mode. Flexible quad-tree CU structure is also introduced Manuscript received September 17, 2014. Manuscript revised January 12, 2015. The authors are with the Graduate School of Information, Production and Systems, Waseda Univ., Kitakyushu-shi, 808-0135 Japan. a) E-mail: [email protected] DOI: 10.1587/transfun.E98.A.1356 Fig. 1 Unified MV&BS parameter decoder. in HEVC. All these new coding tools lead to challenges on memory bandwidth requirement, data dependency and throughput requirement for high-performance VLSI imple- mentation [3]. VLSI architecture of unified MV and BS PDec is dis- cussed by following reasons. 1) As shown in Fig. 1, MV and BS decoder play the obligatory roles for an integrated video decoder system. 2) Original syntaxes from CABAD are too obscure to be directly used by motion compensa- tion (MC) and deblocking filter (DBF) so that pre-process is needed. Data and control flow in pre-process is charac- terized an irregularity because these processes are highly input-dependent and evolve with the algorithm itself. 3) The pre-process for decoding syntaxes into parameters has less dependency with core calculation in MC and DBF. Further, BS calculation relies on the result of current block’s MV, which inspires us to build unified PDec architecture to share on-chip memory without much extra control overhead. Several parameter decoder approaches are reported previously. In [4] an approach for MPEG2 is achieved. For H.264/AVC, Xu, et al. [5] first shows a solution for QCIF format. Higher throughput is also achieved in [6][8]. However, these realizations have much room for optimiza- tion. The fastest throughput among them is 260 cycles/MB, which is far slower than the throughput requirement for 8K UHDTV. In Zhou et al.’s work [9], a joint parameter decoder with fixed pipeline granularity for 4K@60fps UHDTV ap- plication is proposed for MV, BS and intra prediction mode. Considering this H.264/AVC approach is not applicable for HEVC and the technique can’t support high throughput re- quirement for 8K video, we can’t directly inherit [9] for Copyright c 2015 The Institute of Electronics, Information and Communication Engineers

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

1356IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015

PAPER Special Section on Design Methodologies for System on a Chip

Unified Parameter Decoder Architecture for H.265/HEVC MotionVector and Boundary Strength Decoding

Shihao WANG†a), Nonmember, Dajiang ZHOU†, Member, Jianbin ZHOU†, Nonmember, Takeshi YOSHIMURA†,and Satoshi GOTO†, Fellows

SUMMARY In this paper, VLSI architecture design of unified motionvector (MV) and boundary strength (BS) parameter decoder (PDec) for 8KUHDTV HEVC decoder is presented. The adoption of new coding toolsin PDec, such as Advanced Motion Vector Prediction (AMVP), increasesthe VLSI hardware realization overhead and memory bandwidth require-ment, especially for 8K UHDTV application. We propose four techniquesfor these challenges. Firstly, this work unifies MV and BS parameter de-coders for line buffer memory sharing. Secondly, to support high through-put, we propose the top-level CU-adaptive pipeline scheme by trading offbetween implementation complexity and performance. Thirdly, PDec pro-cess engine with optimizations is adopted for 43.2k area reduction. Finally,PU-based coding scheme is proposed for 30% DRAM bandwidth reduc-tion. In 90 nm process, our design costs 93.3k logic gates with 23.0 kBline buffer. The proposed architecture can support real-time decoding for7680x4320@60fps application at 249 MHz in the worst case.key words: UHDTV, H.265/HEVC, parameter decoder, motion vector,boundary strength

1. Introduction

Nowadays Ultra High Definition Television (UHDTV) hasbeen a hot topic because of the better visual experience itprovides. Up to 8K UHDTV video format has been regardedas the high-end application in the near future. At the sametime, the new High Efficiency Video Coding (H.265/HEVC)video coding standard has been standardized in January2013. HEVC, as a successor to H.264, can achieve dou-ble video compression rate while guaranteeing the equiva-lent compression quality [1], [2]. Under this background,video decoder supporting both 8K UHDTV and HEVC isrequired by the market. However, in order to support real-time decoding ability, VLSI implementation for decoder ischallenged by two issues. Firstly, since high video resolu-tion directly increases the burden on system throughput be-cause of the huge data volume, higher throughput require-ment is needed. Assuming the clock frequency is 250 MHz,for 7680x4320@60fps, a throughput of at least 8 pixel/cyclehas to be achieved. On the other hand, new HEVC standardintroduces complicated coding tools in MV and BS calcula-tion to achieve better compression performance, such as Ad-vanced Motion Vector Prediction (AMVP) mode and mergemode. Flexible quad-tree CU structure is also introduced

Manuscript received September 17, 2014.Manuscript revised January 12, 2015.†The authors are with the Graduate School of Information,

Production and Systems, Waseda Univ., Kitakyushu-shi, 808-0135Japan.

a) E-mail: [email protected]: 10.1587/transfun.E98.A.1356

Fig. 1 Unified MV&BS parameter decoder.

in HEVC. All these new coding tools lead to challengeson memory bandwidth requirement, data dependency andthroughput requirement for high-performance VLSI imple-mentation [3].

VLSI architecture of unified MV and BS PDec is dis-cussed by following reasons. 1) As shown in Fig. 1, MVand BS decoder play the obligatory roles for an integratedvideo decoder system. 2) Original syntaxes from CABADare too obscure to be directly used by motion compensa-tion (MC) and deblocking filter (DBF) so that pre-processis needed. Data and control flow in pre-process is charac-terized an irregularity because these processes are highlyinput-dependent and evolve with the algorithm itself. 3) Thepre-process for decoding syntaxes into parameters has lessdependency with core calculation in MC and DBF. Further,BS calculation relies on the result of current block’s MV,which inspires us to build unified PDec architecture to shareon-chip memory without much extra control overhead.

Several parameter decoder approaches are reportedpreviously. In [4] an approach for MPEG2 is achieved.For H.264/AVC, Xu, et al. [5] first shows a solution forQCIF format. Higher throughput is also achieved in [6]–[8].However, these realizations have much room for optimiza-tion. The fastest throughput among them is 260 cycles/MB,which is far slower than the throughput requirement for 8KUHDTV. In Zhou et al.’s work [9], a joint parameter decoderwith fixed pipeline granularity for 4K@60fps UHDTV ap-plication is proposed for MV, BS and intra prediction mode.Considering this H.264/AVC approach is not applicable forHEVC and the technique can’t support high throughput re-quirement for 8K video, we can’t directly inherit [9] for

Copyright c© 2015 The Institute of Electronics, Information and Communication Engineers

Page 2: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

WANG et al.: UNIFIED PARAMETER DECODER ARCHITECTURE FOR H.265/HEVC MOTION VECTOR AND BOUNDARY STRENGTH DECODING1357

HEVC solution. Except the VLSI works mentioned above,FPGA-based implementation in [10] is also reported. How-ever, FPGA based work is too difficult to be extended for8K UHDTV application. Meanwhile, many previous workshave been done on the motion compensation architecture forH.264/AVC and H.265/HEVC, such as [11]–[13]. However,few of them involves the realization for MV production,which is a tough work in HEVC for hardware implemen-tation.

In this paper a novel VLSI PDec architecture is pro-posed with four contributions.

1) The unified architecture for MV and BS parameterdecoder is proposed to achieve memory sharing.

2) The CU-adaptive pipeline is proposed to supporthigh throughput of up to 8K application with reasonable im-plementation complexity.

3) The proposed Index-mapping and resource reuseschemes are introduced for irregular process algorithm toachieve 43.2 k logic gates reduction.

4) Memory organization is well designed and PU-based coding scheme for co-located storage can reduce 30%DRAM bandwidth requirement.

In total, our proposed parameter decoder architecturesupport real-time decoding for 7680x4320@60fps videos at249 MHz clock frequency in worst case.

The rest of the paper is organized as followings. Sec-tion 2 presents an overview for PDec in HEVC. Section 3shows PDec pipeline design in detail. In Sect. 4, both spatialand temporal storage with optimized schemes are discussedin detail. Implementation results are shown in Sect. 5. Fi-nally, we conclude the whole paper in Sect. 6.

2. Overview of PDec in HEVC

In this section the overview of parameter decoder is dis-cussed in detail. Our proposed unified parameter decodercontains two major parts, motion vector calculation andboundary strength calculation, as shown in Fig. 1. The cal-culation for intra prediction mode is excluded from the pro-posed PDec for following reasons. Firstly, in HEVC stan-dard, decoding syntaxes of transform units in CABAC relieson the result of intra prediction mode for current coding unit.If intra part is encapsulated in PDec, a long feedback path in-side the whole decoder is generated, leading to performancedegradation. Secondly, algorithm for intra prediction is sim-pler than MV’s and it doesn’t need the reference data whichis locates in different coding tree unit (CTU) rows. There-fore, the overhead for adding intra part into CABAC is neg-ligible compared to MV and BS calculation. The rest of thischapter gives a brief overview for MV and BS calculation inHEVC.

2.1 Motion Vector Calculation

The process of calculating motion vector is to decode syntaxelements into motion parameters, which can be directly usedby the following motion compensation module. A block’s

Fig. 2 Candidate selection in MV calculation.

Fig. 3 Scaling calculation should be executed when tb!=td. (tb & td rep-resent POC differences). Left: neighbor PU is temporal; Right: neigh-bor PU is spatial.

motion parameters have high possibility to be similar to spa-tial or temporal neighboring. In HEVC, irregular coding al-gorithm is employed to eliminate such kinds of redundancyfor compression efficiency.

Advanced Motion Vector Prediction (AMVP) modeand merge prediction mode are employed by HEVC for cod-ing MV parameters. Both of them require prediction pa-rameters of five spatial neighboring blocks and two tempo-ral co-located blocks as input, as depicted in Fig. 2. Be-sides, each of the two prediction modes owns its unique al-gorithm and resource sharing between them is limited. InAMVP, all seven reference blocks are categorized into threeregions, A, B and Col regions. Each region will produceat most one candidate so that a list of at most three can-didates can be constructed. After removing the identicalones in the list, the final MV will be selected by the syn-tax mvp lx flag. In merge mode, motion parameter decisionstarts with constructing merge candidate list. Firstly, validblocks are pushed into the list in the order of B0, A0, A1,B1, B2 and Col. Then the combined candidates will be as-sembled with the content of candidate list and added into listwith the valid candidates in the first step. Finally, zero can-didates with different reference frames are produced if thelist is not full. After the list is constructed, merge MV resultis chosen by the merge idx from the list. Note that in eachmode, the scaling operations will be processed when thereis a difference between reference frame of current block andthat of reference block [2].

Scaling calculator is an operation frequently used inAMVP mode. Figure 3 shows two cases where scaling cal-culator is utilized. In the figure tb (td) indicates the POCdifference between current (neighbor) picture and its refer-ence picture. When tb and td are not the same, mvNeighborcan’t be directly used as the prediction of motion vector and

Page 3: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

1358IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015

it must be scaled. Therefore, mvScaled should be deducedby following equations defined in HEVC specification [2].

tb = Clip3(−128, 127,

Di f f PicOrderCnt(curr pic, curr re f ))

td = Clip3(−128, 127,

Di f f PicOrderCnt(neighbor pic, neighbor re f ))

tx = (16384 + (Abs(td) >> 1))/td

distS caleFactor = Clip3(−4096, 4095,

(tb × tx + 32) >> 6)

mvS caled = Clip3(−32768, 32767,

S ign(distS caleFactor × mvNeighbor)

× ((Abs(distS caleFactor × mvNeighbor)+ 127)>>8))

2.2 Boundary Strength Calculation

Boundary strength is used in deblocking filter for filter se-lection. The calculation of BS can be divided into two steps.The first step is to prepare necessary data. Generally, wehave to fetch the MV parameters of current block and all theadjacent neighboring blocks in the left and top. Then in thesecond step, specific algorithm is used to produce the BSresult by comparing MV parameters between current blockand neighboring blocks.

BS calculation will be executed on all the predictionunit (PU) and transform unit (TU) edges at 8x8 block grid.Let P and Q be the two blocks beside a certain edge. IfP or Q is intra prediction, BS is equal to 2. Otherwise,PU and TU edges have different algorithms to produce BSvalue. The algorithm for TU edges is quite simple, whichjust check whether P or Q has non-zero coefficients. Incontrast, the algorithm for PU edges are much more com-plex. Almost all the motion information has to be comparedbetween P and Q blocks, such as the number of referenceframes, the number of MV and the MV difference [2].

3. Unified CU-adaptive Pipeline

3.1 Pipeline Analysis

In order to support high throughput requirement, thepipeline design involves three aspects, pipeline granularity,pipeline processing stages and cycle resource allocation.

Pipeline granularity should be well designed for trad-ing off between performance and implementation complex-ity. Pipeline granularity in PDec is a set of pixels processedby the processing element simultaneously, which means thepixels in granularity should be homogeneous. In HEVC,pixels in one coding unit share the same prediction mode(inter or intra), pixels in each PU share one MV. On theother hand, 4-pixel-length edges on 8x8 grid share one BSfor deblocking filter. By combining these two processes,it’s better to define the pipeline granularity as 8x8 blocksor larger. In previous works, fixed pipeline granularity is

adopted. Though pipeline controller for that is simple, thethroughput will be decreased in HEVC. For example, in[6], fixed 4x4 or 8x8 pipeline block is designed. However,HEVC supports maximum 64x64 block size. If fixed 4x4pipeline granularity is used, large redundant calculation di-rectly degrades the performance (for example, one 64x64prediction unit needs up to 256 4x4 pipeline blocks, even ifpixels in 64x64 block share one motion vector).

To overcome the redundancy problem, we propose theCU-adaptive pipeline granularity. In detail, pipeline granu-larity is coding unit. Though the size and partition mode arevarious for different coding unit, the CU-adaptive pipelinecan still have several advantages. Firstly, compared to theprevious fixed pipeline blocks, the CU-adaptive pipelinescheme can efficiently eliminate the redundancy for largerCU cases, which is considered to happen frequently in 8KUHDTV. Secondly, instead of PU based pipeline, the CU-adaptive pipeline scheme can reduce the implementationcomplexity for pipeline controller. The reason is the pro-posed scheme can reduce the number of possible cases forpipeline granularity, especially for asymmetrical partitions.Thirdly, in spite of various sizes, the process for each CUfollows similar procedure, leading to easier implementation.Finally, even if CU contains two or more PU, neighboringblocks for each PUs have overlap, which means that datareuse can be utilized for memory bandwidth reduction, asshown in Fig. 4(a).

A simplified data flow in Fig. 4(b) illustrates that threepipeline processing stages are designed for unified PDec.Considering BS calculation needs the result of currentblock’s MV, so BS calculation should be scheduled afterthe correspondent process for MV. Thus, by incorporatingthe two processes together, we define the unified pipeline asconsisting of three main steps [14]: 1) memory reading forreference data fetching, 2) MV calculation for current CUand 3) BS calculation and MV writing back. Notice thatonly BS calculation for PU edge is in stage 3 while TU edgeis done outside pipeline. The reason is that BS calculationfor PU edge require MV information so it ties tightly withthe rest of CU-adaptive pipeline. Meanwhile, BS for TUedge is quite simple so that it will not increase the hardwarecomplexity much by separating it from PU edge.

Table 1 shows the cycle resource allocation for eachstage and for each pipeline granularity. Generally, apipeline’s throughput is mainly decided by the slowest stage.Hence the performance will not be degraded so much ifwe decelerate the faster stages. Thus we unify the pro-cess speed of each granularity as the slowest stage amongall three stages, which is shown in the Max. Cycles columnin Table 1. Notice that the data in brackets indicates that aCU contains two PU. It is achieved by adding no operation(NOP) cycles in the trailing of fast stages. The merit is theimplementation for pipeline controller is greatly simplified.The throughput for each CU sizes are also listed in the ta-ble, which proves that the proposed PDec is competent forat least 8 pixels per cycle.

Page 4: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

WANG et al.: UNIFIED PARAMETER DECODER ARCHITECTURE FOR H.265/HEVC MOTION VECTOR AND BOUNDARY STRENGTH DECODING1359

Fig. 4 Analysis on designing the CU-adaptive pipeline.

Table 1 Cycles allocation for different CU cases.

CU Size

Process Cycles for Each Stage(Cycles for CU contains 2 PUs are shown in brackets) Max. Cycles

for each CUThroughput(pixel/cycle)Stage 1. Stage 2. Stage 3.

Memory Read MV Calculation BS Cal. and Mem. Wr.8x8 8(8) 4(8) 3(4) 8(8) 8.0(8.0)

16x16 10(12) 4(8) 7(10) 10(12) 25.6(21.3)32x32 18(20) 4(8) 15(22) 18(22) 56.9(46.5)64x64 34(36) 4(8) 31(46) 34(46) 120.5(89.0)

Page 5: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

1360IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015

3.2 Blockdiagram for the CU-adaptive Pipeline

Fig. 4(d) shows the whole framework of unified parameterdecoder. The below part illustrates the data flow inside theproposed CU-adaptive pipeline. Inside each pipeline stage,important process units are depicted. The detail for thesethree stages will be given in the following Sect. 3.3–3.5,correspondingly. The upper part gives a brief representa-tion for memory in this design, which will be discussed inSect. 4. Based on the pipeline analysis, no matter what thecurrent and next CU’s partition mode and size are, they canbe processed continuously without pipeline pause to guar-antee high pipeline performance.

3.3 Pipeline Stage 1

The first stage is in charge of fetching neighboring datafrom memory, which will be explained in Sect. 4. Generally,each PU in CU needs four cycles for fetching MV referenceblocks’ information, as shown in Fig. 4(c). When currentCU contains more than one PU, only extra two cycles areneeded for second PU as the rest of the neighboring blockscan be reused from first PU. After finishing MV referencefetching, BS reference blocks are accessed and then pushedinto FIFO in preparation for stage 3.

Three memory controllers work in parallel as shown inFig. 4(c). Rather than sequential process, these controllersin parallel is to improve the throughput in worst case. Theworst case is defined that CU size is 8x8 and partition modeis symmetric partition (two PU in this CU). For each PU, fivespatial neighboring in line buffer and 2 temporal neighbor-ing in collocated SRAM need 7 cycles to be fetched. Extra 3cycles are for fetching BS neighboring blocks. Thus, totally20 cycles (ten for each PU) are required for this worst case.Under this situation, the performance can’t even support a8K@30fps video application. Therefore, we propose threememory controllers working parallel. Together with the datareuse scheme in Fig. 4(a), the design can achieve throughputfor 8K@60fps.

3.4 Pipeline Stage 2

The algorithm-irregular MV calculation (AMVP mode andmerge mode) is schedule in Stage 2. The implementationsuffers from huge area cost and timing issue. Hence, we pro-pose resource sharing scheme and index-mapping schemefor optimization.

Firstly, MV scaling calculator is optimized and reused.The data path of scaling calculator is depicted in Fig. 5. Inthis procedure, one division, two multiplications and sev-eral adders are employed. Without optimization, the syn-thesis result reports 7ns critical path and around 10k gatescost. The critical path is marked as read in Fig. 5. Suchserious costs in both timing and area are unacceptable foran efficient design. Sub-stage pipeline is proposed insidestage 2 for scaling calculator. Four sub-stages are designed

Fig. 5 Data path of scaling calculator and the scheme for sub-stagepipeline.

Fig. 6 Scaling calculator.

in Fig. 5, which are segmented by dotted line. By doing so,a balanced division for critical path is achieved. Meanwhile,LUT replaces the division implementation in the sub-stage 2for area and timing saving. The proposed sub-stage pipelineis capable for working under 2.5ns timing constrain. As onescaling calculator is capable to process one motion vector,two calculators are necessary for scaling two motion vec-tors in a bi-directional prediction case. Thus the optimizedarchitecture for scaling calculator can achieve 400 MHz tim-ing constrains.

Secondly, index-mapping scheme for merge mode en-gine is proposed. Candidate Producer in Fig. 6(a) followsthe same procedure and accomplishes the identical functionfor merge mode in HEVC standard. Taking Combined Can-didate Producer as an example, we receive available candi-dates (OrigCand) from 5 spatial and 2 temporal candidates.

Page 6: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

WANG et al.: UNIFIED PARAMETER DECODER ARCHITECTURE FOR H.265/HEVC MOTION VECTOR AND BOUNDARY STRENGTH DECODING1361

Fig. 7 Line buffer and left-top register organization.

If the number of received candidates is less than five (assum-ing n), then at most (5-n) combined candidates (CombCand)will be produced. The procedure for combined candidates isillustrated in right-bottom in Fig. 2. Two motion vectors fortwo lists of combined candidate are copied from differentOrigCand to construct a combined one. Up to 12 kinds ofcombinations are employed in HEVC specification, leadingto 12 different combined candidates. Zero Candidate Pro-ducer produces candidates (ZeroCand) when the number ofOrigCand and CombCand is still less than five. Up to fiveZeroCand may be used by merge mode.

For hardware implementation, candidate list whosedepth is five is first generated from 24 possible candidatesmentioned above. Then the final merge result is chosen fromthis list by merge index. However, a 24-5 MUX with 95-bitin Fig. 7 for each entry has to be utilized, which leads tohuge area cost.

Index-mapping is an optimization to reducing areacost. The motivation is we notice the motion vectors offinal result can only have two cases: copy from inputs orzero MV. Assuming zero MV as a special input, we can di-rectly copy the input to the final result as long as we con-struct a mapping relation between them. As is shown in theFig. 6(b), Index-mapping Scheme receives little data volumesuch as available flags as inputs, and five candidates in thefinal lists are mapped to the original inputs. Based on themerge index and the mapping results, the merge result canbe directly copied from the original inputs.

For hardware implementation of Index-mappingscheme, final motion vector is copied from seven original in-puts (5 spatial candidates and 2 temporal ones) or zero MV.Considering bi-direction prediction containing two motionvectors, only two 8-1 multiplexers are required, which savesarea cost a lot compared to the original design.

3.5 Pipeline Stage 3

This stage is in charge of calculating BS parameters andwriting MV result back into memory. BS calculation re-lies on information of prediction unit and transform unit,as is shown in Fig. 4(d). Split 4x4 pipeline scheme insidestage 3 is proposed for pipeline buffer reduction. The CU-adaptive pipeline is proposed in the top level of PDec de-sign. However, when CU size is 64x64, totally 32 (64/4*2)neighboring 4x4 blocks should be buffered before the startof BS calculation. We notice that BS calculation is executed4x4 by 4x4, thus a further separation for pipeline granular-ity is feasible. By applying split 4x4 pipeline scheme, only8 neighboring blocks (the delay between stage 2 and 3) areneeded to be buffered, which helps reduce 75% of bufferingmemory.

Process speed is analyzed for split 4x4 pipelinescheme. In stage 3 only left and top edges of a PU is cal-culated to get an internal BS, which will be further refinedby TU information as shown in Fig. 4(d). For example, if aCU is completely a PU for a 64x64 CU, 64/4=16 4x4 blockson PU left edge needs 16 cycles while another 16 cycles arefor the PU top edge. Considering left and top PU edgeshave a 4x4 block overlap, totally (16+16-1)=31 cycles arerequired. 46 cycles are for the case where CU contains twoPU so that one more PU edge inside CU will further costanother (16-1)=15 cycles compared to the previous case, asis shown in Table 1.

4. Proposed Memory Organization

4.1 Memory Organization for Spatial Storage

In parameter decoder design, data accessing for spatialneighboring is considered as the bottleneck for achievinghigh throughput requirement of 8K UHDTV application. Asis mentioned in Sect. 2, five neighboring blocks are neededfor MV process and all left and top neighboring blocks forBS calculation. All the data should be fetched within thescheduled cycles listed in Table 1. In order to meet thethroughput, line buffer is maintained as shown in Fig. 7.Considering the ability to distinguish different blocks, theproposed line buffer’s cell is set equal to minimum TU size(4x4). Meanwhile, to simplify the control logic, single linebuffer is employed to consist of not only the storage of thetop row (Top-Buffer) but also the left blocks of current CTU(Left-Buffer). Under this memory organization, the neigh-boring block A0, A1 are stored in Left-Buffer while B0, B1are in Top-Buffer correspondingly. For each prediction unit,these four blocks can be read out from line buffer in four cy-cles. If CU is divided into two PUs, at most two blocks (A0,B1 or A1, B0) can be reused so extra two cycles is needed.Further, A1 and B1 can be reused for BS calculation to saveextra two cycles.

In addition, fifteen registers are maintained to store the

Page 7: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

1362IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015

Fig. 8 PU-based coding scheme for DRAM bandwidth requirement reduction.

top-left B2 blocks. One is that the proposed replacing strat-egy for line buffer can’t store the blocks at the concave cor-ner in the decoded regions. So the top-left registers helpimplement the defection of line buffer. On the other hand,we use register instead of reusing line buffer for B2 stor-age. The reason is that B2 will be read and refreshed fre-quently inside CTU, implementation by register will elimi-nate the potential memory accessing conflict problem (readand write simultaneously). The Left-top register consists offifteen identical registers, each of which is 95 bits. The con-tent of these 95 bits is the same as that in line buffer, whichis shown in Fig. 7. Whenever a block at concave corner iswritten into line buffer, it is also written into correspondentLeft-top register simultaneously. Thus there is no commu-nication between line buffer and Left-top register, so thatfurther memory conflict problem is avoided.

The proposed pipeline stages in Sect. 3.1 leads to po-tential hazards for memory access. Forward by-passing isused to eliminate the problems. In detail, we allocate mem-ory reading and writing in stage 1 and stage 3 respectively.Thus two potential memory conflict hazards exist. Firstly,the same memory address may be read and written in onecycle simultaneously. The second is write-after-read hazardcaused by delayed writing operation. For these two hazards,two by-pass registers are inserted between stage 1 and 2 asshown in Fig. 4(d). The proposed pipeline fixes the delaysbetween stages so that stage 3 is always 12 cycles delayedthan stage 1. Meanwhile, the minimum process cycles are8 cycles when CU is 8x8. Therefore, when current CU ison its third stage, at most following two CUs go throughits first stage. Potential memory hazards can only happenduring these 10 cycles. Thus we use registers to store theprevious two CUs’ result. Whenever current CU wants toaccess contents of previous two CU, we directly fetch themfrom registers instead of reading from line buffer. Thereforethe forward by-pass scheme can efficiently deal with the po-tential hazards for the line buffer.

4.2 Memory Organization for Temporal Storage

In this section PU-based coding scheme and the cyclicallymapped SRAM are introduced in detail. All the col-locateddata is stored in the off-chip DRAM because of the huge datavolume. Generally, DRAM bandwidth issue is regarded asthe bottleneck for the real-time video decoder. In terms of8K UHDTV, more attentions should be paid to relieve theDRAM bandwidth burden.

The PU-based coding scheme is proposed for DRAMbandwidth reduction. Writing data back to DRAM for each4x4 blocks, just like the way of line buffer, is unacceptablebecause of the huge data volume. We notice that only thetop-left corner 4x4 block in each 16x16 block will be usedas the temporal candidate in HEVC standard, which meanswe only need to store one set of prediction parameters foreach 16x16 block for col-located storage. It will help reduce93.75% of DRAM bandwidth. Further, considering the pre-diction unit larger than 16x16, the same co-located informa-tion is repeatedly written into DRAM for each 16x16, lead-ing to meaningless burden on bandwidth requirement. Thuswe propose the PU-based coding scheme for co-located datacompression. The data stored in DRAM contains not onlythe prediction parameters but also the 8-bit description forcurrent PU in CTU. As depicted in Fig. 8, a 32x32 PU willwrite identical data (95 bits) into DRAM for four times. PU-based scheme will avoid such kind of redundancy by writingthis identical data (PU info.) only once with 8 trailing bits.Though extra 8 bits for PU description is added for each co-located block, the redundant DRAM storage is avoid. Theexperiment result shows that by applying the PU-based cod-ing scheme, further 30% of bandwidth requirement reduc-tion can be achieved.

Figure 9 illustrates the proposed cyclic SRAM. It is de-signed to support the random access characteristic for ac-cessing co-located data. As is mentioned above, the co-located data is pre-coded before storage. In the reading sidedecoding process is first done. After that the decoded datawill be written into the SRAM for random access. The pro-

Page 8: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

WANG et al.: UNIFIED PARAMETER DECODER ARCHITECTURE FOR H.265/HEVC MOTION VECTOR AND BOUNDARY STRENGTH DECODING1363

Table 2 Comparison with state-of-the-art works.

Yin, [6] Tao, [8] Zhou, [9] This WorkStandard H.264/AVC H.264/AVC&AVS H.264/AVC H.265/HEVCFunction MV MV/BS/Intra MV/BS/Intra MV/BS

Worst-case Throughput (Pixel/cycle) 0.98 0.73 4.0 8.0Avg. Throughput (Pixel/cycle) 1.6 1.6 4.0 17.8

SRAM Size (Line buffer) 2.8k(1080p) 4.75k(1080p) 3.6k(2160p) 23.0k(4320p)Logic Gate 52k 63.0k 37.2k 93.3k

Norm. Logic Gate* 41.80 101.27 7.47 4.69

Max. Resolution1920x1080@60fps 1920x1080@30fps 3840x2160@60fps 7680x4320@60fps

126 MHz 84 MHz 124 MHz 249 MHzTechnology 90 nm 65 nm 90 nm 90 nm

**As the supported maximum resolution varies, we define normalized logic gate equals to logic gate divided by processed pixels per second (x10-5).

Fig. 9 Working mechanism of cyclically mapped SRAM.

posed cyclic SRAM can be regarded as an addressable FIFOwhose size is four CTU’s co-located information. Based onthe analysis on HEVC, the current decoding block will onlyuse the co-located CTU and the col-located next CTU’s datafor MV calculation, as shown in Fig. 9. Thus, as long ascol-located following two CTUs’ content is stored inside,the cyclic SRAM can be read by PDec. In the non-read cy-cle, cyclic SRAM will keep writing decoded data until thecyclic SRAM meets the full condition, when writing cursoris equal to the write stop address. Single-port SRAM is ca-pable of above functions so as to reduce control logic andsave area cost.

5. Implementation Result

5.1 Simulation on PU-Based Coding Scheme

PU-based coding scheme helps to reduce the DRAM band-width (BW). As this work is aiming at UHDTV application,we choose ClassA sequences (Traffic and PeopleOnStreet)which is 2560x1600 with 300 frames and two 7680x4320video sequences (06000 and 12000) with 150 frames as ourtest sequences. Different configurations (RandomAccess(RA) and Lowdelay (LD)) and QP value are considered. Theresult is shown in Table 3.

In Table 3 all the bandwidth is normalized to8K@60fps. If we use 16x16 block storage as is mentioned

Table 3 DRAM bandwidth reduction of PU-based coding scheme.

Video Seq. Cfg. QP Opt. BW BW Reduction

06000 RA25 28.4 MB/s 69.3%40 11.2MB/s 87.9%

12000 RA25 23.3 MB/s 74.8%40 10.8MB/s 88.3%

ClassALD

22 66.6 MB/s 27.8%37 35.8 MB/s 61.2%

RA22 61.0 MB/s 33.9%37 34.7 MB/s 62.4%

in Sect. 4.2, necessary DRAM BW is 92.3MB/s. PU-basedcoding scheme can further reduce the bandwidth by at lestaround 30%. For a 7680x4320 sequence 12000 at QP=40,up to 88.3% BW reduction can be achieved.

5.2 Synthesis Result

The unified PDec is implemented on RTL level design usingVerilog HDL and further synthesized by Synopsys DesignCompiler with TSMC 90 nm process.

In detail, the maximum clock frequency of our pro-posal can achieve 400 MHz. The area under 300 HMz is93.3k. The on-chip line buffer’s size is 23.0 kB. Here weanalysis HEVC standard and define the worst is that thewhole frame is coded as 8x8 coding unit. From Table 1 weknow that the cycles for worst case are 8, equally to 8 pixelsper cycle. This throughput is enough for real-time decodingfor 7680x4320@60pfs video sequence. We also simulateour proposal on HEVC test sequence and the result showsthe average process speed is 17.8 pixel/cycle, which is ableto finish decoding highest profile 6.3 7680x4320@120fpsapplications at only around 111.8 MHz on average.

Table 2 shows the comparison with other related works.Compared to existing works, ours is the only one that sup-ports HEVC standard. Notice that though in [8], [9] intraprediction mode calculation is included, it affects final areaand timing cost little since algorithm for it is quite simplecompared to that of MV, especially in HEVC. On the otherhand, line buffer size is much larger than others for threereasons. 1) HEVC support PU’s edge equals to 4 at least, sostorage for 4x4 block granularity is needed; 2) MV parame-ters in HEVC is more than that in H.264; 3) 8K UHDTV’sframe width is twice larger than 4K’s, leads to double size ofline buffer. Finally, as total area is related to the throughput,

Page 9: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

1364IEICE TRANS. FUNDAMENTALS, VOL.E98–A, NO.7 JULY 2015

we define the normalized gate number in the table for a faircomparison. The normalized results show that our proposedunified architecture has around 36% efficiency on area cost,even if supported HEVC’s complexity is more than that ofH.264/AVC.

Totally, our proposed unified motion vector and bound-ary strength PDec can support real-time decoding for7680x4320@60fps video under 249 MHz in worst case.

6. Conclusion

This paper proposes unified parameter decoder architec-ture for UHDTV. The design can accomplish the algorithmof MV and BS calculation for sharing the memory andlogic resources. In particular, CU-based pipeline strategyis proposed to simplify control logic as well as supportingHEVC’s new coding tools. Moreover, on-chip line bufferand cyclic SRAM are designed for both spatial and temporalreference storage to guarantee enough bandwidth require-ments. PU-based coding scheme can help reduce around30% of DRAM bandwidth. Finally, optimization on irregu-lar algorithm is adopted for 43.2k logic gates reduction. Intotal, the proposed unified parameter decoder supports real-time video decoding for 7680x4320@60fps application at249 MHz in worst case.

Acknowledgments

This work is supported by Waseda University Graduate Pro-gram for Embodiment Informatics (FY2013-FY2019).

References

[1] G.J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview ofthe High Efficiency Video Coding (HEVC) Standard,” IEEE Trans.Circuits Syst. Video Technol., vol.22, no.12, pp.1649–1668, Dec.2012.

[2] JCT-VC, “High efficiency video coding (HEVC) test model 13 (HM13) encoder description,” JCTVC-O1002, Nov. 2013.

[3] G.-L. Li, C.-C. Wang, and K.-H. Chiang, “An efficient motionvector prediction method for avoiding AMVP data dependency forHEVC,” 2014 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), pp.7363–7366, May 2014.

[4] J. Zheng, D. Wu, L. Deng, D. Xie, and W. Gao, “A motion vectorpredictor architecture for AVS and MPEG-2 HDTV decoder,” Ad-vances in Multimedia Information Processing — PCM 2006, Lec-ture Notes in Computer Science, vol.4261, pp.424–431, SpringerBerlin Heidelberg, Berlin, Heidelberg, 2006.

[5] K. Xu and C.-S. Choy, “A power-efficient and self-adaptive pre-diction engine for H.264/AVC decoding,” IEEE Trans. VLSI Syst.,vol.16, no.3, pp.302–313, March 2008.

[6] H. Yin, D. Zhang, X. Wang, and Z. Xia, “An efficient MV predic-tion VLSI architecture for H.264 video decoder,” 2008 InternationalConference on Audio, Language and Image Processing, ICALIP2008, pp.423–428, July 2008.

[7] K. Yoo, J.-H. Lee, and K. Sohn, “VLSI architecture design of mo-tion vector processor for H.264/AVC,” 2008 15th IEEE InternationalConference on Image Processing, ICIP 2008, pp.1412–1415, Oct.2008.

[8] Y. Tao, G. He, W. He, Q. Wang, J. Ma, and Z. Mao, “Effective mul-ti-standard macroblock prediction VLSI design for reconfigurable

multimedia systems,” 2011 IEEE International Symposium of Cir-cuits and Systems (ISCAS), pp.1487–1490, May 2011.

[9] J. Zhou, D. Zhou, X. He, and S. Goto, “A bandwidth optimized,64 Cycles/MB joint parameter decoder architecture for ultra highdefinition H.264/AVC applications,” IEICE Trans. Fundamentals,vol.E93-A, no.8, pp.1425–1433, Aug. 2010.

[10] D. Palomino, F. Sampaio, S. Bampi, A. Susin, and L. Agostini,“FPGA based design for motion vector predicton in H.264/AVC en-coders targeting HD1080p resolution,” 2012 VIII Southern Confer-ence on Programmable Logic (SPL), pp.1–6, March 2012.

[11] X. Chen, P. Liu, D. Zhou, J. Zhu, X. Pan, and S. Goto, “A high per-formance and low bandwidth multi-standard motion compensationdesign for HD video decoder,” IEICE Trans. Electron., vol.E93-C,no.3, pp.253–260, March 2010.

[12] J. Kim, Y. Kim, and H.-J. Lee, “An H.264/AVC decoder with re-duced external memory access for motion compensation,” IEICETrans. Inf. & Syst., vol.E94-D, no.4, pp.798–808, April 2011.

[13] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A.P. Chandrakasan,“A 249-Mpixel/s HEVC video-decoder chip for 4K ultra-HD appli-cations,” IEEE J. Solid-State Circuits, vol.49, no.1, pp.61–72, Jan2014.

[14] S. Wang, D. Zhou, J. Zhou, T. Yoshimura, and S. Goto, “UnifiedVLSI architecture of motion vector and boundary strength parame-ter decoder for 8K UHDTV HEVC decoder,” Advances in Multime-dia Information Processing — PCM 2014, Lecture Notes in Com-puter Science, vol.8879, pp.74–83, Springer International Publish-ing, Cham, 2014.

Appendix: Glossary of Acronyms

Below gives a glossary of acronyms used in this paper inalphabetic order.

Acronyms The full termAMVP Advanced Motion Vector PredictionBS Boundary StrengthBW BandWidthCTU Coding Tree UnitCU Coding UnitDBF DeBlocking FilterMC Motion CompensationMV Motion VectorNOP No OperationPDec Parameter DecoderPU Prediction UnitTU Transform Unit

Shihao Wang received the B.E. degree inelectronic engineering from Shanghai Jiao TongUniversity, China, in 2012 and the M.E. degreein engineering from Waseda University, Japan,in 2014. He is currently pursuing the Ph.D. de-gree in engineering in Waseda University, Japan.His interests are in VLSI architectures and al-gorithm for multimedia signal processing. Hewas a recipient of Ting Hsin Scholarship in 2013and 2014. He is being supported by the Gradu-ate Program of Embodiment Informatics held by

Waseda University and MEXT, Japan.

Page 10: New PAPER Special Section on Design Methodologies for System on …pdfs.semanticscholar.org/260d/65d033723a2ff21fa6781a... · 2017. 10. 15. · 1356 IEICE TRANS. FUNDAMENTALS, VOL.E98–A,

WANG et al.: UNIFIED PARAMETER DECODER ARCHITECTURE FOR H.265/HEVC MOTION VECTOR AND BOUNDARY STRENGTH DECODING1365

Dajiang Zhou received B.E. and M.E.degrees from Shanghai Jiao Tong University,China. He earned his Ph.D. degree in engi-neering from Waseda University, Japan, in 2010.Dr. Zhou is currently with Waseda as an As-sistant Professor at the Graduate School of In-formation, Production and Systems. His inter-ests are in algorithms and implementations formultimedia and communications signal process-ing, especially in low-power high-performanceVLSI architectures for video codecs including

H.265/HEVC and H.264/AVC. Dr. Zhou received a number of awards in-cluding the Best Student Paper Award of VLSI Circuits Symposium 2010,the International Low Power Design Contest Award of ACM ISLPED 2010,the 2013 Kenjiro Takayanagi Young Researcher Award, and the ChineseGovernment Award for Excellent Students Abroad of 2010. Dr. Zhou wasa recipient of the research fellowship of the Japan Society for the Promo-tion of Science during 2009–2011. His work on an 8K UHDTV video de-coder VLSI chip was granted the 2012 Semiconductor of the Year Awardof Japan.

Jianbin Zhou received the B.E. degree insoftware engineering from South China Univer-sity of Technology, China, in 2012 and the M.E.degree in engineering from Waseda University,Japan, in 2014. He is currently pursuing thePh.D. degree in engineering in Waseda Univer-sity, Japan. His interests are in VLSI architec-tures and algorithm for multimedia signal pro-cessing. He was a recipient of Ting Hsin Schol-arship in 2013 and 2014. He is being supportedby the Graduate Program of Embodiment Infor-

matics held by Waseda University and MEXT, Japan.

Takeshi Yoshimura received B.E., M.E.,and Dr. Eng. degrees from Osaka University,Osaka, Japan, in 1972, 1974, and 1997. Hejoined the NEC Corporation, Kawasaki, Japan,in 1974, where he has been engaged in researchand development efforts devoted to computerapplication systems for communication networkdesign, hydraulic network design, and VLSICAD. From 1979 to 1980 he was on leave at theElectronics Research Laboratory, University ofCalifornia, Berkeley, where he worked on very

large scale integration computer-aided design layout. He received Best Pa-per Awards from the Institute of Electronics, Information and Communica-tion Engineers of Japan (IEICE) and the IEEE CAS Society. Dr. Yoshimurais a professor of Waseda University and a member of IPSJ (the InformationProcessing Society of Japan), and IEEE.

Satoshi Goto received the B.E. and theM.E. Degrees in Electronics and Communica-tion Engineering from Waseda University in1968 and 1970 respectively. He also receivedthe Dr. of Engineering from the same Universityin 1978. He joined NEC Laboratories in 1970where he worked for LSI design, Multimediasystem and Software as GM and Vice President.Since 2002, he has been Professor, at Graduateschool of Information, Production and Systemsof Waseda University at Kitakyushu. He served

as GC of ICCAD, ASPDAC, VLSI-SOC, ASICON and ISOCC and was aboard member of IEEE CAS society. He is IEEE Life Fellow and IEICEFellow. He is Visiting Professor at Shanghai Jiao Tang University and Ts-inghua University of China and Member of Science Council of Japan.