low power and low area transform–quant & inverse quant–inverse transform hardware design for...

Low Power and Low AreaTransform–Quant

& Inverse Quant–Inverse Transform

Hardware Design for H.264 Encoder

Outline

I. H.264 TQ & IQIT

II. DESIGNED HARDWARE

III. RESULTS

H.264 TQ & IQITEach residual macroblock is transformed, quantized.

Previous standards such as MPEG-1,MPEG-2, MPEG-4 and H.263 made use of the 8x8 Discrete Cosine Transform (DCT) as the basic transform.

The “baseline” profile of H.264 uses three transforms depending on “the type of residual data :

1) A transform for the 4x4 array of luma DC coefficients in intra macroblocks(predicted in 16x16 mode),

2) A transform for the 2x2 array of chroma DC coefficients (in any macroblock)

3) A transform for all other 4x4 blocks in the residual data.

Work accomplished ... ( T, Q, IQ, IT)

... Future work ( MC, toplevel, ...)

Data within a macroblock are transmitted in the order shown in Figure

If the macroblock is coded in 16x16 Intra mode, then the block labelled “-1” is transmitted first, containing the DC coefficient of each 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with the DC coefficient set to zero in a 16x16 Intra macroblock). Blocks 16 and 17 contain a 2x2 array of DC coefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks 18- 25 (with zero DC coefficients) are sent.

The entire process of transform and quantization can be carried out using 16-bit integer arithmetic

4x4 Integer Transform &Inverse Transform

It is an integer transform

The core part of the transform is multiply-free, it only requires additions and shifts.

A scaling multiplication (part of the complete transform) is integrated into the quantizer (reducing the total number of multiplications).

4x4 Forward Integer Transform

[(x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15), 2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15), (x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15), (x0+x4+x8+x12) - 2*(x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15); (2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) + (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), 2*(2*x0+x4-x8-2*x12) + (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) - 2*(2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - (2*x1+x5-x9-2*x13) - (2*x2+x6-x10-2*x14) + (2*x3+x7-x11-2*x15), (2*x0+x4-x8-2*x12) - 2*(2*x1+x5-x9-2*x13) + 2*(2*x2+x6-x10-2*x14) - (2*x3+x7-x11-2*x15); (x0-x4-x8+x12) + (x1-x5-x9+x13) + (x2-x6-x10+x14) + (x3-x7-x11+x15), 2*(x0-x4-x8+x12) + (x1-x5-x9+x13) - (x2-x6-x10+x14) - 2*(x3-x7-x11+x15), (x0-x4-x8+x12) - (x1-x5-x9+x13) - (x2-x6-x10+x14) + (x3-x7-x11+x15), (x0-x4-x8+x12) - 2*(x1-x5-x9+x13) + 2*(x2-x6-x10+x14) - (x3-x7-x11+x15); (x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) + (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), 2*(x0-2*x4+2*x8-x12) + (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) - 2*(x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - (x1-2*x5+2*x9-x13) - (x2-2*x6+2*x10-x14) + (x3-2*x7+2*x11-x15), (x0-2*x4+2*x8-x12) - 2*(x1-2*x5+2*x9-x13) + 2*(x2-2*x6+2*x10-x14) - (x3-2*x7+2*x11-x15)]

4x4 Inverse Integer Transform

[(y0 + y4 + y8 + y12/2) + (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) + 1/2 * (y3 + y7 + y11 + y15/2),(y0 + y4 + y8 + y12/2) + 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) - (y3 + y7 + y11 + y15/2),(y0 + y4 + y8 + y12/2) - 1/2 * (y1 + y5 + y9 + y13/2) - (y2 + y6 + y10 + y14/2) + (y3 + y7 + y11 + y15/2),(y0 + y4 + y8 + y12/2) - (y1 + y5 + y9 + y13/2) + (y2 + y6 + y10 + y14/2) - 1/2 * (y3 + y7 + y11 + y15/2); (y0 + y4/2 - y8 - y12) + (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) + 1/2 * (y3 + y7/2 - y11 - y15),(y0 + y4/2 - y8 - y12) + 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) - (y3 + y7/2 - y11 - y15),(y0 + y4/2 - y8 - y12) - 1/2 * (y1 + y5/2 - y9 - y13) - (y2 + y6/2 - y10 - y14) + (y3 + y7/2 - y11 - y15),(y0 + y4/2 - y8 - y12) - (y1 + y5/2 - y9 - y13) + (y2 + y6/2 - y10 - y14) - 1/2 * (y3 + y7/2 - y11 - y15); (y0 - y4/2 - y8 + y12) + (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) + 1/2 * (y3 - y7/2 - y11 + y15),(y0 - y4/2 - y8 + y12) + 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) - (y3 - y7/2 - y11 + y15),(y0 - y4/2 - y8 + y12) - 1/2 * (y1 - y5/2 - y9 + y13) - (y2 - y6/2 - y10 + y14) + (y3 - y7/2 - y11 + y15),(y0 - y4/2 - y8 + y12) - (y1 - y5/2 - y9 + y13) + (y2 - y6/2 - y10 + y14) - 1/2 * (y3 - y7/2 - y11 + y15); (y0 - y4 + y8 - y12/2) + (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) + 1/2 * (y3 - y7 + y11 - y15/2),(y0 - y4 + y8 - y12/2) + 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) - (y3 - y7 + y11 - y15/2),(y0 - y4 + y8 - y12/2) - 1/2 * (y1 - y5 + y9 - y13/2) - (y2 - y6 + y10 - y14/2) + (y3 - y7 + y11 - y15/2),(y0 - y4 + y8 - y12/2) - (y1 - y5 + y9 - y13/2) + (y2 - y6 + y10 - y14/2) - 1/2 * (y3 - y7 + y11 - y15/2)]

>> indicates a binary shift right. In the reference model software, f is 2qbits/3 for Intra blocks or

2qbits/6 for Inter blocks.

For QP>5, the factors MF remain unchanged but the divisor 2qbits increases by a factor of 2 for each increment of 6 in QP.

Quantization

the rescaled output increase by a factor of 2 for every increment of 6 in QP.

a further constant scaling factor of 64 to avoid rounding errors

The values at the output of the inverse transform are divided by 64 to remove the scaling factor

Inverse Quantization

4x4 luma DC coefficient Transform & Quantization16x16 Intra-mode only

an inverse Hadamard transform is applied followed by rescaling (note that the order is not reversed as might be expected)

If QP is greater than or equal to 12, rescaling is performed by:

If QP is less than 12, rescaling is performed by:

4x4 Forward & Inverse Hadamard Transform

[(z0+z4+z8+z12) + (z1+z5+z9+z13) + (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) + (z1+z5+z9+z13) - (z2+z6+z10+z14) - (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) - (z2+z6+z10+z14) + (z3+z7+z11+z15), (z0+z4+z8+z12) - (z1+z5+z9+z13) + (z2+z6+z10+z14) - (z3+z7+z11+z15); (z0+z4-z8-z12) + (z1+z5-z9-z13) + (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) + (z1+z5-z9-z13) - (z2+z6-z10-z14) - (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) - (z2+z6-z10-z14) + (z3+z7-z11-z15), (z0+z4-z8-z12) - (z1+z5-z9-z13) + (z2+z6-z10-z14) - (z3+z7-z11-z15); (z0-z4-z8+z12) + (z1-z5-z9+z13) + (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) + (z1-z5-z9+z13) - (z2-z6-z10+z14) - (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) - (z2-z6-z10+z14) + (z3-z7-z11+z15), (z0-z4-z8+z12) - (z1-z5-z9+z13) + (z2-z6-z10+z14) - (z3-z7-z11+z15); (z0-z4+z8-z12) + (z1-z5+z9-z13) + (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) + (z1-z5+z9-z13) - (z2-z6+z10-z14) - (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) - (z2-z6+z10-z14) + (z3-z7+z11-z15), (z0-z4+z8-z12) - (z1-z5+z9-z13) + (z2-z6+z10-z14) - (z3-z7+z11-z15)]

2x2 chroma DC coefficient Transform & Quantization

Inverse transform is identical

During decoding, the inverse transform is applied before rescaling

If QP is greater than or equal to 6, rescaling is performed by:

If QP is less than 6, rescaling is performed by:

The rescaled coefficients are replaced in their respective 4x4 blocks of chroma coefficients

[ (z0+z2) + (z1+z3), (z0+z2) - (z1+z3); (z0-z2) + (z1-z3), (z0-z2) - (z1-z3)]

DESIGNED HARDWARE

Problems encountered

Signed arithmetic

Initially designed for 100Mhz

Due to creating a dual purpose datapath we get extra MUX delays

Hardware specified in the standart to avoid rounding errors

Error of the book “H.264 and MPEG-4 Video Compression” !

Unpredicted and unbelievable routing error !

Designed hardware supports up to H.264 level 2.2 (SDTV @ 15 fps).

A dual purpose datapath is designed.

Transform and Quantization of a 4x4 block is completed in 36 clock cycles.

Inverse Quantization of a 4x4 block takes 18 clock cycles.

Inverse Transform of a 4x4 block is done in 36 clock cycles.

It takes nearly 2400 cycles to complete an intra 16x16 predicted macroblock.

Working at 80Mhz designed hardware can process up to 33000 mb’s per second.

RESULTS

Number of ports : 68

Number of nets : 212

Number of instances : 30

Number of references to this view : 0

Total accumulated area :

Number of Dffs or Latches : 493

Number of Function Generators : 2688

Number of MUX CARRYs : 148

Number of MUXF5 : 608

Number of MUXF6 : 184

Number of accumulated instances : 3847

Number of global buffers used: 0

Synthesis ResultsSynthesis is done with LeonardoSpectrum

Clock frequency is 80MHz

Device Utilization for 2V8000ff1152

Resource Used Avail Utilization

-----------------------------------------------

IOs 68 824 8.25%

Global Buffers 0 16 0.00%

Function Generators 2688 93184 2.88%

CLB Slices 1344 46592 2.88%

Dffs or Latches 493 95656 0.52%

Block RAMs 0 168 0.00%

Block Multipliers 1 168 0.60%

FPGA & ASIC

The design can be used either for FPGA or for ASIC.

Only one multiplier is used (2V8000ff1152 has 168 block multipliers).

A clock frequency of 80 MHz for FPGA is achieved.

To be able to reach 80MHz lots of pipelining stages are added.

Designed hardware may work at a clock frequency up to 200MHz in ASIC.

Removing pipelining registers will decrease the area and power consumption.

Thanks ...

?

Questions

low power and low area transform–quant & inverse quant–inverse transform hardware design for...

Documents

x9 x13 x2x6

x5x9 x13 x2

x9x13 x2

x2 x6x10

x15 x0x4

x6x10 x14 x3x7

x1 x5x9

x2 x6 x10 x14 x3 x7