1 thread-parallel mpeg-2, mpeg4 and h.264 video encoders for soc multi- processor architecture tom...

Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi-Processor Architecture

Tom R. Jacobs, Vassilios A. Chouliars,

and David J. Mulvaney

IEEE Transactions on Consumer Electronics

Outline

Introduction Background knowledge Main purpose

Previous work Methodology Experimental results Conclusions

IntroductionBackground Knowledge (1/5)

A number of lossy video compression standards have been developed. MPEG-1, MPEG-2, MPEG4-PART2, H.264

In order to maintain image quality and reduce bit-rates

Additional computation and power consumption

Such processing-intense consumer application algorithms are generally implemented in System-On-Chip (SOC) devices.

Parallelism DLP Data-Level Parallelism TLP Thread-Level Parallelism

Data-Level Parallelism (DLP) Distributing the data across different parallel

processing nodes.Program:

if CPU="a" then

low_limit=1; upper_limit=5

else if CPU="b" then

low_limit=6; upper_limit=10

end if do i = low_limit , upper_limit

Task on d(i)

end do

end program

1 2 3 4 5 6 7 8 9 10

Data array D of size 10

Processing node

Thread-Level Parallelism (TLP) TLP is the parallelism inherent in an

application that runs multiple threads at once.

Benefit- Distributing the workload of a single high-

performance processor among a number of slower and simpler processor cores.

IntroductionMain Purpose (1/2)

Utilizing Thread-Level Parallel (TLP) techniques to improve the performance on video coding. Reduce DIC (Dynamic Instruction Count).

How to improve? Workload distribution among a number of

parallel-executing processors.

IntroductionMain Purpose (2/2)

The results presented demonstrate that reductions in dynamic instruction count can be achieved.

Previous Work

The majority of this research is focused on coarse-granularity TLP exploitation, with distribution the workload most commonly at GOP level.

GOP GOP GOP GOP GOP GOP

Multi-threading

Little inter-node communication

Previous Work

In 1995, K. Shen, L. A. Rowe, and E.J. Delp implemented parallel MPEG-1 at GOP level.

In 1996, S. Bozoki, S. J. P. Westen, R. L. Lagendijk and J. Biemond performed a comparison between GOP and slice level on MPEG-1.

Previous Work

In 1997, A. Bilas, J. Fritts and J. P. Singh evaluated the performance of MPEG-2 decoders using shared memory system.

Akramullah, Ahmad and Liou implemented a threaded MPEG-2 encoder at the MB level by using local memory.

MethodologyOverview

The threaded MPEG-2 , MPEG-4 and H.264 implemented were compiled on multi-context instruction simulator (MT-ISS) based on SimpleScalar infrastructure.

The most important issue Data dependancies between processors. Avoid race hazards.

MethodologyRace hazards

Integer i

Thread 1

Thread 2

Integer i

Thread 1 Thread 2

i+11 1

11 Race hazards

Expected condition

Error condition

MethodologyThread-parallel MPEG-2 (1/5)

Test model 5 (TM5) of MPEG-2 encoder is used.

Computation analysis (QCIF) DIST1 52%~73% of total DIC for a search

window of 6 to 62 pels respectively. FullSearch 3.5%~23.2% of total DIC.

Can be improved by less complex algorithmic ME method. (such as 3-step, 4-step, diamond)

FDCT, and IDCT 2.1%~21% of total DIC.

Motion Estimation Kernel implementation can take advantage

of data parallel techniques. Store the information in mbinfo structure for

motion compensation. Maintain exclusivity of all variables during

the parallel sections.

Forward transform FDCT first scans the MBs on a row-by-row

basis, process these MBs in a row individually.

Determine prediction error and applies the DCT to the block.

Thread-parallel transform function can be performed in block-level.

Inverse transform IDCT scans the MBs first row-by-row and

then block-by-block. Due to the absence of data dependencies

between blocks Can executed as parallel.

The implementation is based on XviD project with Advanced Simple Profile (ASP). Bidirectional frames Quarter-pel motion compensation Global motion compensation Trellis quantization Custom quantization matrices

Computation analysis (QCIF)

The nature of XivD encoder Intra-frame encoding Inter-frame encoding

Intra-frame encoding FrameCodeI (row-by-row for each MBs) Parallelize the loop for encoding the MBs in a

row of the image. MB data structure pMB.

Shared memory array. The highest DIC metric in FrameCodeI is

MBTransQuantIntra.

MBTransQuantIntra Forward transformation, quantization and

inverse transformation. Shared data structure pEnc

Includes a count of quantization values. Serial code section.

Transform specific MB pixel data into the frequency domain independently.

MBPrediction and MBCoding Responsible for VLC and write to bitstream.

Inter-frame encoding FrameCodeP Part 1

Motion Estimation Part 2

Transformation Quantization

Motion Estimation Determine a MV for every MB and applies

certain criteria to indicate when Intra coding should be used.

Scanning in raster line order. Two kind of the process

Motion prediction from current frame. ME relative to reference frames.

Motion Prediction Examining the MVs in neighbouring MBs and

determining an initial estimate for ME.

● ●

● ● ●

Ideal pattern typical pattern TLP pattern

MethodologyH.264 (1/6)

Using x264 for implementation. Frame slicing

Main problems of using MB-level Wide variation in processor workload. The modification of prediction algorithm is

needed.

Slice group in H.264 A group of MBs in a frame. Can be encoded or decoded separatedly

from the remainder of the frame. Not allowing motion prediction cross slice

boundaries. Drawback

The required bit-rate increase.

Comparison of different slice number

Different resolution with 4 slices

Computation analysis

Experimental ResultsMPEG-2

SearchRange

Experimental ResultsMPEG-4

QualitySetting

Experimental ResultsH.264

QuantizationParameter

Experimental ResultsComparative results

Conclusions

The DIC metric of MPEG-2, MPEG-4, and H.264 can be greatly reduced by TLP.

For HD sequences, the improvement is around 84%, 92%, 96% respectively.

TLP has become more significant for each new generation of video encoders.

1 thread-parallel mpeg-2, mpeg4 and h.264 video encoders for soc multi- processor architecture tom...

performance of mpeg

processing node slide

end program slide

gop level

threaded mpeg

tm5 of mpeg

power consumption slide

datalevel parallelism

Documents

mark mulvaney kentucky

mpeg4 natural video coding

michael j. mulvaney, phd, cca

vassilios j. papazoglou - minedu.gov.gr · vassilios j....

mpeg4 fine grained scalable multi-resolution layered video...

a wavelet - based object watermarking system for mpeg4 video

june 9 2017 panel 5 kevin mulvaney

vassilios tsakalos director general research promotion...

full di 30 fps mpeg4 asp dual streaming geouisiona...

wireless mpeg4 ip camera - airlive: wireless network...

oak grove faculty meeting (stem and pbl)-mulvaney

mcelroy, deutsch, mulvaney &carpenter, llp

lens free live cell imaging ian pykett, vassilios albanis

vassilios n. grigoriadis1, ioanna d. papadopoulou2, polyxeni...

practical design of pid-type controllers with...

mpeg4 avc h264 video codecs comparison

mpeg4 codec for access grid

23 email kefalas vassilios: … fotios v. nikolopoulos 1,...

beating the caro kann by vassilios kotronias.pdf

jemin hwangbo, vassilios tsounis, hendrik kolvenbach and...