processor architectures for multimedia applications

PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS

Oguz Karacuka

What Is Multimedia Processing?

Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) Embedded: – 3D graphics (game consoles) – Video/audio decoding&encoding (set top boxes, PVR...) – Image processing (digital cameras) – Signal processing (cellular phones)

Characteristics Of Multimedia Apps.

Requirement for real-time response – “Incorrect” result often preferred to slow result

– Unpredictability can be bad (e.g. dynamic execution) Narrow data-types

– Typical width of data in memory: 8 to 16 bits– Typical width of data during computation: 16 to 32 bits– 64-bit data types rarely needed– Fixed-point arithmetic often replaces floating-point

Fine-grain (data) parallelism– Identical operation applied on streams of input data– Branches have high predictability– High instruction locality in small loops or kernels

Characteristics Of Multimedia Apps.cont.

Coarse-grain parallelism– Most apps organized as a pipeline of functions– Multiple threads of execution can be used

Memory requirements– High bandwidth requirements but can tolerate high latency– High spatial locality (predictable pattern) but low temporal locality– Cache bypassing and prefetching can be crucial

Examples of Media Functions

Matrix transpose/multiply (3D graphics) DCT/FFT (Video, audio, communications) Motion estimation (Video encoding, deinterlacing) Gamma correction (3D graphics) Haar transform (Media mining) Median filter (Image processing) Separable convolution (Image processing) Viterbi decode (Communications, speech) Bit packing (Communications, cryptography) …

Approaches to Media Processing

Multimedia Processing

General-purposeprocessors withSIMD extensions

VLIW with SIMD extensions(aka mediaprocessors, Adapted Programmable Architectures)

DSP’s(Flexible Programmable Architectures)

Asics/FPGA’s (Dedicated/Function Specific Architectures)

Vector Processors

Application Example: MPEG Dec.

MPEG Encoder & Decoder Complexity

Function Specific Architectures

Limited (if any) programmability DSP or RISC core processor for main control Special hardware accelerators for the DCT,

quantization, entropy encoding, motion estimation... High efficiency and speed: typically better compared to

programmable architectures. The silicon area optimization achieved by function-

specific architectures allows lower production cost.

Function Specific Architectures

Programmable Dedicated Architectures

Increased flexibility: enables the processing of different tasks under software control.

Higher cost for design and manufacturing: additional hardware for program control is required.

Require software development for the application: parallelization strategies have to be applied

Flexible Programmable Architectures

TI’s Multimedia Video Processor (MVP) TMS320C80

Adapted Programmable Architectures

C-Cube’s VRP – VRP2

VLIW Advanced Architectures

Reduce the number of cycles per instruction required for execution of highly complex and parallel algorithms

Multiple independent functional units that are directly controlled by long instruction words.

Unefficient use of silicon: requires a giant routing network of buses and crossbar switches.

All functional units share a common large register file Code compaction is typically done by a special compiler,

which can predict branch outcomes by applying an algorithm known as trace scheduling

Can be combined with SIMD arch. for increased parallelism e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia

Philips TriMedia CPU64 Arch.

Philips TriMedia CPU64 Arch.

5 slot VLIW architecture with a 64-bit word size; 27 functional units, offering a choice of operation types in each slot in the instruction any operation can be guarded

to provide conditional execution without branching; All functional units provide vector-style subword parallelism

on byte, half-word, or word entities. instruction set and functional units optimized with respect

to media processing; a single multi-ported register file with bypass network,

allowing 1-cycle latency operations; 32 kB, 8-way instruction cache 16 kB, 8-way, quasi-dual

ported, data cache; a variable-length (compressed) instruction set design.

Multiple-instruction, multiple-data Multiple-instruction, multiple-data (MIMD) architectures(MIMD) architectures

offer 10 to 100 times more throughput than existing VLIW and SIMD architectures

Multiple instructions are executed in parallel on multiple data: a control unit for each data path.

asynchronous nature increases the complexity of software development.

SIMD Extensions to General Purp. Processors

Performance – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS Power consumption – A 1.2GHz Athlon consumes ~60W – Power consumption increases with clock

frequency and complexity Cost – A 1.2GHz Athlon costs ~$62 to manufacture and

has a list price of ~$600 (module) (year 2000) – Cost increases with complexity

WHY ?

SIMD Extensions to General Purp. Processors

Motivation– Low media-processing performance of GPPs– Cost and lack of flexibility of specialized ASICs for graphics/video– Underutilized datapaths and registers

Basic idea: sub-word parallelism– The mismatch between wide data paths and the relatively short data types found in multimedia applications– Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors)

– Partition 64-bit datapaths to handle multiple narrow operations in parallel

Initial constraints– No additional architecture state (registers)– No additional exceptions– Minimum area overhead

Overwiew of SIMD Extensions

Intel’s MMX Example

targeted to accelerate multimedia and communications applications, especially on the Internet.

MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions.

Added DCT / IDCT kernels MPEG-1 video decompression speed up with MMX is

about 80%,while some other applications, such as image filtering speed up to 370%.

Summary of SIMD Instructions

Integer arithmetic– Addition and subtraction with saturation– Fixed-point rounding modes for multiply and shift– Sum of absolute differences– Multiply-add, multiplication with reduction– Min, max

Floating-point arithmetic– Packed floating-point operations– Square root, reciprocal– Exception masks

Data communication– Merge, insert, extract– Pack, unpack (width conversion)

Summary of SIMD Instructions

Comparisons– Integer and FP packed comparison– Compare absolute values– Element masks and bit vectors

Memory– No new load-store instructions for short vector– No support for strides or indexing– Short vectors handled with 64b load and store instructions– Pack, unpack, shift, rotate, shuffle to handle alignment of narrow data-types within a wider one– Prefetch instructions for utilizing temporal locality

SIMD Ext. for GPP Summary

Narrow vector extensions for GPPs– 64b or 128b registers as vectors of 32b, 16b, and 8b elements

Based on sub-word parallelism and partitioned datapaths Instructions

– Packed fixed- and floating-point, multiply-add, reductions– Pack, unpack, permutations

2x to 4x performance improvement over base architecture– Limited by memory bandwidth

Difficult to use (no compilers) Overhead of handling alignment and datawidth adjustment Optimized shared libraries

– Written in assembly, distributed by vendor – Need well defined API for data format and use

SUMMARY

Computationally intensive multimedia functions, such as MPEG encoding, HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors

We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.

processor architectures for multimedia applications

Documents

custom architectures

processor architectures

data types

d graphicsdctfft video

multipledata simd

d graphicshaar

high flexibility

d graphics applications