processor architectures for multimedia applications
DESCRIPTION
PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS. Oguz Karacuka. What Is Multimedia Processing?. Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony) - PowerPoint PPT PresentationTRANSCRIPT
What Is Multimedia Processing?
Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) Embedded: – 3D graphics (game consoles) – Video/audio decoding&encoding (set top boxes, PVR...) – Image processing (digital cameras) – Signal processing (cellular phones)
Characteristics Of Multimedia Apps.
Requirement for real-time response – “Incorrect” result often preferred to slow result
– Unpredictability can be bad (e.g. dynamic execution) Narrow data-types
– Typical width of data in memory: 8 to 16 bits– Typical width of data during computation: 16 to 32 bits– 64-bit data types rarely needed– Fixed-point arithmetic often replaces floating-point
Fine-grain (data) parallelism– Identical operation applied on streams of input data– Branches have high predictability– High instruction locality in small loops or kernels
Characteristics Of Multimedia Apps.cont.
Coarse-grain parallelism– Most apps organized as a pipeline of functions– Multiple threads of execution can be used
Memory requirements– High bandwidth requirements but can tolerate high latency– High spatial locality (predictable pattern) but low temporal locality– Cache bypassing and prefetching can be crucial
Examples of Media Functions
Matrix transpose/multiply (3D graphics) DCT/FFT (Video, audio, communications) Motion estimation (Video encoding, deinterlacing) Gamma correction (3D graphics) Haar transform (Media mining) Median filter (Image processing) Separable convolution (Image processing) Viterbi decode (Communications, speech) Bit packing (Communications, cryptography) …
Approaches to Media Processing
Multimedia Processing
General-purposeprocessors withSIMD extensions
VLIW with SIMD extensions(aka mediaprocessors, Adapted Programmable Architectures)
DSP’s(Flexible Programmable Architectures)
Asics/FPGA’s (Dedicated/Function Specific Architectures)
Vector Processors
Function Specific Architectures
Limited (if any) programmability DSP or RISC core processor for main control Special hardware accelerators for the DCT,
quantization, entropy encoding, motion estimation... High efficiency and speed: typically better compared to
programmable architectures. The silicon area optimization achieved by function-
specific architectures allows lower production cost.
Programmable Dedicated Architectures
Increased flexibility: enables the processing of different tasks under software control.
Higher cost for design and manufacturing: additional hardware for program control is required.
Require software development for the application: parallelization strategies have to be applied
VLIW Advanced Architectures
Reduce the number of cycles per instruction required for execution of highly complex and parallel algorithms
Multiple independent functional units that are directly controlled by long instruction words.
Unefficient use of silicon: requires a giant routing network of buses and crossbar switches.
All functional units share a common large register file Code compaction is typically done by a special compiler,
which can predict branch outcomes by applying an algorithm known as trace scheduling
Can be combined with SIMD arch. for increased parallelism e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia
Philips TriMedia CPU64 Arch.
5 slot VLIW architecture with a 64-bit word size; 27 functional units, offering a choice of operation types in each slot in the instruction any operation can be guarded
to provide conditional execution without branching; All functional units provide vector-style subword parallelism
on byte, half-word, or word entities. instruction set and functional units optimized with respect
to media processing; a single multi-ported register file with bypass network,
allowing 1-cycle latency operations; 32 kB, 8-way instruction cache 16 kB, 8-way, quasi-dual
ported, data cache; a variable-length (compressed) instruction set design.
Multiple-instruction, multiple-data Multiple-instruction, multiple-data (MIMD) architectures(MIMD) architectures
offer 10 to 100 times more throughput than existing VLIW and SIMD architectures
Multiple instructions are executed in parallel on multiple data: a control unit for each data path.
asynchronous nature increases the complexity of software development.
SIMD Extensions to General Purp. Processors
Performance – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS Power consumption – A 1.2GHz Athlon consumes ~60W – Power consumption increases with clock
frequency and complexity Cost – A 1.2GHz Athlon costs ~$62 to manufacture and
has a list price of ~$600 (module) (year 2000) – Cost increases with complexity
WHY ?
SIMD Extensions to General Purp. Processors
Motivation– Low media-processing performance of GPPs– Cost and lack of flexibility of specialized ASICs for graphics/video– Underutilized datapaths and registers
Basic idea: sub-word parallelism– The mismatch between wide data paths and the relatively short data types found in multimedia applications– Treat a 64-bit register as a vector of 2 32-bit or 4 16-bit or 8 8-bit values (short vectors)
– Partition 64-bit datapaths to handle multiple narrow operations in parallel
Initial constraints– No additional architecture state (registers)– No additional exceptions– Minimum area overhead
Intel’s MMX Example
targeted to accelerate multimedia and communications applications, especially on the Internet.
MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions.
Added DCT / IDCT kernels MPEG-1 video decompression speed up with MMX is
about 80%,while some other applications, such as image filtering speed up to 370%.
Summary of SIMD Instructions
Integer arithmetic– Addition and subtraction with saturation– Fixed-point rounding modes for multiply and shift– Sum of absolute differences– Multiply-add, multiplication with reduction– Min, max
Floating-point arithmetic– Packed floating-point operations– Square root, reciprocal– Exception masks
Data communication– Merge, insert, extract– Pack, unpack (width conversion)
Summary of SIMD Instructions
Comparisons– Integer and FP packed comparison– Compare absolute values– Element masks and bit vectors
Memory– No new load-store instructions for short vector– No support for strides or indexing– Short vectors handled with 64b load and store instructions– Pack, unpack, shift, rotate, shuffle to handle alignment of narrow data-types within a wider one– Prefetch instructions for utilizing temporal locality
SIMD Ext. for GPP Summary
Narrow vector extensions for GPPs– 64b or 128b registers as vectors of 32b, 16b, and 8b elements
Based on sub-word parallelism and partitioned datapaths Instructions
– Packed fixed- and floating-point, multiply-add, reductions– Pack, unpack, permutations
2x to 4x performance improvement over base architecture– Limited by memory bandwidth
Difficult to use (no compilers) Overhead of handling alignment and datawidth adjustment Optimized shared libraries
– Written in assembly, distributed by vendor – Need well defined API for data format and use
SUMMARY
Computationally intensive multimedia functions, such as MPEG encoding, HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors
We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.