multimedia characteristics and optimizations marilyn wolf dept. of ee princeton university © 2004...

Multimedia Characteristics and Optimizations

Marilyn WolfDept. of EEPrinceton University

© 2004 Marilyn Wolf

Outline

Fritts: compiler studies.Lv: compiler studies.Memory system optimizations.

Basic Characteristics

Comparison of operation frequencies with SPEC (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) Lower frequency of memory and floating-point operations More arithmetic operations Larger variation in memory usage

Basic block statistics Average of 5.5 operations per basic block Need global scheduling techniques to extract ILP

Static branch prediction Average of 89.5% static branch prediction on training input` Average of 85.9% static branch prediction on evaluation input

Data types and sizes Nearly 70% of all instructions require only 8 or 16 bit data

types

Breakdown of Data Types by Media Type

0%

20%

40%

60%

80%

100%

Media type

Rat

io o

f dat

a ty

pes

(%)

floating-pointpointerswordhalfwordbyte

Memory Statistics

Working set size cache regression: cache sizes 1K to 4MB assumed line size of 64 bytes measured read and write miss ratios

Spatial locality cache regression: line sizes 8 to 1024 bytes assumed cache size of 64 KB measure read and write miss ratios

Memory Results data memory: 32 KB and 60.8% spatial locality (up to 128 bytes) instruction memory: 8 KB and 84.8% spatial locality (up to 256 bytes)

ba ll

A

BAlocality spatial

/

Data Spatial Locality

-200

-150

-100

-50

0

50

100

150

16 32 64 128 256 512 1024Line Size (bytes)

Deg

ree

of S

patia

l Loc

ality

(%)

videoimagegraphicsaudiospeechsecurityaverage

Multimedia Looping Characteristics

Highly Loop Centric Nearly 95% of execution time spent within two innermost loop

levels

Large Number of Iterations Significant processing regularity About 10 iterations per loop on average

Path Ratio indicates Intra-Loop Complexity Computed as ratio of average number of instructions executed per

loop invocation to total number of instructions in loop Average path ratio of 78% Indicates greater control complexity than expected

Average Iterations per Loopand Path Ratio

0

0.2

0.4

0.6

0.8

1

video image graphics audio speech security average

Media Type

Path

Rat

io

1

10

100

1000

Media Type

Ave

rage

Num

ber o

f Ite

ratio

ns

- average number of loop iterations

- average path ratio

Instruction Level Parallelism Instruction level parallelism

base model: single issue using classical optimizations only

parallel model: 8-issue

Explores only parallel scheduling performance assumes an ideal processor model no performance penalties from branches, cache misses, etc.

0

0.5

1

1.5

2

2.5

3

3.5

video image graphics audio speech security average

Media Type

Spe

edup

8-issue classical only8-issue classical w/ inlining8-issue superblock8-issue hyperblock

Workload Evaluation Conclusions

Operation Characteristics More arithmetic operations; less memory and floating-point usage Large variation in memory usage (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1)

Good Static Branch Prediction Multimedia:10-15% avg. miss ratio General-purpose: 20-30% avg. miss ratio Similar basic block sizes (5 instrs per basic block)

Primarily Small Data Types (8 or 16 bits) Nearly 70% of instructions require 16-bit or smaller data types Significant opportunity for subword parallelism or narrower datapaths

Memory Typically small data and instruction working set sizes High data and instruction spatial locality

Loop-Centric Majority of execution time spent in two innermost loops Average of 10 iterations per loop invocation Path ratio indicates greater control complexity than expected

Architecture Evaluation

Architecture Evaluation Determine fundamental architecture style

Statically Scheduled => Very Long Instruction Word (VLIW) allows wider issue simple hardware => potentially higher frequencies

Dynamically Scheduled => Superscalar allows decoupled data memory accesses effective at reducing penalties from stall

Examine variety of architecture parameters Fundamental Architecture Style Instruction Fetch Architecture High Frequency Effects Cache Memory Hierarchy

Related Work[Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific

Programmable Processors,” DAC-35, 1998.[PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-

Issue Processors,” MICRO-24, 1991.[ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,”

Ph.D. Thesis, Princeton University, 1999.[DZucker95] “A comparison of hardware prefetching techniques for multimedia

benchmarks,” Technical Report CSL-TR-95-683, Stanford University, 1995.

Fundamental ArchitectureEvaluation

Fundamental architecture evaluation included: Static vs. dynamic scheduling Issue width

Focused on non-memory limited applications Determine impact of datapath features independent of memory Assume memory techniques can solve memory bottleneck

Architecture model 8-issue processor Operation latencies targeted for 500 MHz to 1 GHz 64 integer and floating-point registers Pipeline: 1 fetch, 2 decode, 1 write back, variable execute stages 32 KB direct-mapped L1 data cache with 64 byte lines 16 KB direct-mapped L1 instruction cache with 256 byte lines 256 KB 4-way set associate on-chip L2 cache 4:1 Processor to external bus frequency ratio

Static versus Dynamic Scheduling

0

0.5

1

1.5

2

2.5

3

Classical Superscalar Hyperblock

Compilation Method

IPC

VLIW

in-order superscalar

out-of-order superscalar

VLIW w/ perfect caches

in-order superscalar w/perfect cachesout-of-order superscalar w/perfect caches

0

0.5

1

1.5

2

2.5

0 5 10Issue-Width

IPC

VLIW

in-ordersuperscalarout-of-ordersuperscalar

- static versus dynamic scheduling for various compiler methods

- result of increasing issue width for the given architecture and compiler methods

Instruction Fetch Architecture

0

1

2

3

4

5

6

Classical Superscalar HyperblockCompilation Method

IPC

Dif

fere

nce

(%) VLIW w/ fixed-

width instrsVLIW w/ variable-width instrsin-order superscalarout-of-ordersuperscalar

90.591

91.5

9292.5

9393.5

9494.5

0 5 10 15 20 25Predictor Size (KB)

Bra

nch

Pre

dict

ion

Rat

e (%

)

2-bit counter (Classical)2-bit counter (Superscalar)2-bit counter (Hyperblock)PAs(6,16) (Classical)PAs(6,16) (Superscalar)PAs(6,16) (Hyperblock)PAs(10,8) (Classical)PAs(10,8) (Superscalar)PAs(10,8) (Hyperblock)

- aggressive versus conservative fetch methods

- comparison of dynamic branch prediction schemes

Experimental ConfigurationSingle-issue processor

SimpleScalar sim-outorder Single issue configuration RISC

Experimental Configuration-Benchmarks

Selected from different area of MediaBench

Additional real-world applications

Baseline benchmark characteristics

Measure on the single issue processor

Execution time closely related to dynamic instruction count

1.0E+00

1.0E+02

1.0E+04

1.0E+06

1.0E+08

1.0E+10

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

elli

pse

mat

ch

hm

mDy namic Inst Count Exec Cy cles

VLIW vs. Single Issue

Static Code SizeDynamic Operation CountExecution SpeedBasic Block Size

Static Code Size Results

0

20

40

60

80

100

120

140

160

180

200

Uni

fied

Sta

tic C

ode

Size

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

ellip

se

mat

ch

hm

m

AV

G

Single Issue TM1300

Static Code Size Analysis

Similar Static Code SizeOn average, TM1300 requires 17%

more space

Dynamic Operation Count Results

0

50

100

150

200

250

Un

ifie

d D

yn

amic

Op

erat

ion

Co

un

t

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

ellip

se

mat

ch

hm

m

AV

G

Single Issue TM1300

Dynamic Operation Count Analysis

Dynamic instruction counts are similar for two type of processors

On average, TM1300 needs 20% more operations ISA difference resulted execution time is small

Execution Speed Results

0

100

200

300

400

500

600

Un

ifie

d E

xec

uti

on

Sp

eed

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

elli

pse

mat

ch

hm

m

AV

G

Single Issue TM1300

Execution Speed Analysis

TM1300 executes all benchmarks faster than the single issue processor

On average, the speedup is 3.4x wide issue capability, is partly resulted Architecture features

Unoptimized Basic Block Size Results

0

5

10

15

20

25

30

35

Bas

ic B

lock

Siz

e

g721

_enc

g721

_dec

jpeg

_enc

jpeg

_dec

mpe

g2_e

nc

mpe

g2_d

ec

regi

on

cont

our

ellip

se

mat

ch

hmm

AVG

Single Issue TM1300

Unoptimized Basic Block Size Analysis

Trimedia compile provides code with larger basic block size

On average, the basic block on TM1300 is twice as large

Exploiting Special Features

Methods Using custom instruction Loop transformation

Metrics Execution Speed Memory Access Count Basic Block Size Operation Level Parallelism

Execution Time Results

0

50

100

150

200

250

Un

ifie

d E

xec

uti

on

Sp

eed

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

ellip

se

mat

ch

hm

m

AV

G

Un-optimized Optimized

Execution Time Analysis

1.5 x average speedupData transferring intensive, floating point

intensive, and table looking intensive applications have less speedup

Memory Access Count Results

0

20

40

60

80

100

120

Uni

fied

Mem

ory

Ref

eren

ce C

ount

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

ellip

se

mat

ch

hm

m

AV

G


Memory Access Count Analysis

Reduced average memory access count

Memory access can be bottleneck (MPEG)

Optimized Basic Block Size

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Bas

ic B

lock

Siz

e

g721

_enc

g721

_dec

jpeg

_enc

jpeg

_dec

mpe

g2_e

nc

mpe

g2_d

ec

regi

on

cont

our

ellip

se

mat

ch

hmm

AVG


Optimized Basic Block Size Analysis

Significant basic block size change results performance gain ( Region)

Operation Level Parallelism Results

0

0.5

1

1.5

2

2.5

3

3.5

OPC

g7

21

_en

c

g7

21

_d

ec

jpeg

_en

c

jpeg

_d

ec

mp

eg2

_en

c

mp

eg2

_d

ec

reg

ion

con

tou

r

ellip

se

mat

ch

hm

m

AV

G

Operation Level Parallelism Analysis

OPC close to 2Memory access can be bottleneckWider bus&super-word parallelism

Overall Performance Change Results

0

2

4

6

8

10

12

Sp

eed

up

g721

_enc

g721

_dec

jpeg

_enc

jpeg

_dec

mpe

g2_e

nc

mpe

g2_d

ec

regi

on

cont

our

ellip

se

mat

ch

hmm

AVG

Overall Performance Change Analysis

TM1300 exhibit significant performance gain over single issue processor

5x speedup on average 10x best case speedup

What type of memory system?

Cache: size, # sets, block size.

On-chip main memory: amount, type, banking, network to PEs.

Off-chip main memory: type, organization.

Memory system optimizations

Strictly software: Effectively using the cache and

partitioned memory.Hardware + software:

Scratch-pad memories. Custom memory hierarchies.

Taxonomy of memory optimizations (Wolf/Kandemir)

Data vs. code.Array/buffer vs. non-array.Cache/scratch pad vs. main memory.Code size vs. data size.Program vs. process.Languages.

Software performance analysis

Worst-case execution time (WCET) analysis (Li/Malik): Find longest path through CDFG. Can use annotations of branch probabilities. Can be mapped onto cache lines. Difficult in practice---must analyze optimized

code.

Trace-driven analysis: Well understood. Requires code, input vectors.

Software energy/power analysis

Analytical models of cache (Su/Despain, Kamble/Ghose, etc.): Decoding, memory core, I/O path, etc.

System-level models (Li/Henkel).Power simulators (Vijaykrishnan et

al, Brooks et al).

Power-optimizing transformations

Kandemir et al: Most energy is consumed by the

memory system, not the CPU core. Performance-oriented optimizations

reduce memory system energy but increase datapath energy consumption.

Larger caches increase cache energy consumption but reduce overall memory system energy.

Scratch pad memories

Explicitly managed local memory.Panda et al used a static

management scheme. Data structures assigned to off-chip

memory or scratch pad at compile time. Put scalars in scratch pad, arrays in

main.May want to manage scratch pad at

run time.

Reconfigurable caches

Use compiler to determine best cache configuration for various program regions. Must be able to quickly reconfigure the

cache. Must be able to identify where program

behavior changes.

Software methods for cache placement

McFarling analyzed inter-function dependencies.

Tomiyama and Yasuura used ILP.Li and Wolf used a process-level model.Kirovski et al use profiling information plus

graph model.Dwyer/Fernando use bit vectors to

construct boudns in instruction caches.Parmeswaran and Henkel use heuristics.

Addressing optimizations

Addressing can be expensive: 55% of DSP56000 instructions performed

addressing operations in MediaBench.Utilize specialized addressing registers,

pre/post-incr/decrement, etc. Place variables in proper order in memory

so that simpler operations can be used to calculate next address from previous address.

Hardware methods for cache optimization

Kirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.

multimedia characteristics and optimizations marilyn wolf dept. of ee princeton university © 2004...

Documents