multimedia characteristics and optimizations marilyn wolf dept. of ee princeton university © 2004...
TRANSCRIPT
Multimedia Characteristics and Optimizations
Marilyn WolfDept. of EEPrinceton University
© 2004 Marilyn Wolf
Outline
Fritts: compiler studies.Lv: compiler studies.Memory system optimizations.
Basic Characteristics
Comparison of operation frequencies with SPEC (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1) Lower frequency of memory and floating-point operations More arithmetic operations Larger variation in memory usage
Basic block statistics Average of 5.5 operations per basic block Need global scheduling techniques to extract ILP
Static branch prediction Average of 89.5% static branch prediction on training input` Average of 85.9% static branch prediction on evaluation input
Data types and sizes Nearly 70% of all instructions require only 8 or 16 bit data
types
Breakdown of Data Types by Media Type
0%
20%
40%
60%
80%
100%
Media type
Rat
io o
f dat
a ty
pes
(%)
floating-pointpointerswordhalfwordbyte
Memory Statistics
Working set size cache regression: cache sizes 1K to 4MB assumed line size of 64 bytes measured read and write miss ratios
Spatial locality cache regression: line sizes 8 to 1024 bytes assumed cache size of 64 KB measure read and write miss ratios
Memory Results data memory: 32 KB and 60.8% spatial locality (up to 128 bytes) instruction memory: 8 KB and 84.8% spatial locality (up to 256 bytes)
ba ll
A
BAlocality spatial
/
Data Spatial Locality
-200
-150
-100
-50
0
50
100
150
16 32 64 128 256 512 1024Line Size (bytes)
Deg
ree
of S
patia
l Loc
ality
(%)
videoimagegraphicsaudiospeechsecurityaverage
Multimedia Looping Characteristics
Highly Loop Centric Nearly 95% of execution time spent within two innermost loop
levels
Large Number of Iterations Significant processing regularity About 10 iterations per loop on average
Path Ratio indicates Intra-Loop Complexity Computed as ratio of average number of instructions executed per
loop invocation to total number of instructions in loop Average path ratio of 78% Indicates greater control complexity than expected
Average Iterations per Loopand Path Ratio
0
0.2
0.4
0.6
0.8
1
video image graphics audio speech security average
Media Type
Path
Rat
io
1
10
100
1000
Media Type
Ave
rage
Num
ber o
f Ite
ratio
ns
- average number of loop iterations
- average path ratio
Instruction Level Parallelism Instruction level parallelism
base model: single issue using classical optimizations only
parallel model: 8-issue
Explores only parallel scheduling performance assumes an ideal processor model no performance penalties from branches, cache misses, etc.
0
0.5
1
1.5
2
2.5
3
3.5
video image graphics audio speech security average
Media Type
Spe
edup
8-issue classical only8-issue classical w/ inlining8-issue superblock8-issue hyperblock
Workload Evaluation Conclusions
Operation Characteristics More arithmetic operations; less memory and floating-point usage Large variation in memory usage (ALU, mem, branch, shift, FP, mult) => (4, 2, 1, 1, 1, 1)
Good Static Branch Prediction Multimedia:10-15% avg. miss ratio General-purpose: 20-30% avg. miss ratio Similar basic block sizes (5 instrs per basic block)
Primarily Small Data Types (8 or 16 bits) Nearly 70% of instructions require 16-bit or smaller data types Significant opportunity for subword parallelism or narrower datapaths
Memory Typically small data and instruction working set sizes High data and instruction spatial locality
Loop-Centric Majority of execution time spent in two innermost loops Average of 10 iterations per loop invocation Path ratio indicates greater control complexity than expected
Architecture Evaluation
Architecture Evaluation Determine fundamental architecture style
Statically Scheduled => Very Long Instruction Word (VLIW) allows wider issue simple hardware => potentially higher frequencies
Dynamically Scheduled => Superscalar allows decoupled data memory accesses effective at reducing penalties from stall
Examine variety of architecture parameters Fundamental Architecture Style Instruction Fetch Architecture High Frequency Effects Cache Memory Hierarchy
Related Work[Lee98] “Media Architecture: General Purpose vs. Multiple Application-Specific
Programmable Processors,” DAC-35, 1998.[PChang91] “Comparing Static and Dynamic Code Scheduling for Multiple-Instruction-
Issue Processors,” MICRO-24, 1991.[ZWu99] “Architecture Evaluation of Multi-Cluster Wide-Issue Video Signal Processors,”
Ph.D. Thesis, Princeton University, 1999.[DZucker95] “A comparison of hardware prefetching techniques for multimedia
benchmarks,” Technical Report CSL-TR-95-683, Stanford University, 1995.
Fundamental ArchitectureEvaluation
Fundamental architecture evaluation included: Static vs. dynamic scheduling Issue width
Focused on non-memory limited applications Determine impact of datapath features independent of memory Assume memory techniques can solve memory bottleneck
Architecture model 8-issue processor Operation latencies targeted for 500 MHz to 1 GHz 64 integer and floating-point registers Pipeline: 1 fetch, 2 decode, 1 write back, variable execute stages 32 KB direct-mapped L1 data cache with 64 byte lines 16 KB direct-mapped L1 instruction cache with 256 byte lines 256 KB 4-way set associate on-chip L2 cache 4:1 Processor to external bus frequency ratio
Static versus Dynamic Scheduling
0
0.5
1
1.5
2
2.5
3
Classical Superscalar Hyperblock
Compilation Method
IPC
VLIW
in-order superscalar
out-of-order superscalar
VLIW w/ perfect caches
in-order superscalar w/perfect cachesout-of-order superscalar w/perfect caches
0
0.5
1
1.5
2
2.5
0 5 10Issue-Width
IPC
VLIW
in-ordersuperscalarout-of-ordersuperscalar
- static versus dynamic scheduling for various compiler methods
- result of increasing issue width for the given architecture and compiler methods
Instruction Fetch Architecture
0
1
2
3
4
5
6
Classical Superscalar HyperblockCompilation Method
IPC
Dif
fere
nce
(%) VLIW w/ fixed-
width instrsVLIW w/ variable-width instrsin-order superscalarout-of-ordersuperscalar
90.591
91.5
9292.5
9393.5
9494.5
0 5 10 15 20 25Predictor Size (KB)
Bra
nch
Pre
dict
ion
Rat
e (%
)
2-bit counter (Classical)2-bit counter (Superscalar)2-bit counter (Hyperblock)PAs(6,16) (Classical)PAs(6,16) (Superscalar)PAs(6,16) (Hyperblock)PAs(10,8) (Classical)PAs(10,8) (Superscalar)PAs(10,8) (Hyperblock)
- aggressive versus conservative fetch methods
- comparison of dynamic branch prediction schemes
Experimental ConfigurationSingle-issue processor
SimpleScalar sim-outorder Single issue configuration RISC
Experimental Configuration-Benchmarks
Selected from different area of MediaBench
Additional real-world applications
Baseline benchmark characteristics
Measure on the single issue processor
Execution time closely related to dynamic instruction count
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
1.0E+10
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
elli
pse
mat
ch
hm
mDy namic Inst Count Exec Cy cles
VLIW vs. Single Issue
Static Code SizeDynamic Operation CountExecution SpeedBasic Block Size
Static Code Size Results
0
20
40
60
80
100
120
140
160
180
200
Uni
fied
Sta
tic C
ode
Size
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Single Issue TM1300
Static Code Size Analysis
Similar Static Code SizeOn average, TM1300 requires 17%
more space
Dynamic Operation Count Results
0
50
100
150
200
250
Un
ifie
d D
yn
amic
Op
erat
ion
Co
un
t
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Single Issue TM1300
Dynamic Operation Count Analysis
Dynamic instruction counts are similar for two type of processors
On average, TM1300 needs 20% more operations ISA difference resulted execution time is small
Execution Speed Results
0
100
200
300
400
500
600
Un
ifie
d E
xec
uti
on
Sp
eed
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
elli
pse
mat
ch
hm
m
AV
G
Single Issue TM1300
Execution Speed Analysis
TM1300 executes all benchmarks faster than the single issue processor
On average, the speedup is 3.4x wide issue capability, is partly resulted Architecture features
Unoptimized Basic Block Size Results
0
5
10
15
20
25
30
35
Bas
ic B
lock
Siz
e
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
Single Issue TM1300
Unoptimized Basic Block Size Analysis
Trimedia compile provides code with larger basic block size
On average, the basic block on TM1300 is twice as large
Exploiting Special Features
Methods Using custom instruction Loop transformation
Metrics Execution Speed Memory Access Count Basic Block Size Operation Level Parallelism
Execution Time Results
0
50
100
150
200
250
Un
ifie
d E
xec
uti
on
Sp
eed
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Un-optimized Optimized
Execution Time Analysis
1.5 x average speedupData transferring intensive, floating point
intensive, and table looking intensive applications have less speedup
Memory Access Count Results
0
20
40
60
80
100
120
Uni
fied
Mem
ory
Ref
eren
ce C
ount
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Un-optimized Optimized
Memory Access Count Analysis
Reduced average memory access count
Memory access can be bottleneck (MPEG)
Optimized Basic Block Size
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
Bas
ic B
lock
Siz
e
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
Un-optimized Optimized
Optimized Basic Block Size Analysis
Significant basic block size change results performance gain ( Region)
Operation Level Parallelism Results
0
0.5
1
1.5
2
2.5
3
3.5
OPC
g7
21
_en
c
g7
21
_d
ec
jpeg
_en
c
jpeg
_d
ec
mp
eg2
_en
c
mp
eg2
_d
ec
reg
ion
con
tou
r
ellip
se
mat
ch
hm
m
AV
G
Operation Level Parallelism Analysis
OPC close to 2Memory access can be bottleneckWider bus&super-word parallelism
Overall Performance Change Results
0
2
4
6
8
10
12
Sp
eed
up
g721
_enc
g721
_dec
jpeg
_enc
jpeg
_dec
mpe
g2_e
nc
mpe
g2_d
ec
regi
on
cont
our
ellip
se
mat
ch
hmm
AVG
Overall Performance Change Analysis
TM1300 exhibit significant performance gain over single issue processor
5x speedup on average 10x best case speedup
What type of memory system?
Cache: size, # sets, block size.
On-chip main memory: amount, type, banking, network to PEs.
Off-chip main memory: type, organization.
Memory system optimizations
Strictly software: Effectively using the cache and
partitioned memory.Hardware + software:
Scratch-pad memories. Custom memory hierarchies.
Taxonomy of memory optimizations (Wolf/Kandemir)
Data vs. code.Array/buffer vs. non-array.Cache/scratch pad vs. main memory.Code size vs. data size.Program vs. process.Languages.
Software performance analysis
Worst-case execution time (WCET) analysis (Li/Malik): Find longest path through CDFG. Can use annotations of branch probabilities. Can be mapped onto cache lines. Difficult in practice---must analyze optimized
code.
Trace-driven analysis: Well understood. Requires code, input vectors.
Software energy/power analysis
Analytical models of cache (Su/Despain, Kamble/Ghose, etc.): Decoding, memory core, I/O path, etc.
System-level models (Li/Henkel).Power simulators (Vijaykrishnan et
al, Brooks et al).
Power-optimizing transformations
Kandemir et al: Most energy is consumed by the
memory system, not the CPU core. Performance-oriented optimizations
reduce memory system energy but increase datapath energy consumption.
Larger caches increase cache energy consumption but reduce overall memory system energy.
Scratch pad memories
Explicitly managed local memory.Panda et al used a static
management scheme. Data structures assigned to off-chip
memory or scratch pad at compile time. Put scalars in scratch pad, arrays in
main.May want to manage scratch pad at
run time.
Reconfigurable caches
Use compiler to determine best cache configuration for various program regions. Must be able to quickly reconfigure the
cache. Must be able to identify where program
behavior changes.
Software methods for cache placement
McFarling analyzed inter-function dependencies.
Tomiyama and Yasuura used ILP.Li and Wolf used a process-level model.Kirovski et al use profiling information plus
graph model.Dwyer/Fernando use bit vectors to
construct boudns in instruction caches.Parmeswaran and Henkel use heuristics.
Addressing optimizations
Addressing can be expensive: 55% of DSP56000 instructions performed
addressing operations in MediaBench.Utilize specialized addressing registers,
pre/post-incr/decrement, etc. Place variables in proper order in memory
so that simpler operations can be used to calculate next address from previous address.
Hardware methods for cache optimization
Kirk and Strosnider divided the cache into sections and allocated timing-critical code to its own section.