1 techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des...
TRANSCRIPT
1
Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des
architectures VLIWThèse de doctorat
Gilles POKAM*
15 Juillet 2004
*Financement CIFRE de STMicroelectronics
2
Low Power Compilation Techniques on VLIW Architectures
Ph.D. Thesis
Gilles POKAM*
July 15, 2004
*Thesis funded by STMicroelectronics
3
Motivation root causes of increase performance
higher clock frequency augmentation rate of ~30% each two years
makes programs run faster higher level of integration density
process scaling following Moore’s law grows the architecture complexity
power consumption is quickly becoming a limiting factor
4
Illustration of power density growth for general purpose systems
40048008
80808085
8086
286386
486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
Po
wer
Den
sity
(W
/cm
2)
Hot Plate
NuclearReactor today 2004!today 2004!
5
Power as a design cost constraint in embedded systems
embedded systems examples embedded systems examples PDAs, cell phones, set-top boxes, etc …
key points affecting design cost include : average energy (battery autonomy) heat dissipation (packaging cost) peak power (components reliability)
In this thesis we are concerned with total power consumption
6
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
7
The goals of this thesis
to understand the energy issues involved when compiling for performance on VLIW architectures
to come out with hardware/software solutions that improve energy-efficiency
8
Why VLIW architectures ? popular in embedded systems
Philips TriMedia Processor Texas Instrument TMS320C62xx Lx Processor HP/STMicroelectronics
provide power/performance alternative to general purpose systems
statically scheduled processor compiler is responsible of extracting
instruction level parallelism (ILP)
9
Research methodology our analysis standpoint lies in the compiler
we therefore consider program analysisprogram analysis as a basis for exploring energy reduction techniques
power is also concerned with the underlying micro-architecture we also consider the adequation of the adequation of the
hardware and the softwarehardware and the software to reduce energy consumption
10
Thesis contributions
1.1. Program analysisProgram analysis a methodology for characterizing the dynamic a methodology for characterizing the dynamic
behavior of programs at static timebehavior of programs at static time
2.2. VLIW energy issuesVLIW energy issues heuristic to comprehending the energy issues
involved when compiling for ILP
3.3. Hardware/Software adequationHardware/Software adequation adaptive compilation schemes targetingadaptive compilation schemes targeting
1.1. the cache subsystemthe cache subsystem2.2. the processor data-paththe processor data-path
11
Thesis experimental environment
Lx VLIW processorLx VLIW processor
4-issue width 64 GPR, 8 CBR 4 ALUs, 2 MULs, 1 LSU, 1
BU 32KB 4-way data cache 32B data cache block size 32KB 1-way instruction
cache 64B instruction cache line
size Power model provided by
STMicroelectronics
BenchmarksBenchmarks
MiBench suite e.g. fft, gsm, susan …
MediaBench suite e.g. mpeg, epic …
PowerStone suite e.g. summin,
whestone, v42bis …
12
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
13
Why do we need to analyze programs ?
knowledge of the dynamic behavior of a program is knowledge of the dynamic behavior of a program is essential to determine which program region may benefit essential to determine which program region may benefit most from an optimizationmost from an optimization
programs use to execute as a series of phasesprograms use to execute as a series of phases; each phase each phase having varying dynamic behaviorhaving varying dynamic behavior [Sherwood and Calder, 1999]
a phase can be assimilated to a program pathprogram path which occurs which occurs repeatedlyrepeatedly
exposing the most frequently executed program paths, i.e. hot pathshot paths, to the compiler may help discriminate among power/performance optimizations
14
Our approach for program paths analysis
whole-program level instrumentation ([Larus, PLDI 2000]) with main focus on basic block regions
signature to differentiate among dynamic instances of the same region
program paths processed with suffix array to detect all occurrences of repeated sub-paths
heuristics to select hot paths among the sub-paths that appear repeated in the trace
15
Approach overview:detecting occurrences of repeated sub-
paths
DynamicDynamicsignaturesignature
Suffix arraySuffix array
Suffix sorting algorithm based on KMR toSuffix sorting algorithm based on KMR todetect all occurrences of repeated sub-pathsdetect all occurrences of repeated sub-paths[Karl, Muller and Rosenberg, 1972][Karl, Muller and Rosenberg, 1972]
16
Hot paths selection not all repeated sub-paths are of interest :
Local coverageLocal coverage: provides local behavior of region Global coverageGlobal coverage: provides the weight of region in
program Distance reuseDistance reuse: average distance of consecutive
accesses to a region
17
Results summary
BenchBench Percentage of Percentage of hot pathshot paths
Local coverage Local coverage (% exec instr.)(% exec instr.)
Glo. coverage Glo. coverage (% exec instr.)(% exec instr.)
Dist. ReuseDist. Reuse
(# of BB)(# of BB)
dijkstra 2.81 0.09 47 1.74
adpcm 5.88 < 0.005 90 0.00
blowfish 27.01 0.06 24 85.00
fft 11.7 < 0.005 7 4.21
sha 20.0 0.06 72 0.75
bmath 15.22 0.05 37 19.21
patricia 5.85 0.15 65 24.84
18
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
19
Back to basis …
90% ~10%
Power = ½ CL VDD2 a f + VDDIleakage
CL
Current technologyCurrent technology 50%future technology future technology trend [SIA, 1999]trend [SIA, 1999] ~50%
dynamic powerdynamic power static powerstatic power
20
Software opportunities for power reduction
IVVC leakddddL faP 21 2
dynamic powerdynamic power
Common techniques:Common techniques: • clock-gating for activity reductionclock-gating for activity reduction• power supply voltage scalingpower supply voltage scaling• frequency scalingfrequency scaling
static power static power
Common techniques:Common techniques:• power supply voltage power supply voltage scaling scaling
21
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
22
Problem summary we want to understand under which
conditions compiling for ILP may degrade energy
main motivation comes from the relation between power growth and ILP compiler
for the rest of this study assume can not be modified (fixed microarchitecture)
IPCPower ~ VLIW compiler
architecturecomplexity
23
Metric used energy and performance must be considered
conjointly [Horowitz] to leverage program slowdown and energy reduction
performance to energy ratio (PTE)
Goals compare two instances of the same program at the
software level lay emphasis on the range of performance values
(IPC) that may degrade energy for a given ILP transformation if energy growth is more for a given ILP transformation if energy growth is more important than obtained performance improvement the important than obtained performance improvement the resulting PTE is degradedresulting PTE is degraded
energy
eperformancNIPCPTE
ECycleEnergy BBBBBB
1
24
Energy Model
the execution of a bundle dissipates an energy :
consider loop intensive kernels …
wn
EPB nw
EEEIPCEEPB misssopwcw qlpmnn
Energybase cost
Energy due toexecution of bundle
Energy due toD-cache misses
Energy due toI-cache misses
EEEIPCEEPB misssopwcw qlpmnn
25
We consider hyperblock transformation
What is an hyperblock ?
construct predicated BB out of a region of BBs
correct the effect of eliminating branch instructions by adding compensation code
Why hyperblock ?
most optimizations do not generate extra work, optimizing for performance = optimizing for power
hyperblock augment instructions count, how does this affect energy ?
H
Hammock region R Hyperblock H
brbr
26
Tradeoff analysis transformation
heuristic
impact due to added instructions
influence of on
PTEPTE RH
IPCIPC
IPCR
RH cb
a
EnNfnfN opHHHRRRmc
Hammock region R Hyperblock H
c IPCH
c < 0extra work due tocompensation code
C = 0no degradationno benefit
C > 0Optimal config
m is nb of BB in RN is nb of operations in R or Hn is nb of bundles in R or Hf is execution frequency
27
Conclusions heuristic shows 17% improvement on a small subset of
Powerstone benchmarks
improvement on all benchmarks is restricted due to: available ILP: for a given IPC value, ILP transformation must
result into much higher IPC (e.g. case c < 0) machine overhead: small IPC improvement has no impact on
energy whenever machine overhead dominates (e.g. c <= 0)
suggested research directionssuggested research directions better usage of available ILP via knowledge of phases
execution behavior hot program pathshot program paths better managing machine overhead via the adequation ofadequation of to to
the architecture requirements of a program regionthe architecture requirements of a program region
28
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
29
Why cache ? highly power (dynamic and static) consuming components
typically 80% of total transistor count occupies about 50% of total chip area
usually appears with a monolithic configuration in embedded systems (per-application configuration)
varying program phase behavior may suggest us thatvarying program phase behavior may suggest us that no no best cache size exists for a given applicationbest cache size exists for a given application
adequation of cache configuration with program behavior on adequation of cache configuration with program behavior on a per-phase basisa per-phase basis
reduction of the number of active and passive transistors reduction of the number of active and passive transistors reduction of dynamic reduction of dynamic and static powerand static power
30
Two major proposals Albonesi [MICRO’99]: Albonesi [MICRO’99]:
selective cache waysselective cache ways disable/enable cache ways
ProblemProblem disabling cache ways disabling cache ways
causes lost of data causes lost of data impossible to recover to impossible to recover to previous cache cells state!previous cache cells state!
Zhang & al. [ISCA’03]: way-Zhang & al. [ISCA’03]: way-concatenationconcatenation
reduce cache associativity while still maintaining full cache capacity
ProblemProblem data coherency problem data coherency problem
across different cache across different cache configurations!configurations!
disco
nnect
disco
nnect
32K 4-way
16K 2-way
Way 0Way 0 Way 2Way 2 Way 3Way 3Way 1Way 1 Way 0Way 0
Way 1Way 1
Way 2Way 2
Way 3Way 3
32K 4-way
32K 2-way
concatenateconcatenate concatenateconcatenate@@ @@
31
Program regions analysis program regions
are sensitive to cache size and associativity
key ideakey idea varyvary associativity associativity
and and sizesize according to according to characteristics of characteristics of program regionsprogram regions
Config 232K 4-way32K 2-way
Config 032K 4-way32K 2-way16K 2-way
Config 132K 1-way16K 1-way 8K 1-way
summin (MiBench)
Config 316K 2-way
32
Solution for varying cache size how to keep data ?how to keep data ?
unaccessed cache waysunaccessed cache ways are put in a low power mode (drowsy modedrowsy mode)
drowsy mode [Flautner ISCA’02] scales down to preserve memory cells
state Advantage:
static power is reduced as a by-side of scaling down static power is reduced as a by-side of scaling down Disadvantage
1 cycle delay to wake up a drowsy cache way !
V dd
V dd
33
Solution for varying degree of associativity
maintain data coherency via cache line invalidation
tag array is maintained activetag array is maintained active to monitor write accesses
cache controller invalidates cache lines with old copy on a write access
we save dynamic energy because lower associativity caches access few
memory cells than higher ones reduction of switching transition “a”
34
Results summary
three cache designs are compared1. no adaptive cache scheme2. adaptation on a per-application basis3. adaptation on a per-phase basis (our scheme)
6 out of 8 applications are sensitive to cache size and associativity, resulting in dynamic power reduction of up to 12%
static energy is reduced drastically, on average 80% on all benchmarks
performance can suffer from the one cycle wake up delay. Two applications show ~30% degradation, from which 65% is due to the one cycle delay needed to wake up a drowsy cache way
better cache way allocation policy can improve this result
35
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
36
Motivation
32 bit-width embedded processors are becoming popular
confluence of integer scalar programsinteger scalar programs and multimedia multimedia applicationsapplications on modern embedded processors
multimedia applications use to operate on 8-bit (e.g. video) or 16-bit data (e.g. audio)
typically 50% of instructions in MediaBench [Brooks et al., HPCA’99]
detecting the occurrence of these narrow-width operands on a per-region basis may allow
the adequation of processor data-path width to the bit-width size the adequation of processor data-path width to the bit-width size of a program regionof a program region
37
Techniques to detect narrow-width operands
Dynamic approach detection on a cycle-by-cycledetection on a cycle-by-cycle basis by
means of hardware (e.g. zero detection logic)
clock-gate the un-significant bytes to clock-gate the un-significant bytes to save energysave energy
problem efficient for GP systems, but required
hardware cost often not affordable for embedded systems
related work include Brooks et al., HPCA’99 Canal et al., MICRO’00
Compiler approach use static data flow analysis to static data flow analysis to
compute ranges of bit-width valuescompute ranges of bit-width values for program variables
re-encode program variables with re-encode program variables with smaller bit-width size to save energysmaller bit-width size to save energy
problem static analysis limits the opportunity for
detecting more narrow-width operands re-encoding must preserve program
correctness too conservative!
related work include Stephenson et al., PLDI 2000
38
Program regions analysisadpcm (BB granularity) the occurrence of
dynamic narrow-width operands at the basic block level can be high
Key idea: adapt the underlying adapt the underlying
processor data-path processor data-path width to the dynamic width to the dynamic bit-width size of the bit-width size of the regionregion
39
Our approach
avoid relying on hardware support to detect the occurrences of narrow-width operands
avoid relying on static data flow analysis to discover bit-width ranges (too conservative!)
Dynamic approachDynamic approach Compiler approachCompiler approach
speculative narrow-width execution modespeculative narrow-width execution mode
take advantage of runtime information to expose dynamic narrow-width operands to the compiler
use instead compiler approach to decide when to switch from normal to narrow-width mode and vice-versa (reconfig.instr.)
40
Speculative narrow-width execution: micro-architecture
Recovering scheme simple comparison logic at
execute stage upon a miss
pipeline is flushedinstruction is replayed with correct mode
recovery scheme may have impact on both performance and energy
Static energy saving adaptive register file width
that can be viewed as either a 8/16/32-bit register file
unused register file slices are put in a low-power mode (drowsy mode) to reduce static energy
Dynamic energy saving data-path clock-gating
when a narrow execution mode is encountered (pipeline latches, ALU)
slice-enable
8 bits 16 bits8 bits
Write-back
(8/16/32 mode)Bypass
(8/16/32 mode)
(8/16/32 mode) (8/16/32
mode)
Slice-enable signalSlice-enable signal
(8/16/32 mode)
41
Speculative narrow-width execution:compiler support
regions are rarely composed of narrow-width operands only …
address instructions (AI) usually require larger bit-width size; split AI into
address calculation memory access via accumulator register
schedule instructions within a region such that those having one operand with 32-bit width are moved around
insert reconfiguration instructions at each frontier of a region
42
Results summary impact of recovery scheme varies with miss-
speculation penalty and availability of narrow-width operands
with 5 cycles penalty and 80% narrow-width availability programs show no performance degradation
with 25 cycles penalty and 60% narrow-width availability IPC degradation reaches 30%
overall, on the 13 applications from Powerstone, the data-path dynamic energy is reduced by 17% on average
we achieve a 22% reduction of the register file static energy
43
Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions
44
Conclusions
power consumption is a matter of both softwaresoftware and hardwarehardware software because program execution causes switching transitions
(dynamic power) hardware because power consumption grows with architecture
complexity
hardware/software techniques must be used conjointly to provide an effective basis for reducing power consumption
this thesis has provided arguments in favor of a this thesis has provided arguments in favor of a profile-driven, profile-driven, compiler-architecture symbiosis approachescompiler-architecture symbiosis approaches to reduce power to reduce power consumption byconsumption by
detecting the occurences of program phases/regions detecting the occurences of program phases/regions discriminating optimizations that best benefit a phase/region discriminating optimizations that best benefit a phase/region adapting the micro-architecture w.r.t. the behavior of a phase/regionadapting the micro-architecture w.r.t. the behavior of a phase/region
45
Future work analogy between ILP and DLP
investigate the energy issues involved with SIMD compilation need for SIMD energy model measure impact of overhead instructions (pack/unpack)
catching different program behaviors with a hot path signature
will allow to study the interplay of using different reconfiguration techniques to save energy
energy impact of SIMD compilation with an adaptive i-cache
effectiveness of SIMD compilation to exploit narrow-width operands (speculative vectorization techniques ?)