1 techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des...

1

Techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des

architectures VLIWThèse de doctorat

Gilles POKAM*

15 Juillet 2004

*Financement CIFRE de STMicroelectronics

2

Low Power Compilation Techniques on VLIW Architectures

Ph.D. Thesis

Gilles POKAM*

July 15, 2004

*Thesis funded by STMicroelectronics

3

Motivation root causes of increase performance

higher clock frequency augmentation rate of ~30% each two years

makes programs run faster higher level of integration density

process scaling following Moore’s law grows the architecture complexity

power consumption is quickly becoming a limiting factor

4

Illustration of power density growth for general purpose systems

40048008

80808085

8086

286386

486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

Year

Po

wer

Den

sity

(W

/cm

2)

Hot Plate

NuclearReactor today 2004!today 2004!

5

Power as a design cost constraint in embedded systems

embedded systems examples embedded systems examples PDAs, cell phones, set-top boxes, etc …

key points affecting design cost include : average energy (battery autonomy) heat dissipation (packaging cost) peak power (components reliability)

In this thesis we are concerned with total power consumption

6

Agenda Motivation Thesis objectives Program analysis Power consumption ILP compilation analysis Adaptive cache strategy Adaptive processor data-path Conclusions

7

The goals of this thesis

to understand the energy issues involved when compiling for performance on VLIW architectures

to come out with hardware/software solutions that improve energy-efficiency

8

Why VLIW architectures ? popular in embedded systems

Philips TriMedia Processor Texas Instrument TMS320C62xx Lx Processor HP/STMicroelectronics

provide power/performance alternative to general purpose systems

statically scheduled processor compiler is responsible of extracting

instruction level parallelism (ILP)

9

Research methodology our analysis standpoint lies in the compiler

we therefore consider program analysisprogram analysis as a basis for exploring energy reduction techniques

power is also concerned with the underlying micro-architecture we also consider the adequation of the adequation of the

hardware and the softwarehardware and the software to reduce energy consumption

10

Thesis contributions

1.1. Program analysisProgram analysis a methodology for characterizing the dynamic a methodology for characterizing the dynamic

behavior of programs at static timebehavior of programs at static time

2.2. VLIW energy issuesVLIW energy issues heuristic to comprehending the energy issues

involved when compiling for ILP

3.3. Hardware/Software adequationHardware/Software adequation adaptive compilation schemes targetingadaptive compilation schemes targeting

1.1. the cache subsystemthe cache subsystem2.2. the processor data-paththe processor data-path

gpokam

in this respect, our overall contributions include, in this order:1. a methodology to characterize programs at static time for energy-efficient compilation2. heuristics to comprehending the energy issues involved when compiling for ILPAnd, concerning the adequation of the hard and soft, adaptive compilation scheme targeting the cache and the processor data path

11

Thesis experimental environment

Lx VLIW processorLx VLIW processor

4-issue width 64 GPR, 8 CBR 4 ALUs, 2 MULs, 1 LSU, 1

BU 32KB 4-way data cache 32B data cache block size 32KB 1-way instruction

cache 64B instruction cache line

size Power model provided by

STMicroelectronics

BenchmarksBenchmarks

MiBench suite e.g. fft, gsm, susan …

MediaBench suite e.g. mpeg, epic …

PowerStone suite e.g. summin,

whestone, v42bis …

12


13

Why do we need to analyze programs ?

knowledge of the dynamic behavior of a program is knowledge of the dynamic behavior of a program is essential to determine which program region may benefit essential to determine which program region may benefit most from an optimizationmost from an optimization

programs use to execute as a series of phasesprograms use to execute as a series of phases; each phase each phase having varying dynamic behaviorhaving varying dynamic behavior [Sherwood and Calder, 1999]

a phase can be assimilated to a program pathprogram path which occurs which occurs repeatedlyrepeatedly

exposing the most frequently executed program paths, i.e. hot pathshot paths, to the compiler may help discriminate among power/performance optimizations

14

Our approach for program paths analysis

whole-program level instrumentation ([Larus, PLDI 2000]) with main focus on basic block regions

signature to differentiate among dynamic instances of the same region

program paths processed with suffix array to detect all occurrences of repeated sub-paths

heuristics to select hot paths among the sub-paths that appear repeated in the trace

gpokam

the main point to say concerning our approach is that1. we instrument an application at the whole program level to collect paths; we focus however on strong regions only to reduce the amount of stored information2. during instrumentation, we also record a set of dynamic information which is unique for each region; this is its signature. This helps differentiate among dynamic instances of the same path.3. these paths are then processed with a SA to detect all occurrences of repetitive sub-paths4. we introduce some heuristics to select hot paths among repeated sub-paths

15

Approach overview:detecting occurrences of repeated sub-

paths

DynamicDynamicsignaturesignature

Suffix arraySuffix array

Suffix sorting algorithm based on KMR toSuffix sorting algorithm based on KMR todetect all occurrences of repeated sub-pathsdetect all occurrences of repeated sub-paths[Karl, Muller and Rosenberg, 1972][Karl, Muller and Rosenberg, 1972]

16

Hot paths selection not all repeated sub-paths are of interest :

Local coverageLocal coverage: provides local behavior of region Global coverageGlobal coverage: provides the weight of region in

program Distance reuseDistance reuse: average distance of consecutive

accesses to a region

17

Results summary

BenchBench Percentage of Percentage of hot pathshot paths

Local coverage Local coverage (% exec instr.)(% exec instr.)

Glo. coverage Glo. coverage (% exec instr.)(% exec instr.)

Dist. ReuseDist. Reuse

(# of BB)(# of BB)

dijkstra 2.81 0.09 47 1.74

adpcm 5.88 < 0.005 90 0.00

blowfish 27.01 0.06 24 85.00

fft 11.7 < 0.005 7 4.21

sha 20.0 0.06 72 0.75

bmath 15.22 0.05 37 19.21

patricia 5.85 0.15 65 24.84

18


19

Back to basis …

90% ~10%

Power = ½ CL VDD2 a f + VDDIleakage

CL

Current technologyCurrent technology 50%future technology future technology trend [SIA, 1999]trend [SIA, 1999] ~50%

dynamic powerdynamic power static powerstatic power

20

Software opportunities for power reduction

IVVC leakddddL faP 21 2

dynamic powerdynamic power

Common techniques:Common techniques: • clock-gating for activity reductionclock-gating for activity reduction• power supply voltage scalingpower supply voltage scaling• frequency scalingfrequency scaling

static power static power

Common techniques:Common techniques:• power supply voltage power supply voltage scaling scaling

21


22

Problem summary we want to understand under which

conditions compiling for ILP may degrade energy

main motivation comes from the relation between power growth and ILP compiler

for the rest of this study assume can not be modified (fixed microarchitecture)

IPCPower ~ VLIW compiler

architecturecomplexity

23

Metric used energy and performance must be considered

conjointly [Horowitz] to leverage program slowdown and energy reduction

performance to energy ratio (PTE)

Goals compare two instances of the same program at the

software level lay emphasis on the range of performance values

(IPC) that may degrade energy for a given ILP transformation if energy growth is more for a given ILP transformation if energy growth is more important than obtained performance improvement the important than obtained performance improvement the resulting PTE is degradedresulting PTE is degraded

energy

eperformancNIPCPTE

ECycleEnergy BBBBBB

1

24

Energy Model

the execution of a bundle dissipates an energy :

consider loop intensive kernels …

wn

EPB nw

EEEIPCEEPB misssopwcw qlpmnn

Energybase cost

Energy due toexecution of bundle

Energy due toD-cache misses

Energy due toI-cache misses

EEEIPCEEPB misssopwcw qlpmnn

25

We consider hyperblock transformation

What is an hyperblock ?

construct predicated BB out of a region of BBs

correct the effect of eliminating branch instructions by adding compensation code

Why hyperblock ?

most optimizations do not generate extra work, optimizing for performance = optimizing for power

hyperblock augment instructions count, how does this affect energy ?

H

Hammock region R Hyperblock H

brbr

26

Tradeoff analysis transformation

heuristic

impact due to added instructions

influence of on

PTEPTE RH

IPCIPC

IPCR

RH cb

a

EnNfnfN opHHHRRRmc

Hammock region R Hyperblock H

c IPCH

c < 0extra work due tocompensation code

C = 0no degradationno benefit

C > 0Optimal config

m is nb of BB in RN is nb of operations in R or Hn is nb of bundles in R or Hf is execution frequency

27

Conclusions heuristic shows 17% improvement on a small subset of

Powerstone benchmarks

improvement on all benchmarks is restricted due to: available ILP: for a given IPC value, ILP transformation must

result into much higher IPC (e.g. case c < 0) machine overhead: small IPC improvement has no impact on

energy whenever machine overhead dominates (e.g. c <= 0)

suggested research directionssuggested research directions better usage of available ILP via knowledge of phases

execution behavior hot program pathshot program paths better managing machine overhead via the adequation ofadequation of to to

the architecture requirements of a program regionthe architecture requirements of a program region

28


29

Why cache ? highly power (dynamic and static) consuming components

typically 80% of total transistor count occupies about 50% of total chip area

usually appears with a monolithic configuration in embedded systems (per-application configuration)

varying program phase behavior may suggest us thatvarying program phase behavior may suggest us that no no best cache size exists for a given applicationbest cache size exists for a given application

adequation of cache configuration with program behavior on adequation of cache configuration with program behavior on a per-phase basisa per-phase basis

reduction of the number of active and passive transistors reduction of the number of active and passive transistors reduction of dynamic reduction of dynamic and static powerand static power

30

Two major proposals Albonesi [MICRO’99]: Albonesi [MICRO’99]:

selective cache waysselective cache ways disable/enable cache ways

ProblemProblem disabling cache ways disabling cache ways

causes lost of data causes lost of data impossible to recover to impossible to recover to previous cache cells state!previous cache cells state!

Zhang & al. [ISCA’03]: way-Zhang & al. [ISCA’03]: way-concatenationconcatenation

reduce cache associativity while still maintaining full cache capacity

ProblemProblem data coherency problem data coherency problem

across different cache across different cache configurations!configurations!

disco

nnect

disco

nnect

32K 4-way

16K 2-way

Way 0Way 0 Way 2Way 2 Way 3Way 3Way 1Way 1 Way 0Way 0

Way 1Way 1

Way 2Way 2

Way 3Way 3

32K 4-way

32K 2-way

concatenateconcatenate concatenateconcatenate@@ @@

31

Program regions analysis program regions

are sensitive to cache size and associativity

key ideakey idea varyvary associativity associativity

and and sizesize according to according to characteristics of characteristics of program regionsprogram regions

Config 232K 4-way32K 2-way

Config 032K 4-way32K 2-way16K 2-way

Config 132K 1-way16K 1-way 8K 1-way

summin (MiBench)

Config 316K 2-way

32

Solution for varying cache size how to keep data ?how to keep data ?

unaccessed cache waysunaccessed cache ways are put in a low power mode (drowsy modedrowsy mode)

drowsy mode [Flautner ISCA’02] scales down to preserve memory cells

state Advantage:

static power is reduced as a by-side of scaling down static power is reduced as a by-side of scaling down Disadvantage

1 cycle delay to wake up a drowsy cache way !

V dd

V dd

33

Solution for varying degree of associativity

maintain data coherency via cache line invalidation

tag array is maintained activetag array is maintained active to monitor write accesses

cache controller invalidates cache lines with old copy on a write access

we save dynamic energy because lower associativity caches access few

memory cells than higher ones reduction of switching transition “a”

34

Results summary

three cache designs are compared1. no adaptive cache scheme2. adaptation on a per-application basis3. adaptation on a per-phase basis (our scheme)

6 out of 8 applications are sensitive to cache size and associativity, resulting in dynamic power reduction of up to 12%

static energy is reduced drastically, on average 80% on all benchmarks

performance can suffer from the one cycle wake up delay. Two applications show ~30% degradation, from which 65% is due to the one cycle delay needed to wake up a drowsy cache way

better cache way allocation policy can improve this result

35


36

Motivation

32 bit-width embedded processors are becoming popular

confluence of integer scalar programsinteger scalar programs and multimedia multimedia applicationsapplications on modern embedded processors

multimedia applications use to operate on 8-bit (e.g. video) or 16-bit data (e.g. audio)

typically 50% of instructions in MediaBench [Brooks et al., HPCA’99]

detecting the occurrence of these narrow-width operands on a per-region basis may allow

the adequation of processor data-path width to the bit-width size the adequation of processor data-path width to the bit-width size of a program regionof a program region

37

Techniques to detect narrow-width operands

Dynamic approach detection on a cycle-by-cycledetection on a cycle-by-cycle basis by

means of hardware (e.g. zero detection logic)

clock-gate the un-significant bytes to clock-gate the un-significant bytes to save energysave energy

problem efficient for GP systems, but required

hardware cost often not affordable for embedded systems

related work include Brooks et al., HPCA’99 Canal et al., MICRO’00

Compiler approach use static data flow analysis to static data flow analysis to

compute ranges of bit-width valuescompute ranges of bit-width values for program variables

re-encode program variables with re-encode program variables with smaller bit-width size to save energysmaller bit-width size to save energy

problem static analysis limits the opportunity for

detecting more narrow-width operands re-encoding must preserve program

correctness too conservative!

related work include Stephenson et al., PLDI 2000

38

Program regions analysisadpcm (BB granularity) the occurrence of

dynamic narrow-width operands at the basic block level can be high

Key idea: adapt the underlying adapt the underlying

processor data-path processor data-path width to the dynamic width to the dynamic bit-width size of the bit-width size of the regionregion

39

Our approach

avoid relying on hardware support to detect the occurrences of narrow-width operands

avoid relying on static data flow analysis to discover bit-width ranges (too conservative!)

Dynamic approachDynamic approach Compiler approachCompiler approach

speculative narrow-width execution modespeculative narrow-width execution mode

take advantage of runtime information to expose dynamic narrow-width operands to the compiler

use instead compiler approach to decide when to switch from normal to narrow-width mode and vice-versa (reconfig.instr.)

40

Speculative narrow-width execution: micro-architecture

Recovering scheme simple comparison logic at

execute stage upon a miss

pipeline is flushedinstruction is replayed with correct mode

recovery scheme may have impact on both performance and energy

Static energy saving adaptive register file width

that can be viewed as either a 8/16/32-bit register file

unused register file slices are put in a low-power mode (drowsy mode) to reduce static energy

Dynamic energy saving data-path clock-gating

when a narrow execution mode is encountered (pipeline latches, ALU)

slice-enable

8 bits 16 bits8 bits

Write-back

(8/16/32 mode)Bypass

(8/16/32 mode)

(8/16/32 mode) (8/16/32

mode)

Slice-enable signalSlice-enable signal

(8/16/32 mode)

41

Speculative narrow-width execution:compiler support

regions are rarely composed of narrow-width operands only …

address instructions (AI) usually require larger bit-width size; split AI into

address calculation memory access via accumulator register

schedule instructions within a region such that those having one operand with 32-bit width are moved around

insert reconfiguration instructions at each frontier of a region

42

Results summary impact of recovery scheme varies with miss-

speculation penalty and availability of narrow-width operands

with 5 cycles penalty and 80% narrow-width availability programs show no performance degradation

with 25 cycles penalty and 60% narrow-width availability IPC degradation reaches 30%

overall, on the 13 applications from Powerstone, the data-path dynamic energy is reduced by 17% on average

we achieve a 22% reduction of the register file static energy

43


44

Conclusions

power consumption is a matter of both softwaresoftware and hardwarehardware software because program execution causes switching transitions

(dynamic power) hardware because power consumption grows with architecture

complexity

hardware/software techniques must be used conjointly to provide an effective basis for reducing power consumption

this thesis has provided arguments in favor of a this thesis has provided arguments in favor of a profile-driven, profile-driven, compiler-architecture symbiosis approachescompiler-architecture symbiosis approaches to reduce power to reduce power consumption byconsumption by

detecting the occurences of program phases/regions detecting the occurences of program phases/regions discriminating optimizations that best benefit a phase/region discriminating optimizations that best benefit a phase/region adapting the micro-architecture w.r.t. the behavior of a phase/regionadapting the micro-architecture w.r.t. the behavior of a phase/region

45

Future work analogy between ILP and DLP

investigate the energy issues involved with SIMD compilation need for SIMD energy model measure impact of overhead instructions (pack/unpack)

catching different program behaviors with a hot path signature

will allow to study the interplay of using different reconfiguration techniques to save energy

energy impact of SIMD compilation with an adaptive i-cache

effectiveness of SIMD compilation to exploit narrow-width operands (speculative vectorization techniques ?)

1 techniques de compilation pour la gestion et l’optimisation de la consommation d’énergie des...

Documents

thesis n

n popular

vliw architectures n

research methodology

n key points

energy consumption slide

total power consumption

stmicroelectronics slide