02/02/2005 ucy computer architecture group andreas artemiou 1 power awareness through selective...

55
02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong, Micha Moffie, Naftali Schwartz and Avi Mendelson Microprocessor Reseacrch Intel Labs, Haifa, Israel ISCA 2004 Presented at the Computer Architecture Group University of Cyprus by Andreas Artemiou 02-02-2005

Upload: jade-ward

Post on 27-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

1

Power awareness through selective Dynamically Optimized Traces

Roni Rosner, Yoav Almong, Micha Moffie, Naftali Schwartz and Avi MendelsonMicroprocessor Reseacrch

Intel Labs, Haifa, IsraelISCA 2004

Presented at the

Computer Architecture Group

University of Cyprus

by Andreas Artemiou

02-02-2005

Page 2: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

2

PARROT Concept

Achieve higher performance with reduced energy consumption through gradual optimization of frequently executed code traces

Page 3: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

3

Why energy is important?

Current processors have become power limited, that is, they can only operate at limited frequency, preventing them from achieving their full microarchitectural performance potential.

Page 4: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

4

Modern Processors

Modern Processors Front end: fetch, dispatch, issue Back end: execute, commit

The front end bandwidth is crucial for the overall performance of the system

The power and complexity of dynamic scheduling depends on execution bandwidth as well as on program behavior and the instruction window size

Page 5: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

5

Amdahl’s Law

This work try to take advantage of the hot/cold (90/10) paradigm following the Amdahl's Law

That is a small portion of the static program code is responsible for the most of its dynamic execution

PARROT applies similar principles like those in profiling compilers, dynamic translators etc.

Page 6: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

6

Identifying techniques

To identify frequently executed code sections we have: Software based techniques like those proposed by

Mahlke S.A. et al (MICRO 1992) Hardware based techniques like those by Merten M.C.

et al (ISCA 26, 1999) PARROT aims to aggressively exploit the hot/cold

paradigm in hardware for the benefit of both: processor performance and power awareness

PARROT=Power-Aware aRchitecture Running Optimized Traces

Page 7: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

7

What this work does?

Examine several microarchitectural alternatives based on the concept of PARROT (referred as PARROT Microarchitectures)

Organized around optimized trace cache. What is a trace cache;

Page 8: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

8

PARROT parts

PARROT (3 parts) Identify the most frequent sequences of

program code with a mechanism that does trace selection and filtering

Aggressively optimize them once with a dynamic optimizer

Efficiently execute them many times (after stored in the trace cache)

Page 9: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

9

PARROT parts (2)

Key factors for power awareness: Gradual construction of traces Pipeline decoupling Specific Trace optimizations

Cold part: Limiting the hardware for the cold part may exact a

small price in performance Hot part:

More aggressive hardware may be used to improve performance/power tradeoffs for the dominant hot segments of the code

Results show that no additional energy is spent for the optimization of hot traces

Page 10: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

10

Related work

PARROT's uniqueness is: in the application of decoupling and Dynamic optimization techniques to achieve better

performance with less energy consumption Other similar ideas:

rePLay (Patel J.S. and Lumetta S.S. IEEE Transactions on Computer VOL.50 NO. 6 JUNE 2001) – Hardare based

Turboscalar (B. Black and J. P. Shen ISCA 27, June 2000) – Hardware based

DAISY (K. Ebcioglu and E.R. Altman ISCA 24, 1997) – Software based

Page 11: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

11

Motivation

Parrot is based on the following observations The working set of a program is relatively small Much of the complexity excesses of modern OOO

processors results from handling rare cases Small segments of code which are repeatedly

executed (“hot traces”) typically cover most of the program´s working set

The hot segments of the code behave differently than the rest of the code, namely they are more regular and predictable, and consequently they exhibit higher potential for ILP extraction than the other, less frequent executed parts of the code

Page 12: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

12

PARROT Split Microarchitecture

Page 13: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

13

Microarchitecture Components

Front-end and execution pipelines are tuned for either cold or hot portions of the code

Background components post process the instruction flow out of the foreground pipeline making “off critical path” decisions such as when to move from the cold subsystem to the hot subsystems and when to apply further optimizations

Synchronization elements are required for arbitrating and switching states between pipelines and for preserving global program order

Page 14: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

14

Cold/Hot Subsystems

A trace cache can be very efficient in handling hot code, provided this code has been sufficiently well identified (Previous studies)

We base the cold subsystem on instructions fetched from an instruction cache whereas the hot subsystem is based on traces fetched from a trace cache

Power awareness and trace cache effectiveness limit trace construction and trace cache insertion to frequently executed code sections

PARROT gradually applies dynamic optimizations – the hotter the trace is, the more aggressive power-aware optimizations are applied

Page 15: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

15

Decoding/Optimizing

Reusability of hardware work and results is important for both performance and energy savings

In PARROT, the trace-cache stores: decoded traces and is thus a container for

reuse of decoding results optimized traces allowing multiple reuses of

trace optimizations

Page 16: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

16

Dynamic Optimization

Dynamic Optimizations advantages: Dynamic information available (outcome of trace

internal branches) enables optimizations that are impossible for a static compiler.

Decoupling these optimizations allows more aggressive optimizations than on-the-fly optimizations that can be performed within a standard execution of the pipeline

To take full advantage of these optimizations atomicity of a trace is assumed. That permits very aggressive optimizations across basic-block boundaries

Architectural transparency (the hardware is able to optimize legacy code without the need of recompilation)

Page 17: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

17

Traces

An execution trace is a sequence of operations representing a continuous segment of the dynamic flow of a program. Trace may contain execution beyond CTI and so a trace may extend over several basic blocks

In the current study they consider decoded atomic traces Decoded = contains decoded micro-operations (uops)

and enable reuse of decode activity, thus saving energy

Atomic traces are single-entry and single-exit

Page 18: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

18

Trace Selection

Trace selection is the process of constructing particular traces out of a dynamic sequence of instructions. It may be: Deterministic: if applied to the fully

predictable sequence of in-order committed instructions

Speculative: if applied to any previous stage in pipeline which instructions are potentially mispredicted

Page 19: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

19

Trace Construction Criteria Deterministic selection criteria:

Capacity limitation: traces are constructed into frames of at most 64 uops

Complete basic blocks: with the exceptions of large basic blocks, traces always terminate on CTIs

Terminating CTIs: all indirect jumps and software exceptions terminate basic blocks, except RETURN instructions. In addition, backward taken branches terminate a trace

RETURN instructions terminate traces only if they exit the outermost procedure context already encountered in the current trace

If two or more consecutive traces are identical, they are joined into a single trace, until capacity limit is reached (achieves the effects of explicit loop unrolling)

Unique trace identifiers (TIDs) can be compared into a single address and a sequence of branch directions (taken/not taken)

Page 20: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

20

Split Execution Microarchitecture

A split-execution implementation, consists of two disjoint sub-systems for the cold and for the hot paths.

Different execution engines can be employed by each subsystem (wider execution engine, higher bandwidth etc)

There is an optimized unified-execution engine that shares the execution resources between the hot and cold subsystems

Page 21: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

21

Unified Core Microarchitecture

Page 22: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

22

Pipeline Phases

The description following applies to both split-execution and unified execution implementations.

Both cold and hot pipelines operate in two phases: foreground phase which is responsible for

the fetch to execution pipeline Background phase selects for selecting

frequent parts of the just executed code, optimizing them and potentially promoting them to a “hotter” level

Page 23: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

23

Background phase

The background phase of the cold subsystem identifies frequent IA32 instruction sequences and captures them as traces in the trace cache.

Composed of TID selection, TID hot-filtering, trace-construction and insertion into trace cache

Continuous training of both trace predictor and hot filter is assured.

Only those TIDs that pass the hot-filter continue to the trace construction stage.

The background phase of the hot subsystem identifies the most frequent traces, optimizes them and inserts them back into trace cache.

Post-processing is used gradually, so the longer a trace is used the more aggressive optimizations are applied to it

Page 24: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

24

Detailed Schema

Page 25: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

25

Predictors

Two predictors: Branch predictor predicts the next cache line to be

fetched from the I$ for execution on the cold pipeline. A trace predictor predicts the TID of the next trace to

be fetched from the trace cache and executed on the hot pipeline

Both based on a global history register (GHR) GHR is updated for each CTI being executed

Both support speculative update upon fetch and real upon commit

NOTE!!! Is important that a trace predictor may predict a TID that reflect a trace not present in the trace cache.

Page 26: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

26

Fetch Selector

Fetch selector chooses between the execution pipelines by consulting both the lower priority branch predictor and the higher priority trace predictor

When the trace predictor is successful in making a next TID prediction the hot pipeline is selected and if a trace is successfully fetched it is executed on the hot pipeline

All other cases result in cold pipeline directed by branch prediction

Page 27: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

27

Foreground phase

In the foreground phase the pipelines are executing sequences of uops originating from either cold instructions or hot traces.

A split core enables core specialization: The cold core may focus on the execution of rare but

complex operations or be less performance aggressive while the hot core may excel at aggressive execution of atomic traces, employ simplified renaming schemes or rely on dynamic scheduling performed by the optimizer

On the other hand split core increases die size and introduces complexities with cold/hot stage switches

A unified core reduces both die size and idle power This study considers standard superscalar out-of-order cores

only, in both split and unified configurations

Page 28: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

28

Switch Mechanism in Split Core

For the split core: The state switch mechanism ensures that

values computed and stored in the register file of one core are used at the appropriate time and place in the second

By tracking for each register the last writer uop preceding the switch and the first reader uop of the code following the switch, and assuring that the reader is not executed until writer completes writeback and the value has communicated to the second core

Page 29: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

29

Commit Stage

The commit stage is responsible for committing IA32 instructions to the architectural state.

Two synchronization issues: to commit instruction in program order (in a split

microarchitecture instructions must contain markers to reconstruct global order)

The atomic hot traces should be committed at once as a single entity, requiring a mechanism for state accumulation

Note that for such a gradual scheme only moderate enlargement of non critical machine resources is necessary.

If any intermediate event prevents full completion of the trace, the remaining uops are flushed and the architectural stage returns before the fetch of the trace. This may happen from exceptions, failed assert uops or from external interrupts!

Page 30: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

30

Post Processing

For post-processing cold code, PARROT employs a deterministic TID/trace build scheme.

Uops from cold committed instructions are collected until a termination condition of a trace is reached. Then a new TID is generated from the entry address and the CTI’s, and this TID is used to train the trace predictor

If this TID is identified as frequent, the collected uops are used to construct an executable trace that can be inserted into the trace cache.

Page 31: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

31

Filtering Mechanisms

PARROT employs two filtering mechanisms: Hot filter to select frequent TID’s from among those

constructed on cold pipeline Blazing filter which is used for selecting the most frequent

TID’s from among those executed on the hot pipeline Both are small caches that maintain counters for each TID Each trace execution increment the counter of the executed TID Once the hot filter threshold is reached, the trace is constructed

and inserted into the trace cache. When the blazing filter threshold is reached, the execution trace

is optimized and written back to the trace cache, replacing the original.

Page 32: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

32

Optimizations PARROT employs dynamic optimizations on blazing

traces. The optimizations can be classified as:

general purpose which are independent of the underlying execution core (logic simplifications, constant propagation and dead code elimination)

or core-specific which include functional transformations as (micro-operation fusion and SIMDification and global transformations)

Optimization results in: Uop reduction Dependency elimination Simplified renaming Improved scheduling

Virtual renaming results to power/energy savings

Page 33: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

33

Simulation Framework

Performance simulation Energy simulation

Page 34: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

34

Performance Simulation (1)

Incorporates full memory hierarchy Newly designed components for the post

processing phases Software architecture includes a generic

highly configurable object oriented execution core class which can be instantiated with a variable number of execution cores of widely differing characteristics

Page 35: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

35

Performance Simulation (2)

One feature of the simulation framework is the abstract instruction which can be defined as a “commitable work unit” and so has a different interpretation within the cold and hot pipelines i.e. in cold pipeline is an instruction, in hot pipeline is the trace

Page 36: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

36

Energy Simulation (1)

Use tools which are based on a combination of the WATTCH-like and TEM2P2EST like approaches.

Assume uniform leakage In space In time

Page 37: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

37

Energy Simulation (2)

Total leakage: LE=PMAX*(0.05*M+0.4*K)* CYC PMAX is tha average dynamic-power of the base OOO

model. 0.05 is the technology constant for each MB of level 2

cache 0.4 is the technology constant for the standard core K is the factor by which the current core is larger than

the standard OOO core CYC is the number of cycles the application is running

Page 38: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

38

Models (1)

Model a variety of configurations Reference model (N)

Standard 4-wide OOO machine Narrow is the different variants of this

standard 4-wide Wide is the different variants of a more

generous 8-wide machine Created a theoretical configuration where all

stages are wide (W)

Page 39: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

39

Models (2)

PARROT configurations are denoted TON for narrow TOW for wide TOS for split

T stands for selective trace cache O stands for dynamic optimizations Another two models

TN (without trace optimization) TW (without trace optimization)

Page 40: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

40

Models (3)

Page 41: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

41

Benchmarks

SpecInt2000: 30M instructions SPECFP 2000: 30M instructions OFFICE/Windows applications: 100M

instructions Multimedia: from 30M to 100M instructions DotNet: 100M instructions

Page 42: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

42

Metrics

For processor performance: IPC Total energy Cubic-MIPS-per-WATT(CMPW)

Parameters characterizing PARROT: Coverage Uop reduction Energy breakdown

Page 43: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

43

Results

Presents the results of alternative enhancements applied to the reference machines N and W

The TOS conceptual microarchitecture statistics are presented only as a reference for alternative future development

Page 44: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

44

Performance and Power Awareness (1)

TW/W raises 7% TN/N negligible 2% increase TON/N 17% performance improvement TOW/W 25% improvement

Page 45: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

45

Performance and Power Awareness (2)

All extensions of the wide machine actually save energy

Extensions of the narrow even with PARROT style optimizations increase energy

Page 46: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

46

Performance and Power Awareness (3)

CMPW criterion weights both performance gain and energy loss here: TON/N about 32% TOW/W about 92%

Page 47: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

47

Front-End Capabilities (1)

N model with a 4K-entries branch predictor TON with branch predictor (cold) and trace

predictor (hot) 2k entries each

Page 48: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

48

Front-End Capabilities (2)

Page 49: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

49

Optimizer Capabilities (1)

Average uop reduction with PARROT 19% Average dependency reduction 8%

Page 50: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

50

Optimizer Capabilities (2)

Page 51: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

51

Energy Breakdown (1)

Page 52: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

52

Conclusions (1)

Presents PARROT Improving processor performance and

power awareness Consists of asymmetric decoupling of the

processor into subsystems responsible for handling the cold-infrequent and hot-frequent portions of code

Designs each part according to different power and performance considerations

Page 53: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

53

Conclusions (2)

The presented simulation results demonstrate that applying the PARROT concept to a standard 4-wide, OOO processor yields comparable performance to an 8-wide processor, however, consuming significantly less energy

Page 54: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

54

Future Work

One major topic for future research is related to split-core micro architectures

Investigate potential advantage of such design for establishing even better performance/energy tradeoffs by considering different alternatives

Page 55: 02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou

55

Thank you!!!!