from adaptive to self-tuning systems sudhakar yalamanchili, subramanian ramaswamy and gregory diamos...

From Adaptive to Self-Tuning SystemsFrom Adaptive to Self-Tuning Systems

Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory DiamosDiamos

School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

Architectural ChallengesArchitectural Challenges

Not much headroom left in the Not much headroom left in the stage to stage to stagestage times (currently 8-12 FO4 times (currently 8-12 FO4

delays) delays) [4][4]

ILP

Pipeline in-order OOO aggressive OOO

1.1. P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 20002.2. Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 023.3. S. Borkar “Design Challenges of Technology Scaling” Micro 1999S. Borkar “Design Challenges of Technology Scaling” Micro 19994. Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional

microarchitectures. In ISCA 2000

Cache AreaCache Area 80% of transistor budget 80% of transistor budget 50% of total area 50% of total area [1][1]

Defects in cache affect processor yieldDefects in cache affect processor yield Significant power consumers (e.g. > 40% of total power Significant power consumers (e.g. > 40% of total power

in Strong ARM)in Strong ARM)[2][2]

On-chip-DRAM gap continues to grow

Power WallPower Wall Frequency WallFrequency Wall

Single Thread PerformanceSingle Thread Performance

Memory WallMemory Wall

Economic WallEconomic Wall Costs of developing next generation Costs of developing next generation

processorsprocessors Design & Manufacturing costs

Extreme Device Variability

•Negative returns with power•Increasing inefficiencies due to

• speculation• control flow

Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg

Pow

er

Leakage current increases Leakage current increases 7.5X with each generation 7.5X with each generation

[3][3]


System ViewSystem View

Large scale

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

P

P

P

P

P

P

P

P

P

P

P

P

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

M

1. Capture and adapt to intrinsic application behavior

Many-core, Heterogeneous System

Static, off-line characterizations

Dynamic, on-line, evolutionary behaviors

Solution: Systems are self-tuning

2. Device-Level Variations reduce architecture yield


State of the Practice

The Space of SolutionsThe Space of Solutions

Structured Workloads Ill- Structured Workloads

Rigid, HW/SW Boundaries Evolutionary or Self-Tuning Systems

M

P

M

P

Traditional

Architectures (Fixed)

P M

Architectures Change At SW-

determined Points of Execution

P M

Architectures continuously

autonomously evolve and adapt

Ability to Customize Architectures Before

Application Deployment

P M


From Adaptive to Self TuningFrom Adaptive to Self Tuning

Where do we make future investments in transistors Where do we make future investments in transistors and software?and software?

Hardware software co-design for continuous Hardware software co-design for continuous monitoring and/or tuning monitoring and/or tuning

Expose and (dynamically) eliminate design redundancies Expose and (dynamically) eliminate design redundancies

Two ExamplesTwo Examples Cache memory hierarchyCache memory hierarchy On-Chip NetworksOn-Chip Networks


Generational Behavior of CachesGenerational Behavior of Caches

new generation new generation

Time

Idle intervalmiss

hit

Memory Lines

2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005TACO 2005

1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache 1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001Leakage Power“ ISCA 2001


Cache Tuning: Conceptual ModelCache Tuning: Conceptual Model

Remap memory into the cache Remap memory into the cache shapeshape the cache the cacheMatch the program footprint Match the program footprint resizeresize the cache the cache


Cache Tuning: System Model & OpportunitiesCache Tuning: System Model & Opportunities

statementstatement

statementstatement

end loop

loop Region A

remapping directivePlacement( B[][], param )

Placement ( B[][] , param)

Static analysis or programmer supplied

Profile based insertion

L1

L2

M

AT

P

Thread 1Thread 2

LUTlogic

Alternative implementations

Run-time tuning

x

y z

Structured accesses


Static Tuning: Scientific ApplicationsStatic Tuning: Scientific Applications

Targeted to programs with predictable access patternsTargeted to programs with predictable access patternsCompiler can both resizeCompiler can both resize and remapand remap

Advanced compiler optimizations made possibleAdvanced compiler optimizations made possible


Dynamic Tuning: Dynamic Tuning: FoldingFolding Heuristics Heuristics

Find and utilize redundancies in the designFind and utilize redundancies in the design Miss foldingMiss folding fold misses via re-mapping memory lines into fold misses via re-mapping memory lines into

the same cache setthe same cache set

S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007

Comparisons shown for a 256KB L2 cacheComparisons shown for a 256KB L2 cache


Tuning for Yield: Decreasing Defect Sensitivity*Tuning for Yield: Decreasing Defect Sensitivity*

Performance Yield Performance Yield yield at a given performance (e.g. AMAT) yield at a given performance (e.g. AMAT) for 1000 unitsfor 1000 units

Up to four times greater than modulo placementUp to four times greater than modulo placement Exploiting redundancies Exploiting redundancies application to power management application to power management

Recovering Design

Inefficiencies

S. Ramaswamy, S. Yalamanchili,S. Ramaswamy, S. Yalamanchili, “Customizable Fault Tolerant Caches for Embedded “Customizable Fault Tolerant Caches for Embedded Processors,”Processors,” ICCD ICCD 2006 2006


OpportunitiesOpportunities

Voltage scalingVoltage scaling Combine voltage scaling and remapping for program phase Combine voltage scaling and remapping for program phase

dependent power managementdependent power management

Compiler-directed hardware optimizationsCompiler-directed hardware optimizations For example concurrent data layout + cache placementFor example concurrent data layout + cache placement

Application to multi-threaded and multi-core Application to multi-threaded and multi-core domainsdomains

Cache sharing across threadsCache sharing across threads Challenge: coherency trafficChallenge: coherency traffic


The On-Chip NetworkThe On-Chip Network

The network is in the critical path (performance)The network is in the critical path (performance) Operand networksOperand networks Cache hierarchyCache hierarchy System on ChipSystem on Chip

Increasing impact of wire (channel) delays Increasing impact of wire (channel) delays Wire delays must be actively managedWire delays must be actively managed

On-demandOn-demand resource management resource management Initial studies: link tuningInitial studies: link tuning Reference: Research at EPFL & Stanford on robust link Reference: Research at EPFL & Stanford on robust link

designdesign


A A SSystem for ystem for TTuning and uning and AActively ctively RReconfiguring econfiguring SSoC oC Links (Links (STARSSTARS))

Variable delays and and cascaded registers measure link delayVariable delays and and cascaded registers measure link delay

Digital PLL tunes the clock to match the link delayDigital PLL tunes the clock to match the link delay

Value 1 Value 2

Value 1 Value 2

Value 1 Value 2

Well Tuned Too Slow

Latch 1

Latch 2

Latch 3

Too Fast

Time


FPGA TestsFPGA Tests

Monitoring

Find End of Link Transition

Tuning

Find Start of Link Transition

Determine Slack In the

Link

Adjust Clock Frequency

Low speed tests to validate the control strategyLow speed tests to validate the control strategy


Variable Delay Elements (VDE)Variable Delay Elements (VDE) Variable delay from 118ps to 1.47nsVariable delay from 118ps to 1.47ns 10 bits of resolution10 bits of resolution 502 transistors502 transistors

Digitally Controlled Oscillator (DCO)Digitally Controlled Oscillator (DCO) Clock period from 240ps to 2.97nsClock period from 240ps to 2.97ns 10 bits of resolution10 bits of resolution 528 transistors528 transistors

Digital Clock Divider (DCD)Digital Clock Divider (DCD) Min input clock period 480psMin input clock period 480ps 8 bits of resolution8 bits of resolution 1127 transistors1127 transistors

Allows tuning links up to 2.083 GHzAllows tuning links up to 2.083 GHz From reference clock of 8.13MHzFrom reference clock of 8.13MHz

Prototyping: 180nmPrototyping: 180nm


ExtensionsExtensions

Modulate link widthsModulate link widths

Modulate buffer organizationsModulate buffer organizations Channels/depthChannels/depth

Feedback between local congestion detection and Feedback between local congestion detection and link and buffer resourceslink and buffer resources


Summary Summary

Application demands will be time varyingApplication demands will be time varying

Technology will introduce time-varying hardware Technology will introduce time-varying hardware characteristicscharacteristics

Continuous cooperative HW/SW tuning provides a Continuous cooperative HW/SW tuning provides a methodology for addressing these concernsmethodology for addressing these concerns

Need the support of abstractions for tuningNeed the support of abstractions for tuning Influence of prior applications to datapaths (Razor-UMich), Influence of prior applications to datapaths (Razor-UMich),

communication systems (Vizor-GT), and reliable links communication systems (Vizor-GT), and reliable links (Stanford/EPFL)(Stanford/EPFL)

Build on existing research in cache performance & power Build on existing research in cache performance & power managementmanagement

from adaptive to self-tuning systems sudhakar yalamanchili, subramanian ramaswamy and gregory diamos...

Documents

p thread

application deployment

points of execution

cache tuning

cache slide

cache area cache area

cache decay

computer engineering