from adaptive to self-tuning systems sudhakar yalamanchili, subramanian ramaswamy and gregory diamos...
TRANSCRIPT
From Adaptive to Self-Tuning SystemsFrom Adaptive to Self-Tuning Systems
Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory Sudhakar Yalamanchili, Subramanian Ramaswamy and Gregory DiamosDiamos
School of Electrical and Computer EngineeringSchool of Electrical and Computer Engineering
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2
Architectural ChallengesArchitectural Challenges
Not much headroom left in the Not much headroom left in the stage to stage to stagestage times (currently 8-12 FO4 times (currently 8-12 FO4
delays) delays) [4][4]
ILP
Pipeline in-order OOO aggressive OOO
1.1. P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 2000P. Ranganathan, S. Adve, N. Jouppi. Reconfigurable Caches and their Application to Media Processing. ISCA 20002.2. Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 02Michael Zhang, Krste Asanovic “Fine-Grain CAM-Tag Cache Resizing Using Miss Tags” ISLPED 023.3. S. Borkar “Design Challenges of Technology Scaling” Micro 1999S. Borkar “Design Challenges of Technology Scaling” Micro 19994. Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional
microarchitectures. In ISCA 2000
Cache AreaCache Area 80% of transistor budget 80% of transistor budget 50% of total area 50% of total area [1][1]
Defects in cache affect processor yieldDefects in cache affect processor yield Significant power consumers (e.g. > 40% of total power Significant power consumers (e.g. > 40% of total power
in Strong ARM)in Strong ARM)[2][2]
On-chip-DRAM gap continues to grow
Power WallPower Wall Frequency WallFrequency Wall
Single Thread PerformanceSingle Thread Performance
Memory WallMemory Wall
Economic WallEconomic Wall Costs of developing next generation Costs of developing next generation
processorsprocessors Design & Manufacturing costs
Extreme Device Variability
•Negative returns with power•Increasing inefficiencies due to
• speculation• control flow
Source:http://techreport.com/reviews/2005q2/opteron-x75/dualcore-chip.jpg
Pow
er
Leakage current increases Leakage current increases 7.5X with each generation 7.5X with each generation
[3][3]
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 3
System ViewSystem View
Large scale
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
P
P
P
P
P
P
P
P
P
P
P
P
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
M
1. Capture and adapt to intrinsic application behavior
Many-core, Heterogeneous System
Static, off-line characterizations
Dynamic, on-line, evolutionary behaviors
Solution: Systems are self-tuning
2. Device-Level Variations reduce architecture yield
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 4
State of the Practice
The Space of SolutionsThe Space of Solutions
Structured Workloads Ill- Structured Workloads
Rigid, HW/SW Boundaries Evolutionary or Self-Tuning Systems
M
P
M
P
Traditional
Architectures (Fixed)
P M
Architectures Change At SW-
determined Points of Execution
P M
Architectures continuously
autonomously evolve and adapt
Ability to Customize Architectures Before
Application Deployment
P M
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 5
From Adaptive to Self TuningFrom Adaptive to Self Tuning
Where do we make future investments in transistors Where do we make future investments in transistors and software?and software?
Hardware software co-design for continuous Hardware software co-design for continuous monitoring and/or tuning monitoring and/or tuning
Expose and (dynamically) eliminate design redundancies Expose and (dynamically) eliminate design redundancies
Two ExamplesTwo Examples Cache memory hierarchyCache memory hierarchy On-Chip NetworksOn-Chip Networks
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 6
Generational Behavior of CachesGenerational Behavior of Caches
new generation new generation
Time
Idle intervalmiss
hit
Memory Lines
2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” 2. Jaume Abella, Antonio Gonzlez, Xavier Vera, Michael F. P. O'Boyle “IATAC: a smart predictor to turn-off L2 cache lines.” TACO 2005TACO 2005
1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache 1. Kaxiras, S., Hu, Z. and Martonosi, M., "Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power“ ISCA 2001Leakage Power“ ISCA 2001
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 7
Cache Tuning: Conceptual ModelCache Tuning: Conceptual Model
Remap memory into the cache Remap memory into the cache shapeshape the cache the cacheMatch the program footprint Match the program footprint resizeresize the cache the cache
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 8
Cache Tuning: System Model & OpportunitiesCache Tuning: System Model & Opportunities
statementstatement
statementstatement
end loop
loop Region A
remapping directivePlacement( B[][], param )
Placement ( B[][] , param)
Static analysis or programmer supplied
Profile based insertion
L1
L2
M
AT
P
Thread 1Thread 2
LUTlogic
Alternative implementations
Run-time tuning
x
y z
Structured accesses
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 9
Static Tuning: Scientific ApplicationsStatic Tuning: Scientific Applications
Targeted to programs with predictable access patternsTargeted to programs with predictable access patternsCompiler can both resizeCompiler can both resize and remapand remap
Advanced compiler optimizations made possibleAdvanced compiler optimizations made possible
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 10
Dynamic Tuning: Dynamic Tuning: FoldingFolding Heuristics Heuristics
Find and utilize redundancies in the designFind and utilize redundancies in the design Miss foldingMiss folding fold misses via re-mapping memory lines into fold misses via re-mapping memory lines into
the same cache setthe same cache set
S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007S. Ramaswamy, S. Yalamanchili. Improving Cache Efficiency via Resizing + Remapping. ICCD 2007
Comparisons shown for a 256KB L2 cacheComparisons shown for a 256KB L2 cache
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 11
Tuning for Yield: Decreasing Defect Sensitivity*Tuning for Yield: Decreasing Defect Sensitivity*
Performance Yield Performance Yield yield at a given performance (e.g. AMAT) yield at a given performance (e.g. AMAT) for 1000 unitsfor 1000 units
Up to four times greater than modulo placementUp to four times greater than modulo placement Exploiting redundancies Exploiting redundancies application to power management application to power management
Recovering Design
Inefficiencies
S. Ramaswamy, S. Yalamanchili,S. Ramaswamy, S. Yalamanchili, “Customizable Fault Tolerant Caches for Embedded “Customizable Fault Tolerant Caches for Embedded Processors,”Processors,” ICCD ICCD 2006 2006
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 12
OpportunitiesOpportunities
Voltage scalingVoltage scaling Combine voltage scaling and remapping for program phase Combine voltage scaling and remapping for program phase
dependent power managementdependent power management
Compiler-directed hardware optimizationsCompiler-directed hardware optimizations For example concurrent data layout + cache placementFor example concurrent data layout + cache placement
Application to multi-threaded and multi-core Application to multi-threaded and multi-core domainsdomains
Cache sharing across threadsCache sharing across threads Challenge: coherency trafficChallenge: coherency traffic
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 13
The On-Chip NetworkThe On-Chip Network
The network is in the critical path (performance)The network is in the critical path (performance) Operand networksOperand networks Cache hierarchyCache hierarchy System on ChipSystem on Chip
Increasing impact of wire (channel) delays Increasing impact of wire (channel) delays Wire delays must be actively managedWire delays must be actively managed
On-demandOn-demand resource management resource management Initial studies: link tuningInitial studies: link tuning Reference: Research at EPFL & Stanford on robust link Reference: Research at EPFL & Stanford on robust link
designdesign
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 14
A A SSystem for ystem for TTuning and uning and AActively ctively RReconfiguring econfiguring SSoC oC Links (Links (STARSSTARS))
Variable delays and and cascaded registers measure link delayVariable delays and and cascaded registers measure link delay
Digital PLL tunes the clock to match the link delayDigital PLL tunes the clock to match the link delay
Value 1 Value 2
Value 1 Value 2
Value 1 Value 2
Well Tuned Too Slow
Latch 1
Latch 2
Latch 3
Too Fast
Time
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 15
FPGA TestsFPGA Tests
Monitoring
Find End of Link Transition
Tuning
Find Start of Link Transition
Determine Slack In the
Link
Adjust Clock Frequency
Low speed tests to validate the control strategyLow speed tests to validate the control strategy
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 16
Variable Delay Elements (VDE)Variable Delay Elements (VDE) Variable delay from 118ps to 1.47nsVariable delay from 118ps to 1.47ns 10 bits of resolution10 bits of resolution 502 transistors502 transistors
Digitally Controlled Oscillator (DCO)Digitally Controlled Oscillator (DCO) Clock period from 240ps to 2.97nsClock period from 240ps to 2.97ns 10 bits of resolution10 bits of resolution 528 transistors528 transistors
Digital Clock Divider (DCD)Digital Clock Divider (DCD) Min input clock period 480psMin input clock period 480ps 8 bits of resolution8 bits of resolution 1127 transistors1127 transistors
Allows tuning links up to 2.083 GHzAllows tuning links up to 2.083 GHz From reference clock of 8.13MHzFrom reference clock of 8.13MHz
Prototyping: 180nmPrototyping: 180nm
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 17
ExtensionsExtensions
Modulate link widthsModulate link widths
Modulate buffer organizationsModulate buffer organizations Channels/depthChannels/depth
Feedback between local congestion detection and Feedback between local congestion detection and link and buffer resourceslink and buffer resources
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 18
Summary Summary
Application demands will be time varyingApplication demands will be time varying
Technology will introduce time-varying hardware Technology will introduce time-varying hardware characteristicscharacteristics
Continuous cooperative HW/SW tuning provides a Continuous cooperative HW/SW tuning provides a methodology for addressing these concernsmethodology for addressing these concerns
Need the support of abstractions for tuningNeed the support of abstractions for tuning Influence of prior applications to datapaths (Razor-UMich), Influence of prior applications to datapaths (Razor-UMich),
communication systems (Vizor-GT), and reliable links communication systems (Vizor-GT), and reliable links (Stanford/EPFL)(Stanford/EPFL)
Build on existing research in cache performance & power Build on existing research in cache performance & power managementmanagement