on tuning microarchitecture for programs
Post on 30-Dec-2015
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
On Tuning Microarchitecture for Programs
Daniel Crowell, Wenbin Fang, and Evan Samanas
1
Outline
Goal: Adapt µArch to meet program’s performance/energy requirement during runtime
•Motivation•A flexible framework for µArch adaptivity•Feasibility study on different adaptive components. •Case study on adaptive cache (selective-way/set)•Evaluation on adaptive cache•Conclusion
2
Motivation
• Optimizing for all is optimizing for nothing• Software is more and more complex, and
many are close source• S/W and H/W codesign is infeasible for legacy
software
3
Three Questions for Microarchitecture Adaptivity
• When to adapt? => Policy– Interval? Context switch? Function boundary?
• What goal(s)? => Policy– Performance first? Performance-power ratio first?
• How to adapt? => Mechanism– What technique to use to allow reconfiguration
during runtime?
4
Reference: Lee and Brooks [1], and Albonesi et al. [7]
Adaptivity Framework
5
Reference: Lee and Brooks [1] and Albonesi et al. [7]
Policy
• Instruction 1: adapt_advise– Inspired from “madvise” in os system calls– When to adapt : when this instruction is
executed– What goal: an operand (performance? energy?
both?)• Instruction 2: adapt_setup
– Privilleged, only used by OS– Operand: allowed user programs to use
adapt_advise or not6
Reference: Ipek [5], and Clark [6]Adding new instructions to SimpleScalar: http://ce.et.tudelft.nl/~demid/SSIAT/
PolicyApplication boundary (OS)
Time interval (OS)
Context switching (OS)
User program (Compiler / User program)
7
[3]
[1][2]
[4]
Feasibility study
• Back up motivation: What should be configured?
• Ideal configuration differs by workload• L1 Data Cache, TLB, Branch Predictor• Simplescalar, Wattch• 6 Programs from SPEC2000Int
8
Feasibility study (TLB cont.)
9
Feasibility study (TLB)
10
Feasibility Study (Branch Predictor)
11
Feasibility Study (Cache)
12
Feasibility Study (Cache)
13
What We Learned
• TLB– Variability with # entries– Fully-associative better
• Branch Predictor: Combined better• Cache: Variability in both
– Size Variability > Assoc. Variability
• Cache most interesting– Lots of Literature
14
Selective set (Yang et. al. 2001)
• Adjust size (# of sets) of L1 cache– Double size– Shrink by half
• Goal: Decrease static power by reducing leakage
• Adjust by miss rate threshold• Size-bound• Focus on I-Cache
15
Selective way (Albonesi 1999)
• Disables “unneeded” cache ways– Reduces cache switching activity
• When to disable: Extend ISA?• When to enable: Performance Degradation
Threshold
16
Evaluation
• Simulator– SimpleScalar 3.0– Wattch
• Workload– 6 programs from SPEC 2000
• Case study: Adaptive Cache
17
SimpleScalar changes
Two methods used:•Simplescalar implementation of Selective Sets
– Used timer with miss counter to determine sets to disable– Power down portions of cache and selectively flush dirty
data
•Scripting based method – can use this same design for both selective sets and
selective ways– Completely replaces cache when resized, flushes all values
at each interval
18
Application-boundary policy
19
• Instructions Per Cycle vs Energy Delay• IPC: considers only performance (higher better)
– Energy Delay: considers both performance and power (lower better)• Smaller cache size
– Energy delay decreases at first, but rises later– Want to choose point where it is smallest
Selective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set Cache
Configuration set at start of program, then remains unchanged
Application-boundary policy
20
Selective-way Cache
• Similar tradeoffs in IPC and Power to Selective Set
• Fewer choices – simplescalar limits to power of two associatively– Unlike cache set size, power of two limit not normally necessary
Time-interval policy
21
• Reconfigurations occur every so many CPU cycles• Why?
– Good if program behavior not known before execution– Program may require fewer/more cached data later in execution
• For our cache study: Relies on % Cache misses to determine reconfiguration.
• Performance hit to changing too frequently– May oscillate between two roughly equivalent states– Reconfiguration requires temporarily halting, possibly flushing values
from cache
Time-interval policy
22
Selective-set Cache
• What is the minimum allowed cache miss rate? (1%, 2%, 3%, 4%? – policy choice)
• Notice positive energy delay on right graph (not good!)– never resizes down, since miss rate always higher than 1%– So all adaptivity adds is overhead under those circumstances
Cache miss rate Cache miss rate
Time-interval policy
23
Selective-way Cache
Cache miss rate Cache miss rate
• Again, similar to selective sets• Differences dependent upon program being executed
Cache miss rateProgram 8-way 4-way 2-way 64 KB 16 KB 4 KB
Gcc 0.71% 1.11% 1.19% 1.09% 2.64% 5.42%
Gzip 1.16% 1.68% 2.41% 1.16% 2.93% 4.02%
Mcf 0.11% 0.11% 0.13% 0.11% 0.14% 0.15%
Perlbmk 0.43% 0.51% 0.78% 0.51% 0.70% 5.87%
Vortex 0.19% 0.45% 1.34% 0.35% 1.90% 5.7%
Vpr 3.91% 4.58% 5.53% 4.55% 5.74% 8.54%
24
Decreasing number of ways or sets almost always increases miss rate
Problem Mentioned Earlier: See how Gzip and Vpr are always higher than 1%, which does not work well with a < 1% dynamic reconfiguration level
Conclusion
• Adaptivity is useful– Tune for different program requirements– Save power
• A flexible adaptivity framework– Mechanism– Policy
• Cache just one of many areas where this is useful
25
Reference
[1] B. C. Lee and D. Brooks. Efficiency trends and limits from comprehensive microarchitectural adaptivity. In ASPLOS, 2008.[2] S.-H. Yang, M. D. Powell, B. Falsa, K. Roy, and T. Vijaykumar. An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance i-caches. In HPCA, 2001.[3] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In JILP, 2000.[4] M. C. Huang, J. Renau, and J. Torrellas. Positional adaptation of processors: application to energy reduction. In ISCA, 2003.[5] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. In ISCA, 2007.[6] M. Clark and L. K. John. Performance evaluation of congurable hardware features on the amd-k5. In ICCD, 1999.[7] D. H. Albonesi, R. Balasubramonian, S. G. Dropsho, S. Dwarkadas, E. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster. Dynamically tuning processor resources with adaptive processing. In Computer, 2003
26
Question?
27
top related