on tuning microarchitecture for programs

On Tuning Microarchitecture for Programs

Daniel Crowell, Wenbin Fang, and Evan Samanas

Outline

Goal: Adapt µArch to meet program’s performance/energy requirement during runtime

•Motivation•A flexible framework for µArch adaptivity•Feasibility study on different adaptive components. •Case study on adaptive cache (selective-way/set)•Evaluation on adaptive cache•Conclusion

Motivation

• Optimizing for all is optimizing for nothing• Software is more and more complex, and

many are close source• S/W and H/W codesign is infeasible for legacy

software

Three Questions for Microarchitecture Adaptivity

• When to adapt? => Policy– Interval? Context switch? Function boundary?

• What goal(s)? => Policy– Performance first? Performance-power ratio first?

• How to adapt? => Mechanism– What technique to use to allow reconfiguration

during runtime?

Reference: Lee and Brooks [1], and Albonesi et al. [7]

Adaptivity Framework

Reference: Lee and Brooks [1] and Albonesi et al. [7]

Policy

• Instruction 1: adapt_advise– Inspired from “madvise” in os system calls– When to adapt ： when this instruction is

executed– What goal: an operand (performance? energy?

both?)• Instruction 2: adapt_setup

– Privilleged, only used by OS– Operand: allowed user programs to use

adapt_advise or not6

Reference: Ipek [5], and Clark [6]Adding new instructions to SimpleScalar: http://ce.et.tudelft.nl/~demid/SSIAT/

PolicyApplication boundary (OS)

Time interval (OS)

Context switching (OS)

User program (Compiler / User program)

[1][2]

Feasibility study

• Back up motivation: What should be configured?

• Ideal configuration differs by workload• L1 Data Cache, TLB, Branch Predictor• Simplescalar, Wattch• 6 Programs from SPEC2000Int

Feasibility study (TLB cont.)

Feasibility study (TLB)

Feasibility Study (Branch Predictor)

Feasibility Study (Cache)

What We Learned

• TLB– Variability with # entries– Fully-associative better

• Branch Predictor: Combined better• Cache: Variability in both

– Size Variability > Assoc. Variability

• Cache most interesting– Lots of Literature

Selective set (Yang et. al. 2001)

• Adjust size (# of sets) of L1 cache– Double size– Shrink by half

• Goal: Decrease static power by reducing leakage

• Adjust by miss rate threshold• Size-bound• Focus on I-Cache

Selective way (Albonesi 1999)

• Disables “unneeded” cache ways– Reduces cache switching activity

• When to disable: Extend ISA?• When to enable: Performance Degradation

Threshold

Evaluation

• Simulator– SimpleScalar 3.0– Wattch

• Workload– 6 programs from SPEC 2000

• Case study: Adaptive Cache

SimpleScalar changes

Two methods used:•Simplescalar implementation of Selective Sets

– Used timer with miss counter to determine sets to disable– Power down portions of cache and selectively flush dirty

•Scripting based method – can use this same design for both selective sets and

selective ways– Completely replaces cache when resized, flushes all values

at each interval

Application-boundary policy

• Instructions Per Cycle vs Energy Delay• IPC: considers only performance (higher better)

– Energy Delay: considers both performance and power (lower better)• Smaller cache size

– Energy delay decreases at first, but rises later– Want to choose point where it is smallest

Selective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set CacheSelective-set Cache

Configuration set at start of program, then remains unchanged

Application-boundary policy

Selective-way Cache

• Similar tradeoffs in IPC and Power to Selective Set

• Fewer choices – simplescalar limits to power of two associatively– Unlike cache set size, power of two limit not normally necessary

Time-interval policy

• Reconfigurations occur every so many CPU cycles• Why?

– Good if program behavior not known before execution– Program may require fewer/more cached data later in execution

• For our cache study: Relies on % Cache misses to determine reconfiguration.

• Performance hit to changing too frequently– May oscillate between two roughly equivalent states– Reconfiguration requires temporarily halting, possibly flushing values

from cache

Selective-set Cache

• What is the minimum allowed cache miss rate? (1%, 2%, 3%, 4%? – policy choice)

• Notice positive energy delay on right graph (not good!)– never resizes down, since miss rate always higher than 1%– So all adaptivity adds is overhead under those circumstances

Cache miss rate Cache miss rate

Selective-way Cache

Cache miss rate Cache miss rate

• Again, similar to selective sets• Differences dependent upon program being executed

Cache miss rateProgram 8-way 4-way 2-way 64 KB 16 KB 4 KB

Gcc 0.71% 1.11% 1.19% 1.09% 2.64% 5.42%

Gzip 1.16% 1.68% 2.41% 1.16% 2.93% 4.02%

Mcf 0.11% 0.11% 0.13% 0.11% 0.14% 0.15%

Perlbmk 0.43% 0.51% 0.78% 0.51% 0.70% 5.87%

Vortex 0.19% 0.45% 1.34% 0.35% 1.90% 5.7%

Vpr 3.91% 4.58% 5.53% 4.55% 5.74% 8.54%

Decreasing number of ways or sets almost always increases miss rate

Problem Mentioned Earlier: See how Gzip and Vpr are always higher than 1%, which does not work well with a < 1% dynamic reconfiguration level

Conclusion

• Adaptivity is useful– Tune for different program requirements– Save power

• A flexible adaptivity framework– Mechanism– Policy

• Cache just one of many areas where this is useful

Reference

[1] B. C. Lee and D. Brooks. Efficiency trends and limits from comprehensive microarchitectural adaptivity. In ASPLOS, 2008.[2] S.-H. Yang, M. D. Powell, B. Falsa, K. Roy, and T. Vijaykumar. An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance i-caches. In HPCA, 2001.[3] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In JILP, 2000.[4] M. C. Huang, J. Renau, and J. Torrellas. Positional adaptation of processors: application to energy reduction. In ISCA, 2003.[5] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. In ISCA, 2007.[6] M. Clark and L. K. John. Performance evaluation of congurable hardware features on the amd-k5. In ICCD, 1999.[7] D. H. Albonesi, R. Balasubramonian, S. G. Dropsho, S. Dwarkadas, E. G. Friedman, M. C. Huang, V. Kursun, G. Magklis, M. L. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. W. Cook, and S. E. Schuster. Dynamically tuning processor resources with adaptive processing. In Computer, 2003

Question?

on tuning microarchitecture for programs

Documents

i8085 microarchitecture

design of digital circuits - eth z n introduction to...

amd bulldozer microarchitecture

the architecture for discovery...tick-tock development model...

computer architectures -...

automatically tuning parallel and parallelized programs

the microarchitecture levelthe microarchitecture...

microarchitecture-independent workload...

comp22111’ processor’ microarchitecture’

these notes: nvidia gpu microarchitecture org 1 nvidia gpu...

partially automated tuning programs for 9-cell ilc cavities

itanium processor microarchitecture - binghamton

2.2 msp430 microarchitecture

mips microarchitecture multicycle processor

ee 7722|gpu microarchitecture

performance profiler - nasa · 2019. 10. 8. · driver...

intel microarchitecture: nehalem

intel’s haswell cpu microarchitecture

tuning metaocaml programs for high performance

cairngorm microarchitecture