energy-efficient, high-performance heterogeneous core designparihar/pres/pres_coredesign.pdf ·...
TRANSCRIPT
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Energy-Efficient, High-PerformanceHeterogeneous Core Design
Raj Parihar
Core Design Session, MICRO - 2012
Advanced Computer Architecture Lab, UofR, Rochester
April 18, 2013
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
References
Composite Cores: Pushing Heterogeneity into a CoreA. Lukefahr, S. Padmanabha, R. Das, F. M. Sleiman, R.
Dreslinski, T. F. Wenisch, and S. Mahlke
University of Michigan, Ann Arbor
MorphCore: An Energy-Efficient Microarchitecture for HighPerformance ILP and High Throughput TLPKhubaib, M. A. Suleman, M. Hashemi, C. Wilkerson, Y. N. Patt
UT Austin, HPS Lab, Intel Labs - Hillsboro
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 2
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Motivation
Workload and applications exhibit different phases
Some phases are constrained by fundamental ILP limit
In an inherently low ILP phase a simple in-order, instead ofout-of-order, core can be usedIn-order core saves energy w/o degrading overall performance
Phases also have varying degree of exploitable ILP and TLP
An out-of-order engine is more efficient in the high ILP phasesA highly threaded in-order SMT is more beneficial in TLP phases
Overall idea is to identify the phase behavior and change the
architecture on-the-fly to suit the need
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 3
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Outline
Motivation
Composite Cores: Heterogeneity
MorphCore: Exploit ILP and TLP
Summary
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 4
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Composite Cores: Heterogeneity within a Single Core
Heterogeneous multicore systems, capable of achieving either
high-performance or energy-efficiency, are quite prominent
Often migrate applications/phases to specific core which favors itIssues with conventional heterogeneous system
Slow migrations, requires large phases (100s of millions insts)Often coarse-grain and the fine-grain opportunities are lostSwitching and migration has significant performance overhead
Proposed solutions: a single core microarchitecture whichintegrates – big and little compute µEngines together
An online controller can map 25% code to little µEngineAchieves 18% energy efficiency at performannce loss ≤ 5%
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 5
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Conventional Heterogeneous CMP: ARM’s big.LITTLE
Incorporates two different kind of cores on same chip
big: Cortex-A15(3-way OoO), deeply pipelined (15-25 stages)LITTLE: Cortex-A7(2-way in-order), short pipeline (8-10 stages)
How do these fare against each other?
Performance: Cortex-A15 is 2-3x faster than Cortex-A7Energy: Cortex-A7 is 3-4x more energy-efficient than Cortex-A15
These two kind of cores are utilized, through migration, when anappropriate phase arrives
Migration happens through coherent L2 caches, costs about 20 µsRequires large phases to amortize the cost of slow migration
Composite cores: modify single core to suit both the needs
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 6
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Fine-Grain Switching Interval
Conventional heterogeneous CMP requires large phases
To amortize the cost of switching, typically few millions insts
The migration overhead precludes fine-grained switching in
traditional heterogeneous core designs
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 7
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Composite Cores: Architecture
Each core consists of two tightly coupled compute µEnginesAchieves high-performance and energy efficiency by switchingthe µEngines in response to changes in application performance
Shared: Front-end, branch predictor, data and inst cachesExtra component: A reactive online controller to perform switching
Switching requires only the register file transfer and some stalling
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 8
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Reactive Online ControllerOnline controller tries to maximize the energy savings subject toa configurable maximum performance degradation, or slowdown
Estimates dynamic performance loss using a liner modelSwitching happens when loss is more than the acceptable threshold
Performance estimator is the most crucial, complex, trickiest
component and involves many approximations
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 9
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Performance EstimatorGoal of this module is to provide an estimate of the performance
of both the µEngines in the previous quantum and overallPerformance estimation of the non-active core is challenging
Uses a linear performance estimating model: y = a0 +∑
aixi
Various stats are collected: L2 miss, ILP, L2 hit, MLP etc.Utilize ridge regression analysis to determine the coefficients
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 10
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Overall Energy Savings
Implementable regression model saves about 18% energy
Reduction in energy-delay-product is 21%
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 11
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Switching Impact on Performance
Subject to 5% slowdown, accptable margin in performance
mcf : is memory bound, decrease in branch misprediction latency
actually causes a small performance improvement
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 12
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Little Core UtilizationOn an average about 25% of code can be mapped to little core
Given the oracle knowledge about 37% code can be mapped
Applications like mcf can be completely mapped to little core
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 13
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Average Little Core Power
Little µEngine consumes little extra power compared to little core
because of over-provisioned shared resources
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 14
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Performance Energy Sensitivity
Allowing only 1% slowdown saves upto 4% of the energy
20% performance drop can save upto 44% of the energyGood feature to have where maintaining usability is essential
Low-battery levels in laptops and cell phones
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 15
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
MorphCore: Motivation
In general, industry builds two types of cores:
Large out-of-order cores: Intel’s Sandybridge, IBM’s Power 7Small cores: Intel’s Larrabee, Sun’s Niagara, ARM’s A15
OoO cores provide high single-thread performance by exploiting
ILP but are power inefficient for multi-threaded programs
Key insight: Highly-threaded in-order SMT core can achieve the
instruction issue throughput similar to an OoO (Hily, Seznec)MorphCore is built on two key insights: above observation and
In-order SMT core can be built using subset of the OoO hardware
MorphCore: Start with a traditional OoO core and make minimal
changes to transform it to highly-threaded in-order SMT
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 16
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
In-order SMT vs Out-of-order Superscalar
Hily & Seznec: Highly-threaded in-order core can achieve similar
throughput to an OoO core on multi-threaded apps (HPCA’99)
In high TLP applications, high-performance and low energy
consumption can be achieved with in-order SMT execution
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 17
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
MorphCore Microarchitecture
Two modes of execution: OutOfOrder and InOrderBased on a traditional OoO core and also supports
Additional in-order SMT threads, in-order scheduling, execution andcommit of simultaneously running threads
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 18
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Details of MicroarchitectureFetch: using hardware muxes 2 front-ends can be configured
InOrder SMT mode - 8 threads, OutOfOrder mode - 2 threads
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 19
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Real Details: Too Specific
Hw mux, reconfigurable logic to “transform” OoO to in-order SMT
Modified rename stage: details are too involved!
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 20
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Wakeup and Selection Logic
After all these modifications they claim that only 2.5% of extra
critical delay is added in the design – 2.5% slower frequency
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 21
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
MorphCore Mode Switching
No switching overhead on OS – Hardware does it itself
Not mentioned clearly (most of it is future work!)
General idea is that when OS schedules more threads you are in
parallel region so enable in-order SMT – threshold: >2 threads
When the number of active threads is ≤ 2, enable OoO engine
Assumes thread library uses MONITOR/MWAIT insts such that
MorphCore hardware can detect a thread becoming inactiveClaims that since no migration of instruction and data needs tohappen on mode switches, the penalty is minimum
Pipeline flushing and stallingRegisters and muxes reconfiguration
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 22
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Performance ResultsST apps: MorphCore achieves very close to OoO 2-way SMT
MT apps: achieves close to 6-thread in-order SMT (SMALL)
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 23
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Overall Speedup, Power and Energy
Performance and Energy combined
MorphCore does better than all other alternative
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 24
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Comparison with CoreFusion
Opposite approach: Instead of building a larger core from small
cores (CoreFusion), MorphCore tries to scale down the OoO
design to implement simple in-order SMT core
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 25
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Other Metrics compared to CoreFusion
Reduces power by 19%, energy by 29% and energy-delay
squarred product by 29%
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 26
MotivationComposite Cores: HeterogeneityMorphCore: Exploit ILP and TLP
Summary
Summary
Both ideas are quite similar to each other
Both proposal bring the notion of heterogeneity within a core
Both designs try to leverage fine-grain phases in runtimeThey also try to reuse (share) as much as hardware possibleBoth designs also try to minimize the migration overhead
Both designs require significant modifications in the coremicroarchitecture
The savings/benefits are only few %ageComplexity is quite high for these new core design
Raj Parihar Energy-Efficient, High-Performance Heterogeneous Core Design 27