reliability implications of power/thermal constrained ... · pdf filereliability implications...

Reliability Implications of Power/Thermal Constrained Operation in Asymmetric Multicore Processors

William J. Song, Saibal Mukhopadhyay, and Sudhakar Yalamanchili

School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA

[email protected], {saibal, sudha}@ece.gatech.edu Abstract

The emergence of the dark silicon era raises new issues in balancing performance, utilization, and reliability. Power and thermal constraints preclude core scaling according to the business-as-usual progression of Moore’s Law [1]. Such constraints invoke the utilization wall [2] where the number of active cores is limited, and hence the silicon resources on the die are underutilized and operate below their full switching capacity; dark silicon [1, 2, 3, 4]. The consequence is a drag on the performance growth for future processors [1, 2, 3].

A range of techniques have emerged to address the issue of dark silicon including i) the use of heterogeneous and/or asymmetric architectures often including specialized cores to deliver optimized energy/area for specific functions [2, 3], ii) dynamic voltage-frequency scaling (DVFS) [4], and iii) systematic techniques for power gating such as turbo-boost or computational sprinting [5]. While the focus of these and other similar efforts has been on managing energy/power/performance tradeoffs, little attention has been paid to the impact of these management techniques on processor reliability. The application of DVFS and power gating techniques have a complicated impact on device and hence core and processor degradation. For example, devices that are power-gated off experience some degree of regeneration enabling limited recovery from thermal and electrical stresses; electromigration, negative bias temperature instability, stress migration, time dependent dielectric breakdown, thermal cycling [7], etc.

In this presentation, we focus on asymmetric multicore processors (AMPs) and the reliability impact of three management techniques; i) computational sprinting, ii) DVFS, and iii) low voltage operation. The computational sprinting [5] (alternatively known as race-to-idle or turbo-boost) accelerates the execution of cores by increasing voltage and frequency levels. It is followed by an idle period that turns off the cores. The leakage power savings in the idle period initially motivated this technique. However, here the power savings from idle cores are used to boost the execution of the active cores. The use of sprinting stresses different cores (i.e., out-of-order vs. in-order) in different ways, as do different workloads. The result is that cores degrade at different rates, which can lead to an overall reduction in lifetime reliability. Similarly, the use of DVFS or sustained low voltage operation has a different impact on the degradation (and regenerative ability) of devices/cores. Consequently management techniques such as computational sprinting are not just techniques for extracting performance under thermal and power constraints. Rather, there are also a choice of tradeoffs between performance and lifetime reliability. We argue that time-multiplexed operation of cores (e.g., power gating) must be orchestrated keeping reliability impact in mind.

We present some preliminary results of simulations of a 64-core asymmetric processor and evaluate the lifetime reliability implications of power/thermal management techniques. The processor is comprised of 48 simple in-order execution cores and 16 complex out-of-order execution cores. Models at the 16nm technology node are used to generate power traces from a cycle-level x86 simulator. Cores execute in multi-programmed mode where each core executes a distinct set of benchmarks. We utilize a modeling methodology for the integrated and coordinated modeling of power, energy, temperature, and reliability in multicore processors [6, 7, 8]. Figure 1 shows an

example of the degradation distribution, represented as failure probability, for the 64-core model executing a mix of SPEC 2006 benchmarks. We apply this methodology to understand the relative degradation behaviors of i) computational sprinting, ii) DVFS, and iii) continuous low voltage operation.

The simulation results illustrate some counter-intuitive trends. For example, the failure probability of a core rapidly increases during the computational sprinting period, but its increase rate is minimal during the idle period. In contrast, a non-power-gated core operating at very low voltage is continuously stressed and consequently its lifetime reliability can be worse than operation using sprinting [6]. From a reliability perspective, a similar observation can be made for the use of DVFS [4]. Further, using computational sprinting with dynamically adjusted idle periods, we observe 7~12% throughput improvement compared to continuous low voltage operation while maintaining equivalent lifetime reliability. Such enhancement is particularly noticeable in compute-bound benchmarks when executed on complex out-of-order cores while throughput is actually worse when memory-bound benchmarks were sprinted on in-order cores. This presentation provides results from a series of experiments that seek to establish an understanding of the reliability impact of various architectural and management approaches to mitigating the effects of dark silicon.

References [1] H. Esmaelizadeh, E. Blem, R. Amant, K. Sankaralingam, D. Burger, “Dark Silicon and The End of Multicore Scaling,” ISCA, June 2011. [2] N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. Taylor, “The GreenDroid Mobile Application Processor: An Architecture for Silicon’s Dark Future,” Micro, Dec. 2011. [3] N. Hardavellas, M. Ferdman, B. Falsafi, A. Ailamaki, “Toward Dark Silicon in Servers,” Micro, Dec. 2011. [4] AgileRegulator: A Hybrid Voltage Regulator Scheme Redeeming Dark Silicon for Power Efficiency in Multicore Architecutre, HPCA, Feb. 2012. [5] A. Raghavan, Y. Luo, A. Chandawalla, “Computational Sprinting,” HPCA, Feb. 2012. [6] S. Gupta and S. Sapatnekar, “GNOMO: Greater-than-Nominal Vdd Operation for BTI Mitigation,” ASP-DAC, Feb. 2012. [7] J. Srinivasan, S. Adve, P. Bose, J. Rivers, “Lifetime Reliability: Toward An Architectural Solution,” Micro, June 2005. [8] M. Cho, W. Song, S. Yalamanchili, and S. Mukhopadhyay, “Thermal System Identification: A Methodology for Post-Silicon Characterization and Prediction of The Transient Thermal Field in Multicore Chips,” Semi-Therm, Mar. 2012.

Figure 1. Degradation distribution for a 64-core asymmetric multicore processor.

reliability implications of power/thermal constrained ... · pdf filereliability implications...

Documents