robust power management in the ibm z13 - semantic scholar · z13 collects data from just under 300...

Robust power managementin the IBM z13

T. WebelP. M. LoboR. Bertran

G. M. SalemM. Allen-Ware

R. RizzoloS. M. Carey

T. StrachA. Buyuktosunoglu

C. LefurgyP. Bose

R. NigaglioniT. Slegel

M. S. FloydB. W. Curran

The power management strategy adopted for the IBM z13iprocessor chip (referred to as the CP or Central Processor chip)is guided by three basic principles: (a) controlling the peak powerconsumption by setting a realistic limit on the so-called thermaldesign power or thermal design point (TDP) driven by customerworkloads and maximum-power stress microbenchmarks;(b) reduction of the voltage margin by using a novel dynamicguard-banding technique; and (c) the creation of a rich new setof fine-grained, time-synchronized sensors that track performance,power, temperature, and power management behavior for arunning machine. A prime requirement of the power managementarchitecture is that the efficient control mechanisms be designedin such a manner that the high standards of IBM z Systemsiapplication performance and reliability be maintained withoutany compromise. In this paper, we describe the key featuresconstituting the z13 CP robust power management architectureand design that meet the stipulated objectives.

IntroductionPower delivery and dissipation limits constitute a majorconstraint in achieving the market-driven performancetargets of next-generation server and mainframe systems[1]. While technology scaling allows increased deviceand core count at near-historical growth rates, operationalfrequency and single-thread performance growth hasslowed considerably because of the power “wall.” TheIBM z13* system is no exception in this regard, and assuch, a careful design of the power management supportis essential to the product’s success in the marketplace.Several basic goals exist with respect to the z13

power management (PM) architecture. For example, onemajor goal is the reduction of power over-allocation bypower-capping [2]. Very high-power workloads thatare outside the realm of known or expected customerapplications need to be detected, and an appropriatethrottling action has to be triggered to make sure that thechip stays within a stipulated power budget. One way toset a high-power threshold is to define a limit on sustainedinstructions completed per cycle (IPC) measured over acycle window that is chosen carefully in consideration ofthe thermal time constant of the system. This high IPC

(and hence activity) “sensor” is easy to construct viadigital activity counters that do not require post-siliconcalibration. Hence, in the z13 design, such a IPC-drivenactivity sensor can be used to serve the needs of the powermanagement function. In the most conservative settingtypical of reliable IBM mainframes, the actual powerbudget limit, which is often referred to as the thermaldesign power (TDP), must be determined via systematicgeneration of synthetic maximum-power loops that areabove the power levels of all (real) customer workloads,but significantly below the theoretical worst-case powerobtained by adding the highest-utilization power values ofindividual units or macros. The choice of the TDP limitmust also ensure that it complies with the stipulated powerdelivery limit. In other words, the TDP must be less thanthe voltage regulator design point (RDP), which definesthe ultimate limit. This approach ensures a realistic peakpower limit while preserving the robust performance andfunctionality requirements.Another key objective is to reduce the voltage margin

guard band required to handle inductive noise events.Such events are caused by large temporal gradients in loadcurrent ðdi=dtÞ. Depending on the magnitude of the netinductance ðLÞ driven by the power supply and on-chipwires, the noise ðL � di=dtÞ on the supply voltage rail is aneffect that must suitably be provisioned for in setting theDigital Object Identifier: 10.1147/JRD.2015.2446872

T. WEBEL ET AL. 16 : 1IBM J. RES. & DEV. VOL. 59 NO. 4/5 PAPER 16 JULY/SEPTEMBER 2015

ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done withoutalteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed

royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.

0018-8646/15 B 2015 IBM

operational voltage point. Instead of a very conservativestatic guard band added to the supply voltage to protectagainst voltage droop (noise), the objective here is touse a lower voltage (to save power), but with relianceon specially crafted sensors to detect unacceptably largedroops. The response to such voltage droops is viadynamic “activity throttle” actions in order to contain thedroop level to safe limits. Again, this approach ensuresenergy-efficient resilience and robust performance. Theadopted strategy is unique to IBM z Systems*, but drawsupon the experiences gleaned from a similar exercisepursued in the context of the IBM Power Systems [3–5].A third major goal of this design is to create high

visibility into performance, power, temperature and powermanagement behavior in a fine-grained, time-synchronizedmanner. Prior to the z13, there was a limited set of powermanagement related data, on coarse-grain time scales,and in an asynchronous manner. These data could helpinterpret the power and thermal profile of the priorz Systems in only a limited manner, especially in termsof being able to relate that to the performance data beingcollected elsewhere. The new systems approach on thez13 collects data from just under 300 sensors that areprecisely synchronized to cover all these categories via asingle point of collection and control. A microcontrollercomponent contained in the z13 CP chip is dedicatedfor this purpose and operates on a fine-grained time scaleof 16 ms per sensor update.The remainder of this paper is organized as follows. In

the next section, we provide a general overview of the z13power management architecture. It includes subsections onkey component subsystems within the overall functionalspecification of the power management facility. Forexample, sub-sections on alternative methods for digitalsensing of power and droop events (e.g., the so-calledpower-proxy apparatus and critical path monitors (CPM))are included within this power management overview.The section “Power-management characterization throughautomated microbenchmark generation” is devoted topre-silicon and post-silicon characterization, tuning, andstress-test methodology via automated microbenchmarkgeneration. Individual subsections within this section aredevoted to key methods within the MicroProbe automationtoolset that are relevant to noise stress generation and thetraining (or calibration) of power-proxy digital sensors.The final technical section (“Power management resultsand validation”) covers illustrative experimental results.Beyond that, we provide a brief concluding section,followed by acknowledgments.

Power-management overviewFor a description of the CP chip microarchitecture,implementation and design methodology details, the readeris referred to other papers in this issue (e.g., [6, 7]). The

z13 power management architecture is implemented as atwo-level hierarchy: core-level and chip-level. Most of thefunctionalities implemented (at either level) are dedicatedto detect situations where there is excessive powerconsumption or reduced voltage margin. The performancethrottling mechanism (PTM), implemented at core-level,is the only actuator mechanism. This actuator is triggeredlocally by mechanisms implemented at the core level.Globally, the per-core PTM actuators can be triggeredby control functions implemented at the chip level.The chip is instrumented with many sensors for platformcharacterization, providing accurate and fine-grainedmeasurement of the chip activity. The remainder of thesection provides more details of the power managementarchitecture.

Power-management hardware-firmwaremicroarchitectureFigure 1 depicts the hardware-centric data flow betweenthe processor cores and the chip-level power-managementlogic. It covers the computation of triggers forperformance throttling and the critical path monitor (CPM)connections to the PTM. The CPM is a circuit-level sensor[8, 9] that effectively provides a gauge of the amount oftiming margin available at any given time within workloadexecution. Typically, the largest loss of circuit timingmargin occurs during a voltage droop event. Other sourcesof consideration in such loss include: localized thermal hotspots, circuit aging over time, critical path timing profilechanges due to dynamic voltage-frequency scaling (ifapplicable), and so on. In the z13 design, the other effectsmentioned are either too small or non-existent by virtue ofparticular design choices.If the voltage detected droop exceeds a pre-determined

threshold level, dynamic actuation of the droop mitigationcontrol logic is asserted. In principle, this could beachieved via adjusting the clock frequency via digitalphase-locked loop (DPLL) controls as in [3, 4] or,as is chosen for initial implementation in the z13, viainstruction throttling which is similar to the mechanismused in [5]. The short-term activity proxy (STAP) andlong-term activity proxy (LTAP) are activity counter baseddigital sensors architected to sense the level of activecompute resource utilization at short and long time scales(respectively). These are described in more detail lateron as part of the core-level power management functions.The power management in the z13 CP is done using

hardware-firmware (HW-FW) co-programming. In HW,the pervasive region within the CP/SC (where SC refersto the System Controller) chipset [6] contains a highlyoptimized microcontroller called the Self-Boot Engine(SBE), so named because the design was first used in thePOWER8* microprocessor to assist or replace the “boot”function provided by the flexible service processor (FSP)

16 : 2 T. WEBEL ET AL. IBM J. RES. & DEV. VOL. 59 NO. 4/5 PAPER 16 JULY/SEPTEMBER 2015

in Power Systems [10]. The SBE has its own instructionset architecture (ISA) that supports (load/store) instructionsthat translate to SCOM (scan communications) registers’read/write commands, ALU instructions [e.g., ADD, SUB,AND, OR, XOR, Rotate (left/right)], branch instructions,and others. Note that the SCOM unit serves as an interfaceto read and write on-chip processor registers. Off-chipdata and control information is read into the processor viaSCOM registers. In addition, SBE uses a local, on-chipSRAM memory (also present in the pervasive region)called PIBMEM as its code/data memory. SBE acts asthe centralized power-management controller for thechip. In addition to SBE, there is a dedicated core-levelpower-management block in each core and a chip-levelpower-management block in the “uncore” or nest regionthat perform specialized power-management functionslike power-proxy measurements, CBCA (cycle-by-cycleactivity) detection, core-throttling and various modulationand control utilities associated with the CPM-senseddata (referred to as “CPM Filter and Thresholding”in Figure 1). Note that on a multi-core processor chip,the units comprising non-local (or shared) caches,interconnect elements, memory controllers and othernon-processor-core functions are collectively referred toas the “nest” in IBM terminology. The rest of the industrytypically calls this the “uncore.” There are also temperaturesensors, skitters, and CPMs that are placed at variouspoints in the chip, which can be monitored by the SBE.

(The skitter [11], which is a precursor to the CPM, is aspecial circuit macro that serves as a simplified circuitdelay sensor that is also useful in detecting voltage noise.)Even though the SBE microcontroller off-loads the

support element (SE) and the FSP quite significantly, thereis firmware within the FSP that initially activates the SBE,reads hundreds of time-synchronized sensors from SBE’smemory (PIBMEM), and performs error handling andother high-level functions.

Core-level power management functionsIn this subsection, we provide an overview of thecore-level power management functions within theoverall z13 power management architecture.

PTM functionThe performance throttling mechanism (PTM) isresponsible for reducing system performance in orderto mitigate unwanted, resilience-threatening situations(e.g., excessive power consumption or voltage droops).Whenever the PTM mechanism is triggered, the executionpipeline is throttled, reducing the power consumption andthe voltage droop. After the throttling event, a ramp-upfunction starts “un-throttling” the pipeline until the normalexecution rate is re-established. The mechanism is fullyconfigurable, providing parameters to control the degreeof the initial throttling, the duration, and the ramp rate.This mechanism also supports sustained throttling events.

Figure 1

The z13 CP chip power management high-level data flow. (CPM: Critical Path Monitor; CBCA: Cycle-by-Cycle Activity; STAP: Short-term Ac-tivity Proxy; LTAP: Long-term Activity Proxy; ISU: Instruction Sequencing Unit; IFU: Instruction Fetch Unit; LSU: Load-Store Unit; VSU:Vector/Scalar Floating Point Unit; Mux: multiplexor.)


That is, while still in the ramp-up phase, a new throttlingevent can be triggered and this would restart the wholeprocess. The ramp function is necessary to prevent thePTM logic from inducing a voltage droop event when thethrottling is released.

Throttle meteringIn order to validate the PTM function, a throttle meteringcapability is implemented. This measures the degree towhich PTM is invoked. This is crucial to be able toproperly calibrate the power management functions andto understand the impact of throttling on performancemeasurements. Since the PTM function could affect thesystem’s performance, it is designed to be triggered onlyin extremely unlikely scenarios. This strategy ensuresminimal impact to real system performance.

Activity proxiesActivity proxies constitute a well-known technique toestimate or track the power consumption profile ofprocessor chips [12–16]. In the z13 design, they consistof a set of 16 activity counters. These counter values aremultiplied by weights (and then summed up) to providean estimation of the core (and then the chip) powerconsumption. One important step in defining activityproxies is the selection of the input activities to bemonitored. The activity counters are selected with theaim of covering the different types of activities thatcan be generated in the core. The counters cover variousfunctional units (fixed point unit, floating point unit,load-store unit, etc.), instruction cache, data cache,branch target buffer, completed instructions, dispatchedinstructions, etc. Another factor that affects the temporalsampling granularity (and hence the number of powerestimations per second), as well as the estimationaccuracy, is the number of activity counters that are usedas input. As the number of activity counters increases,the accuracy of power estimate improves. However, ittakes longer to generate an estimation. The z13 CP chipimproves on prior activity proxy designs by implementingthree different activity proxies. This enables a widertiming window selection to support power managementfunctions at different timescales. Additionally, the z13provides hardware to sum across the core activity countersto produce a chip-level activity count. We describe theactivity proxies in the rest of this sub-section.The long-term activity proxy (LTAP) uses all 16 activity

counters as input. Therefore, it provides a more accurateestimation at the expense of slow estimation speed. Thisproxy is used to provide power characterization supportvia the firmware interface. The final product selectionmode is to use the shortest time scale of 32 �s forthe LTAP computation. The time period of 32 �s isdetermined by the firmware interface that is needed to

manage power at the chip-level, spanning multiple coreswith low-speed serial communication links. The result is arelatively slower time-constant gauge of the power at thechip-level. The 16-counter hardware apparatus to estimateper-core power is natively much faster, of course. Thesampling period for doing the weighted sum accumulationis determined by the number of bits in each counter. Thiscounter-width is chosen in such a manner that the eventactivity counts monitored are large enough on average toprovide statistical significance in computing the weightedsum. As such, LTAP provides the highest achievablelevel of accuracy in the chip-level power cappingfunction implemented within the SBE. This is a practicalimplementation choice, because it is used in conjunctionwith the CPMs that detect voltage droop. If there is arapid increase in power that happens on a time scale thatis smaller than 32 �s, then before the LTAP can engage,the regulator voltage output starts to drop due to thecurrent draw. This engages the CPM throttling until thelonger-time-scale LTAP could catch up and start to throttlefor any sustained capping requirements.The short-term activity proxy (STAP) is similar to

the LTAP, but it uses only 8 counters, trading off someaccuracy to improve the estimation speed. The appropriateset of counters has been selected using techniquessuccessfully used in previous IBM chips [14]. Therelatively fast and accurate estimations make it a potentialchoice in producing higher speed power estimations, ifor where needed. STAP might also serve a role in futurechips (for rapid current change slope detection) whereCPMs are not present, but the accuracy of the STAP mightneed to be refined in that case.The cycle-by-cycle activity (CBCA) proxy is a very

fast proxy, capable of making estimates at a veryfast rate (G 10 ns), when considering the hardwaresampling interval. Firmware overheads to usethis for core- and chip-level throttling are minimal,because the CBCA-driven PTM throttle is in effect ahardware-autonomous control loop within the largerfirmware-driven power management control loop. TheCBCA uses 4 activity counters as input, and therefore,its lower accuracy makes it unsuitable for performingchip-level power estimations. However, it is ideal fortriggering the PTM function whenever a steep change inactivity is detected at the chip-level. A fast triggeringmechanism is crucial to reduce the voltage droop produceddue to changes of activity at the chip-level.

Critical path monitors (CPM)We use Critical Path Monitors (CPMs) in the z13 tomeasure available timing margin. These CPMs arerefinements of the CPMs implemented in previous IBMchips [8, 9]. CPMs allow for measuring the timing marginand therefore are very sensitive to voltage droops. CPMs


have been used in the past for actively reducing thevoltage guard-band to save energy [3, 4] and to protectthe chip from voltage droop events [5]. In the z13architecture, they are used in a similar fashion, but insteadof triggering voltage/frequency adjustments, the PTMfunction is triggered whenever a certain (configurable)threshold is met. Adjusting the voltage would be arelatively high-latency operation, since off-chip voltageregulation is used; adjusting the frequency can be quitefast, but it requires careful control of clock skew and jitterduring ramp-down or ramp-up operations. As such, designverification under variable frequency operational modescan become quite complex in products that require veryhigh reliability. The PTM knob was selected for voltagedroop management in the z13.

Chip-level power and thermalmanagement functionsIn this subsection, we describe the chip-level powermanagement logic (CLPM) that is implemented in z13with a focus on the implementation macros outside thecore within the pervasive infrastructure.

Chip-level CBCA-based throttle-event generationIndividual CBCA values from the cores are summedup at chip-level, and the current sum is subtractedfrom the average of the previous 8 CBCA values.The resulting difference is compared with a threshold.If the computed delta is above the threshold a PTMinstruction-throttle-event is generated.

Chip-level activity proxyIn order to monitor and manage the overall powerconsumption of all cores in the chip, a chip-level activityproxy is implemented. For this chip-level activity proxy,two different modes exist: the short-term mode and thelong-term mode.In the short-term mode, every time a core STAP value

is updated, this value is sent via a serial link to thechip-level, where all of the cores’ STAP values are addedtogether to form a chip-level STAP (CLSTAP). Theshort-term mode is useful for performing power-cappingby hardware at the chip-level; thus, it enables effectivemanagement of chip power budgets.In the long-term mode, every time a core LTAP value

is updated, this value is sent via a serial link to thechip-level, where all of the cores’ LTAP values are addedtogether to form a chip-level LTAP (CLLTAP). Thelong-term mode is useful in generating a chip-levelactivity number that might allow firmware to find hotspots in a system that would trigger selective throttling orpower shifting between chips/nodes.The hardware implements a multiplexor (MUX) that

selects either the long-term or the short-term activity proxy

in the core logic. As described before, the LTAP is chosenfor high-accuracy power-capping decisions at a longertime scale; whereas, in principle, the STAP could beselected for smaller time-scale decisions where higherspeed (but less accurate) power estimations are needed.Any update in the selected value at the output of the MUXis transferred serially to the chip-level logic. The seriallink always transfers the maximum number of bits. Thechip-level logic forms a 32 bit number and uses an addertree for this to reduce latency and hardware complexity.An overview of this structure is depicted in Figure 1.

Chip-level power cappingThe power capping function ensures that the chipcontinues to function even if the chip tries to exceed itspower budget. The z13 CP chip has the option to useeither STAP or LTAP as the basis for capping. Thebenefit of using activity proxies instead of true powermeasurement for power capping, is that the performance isdeterministic and repeatable for the same workload acrossdifferent chips and different environmental conditions.This supports the z13 goal of very high reliability in thatthe performance for a workload under the power faultconditions can be bounded and does not vary acrosssystems or chips. One way to dynamically adjust thepower capping limits in a running system is to use theSBE to adapt the power capping threshold real-time basedon chip internal information, which is accessible by theSBE micro. Alternately, the SBE can actuate a sustainedthrottle level determined periodically by firmware ratherthan using the automated hardware throttle mechanisms.This allows the option of using a firmware-based feedbackcontroller for power capping similar to those used in priorIBM systems [17]. The chip power estimation within atime period is computed as follows:

ChipPowerEstimation

¼X#cores

i¼1

P#eventsj¼1 Ci;j � wj

Rþ L� f ðTiÞ

0@

1A

where Ci;j is the count of event j in core i within theinterval, and wj is the corresponding weight of the event.R is a scalar factor to convert the core activity proxyvalue to units of power consumption. L is the worst-caseleakage power (as obtained from the chip’s vital productdata (VPD) repository), and f ðTiÞ is the additional powerreduction due to operating core i at throttle level Ti.The weighted sum of core events is implemented inhardware, while firmware must account for the factorsof R, L, and Ti. Note that the worst-case leakage value Ltakes into account the maximum temperature bounds, andis valid for the nominal supply voltage level of the chip.The factor R incorporates the effect of the particularoperational voltage and frequency settings of the chip.


Since STAP and LTAP track only switching power,the power estimation includes margin for leakage power,which is modeled using worst-case manufacturingassumptions. Once throttling is invoked, the powerestimation is further adjusted to account for some powerreduction due to lower clock-power during the “throttled”time duration. Specifically, the reduction in the Ci;j

activity values resulting from throttling does not accountfor the additional leakage power reduction caused by thedrop in temperature. The f ðTiÞ adjustment accounts forthis correction.

Thermal controlFor thermal control, the pervasive-logic-centricmicrocontroller (SBE) reads the two hottest digital thermalsensors (DTSs) from each core in 16-ms loop intervals.It computes the average of the two, compares againsta threshold value, and throttles that core in case thetemperature exceeds the threshold for the next 16 ms.The SBE releases the throttle if the temperature fallsbelow the threshold. This sequence continues every 16 ms.

Power management microcontroller (SBE)In this subsection, we provide details of the SBEmicrocontroller operational semantics. The Self-BootEngine is a highly specialized microcontroller with its ownISA (instruction set architecture) catering towards on-chipcommunication and initialization operations. The SBE inthe z13 continuously runs a program out of its dedicatedon-chip SRAM memory (PIBMEM), collecting data from290 sensors in the CP chip, as well as 30 on the SC chip,in a 16 ms interval loop. For each sensor, the SBE recordsthe instantaneous (average, maximum, minimum),accumulator, tick-counter, and squared accumulator valuefields in its memory. This data is read by the systemfirmware once every hour, processed, and used for realtime logging for running machines. The tick-counter keepstrack of how many 16-ms sensor values were added tothe accumulator and squared accumulators. This permitsIBM to compute the hourly average and three sigmavalues from these sensor data structures. All 290 sensorsare gathered with time synchronicity within 16 ms or lessof one another. All the hourly maximum and minimumvalues have a time stamp that is accurate to within 64 msfor when the event happened during the hour. In addition,there is a specialized “snapshot” mechanism in thefirmware that allows for the capture of a group of88 critical sensors in their current state (out of the entire290) for a given 16-ms interval. There are two modes ofgroup capture. In the IBM lab environment, one cancapture this group of sensors every 16 ms and create atemporal plot of the time-synchronized behavior of all88 sensors by streaming the data out through the systemcontrol structure. The second mode is “application” mode

in which the “snapshots” are taken based on5 predetermined classes of sensors hitting maximum-valueconditions in the 1 hour interval. Every hour, these5 snapshots are returned with all the normal sensor data.However, the snapshots are time-stamped and can bedirectly correlated with a critical moment in time duringthat hour when a class of sensors peaked, allowing one tosee what the other 88 sensors in the system’s currentinstantaneous state were.

Power-management characterization throughautomated microbenchmark generationIn this section, we provide details of the methods usedto test the functionality and operational “corners” ofthe various features within the power managementarchitecture. The methodology is centered around theuse of an automated microbenchmark generation toolsetcalled MicroProbe [12].

OverviewCalibration and characterization of power-managementmechanisms in a pre-silicon setting is typically inadequate.This is because it is difficult to model power and inductivenoise behavior very precisely. Therefore, a systematic,direct-measurement-based calibration and characterizationof power management mechanisms in a post-siliconsetting (during the processor test and “bring-up” process)is required to firm up the calibration settings, voltagelevels, and package characteristics that ensure robustpower-management functionality.A key part of such a post-silicon processor testing

process is the use of specially crafted microbenchmarks tocalibrate and test the power-management functionalities.These corner-case workloads are sometimes referred toas stressmarks in the research literature. The stressmarksmaximize the voltage noise and power consumption orgenerate different microarchitecture activities to train ortest activity proxies. Manual, expert-driven generation(and fine-tuning) of such stressmarks can be a tedious anderror-prone process. To overcome these issues, we usedthe MicroProbe [12] framework to systematically generatethe stressmarks required during the z13 post-silicon test/bring-up process. Automation in stressmark generation hasbeen attempted by other research groups, but as discussedin [12], MicroProbe has significant new features, tested inreal machine contexts, that have not been available before.The recent work of Kim et al. [18] is an example ofinductive noise stress test generation that was used in anAMD processor context. However, MicroProbe uses anovel thread alignment technique that achieves superiorresults in a mainframe processor execution environment.

Noise stressmark generationIn order to reduce the voltage margin, the cores need to beable to detect voltage droop and trigger capping in case


the margin is insufficient under extreme situations.Therefore, the CBCA and CPMs must be tested on a wideset of cases, including the worst-case ones. Therefore, acomplete noise characterization is needed. We used themethodology tested on the IBM zEnterprise EC12 (zEC12)for the noise stressmark generation and characterization[13]. This methodology permits the control of differentparameters affecting the voltage noise, providing acomplete understanding of the voltage noise and thenoise mitigation mechanisms (CBCA/CPM).The noise stressmarks are built by concatenating

high-power instruction sequences with low-powerinstruction sequences. This generates transient fluctuationsof voltage in the power delivery network. Thefrequency of changes between the high/low instructionsequencesVwhich can be controlled via the sequencelengthVaffects the overall voltage droop generated,achieving its maximum at the power delivery networkresonance frequency. Additionally, the transitions betweenhigh-power to low-power are synchronized across thecores in order to maximize the voltage droop. There area few alternative schemes that were tried in making theinter-core synchronization as exact as practically feasible.One method that was used with success in the zEC12experiments [13] utilizes a particular instruction atspecially crafted frequency that uses the system clock as areference to affect the synchronization. In the z13, othertechniques tied to spin-lock synchronization mechanisms ina hardware cache coherence protocol context were alsoexploited to achieve this objective.Maximum and minimum power instruction sequences

were derived via a systematic approach [12, 13]. First,every instruction of the ISA is profiled in order to generatean energy-centric instruction ranking. These ranks definewhich instructions are more power-hungry and which onesconsume low power. The maximum-power instructionsequence is derived after (automatically) exploring awide set of combinations of the most power-consuminginstructions for each functional unit. Once the sequence isdefined, memory and branch activity is added to furthermaximize the power consumption. Not only has thissystematic, automated approach been proved to outperformprevious manual efforts, but it also enables the definitionof the maximum-power stressmark in the early stages ofthe testing process. The manual approach is inadequatebecause it cannot provide the search coverage afforded byautomated generation of many thousands of candidatemaximum-power instruction sequences. The systematic,automated search proves to be much faster, and itgenerally also yields a higher-power sequence than thatachieved manually by any expert within the design team.Minimum power instruction sequence definition isalso driven by the energy-centric instruction rankingmethodology. It is derived using the last instruction

within the ranking table. Once maximum and minimuminstruction sequences are defined, noise stressmarks aregenerated by concatenating them, as mentioned before.

Activity proxy trainingCBCA and short/long-term activity proxies must becalibrated in order to provide accurate power consumptionestimations. The calibration methodology uses statisticalregression techniques and genetic algorithm searches tofind the optimal set of events (that are to be monitored viaactivity counters) and their weights for a given training set[14]. A proper training set definition is crucial to find asolution with large coverage (i.e., accurate estimates undera diverse range of execution environments). We usedMicroProbe to implement a systematic method to generaterich training sets for power proxies [12, 15]. We generatedlow-to-high activity stressmarks for each functional unitas well as for each possible combination of functionalunits. In addition, several stressmarks covering memoryactivities were also automatically generated usingMicroProbe. Overall, the training set was composedof more than ten thousand stressmarks with uniquemicroarchitecture activity, covering all types of activities.

Power-management results and validationThis section covers some representative results of powermanagement features and characterization efforts for the z13.Figure 2 compares the chip power values of a

microbenchmark generated by the MicroProbemethodology against two benchmarks (hmmer and astar)selected from the CINT2006 benchmark suite that is partof the larger SPEC CPU2006 suite. These two are chosenfor illustration because they are amongst the highest powerworkloads within the SPEC CPU2006 suite. The figurealso includes the case of a microbenchmark that isgenerated through a manual approach. Chip power valuesare normalized to the MicroProbe-generated worst-casebenchmark. Overall, the MicroProbe methodologygenerates a benchmark that can be up to 18% higher(“HMMER ST” in Figure 2) power than a worst-case realworkload. The methodology enables us to set accuratemaximum power targets, ensuring a more robust product.It also enables rapid bring-up through automatedgeneration of microbenchmarks in contrast with manualgeneration. Although not shown in the figure, MicroProbealso generates a noise stressmark that has 70% highernoise compared to a typical functional stressmark.Figure 3 shows experimental results that demonstrate

the ability of the z13 LTAP to track the chip Vdd power.In this experiment, we ran several steady workloadexercisers that stressed a z13 chip in different ways andmeasured the V dd power rail, chip temperature, and LTAPactivity counters. Since LTAP does not track leakagepower, we subtracted it from the power rail measurement


to train the LTAP. Our leakage model was formed bymeasuring the power while the chip was idle and correctingit for the average temperature while running eachworkload. We used a genetic algorithm to train the LTAPcoefficients. This algorithm is described in our priorwork [14]. The genetic algorithm optimizes the coefficientsto find the best fit for reducing the average error in poweracross all workloads.The results are shown in Figure 3. Each workload is

represented by a point on the plot. The x-axis shows thechip supply voltage ðVddÞ rail power measured for theworkload, scaled to the workload with the highest powerconsumption. The y-axis shows the absolute error ofthe leakage power estimated for the workload summedwith the additional power estimated by the LTAP. Theerror includes both the error in the leakage power modeland the LTAP error. The average absolute error across all

workloads is 1%. The highest power consumptionworkloads show errors of up to 2%, which is similar tothe expected error on hardware-based measurement inservers [14]. Although only two of the workload exerciserswere in the “highest power” category in this particularexperiment, our general experience with the very highestpower workloads shows that the error margin for thoseis similarly small (i.e., < 2%). Power capping requires highaccuracy measurement near the upper end of the powerconsumption range, since capping is not required forsystem protection at low power consumption rates.Therefore, we believe the power proxy is appropriateas a system indicator to be used for the power cappingcapability in the z13.Figure 4 shows actual oscilloscope measurements with

and without CBCA mechanism enabled on a speciallycrafted microbenchmark that stresses the noise behavior onthe chip. For this particular microbenchmark, all cores gothrough from a worst-case power consumption phase to analmost idle power consumption phase, thereby creating ahigh power noise in the system. The CBCA mechanismcaptures the transition in power behavior and reactsaccordingly to reduce noise. The y-axis shows the voltagemeasurements that are normalized to the average voltagevalue seen across the run. As Figure 4 shows, the CBCAmechanism achieves considerable noise reduction. Asstated before, the absolute accuracy in power estimationis not a required goal for the CBCA sensor; rather, thespeed with which a power gradient is sensed is moreimportant. As a result of power estimation inaccuracies,the CBCA-enabled voltage droop may, on rare occasions,be slightly greater than the CBCA-disabled voltage droop,but overall, the worst-case voltage droop is reduced. Thismeans that the worst-case minimum voltage (due to noise)experienced at the circuit-level is higher than before;hence the frequency guard band can be made smaller. This

Figure 2

Chip power consumption comparison. The values are normalized to MicroProbe generated stressmark running in the SMT (simultaneous multi-threading) mode. ST stands for single thread mode. Astar and hmmer are typical SPEC INT workloads. (Hmmer SMT: Hmmer workload runningin SMT mode.)

Figure 3

LTAP accuracy. The x-axis shows the chip Vdd rail power measuredfor the workload, scaled to the workload with the highest powerconsumption. The y -axis shows the absolute error of the leakagepower estimated for the workload summed with the additionalpower estimated by the LTAP.


naturally translates to higher frequency (and performance)for the given nominal voltage.Figure 5 shows the maximum and average temperature

values from the 5 digital thermal sensors (DTS)implemented on each core for different throttling levels.The data is normalized to the maximum temperature seenand the workload run is the MicroProbe-generatedmaximum power stressmark. The throttling mechanismreduces the temperature significantly (> 40%) when highthrottling levels are used.Figure 6 shows the improvements in Vmin (minimum

operating voltage) across different chips with CPM

enabled. The data is normalized to the voltage datafrom the first chip (chip 1) where the CPM mechanismis disabled. The workload is a specialized functionaltest to stress different circuit paths. This workload wasa manually-crafted high resource utilization stressmark;it is not the worst-case di=dt inductive noise stressmarkgenerated by MicroProbe. Depending on the chip that isbeing characterized, Vmin improvements range from 4.7%to 6.3% with CPM mechanism enabled. It should be notedthat in Figure 6, the frequency of operation is consideredto be fixed, and Vmin in this context refers to the minimumoperating voltage that can be sustained without circuit

Figure 4

CBCA actuation results with an oscilloscope measurement. The y-axis shows the voltage measurements that are plotted on a normalized scalerelative to the average value seen across the run.

Figure 5

Temperature as a function of throttling level: real measurement data. The y-axis temperature values are shown on a normalized scale relative tothe maximum temperature value recorded (in the absence of any throttling).


timing failure. Without CPM, a very conservative guardband has to be added to the voltage corresponding tothe target frequency. Hence, the minimum permissibleoperating voltage is higher. With CPM enabled, theprinciple of dynamic guard banding [4] is invokedVso theminimum permissible voltage point can be pushed downto a lower value. In effect, this is a case of under-volting.In the rare case where the circuit voltage is sensed toapproach the new (lower) Vmin, PTM is actuated tocircumvent the potential voltage emergency.

Related work and discussionThe z13 power management architecture is an innovative,new feature in z Systems. Many of the core ideas relatedto power estimation and dynamic guard banding haveevolved from prior IBM designs within the PowerSystems family [3, 4], as we already stated. However, theparticular combination of three different activity proxiesto account for decisions at various time scales has notbeen attempted before.The concept of using digital counter-based sensors to

estimate power has been used in non-IBM processorproducts as well. Initial attempts at monitoring on-chippower using analog circuitry, as incorporated in Intel’sMontecito (Itanium) design [19], reportedly met withcalibration and accuracy issues. Subsequent Intelprocessor designs (Nehalem [20] and Sandy Bridge [21])as well as AMD’s Kabini chip [22] all used digital powermetering methods for on-chip power management.When it comes to the z13 power management

architecture, a key difference is that previous PowerSystem products, as well as other non-IBM offeringshave used complex digital power proxy implementationsthat use a relatively large number of activity countersfor precision. In the z13, the emphasis was on usinglower-complexity activity proxies to detect power change

deltas. In other words, the focus was not so much onaccurate power estimation on a per-core basis; rather,the objective was to set a tighter bound on maximumpermissible power consumption, and to reduce thenoise-related guard band. The overriding concern wasthat system resilience should not be compromised inthe attempt to lower the power cost for the targetedproduct performance.The mainframe product space has historically been

focused on high performance and high system-levelutilization through virtualization of hardware CPUresources. However, in recent product cycles, z Systemshave encountered the power wall in the form of currentdelivery limits, even if dissipation limits have beenadequately met (or even extended) through advancedliquid cooling and/or proprietary packaging technologywhere needed. The growth in processor core count,spurred by significantly smaller single-thread performancegrowth has further exacerbated the power deliveryproblem. At the same time, due to workload diversityacross multiple cores and improved degrees of activepower reduction via clock-gating, the maximum powerswing and attendant inductive noise levels have been onthe rise.In view of the prioritized focus on hardware reliability

in z Systems, the new problem of managing powerconsumption had to be addressed in a manner that alsomet the challenge posed by the inductive noise-relatedvoltage droops. These motivations resulted in the idea ofintroducing a new power management unit that would setrealistic limits on maximum power consumption, whilealso reducing the effective voltage-related guard band toprotect against noise. As such, the z13 power managementarchitecture represents a unique blend of concepts inpower conservation coupled with those that provideimmunity against voltage noise.

ConclusionIn this paper, we presented a summary overview of thekey features within the z13 power managementarchitecture. As mentioned at the outset, the drivingprinciple of this design has involved how to providerobust power management, without compromising realcustomer workload performance. The objectives wererealized using an approach that avoids the worst-casevoltage margins and overly conservative maximum-powerassumptions employed by previous generations. Instead,the voltage guard band was lowered by relying on timelysensing and responding to voltage droop events to avoidcircuit errors. This reduction in chip operating voltageby using dynamic throttling of instruction execution, andthe underlying feedback control system, was stress-testedfor robustness using an automated worst-case voltagenoise generation methodology. Using automated

Figure 6

Vmin improvements across various chips relative to highest Vmin

recorded.


microbenchmark generation, we were able to moreeffectively tune our activity proxies and, by generatingworst-case noisy workloads, to more effectivelycharacterize and stress test the chip to ensure theremaining voltage margin is always sufficient for correctoperation. Illustrative experimental results demonstratedthe “efficient resilience” principle that was pursuedeffectively in this project where the chip, instead ofbeing passively subjected to the workload running onit, is an active participant in ensuring optimal systemperformance.

AcknowledgmentsThe MicroProbe tool development work at IBM Researchwas partially sponsored by Defense Advanced ResearchProjects Agency (DARPA), Microsystems TechnologyOffice (MTO), under contract no. HR0011-13-C-0022.The views expressed are those of the authors and do notreflect the official policy or position of the Departmentof Defense or the U.S. Government. This document is:“Approved for Public Release: Distribution Unlimited.”Many individuals across IBM made this work possible.We extend particular thanks to Tilman Gloekler, MiteshAgrawal, Lee Eisen, Huajun Wen, and Karthick Rajamani.

*Trademark, service mark, or registered trademark of InternationalBusiness Machines Corporation in the United States, other countries,or both.

**Trademark, service mark, or registered trademark of SoftLayer,Inc., an IBM Company, in the United States, other countries, or both.

References1. R. Bianchini and R. Rajamony, “Power and energy management

for server systems,” IEEE Comput., vol. 37, no. 11, pp. 68–74,Nov. 2004.

2. C. Lefurgy, X. Wang, and M. Ware, “Power capping: A preludeto power shifting,” Cluster Comput. J., vol. 11, no. 2, pp. 183–195,Jun. 2008.

3. M. Floyd, M. Ware, K. Rajamani, T. Gloekler, B. Brock,P. Bose, A. Buyuktosunoglu, J. Rubio, B. Schubert, B. Spruth,J. Tierno, and L. Pesantez, “Adaptive energy-managementfeatures of the IBM POWER7 chip,” IBM J. Res. & Dev.,vol. 55, no. 3, Paper 8, pp. 8:1–8:18, Mar. 2011.

4. C. Lefurgy, A. Drake, M. Floyd, M. Ware, B. Brock, J. Tierno,J. Carter, and R. Berry, “Active guardband management inPower7þ to save energy and maintain reliability,” IEEEMicro vol. 33, no. 4, pp. 35–45, Jul./Aug. 2013.

5. M. Floyd, A. Drake, R. Berry, H. Chase, R. Willaman, andJ. Pena, “Voltage droop reduction using throttling controlledby timing margin feedback,” in Proc. IEEE Symp. VLSICircuits, Jun. 2012, pp. 96–97.

6. J. Warnock, C. Berry, M. H. Wood, L. Sigal, Y. Chan,G. Mayer, M. Mayo, Y.-H. Chan, F. Malgioglio, G. Strevig,C. Nagarajan, S. Carey, G. Salem, F. Schroeder, H. H. Smith,D. Phan, R. H. Nigaglioni, T. Strach, M. M. Ziegler, N. Fricke,K. Lind, J. L. Neves, S. H. Rangarajan, J. P. Surprise,J. M. Isakson, J. Badar, D. Malone, D. W. Plass,A. Aipperspach, D. F. Wendel, R. M. Averill, III, and R. Puri,“IBM z13 circuit design and methodology,” IBM J. Res. &Dev., vol. 59, no. 4/5, Paper 15, pp. 15:1–15:15, 2015.

7. B. W. Curran, C. Jacobi, J. J. Bonanno, D. A. Schroter,K. J. Alexander, A. Puranik, and M. M. Helms, “The IBM z13multithreaded microprocessor,” IBM J. Res. & Dev., vol. 59,no. 4/5, Paper 1, pp. 1:1–1:13, Jul./Sep. 2015, in this issue.

8. A. Drake, R. Senger, H. Deogun, G. Carpenter, S. Ghiasi,T. Nguyen, N. James, M. Floyd, and V. Pokala, “A distributedcritical-path timing monitor for a 65 nm high-performancemicroprocessor,” in Proc. ISSCC Dig. Tech. Papers, Feb. 2007,pp. 398–399.

9. A. Drake, M. Floyd, R. Willaman, D. Hathaway, J. Hernandez,C. Soja, M. Tiner, G. Carpenter, and R. Senger, “Single-cycle,pulse-shaped critical path monitor in the POWER7þmicroprocessor,” in Proc. ISLPED, Aug. 2013, pp. 193–198.

10. A. Caldeira, V. Haug, M.-E. Kahle, C. Maciel, and M. Sanchez,“IBM Power Systems S812L and S822L technical overviewand introduction” IBM Redbooks, Armonk, NY, USA, p. 119,Aug. 2014. [Online]. Available: http://www.redbooks.ibm.com/abstracts/redp5098.html?Open

11. P. Restle, R. Franch, N. James, W. Huott, T. Skergan,S. Wilson, N. Schwartz, and J. Clabes, “Timing uncertaintymeasurements on the POWER5 microprocessor,” in Proc.ISSCC Dig. Tech. Papers, Feb. 2004, pp. 354–355.

12. R. Bertran, A. Buyuktosunoglu, M. Sharma Gupta,M. Gonzàlez, and P. Bose, “Systematic energycharacterization of CMP/SMT processor systems viaautomated microbenchmarks,” in Proc. 45th Annu. IEEE/ACMMICRO-45, Dec. 2012, pp. 199–211.

13. R. Bertran, A. Buyuktosunoglu, P. Bose, T. Slegel, G. Salem,S. Carey, R. Rizzolo, and T. Strach, “Voltage noise inmulti-core processors: Empirical characterization andoptimization opportunities,” in Proc. 47th Annu. IEEE/ACMMICRO-47, Dec. 2014, pp. 368–380.

14. W. Huang, C. Lefurgy, W. Kuk, A. Buyuktosunoglu, M. Floyd,K. Rajamani, M. Ware, and B. Brock, “Accurate fine-grainedprocessor power proxies,” in Proc. 45th Annu. IEEE/ACMMICRO-45, Dec. 2012, pp. 224–234.

15. R. Bertran, M. González, X. Martorell, N. Navarro, andE. Ayguadé, “A systematic methodology to generatedecomposable and responsive power models for CMPs,” IEEETrans. Comput., vol. 62, no. 7, pp. 1289–1302, Jul. 2013.

16. H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, andR. Eickemeyer, “Abstraction and microarchitecture scalingin early-stage power modeling,” in Proc. Int. Symp. HPCA,Feb. 2011, pp. 394–405.

17. H.-Y. McCreary, M. Broyles, M. Floyd, A. Geissler, S. Hartman,F. Rawson, T. Rosedahl, and J. Rubio, “EnergyScale for IBMPOWER6 microprocessor based systems,” IBM J. Res. & Dev.,vol. 51, no. 6, pp. 775–786, Nov. 2007.

18. Y. Kim, L. John, S. Pant, S. Manne, M. Schulte, W. Bircher,and M. Govindan, “AUDIT: stress testing the automatic way,”in Proc. 45th. MICRO-45, Dec. 2012, pp. 212–223.

19. R. McGowen, C. A. Poirier, C. Bostak, J. Ignowski, M. Millican,W. Parks, and S. Naffziger, “Power and temperature control ona 90-nm Itanium family processor,” IEEE J. Solid-State Circuits,vol. 41, no. 1, pp. 229–237, Jan. 2006.

20. R. Singhal, “Inside Intel Core microarchitecture,” presented atHot Chips-20, Aug. 2008; presentation HC20.26.630. [Online].Available: http://www.hotchips.org/wp-content/uploads/hc_archives/hc20/3_Tues/HC20.26.630.pdf

21. E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, andE. Weissmann, “Power management architecture of the2nd generation Intel Core microarchitecture, formerlycodenamed Sandy Bridge,” presented at Hot Chips-23,Aug. 2011; presentation HC23.19.921. [Online]. Available:http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf

22. D. Bolivier, B. Bates, W. Fry, and S. Godey, “AMD ‘Kabini’APU SOC,” presented at Hot Chips-25, Aug. 2013,presentation HC25.26.111. [Online]. Available: http://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.10-SoC1-epub/HC25.26.111-Kabini-APU-Bouvier-AMD-Final.pdf


Received October 23, 2014; accepted for publicationNovember 16, 2014

Tobias Webel IBM Systems, 71032 Böblingen, Germany([email protected]). Mr. Webel is a Senior Development Engineerat the IBM Laboratory in Böblingen, Germany. After various rolesin the context of pervasive logic design, Mr. Webel is now leadingthe pervasive architecture for both Power System and z Systemchips. Starting with the z13 machine, he is also leading the z Systempower management architecture.

Preetham M. Lobo IBM Systems, Bangalore, 560071 India([email protected]). Mr. Lobo is a member of the pervasive andpower-management logic design team for z System chips. He hasworked on logic design in the IBM POWER7þ* and POWER8processors, and on power management logic design for the IBMz13. In addition, he was part of the pervasive “bring-up” team forthe IBM zEnterprise EC12 and the IBM z13 processor products.

Ramon Bertran IBM Research, Thomas J. Watson ResearchCenter, Yorktown Heights, NY 10598 USA ([email protected]).Dr. Bertran is a Research Staff Member in the Reliability- andPower-Aware Microarchitectures department. He has been involvedin research and development work in support of IBM PowerSystems and z Systems. His current focus is on power-aware andnoise-aware computer architectures.

Gerard M. Salem IBM Systems, Williston, VT 05495 USA([email protected]). Mr. Salem specializes in logic design andhardware debugging, and is involved in systems “bring-up” andintegration as part of the overall system assurance and validationexercise prior to product readiness and customer shipment.

Malcolm Allen-Ware IBM Research, Austin, TX 78758 USA([email protected]). Mr. Allen-Ware is a Distinguished Engineerwho has been involved with power-management techniques acrossall of IBM systems including POWER* (Power System), mainframe(z System), and storage-related system offerings. In addition, he isnow involved in Watson Software as a Service (SaaS) and POWER8Cloud Infrastructure-as-a-Service (IaaS) based optimizations forSoftLayer** GTS Data Centers.

Richard Rizzolo IBM Systems, Poughkeepsie, NY 12601 USA([email protected]). Mr. Rizzolo is a Senior Technical StaffMember who has served as the z13 characterization lead. Hespecializes in test, characterization, and diagnostics as part of systemassurance and validation prior to product shipment.

Sean M. Carey IBM Systems, Poughkeepsie, NY 12601 USA([email protected]). Mr. Carey is part of the hardware designteam for z System processor products and has also served to help aspart of the “bring-up” and characterization team for the z13 product.

Thomas Strach IBM Systems, 71032 Böblingen, Germany([email protected]). Dr. Strach specializes in chip packaging,power noise analysis, and associated simulation-based modeling. Hehas been involved in post-silicon hardware test, characterization and“bring-up” related work for the z13.

Alper Buyuktosunoglu IBM Research, Thomas J. WatsonResearch Center, Yorktown Heights, NY 10598 USA ([email protected]). Dr. Buyuktosunoglu is a Research Staff Member in theReliability- and Power-Aware Microarchitectures department. Hiscurrent focus is on microarchitecture definition and robust powermanagement for Power Systems and z Systems.

Charles Lefurgy IBM Research, Austin, TX 78758 USA([email protected]). Dr. Lefurgy is a Research Staff Memberwithin the Computing as a Service Technology (CAST) Group. Hiscurrent focus is on power management for servers and data centers.

Pradip Bose IBM Research, Thomas J. Watson ResearchCenter, Yorktown Heights, NY 10598 USA ([email protected]).Dr. Bose is a Distinguished Research Staff and manager of theReliability- and Power-Aware Microarchitectures Department. Hisprimary responsibility is to supervise advanced research anddevelopment in support of power-efficient, reliable processor designwithin Power System and z System product offerings.

Ricardo Nigaglioni IBM Systems, Austin, TX 78758 USA.Mr. Nigaglioni is a circuit design engineering professional,specializing in digital circuit design. He was part of the z13design team where he also served as the power lead of the z13processor chip.

Timothy Slegel IBM Systems, Poughkeepsie, NY 12601 USA([email protected]). Mr. Slegel is a Distinguished Engineer withinthe z System Processor Development team at IBM. He has beeninvolved through many years in areas related to processor design,verification, and architectural stress-test generation exercises insupport of z System processor products.

Michael S. Floyd IBM Systems, Austin, TX 78758 USA([email protected]). Mr. Floyd is a Power Systems EnergyScale*hardware architect and has served as the power managementhardware lead for several generations of Power System products.In the z13 power management unit design, Mr. Floyd served inan advisory role in which key experiences gleaned fromPower Systems were used to help the z13 team avoid pitfallsthat might have resulted in product delays.

Brian W. Curran IBM Systems, Poughkeepsie, NY 12601 USA([email protected]). Mr. Curran is a Distinguished Engineerwithin the z System Processor Development team at IBM. He is alead processor core designer in that team.


robust power management in the ibm z13 - semantic scholar · z13 collects data from just under 300...

Documents