runtime power measurement/modeling and thermal modeling
DESCRIPTION
Runtime Power Measurement/Modeling and Thermal Modeling. Research Seminar Canturk ISCI. MOTIVATION. Power Matters! Performance improves exponentially SO DOES POWER DENSITY Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues Follows power density - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/1.jpg)
Runtime Power
Measurement/Modelingand Thermal Modeling
Research SeminarCanturk ISCI
![Page 2: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/2.jpg)
2
MOTIVATIONMOTIVATION Power Matters!
Performance improves exponentially SO DOES POWER DENSITY
Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues
Follows power densityPackaging costs: +$1/W over ~40W
Need good Measurement/Modeling techniques for Power & Thermally aware/adaptive systems Using Measurement to probe microarchitectural details
CASTLE, data activity experiment Compiler Level Power Optimizations
SW Power Profiling and Optimization Power aware OS
power modeling for decision making Dynamic thermal/power management
Thermal hotspots & Power threshold
![Page 3: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/3.jpg)
3
MOTIVATIONMOTIVATION Power Models reflecting modern processors
Clock gating, power Voltage regulation, di/dt
Need for Fast-Realtime Modeling and Measurement to observe long time periods Thermal time constants: O(s) Not feasible even with architecural simulators
i.e.: 1s of real run ~5 x IPC hrs of WATTCH simulation
Need live, run-time power/thermal measures Dynamic Thermal Management Power-Aware OS & Systems control
![Page 4: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/4.jpg)
4
THE BIG PICTURETHE BIG PICTURE
To Estimate component power & temperature breakdowns for P4 at runtime…
Bottom line…
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
![Page 5: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/5.jpg)
5
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Remainder of TalkRemainder of Talk
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
![Page 6: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/6.jpg)
6
RELATED WORKRELATED WORK Implementing counter readers:
PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002]
Using counters for Performance: HPC [Crummey 2001], CPU profilers
Using counters for Power: CASTLE [Joseph 2001], power profilers event driven OS/cruise control [Bellosa 2000,2002]
Real Power Measurement: Compiler Optimizations [Seng 2003] Cycle-accurate measurement with switch caps [Chang
2002]
![Page 7: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/7.jpg)
7
RELATED WORKRELATED WORK Power Management and Modeling Support:
Instruction level energy [Tiwari 1994] PowerScope: Procedure level energy [Flinn 1999] Event counter driven energy coprocessor [Haid 2003] Power-breakdown driven energy reduction [Huang 2001] Virtual Energy Counters for Mem. [Kadayif 2001] ECOsystem: OS energy accounting [Ellis 2002]
Thermal Management and Modeling Support: PID based DTM [Skadron 2002] Architectural Thermal Model [Skadron 2003] Evaluating DTM techniques [Brooks 2001]
![Page 8: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/8.jpg)
8
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 1Milestone 1
Performance Monitoring
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
![Page 9: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/9.jpg)
9
Live CPU Performance Monitoring Live CPU Performance Monitoring with Hardware Counterswith Hardware Counters
Most CPUs have hardware performance counters P4 Performance Monitoring HW:
18 Event Counters 18 Counter Configuration Control Registers
Configure how to count 45 Event Selection Control Registers
Configure what to count Additional Control Registers
![Page 10: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/10.jpg)
10
Counter OverviewCounter Overview Counting Types
Non-retirement: At-Retirement:
Can count BOGUS vs NBOGUS, Tag uops,etc.Mechanisms:
Front end taggingExecution taggingReplay TaggingNo Tags
Also:Event Counting Event Based SamplingPrecise EBS
Event Types 59 event classes 100s of events to count Metric Classifications:
GeneralEx: Speculative Uops retiredBranchingEx: Mispredicted conditionalsTrace Cache and Front EndEx: Processor N deliver modeMemoryEx: MOB Load replaysBusEx: Prefetch bus accessesCharacterizationEx: Packed SP retiredMachine ClearEx: Memory Order Machine Clear
![Page 11: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/11.jpg)
11
Our Event-Counter: Performance ReaderOur Event-Counter: Performance Reader
Performance Reader implemented as Linux Loadable Kernel Module
Implements 6 syscalls: select_events()reset_event_counter()start_event_counter()stop_event_counter()get_event_counts()set_replay_MSRs()
User Level Interface: Defines the events
Starts counters Stops counters
Reads counters & TSC
![Page 12: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/12.jpg)
12
Performance Reader: Performance Reader: Example ValidationExample Validation
L1_Dcache benchmark
Controls cache hit behavior
Validated against measured cache events
Vary hit rate from 0-100%
L1 Hit Rate Experiment
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Desired Hit Rate (Benchmark Input)
Acq
uir
ed H
it R
ates
Ideal Hit RateAcquired L1 Hit RateL1 hit rate from L2 Access
![Page 13: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/13.jpg)
13
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 2Milestone 2
Real Power Measurement
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
![Page 14: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/14.jpg)
14
P4 Power Measuring SetupP4 Power Measuring Setup
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
![Page 15: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/15.jpg)
15Pow
erP
lott
er:
Exa
mp
leP
ower
Plo
tter
: E
xam
ple “Branch exercise”
(Taken rate: 1)“High-Low”“L1Dcache”
Array Size1/100 of L1
“L1Dcache”Array Sizex25 of L1~L2
“L1Dcache”Array Sizex4 of L2
Initialization
BenchmarkExecution
“Fast”
![Page 16: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/16.jpg)
16
SPEC Power ExamplesSPEC Power Examples
Different programs show very different power characteristics
Timescale of interest can be huge => inaccessible via simulation
Spec GCC (O3) with specrun -a run
0
10
20
30
40
50
60
70
80
0 50 100 150 200time (s)
[W]
Spec VPR (O3) with specrun -a run
0
10
20
30
40
50
60
0 100 200 300 400 500time(s)
[W]
![Page 17: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/17.jpg)
17
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 3Milestone 3
PowerModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
![Page 18: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/18.jpg)
18
DefineComponents
Performance Monitoring
Real Power Measurement
PowerModeling
DefineEvents
Real Power Measurement
Verify total power against measured processor power
PowerModeling
Convert counter info into component power breakdowns
Performance Monitoring
Gather counter info with minimal power overhead and program interruption
DefineEvents
Determine combination of P4 events that represent component accesses best
DefineComponents
Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we’ll model: from annotated layout
P4 POWER MODELP4 POWER MODEL
![Page 19: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/19.jpg)
19
Defining ComponentsDefining Components
![Page 20: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/20.jpg)
20
Defining ComponentsDefining Components
![Page 21: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/21.jpg)
21
Defining Events Defining Events Access Rates Access Rates We determined 24 events to approximate access rates
for 22 components Used Several Heuristics to represent each access rate Ex: 2nd Level BPU:
Metric 1: Instructions fetched from L2 (predict)Event: ITLB_Reference
Counts ITLB translationsMask:
All hits, misses Metric 2: Branches retired (history update)
Event: branch_retiredCounts branches retired
Mask:Count all Taken/NT/Predicted/MissP
Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data
![Page 22: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/22.jpg)
22
Access Rates Access Rates Component Powers Component Powers We gather counter data at measured computer via
the tiny counter reader We send the access rates to logger machine
Don’t want to do any computation at host
Logger machine converts access rates to the component power breakdowns Computation done externally, still at runtime Access rates used as proxy to max component
power weighting together with microarchitectural details
EX: Trace cache delivers 3 uops/cycle maxPower(TC)=Access-Rate(TC)/3 * MaxPower(TC) + Non-gated TC CLK power
![Page 23: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/23.jpg)
23
Generic EquationGeneric Equation
Power(Component)||
Access-Rate(Component)x
Microarchitectural Scalingx
MaxPower(Component)+
Non-gated component Clock power
![Page 24: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/24.jpg)
24
Experiment Setup – Recall:Experiment Setup – Recall:
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
![Page 25: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/25.jpg)
25
Experiment SetupExperiment Setup
Voltage readings via RS232 to logging machine
1mV/Adc conversion
![Page 26: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/26.jpg)
26
Experiment SetupExperiment Setup
POWERCLIENT
POWERSERVER
Voltage readings via RS232 to logging machine
Convert voltage to measured powerConvert access rates to modeled powersSync together in time window
1mV/Adc conversion
Component access rates
over ethernet
![Page 27: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/27.jpg)
27
Area Based Power Estimate – Area Based Power Estimate – Total Power ResultTotal Power Result
“Fast”
“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”
(Hit Rate : 0.1)Measured
Modeled
![Page 28: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/28.jpg)
28
After Tuning?After Tuning?
“Fast”
“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”
(Hit Rate : 0.1)Measured
Modeled
![Page 29: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/29.jpg)
29Com
pon
ent
Bre
akd
own
sC
omp
onen
t B
reak
dow
ns
Component Breakdowns for “branch_exercise”
Colors for 4 CPU subsystems
Issue - RetireExecution
![Page 30: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/30.jpg)
30
SPEC ResultsSPEC Results Measured
Modeled
Gcc Gzip Vpr Vortex Gap
Crafty
![Page 31: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/31.jpg)
31
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 4Milestone 4
ThermalModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
![Page 32: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/32.jpg)
32
THERMAL MODELING: A Basic ModelTHERMAL MODELING: A Basic Model
Based on lumpedR-C model from packaging
Built uponpower modeling Sampled
Component Powers
Respective component areas
Physical processor Parameters
PackagingHeat Transfer
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
ithith
i
ith
ii
ibithR
TT
i
RC
tT
C
tPT
dt
dTCP
ith
hib
,,,
,,
:equationdifferenceFinal
,
,
t : Sampling intervalTi : The temperature
difference between block and the heatsink
![Page 33: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/33.jpg)
33
Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die
abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:
Active die thickness metalization/insulation chip-package interface package heatsink
Requires searching of several materials/ dimensions and thermal properties
Multiple layers Multiple T nodes Multiple DEs
Baseline Heat removal Structure: HEATSINKThermal GreaseHeat Spreader
PackageDie
![Page 34: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/34.jpg)
34
Physical Structure vs. Thermal Model Physical Structure vs. Thermal Model
Ambient Temperature
Heatsink
Heat Spreader
Package
Die
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Thermal Grease
Ambient Airflow
![Page 35: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/35.jpg)
35
Analytical DerivationAnalytical Derivation
4 Nodes 4 DEs 1) Tspr:
sprsprspr
hsprgrsprsprspr
totalspr
sprsprR
TT
total
sprsprR
TT
total
TTT
tTTRCC
tPT
t
TCP
dt
dTCP
grspr
hspr
grspr
hspr
)(1
:equationdifferenceFinal
:timengDiscretizi
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Th
Rh
Ch
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
![Page 36: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/36.jpg)
36
EX: Ppro Thermal ModelEX: Ppro Thermal ModelUse CASTLE [Joseph, 2001] computed
component powersDetermine component areas from Die
photoDetermine processor/packaging
physical parametersGenerate numerical thermal modelApply component difference equations
recursively along power flowTdie,i
Tp,i
Tspr
Th
Update Tdie,i
Update Tp,i
Update Tspr
Update Th
![Page 37: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/37.jpg)
37
Simulation OutputsSimulation Outputs Thermal nodes updated every t~20ms
Component Temperatures Build up to ~350K in ~5hrs Theatsink moves very slowly as expected
Pentium Pro Thermal Simulation
01020304050607080
Ambie
nt
Heatsi
nk
Heat S
prea
der
Decod
eIss
ue
Reord
er
DMem
IMem FUs
Other
Te
mp
era
ture
(C
)
At startupAfter 5 Hours
![Page 38: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/38.jpg)
38
SUMMARYSUMMARY
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
![Page 39: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/39.jpg)
39
ConclusionsConclusions Contributions:
Portable runtime real power measurement system Performance counter based runtime power & thermal
model and runtime verification with synchronous real power measurement
Thermal model, which can be applied to ANY power model - with good physical characterization - as long as physical component based power breakdowns are used.
Runtime modeling & measurement system for arbitrarily long timescales!
Outcomes: We can do reasonably accurate real power measurements
at runtime without interfering with HW We can perform runtime power modeling, with the tiny
performance reader without inducing any significant overhead to power profile
![Page 40: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/40.jpg)
40
What to do next?What to do next? Keep tuning for SPECs
<1st Stop> Try regression at several corners
Won’t do well due to clk gating?? Get data from Intel? Try runtime self updating model? Compare all to actual data Experiment with March., evaluate several power properties
<2nd Stop> Add thermal Try to add lateral heat diffusion Get Contour results <New bkmrk>
<3rd Result> P4 thermal monitor stuff Could be played from kernel to modulate clock Can we use with our models to do power savings on REAL
HW??
![Page 41: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/41.jpg)
41
![Page 42: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/42.jpg)
42
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
implementing counter readers:
PCL Performance Counter Library, by Rudolf Berrendorf (University of Applied Sciences Bonn-Rhein-Sieg), Heinz Ziegler, and Bernd Mohr at the Central Institute for Applied Mathematics (ZAM) at the Research Centre Juelich , Germany uniform interface for several architectures (intel Pentium,MMX,
Pro, III, 4/linux; IBM Power3, Power3-II/AIX; etc.) Software library with C, C++, Java & Fortran Bindings Kernel patch (Mikael Pettersson) recompile
PAPI Performance Application Programming Interface Project, by Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, etc., at Innovative Computing Lab, CS dept., University of Tennessee Standard Simple high level API and low level programmable
interface Supports Pentium, MMX, Pro, III/Linux, Windows; Power 3,4/AIX;
etc. PerfCtr kernel patch (Mikael Pettersson) recompile
![Page 43: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/43.jpg)
43
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
implementing counter readers: Perfmon Performance Monitoring Tool by Richard Enbody, Associate
Professor Department of Computer Science and Engineering, Michigan State University.
For SUN Ultra-Sparc & Ppro Device Driver (LKM)
Rabbit Performance Counters Library by Don Heller, Scalable Computing Laboratory, Iowa State University
for Intel Pentium MMX, Pro, II, III/Linux; AMD/Linux functions to access from within C
Cleanest of all, but still ~30 files & ~50instructions LKM
Intel’s VTune Performance analyzer Windows & Linux <New>
IBM’s HPM toolkit Power 3,4/AIX
Brink and Abyss Pentium 4 Performance Counter Tools For Linux, by Brinkley Sprunt, Electrical Engineering, Bucknell University
brink: high level perl script to read experiment/config files abyss: c program to access counters abyss_dev: device driver for counter access EBS kernel patches: to handle PMIs
![Page 44: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/44.jpg)
44
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
using counter readers: CASTLE Project by Margaret Martonosi and Russ Joseph,
Princeton University acquire Ppro counter data to model component power
breakdowns Frank Bellosa, “Benefits of Event Driven energy
Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Counters to show power ~ k x instr-ns/cycle (PII) OS power optimizations:
Throttle down CPU/extend thread time for cache hit/slow down CPU core if main memory is accessed
Andreas Weissel, Frank Bellosa, “Process Cruise Control: Event driven clock scaling for dynamic power management”, CASES 2002 Use event counters info to scale individual thread
frequencies Intel Xscale / Modified Linux kernel
![Page 45: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/45.jpg)
45
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
using counter readers: HPC Toolkit, by John Mellor-Crummey, Rob
Fowler, CS Dept. Rice University Uses perf counter data for profiling converts raw profiling information into platform
independent XML formats and produces performance metric correlations from multiple sources
Used in compiler optimizations Jennifer Anderson, et al, “Continuous Profiling:
Where Have All the Cycles Gone?”, ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, pp. 357 - 390. Performance analysis example – from DEC Data collection by counter sampling, performance
info from program level to individual instructions
![Page 46: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/46.jpg)
46
RELATED WORK RELATED WORK – real power– real power
CASTLE Project by Margaret Martonosi and Russ Joseph, Princeton University Shunt R over Ppro power lines to measure total
processor power John Seng, Dean Tullsen, “Effect of compiler
optimizations on Pentium 4 Power consumption”, 7th Annual Workshop on Interaction between Compilers and Computer Architectures, February, 2003 Shunt R between VRM and CPU
Marc A. Viredaz, Deborah A. Wallach, “Power Evaluation of Itsy Version 2.3”, tech. note TN-57, WRL, Compaq Computer Corp., 2000 similar series R to estimate battery life of itsy pocket
computer
![Page 47: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/47.jpg)
47
RELATED WORK RELATED WORK – real power– real power
Frank Bellosa, “Benefits of Event Driven energy Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Crude Current measurement with DMM for Pentium II to help
define per instruction powers Andreas Weissel, Frank Bellosa, “Process Cruise Control:
Event driven clock scaling for dynamic power management”, CASES 2002 series sense resistor added to Intel IQ 80310 evaluation
platform power supply, to measure energy effect of frequency scaling
Naehyuck Chang, Kwanho Kim, and Hyun Gyu Lee, "Cycle-Accurate Energy Consumption Measurement and Analysis: Case Study of ARM7TDMI" ISLPED 2000 & IEEE Transactions on VLSI Systems, Vol. 10, pp. 146 - 154, Apr., 2002. cycle accurate energy consumption measurement based on
charge transfer Inserts switch caps between power supply and Processor that
switch with the same clock frequency!!
![Page 48: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/48.jpg)
48
RELATED WORK RELATED WORK – power model– power model
Simulation Tools:
WATTCH, by David Brooks and Margaret Martonosi, Princeton University, ISCA 2000 Architectural power simulator Power Models intergrated upon SimpleScalar
SimplePower by W. Ye, N. Vijaykrishnan, M. Kandemir, Penn-State University, and M. Irwin “The Design and Use of SimplePower: A cycle-accurate energy estimation tool”, DAC, June 2000 Execution driven, Cycle accurate, RTL power
estimation Emulates 5 stage pipe with SimpleScalar’s Integer
ISA
![Page 49: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/49.jpg)
49
RELATED WORK RELATED WORK – power model– power model
Power Modeling:
R. Joseph and M. Martonosi. “Run-Time Power Estimation in High Performance Microprocessors”, International Symposium on Low Power Electronics and Design, 2001 complete CASTLE Project: Collects Ppro counter data and models
component power breakdowns verifying against measured total power
Also Wattch simulation vs. counter approximation for SimpleScalar architecture
Russ Joseph, David Brooks, and Margaret Martonosi, "Live, Runtime Power Measurements as a Foundation for Evaluating Power/Performance Tradeoffs" Workshop on Complexity Effectice Design (WCED, held in conjunction with ISCA-28), 2001 Evaluate power vs. performance by measuring total power and
acquiring performance data from counters – i.e. Cache hit rate, branch prediction, bitline activity
![Page 50: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/50.jpg)
50
RELATED WORK RELATED WORK – power model– power model
H. Zeng, X. Fan, C. Ellis, A. Lebeck, and A. Vahdat, “ECOSystem: Managing Energy as a First Class Operating System Resource”, Proceedings of ASPLOS X, Oct. 2002 Uses Currentcy Model (Fixed Power & Time budget for a task) for OS
level energy management for battery life ECOsystem is the Linux OS implementation <No counters> Considers CPU ON/OFF could do better with Power model
H. Zeng, C. Ellis, A. Lebeck, A. Vahdat , “Currentcy: Unifying Policies for Resource Management”, USENIX 2003 Annual Technical Conference Detailed description of currency (OS scheduling, etc.)
Flinn J., Satyanarayanan, M., “PowerScope: A Tool for Profiling the Energy Usage of Mobile Applications”, Proceedings of the Second IEEE Workshop on Mobile Computing Systems and Applications February, 1999 Maps Energy Program structure (Power Profiling – Energy efficient SW
design) DMM gets energy for machine kernel modification (system monitor) gets PIDs for processes and
identifies procedures for profiling offline
![Page 51: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/51.jpg)
51
RELATED WORK RELATED WORK – power model– power model
V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: A first step towards software power minimization”, International Conference on Computer-Aided Design & IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1994 PIONEER WORK in Power Measurement/Modeling Measure current drawn by an Intel 486DX2 Processor and DRAM Generate Energy cost table for instructions Identify inter-instructions effects: circuit state overhead, resource
constraint effect, cache miss effects there are 1 million like this: modeling SW energy, I won’t put here
Lee, A. Ermedahl, and S. Min. “An accurate instruction-level energy consumption model for embedded risc processors” ACM SIGPLAN Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES'01), Jun 2001 Derives energy consumption for instructions rather than functional
units for RISC ARM7TDMI processor Uses their cycle-accurate power measurement scheme Black box approach (similar to F. Bellosa) with linear regression
![Page 52: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/52.jpg)
52
RELATED WORK RELATED WORK – power model– power model
J. Russell and M.F. Jacome, "Software Power Estimation and Optimization for High Performance, 32-bit Embedded Processors," Proc. of ICCD '98 Estimates SW energy for i960 family 32 bit embedded RISC
processors Uses digitizing oscilloscope/series Resistor over processor power
lines for measurement Uses const Pest for processor power and estimates energy based on
runtime ( won’t work with clock gating!) J. Haid, G. Kafer, et al, "Run-Time Energy Estimation in System-
On-a-Chip Designs", ASP-DAC 2003 Proposes a coprocessor for runtime energy estimation for SoC Defines similar event counters in coprocessor and uses power
macro-models M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno, and A.
Sangiovanni-Vincentelli. “Efficient power estimation techniques for hw/sw systems”, IEEE Proc. VOLTA'99 International Workshop on Low Power Design, pages 191--199, March 1999. Power estimation for HW/SW SoC designs RTL HW simulator and Instruction Set simulator using instruction
level power models
![Page 53: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/53.jpg)
53
RELATED WORK RELATED WORK – power model– power model
M. Huang, J. Renau, and J. Torrellas. “Profile-based energy reduction in high-performance processors”, In 4th Workshop on Feedback-Directed and Dynamic Optimization, December 2001 Use profiling to determine when to activate/deactivate low
power methods –i.e. DVS, clock gating, etc. Use energy statistics (power breakdowns) from
performance counters for profiling (SIM) I. Kadayif , T. Chinoda , M. Kandemir , N. Vijaykirsnan ,
M. J. Irwin , A. Sivasubramaniam, “vEC: virtual energy counters”, Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, 2001 Uses Perfmon library for UltraSPARC to read SPARC HW
perf counters related to memory Converts readings to power using analytical memory
energy model estimates memory system energy consumption
![Page 54: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/54.jpg)
54
RELATED WORK RELATED WORK – power model– power model
Luca Benini et al “System-level power estimation and optimization”,
Proceedings 1998 international symposium on Low power electronics and design
“System-level power optimization: techniques and tools”, Proceedings of international symposium on Low power electronics and design, 1999
Tutorial on power conscious system level designMemory optimizations, Hardware software partitioning, instruction level power optimizations, DVS, DPM (allow components to sleep)
“Supporting system-level power exploration for DSP applications”, Proceedings of the 10th Great Lakes Symposium on VLSI, 2000
Modified ARM simulator for instruction level power estimation
![Page 55: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/55.jpg)
55
RELATED WORK RELATED WORK – thermal model– thermal model
K. Skadron, T. Abdelzaher, and M. R. Stan. “Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management”, In Proc. HPCA-8, pages 17--28, Feb. 2002. Single degree component based thermal R-C model for MIPS
R10000 scaled to 0.18Um Only die heatsink thermal conduction, with const. heatsink and
Si properties only Power/Thermal Simulation using Wattch for verification of DTM
with PID controller
Sabry, M.-N.; Bontemps, A.; Aubert, V.; Vahrmann, R, “Realistic and efficient simulation of electro-thermal effects in VLSI circuits”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 Transistor level with interdevice thermal resistances
Szekely, V.; Poppe, A.; Pahi, A.; Csendes, A.; Hajas, G.; Rencz, M, “Electro-thermal and logi-thermal simulation of VLSI designs”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 LOGITHERM simulator module for gate level thermal simulation, by
thermal characterization of logic gates
![Page 56: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/56.jpg)
56
RELATED WORK RELATED WORK – thermal model– thermal model
COSMOS/FloWorks by NIKA fluid flow and thermal analysis program Heat flow computation based on mesh analysis
A. Dhodapkar, C. H. Lim, G. Cai, and W. R. Daasch. “TEMPEST: A thermal enabled multi-model power / performance estimator”, Proceedings of Workshop on Power-Aware Computer Systems, Nov. 2000. Thermally enabled architectural simulator based on
SimpleScalar Single R,C for the whole processor packaging oriented
D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, pages 171--82, Jan. 2001. Discusses Microarchitectural and scaling DTM mechanisms Uses moving average of power for ~100K cycles of Wattch
simulation as a proxy for temperature to detect thermal emergencies for DTM triggering
![Page 57: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/57.jpg)
57
RELATED WORK RELATED WORK – thermal model– thermal model
Thermal Monitoring, “Intel Architecture SW developer’s Manual vol. 3” Catastrophic shutdown detector
thermal diode resets stop clock duty cycle Automatic Thermal monitor
Internally modulate stop clock duty cycle Software controlled clock modulation
SW modulates stop clock duty cycle
Kevin Skadron et al, “Temperature aware Microarchitecture”, 30th ISCA, 2003 HotSpot: architecture level thermal simulator built
upon Wattch Uses multiple degree thermal R-C model for die,
packaging, heatsink and convection to ambient More realistic area estimates based on Alpha 21364 Back Back
![Page 58: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/58.jpg)
58
![Page 59: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/59.jpg)
59
Counter Access HeuristicsCounter Access Heuristics 1) BUS CONTROL:
No 3rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents
Event: IOQ_allocationCounts various types of bus transactions
Should account for BSQ as wellaccess based rather than duration
MASK:Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH
Metric2: Bus Utilization(The % of time Bus is utilized)Event: FSB_data_activity
Counts DataReaDY and DataBuSY events on BusMask:
Count when processor or other agents drive/read/reserve the busExpression: FSB_data_activity x BusRatio / Clocks Elapsed
To account for clock ratios
![Page 60: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/60.jpg)
60
Counter Access HeuristicsCounter Access Heuristics 2) L2 Cache:
Metric: 2nd Level cache referencesEvent: BSQ_cache_reference
Counts cache ref-s as seen by bus unitMASK:
All MESI read misses (LD & RFO)2nd level WR misses
3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict)
Event: ITLB_ReferenceCounts ITLB translations
Mask:All hits, misses & UC hits
Metric 2: Branches retired (history update)Event: branch_retired
Counts branches retiredMask:
Count all Taken/NT/Predicted/MissP
![Page 61: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/61.jpg)
61
Counter Access HeuristicsCounter Access Heuristics 4) ITLB & I-Fetch:
etc……… 10) FP Execution:
Metric: FP instructions executedevent1: packed_SP_uop
counts packed single precision uopsevent2: packed_DP_uop
counts packed single precision uopsevent3: scalar_SP_uop
counts scalar double precision uopsevent4: scalar_DP_uop
counts scalar double precision uopsevent5: 64bit_MMX_uop
counts MMX uops with 64bit SIMD operandsevent6: 128bit_MMX_uop
counts integer SSE2 uops with 128bit SIMD operandsevent7: x87_FP_UOP
counts x87 FP uopsevent8: x87_SIMD_moves_uop
counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back Back
![Page 62: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/62.jpg)
62
![Page 63: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/63.jpg)
63
INTRODUCTION to RUNTIME
• What is Runtime Power/Thermal Measurement:Methodology for measuring CPU power / temperature and component breakdowns3 alternatives:1. Measuring power/temperature directly from hardware; i.e. with
multimeter probesImpossible with VLSIRuntime speed
2. Simulating processor execution with SW and extracting power/temperature data
WATTCH, Tempest, etc. Computation time problems, especially with thermalCycle level detail
3. Runtime Measurement: Getting Processor power/thermal data at runtime using both hardware and software
Runtime speed and SW support – not cycle detail!
![Page 64: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/64.jpg)
64
INTRODUCTION to RUNTIME
• Why Runtime Power/Thermal Measurement:
Offers a hybrid technique overlapping slow, but detailed simulation and crude, but fast realtime measurementsHardware performance counters help extract lots of useful information – both performance and power – on the flyCan be used for ‘priming’ instead of a long simulation where the last few million instructions bear the most of interest
![Page 65: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/65.jpg)
65
WHY POWER & THERMAL
• Moore’s Law:Transistor count x4 / 3 years
DRAM density x4 / 3 years
Performance improves exponentially SO DOES POWER [1]
• Nuclear Core Example:
![Page 66: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/66.jpg)
66
WHY POWER & THERMAL
![Page 67: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/67.jpg)
67
…WHY POWER & THERMAL
• Battery technology increases much slower
• Packaging costs: +$1/W over 35-40W [2]
Back to slide Back to slide
![Page 68: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/68.jpg)
68
POWER BASICS
• Total Power = Dynamic Power + Static Power + Short Circuit Power
Dynamic Power (switching power):Discharging of Capacitances when switching occurs (0 1) – data dependent
Csw= (1/2)..CL.Vdd2.f
Where this came from
![Page 69: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/69.jpg)
69
Derivation of Switching Power
2)2/1( CVEnergy
dt
dVVCViPower
dt
dVCiC
dissipatedisechthis
transitioneachat
fVC
periodclock
VC
timeEnergyPower
VCEnergy
transitioneachat
ddL
ddL
ddL
arg
:10
/
:01
2
2
2
fVCPower
activityswitching
cycleainswitchingofyprobabilitP
PtransitionEnergyPower
stransitiontotaltransitionEnergy
EnergyTotal
ddL2
10
10
2
1
)2/1(
/
10/
:
![Page 70: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/70.jpg)
70
POWER BASICS
Static Power (leakage power):Due to leakage through the N channel and through the drain-substrate junctions.
![Page 71: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/71.jpg)
71
POWER BASICS
Short Circuit Power :Due to finite rise time of input signal.Generic CMOS feature
• In comparison:Currently: 80% Sw. + 10% Leak + 10% SC
Future: 45% Sw. + 45% Leak + 10% SC [3]
![Page 72: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/72.jpg)
72
WATTCH simulates 80K instr-s/sec
SpecINT 164.GZIP runs:~350s with average upc ~1.3 on 1.4 GHz P4 producing ~665 billion uops
WATTCH simulation would take ~100 days
Assuming a 1GHz Machine:1s of real run ~5 x IPC hrs of WATTCH simulation
Back to slide Back to slide
NEED FOR SPEEDNEED FOR SPEED
![Page 73: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/73.jpg)
73
P4 DetailsP4 DetailsKarelian.ee:
P4 – 1.4GHz0.18, C4-FC-PGA-423Heatsink Folded FinM6, Al interconnectDie Size: 217 mm2
Package Size: 5.34cm x 5.17cmPower: Idle/typ./max=??/51.8/71WD$1&T$1/L2: 8K&12KUops/256KVoltage: 1.7/1.75V
![Page 74: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/74.jpg)
74
P4 DetailsP4 Details 1st LKM: <LKM_CPUinfo & UserLevel_CPUinfo>
Implements syscall: getCPUinfo()Gathers CPU info from:
/asm/processor.hIntel control registers (CR4)CPUID instruction
Reveals:Debug Store mechanism exists for PEBSTSC existsMSRs implemented
We can read/write performance counters
EX:karelian (P4,willamette): UserLevel_CPUinfoviale (P4, Northwood): UserLevel_CPUinfo
Back Back
![Page 75: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/75.jpg)
75
P4 Detector - Counter ClustersP4 Detector - Counter ClustersEvent Detectors Event Counters
4 bit wide bus
P4
Com
pone
nts
EV
EN
TS
![Page 76: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/76.jpg)
76
Counters, ESCRs & CCCRsCounters, ESCRs & CCCRs
Simplified Recipe:1. Select Event to count2. Select a counter
(also defines CCCR)3. Select an ESCR4. Set ESCR fields5. Set CCCR fields6. Enable CCCR
![Page 77: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/77.jpg)
77
Counting MechanismsCounting MechanismsCounting Types
Non-retirement: Events occur any time during execution
At-Retirement: Events at the retirement of instruction
Can count BOGUS vs NBOGUS, Tag uops to count, etc.
TerminologyMechanisms:
Front end tagging (i.e. LD/ST retired)Execution tagging (i.e. packed_DP_retired)Replay Tagging (i.e. L1 misses)No Tags (i.e. uops retired)
Also:Event Counting | IEBS | PEBS
Back Back
![Page 78: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/78.jpg)
78
At Retirement Counting TerminologyAt Retirement Counting Terminology
Back Back
BOGUS/NBOGUS (speculative)Tagging (count uops that encounter event)Replay (Data speculation)
![Page 79: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/79.jpg)
79
Verifying Counter ReaderVerifying Counter Reader1) L1Dcache_exercise:
Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate)
i.e. for 10% Hit rate: 80K 20K entriesArray Size < L2 size
Array elements PRBS of array indices Bench loop:
new index array[old index] However, gcc puts 5 LDs in the bench loop
4 static Hit rate ~ 100%1 our load our desired hit rate
![Page 80: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/80.jpg)
80
……Verifying Counter ReaderVerifying Counter Reader
1) L1Dcache_exercise results:
L1Dcache Experiment
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.04 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 100 1000
Desired Hit Rate
Ac
qu
ire
d R
ate
s
Acquired L1 Hit Rate
Our L1 Hit Rate From L2Accesses
Ex:L1Dcache_exerciseHit Rate = 0.25
![Page 81: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/81.jpg)
81
……Verifying Counter ReaderVerifying Counter Reader2) branch_exercise:
Uses random number comparison Assigns 400K PRBS array outside bench loop
To avoid rand() instructions in bench loop bench loop:
Compares array index to threshodThreshold = RAND_MAX*TakenRate
Repeats 1000 reseeding each time However gcc adds 2 more branches into
bench loop:Loop exit condition (Prediction ~ 100%)Unconditional JMP (Prediction ~ 100%)
Our Branch’s Expected Mispredict Rate:~ (0.5 - |TakenRate – 0.5| )
![Page 82: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/82.jpg)
82
……Verifying Counter ReaderVerifying Counter Reader
2) branch_exercise results:
Ex:branch_exerciseTaken Rate=0.5
Branch Prediction Experiment
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.9 1
Desired Taken Rate
Acq
uir
ed R
ates
Approximated Mispredict RateOur Branch's Taken Rate
Back Back
![Page 83: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/83.jpg)
83
Log voltage readings
Convert to instantaneous power: 12 x Vsample x 1000
P4 POWER MEASUREMENTP4 POWER MEASUREMENTComplete Setup:
Serial Reader(PowerMeter)(PowerPlotter)
1mV/Adc conversion
Voltage [
V]
Readings
Clamp Current Probe over 12V
lines
Log Power values Plot Power values
![Page 84: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/84.jpg)
84
MEASUREMENT MethodMEASUREMENT Method Select Power lines that reflect CPU power
P4 uses 12 V lines Clamp the current probe over the 12V lines
1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings
Convert to instantaneous power as:12 x Vsample x 1000
Log Power values Plot Power values
![Page 85: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/85.jpg)
85
MEASUREMENT ToolsMEASUREMENT ToolsPoll serial port ~20ms
quicker overkill, slower overlookCompute running average sample every t you select
Easier to sync with Power ModelPowerMeter:
Convert voltage reading to power and logP=12 x Vread x 1000
PowerPlotter: Plot Power samples over sliding time
window100 s history with 1000 samples (t = 100ms)
![Page 86: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/86.jpg)
86
Current ProbeCurrent ProbeFluke i410Uses Hall Voltage to measure current
and convert to Voltage:1mV / Adc
Range: 0.5 – 400A Accuracy: 3.5%+0.5AGenerated voltage is fed to DMMCompared against the Ppro Amoeba
shunt setup for verification
![Page 87: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/87.jpg)
87
Clamp vs ShuntClamp vs Shunt
sampled current for L1Dcache from clamp
0
1
2
3
4
5
6
7
8
0 200 400 600 800 1000
current
sampled current for L1Dcache from shunt
0
1
2
3
4
5
6
7
0 200 400 600 800 1000 1200
current
current for grep from shunt
0
1
2
3
4
5
6
7
0 100 200 300 400
100 ms
A Series1
current for grep from clamp
0123456789
0 100 200 300 400 500 600
100 ms
A Series1
Back Back
![Page 88: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/88.jpg)
88
DMMDMMAgilent 34401A Measurement Motive:
We should sample as quick as possible (grep case)
Measurement Setup:Fast 4 digit, Autozero OFF, Display OFF
From [8], 1000 readings/s (x150 faster than fast 6 digit)
Serial Interface:From [9] 55 ASCII readings /s
Polling serial port faster than 20ms is overkill
Back Back
![Page 89: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/89.jpg)
89
P4 Power LinesP4 Power Lines Which power lines should we cut / clamp?
[5] shows the power lines:1-CPU power connector 13-System power connectorP1 13 & P2 1
[6],[7] say P4 uses 12V lines for CPU, rather than 5V lines
Both P1 & P2 have 12, 5 and 3.3 V lines
I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines
![Page 90: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/90.jpg)
90
Current on Power LinesCurrent on Power LinesCurrent on Connector P1
line7 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 20 40 60 80
time (s)
I [A
]
Series1
Current on Connector P1 lines1,3,,6,18,19,20,22 (5V)
0
0.5
1
1.5
2
2.5
0 20 40 60 80
time (s)
I [A
]Series1
Current on Connector lines 11,12,23 (3.3V)
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
time (s)
I [A
]
Series1
Current on connector P2 line1 (3.3V)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70 80
time(s)
I(A
)
Series1
Current on connector P2 line14 (5V)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line 3 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line7 (12V)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on connector P2 line 9 (5V)
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Reveals ALL 3 12V lines’ currents follow CPU activity All add to CPU Power! Back Back
![Page 91: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/91.jpg)
91
Validating with OptimizationsValidating with Optimizations Compare to Optimizations vs Power of [Seng & Tullsen]
SPECINT AVE. Power vs gcc Optimizations
39
41
43
45
47
49
51
53
GZIP VPR GCC
AV
Era
ge
Po
wer
[W
]
O0O1O2O3O3 unrollO3 unroll ALL
![Page 92: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/92.jpg)
92
OptimizationsOptimizations O0
None at all O1 –fomit-frame-pointer
thread-jumps, delayed-branches, defer-pop O2 –fomit-frame-pointer
CSE related blocks, jumps, expensive optimizations, reschedule instr-ns, etc.
-O3 –fomit-frame-pointerO2 + inline functions heuristically
-O3 –fomit-frame-pointer –funroll-loopsOnly for #iterations known at compile/run time
-O3 –fomit-frame-pointer –funroll-all-loopsDo for all loops (usually bad result)
![Page 93: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/93.jpg)
93
GZIP – power vs timeGZIP – power vs timePower for GZIP Optimizations
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700 800 900time (s)
[W]
O0
O1
O2
O3
O3unroll
O3unrollALL
![Page 94: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/94.jpg)
94
……GZIP – power vs timeGZIP – power vs timeAll have similar powerExec. time(O0) ~
x2 Exec Time(Oelse)Different data sets provide
different power profile
![Page 95: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/95.jpg)
95
3 specINT average Power3 specINT average PowerSPECINT AVE. Power vs gcc Optimizations
39
41
43
45
47
49
51
53
GZIP VPR GCC
AV
Era
ge
Po
we
r [W
]
O0
O1
O2
O3
O3 unroll
O3 unroll ALL
Optimized code runs quicker, and yet with less average power
specFP – art seems to be the exception?
Back Back
![Page 96: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/96.jpg)
96
About the ripplesAbout the ripplesAdd ripple stuff
here…!!!!!!!!!!!!!!!!!!!!!!!!!!!
![Page 97: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/97.jpg)
97
P4 Architecture vs LayoutP4 Architecture vs Layout
Components to Model:
1) Bus Control2) L2 Cache3) 2nd Level BPU4) ITLB & Ifetch5) L1 Cache
6) MOB7) Mem Control8) DTLB9) Int EXE10)FP EXE11) Int RF
12)FP RF13)Decode14)Trace $15)1st Level BPU16)Microcode ROM17)Allocation
18)Rename19) Inst-n Qs20)Schedule21) Inst-n Qs22)Retirement
Back Back
![Page 98: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/98.jpg)
98
Defining ComponentsDefining Components
![Page 99: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/99.jpg)
99
Counter RotationsCounter Rotations
Back Back
![Page 100: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/100.jpg)
100
Experiment SetupExperiment Setup
POWERCLIENT
POWERSERVER
![Page 101: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/101.jpg)
Com
pon
ent
Bre
akd
own
sC
omp
onen
t B
reak
dow
ns
![Page 102: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/102.jpg)
102
THERMAL BasicsTHERMAL Basics
Duality heat flow electrical flow
Thermal Mass (Capacitance) :
Cth=c.A.t [J/K]c: Specific heat [J/m3K]A: Block Area [m2]t: Wafer thickness [m]
Thermal Resistance :
Rth,norm=.t/A [K/W] : Thermal resistivity [m.K/W]A: Block Area [m2]t: Wafer thickness [m]
![Page 103: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/103.jpg)
103
Simplified Thermal ModelSimplified Thermal Model Divide the CPU to component blocks
Each block dissipates different power, Pblock reveal different temperature changes, Tblock
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
ithith
i
ith
ii
ibithR
TT
i
RC
tT
C
tPT
dt
dTCP
ith
hib
,,,
,,
:equationdifferenceFinal
,
,
t : Sampling intervalTi : The temperature difference
between block and the heatsink
t should be much smaller than the RC time constant, th,i
Tb,j
Rth,j
Cth,jPj
Numerical Values?
See Quantitative Example >>
![Page 104: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/104.jpg)
104
QUANTITATIVE EXAMPLE Use t=0.1 mm – thinned wafer Areas given in table (c=106 [J/m3K] & =10-2 [m.K/W] ) th=RthCth=c t2=10-4s=100s ind. of Area!
Temperature buildup for Regfile with t =133.4 ns:
21.11
42.85
100100
100
blkthblkth
blk
blkth
blkblk CR
tT
C
tPHeatSinktrwT
,,,
)...(
Back to slide Back to slide
![Page 105: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/105.jpg)
105
THERMAL FORMULATIONTHERMAL FORMULATION
For any block, i:Tb,i
Rth,i
Cth,iPi
Th
ithith
i
ith
ii
iith
ith
ii
ihibib
h
hibi
ibithR
TT
i
ibithR
TT
i
RC
tT
C
tPT
t
TC
R
TP
TTTT
T
TTTDefinet
TCP
dt
dTCP
ith
hib
ith
hib
,,,
,,
0
,,
,
,,
,,
:equationdifferenceFinal
:constAssuming
:
:timengDiscretizi
,
,
,
,
t : Sampling interval Ti: The temperature
difference between block and the heatsink
t should be much smaller than the RC time constant, th,i
Back to slide Back to slide
![Page 106: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/106.jpg)
106
Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die
abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:
Active die thickness metalization/insulation chip-package interface package heatsink
Requires searching of several materials/ dimensions and thermal properties
Multiple layers Multiple T nodes Multiple DEs
Baseline Heat removal Structure:
Tb,j
Rth,j
Cth,jPj
![Page 107: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/107.jpg)
107
Refined Thermal ModelRefined Thermal ModelTb,j
Rth,j
Cth,jPjNeed to define the physical structure All the layers heat-flux propagates through
Corresponding Thermal model Multinode Different Assumptions/decisions
Physical Parameters for different elements Dimensions Material types
th and cth
New set of Thermal update DEs
![Page 108: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/108.jpg)
108
Physical Model vs. Thermal Model Physical Model vs. Thermal Model
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Th
Rh
Ch
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
![Page 109: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/109.jpg)
109
Analytical DerivationAnalytical Derivation
4 Nodes 4 DEs 1) Tspr:
sprsprspr
hsprgrsprsprspr
totalspr
sprsprhsprRtotal
sprsprR
TT
total
sprsprR
TT
total
TTT
tTTRCC
tPT
TCtTTtPt
TCP
dt
dTCP
grspr
grspr
hspr
grspr
hspr
)(1
:equationdifferenceFinal
.)(.
:timengDiscretizi
1
![Page 110: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/110.jpg)
110
……Analytical DerivationAnalytical Derivation
2) Th:
3) Tdie,i:
4) Tp,i:
hhh
Ahahhh
grspr
hspr
h
TTT
tTTRCC
tR
TT
T
)(1
idieidieidie
ipidieidieidieidie
iidie
TTT
tTTRCC
tPT
,,,
,,,,,
, )(1
ipipip
spripipipip
idie
ipidie
h
TTT
tTTRCC
tR
TT
T
,,,
,,,,
,
,,
)(1
![Page 111: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/111.jpg)
111
Temperature UpdatingTemperature Updating and and Initial ConditionsInitial Conditions
D.E.s should be updated along the direction of current (power) flow: Tdie,i Tp,i Tspr Th
It is not reasonable to start from ambient temperatures as initial conditions. Mostly, the processor is already running
TA is given as ~50oC by Intel Thermal Design Guidelines Assume idle power:(Ppro ~2 W)
Th=TA+2W.Rhxa=~52oC Tspr=Th+2W.Rspr+gr=~52oC Tp,i=Tdie,i=Tspr=~52oC
Update Tdie,i
Update Tp,i
Update Tspr
Update Th
Back Back
![Page 112: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/112.jpg)
112
Steady State SolutionSteady State Solution If Rth,iRth,i x20
Tss,i Tss,I x20
Regfile ex. of presentation 1:Pi=10 & Rth,i=4 Ti,ss=40K
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
KT
decodeforyNumericall
A
tPRPT
TtR
TP
C
Ttate
RC
tT
C
tPT
decssi
i
thiithissi
iith
ii
ith
i
ithith
i
ith
ii
15.010.35
:
.
01
0:SolutionSSteady
:equationdifferenceFinal
2,,
,,
,,
,,,
Back Back
![Page 113: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/113.jpg)
113
EX: PproEX: Ppro Thermal Model Thermal ModelTb,j
Rth,j
Cth,jPjUse CASTLE computed component powers
Select– thermal – sampling intervalDetermine component areas from Die
photoDetermine processor/packaging
physical parametersGenerate numerical thermal modelApply component difference equations
recursively
![Page 114: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/114.jpg)
114
SimulationSimulation
and c values hardcoded for materials (except Si)
Areas/Relative Areas Hardcoded for components Individual R and C computed for components D.E. loop is re-executed every t, in the discussed
order Updated Thermal Nodes displayed every t~20ms
Component Temperatures Build up to ~350K in ~5hrs Clock Temp. Shoots up Theatsink moves very slowly as expected
For complete set of computed numerical simulation results go to additional slides
![Page 115: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/115.jpg)
115
Simulation Outputs – at StartupSimulation Outputs – at Startup
![Page 116: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/116.jpg)
116
Simulation Outputs – After 5 hrsSimulation Outputs – After 5 hrs
Back Back
![Page 117: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/117.jpg)
117
Thermal Model ParametersThermal Model Parameters
BASELINE AMBIENT TEMPERATURET_ambient = 323; /* in K */ Intel Thermal Design Guidelines
SAMPLING INTERVALdt = 5e-6 sec.s I Choose
Processor Specific Parameters
![Page 118: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/118.jpg)
118
Physical ParametersPhysical Parameters
15% of Heatsink area has fins, 85% doesn’tOverall Rth estimate:
RfinRnofin
![Page 119: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/119.jpg)
119
……Physical ParametersPhysical Parameters
Temperature assumed uniform along heat spreader – and therefore, above
![Page 120: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/120.jpg)
120
……Physical ParametersPhysical Parameters
We don’t use total R&C for package as it’s decomposed into component areas in the model
DIE:Process info scaled from P4 data in [7] using ITRS 1999 & 2001 and interpolating MPU ½ pitch vs. Wire pitch
Metal layer & Isolation scale factor 2.15
ITRS FEP Si final device thickness ~100nm (130nm tech.)I used the overall wafer thickness
Temperature dependent Si: Si(T)=1.5486.102.(300/T)4/3
![Page 121: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/121.jpg)
121
……Physical ParametersPhysical Parameters DIE Rth Estimate:
Rdie=RSi+Rmetal+Rpoly+RSiO2
For 10% die area:RSi~ 0.1 K/W
Rmetal~ 0.0008 K/W
Rpoly~ single layer ignorable
RSi~0.86 K/W
Rdie~ RSi+RSiO2
DIE Cth Estimate:Only Si considered as rest is much thinner
Back Back
![Page 122: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/122.jpg)
122
Numerical Numerical ValuesValues
Back Back
![Page 123: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/123.jpg)
123
Back Back
Computed ThermalComputed Thermal values values
![Page 124: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/124.jpg)
124
Computed Thermal v.2 valuesComputed Thermal v.2 values
Back Back
![Page 125: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/125.jpg)
125
Ppro info & AreasPpro info & Areas Complete processor info([4],[5],[6])
200MHz4 Metal layersPackage: 387 pin DC-PGAPackage size: 6.76cm x 6.25cm0.35 BiCMOSDie Size: 196mm2 (14x14)
Area estimates for dieScale component areas from [1]:
[1] Ours150 MHz 200 MHz0.50 0.35 <process scaling x0.7>Die size:306mm2 196mm2 <Area scaling x0.64>
I use x0.64 area scaling and [1]’s breakdowns for component area estimates
![Page 126: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/126.jpg)
126
Component AreasComponent Areas
3.9% 11.8%
7.9% 4.0%
4.4%4.2%
7.6%8.6%
14.3%
4.1%
2.5%
2.2%
4.6%
1.3%
Close to Intel data:
These areas cover ~81.3% of die
Clock area found from Intel data as:
Aclk=Pclk/PwrDensityclk = 1.7%
![Page 127: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/127.jpg)
127
CASTLE Breakdown AreasCASTLE Breakdown Areas We need to convert given areas to CASTLE comp-s:
DECODEID+MIS=11.7%
ISSUERS=7.6%
REORDERRAT+(ROB&RRF)=8.6%
DMEMDCU = 8.6%
IMEMIFU=11.8%
FUNC_UNITAGU+IEU+FEU=10%
OTHER100-above=41.7%
CLOCK1.7%
Back Back
![Page 128: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/128.jpg)
128
CASTLE
• Power measurement / profiling tool• Developed by Prof Martonosi and Russ• Implemented on a P6, Linux• Generates power profiles for benchmarks at
runtimeUses performance counters to gather utilization information Uses WATTCH’s per usage wattage values for max power values ([8 p.3])Uses heuristics to extract usage counts for blocksUses register sampling to compute activity factors for single ended bitlines.Computes total processor powerUses a digital multimeter for validation
![Page 129: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/129.jpg)
129
Performance Counters
• Exist on most new processors• Majorly used to track performance related events
Cache missesCommitted intr-s, etc.
• Can be used to gather power related data• P6 has 2 performance counters that count 77 events
Can be accessed with:RDMSR (Read Machine Specific Register)WRMSR (Write Machine Specific Register)RDTSC (Read Time Stamp Counter)
Kernel level (Ring 0) instructionsExemplary events:0. TSC elapsed machine cycles03. 03H L1 read misses 44. C0H instr-ns retired
![Page 130: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/130.jpg)
130
Heuristics
• To extract power related data from performance counters
• Platform Dependent!
![Page 131: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/131.jpg)
131
CASTLE implementation
• Platform:P6, 200 MHz | Linux kernel v2.2.16-3
HW counters
Kernel Code
Server code
Series Resistance
Xmultimeter server
Client Code
![Page 132: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/132.jpg)
132
CASTLE Filesystem – User Code
• Client: <cpu-probe>Includes cpu-monitor & cpu-networkCpu-monitor:
Provides the x-windows for power breakdown bar graphs <gtk and threads>Acquires power breakdowns from cpu-network
Cpu-network:Connects to server side through ethernet <sockets and threads>Gets event counts and number of elapsed cycles for each tracked eventConstructs component power values from event data using heuristics
Client Code
![Page 133: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/133.jpg)
133
CASTLE Filesystem – User Code
• Multimeter: <xmmeter>Real Multimeter reads the voltage over series R and sends over RS232
Xmmeter reads the serial port and converts the voltage reading into power as:
P=(Vread/Rs).Vdd
X-window displays the readings
Series Resistance
Xmultimeter server
![Page 134: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/134.jpg)
134
CASTLE Filesystem – User Code
• Server: <probe-server>Reads the performance counts with syscall “getglobaleventcount” defined in kernel code every second
Acquires event counts and elapsed cycles for all events
Sends the event and cycle data to client as a stream of chars.
Server code
![Page 135: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/135.jpg)
135
CASTLE Filesystem – Kernel Code
• Required to access counters• Scattered in:
/usr/src/linux/arch/i386/kernel/entry.S/usr/src/linux/include/linux/sched.h/usr/src/linux/kernel/fork.c/usr/src/linux/kernel/sched.c
• Defines 2 new system calls:GeteventcountGetglobaleventcount
• Accesses the counters, gets counter & cycle dataSyscall returns the server event and cycle counts as a 2D array
Kernel Code
![Page 136: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/136.jpg)
136
CASTLE Details• In castle code, 12 distinct events are defined• From [1] and [8], 10 of the events are used:
instructions decodedinstructions executedinstructions retiredfloating point operations executedbranches retiredBranchesDecodedL1 instruction cache accessesL1 data cache accessesL2 unified cache accessesmain memory requests
• [1] and [8] suggest a 10ms sampling period• Probe-server samples counters every second
![Page 137: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/137.jpg)
137
Power Breakdown ComponentsPower Breakdown Components CASTLE tracks 12 events
Develops power breakdowns for 8 units:DECODEISSUEREORDERDMEMIMEMFUNC_UNITOTHERCLOCK
Component powers recomputed every second in CPU-network
![Page 138: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/138.jpg)
138
Thermal Modeling with CASTLE
• Thermal model requires only power and sampling time information
Thermal model can be added at user level, by:extending cpu-network for temperature updates
extending cpu-monitor for a new thermal x-window
• A pitfall resides as the sampling periodSampling time should be smaller than time constant, for reliable modeling (<< 100s)
Back Back
![Page 139: Runtime Power Measurement/Modeling and Thermal Modeling](https://reader038.vdocument.in/reader038/viewer/2022102819/5681466c550346895db390ef/html5/thumbnails/139.jpg)
139
EOP