on-board performance counters: what do they really tell us? pat teller the university of texas at el...
TRANSCRIPT
On-board Performance Counters: What do they
really tell us?
Pat TellerThe University of Texas at El Paso (UTEP)
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Credits (Person Power)Credits (Person Power)
Michael Maxwell, Graduate (Ph.D.) Student
Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec. 2002
Alonso Bayona, UndergraduateAlexander Sainz, Undergraduate
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Credits (Financial)Credits (Financial)
DoD PET ProgramNSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program
UTEP Dodson Endowment
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
MotivationMotivationFacilitate performance-tuning efforts that
employ aggregate event counts (that are not time multiplexed) accessed via PAPI
When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources
Identify unexpected results – Errors? Misunderstandings of processor functionality?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Road MapRoad Map
Scope of ResearchMethodologyResultsFuture Work and Conclusions
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Processors Under StudyProcessors Under Study
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
MIPS R10K and R12K: 2 counters, 32 events
IBM Power3: 8 counters, 100+ events
Linux/IA-64: 4 counters, 150 eventsLinux/Pentium: 2 counters, 80+ events
Events Studied So FarEvents Studied So FarNumber of load and store instructions
executedNumber of floating-point instructions
executedTotal number of instructions executed
(issued/committed)Number of L1 I-cache and L1 D-cache missesNumber of L2 cache missesNumber of TLB missesNumber of branch mispredictions
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
PAPI OverheadPAPI Overhead
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Extra instructionsRead counter before and after workload
Processing of counter overflow interrupts
Cache pollutionTLB pollution
MethodologyMethodology[Configuration micro-benchmark]Validation micro-benchmark – used to predict
event countPrediction via tool, mathematical model, and/or
simulationHardware-reported event count collection via
PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated)
Comparison/analysis Report findings
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Validation Micro-benchmarkValidation Micro-benchmarkSimple, usually small programStresses a portion of the microarchitecture or
memory hierarchyIts size, simplicity, or execution time facilitates
the tracing of its execution path and/or prediction of the number of times an event is generated
Basic types: array, loop, in-line, and floating-point
Scalable w.r.t. granularity, i.e., number of generated events
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example – Loop Validation Micro-benchmark
Example – Loop Validation Micro-benchmark
For (I = 0; I < number_of_loops; I++){
sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization
}
Used to stress a particular functional unit,e.g., the load/store unit
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Configuration Micro-benchmark
Configuration Micro-benchmark
Simple, usually small programDesigned to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy
Example: program to identify the page size used to store user data
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Some ResultsSome Results
Reported Event Counts: Expected, Consistent
and Quantifiable Results
Reported Event Counts: Expected, Consistent
and Quantifiable ResultsOverhead related to PAPI and other sources is consistent and quantifiable
Reported Event Count – Predicted Event Count= Overhead
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Number of Loads Itanium, Power3, and R12K
Example 1: Number of Loads Itanium, Power3, and R12K
Load data using loop benchmark
0.000
0.100
0.200
0.300
0.400
0.500
0.600
0.700
0.800
0 50000 100000 150000 200000 250000
Expected value
% E
rro
r Itanium
Power3
R12k
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: Number of Stores Itanium, Power3, and R12K
Example 2: Number of Stores Itanium, Power3, and R12K
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Store count results
0.000
0.200
0.400
0.600
0.800
1.000
1.200
1.400
0 100000 200000 300000 400000 500000 600000 700000 800000
Expected value
% D
iffe
ren
ce Itanium
Power3
R12k
Example 2: Number of StoresPower3 and Itanium
Example 2: Number of StoresPower3 and Itanium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Platform
MIPS R12K
IBM Power3
Linux/IA-64
Linux/Pentium
Loads 46 28 86 N/A
Stores 31 129 N/A
Example 3: Total Number of Floating Point Operations –
Pentium II, R10K and R12K, and Itanium
Example 3: Total Number of Floating Point Operations –
Pentium II, R10K and R12K, and Itanium
Processor AccurateConsistent Pentium II MIPS R10K, R12K Itanium
Even when counters overflow. No overhead due to PAPI.
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Reported Event Counts: Unexpected and Consistent
Results --Errors?
Reported Event Counts: Unexpected and Consistent
Results --Errors?
The hardware-reported counts are multiples of the predicted counts
Reported Event Count / Multiplier = Predicted Event Count
Cannot identify overhead for calibration
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Floating Point Adds
0
20
40
60
80
100
120
6600 19800 33000 46200 59400 1E+05 3E+05 4E+05 5E+05 7E+05 2E+06 3E+06 3E+07
Expected Value
% E
rro
r Itanium
Power3
R12k
Pentium
Example 1: Total Number of Floating-Point Operations –
Power3
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
AccurateConsistent
Reported Counts: Expected (Not Quantifiable) ResultsReported Counts: Expected (Not Quantifiable) ResultsPredictions: only possible under special circumstances
Reported event counts seem reasonable
But are they useful without knowing more about the algorithm used by the vendor?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Total Data TLB Misses
Example 1: Total Data TLB Misses
Replacement policy can (unpredictably) affect event counts
PAPI may (unpredictably) affect event counts
Other processes may (unpredictably) affect event counts
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Total Compulsory Data TLB Misses for R10K
Example 1: Total Compulsory Data TLB Misses for R10K
% difference per no. of references
Predicted values consistently lower than reported
Small standard deviationsGreater predictability with
increased no. of references
3%
6%
9%
12%
15%
1
10
100
1000
10000
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: L1 D-Cache Misses# misses relatively constant as # of array
references increase
Example 2: L1 D-Cache Misses# misses relatively constant as # of array
references increaseL1 D cache misses using sequential access
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05
Data accesses
% D
iffe
ren
ce
Itanium
Power3
R12k
Pentium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: L1 D-Cache MissesExample 2: L1 D-Cache Misses
On some of the processors studied, as the number of accesses increased, the miss rate approached 0
Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word
What’s going on?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: L1 D-Cache Misses with Random Access
(Foil Prefetch Scheme used by Stream Buffers)
Example 2: L1 D-Cache Misses with Random Access
(Foil Prefetch Scheme used by Stream Buffers)L1 D cache misses as a function of % filled
-150.0
-100.0
-50.0
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Power3
R12k
Pentium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses
Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses
Cycles per Data Access
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Data Accesses
Cyc
les
Itanium
Power3
R12K
Pentium
total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Reported Event Counts: Unexpected but
ConsistentResults
Reported Event Counts: Unexpected but
ConsistentResults
Predicted counts and reported counts differ significantly but in a consistent manner
Is this an error?Are we missing something?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Compulsory Data TLB Misses for
Itanium
Example 1: Compulsory Data TLB Misses for
Itanium
% difference per no. of references
Reported counts consistently ~5 times greater than predicted 399%
400%
401%
402%
403%
404%
1
10
100
1000
10000
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 3: Compulsory Data TLB Misses for Power
3
Example 3: Compulsory Data TLB Misses for Power
3
% difference per no. of references
Reported counts consistently ~5/~2 times greater than predicted for small/large counts
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Total TLB misses (Power3)% Discrepancy
150%
200%
250%
300%
350%
400%
450%
500%
550%
1 10 100 1000 10000
Reported Event Counts: Unexpected Results
Reported Event Counts: Unexpected Results
OutliersPuzzles
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Outliers L1 D-Cache Misses for
Itanium
Example 1: Outliers L1 D-Cache Misses for
ItaniumL1 D cache misses using sequential access
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
1800.0
2000.0
100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05
Data accesses
% D
iffe
ren
ce
Itanium
Power3
R12k
Pentium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 1: Supporting Data
Example 1: Supporting Data
Itanium L1 Data Cache Misses
Mean Standard Deviation
90% of data 1M accesses
1,290 170
10% of data 1M accesses
782,891 566,370
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: R10K Floating-Point Division Instructions
Example 2: R10K Floating-Point Division Instructions
a = init_value;b = init_value;c = init_value;a = b / init_value;b = a / init_value;c = b / init_value;
a = init_value;b = init_value;c = init_value;a = a / init_value;b = b / init_value;c = c / init_value;
1 FP Instruction Counted
3 FP Instructions Counted
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 2: Assembler Code
Analysis
Example 2: Assembler Code
AnalysisNo optimizationSame instructionsDifferent (expected) operands
Three division instructions in both
No reason for different FP counts
l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d
l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 3: L1 D-Cache Misses with Random Access –
Itaniumonly when at array size = 10x cache size
Example 3: L1 D-Cache Misses with Random Access –
Itaniumonly when at array size = 10x cache size L1 D cache misses as a function of % filled
-200.0
0.0
200.0
400.0
600.0
800.0
1000.0
1200.0
1400.0
1600.0
0.0 50.0 100.0 150.0 200.0 250.0 300.0
% of cache filled
% E
rro
r Itanium
Power3
R12k
Pentium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Example 4: L1 I-Cache Misses and Instructions Retired -
Itanium
Example 4: L1 I-Cache Misses and Instructions Retired -
Itanium
L1 I cache misses
-80
-60
-40
-20
0
20
40
60
80
0 2000 4000 6000 8000 10000 12000
Expected value
% E
rro
r Itanium
Power3
R12k
Total instructions retired
0
2
4
6
8
10
12
14
16
18
20
0 500000 1000000 1500000 2000000 2500000 3000000 3500000
Expected value
% E
rro
r Itanium
Power3
R12k
Pentium
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Both about 17% more than expected.
Future WorkFuture WorkExtend events studied – include multiprocessor events
Extend processors studied – include Power4
Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
ConclusionsConclusionsPerformance counters provide informative data that can
be used for performance tuningExpected frequency of event may determine usefulness
of event countsCalibration data can make event counts more useful to
application programmers (loads, stores, floating-point instructions)
The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration
The usefulness of some event counts is questionable without documentation of the related behavior
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Should we attach the following warning to some
event counts on some platforms?
Should we attach the following warning to some
event counts on some platforms?
CAUTION: The values in the performance counters may be greater than you think.
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
And should we attach the PCAT Seal of
Approval on others?
And should we attach the PCAT Seal of
Approval on others?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
PCAT
Invitation to VendorsInvitation to Vendors
Help us understand what’s going on, when to attach the “warning,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we!
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Question to YouQuestion to You
On-board Performance Counters: What do they really tell you?
With all the caveats, are they useful nonetheless?
PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002