on-board performance counters: what do they really tell us? pat teller the university of texas at el...

On-board Performance Counters: What do they

really tell us?

Pat TellerThe University of Texas at El Paso (UTEP)

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Credits (Person Power)Credits (Person Power)

Michael Maxwell, Graduate (Ph.D.) Student

Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec. 2002

Alonso Bayona, UndergraduateAlexander Sainz, Undergraduate


Credits (Financial)Credits (Financial)

DoD PET ProgramNSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program

UTEP Dodson Endowment


MotivationMotivationFacilitate performance-tuning efforts that

employ aggregate event counts (that are not time multiplexed) accessed via PAPI

When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources

Identify unexpected results – Errors? Misunderstandings of processor functionality?


Road MapRoad Map

Scope of ResearchMethodologyResultsFuture Work and Conclusions


Processors Under StudyProcessors Under Study


MIPS R10K and R12K: 2 counters, 32 events

IBM Power3: 8 counters, 100+ events

Linux/IA-64: 4 counters, 150 eventsLinux/Pentium: 2 counters, 80+ events

Events Studied So FarEvents Studied So FarNumber of load and store instructions

executedNumber of floating-point instructions

executedTotal number of instructions executed

(issued/committed)Number of L1 I-cache and L1 D-cache missesNumber of L2 cache missesNumber of TLB missesNumber of branch mispredictions


PAPI OverheadPAPI Overhead


Extra instructionsRead counter before and after workload

Processing of counter overflow interrupts

Cache pollutionTLB pollution

MethodologyMethodology[Configuration micro-benchmark]Validation micro-benchmark – used to predict

event countPrediction via tool, mathematical model, and/or

simulationHardware-reported event count collection via

PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated)

Comparison/analysis Report findings


Validation Micro-benchmarkValidation Micro-benchmarkSimple, usually small programStresses a portion of the microarchitecture or

memory hierarchyIts size, simplicity, or execution time facilitates

the tracing of its execution path and/or prediction of the number of times an event is generated

Basic types: array, loop, in-line, and floating-point

Scalable w.r.t. granularity, i.e., number of generated events


Example – Loop Validation Micro-benchmark

Example – Loop Validation Micro-benchmark

For (I = 0; I < number_of_loops; I++){

sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization

}

Used to stress a particular functional unit,e.g., the load/store unit


Configuration Micro-benchmark

Configuration Micro-benchmark

Simple, usually small programDesigned to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy

Example: program to identify the page size used to store user data


Some ResultsSome Results

Reported Event Counts: Expected, Consistent

and Quantifiable Results

Reported Event Counts: Expected, Consistent

and Quantifiable ResultsOverhead related to PAPI and other sources is consistent and quantifiable

Reported Event Count – Predicted Event Count= Overhead


Example 1: Number of Loads Itanium, Power3, and R12K

Example 1: Number of Loads Itanium, Power3, and R12K

Load data using loop benchmark

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0 50000 100000 150000 200000 250000

Expected value

% E

rro

r Itanium

Power3

R12k


Example 2: Number of Stores Itanium, Power3, and R12K

Example 2: Number of Stores Itanium, Power3, and R12K


Store count results

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

0 100000 200000 300000 400000 500000 600000 700000 800000

Expected value

% D

iffe

ren

ce Itanium

Power3

R12k

Example 2: Number of StoresPower3 and Itanium

Example 2: Number of StoresPower3 and Itanium


Platform

MIPS R12K

IBM Power3

Linux/IA-64

Linux/Pentium

Loads 46 28 86 N/A

Stores 31 129 N/A

Example 3: Total Number of Floating Point Operations –

Pentium II, R10K and R12K, and Itanium

Example 3: Total Number of Floating Point Operations –

Pentium II, R10K and R12K, and Itanium

Processor AccurateConsistent Pentium II MIPS R10K, R12K Itanium

Even when counters overflow. No overhead due to PAPI.


Reported Event Counts: Unexpected and Consistent

Results --Errors?

Reported Event Counts: Unexpected and Consistent

Results --Errors?

The hardware-reported counts are multiples of the predicted counts

Reported Event Count / Multiplier = Predicted Event Count

Cannot identify overhead for calibration


Floating Point Adds

0

20

40

60

80

100

120

6600 19800 33000 46200 59400 1E+05 3E+05 4E+05 5E+05 7E+05 2E+06 3E+06 3E+07

Expected Value

% E

rro

r Itanium

Power3

R12k

Pentium

Example 1: Total Number of Floating-Point Operations –

Power3


AccurateConsistent

Reported Counts: Expected (Not Quantifiable) ResultsReported Counts: Expected (Not Quantifiable) ResultsPredictions: only possible under special circumstances

Reported event counts seem reasonable

But are they useful without knowing more about the algorithm used by the vendor?


Example 1: Total Data TLB Misses

Example 1: Total Data TLB Misses

Replacement policy can (unpredictably) affect event counts

PAPI may (unpredictably) affect event counts

Other processes may (unpredictably) affect event counts


Example 1: Total Compulsory Data TLB Misses for R10K

Example 1: Total Compulsory Data TLB Misses for R10K

% difference per no. of references

Predicted values consistently lower than reported

Small standard deviationsGreater predictability with

increased no. of references

3%

6%

9%

12%

15%

1

10

100

1000

10000


Example 2: L1 D-Cache Misses# misses relatively constant as # of array

references increase

Example 2: L1 D-Cache Misses# misses relatively constant as # of array

references increaseL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium


Example 2: L1 D-Cache MissesExample 2: L1 D-Cache Misses

On some of the processors studied, as the number of accesses increased, the miss rate approached 0

Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word

What’s going on?


Example 2: L1 D-Cache Misses with Random Access

(Foil Prefetch Scheme used by Stream Buffers)

Example 2: L1 D-Cache Misses with Random Access

(Foil Prefetch Scheme used by Stream Buffers)L1 D cache misses as a function of % filled

-150.0

-100.0

-50.0

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Power3

R12k

Pentium


Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses

Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses

Cycles per Data Access

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Data Accesses

Cyc

les

Itanium

Power3

R12K

Pentium

total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss


Reported Event Counts: Unexpected but

ConsistentResults

Reported Event Counts: Unexpected but

ConsistentResults

Predicted counts and reported counts differ significantly but in a consistent manner

Is this an error?Are we missing something?


Example 1: Compulsory Data TLB Misses for

Itanium

Example 1: Compulsory Data TLB Misses for

Itanium


Reported counts consistently ~5 times greater than predicted 399%

400%

401%

402%

403%

404%

1

10

100

1000

10000


Example 3: Compulsory Data TLB Misses for Power

3

Example 3: Compulsory Data TLB Misses for Power

3


Reported counts consistently ~5/~2 times greater than predicted for small/large counts


Total TLB misses (Power3)% Discrepancy

150%

200%

250%

300%

350%

400%

450%

500%

550%

1 10 100 1000 10000

Reported Event Counts: Unexpected Results

Reported Event Counts: Unexpected Results

OutliersPuzzles


Example 1: Outliers L1 D-Cache Misses for

Itanium

Example 1: Outliers L1 D-Cache Misses for

ItaniumL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium


Example 1: Supporting Data

Example 1: Supporting Data

Itanium L1 Data Cache Misses

Mean Standard Deviation

90% of data 1M accesses

1,290 170

10% of data 1M accesses

782,891 566,370


Example 2: R10K Floating-Point Division Instructions

Example 2: R10K Floating-Point Division Instructions

a = init_value;b = init_value;c = init_value;a = b / init_value;b = a / init_value;c = b / init_value;

a = init_value;b = init_value;c = init_value;a = a / init_value;b = b / init_value;c = c / init_value;

1 FP Instruction Counted

3 FP Instructions Counted


Example 2: Assembler Code

Analysis

Example 2: Assembler Code

AnalysisNo optimizationSame instructionsDifferent (expected) operands

Three division instructions in both

No reason for different FP counts

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d


Example 3: L1 D-Cache Misses with Random Access –

Itaniumonly when at array size = 10x cache size

Example 3: L1 D-Cache Misses with Random Access –

Itaniumonly when at array size = 10x cache size L1 D cache misses as a function of % filled

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Itanium

Power3

R12k

Pentium


Example 4: L1 I-Cache Misses and Instructions Retired -

Itanium

Example 4: L1 I-Cache Misses and Instructions Retired -

Itanium

L1 I cache misses

-80

-60

-40

-20

0

20

40

60

80

0 2000 4000 6000 8000 10000 12000

Expected value

% E

rro

r Itanium

Power3

R12k

Total instructions retired

0

2

4

6

8

10

12

14

16

18

20

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

Expected value

% E

rro

r Itanium

Power3

R12k

Pentium


Both about 17% more than expected.

Future WorkFuture WorkExtend events studied – include multiprocessor events

Extend processors studied – include Power4

Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling


ConclusionsConclusionsPerformance counters provide informative data that can

be used for performance tuningExpected frequency of event may determine usefulness

of event countsCalibration data can make event counts more useful to

application programmers (loads, stores, floating-point instructions)

The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration

The usefulness of some event counts is questionable without documentation of the related behavior


Should we attach the following warning to some

event counts on some platforms?

Should we attach the following warning to some

event counts on some platforms?

CAUTION: The values in the performance counters may be greater than you think.


And should we attach the PCAT Seal of

Approval on others?

And should we attach the PCAT Seal of

Approval on others?


PCAT

Invitation to VendorsInvitation to Vendors

Help us understand what’s going on, when to attach the “warning,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we!


Question to YouQuestion to You

On-board Performance Counters: What do they really tell you?

With all the caveats, are they useful nonetheless?


on-board performance counters: what do they really tell us? pat teller the university of texas at el...

Documents