on-board performance counters: what do they really tell us? pat teller the university of texas at el...

43
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Upload: sarah-megan-gordon

Post on 19-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

On-board Performance Counters: What do they

really tell us?

Pat TellerThe University of Texas at El Paso (UTEP)

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 2: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Credits (Person Power)Credits (Person Power)

Michael Maxwell, Graduate (Ph.D.) Student

Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec. 2002

Alonso Bayona, UndergraduateAlexander Sainz, Undergraduate

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 3: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Credits (Financial)Credits (Financial)

DoD PET ProgramNSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program

UTEP Dodson Endowment

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 4: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

MotivationMotivationFacilitate performance-tuning efforts that

employ aggregate event counts (that are not time multiplexed) accessed via PAPI

When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources

Identify unexpected results – Errors? Misunderstandings of processor functionality?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 5: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Road MapRoad Map

Scope of ResearchMethodologyResultsFuture Work and Conclusions

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 6: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Processors Under StudyProcessors Under Study

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

MIPS R10K and R12K: 2 counters, 32 events

IBM Power3: 8 counters, 100+ events

Linux/IA-64: 4 counters, 150 eventsLinux/Pentium: 2 counters, 80+ events

Page 7: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Events Studied So FarEvents Studied So FarNumber of load and store instructions

executedNumber of floating-point instructions

executedTotal number of instructions executed

(issued/committed)Number of L1 I-cache and L1 D-cache missesNumber of L2 cache missesNumber of TLB missesNumber of branch mispredictions

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 8: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

PAPI OverheadPAPI Overhead

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Extra instructionsRead counter before and after workload

Processing of counter overflow interrupts

Cache pollutionTLB pollution

Page 9: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

MethodologyMethodology[Configuration micro-benchmark]Validation micro-benchmark – used to predict

event countPrediction via tool, mathematical model, and/or

simulationHardware-reported event count collection via

PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated)

Comparison/analysis Report findings

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 10: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Validation Micro-benchmarkValidation Micro-benchmarkSimple, usually small programStresses a portion of the microarchitecture or

memory hierarchyIts size, simplicity, or execution time facilitates

the tracing of its execution path and/or prediction of the number of times an event is generated

Basic types: array, loop, in-line, and floating-point

Scalable w.r.t. granularity, i.e., number of generated events

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 11: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example – Loop Validation Micro-benchmark

Example – Loop Validation Micro-benchmark

For (I = 0; I < number_of_loops; I++){

sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization

}

Used to stress a particular functional unit,e.g., the load/store unit

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 12: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Configuration Micro-benchmark

Configuration Micro-benchmark

Simple, usually small programDesigned to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy

Example: program to identify the page size used to store user data

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 13: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Some ResultsSome Results

Page 14: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Reported Event Counts: Expected, Consistent

and Quantifiable Results

Reported Event Counts: Expected, Consistent

and Quantifiable ResultsOverhead related to PAPI and other sources is consistent and quantifiable

Reported Event Count – Predicted Event Count= Overhead

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 15: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Number of Loads Itanium, Power3, and R12K

Example 1: Number of Loads Itanium, Power3, and R12K

Load data using loop benchmark

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0 50000 100000 150000 200000 250000

Expected value

% E

rro

r Itanium

Power3

R12k

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 16: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: Number of Stores Itanium, Power3, and R12K

Example 2: Number of Stores Itanium, Power3, and R12K

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Store count results

0.000

0.200

0.400

0.600

0.800

1.000

1.200

1.400

0 100000 200000 300000 400000 500000 600000 700000 800000

Expected value

% D

iffe

ren

ce Itanium

Power3

R12k

Page 17: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: Number of StoresPower3 and Itanium

Example 2: Number of StoresPower3 and Itanium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Platform

MIPS R12K

IBM Power3

Linux/IA-64

Linux/Pentium

Loads 46 28 86 N/A

Stores 31 129 N/A

Page 18: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 3: Total Number of Floating Point Operations –

Pentium II, R10K and R12K, and Itanium

Example 3: Total Number of Floating Point Operations –

Pentium II, R10K and R12K, and Itanium

Processor AccurateConsistent Pentium II MIPS R10K, R12K Itanium

Even when counters overflow. No overhead due to PAPI.

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 19: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Reported Event Counts: Unexpected and Consistent

Results --Errors?

Reported Event Counts: Unexpected and Consistent

Results --Errors?

The hardware-reported counts are multiples of the predicted counts

Reported Event Count / Multiplier = Predicted Event Count

Cannot identify overhead for calibration

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 20: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Floating Point Adds

0

20

40

60

80

100

120

6600 19800 33000 46200 59400 1E+05 3E+05 4E+05 5E+05 7E+05 2E+06 3E+06 3E+07

Expected Value

% E

rro

r Itanium

Power3

R12k

Pentium

Example 1: Total Number of Floating-Point Operations –

Power3

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

AccurateConsistent

Page 21: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Reported Counts: Expected (Not Quantifiable) ResultsReported Counts: Expected (Not Quantifiable) ResultsPredictions: only possible under special circumstances

Reported event counts seem reasonable

But are they useful without knowing more about the algorithm used by the vendor?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 22: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Total Data TLB Misses

Example 1: Total Data TLB Misses

Replacement policy can (unpredictably) affect event counts

PAPI may (unpredictably) affect event counts

Other processes may (unpredictably) affect event counts

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 23: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Total Compulsory Data TLB Misses for R10K

Example 1: Total Compulsory Data TLB Misses for R10K

% difference per no. of references

Predicted values consistently lower than reported

Small standard deviationsGreater predictability with

increased no. of references

3%

6%

9%

12%

15%

1

10

100

1000

10000

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 24: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: L1 D-Cache Misses# misses relatively constant as # of array

references increase

Example 2: L1 D-Cache Misses# misses relatively constant as # of array

references increaseL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 25: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: L1 D-Cache MissesExample 2: L1 D-Cache Misses

On some of the processors studied, as the number of accesses increased, the miss rate approached 0

Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word

What’s going on?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 26: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: L1 D-Cache Misses with Random Access

(Foil Prefetch Scheme used by Stream Buffers)

Example 2: L1 D-Cache Misses with Random Access

(Foil Prefetch Scheme used by Stream Buffers)L1 D cache misses as a function of % filled

-150.0

-100.0

-50.0

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Power3

R12k

Pentium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 27: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses

Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses

Cycles per Data Access

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Data Accesses

Cyc

les

Itanium

Power3

R12K

Pentium

total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 28: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Reported Event Counts: Unexpected but

ConsistentResults

Reported Event Counts: Unexpected but

ConsistentResults

Predicted counts and reported counts differ significantly but in a consistent manner

Is this an error?Are we missing something?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 29: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Compulsory Data TLB Misses for

Itanium

Example 1: Compulsory Data TLB Misses for

Itanium

% difference per no. of references

Reported counts consistently ~5 times greater than predicted 399%

400%

401%

402%

403%

404%

1

10

100

1000

10000

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 30: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 3: Compulsory Data TLB Misses for Power

3

Example 3: Compulsory Data TLB Misses for Power

3

% difference per no. of references

Reported counts consistently ~5/~2 times greater than predicted for small/large counts

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Total TLB misses (Power3)% Discrepancy

150%

200%

250%

300%

350%

400%

450%

500%

550%

1 10 100 1000 10000

Page 31: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Reported Event Counts: Unexpected Results

Reported Event Counts: Unexpected Results

OutliersPuzzles

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 32: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Outliers L1 D-Cache Misses for

Itanium

Example 1: Outliers L1 D-Cache Misses for

ItaniumL1 D cache misses using sequential access

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0

2000.0

100 300 500 700 900 2000 4000 6000 8000 10000 30000 50000 5E+05

Data accesses

% D

iffe

ren

ce

Itanium

Power3

R12k

Pentium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 33: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 1: Supporting Data

Example 1: Supporting Data

Itanium L1 Data Cache Misses

Mean Standard Deviation

90% of data 1M accesses

1,290 170

10% of data 1M accesses

782,891 566,370

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 34: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: R10K Floating-Point Division Instructions

Example 2: R10K Floating-Point Division Instructions

a = init_value;b = init_value;c = init_value;a = b / init_value;b = a / init_value;c = b / init_value;

a = init_value;b = init_value;c = init_value;a = a / init_value;b = b / init_value;c = c / init_value;

1 FP Instruction Counted

3 FP Instructions Counted

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 35: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 2: Assembler Code

Analysis

Example 2: Assembler Code

AnalysisNo optimizationSame instructionsDifferent (expected) operands

Three division instructions in both

No reason for different FP counts

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d

l.ds.dl.ds.dl.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.dl.dl.ddiv.ds.d

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 36: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 3: L1 D-Cache Misses with Random Access –

Itaniumonly when at array size = 10x cache size

Example 3: L1 D-Cache Misses with Random Access –

Itaniumonly when at array size = 10x cache size L1 D cache misses as a function of % filled

-200.0

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

0.0 50.0 100.0 150.0 200.0 250.0 300.0

% of cache filled

% E

rro

r Itanium

Power3

R12k

Pentium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 37: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Example 4: L1 I-Cache Misses and Instructions Retired -

Itanium

Example 4: L1 I-Cache Misses and Instructions Retired -

Itanium

L1 I cache misses

-80

-60

-40

-20

0

20

40

60

80

0 2000 4000 6000 8000 10000 12000

Expected value

% E

rro

r Itanium

Power3

R12k

Total instructions retired

0

2

4

6

8

10

12

14

16

18

20

0 500000 1000000 1500000 2000000 2500000 3000000 3500000

Expected value

% E

rro

r Itanium

Power3

R12k

Pentium

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Both about 17% more than expected.

Page 38: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Future WorkFuture WorkExtend events studied – include multiprocessor events

Extend processors studied – include Power4

Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 39: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

ConclusionsConclusionsPerformance counters provide informative data that can

be used for performance tuningExpected frequency of event may determine usefulness

of event countsCalibration data can make event counts more useful to

application programmers (loads, stores, floating-point instructions)

The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration

The usefulness of some event counts is questionable without documentation of the related behavior

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 40: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Should we attach the following warning to some

event counts on some platforms?

Should we attach the following warning to some

event counts on some platforms?

CAUTION: The values in the performance counters may be greater than you think.

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 41: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

And should we attach the PCAT Seal of

Approval on others?

And should we attach the PCAT Seal of

Approval on others?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

PCAT

Page 42: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Invitation to VendorsInvitation to Vendors

Help us understand what’s going on, when to attach the “warning,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we!

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Page 43: On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University

Question to YouQuestion to You

On-board Performance Counters: What do they really tell you?

With all the caveats, are they useful nonetheless?

PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002