john mellor-crummey robert fowler nathan tallent department of computer science rice university

John Mellor-CrummeyRobert Fowler Nathan Tallent

Department of Computer ScienceRice University

SC04 Tutorial Presentation

S05: Practical Application PerformanceAnalysis on Linux Systems

http://www.hipersoft.rice.edu/hpctoolkit/

2

Goals for this Tutorial

Understand

• Factors affecting performance on modern architectures

• What performance monitoring hardware does

• Performance monitoring capabilities on common processors

• Strategies for measuring application node performance—Performance calipers—Sample-based profiling—How and when to use each

• Performance tool resources

• How to analyze application performance using HPCToolkit—Collect sample-based profiles of performance metrics—Explore and understand application performance

• How to install HPCToolkit on your own system

3

Schedule

• Session 1— Overview

— The 45-minute Performance Expert

— Survey of Performance Instrumentation

• Session 2— Introduction to HPCToolkit on Linux

— Case Studies

• Session 3— Hands-on Exercises

— Using HPCToolkit In Practice

• Session 4— Advanced Exercises: our codes or yours

— HPCToolkit Installation Overview

— Wrap up

4

Session 1 a

The 45-Minute Performance Expert

5

Analysis and Tuning Questions

• How can we tell if a program has good performance?

• How can we tell that it doesn’t?

• If performance is not “good,” how can we pinpoint where?

• How can we identify the causes?

• What can we do about it?

6

Peak vs. Realized Performance

Scientific applications realize as low as 5-25% of peak on microprocessor-based systems

Reason: mismatch between application and architecture capabilities—Architecture has insufficient bandwidth to main memory:

– microprocessors often provide < 1 byte from memory per FLOP scientific applications often need more

—Application has insufficient locality– irregular accesses can squander memory bandwidth

use only part of each data block fetched from memory– may not adequately reuse costly virtual memory address translations

—Exposed memory latency– architecture: inadequate memory parallelism to hide latency– application: not structured to exploit memory parallelism

—Instruction mix doesn’t match available functional units

Peak performance = guaranteed not to exceed

Realized performance = what you achieve

7

Performance Analysis and Tuning

• Increasingly necessary—Gap between realized and peak performance is growing

• Increasingly hard—Complex architectures are harder to program effectively

– complex processors: pipelining, out-of-order execution, VLIW

– complex memory hierarchy: multi-level non-blocking caches, TLB

—Optimizing compilers are critical to achieving high performance– small program changes may have a large impact

—Modern scientific applications pose challenges for tools– multi-lingual programs

– many source files

– complex build process

– external libraries in binary-only form

—Many tools don’t help you identify or solve your problem

8

Mead’s “Tall, Thin Man:” VLSI System Designer

Needed Expertise:— Embedded algorithms

— System architecture

— Circuit design

— Layout

— Device Physics

— Manufacturability

— Testing

— System Integration

The tall, thin man knows just enough, but is supported by staff and tools.

†An engineer who works at every level of integration from circuits to application

software. -- Carver Mead.

“tall, thin man†”

(Also see “lead programmerteams”, “concurrent engineering”)

9

The Tall, Thin Application Performance Expert

Needed Expertise and Tools:—Science

—Modeling

—Numerical Methods

—Programming Languages

—Libraries

—Compilers

—Computer Architecture

—Performance Analysis

—Testing

—Etc.

Requires support staff and tools to augment expertise. Necessary tools are not the same ones used by the system designer!

Stands on the shouldersof the tall, thin system designer.

10

Tuning Applications

• Performance is an interaction between—Numerical model

—Algorithms

—Problem formulation (as a program)

—Data structures

—System software

—Hardware

• Removing performance bottlenecks may require dependent adjustments to all

11

Performance = Resources = Time

Tprogram = Tcompute + Twait

Tcompute is the time the CPU thinks it is busy.

Twait is the time it is waiting for external devices/events.

Tcompute = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Tcompute = Seconds = Instructions x Cycles x Seconds

Program Program Instruction Cycle

Determined by technology and by hardware design

Determined by architecture and by the compiler’s ability to use the architecture efficiently for your program.

Determined by model, algorithm, and data structures.

Including: other processes, operating system, I/O, communication

12

Microprocessor-based Architectures

• Instruction Level Parallelism (ILP): systems not really serial—Deeply pipelined processors

—Multiple functional units

• Processor taxonomy—Out of order superscalar: Alpha, Pentium 4, Opteron

– hardware dynamically schedules instructions: determine dependences and dispatch instructions

– many instructions can be in flight at once

—VLIW: Itanium– issue a fixed size “bundle” of instructions each cycle

– bundles tailored to mix of available functional units

– compiler pre-determines what instructions execute in parallel

• Complex memory hierarchy—Non-blocking, multi-level caches

—TLB (Translation Lookaside Buffer)

13

Pipelining 101

• Basic microprocessor pipeline (RISC circa 1983)— Instruction fetch (IF)

— Instruction decode (ID)

— Execute (EX)

— Memory access (MEM)

— Writeback (WB)

• Each instruction takes 5 cycles (latency).

• One instruction can complete per cycle(theoretical peak throughput)

• Disruptions and replays— On simple processors: bubbles and stalls

— Recent complex/dynamic processors use an abort/replay approach

Fetch

Decode

Execute

Memory

Write

14

Limits to Pipelining

• Hazards: conditions that prevent the next instruction from being launched or (in speculative systems) completed—Structural hazard: Can’t use the same hardware to do two

different things at the same time

— Data hazard: Instruction depends on result of prior instruction still in the pipeline. (Also, instruction tries to overwrite values still needed by other instructions.)

— Control hazard: Delay between fetching control instructions and decisions about changes in control flow (branches and jumps)

• In the presence of a hazard, introduce delays (pipeline bubbles) until the hazard is resolved

• Deep or complex pipelines increase the cost of hazards

• External Delays— Cache and memory delays

— Address translation (TLB) misses

15

Pentium 4 (Super-Pipelined, Super-Scalar)

• Stages 1-2— Trace cache next instruction

pointer

• Stages 3-4— Trace cache fetch

• Stage 5— Drive (wire delay!)

• Stages 6-8— Allocate and Rename

• Stages 10-12— Schedule instructions— Memory/fast ALU/slow ALU &

general FP/simple FP

• Stages 13-14— Dispatch

• Stages 15-16— Register access

• Stage 17— Execute

• Stage 18— Set flags

• Stage 19— Check branches

• Stage 20— Drive (more wire delay!)

5 operations issued per clock1 load, 1store unit2 simple/fast, 1 complex/slower integer units1 FP exe, 1 FP move unitUp to 126 instructions in flight: 48 loads, 24 stores, …

16

Opteron Pipeline (Super-Pipelined, Super-Scalar)

Integer

1. Fetch1

2. Fetch2

3. Pick

4. Decode1

5. Decode2

6. Pack

7. Pack/Decode

8. Dispatch

9. Schedule

10.AGU/ALU

11.DC access

12.DC Response

Floating Point

8. Dispatch

9. Stack rename

10.Register rename

11.Sched. Write

12.Schedule

13.Reg. Read

14.FX0

15.FX1

16.FX2

17.FX3

48. …

L2 Cache

13.L2 Tag

14. …15.L2 Data

16. …17.Route/

Multiplex ECC

18. …19.Write DC/

Forward Data

DRAM

14.Address to SAQ

15.Clock Crossing

16.Sys. Req Queue

17. …26. Req DRAM pins

27. …… (Memory Access)

Fetch/decode 3 inst/cycle. 3 integer units 3 address units 3 FPU/multimedia units 2 load/stores to d-cache/clk

17

Itanium2 Pipeline (VLIW/EPIC)

1. Inst. Ptr. Generation and Fetch

2. Inst. “Rotation”

3. Decode, expand, and disperse

4. Rename and decode registers

5. Register file read

6. Execution/ Memory(Cache) 1/ FP1

7. Exception detection/ Memory(Cache) 2/ FP2

8. Write back/ Memory(Cache) 3/ FP3

9. Memory(Cache) 4/ FP4

Six instructions (two bundles) issued per clock.2 integer units4 memory units3 branch units2 floating-point

FrontEnd

BackEnd

18

A Modern Memory Hierarchy (Itanium 2)

L1 Cache-IL1 Cache-D

memory controller

L2 Cache

L3 cache

mem bank 1

mem bank 2

TLBTLB

209.6 ns

Processor

WriteBuffers,

Etc

(FP)

http://www.devx.com/Intel/Article/20521

16K I + 16K D, 1 cycle

256K, 5 (6 FP) cycles

3M, 13.3 (13.1 FP cycles)

19

Memory Hierarchy 101

• Translation Lookaside Buffer (TLB)—Fast virtual memory map for a small (~64) number of pages.—Touch lots of pages lots of TLB misses expensive

• Load/Store queues—Write latency and bandwidth is usually* not an issue

• Caches—Data in a cache is organized as set of 32-128 byte blocks—Spatial locality: use all of the data in a block—Temporal locality: reuse data at the same location —Load/store operations access data in the level of cache closest to the

processor in which data is resident– load of data in L1 cache does not cause traffic between L1 & L2

—Typically, data in a cache close to the processor is also resident in lower level caches (inclusion)

—A miss in Lk cache causes an access to Lk+1

• Parallelism—Multi-ported caches—Multiple memory accesses in flight

20

Processor.P4,

3.0GHzI865PERL

Dual K8, 1.6GHz

Tyan2882

Dual K8, 1.6GHz

Tyan2882

McKinley900MHz x 2HP zx6000

P42GHz

Dell

Memory PC3200Registered

PC2700ECCnode IL on

RegisteredPC2700ECC

node IL offPC2100ECC RB800

Compiler gcc3.2.3 gcc3.3.2 gcc3.3.2 gcc3.2 gcc3.2

Integer +/*/Div (ns) .17/4.7/19.4 .67/1.9/26 .67/1.9/26 1.12/4/7.9 .25/.25/11

Double +/*/Div (ns) 1.67/2.34/14.6 2.54/2.54/11.1 2.54/2.54/11.1 4.45/4.45 2.5/3.75/

Lat. L1 (ns) 0.67 1.88 1.88 2.27 1.18

Lat. L2 (ns) 6.15 12.40 12.40 7.00 9.23

Lat. Main (ns) 91.00 136.50 107.50 212.00 176.50

The Memory Wall: LMBENCH Latency Results

Note: 64 bytes/miss @ 100 ns/miss delivers only 640 MB/sec

21

Processor.P4,

3.0GHzI865PERL

Operon (x2), 1.6GHz

Tyan2882

McKinley(x2),900MHz

HP zx6000

Bus/Memory800MHzPC3200

RegisteredPC2700ECC

node IL offPC2100ECC

Native compiler6M elements

icc8.0 -fast ecc7.1 -O3

copy 2479 3318scan 2479 3306add 3029 3842

triad 3024 3844

gcc 6M elements

gcc3.2.3 -O3gcc3.3.2 -O3 -m64

gcc3.2 -O3

(-funroll-all-loops)copy 2422 1635 793 (820)scan 2459 1661 734 (756)add 2995 2350 843 (853)

triad 2954 1967 844 (858)

The Memory Wall: Streams (Bandwidth) Benchmark

Itanium requires careful explicit code scheduling!

good!

terrible!

22

Memory Latency and Bandwidth: Fundamental Limits

Example: simple 2-d relaxation kernel

S: r(i,j) = a * x(i,j) + b * (x(i-1,j)+x(i+1,j)+x(i,j-1)+x(i,j+1))

• Each execution of S —performs 6 ops,

—uses 5 x values

—writes 1 result value to r

(not counting index computations or scalars kept in registers)

• Assume x is double, large and it always comes from main memory*

— S consumes 40 bytes of memory read bandwidth: 6.7 bytes/op

—1GB/sec memory 150MFLOPS

25M updates/sec

* actually, some accesses to x will come from cache for this computation

23

Memory Latency and Bandwidth: Fundamental Limits

Example: simple 2-d relaxation kernel (part 2)

S: r(i,j) = a * x(i,j) + b * (x(i-1,j)+x(i+1,j)+x(i,j-1)+x(i,j+1))

• In each outer loop, each x value is used exactly 5 times, (6 if r x)

• If a programmer/compiler gets perfect reuse in registers and cache:

—S consumes 5*(1/5)*8 = 8 bytes memory read bandwidth, 1.3bytes/op.

—1GB/sec memory 750MFLOPS

• How?—First hide the latency of cache/memory access

– Restructure to increase locality in the cache.

– Schedule loads ahead of use

– Prefetch

– Software pipeline

— Cache-friendly transformations

– Unit stride accesses

– Tiling

compilerapproaches

programmer or compiler approaches

125M updates/sec

24

Memory Bandwidth

Example: realistic 2-d relaxation kernel (real materials and deformed meshes)

S: r(i,j) = c(i,j,self)*x(i,j)

+ c(i,j,up)*x(i-1,j) + c(i,j,down)*x(i+1,j)

+ c(i,j,left)*x(i,j-1) + c(i,j,right)*x(i,j+1)

—Each execution of S performs – 9 ops,

– uses 5 x and 5 c values,(not counting index computations or scalars kept in registers)

—Assume x and c are double and always come from main memory– S consumes 80 bytes of memory read bandwidth, 8.9 bytes/op.

– 1GB/sec memory 112MFLOPS

12.5M updates/sec

25

Memory Bandwidth

Example: realistic 2-d relaxation kernel (part 2)(real materials and deformed meshes)




—In each outer loop, each x value is used 5 times, 6 if r a.

—Each c value is used only once (twice for i≠j if c is symmetric).—If a programmer/compiler gets perfect reuse in registers and cache:

– S consumes (5 + 1*1/5)*8 = 48 bytes of memory read bandwidth, 5.33 bytes/op.

– 1GB/sec memory 187MFLOPS upper bound

20.8M updates/sec

26

Memory Bandwidth

Example: A realistic 2-d relaxation kernel (part 3)(real materials, deformed meshes)




Solutions— “Well-studied in the literature”

– hide latency and get all the reuse you can

— “Fundamental stuff”– aggressive algorithms to increase reuse (Time skewing, etc)

– alternate algorithmic approaches

one should not care if the FLOP rate is low if the solution is better and faster

27

What about Parallel Codes?

Good parallel efficiency and scalability requires

• Even partitioning of computation

• Low serialization/good parallelism

• Minimum communication

• Low communication overhead

• Communication overlapped with computation

28

Tuning a Parallel Code

While (not satisfied) do:

1. Analyze parallelism

2. Get the parallelism right

3. Analyze node-wide issues

4. Fix any problems

5. Analyze your code

6. Fix the problems

29

Sender

Receiver

SenderOverhead

Transmission time(size ÷ bandwidth)

Transmission time(size ÷ bandwidth)

Time ofFlight

ReceiverOverhead

Transport Latency

Total Latency = Sender Overhead + Time of Flight + (Message Size ÷ BW) + Receiver Overhead

Total Latency

(processorbusy)

(processorbusy)

“size” includes protocol overhead.

Communication Overhead and Latency

Simple message transmission

30

A Tale of Two Parallelizations …

multipartitioning}

} coarse-grainpipelining

Same computational work; different data partitioning; serialization dramatically affects scalability and performance

31

Take Home Points

• Tuning parallel programs requires tuning the parallelism first

• Modern processors are complex and require instruction-level parallelism for performance—Understanding hazards is the key to understanding performance

• Memory is much slower than processors and a multi-layer memory hierarchy is there to hide that gap—The gap can’t be hidden if your bandwidth needs are too great

—Only data reuse will enable you to approach peak performance

• Performance tuning may require changing everything—Algorithms, data structures, program structure

• Effective performance tuning requires a top-down approach—Too many issues to study everything

—Must identify the BIGGEST impediment and go after that first

32

Session 1 b

Survey of Performance Instrumentation

33

Performance Monitoring Hardware

Purpose— Capture information about performance critical details that is

otherwise inaccessible– cycles in flight, TLB misses, mispredicted branches, etc

What it does— Characterize “events” and measure durations

— Record information about an instruction as it executes.

Two flavors of performance monitoring hardware

1. Aggregate performance event counters— sample events during execution: cycles, board cache misses

— limitation: out of order execution smears attribution of events

2. “ProfileMe” instruction execution trace hardware— a set of boolean flags indicating occurrence of events (e.g., traps,

replays, etc) + cycle counters

— limitation: not all sources of delay are counted, attribution is sometimes unintuitive

34

Types of Performance Events

• Program characterization—Attributes independent of processor’s implementation

– Number and types of instructions, e.g. load/store/branch/FLOP

• Memory hierarchy utilization—Cache and TLB events

—Memory access concurrency

• Execution pipeline stalls—Analyze instruction flow through the execution pipeline

—Identify hazards– Conflicts that prevent load/store reordering

• Branch Prediction—Count mispredicts to identify hazards for pipeline stalls

• Resource Utilization—Number of cycles spent using a floating point divider

35

Performance Monitoring Hardware

• Event detectors signal—Raw events—Qualified events, qualified by

– Hazard description MOB_load_replay: +NO_STA, +NO_STD, +UNALGN_ADDR, …

– Type specifier page_walk_type: +DTMISS, +ITMISS

– Cache response ITLB_reference: +HIT, +MISS

– Cache line specific state BSQ_cache_reference_RD: +HITS, +HITE, +HITM, +MISS

– Branch type retired_mispred_branch: +COND, +CALL, +RET, +INDIR

– Floating point assist type x87_assist: +FPSU, +FPSO, +POAU, +PREA

– …

• Event counters

36

Counting Processor Events

Three ways to count

• Condition count—Number of cycles in which condition is true/false

– e.g. the number of cycles in which the pipeline was stalled

• Condition edge count—Number of cycles in which condition changed

– e.g. the number of cycles in which a pipeline stall began

• Thresholded count—Useful for events that report more than 1 count per cycle

– e.g. # of instructions completing in a superscalar processor

– e.g. count # cycles in which 3 or more instructions complete

37

Counting Events with Calipers

• Augment code with —Start counter

—Read counter

—Stop counter

• Strengths—Measure exactly what you want, anywhere you want

—Can be used to guide run-time adaptation

• Weaknesses—Typically requires manual insertion of counters

—Monitoring multiple nested scopes can be problematic

—Perturbation can be severe with calipers in inner loops– Cost of monitoring

– Interference with compiler optimization

38

Profiling

• Allocate a histogram: entry for each “block” of instructions

• Initialize: bucket[*] = 0, counter = –threshold

• At each event: counter++

• At each interrupt: bucket[PC]++, counter = –threshold

…fldl (%ecx,%eax,8)fld %st(0) fmull (%edx,%eax,8)faddl -16(%ebp)fstpl -16(%ebp)fmul %st(0), %st…

program

…247862392123781242262413423985…

PC histogram

counter interrupt occurs increment histogram bucket

+ 1

39

Types of Profiling

• Time-based profiling—Initialize a periodic timer to interrupt execution every t seconds

—Every t seconds, service interrupt at regular time intervals

• Event-based profiling—Initialize an event counter to interrupt execution

—Interrupt every time a counter for a particular event type reaches a specified threshold– e.g. sample the PC every 16K floating point operations

• Instruction sampling (Alpha only)—Initialize the instruction sampling frequency

—At periodic intervals, trace the execution of the next instruction as it progresses through the execution pipeline

40

Benefits of Profiling

• Key benefits—Provides perspective without instrumenting your program

—Does not depend on preconceived model of where problems lie– often problems are in unexpected parts of the code

floating point underflows handled by software

Fortran 90 sections passed by copy

instruction cache misses

…

—Focuses attention on costly parts of code

• Secondary benefits—Supports multiple languages, programming models

—Requires little overhead with appropriate sampling frequency

41

Process vs. System Profiling

• Process profiling—Sample events while process is executing

—Benefits– No security concerns

—Weaknesses– Can miss critical information affecting overall execution

performance

• System-level profiling—Sample events across all applications and operating system

—Benefits– Capture all behavior on system

– Understand impact of external events on process

System processes interfere with execution

NFS activity delays a process expected at a barrier

42

Key Performance Counter Metrics

• Cycles: PAPI_TOT_CYC

• Memory hierarchy—TLB misses

PAPI_TLB_TL (Total), PAPI_TLB_DM (Data), PAPI_TLB_IM (Instructions)—Cache misses:

PAPI_Lk_TCM (Total), PAPI_Lk_DCM (Data), PAPI_Lk_ICM (Instructions)k in [1 .. Number of cache levels]

Misses: PAPI_Lk_LDM (Load), PAPI_Lk_STM (Store) k in [1 .. Number of cache levels]

• Pipeline stallsPAPI_STL_ICY (No-issue cycles), PAPI_STL_CCY (No-completion cycles)PAPI_RES_STL (Resource stall cycles), PAPI_FP_STAL (FP stall cycles)

• Branches: PAPI_BR_MSP (Mispredicted branch)

• Instructions: PAPI_TOT_INS (Completed), PAPI_TOT_IIS (Issued)—Loads and stores: PAPI_LD_INS (Load), PAPI_SR_INS (Store) —Floating point operations: PAPI_FP_OPS

• Events for shared-memory parallel codes —PAPI_CA_SNP (Snoop request), PAPI_CA_INV (Invalidation)

43

Cycles vs. Wall Clock Time

• Wall clock time is an absolute measure of effort

• Cycles is a measure of time spent using the CPU —Usually counts only in “user mode”

—May miss time spent in operating system on your behalf

—Definitely misses time spent waiting– for other processes

– for communication

• Whenever possible, compare wall clock time to cycles!

44

Useful Derived Metrics

• Processor utilization for this process —cycles/(wall clock time)

• Memory operations —load count + store count

• Instructions per memory operation—(graduated instructions)/(load count + store count)

• Avg number of loads per load miss (analogous metric for stores)—(load count)/(load miss count)

• Avg number of memory operations per Lk miss—(load count + store count)/(Lk load miss count + Lk store miss count)

• Lk cache miss rate —(Lk load miss count + Lk store miss count)/(load count + store count)

• Branch mispredict percentage—100 * (branch mispredictions)/(total branches)

• Instructions per cycle—(graduated Instructions)/cycles

45

Derived Metrics for Memory Hierarchy

• TLB misses per cycle—(data TLB misses + instruction TLB misses)/cycles

• Avg number of loads per TLB miss —(load count)/(TLB misses)

• Total Lk data cache accesses

—Lk-1 load misses + Lk-1 store misses

• Accesses from Lk per cycle

—(Lk-1 load misses + Lk-1 store misses)/cycles

• Lk traffic (analogously memory traffic)

—(Lk-1 load misses + Lk-1 store misses) * (Lk-1 cache line size )

• Lk bandwidth consumed (analogously memory bandwidth)

—(Lk traffic)/(wall clock time)

46

Session 1 c

HPCToolkit High-level Overview

47

HPCToolkit Goals

• Support large, multi-lingual applications—a mix of of Fortran, C, C++—external libraries (possibly binary only)—thousands of procedures, hundreds of thousands of lines—we must avoid

– manual instrumentation– significantly altering the build process– frequent recompilation

• Collect execution measurements scalably and efficiently—don’t excessively dilate or perturb execution—avoid large trace files for long running codes

• Support measurement and analysis of serial and parallel codes

• Present analysis results effectively—top down analysis to cope with complex programs —intuitive enough for physicists and engineers to use—detailed enough to meet the needs of compiler writers

• Support a wide range of computer platforms

48

HPCToolkit Overview

• Use event-based sampling to profile unmodified applications—hpcrun (Linux)

• Interpret program counter histograms—hpcprof (Linux)

—xprof (Alpha+Tru64 only)

• Recover program structure, including loop nesting—bloop

• Correlate source code, structure and performance metrics—hpcview

—hpcquick

• Explore and analyze performance databases—hpcviewer

49

HPCToolkit System Workflow

applicationsource

applicationsource

profile executionprofile execution

performanceprofile

performanceprofile

binaryobject code

binaryobject code

compilation

linking

binary analysisbinary analysis

programstructure

programstructure

interpret profileinterpret profile

source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

http://www.hipersoft.rice.edu/hpctoolkit/

50



performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactive analysis


—launch unmodified, optimized application binaries—collect statistical profiles of events of interest

51



performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

—decode instructions and combine with profile data

52



performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase



—extract loop nesting information from executables

53



performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

—synthesize new metrics by combining metrics —relate metrics, structure, and program source

54



performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase



—support top-down analysis with interactive viewer—analyze results anytime, anywhere

55

HPCToolkit: Supported Platforms

• Linux—IA32: Pentium 3, Pentium 4, Athlon

—Opteron

—Itanium

• Tru64—Alpha

• Irix—MIPS R10000, R12000

56

Session 2a

Using HPCToolkit on Linux

http://www.hipersoft.rice.edu/hpctoolkit

57

HPCToolkit Workflow


performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

58

hpcrun*

Capabilities

• Monitor unmodified, dynamically-linked application binaries

• Collect event-based program counter histograms

• Monitor multiple events during a single execution

Data collected for each (load module, event type) pair

•Load module attributes—name

—start address

—length

•Event name

•Execution measurements—A sparse histogram of PC values for which the event counter overflowed

*Only for use on Linux only at present

59

hpcrun in Action

% ls

README

% hpcrun ls

ls.PAPI_TOT_CYC.master4.rtc.2866 README

Profile the default event (PAPI_TOT_CYC) for the ls command using the default period

Data file to which PC histogram is written for monitored event or events

Data file naming convention: command.eventname.nodename.pid

60

hpcrun: Common Options

-h print help information

-L list events that can be monitoredPAPI standard events

Name, Derived = yes/no, Event description

Native processor events

Name

-e <eventk>[:<periodk>]

event and sample period for profilinguse multiple -e options for multiple events

-o <outpath>

optional directory for output data

-- subsequent arguments aren’t hpcrun options

61

A Running Example: sample.c

#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];void cleara(double a[N]) { int i; for (i = 0; i < N; i++) { a[i] = 0; }}int main() { double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); for (i = 0; i < N; i++) { s += a[i] * b[i]; s2 += a[i] * a[i] + b[i] * b[i]; } } printf("s %f s2 %f\n",s,s2);}

•2 user functions

•2 calls to library functions

•Main has imperfectly nested loops with function calls

62

Using hpcrun to Collect a Performance Profile

% hpcrun -e PAPI_TOT_CYC:499997 -e PAPI_L1_LDM \

-e PAPI_FP_INS -e PAPI_TOT_INS sample

Profile the “total cycles” using the default period

Profile the “L1 data cache load misses” using the default period

Profile the “total instructions” using the default period

Profile the “floating point instructions” using period 499997

Running this command on a machine “eddie” produced a data file sample.PAPI_TOT_CYC-etc.eddie.1736

63

HPCToolkit Workflow


performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

64

hpcprof*

Capabilities

• Read program counter histograms gathered by hpcrun

• Read debugging information in executable

• Use debugging information to map PC samples to —Load module

—File

—Function

—Source line

• Output data in ASCII for human consumption

• Output data in XML for consumption by other tools

*Only for use on Linux only at present

65

Using hpcprof to Interpret a Profile

% hpcprof -e sample sample.PAPI_TOT_CYC-etc.eddie.1736

Columns correspond to the following events [event:period]

PAPI_TOT_CYC:32767 - Total cycles (264211 samples)

PAPI_TOT_INS:32767 - Instructions completed (61551 samples)

PAPI_FP_INS:32767 - Float point instructions (15355 samples)

PAPI_L1_LDM:32767 - Level 1 load misses (6600 samples)

Load Module Summary:

85.5% 100.0% 100.0% 99.9% /tmp/sample

14.5% 0.0% 0.0% 0.1% /lib/libc-2.3.3.so

File Summary:

85.5% 100.0% 100.0% 99.9% <</tmp/sample>>/tmp/sample.c

14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>><unknown>

Continued …

66

hpcprof: Function and Line Output

Function Summary:

70.8% 83.3% 100.0% 99.7% <</tmp/sample>>main

14.7% 16.7% 0.0% 0.2% <</tmp/sample>>cleara

14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>>memset

Line Summary:

37.2% 35.0% 55.9% 51.4% <</tmp/sample>>/tmp/sample.c:20





14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>><unknown>:0



Continued …

67

hpcprof: Annotated Source File Output

File <</tmp/sample>>/tmp/sample.c with profile annotations. 1 #define N (1 << 23) 2 #define T (10) 3 #include <string.h> 4 double a[N],b[N]; 5 void cleara(double a[N]) { 6 int i; 7 14.5% 16.7% 0.0% 0.2% for (i = 0; i < N; i++) { 8 a[i] = 0; 9 } 10 } 11 int main() { 12 double s=0,s2=0; int i,j; 13 for (j = 0; j < T; j++) { 14 9.0% 14.6% 0.0% 0.1% for (i = 0; i < N; i++) { 15 5.6% 6.2% 0.0% 0.1% b[i] = 0; 16 } 17 cleara(a); 18 memset(a,0,sizeof(a)); 19 3.6% 5.8% 10.6% 9.2% for (i = 0; i < N; i++) { 20 36.2% 32.9% 53.2% 49.1% s += a[i]*b[i]; 21 16.7% 23.8% 36.2% 41.2% s2 += a[i]*a[i]+b[i]*b[i]; 22 } 23 } 24 printf("s %d s2 %d\n",s,s2); 25 }

68

hpcprof: Common Options

-h , --help print help information

-d, --directory dir Search dir for source files.

-D, --recursive-directory dir Search dir recursively for source files.

Text mode [Default]

-e, --everything Show all information.

-f, --files Show all files.

-r, --funcs Show all functions.

-l, --lines Show all lines.

-a, --annotate file Annotate file.

-n, --number Show number of samples (not %).

-s, --show thresh Set threshold for showing aggregate data.

PROFILE mode

-p, --profile Dump PROFILE output (for use with hpcview).

69

HPCToolkit Workflow


performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

70

Why Binary Analysis?

Issues

• Line-level statistics offer a myopic view of performance—Example: loads on one line, arithmetic on another.

• Event detection “skid” makes line-level metrics inaccurate

• Optimizing compilers rearrange and interleave code.

• Source-code analysis is language-dependent,compiler- (and compiler-version-) dependent.

• For scientific programs, loop level is the unit of optimization and is where the action is

Approach

• Recover loop information from application binaries

71

bloop

Extract loop structure from binary modules (executables, dynamic libraries)

Approach

• Parse instructions in an application binary

• Identify branches and aggregate instructions into basic blocks

• Construct control flow graph using branch target analysis

• Use interval analysis to identify natural loop nests

• Map instructions to source lines with symbol table information—Quality of results depends on quality of symbol table

information

• Normalize results to recover source-level view of optimized code

• Elide information about source lines NOT in loops

• Output program structure in XML for consumption by hpcview

Platforms: Alpha+Tru64, MIPS+IRIX, Linux+IA64, Linux+IA32, Solaris+SPARC

72

Sample Flow Graph from an Executable

Loop nesting structure—blue: outermost level

—red: loop level 1

—green loop level 2

Observation:optimization complicates

program structure!

73

Normalizing Program Structure

Coalesce multiple instances of lines

Repeat

(1) If the same line appears in different loops– find least common ancestor in scope tree; merge

corresponding loops along the paths to each duplicate

purpose: re-roll loops that have been split

(2) If the same line appears at multiple loop levels– discard all but the innermost instance

purpose: account for loop-invariant code motion

Until a fixed point is reached

Constraint: each source line may appear at most once

74

bloop in Action

% bloop sample > sample.bloop<LM n="sample" version=”4.0"> <F n="/tmp/sample.c"> <P n="cleara" b="5" e="10"> <L b="7" e="8"> <S b="7" e="7"/> <S b="8" e="8"/> </L> </P> <P n="main" b="11" e="25"> <L b="13" e="21"> <S b="13" e="13"/> <L b="14" e="15"> <S b="14" e="14"/> <S b="15" e="15"/> </L> <S b="17" e="17"/> <S b="18" e="18"/> <L b="19" e="21"> <S b="19" e="19"/> <S b="20" e="20"/> <S b="21" e="21"/> </L> </L> </P> </F> </LM>

Load Module

File

Statement

Loop

Procedure

75

HPCToolkit Workflow


performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

76

Data Correlation

Motivation

• Need to analyze data from multiple experiments together—Different architectures—Different input data sets—Different metrics: any one metric provides a myopic view

– some measure potential causes (e.g. cache misses)– some measure effects (e.g. cycles)– e.g., cache misses are only a problem if latency not hidden

• Need to analyze metrics in the context of the source code

• Need to synthesize derived metrics, e.g. cache miss rate—Synthetic metrics are needed information—Automation eliminates mental arithmetic —Serve as a key for sorting

77

hpcview and hpcquick

• hpcview correlates —Metrics expressed as PROFILE XML documents

– Produced on Linux with hpcprof

—Program structure expressed as PROFILE XML documents– Produced by bloop

—Program source code in any language

• hpcview produces a database containing both performance data and source code for exploration with hpcviewer—Aggregates metrics up the program structure tree

• hpcquick is a script that simplifies hpcview usage. —Runs hpcprof on any raw profile data files

—Synthesizes a configuration file for hpcview

—Runs hpcview

—Writes an executable script that will reconstruct its actions

78

hpcquick in Action

% hpcquick -S sample.bloop \ -I . \ -P sample.PAPI_TOT_CYC-etc.eddie.1736

Program structure PROFILE XML file

List of paths to directories containing source code

List of metric PROFILE XML files

79

HPCToolkit System Overview


performanceprofile

performanceprofile

applicationsource

applicationsource

binaryobject code

binaryobject code

compilation

linking


programstructure

programstructure


source correlation

source correlation

hyperlinkeddatabase

hyperlinkeddatabase

interactiveanalysis

interactiveanalysis

80

hpcviewer

Tool for interactive top-down analysis of performance data

• Execution metric data plus program source code—Multiple metrics from one or more experiments

• Exploration is guided by rank-ordered table of metrics—Can sort on any metric in ascending or descending order

• Hierarchical view of execution metrics—Load modules

– Executable, shared libraries, operating system, other programs—Files—Procedures—Loops—Source lines

hpcviewer is written in Java and runs anywhere: your cluster, your desktop, your laptop

81

hpcviewer Screenshot

MetricsNavigation

Annotated Source View

82

hpcviewer Browser Actions

• Open additional experiments from file menu

• Navigate from source code to metrics—Show metric info (for lines and scopes with metrics)

• Browsing hierarchical experiment data

—Open : reveal metric data of children

—Close : hide metric data of children

—Zoom : focus analysis on a particular subtree of program data

—Unzoom : undo previous zoom operation

—Flatten : melt away a level of hierarchy to view children of

current root or siblings as peers

—Unflatten : undo previous flatten at current tree location

—Navigate from scope to source: select scope descriptive text

83

Flattening for Top Down Analysis

• Problem —strict hierarchical view of a program is too rigid

—want to compare program components at the same level as peers

• Solution —enable a scope’s descendants to be flattened to compare their

children as peers

flatten

Current scope

unflatten

84

Flattening in Action

Flattened 3 levels: Load modules and files are hidden functions with loops are hidden leaf functions are shown

85

Summary of HPCToolkit Functionality

• Top down analysis focuses attention where it belongs —sorted views put the important things first

• Integrated browsing interface facilitates exploration—rich network of connections makes navigation simple

• Hierarchical, loop-level reporting facilitates analysis—more sensible view when statement-level data is imprecise

• Binary analysis handles multi-lingual applications and libraries—succeeds where language and compiler based tools can’t

• Sample-based profiling, aggregation and derived metrics—reduce manual effort in analysis and tuning cycle

• Multiple metrics provide a better picture of performance

• Multi-platform data collection

• Platform independent analysis tool

86

Session 2 b

Case Studies

87

A Typical “First Look” at a Big Program

• Let’s look at an execution of LANL’s POP 2.01 code

• In one execution we measured— Cycles— Instructions— Floating Point Instructions— Cache Misses at some level (L1, L2, L3)

• We computed derived metrics— Cycles per Instruction (CPI)— Cycles - (instructions * Peak CPI)— Flops per cycle— Misses per Flop

• Screen shot for handout

• Demo for live tutorial

Purpose: Gain familiarity with the viewer on a moderately-large code; rapidly gain familiarity with its performance issues.

88

LANL POP 2.0.1

(Live Demo In Tutorial)

89

“Optimized” Matrix-Matrix Multiply

• The demo program mmdemo.c provides 10 variants of the classic n-cubed matrix-matrix multiply algorithm.— Collects common tricks used by programmers and compilers

— Variants differ in the order they perform operations and access memory

— Performance and underlying phenomena vary widely– Differentiates processor implementations

– Illustrates interactions between code and compilers

– Exercises instrumentation, tool chain, and analyst.

• Goals1. Gain intuition via cross-architecture/cross-compiler

comparison.

2. Delve into questions of what hardware counters really tell us

3. Explore HPCToolkit/PAPI capabilities on different architectures

90

Naïve and “Simple” Variants

void naive_mmult()

{ int i,j,k;

for(i = 0; i < MSIZE; i++){

for(j = 0; j < MSIZE; j++) {

c[i][j] = 0;

for(k = 0; k <MSIZE; k++)

c[i][j] = c[i][j] +

a[i][k] * b[k][j];

}}}

void simple_mmult()

{ int i,j,k;

for(i = 0; i < MSIZE; i++){

for(j = 0; j < MSIZE; j++) {

MTYPE tmp = 0;

for(k = 0; k <MSIZE; k++)

tmp=tmp + a[i][k]* b[k][j];

c[i][j]=tmp;

}}}

Simple is like naïve, but accumulates c[i][j] in a temporary scalar. This is something we would expect any optimizing compiler to do automatically if there is no aliasing of c with a or b.

91

Blocked and Unrolled Variantvoid rmm_mmult()

{int i,j,k;

int ii, ilimit, jj, jlimit, kk, klimit;

for(j = 0; j < MS; j+= XSIZE){

for(i = 0; i < MS; i+= XSIZE){

ilimit = ( i + XSIZE) < MS ? i+XSIZE : MS;

jlimit = ( j + XSIZE) < MS ? j+XSIZE : MS;

for(ii = i; ii < ilimit; ii++)

for(jj= j; jj < jlimit ; jj++) c[ii][jj] = 0;

for(k = 0; k < MS; k+= ZSIZE) {

klimit = ( k + ZSIZE) < MS ? k+ZSIZE : MS;

for(ii = i; ii < ilimit; ii++){

for(jj= j; jj < jlimit ; jj +=8 ){

MTYPE tmp0 = c[ii][jj];

MTYPE tmp1 = c[ii][jj+1];







for(kk= k; kk < klimit ; kk++) { tmp0= tmp0 + a[ii][kk] * tmp1= tmp1 + a[ii][kk] * b[kk][jj+1]; tmp2= tmp2 + a[ii][kk] * b[kk][jj+2]; tmp3= tmp3 + a[ii][kk] * b[kk][jj+3]; tmp4= tmp4 + a[ii][kk] * b[kk][jj+4]; tmp5= tmp5 + a[ii][kk] * b[kk][jj+5]; tmp6= tmp6 + a[ii][kk] * b[kk][jj+6]; tmp7= tmp7 + a[ii][kk] * b[kk][jj+7]; } c[ii][jj] = tmp0; c[ii][jj+1] = tmp1; c[ii][jj+2] = tmp2; c[ii][jj+3] = tmp3; c[ii][jj+4] = tmp4; c[ii][jj+5] = tmp5; c[ii][jj+6] = tmp6; c[ii][jj+7] = tmp7;

}}}}}}

(assumes XSIZE mod 8 0)

92

Matrix Multiply: Summary

1000 x 1000 dense matrix multiply of doubles(Seconds))

900MHz I2 icc8.0, -O3, no alias(icc8.1)

900MHz I2 gcc3.3.3 strict-aliasing

3.0GHz P4icc8.1 -O3, noalias

3.0GHz P4 gcc 3.3.4(strict-alias align, unroll)

3.0GHz P4

gcc 3.3.4() + -sse3

1.6 GHz K8 PC2700eccgcc 3.3.3 -funroll-loops

Naïve 1.450(0.946) 41.738 9.463 9.618 9.554 18.127

Simple 13.90 (13.94) 37.998 9.480 9.449 9.523 17.902

R-Naive 1.45(1.592) 25.254 2.964 2.977 3.116 5.843

Matrix-vector 13.71(13.67) 27.809 7.520 7.505 7.547 17.450

Block mat-mat 4.34(3.89) 18.234 2.075 2.041 2.034 3.053

Lam et.al rev. 1.44(1.49) 25.252 2.945 2.918 3.002 5.925

Lam mat-vec 10.92(10.70) 20.179 2.253 2.882 4.414 3.957

Lam unroll*2 5.28(1.89) 19.681 1.877 2.153 3.765 3.727

4-way unroll 10.57(4.05) 23.729 5.821 5.554 7.144 6.768

Blk and unroll 1.18(0.845) 4.980 1.577 1.488 4.131 1.709

93

Matrix Multiply Notes and Questions

• No attempt made to tune specifically for these architectures.—Further opportunities for optimization?

• Naïve vs simple on Itanium – What’s going on?—Pattern recognition?—Analysis/code generation sensitivity to small changes?— tmp introduces a data dependence hazard?

• Use of SSE instructions?—Little advantage in these examples on P4—SSE vectorization vs. unrolling?

• Sources of inefficiencies?— Memory Hierarchy Components: TLB, Cache, Write buffers— Instruction level parallelism, latency hiding

• Opteron Issues—Performance using other compilers? Other variants?—Performance with faster clocks?

(Revisit this example in the exercises)

94

Matrix Multiply: Itanium-2 + icc

(Live Demo In Tutorial)

95

Matrix Multiply: Pentium 4 + icc

One level of unroll-and-jam applied to b_mmult eliminates write buffer stalls, reduces and hides L2 misses.

(Live Demo In Tutorial, also for Opteron)

96

What Events Can Be Measured?

• Itanium, Pentium 4, Opteron all support a rich set of events

• PAPI tries to present a common set—Architectural differences prevent exact semantic alignment

—Unique resources (L3 cache, Multi-level TLB, etc.) introduce differentiation

• Unique measurement resources—Itanium: Direct measurement of pipeline bubbles

– Enables a “decision tree” approach to diagnosis.

—Opteron: Extensive measurements of “system” (e.g., bus) events– Probably mostly of interest to system architects/implementers

—Pentium 4: Base events + tags– Probably more detail than you want/need

– Speculative instructions and replay

97

Typical Uses for HPCToolkit

• Identifying unproductive work—where is the program spending its time not performing FLOPS

• Memory hierarchy issues—bandwidth utilization: misses x line size/cycles

—exposed latency: ideal vs. measured

• Cross architecture or compiler comparisons—what program features cause performance differences?

• Gap between peak and observed performance—loop balance vs. machine balance?

• Evaluating load balance in a parallelized code—how do profiles for different processes compare

98

Session 3a

Exercises

See WWW page

http://www.hipersoft.rice.edu/hpctoolkit/sc04

99

Exercise 1

Using hpcviewer1. Download and unpack exercise1.tar2. Change to the exercise1 directory3. Run hpcviewer4. Open the performance database for the sample program in

hpcview.output/hpcquick5. Explore the program and performance metrics with

hpcviewer— open and close scopes in the navigation pane tree view— sort in ascending and descending order by different metrics— familiarize yourself with the flattening, zooming, unflattening,

unzooming capabilities of the browser

6. Analyzing the performance— what load module has the most time spent in it?— which is the most costly routine in the program?— which is the most costly loop in the program?

100

Exercise 2

Top down analysis of a big program: WRF1. Download and unpack exercise2.tar

2. Change to the exercise2 directory

3. Run hpcviewer

4. Open the performance database for the sample program in hpcview.output/hpcquick

5. Explore the program and performance metrics with hpcviewer

6. Where does the program spend its time?

7. Is it inefficient?

8. What are the main performance issues?

9. How might one improve the program efficiency?

101

Exercise 3

Comparing memcpy implementations1. Download and unpack exercise3.tar

2. Change to the exercise3 directory

3. Look through the source files main.c, ccopy.c and fcopy.f90— study the copy loops in ccopy.c and fcopy.c

— look at how they and the C library’s memcpy are equally invoked in the main routine

— since they are all called equally, do you expect that the same time is spent in each memcpy variant?

4. Compile them using the Makefile provided

5. Launch the program with hpcrun to profile cycles and total instructions using PAPI counters. Examine the profile results in ASCII with hpcprof -e.

6. What surprises did you find? Try to explain the results by collecting data using other event counters.— feel free to ask for suggestions

102

Session 3 b

Using HPCToolkit in Practice

103

Preparing your Code for Analysis

HPCToolkit tools prefer line mapping information in executables

• hpcprof uses it to map PC samples to source lines

• bloop uses it to recover source lines in loop nests

Compilers may not record line mapping by default

Is line mapping information is present in your executable?

• Use readelf --debug-dump=line <elf-executable>— readelf is part of the binutils tools

Getting line mapping information in your compiled code

• Intel icc, ifort compiler: use -g with explicit optimization flags

• gcc/g++/g77: add -g to any optimization flags

• Strategies for other compilers are similar

104

hpcrun Limitations (Part 1)

Not all PAPI events can be sampled with hpcrun

Cause—PAPI derived metrics combine 2 or more native events

—PAPI does not support sampling of derived metrics

Diagnosis—hpcrun will issue an error if invoked to sample a derived metric

Solution—Sample the constituent native events with hpcrun— hpcview can synthesize the derived metric from native metrics

105

hpcrun Limitations (Part 2)

Not all events can be collected in a single execution

• Causes—Processor does not have enough total hardware counters

—Multiple event detectors need to use the same counter

– For example, on Pentium 4

X87: count all x87 instructions

SSE_SP: count all unpacked SSE single precision instructions

SSE_DP: count all unpacked SSE double precision instructions

– The event register design lets only 2 of the 3 be counted at once

• Solution—With trial and error, find out which events conflict

– Use “hpcrun -e e1 -e e2 … /bin/ls” as for this purpose

—Collect performance metrics for your application using multiple runs

– Sufficiently small number of non-conflicting events in each run

—Use hpcview to correlate the data from multiple runs with source code

106

Informing HPCToolkit about Source Code

HPCToolkit needs to know where to find source code (yours or library) when building performance databases

• When using hpcquick, specify one or more paths to use as roots for relative paths using -I dir1 dir2 ... dirN

• A root for relative paths can be specified to hpcview as —An absolute path

– e.g. /usr/local/mpich-1.2.6/src/pt2pt

—A relative path from the directory in which hpcview is launched– e.g. ../src

—A trailing “/*” wildcard specifies an entire directory tree. – e.g. “/usr/local/mpich-1.2.6/src/*”

– e.g. “../mpich-1.2.6/src/*”

Note: wildcard paths must be enclosed in quotation marks or have a backslash in front of the * to keep your shell from expanding it

107

Under the Hood with hpcview

• hpcquick provides a simple interface for —invoking hpcprof to transform performance data info XML

PROFILE format

—building an XML document that serves as an hpcview configuration file

—invoking hpcview

• hpcquick dose not expose all of the capabilities of hpcview

• hpcview also supports—Adding titles to hpcviewer performance displays

—Replacing paths to force proper correlation with source

—Computation of derived metrics

—Fine-tuning displays by omitting display of intermediate metrics

108

<HPCVIEW><TITLE name="POP 4-way shmem, model_size=medium" /> <PATH name="." /> <PATH name="./compile" /> <PATH name="../sshmem" /> <PATH name="../source" /> - <METRIC name="pcc" displayName="Cycles"> <FILE name="pop.fcy_hwc.pxml" /> </METRIC> <METRIC name="dc" displayName="L1 miss"> <FILE name="pop.fdc_hwc.pxml" /> </METRIC> <METRIC name="dsc" displayName="L2 miss"> <FILE name="pop.fdsc_hwc.pxml" /> </METRIC> <METRIC name="fp" displayName="FP insts"> <FILE name="pop.fgfp_hwc.pxml" /> </METRIC> <METRIC name="rat" displayName="cy per FLOP">- <COMPUTE> <math> <apply> <divide/> <ci>pcc</ci> <ci>fp</ci> </apply> </math> </COMPUTE> </METRIC> </HPCVIEW>

hpcview Configuration File

Paths to relevant source directories

Heading on Display

Metrics defined by Platform Independent Profile Files

Expression for derived metric

109

Why are Some of my Source Files Missing?

My performance database built with hpcview does not contain all of my source files

Explanation 1

• hpcview databases are sparse

• They contain only source files for which samples are present

• If no samples exist for a particular file, it will be omitted—.h files are typically omitted

• This is a design decision. There is no workaround.

Explanation 2

• The performance and program structure data may contain invalid paths

• See next slides for detailed explanation and solutions

110

How Can Invalid Paths Arise?

• bloop and hpcprof obtain file path name information from the application binary

• Path names encoded in the application binary may not be helpful—They may be relative to one of the directories in which the

application was compiled

—They may be relative to one of the directories in which a library was compiled

—They may refer to files using an inaccessible mount point, or one that is named differently– Files compiled on one node of a cluster may be inaccessible through

the same paths on another cluster node

111

Coping with Invalid File Paths

• Diagnosing invalid paths with hpcmissing— hpcmissing <xml-file1> … <xml-filen>

<xml-file*> may be output of bloop or XML output of hpcprof—For each missing file, hpcmissing will report

– The name of the missing file if the parent directory is present– The name of its parent directory if the parent directory is missing

• Correcting path problems in hpcview configuration files—if the executable binary contains the invalid file path

../../binutils/bfd/foo.c

and it the corresponding source file can be found at

/scratch/tmp/binutils/bfd/foo.c

—Direct hpcview to perform a path prefix substitution with the directive

<REPLACE in=”../../binutils” out=”/scratch/tmp/binutils”/>

—In the configuration file, add REPLACE commands after any PATH commands and before any METRIC definitions

• See doc/dtd_hpcview.html for details

112

Working with Multiple Metrics in a Data File

• hpcrun can record multiple metrics in a single execution

• hpcprof will report multiple metrics in an XML output file

• Within the output file, metrics are known by two names— nativename, usually descriptive— shortname, usually a small integer

• To import data from a file with multiple metrics, use the select attribute

<METRIC name=“dsc” displayName=“L2 miss”> <FILE name=“pop.fdsc_hwc.pxml” select=“1”/> </METRIC>

The optional select attribute specifies the shortname of a metric in the file. It defaults to metric “0”

113

Computing Derived Metrics

COMPUTE metrics provide a way to compute derived metrics from other raw or computed metrics

Operators: <plus/>, <minus/>, <divide/>, <power/>, <times/>, <max/>, <min/>

<METRIC name=“acc” displayName=“L1 ACCESS”> <FILE name="pop.fcy_hwc.pxml" />

</METRIC>

<METRIC name=”miss" displayName=“L1 MISS”> <FILE name="pop.fgfp_hwc.pxml" />

</METRIC>

<METRIC name=”l1m" displayName=“L1 MISS RATE”>- <COMPUTE>

<math> <apply> <divide/> <ci>miss</ci> <ci>acc</ci> </apply>

</math></COMPUTE>

</METRIC>

Internal name for use in expressions

Name for use in hpcviewer display

Raw metric from a file

computed metric

MathML formula delimiter

apply operator to terms

identifier

114

Complex Derived Metrics

• Where is the biggest unrealized opportunity for performing floating point operations?

• Let fpins be floating point instructions retired

• Let cycles be cycles

• Let 2 be the max number of floating point operations that can retire per cycle

<METRIC name=“uf” displayName=“UNREALIZED FLOPS”>- <COMPUTE> <math> <apply> <minus/> <apply> <times/> <ci>cyc</ci> <cn>2</cn> </apply> <ci>fpins</ci> </apply> </math> </COMPUTE> </METRIC>

Complex formulae nest operator applications

115

Find the Missing Cycles

<METRIC name="WALLCLK" displayName="WALLCLK" sortBy="true">

<FILE name="bm10dicc. WALLCLK.eddie.7911.hpcquick.pxml" select="0"/>

</METRIC>

<METRIC name="PAPI_TOT_CYC" displayName="PAPI_TOT_CYC">

<FILE name="bm10dicc.PAPI_TOT_CYC-etc.eddie.7921.hpcquick.pxml" select="0"/>

</METRIC>

<METRIC name="wall_cyc" displayName="WALL_CYC" percent="true">

<COMPUTE> <math>

<apply> <times/> <ci>WALLCLK</ci> <cn>300000</cn></apply>

</math> </COMPUTE>

</METRIC>

<METRIC name="missing" displayName="Missing Cycles" percent="true">

<COMPUTE> <math>

<apply> <minus/> <ci>wall_cyc</ci><ci>PAPI_TOT_CYC</ci></apply>

</math></COMPUTE>

</METRIC>

116

Avoiding Display Clutter

• hpcview’s configuration file enables

—Suppress display of intermediate metrics

—Suppress display of percentages (irrelevant for ratios)

—Select metric for initial display sorting using sortBy=“true”

<METRIC name=“acc” displayName=“L1 ACCESS” display=“false”> <FILE name="pop.fcy_hwc.pxml" />

</METRIC>

<METRIC name=“miss” displayName=“L1 MISS” display=“false”> <FILE name="pop.fgfp_hwc.pxml" />

</METRIC>

<METRIC name=”l1m" displayName=“L1 MISS RATE” percent=“false”><COMPUTE>

<math> <apply> <divide/> <ci>miss</ci> <ci>acc</ci> </apply>

</math></COMPUTE>

</METRIC>

Raw

Computed

117

Lines for Loops and Procedures Are Imprecise

The line number reported for a loop or a procedure by hpcviewer does not match its true location in the source

• Cause—Inaccurate or missing line number information in symbol table

—No program structure information from bloop provided to hpcquick and hpcview

• Rationale—In the absence of program structure information from bloop, hpcview infers starting and ending lines of a procedure from the line numbers of PC samples within the procedure– Start line is reported as the first line with a sample

– End line reported as the last line with a sample

118

Unknown File and Function Reports

Some costs are attributed to “unknown file” and “unknown function” in shared libraries

• Shared libraries have dispatch code that is used to call shared library entry points

• Instructions for this dispatch code are not mapped to any source file or function

• Samples that occur when calling shared library entry points through the dispatch table will be mapped to “unknown file” and “unknown function.”

119

Working with Stripped Binaries

• Source not available Summary/skeleton in source pane

• Executable and/or shared libraries may have been stripped—Stripped executable: no symbols, except for those needed to

bind against shared libraries

—Stripped shared library: few symbols - just a record of externally visible function entry points

• hpcrun profiles stripped binaries like any others

• hpcprof reports brief profiles for stripped binaries—Stripped executable: samples mapped to

Load module, unknown file, unknown function, line 0

—Stripped shared library: samples mapped to Load module, unknown file, function, line 0

120

Trouble Spot: Multiple Instantiations

• Code that appears in one place is often instantiated elsewhere—Inlined functions

—Fortran statement functions

—Forward substitution of common subexpressions

• Symbol table line mapping information —Maps all instantiations of code only to its original source location

– Sensible choice for debugging, problematic for performance analysis

121

Current Problems with Multiple Instantiations

• hpcprof

—For sample at a PC: looks up PC’s source line in line map

—All metrics for any line are reported only in one place

—Evidence of multiple instantiations is lost, attribution is confusing

• bloop

—Lines from inlined functions can distort the range of statements that should be mapped to a function– If procedure A contains inlined code from procedure B, instructions

inside A’s procedure body will map to lines inside B

—All loops that contain instances of multiply instantiated code will be fused for presentation

122

Analyzing Parallel Executions

• hpcrun can be used when launching parallel jobs— e.g. mpirun -n 24 hpcrun -e PAPI_L1_DCM -- myjob myarg1 …

• hpcrun independently monitors each process—Records separate performance data file for each process

—Data file naming convention includes node name and PID

123

Working with Threads

hpcrun can be used to monitor threaded programs

Two modes

• -teach: collect data for each thread separately (default)

• -tall: collect data for all threads together

Works with posix threads only

124

Session 4a: Advanced Exercises

• Continue study of previous examples and exercises—Tune the mmdemo.c matrix multiplication example

—Compare mmdemo.c with libgoto and other BLAS implementations

• Analyze the Impact3D code on one processor—Study an Earth Simulator code on a small data size

• Analyze one or more of the NAS parallel benchmarks (MPI)—Examine how the MPI communication overhead scales as the

number of processors changes

• Apply HPCToolkit to your program

• Automate application of HPCToolkit tools—Scripts and Makefiles to integrate HPCToolkit tools with your

development process

125

Improving Program Performance

• Memory hierarchy: Improving spatial and temporal locality—Reduce cache and TLB misses with blocking

—Reduce loads and stores with scalar replacement

—Reduce the size of array temporaries

• Instruction cache—Reduce icache misses by distributing loops that are too large

• Execution pipeline—Compiler optimizations

– Unroll and software pipeline

—Hand transformations

126

Session 4 b

Installing HPCToolkit

127

Getting HPCToolkit

• Open source—Rice code under a BSD license

—Supporting code under LGPL and Apache licenses

• http://www.hipersoft.rice.edu/hpctoolkit

128

Installing HPCToolkit on Linux

• HPCToolkit is available only in source form

• Installing HPCToolkit means downloading and building it

• Downloading HPCToolkit requires—ssh

—cvs

—tar

• Building HPCToolkit requires—javac: Java compiler (1.4 or later for best results)

—C compiler (gcc suffices)

—C++ compiler (g++ suffices)

—gnumake

—autoconf and automake are helpful, but not required

129

Installation Components

• HPCToolkit—Profile collection, interpretation, program structure recovery,

source code and metric correlation, and interactive browser

• Xercesc (use our version for better portability)—APACHE XML parser infrastructure

• Binutils (use our version with enhanced symbol table support) —GNU software libraries and utilities for working with binary files

– Used to partially decode executables and interpret symbol tables

• OpenAnalysis—Rice toolkit for intermediate representation independent program

analysis– Used to construct control flow graphs for application binaries

• PAPI (download separately)—UTK performance counter API library

130

Downloading and Installing PAPI

• PAPI distributed separately from HPCToolkit—See http://icl.cs.utk.edu/papi/software

• Version 3+ required

• PAPI download choices—stable TAR file: lacks some recent fixes, but not changing daily

—CVS repository: up to date fixes, but issues may arise

• Instructions—Download PAPI using your method of choice

—Build using the appropriate platform makefile

—On IA32 platforms, kernel patching is necessary– The perfctr driver must be added to the kernel

– See http://www.hipersoft.rice.edu/hpctoolkit/papi for more info

—Verify it is working using self-diagnosis tests – look in ctests and ftests subdirectories

131

Downloading and Building HPCToolkit

• Download http://www.hipersoft.rice.edu/hpctoolkit/HPCToolkitRoot.tgz

• Unpack the starter tarball to create HPCToolkitRoot directory

• cd HPCToolkitRoot

• ./bin/updatesrc—obtains necessary sources from our CVS repository

• make -f Makefile.quick PAPI=<path_to_PAPI3_install>—Configures, builds and installs to HPCToolkit-XXX-Linux

– XXX is your processor type

— This self-contained installation can be moved anywhere

• Put the HPCToolkit-XXX-Linux/bin directory on your path

132

Other Performance Tuning Resources

• Open Source

—Oprofile http://oprofile.sourceforge.net/news/

—PAPI http://icl.cs.utk.edu/projects/papi

—Psrun http://perfsuite.ncsa.uiuc.edu

—Brink/Abyss http://www.eg.bucknell.edu/~bsprunt (P4+Linux)

• Commercial

—DCPI http://h30097.www3.hp.com/dcpi (Alpha+Tru64)

—Vtune (Intel processors+{Linux,Windows})

– http://www.developer.intel.com/software/products/vtune

—HP (HPUX only)

– Caliper http://www.hp.com/go/hpcaliper (Itanium+HP-UX)

– Prospect http://www.hp.com/go/prospect

—CHUD http://developer.apple.com/tools/performance (Mac OS X)

—IBM: HPMCOUNT

133

Resources (Continued)

• Parallel Performance Tools

—Open Source Tools

– Jumpshot http://www-unix.mcs.anl.gov/perfvis/software/viewers

– TAU http://www.cs.uoregon.edu/research/paracomp/tau/tautools

– SvPablo http://www-pablo.cs.uiuc.edu

—Commercial Tools

– Vampir http://www.pallas.com/e/products/vampir

• Tutorials/Compendia/Organizations

— NCSA Performance expedition http://www-unix.mcs.anl.gov/perf-expedition/tools.html

—SciDAC PERC http://perc.nersc.gov/main.htm

—Parallel Tools Consortium http://www.ptools.org/

• Manufacturers’ Optimization Guides/Manuals

—http://www.intel.com/design/pentium4/manuals/248966.htm

—http://www.amd.com/ us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

— http://www.intel.com/design/itanium/downloads/245474.htm

134

Wrap Up

• Future Directions

• How to Contribute: Join the team!—There are many potential enhancements.

—Many hands make light work.

• Feedback

135

The End

john mellor-crummey robert fowler nathan tallent department of computer science rice university

Documents