john mellor-crummey robert fowler nathan tallent department of computer science rice university
DESCRIPTION
S05: Practical Application Performance Analysis on Linux Systems. John Mellor-Crummey Robert Fowler Nathan Tallent Department of Computer Science Rice University SC04 Tutorial Presentation. http://www.hipersoft.rice.edu/hpctoolkit/. Goals for this Tutorial. Understand - PowerPoint PPT PresentationTRANSCRIPT
John Mellor-CrummeyRobert Fowler Nathan Tallent
Department of Computer ScienceRice University
SC04 Tutorial Presentation
S05: Practical Application PerformanceAnalysis on Linux Systems
http://www.hipersoft.rice.edu/hpctoolkit/
2
Goals for this Tutorial
Understand
• Factors affecting performance on modern architectures
• What performance monitoring hardware does
• Performance monitoring capabilities on common processors
• Strategies for measuring application node performance—Performance calipers—Sample-based profiling—How and when to use each
• Performance tool resources
• How to analyze application performance using HPCToolkit—Collect sample-based profiles of performance metrics—Explore and understand application performance
• How to install HPCToolkit on your own system
3
Schedule
• Session 1— Overview
— The 45-minute Performance Expert
— Survey of Performance Instrumentation
• Session 2— Introduction to HPCToolkit on Linux
— Case Studies
• Session 3— Hands-on Exercises
— Using HPCToolkit In Practice
• Session 4— Advanced Exercises: our codes or yours
— HPCToolkit Installation Overview
— Wrap up
4
Session 1 a
The 45-Minute Performance Expert
5
Analysis and Tuning Questions
• How can we tell if a program has good performance?
• How can we tell that it doesn’t?
• If performance is not “good,” how can we pinpoint where?
• How can we identify the causes?
• What can we do about it?
6
Peak vs. Realized Performance
Scientific applications realize as low as 5-25% of peak on microprocessor-based systems
Reason: mismatch between application and architecture capabilities—Architecture has insufficient bandwidth to main memory:
– microprocessors often provide < 1 byte from memory per FLOP scientific applications often need more
—Application has insufficient locality– irregular accesses can squander memory bandwidth
use only part of each data block fetched from memory– may not adequately reuse costly virtual memory address translations
—Exposed memory latency– architecture: inadequate memory parallelism to hide latency– application: not structured to exploit memory parallelism
—Instruction mix doesn’t match available functional units
Peak performance = guaranteed not to exceed
Realized performance = what you achieve
7
Performance Analysis and Tuning
• Increasingly necessary—Gap between realized and peak performance is growing
• Increasingly hard—Complex architectures are harder to program effectively
– complex processors: pipelining, out-of-order execution, VLIW
– complex memory hierarchy: multi-level non-blocking caches, TLB
—Optimizing compilers are critical to achieving high performance– small program changes may have a large impact
—Modern scientific applications pose challenges for tools– multi-lingual programs
– many source files
– complex build process
– external libraries in binary-only form
—Many tools don’t help you identify or solve your problem
8
Mead’s “Tall, Thin Man:” VLSI System Designer
Needed Expertise:— Embedded algorithms
— System architecture
— Circuit design
— Layout
— Device Physics
— Manufacturability
— Testing
— System Integration
The tall, thin man knows just enough, but is supported by staff and tools.
†An engineer who works at every level of integration from circuits to application
software. -- Carver Mead.
“tall, thin man†”
(Also see “lead programmerteams”, “concurrent engineering”)
9
The Tall, Thin Application Performance Expert
Needed Expertise and Tools:—Science
—Modeling
—Numerical Methods
—Programming Languages
—Libraries
—Compilers
—Computer Architecture
—Performance Analysis
—Testing
—Etc.
Requires support staff and tools to augment expertise. Necessary tools are not the same ones used by the system designer!
Stands on the shouldersof the tall, thin system designer.
10
Tuning Applications
• Performance is an interaction between—Numerical model
—Algorithms
—Problem formulation (as a program)
—Data structures
—System software
—Hardware
• Removing performance bottlenecks may require dependent adjustments to all
11
Performance = Resources = Time
Tprogram = Tcompute + Twait
Tcompute is the time the CPU thinks it is busy.
Twait is the time it is waiting for external devices/events.
Tcompute = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Tcompute = Seconds = Instructions x Cycles x Seconds
Program Program Instruction Cycle
Determined by technology and by hardware design
Determined by architecture and by the compiler’s ability to use the architecture efficiently for your program.
Determined by model, algorithm, and data structures.
Including: other processes, operating system, I/O, communication
12
Microprocessor-based Architectures
• Instruction Level Parallelism (ILP): systems not really serial—Deeply pipelined processors
—Multiple functional units
• Processor taxonomy—Out of order superscalar: Alpha, Pentium 4, Opteron
– hardware dynamically schedules instructions: determine dependences and dispatch instructions
– many instructions can be in flight at once
—VLIW: Itanium– issue a fixed size “bundle” of instructions each cycle
– bundles tailored to mix of available functional units
– compiler pre-determines what instructions execute in parallel
• Complex memory hierarchy—Non-blocking, multi-level caches
—TLB (Translation Lookaside Buffer)
13
Pipelining 101
• Basic microprocessor pipeline (RISC circa 1983)— Instruction fetch (IF)
— Instruction decode (ID)
— Execute (EX)
— Memory access (MEM)
— Writeback (WB)
• Each instruction takes 5 cycles (latency).
• One instruction can complete per cycle(theoretical peak throughput)
• Disruptions and replays— On simple processors: bubbles and stalls
— Recent complex/dynamic processors use an abort/replay approach
Fetch
Decode
Execute
Memory
Write
14
Limits to Pipelining
• Hazards: conditions that prevent the next instruction from being launched or (in speculative systems) completed—Structural hazard: Can’t use the same hardware to do two
different things at the same time
— Data hazard: Instruction depends on result of prior instruction still in the pipeline. (Also, instruction tries to overwrite values still needed by other instructions.)
— Control hazard: Delay between fetching control instructions and decisions about changes in control flow (branches and jumps)
• In the presence of a hazard, introduce delays (pipeline bubbles) until the hazard is resolved
• Deep or complex pipelines increase the cost of hazards
• External Delays— Cache and memory delays
— Address translation (TLB) misses
15
Pentium 4 (Super-Pipelined, Super-Scalar)
• Stages 1-2— Trace cache next instruction
pointer
• Stages 3-4— Trace cache fetch
• Stage 5— Drive (wire delay!)
• Stages 6-8— Allocate and Rename
• Stages 10-12— Schedule instructions— Memory/fast ALU/slow ALU &
general FP/simple FP
• Stages 13-14— Dispatch
• Stages 15-16— Register access
• Stage 17— Execute
• Stage 18— Set flags
• Stage 19— Check branches
• Stage 20— Drive (more wire delay!)
5 operations issued per clock1 load, 1store unit2 simple/fast, 1 complex/slower integer units1 FP exe, 1 FP move unitUp to 126 instructions in flight: 48 loads, 24 stores, …
16
Opteron Pipeline (Super-Pipelined, Super-Scalar)
Integer
1. Fetch1
2. Fetch2
3. Pick
4. Decode1
5. Decode2
6. Pack
7. Pack/Decode
8. Dispatch
9. Schedule
10.AGU/ALU
11.DC access
12.DC Response
Floating Point
8. Dispatch
9. Stack rename
10.Register rename
11.Sched. Write
12.Schedule
13.Reg. Read
14.FX0
15.FX1
16.FX2
17.FX3
48. …
L2 Cache
13.L2 Tag
14. …15.L2 Data
16. …17.Route/
Multiplex ECC
18. …19.Write DC/
Forward Data
DRAM
14.Address to SAQ
15.Clock Crossing
16.Sys. Req Queue
17. …26. Req DRAM pins
27. …… (Memory Access)
Fetch/decode 3 inst/cycle. 3 integer units 3 address units 3 FPU/multimedia units 2 load/stores to d-cache/clk
17
Itanium2 Pipeline (VLIW/EPIC)
1. Inst. Ptr. Generation and Fetch
2. Inst. “Rotation”
3. Decode, expand, and disperse
4. Rename and decode registers
5. Register file read
6. Execution/ Memory(Cache) 1/ FP1
7. Exception detection/ Memory(Cache) 2/ FP2
8. Write back/ Memory(Cache) 3/ FP3
9. Memory(Cache) 4/ FP4
Six instructions (two bundles) issued per clock.2 integer units4 memory units3 branch units2 floating-point
FrontEnd
BackEnd
18
A Modern Memory Hierarchy (Itanium 2)
L1 Cache-IL1 Cache-D
memory controller
L2 Cache
L3 cache
mem bank 1
mem bank 2
TLBTLB
209.6 ns
Processor
WriteBuffers,
Etc
(FP)
http://www.devx.com/Intel/Article/20521
16K I + 16K D, 1 cycle
256K, 5 (6 FP) cycles
3M, 13.3 (13.1 FP cycles)
19
Memory Hierarchy 101
• Translation Lookaside Buffer (TLB)—Fast virtual memory map for a small (~64) number of pages.—Touch lots of pages lots of TLB misses expensive
• Load/Store queues—Write latency and bandwidth is usually* not an issue
• Caches—Data in a cache is organized as set of 32-128 byte blocks—Spatial locality: use all of the data in a block—Temporal locality: reuse data at the same location —Load/store operations access data in the level of cache closest to the
processor in which data is resident– load of data in L1 cache does not cause traffic between L1 & L2
—Typically, data in a cache close to the processor is also resident in lower level caches (inclusion)
—A miss in Lk cache causes an access to Lk+1
• Parallelism—Multi-ported caches—Multiple memory accesses in flight
20
Processor.P4,
3.0GHzI865PERL
Dual K8, 1.6GHz
Tyan2882
Dual K8, 1.6GHz
Tyan2882
McKinley900MHz x 2HP zx6000
P42GHz
Dell
Memory PC3200Registered
PC2700ECCnode IL on
RegisteredPC2700ECC
node IL offPC2100ECC RB800
Compiler gcc3.2.3 gcc3.3.2 gcc3.3.2 gcc3.2 gcc3.2
Integer +/*/Div (ns) .17/4.7/19.4 .67/1.9/26 .67/1.9/26 1.12/4/7.9 .25/.25/11
Double +/*/Div (ns) 1.67/2.34/14.6 2.54/2.54/11.1 2.54/2.54/11.1 4.45/4.45 2.5/3.75/
Lat. L1 (ns) 0.67 1.88 1.88 2.27 1.18
Lat. L2 (ns) 6.15 12.40 12.40 7.00 9.23
Lat. Main (ns) 91.00 136.50 107.50 212.00 176.50
The Memory Wall: LMBENCH Latency Results
Note: 64 bytes/miss @ 100 ns/miss delivers only 640 MB/sec
21
Processor.P4,
3.0GHzI865PERL
Operon (x2), 1.6GHz
Tyan2882
McKinley(x2),900MHz
HP zx6000
Bus/Memory800MHzPC3200
RegisteredPC2700ECC
node IL offPC2100ECC
Native compiler6M elements
icc8.0 -fast ecc7.1 -O3
copy 2479 3318scan 2479 3306add 3029 3842
triad 3024 3844
gcc 6M elements
gcc3.2.3 -O3gcc3.3.2 -O3 -m64
gcc3.2 -O3
(-funroll-all-loops)copy 2422 1635 793 (820)scan 2459 1661 734 (756)add 2995 2350 843 (853)
triad 2954 1967 844 (858)
The Memory Wall: Streams (Bandwidth) Benchmark
Itanium requires careful explicit code scheduling!
good!
terrible!
22
Memory Latency and Bandwidth: Fundamental Limits
Example: simple 2-d relaxation kernel
S: r(i,j) = a * x(i,j) + b * (x(i-1,j)+x(i+1,j)+x(i,j-1)+x(i,j+1))
• Each execution of S —performs 6 ops,
—uses 5 x values
—writes 1 result value to r
(not counting index computations or scalars kept in registers)
• Assume x is double, large and it always comes from main memory*
— S consumes 40 bytes of memory read bandwidth: 6.7 bytes/op
—1GB/sec memory 150MFLOPS
25M updates/sec
* actually, some accesses to x will come from cache for this computation
23
Memory Latency and Bandwidth: Fundamental Limits
Example: simple 2-d relaxation kernel (part 2)
S: r(i,j) = a * x(i,j) + b * (x(i-1,j)+x(i+1,j)+x(i,j-1)+x(i,j+1))
• In each outer loop, each x value is used exactly 5 times, (6 if r x)
• If a programmer/compiler gets perfect reuse in registers and cache:
—S consumes 5*(1/5)*8 = 8 bytes memory read bandwidth, 1.3bytes/op.
—1GB/sec memory 750MFLOPS
• How?—First hide the latency of cache/memory access
– Restructure to increase locality in the cache.
– Schedule loads ahead of use
– Prefetch
– Software pipeline
— Cache-friendly transformations
– Unit stride accesses
– Tiling
compilerapproaches
programmer or compiler approaches
125M updates/sec
24
Memory Bandwidth
Example: realistic 2-d relaxation kernel (real materials and deformed meshes)
S: r(i,j) = c(i,j,self)*x(i,j)
+ c(i,j,up)*x(i-1,j) + c(i,j,down)*x(i+1,j)
+ c(i,j,left)*x(i,j-1) + c(i,j,right)*x(i,j+1)
—Each execution of S performs – 9 ops,
– uses 5 x and 5 c values,(not counting index computations or scalars kept in registers)
—Assume x and c are double and always come from main memory– S consumes 80 bytes of memory read bandwidth, 8.9 bytes/op.
– 1GB/sec memory 112MFLOPS
12.5M updates/sec
25
Memory Bandwidth
Example: realistic 2-d relaxation kernel (part 2)(real materials and deformed meshes)
S: r(i,j) = c(i,j,self)*x(i,j)
+ c(i,j,up)*x(i-1,j) + c(i,j,down)*x(i+1,j)
+ c(i,j,left)*x(i,j-1) + c(i,j,right)*x(i,j+1)
—In each outer loop, each x value is used 5 times, 6 if r a.
—Each c value is used only once (twice for i≠j if c is symmetric).—If a programmer/compiler gets perfect reuse in registers and cache:
– S consumes (5 + 1*1/5)*8 = 48 bytes of memory read bandwidth, 5.33 bytes/op.
– 1GB/sec memory 187MFLOPS upper bound
20.8M updates/sec
26
Memory Bandwidth
Example: A realistic 2-d relaxation kernel (part 3)(real materials, deformed meshes)
S: r(i,j) = c(i,j,self)*x(i,j)
+ c(i,j,up)*x(i-1,j) + c(i,j,down)*x(i+1,j)
+ c(i,j,left)*x(i,j-1) + c(i,j,right)*x(i,j+1)
Solutions— “Well-studied in the literature”
– hide latency and get all the reuse you can
— “Fundamental stuff”– aggressive algorithms to increase reuse (Time skewing, etc)
– alternate algorithmic approaches
one should not care if the FLOP rate is low if the solution is better and faster
27
What about Parallel Codes?
Good parallel efficiency and scalability requires
• Even partitioning of computation
• Low serialization/good parallelism
• Minimum communication
• Low communication overhead
• Communication overlapped with computation
28
Tuning a Parallel Code
While (not satisfied) do:
1. Analyze parallelism
2. Get the parallelism right
3. Analyze node-wide issues
4. Fix any problems
5. Analyze your code
6. Fix the problems
29
Sender
Receiver
SenderOverhead
Transmission time(size ÷ bandwidth)
Transmission time(size ÷ bandwidth)
Time ofFlight
ReceiverOverhead
Transport Latency
Total Latency = Sender Overhead + Time of Flight + (Message Size ÷ BW) + Receiver Overhead
Total Latency
(processorbusy)
(processorbusy)
“size” includes protocol overhead.
Communication Overhead and Latency
Simple message transmission
30
A Tale of Two Parallelizations …
multipartitioning}
} coarse-grainpipelining
Same computational work; different data partitioning; serialization dramatically affects scalability and performance
31
Take Home Points
• Tuning parallel programs requires tuning the parallelism first
• Modern processors are complex and require instruction-level parallelism for performance—Understanding hazards is the key to understanding performance
• Memory is much slower than processors and a multi-layer memory hierarchy is there to hide that gap—The gap can’t be hidden if your bandwidth needs are too great
—Only data reuse will enable you to approach peak performance
• Performance tuning may require changing everything—Algorithms, data structures, program structure
• Effective performance tuning requires a top-down approach—Too many issues to study everything
—Must identify the BIGGEST impediment and go after that first
32
Session 1 b
Survey of Performance Instrumentation
33
Performance Monitoring Hardware
Purpose— Capture information about performance critical details that is
otherwise inaccessible– cycles in flight, TLB misses, mispredicted branches, etc
What it does— Characterize “events” and measure durations
— Record information about an instruction as it executes.
Two flavors of performance monitoring hardware
1. Aggregate performance event counters— sample events during execution: cycles, board cache misses
— limitation: out of order execution smears attribution of events
2. “ProfileMe” instruction execution trace hardware— a set of boolean flags indicating occurrence of events (e.g., traps,
replays, etc) + cycle counters
— limitation: not all sources of delay are counted, attribution is sometimes unintuitive
34
Types of Performance Events
• Program characterization—Attributes independent of processor’s implementation
– Number and types of instructions, e.g. load/store/branch/FLOP
• Memory hierarchy utilization—Cache and TLB events
—Memory access concurrency
• Execution pipeline stalls—Analyze instruction flow through the execution pipeline
—Identify hazards– Conflicts that prevent load/store reordering
• Branch Prediction—Count mispredicts to identify hazards for pipeline stalls
• Resource Utilization—Number of cycles spent using a floating point divider
35
Performance Monitoring Hardware
• Event detectors signal—Raw events—Qualified events, qualified by
– Hazard description MOB_load_replay: +NO_STA, +NO_STD, +UNALGN_ADDR, …
– Type specifier page_walk_type: +DTMISS, +ITMISS
– Cache response ITLB_reference: +HIT, +MISS
– Cache line specific state BSQ_cache_reference_RD: +HITS, +HITE, +HITM, +MISS
– Branch type retired_mispred_branch: +COND, +CALL, +RET, +INDIR
– Floating point assist type x87_assist: +FPSU, +FPSO, +POAU, +PREA
– …
• Event counters
36
Counting Processor Events
Three ways to count
• Condition count—Number of cycles in which condition is true/false
– e.g. the number of cycles in which the pipeline was stalled
• Condition edge count—Number of cycles in which condition changed
– e.g. the number of cycles in which a pipeline stall began
• Thresholded count—Useful for events that report more than 1 count per cycle
– e.g. # of instructions completing in a superscalar processor
– e.g. count # cycles in which 3 or more instructions complete
37
Counting Events with Calipers
• Augment code with —Start counter
—Read counter
—Stop counter
• Strengths—Measure exactly what you want, anywhere you want
—Can be used to guide run-time adaptation
• Weaknesses—Typically requires manual insertion of counters
—Monitoring multiple nested scopes can be problematic
—Perturbation can be severe with calipers in inner loops– Cost of monitoring
– Interference with compiler optimization
38
Profiling
• Allocate a histogram: entry for each “block” of instructions
• Initialize: bucket[*] = 0, counter = –threshold
• At each event: counter++
• At each interrupt: bucket[PC]++, counter = –threshold
…fldl (%ecx,%eax,8)fld %st(0) fmull (%edx,%eax,8)faddl -16(%ebp)fstpl -16(%ebp)fmul %st(0), %st…
program
…247862392123781242262413423985…
PC histogram
counter interrupt occurs increment histogram bucket
+ 1
39
Types of Profiling
• Time-based profiling—Initialize a periodic timer to interrupt execution every t seconds
—Every t seconds, service interrupt at regular time intervals
• Event-based profiling—Initialize an event counter to interrupt execution
—Interrupt every time a counter for a particular event type reaches a specified threshold– e.g. sample the PC every 16K floating point operations
• Instruction sampling (Alpha only)—Initialize the instruction sampling frequency
—At periodic intervals, trace the execution of the next instruction as it progresses through the execution pipeline
40
Benefits of Profiling
• Key benefits—Provides perspective without instrumenting your program
—Does not depend on preconceived model of where problems lie– often problems are in unexpected parts of the code
floating point underflows handled by software
Fortran 90 sections passed by copy
instruction cache misses
…
—Focuses attention on costly parts of code
• Secondary benefits—Supports multiple languages, programming models
—Requires little overhead with appropriate sampling frequency
41
Process vs. System Profiling
• Process profiling—Sample events while process is executing
—Benefits– No security concerns
—Weaknesses– Can miss critical information affecting overall execution
performance
• System-level profiling—Sample events across all applications and operating system
—Benefits– Capture all behavior on system
– Understand impact of external events on process
System processes interfere with execution
NFS activity delays a process expected at a barrier
42
Key Performance Counter Metrics
• Cycles: PAPI_TOT_CYC
• Memory hierarchy—TLB misses
PAPI_TLB_TL (Total), PAPI_TLB_DM (Data), PAPI_TLB_IM (Instructions)—Cache misses:
PAPI_Lk_TCM (Total), PAPI_Lk_DCM (Data), PAPI_Lk_ICM (Instructions)k in [1 .. Number of cache levels]
Misses: PAPI_Lk_LDM (Load), PAPI_Lk_STM (Store) k in [1 .. Number of cache levels]
• Pipeline stallsPAPI_STL_ICY (No-issue cycles), PAPI_STL_CCY (No-completion cycles)PAPI_RES_STL (Resource stall cycles), PAPI_FP_STAL (FP stall cycles)
• Branches: PAPI_BR_MSP (Mispredicted branch)
• Instructions: PAPI_TOT_INS (Completed), PAPI_TOT_IIS (Issued)—Loads and stores: PAPI_LD_INS (Load), PAPI_SR_INS (Store) —Floating point operations: PAPI_FP_OPS
• Events for shared-memory parallel codes —PAPI_CA_SNP (Snoop request), PAPI_CA_INV (Invalidation)
43
Cycles vs. Wall Clock Time
• Wall clock time is an absolute measure of effort
• Cycles is a measure of time spent using the CPU —Usually counts only in “user mode”
—May miss time spent in operating system on your behalf
—Definitely misses time spent waiting– for other processes
– for communication
• Whenever possible, compare wall clock time to cycles!
44
Useful Derived Metrics
• Processor utilization for this process —cycles/(wall clock time)
• Memory operations —load count + store count
• Instructions per memory operation—(graduated instructions)/(load count + store count)
• Avg number of loads per load miss (analogous metric for stores)—(load count)/(load miss count)
• Avg number of memory operations per Lk miss—(load count + store count)/(Lk load miss count + Lk store miss count)
• Lk cache miss rate —(Lk load miss count + Lk store miss count)/(load count + store count)
• Branch mispredict percentage—100 * (branch mispredictions)/(total branches)
• Instructions per cycle—(graduated Instructions)/cycles
45
Derived Metrics for Memory Hierarchy
• TLB misses per cycle—(data TLB misses + instruction TLB misses)/cycles
• Avg number of loads per TLB miss —(load count)/(TLB misses)
• Total Lk data cache accesses
—Lk-1 load misses + Lk-1 store misses
• Accesses from Lk per cycle
—(Lk-1 load misses + Lk-1 store misses)/cycles
• Lk traffic (analogously memory traffic)
—(Lk-1 load misses + Lk-1 store misses) * (Lk-1 cache line size )
• Lk bandwidth consumed (analogously memory bandwidth)
—(Lk traffic)/(wall clock time)
46
Session 1 c
HPCToolkit High-level Overview
47
HPCToolkit Goals
• Support large, multi-lingual applications—a mix of of Fortran, C, C++—external libraries (possibly binary only)—thousands of procedures, hundreds of thousands of lines—we must avoid
– manual instrumentation– significantly altering the build process– frequent recompilation
• Collect execution measurements scalably and efficiently—don’t excessively dilate or perturb execution—avoid large trace files for long running codes
• Support measurement and analysis of serial and parallel codes
• Present analysis results effectively—top down analysis to cope with complex programs —intuitive enough for physicists and engineers to use—detailed enough to meet the needs of compiler writers
• Support a wide range of computer platforms
48
HPCToolkit Overview
• Use event-based sampling to profile unmodified applications—hpcrun (Linux)
• Interpret program counter histograms—hpcprof (Linux)
—xprof (Alpha+Tru64 only)
• Recover program structure, including loop nesting—bloop
• Correlate source code, structure and performance metrics—hpcview
—hpcquick
• Explore and analyze performance databases—hpcviewer
49
HPCToolkit System Workflow
applicationsource
applicationsource
profile executionprofile execution
performanceprofile
performanceprofile
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
http://www.hipersoft.rice.edu/hpctoolkit/
50
HPCToolkit System Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactive analysis
interactive analysis
—launch unmodified, optimized application binaries—collect statistical profiles of events of interest
51
HPCToolkit System Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
—decode instructions and combine with profile data
52
HPCToolkit System Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactive analysis
interactive analysis
—extract loop nesting information from executables
53
HPCToolkit System Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
—synthesize new metrics by combining metrics —relate metrics, structure, and program source
54
HPCToolkit System Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactive analysis
interactive analysis
—support top-down analysis with interactive viewer—analyze results anytime, anywhere
55
HPCToolkit: Supported Platforms
• Linux—IA32: Pentium 3, Pentium 4, Athlon
—Opteron
—Itanium
• Tru64—Alpha
• Irix—MIPS R10000, R12000
56
Session 2a
Using HPCToolkit on Linux
http://www.hipersoft.rice.edu/hpctoolkit
57
HPCToolkit Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
58
hpcrun*
Capabilities
• Monitor unmodified, dynamically-linked application binaries
• Collect event-based program counter histograms
• Monitor multiple events during a single execution
Data collected for each (load module, event type) pair
•Load module attributes—name
—start address
—length
•Event name
•Execution measurements—A sparse histogram of PC values for which the event counter overflowed
*Only for use on Linux only at present
59
hpcrun in Action
% ls
README
% hpcrun ls
ls.PAPI_TOT_CYC.master4.rtc.2866 README
Profile the default event (PAPI_TOT_CYC) for the ls command using the default period
Data file to which PC histogram is written for monitored event or events
Data file naming convention: command.eventname.nodename.pid
60
hpcrun: Common Options
-h print help information
-L list events that can be monitoredPAPI standard events
Name, Derived = yes/no, Event description
Native processor events
Name
-e <eventk>[:<periodk>]
event and sample period for profilinguse multiple -e options for multiple events
-o <outpath>
optional directory for output data
-- subsequent arguments aren’t hpcrun options
61
A Running Example: sample.c
#define N (1 << 23)#define T (10)#include <string.h>double a[N],b[N];void cleara(double a[N]) { int i; for (i = 0; i < N; i++) { a[i] = 0; }}int main() { double s=0,s2=0; int i,j; for (j = 0; j < T; j++) { for (i = 0; i < N; i++) { b[i] = 0; } cleara(a); memset(a,0,sizeof(a)); for (i = 0; i < N; i++) { s += a[i] * b[i]; s2 += a[i] * a[i] + b[i] * b[i]; } } printf("s %f s2 %f\n",s,s2);}
•2 user functions
•2 calls to library functions
•Main has imperfectly nested loops with function calls
62
Using hpcrun to Collect a Performance Profile
% hpcrun -e PAPI_TOT_CYC:499997 -e PAPI_L1_LDM \
-e PAPI_FP_INS -e PAPI_TOT_INS sample
Profile the “total cycles” using the default period
Profile the “L1 data cache load misses” using the default period
Profile the “total instructions” using the default period
Profile the “floating point instructions” using period 499997
Running this command on a machine “eddie” produced a data file sample.PAPI_TOT_CYC-etc.eddie.1736
63
HPCToolkit Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
64
hpcprof*
Capabilities
• Read program counter histograms gathered by hpcrun
• Read debugging information in executable
• Use debugging information to map PC samples to —Load module
—File
—Function
—Source line
• Output data in ASCII for human consumption
• Output data in XML for consumption by other tools
*Only for use on Linux only at present
65
Using hpcprof to Interpret a Profile
% hpcprof -e sample sample.PAPI_TOT_CYC-etc.eddie.1736
Columns correspond to the following events [event:period]
PAPI_TOT_CYC:32767 - Total cycles (264211 samples)
PAPI_TOT_INS:32767 - Instructions completed (61551 samples)
PAPI_FP_INS:32767 - Float point instructions (15355 samples)
PAPI_L1_LDM:32767 - Level 1 load misses (6600 samples)
Load Module Summary:
85.5% 100.0% 100.0% 99.9% /tmp/sample
14.5% 0.0% 0.0% 0.1% /lib/libc-2.3.3.so
File Summary:
85.5% 100.0% 100.0% 99.9% <</tmp/sample>>/tmp/sample.c
14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>><unknown>
Continued …
66
hpcprof: Function and Line Output
Function Summary:
70.8% 83.3% 100.0% 99.7% <</tmp/sample>>main
14.7% 16.7% 0.0% 0.2% <</tmp/sample>>cleara
14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>>memset
Line Summary:
37.2% 35.0% 55.9% 51.4% <</tmp/sample>>/tmp/sample.c:20
13.4% 18.3% 30.2% 34.0% <</tmp/sample>>/tmp/sample.c:21
5.6% 9.3% 13.9% 14.2% <</tmp/sample>>/tmp/sample.c:19
11.6% 17.4% 0.0% 0.2% <</tmp/sample>>/tmp/sample.c:15
8.8% 10.5% 0.0% 0.1% <</tmp/sample>>/tmp/sample.c:8
14.5% 0.0% 0.0% 0.1% <</lib/libc-2.3.3.so>><unknown>:0
5.9% 6.1% 0.0% 0.0% <</tmp/sample>>/tmp/sample.c:7
3.1% 3.4% 0.0% 0.0% <</tmp/sample>>/tmp/sample.c:14
Continued …
67
hpcprof: Annotated Source File Output
File <</tmp/sample>>/tmp/sample.c with profile annotations. 1 #define N (1 << 23) 2 #define T (10) 3 #include <string.h> 4 double a[N],b[N]; 5 void cleara(double a[N]) { 6 int i; 7 14.5% 16.7% 0.0% 0.2% for (i = 0; i < N; i++) { 8 a[i] = 0; 9 } 10 } 11 int main() { 12 double s=0,s2=0; int i,j; 13 for (j = 0; j < T; j++) { 14 9.0% 14.6% 0.0% 0.1% for (i = 0; i < N; i++) { 15 5.6% 6.2% 0.0% 0.1% b[i] = 0; 16 } 17 cleara(a); 18 memset(a,0,sizeof(a)); 19 3.6% 5.8% 10.6% 9.2% for (i = 0; i < N; i++) { 20 36.2% 32.9% 53.2% 49.1% s += a[i]*b[i]; 21 16.7% 23.8% 36.2% 41.2% s2 += a[i]*a[i]+b[i]*b[i]; 22 } 23 } 24 printf("s %d s2 %d\n",s,s2); 25 }
68
hpcprof: Common Options
-h , --help print help information
-d, --directory dir Search dir for source files.
-D, --recursive-directory dir Search dir recursively for source files.
Text mode [Default]
-e, --everything Show all information.
-f, --files Show all files.
-r, --funcs Show all functions.
-l, --lines Show all lines.
-a, --annotate file Annotate file.
-n, --number Show number of samples (not %).
-s, --show thresh Set threshold for showing aggregate data.
PROFILE mode
-p, --profile Dump PROFILE output (for use with hpcview).
69
HPCToolkit Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
70
Why Binary Analysis?
Issues
• Line-level statistics offer a myopic view of performance—Example: loads on one line, arithmetic on another.
• Event detection “skid” makes line-level metrics inaccurate
• Optimizing compilers rearrange and interleave code.
• Source-code analysis is language-dependent,compiler- (and compiler-version-) dependent.
• For scientific programs, loop level is the unit of optimization and is where the action is
Approach
• Recover loop information from application binaries
71
bloop
Extract loop structure from binary modules (executables, dynamic libraries)
Approach
• Parse instructions in an application binary
• Identify branches and aggregate instructions into basic blocks
• Construct control flow graph using branch target analysis
• Use interval analysis to identify natural loop nests
• Map instructions to source lines with symbol table information—Quality of results depends on quality of symbol table
information
• Normalize results to recover source-level view of optimized code
• Elide information about source lines NOT in loops
• Output program structure in XML for consumption by hpcview
Platforms: Alpha+Tru64, MIPS+IRIX, Linux+IA64, Linux+IA32, Solaris+SPARC
72
Sample Flow Graph from an Executable
Loop nesting structure—blue: outermost level
—red: loop level 1
—green loop level 2
Observation:optimization complicates
program structure!
73
Normalizing Program Structure
Coalesce multiple instances of lines
Repeat
(1) If the same line appears in different loops– find least common ancestor in scope tree; merge
corresponding loops along the paths to each duplicate
purpose: re-roll loops that have been split
(2) If the same line appears at multiple loop levels– discard all but the innermost instance
purpose: account for loop-invariant code motion
Until a fixed point is reached
Constraint: each source line may appear at most once
74
bloop in Action
% bloop sample > sample.bloop<LM n="sample" version=”4.0"> <F n="/tmp/sample.c"> <P n="cleara" b="5" e="10"> <L b="7" e="8"> <S b="7" e="7"/> <S b="8" e="8"/> </L> </P> <P n="main" b="11" e="25"> <L b="13" e="21"> <S b="13" e="13"/> <L b="14" e="15"> <S b="14" e="14"/> <S b="15" e="15"/> </L> <S b="17" e="17"/> <S b="18" e="18"/> <L b="19" e="21"> <S b="19" e="19"/> <S b="20" e="20"/> <S b="21" e="21"/> </L> </L> </P> </F> </LM>
Load Module
File
Statement
Loop
Procedure
75
HPCToolkit Workflow
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
76
Data Correlation
Motivation
• Need to analyze data from multiple experiments together—Different architectures—Different input data sets—Different metrics: any one metric provides a myopic view
– some measure potential causes (e.g. cache misses)– some measure effects (e.g. cycles)– e.g., cache misses are only a problem if latency not hidden
• Need to analyze metrics in the context of the source code
• Need to synthesize derived metrics, e.g. cache miss rate—Synthetic metrics are needed information—Automation eliminates mental arithmetic —Serve as a key for sorting
77
hpcview and hpcquick
• hpcview correlates —Metrics expressed as PROFILE XML documents
– Produced on Linux with hpcprof
—Program structure expressed as PROFILE XML documents– Produced by bloop
—Program source code in any language
• hpcview produces a database containing both performance data and source code for exploration with hpcviewer—Aggregates metrics up the program structure tree
• hpcquick is a script that simplifies hpcview usage. —Runs hpcprof on any raw profile data files
—Synthesizes a configuration file for hpcview
—Runs hpcview
—Writes an executable script that will reconstruct its actions
78
hpcquick in Action
% hpcquick -S sample.bloop \ -I . \ -P sample.PAPI_TOT_CYC-etc.eddie.1736
Program structure PROFILE XML file
List of paths to directories containing source code
List of metric PROFILE XML files
79
HPCToolkit System Overview
profile executionprofile execution
performanceprofile
performanceprofile
applicationsource
applicationsource
binaryobject code
binaryobject code
compilation
linking
binary analysisbinary analysis
programstructure
programstructure
interpret profileinterpret profile
source correlation
source correlation
hyperlinkeddatabase
hyperlinkeddatabase
interactiveanalysis
interactiveanalysis
80
hpcviewer
Tool for interactive top-down analysis of performance data
• Execution metric data plus program source code—Multiple metrics from one or more experiments
• Exploration is guided by rank-ordered table of metrics—Can sort on any metric in ascending or descending order
• Hierarchical view of execution metrics—Load modules
– Executable, shared libraries, operating system, other programs—Files—Procedures—Loops—Source lines
hpcviewer is written in Java and runs anywhere: your cluster, your desktop, your laptop
81
hpcviewer Screenshot
MetricsNavigation
Annotated Source View
82
hpcviewer Browser Actions
• Open additional experiments from file menu
• Navigate from source code to metrics—Show metric info (for lines and scopes with metrics)
• Browsing hierarchical experiment data
—Open : reveal metric data of children
—Close : hide metric data of children
—Zoom : focus analysis on a particular subtree of program data
—Unzoom : undo previous zoom operation
—Flatten : melt away a level of hierarchy to view children of
current root or siblings as peers
—Unflatten : undo previous flatten at current tree location
—Navigate from scope to source: select scope descriptive text
83
Flattening for Top Down Analysis
• Problem —strict hierarchical view of a program is too rigid
—want to compare program components at the same level as peers
• Solution —enable a scope’s descendants to be flattened to compare their
children as peers
flatten
Current scope
unflatten
84
Flattening in Action
Flattened 3 levels: Load modules and files are hidden functions with loops are hidden leaf functions are shown
85
Summary of HPCToolkit Functionality
• Top down analysis focuses attention where it belongs —sorted views put the important things first
• Integrated browsing interface facilitates exploration—rich network of connections makes navigation simple
• Hierarchical, loop-level reporting facilitates analysis—more sensible view when statement-level data is imprecise
• Binary analysis handles multi-lingual applications and libraries—succeeds where language and compiler based tools can’t
• Sample-based profiling, aggregation and derived metrics—reduce manual effort in analysis and tuning cycle
• Multiple metrics provide a better picture of performance
• Multi-platform data collection
• Platform independent analysis tool
86
Session 2 b
Case Studies
87
A Typical “First Look” at a Big Program
• Let’s look at an execution of LANL’s POP 2.01 code
• In one execution we measured— Cycles— Instructions— Floating Point Instructions— Cache Misses at some level (L1, L2, L3)
• We computed derived metrics— Cycles per Instruction (CPI)— Cycles - (instructions * Peak CPI)— Flops per cycle— Misses per Flop
• Screen shot for handout
• Demo for live tutorial
Purpose: Gain familiarity with the viewer on a moderately-large code; rapidly gain familiarity with its performance issues.
88
LANL POP 2.0.1
(Live Demo In Tutorial)
89
“Optimized” Matrix-Matrix Multiply
• The demo program mmdemo.c provides 10 variants of the classic n-cubed matrix-matrix multiply algorithm.— Collects common tricks used by programmers and compilers
— Variants differ in the order they perform operations and access memory
— Performance and underlying phenomena vary widely– Differentiates processor implementations
– Illustrates interactions between code and compilers
– Exercises instrumentation, tool chain, and analyst.
• Goals1. Gain intuition via cross-architecture/cross-compiler
comparison.
2. Delve into questions of what hardware counters really tell us
3. Explore HPCToolkit/PAPI capabilities on different architectures
90
Naïve and “Simple” Variants
void naive_mmult()
{ int i,j,k;
for(i = 0; i < MSIZE; i++){
for(j = 0; j < MSIZE; j++) {
c[i][j] = 0;
for(k = 0; k <MSIZE; k++)
c[i][j] = c[i][j] +
a[i][k] * b[k][j];
}}}
void simple_mmult()
{ int i,j,k;
for(i = 0; i < MSIZE; i++){
for(j = 0; j < MSIZE; j++) {
MTYPE tmp = 0;
for(k = 0; k <MSIZE; k++)
tmp=tmp + a[i][k]* b[k][j];
c[i][j]=tmp;
}}}
Simple is like naïve, but accumulates c[i][j] in a temporary scalar. This is something we would expect any optimizing compiler to do automatically if there is no aliasing of c with a or b.
91
Blocked and Unrolled Variantvoid rmm_mmult()
{int i,j,k;
int ii, ilimit, jj, jlimit, kk, klimit;
for(j = 0; j < MS; j+= XSIZE){
for(i = 0; i < MS; i+= XSIZE){
ilimit = ( i + XSIZE) < MS ? i+XSIZE : MS;
jlimit = ( j + XSIZE) < MS ? j+XSIZE : MS;
for(ii = i; ii < ilimit; ii++)
for(jj= j; jj < jlimit ; jj++) c[ii][jj] = 0;
for(k = 0; k < MS; k+= ZSIZE) {
klimit = ( k + ZSIZE) < MS ? k+ZSIZE : MS;
for(ii = i; ii < ilimit; ii++){
for(jj= j; jj < jlimit ; jj +=8 ){
MTYPE tmp0 = c[ii][jj];
MTYPE tmp1 = c[ii][jj+1];
MTYPE tmp2 = c[ii][jj+2];
MTYPE tmp3 = c[ii][jj+3];
MTYPE tmp4 = c[ii][jj+4];
MTYPE tmp5 = c[ii][jj+5];
MTYPE tmp6 = c[ii][jj+6];
MTYPE tmp7 = c[ii][jj+7];
for(kk= k; kk < klimit ; kk++) { tmp0= tmp0 + a[ii][kk] * tmp1= tmp1 + a[ii][kk] * b[kk][jj+1]; tmp2= tmp2 + a[ii][kk] * b[kk][jj+2]; tmp3= tmp3 + a[ii][kk] * b[kk][jj+3]; tmp4= tmp4 + a[ii][kk] * b[kk][jj+4]; tmp5= tmp5 + a[ii][kk] * b[kk][jj+5]; tmp6= tmp6 + a[ii][kk] * b[kk][jj+6]; tmp7= tmp7 + a[ii][kk] * b[kk][jj+7]; } c[ii][jj] = tmp0; c[ii][jj+1] = tmp1; c[ii][jj+2] = tmp2; c[ii][jj+3] = tmp3; c[ii][jj+4] = tmp4; c[ii][jj+5] = tmp5; c[ii][jj+6] = tmp6; c[ii][jj+7] = tmp7;
}}}}}}
(assumes XSIZE mod 8 0)
92
Matrix Multiply: Summary
1000 x 1000 dense matrix multiply of doubles(Seconds))
900MHz I2 icc8.0, -O3, no alias(icc8.1)
900MHz I2 gcc3.3.3 strict-aliasing
3.0GHz P4icc8.1 -O3, noalias
3.0GHz P4 gcc 3.3.4(strict-alias align, unroll)
3.0GHz P4
gcc 3.3.4() + -sse3
1.6 GHz K8 PC2700eccgcc 3.3.3 -funroll-loops
Naïve 1.450(0.946) 41.738 9.463 9.618 9.554 18.127
Simple 13.90 (13.94) 37.998 9.480 9.449 9.523 17.902
R-Naive 1.45(1.592) 25.254 2.964 2.977 3.116 5.843
Matrix-vector 13.71(13.67) 27.809 7.520 7.505 7.547 17.450
Block mat-mat 4.34(3.89) 18.234 2.075 2.041 2.034 3.053
Lam et.al rev. 1.44(1.49) 25.252 2.945 2.918 3.002 5.925
Lam mat-vec 10.92(10.70) 20.179 2.253 2.882 4.414 3.957
Lam unroll*2 5.28(1.89) 19.681 1.877 2.153 3.765 3.727
4-way unroll 10.57(4.05) 23.729 5.821 5.554 7.144 6.768
Blk and unroll 1.18(0.845) 4.980 1.577 1.488 4.131 1.709
93
Matrix Multiply Notes and Questions
• No attempt made to tune specifically for these architectures.—Further opportunities for optimization?
• Naïve vs simple on Itanium – What’s going on?—Pattern recognition?—Analysis/code generation sensitivity to small changes?— tmp introduces a data dependence hazard?
• Use of SSE instructions?—Little advantage in these examples on P4—SSE vectorization vs. unrolling?
• Sources of inefficiencies?— Memory Hierarchy Components: TLB, Cache, Write buffers— Instruction level parallelism, latency hiding
• Opteron Issues—Performance using other compilers? Other variants?—Performance with faster clocks?
(Revisit this example in the exercises)
94
Matrix Multiply: Itanium-2 + icc
(Live Demo In Tutorial)
95
Matrix Multiply: Pentium 4 + icc
One level of unroll-and-jam applied to b_mmult eliminates write buffer stalls, reduces and hides L2 misses.
(Live Demo In Tutorial, also for Opteron)
96
What Events Can Be Measured?
• Itanium, Pentium 4, Opteron all support a rich set of events
• PAPI tries to present a common set—Architectural differences prevent exact semantic alignment
—Unique resources (L3 cache, Multi-level TLB, etc.) introduce differentiation
• Unique measurement resources—Itanium: Direct measurement of pipeline bubbles
– Enables a “decision tree” approach to diagnosis.
—Opteron: Extensive measurements of “system” (e.g., bus) events– Probably mostly of interest to system architects/implementers
—Pentium 4: Base events + tags– Probably more detail than you want/need
– Speculative instructions and replay
97
Typical Uses for HPCToolkit
• Identifying unproductive work—where is the program spending its time not performing FLOPS
• Memory hierarchy issues—bandwidth utilization: misses x line size/cycles
—exposed latency: ideal vs. measured
• Cross architecture or compiler comparisons—what program features cause performance differences?
• Gap between peak and observed performance—loop balance vs. machine balance?
• Evaluating load balance in a parallelized code—how do profiles for different processes compare
98
Session 3a
Exercises
See WWW page
http://www.hipersoft.rice.edu/hpctoolkit/sc04
99
Exercise 1
Using hpcviewer1. Download and unpack exercise1.tar2. Change to the exercise1 directory3. Run hpcviewer4. Open the performance database for the sample program in
hpcview.output/hpcquick5. Explore the program and performance metrics with
hpcviewer— open and close scopes in the navigation pane tree view— sort in ascending and descending order by different metrics— familiarize yourself with the flattening, zooming, unflattening,
unzooming capabilities of the browser
6. Analyzing the performance— what load module has the most time spent in it?— which is the most costly routine in the program?— which is the most costly loop in the program?
100
Exercise 2
Top down analysis of a big program: WRF1. Download and unpack exercise2.tar
2. Change to the exercise2 directory
3. Run hpcviewer
4. Open the performance database for the sample program in hpcview.output/hpcquick
5. Explore the program and performance metrics with hpcviewer
6. Where does the program spend its time?
7. Is it inefficient?
8. What are the main performance issues?
9. How might one improve the program efficiency?
101
Exercise 3
Comparing memcpy implementations1. Download and unpack exercise3.tar
2. Change to the exercise3 directory
3. Look through the source files main.c, ccopy.c and fcopy.f90— study the copy loops in ccopy.c and fcopy.c
— look at how they and the C library’s memcpy are equally invoked in the main routine
— since they are all called equally, do you expect that the same time is spent in each memcpy variant?
4. Compile them using the Makefile provided
5. Launch the program with hpcrun to profile cycles and total instructions using PAPI counters. Examine the profile results in ASCII with hpcprof -e.
6. What surprises did you find? Try to explain the results by collecting data using other event counters.— feel free to ask for suggestions
102
Session 3 b
Using HPCToolkit in Practice
103
Preparing your Code for Analysis
HPCToolkit tools prefer line mapping information in executables
• hpcprof uses it to map PC samples to source lines
• bloop uses it to recover source lines in loop nests
Compilers may not record line mapping by default
Is line mapping information is present in your executable?
• Use readelf --debug-dump=line <elf-executable>— readelf is part of the binutils tools
Getting line mapping information in your compiled code
• Intel icc, ifort compiler: use -g with explicit optimization flags
• gcc/g++/g77: add -g to any optimization flags
• Strategies for other compilers are similar
104
hpcrun Limitations (Part 1)
Not all PAPI events can be sampled with hpcrun
Cause—PAPI derived metrics combine 2 or more native events
—PAPI does not support sampling of derived metrics
Diagnosis—hpcrun will issue an error if invoked to sample a derived metric
Solution—Sample the constituent native events with hpcrun— hpcview can synthesize the derived metric from native metrics
105
hpcrun Limitations (Part 2)
Not all events can be collected in a single execution
• Causes—Processor does not have enough total hardware counters
—Multiple event detectors need to use the same counter
– For example, on Pentium 4
X87: count all x87 instructions
SSE_SP: count all unpacked SSE single precision instructions
SSE_DP: count all unpacked SSE double precision instructions
– The event register design lets only 2 of the 3 be counted at once
• Solution—With trial and error, find out which events conflict
– Use “hpcrun -e e1 -e e2 … /bin/ls” as for this purpose
—Collect performance metrics for your application using multiple runs
– Sufficiently small number of non-conflicting events in each run
—Use hpcview to correlate the data from multiple runs with source code
106
Informing HPCToolkit about Source Code
HPCToolkit needs to know where to find source code (yours or library) when building performance databases
• When using hpcquick, specify one or more paths to use as roots for relative paths using -I dir1 dir2 ... dirN
• A root for relative paths can be specified to hpcview as —An absolute path
– e.g. /usr/local/mpich-1.2.6/src/pt2pt
—A relative path from the directory in which hpcview is launched– e.g. ../src
—A trailing “/*” wildcard specifies an entire directory tree. – e.g. “/usr/local/mpich-1.2.6/src/*”
– e.g. “../mpich-1.2.6/src/*”
Note: wildcard paths must be enclosed in quotation marks or have a backslash in front of the * to keep your shell from expanding it
107
Under the Hood with hpcview
• hpcquick provides a simple interface for —invoking hpcprof to transform performance data info XML
PROFILE format
—building an XML document that serves as an hpcview configuration file
—invoking hpcview
• hpcquick dose not expose all of the capabilities of hpcview
• hpcview also supports—Adding titles to hpcviewer performance displays
—Replacing paths to force proper correlation with source
—Computation of derived metrics
—Fine-tuning displays by omitting display of intermediate metrics
108
<HPCVIEW><TITLE name="POP 4-way shmem, model_size=medium" /> <PATH name="." /> <PATH name="./compile" /> <PATH name="../sshmem" /> <PATH name="../source" /> - <METRIC name="pcc" displayName="Cycles"> <FILE name="pop.fcy_hwc.pxml" /> </METRIC> <METRIC name="dc" displayName="L1 miss"> <FILE name="pop.fdc_hwc.pxml" /> </METRIC> <METRIC name="dsc" displayName="L2 miss"> <FILE name="pop.fdsc_hwc.pxml" /> </METRIC> <METRIC name="fp" displayName="FP insts"> <FILE name="pop.fgfp_hwc.pxml" /> </METRIC> <METRIC name="rat" displayName="cy per FLOP">- <COMPUTE> <math> <apply> <divide/> <ci>pcc</ci> <ci>fp</ci> </apply> </math> </COMPUTE> </METRIC> </HPCVIEW>
hpcview Configuration File
Paths to relevant source directories
Heading on Display
Metrics defined by Platform Independent Profile Files
Expression for derived metric
109
Why are Some of my Source Files Missing?
My performance database built with hpcview does not contain all of my source files
Explanation 1
• hpcview databases are sparse
• They contain only source files for which samples are present
• If no samples exist for a particular file, it will be omitted—.h files are typically omitted
• This is a design decision. There is no workaround.
Explanation 2
• The performance and program structure data may contain invalid paths
• See next slides for detailed explanation and solutions
110
How Can Invalid Paths Arise?
• bloop and hpcprof obtain file path name information from the application binary
• Path names encoded in the application binary may not be helpful—They may be relative to one of the directories in which the
application was compiled
—They may be relative to one of the directories in which a library was compiled
—They may refer to files using an inaccessible mount point, or one that is named differently– Files compiled on one node of a cluster may be inaccessible through
the same paths on another cluster node
111
Coping with Invalid File Paths
• Diagnosing invalid paths with hpcmissing— hpcmissing <xml-file1> … <xml-filen>
<xml-file*> may be output of bloop or XML output of hpcprof—For each missing file, hpcmissing will report
– The name of the missing file if the parent directory is present– The name of its parent directory if the parent directory is missing
• Correcting path problems in hpcview configuration files—if the executable binary contains the invalid file path
../../binutils/bfd/foo.c
and it the corresponding source file can be found at
/scratch/tmp/binutils/bfd/foo.c
—Direct hpcview to perform a path prefix substitution with the directive
<REPLACE in=”../../binutils” out=”/scratch/tmp/binutils”/>
—In the configuration file, add REPLACE commands after any PATH commands and before any METRIC definitions
• See doc/dtd_hpcview.html for details
112
Working with Multiple Metrics in a Data File
• hpcrun can record multiple metrics in a single execution
• hpcprof will report multiple metrics in an XML output file
• Within the output file, metrics are known by two names— nativename, usually descriptive— shortname, usually a small integer
• To import data from a file with multiple metrics, use the select attribute
<METRIC name=“dsc” displayName=“L2 miss”> <FILE name=“pop.fdsc_hwc.pxml” select=“1”/> </METRIC>
The optional select attribute specifies the shortname of a metric in the file. It defaults to metric “0”
113
Computing Derived Metrics
COMPUTE metrics provide a way to compute derived metrics from other raw or computed metrics
Operators: <plus/>, <minus/>, <divide/>, <power/>, <times/>, <max/>, <min/>
<METRIC name=“acc” displayName=“L1 ACCESS”> <FILE name="pop.fcy_hwc.pxml" />
</METRIC>
<METRIC name=”miss" displayName=“L1 MISS”> <FILE name="pop.fgfp_hwc.pxml" />
</METRIC>
<METRIC name=”l1m" displayName=“L1 MISS RATE”>- <COMPUTE>
<math> <apply> <divide/> <ci>miss</ci> <ci>acc</ci> </apply>
</math></COMPUTE>
</METRIC>
Internal name for use in expressions
Name for use in hpcviewer display
Raw metric from a file
computed metric
MathML formula delimiter
apply operator to terms
identifier
114
Complex Derived Metrics
• Where is the biggest unrealized opportunity for performing floating point operations?
• Let fpins be floating point instructions retired
• Let cycles be cycles
• Let 2 be the max number of floating point operations that can retire per cycle
<METRIC name=“uf” displayName=“UNREALIZED FLOPS”>- <COMPUTE> <math> <apply> <minus/> <apply> <times/> <ci>cyc</ci> <cn>2</cn> </apply> <ci>fpins</ci> </apply> </math> </COMPUTE> </METRIC>
Complex formulae nest operator applications
115
Find the Missing Cycles
<METRIC name="WALLCLK" displayName="WALLCLK" sortBy="true">
<FILE name="bm10dicc. WALLCLK.eddie.7911.hpcquick.pxml" select="0"/>
</METRIC>
<METRIC name="PAPI_TOT_CYC" displayName="PAPI_TOT_CYC">
<FILE name="bm10dicc.PAPI_TOT_CYC-etc.eddie.7921.hpcquick.pxml" select="0"/>
</METRIC>
<METRIC name="wall_cyc" displayName="WALL_CYC" percent="true">
<COMPUTE> <math>
<apply> <times/> <ci>WALLCLK</ci> <cn>300000</cn></apply>
</math> </COMPUTE>
</METRIC>
<METRIC name="missing" displayName="Missing Cycles" percent="true">
<COMPUTE> <math>
<apply> <minus/> <ci>wall_cyc</ci><ci>PAPI_TOT_CYC</ci></apply>
</math></COMPUTE>
</METRIC>
116
Avoiding Display Clutter
• hpcview’s configuration file enables
—Suppress display of intermediate metrics
—Suppress display of percentages (irrelevant for ratios)
—Select metric for initial display sorting using sortBy=“true”
<METRIC name=“acc” displayName=“L1 ACCESS” display=“false”> <FILE name="pop.fcy_hwc.pxml" />
</METRIC>
<METRIC name=“miss” displayName=“L1 MISS” display=“false”> <FILE name="pop.fgfp_hwc.pxml" />
</METRIC>
<METRIC name=”l1m" displayName=“L1 MISS RATE” percent=“false”><COMPUTE>
<math> <apply> <divide/> <ci>miss</ci> <ci>acc</ci> </apply>
</math></COMPUTE>
</METRIC>
Raw
Computed
117
Lines for Loops and Procedures Are Imprecise
The line number reported for a loop or a procedure by hpcviewer does not match its true location in the source
• Cause—Inaccurate or missing line number information in symbol table
—No program structure information from bloop provided to hpcquick and hpcview
• Rationale—In the absence of program structure information from bloop, hpcview infers starting and ending lines of a procedure from the line numbers of PC samples within the procedure– Start line is reported as the first line with a sample
– End line reported as the last line with a sample
118
Unknown File and Function Reports
Some costs are attributed to “unknown file” and “unknown function” in shared libraries
• Shared libraries have dispatch code that is used to call shared library entry points
• Instructions for this dispatch code are not mapped to any source file or function
• Samples that occur when calling shared library entry points through the dispatch table will be mapped to “unknown file” and “unknown function.”
119
Working with Stripped Binaries
• Source not available Summary/skeleton in source pane
• Executable and/or shared libraries may have been stripped—Stripped executable: no symbols, except for those needed to
bind against shared libraries
—Stripped shared library: few symbols - just a record of externally visible function entry points
• hpcrun profiles stripped binaries like any others
• hpcprof reports brief profiles for stripped binaries—Stripped executable: samples mapped to
Load module, unknown file, unknown function, line 0
—Stripped shared library: samples mapped to Load module, unknown file, function, line 0
120
Trouble Spot: Multiple Instantiations
• Code that appears in one place is often instantiated elsewhere—Inlined functions
—Fortran statement functions
—Forward substitution of common subexpressions
• Symbol table line mapping information —Maps all instantiations of code only to its original source location
– Sensible choice for debugging, problematic for performance analysis
121
Current Problems with Multiple Instantiations
• hpcprof
—For sample at a PC: looks up PC’s source line in line map
—All metrics for any line are reported only in one place
—Evidence of multiple instantiations is lost, attribution is confusing
• bloop
—Lines from inlined functions can distort the range of statements that should be mapped to a function– If procedure A contains inlined code from procedure B, instructions
inside A’s procedure body will map to lines inside B
—All loops that contain instances of multiply instantiated code will be fused for presentation
122
Analyzing Parallel Executions
• hpcrun can be used when launching parallel jobs— e.g. mpirun -n 24 hpcrun -e PAPI_L1_DCM -- myjob myarg1 …
• hpcrun independently monitors each process—Records separate performance data file for each process
—Data file naming convention includes node name and PID
123
Working with Threads
hpcrun can be used to monitor threaded programs
Two modes
• -teach: collect data for each thread separately (default)
• -tall: collect data for all threads together
Works with posix threads only
124
Session 4a: Advanced Exercises
• Continue study of previous examples and exercises—Tune the mmdemo.c matrix multiplication example
—Compare mmdemo.c with libgoto and other BLAS implementations
• Analyze the Impact3D code on one processor—Study an Earth Simulator code on a small data size
• Analyze one or more of the NAS parallel benchmarks (MPI)—Examine how the MPI communication overhead scales as the
number of processors changes
• Apply HPCToolkit to your program
• Automate application of HPCToolkit tools—Scripts and Makefiles to integrate HPCToolkit tools with your
development process
125
Improving Program Performance
• Memory hierarchy: Improving spatial and temporal locality—Reduce cache and TLB misses with blocking
—Reduce loads and stores with scalar replacement
—Reduce the size of array temporaries
• Instruction cache—Reduce icache misses by distributing loops that are too large
• Execution pipeline—Compiler optimizations
– Unroll and software pipeline
—Hand transformations
126
Session 4 b
Installing HPCToolkit
127
Getting HPCToolkit
• Open source—Rice code under a BSD license
—Supporting code under LGPL and Apache licenses
• http://www.hipersoft.rice.edu/hpctoolkit
128
Installing HPCToolkit on Linux
• HPCToolkit is available only in source form
• Installing HPCToolkit means downloading and building it
• Downloading HPCToolkit requires—ssh
—cvs
—tar
• Building HPCToolkit requires—javac: Java compiler (1.4 or later for best results)
—C compiler (gcc suffices)
—C++ compiler (g++ suffices)
—gnumake
—autoconf and automake are helpful, but not required
129
Installation Components
• HPCToolkit—Profile collection, interpretation, program structure recovery,
source code and metric correlation, and interactive browser
• Xercesc (use our version for better portability)—APACHE XML parser infrastructure
• Binutils (use our version with enhanced symbol table support) —GNU software libraries and utilities for working with binary files
– Used to partially decode executables and interpret symbol tables
• OpenAnalysis—Rice toolkit for intermediate representation independent program
analysis– Used to construct control flow graphs for application binaries
• PAPI (download separately)—UTK performance counter API library
130
Downloading and Installing PAPI
• PAPI distributed separately from HPCToolkit—See http://icl.cs.utk.edu/papi/software
• Version 3+ required
• PAPI download choices—stable TAR file: lacks some recent fixes, but not changing daily
—CVS repository: up to date fixes, but issues may arise
• Instructions—Download PAPI using your method of choice
—Build using the appropriate platform makefile
—On IA32 platforms, kernel patching is necessary– The perfctr driver must be added to the kernel
– See http://www.hipersoft.rice.edu/hpctoolkit/papi for more info
—Verify it is working using self-diagnosis tests – look in ctests and ftests subdirectories
131
Downloading and Building HPCToolkit
• Download http://www.hipersoft.rice.edu/hpctoolkit/HPCToolkitRoot.tgz
• Unpack the starter tarball to create HPCToolkitRoot directory
• cd HPCToolkitRoot
• ./bin/updatesrc—obtains necessary sources from our CVS repository
• make -f Makefile.quick PAPI=<path_to_PAPI3_install>—Configures, builds and installs to HPCToolkit-XXX-Linux
– XXX is your processor type
— This self-contained installation can be moved anywhere
• Put the HPCToolkit-XXX-Linux/bin directory on your path
132
Other Performance Tuning Resources
• Open Source
—Oprofile http://oprofile.sourceforge.net/news/
—PAPI http://icl.cs.utk.edu/projects/papi
—Psrun http://perfsuite.ncsa.uiuc.edu
—Brink/Abyss http://www.eg.bucknell.edu/~bsprunt (P4+Linux)
• Commercial
—DCPI http://h30097.www3.hp.com/dcpi (Alpha+Tru64)
—Vtune (Intel processors+{Linux,Windows})
– http://www.developer.intel.com/software/products/vtune
—HP (HPUX only)
– Caliper http://www.hp.com/go/hpcaliper (Itanium+HP-UX)
– Prospect http://www.hp.com/go/prospect
—CHUD http://developer.apple.com/tools/performance (Mac OS X)
—IBM: HPMCOUNT
133
Resources (Continued)
• Parallel Performance Tools
—Open Source Tools
– Jumpshot http://www-unix.mcs.anl.gov/perfvis/software/viewers
– TAU http://www.cs.uoregon.edu/research/paracomp/tau/tautools
– SvPablo http://www-pablo.cs.uiuc.edu
—Commercial Tools
– Vampir http://www.pallas.com/e/products/vampir
• Tutorials/Compendia/Organizations
— NCSA Performance expedition http://www-unix.mcs.anl.gov/perf-expedition/tools.html
—SciDAC PERC http://perc.nersc.gov/main.htm
—Parallel Tools Consortium http://www.ptools.org/
• Manufacturers’ Optimization Guides/Manuals
—http://www.intel.com/design/pentium4/manuals/248966.htm
—http://www.amd.com/ us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
— http://www.intel.com/design/itanium/downloads/245474.htm
134
Wrap Up
• Future Directions
• How to Contribute: Join the team!—There are many potential enhancements.
—Many hands make light work.
• Feedback
135
The End