hpc workload performance tuning with ibm xl...
Embed Size (px)
TRANSCRIPT

© 2013 IBM Corporation
HPC Workload Performance Tuning with IBM XL Compilers and Libraries
SPXXL/Scicomp 2013 Summer Meeting
Yaoqing Gao [email protected]
IBM Toronto Lab

© 2013 IBM Corporation
IBM | Software Group | Rational
Disclaimer © Copyright IBM Corporation 2013. The information contained in these materials is provided for
informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.
Please Note:
– IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
– Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
– The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 2

© 2013 IBM Corporation
IBM | Software Group | Rational
Goals § What are frequently used compiler option sets
§ What are frequently used compiler directives/pragmas
§ How to detect program hot spots and performance bottlenecks with XL compilers and performance tools
§ How to write compiler-friendly code for better performance
§ How to do performance tuning with XL compilers
§ How to do POWER7 optimization
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 3

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Agenda
§ Overview of XL Compilers § Performance Controls
– Compiler options – Directives and pragmas
§ Hot spot and bottleneck detection § Program tuning for performance § POWER7 optimization § Documentation § Q&A
4

© 2013 IBM Corporation
IBM | Software Group | Rational
Session 1: Overview of XL Compilers
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 5

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Overview of XL Compiler Family § Similar compilation technology used to implement C, C++ and
Fortran Compilers § Target AIX, Linux on Power, zOS (C/C++ only), BlueGene, Cell § Language standard compliance
– Language standard compliance – C99 Standard compliance – C++98 and subsequent TRs, Selected C++11 features – Fortran 2003 Standard compliance – OpenMP implementation for parallel programming
§ Fully backward compatible with objects compiled with older compilers
– Supports mix-and-match of objects generated with different compilers and optimization levels
– Backward compatibility through option control in some rare situations: • C++ name mangling, OpenMP TLS, etc
§ GCC affinity – Partial source and full binary compatibility with gcc
6

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Overview of XL Compiler Family
§ Platform exploitation – qarch: ISA exploitation – qtune: skew performance tuning for specific processor, including
tune=balanced – Large portfolio of compiler builtins and performance annotations
§ Advanced optimization capabilities – Five distinct optimization packages – Aggressive loop analysis and transformations (unimodular and polyhedral framework)
– Whole program optimization – SIMD code generation and Vectorization exploitation – Parallelization (automatic and user-driven through OpenMP)
– Profile-driven optimization 7

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
IBM XL Compiler architecture Basic compilation environment
§ Used at lower optimization levels
§ Focus on fast compilation noopt -O2
§ More aggressive optimization, with limited impact on compilation time
-O3 -qnohot Implies -qnostrict, which may affect program behavior (mainly precision of floating-point operations)
§ Optionally generate an assembly listing file
C FE C++ FE Fortran FE
xl*code
Source file
object file source.lst
8

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
IBM XL Compiler architecture Advanced compilation environment
§ Focus on runtime performance, at the expense of compilation time
– Aggressive loop transformations
– More precise dataflow analysis
§ Triggered by several compiler flags
-O3 -qhot -qsmp
§ Multiple levels of aggressiveness for loop transformations
-qhot=level=0 (default at -O3) -qhot=level=1 (default at -qhot) -qhot=level=2
§ Can be combined with -qstrict
C FE C++ FE Fortran FE
ipa
Source file
object file
xl*code
source.lst
9

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
IBM XL Compiler architecture Whole-program analysis environment - compile phase
§ Collect high-level program representation in preparation for link-time whole program optimization
§ Triggered by -qipa Implied by -O4, -O5, -qpdf1/-
qpdf2 Identical behavior at all -qipa
levels
§ Can be used independently of -qhot
§ Output is composite object file – Includes regular object file
and intermediate representation
– Allows linking the object file with or without link-time optimization
– Skip generation of regular object using -qipa=noobject
C FE C++ FE Fortran FE
ipa
Source file
extended object file
xl*code
object file source.lst
10

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
extended object file
system library object file
IBM XL Compiler architecture Whole-program analysis environment - link phase
§ Intercept the system linker and re-optimize whole program -qipa=level=0 (default with qpdf) -qipa=level=1 (default with qipa) -qipa=level=2
§ Must use the compiler invocation to link the program, with -qipa – Do not use ld directly
§ Flexible handling of extended objects – Can be placed in archives – Accepts combination of regular and
extended object files
§ Whole program assembly listing – Default name a.lst
§ Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program
extended object file
ipa
system library object file
xl*code
final object file
system linker
executable
a.lst profile data
file
11

© 2013 IBM Corporation
IBM | Software Group | Rational
Session 2: Performance Controls with XL Compiler options, flags, directives and pragmas
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 12

© 2013 IBM Corporation
IBM | Software Group | Rational
Performance Compiler Options § Optimization levels:
– -O0-5
§ High order transformations: – -qhot
§ Interprocedural analysis: – -qipa or –O4 or –O5
§ Profile directed feedback optimization: – -qpdf1/-qpdf2
§ Target machine specification: – -qarch=pwr7 –qtune=pwr7 –qcache=auto
§ Floating point options: – -qstrict=subopt, -qfloat=subopt
§ Program behavior options: – -qassert=subopt, etc.
§ Diagnostic options: – -qlist, -qreport, -qlistfmt, etc HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 13

© 2013 IBM Corporation
IBM | Software Group | Rational
Summary of Optimization Levels § Noopt,-O0
– Quick local optimizations – Keep the semantics of a program (-qstrict)
§ -O2 – Optimizations for the best combination of compile speed and runtime performance – Keep the semantics of a program (-qstrict)
§ -O3 – Equivalent to –O3 –qhot=level=0 –qnostrict – Focus on runtime performance at the expense of compilation time: loop transformations, dataflow
analysis – May alter the semantics of a program (-qnostrict)
§ -O3 –qhot – Equivalent to –O3 –qhot=level=1 –qnostrict – Perform aggressive loop transformations and dataflow analysis at the expense of compilation time
§ -O4 – Equivalent to –O3 –qhot=level=1 –qipa=level=1 -qnostrict – Aggressive optimization: whole program optimization; aggressive dataflow analysis and loop
transformations
§ -O5 – Equivalent to –O3 –qhot=level=1 –qipa=level=2 -qnostrict – More aggressive optimization: more aggressive whole program optimization, more precise
dataflow analysis and loop transformations
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 14

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
HPC Performance Tuning with XL Compilers
-O3 –qarch=pwr7 or
–O3 –qhot –qarch=pwr7
with –qnostrict or –qstrict
Profiling for hot spot detection:
§ Compiler instrumentation: –qpdf1=level={1,2} / pdf2
§ -pg for gprof/xprofiler; -qlist for tprof
§ User-provided profile functions: -qfunctrace
SIMDization:
§ Automatic SIMDization: –O3 or above with –qsimd
§ User explicit SIMD program: -qaltivec
Loop transformations:
§ Loop transformations: –O3 or above
Whole program optimizations:
§ -O4 or –O5 for inter-procedural optimization: inlining, code partition, data reorganization
Parallelization:
§ User explicit parallelization only: -qsmp=omp
§ Auto parallelization: -qsmp (-qsmp=auto)
XML Transformation Reports
§ -qlistfmt=xml=all
Frequently used
option set
15

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Pragmas (C/ C++) § #pragma align
§ #pragma alloca (C only)
§ #pragma block_loop
§ #pragma chars
§ #pragma comment
§ #pragma define, #pragma instantiate (C++ only)
§ #pragma disjoint
§ #pragma do_not_instantiate (C++ only)
§ #pragma enum
§ #pragma execution_frequency
§ #pragma expected_value
§ #pragma fini (C only)
§ #pragma hashome (C++ only)
§ #pragma ibm snapshot
§ #pragma implementation (C++ only)
§ #pragma info
§ #pragma init (C only)
§ #pragma ishome (C++ only)
§ #pragma isolated_call
§ #pragma langlvl (C only)
§ #pragma leaves
§ #pragma loopid
§ #pragma map
§ #pragma mc_func
§ #pragma namemangling (C++ only)
§ #pragma namemanglingrule (C++ only)
§ #pragma nosimd
§ #pragma novector
§ #pragma object_model (C++ only)
§ #pragma operator_new (C++ only)
§ #pragma options
§ #pragma option_override
§ #pragma pack
§ #pragma pass_by_value (C++ only)
§ #pragma priority (C++ only)
§ #pragma reachable
§ #pragma reg_killed_by
§ #pragma report (C++ only)
§ #pragma simd_level
§ #pragma STDC cx_limited_range
§ #pragma stream_unroll
§ #pragma strings
§ #pragma unroll
§ #pragma unrollandfuse
§ #pragma weak
§ #pragma ibm independent_loop
16
Parallel processing
§ #pragma ibm critical (C only) § #pragma ibm independent_calls (C only) § #pragma ibm iterations (C only) § #pragma ibm parallel_loop (C only) § #pragma ibm permutation (C only) § #pragma ibm schedule (C only) § #pragma ibm sequential_loop (C only) § #pragma omp atomic § #pragma omp parallel § #pragma omp for § #pragma omp ordered § #pragma omp parallel for § #pragma omp section, #pragma omp sections § #pragma omp parallel sections § #pragma omp single § #pragma omp master § #pragma omp critical § #pragma omp barrier § #pragma omp flush § #pragma omp threadprivate § #pragma omp task § #pragma omp taskwait
Frequently used
16

© 2013 IBM Corporation
IBM | Software Group | Rational
Directives (Fortran) § ALIGN § ASSERT
§ BLOCK_LOOP
§ CNCALL
§ COLLAPSE
§ EJECT
§ EXECUTION_FREQUENCY
§ EXPECTED_VALUE
§ INCLUDE
§ INDEPENDENT
§ #LINE
§ LOOPID
§ MEM_DELAY
§ NOSIMD
§ NEW
§ NOVECTOR
§ PERMUTATION
§ SIMD_LEVEL
§ SNAPSHOT
§ SOURCEFORM
§ STREAM_UNROLL
§ SUBSCRIPTORDER
§ UNROLL
§ UNROLL_AND_FUSE
§ Hardware-specific directives
• CACHE_ZERO • DCBF • DCBFL • DCBST • EIEIO • ISYNC • LIGHT_SYNC • PREFETCH • PROTECTED STREAM
17
Frequently used
May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Frequently Used Pragmas/Directives/Attributes
§ Dependency – #pragma ibm independent_loop – #pragma disjoint
§ Frequency – #pragma execution_frequency – #pragma expected_value – #pragma ibm min_iterations – #pragma ibm max_iterations – #pragma ibm iterations
§ Alignment – __alginx – __attribute__ {(aligned(16))}
§ SIMDization – #pragma nosimd – #pragma simd_level
§ Unroll – #pragma unroll
18

© 2013 IBM Corporation
IBM | Software Group | Rational
Restricted Pointer Enhancements in XL C/C++ V12.1 Restricted parameter pointer: void function(float * restrict a1. float *restrict a2} {
for ( int i=0; i<n; i++) {
a1[i] = a2[i];
}
}
Block scope restricted pointer: float * restrict a1 = A1; float * restrict a2 = A2;
for ( int i=0; i<n; i++) {
a1[i] = a2[i];
}
Multiple level restricted pointer:
float * restrict *restrict * restrict aa1 = AA1;
float * restrict *restrict * restrict bb1 = BB1;
for (int k=0; k<n3; k++) {
for (int j=0; j<n2; j++) {
for (int i=0; i< n1; i++) {
aa1[i][j][k] = bb1[i][j][k];
} } }
• Determine if two different pointers are being used to reference different objects
• Refine aliasing to expose optimization opportunities;
May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 19

© 2013 IBM Corporation
IBM | Software Group | Rational
module mod
real :: x(500), y(500), z(500)
!ibm* align(16, x, y, z)
end module
subroutine partial_sum(m, n)
use mod
integer, intent(in) :: m, n
!ibm* assert(itercnt(40))
ibm* assert(itercnt(120))
do i=m, n
z(i) = x(i) + 1.37*y(i)
enddo
end subroutine
IBM align and iteration count directives
20 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Guide the compiler to align arrays x, y, and z to 16 bytes
• Avoid cache conflicts and false sharing
• Expose SIMDization opportunity
The frequently used loop iteration counts are 40 and 120
• Guide the compilers for profitability analysis
• Expose loop optimization opportunities

© 2013 IBM Corporation
IBM | Software Group | Rational
Summary
§ Frequently used compiler option sets – -O3 –qarch=pwr7 –qtune=pwr7 – -O3 –qhot –qarch=pwr7 –qtune=pwr7
§ Frequently used compiler directives/pragmas – Dependency and alias analysis – Alignment – Frequency – Program behavior – Transformations
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 21

© 2013 IBM Corporation
IBM | Software Group | Rational
Session 3: Program hot spot and performance bottleneck detection with XL Compilers and performance tools
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 22

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Profiling with XL Compilers and gprof/xprof
§ Compiler option -pg for gprof/xprof – gmount.out will be generated at run-time
§ gprof output (gprof cba gmount.out > gprof.out ) % cumulative self self total
time seconds seconds calls us/call us/call name
92.84 18.18 18.18 213340 85.21 85.21 .bb_it
4.81 19.12 0.94 410042542 0.00 0.00 .__gmon_start__
1.69 19.45 0.33 .main
0.61 19.57 0.12 106669 1.13 171.54 ._bit_array
0.00 19.57 0.00 213336 0.00 0.01 .test_it
0.00 19.57 0.00 2 0.00 0.00 ._bit_array_mn 23

© 2013 IBM Corporation
IBM | Software Group | Rational
Profiling using functrace feature of XL Compilers
extern "C“ void __func_trace_enter (
const char *function_name,
const char *file_name,
int line_number,
void** const user_data){…}
extern "C“ void __func_trace_exit (
const char *function_name,
const char *file_name,
int line_number,
void** const user_data){…}
extern "C“ void __func_trace_catch(
const char *function_name,
const char *file_name,
int line_number,
void** const user_data){…}
void my_function(…)
{ /* entry point */
…..
/* exit point
}
-qfunctrace
-qfunctrace+<list> trace a list of function names,
class names and name scopes
-qfunctrace-<list> do not trace a list of function names,
class names and name scopes
-qoptfile=option_file specify the name of a file
containing –qfunctrace
User-defined tracing functions User functions
Tracing function insertion
by compiler under -qfunctrace
May 2013 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 24

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Profile Directed Feedback Optimization with XL Compilers § Step 1: compiler instrumentation with the option –qpdf1 to generate an instrumented executable
§ Step 2: profile information gathering with the representative input data
§ Step 3: profile directed feedback optimization by re-compiling with –qpdf2 to generate an optimized executable
Call Counter Information Region # 1
Region Execution Count 1
Call Coverage 32 / 49
Call Counters
Call Name Call Execution Count Line #
smtctl_ 0 79
is_smt_on_ 1 55
jbind_ 1 109
jbind_ 0 108
aff_ 1 38
Region # 1
Region Execution Count 5
Block Coverage 81 / 81
Block Counters
Block Index
Block Execution Count
Start Line #
End Line #
3 5 1 33
4 5 33 34
Basic Block Counter Information
……
25

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
SOURCE CODE
COMPILE AND LINK WITH –qpdf1 Static analysis Profile based refinement
COMPILE AND LINK WITH –qpdf2 Profile directed optimizations
INSTRUMENTED APPLICATION
OPTIMIZED APPLICATION
PROFILE DATA
SAMPLE INPUTS
SAMPLE INPUTS
Multiple-pass Dynamic Profiling Infrastructure
Hardware and software constraints
Multiple sample runs for different hardware performance events
Profile based instrumentation refinement
26

© 2013 IBM Corporation
IBM | Software Group | Rational
Profiling with other performance tools § HPC/HPCS toolkit
– e.g. , hpccount -g <group[ ,<group>,...]> <program>
§ Oprofile on LINUX and tprof on AIX
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Subroutine Ticks % Source Address Bytes
========== ===== ====== ====== ======= =====
.match 522 1.80 scanner.c 34b0 2100
.train_match 61 0.21 scanner.c 5950 30e0
.simtest2 12 0.04 scanner.c 55b0 3a0
4 566| 0004B0 stfd D8C40010 1 STFL (*)aggr#2.W.rns88.(gr4,16)=fp6
- 567| 0004B4 fmadd FC0601BA 1 FMA fp0=fp0,fp6,fp6,fcr
- 566| 0004B8 lwz 80670004 1 L4A gr3=(*)aggr#2.I.rns21.(gr7,4)
- 566| 0004BC addi 38840040 1 AI gr4=gr4,64
2 566| 0004C0 fmadd FD1D413A 1 FMA fp8=fp8,fp29,fp4,fcr
- 566| 0004C4 lwz 80A70014 1 L4A gr5=(*)aggr#2.I.rns21.(gr7,20)
- 566| 0004C8 lfd C8860000 1 LFL fp4=(*)double.rns81.(gr6,0)
- 566| 0004CC lwz 80C70008 1 L4A gr6=(*)aggr#2.I.rns21.(gr7,8)
23 566| 0004D0 lfd CBC90008 1 LFL fp30=(*)aggr#2.U.rns89.(gr9,8)
Annotated assembly
cpu distribution
27
ticks #line Instruction

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Reading the listing file 79| 0054FC subf 7C641850 1 S gr3=gr3,gr4 79| 005500 sradi 7C630E74 1 SRA8C gr3=gr3,1 79| 005504 addze 7C630194 1 79| 005508 std F86102A8 1 ST8 @__17212.nat.3(gr1,680)=gr3 79| 00550C std F86102B0 1 ST8 __17212(gr1,688)=gr3 79| CL.592: 81| 005558 addi 390000DE 1 LI gr8=222 81| 00555C bl 4BFFAAA5 1 CALL gr3=ab_resize,6,@PALI_SHADOW_CONST.rns909.,gr3-gr8,ab_resize",gr1,cr[01567]",gr0",gr4"-gr12",fp0"-fp13“ 81| 005560 ori 60000000 1 81| 005564 std F86102D8 1 ST8 p(gr1,728)=gr3 53| 005568 std F86102E0 1 ST8 __17475(gr1,736)=gr3 53| 00556C b 4800000C 1 B CL.594,-1
Line no
XIL Meaningless cycle count from power1
Actual binary
Machine opcode
Code offset
28

© 2013 IBM Corporation
IBM | Software Group | Rational
Tips for Hot Spot and Bottleneck Detection § Identify hot spots by gathering the profile information such as timing, call
frequency, block frequency, frequently used values, etc., using – gprof/oprofile – Combination of user provided trace functions and compiler instrumentation with -
qfunctrace – Compiler instrumentation with –qpdf1/pdf2 for profile directed feedback(PDF) – HPC toolkit
§ Identify if a workload is memory latency/bandwidth intensive; computation intensive; or combination of both, by gathering performance counter information about
– CPI breakdown – FPU/FXU – Cache misses – Branch mispredictions – LSU – …
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 29

© 2013 IBM Corporation
IBM | Software Group | Rational
Session 4: Performance tuning with XL compilers
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 30

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
XML Compiler Transformation Reports § Generate compilation reports consumable by other tools
– Enable better visualization and analysis of compiler information – Help users do manual performance tuning – Help automatic performance tuning through performance tool
integration
§ Unified report from all compiler subcomponents and analysis – Compiler options – Pseudo-sources – Compiler transformations, including missed opportunities
§ Consistent support among Fortran, C/C++ § Controlled under option
-qlistfmt=[xml | html]=inlines generates inlining information -qlistfmt=[xml | html]=transform generates loop transformation information -qlistfmt=[xml | html]=data generates data reorganization information -qlistfmt=[xml | html]=pdf generates dynamic profiling information -qlistfmt=[xml | html]=all turns on all optimization content -qlistfmt=[xml | html]=none turns off all optimization content
Example
new
31

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Source location
Loop information Loop Table Loop Index
Start Line #
End Line #
Parent Loop Index Nest Level Minimum
Cost Maximum
Cost Iteration
Count Attributes
1 203 1 19630 19630 75 (array)
• well behaved • bump normalized • lower bound
normalized
2 188 1 20413 20413 149 (array)
• perfect nest • well behaved • bump normalized • guarded • lower bound
normalized
3 141 1 13300 13300 100 (default)
• perfect nest • well behaved • bump normalized • guarded • lower bound
normalized
4 203 1 19630 19630 5 (PDF)
• residual • well behaved • bump normalized • guarded • lower bound
normalized
Loop iteration count based on static analysis or dynamic profiling
32

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Loop Transformation Reports
Seq # Type Phase Region # Line # Loop Index
Description Attributes
2 LoopVector (success)
High Level Optimizer 4 217 1
Loop vectorization was performed.
not available
3 LoopFusion (success)
High Level Optimizer 4 108 3 Loops were
fused.
• Loop Line Number: 108 • Loop Line Number: 206
4 LoopVectorVersion (success)
High Level Optimizer 4 108 3
Vector versioning was performed.
not available
20 ModuloSchedule (success)
Low Level Optimizer 12 3499 26
Loop was modulo scheduled.
• Initiation Interval: 12
33

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
file.c foo (float *p, float *q, float *r, int n) { for (int i=0; i< n; i++) { p[i] = p[i] + q[i]*r[i]; } }
Performance Tuning with Compiler Transformation Reports -qlistfmt=xml=all
file.c foo (float * restrict p, float * restrict q, float * restrict r, int n) { for (int i=0; i< n; i++) { p[i] = p[i] + q[i]*r[i]; } }
file.xml Loop cannot be automatically parallelized. A dependency is carried by variable aliasing
Original source file modified source file
file.xml Loop was automatically parallelized Loop was modulo scheduled
Tuning
34

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Explicit SIMD programming for POWER7 Enabled under -qaltivec § Successor to altivec programming extensions on POWER6/PPC970
– Altivec data types 16-byte vectors vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements
– VSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements
§ Altivec built-in functions extended to new data types vec_add(vector double, vector double), vec_sub(vector long long, vector long long),
§ New vector operations: vec_mul, vec_div, … § Unaligned load and store operations
– Altivec truncating loads/stores still available: vec_ld, vec_st – New non-truncating loads/stores: vec_xld2, vec_xstd2
35

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
VSX Example
Compiler Listing:
4| CL.5:
0| 000050 addi 38840010 1 AI gr4=gr4,16
7| 000054 xvsqrtdp F060032C 1 VDFSQRT vs3=vs0,fcr 7| 000058 xvcpsgnd F0021780 1 LRVS vs0=vs2
7| 00005C xvmuldp F0442380 1 VDFM vs2=vs4,vs4,fcr
6| 000060 lxvd2x 7C840698 1 VLQD vs4=y[]@gr612->.y(gr4,gr0,0)
7| 000064 xvmaddad F0010B08 1 VDFMA vs0=vs0,vs1,vs1,fcr
5| 000068 lxvd2x 7C250698 1 VLQD vs1=x[]@gr609->.x(gr5,gr0,0) 0| 00006C addi 38A50010 1 AI gr5=gr5,16
8| 000070 stxvd2x 7C630798 1 VSTQD z[]@gr621->.z(gr3,gr0,0)=vs3
0| 000074 addi 38630010 1 AI gr3=gr3,16
0| 000078 bc 4320FFD8 1 BCT ctr=CL.5,,100,0
C: #include <math.h> #include <altivec.h> extern double x[1000],y[1000],z[1000]; void sub(){ int i; vector double x2,y2,z2; for(i=0;i<1000;i+=2) { x2=vec_xld2(0,&x[i]); y2=vec_xld2(0,&y[i]); z2=vec_sqrt(vec_add( vec_mul(x2,x2),vec_mul(y2,y2))); vec_xstd2(z2,0,&z[i]); }}
Fortran: subroutine sub(x,y,z) real*8 x(*),y(*),z(*) vector(real(8)) x2,y2,z2 do i=1,1000,2 x2=vec_xld2(0,x(i)) y2=vec_xld2(0,y(i)) z2=vec_sqrt(vec_add( vec_mul(x2,x2),vec_mul(y2,y2))); call vec_xstd2(z2,0,z(i)) enddo return end
xlc –O3 –qarch=pwr7 –qvecnvol –qaltivec py.c xlf90 –O3 –qarch=pwr7 –qvecnvol py.f
36

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 37
Automatic SIMDization § Automatic SIMDization for VMX and VSX
– Supports data types of INTEGER, UNSIGNED, REAL and COMPLEX
§ Features: – Basic block level SIMDizaton – Loop level aggregation – Data conversion, reduction – Loop with limited control flow – Automatic SIMDization with -qstrict (VSX) and -qnostrict – Support of unaligned vector memory accesses (VSX) – Automatic SIMDization enabled at -O3 -qsimd
37

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
AutoSIMD: VSX example
Compiler Listing:
0| CL.115:
9| VLQD [email protected][].rns2.0(gr11,gr12,0)
9| VLQD [email protected][].rns2.0(gr11,gr31,0)
9| VLQD [email protected][].rns3.1(gr8,gr0,0)
9| VLQD [email protected][].rns3.1(gr8,gr23,0)
9| VDFM vs4=vs4,vs4,fcr
9| VDFM vs5=vs5,vs5,fcr
9| LRVS vs8=vs2
9| VDFMA vs8=vs8,vs9,vs9,fcr
9| LRVS vs9=vs3
9| VDFMA vs9=vs9,vs10,vs10,fcr
9| VDFSQRT vs10=vs0,fcr
9| VDFSQRT vs11=vs1,fcr
9| VSTQD @V.z[].rns1.2(gr9,gr26,0)=vs7
9| VSTQD @V.z[].rns1.2(gr9,gr25,0)=vs6
0| AI gr9=gr9,64
...
0| BCT ctr=CL.115,,100,0
subroutine sub(x,y,z) integer i real*8 x(*),y(*),z(*) CALL ALIGNX(16, x(1)) CALL ALIGNX(16, y(1)) CALL ALIGNX(16, z(1)) do i=1,1000 z(i)=sqrt(x(i)*x(i)+y(i)*y(i)) enddo return end
Command: xlf90 –O3 –qhot –qstrict –qarch=pwr7 –qsimd –qvecnvol py.f
38

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
SIMDization Tuning
memory accesses have non-vectorizable alignment.
§ Use __attribute__((aligned(n)) to set data alignment
§ Use __alignx(16, a) to indicate the data alignment to the compiler
§ Use -qassert=refalign if all references are naturally aligned
§ Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
§ Use fewer pointers when possible
§ Use #pragma independent if it has no loop carried dependency
§ Use #pragma disjoint (*a, *b) if a and b are disjoint
§ Use restrict keyword or compiler option –qrestrict
User actions Transformation report
Loop was SIMD vectorized
§ Use #pragma simd_level(10) to force the compiler to do SIMDization It is not profitable
to vectorize
39

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
SIMDization Tuning
memory accesses have non-vectorizable strides
§ Loop interchange for stride-one accesses, when possible
§ Data layout reshape for stride-one accesses
§ Higher optimization to propagate compile known stride information
§ Stride versioning
§ Do statement splitting and loop splitting
User actions Transformation report
either operation or data type is not suitable for SIMD vectorization.
§ Convert while-loops into do-loops when possible
§ Limited use of control flow in a loop
§ Use MIN, MAX instead of if-then-else
§ Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
40

© 2013 IBM Corporation
IBM | Software Group | Rational
MASS and MASSV Libraries § Libraries of mathematical routines tuned for optimal performance on
various POWER architectures – General implementation tuned for POWER – Specific implementations tuned for specific POWER processors (pwr5, pwr6, pwr7)
§ Compiler will automatically insert calls to MASS/MASSV routines at higher optimization levels
– Users can add explicit calls to the library
41 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
for (i=0;i<n;i++) {
b[i]=sqrt(a[i]);
} __vsqrt_P7(b,a,n);
Loop vectorization was performed.
Transformation report

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
MASS enhancements and Auto-vectorization on POWER7 § MASS enhancements for POWER7
– POWER7 vector MASS library (libmassvp7.a) • Internally exploit VSX instructions
SP: average speedup of 1.99 vs Power5 MASSV DP: average speedup of 1.27 vs Power5 MASSV
– POWER7 SIMD MASS library (libmass_simdp7.a) • Tuned math routines operating on vector data types • Over 35 frequently used mathematical functions • Both simple and double precision • To be used in conjunction with explicit SIMD programming
§ Auto-vectorization at optimization level –O3 or above
§ -qstrict=vectorprecision to maintain precision over all loop iterations
42

© 2013 IBM Corporation
IBM | Software Group | Rational
ESSL and PESSL Libraries § ESSL (Engineering and Scientific Subroutine Library) is a collection of
high performance mathematical subroutines providing a wide range of functions for many common scientific and engineering applications.
§ ESSL Symmetric Multi-Processing (SMP) Library provides thread-safe versions of the ESSL subroutines for use on all SMP processors. In addition, some of these subroutines support the shared memory parallel processing programming model.
§ Parallel ESSL is a scalable mathematical subroutine library for standalone clusters or clusters of servers connected via a switch and running AIX and Linux. Parallel ESSL supports the Single Program Multiple Data (SPMD) programming model using the Message Passing Interface (MPI) library.
43 HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Software-controlled data prefetching for POWER7
§ Software control over POWER7 prefetch engine, supporting up to 12 data streams
§ Fine grained software controlled data prefetch including stream type, stream length, stream stride, prefetch depth at optimization level -O3 -qhot or above
– More aggressive exploitation under option -qprefetch=aggressive
§ Global analysis for coarse grained prefetch engine control at optimization level -O5
44

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Built-in functions for POWER7 data prefetching and cache control
§ Transient cache line touch void __dcbtt(void *address); void __dcbtstt (void * address);
§ Partial cache line touch void __partial_dcbt(void *address);
§ Stride-N stream prefetch void __protected_stream_stride(offset, stride, stream_ID);
§ Transient stream prefetch: void __transient_protected_stream_count_depth(unit_count, depth, stream_ID)
void __transient_unlimited_protected_stream_depth(prefetch_depth, stream_ID)
45

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Example of POWER7 data prefetching
for (i=0; i< n; i++) { a[i] = b[i] + ...;
}
__protected_store_stream_set(FORWARD, &a, 11);
__protected_stream_count_depth(n*sizeof(double)/128, DEEPER, 11);
__protected_stream_set(FORWARD, &b, 0);
__transient_protected_stream_count_depth(n*sizeof(double)/128, DEEPER, 0);
__eieio();
__protected_stream_go();
Store stream prefetch for array a;
transient stream prefetch for array b Stream id
Stream length
Stream direction
Prefetch depth
Start stream prefetch
46

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Loop Optimization
§ Traditional unimodular loop transformations for prefect regular loop nests
– Compiler loop transformations including loop fusion, loop distribution, unroll-and-jam, loop tiling, loop rerolling, loop collapsing, loop unrolling
– Compiler pragmas: unroll, stream_unroll, block_loop, unrollandfuse – Under the optimization level O3, O3 –qhot or above
§ Polyhedral framework for any loop nests – Provide abstract representation for aggressive analysis and
complicated transformations of arbitrary loop nests and shapes under option control –qhot=level=2 with -qsmp
• Loop skewing, loop tiling for triangular loop shapes – Perform exact dependence testing through unified dependence
formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 –qhot or above)
47

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Polyhedral Loop Transformation Examples
§ Dependence analysis do i = 1,N do j = 1,i do k = i+1,N c(k) = c(j) + b(j,k)
§ Enable interchange of j and k loops to improve access locality for b
– Identifies independence of memory accesses to c
§ Affects all optimization levels that include -qhot
§ Loop transformations
§ Tiling of triangular matrix multiplication do i = 1,N do j = i+1,N do k = i+1,N c(j,i)+=a(k,i)*b(j,k)
§ Also allows transformation of imperfect loop nests
– Intervening code between loops
§ Only available at -qhot=level=2 48

© 2013 IBM Corporation
IBM | Software Group | Rational
Data Reorganization § Data reorganization transformations to reshape data layout to reduce memory
latency, enhance cache utilization and memory bandwidth.
§ Data reorganization transformations enabled at O5 – Data splitting – Data interleaving – Data transposing – Data merging – Data grouping – Data compressing – Data padding
§ -qlistfmt=xml=data (-qlistfmt=html=data) generates data reorganization transformation information in xml (html)
May 2013 Performance Programming with IBM XL Compilers and Libraries
Data Reorganization Report Seq # Type Phase Data
Name Category Region # Line # Description
1 ArraySplitting High Level Optimizer iplus 9
An array of a large aggregated data-type was split into multiple arrays of smaller data-types.
27 ArrayCoalescing High Level Optimizer net Global variables were
aggregated.
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries 49

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
S[0].F0 S[0].F1 S[0].F2 S[0].F3
S[1].F0 S[1].F1 S[1].F2 S[1].F3
S[2].F0 S[2].F1 S[2].F2 S[2].F3
S[3].F0 S[3].F1 S[3].F2 S[3].F3
… … … …
F0[0]
F1[0]
F2[0]
F3[0]
F0[1]
F1[1]
F2[1]
F3[1]
F0[2]
F1[2]
F2[2]
F3[2]
F0[3]
F1[3]
F2[3]
F3[3]
…
…
…
…
A[0]
A[1]
A[2]
A[3]
A[0][2] A[0][3] A[0][1] A[0][0]
A[1][2] A[1][3] A[1][1] A[1][0]
A[2][2] A[2][3] A[2][1] A[2][0]
A[3][2] A[3][3] A[3][1] A[3][0]
…
…
…
…
A[3] A[3][2] A[3][3] A[3][1] A[3][0] …
A[0][0] A[0][1] A[0][2] A[0][3]
A[1][0] A[1][1] A[1][2] A[1][3]
A[2][0] A[2][1] A[2][2] A[2][3]
A[3][0] A[3][1] A[3][2] A[3][3]
… … … …
…
…
…
…
…
A’[0][0]
A’[1][0]
A’[2][0]
A’[3][0]
A’[0][1]
A’[1][1]
A’[2][1]
A’[3][1]
A’[0][2]
A’[1][2]
A’[2][2]
A’[3][2]
A’[0][3]
A’[1][3]
A’[2][3]
A’[3][3]
…
…
…
…
… … … … …
Array splitting
Array merging Array transposing
Data locality, cache utilization 50

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
User Explicit Parallelization with OpenMP § Full OpenMP 3.0 implementation on C, C++ and Fortran
– Full OpenMP task parallelization – Privatization of Fortran descriptor-based arrays
§ Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default
– Bypass expensive pthread key mechanism – pthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility
§ Improved interaction between OpenMP and automatic SIMD
51

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Automatic parallelization § Improvements to automatic parallelization
– More effective array data flow analysis for array privatization
– Automatic privatization of descriptor-based Fortran arrays – Runtime-dependence testing
§ SMP runtime improvements – Leverage of TLS on SMP runtime implementation
§ Enable automatic parallelization with the compiler option -qsmp
52

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
XLSMPOPTS Environment Variable for Runtime Tuning
§ XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs
§ Some suboptions of interest: – spins and yields to define the behavior of idle threads – Thread binding using startproc and stride suboptions – new bind suboption on AIX, bind=SDL=<start resource>,<number of resources>,<stride>;bindlist=SDL=i0,i1,…,ix
– schedule to define the runtime scheduling algorithm used for parallel loops (static, dynamic, guided) • Note that the default schedule has changed from
runtime to auto in V11/V13
53

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Inlining § Single control knob to enable inlining
– Simplifies inlining control for programmer -qinline=level=X X=0..10 (default 5)
– Consistent across all languages and optimization levels – Previous mechanisms still available
§ User inline control with –qinline{+|-}<function_name>
§ inlining performed by – C++ FE
• Small methods • inline and always_inline (explain when to use, impacts compile time) • Some exception-handling capabilities • Within a single source file
– High-level Optimizer • Larger methods • Will inline within a source file (-O3, -qhot) and across source files (-O4, -O5, -qipa) • Inline across languages (-O4, -O5, -qipa)
– Low-level optimizer • Larger methods • Within a source file • Enabled at -O2 and higher
54

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
Control over optimizations that may affect program results -qstrict suboptions
§ Aggressive optimization may affect the results of the program – Precision of floating-point computation – Handling of special cases of IEEE FP standard (INF, NAN, etc) – Use of alternate math libraries
§ -qstrict guarantees identical result to noopt, at the expense of optimization
– Suboptions allow fine-grain control over this guarantee – Examples:
-qstrict=precision Strict FP precision -qstrict=exceptions Strict FP exceptions -qstrict=ieeefp Strict IEEE FP implementation -qstrict=nans Strict general and computation of NANs -qstrict=order Do not modify evaluation order -qstrict=vectorprecision Maintain precision over all loop iterations
§ Can be combined: -qstrict=precision:nonans 55

© 2013 IBM Corporation
IBM | Software Group | Rational
Tips for Compiler Friendly Programming § Obey all language aliasing rules (avoid –qalias=noansi in C/C++)
§ Avoid unnecessary use of globals and pointers; use restrict keyword or compiler directives/pragmas to help the compiler do dependence and alias analysis
§ Use “const” for globals, parameters and functions whenever possible
§ Group frequently used functions into the same file (compilation unit) to expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)
§ Excessive hand-optimization such as unrolling and inlining may impede the compiler
§ Keep array index expressions as simple as possible for easy dependency analysis
§ Consider using the highly tuned MASS and ESSL libraries rather than custom implementations or generic libraries
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 56

© 2013 IBM Corporation
IBM | Software Group | Rational
Tips for POWER7 Optimizations § POWER7 exploitation
– POWER7 specific ISA exploitation under –qarch=pwr7 • Extended FP register file • 64-bit population count, bit permutation, fixed point pipelined multiply, fix point select,
divide check for software divide assistance • VMS/VSX • Stride-N stream prefetch, partial cache line touch
– Scheduling and instruction selection under –qtune=pwr7
§ Automatic SIMDization – Use simd_level(0..10) pragma to exploit aggressive SIMDization – Use align attribute to force the compiler to align static data by 16-byte; use
MALLOCALIGN=16 to force OS to align malloced data by 16-byte; use alignx directive to tell the compile the alignment.
– Limited use of control flow – Limited use of pointers. Use independent_loop directive to tell the compiler a loop has no
loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible
– Limited use of stride accesses. Expose stride-one accesses whenever possible
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 57

© 2013 IBM Corporation
IBM | Software Group | Rational
Tips for POWER7 Optimizations § POWER7 aware loop transformations
– Loop distribution, unroll-and-jam, stream unrolling controlled by 12 streams on each core, shared by SMTs
– Loop blocking controlled by L2 cache size – Selection of SIMDization and vectorization controlled by the threshold;
use loop iteration directives to guide the compiler
§ Memory hierarchy optimization – Data prefetch
• Automatic data prefetch at O3 –qhot or above. • -qprefetch=aggressive to enable aggressive data prefetch; • DSCR setting for the default data prefetching; • Enable DCBZ insertion on POWER7 IH • Partial cache line touch
– Data reorganization enabled at O5
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 58

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013
XL Compiler Documentation § An information centre containing the documentation for the XL Fortran V14.1 and XL C/C++ V12.1 versions is available at: http://publib.boulder.ibm.com/infocenter/comphelp/v121v141/index.jsp for AIX compilers and http://publib.boulder.ibm.com/infocenter/lnxpcomp/v121v141/index.jsp for the LINUX compilers:
• Installation Guide • Getting Started with XL C/C++ • Compiler Reference • Language Reference
§ Whitepaper “Code optimization with the IBM XL Compilers” – http://www-01.ibm.com/support/docview.wss?uid=swg27005174
§ Whitepaper “Overview of the IBM XL C/C++ and XL Fortran Compiler Family” available at:
– http://www.ibm.com/support/docview.wss?uid=swg27005175
§ Please send any comments or suggestions on this information center or about the existing C, C++ or Fortran documentation shipped with the products to [email protected].
59

© 2013 IBM Corporation
IBM | Software Group | Rational
HPC Workload Perfromance Tuning with IBM XL Compilers and Libraries May 2013 60 60