hpc workload performance tuning on power8 with ibm xl...
TRANSCRIPT
© 2014 IBM Corporation
HPC Workload Performance Tuning on POWER8
with IBM XL Compilers and Libraries Yaoqing Gao, STSM, IBM Canada Lab [email protected]
SPXXL/Scicomp Summer Workshop 2014
© Copyright IBM Corporation 2014. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.
Please Note:
– IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
– Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
– The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Disclaimer
2
Outline
§ Overview of IBM XL Compiler Family
§ Major Features in XL C/C++ V13.1 and XL Fortran V15.1 § POWER8 Exploitation
§ Performance Tuning Tips
3 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Overview of XL Compiler Family • Target Linux, AIX on POWER – Common technology for Blue Gene/Q, and zOS (XL C/C++ only for zOS)
• Language standard compliance – C99 standard compliance – C++98 and subsequent TRs, selected C++11 features – Fortran 2003 Standard compliance, Selected Fortran 2008 features – OpenMP 3.1 conformance, partial support of OpenMP 4.1
• Fully backward compatible with objects compiled with older compilers – Supports mix-and-match of objects generated with different compilers and
optimization levels – Backward compatibility through option control in some rare situations:
• C++ name mangling, OpenMP TLS, etc
• GCC affinity – Partial source and full binary compatibility with gcc
4
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Overview of XL Compiler Family
• Platform exploitation – qarch: ISA exploitation – qtune: skew performance tuning for specific processor, including
tune=balanced – Large portfolio of compiler builtins and performance annotations
• Advanced optimization capabilities – Five distinct optimization packages – Aggressive loop analysis and transformations – Whole program optimization – SIMD code generation and Vectorization exploitation – Parallelization (automatic and user-driven through OpenMP) – Profile-driven optimization
5
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
IBM XL Compiler architecture Basic compilation environment
• Used at lower optimization levels
• Focus on fast compilation noopt -O2
• More aggressive optimization, with limited impact on compilation time
-O3 -qnohot
Implies -qnostrict, which may affect program behavior (mainly precision of floating-point operations)
• Optionally generate an assembly listing file
C FE C++ FE Fortran FE
xl*code
Source file
object file source.lst
6
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
IBM XL Compiler architecture Advanced compilation environment
• Focus on runtime performance, at the expense of compilation time – Aggressive loop
transformations – More precise dataflow
analysis • Triggered by several compiler
flags -O3 -qhot -qsmp
• Multiple levels of aggressiveness for loop transformations
-qhot=level=0 (default at -O3) -qhot=level=1 (default at -qhot) -qhot=level=2
• Can be combined with -qstrict
C FE C++ FE Fortran FE
ipa
Source file
object file
xl*code
source.lst
7
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
IBM XL Compiler architecture Whole-program analysis environment - compile phase
• Collect high-level program representation in preparation for link-time whole program optimization
• Triggered by -qipa Implied by -O4, -O5, -qpdf1/-
qpdf2 Identical behavior at all -qipa
levels • Can be used independently of -
qhot • Output is composite object file
– Includes regular object file and intermediate representation
– Allows linking the object file with or without link-time optimization
– Skip generation of regular object using -qipa=noobject
C FE C++ FE Fortran FE
ipa
Source file
extended object file
xl*code
object file source.lst
8
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
extended object file
system library object file
IBM XL Compiler architecture Whole-program analysis environment - link phase
• Intercept the system linker and re-optimize whole program -qipa=level=0 (default with qpdf) -qipa=level=1 (default with qipa) -qipa=level=2
• Must use the compiler invocation to link the program, with -qipa – Do not use ld directly
• Flexible handling of extended objects – Can be placed in archives – Accepts combination of regular and
extended object files • Whole program assembly listing
– Default name a.lst
• Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program
extended object file
ipa
system library object file
xl*code
final object file
system linker
executable
a.lst profile data file
9
§ C++11 – defaulted and deleted functions – uniform init (part 1 to support Linux header usage) – rvalue reference – const expr
§ C11 – generic type generic – typedef redeclaration – _Bool bitfield
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Major C and C++ Language Features in XLC V13.1
10
§ F2008 Language Features – Submodules
– Declare index variables in FORALL
– Generic resolution improvements
– IMPURE elemental procedure support
– Bit processing intrinsics for shifting, combined shifting, masking, merging
– BACK= argument in the MAXLOC and MINLOC intrinsics
– FINDLOC intrinsic
§ TS29113 (Further C Interoperability) Features – C Descriptor support for allocatable and pointer arguments
– Assumed-type objects, assumed-rank objects
– Support for optional arguments, asynchronous arguments
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Major Fortran Language Features in XLF V15.1
11
12
Larger Caching Structures vs. POWER7 • 2x L1 data cache (64 KB) • 2x outstanding data cache misses • 4x translation Cache
Wider Load/Store • 32B à 64B L2 to L1 data bus • 2x data cache to execution dataflow
Enhanced Prefetch • Instruction speculation awareness • Data prefetch depth awareness • Adaptive bandwidth awareness • Topology awareness
Execution Improvement vs. POWER7 • SMT4 à SMT8 • 8 dispatch • 10 issue • 16 execution pipes:
• 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR
• Larger Issue queues (4 x 16-entry) • Larger global completion, Load/Store reorder • Improved branch prediction • Improved unaligned storage access Core Performance vs POWER7
~1.6x Single Thread ~2x Max SMT
VSU FXU
IFU
DFU
ISU
LSU
POWER8 Core
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
13
Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip)
Memory • Up to 230 GB/s sustained bandwidth
Bus Interfaces • Durable open memory attach interface • Integrated PCIe Gen3 • SMP Interconnect • CAPI (Coherent Accelerator Processor Interface)
Cores • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 exec pipe • 2X internal data flows/queues • Enhanced prefetching • 64K data cache, 32K instruction cache
Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • CAPI
Energy Management • On-chip Power Management Micro-controller • Integrated Per-core VRM • Critical Path Monitors
Technology • 22nm SOI, eDRAM, 15 ML 650mm2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
Core
L2
L3 Cache & Chip Interconnect
8M L3 Region
Mem. Ctrl. Mem. Ctrl.
SMP Links Accelerators
SMP Links PCIe
POWER8 Processor
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
POWER8 Performance Improvements
0 50 100 150 200 250
POWER5
POWER6
POWER7
POWER8
Memory Bandwidth GB/Sec
0 50 100 150 200
POWER6
POWER7
POWER7+
POWER8
I/O Bandwidth GB/Sec
POWER5
POWER6
POWER7
POWER8
SMT8 Core Performance POWER5
POWER6
POWER7
POWER8
Socket Performance
14
POWER8 Support
§ Automatic exploitation of POWER8 ISA – -qarch=pwr8 –qtune=pwr8
§ Built-in support for cryptography, hardware transactional memory, BCD, DSCR prefetch setting, assorted other instructions
§ SMT-aware tuning – Sub-option to –qtune for SMT mode: balanced | st | smt2 | smt4 |
smt8 – SMT-aware optimizations: locality transformation, instruction
scheduling, etc.
15 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
• Definition – Technique that allows a group of instructions including
updates to memory image to execute speculatively and atomically. This group of instructions is called a transaction
• Value – Reducing programming development – Reducing customer cost (higher SLA attainment through fewer images,
and high scalability) – Improving performance of legacy software with large sequential
components
T1 T2 … Tn
Time
T1 T2 … Tn
Lock based model Sequential execution
TM model Concurrent execution
.
.
Enable Scaling with ease of programming
a
b a
b
c c
d
f e
a
b
e
d
Aggressive Back off
f
f
Trace optimization
for (int j=0; j<n; j++) for (int i=0; i<n; i++) Y[i][j] = X[i][j];
41 ms (n = 1000)
speculate { for (int i=0; i<n; i++) for (int j=0; j<n; j++) Y[i][j] = X[i][j]; } catch (Exception e) { for (int j=0; j<n; j++) for (int i=0; i<n; i++) Y[i][j] = X[i][j]; }
1.8 ms (n = 1000)
Speculative loop optimization/vectorization
Example: MemcacheD Scaling
0
2
4
6
8
10
12
1 2 4 6 8 10 12 14 16
Cores
No
rma
lize
d T
hro
ug
hp
ut
TLENOTLE
POWER8 Hardware Transactional Memory
if (__TM_begin(tm_buff) == 0) {
long val = mutex->mt_lock; if (val == UL_FREE) { /* Free */ /* Enter critical section using TM */ return 0; } /* Busy */ __TM_abort(); } else { /* Not in a transaction */ ... /* Giving up - Not using TM - Need to acquire lock */ ... <acquire lock> ... /* Enter critical section holding lock - Not using TM */ }
if (__TM_end() == 0) { /* Was inside transaction - No need to do anything */ return 0; } else { /* Must have acquired lock instead of using TM */ ... <release lock> ... }
To enter a critical section (pthread_mutex_lock): To exit critical section (pthread_mutex_unlock):
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Built-ins for POWER8 Hardware Transactional Memory
17
§ POWER8 SIMD Hardware Improvements – Major improvements to misaligned vector load/store – Fully symmetric VMX units and SP enhancements to POWER7:
doubling of throughput – New 2-way 64b integer operations – Direct move facility for VSR/GPR transfers
§ Compiler Enhancements for SIMD – New SIMD infrastructure for simplicity and performance, enabled by
default at –O3, -qhot – More aggressive SIMDization for both loops and basic blocks for
POWER7 and POWER8, e.g., partial loop SIMDization, basic block SIMDization for scalars
HPC Workload Performance Tuning on POWER8 with IBM XL Compilers and Libraries
SIMD Performance Improvements on POWER8
18
19 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
§ Out-of-box performance – Simplify performance tuning with a set of recommended options, -O2 –qipa, -O3 and -O3
-qhot, for different workloads that provide a balance between compile time and runtime performance
– Up to 10-15% improvement in application performance
§ C/C++ performance – Improvements to inlining heuristics
• Always honour always_inline attribute for C++ (C with optimization enabled) • Enable automatic inlining at –O3
§ Profile Directed Feedback Optimization (PDF) Enhancements § Multiple PDF workload support § Enhanced PDF tools to manipulate profile data
§ OpenMP performance and scalability – Reduced overhead for most OpenMP constructs – Improved performance and scalability for OpenMP tasks – OpenMP nested parallelization support
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Runtime Performance Improvements
20
§ Ease of use
– Improved static analysis and runtime checking to validate program correctness
• Compile-time checks to detect uninitialized variables • Compile-time checks to detect potential race conditions in
synchronization code • Runtime checks to detect uninitialized variables • Runtime checks to detect stack clobbering
– Improved DWARF information for debuggers – Support for make dependency file generation through -MMD, -MT, -MF in XL
Fortran – Fortran interfaces for common POSIX functions and constants
§ Ease of migration and compatibility – GNU compatibility-M, -MF, -MT) – C++ libraries and other open source libraries
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Ease of Use and Migration
21
• Compile-time check enabled with -qinfo=unset – Additional cases can be detected by the optimizer using -O2
• Runtime check enabled with -qcheck=unset
1 #include <stdio.h> 2 int C; 3 int main() { 4 int a; 5 if (C) { 6 a = 0; 7 } 8 printf("a=%d\n", a); 9 return 0; 10 }
$ xlC_r uni.cpp -qinfo=unset -O2 Warning: Line 8 in function "main": "a" may be used before define.
uni.cpp
Detecting Uninitialized Variables
22 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
IBM Confidential
1 #include <stdio.h> 2 int sum; 3 volatile int done; 4 int producer() { 5 for (int i=0; i<1000; i++) { 6 sum += i; 7 } 8 done = 1; 9 return 0; 10 } 11 12 int consumer() { 13 while (! done); 14 printf("sum=%d\n", sum); 15 return 0; 16 }
§ Enabled with -qinfo=mt -O3
$ xlC_r -qinfo=mt mt.cpp -O3 -c 1586-669 (I) "mt.cpp", line 13: If this loop is used as a synchronization point, additional synchronization via a directive or built-in function might be needed. 1586-670 (I) "mt.cpp", line 8: If this statement is used as a synchronization point, additional synchronization via a directive or built-in function might be needed.
mt.cpp
Update shared value
Signal finished
Wait for signal
Use shared value
__lwsync()
__lwsync()
Detecting Potential Race Conditions in Synchronization Code
23 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Tips for Compiler Friendly Programming • Obey all language aliasing rules (avoid –qalias=noansi in C/C++) • Avoid unnecessary use of globals and pointers; use restrict keyword
(XLC supports multiple level and scope restricted pointer) or compiler directives/pragmas to help the compiler do dependence and alias analysis
• Use “const” for globals, parameters and functions whenever possible • Group frequently used functions into the same file (compilation unit) to
expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)
• Limit exception handling • Excessive hand-optimization such as unrolling and inlining may impede
the compiler • Keep array index expressions as simple as possible for easy dependency
analysis • Consider using the highly tuned MASS and ESSL libraries
24 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Tips for POWER Optimization
• POWER8 exploitation – POWER8 specific ISA exploitation under –qarch=pwr8 – Scheduling and instruction selection under –qtune=pwr8
• Frequently used compiler option sets – -O3 –qarch=pwr8 –qtune=pwr8 – -O3 –qhot –qarch=pwr8 –qtune=pwr8
• Frequently used pragmas and directives – Dependency and alias analysis – Alignment – Frequency – Program behavior – Transformations
25 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
Tips for POWER Optimization
• Data prefetch – Automatic data prefetch at O3 –qhot or above. – Problem-state control of DSCR (data stream control register),
provided to user via builtins to control data prefetch
• Automatic SIMDization at O3 –qhot – Limited use of control flow – Limited use of pointers. Use independent_loop directive to tell the
compiler a loop has no loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible
– Limited use of stride accesses. Expose stride-one accesses whenever possible
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries 26
SIMDization Tuning
memory accesses have non-vectorizable alignment.
§ Use __attribute__((aligned(n)) to set data alignment § Use __alignx(16, a) to indicate the data alignment to the compiler § Use -qassert=refalign if all references are naturally aligned § Use array references instead of pointers where possible
data dependence prevents SIMD vectorization
§ Use fewer pointers when possible § Use #pragma independent if it has no loop carried dependency § Use #pragma disjoint (*a, *b) if a and b are disjoint § Use restrict keyword or compiler option –qrestrict
User actions Transformation report
Loop was SIMD vectorized
§ Use #pragma simd_level(10) to force the compiler to do SIMDization It is not profitable to vectorize
27 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries
SIMDization Tuning
memory accesses have non-vectorizable strides
§ Loop interchange for stride-one accesses, when possible § Data layout reshape for stride-one accesses § Higher optimization to propagate compile known stride information § Stride versioning
§ Do statement splitting and loop splitting
User actions Transformation report
either operation or data type is not suitable for SIMD vectorization.
§ Convert while-loops into do-loops when possible § Limited use of control flow in a loop § Use MIN, MAX instead of if-then-else § Eliminate function calls in a loop through inlining
loop structure prevents SIMD vectorization
28 HPC Workload Performance Tuning on POWER8 with IBM XL Compilers and Libraries
Tips for POWER Optimization
• Make use of visibility attribute – Load time improvement – Better code with PLT overhead reduction – Code size reduction – Symbol collision avoidance
• Inline tuning – inline keyword, inline threshold – Call overhead reduction – Load-hit-store avoidance
• Whole program optimization by IPA – Across-file inlining – Code partitioning – Data reorganization – TOC pressure reduction
29 HPC Workload Perfformance Tuning on POWER8 with IBM XL Compilers and Libraries
HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries 30 30