hpc workload performance tuning on power8 with ibm xl...

© 2014 IBM Corporation

HPC Workload Performance Tuning on POWER8

with IBM XL Compilers and Libraries Yaoqing Gao, STSM, IBM Canada Lab [email protected]

SPXXL/Scicomp Summer Workshop 2014

© Copyright IBM Corporation 2014. The information contained in these materials is provided for informational purposes only, and is provided AS IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, these materials. Nothing contained in these materials is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. References in these materials to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in these materials may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. IBM, the IBM logo, Rational, the Rational logo, Telelogic, the Telelogic logo, and other IBM products and services are trademarks of the International Business Machines Corporation, in the United States, other countries or both. Other company, product, or service names may be trademarks or service marks of others.

Please Note:

–  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

–  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

–  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries

Disclaimer

2

Outline

§ Overview of IBM XL Compiler Family

§ Major Features in XL C/C++ V13.1 and XL Fortran V15.1 § POWER8 Exploitation

§ Performance Tuning Tips

3 HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries


Overview of XL Compiler Family •  Target Linux, AIX on POWER – Common technology for Blue Gene/Q, and zOS (XL C/C++ only for zOS)

•  Language standard compliance –  C99 standard compliance –  C++98 and subsequent TRs, selected C++11 features –  Fortran 2003 Standard compliance, Selected Fortran 2008 features –  OpenMP 3.1 conformance, partial support of OpenMP 4.1

•  Fully backward compatible with objects compiled with older compilers –  Supports mix-and-match of objects generated with different compilers and

optimization levels –  Backward compatibility through option control in some rare situations:

•  C++ name mangling, OpenMP TLS, etc

•  GCC affinity –  Partial source and full binary compatibility with gcc

4


Overview of XL Compiler Family

•  Platform exploitation –  qarch: ISA exploitation –  qtune: skew performance tuning for specific processor, including

tune=balanced –  Large portfolio of compiler builtins and performance annotations

•  Advanced optimization capabilities –  Five distinct optimization packages –  Aggressive loop analysis and transformations –  Whole program optimization –  SIMD code generation and Vectorization exploitation –  Parallelization (automatic and user-driven through OpenMP) –  Profile-driven optimization

5


IBM XL Compiler architecture Basic compilation environment

•  Used at lower optimization levels

•  Focus on fast compilation noopt -O2

•  More aggressive optimization, with limited impact on compilation time

-O3 -qnohot

Implies -qnostrict, which may affect program behavior (mainly precision of floating-point operations)

•  Optionally generate an assembly listing file

C FE C++ FE Fortran FE

xl*code

Source file

object file source.lst

6


IBM XL Compiler architecture Advanced compilation environment

•  Focus on runtime performance, at the expense of compilation time –  Aggressive loop

transformations –  More precise dataflow

analysis •  Triggered by several compiler

flags -O3 -qhot -qsmp

•  Multiple levels of aggressiveness for loop transformations

-qhot=level=0 (default at -O3) -qhot=level=1 (default at -qhot) -qhot=level=2

•  Can be combined with -qstrict


ipa

Source file

object file

xl*code

source.lst

7


IBM XL Compiler architecture Whole-program analysis environment - compile phase

•  Collect high-level program representation in preparation for link-time whole program optimization

•  Triggered by -qipa Implied by -O4, -O5, -qpdf1/-

qpdf2 Identical behavior at all -qipa

levels •  Can be used independently of -

qhot •  Output is composite object file

–  Includes regular object file and intermediate representation

–  Allows linking the object file with or without link-time optimization

–  Skip generation of regular object using -qipa=noobject


ipa

Source file

extended object file

xl*code

object file source.lst

8



system library object file

IBM XL Compiler architecture Whole-program analysis environment - link phase

•  Intercept the system linker and re-optimize whole program -qipa=level=0 (default with qpdf) -qipa=level=1 (default with qipa) -qipa=level=2

•  Must use the compiler invocation to link the program, with -qipa –  Do not use ld directly

•  Flexible handling of extended objects –  Can be placed in archives –  Accepts combination of regular and

extended object files •  Whole program assembly listing

–  Default name a.lst

•  Under -qpdf1/-qpdf2 the compiler collects and uses runtime profile information about the program


ipa

system library object file

xl*code

final object file

system linker

executable

a.lst profile data file

9

§  C++11 –  defaulted and deleted functions –  uniform init (part 1 to support Linux header usage) –  rvalue reference –  const expr

§  C11 –  generic type generic –  typedef redeclaration –  _Bool bitfield


Major C and C++ Language Features in XLC V13.1

10

§  F2008 Language Features –  Submodules

–  Declare index variables in FORALL

–  Generic resolution improvements

–  IMPURE elemental procedure support

–  Bit processing intrinsics for shifting, combined shifting, masking, merging

–  BACK= argument in the MAXLOC and MINLOC intrinsics

–  FINDLOC intrinsic

§  TS29113 (Further C Interoperability) Features –  C Descriptor support for allocatable and pointer arguments

–  Assumed-type objects, assumed-rank objects

–  Support for optional arguments, asynchronous arguments


Major Fortran Language Features in XLF V15.1

11

12

Larger Caching Structures vs. POWER7 • 2x L1 data cache (64 KB) • 2x outstanding data cache misses • 4x translation Cache

Wider Load/Store • 32B à 64B L2 to L1 data bus • 2x data cache to execution dataflow

Enhanced Prefetch • Instruction speculation awareness • Data prefetch depth awareness • Adaptive bandwidth awareness • Topology awareness

Execution Improvement vs. POWER7 • SMT4 à SMT8 • 8 dispatch • 10 issue • 16 execution pipes:

•  2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR

• Larger Issue queues (4 x 16-entry) • Larger global completion, Load/Store reorder • Improved branch prediction • Improved unaligned storage access Core Performance vs POWER7

~1.6x Single Thread ~2x Max SMT

VSU FXU

IFU

DFU

ISU

LSU

POWER8 Core


13

Caches • 512 KB SRAM L2 / core • 96 MB eDRAM shared L3 • Up to 128 MB eDRAM L4 (off-chip)

Memory • Up to 230 GB/s sustained bandwidth

Bus Interfaces • Durable open memory attach interface • Integrated PCIe Gen3 • SMP Interconnect • CAPI (Coherent Accelerator Processor Interface)

Cores • 12 cores (SMT8) • 8 dispatch, 10 issue, 16 exec pipe • 2X internal data flows/queues • Enhanced prefetching • 64K data cache, 32K instruction cache

Accelerators • Crypto & memory expansion • Transactional Memory • VMM assist • CAPI

Energy Management • On-chip Power Management Micro-controller • Integrated Per-core VRM • Critical Path Monitors

Technology • 22nm SOI, eDRAM, 15 ML 650mm2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

Core

L2

L3 Cache & Chip Interconnect

8M L3 Region

Mem. Ctrl. Mem. Ctrl.

SMP Links Accelerators

SMP Links PCIe

POWER8 Processor



POWER8 Performance Improvements

0 50 100 150 200 250

POWER5

POWER6

POWER7

POWER8

Memory Bandwidth GB/Sec

0 50 100 150 200

POWER6

POWER7

POWER7+

POWER8

I/O Bandwidth GB/Sec

POWER5

POWER6

POWER7

POWER8

SMT8 Core Performance POWER5

POWER6

POWER7

POWER8

Socket Performance

14

POWER8 Support

§ Automatic exploitation of POWER8 ISA –  -qarch=pwr8 –qtune=pwr8

§ Built-in support for cryptography, hardware transactional memory, BCD, DSCR prefetch setting, assorted other instructions

§ SMT-aware tuning –  Sub-option to –qtune for SMT mode: balanced | st | smt2 | smt4 |

smt8 –  SMT-aware optimizations: locality transformation, instruction

scheduling, etc.


•  Definition –  Technique that allows a group of instructions including

updates to memory image to execute speculatively and atomically. This group of instructions is called a transaction

•  Value –  Reducing programming development –  Reducing customer cost (higher SLA attainment through fewer images,

and high scalability) –  Improving performance of legacy software with large sequential

components

T1 T2 … Tn

Time

T1 T2 … Tn

Lock based model Sequential execution

TM model Concurrent execution

.

.

Enable Scaling with ease of programming

a

b a

b

c c

d

f e

a

b

e

d

Aggressive Back off

f

f

Trace optimization

for (int j=0; j<n; j++) for (int i=0; i<n; i++) Y[i][j] = X[i][j];

41 ms (n = 1000)

speculate { for (int i=0; i<n; i++) for (int j=0; j<n; j++) Y[i][j] = X[i][j]; } catch (Exception e) { for (int j=0; j<n; j++) for (int i=0; i<n; i++) Y[i][j] = X[i][j]; }

1.8 ms (n = 1000)

Speculative loop optimization/vectorization

Example: MemcacheD Scaling

0

2

4

6

8

10

12

1 2 4 6 8 10 12 14 16

Cores

No

rma

lize

d T

hro

ug

hp

ut

TLENOTLE

POWER8 Hardware Transactional Memory

if (__TM_begin(tm_buff) == 0) {

long val = mutex->mt_lock; if (val == UL_FREE) { /* Free */ /* Enter critical section using TM */ return 0; } /* Busy */ __TM_abort(); } else { /* Not in a transaction */ ... /* Giving up - Not using TM - Need to acquire lock */ ... <acquire lock> ... /* Enter critical section holding lock - Not using TM */ }

if (__TM_end() == 0) { /* Was inside transaction - No need to do anything */ return 0; } else { /* Must have acquired lock instead of using TM */ ... <release lock> ... }

To enter a critical section (pthread_mutex_lock): To exit critical section (pthread_mutex_unlock):


Built-ins for POWER8 Hardware Transactional Memory

17

§ POWER8 SIMD Hardware Improvements –  Major improvements to misaligned vector load/store –  Fully symmetric VMX units and SP enhancements to POWER7:

doubling of throughput –  New 2-way 64b integer operations –  Direct move facility for VSR/GPR transfers

§ Compiler Enhancements for SIMD –  New SIMD infrastructure for simplicity and performance, enabled by

default at –O3, -qhot –  More aggressive SIMDization for both loops and basic blocks for

POWER7 and POWER8, e.g., partial loop SIMDization, basic block SIMDization for scalars

HPC Workload Performance Tuning on POWER8 with IBM XL Compilers and Libraries

SIMD Performance Improvements on POWER8

18

§  Out-of-box performance – Simplify performance tuning with a set of recommended options, -O2 –qipa, -O3 and -O3

-qhot, for different workloads that provide a balance between compile time and runtime performance

– Up to 10-15% improvement in application performance

§  C/C++ performance –  Improvements to inlining heuristics

•  Always honour always_inline attribute for C++ (C with optimization enabled) •  Enable automatic inlining at –O3

§  Profile Directed Feedback Optimization (PDF) Enhancements §  Multiple PDF workload support §  Enhanced PDF tools to manipulate profile data

§  OpenMP performance and scalability –  Reduced overhead for most OpenMP constructs –  Improved performance and scalability for OpenMP tasks –  OpenMP nested parallelization support


Runtime Performance Improvements

20

§  Ease of use

–  Improved static analysis and runtime checking to validate program correctness

•  Compile-time checks to detect uninitialized variables •  Compile-time checks to detect potential race conditions in

synchronization code •  Runtime checks to detect uninitialized variables •  Runtime checks to detect stack clobbering

–  Improved DWARF information for debuggers –  Support for make dependency file generation through -MMD, -MT, -MF in XL

Fortran –  Fortran interfaces for common POSIX functions and constants

§  Ease of migration and compatibility –  GNU compatibility-M, -MF, -MT) –  C++ libraries and other open source libraries


Ease of Use and Migration

21

•  Compile-time check enabled with -qinfo=unset – Additional cases can be detected by the optimizer using -O2

•  Runtime check enabled with -qcheck=unset

1 #include <stdio.h> 2 int C; 3 int main() { 4 int a; 5 if (C) { 6 a = 0; 7 } 8 printf("a=%d\n", a); 9 return 0; 10 }

$ xlC_r uni.cpp -qinfo=unset -O2 Warning: Line 8 in function "main": "a" may be used before define.

uni.cpp

Detecting Uninitialized Variables


IBM Confidential

1 #include <stdio.h> 2 int sum; 3 volatile int done; 4 int producer() { 5 for (int i=0; i<1000; i++) { 6 sum += i; 7 } 8 done = 1; 9 return 0; 10 } 11 12 int consumer() { 13 while (! done); 14 printf("sum=%d\n", sum); 15 return 0; 16 }

§  Enabled with -qinfo=mt -O3

$ xlC_r -qinfo=mt mt.cpp -O3 -c 1586-669 (I) "mt.cpp", line 13: If this loop is used as a synchronization point, additional synchronization via a directive or built-in function might be needed. 1586-670 (I) "mt.cpp", line 8: If this statement is used as a synchronization point, additional synchronization via a directive or built-in function might be needed.

mt.cpp

Update shared value

Signal finished

Wait for signal

Use shared value

__lwsync()

__lwsync()

Detecting Potential Race Conditions in Synchronization Code


Tips for Compiler Friendly Programming •  Obey all language aliasing rules (avoid –qalias=noansi in C/C++) •  Avoid unnecessary use of globals and pointers; use restrict keyword

(XLC supports multiple level and scope restricted pointer) or compiler directives/pragmas to help the compiler do dependence and alias analysis

•  Use “const” for globals, parameters and functions whenever possible •  Group frequently used functions into the same file (compilation unit) to

expose compiler optimization opportunity (e.g., intra compilation unit inlining, instruction cache utilization)

•  Limit exception handling •  Excessive hand-optimization such as unrolling and inlining may impede

the compiler •  Keep array index expressions as simple as possible for easy dependency

analysis •  Consider using the highly tuned MASS and ESSL libraries


Tips for POWER Optimization

•  POWER8 exploitation –  POWER8 specific ISA exploitation under –qarch=pwr8 –  Scheduling and instruction selection under –qtune=pwr8

•  Frequently used compiler option sets –  -O3 –qarch=pwr8 –qtune=pwr8 –  -O3 –qhot –qarch=pwr8 –qtune=pwr8

•  Frequently used pragmas and directives –  Dependency and alias analysis –  Alignment –  Frequency –  Program behavior –  Transformations



•  Data prefetch –  Automatic data prefetch at O3 –qhot or above. –  Problem-state control of DSCR (data stream control register),

provided to user via builtins to control data prefetch

•  Automatic SIMDization at O3 –qhot –  Limited use of control flow –  Limited use of pointers. Use independent_loop directive to tell the

compiler a loop has no loop carried dependency; use either restrict keyword or disjoint pragma to tell the compiler the references do not share the same physical storage whenever possible

–  Limited use of stride accesses. Expose stride-one accesses whenever possible

HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries 26

SIMDization Tuning

memory accesses have non-vectorizable alignment.

§ Use __attribute__((aligned(n)) to set data alignment § Use __alignx(16, a) to indicate the data alignment to the compiler § Use -qassert=refalign if all references are naturally aligned § Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

§ Use fewer pointers when possible § Use #pragma independent if it has no loop carried dependency § Use #pragma disjoint (*a, *b) if a and b are disjoint § Use restrict keyword or compiler option –qrestrict

User actions Transformation report

Loop was SIMD vectorized

§ Use #pragma simd_level(10) to force the compiler to do SIMDization It is not profitable to vectorize


SIMDization Tuning

memory accesses have non-vectorizable strides

§ Loop interchange for stride-one accesses, when possible § Data layout reshape for stride-one accesses § Higher optimization to propagate compile known stride information § Stride versioning

§ Do statement splitting and loop splitting

User actions Transformation report

either operation or data type is not suitable for SIMD vectorization.

§ Convert while-loops into do-loops when possible § Limited use of control flow in a loop § Use MIN, MAX instead of if-then-else § Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

28 HPC Workload Performance Tuning on POWER8 with IBM XL Compilers and Libraries


•  Make use of visibility attribute –  Load time improvement –  Better code with PLT overhead reduction –  Code size reduction –  Symbol collision avoidance

•  Inline tuning – inline keyword, inline threshold –  Call overhead reduction –  Load-hit-store avoidance

•  Whole program optimization by IPA –  Across-file inlining –  Code partitioning –  Data reorganization –  TOC pressure reduction

29 HPC Workload Perfformance Tuning on POWER8 with IBM XL Compilers and Libraries

HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers and Libraries 30 30

hpc workload performance tuning on power8 with ibm xl...

Documents