- tuning and analysis utilities: tau - debugging on blue gene/p...

84
- Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P System - Intel Multi-core Architecture - Hybrid / Multi-core Programming Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University Stony Brook Center for Computational Center June 5th 2009

Upload: others

Post on 02-Apr-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

- Tuning and Analysis Utilities: TAU- Debugging on Blue Gene/P System- Intel Multi-core Architecture- Hybrid / Multi-core Programming

Tulin KamanDepartment of Applied Mathematics and Statistics

Stony Brook University

Stony Brook Center for Computational CenterJune 5th 2009

Page 2: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Overview

PART-I • Blue Gene Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Page 3: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Blue Gene Systems

New York Blue/L : 18 racks18432 compute nodes (36864 CPUs)

New York Blue/P : 2 racks2048 compute nodes (8192 CPUs)

Page 4: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Blue Gene System Software

consists of five integrated software subsystems System administration and management Partition and job management Application development and debugging tools Compute Node kernel and services I/O Node kernel and services

Page 5: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

BG Hardware Subsystems

• These software subsystems are required in the hardwaresubsystems: Front-end node (fen/fenp): provides access users

to edit and compile program, create job script file andsubmit jobs

Service node: is the manager of the system andused for controlling the system.

I/O nodes (IO): provide access to external devicesthrough an Ethernet port to the 10 gigabit functionalnetwork and can perform file I/O operation

Compute nodes (CN): run user application.

Page 6: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Nodes (Compute and I/O)

• The Compute Nodes on BG/P are made of one quadcore chip with 2GB physical memory.

• Compute nodes are reserved for computations and I/O iscarried out via the I/O nodes.

• Each Blue Gene I/O Node serves a group of ComputeNodes.

• The I/O Node include a complete Internet Protocol (IP)stack, with TCP and UDP services.

• CNs share the IP address of the I/O node, and thesocket port number is a single address space within theprocessor set.

Page 7: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

CN and I/O node properties

Page 8: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

NYBlue partition naming convention

• B denotes the Block.• Partition size• The nodes are interconnected through

six networks, one of which connects thenearest neighbors into 3D torus ormesh.

• Pset ratio: specifies the I/O node tocompute node ratio. A designates a pset ratio of 1:16 B designates a pset ratio of 1:32 C designates a pset ratio of 1:64 D designates a pset ratio of 1:128

Page 9: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

The Compute NodeExecution Mode

• The CN on BG/P: quad-core on a single chip with 2GB.

Execution process modes on BG/P -mode SMP : Symmetrical MultiProcessing

one MPI task/node, four threads/task, 2GB -mode DUAL : Dual Node

two MPI task/node, two threads/task, 1GB -mode VN : Virtual Node

four single threaded MPI task/node, 512MB

Page 10: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Compilers Overview

• IBM XL Family of Compilers: xlc, xlc++,xlf ,… : Compilers are for FENP not for CN bgxlc, bgxlc++, bgxlc_r, bgxlc++_r, bgxlf, bgxlf_r … The programmer

explicitly identifies all the libraries and include filesC and C++ /opt/ibmcmp/vacpp/bg/9.0/bin/Fortran /opt/ibmcmp/xlf/bg/11.1/bin/

• GNU Compiler Collection: gcc, g++, gfortran compilers are for FENP not for CN powerpc-bgp-linux-gcc, powerpc-bgp-linux-g++, powerpc-bgp-linux-

gfortran, … The programmer explicitly identifies all the libraries andinclude files.

/bgsys/drivers/ppcfloor/gnu-linux/bin/

Page 11: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Building Executableswith MPI wrappers

• IBM XL Family of Compilers:/bgsys/drivers/ppcfloor/comm/bin/

MPI wrappersmpixlc, mpixlcxx, mpixlf77, mpixlf90, …

Thread-safe version of MPI wrappersmpixlc_r, mpixlcxx_r, mpixlf77_r, mpixlf90_r, …

• GNU Compiler Collection:/bgsys/drivers/ppcfloor/comm/bin/

MPI wrappers mpicc, mpicxx, mpif77, mpif90, …

Page 12: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

IBM XL CompilersOptimization Options

• IBM XL compilers have level of optimization support. Basic command-line optimization

-O0 = quick local optimization -O2 = -O0 -qmaxmem=8192

Advanced command-line optimization -O3 = -O2 -qnostrict -qmaxmem=-1 -qhot=level=0 -O4 = -O3 -qarch=auto -qtune=auto -qcache=auto -qhot -qipa -O5 = “All of -O4” -qipa=level=2

• Make sure that the application is first compiled andexecuted properly at low optimization levels.

Page 13: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

IBM XL compilers for betterperformance

• Specify the qarch and qtune values, otherwise you may get -qarch=auto -qtune=auto which optimizes for the front-end to optimize for the compute nodes -qarch=450 : Generates code for a single FPU -qarch=450d : Generates code for a double FPU(double hummer) -qtune=450 : Optimizes object code for the 450 family processor

-qhot : enables and customizes high-order loop analysis andtransformation.

* Time study is performed on FronTier code.

270

-O3-qhot

278 / 282

-O3 -qtune=450-qarch=450/450d

Time(sec)*

Level

275

-O4

280290420

-O3-O2-O0

Page 14: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Usage of -qsmp

• auto enables automatic parallelization• omp automatic parallelization is disabled. Only OpenMP parallelization

pragmas are recognized.• opt instructs the compiler to optimize as well as parallelize. The

optimization is -O2 -qhot.• Compiler can automatically locate and countable loop is automatically

parallelized if The order in which loop iterations start and end doesn’t affect the result The loop doesn’t contain I/O operations The code is compiled with a thread-safe version of compiler (_r suffix) -qnostrict_induction compiler option is in effect. -qsmp=auto option is in effect.

1612 sec420 secmpixlc_r

1145 sec

1024x1024

mpixlc_r -qsmp=auto -qnostrict_induction

Problem Size

298 sec

512x512

1.4x

Page 15: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

GNU Profiler

• GNU profiler is available on BG/P.• Compile and link with -g -pg.• Run the application. See gmon.out files on your working directory.• Analyze the data with

> gprof <yourexe> gmon.out.0 > report.0> vi report.0

Page 16: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Mathematical AccelerationSubsystem Libraries

• MASS libraries offer improved performance over the standardmathematical library routines, are thread-safe, and supportcompilations in C, C++, and Fortran applications.

• On the linker command, it is explicitly specified.-L/opt/ibmcmp/xlmass/bg/4.4/bglib -lmass

/opt/…/bglib contains the "cross-compiled" versions of thelibraries.

/opt/…/lib and /opt/../lib64 contain the "native" versions of thelibraries.

Page 17: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Overview

PART-I • BG/P Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Page 18: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Tuning and AnalysisUtilities: TAU

-PROFILE-PROFILECALLPATH

-MULTIPLECOUNTERS …

Page 19: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

TAU

• Performance evaluation tool

• Profiling and tracing toolkit for performance analysis ofparallel programs written in C, C++, Fortran, Java andPython.

• Support for multiple parallel programming paradigms:MPI, Multi-threading, Hybrid (MPI+Threads)

• Access to hardware counters.

• Automatically instruments your code.

Page 20: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

How to use TAU?

• Set a couple of environment variables $PATH, $TAU_MAKEFILE, $TAU_OPTIONS

• Instrument the program by inserting TAU macros orautomatically.

• To take advantage of TAU's automatic instrumentationfeatures, Program Database Toolkit (PDT) providesaccess to the high-level interface of source code foranalysis tools and applications.

• For automatic instrumentation Replace the compiler with TAU compiler script.

Page 21: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

TAU Configuration

• Each configuration labeled with the options used../configure -mpi -arch=bgp -pdt=<pdt-dir> -pdt=xlC

PROFILE(default)PROFILECALLPATH/MULTIPLECOUNTERS

• Each configuration creates a unique Makefile under /bgl/apps/TAUL/tau-2.18/bgl/lib/ /bgsys/apps/TAUP/tau-2.18/bgp/lib/

• TAU compiler scripts are installed in /bgl/apps/TAUL/tau-2.18/bgl/bin/ /bgsys/apps/TAUP/tau-2.18/bgp/bin/

• Add the bin directory to your path.

Page 22: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Set TAU_MAKEFILE

• Set the environment variable TAU_MAKEFILE to the location ofthe tau makefile.

• List of TAU’s MakefileMakefile.tau-mpi-pdtMakefile.tau-callpath-mpi-pdtMakefile.tau-multiplecounters-mpi-papi-pdt …

• Start with MPI instrumentation & PDT for automatic sourceinstrumentation.

> export TAU_MAKEFILE=/bgl/apps/TAUL/tau-2.18/bgl/lib/Makefile.tau-mpi-pdt> export TAU_MAKEFILE=/bgsys/apps/TAUP/tau-2.18/bgp/lib/Makefile.tau-mpi-pdt

Page 23: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

TAU Shell Scripts

• Compile your code with TAU shell scripts.

• If your Fortran code is a fixed-format Fortran code, use “tau_f90.sh -qfixed”

• -D options to XLF: The XL Fortran compilers require a slightlydifferent syntax to define preprocessor macro symbols, use "-WF,-D”.

tau_f90.shmpixlf77/mpixlf77_rmpixlf90/mpxlf90_r

mpif77mpif90

tau_cxx.shmpixlcxx/mpixlcxx_rmpicxx

tau_cc.shmpixlc/mpixlc_rmpicc

TAU shell scriptsIBM XL CompilersGNU Compilers

Page 24: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Analyze Performance Data

• pprof ( for text based display ) sorts and displays profile data generated by TAU. Execute pprof in the directory where profile files are

located.• paraprof ( for GUI display)

TAU has Java based performance data viewer. Requires Java1.4 or above, add it to your path. --pack options pack the data into packed (.ppk) format and

it does not launch the paraprof GUI. > paraprof --pack filename To launch the GUI

> paraprof filename.ppk

Page 25: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Generate a Flat Profile Set environment variables> export PATH=/bgl/apps/TAUL/tau-2.18/bgl/bin:$PATH>export TAU_MAKEFILE=/bgl/apps/TAUL/tau-2.18/bgl/lib/Makefile.tau-mpi-pdt

Compile your code with TAU shell scripts.> make CC=tau_cc.sh CXX=tau_cxx.sh F90=`tau_f90.sh -qfixed`

Provide the full path to the directory where you want to store the profile files(profile.x.0.0). In your batch job script file, set the environment variable PROFILEDIR.

# @ arguments = -np 16 -env PROFILEDIR=<profile-dir> -exe …

Run your job. Go to the directory where you store the profile files. Pack the data intopacked (.ppk) format.

> paraprof --pack filename.ppk

Launch the GUI to analyze the data.> paraprof filename.ppk

Page 26: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Identify the routines that use the most time

• Configuration with -PROFILE(default) creates Makefile.tau-mpi-pdt.

Page 27: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Generate call path profiles

• f1 => f2 shows the time spent in f2 when it is called by f1

Page 28: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

paraprof→Windows→Threads→Call Graph

Page 29: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

-MULTIPLECOUNTERS-papi=<papi-dir>

• Blue Gene modern CPUs provide on-chiphardware performance counters that can recordseveral events. The number of instructions issued The number of L1, L2 and L3 data and instruction

cache misses, hits, access, read, write.• TAU uses the Performance Data Standard and

API (PAPI-Performance ApplicationProgramming Interface) to access theseperformance counters.

Page 30: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Generate HardwareCounter Profile

• Set the environment variable TAU_MAKEFILE to Makefile.tau-multiplecounters-mpi-papi-pdt

• Set the COUNTERx environment variables to specify the type of counterto profile in your job script file. # @ arguments = -np 1 -env PROFILEDIR=<profile-dir>-env “COUNTER1=GET_TIME_OF_DAY COUNTER2= PAPI_L1_DCM \COUNTER3=PAPI_L1_ICM COUNTER4=PAPI_L1_TCM” -exe …

• Following subdirectories will be created<profile-dir>/MULTI__GET_TIME_OF_DAY<profile-dir>/MULTI__PAPI_L1_DCM<profile-dir>/MULTI__PAPI_L1_ICM

<profile-dir>/MULTI__PAPI_L1_TCM

Page 31: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Performance Counters

Page 32: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Fast Blue Gene Timers

• Blue Gene systems have a special clock cyclecounter that can be used for low overhead timings.

-BGLTIMERS Use fast low-overhead timers on IBM BG/L

-BGPTIMERS Use fast low-overhead timers on IBM BG/P

Page 33: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

PerfExplorer

• Framework for parallel performance data mining.

• Enables the development and integration of datamining operations that will be applies to large-scaleparallel performance profiles.

• Requires Java Run Time Environment 5

• Requires PerfDMF (Performance DataManagement Framework) from TAU.

Page 34: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Running PerfExplorer

• Make sure you have Java5 or better in your PATH.• Configure PerfDMF. To configure PerfDMF,run perfdmf_configure• Generate .ppk files.

> llsubmit tau_app16.ru> paraprof --pack tau_np32.ppk>…> llsubmit tau_app512.run> paraprof --pack tau_np512.ppk

• > paraprof Add trial to the DB.

Trial type: Paraprof Packed Profile Select File(s) -> OK. Uploading Trial.

• > perfexplorer Choose Experiments. The options under the Chart menu provide analysis.

Page 35: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Overview

PART-I • BG/P Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Page 36: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Debugging onBlue Gene/P Systems

Page 37: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Debugging Architecture

• The Compute Node Kernel provides the low-level primitives to debug

• The Control and I/O Daemon (CIOD) running on theI/O Nodes supports debugger access into the compute nodes.

• The debug server running on the I/O Nodes code that interfaces with the CIOD

• The debug client running on a Front End Node code that the user interacts directly make remote request to the debug server

Page 38: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

The GNU Debugger

• Debugging application on Blue Gene is differentfrom debugging on cluster because The application runs on the compute nodes. The application developer is logged into the front-end

nodes.• GDB has the ability to do remote debug.• Debug server called gdbserver allows the

GDB work with applications running on computenodes.

Page 39: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Compiling your program

• Compile your code with -g. This tells the compiler toinclude debug information.

• Do not use high level compiler optimization.• Enable compiler optimization with -O[N] where N is

zero. This tells the compiler to disable optimization.• Specify the complementary options -qarch and -qtune

values for PPC440 on BG/L and PPC450 on BG/P.

Page 40: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Starting gdbserver

• mpirun command is used to start gdbserver.• A parameter -start_gdbserver is used on the

mpirun command to start the program under thedebugger control.

• Open two separate console shells. The first shell -

Start the debugger server on the I/O Node

The second shells - Make remote request to debugger server and debug the

application on the Compute Node.

Page 41: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Step1: Starting the gdbserverusing mpirun

• Start your application using mpirun with -start_gdbserver

• Partition called B064TC00 is used.• The executable is gas_O0_gdb.

Page 42: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

What happens

• The application does not start running immediately.

• The application is loaded to compute nodes.

• The debug server called gdbserver is started onthe I/O node in the specific partition.

• mpirun stops and wait you to connect GDB clientsto the compute nodes.

Page 43: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Step2: Finding the address of theCompute Node

• After starting the gdbserver, you will get a prompt from thegdb server.

• Find the IP address and the port number for the computenode that you want to debug. dump_proctable will give youthe list of port numbers for all the compute nodes you candebug.

> dump_proctableMPI Rank 0: Connect to 172.30.200.101:10063MPI Rank 1: Connect to 172.30.200.101:10058

Page 44: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Step 3: Starting gdb client

• Open a new window.• Start gdb on the front-end node and connect to gdbserver.• Do not use the default gdb (/usr/bin/gdb).

• After you get the gdb prompt use the target remote command ofgdb to connect to gdbserver using the address of the computenode. <IPaddress:PortNumber>

Page 45: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Step4: Start running the applicationon compute node

• When you connect the GDB clients to specificcompute node, press Enter to send signal tompirun command.

• Now you are debugging your application on theconfigured compute node.

• Now you can use the standart gdb commandsafter this point.

• You may want to quit gdb by ending the remotedebugging with detach command of gdb firstbefore quit. Otherwise gdb will kill theapplication.

Page 46: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Overview

PART-I • BG/P Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Page 47: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Intel Multi-core Architecture

NEHALEM

Page 48: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Nehalem: Intel microarchitecture

• Intel white paper does say that a new dynamically and design-scalable microarchtitecture rewrites the book on energyefficiency and performance.

• To extract greater performance from this new microarchitecture,Intel introduces Intel QuickPath Technology. With IntelQuickPath Technology, each processor core features anintegrated memory controller and high-speed interconnect.

• Designed from the ground up to take advantages of 45-nanometer (nm) high-k metal gate silicon technology. helps to dramatically increases processor energy efficiency increases transistor switching speeds up to enable higher core and

bus clock frequencies Thus more performance in the same power and thermal envelope.

Page 49: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

• How the processor uses available clock cycles and power,rather than just pushing up ever higher clock speeds andenergy needs.

• Nehalem’s biggest innovations come from new optimizationsof the individual cores and the overall multi-coremicroarchitecture to increase single-thread and multi-threadperformance.

Page 50: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Microarchitecture’s performance and powermanagement innovations include:

• Dynamically managed cores, threads, cache, interfaces, andpower.

• Microarchitecture’s Simultaneous multi-threading (SMT)capability enables running two simultaneous threads per core- 8 simultaneous threads per quad-core processor 16 simultaneous threads for dual-processor quad-core

designs.• Superior multi-level cache, including an inclusive shared L3

cache.• New high-end system architecture that delivers from two to

three times more peak bandwidth and up to four times morerealized bandwidth.

• Performance-enhanced dynamic power management

Page 51: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Nehalem Goals

• Low latency to retrieve data Keep execution engine fed without stalling

Latency: how long does it take for the first byte sent from onenode to reach its target node

• High data bandwidth Handle requests from multiple cores/threads seamlessly

Bandwidth: how many megabytes of data can one send from anode to another node in a second

• Scalability

Page 52: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Performance Improvements

Intel’s core enhancements to further improve theperformance of the individual processorcores(Nehalem).

• Instructions Per Cycle Improvements: The moreinstructions that can be run per each clock cycle, thegreater the performance.

• Enhanced Branch Prediction: Branch predictionattempts to guess whether a conditional branch will betaken or not. Branch predictors are crucial in today’sprocessors for achieving high performance. They allowprocessors to fetch and execute instructions withoutwaiting for a branch to be resolved.

Page 53: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Intel’s core enhancements

• Simultaneous Multi-Threading: Intelintroduces Hyper-Threading Technology (HT),a technique enabled a single execution core torun two threads at the same time. A quad-coreprocessor could run up to eight threadssimultaneously.

Page 54: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Intel’s core enhancements

• Intel Smart Cache Enhancement: enhances the Intel SmartCache by adding an inclusive shared L3 (last-level) cache thatcan be up to 8 MB in size. Inclusive cache policy for best performance

Address residing in L1/L2 must be present in third(last)level cache.

A miss of its inclusive shared L3 cache guarantees the datais outside the processor, not on-die and thus is designed toeliminate unnecessary core snoop traffic between cores toreduce latency and improve performance.

If a data request misses on exclusive shared L3 cache, eachprocessor core must be searched in case their individualcaches might contain the requested data. This canincrease latency and traffic between cores.

Page 55: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

80 core Teraflops Research Chip

• The 80 core Teraflops Research Chip is the firstprogrammable chip to deliver more than one trillionfloating point operations per second (1 Teraflops) ofperformance while consuming very little power.

• This research project focuses on exploring new, energy-efficient designs for future multi-core chips, as well as approaches tointerconnect and core-to-core communications. http://techresearch.intel.com/articles/

Tera-Scale/1449.htm

Page 56: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Tukwila

• Tukwila is Intel's next-generation Itaniumprocessor with four cores

• 30MB total cache• Intel QuickPath Interconnect Technology• Dual Integrated Memory Controller• Expected in 2010

Page 57: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

IBM Blue Gene/Q Architecture

• IBM is not releasing low-level details of the Blue Gene/Qarchitecture.

• Blue Gene/Q node will contain 16 cores. This can beimplemented as one 16-core chip or two 8-core chips or even four quad-core chips.

• Each node stands to have 16 GB.• In 2011 Lawrence Livermore National Laboratory will

install a 20 petaflop system.

Page 58: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Overview

PART-I • BG/P Overview• Performance Tool• Debugging Tool

PART-II• Intel Multi-core Architectures• Hybrid / Multi-core Programming

Page 59: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Hybrid / Multi-coreProgramming

Page 60: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

How to program machinesthat can be built?

• Distributed - Memory Machines Each node in the computer has a locally addressable

memory space. Parallel programs consists of cooperating processes, each

with its own memory. Processes send data to one another as messages.

Message Passing Interface (MPI) is a widely usedstandard for writing message-passing programs.

• Shared - Memory Machines Each core can access the entire data space. In shared memory multi-core architectures, OpenMP,

Pthreads can be used to implement parallelism.

Page 61: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Start with auto parallelization

IBM XL thread safe compilers will automaticallyparallelize your code if There is no branching into or out of the loop. The increment expression is not within a critical

section. The order in which loop iterations start and end

doesn’t affect the result. The loop doesn’t contain I/O operations The program is compiled with a thread-safe version

of compiler. (_r suffix : mpixlc_r, mpixlcxx_r,mpixlf77_r,…)

Page 62: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Shared Memory Model

• All threads have access tothe same, global shared,memory.

• Threads also have their ownprivate memory.

• Shared data is accessibleby all threads.

• Private data can be onlyaccessed by the thread thatowns it.

• Programmers areresponsible forsynchronizing access(protecting) globally shareddata.

Page 63: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Shared Memory Programming:Pthreads

Pthreads = POSIX threads Hardware vendors have their own versions of threads.

Difficult for programmers to develop portable threadedapplications.

For UNIX systems, a standardized programminginterface specified by the IEEE POSIX 1003.1cstandard. This standard are referred to as POSIX(Portable Operating System Interface) threads.

Lower level Unix library to build multi-threadedprograms.

Pthreads are defined a set of C programminglanguage types and function calls.

Page 64: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Shared Memory Programming:OpenMP

OpenMP = Open Multi-Processing The OpenMP Application Program Interface (API) for

writing shared memory parallel programs. Supports multi-platform shared-memory parallel

programming in C/C++ and Fortran on all architectures. Consists of

Compiler Directives Runtime Library Routines Environment Variables

OpenMP program is portable. Compilers have OpenMPsupport.

Requires little programming effort.

Page 65: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Labeling the data

• As programmers we need to think about where in thememory we can put the data Shared

All threads can read and write the data simultaneously. The changes are visible to all threads.

Private Thread has a copy of the data. Other thread can not access to this data. The changes are visible to only the thread that owns

the data.

Page 66: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Common modelfor threaded program

• Fork-Join Model

Master threads runs from startto end.

When the parallelism isspecified, the main thread getshelp from the threads that arecalled worker threads.

At the end of the parallelportion of the work, the threadssynchronize and terminate.

Only the main thread leaves.

Master Thread

Parallel Region

Parallel Region

Fork

Fork

Join

Join

Page 67: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Simple Example

• For-Loop:

All the iterations are independent.

• OpenMP:

Specify the include file "omp.h” Compile with

> mpixlc_r -qsmp=omp ex1.c -o ex1 By default, the runtime environment uses all available

threads. The default execution mode is SMP(fourthreads/task).

It breaks the work between threads.

Page 68: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Setting Environment variables

• If you want to use fewer than the number ofavailable you need to set XLSMPOPTS=PARTHDS orOMP_NUM_THREADS environment variables.

• In your LoadLeveler batch job file

# @ arguments = -env XLSMPOPTS=PARTHDS=2 -exe …

OR# @ arguments = -env OMP_NUM_THREADS=2 -exe …

Page 69: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

C/C++ Directives Format

#pragma omp directive-name[clause, ...] newline

• Case sensitive

• Directives follow conventions of the C/C++ standards forcompiler directives

• Only one directive-name may be specified per directive

• You can synchronize the threads

Page 70: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

• Run-Time Library Routine OpenMP can perform a variety of functions

get the number of threads and set the number of threads touse. omp_get_num_threads();omp_set_num_threads(4);

get the thread ID - omp_get_thread_num(); wallclock timer functions locking functions

• Environment variables OpenMP provides environment variables for controlling the

execution of parallel code Set the number of threads - OMP_NUM_THREADS Scheduling: determines how iterations of the loop are

scheduled on processors - OMP_SCHEDULE

Page 71: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Matrix-vector multiplicationOpenMP

for (i=0; i<m; i++){

a[i] = 0.0;for (j=0; j<n; j++)

sum += b[i*n+j]*c[j];a[i] = sum;

}

#pragma omp parallel for default(none) \shared(m,n,a,b,c) private(i,j,sum)

•default(none) forces me to specify all the variables eithershared, private or some other kind.•private(i,j,sum) loop variables, and sum should be privateotherwise any thread can modify these variables.•shared(m,n,a,b,c) any thread can be able to access (readand write) its portion of the array.

Page 72: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

OpenMP Performance

Page 73: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

OpenMP Performance

Small problemSize

Page 74: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

If clause

{#pragma omp for

for (i=0; i<m; i++){

a[i] = 0.0;for (j=0; j<n; j++)

sum += b[i*n+j]*c[j];a[i] = sum;

} /*-- End of omp parallel for --*/}

#pragma omp parallel if(n > 256) default(none) \shared(m,n,a,b,c) private(i,j,sum)

Page 75: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Pthreads

• int pthread_create(pthread_t *thread,const pthread_attr_r *attr,void *(*start_routine)(void *),void *arg);

thread1: identifies the newly created thread, and this is a uniqueidentifier.

attr: a thread attribute object specifies various characteristics forthe new thread. NULL for the default values.

start_routine: the routine that the thread will start executing. arg: a parameter to be passed to the start_routine

• Causes the calling thread to wait for the specified thread’s termination.int pthread_join(Pthread thread,Void **value_prt)

• Specify the include file ”pthread.h”

Page 76: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Who am I? Who are you?

pthread io_thread;. . .main(){ . . .

pthread_create(&io_thread, . . . ); . . .}Io_routine(){

pthread_t thread;thread = pthread_self();if (pthread_equal(io_thread, thread){. . .

}}

pthread_create returns athread handle of pthread_t.

pthread_self obtains thethread handle of the callingthread.

pthread_equal compares onethread handle to another threadhandle.

Page 77: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Matrix-vector multiplicationPthreads

void do_work(int row_m,int n,double *,double *,double *c);Void *do_work(void *argpth)main(){pthread_t thread[NUMTHREADS];. . .

for ( t = 0 ; t <NUMTHREADS; t++) pthread_create(&thread[t], NULL, do_work_pth, (void *) t);

for ( t =0; t < NUMTHREADS; t++) pthread_join(thread[t], NULL);. . .}void *do_work_pth(void *argpth){ long tid = (long)argpth;

. . . do_work(row_m, n, a, b, c)

}

Page 78: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Pthreads - OpenMPComparison

• To make use of Pthreads,developers must write theircode specifically for thisAPI. This means they mustinclude header files,declare Pthreads datastructures, and callPthreads- specificfunctions.

• Portable.

• Easy to implement.

• Portable, but Pthreadsoffers a much greaterrange of primitivefunctions that providefiner-grained controlover threadingoperations.

Page 79: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

The Programming Model

• Shared-memory parallelism within the nodes• Distributed parallelism across the nodes

• Three mesh levels Finest (usual) mesh: for computation of PDE solutions Middle level (thread) mesh:

Each block belongs to one thread, and has a fraction ofmemory of chip for writing

Coarsest level (MPI) mesh: Each block belongs to one chip; messages between

blocks. Each thread can read from entire memory ofchip, but can write only to restricted memory of thread

Page 80: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Three Mesh Levels:4X4 (MPI) X 8X8 (threads) X 8x8

(PDE computational cells)

Thread meshblock (8x8 ofthese per processor block). Allows 64threads (cores)per processor.

Global read andlocal write across processormesh block

Processor Mesh block(4x4 of these)MPI across boundaries

single computationalcell for PDE solution.8x8 of these acrosssingle thread meshblock

Page 81: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

FronTier - Riemann Problem

Page 82: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

Implementation:

• For the directional sweep on the rectangular grid states.

• pthread_create() routine permits the programmer topass one argument to the thread start routine. To passmultiple arguments, creating a structure which containsall of the arguments overcome this.

• Reader/writer locks for shared data structure.

• pthread_join()

Page 83: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

References

IBM REDBOOKS:• IBM System Blue Gene Solution: Blue Gene/P Application DevelopmentChapter 9.2: Debugging Applications

• Unfolding the IBM eServer Blue Gene SolutionChapter 6.5: Debugging

TAU Documentationwww.cs.uoregon.edu/research/tau/

PSC/Intel Multi-Core Programming and Performance Tuning WorkshopMarch 23 - 26, 2009 Pittsburgh Supercomputing Center

Page 84: - Tuning and Analysis Utilities: TAU - Debugging on Blue Gene/P …tkaman/Presentations/TulinKaman... · 2009. 6. 24. · - Tuning and Analysis Utilities: TAU - Debugging on Blue

References

http://www.openmp.org

An Overview of OpenMP Video, Ruud van der Pas

Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon

PThreads Programming by Bradford Nichols, Dick Buttlar, Jacqueline Proulx Farrell

POSIX Threads ProgrammingBlaise Barney, Lawrence Livermore National Laboratoryhttps://computing.llnl.gov/tutorials/pthreads/

OpenMPBlaise Barney, Lawrence Livermore National Laboratoryhttps://computing.llnl.gov/tutorials/openMP/