parallel environment - today and tomorrowspscicomp.org/wordpress/wp-content/uploads/2011/05/... ·...

© 2011 IBM Corporation

Systems and Technology Group

Parallel Environment - Today and Tomorrow

Chulho Kim – PE & Protocols Architect

SciComp – Summer Meeting

May 9-13, 2011


Features introduce on

PE 5.2.* in 2010

Recap


Parallel Environment PE 5.2.*

OS Support

� Power Platform

– AIX 5.3 & AIX 6.1

– SLES11 SP1 & RHEL6

– Interconnect

– DDR IB switches,

– IBM Galaxy2 HCA, Mellanox ConnectX2 HCA

– Shared Memory (SMP), IP (UDP) and User Space (IB Verbs)

� X86 Platform

– OS

– RHEL6 GA supported in December 2010

– SLES11 SP1 GA support available May 2011

– Interconnect

– DDR & QDR IB switches

– Mellanox ConnectX2 HCA

– Shared Memory (SMP), IP (UDP) and User Space (IB Verbs)


Parallel Environment PE 5.2

� Dynamic Tasking Support– Dynamic process management, the process creation and management capability provided by

MPI-2.1 using initial static allocation of resources

� Support for OpenSSH and RSH authentication– CTsec method is withdrawn

� Scalability Improvements in Large Scale Environments– Scaling support > 1M tasks

– POE/PMD are now 64bit binaries

– MP_DEBUG_ATTACH={no, yes (default)}

– MP_DEBUG_ATTACH_DIR={/tmp (default), user specified dir}

– MP_INFOLEVEL debug output is reduced

� Support for third party resource manager APIs– APIs define and utilizes a set of common resource management interfaces, for use by

LoadLeveler and other resource managers

� Support for third party Schedulers– PE CD contains LL RM install images to support 3rd party Schedulers


Parallel Environment PE 5.2

� POE runtime enhanced to support launching of User Space jobs without using any resource manager

– MP_RESD=poe & new environment variable: MP_POE_LAUNCH

– Useful in interactive development environment

– /etc/poe.limits can be modified by System Admin to control

– MP_POE_LAUNCH={ip, us, all, none (disables interactive job)}

� Runtime upport for multi-protocol application

– Mixed MPI, LAPI, UPC, CAF, openshmem, etc

� On AIX, POE now utilizes the IBM MetaCluster Distributed Checkpoint Restart (MDCR) function, and its associated components (Application WPAR), to coordinate the checkpointing and restarting of jobs.


Parallel Environment PE 5.2.1MPI-IO Enhancements – All OS platforms� New MP Environment variable and poe option

– MP_IOAGENT_CNT or -ioagent_cnt

– Specify # of I/O agents for a job, default is 1 I/O agent per node

– Range is 0 to ‘all’

– MP_IOTASKLIST or –iotasklist

– Specify which task(s) to be ioagent

– Example: MP_IOTASKLIST=4:2:4:6:8

– First # is the number of I/O agents to specify

– Next is list of taskids to be I/O agents – here, 2,4,6,8 rank are I/O agents

– NOTE: When MP_IOTASKLIST is used MP_IONODEFILE is ignored

� AIX 6.1 TL4 - Enhanced OS Jitter co-scheduling function to exploit enhance AIX

scheduling hooks

– AIX 6.1 TL4 and above with PE 5.2.1, looking for customer willing to work with HPC Performance team to help configure and setup

– New priority window mechanism

– Group all non HPC processes (i.e. daemons) to be scheduled to run in the less favored window

– Turned on through internal environment variable to enable new function

– Help minimize OS Jitter

� Please check PE 5.2.1 Operation and Use Manual for further info


Parallel Environment PE 5.2.1POE Sub Job & Workflow Solution

Why do we need a parallel runtime to manage multi job step or work flow

within a static compute resource allocation?

� Schedulers can’t easily guarantee that resources will be able to be obtained

to complete workflow requirements in a timely manner

� It is very hard or impossible to dynamically change the sub job program

depending on the previous outcome without having to resubmit another

request to the scheduler

� Workflow or multi job step is only supported under batch

� Batch submission scripts can’t easily describe MPMD jobs

� It is hard to initiate different environment setup per task or group of tasks

for SPMD or MPMD jobs

� The output of user’s workflow is not consolidated into one file for the user


POE Master Task

Job Step 1 – Single job across all static compute resources

POE Workflow

Job Step 2 – Multiple (n+m) distinct job across most of static compute resources

Job Step 3 – 2 distinct jobs each using half of static compute resources

STDOUT/STDERR –

Output from each job step

Gets consolidated

NOTE: Each job can be SPMD or MPMD model, Serial or Parallel

Static Compute Resources (i.e. nodes,

cpus, memory, HCA, duration)

allocated by Scheduler and Resource

Manager

STDIN – Can be a script,

or Master Workflow

Program that writes to its

STDOUT the job steps

for POE to execute or

User’s Command input

file


Example demonstrating POE’s Sub Job & Work Flow capabilities

job

submission

Load leveler Job Queue

User requests through Scheduler (LL) or running Interactive POE – 14 Nodes with 32 Cores in dedicated mode

POE task starts on the same

node as Task 0

Takes input from “command file”,

Script or User’s Workflow manager

POE job steps1-14:

Run 14 Node serial job

POE job steps15-16 (SPMD):

Run two 7 Node parallel job

POE job step17:

WAIT for completion

POE job steps18-20 (MPMD):

Run 3 parallel

MPMD jobs

POE job step24:

COMPLETE

End job

POE job steps21-23 mixed

(SPMD & MPMD):

Run 2 SPMD & 1 MPMD


Customer use case for POE Workflow/Subjob



Problem Use case: parameter screening before submitting THE VERY LONG SIMULATION

# @ total_tasks = 1024 (get many nodes allocated)

# Run the parameter screening phase on a small number of iterationsfor parm in 1 2 3 .... 64do

time poe ./binary -procs 16 -nbiter 5 -param $parm > out.$parm &done

# Find the optimal value of parm from the screening phasebestparm = findbestparam()

# And now run THE VERY LONG LASTING SIMULATIONtime poe ./binary -procs 1024 -nbiter 50000 -param $bestparm

This LoadLeveler script will not work as is :-(

=> POE subjobs to the rescue !



Solution Use case: parameter screening before submitting THE VERY LONG SIMULATION

# @ total_tasks = 1024 (get many nodes allocated)

export MP_NEWJOB=parallelexport MP_LABELIO=yes# We create a subjobs file containing n invocations of poefor parm in 1 2 3 ...64; docat >> subjobs << EOF/bin/ksh -c "./binary -nbiter 5 -param $parm"@$parm%16%mpi:*EOFdonecat >> subjobs << EOFWAITCOMPLETEEOF# Kick them off all at oncetime poe -cmdfile subjobs

# Find the optimal value of parm from the screening phasebestparm = findbestparam()

# And now run THE VERY LONG LASTING SIMULATIONtime poe ./binary -nbiter 50000 -param $bestparm


POE Sub Job & Workflow SolutionCurrent documentation:

http://www.ibm.com/developerworks/wikis/display/hpccentral/Parallel+Environment+Docs+a

nd+README+updates

� PE 521 README update – Sub jobs/Workflow support

Enhance runtime to support workflow or sub jobs and MPMD runs

� MP_POE_CONTINUE={no (default), yes}

� MP_NEWJOB=parallel (new option)

Specifying subjobs

To specify the subjobs to be launched by POE, the MP_CMDFILE (-cmdfile) format

includes a new extension, as follows:

WAIT, COMPLETE command

Format 1: Launching one instance of the executable

<executable name>@<user subjob id>%<subjob size>%<protocol> [options]

Format 2: Launching multiple instances of the executable

<executable name>@<user subjob id>%<subjob size>%<protocol>: <number>

Format 3: Launching the number of executable instances required to fill

out the current subjob

<executable name>@<user subjob id>%<subjob size>%<protocol>: <*> [options]


Parallel Environment PE 5.2What’s new in PDB 2.0

� What’s new?

–Data aggregation

–Enhanced scalability by pure tree launch/communication mechanism – Using SCI

–Grouping capability

–Filtering output through expression matching

–Coloring of output of messages (same, similar, different)

–Sharing debug session with different users– Multiple consoles support

– Multi console attach with broadcast capability

� Please see PE 5.2.1 Documentation for details


ESSL & PESSL Update


ESSL Strategy

� Maximize performance for IBM Power Platforms� Algorithms are redesigned to support each major

architecture (e.g. Power7 VSX)� Support Serial, SMP, SIMD and SPMD� Support OpenMP and MPI applications � Callable from Fortran, C, and C++� Support AIX, SLES, RHEL� Provide an easy way for customers who use

industry standard APIs to move their applications to IBM Power platforms and obtain high performance


GA Products

� ESSL 5.1 contains over 550 high-performance serial and SMP mathematical subroutines tuned for Power7, Power6, and Power6+

– AIX 5.3/AIX 6.1 GA 6/25/2010

– AIX 7.1 GA 9/2010

– SLES11SP1 GA 10/22/2010

– RHEL6 GA 12/2010

� Parallel ESSL 3.3 contains over 125 high-performance SPMD mathematical subroutines specifically designed to exploit the full power of clusters of Power servers connected with a high performance switch

� Callable from FORTRAN, C, and C++� SMP Libraries are OpenMP based� BLAS, LAPACK, Parallel BLAS and ScaLAPACK Compatibility� FFTW Version 3.1.2 to ESSL Wrapper Libraries� http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/c

om.ibm.cluster.essl.doc/esslbooks.html


Mathematical areas

� ESSL

– Linear Algebra Subprograms

– Matrix Operations

– Linear Algebraic Equations

– Eigensystems Analysis

– Fourier Transforms, Convolution & Correlation & Related Computations

– Sorting & Searching

– Interpolation

– Numerical Quadrature

– Random Number Generation

� Parallel ESSL – BLACS

– Level 2 Parallel BLAS

– Level 3 Parallel BLAS

– Linear Algebraic Equations

– Eigensystems Analysis

– Fourier Transforms

– Random Number Generation


Libraries � ESSL 5.1

– Power Thread-Safe Serial and SMP (OpenMP based) Libraries

– 32 bit integers/32 bit pointers



� Parallel ESSL 4.1– SMP Libraries for use with the Parallel Environment MPI Library



– Single Message Passing Thread with one or more Computational Threads

– 1 MPI Task per Core and single-threaded computations

– Mix MPI Tasks and multiple computational threads per node

– Infiniband


ESSL 5.1 (2010)

� Power7 Support including VSX versions of most BLAS and FFT subroutines– 2 VSX (SIMD) units that can each handle 2 Double-Precision FP instructions

– 1 VSX unit can also handle 4 Single Precision instructions per cycle

– 8 FLOPS per cycles � New FFT & LAPACK Subroutines

– scrftd, dcrftd

– srcftd, drcftd

– dsygvx� XLF13.1, XLC/C++ 11.1, UPC, X10� AIX 5.3 & AIX 6.1 (6/25/2010)� AIX 7.1 (9/2010)� SLES11SP1 (10/22/2010)� RHEL6 (12/2010)


Parallel ESSL 3.3.x (2010)

� 2/2010

–Power7 HV32 DDR IB Support on AIX 6.1

� 6/2010

–Power7 HV32 DDR IB Support on SLES11

� 12/2010

–Power7 HV32 DDR IB Support on RHEL6


The futures material presented represents a mix of experimentation, prototyping and development.

While topics discussed may appear in some form in future IBM products there is no guarantee any particular feature will appear precisely as described.

Some work described may never go farther than prototype form.


Parallel Environment

Long Range

Possibilities

The following slides represent a mix of ideas, prototyping efforts

and optimistic looks to the future – None of these are certain


Scaling

� The Parallel Environment (POE, MPI, LAPI & Tools) teams are investigating to support hundreds of thousands of tasks

– smaller, tighter data structures in protocol implementation √

– Retain performance at 100s of tasks while making 100s of thousands possible √ - ongoing

– Improved Collective Communications scaling √ - ongoing

– Revised early arrival buffer management √

– Better MPI-IO strategies √ - need additional user input (use cases)

– Robustness at scale √ - ongoing

– OS Jitter control strategies √ - ongoing


Runtime Enhancements

� Enhance runtime to support workflow or sub jobs and MPMD runs √ -Part of PE 5.2.1

� Support for multiple PE installs √ - ongoing


Alternate Programming Models

� MPI Programming is difficult. Many HPC communities are seeking more intuitive ways to exploit parallelism

– PGAS (Partitioned Global Address space models

– Unified Parallel C √ – working with IBM compiler team and Berkeley UPC team

– CoArray Fortran

– Shared memory models

– Openshmem (SHMEMTM) √ – we are participating in this group

– Potential new programming model from IBM Research √ – working with IBM Research team

– Part of DARPA High Productivity Computing Languages effort


Options for Productivity Tools

� Infrastructures for tools (i.e. debugging) at huge scale √

– Collaboration with NCSA √ - contributed SCI framework to Eclipse PTP project, PDB is first use case

� Leveraging of Eclipse Parallel Tools Platform √

– An MPI code development assistant √

� Watson Research HPC Toolkit for Performance analysis/tuning

– Part of PE 5.1 √

� Implementing MPI 2.0 Process management within pre-defined resources √ - Part of PE 5.2


MPI 3.0 Forum

� IBM and the MPI team are working with the MPI Forum on MPI 2.1, 2.2 and 3.0

� MPI 2.1 and 2.2 have fairly modest goals.√

– MPI 2.1 will offer a single MPI Standard (1.1 and 2.0 merged and errata corrections formalized). The draft is in the formal approvals process now

– MPI 2.2 content is being defined now & approval target is very early 2009. Modest API extensions. √ - ongoing

– No changes required for MPI applications

– Low implementation effort – prompt availability predicted


MPI 3.0

� Major extensions possible (but not certain)� Reference implementation required

– This was part of MPI 1 but not part of MPI 2. The lesson has been learned.

� Target for approval – 2010 – moving target – now maybe by end of 2011� Implementations may deliver MPI 3.0 in stages� There is a handful of proposal sets today

– Some may fall away as they are debated

– The process is open to new proposals now but presumably will close in 2009. - moving target


Chulho Kim

IBM Poughkeepsie UNIX Development Lab

[email protected]

Contact Information

parallel environment - today and tomorrowspscicomp.org/wordpress/wp-content/uploads/2011/05/... ·...

Documents