parallel environment - today and tomorrowspscicomp.org/wordpress/wp-content/uploads/2011/05/... ·...
TRANSCRIPT
© 2011 IBM Corporation
Systems and Technology Group
Parallel Environment - Today and Tomorrow
Chulho Kim – PE & Protocols Architect
SciComp – Summer Meeting
May 9-13, 2011
© 2011 IBM Corporation
Features introduce on
PE 5.2.* in 2010
Recap
© 2011 IBM Corporation
Parallel Environment PE 5.2.*
OS Support
� Power Platform
– AIX 5.3 & AIX 6.1
– SLES11 SP1 & RHEL6
– Interconnect
– DDR IB switches,
– IBM Galaxy2 HCA, Mellanox ConnectX2 HCA
– Shared Memory (SMP), IP (UDP) and User Space (IB Verbs)
� X86 Platform
– OS
– RHEL6 GA supported in December 2010
– SLES11 SP1 GA support available May 2011
– Interconnect
– DDR & QDR IB switches
– Mellanox ConnectX2 HCA
– Shared Memory (SMP), IP (UDP) and User Space (IB Verbs)
© 2011 IBM Corporation
Parallel Environment PE 5.2
� Dynamic Tasking Support– Dynamic process management, the process creation and management capability provided by
MPI-2.1 using initial static allocation of resources
� Support for OpenSSH and RSH authentication– CTsec method is withdrawn
� Scalability Improvements in Large Scale Environments– Scaling support > 1M tasks
– POE/PMD are now 64bit binaries
– MP_DEBUG_ATTACH={no, yes (default)}
– MP_DEBUG_ATTACH_DIR={/tmp (default), user specified dir}
– MP_INFOLEVEL debug output is reduced
� Support for third party resource manager APIs– APIs define and utilizes a set of common resource management interfaces, for use by
LoadLeveler and other resource managers
� Support for third party Schedulers– PE CD contains LL RM install images to support 3rd party Schedulers
© 2011 IBM Corporation
Parallel Environment PE 5.2
� POE runtime enhanced to support launching of User Space jobs without using any resource manager
– MP_RESD=poe & new environment variable: MP_POE_LAUNCH
– Useful in interactive development environment
– /etc/poe.limits can be modified by System Admin to control
– MP_POE_LAUNCH={ip, us, all, none (disables interactive job)}
� Runtime upport for multi-protocol application
– Mixed MPI, LAPI, UPC, CAF, openshmem, etc
� On AIX, POE now utilizes the IBM MetaCluster Distributed Checkpoint Restart (MDCR) function, and its associated components (Application WPAR), to coordinate the checkpointing and restarting of jobs.
© 2011 IBM Corporation
Parallel Environment PE 5.2.1MPI-IO Enhancements – All OS platforms� New MP Environment variable and poe option
– MP_IOAGENT_CNT or -ioagent_cnt
– Specify # of I/O agents for a job, default is 1 I/O agent per node
– Range is 0 to ‘all’
– MP_IOTASKLIST or –iotasklist
– Specify which task(s) to be ioagent
– Example: MP_IOTASKLIST=4:2:4:6:8
– First # is the number of I/O agents to specify
– Next is list of taskids to be I/O agents – here, 2,4,6,8 rank are I/O agents
– NOTE: When MP_IOTASKLIST is used MP_IONODEFILE is ignored
� AIX 6.1 TL4 - Enhanced OS Jitter co-scheduling function to exploit enhance AIX
scheduling hooks
– AIX 6.1 TL4 and above with PE 5.2.1, looking for customer willing to work with HPC Performance team to help configure and setup
– New priority window mechanism
– Group all non HPC processes (i.e. daemons) to be scheduled to run in the less favored window
– Turned on through internal environment variable to enable new function
– Help minimize OS Jitter
� Please check PE 5.2.1 Operation and Use Manual for further info
© 2011 IBM Corporation
Parallel Environment PE 5.2.1POE Sub Job & Workflow Solution
Why do we need a parallel runtime to manage multi job step or work flow
within a static compute resource allocation?
� Schedulers can’t easily guarantee that resources will be able to be obtained
to complete workflow requirements in a timely manner
� It is very hard or impossible to dynamically change the sub job program
depending on the previous outcome without having to resubmit another
request to the scheduler
� Workflow or multi job step is only supported under batch
� Batch submission scripts can’t easily describe MPMD jobs
� It is hard to initiate different environment setup per task or group of tasks
for SPMD or MPMD jobs
� The output of user’s workflow is not consolidated into one file for the user
© 2011 IBM Corporation
POE Master Task
Job Step 1 – Single job across all static compute resources
POE Workflow
Job Step 2 – Multiple (n+m) distinct job across most of static compute resources
Job Step 3 – 2 distinct jobs each using half of static compute resources
STDOUT/STDERR –
Output from each job step
Gets consolidated
NOTE: Each job can be SPMD or MPMD model, Serial or Parallel
Static Compute Resources (i.e. nodes,
cpus, memory, HCA, duration)
allocated by Scheduler and Resource
Manager
STDIN – Can be a script,
or Master Workflow
Program that writes to its
STDOUT the job steps
for POE to execute or
User’s Command input
file
© 2011 IBM Corporation
Example demonstrating POE’s Sub Job & Work Flow capabilities
job
submission
Load leveler Job Queue
User requests through Scheduler (LL) or running Interactive POE – 14 Nodes with 32 Cores in dedicated mode
POE task starts on the same
node as Task 0
Takes input from “command file”,
Script or User’s Workflow manager
POE job steps1-14:
Run 14 Node serial job
POE job steps15-16 (SPMD):
Run two 7 Node parallel job
POE job step17:
WAIT for completion
POE job steps18-20 (MPMD):
Run 3 parallel
MPMD jobs
POE job step24:
COMPLETE
End job
POE job steps21-23 mixed
(SPMD & MPMD):
Run 2 SPMD & 1 MPMD
© 2011 IBM Corporation
Customer use case for POE Workflow/Subjob
© 2011 IBM Corporation
Customer use case for POE Workflow/Subjob
Problem Use case: parameter screening before submitting THE VERY LONG SIMULATION
# @ total_tasks = 1024 (get many nodes allocated)
# Run the parameter screening phase on a small number of iterationsfor parm in 1 2 3 .... 64do
time poe ./binary -procs 16 -nbiter 5 -param $parm > out.$parm &done
# Find the optimal value of parm from the screening phasebestparm = findbestparam()
# And now run THE VERY LONG LASTING SIMULATIONtime poe ./binary -procs 1024 -nbiter 50000 -param $bestparm
This LoadLeveler script will not work as is :-(
=> POE subjobs to the rescue !
© 2011 IBM Corporation
Customer use case for POE Workflow/Subjob
Solution Use case: parameter screening before submitting THE VERY LONG SIMULATION
# @ total_tasks = 1024 (get many nodes allocated)
export MP_NEWJOB=parallelexport MP_LABELIO=yes# We create a subjobs file containing n invocations of poefor parm in 1 2 3 ...64; docat >> subjobs << EOF/bin/ksh -c "./binary -nbiter 5 -param $parm"@$parm%16%mpi:*EOFdonecat >> subjobs << EOFWAITCOMPLETEEOF# Kick them off all at oncetime poe -cmdfile subjobs
# Find the optimal value of parm from the screening phasebestparm = findbestparam()
# And now run THE VERY LONG LASTING SIMULATIONtime poe ./binary -nbiter 50000 -param $bestparm
© 2011 IBM Corporation
POE Sub Job & Workflow SolutionCurrent documentation:
http://www.ibm.com/developerworks/wikis/display/hpccentral/Parallel+Environment+Docs+a
nd+README+updates
� PE 521 README update – Sub jobs/Workflow support
Enhance runtime to support workflow or sub jobs and MPMD runs
� MP_POE_CONTINUE={no (default), yes}
� MP_NEWJOB=parallel (new option)
Specifying subjobs
To specify the subjobs to be launched by POE, the MP_CMDFILE (-cmdfile) format
includes a new extension, as follows:
WAIT, COMPLETE command
Format 1: Launching one instance of the executable
<executable name>@<user subjob id>%<subjob size>%<protocol> [options]
Format 2: Launching multiple instances of the executable
<executable name>@<user subjob id>%<subjob size>%<protocol>: <number>
Format 3: Launching the number of executable instances required to fill
out the current subjob
<executable name>@<user subjob id>%<subjob size>%<protocol>: <*> [options]
© 2011 IBM Corporation
Parallel Environment PE 5.2What’s new in PDB 2.0
� What’s new?
–Data aggregation
–Enhanced scalability by pure tree launch/communication mechanism – Using SCI
–Grouping capability
–Filtering output through expression matching
–Coloring of output of messages (same, similar, different)
–Sharing debug session with different users– Multiple consoles support
– Multi console attach with broadcast capability
� Please see PE 5.2.1 Documentation for details
© 2011 IBM Corporation
ESSL & PESSL Update
© 2011 IBM Corporation
ESSL Strategy
� Maximize performance for IBM Power Platforms� Algorithms are redesigned to support each major
architecture (e.g. Power7 VSX)� Support Serial, SMP, SIMD and SPMD� Support OpenMP and MPI applications � Callable from Fortran, C, and C++� Support AIX, SLES, RHEL� Provide an easy way for customers who use
industry standard APIs to move their applications to IBM Power platforms and obtain high performance
© 2011 IBM Corporation
GA Products
� ESSL 5.1 contains over 550 high-performance serial and SMP mathematical subroutines tuned for Power7, Power6, and Power6+
– AIX 5.3/AIX 6.1 GA 6/25/2010
– AIX 7.1 GA 9/2010
– SLES11SP1 GA 10/22/2010
– RHEL6 GA 12/2010
� Parallel ESSL 3.3 contains over 125 high-performance SPMD mathematical subroutines specifically designed to exploit the full power of clusters of Power servers connected with a high performance switch
� Callable from FORTRAN, C, and C++� SMP Libraries are OpenMP based� BLAS, LAPACK, Parallel BLAS and ScaLAPACK Compatibility� FFTW Version 3.1.2 to ESSL Wrapper Libraries� http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/c
om.ibm.cluster.essl.doc/esslbooks.html
© 2011 IBM Corporation
Mathematical areas
� ESSL
– Linear Algebra Subprograms
– Matrix Operations
– Linear Algebraic Equations
– Eigensystems Analysis
– Fourier Transforms, Convolution & Correlation & Related Computations
– Sorting & Searching
– Interpolation
– Numerical Quadrature
– Random Number Generation
� Parallel ESSL – BLACS
– Level 2 Parallel BLAS
– Level 3 Parallel BLAS
– Linear Algebraic Equations
– Eigensystems Analysis
– Fourier Transforms
– Random Number Generation
© 2011 IBM Corporation
Libraries � ESSL 5.1
– Power Thread-Safe Serial and SMP (OpenMP based) Libraries
– 32 bit integers/32 bit pointers
– 32 bit integers/64 bit pointers
– 64 bit integers/64 bit pointers
� Parallel ESSL 4.1– SMP Libraries for use with the Parallel Environment MPI Library
– 32 bit integers/32 bit pointers
– 32 bit integers/64 bit pointers
– Single Message Passing Thread with one or more Computational Threads
– 1 MPI Task per Core and single-threaded computations
– Mix MPI Tasks and multiple computational threads per node
– Infiniband
© 2011 IBM Corporation
ESSL 5.1 (2010)
� Power7 Support including VSX versions of most BLAS and FFT subroutines– 2 VSX (SIMD) units that can each handle 2 Double-Precision FP instructions
– 1 VSX unit can also handle 4 Single Precision instructions per cycle
– 8 FLOPS per cycles � New FFT & LAPACK Subroutines
– scrftd, dcrftd
– srcftd, drcftd
– dsygvx� XLF13.1, XLC/C++ 11.1, UPC, X10� AIX 5.3 & AIX 6.1 (6/25/2010)� AIX 7.1 (9/2010)� SLES11SP1 (10/22/2010)� RHEL6 (12/2010)
© 2011 IBM Corporation
Parallel ESSL 3.3.x (2010)
� 2/2010
–Power7 HV32 DDR IB Support on AIX 6.1
� 6/2010
–Power7 HV32 DDR IB Support on SLES11
� 12/2010
–Power7 HV32 DDR IB Support on RHEL6
© 2011 IBM Corporation
The futures material presented represents a mix of experimentation, prototyping and development.
While topics discussed may appear in some form in future IBM products there is no guarantee any particular feature will appear precisely as described.
Some work described may never go farther than prototype form.
© 2011 IBM Corporation
Parallel Environment
Long Range
Possibilities
The following slides represent a mix of ideas, prototyping efforts
and optimistic looks to the future – None of these are certain
© 2011 IBM Corporation
Scaling
� The Parallel Environment (POE, MPI, LAPI & Tools) teams are investigating to support hundreds of thousands of tasks
– smaller, tighter data structures in protocol implementation √
– Retain performance at 100s of tasks while making 100s of thousands possible √ - ongoing
– Improved Collective Communications scaling √ - ongoing
– Revised early arrival buffer management √
– Better MPI-IO strategies √ - need additional user input (use cases)
– Robustness at scale √ - ongoing
– OS Jitter control strategies √ - ongoing
© 2011 IBM Corporation
Runtime Enhancements
� Enhance runtime to support workflow or sub jobs and MPMD runs √ -Part of PE 5.2.1
� Support for multiple PE installs √ - ongoing
© 2011 IBM Corporation
Alternate Programming Models
� MPI Programming is difficult. Many HPC communities are seeking more intuitive ways to exploit parallelism
– PGAS (Partitioned Global Address space models
– Unified Parallel C √ – working with IBM compiler team and Berkeley UPC team
– CoArray Fortran
– Shared memory models
– Openshmem (SHMEMTM) √ – we are participating in this group
– Potential new programming model from IBM Research √ – working with IBM Research team
– Part of DARPA High Productivity Computing Languages effort
© 2011 IBM Corporation
Options for Productivity Tools
� Infrastructures for tools (i.e. debugging) at huge scale √
– Collaboration with NCSA √ - contributed SCI framework to Eclipse PTP project, PDB is first use case
� Leveraging of Eclipse Parallel Tools Platform √
– An MPI code development assistant √
� Watson Research HPC Toolkit for Performance analysis/tuning
– Part of PE 5.1 √
� Implementing MPI 2.0 Process management within pre-defined resources √ - Part of PE 5.2
© 2011 IBM Corporation
MPI 3.0 Forum
� IBM and the MPI team are working with the MPI Forum on MPI 2.1, 2.2 and 3.0
� MPI 2.1 and 2.2 have fairly modest goals.√
– MPI 2.1 will offer a single MPI Standard (1.1 and 2.0 merged and errata corrections formalized). The draft is in the formal approvals process now
– MPI 2.2 content is being defined now & approval target is very early 2009. Modest API extensions. √ - ongoing
– No changes required for MPI applications
– Low implementation effort – prompt availability predicted
© 2011 IBM Corporation
MPI 3.0
� Major extensions possible (but not certain)� Reference implementation required
– This was part of MPI 1 but not part of MPI 2. The lesson has been learned.
� Target for approval – 2010 – moving target – now maybe by end of 2011� Implementations may deliver MPI 3.0 in stages� There is a handful of proposal sets today
– Some may fall away as they are debated
– The process is open to new proposals now but presumably will close in 2009. - moving target
© 2011 IBM Corporation
Chulho Kim
IBM Poughkeepsie UNIX Development Lab
Contact Information