© 2009 ibm corporation parallel programming with x10/apgas ibm upc and x10 teams through languages...

1
© 2009 IBM Corporation Parallel Programming with X10/APGAS The Challenge Parallelism scaling replaces frequency scaling as foundation for increased performance P rofound impact on future software M ulti-core chips Cluster Parallelism H eterogeneous Parallelism 16B /cycle (2x) 16B/cycle FlexIO TM D ual XDR TM 16B/cycle 16B/cycle 64-bit Power Architecture with VMX PPE SPE 16B /cycle (2x) 16B/cycle BIC FlexIO TM MIC D ual XDR TM 16B/cycle EIB (up to 96B /cycle) 16B/cycle 64-bit Power Architecture with VMX PPE SPE LS SXU SPU SM F PXU L1 PPU 16B/cycle L2 32B/cycle LS SXU SPU SM F LS SXU SPU SM F LS SXU SPU SM F LS SXU SPU SMF LS SXU SPU SM F LS SXU SPU SM F LS SXU SPU SM F ... ... ... L2 Cache PEs, L1 $ PEs, L1 $ ... L2 Cache PEs, L1 $ PEs, L1 $ ... Interconnect SM P N ode ... Mem ory PEs PEs SM P N ode ... Mem ory PEs PEs (100’s ofsuch clusternodes) I/O gateway nodes Scalable U nit”C lusterInterconnectSwitch/Fabric Large Scale Parallelism B lue G ene Road Runner IBM UPC and X10 teams Through languages Asynchronous Co-Array Fortran extension of CAF with asyncs Asynchronous UPC (AUPC) Proper extension of UPC with asyncs X10 (already asynchronous) Extension of sequential Java Language runtimes share common APGAS runtime through an APGAS library in C, Fortran, Java (co-habiting with MPI) Implements PGAS Remote references Global data-structures Implements inter-place messaging Optimizes inlineable asyncs Implements global and/or collective operations Implements intra-place concurrency Atomic operations Algorithmic scheduler Libraries reduce cost of adoption, languages offer enhanced productivity benefits XL UPC status: on path to IBM supported product in 2011 APGAS Realization Programming model is still based on shared memory. Familiar to many programmers. Place hierarchies provide a way to deal with heterogeneity. Async data transfers between places are not an ad-hoc artifact of the Cell. Asyncs offer an elegant framework subsuming multi-core parallelism and messaging. There are many opportunities for compiler optimizations E.g. communication aggregation. So the programmer can write more abstractly and still get good performance There are many opportunities for static checking for concurrency/distribution design errors. Programming model is implementable on a variety of hardware architectures Leads to better application portability.. There are many opportunities for hardware optimizations based on APGAS APGAS Advantages X10 Project Status X10 is an APGAS language in the Java family of languages X10 is an open source project (Eclipse Public License) • Documentation, releases, implementation source code, benchmarks, etc. all publicly available at http://x10-lang.org X10 and X10DT 2.0 Just Released! • Added structs for improved space/time efficiency • More flexible distributed object model (global fields/methods) • Static checking of place types (locality constraints) • X10DT 2.0 supports X10 C++ backend • X10 2.0 used in 2009 HPC Challenge (Class 2) submission| X10 2.0 Platforms • Java-backend (compile X10 to Java) • Runs on any Java 5 JVM • Single process implementation (all places in one JVM) • C++ backend (compile X10 to C++) • AIX, Linux, cygwin, MacOS, Solaris • PowerPC, x86, x86_64, sparc • Multi-process implementation (one place per process) • Uses common APGAS runtime • X10 Innovation Grants http://www.ibm.com/developerworks/university/in novation/ –Program to support academic research and curricular development activities in the area of Asynchronous PGAS Programming Model A programming model provides an abstraction of the architecture that enables programmers to express their solutions in manner relevant to their domain Mathematicians write equations MBAs write business logic Compilers, language runtimes, libraries, and operating systems implement the programming model, bridging the gap to the hardware. Development and performance tools provide the surrounding ecosystem for a programming model and its implementation. The evolution of programming models impacts Design methodologies Operating systems Programming environments Compilers, Runtimes, Libraries, Operating Systems Programming Model Programming Models Design Methodologies Operating Systems Programming Environments Programming Models: Bridging the Gap Between Programmer and Hardware Fine grained concurrency async S Atomicity atomic S when (c) S Global data-structures points, regions, distributions, arrays Place-shifting operations at (P) S Ordering finish S clock Two basic ideas: Places and Asynchrony X10 LU RA Stream FFT node s GFlop/s MUP/s GBytes/ s GFlops/ s 4 354 6.34 325.7 23.67 8 666 12.31 650.5 40.62 16 1268 23.02 1287.8 65.92 32 43.1 2601.5 UPC LU RA Stream FFT nodes GFlop/s MUP/s GBytes/ s GFlops/ s 4 379 5.5 140 7.9 8 747 10.8 256 13 16 1442 21.5 523 26.3 32 2333 43.3 1224 39.8 Performance results: Power5+ cluster 4 8 16 32 100 1000 10000 X10 UPC Peak HPL perf. comparison 4 8 16 32 1 10 100 X10 UPC GFlop/s GFlop/s FFT perf. comparison IBM Poughkeepsie Benchmark Center 32 Power5+ nodes 16 SMT 2x processors/node 64 GB/node; 1.9 GHz HPS switch, 2 GBytes/s/link Performance results – Blue Gene/P X10 LU RA Stream FFT node s GFlop/s GUP/s GBytes/ s GFlops/ s 32 117 0.042 141 1024 3893 1.05 4516 2048 4096 UPC LU RA Stream FFT nodes GFlop/s GUP/s GBytes/ s GFlops/ s 32 242 0.04 168 6.4 1024 7744 1.27 5376 156 2048 15538 2.54 4096 28062 5.04 IBM TJ Watson Res. Ctr. WatsonShaheen 4 racks Blue Gene/P 1024 nodes/rack 4 CPUs/node; 850 MHz 4 Gbytes/node RAM 16 x 16 x 16 torus 32 1024 2048 4096 100 1000 10000 100000 X10 UPC Peak 32 1024 2048 4096 1 10 100 1000 X10 UPC HPL perf. comparison FFT perf. comparison

Upload: steven-cameron

Post on 14-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with

© 2009 IBM Corporation

Parallel Programming with X10/APGAS

The ChallengeParallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software

Multi-core chips Cluster ParallelismHeterogeneous Parallelism

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF. . .

. . .

. . .L2 Cache

PEs,L1 $

PEs,L1 $

. . .L2 Cache

PEs,L1 $

PEs,L1 $

. . .

Interconnect

SMP Node

. . .

Memory

PEs PEs

SMP Node

. . .

Memory

PEs PEs

(100’s of suchcluster nodes)

I/Ogatewaynodes

“Scalable Unit” Cluster Interconnect Switch/Fabric

Large Scale Parallelism

Blue Gene

Road Runner

IBM UPC and X10 teams

Through languages– Asynchronous Co-Array Fortran

– extension of CAF with asyncs

– Asynchronous UPC (AUPC)

– Proper extension of UPC with asyncs

– X10 (already asynchronous)

– Extension of sequential Java Language runtimes share common APGAS runtime

through an APGAS library in C, Fortran, Java (co-habiting with MPI)– Implements PGAS

– Remote references– Global data-structures

– Implements inter-place messaging

– Optimizes inlineable asyncs– Implements global and/or collective operations

– Implements intra-place concurrency

– Atomic operations• Algorithmic scheduler

Libraries reduce cost of adoption, languages offer enhanced productivity benefits

XL UPC status: on path to IBM supported product in 2011

APGAS Realization Programming model is still based on shared memory.

– Familiar to many programmers.

Place hierarchies provide a way to deal with heterogeneity.– Async data transfers between places are not an ad-hoc artifact of the Cell.

Asyncs offer an elegant framework subsuming multi-core parallelism and messaging.

There are many opportunities for compiler optimizations – E.g. communication aggregation.

– So the programmer can write more abstractly and still get good performance

There are many opportunities for static checking for concurrency/distribution design errors.

Programming model is implementable on a variety of hardware architectures– Leads to better application portability..

– There are many opportunities for hardware optimizations based on APGAS

APGAS Advantages

X10 Project Status X10 is an APGAS language in the Java family of languages

X10 is an open source project (Eclipse Public License)• Documentation, releases, implementation source code, benchmarks, etc. all publicly available at http://x10-lang.org

X10 and X10DT 2.0 Just Released!• Added structs for improved space/time efficiency• More flexible distributed object model (global fields/methods)• Static checking of place types (locality constraints)• X10DT 2.0 supports X10 C++ backend• X10 2.0 used in 2009 HPC Challenge (Class 2) submission|

X10 2.0 Platforms• Java-backend (compile X10 to Java)

• Runs on any Java 5 JVM• Single process implementation (all places in one JVM)

• C++ backend (compile X10 to C++)• AIX, Linux, cygwin, MacOS, Solaris• PowerPC, x86, x86_64, sparc• Multi-process implementation (one place per process)• Uses common APGAS runtime

• X10 Innovation Grants–http://www.ibm.com/developerworks/university/innovation/–Program to support academic research and curricular development activities in the area of computing at scale on cloud computing platforms based on the X10 programming language.

Asynchronous PGAS Programming Model

• A programming model provides an abstraction of the architecture that enables programmers to express their solutions in manner relevant to their domain– Mathematicians write equations– MBAs write business logic

• Compilers, language runtimes, libraries, and operating systems implement the programming model, bridging the gap to the hardware.

• Development and performance tools provide the surrounding ecosystem for a programming model and its implementation.

• The evolution of programming models impacts – Design methodologies – Operating systems– Programming environments

Compilers,Runtimes,Libraries,

Operating Systems

Programming Model

Programming Models

DesignMethodologies

OperatingSystems

ProgrammingEnvironments

Programming Models: Bridging the Gap Between Programmer and Hardware

Fine grained concurrency

• async S

Atomicity

• atomic S

• when (c) S

Global data-structures

• points, regions, distributions, arrays

Place-shifting operations

• at (P) S

Ordering

• finish S

• clock

Two basic ideas: Places and Asynchrony

X10 LU RA Stream FFT

nodes GFlop/s MUP/s GBytes/s GFlops/s

4 354 6.34 325.7 23.67

8 666 12.31 650.5 40.62

16 1268 23.02 1287.8 65.92

32 43.1 2601.5

UPC LU RA Stream FFT

nodes GFlop/s MUP/s GBytes/s GFlops/s

4 379 5.5 140 7.9

8 747 10.8 256 13

16 1442 21.5 523 26.3

32 2333 43.3 1224 39.8

Performance results: Power5+ cluster

4 8 16 32100

1000

10000X10

UPC

Peak

HPL perf. comparison

4 8 16 321

10

100

X10

UPC

GF

lop

/s

GF

lop

/s

FFT perf. comparisonIBM Poughkeepsie Benchmark Center

32 Power5+ nodes16 SMT 2x processors/node64 GB/node; 1.9 GHz

HPS switch, 2 GBytes/s/link

Performance results – Blue Gene/P

X10 LU RA Stream FFT

nodes GFlop/s GUP/s GBytes/s GFlops/s

32 117 0.042 141

1024 3893 1.05 4516

2048

4096

UPC LU RA Stream FFT

nodes GFlop/s GUP/s GBytes/s GFlops/s

32 242 0.04 168 6.4

1024 7744 1.27 5376 156

2048 15538 2.54

4096 28062 5.04

IBM TJ Watson Res. Ctr. WatsonShaheen

4 racks Blue Gene/P1024 nodes/rack4 CPUs/node; 850 MHz4 Gbytes/node RAM

16 x 16 x 16 torus 32

1024

20484096

100

1000

10000

100000X10

UPC

Peak

321024

20484096

1

10

100

1000

X10

UPC

HPL perf. comparison FFT perf. comparison