© 2009 ibm corporation parallel programming with x10/apgas ibm upc and x10 teams through languages...

© 2009 IBM Corporation

Parallel Programming with X10/APGAS

The ChallengeParallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software

Multi-core chips Cluster ParallelismHeterogeneous Parallelism

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

16B/cycle (2x)16B/cycle

BIC

FlexIOTM

MIC

Dual XDRTM

16B/cycle

EIB (up to 96B/cycle)

16B/cycle

64-bit Power Architecture with VMX

PPE

SPE

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

PXUL1

PPU

16B/cycle

PXUL1

PPU

16B/cycle

L232B/cycle

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

SMF

LS

SXUSPU

LS

SXUSPU

SMF. . .

. . .

. . .L2 Cache

PEs,L1 $

PEs,L1 $

. . .L2 Cache

PEs,L1 $

PEs,L1 $

. . .

Interconnect

SMP Node

. . .

Memory

PEs PEs

SMP Node

. . .

Memory

PEs PEs

(100’s of suchcluster nodes)

I/Ogatewaynodes

“Scalable Unit” Cluster Interconnect Switch/Fabric

Large Scale Parallelism

Blue Gene

Road Runner

IBM UPC and X10 teams

Through languages– Asynchronous Co-Array Fortran

– extension of CAF with asyncs

– Asynchronous UPC (AUPC)

– Proper extension of UPC with asyncs

– X10 (already asynchronous)

– Extension of sequential Java Language runtimes share common APGAS runtime

through an APGAS library in C, Fortran, Java (co-habiting with MPI)– Implements PGAS

– Remote references– Global data-structures

– Implements inter-place messaging

– Optimizes inlineable asyncs– Implements global and/or collective operations

– Implements intra-place concurrency

– Atomic operations• Algorithmic scheduler

Libraries reduce cost of adoption, languages offer enhanced productivity benefits

XL UPC status: on path to IBM supported product in 2011

APGAS Realization Programming model is still based on shared memory.

– Familiar to many programmers.

Place hierarchies provide a way to deal with heterogeneity.– Async data transfers between places are not an ad-hoc artifact of the Cell.

Asyncs offer an elegant framework subsuming multi-core parallelism and messaging.

There are many opportunities for compiler optimizations – E.g. communication aggregation.

– So the programmer can write more abstractly and still get good performance

There are many opportunities for static checking for concurrency/distribution design errors.

Programming model is implementable on a variety of hardware architectures– Leads to better application portability..

– There are many opportunities for hardware optimizations based on APGAS

APGAS Advantages

X10 Project Status X10 is an APGAS language in the Java family of languages

X10 is an open source project (Eclipse Public License)• Documentation, releases, implementation source code, benchmarks, etc. all publicly available at http://x10-lang.org

X10 and X10DT 2.0 Just Released!• Added structs for improved space/time efficiency• More flexible distributed object model (global fields/methods)• Static checking of place types (locality constraints)• X10DT 2.0 supports X10 C++ backend• X10 2.0 used in 2009 HPC Challenge (Class 2) submission|

X10 2.0 Platforms• Java-backend (compile X10 to Java)

• Runs on any Java 5 JVM• Single process implementation (all places in one JVM)

• C++ backend (compile X10 to C++)• AIX, Linux, cygwin, MacOS, Solaris• PowerPC, x86, x86_64, sparc• Multi-process implementation (one place per process)• Uses common APGAS runtime

• X10 Innovation Grants–http://www.ibm.com/developerworks/university/innovation/–Program to support academic research and curricular development activities in the area of computing at scale on cloud computing platforms based on the X10 programming language.

Asynchronous PGAS Programming Model

• A programming model provides an abstraction of the architecture that enables programmers to express their solutions in manner relevant to their domain– Mathematicians write equations– MBAs write business logic

• Compilers, language runtimes, libraries, and operating systems implement the programming model, bridging the gap to the hardware.

• Development and performance tools provide the surrounding ecosystem for a programming model and its implementation.

• The evolution of programming models impacts – Design methodologies – Operating systems– Programming environments

Compilers,Runtimes,Libraries,

Operating Systems

Programming Model

Programming Models

DesignMethodologies

OperatingSystems

ProgrammingEnvironments

Programming Models: Bridging the Gap Between Programmer and Hardware

Fine grained concurrency

• async S

Atomicity

• atomic S

• when (c) S

Global data-structures

• points, regions, distributions, arrays

Place-shifting operations

• at (P) S

Ordering

• finish S

• clock

Two basic ideas: Places and Asynchrony

X10 LU RA Stream FFT

nodes GFlop/s MUP/s GBytes/s GFlops/s

4 354 6.34 325.7 23.67

8 666 12.31 650.5 40.62

16 1268 23.02 1287.8 65.92

32 43.1 2601.5

UPC LU RA Stream FFT

nodes GFlop/s MUP/s GBytes/s GFlops/s

4 379 5.5 140 7.9

8 747 10.8 256 13

16 1442 21.5 523 26.3

32 2333 43.3 1224 39.8

Performance results: Power5+ cluster

4 8 16 32100

1000

10000X10

UPC

Peak

HPL perf. comparison

4 8 16 321

10

100

X10

UPC

GF

lop

/s

GF

lop

/s

FFT perf. comparisonIBM Poughkeepsie Benchmark Center

32 Power5+ nodes16 SMT 2x processors/node64 GB/node; 1.9 GHz

HPS switch, 2 GBytes/s/link

Performance results – Blue Gene/P

X10 LU RA Stream FFT

nodes GFlop/s GUP/s GBytes/s GFlops/s

32 117 0.042 141

1024 3893 1.05 4516

2048

4096

UPC LU RA Stream FFT

nodes GFlop/s GUP/s GBytes/s GFlops/s

32 242 0.04 168 6.4

1024 7744 1.27 5376 156

2048 15538 2.54

4096 28062 5.04

IBM TJ Watson Res. Ctr. WatsonShaheen

4 racks Blue Gene/P1024 nodes/rack4 CPUs/node; 850 MHz4 Gbytes/node RAM

16 x 16 x 16 torus 32

1024

20484096

100

1000

10000

100000X10

UPC

Peak

321024

20484096

1

10

100

1000

X10

UPC

HPL perf. comparison FFT perf. comparison

© 2009 ibm corporation parallel programming with x10/apgas ibm upc and x10 teams through languages...

Documents