© 2009 ibm corporation parallel programming with x10/apgas ibm upc and x10 teams through languages...
TRANSCRIPT
© 2009 IBM Corporation
Parallel Programming with X10/APGAS
The ChallengeParallelism scaling replaces frequency scaling as foundation for increased performance Profound impact on future software
Multi-core chips Cluster ParallelismHeterogeneous Parallelism
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
SMF
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
16B/cycle (2x)16B/cycle
BIC
FlexIOTM
MIC
Dual XDRTM
16B/cycle
EIB (up to 96B/cycle)
16B/cycle
64-bit Power Architecture with VMX
PPE
SPE
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
PXUL1
PPU
16B/cycle
PXUL1
PPU
16B/cycle
L232B/cycle
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
SMF
LS
SXUSPU
LS
SXUSPU
SMF. . .
. . .
. . .L2 Cache
PEs,L1 $
PEs,L1 $
. . .L2 Cache
PEs,L1 $
PEs,L1 $
. . .
Interconnect
SMP Node
. . .
Memory
PEs PEs
SMP Node
. . .
Memory
PEs PEs
(100’s of suchcluster nodes)
I/Ogatewaynodes
“Scalable Unit” Cluster Interconnect Switch/Fabric
Large Scale Parallelism
Blue Gene
Road Runner
IBM UPC and X10 teams
Through languages– Asynchronous Co-Array Fortran
– extension of CAF with asyncs
– Asynchronous UPC (AUPC)
– Proper extension of UPC with asyncs
– X10 (already asynchronous)
– Extension of sequential Java Language runtimes share common APGAS runtime
through an APGAS library in C, Fortran, Java (co-habiting with MPI)– Implements PGAS
– Remote references– Global data-structures
– Implements inter-place messaging
– Optimizes inlineable asyncs– Implements global and/or collective operations
– Implements intra-place concurrency
– Atomic operations• Algorithmic scheduler
Libraries reduce cost of adoption, languages offer enhanced productivity benefits
XL UPC status: on path to IBM supported product in 2011
APGAS Realization Programming model is still based on shared memory.
– Familiar to many programmers.
Place hierarchies provide a way to deal with heterogeneity.– Async data transfers between places are not an ad-hoc artifact of the Cell.
Asyncs offer an elegant framework subsuming multi-core parallelism and messaging.
There are many opportunities for compiler optimizations – E.g. communication aggregation.
– So the programmer can write more abstractly and still get good performance
There are many opportunities for static checking for concurrency/distribution design errors.
Programming model is implementable on a variety of hardware architectures– Leads to better application portability..
– There are many opportunities for hardware optimizations based on APGAS
APGAS Advantages
X10 Project Status X10 is an APGAS language in the Java family of languages
X10 is an open source project (Eclipse Public License)• Documentation, releases, implementation source code, benchmarks, etc. all publicly available at http://x10-lang.org
X10 and X10DT 2.0 Just Released!• Added structs for improved space/time efficiency• More flexible distributed object model (global fields/methods)• Static checking of place types (locality constraints)• X10DT 2.0 supports X10 C++ backend• X10 2.0 used in 2009 HPC Challenge (Class 2) submission|
X10 2.0 Platforms• Java-backend (compile X10 to Java)
• Runs on any Java 5 JVM• Single process implementation (all places in one JVM)
• C++ backend (compile X10 to C++)• AIX, Linux, cygwin, MacOS, Solaris• PowerPC, x86, x86_64, sparc• Multi-process implementation (one place per process)• Uses common APGAS runtime
• X10 Innovation Grants–http://www.ibm.com/developerworks/university/innovation/–Program to support academic research and curricular development activities in the area of computing at scale on cloud computing platforms based on the X10 programming language.
Asynchronous PGAS Programming Model
• A programming model provides an abstraction of the architecture that enables programmers to express their solutions in manner relevant to their domain– Mathematicians write equations– MBAs write business logic
• Compilers, language runtimes, libraries, and operating systems implement the programming model, bridging the gap to the hardware.
• Development and performance tools provide the surrounding ecosystem for a programming model and its implementation.
• The evolution of programming models impacts – Design methodologies – Operating systems– Programming environments
Compilers,Runtimes,Libraries,
Operating Systems
Programming Model
Programming Models
DesignMethodologies
OperatingSystems
ProgrammingEnvironments
Programming Models: Bridging the Gap Between Programmer and Hardware
Fine grained concurrency
• async S
Atomicity
• atomic S
• when (c) S
Global data-structures
• points, regions, distributions, arrays
Place-shifting operations
• at (P) S
Ordering
• finish S
• clock
Two basic ideas: Places and Asynchrony
X10 LU RA Stream FFT
nodes GFlop/s MUP/s GBytes/s GFlops/s
4 354 6.34 325.7 23.67
8 666 12.31 650.5 40.62
16 1268 23.02 1287.8 65.92
32 43.1 2601.5
UPC LU RA Stream FFT
nodes GFlop/s MUP/s GBytes/s GFlops/s
4 379 5.5 140 7.9
8 747 10.8 256 13
16 1442 21.5 523 26.3
32 2333 43.3 1224 39.8
Performance results: Power5+ cluster
4 8 16 32100
1000
10000X10
UPC
Peak
HPL perf. comparison
4 8 16 321
10
100
X10
UPC
GF
lop
/s
GF
lop
/s
FFT perf. comparisonIBM Poughkeepsie Benchmark Center
32 Power5+ nodes16 SMT 2x processors/node64 GB/node; 1.9 GHz
HPS switch, 2 GBytes/s/link
Performance results – Blue Gene/P
X10 LU RA Stream FFT
nodes GFlop/s GUP/s GBytes/s GFlops/s
32 117 0.042 141
1024 3893 1.05 4516
2048
4096
UPC LU RA Stream FFT
nodes GFlop/s GUP/s GBytes/s GFlops/s
32 242 0.04 168 6.4
1024 7744 1.27 5376 156
2048 15538 2.54
4096 28062 5.04
IBM TJ Watson Res. Ctr. WatsonShaheen
4 racks Blue Gene/P1024 nodes/rack4 CPUs/node; 850 MHz4 Gbytes/node RAM
16 x 16 x 16 torus 32
1024
20484096
100
1000
10000
100000X10
UPC
Peak
321024
20484096
1
10
100
1000
X10
UPC
HPL perf. comparison FFT perf. comparison