konstantin berlin 1 , jun huan 2 , mary jacob 3 , garima kochhar 3 , jan prins 2 , bill pugh 1 ,
DESCRIPTION
Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures. Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 , P. Sadayappan 3 , Jaime Spacco 1 , Chau-Wen Tseng 1 - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/1.jpg)
Konstantin Berlin1, Jun Huan2, Mary Jacob3,
Garima Kochhar3, Jan Prins2, Bill Pugh1,
P. Sadayappan3, Jaime Spacco1, Chau-Wen Tseng1
1 University of Maryland, College Park2 University of North Carolina, Chapel Hill
3 Ohio State University
Evaluating the Impact of Programming Language Features on the
Performance of Parallel Applications on Cluster Architectures
![Page 2: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/2.jpg)
Motivation
• Irregular, fine-grain remote accesses– Several important applications– Message passing (MPI) is inefficient
• Language support for fine-grain remote accesses?– Less programmer effort than MPI– How efficient is it on clusters?
![Page 3: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/3.jpg)
Contributions
• Experimental evaluation of language features• Observations on programmability & performance• Suggestions for efficient programming style• Predictions on impact of architectural trends
Findings not a surprise, but we quantify penalties for language features for challenging applications
![Page 4: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/4.jpg)
Outline
• Introduction• Evaluation
– Parallel paradigms– Fine-grain applications– Performance
• Observations & recommendations• Impact of architecture trends• Related work
![Page 5: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/5.jpg)
Parallel Paradigms• Shared-memory
– Pthreads, Java threads, OpenMP, HPF– Remote accesses same as normal accesses
• Distributed-memory– MPI, SHMEM– Remote accesses through explicit (aggregated) messages– User distributes data, translates addresses
• Distributed-memory with special remote accesses– Library to copy remote array sections (Global Arrays)– Extra processor dimension for arrays (Co-Array Fortran)– Global pointers (UPC)– Compiler / run-time system converts accesses to messages
![Page 6: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/6.jpg)
Global Arrays• Characteristics
– Provides illusion of shared multidimensional arrays– Library routines
• Copy rectangular shaped data in & out of global arrays• Scatter / gather / accumulate operations on global array
– Designed to be more restrictive, easier to use than MPI
• Example
NGA_Access(g_a, lo, hi, &table, &ld);
for (j = 0; j < PROCS; j++) { for (i = 0; i < counts[j]; i++) {
table[index-lo[0]] ^= stable[copy[i] >> (64-LSTSIZE)]; } }
NGA_Release_update(g_a, lo, hi);
![Page 7: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/7.jpg)
UPC• Characteristics
– Provides illusion of shared one-dimensional arrays– Language features
• Global pointers to cyclically distributed arrays• Explicit one-sided msgs (upc_memput(), upc_memget())
– Compilers translate global pointers, generate communication
• Exampleshared unsigned int table[TABSIZE];for (i=0; i<NUM_UPDATES/THREADS; i++) { int ran = random(); table[ (ran & (TABSIZE-1)) ] ^= stable[ (ran >> (64-LSTSIZE)) ];}barrier();
![Page 8: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/8.jpg)
UPC
• Most flexible method for arbitrary remote references• Supported by many vendors• Can cast global pointers to local pointers
– Efficiently access local portions of global array
• Can program using hybrid paradigm– Global pointers for fine-grain accesses– Use upc_memput(), upc_mempget() for coarse-grain accesses
![Page 9: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/9.jpg)
Target Applications
• Parallel applications– Most standard benchmarks are easy
• Coarse-grain parallelism• Regular memory access patterns
• Applications with irregular, fine-grain parallelism– Irregular table access– Irregular dynamic access– Integer sort
![Page 10: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/10.jpg)
Options for Fine-grain Parallelism• Implement fine-grain algorithm
– Low user effort, inefficient
• Implement coarse-grain algorithm– High user effort, efficient
• Implement hybrid algorithm– Most code uses fine-grain remote accesses– Performance critical sections use coarse-grain algorithm– Reduce user effort at the cost of performance
• How much performance is lost on clusters?
![Page 11: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/11.jpg)
Experimental Evaluation
• Cluster : Compaq Alphaserver SC (ORNL)– 64 nodes, 4-way Alpha EV67 SMP, 2 GB memory each– Single Quadrics adapter per node
• SMP : SunFire 6800 (UMD)– 24 processors, UltraSparc III, 24 GB memory total– Crossbar interconnect
![Page 12: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/12.jpg)
Irregular Table Update
• Applications– Parallel databases, giant histogram / hash table
• Characteristics– Irregular parallel accesses to large distributed table– Bucket version (aggregated non-local accesses) possible
• Example
for ( i=0; i<NUM_UPDATES; i++ ) {
ran = random();
table[ran & (TABSIZE-1)] ^= stable[ran >> (64-LSTSIZE)];
}
![Page 13: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/13.jpg)
Table Update (AlphaServer, 2^22 table)
1
10
100
1000
10000
1 2 4 8 16 32
Processors
Up
dat
es /
mse
c /
pro
cess
or MPI
UPC (bucket)
UPC
Global Arrays
• UPC / Global Array fine-grain accesses inefficient (100x)• Hybrid coarse-grain (bucket) version closer to MPI
![Page 14: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/14.jpg)
Table Update (Sun SMP, 2^25 table)
0
500
1,000
1,500
2,000
2,500
1 3 5 7 9 11 13 15 17 19 21 23
Processors
Up
dat
es /
mse
c /
pro
cess
or Java
C / OpenMP
C / Pthreads
UPC
• UPC fine-grain accesses inefficient even on SMP
![Page 15: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/15.jpg)
Irregular Dynamic Accesses
• Applications– NAS CG (sparse conjugate gradient)
• Characteristics– Irregular parallel accesses to sparse data structures– Limited aggregation of non-local accesses
• Example (NAS CG)
for (j = 0; j < n; j++) {
sum = 0.0;
for (k = rowstr[j]; k < rowstr[j+1]; k++)
sum = sum + a[k] * v[colidx[k]];
w[j] = sum;
}
![Page 16: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/16.jpg)
NAS Conjugate Gradient (AlphaServer, Class B)
0
20
40
60
80
100
120
1 2 4 8 16 32
Processors
MF
LO
PS
/ p
roc
es
so
r
MPI
OpenMP
UPC (MPI)
UPC (OpenMP)
• UPC fine-grain accesses inefficient (4x)• Hybrid coarse-grain version slightly closer to MPI
![Page 17: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/17.jpg)
Integer Sort
• Applications– NAS IS (integer sort)
• Characteristics– Parallel sort of large list of integers– Non-local accesses can be aggregated
• Example (NAS IS)
for ( i=0; i<NUM_KEYS; i++ ) { /* sort local keys into buckets */
key = key_array[i];
key_buff1[bucket_ptrs[key >> shift]++] = key; }
upc_reduced_sum(…) ; /* get bucket size totals */
for ( i = 0; i < THREADS; i++ ) { /* get my bucket from every proc */
upc_memget(…); }
![Page 18: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/18.jpg)
NAS Integer Sort (AlphaServer, 128K keys)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32
Processors
Eff
icie
ncy
MPI
UPC
• UPC fine-grain accesses inefficient• Coarse-grain version closer to MPI (2-3x)
![Page 19: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/19.jpg)
UPC Microbenchmarks
• Compare memory access costs– Quantify software overhead
• Private– Local memory, local pointer
• Shared-local– Local memory, global pointer
• Shared-same-node– Non-local memory (but on same SMP node)
• Shared-remote– Non-local memory
![Page 20: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/20.jpg)
UPC Microbenchmarks
• Architectures– Compaq AlphaServer SC, v1.7 compiler (ORNL)– Compaq AlphaServer Marvel, v2.1 compiler (Florida)– Sun SunFire 8600 (UMD)– AMD Athlon PC cluster (OSU)– Cray T3E (MTU)– SGI Origin 2000 (UNC)
![Page 21: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/21.jpg)
2.5
1450
18000 16000
1.7
187 191
10
1630 1700
2
90
18800 50350
83
13063098
119
876 876
1
100
10000
Tim
e p
er o
per
atio
n (
ns)
Alpha SC(v1.7)
AlphaMarvel (v2.1)
Sun SunFire6800
AMDAthalon PC
Cluster
Cray T3E SGI Origin2000
UPC Point-wise Data Access Costs (read-modify-write double)
Private Shared-local Shared-same-node Shared-remote
• Global pointers significantly slower• Improvement with newer UPC compilers
![Page 22: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/22.jpg)
Observations
• Fine-grain programming model is seductive– Fine-grain access to shared data– Simple, clean, easy to program
• Not a good reflection of clusters– Efficient fine-grain communication not supported in hardware– Architectural trend towards clusters, away from Cray T3E
![Page 23: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/23.jpg)
Observations
• Programming model encourages poor performance– Easy to write simple fine-grain parallel programs– Poor performance on clusters– Can code around this, often at the cost of complicating your
model or changing your algorithm
• Dubious that compiler techniques will solve this problem– Parallel algorithms with block data movement needed for clusters– Compilers cannot robustly transform fine-grained code into
efficient block parallel algorithms
![Page 24: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/24.jpg)
Observations
• Hybrid programming model is easy to use– Fine-grained shared data access easy to program– Use coarse-grain message passing for performance– Faster code development, prototyping– Resulting code cleaner, more maintainable
• Must avoid degrading local computations– Allow compiler to fully optimize code– Usually not achieved in fine-grain programming– Strength of using explicit messages (MPI)
![Page 25: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/25.jpg)
Recommendations
• Irregular coarse-grain algorithms– For peak cluster performance, use message passing– For quicker development, use hybrid paradigm
• Use fine-grain remote accesses sparingly– Exploit existing code / libraries where possible
• Irregular fine-grain algorithms– Execute smaller problems on large SMPs– Must develop coarse-grain alternatives for clusters
• Fine-grain programming on clusters still just a dream– Though compilers can help for regular access patterns
![Page 26: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/26.jpg)
Impact of Architecture Trends
• Trends– Faster cluster interconnects (Quadrics, InfiniBand)– Larger memories– Processor / memory integration– Multithreading
• Raw performance improving– Faster networks (lower latency, higher bandwidth) – Absolute performance will improve
• But same performance limitations!– Avoid small messages– Avoid software communication overhead– Avoid penalizing local computation
![Page 27: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/27.jpg)
Related Work
• Parallel paradigms– Many studies– PMODELs (DOE / NSF) project
• UPC benchmarking– T. El-Ghazawi et al. (GWU)
• Good performance on NAS benchmarks • Mostly relies on upc_memput(), upc_memget()
– K. Yelick et al. (Berkeley)• UPC compiler targeting GASNET• Compiler attempts to aggregate remote accesses
![Page 28: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,](https://reader035.vdocument.in/reader035/viewer/2022062800/5681424d550346895dae794e/html5/thumbnails/28.jpg)
End of Talk