cse 260 – parallel processing ucsd fall 2006 a performance characterization of upc presented by...
TRANSCRIPT
CSE 260 – Parallel Processing UCSD Fall 2006
A Performance A Performance Characterization of Characterization of
UPCUPCPresented by –Presented by –
Anup TapadiaAnup Tapadia
Fallon ChenFallon Chen
CSE 260 – Parallel Processing UCSD Fall 2006
IntroductionIntroduction
Unified Parallel C (UPC) is:Unified Parallel C (UPC) is: An explicit parallel extension of ANSI C An explicit parallel extension of ANSI C A partitioned global address space A partitioned global address space
languagelanguage Similar to the C language philosophySimilar to the C language philosophy
Concise and efficient syntaxConcise and efficient syntax Common and familiar syntax and Common and familiar syntax and
semantics for parallel C with simple semantics for parallel C with simple extensions to ANSI Cextensions to ANSI C
Based on ideas in Split-C, AC, and PCPBased on ideas in Split-C, AC, and PCP
CSE 260 – Parallel Processing UCSD Fall 2006
UPC Execution ModelUPC Execution Model
A number of threads working A number of threads working independently in a SPMD fashionindependently in a SPMD fashion Number of threads specified at compile-Number of threads specified at compile-
time or run-time; available as program time or run-time; available as program variable variable THREADSTHREADS
MYTHREADMYTHREAD specifies thread index specifies thread index ((0..THREADS-10..THREADS-1))
upc_barrierupc_barrier is a global synchronization: is a global synchronization: all waitall wait
CSE 260 – Parallel Processing UCSD Fall 2006
Simple Shared Memory Simple Shared Memory ExampleExample
shared [1] int data[4][THREADS]shared [1] int data[4][THREADS]
Block Size
Array Size
Thread Numbe
r
0,0
1,0
2,0
3,0
0,1
1,1
2,1
3,1
0,2
1,2
2,2
3,2
0,3
1,3
2,3
3,3
Thread 0 Thread 1 Thread 2 Thread 30,n
1,n
2,n
3,n
Thread n
…
CSE 260 – Parallel Processing UCSD Fall 2006
Example: Monte Carlo Pi Example: Monte Carlo Pi CalculationCalculation
Estimate Pi by throwing darts at a unit squareEstimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circleCalculate percentage that fall in the unit circle
Area of square = rArea of square = r22 = 1 = 1 Area of circle quadrant = ¼ * Area of circle quadrant = ¼ * r r2 2 = =
Randomly throw darts at x,y positionsRandomly throw darts at x,y positions If xIf x22 + y + y22 < 1, then point is inside circle < 1, then point is inside circle Compute ratio:Compute ratio:
# points inside / # points total# points inside / # points total = 4*ratio = 4*ratio
CSE 260 – Parallel Processing UCSD Fall 2006
Monte Carlo Pi ScalingMonte Carlo Pi Scaling
UPC Monte Carlo Pi
0
510
15
2025
30
3540
45
4 8 16 32
UPC MonteCarlo Pi
Procs
Exe
c T
ime
( S
eco
nd
s)
CSE 260 – Parallel Processing UCSD Fall 2006
Ring Performance - Ring Performance - DataStarDataStar
Ring Bandwidth on 32 Procs - Datastar
0
500
1000
1500
2000
2500
3000
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
UPC_Ring
MPI Ring
Ban
dwid
th
CSE 260 – Parallel Processing UCSD Fall 2006
Ring Performance - Ring Performance - SpindelSpindel
Ring with 8 nodes - Spindel Test Cluster
0
20
40
60
80
100
120
140
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
UPC Ring
MPI Ring
CSE 260 – Parallel Processing UCSD Fall 2006
Ring Performance - Ring Performance - DataStarDataStar
0
2000
4000
6000
8000
10000
12000
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
UPC Ring
MPI Ring
de
lay
use
c
Ring Delay for 32 procs - DataStar
CSE 260 – Parallel Processing UCSD Fall 2006
Ring Performance - Ring Performance - SpindelSpindel
Ring Delay on 8 proc - Spindel Test Cluster
0
10000
20000
30000
40000
50000
60000
70000
1 4 16 64 256
1024
4096
1638
4
6553
6
2621
44
1048
576
4194
304
UPC Ring
MPI Ring
CSE 260 – Parallel Processing UCSD Fall 2006
Parallel Binary sort Parallel Binary sort
1 2 345 67 8
1 257 34 68
1 7 25 48 36
1 7 5 2 8 4 6 3
CSE 260 – Parallel Processing UCSD Fall 2006
Parallel Binary sort Parallel Binary sort (cont..)(cont..)
1 7 5 2 8 4 6 3
1 7 52 84 63
1 752 84 63
1 4 863 72 5
CSE 260 – Parallel Processing UCSD Fall 2006
MPI Binary sort scalingMPI Binary sort scaling(Spindel Test Cluster)(Spindel Test Cluster)
MPI Binary Sort
00.02
0.040.06
0.080.1
0.12
0.140.16
0.180.2
2 3 4 5 6 7 8
MPI Binary Sort
Number of Proc
exec
tim
e se
c
CSE 260 – Parallel Processing UCSD Fall 2006
A Performance Characterization A Performance Characterization of UPCof UPC
Fallon ChenFallon Chen
CSE 260 – Parallel Processing UCSD Fall 2006
Matrix Multiply
● Basic square matrix multiply: A x B = C● A, B and C are NxN matrices● In UPC, we can take advantage of the data
layout for matrix multiply when N is a multiple of the number of THREADS
● Store A row wise ● Store B column wise
CSE 260 – Parallel Processing UCSD Fall 2006
Data LayoutData Layout
Thread 0
Thread 1
Thread THREADS-1
0 .. (N*P / THREADS) -1
(N*P / THREADS)..(2*N*P / THREADS)-1
((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1
Columns 0: (M/THREADS)-1
Thread 0Thread THREADS-1
•Note: N and M are assumed to be multiples of THREADS
N
P M
P
(images by Kathy Yelick, from the UPC Tutorial)
CSE 260 – Parallel Processing UCSD Fall 2006
AlgorithmAlgorithm
At each thread, get a local copy of the At each thread, get a local copy of the row(s) of A that have affinity to that row(s) of A that have affinity to that particular threadparticular thread
At each thread, broadcast column using At each thread, broadcast column using a UPC collective function of B so that a UPC collective function of B so that at the end each thread has a copy of Bat the end each thread has a copy of B
Multiply the row of A by B to produce a Multiply the row of A by B to produce a row (or rows) of Crow (or rows) of C
Very short– about 100 lines of codeVery short– about 100 lines of code
CSE 260 – Parallel Processing UCSD Fall 2006
0 200 400 600 800 1000 12000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Latency, UPC vs. MPI, 4 and 8 Processors
UPC-4UPC-8MPI-4MPI-8
N (grid size)
Tim
e (
seco
nd
s)
CSE 260 – Parallel Processing UCSD Fall 2006
0 200 400 600 800 1000 12000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
UPC Scaling, Matrix Multiply
48
N (grid size)
Tim
e (
se
co
nd
s)
CSE 260 – Parallel Processing UCSD Fall 2006
Connected Components Connected Components LabelingLabeling
Used a union find algorithm for global Used a union find algorithm for global relabelingrelabeling
Stored global labels as a shared array, and Stored global labels as a shared array, and used a shared array to exchange ghost used a shared array to exchange ghost cellscells
Directly accessing a shared array in a loop Directly accessing a shared array in a loop is slow for large amounts of datais slow for large amounts of data
Need to use bulk copies upc_memput and Need to use bulk copies upc_memput and upc_memget, but then you have to attend upc_memget, but then you have to attend carefully to how data is laid out (see next carefully to how data is laid out (see next two slides for what happens if you don’t)two slides for what happens if you don’t)
CSE 260 – Parallel Processing UCSD Fall 2006
UPC CCL ScalingUPC CCL Scaling
0
1
2
3
4
5
6
100 200 300 400 500 600 700 800 900
481632
CSE 260 – Parallel Processing UCSD Fall 2006
0 100 200 300 400 500 600 700 800 900 10000.0001
0.001
0.01
0.1
1
10
Latency UPC vs. MPI, Connected Components Labeling,
16 and 32 Processors
MPI-16UPC-16MPI-32UPC-32
CSE 260 – Parallel Processing UCSD Fall 2006
Did UPC help, hurt?Did UPC help, hurt?
Global view of memory useful aid in Global view of memory useful aid in debugging and developmentdebugging and development
Redistribution routines pretty easy Redistribution routines pretty easy to writeto write
Efficient code no easier to write Efficient code no easier to write than in MPI because you have to than in MPI because you have to consider the shared memory data consider the shared memory data layout when fine tuning the codelayout when fine tuning the code
CSE 260 – Parallel Processing UCSD Fall 2006
ConclusionsConclusions UPC is easy to program in for C writers, UPC is easy to program in for C writers,
significantly easier than alternative paradigms significantly easier than alternative paradigms at timesat times
UPC exhibits very little overhead when UPC exhibits very little overhead when compared with MPI for problems that are compared with MPI for problems that are embarrassingly parallel. No tuning is necessary.embarrassingly parallel. No tuning is necessary.
For other problems compiler optimizations are For other problems compiler optimizations are happening but not fully therehappening but not fully there
With hand-tuning , UPC performance compared With hand-tuning , UPC performance compared favorably with MPIfavorably with MPI
Hand tuned code, with block moves, is still Hand tuned code, with block moves, is still substantially simpler than message passing codesubstantially simpler than message passing code