cse 260 – parallel processing ucsd fall 2006 a performance characterization of upc presented by...

CSE 260 – Parallel Processing UCSD Fall 2006

A Performance A Performance Characterization of Characterization of

UPCUPCPresented by –Presented by –

Anup TapadiaAnup Tapadia

Fallon ChenFallon Chen


IntroductionIntroduction

Unified Parallel C (UPC) is:Unified Parallel C (UPC) is: An explicit parallel extension of ANSI C An explicit parallel extension of ANSI C A partitioned global address space A partitioned global address space

languagelanguage Similar to the C language philosophySimilar to the C language philosophy

Concise and efficient syntaxConcise and efficient syntax Common and familiar syntax and Common and familiar syntax and

semantics for parallel C with simple semantics for parallel C with simple extensions to ANSI Cextensions to ANSI C

Based on ideas in Split-C, AC, and PCPBased on ideas in Split-C, AC, and PCP


UPC Execution ModelUPC Execution Model

A number of threads working A number of threads working independently in a SPMD fashionindependently in a SPMD fashion Number of threads specified at compile-Number of threads specified at compile-

time or run-time; available as program time or run-time; available as program variable variable THREADSTHREADS

MYTHREADMYTHREAD specifies thread index specifies thread index ((0..THREADS-10..THREADS-1))

upc_barrierupc_barrier is a global synchronization: is a global synchronization: all waitall wait


Simple Shared Memory Simple Shared Memory ExampleExample

shared [1] int data[4][THREADS]shared [1] int data[4][THREADS]

Block Size

Array Size

Thread Numbe

r

0,0

1,0

2,0

3,0

0,1

1,1

2,1

3,1

0,2

1,2

2,2

3,2

0,3

1,3

2,3

3,3

Thread 0 Thread 1 Thread 2 Thread 30,n

1,n

2,n

3,n

Thread n

…


Example: Monte Carlo Pi Example: Monte Carlo Pi CalculationCalculation

Estimate Pi by throwing darts at a unit squareEstimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circleCalculate percentage that fall in the unit circle

Area of square = rArea of square = r22 = 1 = 1 Area of circle quadrant = ¼ * Area of circle quadrant = ¼ * r r2 2 = =

Randomly throw darts at x,y positionsRandomly throw darts at x,y positions If xIf x22 + y + y22 < 1, then point is inside circle < 1, then point is inside circle Compute ratio:Compute ratio:

# points inside / # points total# points inside / # points total = 4*ratio = 4*ratio


Monte Carlo Pi ScalingMonte Carlo Pi Scaling

UPC Monte Carlo Pi

0

510

15

2025

30

3540

45

4 8 16 32

UPC MonteCarlo Pi

Procs

Exe

c T

ime

( S

eco

nd

s)


Ring Performance - Ring Performance - DataStarDataStar

Ring Bandwidth on 32 Procs - Datastar

0

500

1000

1500

2000

2500

3000

1 4 16 64 256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

UPC_Ring

MPI Ring

Ban

dwid

th


Ring Performance - Ring Performance - SpindelSpindel

Ring with 8 nodes - Spindel Test Cluster

0

20

40

60

80

100

120

140

1 4 16 64 256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

UPC Ring

MPI Ring


Ring Performance - Ring Performance - DataStarDataStar

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

UPC Ring

MPI Ring

de

lay

use

c

Ring Delay for 32 procs - DataStar


Ring Performance - Ring Performance - SpindelSpindel

Ring Delay on 8 proc - Spindel Test Cluster

0

10000

20000

30000

40000

50000

60000

70000

1 4 16 64 256

1024

4096

1638

4

6553

6

2621

44

1048

576

4194

304

UPC Ring

MPI Ring


Parallel Binary sort Parallel Binary sort

1 2 345 67 8

1 257 34 68

1 7 25 48 36

1 7 5 2 8 4 6 3


Parallel Binary sort Parallel Binary sort (cont..)(cont..)

1 7 5 2 8 4 6 3

1 7 52 84 63

1 752 84 63

1 4 863 72 5


MPI Binary sort scalingMPI Binary sort scaling(Spindel Test Cluster)(Spindel Test Cluster)

MPI Binary Sort

00.02

0.040.06

0.080.1

0.12

0.140.16

0.180.2

2 3 4 5 6 7 8

MPI Binary Sort

Number of Proc

exec

tim

e se

c


A Performance Characterization A Performance Characterization of UPCof UPC

Fallon ChenFallon Chen


Matrix Multiply

● Basic square matrix multiply: A x B = C● A, B and C are NxN matrices● In UPC, we can take advantage of the data

layout for matrix multiply when N is a multiple of the number of THREADS

● Store A row wise ● Store B column wise


Data LayoutData Layout

Thread 0

Thread 1

Thread THREADS-1

0 .. (N*P / THREADS) -1

(N*P / THREADS)..(2*N*P / THREADS)-1

((THREADS-1)N*P) / THREADS .. (THREADS*N*P / THREADS)-1

Columns 0: (M/THREADS)-1

Thread 0Thread THREADS-1

•Note: N and M are assumed to be multiples of THREADS

N

P M

P

(images by Kathy Yelick, from the UPC Tutorial)


AlgorithmAlgorithm

At each thread, get a local copy of the At each thread, get a local copy of the row(s) of A that have affinity to that row(s) of A that have affinity to that particular threadparticular thread

At each thread, broadcast column using At each thread, broadcast column using a UPC collective function of B so that a UPC collective function of B so that at the end each thread has a copy of Bat the end each thread has a copy of B

Multiply the row of A by B to produce a Multiply the row of A by B to produce a row (or rows) of Crow (or rows) of C

Very short– about 100 lines of codeVery short– about 100 lines of code


0 200 400 600 800 1000 12000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Latency, UPC vs. MPI, 4 and 8 Processors

UPC-4UPC-8MPI-4MPI-8

N (grid size)

Tim

e (

seco

nd

s)


0 200 400 600 800 1000 12000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

UPC Scaling, Matrix Multiply

48

N (grid size)

Tim

e (

se

co

nd

s)


Connected Components Connected Components LabelingLabeling

Used a union find algorithm for global Used a union find algorithm for global relabelingrelabeling

Stored global labels as a shared array, and Stored global labels as a shared array, and used a shared array to exchange ghost used a shared array to exchange ghost cellscells

Directly accessing a shared array in a loop Directly accessing a shared array in a loop is slow for large amounts of datais slow for large amounts of data

Need to use bulk copies upc_memput and Need to use bulk copies upc_memput and upc_memget, but then you have to attend upc_memget, but then you have to attend carefully to how data is laid out (see next carefully to how data is laid out (see next two slides for what happens if you don’t)two slides for what happens if you don’t)


UPC CCL ScalingUPC CCL Scaling

0

1

2

3

4

5

6

100 200 300 400 500 600 700 800 900

481632


0 100 200 300 400 500 600 700 800 900 10000.0001

0.001

0.01

0.1

1

10

Latency UPC vs. MPI, Connected Components Labeling,

16 and 32 Processors

MPI-16UPC-16MPI-32UPC-32


Did UPC help, hurt?Did UPC help, hurt?

Global view of memory useful aid in Global view of memory useful aid in debugging and developmentdebugging and development

Redistribution routines pretty easy Redistribution routines pretty easy to writeto write

Efficient code no easier to write Efficient code no easier to write than in MPI because you have to than in MPI because you have to consider the shared memory data consider the shared memory data layout when fine tuning the codelayout when fine tuning the code


ConclusionsConclusions UPC is easy to program in for C writers, UPC is easy to program in for C writers,

significantly easier than alternative paradigms significantly easier than alternative paradigms at timesat times

UPC exhibits very little overhead when UPC exhibits very little overhead when compared with MPI for problems that are compared with MPI for problems that are embarrassingly parallel. No tuning is necessary.embarrassingly parallel. No tuning is necessary.

For other problems compiler optimizations are For other problems compiler optimizations are happening but not fully therehappening but not fully there

With hand-tuning , UPC performance compared With hand-tuning , UPC performance compared favorably with MPIfavorably with MPI

Hand tuned code, with block moves, is still Hand tuned code, with block moves, is still substantially simpler than message passing codesubstantially simpler than message passing code

cse 260 – parallel processing ucsd fall 2006 a performance characterization of upc presented by...

Documents

monte carlo pi scalingcse

thread index

c language philosophyconcise

points total p

spmd fashionnumber of

ansi cbased

simple extensions

global synchronization