2 basic concepts on

8/6/2019 2 Basic Concepts on

1/58

Basic Concepts in Parallelization

1

RvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

IWOMP 2010CCS, University of Tsukuba

Tsukuba, JapanJune 14-16, 2010

Ruud van der Pas

Senior Staff EngineerOracle Solaris StudioOracle

Menlo Park, CA, USA



2/58


2


Outline

Introduction

Parallel Architectures

Parallel Programming Models

Data Races

Summary


3/58


3


Introduction


4/58Basic Concepts in Parallelization

4


Why Parallelization ?

Parallelization is another optimization techniqueThe goal is to reduce the execution time

To this end, multiple processors, or cores, are used

Time

1 core 4 coresparallelization

Using 4 cores, the execution

time is 1/4 of the single coretime



5


What Is Parallelization ?

"Something" is parallel if there is a certain levelof independence in the order of operations

A sequence of machine instructions

A collection of program statements

An algorithmThe problem you're trying to solve

gr

an

ul

ari

ty

In other words, it doesn't matter in what orderthose operations are performed



6


What is a Thread ?

Loosely said, a thread consists of a series of instructions

with it's own program counter (PC) and state A parallel program executes threads in parallel

These threads are then scheduled onto processors

.....

.....

.....

.....

.....

.....

.....

.....

Thread 0

.....

.....

.....

.....

.....

.....

.....

.....

Thread 1

.....

.....

.....

.....

.....

.....

.....

.....

Thread 2

.....

.....

.....

.....

.....

.....

.....

.....

Thread 3

P P

PCPC

PC

PC



7


Parallel overhead

The total CPU time often exceeds the serial CPU time:

The newly introduced parallel portions in yourprogram need to be executed

Threads need time sending data to each other andsynchronizing (communication)

Often the key contributor, spoiling all the fun Typically, things also get worse when increasing the

number of threads

Effi cient parallelization is about minimizing thecommunication overhead



8


Communication

Serial

Execution

Wallclocktime

Parallel - With

communication

Additional communication

Less than 4x faster

Consumes additional resources

Wallclock time is more than of serial wallclock time

Total CPU time increases

Parallel - Without

communication

Embarrassinglyparallel: 4x faster

Wallclock time is ofserial wallclock time

Communication



9


Load balancing

Time

Perfect Load Balancing Load Imbalance

All threads fi nish in thesame amount of time

No threads is idle

Thread is idle

Different threads need a differentamount of time to fi nish theirtask

Total wall clock time increases

Program does not scale well

1 idle

2 idle3 idle



10


About scalability Defi ne the speed-up S(P) as

S(P) := T(1)/T(P)

The effi ciency E(P) is defi ned asE(P) := S(P)/P

In the ideal case, S(P)=PandE(P)=P/P=1=100%

Unless the application is

embarrassingly parallel, S(P)eventually starts to deviate fromthe ideal curve

Past this point Popt

, the application

sees less and less benefi t fromadding processors

Note that both metrics give noinformation on the actual run-time

As such, they can be dangerous touse

Ideal

Speed-up

S(P)

PPopt

In some cases, S(P) exceeds P

This is called "superlinear" behaviour

Don't count on this to happen though


11/58


11


Amdahl's Law

This "law' describes the effect the non-parallelizable part of a

program has on scalability Note that the additional overhead caused by parallelization and

speed-up because of cache effects are not taken into account

Assume our program has a parallel fractionf

This implies the execution timeT(1) := f*T(1) + (1-f)*T(1)

On P processors:T(P) = (f/P)*T(1) + (1-f)*T(1)

Amdahl's law:

S(P) = T(1)/T(P) = 1 / (f/P + 1-f)

Comments:


12/58


12


Amdahl's Law

It is easy to scale on asmall number ofprocessors

Scalable performancehowever requires ahigh degree ofparallelization i.e. f isvery close to 1

This implies that youneed to parallelize thatpart of the code wherethe majority of the timeis spent

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

2

4

6

8

10

12

14

16

1.000.99

0.75

0.50

0.25

0.00

Value for f:

Processors

Sp

eed-up


13/58


13


Amdahl's Law in practice

f = (1 - T(P)/T(1))/(1 - (1/P))

We can estimate the parallel fractionf

Recall:T(P) = (f/P)*T(1) + (1-f)*T(1)

It is trivial to solve this equation for f:

Example:

T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84%

Estimated performance on 8 processors is then:

T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5S(8) = T/T(8) = 3.78


14/58


14


Threads Are Getting Cheap .....

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

10

20

30

40

50

60

70

80

90

100

Elapsed time and speed up for f = 84%

Number of cores used

Using 8 cores:

100/26.5 = 3.8x faster

= Elapsed time

= Speed up


15/58


15


Numerical results

Consider:A = B + C + D + E

Serial Processing Parallel Processing

Thread 0 Thread 1

The roundoff behaviour is different and so the numericalresults may be different too

This is natural for parallel programs, but it may be hard todifferentiate it from an ordinary bug ....

A = B + C

A = A + D

A = A + E

T1 = B + C

T1 = T1 + T2

T2 = D + E


16/58


16


Cache Coherence


17/58


17


Typical cache based system

L2cacheCPU Memory

L1cache?


18/58


18


Cache line modifi cations

Main Memory

caches

page cache line

cache lineinconsistency !


19/58


19


Caches in a parallel computer

A cache line starts in

memoryOver time multiple copies

of this line may exist

Cache Coherence ("cc"):

Tracks changes in copies

Makes sure correct cacheline is used

Different implementationspossible

Need hardware support tomake it effi cient

Processors Caches Memory

CPU 0

CPU 1

CPU 2

CPU 3

invalidated

invalidated

invalidated


20/58


20


Parallel Architectures


21/58


21


Uniform Memory Access (UMA)

Also called "SMP"

(Symmetric MultiProcessor)

Memory Access time isUniform for all CPUs

CPU can be multicore

Interconnect is "cc":

Bus

Crossbar

No fragmentation -Memory and I/O areshared resources

Memory I/O

cache

CPU

Easy to use and to administer

Effi cient use of resources

Said to be expensive

Said to be non-scalable

Pro

Con

cache

CPU

cache

CPU

Interconnect I/O

I/O


22/58


22


NUMA

Also called "Distributed

Memory" or NORMA (NoRemote Memory Access)

Memory Access time isNon-Uniform

Hence the name "NUMA"

Interconnect is not "cc":

Ethernet, Infi niband,etc, ......

Runs 'N' copies of the OS

Memory and I/O aredistributed resources

Said to be cheap

Said to be scalable

Diffi cult to use and administer

In-effi cient use of resources

Pro

Con

I/O

cache

CPU

M

I/O

cache

CPU

M

I/O

cache

CPU

M

Interconnect


23/58


23


The Hybrid Architecture

Second-level Interconnect

Second-level interconnect is not cache coherent

Ethernet, Infi niband, etc, ....

Hybrid Architecture with all Pros and Cons:

UMA within one SMP/Multicore node

NUMA across nodes

SMP/MC node SMP/MC node


24/58


24


cc-NUMA

Two-level interconnect:

UMA/SMP within one system

NUMA between the systems

Both interconnects support cache coherence i.e. thesystem is fully cache coherent

Has all the advantages ('look and feel') of an SMP

Downside is the Non-Uniform Memory Access time

Interconnect

2-nd level


25/58


25


Parallel Programming Models


26/58


26


How To Program A Parallel Computer?

There are numerous parallel programming models

The ones most well-known are: Distributed Memory

Sockets (standardized, low level)

PVM - Parallel Virtual Machine (obsolete)

MPI - Message Passing Interface (de-facto std)

Shared Memory

Posix Threads (standardized, low level)

OpenMP (de-facto standard)

Automatic Parallelization (compiler does it for you)

Any

Cluster

and/or

singleSMP

SMP

on

ly


27/58


27


Parallel Programming ModelsDistributed Memory - MPI


28/58


28


What is MPI?

MPI stands for the Message Passing Interface

MPI is a very extensive de-facto parallel programmingAPI for distributed memory systems (i.e. a cluster)

An MPI program can however also be executed on ashared memory system

First specifi cation: 1994

The current MPI-2 specifi cation was released in 1997

Major enhancements

Remote memory management

Parallel I/ODynamic process management


29/58


29


More about MPI

MPI has its own data types (e.g. MPI_INT)

User defi ned data types are supported as wellMPI supports C, C++ and Fortran

Include fi lein C/C++ andmpif.hin Fortran

An MPI environment typically consists of at least:

A library implementing the API

A compiler and linker that support the library

A run time environment to launch an MPI program

Various implementations available

HPC Clustertools, MPICH, MVAPICH, LAM, VoltaireMPI, Scali MPI, HP-MPI, .....


30/58

Th MPI E ti M d l


31/58


31


The MPI Execution Model

All processesstart

MPI_Finalize

All processesfi nish

= communication

MPI_Init

Th MPI M M d l


32/58


32


The MPI Memory Model

Pprivate

Pprivate

P

private

P

private

P

private

TransferMechanism

All threads/processes haveaccess to their own, private,memory only

Data transfer and mostsynchronization has to beprogrammed explicitly

All data is private Data is shared explicitly by

exchanging buffers

E l S d N I t


33/58


33


Example - Send N Integers

#include

you = 0; him = 1;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &me);

if ( me == 0 ) {

error_code =MPI_Send(&data_buffer, N,MPI_INT,

him, 1957,MPI_COMM_WORLD);

} else if ( me == 1 ) {

error_code =MPI_Recv(&data_buffer, N,MPI_INT,

you, 1957,MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

}

MPI_Finalize();

include fi le

initialize MPI environment

get process ID

process 0 sends

process 1 receives

leave the MPI environment

R ti B h i


34/58


34


you = 1him = 0

Process 0

you = 1him = 0

Process 1

Run time Behavior

N integersdestination = you = 1

label = 1957

MPI_Send

me = 0 me = 1

Yes ! Connection established

MPI_RecvN integers

sender = him =0label = 1957


35/58


36/58


36


Parallel Programming ModelsShared Memory - Automatic

Parallelization


37/58


38/58


38


Parallel Programming ModelsShared Memory - OpenMP


39/58


39


http://www.openmp.org

0

M

1

M

P

M

Shared Memory

Shared Memory Model


40/58


40


Shared Memory Model

Tprivate

Tprivate

T

private

T

private

T

private

Programming Model

Shared

Memory

All threads have access to thesame, globally shared, memory

Data can be shared or private

Shared data is accessible by all

threads

Private data can only beaccessed by the thread thatowns it

Data transfer is transparent tothe programmer

Synchronization takes place,but it is mostly implicit

About data


41/58


41


About data

In a shared memory parallel program variables have a"label" attached to them:

Labelled "Private" Visible to one thread onlyChange made in local data, is not seen by others

Example -Local variables in a function that isexecuted in parallel

Labelled "Shared" Visible to all threadsChange made in global data, is seen by all others

Example -Global data

Example - Matrix times vector


42/58


42


TID = 0

for (i=0,1,2,3,4)

TID = 1

for (i=5,6,7,8,9)

Example - Matrix times vector

i = 0 i = 5

a[0] = sum a[5] = sum

sum = b[i=0][j]*c[j] sum = b[i=5][j]*c[j]i = 1 i = 6

a[1] = sum a[6] = sum

sum = b[i=1][j]*c[j] sum = b[i=6][j]*c[j]... etc ...

for (i=0; i


43/58


43


A Black and White comparison

MPI OpenMP

De-facto standard De-facto standard

Endorsed by all key players Endorsed by all key players

Runs on any number of (cheap) systems Limited to one (SMP) system

Grid Ready Not (yet?) Grid Ready

High and steep learning curve Easier to get started (but, ...)

You're on your own Assistance from compiler

All or nothing model Mix and match model

No data scoping (shared, private, ..) Requires data scoping

More widely used (but ....) Increasingly popular (CMT !)

Sequential version is not preserved Preserves sequential codeRequires a library only Need a compiler

Requires a run-time environment No special environment

Easier to understand performance Performance issues implicit


44/58


44


The Hybrid ParallelProgramming Model

The Hybrid Programming Model


45/58


45


The Hybrid Programming Model

0

M

1

M

P

M

Shared Memory

Distributed Memory

0

M

1

M

P

M

Shared Memory


46/58


46


Data Races

About Parallelism


47/58


47


About Parallelism

Parallelism

Independence

No Fixed Ordering

"Something" that does notobey this rule, is not

parallel (at that level ...)

Shared Memory Programming


48/58


48


Shared Memory Programming

Threads communicate via shared memory

Tprivate

Tprivate

T

private

T

private T

private

Shared

Memory

XX

X

Y

49What is a Data Race?


49/58


49


Two different threads in a multi-threaded sharedmemory program

Access the same (=shared) memory location

Asynchronously and

Without holding any common exclusive locks and

At least one of the accesses is a write/store

50Example of a data race


50/58


50


p

T

private

T

private

SharedMemory

n

{n = omp_get_thread_num();}

#pragma omp parallel shared(n)

W

W

51Another example


51/58


51


p

{x = x + 1;}

#pragma omp parallel shared(x)

T

private

T

private

SharedMemory

x

R/W

R/W

52About Data Races


52/58


52


Loosely described, a data race means that the updateof a shared variable is not well protected

A data race tends to show up in a nasty way:

Numerical results are (somewhat) different from runto run

Especially with Floating-Point data diffi cult to

distinguish from a numerical side-effect Changing the number of threads can cause the

problem to seemingly (dis)appear

May also depend on the load on the system

May only show up using many threads


53/58

54Not a parallel loop


54/58


54


for (i=0; i


55/58


55


We manually parallelized the previous loop

The compiler detects the data dependence and doesnot parallelize the loop

Vectorsaandbare of type integer

We use the checksum ofaas a measure forcorrectness:

checksum += a[i] for i = 0, 1, 2, ...., n-2

The correct, sequential, checksum result is computedas a reference

We ran the program using 1, 2, 4, 32 and 48 threads

Each of these experiments was repeated 4 times

56Numerical results


56/58


56


threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953threads: 1 checksum 1953 correct value 1953


threads: 4 checksum 1905 correct value 1953

threads: 4 checksum 1905 correct value 1953threads: 4 checksum 1953 correct value 1953threads: 4 checksum 1937 correct value 1953



Data Race

In Action !

57


57/58

Basic Concepts in ParallelizationRvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

Summary

58Parallelism Is Everywhere


58/58

Basic Concepts in ParallelizationRvdP/V1 Tutorial IWOMP 2010 CCS Un. of Tsukuba, June 14 , 2010

Multiple levels of parallelism:

Granularity Technology Programming Model

Instruction Level Superscalar Compiler

Chip Level Multicore Compiler, OpenMP, MPI

System Level SMP/cc-NUMA Compiler, OpenMP, MPIGrid Level Cluster MPI

Threads Are Getting Cheap

2 basic concepts on

Documents