aspects of practical parallel programming parallel programming models data parallel

Parallel Computing 3

Models of Parallel Computations

Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR

• Aspects of practical parallel programming• Parallel programming models• Data parallel

– High Performance Fortran• Shared variables/memory

– compiler’s support: automatic/assisted parallelization – OpenMP– thread libraries

• Message passing

Outline of the lecture

• Primary goal: maximization of performance– specific approaches are expected to be more efficient than

universal ones• considerable diversity in parallel hardware

• techniques/tools are much more dependent on the target platform than in sequential programming

– understanding the hardware will make it easier to make programs get high performance

• back to the era of assembly programming?

• On the contrary, standard/portable/universal methods increase the productivity in software development and maintenance

Parallel programming (1)T

• Parallel programs are more difficult to write and debug than sequential ones

– parallel algorithms can be generally qualitatively different form the corresponding sequential ones

• the change of the form of the code may be not enough

– several new classes of potential software bugs (e.g. race conditions)

– difficult debugging

– issues of scalability

Parallel programming (2)

• Special programming language supporting concurrency– theoretically advantageous, in practice not as much popular – ex.: Ada, Occam, Sisal, etc. (there are dozens of designs)– language extensions: CC++, Fortran M, etc.

• Universal programming language (C, Fortran,...) with parallelizing compiler

– autodetection of parallelism in the sequential code – easier for shared memory, limited efficiency

• matter of future? (despite of 30 years of intense research) – ex.: Forge90 for Fortran (1992), some standard compilers

• Universal programming language plus a library of external parallelizing functions

– mainstream nowadays – ex.: PVM (Parallel Virtual Machine), MPI (Message Passing Interface),

Pthreads a. o.

General approaches

• A parallel programming model is a set of software technologies to express parallel algorithms and match applications with the underlying parallel systems [Wikipedia]

• Considered models:– data parallel [just introductory info in this course]

– shared variables/memory [related to the OpenMP lecture in part II of the course]

– message passing [continued in the next lecture (MPI)]

Parallel programming models

Data parallel model

• Assumed underlying hardware: multicomputer or multiprocessor– originally associated with SIMD machines

such as CM-200 • multiple processing elements perform the same

operation on multiple data simultaneously

• array processors

Hardware requirements

[Wikipedia]

• Based on concept of applying the same operation (e.g. “add 1 to every array element”) to a number of a data ensemble in parallel

– a set of tasks operate collectively on the same data structure (usually an array) – each task on a different partition

• On multicomputers the data structure issplit up and resides as “chunks” in the local memory of each task

• On multiprocessors, all tasks may have access to the data structure through global memory

• The tasks are loosely synchronized– at the beginning and end of the parallel

operations

• SPMD execution model

Data parallel model

Task 1

Fortran90 fragment

real A(100)

Task 2

do i = 0, 50 A(i) = A(i)+1enddo

do i = 51, 100 A(i) = A(i)+1enddo

A = A+1

• Higher-level parallel programming– data distribution and communication done by compiler

• transfer low-level details from programmer to compiler– compiler converts the program into standard code with calls to a message passing library (MPI usually); all

message passing is done invisibly to the programmer

+ Ease of use– simple to write, debug and maintain

• no explicit message passing• single-threaded control (no spawn, fork, etc.)

– Restricted flexibility and control– only suitable for certain applications

• data in large arrays• similar independent operations on each element• naturally load-balanced

– harder to get top performance• reliant on good compilers

Characteristics

• The best known representative of data parallel programming language

• HPF version 1.0 in 1993 (extends Fortran 90), version 2.0 in 1997

• Extensions to Fortran 90 to support data parallel model, including– directives to tell compiler how to distribute data

• DISTRIBUTE, ALIGN directives• ignored as comments in serial Fortran compilers

– mathematical operations on array-valued arguments– reduction operations on arrays– FORALL construct– assertions that can improve optimization of generated code

• INDEPENDENT directive– additional intrinsics and library routines

• Available e.g. in the Portland Group PGI Workstation package– http://www.pgroup.com/products/pgiworkstation.htm

• Nowadays not frequently used

High Performance Fortran

REAL A(12, 12) ! declarationREAL B(16, 16) ! of an arrays!HPF$ TEMPLATE T(16,16) ! and a template!HPF$ ALIGN B WITH T ! align B with T!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! align A with T and

shift!HPF$ PROCESSORS P(2, 2) ! declare number of procesors 2*2!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P ! distribution of

arrays

HPF data mapping example

[Mozdren 2010]

• Parallel MATLAB (the MathWorks): Parallel Computing Toolbox – plus Distributed Computing Server for greater parallel environments

– released in 2004; increasing popularity

• Some features coherent to the data parallel model– codistributed arrays: arrays partitioned into segments, each of which resides in

the workspace of a different task• allow to handle larger data sets than

in a single MATLAB session

• support for more than 150 MATLAB functions (e.g. finding eigenvalues)

• in a very similar way as with regular arrays

– parallel FOR loop: loop iterations without enforcing their particular ordering

• distributes loop iterations over a set of tasks

• iterations must be independent of each other

Codistributed arrays

L1 L2 L3 L4

Data parallel in MATLAB

parfor i = (1:nsteps) x = i * step; s = s + (4 /(1 + x^2)); end

Shared variables model

Interconnectionfabric

(bus, crossbar)

• Assumed underlying hardware: multiprocessor – collection of processors that share

common memory

– interconnection fabric supporting single address space

• Not applicable to multicomputers– but: Intel Cluster OpenMP

• Easier to apply than message passing– allows incremental parallelization

• Based on the notion of threads

after [Wilkinson2004]

Thread vs. process (1)

Stack Thread

Thread

Interrupt routines

Process

• Thread (“lightweight” processes) differs from (“heavyweight”) process:– all threads in a process share the same memory space– each thread has a thread private area for its local variables

– e.g. stack– threads can work on shared data structures– threads can communicate with each other via the shared data

• Threads originally not targeted at the technical or HPC computing

– low level, task (rather than data) parallelism

• Details of thread/process relationship are very OS dependent

Thread vs. process (2)

Stack Thread

Thread

Interrupt routines

Process

• Parallel application generates, when appropriate, a set of cooperating threads

– usually one per processor

– distinguished by enumeration

• Shared memory provides means to exchange data among threads– shared data can be

accessed by all threads

– no message passing necessary

Thread communication

Thread 1 Thread 2

my_a = 23sh_a = a

24Private data

Shared data

Programmy_a = sh_a+1

• Threads execute their programs asynchronously • Writes and reads are always nonblocking• Accessing shared data needs careful control

– need some mechanisms to ensure that the actions occur in the correct order• e.g. write of A in thread 1 must occur before its read in thread 2

• Most common synchronization constructs:– master section: a section of code executed by one thread only

• e.g. initialisation, writing a file– barrier: all threads must arrive at a barrier before any thread can proceed

past it• e.g. delimiting phases of computation (e.g. a timestep)

– critical section: only one thread at a time can enter a section of code• e.g. modification of shared variables

• Makes shared-variables programming error-prone

Thread synchronization

• Consider two threads each of which is to add 1 to a shared data item X,e.g. X = 10.

1. read X2. compute X+13. write X back

• If step 1 is performed at the same timeby both threads, the result will be 11(instead of expected 12)

• Race condition: two or more threads (processes) are reading or writing shared data, and the result depends on who runs precisely when

• X=X+1 must be atomic operation• Can be ensured by mechanisms of mutual exclusion

– e.g. critical section, mutex, lock, semaphore, monitor

Accessing shared data

Thread 1 Thread 2

[Wilkinson2004]

• Initially only the master thread is active– executes sequential code

• Basic operations:

– fork: master thread creates / awakens additional threads to execute in a parallel region

– join: at end of parallel region created threads die / are suspended

• Dynamic thread creation– the number of active threads changes

during execution

– fork is not an expensive operation

• Sequential program a special / trivial case of a shared-memory parallel program

Fork/Join parallelism

[Quinn 2004]

f o r k

M as ter T h r ead

f o r k

O th er th r ead s

• Compiler’s support: – automatic parallelization– assisted parallelization– OpenMP

• Thread libraries: – POSIX threads, Windows threads

[next slides]

Computer realization

• The code instrumented automatically by the compiler – according the compilation flags and/or environment variables

• Parallelizes independent loops only– processed by the prescribed number of parallel threads

• Usually provided by Fortran compilers for multiprocessors – as a rule proprietary solutions

• Simple and sometimes fairly efficient• Applicable to programs with a simple structure• Ex.:

– XL Fortran (IBM, AIX): -qsmp=auto option, XLSMPOPTS environment variable (the number of threads)

– Fortran (SUN, Solaris): -autopar flag, PARALLEL environment variable

– PGI C (Portland Group, Linux): -Mconcur flag

Automatic parallelization

• The programmer provides the compiler with additional information by adding compiler directives

– special lines of source code with meaning only to a compiler that understands them

• in the form of stylized Fortran comments or #pragma in C

• ignored by nonparallelizing compilers

• Assertive and prescriptive directives [next slides]

• Diverse formats of the parallelizing directives, but similar capabilities

standard required

Assisted parallelization

• Hints that state facts that the compiler might not guess from the code itself

• Evaluation context dependent

• Ex.: XL Fortran (IBM, AIX)– no dependencies (the references in the loop do not overlap, parallelization

possible): !SMP$ ASSERT (NODEPS)– trip count (average number of iterations of the loop; helps to decide if

unroll or parallelize the loop): !SMP$ ASSERT (INTERCNT(100))

Assertive directives

• Instructions for the parallelizing compiler, which it must obey– clauses may specify additional information

• A means for manual parallelization

• Ex.: XL Fortran (IBM, AIX)– parallel region:

defines a block of code that can be executed by a team of threads concurrently

– parallel loop: enables to specify which loops the compiler should parallelize

• Besides directives, additional constructs within the base language to express parallelism can be introduced

– e.g. the forall statement in Fortran 95

Prescriptive directives

!SMP$ PARALLEL <clauses>

<block>

!SMP$ END PARALLEL

!SMP$ PARALLEL DO <clauses>

!SMP$ END PARALLEL DO

• API for writing portable multithreaded applications based on the shared variables model

– master thread spawns a team of threads as needed – relatively high level (compared to thread libraries)

• A standard developed by the OpenMP Architecture Review Board– http://www.openmp.org– first specification in 1997

• A set of compiler directives and library routines• Language interfaces for Fortran, C and C++

– OpenMP-like interfaces for other languages (e.g. Java)• Parallelism can be added incrementally

– i.e. the sequential program evolves into a parallel program– single source code for both the sequential and parallel versions

• OpenMP compilers available on most platforms (Unix, Windows, etc.)

[More in a special lecture]

OpenMP

• Collection of routines to create, manage, and coordinate threads• Main representatives:

– POSIX threads (Pthreads), – Windows threads (Windows (Win32) API)

• Explicit threading not primarily intended for parallel programming– low level, quite complex coding

Thread libraries

Numerical integration based on the rectangle method:

set n (number of strips)for each strip

calculate the height y of the strip (rectangle) at its midpointsum all y to the result S

endformultiply S by the width of the stripsprint result

Example: PI calculation

Calculation of π by the numerical integration formula

1.00.0

/* Pi, Win32 API */#include <windows.h>#define NUM_THREADS 2 HANDLE thread_handles[NUM_THREADS]; CRITICAL_SECTION hUpdateMutex; static long num_steps = 100000; double step, global_sum = 0.0;

void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0/(double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); }

PI in Windows threads (1)

void main () { double pi; int i; DWORD threadID; int threadArg[NUM_THREADS]; for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; InitializeCriticalSection(&hUpdateMutex); for (i = 0; i < NUM_THREADS; i++) { thread_handles[i] = CreateThread(0,0,(LPTHREAD_START_ROUTINE)Pi, &threadArg[i],0,&threadID); } WaitForMultipleObjects(NUM_THREADS,thread_handles,TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);}

PI in Windows threads (2)

/* Pi , pthreads library */ #define _REENTRANT #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 2 pthread_t thread_handles[NUM_THREADS]; pthread_mutex_t hUpdateMutex; pthread_attr_t attr; static long num_steps = 100000; double step, global_sum = 0.0; void* Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0 / (double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } pthread_mutex_lock(&hUpdateMutex); global_sum += sum; pthread_mutex_unlock(&hUpdateMutex); }

PI in POSIX threads (1)

void main () { double pi; int i; int retval; pthread_t threadID; int threadArg[NUM_THREADS]; pthread_attr_init(&attr); pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM); pthread_mutex_init(&hUpdateMutex,NULL); for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; for (i = 0; i < NUM_THREADS; i++) { retval = pthread_create(&threadID,NULL,Pi,&threadArg[i]); thread_handles[i] = threadID; } for (i=0; i<NUM_THREADS; i++) { retval = pthread_join(thread_handles[i],NULL); } pi = global_sum * step; printf(" pi is %.10f \n",pi); }

PI in POSIX threads (2)

/* Pi, OpenMP, using parallel for and reduction */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 2 static long num_steps = 1000000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);

PI in OpenMP (1)

#pragma omp parallel for reduction(+:sum) private(x) for (i = 1; i < num_steps; i++){ x = (i - 0.5) * step; sum += 4.0 / (1.0 + x*x); } pi = sum * step; printf("Pi is %.10f \n",pi);}

NB: Programs such as PI calculation are likely to be successfully parallelized through automatic parallelization as well

Message passing model

• Assumed underlying hardware: multicomputer– collection of processors,

each with its own local memory – interconnection network supporting

message transfer between every pair of processors

• Supported by all (parallel) architectures – the most generalmodel

– naturally fits multicomputers– easily implemented on multiprocessors

• Complete control: data distribution and communication

• May not be easy to apply – sequential-to-parallel transformation requires major effort– one giant step rather than many tiny steps– message passing = “assembler” of parallel computing

[Quinn2004]

• Parallel application generates (next slide) a set of cooperating processes – process = instance of o running program– usually one per processor– distinguished by the unique ID number

• rank (MPI), tid (PVM), etc.

• To solve a problem, processes alternately perform computations and exchange messages

– basic operations: send, receive– no shared memory space necessary

• Messages transport the contents of variables of one process to variables of other process.

• Message passing has also a synchronization function

Message passing

Process 1 Process 2

send(&x, 2)

recv(&y, 1)

Data transfer

[Wilkinson2004]

• Static process creation – fixed number of processes in time– specified before the execution (e.g. on the command line)– usually the processes follow the same code, but their control paths through the code can

differ – depending on the ID• SPMD (Single Program Multiple Data) model• one master process (ID 0) – several slave processes

• Dynamic process creation– varying number of processes in time

• just one process at the beginning

– processes can create (destroy) other processes: the spawn operation

• rather expensive!

– the processes often differ in code• MPMD (Multiple Program Multiple

Data) model

Process creation

Process 1

Process 2spawn();

Start process 2

[Wilkinson2004]

• Exactly two processes are involved

• One process (sender / source) sends a message and another process (receiver / destination) receives it

– active participation of processes on both sides usually required• two-sided communication

• In general, the source and destination processes operate asynchronously – the source may complete sending a message long before the destination gets around to

receiving it

– the destination may initiate receiving a message that has not yet been sent

• The order of messages is guaranteed (they do not overtake)

• Examples of technical issues – handling more messages waiting to be received– sending complex data structures– using message buffers– send and receive routines – blocking vs. nonblocking

Point-to-point communication

• Blocking operation: only returns (from the subroutine call) when the operation has completed

– ex.: sending fax on a standard machine• Nonblocking operation: returns immediately, the operation need not be

completed yet, other work may be performed in the meantime – the completion of the operation can/must be tested– ex.: sending fax on a machine with memory

• Synchronous send: does not complete until the message has been received – provides (synchronizing) info about the message delivery – ex.: sending fax (on a standard machine)

• Asynchronous send: completes as soon as the message is on its way– sender only knows when the message has left– ex.: sending a letter

(Non-)blocking & (a-)synchronous

• Transfer of data in a set of processes • Provided by most message passing systems• Basic operations [next slides]:

– barrier: synchronization of processes – broadcast: one-to-many communication of the same data – scatter: one-to-many communication of different portions of data – gather: many-to-one communication of the (different, but related) data– reduction: gather plus combination of data with arithmetic/logical operation

• Root – in some collective operations, the single prominent source / destination

– e.g. in broadcast• Collective operations can be built out as a set of point-to-point operations,

but these “blackbox” routines– hide a lot of the messy details– are usually more efficient

• can take advantage of special communication hardware

Collective communication

• A basic mechanism for synchronizing processes

• Inserted at the point in each process where it must wait for the others

• All processes can continue from thispoint when all the processes have reached it

– or when a stated number of processeshave reached this point

• Often involved in other operations

Barrier

[Wilkinson2004]

• Distributes the same piece of data from a single source (root) to all processes (concerned with problem)

– multicast – sending the message to a defined group of processes

Broadcast

• Distributes each element of an array in the root to a separate process– including the root

– contents of the ith array element sent to the ith process

Scatter

A CB D

• Collects data from each process at the root– value from the ith process is stored in the ith array element (rank order)

Gather

Gather A

A CB D

• Gather operation combined with specified arithmetic/logical operation1. collect data from each processor

2. reduce these data to a single value (such as a sum or max)

3. store the reduced result on the root processor

Reduction

ReduceA CB D

I KJ L

M ON P

A E I M

root E GF H

• Computer realization of the message passing model• Most popular message passing systems (MPS):

– Message Passing Interface (MPI) [next lecture]– Parallel Virtual Machine (PVM)– in distributed computing Corba, Java RMI, DCOM, etc.

Message passing system (1)

• Information needed by MPS to transfer a message include: – sending process and location, type and amount of transferred data

• no interest in data itself (message body)– receiving process(-es) and storage to receive the data

• Most of this information is attached as message envelope – may be (partially) available to the receiving process

• MPS may provide various information to the processes– e.g. about the progress of communication

• A lot of other technical aspects, e. g.: – process enrolment in MPS– addressing scheme– content of the envelope – using message buffers (system, user space)

Message passing system (2)

Message passing (MPI)+ easier to debug + easiest to optimize+ can overlap

communication and computation

+ potential to high scalability

+ support on all parallel architectures

– harder to program– load balancing, deadlock

prevention, etc. need to be addressed

most freedom and responsibility

WWW (what, when, why)

Shared variables (OMP)+ easier to program than

MP, code is simpler+ implementation can be

incremental+ no message start-up costs+ can cope with irregular

communication patterns– limited to shared-memory

systems– harder to debug and

optimize– scalability limited– usually less efficient than

MP equivalents

Data parallel (HPF)+ easier to program than

MP+ simpler to debug than SV+ does not require shared

memory– DP style suitable only for

certain applications– restricted control over

data and work distribution– difficult to obtain top

performance– a few API’s available– out of date?

• The definition of parallel programming models is not uniform in literature; other models can be e.g.

– thread programming model

– hybrid models, e. g. the combination of the message passing and shared variables model

• explicit message passing between the nodes of a cluster as well as shared-memory and multithreading within the nodes

• Models continue to evolve along with the changing world of computer hardware and software

– CUDA parallel programming model for CUDA GPU architecture

Conclusions

Further study

• The message passing model and shared variables model somehow treated in all general textbooks on parallel programming

• exception: [Foster 1995] almost skips data sharing• There are plenty of books dedicated to shared objects,

synchronisation and shared memory, e.g. [Andrews 2000] Foundations of Multithreaded, Parallel,and Distributed Programming

• not necessarily focusing on parallel processing• Data parallelism is usually a marginal topic

aspects of practical parallel programming parallel programming models data parallel

parallel computing

data structure

special programming

pvm parallel virtual

era of assembly programming

data ensemble

sequential code easier

software development

Documents

parallel programming

parallel architecture-programming

introduction to parallel computingws16/17 ©jesper larsson...

programming paradigms -...

parallel programming primer

parallel programming practice

parallel programming - paginas.fe.up.pt

parallel programming -...

intel parallel programming

2.3 parallel programming

parallel programming environment

parallel computing practical aspects and experiences...

programminglanguages programming languages parallel...

parallel programming intro

parallel programming concepts - lehigh...

making parallel programming synonymous with programming

introduction to parallel computers and parallel...

parallel programming patterns · mccool et al., chapter 3 ....

parallel programming guide

introduction to parallel computing · ws17/18 ©jesper...