aspects of practical parallel programming parallel programming models data parallel
Post on 12-Jan-2016
76 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Parallel Computing 3
Models of Parallel Computations
Ondřej Jakl Institute of Geonics, Academy of Sci. of the CR
2
• Aspects of practical parallel programming• Parallel programming models• Data parallel
– High Performance Fortran• Shared variables/memory
– compiler’s support: automatic/assisted parallelization – OpenMP– thread libraries
• Message passing
Outline of the lecture
3
• Primary goal: maximization of performance– specific approaches are expected to be more efficient than
universal ones• considerable diversity in parallel hardware
• techniques/tools are much more dependent on the target platform than in sequential programming
– understanding the hardware will make it easier to make programs get high performance
• back to the era of assembly programming?
• On the contrary, standard/portable/universal methods increase the productivity in software development and maintenance
Parallel programming (1)T
rade
-off
4
• Parallel programs are more difficult to write and debug than sequential ones
– parallel algorithms can be generally qualitatively different form the corresponding sequential ones
• the change of the form of the code may be not enough
– several new classes of potential software bugs (e.g. race conditions)
– difficult debugging
– issues of scalability
Parallel programming (2)
5
• Special programming language supporting concurrency– theoretically advantageous, in practice not as much popular – ex.: Ada, Occam, Sisal, etc. (there are dozens of designs)– language extensions: CC++, Fortran M, etc.
• Universal programming language (C, Fortran,...) with parallelizing compiler
– autodetection of parallelism in the sequential code – easier for shared memory, limited efficiency
• matter of future? (despite of 30 years of intense research) – ex.: Forge90 for Fortran (1992), some standard compilers
• Universal programming language plus a library of external parallelizing functions
– mainstream nowadays – ex.: PVM (Parallel Virtual Machine), MPI (Message Passing Interface),
Pthreads a. o.
General approaches
6
• A parallel programming model is a set of software technologies to express parallel algorithms and match applications with the underlying parallel systems [Wikipedia]
• Considered models:– data parallel [just introductory info in this course]
– shared variables/memory [related to the OpenMP lecture in part II of the course]
– message passing [continued in the next lecture (MPI)]
Parallel programming models
Data parallel model
8
• Assumed underlying hardware: multicomputer or multiprocessor– originally associated with SIMD machines
such as CM-200 • multiple processing elements perform the same
operation on multiple data simultaneously
• array processors
Hardware requirements
[Wikipedia]
9
• Based on concept of applying the same operation (e.g. “add 1 to every array element”) to a number of a data ensemble in parallel
– a set of tasks operate collectively on the same data structure (usually an array) – each task on a different partition
• On multicomputers the data structure issplit up and resides as “chunks” in the local memory of each task
• On multiprocessors, all tasks may have access to the data structure through global memory
• The tasks are loosely synchronized– at the beginning and end of the parallel
operations
• SPMD execution model
Data parallel model
Task 1
Fortran90 fragment
real A(100)
Task 2
do i = 0, 50 A(i) = A(i)+1enddo
do i = 51, 100 A(i) = A(i)+1enddo
A = A+1
10
• Higher-level parallel programming– data distribution and communication done by compiler
• transfer low-level details from programmer to compiler– compiler converts the program into standard code with calls to a message passing library (MPI usually); all
message passing is done invisibly to the programmer
+ Ease of use– simple to write, debug and maintain
• no explicit message passing• single-threaded control (no spawn, fork, etc.)
– Restricted flexibility and control– only suitable for certain applications
• data in large arrays• similar independent operations on each element• naturally load-balanced
– harder to get top performance• reliant on good compilers
Characteristics
11
• The best known representative of data parallel programming language
• HPF version 1.0 in 1993 (extends Fortran 90), version 2.0 in 1997
• Extensions to Fortran 90 to support data parallel model, including– directives to tell compiler how to distribute data
• DISTRIBUTE, ALIGN directives• ignored as comments in serial Fortran compilers
– mathematical operations on array-valued arguments– reduction operations on arrays– FORALL construct– assertions that can improve optimization of generated code
• INDEPENDENT directive– additional intrinsics and library routines
• Available e.g. in the Portland Group PGI Workstation package– http://www.pgroup.com/products/pgiworkstation.htm
• Nowadays not frequently used
High Performance Fortran
12
REAL A(12, 12) ! declarationREAL B(16, 16) ! of an arrays!HPF$ TEMPLATE T(16,16) ! and a template!HPF$ ALIGN B WITH T ! align B with T!HPF$ ALIGN A(i, j) WITH T(i+2, j+2) ! align A with T and
shift!HPF$ PROCESSORS P(2, 2) ! declare number of procesors 2*2!HPF$ DISTRIBUTE T(BLOCK, BLOCK) ONTO P ! distribution of
arrays
HPF data mapping example
T,B
A
[Mozdren 2010]
13
• Parallel MATLAB (the MathWorks): Parallel Computing Toolbox – plus Distributed Computing Server for greater parallel environments
– released in 2004; increasing popularity
• Some features coherent to the data parallel model– codistributed arrays: arrays partitioned into segments, each of which resides in
the workspace of a different task• allow to handle larger data sets than
in a single MATLAB session
• support for more than 150 MATLAB functions (e.g. finding eigenvalues)
• in a very similar way as with regular arrays
– parallel FOR loop: loop iterations without enforcing their particular ordering
• distributes loop iterations over a set of tasks
• iterations must be independent of each other
Codistributed arrays
L1 L2 L3 L4
Data parallel in MATLAB
parfor i = (1:nsteps) x = i * step; s = s + (4 /(1 + x^2)); end
Shared variables model
15
Interconnectionfabric
(bus, crossbar)
• Assumed underlying hardware: multiprocessor – collection of processors that share
common memory
– interconnection fabric supporting single address space
• Not applicable to multicomputers– but: Intel Cluster OpenMP
• Easier to apply than message passing– allows incremental parallelization
• Based on the notion of threads
Hardware requirements
after [Wilkinson2004]
16
Thread vs. process (1)
IP
Code
IP
Stack
Stack Thread
Thread
Heap
Interrupt routines
Files
Process
17
• Thread (“lightweight” processes) differs from (“heavyweight”) process:– all threads in a process share the same memory space– each thread has a thread private area for its local variables
– e.g. stack– threads can work on shared data structures– threads can communicate with each other via the shared data
• Threads originally not targeted at the technical or HPC computing
– low level, task (rather than data) parallelism
• Details of thread/process relationship are very OS dependent
Thread vs. process (2)
IP
Code
IP
Stack
Stack Thread
Thread
Heap
Interrupt routines
Files
Process
18
• Parallel application generates, when appropriate, a set of cooperating threads
– usually one per processor
– distinguished by enumeration
• Shared memory provides means to exchange data among threads– shared data can be
accessed by all threads
– no message passing necessary
Thread communication
Thread 1 Thread 2
23
my_a = 23sh_a = a
23
24Private data
Shared data
Programmy_a = sh_a+1
19
• Threads execute their programs asynchronously • Writes and reads are always nonblocking• Accessing shared data needs careful control
– need some mechanisms to ensure that the actions occur in the correct order• e.g. write of A in thread 1 must occur before its read in thread 2
• Most common synchronization constructs:– master section: a section of code executed by one thread only
• e.g. initialisation, writing a file– barrier: all threads must arrive at a barrier before any thread can proceed
past it• e.g. delimiting phases of computation (e.g. a timestep)
– critical section: only one thread at a time can enter a section of code• e.g. modification of shared variables
• Makes shared-variables programming error-prone
Thread synchronization
20
• Consider two threads each of which is to add 1 to a shared data item X,e.g. X = 10.
1. read X2. compute X+13. write X back
• If step 1 is performed at the same timeby both threads, the result will be 11(instead of expected 12)
• Race condition: two or more threads (processes) are reading or writing shared data, and the result depends on who runs precisely when
• X=X+1 must be atomic operation• Can be ensured by mechanisms of mutual exclusion
– e.g. critical section, mutex, lock, semaphore, monitor
Accessing shared data
Thread 1 Thread 2
[Wilkinson2004]
21
• Initially only the master thread is active– executes sequential code
• Basic operations:
– fork: master thread creates / awakens additional threads to execute in a parallel region
– join: at end of parallel region created threads die / are suspended
• Dynamic thread creation– the number of active threads changes
during execution
– fork is not an expensive operation
• Sequential program a special / trivial case of a shared-memory parallel program
Fork/Join parallelism
[Quinn 2004]
Tim
e
f o r k
jo in
M as ter T h r ead
f o r k
jo in
O th er th r ead s
22
• Compiler’s support: – automatic parallelization– assisted parallelization– OpenMP
• Thread libraries: – POSIX threads, Windows threads
[next slides]
Computer realization
23
• The code instrumented automatically by the compiler – according the compilation flags and/or environment variables
• Parallelizes independent loops only– processed by the prescribed number of parallel threads
• Usually provided by Fortran compilers for multiprocessors – as a rule proprietary solutions
• Simple and sometimes fairly efficient• Applicable to programs with a simple structure• Ex.:
– XL Fortran (IBM, AIX): -qsmp=auto option, XLSMPOPTS environment variable (the number of threads)
– Fortran (SUN, Solaris): -autopar flag, PARALLEL environment variable
– PGI C (Portland Group, Linux): -Mconcur flag
Automatic parallelization
24
• The programmer provides the compiler with additional information by adding compiler directives
– special lines of source code with meaning only to a compiler that understands them
• in the form of stylized Fortran comments or #pragma in C
• ignored by nonparallelizing compilers
• Assertive and prescriptive directives [next slides]
• Diverse formats of the parallelizing directives, but similar capabilities
standard required
Assisted parallelization
25
• Hints that state facts that the compiler might not guess from the code itself
• Evaluation context dependent
• Ex.: XL Fortran (IBM, AIX)– no dependencies (the references in the loop do not overlap, parallelization
possible): !SMP$ ASSERT (NODEPS)– trip count (average number of iterations of the loop; helps to decide if
unroll or parallelize the loop): !SMP$ ASSERT (INTERCNT(100))
Assertive directives
26
• Instructions for the parallelizing compiler, which it must obey– clauses may specify additional information
• A means for manual parallelization
• Ex.: XL Fortran (IBM, AIX)– parallel region:
defines a block of code that can be executed by a team of threads concurrently
– parallel loop: enables to specify which loops the compiler should parallelize
• Besides directives, additional constructs within the base language to express parallelism can be introduced
– e.g. the forall statement in Fortran 95
Prescriptive directives
!SMP$ PARALLEL <clauses>
<block>
!SMP$ END PARALLEL
!SMP$ PARALLEL DO <clauses>
<do loop>
!SMP$ END PARALLEL DO
27
• API for writing portable multithreaded applications based on the shared variables model
– master thread spawns a team of threads as needed – relatively high level (compared to thread libraries)
• A standard developed by the OpenMP Architecture Review Board– http://www.openmp.org– first specification in 1997
• A set of compiler directives and library routines• Language interfaces for Fortran, C and C++
– OpenMP-like interfaces for other languages (e.g. Java)• Parallelism can be added incrementally
– i.e. the sequential program evolves into a parallel program– single source code for both the sequential and parallel versions
• OpenMP compilers available on most platforms (Unix, Windows, etc.)
[More in a special lecture]
OpenMP
28
• Collection of routines to create, manage, and coordinate threads• Main representatives:
– POSIX threads (Pthreads), – Windows threads (Windows (Win32) API)
• Explicit threading not primarily intended for parallel programming– low level, quite complex coding
Thread libraries
29
Numerical integration based on the rectangle method:
set n (number of strips)for each strip
calculate the height y of the strip (rectangle) at its midpointsum all y to the result S
endformultiply S by the width of the stripsprint result
Example: PI calculation
Calculation of π by the numerical integration formula
dxx
1
02
1
4
1.00.0
2.0
4.0
F(x
) = 4
/(1+
x2)
x
30
/* Pi, Win32 API */#include <windows.h>#define NUM_THREADS 2 HANDLE thread_handles[NUM_THREADS]; CRITICAL_SECTION hUpdateMutex; static long num_steps = 100000; double step, global_sum = 0.0;
void Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0/(double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } EnterCriticalSection(&hUpdateMutex); global_sum += sum; LeaveCriticalSection(&hUpdateMutex); }
PI in Windows threads (1)
31
void main () { double pi; int i; DWORD threadID; int threadArg[NUM_THREADS]; for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; InitializeCriticalSection(&hUpdateMutex); for (i = 0; i < NUM_THREADS; i++) { thread_handles[i] = CreateThread(0,0,(LPTHREAD_START_ROUTINE)Pi, &threadArg[i],0,&threadID); } WaitForMultipleObjects(NUM_THREADS,thread_handles,TRUE,INFINITE); pi = global_sum * step; printf(" pi is %f \n",pi);}
PI in Windows threads (2)
32
/* Pi , pthreads library */ #define _REENTRANT #include <pthread.h> #include <stdio.h> #include <stdlib.h> #define NUM_THREADS 2 pthread_t thread_handles[NUM_THREADS]; pthread_mutex_t hUpdateMutex; pthread_attr_t attr; static long num_steps = 100000; double step, global_sum = 0.0; void* Pi (void *arg) { int i, start; double x, sum = 0.0; start = *(int *)arg; step = 1.0 / (double)num_steps; for (i = start; i <= num_steps; i = i + NUM_THREADS){ x = (i - 0.5) * step; sum = sum + 4.0 / (1.0 + x * x); } pthread_mutex_lock(&hUpdateMutex); global_sum += sum; pthread_mutex_unlock(&hUpdateMutex); }
PI in POSIX threads (1)
33
void main () { double pi; int i; int retval; pthread_t threadID; int threadArg[NUM_THREADS]; pthread_attr_init(&attr); pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM); pthread_mutex_init(&hUpdateMutex,NULL); for (i = 0; i < NUM_THREADS; i++) threadArg[i] = i + 1; for (i = 0; i < NUM_THREADS; i++) { retval = pthread_create(&threadID,NULL,Pi,&threadArg[i]); thread_handles[i] = threadID; } for (i=0; i<NUM_THREADS; i++) { retval = pthread_join(thread_handles[i],NULL); } pi = global_sum * step; printf(" pi is %.10f \n",pi); }
PI in POSIX threads (2)
34
/* Pi, OpenMP, using parallel for and reduction */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #define NUM_THREADS 2 static long num_steps = 1000000; double step; void main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS);
PI in OpenMP (1)
35
PI in OpenMP (1)
#pragma omp parallel for reduction(+:sum) private(x) for (i = 1; i < num_steps; i++){ x = (i - 0.5) * step; sum += 4.0 / (1.0 + x*x); } pi = sum * step; printf("Pi is %.10f \n",pi);}
NB: Programs such as PI calculation are likely to be successfully parallelized through automatic parallelization as well
Message passing model
37
• Assumed underlying hardware: multicomputer– collection of processors,
each with its own local memory – interconnection network supporting
message transfer between every pair of processors
• Supported by all (parallel) architectures – the most generalmodel
– naturally fits multicomputers– easily implemented on multiprocessors
• Complete control: data distribution and communication
• May not be easy to apply – sequential-to-parallel transformation requires major effort– one giant step rather than many tiny steps– message passing = “assembler” of parallel computing
Hardware requirements
[Quinn2004]
38
• Parallel application generates (next slide) a set of cooperating processes – process = instance of o running program– usually one per processor– distinguished by the unique ID number
• rank (MPI), tid (PVM), etc.
• To solve a problem, processes alternately perform computations and exchange messages
– basic operations: send, receive– no shared memory space necessary
• Messages transport the contents of variables of one process to variables of other process.
• Message passing has also a synchronization function
Message passing
Process 1 Process 2
send(&x, 2)
recv(&y, 1)
x y
Data transfer
[Wilkinson2004]
39
• Static process creation – fixed number of processes in time– specified before the execution (e.g. on the command line)– usually the processes follow the same code, but their control paths through the code can
differ – depending on the ID• SPMD (Single Program Multiple Data) model• one master process (ID 0) – several slave processes
• Dynamic process creation– varying number of processes in time
• just one process at the beginning
– processes can create (destroy) other processes: the spawn operation
• rather expensive!
– the processes often differ in code• MPMD (Multiple Program Multiple
Data) model
Process creation
Process 1
Process 2spawn();
Tim
e
Start process 2
[Wilkinson2004]
40
• Exactly two processes are involved
• One process (sender / source) sends a message and another process (receiver / destination) receives it
– active participation of processes on both sides usually required• two-sided communication
• In general, the source and destination processes operate asynchronously – the source may complete sending a message long before the destination gets around to
receiving it
– the destination may initiate receiving a message that has not yet been sent
• The order of messages is guaranteed (they do not overtake)
• Examples of technical issues – handling more messages waiting to be received– sending complex data structures– using message buffers– send and receive routines – blocking vs. nonblocking
Point-to-point communication
41
• Blocking operation: only returns (from the subroutine call) when the operation has completed
– ex.: sending fax on a standard machine• Nonblocking operation: returns immediately, the operation need not be
completed yet, other work may be performed in the meantime – the completion of the operation can/must be tested– ex.: sending fax on a machine with memory
• Synchronous send: does not complete until the message has been received – provides (synchronizing) info about the message delivery – ex.: sending fax (on a standard machine)
• Asynchronous send: completes as soon as the message is on its way– sender only knows when the message has left– ex.: sending a letter
(Non-)blocking & (a-)synchronous
42
• Transfer of data in a set of processes • Provided by most message passing systems• Basic operations [next slides]:
– barrier: synchronization of processes – broadcast: one-to-many communication of the same data – scatter: one-to-many communication of different portions of data – gather: many-to-one communication of the (different, but related) data– reduction: gather plus combination of data with arithmetic/logical operation
• Root – in some collective operations, the single prominent source / destination
– e.g. in broadcast• Collective operations can be built out as a set of point-to-point operations,
but these “blackbox” routines– hide a lot of the messy details– are usually more efficient
• can take advantage of special communication hardware
Collective communication
43
• A basic mechanism for synchronizing processes
• Inserted at the point in each process where it must wait for the others
• All processes can continue from thispoint when all the processes have reached it
– or when a stated number of processeshave reached this point
• Often involved in other operations
Barrier
[Wilkinson2004]
44
• Distributes the same piece of data from a single source (root) to all processes (concerned with problem)
– multicast – sending the message to a defined group of processes
Broadcast
Broadcast
B
B B
B
BB
root
45
• Distributes each element of an array in the root to a separate process– including the root
– contents of the ith array element sent to the ith process
Scatter
Scatter
B DCA
A CB D
A CB D
root
46
• Collects data from each process at the root– value from the ith process is stored in the ith array element (rank order)
Gather
Gather A
B D
C DB
CA
A CB D
root
47
• Gather operation combined with specified arithmetic/logical operation1. collect data from each processor
2. reduce these data to a single value (such as a sum or max)
3. store the reduced result on the root processor
Reduction
ReduceA CB D
I KJ L
M ON P
A E I M
root E GF H
48
• Computer realization of the message passing model• Most popular message passing systems (MPS):
– Message Passing Interface (MPI) [next lecture]– Parallel Virtual Machine (PVM)– in distributed computing Corba, Java RMI, DCOM, etc.
Message passing system (1)
49
• Information needed by MPS to transfer a message include: – sending process and location, type and amount of transferred data
• no interest in data itself (message body)– receiving process(-es) and storage to receive the data
• Most of this information is attached as message envelope – may be (partially) available to the receiving process
• MPS may provide various information to the processes– e.g. about the progress of communication
• A lot of other technical aspects, e. g.: – process enrolment in MPS– addressing scheme– content of the envelope – using message buffers (system, user space)
Message passing system (2)
50
Message passing (MPI)+ easier to debug + easiest to optimize+ can overlap
communication and computation
+ potential to high scalability
+ support on all parallel architectures
– harder to program– load balancing, deadlock
prevention, etc. need to be addressed
most freedom and responsibility
WWW (what, when, why)
Shared variables (OMP)+ easier to program than
MP, code is simpler+ implementation can be
incremental+ no message start-up costs+ can cope with irregular
communication patterns– limited to shared-memory
systems– harder to debug and
optimize– scalability limited– usually less efficient than
MP equivalents
Data parallel (HPF)+ easier to program than
MP+ simpler to debug than SV+ does not require shared
memory– DP style suitable only for
certain applications– restricted control over
data and work distribution– difficult to obtain top
performance– a few API’s available– out of date?
51
• The definition of parallel programming models is not uniform in literature; other models can be e.g.
– thread programming model
– hybrid models, e. g. the combination of the message passing and shared variables model
• explicit message passing between the nodes of a cluster as well as shared-memory and multithreading within the nodes
• Models continue to evolve along with the changing world of computer hardware and software
– CUDA parallel programming model for CUDA GPU architecture
Conclusions
52
Further study
• The message passing model and shared variables model somehow treated in all general textbooks on parallel programming
• exception: [Foster 1995] almost skips data sharing• There are plenty of books dedicated to shared objects,
synchronisation and shared memory, e.g. [Andrews 2000] Foundations of Multithreaded, Parallel,and Distributed Programming
• not necessarily focusing on parallel processing• Data parallelism is usually a marginal topic
53
top related