![Page 1: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/1.jpg)
Parallel Processing 1
High Performance Computing(CS 540)
Shared Memory Programming with OpenMP and Pthreads*
Jeremy R. Johnson
*Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney
![Page 2: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/2.jpg)
Parallel Processing 2
Introduction• Objective: To further study the shared memory model of parallel
programming. Introduction to the OpenMP and Pthreads for shared memory parallel programming
• Topics– Concurrent programming with UNIX Processes
– Introduction to shared memory parallel programming with Pthreads• Threads• fork/join• race conditions• Synchronization• performance issues - synchronization overhead, contention and granularity, load balance, cache
coherency and false sharing.
– Introduction parallel program design paradigms• Data parallelism (static scheduling)• Task parallelism with workers• Divide and conquer parallelism (fork/join)
![Page 3: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/3.jpg)
Parallel Processing 3
Introduction
• Topics
– OpenMP vs. Pthreads• hello_pthreadsc
• hello_openmp.c
– Parallel Regions and execution model– Data parallelism with loops– Shared vs. private variables– Scheduling and chunk size– Synchronization and reduction variables– Functional parallelism with parallel sections– Case Studies
![Page 4: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/4.jpg)
Processes
• Processes contain information about program resources and program execution state
– Process ID, process group ID, user ID, and group ID– Environment– Working directory– Program instructions– Registers– Stack– Heap– File descriptors– Signal actions– Shared libraries– Inter-process communication tools (such as message queues, pipes,
semaphores, or shared memory).
Parallel Processing 4
![Page 5: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/5.jpg)
UNIX Process
Parallel Processing 5
![Page 6: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/6.jpg)
Threads
• An independent stream of instructions that can be scheduled to run
– Stack pointer– Registers (program counter)– Scheduling properties (such as policy or priority)– Set of pending and blocked signals– Thread specific data
• “lightweight process”– Cost of creating and managing threads much less than processes– Threads live within a process and share process resources such as
address space
• Pthreads – standard thread API (IEEE Std 1003.1)
Parallel Processing 6
![Page 7: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/7.jpg)
Threads within a UNIX Process
Parallel Processing 7
![Page 8: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/8.jpg)
Shared Memory Model
• All threads have access to the same global, shared memory
• All threads within a process share the same address space
• Threads also have their own private data
• Programmers are responsible for synchronizing access (protecting) globally shared data.
Parallel Processing 8
![Page 9: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/9.jpg)
Simple Example
void do_one_thing(int *);
void do_another_thing(int *);
void do_wrap_up(int, int);
int r1 = 0, r2 = 0;
extern int
main(void)
{
do_one_thing(&r1);
do_another_thing(&r2);
do_wrap_up(r1, r2);
return 0;
}
Parallel Processing 9
![Page 10: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/10.jpg)
Parallel Processing 10
do_another_thing() i j k--------------------------------------main()
main()--------do_one_thing() --------do_another_thing()---------
r1r2
SPPCGP0GP1…
PIDUIDGID
Open FilesLocksSockets…
Stack
Text
Data
Heap
Registers
Identity
Resources
Virtual Address Space
![Page 11: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/11.jpg)
Simple Example (Processes)
int shared_mem_id, *shared_mem_ptr;
int *r1p, *r2p;
extern int main(void)
{
pid_t child1_pid, child2_pid;
int status;
/* initialize shared memory segment */
if ((shared_mem_id = shmget(IPC_PRIVATE, 2*sizeof(int), 0660)) == -1)
perror("shmget"), exit(1);
if ((shared_mem_ptr = (int *)shmat(shared_mem_id, (void *)0, 0)) == (void *)-1
)
perror("shmat failed"), exit(1);
r1p = shared_mem_ptr;
r2p = (shared_mem_ptr + 1);
*r1p = 0;
*r2p = 0;
Parallel Processing 11
![Page 12: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/12.jpg)
Simple Example (Processes)
if ((child1_pid = fork()) == 0) {
/* first child */
do_one_thing(r1p);
return 0;
} else if (child1_pid == -1) {
perror("fork"), exit(1);
}
/* parent */
if ((child2_pid = fork()) == 0) {
/* second child */
do_another_thing(r2p);
return 0;
} else if (child2_pid == -1) {
perror("fork"), exit(1);
}
Parallel Processing 12
/* parent */
if ((waitpid(child1_pid, &status, 0) == -1))
perror("waitpid"), exit(1);
if ((waitpid(child2_pid, &status, 0) == -1))
perror("waitpid"), exit(1);
do_wrap_up(*r1p, *r2p);
return 0;
}
![Page 13: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/13.jpg)
Parallel Processing 13
do_one_thing() i j k---------------------------main()
main()--------do_one_thing() --------do_another_thing()---------
SPPCGP0GP1…
PIDUIDGID
Open FilesLocksSockets
…
Stack
Text
Data
Heap
Registers
Identity
Resources
Virtual Address Space
do_another_thing() i j k---------------------------main()
main()--------do_one_thing() --------do_another_thing()---------
SPPCGP0GP1…
PIDUIDGID
Open FilesLocksSockets
…
Stack
Text
Data
Heap
Registers
Identity
Resources
Virtual Address Space
Shared Memory
![Page 14: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/14.jpg)
Simple Example (PThreads)
int r1 = 0, r2 = 0;
extern int
main(void)
{
pthread_t thread1, thread2;
if (pthread_create(&thread1,
NULL,
do_one_thing,
(void *) &r1) != 0)
perror("pthread_create"), exit(1);
if (pthread_create(&thread2,
NULL,
do_another_thing,
(void *) &r2) != 0)
perror("pthread_create"), exit(1);
Parallel Processing 14
if (pthread_join(thread1, NULL) != 0)
perror("pthread_join"),exit(1);
if (pthread_join(thread2, NULL) != 0)
perror("pthread_join"),exit(1);
do_wrap_up(r1, r2);
return 0;
}
![Page 15: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/15.jpg)
Parallel Processing 15
do_another_thing() i j k--------------------------------------main()
main()--------do_one_thing() --------do_another_thing()-----------------r1r2
SPPCGP0GP1…
PIDUIDGID
Open FilesLocksSockets…
Stack
Text
Data
Heap
Registers
Identity
Resources
Virtual Address Space
do_another_thing() i j k--------------------------------------main()
Stack
SPPCGP0GP1…
Registers
Thread 1
Thread 2
![Page 16: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/16.jpg)
Concurrency and Parallelism
Parallel Processing 16
Time
do_one_thing()do_another_thing() do_wrap_up()
do_one_thing() do_another_thing() do_wrap_up()
do_one_thing()
do_another_thing()
do_wrap_up()
![Page 17: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/17.jpg)
Unix Fork
• The fork() call
– Creates a child process that is identical to the parent process
– The child has its own PID
– The fork() call provides different return values to the parent [child’s PID] and the child [0]
Parallel Processing 17
![Page 18: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/18.jpg)
Parallel Processing 18
--------fork()-----------------
PID = 7274
--------fork()-----------------
PID = 7274
--------fork()-----------------
PID = 7275
fork
Parent
Child
![Page 19: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/19.jpg)
Thread Creation
• pthread_create creates a new thread and makes it executable
– pthread_create (thread,attr,start_routine,arg) • thread - unique identifier
• attr – attribute
• Start_routine – the routine the newly created thread will execute
• arg – a single argument passed to start_routine
Parallel Processing 19
![Page 20: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/20.jpg)
Thread Creation
• Once created, threads are peers, and may create other threads
Parallel Processing 20
![Page 21: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/21.jpg)
Thread Join
• "Joining" is one way to accomplish synchronization between threads.
• The pthread_join() subroutine blocks the calling thread until the specified threadid thread terminates.
Parallel Processing 21
![Page 22: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/22.jpg)
Fork/Join Overhead
• Compare the overhead of procedure call, process fork/join, thread create/join
– Procedure call (no args)• 1.2 10-8 sec (.12 ns)
– Process• 0.0012 sec (1.2 ms)
– Thread• 0.000042 sec (42 s)
Parallel Processing 22
![Page 23: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/23.jpg)
Race Conditions
• When two or more threads access the same resource at the same time
Parallel Processing 23
Tim
e
Thread 1 Thread 2 Balance
Withdraw $50 Withdraw $50Read Balance $125 Read Balance $125Set Balance $75 Set Balance $75
![Page 24: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/24.jpg)
Bad Count
int sum= 0;
void count(int *arg)
{
int i;
for (i=0;i<*arg;i++) {
sum++;
}
}
int main(int argc, char **argv)
{
int error,i;
int numcounters = NUMCOUNTERS;
int limit = LIMIT;
pthread_t tid[NUMCOUNTERS];
Parallel Processing 24
pthread_setconcurrency(numcounters);
for (i=0;i<numcounters;i++)
{
error = pthread_create(&tid[i],NULL,(void *(*)(void *))count,&limit);
}
for (i=0;i<numcounters;i++)
{
error = pthread_join(tid[i],NULL);
}
printf("Counters finished with count = %d\n",sum);
printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);
return 0;
}
![Page 25: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/25.jpg)
Mutex
• Mutex variables are for protecting shared data when multiple writes occur.
• A mutex variable acts like a "lock" protecting access to a shared data resource. Only one thread can own (lock) a mutex at any given time
Parallel Processing 25
![Page 26: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/26.jpg)
Mutex Operations
• pthread_mutex_lock (mutex) – The pthread_mutex_lock() routine is used by a thread to
acquire a lock on the specified mutex variable. If the mutex is already locked by another thread, this call will block the calling thread until the mutex is unlocked.
• Pthread_mutex_unlock (mutex) – will unlock a mutex if called by the owning
thread. Calling this routine is required after a thread has completed its use of protected data if other threads are to acquire the mutex for their work with the protected data.
Parallel Processing 26
![Page 27: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/27.jpg)
Good Countint sum= 0;
pthread_mutex_t lock;
void count(int *arg)
{
int i;
for (i=0;i<*arg;i++)
{
pthread_mutex_lock(&lock);
sum++;
pthread_mutex_unlock(&lock);
}
}
int main(int argc, char **argv)
{
int error,i;
int numcounters = NUMCOUNTERS;
int limit = LIMIT;
pthread_t mytid, tid[MAXCOUNTERS];
Parallel Processing 27
pthread_setconcurrency(numcounters);
pthread_mutex_init(&lock,NULL);
for (i=1;i<=numcounters;i++)
{
error = pthread_create(&tid[i],NULL,(void *(*)(void *))count, &limit);
}
for (i=1;i<=numcounters;i++)
{
error = pthread_join(tid[i],NULL);
}
printf("Counters finished with count = %d\n",sum);
printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);
return 0;
}
![Page 28: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/28.jpg)
Better Count
int sum= 0;
pthread_mutex_t lock;
void count(int *arg)
{
int i;
int localsum = 0;
for (i=0;i<*arg;i++)
{
localsum++;
}
pthread_mutex_lock(&lock);
sum = sum + localsum;
pthread_mutex_unlock(&lock);
}
Parallel Processing 28
![Page 29: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/29.jpg)
Threadsafe Code
• Refers to an application's ability to execute multiple threads simultaneously without "clobbering" shared data or creating "race" conditions.
Parallel Processing 29
![Page 30: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/30.jpg)
Condition Variables
• While mutexes implement synchronization by controlling thread access to data, condition variables allow threads to synchronize based upon the actual value of data.
• Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met.
• A condition variable is a way to achieve the same goal without polling
• Always used with a mutexParallel Processing 30
![Page 31: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/31.jpg)
Using Condition variables
Thread A
• Do work up to the point where a certain condition must occur (such as "count" must reach a specified value)
• Lock associated mutex and check value of a global variable
• Call pthread_cond_wait() to perform a blocking wait for signal from Thread-B. Note that a call to pthread_cond_wait() automatically and atomically unlocks the associated mutex variable so that it can be used by Thread-B.
• When signalled, wake up. Mutex is automatically and atomically locked.
• Explicitly unlock mutex• Continue
Thread B
• Do work
• Lock associated mutex
• Change the value of the global variable that Thread-A is waiting upon.
• Check value of the global Thread-A wait variable. If it fulfills the desired condition, signal Thread-A.
• Unlock mutex.
• Continue
Parallel Processing 31
![Page 32: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/32.jpg)
Condition Variable Example
void *watch_count(void *idp)
{
int i=0, save_state, save_type;
int *my_id = idp;
pthread_mutex_lock(&count_lock);
while (count < COUNT_THRES) {
pthread_cond_wait(&count_hit_threshold, &count_lock);
}
pthread_mutex_unlock(&count_lock);
return(NULL);
}
Parallel Processing 32
void *inc_count(void *idp)
{
int i=0, save_state, save_type;
int *my_id = idp;
for (i=0; i<TCOUNT; i++) {
pthread_mutex_lock(&count_lock);
count++;
if (count == COUNT_THRES) {
pthread_cond_signal(&count_hit_threshold);
}
pthread_mutex_unlock(&count_lock);
}
return(NULL);
}
![Page 33: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/33.jpg)
Parallel Processing 33
OpenMP
• Extension to FORTRAN, C/C++– Uses directives (comments in FORTRAN, pragma in C/C++)
• ignored without compiler support• Some library support required
• Shared memory model– parallel regions– loop level parallelism– implicit thread model– communication via shared address space– private vs. shared variables (declaration)– explicit synchronization via directives (e.g. critical)– library routines for returning thread information (e.g.
omp_get_num_threads(), omp_get_thread_num() )– Environment variables used to provide system info (e.g.
OMP_NUM_THREADS)
![Page 34: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/34.jpg)
Parallel Processing 34
Benefits
• Provides incremental parallelism
• Small increase in code size
• Simpler model than message passing
• Easier to use than thread library
• With hardware and compiler support smaller granularity than message passing.
![Page 35: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/35.jpg)
Parallel Processing 35
Further Information
• Adopted as a standard in 1997– Initiated by SGI
• www.openmp.org• computing.llnl.gov/tutorials/openMP
• Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001.
• Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008.
![Page 36: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/36.jpg)
Parallel Processing 36
Shared vs. Distributed Memory
Memory
P0 P1 Pn...
Interconnection Network
P0 P1 Pn
...M0 M1 Mn
Shared memory Distributed memory
![Page 37: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/37.jpg)
Parallel Processing 37
Shared Memory Programming Model
• Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software)
• If logical shared memory then there may be different costs for accessing memory depending on the physical location.
• UMA - uniform memory access– SMP - symmetric multi-processor– typically memory connected to processors via a bus
• NUMA - non-uniform memory access– typically physically distributed memory connected via an
interconnection network
![Page 38: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/38.jpg)
Parallel Processing 38
Hello_openmp.c#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]); omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
if (id == 0)
printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
![Page 39: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/39.jpg)
Parallel Processing 39
Compiling & Running Hello_openmp
% gcc –fopenmp hello_openmp.c –o hello
% ./hello 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 3
Number of threads = 4
Hello World from 2
The order of the print statements is nondeterministic
![Page 40: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/40.jpg)
Parallel Processing 40
Execution Model
Master thread
Master and slave threads
Master thread
Implicit barrier synchronization(join)
Implicit thread creation (fork)
Parallel Region
![Page 41: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/41.jpg)
Parallel Processing 41
Explicit Barrier#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
int n;
if (argc > 1) {
n = atoi(argv[1]);
omp_set_num_threads(n);
}
printf("Number of threads = %d\n",omp_get_num_threads());
#pragma omp parallel
{
int id = omp_get_thread_num();
printf("Hello World from %d\n",id);
#pragma omp barrier
if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads());
}
exit(0);
}
![Page 42: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/42.jpg)
Parallel Processing 42
Output with Barrier
%./hellob 4
Number of threads = 1
Hello World from 1
Hello World from 0
Hello World from 2
Hello World from 3
Number of threads = 4
The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end
![Page 43: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/43.jpg)
Parallel Processing 43
Hello_pthreads.c#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <errno.h>
#define MAXTHREADS 32
int main(int argc, char **argv)
{
int error,i,n;
void hello(int *pid);
pthread_t tid[MAXTHREADS],mytid;
int pid[MAXTHREADS];
if (argc > 1) {
n = atoi(argv[1]);
if (n > MAXTHREADS) {
printf("Too many threads\n"); exit(1);
}
pthread_setconcurrency(n);
}
printf("Number of threads = %d\n",pthread_getconcurrency());
for (i=0;i<n;i++) {
pid[i]=i;
error = pthread_create(&tid[i], NULL,(void *(*)(void *))hello, &pid[i]);
}
for (i=0;i<n;i++) {
error = pthread_join(tid[i],NULL);
}
exit(0);
}
![Page 44: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/44.jpg)
Parallel Processing 44
Hello_pthreads.c
void hello(int *pid)
{
pthread_t tid;
tid = pthread_self();
printf("Hello World from %d (tid = %u)\n",*pid,(unsigned int) tid);
if (*pid == 0)
printf("Number of threads = %d\n",pthread_getconcurrency());
}
% gcc -pthread hello.c -o hello
% ./hello 4
Number of threads = 4
Hello World from 0 (tid = 1832728912)
Hello World from 1 (tid = 1824336208)
Number of threads = 4
Hello World from 3 (tid = 1807550800)
Hello World from 2 (tid = 1815943504)
The order of the print statements is nondeterministic
![Page 45: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/45.jpg)
Types of Parallelism
Data Parallelism
Threads execute same instructions
… but on different data
Functional Parallelism
Threads execute different instructions
… and can read same data but should write different
data
F1
F2
F3
F4
![Page 46: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/46.jpg)
Parallel Processing 46
Parallel Loop
int a[1000], b[1000];
int main()
{
int i;
int N = 1000;
for (i=0; i<N; i++)
a[i] = i; b[i] = N-i;
for (i=0;i<N;i++) {
a[i] = a[i] + b[i];
}
int a[1000], b[1000];
int main()
{
int i;
int N = 1000;
// Serial Initialization
for (i=0; i<N; i++)
a[i] = i; b[i] = N-i;
#pragma omp for shared(a,b), private(i), schedule(static)
for (i=0;i<N;i++) {
a[i] = a[i] + b[i];
}
![Page 47: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/47.jpg)
Parallel Processing 47
Scheduling of Parallel Loop
+
a
b
0 1tid
Stripmining
2 Nthreads-1
![Page 48: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/48.jpg)
Parallel Processing 48
Implementation of Parallel Loop
void vadd(int *id){int i;for (i=*id;i<N;i+=numthreads) { a[i] = a[i] + b[i]; }}
for (i=0;i<numthreads;i++) { id[i] = i; error = pthread_create(&tid[i],NULL,(void *(*)(void *))vadd, &id[i]); }for (i=0;i<numthreads;i++) { error = pthread_join(tid[i],NULL); }
![Page 49: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/49.jpg)
Parallel Processing 49
Scheduling Chunks of Parallel Loop
a
b
0 1tid
chunk0
chunk0
Chunk 1
2
Chunk 2
Chunk Nthreads-1
![Page 50: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/50.jpg)
Parallel Processing 50
Implementation of Chunking
#pragma omp for shared(a,b), private(i), schedule(static,CHUNK)for (i=0;i<N;i++) { a[i] = a[i] + b[i];}
void vadd(int *id){int i,j;
for (i=*id*CHUNK;i<N;i+=numthreads*CHUNK) { for (j=0;j<CHUNK;j++) a[i+j] = a[i+j] + b[i+j]; }}
![Page 51: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/51.jpg)
Parallel Processing 51
Race Condition
int x[10000000];int main(int argc, char **argv) {int sum=0;…….omp_set_num_threads(numcounters);
for (i=0;i<numcounters*limit;i++) x[i] = 1;
#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; if (i==0) printf("num threads = %d\n",omp_get_num_threads()); }
![Page 52: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/52.jpg)
Parallel Processing 52
Critical Sections
int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) {#pragma omp critical(sum) sum = sum + x[i]; }
![Page 53: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/53.jpg)
Parallel Processing 53
Reduction Variables
int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(x)
reduction(+:sum)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; }
![Page 54: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/54.jpg)
Parallel Processing 54
Reduction
X[]
+
partialsum
+
partialsum
+
partialsum
+
partialsum
+
partialsum
+
total sum
![Page 55: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/55.jpg)
Parallel Processing 55
Implementing Reduction
#pragma omp parallel shared(sum,x) {int i;int localsum=0;int id;id = omp_get_thread_num();for (i=id;i<numcounters*limit;i+=numcounters) { localsum = localsum + x[i]; }#pragma omp critical(sum) sum = sum+localsum;}
![Page 56: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/56.jpg)
Functional Parallelism Example
int main()
{
int i;
double a[N], b[N], c[N], d[N];
// Parallel Function
#pragma omp parallel shared(a,b,c,d) privite(i)
{
#pragma omp sections
{
#pragma omp section
for (i=0; i<N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i<N; i++)
d[i] = a[i] * b[i];
}
}
![Page 57: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/57.jpg)
Parallel Programming
• Task parallelism vs. data parallelism
• Fork/join parallelism (divide & conquer)
• Static scheduling
• Dynamic scheduling with workers
Parallel Processing 57
![Page 58: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/58.jpg)
Sequential Count
int X[MAXSIZE];
int icount(int l,int u)
{
int i;
int y = 0;
for (i=l; i<=u;i++)
y = y + X[i];
return y;
}
Parallel Processing 58
int rcount(int l,int u)
{
int m;
int y1,y2;
if ( (u-l) == 0)
return X[l];
else
{
m = (l+u)/2;
y1 = rcount(l,m);
y2 = rcount(m+1,u);
return (y1 + y2);
}
}
![Page 59: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/59.jpg)
Counting with a Parallel Loop
int sum= 0;
int numcounters;
int size;
pthread_mutex_t lock;
Parallel Processing 59
void count(int *id)
{
int i,lsum;
lsum = 0;
for (i=*id;i<size;i+=numcounters)
{
lsum = lsum + X[i];
}
pthread_mutex_lock(&lock);
sum = sum + lsum;
pthread_mutex_unlock(&lock);
}
![Page 60: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/60.jpg)
Counting with Workers
void get_task(int *start, int *stop)
{
pthread_mutex_lock(&task_lock);
*start = task_index;
if (*start + task_chunk > n)
*stop = n;
else
*stop = *start + task_chunk;
task_index = *stop;
pthread_mutex_unlock(&task_lock);
}
Parallel Processing 60
void worker()
{
int start,stop,i;
int y = 0;
get_task(&start,&stop);
for (i=start; i<stop;i++)
y = y + X[i];
pthread_mutex_lock(&sum_lock);
sum = sum + y;
pthread_mutex_unlock(&sum_lock);
}
![Page 61: High Performance Computing (CS 540) Shared Memory Programming with OpenMP and Pthreads *](https://reader036.vdocument.in/reader036/viewer/2022062519/56814c52550346895db96449/html5/thumbnails/61.jpg)
Parallel Divide & Conquerint pcount(int *arg)
{
int error,arg1[3],arg2[3];
int l,u,m;
int y,y1,y2;
pthread_t tid1,tid2;
l = arg[0];
u = arg[1];
if ( (u-l) <= cutoff)
y = count(l,u);
else
{
m = (l+u)/2;
arg1[0] = l;
arg1[1] = m;
Parallel Processing 61
error = pthread_create(&tid1,NULL,(void *(*)(void *))pcount,arg1);
/* y2 = count(m+1,u); */
arg2[0] = m+1;
arg2[1] = u;
error = pthread_create(&tid2,NULL,(void *(*)(void *))pcount,arg2);
error = pthread_join(tid1,NULL);
y1 = arg1[2];
error = pthread_join(tid2,NULL);
y2 = arg2[2];
y = y1 + y2;
}
/* thr_exit(&y); */
arg[2] = y;
}