shared memory programming: pthreads and -

33
© 2005 IBM Shared Memory Programming: pThreads and OpenMP IBM July, 2006

Upload: others

Post on 09-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shared Memory Programming: pThreads and -

© 2005 IBM

Shared Memory Programming:pThreads and OpenMP

IBMJuly, 2006

Page 2: Shared Memory Programming: pThreads and -

2 © 2005 IBM Corporation

Agenda

• Shared Memory• Computer Architecture• Programming Paradigm

• pThreads• OpenMP Compilers

• Implementation• Compiler Internals

• Automatic Parallelization• Work Distribution• Performance• Problems and Concerns

Page 3: Shared Memory Programming: pThreads and -

3 © 2005 IBM Corporation

Compiling and Running an OpenMP Program

xlf90_r –O3 –qsmp –c openmp_prog.cxlf90_r –qsmp –o a.out openmp_prog.oexport OMP_NUM_THREADS=4./a.out &ps –m –o THREAD

_r = reentrant codeUse –qsmp both for compiling and for linking

Page 4: Shared Memory Programming: pThreads and -

4 © 2005 IBM Corporation

Shared Memory Programming

pThreads• POSIX standard• Library functions• Explicit fork and join• Explicit synchronization• Explicit locks• Often used in “operational”

environment• Often used for M:N

applications = many short tasks for a few processors

OpenMP• Industry Standard• Compiler assist:

• Directives (Fortran)• Pragmas (C, C++)

• Explicit “fork” and “join”• Implicit synchronization• Implicit locks• Often used in data analysis• Often used for 1:1

applications = one task per processor

Page 5: Shared Memory Programming: pThreads and -

5 © 2005 IBM Corporation

Shared Memory Architecture

• One address space•Multiple processor access

• Software synchronization• Hardware (and software) cache coherence

Memory

CPU0

CPU1

Page 6: Shared Memory Programming: pThreads and -

6 © 2005 IBM Corporation

Shared Memory Architecture Variation: NUMA

• Single address space• Non-uniform memory latency• Non-uniform memory bandwidth

CPU0

CPU1

Memory

Memory

Shared Memory Definition: Single address space

Page 7: Shared Memory Programming: pThreads and -

7 © 2005 IBM Corporation

pThreads

• Standard for shared memory programming• POSIX standard• ALL modern computers• Reentrant code:

•Does not hold static data over successive call.• Important for f77 (static storage) programs

Page 8: Shared Memory Programming: pThreads and -

8 © 2005 IBM Corporation

pThreads Thread Scope

• AIXTHREAD_SCOPE={S|P}•S: Thread is bound to a kernel thread

• Scheduled by the kernel.•P: Thread is subject to the user scheduler

• Does not have a dedicated kernel thread• Sleeps in user mode• Placed on the user run queue when waiting for a

processor• Subjected to time slicing by the user scheduler• Priority of thread is controlled by user

Page 9: Shared Memory Programming: pThreads and -

9 © 2005 IBM Corporation

pThreads Thread Scope

• Typically, user is concerned with process threads

• Typically, operating system uses system threads•Process to system thread mapping (pThreads default):

• 8 to 1–8 process threads map onto 1 system thread

• User can choose system or process threads•pthread_attr_setscope(&attr,PTHREAD_SCOPE_SYSTEM);

•export AIXTHREAD_SCOPE=S•export AIXTHREAD_MNRATIO=m:n

Page 10: Shared Memory Programming: pThreads and -

10 © 2005 IBM Corporation

m:n Thread Scope

Void main(){…for (i=0;i<nthreads;i++)pthread_create(….)

….}

-l pthreads mProcessThreads

AIX n SystemThreads

Page 11: Shared Memory Programming: pThreads and -

11 © 2005 IBM Corporation

OpenMP Thread Scope

• 1:1 model• Each user (process) thread is mapped to

one system thread

Void main(){…#pragma OMP parallelfor (i=0;i<nthreads;i++)work();

….}

SystemThreads

-qsmp

Page 12: Shared Memory Programming: pThreads and -

12 © 2005 IBM Corporation

OpenMP and pThreads

• OpenMP uses a shared memory model similar to pThreads•Fork•Join•Barriers (mutexes)

• NOT strictly built on top of pThreads

Page 13: Shared Memory Programming: pThreads and -

13 © 2005 IBM Corporation

OpenMP and pThreads//"Master" thread forks slave threads

main(){...for (i = 0; i < num_threads; i++)

pthread_create ( &tid[i],...for (i = 0; i < num_threads; i++)

pthread_join( &tid[i], … );

}

void do_work(void *it_num){for (i=start;i<ending;i++)

A[i]=func(B);return ;

}

Void sub(n,A,B){

#pragma omp parallel{#pragma omp for

for (i=0;i<n;i++){A[i] = func(B);

…}

}....

}

Page 14: Shared Memory Programming: pThreads and -

14 © 2005 IBM Corporation

OpenMP Example

Fork

Join

Locks

Void user_sub(){...#pragma omp parallel{…#pragma omp for

for (i=0;i<;i++{A[i] = ....B[i] = ..

}…}

Return (0)}

Locks

Page 15: Shared Memory Programming: pThreads and -

15 © 2005 IBM Corporation

Parallel Regions and Work Sharing

• Parallel Region• Master thread forks slave threads• Slave threads enter “re-entrant” code

• Work Sharing• Slave threads collective divide work• Various scheduling schemes

• User choice

Void user_sub(){...#pragma omp parallel{…#pragma omp forfor (i=0;i<;i++){

A[i] = ....B[i] = ….

}…}

Return (0)}

Page 16: Shared Memory Programming: pThreads and -

16 © 2005 IBM Corporation

Parallelization Example

Void sub(n,A,B){

#pragma omp parallel{#pragma omp for

for (i=0;i<n;i++)A[i] = func(B);

}....

}

Page 17: Shared Memory Programming: pThreads and -

17 © 2005 IBM Corporation

Parallelization Example

void sub(n,A,B){…#pragma omp parallelsub1(A,B);

...}

void sub1(A,B){…#pragma omp for

sub2(ns,ne,A,B)....

}void sub2(ns,ne,A,B){ for (i=ns;i<ne;i++)

A[i] = func(B[i]))}

Threads

Page 18: Shared Memory Programming: pThreads and -

18 © 2005 IBM Corporation

OpenMP Work Distribution

• Scheduling determined by the user•Directives•Environment variables

• STATIC: small overhead•Distribution done at compile time

• DYNAMIC: better for load balancing•Distribution done during execution

• GUIDED

Page 19: Shared Memory Programming: pThreads and -

19 © 2005 IBM Corporation

Scheduling

• Environment variable:•OMP_SCHEDULE={static, dynamic, guided}

• Pragma:•#pragma omp for schedule({static, dynamic, guided})

• Full syntax:•OMP_SCHEDULE={s…,d…,g…}[(chunk_size)]•#pragma … schedule({s…,d…,g…},[chunk_size])

Page 20: Shared Memory Programming: pThreads and -

20 © 2005 IBM Corporation

Scheduling: Static• Default (xlf and xlc)• Number of iterations per thread determined before

loop execution

Compile Execute

Data

0 –(n-1)

for (i=…;..;i++){…}

for (i=…;..;i++){…}

for (i=n/p+1;i<2*n/p;i++){…}

for (i=0;i<n;i++){…}

for (i=0;i<n/p;i++){…}

Static

Page 21: Shared Memory Programming: pThreads and -

21 © 2005 IBM Corporation

Static Scheduling Example

n=100 iterations

4 threads

25Static

Schedule

0 1 2 3

25 25 25

Page 22: Shared Memory Programming: pThreads and -

22 © 2005 IBM Corporation

Scheduling: Dynamic• Next thread takes next slab

ExecuteNext available thread takes next available slab

Next thread takessize n/p slab

Compile

Data

0 –(n-1)

for (i=0;i<n;i++){…}

for (i=0;i<n/p;i++){…}

Dynamic

Page 23: Shared Memory Programming: pThreads and -

23 © 2005 IBM Corporation

Dynamic Scheduling Example

n=100 iterations

4 threads

5

DynamicSchedule

Chunk_size=5

5 5 5

5 5 5 5

5 5 5 5

5 5

5

5

5 5

5

5

0 1 2 3

Page 24: Shared Memory Programming: pThreads and -

24 © 2005 IBM Corporation

Scheduling: Guided• Next thread takes reduced size slab

Execute

Next available thread takes next further reduced slab

Next thread takessize (n-(n/p)/p slab

Compile

Data

0 –(n-1)

for (i=0;i<n;i++){…}

for (i=0;i<n/p;i++){…}

Guided

Page 25: Shared Memory Programming: pThreads and -

25 © 2005 IBM Corporation

Guided Scheduling Example

0 1 2 3n=100 iterations

4 threads

25

1814

10

86

4 33 21

111

111

GuidedSchedule

Page 26: Shared Memory Programming: pThreads and -

26 © 2005 IBM Corporation

Performance Concerns

• False Sharing•Cache coherency thrashing

• Load balance•Uneven number distribution•Uneven work units•Triangles

• Barriers•Synchronization is expensive

Page 27: Shared Memory Programming: pThreads and -

27 © 2005 IBM Corporation

False Sharing

• Multiple processors (threads) write to same cache line•Valid shared memory operation but causes severe performance penalty

•Common in older Cray parallel/vector codes• Dangerous programming practice

•Difficult to detectCache Line

float sum[8];#pragma omp parallelp = ...my thread number...#pragma omp forfor (i=1;i<n;i++)

sum[p] = sum[p] + func(i,n);

sum[0]

sum[3]

Page 28: Shared Memory Programming: pThreads and -

28 © 2005 IBM Corporation

Corrected False Sharing

• Each processor (thread) writes to own cache line•Wastes a tiny bit of memory•Cache line is 128 bytes = thirty-two 4-byte words

Cache Line 1 Cache Line 2float sum[8*32];#pragma omp parallelp = ...my thread number...#pragma omp forfor (i=1;i<n;i++)

sum[p*32] = sum[p*32] + func(i,n);

sum[0] sum[3]

Page 29: Shared Memory Programming: pThreads and -

29 © 2005 IBM Corporation

False Sharing

• Multiple threads accessing SAME line cause contention

#pragma omp forfor (i=1;i<n;i++) {

k = index[i];A[i] = B[k] + i;

Cache Line

B[i=1]

B[i=100]

Page 30: Shared Memory Programming: pThreads and -

30 © 2005 IBM Corporation

Effect of False Sharing

0

200400

600800

1000

0 128 256 384 512 640 768 896 1024

Length

% O

verh

ead 2

48162432

Thread

Page 31: Shared Memory Programming: pThreads and -

31 © 2005 IBM Corporation

Effect of False Sharing:Two threads on Same Chip

0

50100

150200

250

0 128 256 384 512 640 768 896 1024

Length

% O

verh

ead

1 Chip2 Chips

Page 32: Shared Memory Programming: pThreads and -

32 © 2005 IBM Corporation

False Sharing Summary

Time vs. Length

0

200

400

600

800

1000

1200

1400

1600

1800

0 128 256 384 512 640 768 896 1024

Length

% O

verh

ead

• Effect is worse with smaller features

• Effect is worse with more threads

Page 33: Shared Memory Programming: pThreads and -

33 © 2005 IBM Corporation

Summary

• Shared memory programming•Only works on single address space node

• Dynamic constructs•Good for load balancing

• Concerns:•False sharing•Critical regions