module 1 parallel programming and threads parallel programming and threads
TRANSCRIPT
Module 1Module 1
Parallel ProgrammingAnd
Threads
Parallel ProgrammingAnd
Threads
Parallelism and Concurrency:
System and Environment
Parallelism and Concurrency:
System and Environment Parallelism: exploit
system resources to speed up computation
Concurrency: respond quickly/properly to events from the environmentfrom other parts of
system
Parallelism: exploit system resources to speed up computation
Concurrency: respond quickly/properly to events from the environmentfrom other parts of
systemPractical Parallel and Concurrent Programming
DRAFT: comments to [email protected] 2
Environment
System
Events
04/21/23
Components and ParallelismComponents and Parallelism
A component can use parallelism internally to improve performance
Usually, clients need not be aware of internal parallelism
Why would the interface change because of internal parallelism?
A component can use parallelism internally to improve performance
Usually, clients need not be aware of internal parallelism
Why would the interface change because of internal parallelism?
m(0) m(1) m(N-1)…
A
C
Client
call return
Encapsulating ParallelismEncapsulating Parallelism
A component can have a parallel implementationit is an “implementation” detail whether
or not there is internal parallelism Behavior of parallel implementation
should be the “same” as sequential, where
component specification defines “same”
A component can have a parallel implementationit is an “implementation” detail whether
or not there is internal parallelism Behavior of parallel implementation
should be the “same” as sequential, where
component specification defines “same”
ExamplesExamples
Parallel parsing of HTML Parallel XML query processing Use of commands in Linux Applying same command to
multiple files Searching different Internet sites
Parallel parsing of HTML Parallel XML query processing Use of commands in Linux Applying same command to
multiple files Searching different Internet sites
Main memoryMain memory
Processor coreProcessor core
ALUALU
1122334455......
Instruction streamInstruction stream
Clock: 0Clock: 1Clock: 2Clock: 3Clock: 4Clock: 5Clock: 6Clock: 7Clock: 8Clock: 9
22446699
Completion timeCompletion time
A simple microprocessor model ~ 1985
A simple microprocessor model ~ 1985
Clock: 10
1212
Clock: 11
Single h/w thread
Instructions execute one after the other
Memory access time ~ clock cycle time
Single h/w thread
Instructions execute one after the other
Memory access time ~ clock cycle time
Clock: 12 ALU: arithmetic logic unit
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 604/21/23
Main memoryMain memory
Instruction streamInstruction stream
222222204 (main memory)204 (main memory)
Completion timeCompletion time
FastFwd Two Decades (circa 2005):Power Hungry Superscalar with Caches
FastFwd Two Decades (circa 2005):Power Hungry Superscalar with Caches
226 (hit in L2)226 (hit in L2)
Multiple levels of cache, 2 cycles for L1, 20 cycles for L2, 200 cycles for memory
Multiple levels of cache, 2 cycles for L1, 20 cycles for L2, 200 cycles for memory
ALUALUALUALUALUALU
1122334455
......L2 cache (4MB)L2 cache (4MB)
L1 cache (64KB)L1 cache (64KB)
Dynamic out-of-order I execution
Pipelined memory accesses
Speculation - ex I b4 branch resolved
Dynamic out-of-order I execution
Pipelined memory accesses
Speculation - ex I b4 branch resolved
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 704/21/23
04/21/23
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 8
Power wall we can’t clock processors faster
Memory wallmany workload’s performance is dominated by
memory access times
Instruction-level Parallelism (ILP) wallwe can’t find extra work to keep functional
units busy while waiting for memory accesses
Power wall we can’t clock processors faster
Memory wallmany workload’s performance is dominated by
memory access times
Instruction-level Parallelism (ILP) wallwe can’t find extra work to keep functional
units busy while waiting for memory accesses
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 904/21/23
CoreCore
1122334455......
Multi-core h/w – common L2Multi-core h/w – common L2
1122334455......
L2 cacheL2 cache
CoreCore
Main memoryMain memory
L1 cacheL1 cache L1 cacheL1 cache
ALUALUALUALU
ALUALUALUALU
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1004/21/23
1122334455......
Multi-core h/w – additional L3
Multi-core h/w – additional L3
1122334455......
Main memoryMain memory
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
L2 cacheL2 cache L2 cacheL2 cache
L3 cacheL3 cache
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1104/21/23
SMP multiprocessorSMP multiprocessor
Single-threaded
core
Single-threaded
core
1122334455......
1122334455......
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
L2 cacheL2 cache L2 cacheL2 cache
Main memoryMain memory
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1204/21/23
InterconnectInterconnect
NUMA multiprocessornon-uniform memory access
NUMA multiprocessornon-uniform memory access
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Memory & directoryMemory & directory
L2 cacheL2 cache L2 cacheL2 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Memory & directoryMemory & directory
L2 cacheL2 cache L2 cacheL2 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Memory & directoryMemory & directory
L2 cacheL2 cache L2 cacheL2 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Single-threaded
core
Single-threaded
core
L1 cacheL1 cache
Memory & directoryMemory & directory
L2 cacheL2 cache L2 cacheL2 cache
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1304/21/23
Three kinds of parallel hardware
Three kinds of parallel hardware
Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed
Multiple cores Increase ops/cycle Don’t necessarily scale caches and off-chip
resources proportionately
Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w
proportionately
Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed
Multiple cores Increase ops/cycle Don’t necessarily scale caches and off-chip
resources proportionately
Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w
proportionately
Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1404/21/23
Sequential ProgramSequential Program
<state nothing is known>int sum = 5;<state int sum=5>for (int i=0; i<5; i++) sum += i;<state int i=5, int
sum=5+0+1+2+3+4=15>
<state nothing is known>int sum = 5;<state int sum=5>for (int i=0; i<5; i++) sum += i;<state int i=5, int
sum=5+0+1+2+3+4=15>
Sequential ProgramSequential Program
DeterminismGiven a current program state and a
code fragment, determine the next program state
TerminationProve that a program terminatesUsually depends on loop or procedure
recursion termination conditions
DeterminismGiven a current program state and a
code fragment, determine the next program state
TerminationProve that a program terminatesUsually depends on loop or procedure
recursion termination conditions
Parallel ProgramsParallel Programs Concurrent
Non-deterministic Given state and code, next state is ???
Non-terminating Distributed
Concurrent but can survive partial failure of thread or process
ByzantineDistributed but can survive partial failure
at the worst time and in the worst way
ConcurrentNon-deterministic
Given state and code, next state is ???Non-terminating
DistributedConcurrent but can survive partial failure
of thread or process Byzantine
Distributed but can survive partial failure at the worst time and in the worst way
Program RepresentationProgram Representation
#include <stdio.h>static int X=5;int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0;}
#include <stdio.h>static int X=5;int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0;}
preprocessor
Allocated space and initialized at compile time
Local variables
String constant
Program Representation(after compilation)
Program Representation(after compilation)
Object Module (.o .obj) Code Uninitialized static data Initialized static data
X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n”
Symbol Table Defined
main Referenced
printf
Object Module (.o .obj) Code Uninitialized static data Initialized static data
X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n”
Symbol Table Defined
main Referenced
printf
Linking and LoadingLinking and Loading Linker (can also create libraries)
Combine multiple object modules into one Satisfies any symbol references among the combined
modules Loader
Combine object modules and libraries into an executable file (a.out or .exe)
All symbol references must be satisfied Symbol table used by debuggers
Dynamic linking Stops program on reference to an undefined symbol,
finds obj in file system, links and continues Demand loading
Symbol ref satisfied before execution but load delayed
Linker (can also create libraries) Combine multiple object modules into one Satisfies any symbol references among the combined
modules Loader
Combine object modules and libraries into an executable file (a.out or .exe)
All symbol references must be satisfied Symbol table used by debuggers
Dynamic linking Stops program on reference to an undefined symbol,
finds obj in file system, links and continues Demand loading
Symbol ref satisfied before execution but load delayed
Program Representationat Runtime
Program Representationat Runtime
Same as in the object modules Code Static Data
Created at runtime Procedure call frame stack Heap to support new/delete on dynamic
variables & Ref -- implicit pointer variable to data in the
heap * -- explicit pointer variable to data in the heap
Same as in the object modules Code Static Data
Created at runtime Procedure call frame stack Heap to support new/delete on dynamic
variables & Ref -- implicit pointer variable to data in the
heap * -- explicit pointer variable to data in the heap
Coroutine, state vectorCoroutine, state vector
hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor.
Coroutine - data structure for saved State Vector - hardware registers
hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor.
Coroutine - data structure for saved State Vector - hardware registers
Intel x86 State VectorIntel x86 State VectorAX=0000 BX=0000 CX=0000DX=0000 SI=0000 DI=0000SP=FFEE top-of-stack pointerBP=0000 procedure call frame pointerDS=0AD5 data segment pointerSS=0AD5 stack segment pointerCS=0AD5 code segment pointerES=0AD5 IP=0100 instruction pointer (next instruction to execute)NV UP EI PL NZ NA PO NC (processor status bits) CS:IP Code Bytes Instruction0AD5:0100 E8 FD 00 CALL 0200
AX=0000 BX=0000 CX=0000DX=0000 SI=0000 DI=0000SP=FFEE top-of-stack pointerBP=0000 procedure call frame pointerDS=0AD5 data segment pointerSS=0AD5 stack segment pointerCS=0AD5 code segment pointerES=0AD5 IP=0100 instruction pointer (next instruction to execute)NV UP EI PL NZ NA PO NC (processor status bits) CS:IP Code Bytes Instruction0AD5:0100 E8 FD 00 CALL 0200
C Procedure Call FrameC Procedure Call Frame
Command LineCommand Line./a.out apple do* pear
The shell expands command-line arguments.
./a.out apple donut doright pearargc = 5argv[0] = “/Users/bobcook/home/bin/a.out”argv[1] = “apple”argv[2] = “donut”argv[3] = “doright”argv[4] = “pear”
./a.out apple do* pear
The shell expands command-line arguments.
./a.out apple donut doright pearargc = 5argv[0] = “/Users/bobcook/home/bin/a.out”argv[1] = “apple”argv[2] = “donut”argv[3] = “doright”argv[4] = “pear”
Apple Xcode DebuggerApple Xcode Debugger
Macintosh-5:fact bobcook$ gcc -g main.cMacintosh-5:fact bobcook$ gdb a.out(gdb) list1 #include <stdio.h>23 int factorial(int n) {4 if (n<2)5 return 1;6 return n*factorial(n-1);7 }89 int main(int argc, char *argv[]) {10 printf("%d\n",
factorial(atoi(argv[1])));
Macintosh-5:fact bobcook$ gcc -g main.cMacintosh-5:fact bobcook$ gdb a.out(gdb) list1 #include <stdio.h>23 int factorial(int n) {4 if (n<2)5 return 1;6 return n*factorial(n-1);7 }89 int main(int argc, char *argv[]) {10 printf("%d\n",
factorial(atoi(argv[1])));
(gdb) set args 5(gdb) b 5Breakpoint 1 at 0x1f0a: file main.c, line 5.(gdb) rStarting program: /Users/bobcook/Desktop/fact/a.out 5Reading symbols for shared libraries +. doneBreakpoint 1, factorial (n=1) at main.c:55 return 1;(gdb) bt#0 factorial (n=1) at main.c:5#1 0x00001f1f in factorial (n=2) at main.c:6#2 0x00001f1f in factorial (n=3) at main.c:6#3 0x00001f1f in factorial (n=4) at main.c:6#4 0x00001f1f in factorial (n=5) at main.c:6#5 0x00001f52 in main (argc=2, argv=0xbffff840) at
main.c:10
(gdb) set args 5(gdb) b 5Breakpoint 1 at 0x1f0a: file main.c, line 5.(gdb) rStarting program: /Users/bobcook/Desktop/fact/a.out 5Reading symbols for shared libraries +. doneBreakpoint 1, factorial (n=1) at main.c:55 return 1;(gdb) bt#0 factorial (n=1) at main.c:5#1 0x00001f1f in factorial (n=2) at main.c:6#2 0x00001f1f in factorial (n=3) at main.c:6#3 0x00001f1f in factorial (n=4) at main.c:6#4 0x00001f1f in factorial (n=5) at main.c:6#5 0x00001f52 in main (argc=2, argv=0xbffff840) at
main.c:10
Context BlockUser struct in UNIX
Context BlockUser struct in UNIX
Operating system information to define its virtual processorCoroutineCode, data, stack, heap segmentsUser id, group id, process id, parent idResource usage informationScheduling information (priority)
Operating system information to define its virtual processorCoroutineCode, data, stack, heap segmentsUser id, group id, process id, parent idResource usage informationScheduling information (priority)
ProcessProcess
A program in executionThread -- entity within a process that can
be scheduled for execution Coroutine, thread id, thread priority, thread
local storage, a unique call stack All threads in a process share code, data, heap
A program in executionThread -- entity within a process that can
be scheduled for execution Coroutine, thread id, thread priority, thread
local storage, a unique call stack All threads in a process share code, data, heap
#include <stdio.h>#include <pthread.h>#include <assert.h>#include <unistd.h>
void *p(void *arg) {int i;for (i=0; i<5; i++) { printf("X\n"); sleep(1); }pthread_exit((void *)99);}
int main() { //X Y interleaving is unpredictablepthread_t x; void *r; int i;assert(pthread_create(&x, NULL, p, (void *)34) == 0);for (i=0; i<5; i++) { printf("Y\n"); sleep(1); }assert(pthread_join(x, &r) == 0);return 0;}
#include <stdio.h>#include <pthread.h>#include <assert.h>#include <unistd.h>
void *p(void *arg) {int i;for (i=0; i<5; i++) { printf("X\n"); sleep(1); }pthread_exit((void *)99);}
int main() { //X Y interleaving is unpredictablepthread_t x; void *r; int i;assert(pthread_create(&x, NULL, p, (void *)34) == 0);for (i=0; i<5; i++) { printf("Y\n"); sleep(1); }assert(pthread_join(x, &r) == 0);return 0;}
Thread State TransitionsThread State Transitions
Multi-Thread DebuggingMulti-Thread DebuggingThread IDThread IDThread ID
Call StackNth frame…1st frame
Call StackNth frame…1st frame
Local variables
Local variables
Local variables
Local variables