module 1 parallel programming and threads parallel programming and threads

33
Module 1 Parallel Programming And Threads

Upload: esther-dora-grant

Post on 18-Jan-2016

281 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Module 1Module 1

Parallel ProgrammingAnd

Threads

Parallel ProgrammingAnd

Threads

Page 2: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Parallelism and Concurrency:

System and Environment

Parallelism and Concurrency:

System and Environment Parallelism: exploit

system resources to speed up computation

Concurrency: respond quickly/properly to events from the environmentfrom other parts of

system

Parallelism: exploit system resources to speed up computation

Concurrency: respond quickly/properly to events from the environmentfrom other parts of

systemPractical Parallel and Concurrent Programming

DRAFT: comments to [email protected] 2

Environment

System

Events

04/21/23

Page 3: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Components and ParallelismComponents and Parallelism

A component can use parallelism internally to improve performance

Usually, clients need not be aware of internal parallelism

Why would the interface change because of internal parallelism?

A component can use parallelism internally to improve performance

Usually, clients need not be aware of internal parallelism

Why would the interface change because of internal parallelism?

m(0) m(1) m(N-1)…

A

C

Client

call return

Page 4: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Encapsulating ParallelismEncapsulating Parallelism

A component can have a parallel implementationit is an “implementation” detail whether

or not there is internal parallelism Behavior of parallel implementation

should be the “same” as sequential, where

component specification defines “same”

A component can have a parallel implementationit is an “implementation” detail whether

or not there is internal parallelism Behavior of parallel implementation

should be the “same” as sequential, where

component specification defines “same”

Page 5: Module 1 Parallel Programming And Threads Parallel Programming And Threads

ExamplesExamples

Parallel parsing of HTML Parallel XML query processing Use of commands in Linux Applying same command to

multiple files Searching different Internet sites

Parallel parsing of HTML Parallel XML query processing Use of commands in Linux Applying same command to

multiple files Searching different Internet sites

Page 6: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Main memoryMain memory

Processor coreProcessor core

ALUALU

1122334455......

Instruction streamInstruction stream

Clock: 0Clock: 1Clock: 2Clock: 3Clock: 4Clock: 5Clock: 6Clock: 7Clock: 8Clock: 9

22446699

Completion timeCompletion time

A simple microprocessor model ~ 1985

A simple microprocessor model ~ 1985

Clock: 10

1212

Clock: 11

Single h/w thread

Instructions execute one after the other

Memory access time ~ clock cycle time

Single h/w thread

Instructions execute one after the other

Memory access time ~ clock cycle time

Clock: 12 ALU: arithmetic logic unit

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 604/21/23

Page 7: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Main memoryMain memory

Instruction streamInstruction stream

222222204 (main memory)204 (main memory)

Completion timeCompletion time

FastFwd Two Decades (circa 2005):Power Hungry Superscalar with Caches

FastFwd Two Decades (circa 2005):Power Hungry Superscalar with Caches

226 (hit in L2)226 (hit in L2)

Multiple levels of cache, 2 cycles for L1, 20 cycles for L2, 200 cycles for memory

Multiple levels of cache, 2 cycles for L1, 20 cycles for L2, 200 cycles for memory

ALUALUALUALUALUALU

1122334455

......L2 cache (4MB)L2 cache (4MB)

L1 cache (64KB)L1 cache (64KB)

Dynamic out-of-order I execution

Pipelined memory accesses

Speculation - ex I b4 branch resolved

Dynamic out-of-order I execution

Pipelined memory accesses

Speculation - ex I b4 branch resolved

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 704/21/23

Page 8: Module 1 Parallel Programming And Threads Parallel Programming And Threads

04/21/23

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 8

Page 9: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Power wall we can’t clock processors faster

Memory wallmany workload’s performance is dominated by

memory access times

Instruction-level Parallelism (ILP) wallwe can’t find extra work to keep functional

units busy while waiting for memory accesses

Power wall we can’t clock processors faster

Memory wallmany workload’s performance is dominated by

memory access times

Instruction-level Parallelism (ILP) wallwe can’t find extra work to keep functional

units busy while waiting for memory accesses

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 904/21/23

Page 10: Module 1 Parallel Programming And Threads Parallel Programming And Threads

CoreCore

1122334455......

Multi-core h/w – common L2Multi-core h/w – common L2

1122334455......

L2 cacheL2 cache

CoreCore

Main memoryMain memory

L1 cacheL1 cache L1 cacheL1 cache

ALUALUALUALU

ALUALUALUALU

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1004/21/23

Page 11: Module 1 Parallel Programming And Threads Parallel Programming And Threads

1122334455......

Multi-core h/w – additional L3

Multi-core h/w – additional L3

1122334455......

Main memoryMain memory

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

L2 cacheL2 cache L2 cacheL2 cache

L3 cacheL3 cache

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1104/21/23

Page 12: Module 1 Parallel Programming And Threads Parallel Programming And Threads

SMP multiprocessorSMP multiprocessor

Single-threaded

core

Single-threaded

core

1122334455......

1122334455......

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

L2 cacheL2 cache L2 cacheL2 cache

Main memoryMain memory

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1204/21/23

Page 13: Module 1 Parallel Programming And Threads Parallel Programming And Threads

InterconnectInterconnect

NUMA multiprocessornon-uniform memory access

NUMA multiprocessornon-uniform memory access

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Memory & directoryMemory & directory

L2 cacheL2 cache L2 cacheL2 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Memory & directoryMemory & directory

L2 cacheL2 cache L2 cacheL2 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Memory & directoryMemory & directory

L2 cacheL2 cache L2 cacheL2 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Single-threaded

core

Single-threaded

core

L1 cacheL1 cache

Memory & directoryMemory & directory

L2 cacheL2 cache L2 cacheL2 cache

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1304/21/23

Page 14: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Three kinds of parallel hardware

Three kinds of parallel hardware

Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed

Multiple cores Increase ops/cycle Don’t necessarily scale caches and off-chip

resources proportionately

Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w

proportionately

Multi-threaded cores Increase utilization of a core or memory b/w Peak ops/cycle fixed

Multiple cores Increase ops/cycle Don’t necessarily scale caches and off-chip

resources proportionately

Multi-processor machines Increase ops/cycle Often scale cache & memory capacities and b/w

proportionately

Practical Parallel and Concurrent Programming DRAFT: comments to [email protected] 1404/21/23

Page 15: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Sequential ProgramSequential Program

<state nothing is known>int sum = 5;<state int sum=5>for (int i=0; i<5; i++) sum += i;<state int i=5, int

sum=5+0+1+2+3+4=15>

<state nothing is known>int sum = 5;<state int sum=5>for (int i=0; i<5; i++) sum += i;<state int i=5, int

sum=5+0+1+2+3+4=15>

Page 16: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Sequential ProgramSequential Program

DeterminismGiven a current program state and a

code fragment, determine the next program state

TerminationProve that a program terminatesUsually depends on loop or procedure

recursion termination conditions

DeterminismGiven a current program state and a

code fragment, determine the next program state

TerminationProve that a program terminatesUsually depends on loop or procedure

recursion termination conditions

Page 17: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Parallel ProgramsParallel Programs Concurrent

Non-deterministic Given state and code, next state is ???

Non-terminating Distributed

Concurrent but can survive partial failure of thread or process

ByzantineDistributed but can survive partial failure

at the worst time and in the worst way

ConcurrentNon-deterministic

Given state and code, next state is ???Non-terminating

DistributedConcurrent but can survive partial failure

of thread or process Byzantine

Distributed but can survive partial failure at the worst time and in the worst way

Page 18: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Program RepresentationProgram Representation

#include <stdio.h>static int X=5;int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0;}

#include <stdio.h>static int X=5;int main(int argc, char *argv[]) { printf(“%d %s \n”, argc, argv[0]); return 0;}

preprocessor

Allocated space and initialized at compile time

Local variables

String constant

Page 19: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Program Representation(after compilation)

Program Representation(after compilation)

Object Module (.o .obj) Code Uninitialized static data Initialized static data

X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n”

Symbol Table Defined

main Referenced

printf

Object Module (.o .obj) Code Uninitialized static data Initialized static data

X, 32 bits, 0x00000005 String, 64 bits, “%d %s \n”

Symbol Table Defined

main Referenced

printf

Page 20: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Linking and LoadingLinking and Loading Linker (can also create libraries)

Combine multiple object modules into one Satisfies any symbol references among the combined

modules Loader

Combine object modules and libraries into an executable file (a.out or .exe)

All symbol references must be satisfied Symbol table used by debuggers

Dynamic linking Stops program on reference to an undefined symbol,

finds obj in file system, links and continues Demand loading

Symbol ref satisfied before execution but load delayed

Linker (can also create libraries) Combine multiple object modules into one Satisfies any symbol references among the combined

modules Loader

Combine object modules and libraries into an executable file (a.out or .exe)

All symbol references must be satisfied Symbol table used by debuggers

Dynamic linking Stops program on reference to an undefined symbol,

finds obj in file system, links and continues Demand loading

Symbol ref satisfied before execution but load delayed

Page 21: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Program Representationat Runtime

Program Representationat Runtime

Same as in the object modules Code Static Data

Created at runtime Procedure call frame stack Heap to support new/delete on dynamic

variables & Ref -- implicit pointer variable to data in the

heap * -- explicit pointer variable to data in the heap

Same as in the object modules Code Static Data

Created at runtime Procedure call frame stack Heap to support new/delete on dynamic

variables & Ref -- implicit pointer variable to data in the

heap * -- explicit pointer variable to data in the heap

Page 22: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Coroutine, state vectorCoroutine, state vector

hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor.

Coroutine - data structure for saved State Vector - hardware registers

hardware registers that must be saved when losing control of a physical processor and that must be restored when gaining control of a physical processor.

Coroutine - data structure for saved State Vector - hardware registers

Page 23: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Intel x86 State VectorIntel x86 State VectorAX=0000 BX=0000 CX=0000DX=0000  SI=0000  DI=0000SP=FFEE     top-of-stack pointerBP=0000      procedure call frame pointerDS=0AD5     data segment pointerSS=0AD5     stack segment pointerCS=0AD5     code segment pointerES=0AD5 IP=0100   instruction pointer (next instruction to execute)NV UP EI PL NZ NA PO NC  (processor status bits)     CS:IP      Code Bytes       Instruction0AD5:0100     E8 FD 00         CALL 0200

AX=0000 BX=0000 CX=0000DX=0000  SI=0000  DI=0000SP=FFEE     top-of-stack pointerBP=0000      procedure call frame pointerDS=0AD5     data segment pointerSS=0AD5     stack segment pointerCS=0AD5     code segment pointerES=0AD5 IP=0100   instruction pointer (next instruction to execute)NV UP EI PL NZ NA PO NC  (processor status bits)     CS:IP      Code Bytes       Instruction0AD5:0100     E8 FD 00         CALL 0200

Page 24: Module 1 Parallel Programming And Threads Parallel Programming And Threads

C Procedure Call FrameC Procedure Call Frame

Page 25: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Command LineCommand Line./a.out apple do* pear

The shell expands command-line arguments.

./a.out apple donut doright pearargc = 5argv[0] = “/Users/bobcook/home/bin/a.out”argv[1] = “apple”argv[2] = “donut”argv[3] = “doright”argv[4] = “pear”

./a.out apple do* pear

The shell expands command-line arguments.

./a.out apple donut doright pearargc = 5argv[0] = “/Users/bobcook/home/bin/a.out”argv[1] = “apple”argv[2] = “donut”argv[3] = “doright”argv[4] = “pear”

Page 26: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Apple Xcode DebuggerApple Xcode Debugger

Page 27: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Macintosh-5:fact bobcook$ gcc -g main.cMacintosh-5:fact bobcook$ gdb a.out(gdb) list1 #include <stdio.h>23 int factorial(int n) {4 if (n<2)5 return 1;6 return n*factorial(n-1);7 }89 int main(int argc, char *argv[]) {10 printf("%d\n",

factorial(atoi(argv[1])));

Macintosh-5:fact bobcook$ gcc -g main.cMacintosh-5:fact bobcook$ gdb a.out(gdb) list1 #include <stdio.h>23 int factorial(int n) {4 if (n<2)5 return 1;6 return n*factorial(n-1);7 }89 int main(int argc, char *argv[]) {10 printf("%d\n",

factorial(atoi(argv[1])));

Page 28: Module 1 Parallel Programming And Threads Parallel Programming And Threads

(gdb) set args 5(gdb) b 5Breakpoint 1 at 0x1f0a: file main.c, line 5.(gdb) rStarting program: /Users/bobcook/Desktop/fact/a.out 5Reading symbols for shared libraries +. doneBreakpoint 1, factorial (n=1) at main.c:55 return 1;(gdb) bt#0 factorial (n=1) at main.c:5#1 0x00001f1f in factorial (n=2) at main.c:6#2 0x00001f1f in factorial (n=3) at main.c:6#3 0x00001f1f in factorial (n=4) at main.c:6#4 0x00001f1f in factorial (n=5) at main.c:6#5 0x00001f52 in main (argc=2, argv=0xbffff840) at

main.c:10

(gdb) set args 5(gdb) b 5Breakpoint 1 at 0x1f0a: file main.c, line 5.(gdb) rStarting program: /Users/bobcook/Desktop/fact/a.out 5Reading symbols for shared libraries +. doneBreakpoint 1, factorial (n=1) at main.c:55 return 1;(gdb) bt#0 factorial (n=1) at main.c:5#1 0x00001f1f in factorial (n=2) at main.c:6#2 0x00001f1f in factorial (n=3) at main.c:6#3 0x00001f1f in factorial (n=4) at main.c:6#4 0x00001f1f in factorial (n=5) at main.c:6#5 0x00001f52 in main (argc=2, argv=0xbffff840) at

main.c:10

Page 29: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Context BlockUser struct in UNIX

Context BlockUser struct in UNIX

Operating system information to define its virtual processorCoroutineCode, data, stack, heap segmentsUser id, group id, process id, parent idResource usage informationScheduling information (priority)

Operating system information to define its virtual processorCoroutineCode, data, stack, heap segmentsUser id, group id, process id, parent idResource usage informationScheduling information (priority)

Page 30: Module 1 Parallel Programming And Threads Parallel Programming And Threads

ProcessProcess

A program in executionThread -- entity within a process that can

be scheduled for execution Coroutine, thread id, thread priority, thread

local storage, a unique call stack All threads in a process share code, data, heap

A program in executionThread -- entity within a process that can

be scheduled for execution Coroutine, thread id, thread priority, thread

local storage, a unique call stack All threads in a process share code, data, heap

Page 31: Module 1 Parallel Programming And Threads Parallel Programming And Threads

#include <stdio.h>#include <pthread.h>#include <assert.h>#include <unistd.h>

void *p(void *arg) {int i;for (i=0; i<5; i++) { printf("X\n"); sleep(1); }pthread_exit((void *)99);}

int main() { //X Y interleaving is unpredictablepthread_t x; void *r; int i;assert(pthread_create(&x, NULL, p, (void *)34) == 0);for (i=0; i<5; i++) { printf("Y\n"); sleep(1); }assert(pthread_join(x, &r) == 0);return 0;}

#include <stdio.h>#include <pthread.h>#include <assert.h>#include <unistd.h>

void *p(void *arg) {int i;for (i=0; i<5; i++) { printf("X\n"); sleep(1); }pthread_exit((void *)99);}

int main() { //X Y interleaving is unpredictablepthread_t x; void *r; int i;assert(pthread_create(&x, NULL, p, (void *)34) == 0);for (i=0; i<5; i++) { printf("Y\n"); sleep(1); }assert(pthread_join(x, &r) == 0);return 0;}

Page 32: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Thread State TransitionsThread State Transitions

Page 33: Module 1 Parallel Programming And Threads Parallel Programming And Threads

Multi-Thread DebuggingMulti-Thread DebuggingThread IDThread IDThread ID

Call StackNth frame…1st frame

Call StackNth frame…1st frame

Local variables

Local variables

Local variables

Local variables