•moore’s lawcs.iit.edu/~iraicu/teaching/cs595-f11/lecture08.pdf · 2011-09-27 · •moore’s...
TRANSCRIPT
![Page 1: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/1.jpg)
![Page 2: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/2.jpg)
• Moore’s Law
– The number of transistors that can be placed
inexpensively on an integrated circuit will
double approximately every 18 months.
– Self-fulfilling prophecy
• Computer architect goal
• Software developer assumption
2
![Page 3: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/3.jpg)
• Impediments to Moore’s Law
– Theoretical Limit
– What to do with all that die space?
– Design complexity
– How do you meet the expected performance
increase?
3
![Page 4: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/4.jpg)
• von Neumann model – Execute a stream of instructions (machine code) – Instructions can specify
• Arithmetic operations • Data addresses • Next instruction to execute
– Complexity • Track billions of data locations and millions of instructions • Manage with:
– Modular design – High-level programming languages
4
![Page 5: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/5.jpg)
• Parallelism
– Continue to increase performance via
parallelism.
5
0
50
100
150
200
250
300
2004 2006 2008 2010 2012 2014 2016 2018
Nu
mb
er
of
Co
res
0
10
20
30
40
50
60
70
80
90
100
Man
ufa
ctu
rin
g P
rocess
Number of Cores
Processing
![Page 6: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/6.jpg)
• From a software point-of-view, need to
solve demanding problems
– Engineering Simulations
– Scientific Applications
– Commercial Applications
• Need the performance, resource gains
afforded by parallelism
6
![Page 7: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/7.jpg)
• Engineering Simulations – Aerodynamics – Engine efficiency
7
![Page 8: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/8.jpg)
• Scientific Applications – Bioinformatics – Thermonuclear processes – Weather modeling
8
![Page 9: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/9.jpg)
• Commercial Applications – Financial transaction processing – Data mining – Web Indexing
9
![Page 10: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/10.jpg)
• Unfortunately, greatly increases coding
complexity
– Coordinating concurrent tasks
– Parallelizing algorithms
– Lack of standard environments and support
10
![Page 11: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/11.jpg)
• The challenge
– Provide the abstractions, programming
paradigms, and algorithms needed to
effectively design, implement, and maintain
applications that exploit the parallelism
provided by the underlying hardware in order
to solve modern problems.
11
![Page 12: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/12.jpg)
• Standard sequential architecture
CPU RAM
BUS
Bottlenecks
12
![Page 13: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/13.jpg)
• Use multiple
– Datapaths
– Memory units
– Processing units
13
![Page 14: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/14.jpg)
• SIMD
– Single instruction stream, multiple data
stream Processing
Unit
Control
Unit
Interco
nnect
Processing
Unit
Processing
Unit
Processing
Unit
Processing
Unit 14
![Page 15: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/15.jpg)
• SIMD
– Advantages
• Performs vector/matrix operations well
– EX: Intel’s MMX chip
– Disadvantages
• Too dependent on type of computation
– EX: Graphics
• Performance/resource utilization suffers if
computations aren’t “embarrasingly parallel”.
15
![Page 16: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/16.jpg)
• MIMD
– Multiple instruction stream, multiple data
stream Processing/Control
Unit
Processing/Control
Unit
Processing/Control
Unit
Processing/Control
Unit
Interco
nnect
16
![Page 17: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/17.jpg)
• MIMD – Advantages
• Can be built with off-the-shelf components • Better suited to irregular data access patterns
– Disadvantages • Requires more hardware (!sharing control unit) • Store program/OS at each processor
• Ex: Typical commodity SMP machines we see
today.
17
![Page 18: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/18.jpg)
• Task Communication
– Shared address space
• Use common memory to exchange data
• Communication and replication are implicit
– Message passing
• Use send()/receive() primitives to exchange data
• Communication and replication are explicit
18
![Page 19: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/19.jpg)
• Shared address space
– Uniform memory access (UMA)
• Access to a memory location is independent of
which processing unit makes the request.
– Non-uniform memory access (NUMA)
• Access to a memory location depends on the
location of the processing unit relative to the
memory accessed.
19
![Page 20: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/20.jpg)
• Message passing
– Each processing unit has its own private
memory
– Exchange of messages used to pass data
– APIs
• Message Passing Interface (MPI)
• Parallel Virtual Machine (PVM)
20
![Page 21: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/21.jpg)
• Algorithm
– a sequence of finite instructions, often used
for calculation and data processing.
• Parallel Algorithm
– An algorithm that which can be executed a
piece at a time on many different processing
devices, and then put back together again at
the end to get the correct result
21
![Page 22: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/22.jpg)
• Challenges
– Identifying work that can be done
concurrently.
– Mapping work to processing units.
– Distributing the work
– Managing access to shared data
– Synchronizing various stages of execution.
22
![Page 23: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/23.jpg)
• Models
– A way to structure a parallel algorithm by
selecting decomposition and mapping
techniques in a manner to minimize
interactions.
23
![Page 24: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/24.jpg)
• Models
– Data-parallel
– Task graph
– Work pool
– Master-slave
– Pipeline
– Hybrid
24
![Page 25: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/25.jpg)
• Data-parallel
– Mapping of Work
• Static
• Tasks -> Processes
– Mapping of Data
• Independent data items assigned to processes
(Data Parallelism)
25
![Page 26: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/26.jpg)
• Data-parallel – Computation
• Tasks process data, synchronize to get new data or exchange results, continue until all data processed
– Load Balancing • Uniform partitioning of data
– Synchronization • Minimal or barrier needed at end of a phase
– Examples • Ray Tracing
26
![Page 27: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/27.jpg)
• Data-parallel
P
P
P
P
P
D D
D D
D
O
O
O
O
O
27
![Page 28: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/28.jpg)
• Task graph
– Mapping of Work
• Static
• Tasks are mapped to nodes in a data dependency
task dependency graph (Task parallelism)
– Mapping of Data
• Data moves through graph (Source to Sink)
28
![Page 29: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/29.jpg)
• Task graph – Computation
• Each node processes input from previous node(s) and send output to next node(s) in the graph
– Load Balancing • Assign more processes to a given task • Eliminate graph bottlenecks
– Synchronization • Node data exchange
– Examples • Parallel Quicksort, Divide and Conquer approaches • Scientific Applications that can be expressed in workflows
(e.g. DAGs)
29
![Page 30: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/30.jpg)
• Task graph
P
P P
P
P
P
D
D
D
O
O
30
![Page 31: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/31.jpg)
• Work pool
– Mapping of Work/Data
• No desired pre-mapping
• Any task performed by any process
• Pull-model oriented
– Computation
• Processes work as data becomes available (or
requests arrive)
31
![Page 32: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/32.jpg)
• Work pool
– Load Balancing
• Dynamic mapping of tasks to processes
– Synchronization
• Adding/removing work from input queue
– Examples
• Web Server
• Bag-of-tasks
32
![Page 33: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/33.jpg)
• Work pool
P
Work Pool
P P
P P
Inp
ut q
ueu
e
Outp
ut q
ueu
e
33
![Page 34: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/34.jpg)
• Master-slave
– Modification to Worker Pool Model
• One or more Master processes generate and
assign work to worker processes\
• Push-model oriented
– Load Balancing
• A Master process can better distribute load to
worker processes
34
![Page 35: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/35.jpg)
• Pipeline
– Mapping of work
• Processes are assigned tasks that correspond to
stages in the pipeline
• Static
– Mapping of Data
• Data processed in FIFO order
– Stream parallelism
35
![Page 36: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/36.jpg)
• Pipeline – Computation
• Data is passed through a succession of processes, each of which will perform some task on it
– Load Balancing • Insure all stages of the pipeline are balanced
(contain the same amount of work)
– Synchronization • Producer/Consumer buffers between stages
– Ex: Processor pipeline, graphics pipeline
36
![Page 37: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/37.jpg)
• Pipeline
P
Input q
ueu
e
Outp
ut q
ueu
e
P P
buffer
buffer
37
![Page 38: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/38.jpg)
• Message-Passing
• Shared Address Space
38
![Page 39: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/39.jpg)
• Message-Passing
– Most widely used for programming parallel
computers (clusters of workstations)
– Key attributes:
• Partitioned address space
• Explicit parallelization
– Process interactions
• Send and receive data
39
![Page 40: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/40.jpg)
• Message-Passing – Communications
• Sending and receiving messages • Primitives
– send(buff, size, destination) – receive(buff, size, source) – Blocking vs non-blocking – Buffered vs non-buffered
• Message Passing Interface (MPI) – Popular message passing library – ~125 functions
40
![Page 41: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/41.jpg)
• Message-Passing
Workstation
Workstation
Workstation
Workstation
Data
P4 P3 P2 P1
send(buff1, 1024, p3) receive(buff3, 1024, p1)
41
![Page 42: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/42.jpg)
• Shared Address Space – Mostly used for programming SMP machines
(multicore chips) – Key attributes
• Shared address space – Threads – Shmget/shmat UNIX operations
• Implicit parallelization
– Process/Thread communication • Memory reads/stores
42
![Page 43: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/43.jpg)
• Shared Address Space – Communication
• Read/write memory – EX: x++;
– Posix Thread API • Popular thread API • Operations
– Creation/deletion of threads – Synchronization (mutexes, semaphores) – Thread management
43
![Page 44: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/44.jpg)
• Shared Address Space
Workstation
T3 T4 T2 T1
Data SMP RAM
44
![Page 45: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/45.jpg)
• Synchronization
– Deadlock
– Livelock
– Fairness
• Efficiency
– Maximize parallelism
• Reliability
– Correctness
– Debugging
45
![Page 46: •Moore’s Lawcs.iit.edu/~iraicu/teaching/CS595-F11/lecture08.pdf · 2011-09-27 · •Moore’s Law –The number of transistors that can be placed inexpensively on an integrated](https://reader030.vdocument.in/reader030/viewer/2022040814/5e5b91a90132db7a876b6bb1/html5/thumbnails/46.jpg)
46