COMP 322 Lecture 3 1 September 2009
COMP 322: Principles of Parallel Programming
Lecture 3: Reasoning about Performance (Chapter 3)
Fall 2009 http://www.cs.rice.edu/~vsarkar/comp322
Vivek Sarkar Department of Computer Science
Rice University [email protected]
COMP 322, Fall 2009 (V.Sarkar) 2
Summary of Last Lecture • Parallel Algorithms for:
— Prefix Sum – T1 = O(N) – TN = O(log N)
— Quicksort – T1 = O(N log N) – TN = O(log2 N)
• Upper and lower bounds for greedy schedulers — max(T1/P, T∞) ≤ TP ≤ T1/P + T∞
• Amdahl’s Law — Speedup(P) = T1/TP ≤ P / (fPAR + P * fSEQ)
COMP 322, Fall 2009 (V.Sarkar) 3
Acknowledgments for Todayʼs Lecture • “Scaling to Petascale: Concepts & Beyond”, Thomas Sterling,
LSU, August 3, 2009 • “CS380P: Parallel Systems”, course lectures by Prof. Calvin Lin,
UT Austin, Spring 2009 — http://www.cs.utexas.edu/users/lin/cs380p/
• Course text: “Principles of Parallel Programming”, Calvin Lin & Lawrence Snyder
• “Introduction to Parallel Computing”, 2nd Edition, Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Addison-Wesley, 2003
• COMP 422 lectures, Spring 2008 — http://www.cs.rice.edu/~vsarkar/comp422
• COMP 515 lectures, Spring 2009 — http://www.cs.rice.edu/~vsarkar/comp515
COMP 322, Fall 2009 (V.Sarkar) 4
Example 3: Parallelizing QuickSort procedure QUICKSORT(S) { if S contains at most one element then return S else { choose an element a randomly from S; // Opportunity 1: Parallel Partition let S1, S2 and S3 be the sequences of elements in S less than, equal to, and greater than a, respectively; // Opportunity 2: Parallel Calls return (QUICKSORT(S1) followed by S2 followed by QUICKSORT(S3)) } // else } // procedure
COMP 322, Fall 2009 (V.Sarkar) 5
Approach 3: parallel partition, parallel calls
Depth = O(lg n) and each stage takes O(lg n) parallel time overall span is O(lg2 n)
COMP 322, Fall 2009 (V.Sarkar) 6
Non-Parallelizable Code • Non-parallelizable code = code that is inherently sequential or
limited to a small degree of parallelism • Sources of non-parallelizable code include
— Fraction of sequential code (Amdahl’s Law) — Dependences in the computation graph — Large critical path length,T∞
• Mitigate by removing dependences — So as to reduce critical path length,T∞
COMP 322, Fall 2009 (V.Sarkar) 7
Data Dependences (pg. 68) • Simple example of data dependence: S1 PI = 3.14
S2 R = 5.0
S3 AREA = PI * R ** 2
• Statement S3 cannot be executed in parallel with either S1 or S2 without compromising correct results
COMP 322, Fall 2009 (V.Sarkar) 8
Classification of Data Dependences • Formally:
There is a data dependence from statement S1 to statement S2 (S2 depends on S1) if:
1. Both statements access the same memory location and at least one of them stores onto it, and
2. There is a feasible run-time execution path from S1 to S2
• True dependence (read-after-write) — S2 reads and S1 writes
• Anti dependence (write-after-read) — S2 writes and S1 reads
• Output dependence (write-after-write hazard) — S2 writes and S1 writes
COMP 322, Fall 2009 (V.Sarkar) 9
Removing False (Anti / Output) Dependences by Renaming
Before renaming: 1. sum = a + 1; 2. first_term = sum * scale1; 3. sum = b+1; 4. second_term = sum * Scale2;
After renaming sum to first_sum and second_sum: 1. first_sum = a + 1; 2. first_term = first_sum * scale1; 3. second_sum = b+1; 4. second_term = second_sum * Scale2;
COMP 322, Fall 2009 (V.Sarkar) 10
Latency and Throughput (pp. 62 – 63) • Latency = amount of time it takes to complete a given unit of
work • Throughput = amount of work that can be completed per unit
time — Throughput is also referred to as bandwidth, especially when the
work involves data transfer
• Little’s Law — A system must provide Parallelism ≥ Latency * Throughput, to fully
utilize the available throughput (bandwidth)
COMP 322, Fall 2009 (V.Sarkar) 11
Bandwidth vs. Latency in a Pipeline
• In this example: — Sequential execution takes
4 * 90min = 6 hours — Pipelined execution takes
30+4*40+20 = 3.5 hours • Bandwidth = loads/hour • BW = 4/6 l/h w/o pipelining • BW = 4/3.5 l/h w pipelining
• BW 1.5 l/h w pipelining, as total loads ∞
• Pipelining helps bandwidth but not latency (90 min)
• Bandwidth limited by slowest pipeline stage (40 min)
• Little’s Law: need at least 3 stages (> 1.5 * 1.5 = 2.25)
A
B
C
D
6 PM 7 8 9
T a s k
O r d e r
Time
30 40 40 40 40 20
Dave Patterson’s Laundry example: 4 people doing laundry
wash (30 min) + dry (40 min) + fold (20 min) = 90 min
COMP 322, Fall 2009 (V.Sarkar) 12
Sources of Performance Loss in Real Parallel Machines (pp. 64 – 66)
• Our discussion of computation graphs and parallel algorithms thus far assumed ideal abstract parallel machines — Reasoning about ideal execution time helps remove algorithmic dependences
that can lead to insufficient parallelism and processor starvation
• However, real parallel machines exhibit additional sources of performance loss due to latency, contention, and overhead
• These can be mitigated by paying attention to three dimensions in the “DIG” acronym when mapping from ideal parallelism to useful parallelism 1. Increase Data locality in all computations
– Ideal machine models assume that a memory access is a constant-time operation; however the latency of a memory access can vary by multiple orders of magnitude in real machines. Increasing data locality helps bridge this gap.
2. Decrease load Imbalance across computation, memory, & communication resources – Ideal machine models ignore sources of contention in real machines e.g., from two virtual
channels mapped to the same physical link or two variables mapped to the same memory module. Redistributing requests across physical resources helps bridge this gap.
3. Increase Granularity of computation and communication – Ideal machine models ignore large overheads involved in scheduling tasks and
communicating data. Increasing the granularity of tasks and data transfers helps bridge this gap.
COMP 322, Fall 2009 (V.Sarkar) 13 13
The Memory Wall
Memory Access Time
CPU Time
Ratio
THE WALL
COMP 322, Fall 2009 (V.Sarkar) 14 14
SMP Node Diagram
MP L1 L2
MP L1 L2
L3
MP L1 L2
MP L1 L2
L3
M1 M2 Mn‐1
Controller
S
S
NIC NIC USB Peripherals
JTAG
Legend : MP : MicroProcessor L1,L2,L3 : Caches M1.. : Memory Banks S : Storage NIC : Network Interface Card
Ethernet
PCI‐e
COMP 322, Fall 2009 (V.Sarkar) 15
Levels of the Memory Hierarchy
CPU Registers 100s Bytes <1s ns
Cache 10s‐100s K Bytes 1‐10 ns $10/ MByte
Main Memory M Bytes 100ns‐ 300ns $1/ MByte
Disk 10s G Bytes, 10 ms (10,000,000 ns) $0.0031/ MByte
Capacity Access Time Cost
Tape infinite sec‐min $0.0014/ MByte
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler 1‐8 bytes
cache cntl 8‐128 bytes
OS 512‐4K bytes
user/operator Mbytes
Upper Level
Lower Level
faster
Larger
Copyright 2001, UCB, David Pa\erson
COMP 322, Fall 2009 (V.Sarkar) 16
Cache Performance
16
T = total execu_on _me Tcycle = _me for a single processor cycle Icount = total number of instruc_ons IALU = number of ALU instruc_ons (e.g. register – register) IMEM = number of memory access instruc_ons ( e.g. load, store) CPI = average cycles per instruc_ons CPIALU = average cycles per ALU instruc_ons
CPIMEM = average cycles per memory instruc_on rmiss = cache miss rate rhit = cache hit rate CPIMEM‐MISS = cycles per cache miss CPIMEM‐HIT=cycles per cache hit MALU = instruc_on mix for ALU instruc_ons MMEM = instruc_on mix for memory access instruc_on
COMP 322, Fall 2009 (V.Sarkar) 17
Cache Performance: Example
17
COMP 322, Fall 2009 (V.Sarkar) 18
Performance: Locality • Temporal Locality is a property that if a program accesses
a memory location, there is a much higher than random probability that the same location would be accessed again.
• Spatial Locality is a property that if a program accesses a memory location, there is a much higher than random probability that the nearby locations would be accessed soon.
• A couple of key factors affect the relationship between locality and scheduling : — Size of dataset being processed by each processor — How much reuse is present in the code processing a chunk of
iterations. These factors were ignored in our idealized model of
computation graph scheduling
18
COMP 322, Fall 2009 (V.Sarkar) 19
Idleness • Idleness = state of a processor when it cannot find any useful
work to execute • Sources of idleness include:
— Load imbalance that prevents the workload from being “infinitely divisible” – Mitigate by reducing load imbalance
— Memory-bound computations – Mitigate by increasing locality – Mitigate by overlapping computation with latency
COMP 322, Fall 2009 (V.Sarkar) 20
Contention • Contention = degradation of system performance caused by
competition for a shared resource — Impact increases with increasing number of processors — Shared resource is often called a serialization bottleneck
• Sources of idleness include: — Acquiring and releasing a single lock (lock contention) — Acquiring and releasing a single cache line (cache contention)
COMP 322, Fall 2009 (V.Sarkar) 21
Cache Coherence on Bus-Based Machines (pp. 34 – 36)
COMP 322, Fall 2009 (V.Sarkar) 22
Prefetching and Multithreading Approaches for Hiding Memory Latency
• Consider the problem of browsing the web on a very slow network connection. We deal with the problem in one of two possible ways: — we anticipate which pages we are going to browse ahead of time and
issue requests for them in advance; — we open multiple browsers and access different pages in each
browser, thus while we are waiting for one page to load, we could be reading others; or
• The first approach is called prefetching, the second multithreading
COMP 322, Fall 2009 (V.Sarkar) 23
Overhead • Overhead = any cost that gets added to a sequential
computation so as to enable it to run in parallel • Sources of overhead include
— Communication: can be explicit via messages, or implicit via a memory hierarchy (caches) e.g., transmission delay, data marshalling & demarshalling
— Synchronization: extra processing to ensure that dependences in computation graph are satisfied
— Computation: extra work added to obtain a parallel algorithm — Memory: extra memory used to obtain a parallel algorithm — Task creation and termination: extra processing performed at the
start and end of each task e.g., a forall iteration or an async statement
• For simplicity, we assume that all overhead can be executed in parallel
COMP 322, Fall 2009 (V.Sarkar) 24
Overhead --- Mitigate by Increasing Task Granularity
24
v w
v = overhead w = work unit W = Total work Ti = execu_on _me with i processors P = # processors S = speedup
Assump_on : Workload is infinitely divisible
Implica_on: For a given P, task granularity, wi = W/P > v, is a necessary condi_on to obtain speedup, S > P/2
Larger overhead, v increase task granularity, wi
COMP 322, Fall 2009 (V.Sarkar) 25
Scalable Performance: Key Terms and Concepts
• Scalable Speedup: Relative reduction of execution time of a fixed size workload through parallel execution (ideally = N, but is < N in practice)
• Scalable Efficiency : Ratio of the actual performance to the best possible performance (ideally = 1, but is < 1 in practice)
25
€
Efficiency =execution _ time_on _one_ processor
(execution _ time_on _multiple_ processors× number_of _ processors)
COMP 322, Fall 2009 (V.Sarkar) 26
Example with Overhead (pg. 82) • Consider a problem with sequential execution time TS, that
incurs 0.2 TS overhead per processor when executed in parallel • Therefore
— T1 = TS — T2 = TS/2 + 0.2 TS = 0.7 TS — T10 = TS/10 + 0.2 TS = 0.3 TS — T100 = TS/100 + 0.2 TS = 0.21 TS
Speedup(1) = 1, Efficiency(1) = 1 Speedup(2) = 1/0.7 = 1.43, Efficiency(2) = 1.43/2 = 0.71 Speedup(10) = 1/0.3 = 3.33, Efficiency(10) = 3.33/10 = 0.33 Speedup(100) = 1/0.21 = 4.76, Efficiency(100) = 4.76/100 =
0.047
COMP 322, Fall 2009 (V.Sarkar) 27
Another Example with Overhead • Consider a problem with sequential execution time TS(N), a
function of input size N, and assume a fixed overhead of TOVHD per processor when executed in parallel
• Therefore — T1(N) = TS(N) — TP(N) = TS(N)/P + TOVHD , for P > 1 Speedup(P) = T1(N) / TP(N) = TS(N) / (TS(N)/P + TOVHD) Efficiency(P) = Speedup(P) / P = TS(N) / (TS(N) + P*TOVHD)
• Half-performance metric — N1/2 = input size that achieves Efficiency(P) = 0.5 for a given P — A larger value of N1/2 indicates that the problem is harder to
parallelize efficiently
COMP 322, Fall 2009 (V.Sarkar) 28
Strong Scaling, Weak Scaling
28
Strong Scaling Weak S
caling
Strong Scaling
Weak Scaling
Tota
l Pr
oblem S
ize
Machine Scale (# of nodes)
Gran
ular
ity
(size
/ no
de)
COMP 322, Fall 2009 (V.Sarkar) 29
Summary of Todayʼs Lecture Key concepts: • Latency • Throughput / Bandwidth • Little’s Law • DIG acronym
1. Increase Data locality in all computations – Addresses idleness arising from large memory access
latencies 2. Decrease load Imbalance across computation, memory, &
communication resources – Addresses contention in physical resources
3. Increase Granularity of computation and communication – Addresses overheads in real machines
COMP 322, Fall 2009 (V.Sarkar) 30
HOMEWORK #1(Written Assignment)
1. Exercise 6, Chapter 3, page 85 2. Analyze the speedup, efficiency, and half-performance metric
of Parallel Quicksort as a function of N and P.
Due Date: In class on Thursday, Sep 10th