arch d. robison intel corporationpolaris.cs.uiuc.edu/~padua/cs498dhp/parallelprogramming12.pdf ·...
TRANSCRIPT
Copyright 2005 Intel Corporation 1
Multithreaded Programming
Arch D. RobisonIntel Corporation
Copyright 2005 Intel Corporation 2
Overall Goal• Acquire mental model of how parallel computers work• Learn basic guidelines on designing parallel programs• Know what you need to look up later• Emphasis is on multithreaded shared-memory programs,
but most of the high-level concepts apply to any parallel program.
Copyright 2005 Intel Corporation 3
Outline
Episode
Practical issuesVI
CorrectnessV
Specification of parallelism and synchronizationIV
DecompositionIII
HardwareII
Parallelism 101I
Copyright 2005 Intel Corporation 4
Episode I
• Reasons to write parallel programs• Kinds of parallelism• Speedup
Copyright 2005 Intel Corporation 5
Reasons to Write Parallel Programs
• Functionality• Throughput• Turn-around• Capacity
Copyright 2005 Intel Corporation 6
Functionality
• Sometimes parallel version is easier to write.– Often the case in stream processing– expand | sed 's/ */ /' | fold –s
• Responsiveness of human interface– Decouple human interface from computing portions
• Compute in background
– Decouple portions of user interface• Not have window block because another window’s dialog
box is open.
Copyright 2005 Intel Corporation 7
Throughput
• Thoughput = useful work per unit time• Parallel program can improve throughput even
on a single CPU– Example: hide I/O latency
Copyright 2005 Intel Corporation 8
Turn-Around• Solve problem faster
– Typically by applying more resources– Parallel solution is sometimes better algorithm!
• Parallel branch-and-bound may tighten bound quicker, resulting in less wasted work.
• Examples– Practical weather prediction
• 24-hour forecast that takes 24 hours to compute is not useful.
– Real time (e.g. video games, cybernetics)• Must meet deadline.
Copyright 2005 Intel Corporation 9
Capacity• Big parallel processors have huge memory capacity
– Can solve problem “in core” without having to resort to slow “out of core” techniques
RAM on Top 5 Supercomputers in June 2005
Barcelona
Japan
NSA/Ames
IBM Thomas J. Watson Research Center
NASA/Ames 32 Terabytes?Blue Gene/L
20 Terabytes?BGW
9.6 TerabytesMareNostrum
10 TerabytesEarth Simulator
20 TerabytesColumbia
Copyright 2005 Intel Corporation 10
Basic Parallel Programming Models• Single Instruction Multiple Data (SIMD)
– Single thread operating on long vector• Multiple Instruction Multiple Data (MIMD)
– Message passing• Each thread has its own address space• Threads communicate by sending and receiving messages
– Shared memory• Threads share a common address space• Threads communicate by writing and reading shared memory
locations
• Hybrid– Networks of shared-memory machines– Virtual shared memory across a cluster
• Totally implicit– E.g., applicative programming languages
Copyright 2005 Intel Corporation 11
Parallelism vs. Concurrency• Concurrency = multiple threads in progress at same time
– Achievable on a single processor by interleaving
• Parallelism = multiple threads executing simultaneously– Requires multiple execution units
Copyright 2005 Intel Corporation 12
Reasons for Concurrency
• Throughput– Hide I/O latency. – Example: Let task A compute while task B waits to
read disk.• Functionality
– can improve functionality (e.g. listen to music file and surf Web)
Copyright 2005 Intel Corporation 13
Reasons for Parallelism• Functionality
– Multithreaded CPU: avoids annoying pauses when a task hogs a physical thread.
• Throughput– Achieve good frame rate in video game: CPU
simulates universe while graphics processor renders. • Turn-Around
– Almost all champion chess programs.• Capacity
– Most large-scale scientific problems
Copyright 2005 Intel Corporation 14
• Speedup measures improvement from parallelization
• Efficiency is relative to ideal speedup
Speedup
speedup(p) =time for best serial version
time for version with p physical threads
efficiency(p) =speedup
p
Copyright 2005 Intel Corporation 15
Typical Speedup ChartSpeedup of Parallel Foo
Geomean of 10 runs [blocksize=16x16 pixels, HT-enabled: Xeon with 8 TPUs, 512K Cache ]
0.5
1.0
1.5
2.0
2.5
3.0
3.5
1 2 3 4 5 6 7 8Physical Threads
Spe
edup
(Rel
ativ
e to
Ser
ial V
ersi
on) IDEAL
NEW AND IMPROVED!
BRAND X
Why does this chart violate our definition of speedup?
Copyright 2005 Intel Corporation 16
Use Best Serial Code As Baseline
• Speedup(1)>1 in example indicates that the original serial code was not the best serial code.– This commonly happens– While tuning for parallelism, serial improvements are found.
• Example: “no alias” information useful for parallelizing can also help compiler generate better serial code
• Example: reblocking loops to help parallel decomposition sometimes helps serial code’s reuse of cache.
•Reference: Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers http://www.sc2000.org/bell/twelve-ways.txt
Copyright 2005 Intel Corporation 17
Amdahl’s ``Law’’• Suppose program takes 1 unit of time to execute serially.• Part of program is inherently serial (unparallelizable).
s = fraction of time spent in serial portion1-s = remaining parallelizable time
s 1-s
time on single processor
s (1-s)/p
time on parallel processor (in ideal world)
Copyright 2005 Intel Corporation 18
Amdahl’s Formulae
speedup(p) ≤1
1−sp
+ s
speedup(∞) ≤ 1s
serial execution time
best possible parallel execution time
• Speedup limited by serial portion
• Situation is worse in real world, because parallel portion usually has inefficiencies.
Copyright 2005 Intel Corporation 19
Diminishing Returns• Suppose program has 20% serial portion
speedup(2) ≤ 11−0.2
2+ 0.2
≈ 1.67
speedup(4) ≤ 11−0.2
4+ 0.2
= 2.5
speedup(∞) ≤ 11−0.2
∞+ 0.2
= 5
This is depressing. Is there a way to evade Amdahl’s law?
Doubling the number of processors from 2 to 4 of processors got us only 50% improvement!
Copyright 2005 Intel Corporation 20
Gustafson/Barsis’s Argument• Scale up problem size N as number of processors
increases.– Assume that work in parallel portion grows as N does
• E.g., finer mesh and more particles for gorier video game– Parallelizable portion now requires time N(1-s) to execute serially
• Keep serial portion independent of problem size– This is often achievable, though it requires care.– E.g., must never loop over N in serial portion.– E.g., if amount of I/O depends on N, I/O must be parallel too.
• Net effect is to shrink serial fraction as number of processors grows.
Copyright 2005 Intel Corporation 21
Gustafson/Barsis’s Formulae
• Revised speedup:
• Now set N=p:
speedup(p) ≤ s + p(1-s)
speedup(p) ≤s+N(1-s)
N(1−s)p
+ s
serial execution time
best possible parallel execution time
Copyright 2005 Intel Corporation 22
Growing Returns• Revisit program with 20% serial portion• Assume that parallel work scales up with N
Reference: John Gustafson, “Reevaluatng Amdahl’s Law, CACM May 1988, Volume 31.
Doubling the number of processors and doubling the problem size now almost doubles speedup.
speedup(2) ≤ .2 + 2(1−0.2) = 1.8
speedup(4) ≤ .2 + 4(1−0.2) = 3.4
speedup(∞) ≤ .2+ ∞(1−0.2) = ∞
23Copyright 2005 Intel Corporation
Summary of Episode I• Reasons to Write Parallel Programs
– functionality– throughput– turn-around– capacity
• Two flavors– shared memory– message passing
• Two laws– Amdahl: speedup limited by serial portion– Gustafson/Barsis: evade limit by scaling up problem
Copyright 2005 Intel Corporation 24
Episode II
• Approaches to single-chip parallelism• Structure of processors and memory
Copyright 2005 Intel Corporation 25
Simple Programmer’s ViewMr. CPU
variables
Memory as equally accessible boxes.
Fast access to handful of registers
Copyright 2005 Intel Corporation 26
Pentium® 4 Processor
Pentium 4 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
Copyright 2005 Intel Corporation 27
Latency vs. Throughput
• Latency = how long to execute an instruction.• Throughput = instructions per unit time• Example: Pentium® 4 integer multiply
– latency = 15-18 clock cycles– throughput is one result per 5 cycles
Reference: Appendix C of IA-32 Intel Archtecture Optimization Reference Manual http://www.intel.com/design/pentium4/manuals/248966.htm
Copyright 2005 Intel Corporation 28
Instruction Level Parallelism• SIMD
– Instruction operates on vector of data– 4-element vectors have become very popular
• Super scalar– Execute more than one instruction at once– No clear winner in “in-order” vs. “out-of-order” debate
• In-order– Instructions executed in program order– Smaller, simpler, less power, but suffers from cache misses
• Out-of-order– Instructions executed out of order so that instructions waiting for
data do no delay subsequent independent instructions– Bigger, more complex, almost quadratic power demand, but
more tolerant of cache misses
Copyright 2005 Intel Corporation 29
Multithreaded CPUs• Like expert chess player playing against multiple
opponents– Each board is state of a thread– Move on to next board while waiting for response (hide latency)– Susan Polgar set record of 326 concurrent games (309 wins!)
• Been around since 1950s– http://www.cs.clemson.edu/~mark/multithreading.html
• Simultaneous Multithreading (SMT)– Feed execution units from multiple instruction streams– Good fit for out-of-order machine
• Switch On Event Multithreading (SOEMT)– Switch to another instruction stream while waiting on an outer-
level cache miss.– Good fit for in-order machine
Copyright 2005 Intel Corporation 30
Multicore Chips
• Multiple CPUs in same socket or die.• Multiple cores on same die has both
advantages and disadvantages☺Enables communication at chip speed instead of
slower board speed.Lower yield from wafer. Assuming random distribution of failures, if fraction x of CPUs are good, then only x2 CPU-pairs are good.
Copyright 2005 Intel Corporation 31
Future Speed Improvements Will Mostly Be From Multithreading
• Improvements in past have been for single thread– Examples
• Higher clock rate• Bigger functional units (e.g. full array multipliers)• Pipelining and out-of-order execution.
– These all consume superlinear transistors or power• Approximately quadratic• Cooling capacity limited (cannot put toaster in lap)
• Improvements in future are for multiple threads– Multiple cores and more threads per core– As long as program scales significantly better than ≥√p,
we are ahead.
Copyright 2005 Intel Corporation 32
Memory
• Memory also has latency and throughput– Both are slower than CPU speeds
• Latency is hundreds of clocks• Bandwidth can be 20x slower than CPU
• Memory cannot be large, fast, and cheap– Hence cache hierarchy– Some small fast expensive memory at top– Lots of big slow cheap memory at bottom
See STREAM benchmark at http://www.cs.virginia.edu/stream/
Copyright 2005 Intel Corporation 33
Example: One Core of Pentium® D Processor
16 KB Level 1 Data Cache
Main Memory
(Would be 5x21 feet if 1 GB were drawn to paper scale)
register set
12 KB Level 1 Instruction Cache
1 MB Level 2 Cache
Bandwidth ∝ width of arrow
Copyright 2005 Intel Corporation 34
Off-Chip Memory Is Slow!
HT0 HT1L1 cache
L2 cache
memory
Latency ∝ length of arrow
Bandwidth ∝ width of arrow
Design your parallel programs to minimize off-chip communication.
Intrachip locality is good.
Interchip locality is bad.
Copyright 2005 Intel Corporation 35
Shared Bus
☺ Uniform Memory Access (UMA)– Any processor can reach any memory in the same time
All devices on bus share the bus bandwidth– Uniformly slow for everyone
CPUCPUCPUCPU
memorymemorymemory
Copyright 2005 Intel Corporation 36
Point to Point
Non-Uniform Memory Access (NUMA)– Processors can reach near memory faster than far memory.
☺ Bus bandwidth grows as nodes are added– Do not have to share bandwidth
CPUCPU
CPUCPU
memory
memory
memory
memory
All sorts of topologies possible
Mesh, toruses, hypercube, fat trees, ...
Copyright 2005 Intel Corporation 37
Cluster
• Each node in cluster is single or multiprocessor box.• Nodes interconnected
– Bus (e.g. antique Ethernet*)– Point to point (e.g. Scalable Coherent Interconnect*)– Switched fabric (e.g. Infiniband*)
*Other names and brands may be claimed as the property of others.
Copyright 2005 Intel Corporation 38
Some Interconnect Metrics
• Bisection bandwidth– Adversary cuts machine in half.– Measure bandwidth between halves.
• Latency– Near vs. far reference– For message passing, software adds order of
magnitude (or worse) to hardware latency.
Copyright 2005 Intel Corporation 39
Shared Memory Is Really Message Passing
• Messages are cache lines/sectors– Typically around 32-128 bytes per line– Pentium® 4 Processor has sectored lines
• 64-byte “sector” and 128-byte “line”• Writes invalidate sectors• Reads pull in lines
– For cache without sectors, sector is same as line
• When a processor writes to a cache sector, the other processors have to be notified of the change.
Copyright 2005 Intel Corporation 40
Example: False Sharing
• Consider two processors without shared cache.• If two processors write different locations that
are on same cache sector, and at least one access is write, then cache sector must move between processors.
Copyright 2005 Intel Corporation 41
Cache Line Ping Pong
0 0
1 0
x[0]++
1 0
1 1
x[1]++
Processor #0
...x[2]++
...
...x[0]++
...
Processor #1 Processor #2
Copyright 2005 Intel Corporation 42
8 Threads on 4-Socket HT MachineThread i does "x[i*stride]++"
1
10
100
1000
0 8 16 24 32
Stride
Norm
aliz
ed T
ime
(Log
Sca
le)
(Geo
mea
n of
3 R
uns)
Pop quiz: Cache sector is 64 bytes. What is sizeof(x[0])?
43Copyright 2005 Intel Corporation
Summary of Episode II• Approaches to single-chip parallelism
– SIMD– Super-scalar– Multithreaded CPU– Multicore
• Most of future improvement in chip performance will come from parallelism
• Off-chip memory is slow– Current machines depend on caches– Uniform vs. Non-Uniform Memory Access– Cache line/sector is unit of information interchange