arch d. robison intel corporationpolaris.cs.uiuc.edu/~padua/cs498dhp/parallelprogramming12.pdf ·...

Copyright 2005 Intel Corporation 1

Multithreaded Programming

Arch D. RobisonIntel Corporation


Overall Goal• Acquire mental model of how parallel computers work• Learn basic guidelines on designing parallel programs• Know what you need to look up later• Emphasis is on multithreaded shared-memory programs,

but most of the high-level concepts apply to any parallel program.


Outline

Episode

Practical issuesVI

CorrectnessV

Specification of parallelism and synchronizationIV

DecompositionIII

HardwareII

Parallelism 101I


Episode I

• Reasons to write parallel programs• Kinds of parallelism• Speedup


Reasons to Write Parallel Programs

• Functionality• Throughput• Turn-around• Capacity


Functionality

• Sometimes parallel version is easier to write.– Often the case in stream processing– expand | sed 's/ */ /' | fold –s

• Responsiveness of human interface– Decouple human interface from computing portions

• Compute in background

– Decouple portions of user interface• Not have window block because another window’s dialog

box is open.


Throughput

• Thoughput = useful work per unit time• Parallel program can improve throughput even

on a single CPU– Example: hide I/O latency


Turn-Around• Solve problem faster

– Typically by applying more resources– Parallel solution is sometimes better algorithm!

• Parallel branch-and-bound may tighten bound quicker, resulting in less wasted work.

• Examples– Practical weather prediction

• 24-hour forecast that takes 24 hours to compute is not useful.

– Real time (e.g. video games, cybernetics)• Must meet deadline.


Capacity• Big parallel processors have huge memory capacity

– Can solve problem “in core” without having to resort to slow “out of core” techniques

RAM on Top 5 Supercomputers in June 2005

Barcelona

Japan

NSA/Ames

IBM Thomas J. Watson Research Center

NASA/Ames 32 Terabytes?Blue Gene/L

20 Terabytes?BGW

9.6 TerabytesMareNostrum

10 TerabytesEarth Simulator

20 TerabytesColumbia


Basic Parallel Programming Models• Single Instruction Multiple Data (SIMD)

– Single thread operating on long vector• Multiple Instruction Multiple Data (MIMD)

– Message passing• Each thread has its own address space• Threads communicate by sending and receiving messages

– Shared memory• Threads share a common address space• Threads communicate by writing and reading shared memory

locations

• Hybrid– Networks of shared-memory machines– Virtual shared memory across a cluster

• Totally implicit– E.g., applicative programming languages


Parallelism vs. Concurrency• Concurrency = multiple threads in progress at same time

– Achievable on a single processor by interleaving

• Parallelism = multiple threads executing simultaneously– Requires multiple execution units


Reasons for Concurrency

• Throughput– Hide I/O latency. – Example: Let task A compute while task B waits to

read disk.• Functionality

– can improve functionality (e.g. listen to music file and surf Web)


Reasons for Parallelism• Functionality

– Multithreaded CPU: avoids annoying pauses when a task hogs a physical thread.

• Throughput– Achieve good frame rate in video game: CPU

simulates universe while graphics processor renders. • Turn-Around

– Almost all champion chess programs.• Capacity

– Most large-scale scientific problems


• Speedup measures improvement from parallelization

• Efficiency is relative to ideal speedup

Speedup

speedup(p) =time for best serial version

time for version with p physical threads

efficiency(p) =speedup

p


Typical Speedup ChartSpeedup of Parallel Foo

Geomean of 10 runs [blocksize=16x16 pixels, HT-enabled: Xeon with 8 TPUs, 512K Cache ]

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 2 3 4 5 6 7 8Physical Threads

Spe

edup

(Rel

ativ

e to

Ser

ial V

ersi

on) IDEAL

NEW AND IMPROVED!

BRAND X

Why does this chart violate our definition of speedup?


Use Best Serial Code As Baseline

• Speedup(1)>1 in example indicates that the original serial code was not the best serial code.– This commonly happens– While tuning for parallelism, serial improvements are found.

• Example: “no alias” information useful for parallelizing can also help compiler generate better serial code

• Example: reblocking loops to help parallel decomposition sometimes helps serial code’s reuse of cache.

•Reference: Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers http://www.sc2000.org/bell/twelve-ways.txt


Amdahl’s ``Law’’• Suppose program takes 1 unit of time to execute serially.• Part of program is inherently serial (unparallelizable).

s = fraction of time spent in serial portion1-s = remaining parallelizable time

s 1-s

time on single processor

s (1-s)/p

time on parallel processor (in ideal world)


Amdahl’s Formulae

speedup(p) ≤1

1−sp

+ s

speedup(∞) ≤ 1s

serial execution time

best possible parallel execution time

• Speedup limited by serial portion

• Situation is worse in real world, because parallel portion usually has inefficiencies.


Diminishing Returns• Suppose program has 20% serial portion

speedup(2) ≤ 11−0.2

2+ 0.2

≈ 1.67

speedup(4) ≤ 11−0.2

4+ 0.2

= 2.5

speedup(∞) ≤ 11−0.2

∞+ 0.2

= 5

This is depressing. Is there a way to evade Amdahl’s law?

Doubling the number of processors from 2 to 4 of processors got us only 50% improvement!


Gustafson/Barsis’s Argument• Scale up problem size N as number of processors

increases.– Assume that work in parallel portion grows as N does

• E.g., finer mesh and more particles for gorier video game– Parallelizable portion now requires time N(1-s) to execute serially

• Keep serial portion independent of problem size– This is often achievable, though it requires care.– E.g., must never loop over N in serial portion.– E.g., if amount of I/O depends on N, I/O must be parallel too.

• Net effect is to shrink serial fraction as number of processors grows.


Gustafson/Barsis’s Formulae

• Revised speedup:

• Now set N=p:

speedup(p) ≤ s + p(1-s)

speedup(p) ≤s+N(1-s)

N(1−s)p

+ s

serial execution time

best possible parallel execution time


Growing Returns• Revisit program with 20% serial portion• Assume that parallel work scales up with N

Reference: John Gustafson, “Reevaluatng Amdahl’s Law, CACM May 1988, Volume 31.

Doubling the number of processors and doubling the problem size now almost doubles speedup.

speedup(2) ≤ .2 + 2(1−0.2) = 1.8

speedup(4) ≤ .2 + 4(1−0.2) = 3.4

speedup(∞) ≤ .2+ ∞(1−0.2) = ∞

23Copyright 2005 Intel Corporation

Summary of Episode I• Reasons to Write Parallel Programs

– functionality– throughput– turn-around– capacity

• Two flavors– shared memory– message passing

• Two laws– Amdahl: speedup limited by serial portion– Gustafson/Barsis: evade limit by scaling up problem


Episode II

• Approaches to single-chip parallelism• Structure of processors and memory


Simple Programmer’s ViewMr. CPU

variables

Memory as equally accessible boxes.

Fast access to handful of registers


Pentium® 4 Processor

Pentium 4 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.


Latency vs. Throughput

• Latency = how long to execute an instruction.• Throughput = instructions per unit time• Example: Pentium® 4 integer multiply

– latency = 15-18 clock cycles– throughput is one result per 5 cycles

Reference: Appendix C of IA-32 Intel Archtecture Optimization Reference Manual http://www.intel.com/design/pentium4/manuals/248966.htm


Instruction Level Parallelism• SIMD

– Instruction operates on vector of data– 4-element vectors have become very popular

• Super scalar– Execute more than one instruction at once– No clear winner in “in-order” vs. “out-of-order” debate

• In-order– Instructions executed in program order– Smaller, simpler, less power, but suffers from cache misses

• Out-of-order– Instructions executed out of order so that instructions waiting for

data do no delay subsequent independent instructions– Bigger, more complex, almost quadratic power demand, but

more tolerant of cache misses


Multithreaded CPUs• Like expert chess player playing against multiple

opponents– Each board is state of a thread– Move on to next board while waiting for response (hide latency)– Susan Polgar set record of 326 concurrent games (309 wins!)

• Been around since 1950s– http://www.cs.clemson.edu/~mark/multithreading.html

• Simultaneous Multithreading (SMT)– Feed execution units from multiple instruction streams– Good fit for out-of-order machine

• Switch On Event Multithreading (SOEMT)– Switch to another instruction stream while waiting on an outer-

level cache miss.– Good fit for in-order machine


Multicore Chips

• Multiple CPUs in same socket or die.• Multiple cores on same die has both

advantages and disadvantages☺Enables communication at chip speed instead of

slower board speed.Lower yield from wafer. Assuming random distribution of failures, if fraction x of CPUs are good, then only x2 CPU-pairs are good.


Future Speed Improvements Will Mostly Be From Multithreading

• Improvements in past have been for single thread– Examples

• Higher clock rate• Bigger functional units (e.g. full array multipliers)• Pipelining and out-of-order execution.

– These all consume superlinear transistors or power• Approximately quadratic• Cooling capacity limited (cannot put toaster in lap)

• Improvements in future are for multiple threads– Multiple cores and more threads per core– As long as program scales significantly better than ≥√p,

we are ahead.


Memory

• Memory also has latency and throughput– Both are slower than CPU speeds

• Latency is hundreds of clocks• Bandwidth can be 20x slower than CPU

• Memory cannot be large, fast, and cheap– Hence cache hierarchy– Some small fast expensive memory at top– Lots of big slow cheap memory at bottom

See STREAM benchmark at http://www.cs.virginia.edu/stream/


Example: One Core of Pentium® D Processor

16 KB Level 1 Data Cache

Main Memory

(Would be 5x21 feet if 1 GB were drawn to paper scale)

register set

12 KB Level 1 Instruction Cache

1 MB Level 2 Cache

Bandwidth ∝ width of arrow


Off-Chip Memory Is Slow!

HT0 HT1L1 cache

L2 cache

memory

Latency ∝ length of arrow

Bandwidth ∝ width of arrow

Design your parallel programs to minimize off-chip communication.

Intrachip locality is good.

Interchip locality is bad.


Shared Bus

☺ Uniform Memory Access (UMA)– Any processor can reach any memory in the same time

All devices on bus share the bus bandwidth– Uniformly slow for everyone

CPUCPUCPUCPU

memorymemorymemory


Point to Point

Non-Uniform Memory Access (NUMA)– Processors can reach near memory faster than far memory.

☺ Bus bandwidth grows as nodes are added– Do not have to share bandwidth

CPUCPU

CPUCPU

memory

memory

memory

memory

All sorts of topologies possible

Mesh, toruses, hypercube, fat trees, ...


Cluster

• Each node in cluster is single or multiprocessor box.• Nodes interconnected

– Bus (e.g. antique Ethernet*)– Point to point (e.g. Scalable Coherent Interconnect*)– Switched fabric (e.g. Infiniband*)

*Other names and brands may be claimed as the property of others.


Some Interconnect Metrics

• Bisection bandwidth– Adversary cuts machine in half.– Measure bandwidth between halves.

• Latency– Near vs. far reference– For message passing, software adds order of

magnitude (or worse) to hardware latency.


Shared Memory Is Really Message Passing

• Messages are cache lines/sectors– Typically around 32-128 bytes per line– Pentium® 4 Processor has sectored lines

• 64-byte “sector” and 128-byte “line”• Writes invalidate sectors• Reads pull in lines

– For cache without sectors, sector is same as line

• When a processor writes to a cache sector, the other processors have to be notified of the change.


Example: False Sharing

• Consider two processors without shared cache.• If two processors write different locations that

are on same cache sector, and at least one access is write, then cache sector must move between processors.


Cache Line Ping Pong

0 0

1 0

x[0]++

1 0

1 1

x[1]++

Processor #0

...x[2]++

...

...x[0]++

...

Processor #1 Processor #2


8 Threads on 4-Socket HT MachineThread i does "x[i*stride]++"

1

10

100

1000

0 8 16 24 32

Stride

Norm

aliz

ed T

ime

(Log

Sca

le)

(Geo

mea

n of

3 R

uns)

Pop quiz: Cache sector is 64 bytes. What is sizeof(x[0])?

43Copyright 2005 Intel Corporation

Summary of Episode II• Approaches to single-chip parallelism

– SIMD– Super-scalar– Multithreaded CPU– Multicore

• Most of future improvement in chip performance will come from parallelism

• Off-chip memory is slow– Current machines depend on caches– Uniform vs. Non-Uniform Memory Access– Cache line/sector is unit of information interchange

arch d. robison intel corporationpolaris.cs.uiuc.edu/~padua/cs498dhp/parallelprogramming12.pdf ·...

Documents