arch d. robison intel corporationpolaris.cs.uiuc.edu/~padua/cs498dhp/parallelprogramming12.pdf ·...

43
Copyright 2005 Intel Corporation 1 Multithreaded Programming Arch D. Robison Intel Corporation

Upload: others

Post on 19-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 1

Multithreaded Programming

Arch D. RobisonIntel Corporation

Page 2: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 2

Overall Goal• Acquire mental model of how parallel computers work• Learn basic guidelines on designing parallel programs• Know what you need to look up later• Emphasis is on multithreaded shared-memory programs,

but most of the high-level concepts apply to any parallel program.

Page 3: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 3

Outline

Episode

Practical issuesVI

CorrectnessV

Specification of parallelism and synchronizationIV

DecompositionIII

HardwareII

Parallelism 101I

Page 4: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 4

Episode I

• Reasons to write parallel programs• Kinds of parallelism• Speedup

Page 5: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 5

Reasons to Write Parallel Programs

• Functionality• Throughput• Turn-around• Capacity

Page 6: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 6

Functionality

• Sometimes parallel version is easier to write.– Often the case in stream processing– expand | sed 's/ */ /' | fold –s

• Responsiveness of human interface– Decouple human interface from computing portions

• Compute in background

– Decouple portions of user interface• Not have window block because another window’s dialog

box is open.

Page 7: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 7

Throughput

• Thoughput = useful work per unit time• Parallel program can improve throughput even

on a single CPU– Example: hide I/O latency

Page 8: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 8

Turn-Around• Solve problem faster

– Typically by applying more resources– Parallel solution is sometimes better algorithm!

• Parallel branch-and-bound may tighten bound quicker, resulting in less wasted work.

• Examples– Practical weather prediction

• 24-hour forecast that takes 24 hours to compute is not useful.

– Real time (e.g. video games, cybernetics)• Must meet deadline.

Page 9: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 9

Capacity• Big parallel processors have huge memory capacity

– Can solve problem “in core” without having to resort to slow “out of core” techniques

RAM on Top 5 Supercomputers in June 2005

Barcelona

Japan

NSA/Ames

IBM Thomas J. Watson Research Center

NASA/Ames 32 Terabytes?Blue Gene/L

20 Terabytes?BGW

9.6 TerabytesMareNostrum

10 TerabytesEarth Simulator

20 TerabytesColumbia

Page 10: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 10

Basic Parallel Programming Models• Single Instruction Multiple Data (SIMD)

– Single thread operating on long vector• Multiple Instruction Multiple Data (MIMD)

– Message passing• Each thread has its own address space• Threads communicate by sending and receiving messages

– Shared memory• Threads share a common address space• Threads communicate by writing and reading shared memory

locations

• Hybrid– Networks of shared-memory machines– Virtual shared memory across a cluster

• Totally implicit– E.g., applicative programming languages

Page 11: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 11

Parallelism vs. Concurrency• Concurrency = multiple threads in progress at same time

– Achievable on a single processor by interleaving

• Parallelism = multiple threads executing simultaneously– Requires multiple execution units

Page 12: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 12

Reasons for Concurrency

• Throughput– Hide I/O latency. – Example: Let task A compute while task B waits to

read disk.• Functionality

– can improve functionality (e.g. listen to music file and surf Web)

Page 13: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 13

Reasons for Parallelism• Functionality

– Multithreaded CPU: avoids annoying pauses when a task hogs a physical thread.

• Throughput– Achieve good frame rate in video game: CPU

simulates universe while graphics processor renders. • Turn-Around

– Almost all champion chess programs.• Capacity

– Most large-scale scientific problems

Page 14: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 14

• Speedup measures improvement from parallelization

• Efficiency is relative to ideal speedup

Speedup

speedup(p) =time for best serial version

time for version with p physical threads

efficiency(p) =speedup

p

Page 15: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 15

Typical Speedup ChartSpeedup of Parallel Foo

Geomean of 10 runs [blocksize=16x16 pixels, HT-enabled: Xeon with 8 TPUs, 512K Cache ]

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 2 3 4 5 6 7 8Physical Threads

Spe

edup

(Rel

ativ

e to

Ser

ial V

ersi

on) IDEAL

NEW AND IMPROVED!

BRAND X

Why does this chart violate our definition of speedup?

Page 16: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 16

Use Best Serial Code As Baseline

• Speedup(1)>1 in example indicates that the original serial code was not the best serial code.– This commonly happens– While tuning for parallelism, serial improvements are found.

• Example: “no alias” information useful for parallelizing can also help compiler generate better serial code

• Example: reblocking loops to help parallel decomposition sometimes helps serial code’s reuse of cache.

•Reference: Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers http://www.sc2000.org/bell/twelve-ways.txt

Page 17: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 17

Amdahl’s ``Law’’• Suppose program takes 1 unit of time to execute serially.• Part of program is inherently serial (unparallelizable).

s = fraction of time spent in serial portion1-s = remaining parallelizable time

s 1-s

time on single processor

s (1-s)/p

time on parallel processor (in ideal world)

Page 18: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 18

Amdahl’s Formulae

speedup(p) ≤1

1−sp

+ s

speedup(∞) ≤ 1s

serial execution time

best possible parallel execution time

• Speedup limited by serial portion

• Situation is worse in real world, because parallel portion usually has inefficiencies.

Page 19: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 19

Diminishing Returns• Suppose program has 20% serial portion

speedup(2) ≤ 11−0.2

2+ 0.2

≈ 1.67

speedup(4) ≤ 11−0.2

4+ 0.2

= 2.5

speedup(∞) ≤ 11−0.2

∞+ 0.2

= 5

This is depressing. Is there a way to evade Amdahl’s law?

Doubling the number of processors from 2 to 4 of processors got us only 50% improvement!

Page 20: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 20

Gustafson/Barsis’s Argument• Scale up problem size N as number of processors

increases.– Assume that work in parallel portion grows as N does

• E.g., finer mesh and more particles for gorier video game– Parallelizable portion now requires time N(1-s) to execute serially

• Keep serial portion independent of problem size– This is often achievable, though it requires care.– E.g., must never loop over N in serial portion.– E.g., if amount of I/O depends on N, I/O must be parallel too.

• Net effect is to shrink serial fraction as number of processors grows.

Page 21: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 21

Gustafson/Barsis’s Formulae

• Revised speedup:

• Now set N=p:

speedup(p) ≤ s + p(1-s)

speedup(p) ≤s+N(1-s)

N(1−s)p

+ s

serial execution time

best possible parallel execution time

Page 22: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 22

Growing Returns• Revisit program with 20% serial portion• Assume that parallel work scales up with N

Reference: John Gustafson, “Reevaluatng Amdahl’s Law, CACM May 1988, Volume 31.

Doubling the number of processors and doubling the problem size now almost doubles speedup.

speedup(2) ≤ .2 + 2(1−0.2) = 1.8

speedup(4) ≤ .2 + 4(1−0.2) = 3.4

speedup(∞) ≤ .2+ ∞(1−0.2) = ∞

Page 23: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

23Copyright 2005 Intel Corporation

Summary of Episode I• Reasons to Write Parallel Programs

– functionality– throughput– turn-around– capacity

• Two flavors– shared memory– message passing

• Two laws– Amdahl: speedup limited by serial portion– Gustafson/Barsis: evade limit by scaling up problem

Page 24: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 24

Episode II

• Approaches to single-chip parallelism• Structure of processors and memory

Page 25: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 25

Simple Programmer’s ViewMr. CPU

variables

Memory as equally accessible boxes.

Fast access to handful of registers

Page 26: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 26

Pentium® 4 Processor

Pentium 4 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

Page 27: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 27

Latency vs. Throughput

• Latency = how long to execute an instruction.• Throughput = instructions per unit time• Example: Pentium® 4 integer multiply

– latency = 15-18 clock cycles– throughput is one result per 5 cycles

Reference: Appendix C of IA-32 Intel Archtecture Optimization Reference Manual http://www.intel.com/design/pentium4/manuals/248966.htm

Page 28: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 28

Instruction Level Parallelism• SIMD

– Instruction operates on vector of data– 4-element vectors have become very popular

• Super scalar– Execute more than one instruction at once– No clear winner in “in-order” vs. “out-of-order” debate

• In-order– Instructions executed in program order– Smaller, simpler, less power, but suffers from cache misses

• Out-of-order– Instructions executed out of order so that instructions waiting for

data do no delay subsequent independent instructions– Bigger, more complex, almost quadratic power demand, but

more tolerant of cache misses

Page 29: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 29

Multithreaded CPUs• Like expert chess player playing against multiple

opponents– Each board is state of a thread– Move on to next board while waiting for response (hide latency)– Susan Polgar set record of 326 concurrent games (309 wins!)

• Been around since 1950s– http://www.cs.clemson.edu/~mark/multithreading.html

• Simultaneous Multithreading (SMT)– Feed execution units from multiple instruction streams– Good fit for out-of-order machine

• Switch On Event Multithreading (SOEMT)– Switch to another instruction stream while waiting on an outer-

level cache miss.– Good fit for in-order machine

Page 30: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 30

Multicore Chips

• Multiple CPUs in same socket or die.• Multiple cores on same die has both

advantages and disadvantages☺Enables communication at chip speed instead of

slower board speed.Lower yield from wafer. Assuming random distribution of failures, if fraction x of CPUs are good, then only x2 CPU-pairs are good.

Page 31: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 31

Future Speed Improvements Will Mostly Be From Multithreading

• Improvements in past have been for single thread– Examples

• Higher clock rate• Bigger functional units (e.g. full array multipliers)• Pipelining and out-of-order execution.

– These all consume superlinear transistors or power• Approximately quadratic• Cooling capacity limited (cannot put toaster in lap)

• Improvements in future are for multiple threads– Multiple cores and more threads per core– As long as program scales significantly better than ≥√p,

we are ahead.

Page 32: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 32

Memory

• Memory also has latency and throughput– Both are slower than CPU speeds

• Latency is hundreds of clocks• Bandwidth can be 20x slower than CPU

• Memory cannot be large, fast, and cheap– Hence cache hierarchy– Some small fast expensive memory at top– Lots of big slow cheap memory at bottom

See STREAM benchmark at http://www.cs.virginia.edu/stream/

Page 33: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 33

Example: One Core of Pentium® D Processor

16 KB Level 1 Data Cache

Main Memory

(Would be 5x21 feet if 1 GB were drawn to paper scale)

register set

12 KB Level 1 Instruction Cache

1 MB Level 2 Cache

Bandwidth ∝ width of arrow

Page 34: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 34

Off-Chip Memory Is Slow!

HT0 HT1L1 cache

L2 cache

memory

Latency ∝ length of arrow

Bandwidth ∝ width of arrow

Design your parallel programs to minimize off-chip communication.

Intrachip locality is good.

Interchip locality is bad.

Page 35: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 35

Shared Bus

☺ Uniform Memory Access (UMA)– Any processor can reach any memory in the same time

All devices on bus share the bus bandwidth– Uniformly slow for everyone

CPUCPUCPUCPU

memorymemorymemory

Page 36: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 36

Point to Point

Non-Uniform Memory Access (NUMA)– Processors can reach near memory faster than far memory.

☺ Bus bandwidth grows as nodes are added– Do not have to share bandwidth

CPUCPU

CPUCPU

memory

memory

memory

memory

All sorts of topologies possible

Mesh, toruses, hypercube, fat trees, ...

Page 37: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 37

Cluster

• Each node in cluster is single or multiprocessor box.• Nodes interconnected

– Bus (e.g. antique Ethernet*)– Point to point (e.g. Scalable Coherent Interconnect*)– Switched fabric (e.g. Infiniband*)

*Other names and brands may be claimed as the property of others.

Page 38: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 38

Some Interconnect Metrics

• Bisection bandwidth– Adversary cuts machine in half.– Measure bandwidth between halves.

• Latency– Near vs. far reference– For message passing, software adds order of

magnitude (or worse) to hardware latency.

Page 39: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 39

Shared Memory Is Really Message Passing

• Messages are cache lines/sectors– Typically around 32-128 bytes per line– Pentium® 4 Processor has sectored lines

• 64-byte “sector” and 128-byte “line”• Writes invalidate sectors• Reads pull in lines

– For cache without sectors, sector is same as line

• When a processor writes to a cache sector, the other processors have to be notified of the change.

Page 40: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 40

Example: False Sharing

• Consider two processors without shared cache.• If two processors write different locations that

are on same cache sector, and at least one access is write, then cache sector must move between processors.

Page 41: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 41

Cache Line Ping Pong

0 0

1 0

x[0]++

1 0

1 1

x[1]++

Processor #0

...x[2]++

...

...x[0]++

...

Processor #1 Processor #2

Page 42: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

Copyright 2005 Intel Corporation 42

8 Threads on 4-Socket HT MachineThread i does "x[i*stride]++"

1

10

100

1000

0 8 16 24 32

Stride

Norm

aliz

ed T

ime

(Log

Sca

le)

(Geo

mea

n of

3 R

uns)

Pop quiz: Cache sector is 64 bytes. What is sizeof(x[0])?

Page 43: Arch D. Robison Intel Corporationpolaris.cs.uiuc.edu/~padua/cs498DHP/parallelprogramming12.pdf · • E.g., finer mesh and more particles for gorier video game – Parallelizable

43Copyright 2005 Intel Corporation

Summary of Episode II• Approaches to single-chip parallelism

– SIMD– Super-scalar– Multithreaded CPU– Multicore

• Most of future improvement in chip performance will come from parallelism

• Off-chip memory is slow– Current machines depend on caches– Uniform vs. Non-Uniform Memory Access– Cache line/sector is unit of information interchange