advanced computing techniques & applications dr. bo yuan e-mail: [email protected]

91
Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: [email protected]

Upload: delphia-greer

Post on 29-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Advanced Computing Techniques &

Applications

Advanced Computing Techniques &

Applications

Dr. Bo Yuan

E-mail: [email protected]

Page 2: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
Page 3: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn
Page 4: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

4

Course ProfileCourse Profile

• Lecturer: Dr. Bo Yuan

• Contact– Phone: 2603 6067

– E-mail: [email protected]

– Room: F-301B

• Time: 10:25 am – 12:00pm, Friday

• Venue: CI-107 & B-204 (Lab)

• Teaching Assistant– Mr. Pengtao Huang

[email protected]

Page 5: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

5

We will study ...We will study ...

• MPI– Message Passing Interface

– API for distributed memory parallel computing (multiple processes)

– The dominant model used in cluster computing

• OpenMP– Open Multi-Processing

– API for shared memory parallel computing (multiple threads)

• GPU Computing with CUDA– Graphics Processing Unit

– Compute Unified Device Architecture

– API for shared memory parallel computing in C (multiple threads)

• Parallel Matlab– A popular high-level technical computing language and interactive environment

Page 6: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

6

Aims & ObjectivesAims & Objectives

• Learning Objectives

– Understand the main issues and core techniques in parallel computing.

– Able to develop MPI based parallel programs.

– Able to develop OpenMP based parallel programs.

– Able to develop GPU based parallel programs.

– Able to develop Matlab based parallel programs.

• Graduate Attributes

– In-depth Knowledge of the Field of Study

– Effective Communication

– Independence and Teamwork

– Critical Judgment

Page 7: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

7

Learning ActivitiesLearning Activities

• Lecture (9)– Introduction (3)

– MPI and OpenMP (3)

– GPU Computing (3)

• Practice (4)– MPI (1)

– OpenMP (1)

– GPU Programming (1)

– Parallel Matlab (1)

• Others (3)– Industry Tour (1)

– Presentation (1)

– Final Exam (1)

Page 8: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

8

Learning ResourcesLearning Resources

Page 9: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

9

Learning ResourcesLearning Resources

• Books– http://www.mcs.anl.gov/~itf/dbpp/

– https://computing.llnl.gov/tutorials/parallel_comp/

– http://www-users.cs.umn.edu/~karypis/parbook/

• Journals– http://www.computer.org/tpds

– http://www.journals.elsevier.com/parallel-computing/

– http://www.journals.elsevier.com/journal-of-parallel-and-distributed-computing/

• Amazon Cloud Computing Services– http://aws.amazon.com

• CUDA– http://developer.nvidia.com

Page 10: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

10

Learning ResourcesLearning Resources

https://www.coursera.org/course/hetero

Page 13: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

13

Rules & PoliciesRules & Policies• Plagiarism– Plagiarism is the act of misrepresenting as one's own original work the ideas,

interpretations, words or creative works of another.

– Direct copying of paragraphs, sentences, a single sentence or significant parts of a

sentence.

– Presenting as independent work done in collaboration with others.

– Copying ideas, concepts, research results, computer codes, statistical tables,

designs, images, sounds or text or any combination of these.

– Paraphrasing, summarizing or simply rearranging another person's words, ideas,

without changing the basic structure and/or meaning of the text.

– Copying or adapting another student's original work into a submitted assessment

item.

Page 15: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

15

Half AdderHalf Adder

A: Augend B: Addend

S: Sum C: Carry

Page 16: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

16

Full AdderFull Adder

Page 17: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

17

SR LatchSR Latch

S R Q

0 0 Q

0 1 0

1 0 1

1 1 N/A

Page 18: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

18

Address DecoderAddress Decoder

Page 19: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

19

Address DecoderAddress Decoder

Page 20: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

20

Electronic Numerical Integrator And Computer

• Speed (10-digit decimal numbers)– Machine Cycle: 5000 cycles per second

– Multiplication: 357 times per second

– Division/Square Root: 35 times per second

• Programming– Programmable

– Switches and Cables

– Usually took days.

– I/O: Punched Cards

Page 21: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

21

Stored-Program ComputerStored-Program Computer

Page 22: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

22

Personal Computer in 1980sPersonal Computer in 1980s

BASIC IBM PC/AT

Page 24: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

24

Page 25: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

25

Top 500 SupercomputersTop 500 Supercomputers

GF

LOP

S

Page 26: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

26

Cost of ComputingCost of Computing

Date Approximate cost per GFLOPS Approximate cost per GFLOPS inflation adjusted to 2013 dollars

1984 $15,000,000 $33,000,000

1997 $30,000 $42,000

April 2000 $1,000 $1,300

May 2000 $640 $836

August 2003 $82 $100

August 2007 $48 $52

March 2011 $1.80 $1.80

August 2012 $0.75 $0.73

December 2013 $0.12 $0.12

Page 27: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

27

Complexity of ComputingComplexity of Computing

• A: 10×100 B: 100×5 C: 5×50

• (AB)C vs. A(BC)

• A: N×N B: N×N C=AB

• Time Complexity: O(N3)

• Space Complexity: O(1)

Page 28: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

28

Why Parallel Computing?Why Parallel Computing?

• Why we need every-increasing performance:– Big Data Analysis

– Climate Modeling

– Gaming

• Why we need to build parallel systems:– Increase the speed of integrated circuits Overheating

– Increase the number of transistors Multi-Core

• Why we need to learn parallel programming:– Running multiple instances of the same program is unlikely to help.

– Need to rewrite serial programs to make them parallel.

Page 29: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

29

Parallel SumParallel Sum

1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9

0 1 2 76543 Cores

8 19 7 15 7 13 12 14

0 95

Page 30: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

30

Parallel SumParallel Sum

1, 4, 3 9, 2, 8 5, 1, 1 6, 2, 7 2, 5, 0 4, 1, 8 6, 5 ,1 2, 3, 9

0 1 2 76543 Cores

8 19 7 15 7 13 12 14

0 2 4 627 22 20 26

95

0 4

0

49 46

Page 31: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

31

Prefix ScanPrefix Scan

3 5 2 5 7 9 4 6

3 8 10 15 22 31 35 41

0 3 8 10 15 22 31 35

Original Vector

Inclusive Prefix Scan

Exclusive Prefix Scan

prefixScan[0]=A[0];for (i=1; i<N; i++) prefixScan[i]=prfixScan[i-1]+A[i];

Page 32: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

32

Parallel Prefix ScanParallel Prefix Scan

3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2

3 5 2 5 7 9 -4 6 7 -3 1 7 6 8 -1 2

3 8 10 15 7 16 12 18 7 4 5 12 6 14 13 15

15 18 12 15

0 15 33 45

3 8 10 15 22 31 27 33 40 37 38 45 51 59 58 60

Exclusive Prefix Scan

Page 33: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

33

Levels of ParallelismLevels of Parallelism

• Embarrassingly Parallel– No dependency or communication between parallel tasks

• Coarse-Grained Parallelism– Infrequent communication, large amounts of computation

• Fine-Grained Parallelism– Frequent communication, small amounts of computation

– Greater potential for parallelism

– More overhead

• Not Parallel– Giving life to a baby takes 9 months.

– Can this be done in 1 month by having 9 women?

Page 34: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

34

Data DecompositionData Decomposition

2 Cores

Page 35: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

35

GranularityGranularity

8 Cores

Page 36: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

36

CoordinationCoordination

• Communication– Sending partial results to other cores

• Load Balancing– Wooden Barrel Principle

• Synchronization– Race Condition

Thread A Thread B

1A: Read variable V 1B: Read variable V

2A: Add 1 to variable V 2B: Add 1 to variable V

3A Write back to variable V 3B: Write back to variable V

Page 37: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

37

Data DependencyData Dependency

• Bernstein's Conditions

• Examples1: function Dep(a, b) 2: c = a·b 3: d = 3·c 4: end function

1: function NoDep(a, b)2: c = a·b 3: d = 3·b 4: e = a+b 5: end function

ji

ji

ij

OO

OI

OI

Flow Dependency

Output Dependency

Page 38: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

38

What is not parallel?What is not parallel?

Recurrences

for (i=1; i<N; i++) a[i]=a[i-1]+b[i];

Loop-Carried Dependence

for (k=5; k<N; k++) { b[k]=DoSomething(K); a[k]=b[k-5]+MoreStuff(k);}

Atypical Loop-Carried Dependence

wrap=a[0]*b[0];for (i=1; i<N; i++) { c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap;}

Solution

for (i=1; i<N; i++) { wrap=a[i-1]*b[i-1]; c[i]=wrap; wrap=a[i]*b[i]; d[i]=2*wrap;}

Page 39: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

39

What is not parallel?What is not parallel?

Induction Variables

i1=4;i2=0;for (k=1; k<N; k++) { B[i1++]=function1(k,q,r); i2+=k; A[i2]=function2(k,r,q);}

Solution

i1=4;i2=0;for (k=1; k<N; k++) { B[k+3]=function1(k,q,r); i2=(k*k+k)/2; A[i2]=function2(k,r,q);}

Page 40: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

40

Types of ParallelismTypes of Parallelism

• Instruction Level Parallelism

• Task Parallelism– Different tasks on the same/different sets of data

• Data Parallelism– Similar tasks on different sets of the data

• Example– 5 TAs, 100 exam papers, 5 questions

– How to make it task parallel?

– How to make it data parallel?

Page 41: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

41

Assembly LineAssembly Line

15 20 5

• How long does it take to produce a single car?

• How many cars can be operated at the same time?

• How long is the gap between producing the first and the second car?

• The longest stage on the assembly line determines the throughput.

Page 42: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

42

Instruction PipelineInstruction Pipeline

IF: Instruction fetch

ID: Instruction decode and register fetch

EX: Execute

MEM: Memory access

WB: Register write back

1: Add 1 to R5.

2: Copy R5 to R6.

Page 43: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

43

SuperscalarSuperscalar

Page 44: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

44

Computing ModelsComputing Models• Concurrent Computing

– Multiple tasks can be in progress at any instant.

• Parallel Computing– Multiple tasks can be run simultaneously.

• Distributed Computing– Multiple programs on networked computers work collaboratively.

• Cluster Computing– Homogenous, Dedicated, Centralized

• Grid Computing– Heterogonous, Loosely Coupled, Autonomous, Geographically Distributed

Page 45: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

45

Concurrent vs. ParallelConcurrent vs. Parallel

Core

Job 1 Job 2

Core 1 Core 2

Job 1 Job 2

Core 1 Core 2

Job 3 Job 4

Job 1 Job 2

Page 46: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

46

Process & ThreadProcess & Thread• Process

– An instance of a computer program being executed

• Threads– The smallest units of processing scheduled by OS

– Exist as a subset of a process.

– Share the same resources from the process.

– Switching between threads is much faster than switching between processes.

• Multithreading– Better use of computing resources

– Concurrent execution

– Makes the application more responsive

ProcessThread

Thread

Page 47: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

47

Parallel ProcessesParallel Processes

Program

Process 1

Process 2

Process 3

Node 1

Node 2

Node 3

Single Program, Multiple Data

Page 48: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

48

Parallel ThreadsParallel Threads

Page 49: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

49

Graphics Processing UnitGraphics Processing Unit

Page 50: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

50

CPU vs. GPUCPU vs. GPU

Page 51: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

51

CUDACUDA

Page 52: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

52

CUDA CUDA

Page 53: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

53

GPU Computing ShowcaseGPU Computing Showcase

Page 54: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

54

MapReduce vs. GPUMapReduce vs. GPU• Pros:

– Run on clusters of hundreds or thousands of commodity computers.

– Can handle excessive amount of data with fault tolerance.

– Minimum efforts required for programmers: Map & Reduce

• Cons:

– Intermediate results are stored in disks and transferred via network links.

– Suitable for processing independent or loosely coupled jobs.

– High upfront hardware cost and operational cost

– Low Efficiency: GFLOPS per Watt, GFLOPS per Dollar

Page 55: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

55

Parallel Computing in MatlabParallel Computing in Matlab

for i=1:1024 A(i) = sin(i*2*pi/1024); end plot(A);

matlabpool open local 3

parfor i=1:1024 A(i) = sin(i*2*pi/1024); end plot(A);

matlabpool close

Page 56: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

56

GPU Computing in MatlabGPU Computing in Matlab

http://www.mathworks.cn/discovery/matlab-gpu.html

Page 57: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

57

Cloud ComputingCloud Computing

Page 59: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

59

Five Attributes of Cloud ComputingFive Attributes of Cloud Computing

• Service Based– What the service needs to do is more important than how the technologies are used to

implement the solution.

• Scalable and Elastic– The service can scale capacity up or down as the consumer demands at the speed of

full automation.

• Shared– Services share a pool of resources to build economies of scale.

• Metered by Use– Services are tracked with usage metrics to enable multiple payment models.

• Uses Internet Technologies– The service is delivered using Internet identifiers, formats and protocols.

Page 60: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

60

Flynn’s Taxonomy Flynn’s Taxonomy

• Single Instruction, Single Data (SISD)– von Neumann System

• Single Instruction, Multiple Data (SIMD)– Vector Processors, GPU

• Multiple Instruction, Single Data (MISD)– Generally used for fault tolerance

• Multiple Instruction, Multiple Data (MIMD)– Distributed Systems

– Single Program, Multiple Data (SPMD)

– Multiple Program, Multiple Data (MPMD)

Page 61: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

61

Flynn’s Taxonomy Flynn’s Taxonomy

Page 62: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

62

Von Neumann ArchitectureVon Neumann Architecture

Harvard Architecture

Page 63: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

63

Inside a PC ...Inside a PC ...

Front-Side Bus (Core 2 Extreme)

8B × 400MHZ × 4/Cycle = 12.8GB/S

Memory (DDR3-1600)

8B × 200MHZ × 4 × 2/Cycle = 12.8GB/S

PCI Express 3.0 (×16)

1GB/S × 16= 16GB/S

Page 64: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

64

Shared Memory SystemShared Memory System

CPU CPU CPU CPU

Interconnect

Memory

. . .

Page 65: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

65

Non-Uniform Memory AccessNon-Uniform Memory Access

Core 1 Core 2

Interconnect

Memory

Core 1 Core 2Remote Access

Local Access Local Access

Interconnect

Memory

Page 66: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

66

Distributed Memory SystemDistributed Memory System

CPU

Memory

Communication Networks

CPU

Memory

CPU

Memory

. . .

Page 67: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

67

Crossbar SwitchCrossbar Switch

P1 P2 P3 P4

M4

M3

M2

M1

Page 68: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

68

CacheCache

• Component that transparently stores data so that future requests for that

data can be served faster– Compared to main memory: smaller, faster, more expensive

– Spatial Locality

– Temporal Locality

• Cache Line– A block of data that is accessed together

• Cache Miss– Failed attempts to read or write a piece of data in the cache

– Main memory access required

– Read Miss, Write Miss

– Compulsory Miss, Capacity Miss, Conflict Miss

Page 69: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

69

Writing PoliciesWriting Policies

Page 70: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

70

Cache MappingCache Mapping

Index

0

1

2

3

4

5

...

Index

0

1

2

3

Index

0

1

2

3

4

5

...

Index

0

1

2

3

Direct Mapped 2-Way Associative

Memory Cache Memory Cache

Page 71: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

71

Cache MissCache Miss

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Row Major

Col

umn

Maj

or#define MAX 4double A[MAX][MAX], x[MAX], y[MAX];

/* Initialize A and x, assign y=0 */

for (i=0; i<MAX, i++) for (j=0; j<MAX; j++) y[i]+=A[i][j]*x[j];

/* Assign y=0 */

for (j=0; j<MAX, j++) for (i=0; i<MAX; i++) y[i]+=A[i][j]*x[j];

Cache Memory

How many hit misses?

Page 72: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

72

Cache CoherenceCache CoherenceCore 0

Cache 0

Core 1

Cache 1

Interconnect

x=2 y1

y0 z1

Time Core 0 Core 1

0 y0=x; y1=3*x;

1 x=7; Statements without x

2 Statements without x z1=4*x;

What is the value of z1?

With write through policy …

With write back policy …

Page 73: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

73

Cache CoherenceCache Coherence

Core 0

A

Core 1

A

A=5

A=5 B=A2

update A reload Ainvalidate

(A=5)B

Core 0

AB

Core 1

AB

A=5 B=B+1

update AB reload ABinvalidate

A and B are called false sharing.

Page 74: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

74

False SharingFalse Sharing

int i, j, m, n;double y[m];

/* Assign y=0 */

for (i=0; i<m; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

/* Private variables */int i, j, iter_count;

/* Shared variables */int m, n, core_count;double y[m];

iter_count=m/core_count;

/* Core 0 does this */for (i=0; i<iter_count; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

/* Core 1 does this */for (i=iter_count; i<2*iter_count; i++) for (j=0; j<n; j++) y[i]+=f(i, j);

m=8, two cores

cache line: 64 bytes

Page 75: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

75

Virtual MemoryVirtual Memory

• Virtualization of various forms of computer data

storage into a unified address space– Logically increases the capacity of main memory

(e.g., DOS can only access 1 MB of RAM).

• Page– A block of continuous virtual memory addresses

– The smallest unit to be swapped in/out of main

memory from/into secondary storage.

• Page Table– Used to store the mapping between virtual addresses

and physical addresses.

• Page Fault– The accessed page is not in the physical memory.

Page 76: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

76

Interleaving StatementsInterleaving Statements

s1

s2

s1

s2

T0 T1

s1 s1 s1 s1 s1s1

s2

s1

s2

s1

s2

s2

s1

s2

s2

s1

s2

s2

s1

s2

s2

s2

s1

s2

!!)!(

NM

NMCM

NM

Page 77: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

77

Critical RegionCritical Region

• A portion of code where shared resources are accessed and updated

• Resources: data structure (variables), device (printer)

• Threads are disallowed from entering the critical region when another thread

is occupying the critical region.

• A means of mutual exclusion is required.

• If a thread is not executing within the critical region, that thread must not

prevent another thread seeking entry from entering the region.

• We consider two threads and one core in the following examples.

Page 78: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

78

First AttemptFirst Attempt

int threadNumber = 0;

void ThreadZero(){ while (TRUE) do { while (threadNumber == 1) do {} // spin-wait CriticalRegionZero; threadNumber=1; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { while (threadNumber == 0) do {} // spin-wait CriticalRegionOne; threadNumber=0; OtherStuffOne; }}

• Q1: Can T1 enter the critical region more times than T0?

• Q2: What would happen if T0 terminates (by design or by accident)?

Page 79: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

79

Second AttemptSecond Attempt

int Thread0inside = 0;int Thread1inside = 0;

void ThreadZero(){ while (TRUE) do { while (Thread1inside) do {} Thread0inside = 1; CriticalRegionZero; Thread0inside = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { while (Thread0inside) do {} Thread1inside = 1; CriticalRegionOne; Thread1inside = 0; OtherStuffOne; }}

• Q1: Can T1 enter the critical region multiple times when T0 is not within the critical region?

• Q2: Can T1 and T2 be allowed to enter the critical region at the same time?

Page 80: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

80

Third AttemptThird Attempt

int Thread0WantsToEnter = 0;int Thread1WantsToEnter = 0;

void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do {} CriticalRegionZero; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do {} CriticalRegionOne; Thread1WantsToEnter = 0; OtherStuffOne; }}

Page 81: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

81

Fourth AttemptFourth Attempt

int Thread0WantsToEnter = 0;int Thread1WantsToEnter = 0;

void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do { Thread0WantsToEnter = 0; delay(someRandomCycles); Thread0WantsToEnter = 1; } CriticalRegionZero; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do { Thread1WantsToEnter = 0; delay(someRandomCycles); Thread1WantsToEnter = 1; } CriticalRegionOne; Thread1WantsToEnter = 0; OtherStuffOne; }}

Page 82: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

82

Dekker’s AlgorithmDekker’s Algorithmint Thread0WantsToEnter = 0, Thread1WantsToEnter = 0, favored = 0;void ThreadZero(){ while (TRUE) do { Thread0WantsToEnter = 1; while (Thread1WantsToEnter) do { if (favored == 1) { Thread0WantsToEnter = 0; while (favored == 1) do {} Thread0WantsToEnter = 1; } } CriticalRegionZero; favored = 1; Thread0WantsToEnter = 0; OtherStuffZero; }}

void ThreadOne(){ while (TRUE) do { Thread1WantsToEnter = 1; while (Thread0WantsToEnter) do { if (favored == 0) { Thread1WantsToEnter = 0; while (favored == 0) do {} Thread1WantsToEnter = 1; } } CriticalRegionOne; favored = 0; Thread1WantsToEnter = 0; OtherStuffZero; }}

Page 83: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

83

Parallel Program DesignParallel Program Design

• Foster’s Methodology

• Partitioning– Divide the computation to be performed and the data operated on by the computation into

small tasks.

• Communication– Determine what communication needs to be carried out among the tasks.

• Agglomeration– Combine tasks that communicate intensively with each other or must be executed

sequentially into larger tasks.

• Mapping– Assign the composite tasks to processes/threads to minimize inter-processor

communication and maximize processor utilization.

Page 84: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

84

Parallel HistogramParallel Histogram

10 2 3 4 5

data[i-1]

bin_counts[b-1]++

bin_counts[b]++

Find_bin()

Increment bin_counts

data[i]data[i+1

]

Page 85: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

85

Parallel HistogramParallel Histogram

data[i-1]

loc_bin_cts[b-1]++

data[i]data[i+1

]data[i+2

]

loc_bin_cts[b]++

bin_counts[b-1]+= bin_counts[b]+=

loc_bin_cts[b-1]++ loc_bin_cts[b]++

Page 86: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

86

PerformancePerformance

• Speedup

• Efficiency

• Scalability– Problem Size, Number of Processors

• Strongly Scalable– Same efficiency for larger N with fixed problem size

• Weakly Scalable– Same efficiency for larger N with a fixed problem size per processor

Parallel

Serial

T

TS

Parallel

Serial

TN

T

N

SE

Page 87: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

87

Amdahl's Law Amdahl's Law

NPP

NS

)1(

1)(

Page 88: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

88

Gustafson's LawGustafson's Law

baTParallel bNaTSerial

sequential parallel

ba

aforNN

ba

bNaNS

)1()(

• Linear speedup can be achieved when:– Problem size is allowed to grow monotonously with N.

– The sequential part is fixed or grows slowly.

• Is it possible to achieve super linear speedup?

Page 89: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

89

ReviewReview

• Why is parallel computing important?

• What is data dependency?

• What are the benefits and issues of fine-grained parallelism?

• What are the three types of parallelism?

• What is the difference between concurrent and parallel computing?

• What are the essential features of cloud computing?

• What is Flynn’s Taxonomy?

Page 90: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

90

ReviewReview

• Name the four categories of memory systems.

• What are the two common cache writing policies?

• Name the two types of cache mapping strategies.

• What is a cache miss and how to avoid it?

• What may cause the false sharing issue?

• What is a critical region?

• How to verify the correctness of a concurrent program?

Page 91: Advanced Computing Techniques & Applications Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

91

ReviewReview

• Name three major APIs for parallel computing.

• What are the benefits of GPU computing compared to MapReduce?

• What is the basic procedure of parallel program design?

• What are the key performance factors in parallel programming?

• What is a strongly/weakly scalable parallel program?

• What is the implication of Amdahl's Law?

• What does Gustafson's Law tell us?