figure 2.1 example ofa two-way

42
(i) (a) Three different code fragments for adding a list of four numbers. (iii) (ii) WB: Write-back NA: No Action E: Instruction Execute IF ID E NA IF ID OF: Operand Fetch OF OF IF ID OF E IF ID E OF IF ID NA WB Adder Utilization Clock cycle 5 6 7 4 Vertical waste Horizontal waste Full issue slots Empty issue slots 1. load R1, @1000 2. load R2, @1008 3. add R1, @1004 4. add R2, @100C 5. add R1, R2 6. store R1, @2000 0 2 4 Instruction cycles 6 8 1. load R1, @1000 2. add R1, @1004 3. add R1, @1008 4. add R1, @100C 5. store R1, @2000 1. load R1, @1000 3. load R2, @1008 4. add R2, @100C 5. add R1, R2 6. store R1, @2000 2. add R1, @1004 load R1, @1000 load R2, @1008 add R1, @1004 add R2, @100C add R1, R2 store R1, @2000 ID: Instruction Decode IF IF: Instruction Fetch ID (b) Execution schedule for code fragment (i) above. (c) Hardware utilization trace for schedule in (b). Figure 2.1 Example of a two-way superscalar execution of instructions.

Upload: others

Post on 08-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Figure 2.1 Example ofa two-way

(i)

(a) Three different code fragments for adding a list of four numbers.

(iii)(ii)

WB: Write−back

NA: No Action

E: Instruction Execute

IF ID ENA

IF ID

OF: Operand Fetch

OF

OF

IF ID OF E

IF ID EOF

IF ID NA WB

Adder Utilization

Clock cycle

5

6

7

4

Vertical waste

Horizontal wasteFull issue slots

Empty issue slots

1. load R1, @1000

2. load R2, @1008

3. add R1, @1004

4. add R2, @100C

5. add R1, R2

6. store R1, @2000

0 2 4Instruction cycles

6 8

1. load R1, @1000

2. add R1, @1004

3. add R1, @1008

4. add R1, @100C

5. store R1, @2000

1. load R1, @1000

3. load R2, @1008

4. add R2, @100C

5. add R1, R2

6. store R1, @2000

2. add R1, @1004

load R1, @1000

load R2, @1008

add R1, @1004

add R2, @100C

add R1, R2

store R1, @2000

ID: Instruction Decode

IF

IF: Instruction Fetch

ID

(b) Execution schedule for code fragment (i) above.

(c) Hardware utilization trace for schedule in (b).

Figure 2.1 Example of a two-way superscalar execution of instructions.

Page 2: Figure 2.1 Example ofa two-way

(a) Column major data access

A b A

=

b A b A b

+ + +

(b) Row major data access.

A b A b A b A b

= = = =

Figure 2.2 Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a runningsum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

Page 3: Figure 2.1 Example ofa two-way

(a) (b)

Global

+

+

+

+PE

PE

PE

PE

PE

PE

PE

PE

PE

control

unit

INT

ER

CO

NN

EC

TIO

N N

ET

WO

RK

INT

ER

CO

NN

EC

TIO

N N

ET

WO

RK

control unit

control unit

control unit

control unit

PE: Processing Element

Figure 2.3 A typical SIMD architecture (a) and a typical MIMD architecture (b).

Page 4: Figure 2.1 Example ofa two-way

Idle

Idle

(b)

Step 2

(a)

Idle

Step 1

Initial values

Idle

C

B

0

A

B

C 0

A

B

C0

A

B

A

0

else

C

Processor 0 Processor 1 Processor 2

5

0

4

2

1

1

0

0

A

B

C 0

A

B

C

A

B

C 0

A

B

C5 0

C = A/B;

C = A;

if (B == 0)

Processor 3

Processor 0 Processor 1 Processor 2 Processor 3

5

0

4

2

1

1

0

0

Processor 0 Processor 1 Processor 2 Processor 3

5

0

4

2

1

1

0

0

0

A

B

C

A

B

C

A

B

C

A

B

C 5 12

Figure 2.4 Executing a conditional statement on an SIMD computer with four processors: (a) theconditional statement; (b) the execution of the statement in two steps.

Page 5: Figure 2.1 Example ofa two-way

M

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

M

M

Inte

rcon

nect

ion

Net

wor

k

MM

P

CM

M

(b)

P

C

P

C

P

C

C

P

P

M

M

C

(a) (c)

P

P

P

Figure 2.5 Typical shared-address-space architectures: (a) Uniform-memory-access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with cachesand memories; (c) Non-uniform-memory-access shared-address-space computer with local memoryonly.

Page 6: Figure 2.1 Example ofa two-way

Static network Indirect network

Switching elementProcessing node

Network interface/switch

P

P P P

P

P

PP

Figure 2.6 Classification of interconnection networks: (a) a static network; and (b) a dynamicnetwork.

Page 7: Figure 2.1 Example ofa two-way

Cache /Local Memory

Cache /Local Memory

Shar

ed M

emor

y

Data

Processor 0

Address

Data

Shar

ed M

emor

yProcessor 0 Processor 1

(a)

(b)

Address

Processor 1

Figure 2.7 Bus-based interconnects (a) with no local caches; (b) with local memory/caches.

Page 8: Figure 2.1 Example ofa two-way

Memory Banks

b−1543210

Proc

essi

ng E

lem

ents

0

1

2

3

4

5

6

p−1

elementA switching

Figure 2.8 A completely non-blocking crossbar network connecting p processors to b memorybanks.

Page 9: Figure 2.1 Example ofa two-way

Memory banks

0

1

0

. . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Stage 1

b-1

Stage 2 Stage n

p-1

Processors Multistage interconnection network

1

Figure 2.9 The schematic of a typical multistage interconnection network.

Page 10: Figure 2.1 Example ofa two-way

000

010

100

110

001

011

101

111

000

010

100

110

001

011

101

111

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

= left_rotate(000)

= left_rotate(100)

= left_rotate(001)

= left_rotate(101)

= left_rotate(010)

= left_rotate(110)

= left_rotate(011)

= left_rotate(111)

Figure 2.10 A perfect shuffle interconnection for eight inputs and outputs.

Page 11: Figure 2.1 Example ofa two-way

(b)(a)

Figure 2.11 Two switching configurations of the 2 × 2 switch: (a) Pass-through; (b) Cross-over.

Page 12: Figure 2.1 Example ofa two-way

111

110

101

100

011

010

001

000 000

001

010

011

100

101

110

111

Figure 2.12 A complete omega network connecting eight inputs and eight outputs.

Page 13: Figure 2.1 Example ofa two-way

111

110

101

100

011

010

001

000 000

001

010

011

100

101

110

111

A

B

Figure 2.13 An example of blocking in omega network: one of the messages (010 to 111 or 110to 100) is blocked at link AB.

Page 14: Figure 2.1 Example ofa two-way

(a) (b)

Figure 2.14 (a) A completely-connected network of eight nodes; (b) a star connected network ofnine nodes.

Page 15: Figure 2.1 Example ofa two-way

(a) (b)

Figure 2.15 Linear arrays: (a) with no wraparound links; (b) with wraparound link.

Page 16: Figure 2.1 Example ofa two-way

(c)(b)(a)

Figure 2.16 Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D meshwith wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.

Page 17: Figure 2.1 Example ofa two-way

0

1

00

01

10

11

000 010

001 011

100 110

111101

0000

0100

0001 0011

0101

0110

0010

0111

1100 1110

1111

10111001

1000

1101

1010

0-D hypercube 1-D hypercube 2-D hypercube 3-D hypercube

4-D hypercube

Figure 2.17 Construction of hypercubes from hypercubes of lower dimension.

Page 18: Figure 2.1 Example ofa two-way

(a) (b)

Processing nodes

Switching nodes

Figure 2.18 Complete binary tree networks: (a) a static tree network; and (b) a dynamic treenetwork.

Page 19: Figure 2.1 Example ofa two-way

Figure 2.19 A fat tree network of 16 processing nodes.

Page 20: Figure 2.1 Example ofa two-way

C

B

A

P

PP

P

Figure 2.20 Bisection width of a dynamic network is computed by examining various equi-partitions of the processing nodes and selecting the minimum number of edges crossing the par-tition. In this case, each partition yields an edge cut of four. Therefore, the bisection width of thisgraph is four.

Page 21: Figure 2.1 Example ofa two-way

(b)

(a)

InvalidateMemoryMemory

P1P0P1P0

UpdateMemoryMemory

P1P0P1P0

load x

write #3, xload xload x

x = 1

x = 1x = 1

x = 1

x = 1x = 1

x = 3

x = 3

x = 3x = 3

x = 1

x = 1

write #3, xload x

Figure 2.21 Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update pro-tocol for shared variables.

Page 22: Figure 2.1 Example ofa two-way

flush

read/write

read write

C_read

read

C_write

write

C_write

Dirty

Shared

Invalid

Figure 2.22 State diagram of a simple three-state coherence protocol.

Page 23: Figure 2.1 Example ofa two-way

y = 13, D

y = 13, S

x = 6, S

x = 6, I

y = 19, D

y = 20, D

x = 5, S

y = 12, S

x = 5, I

y = 12, I

y = 13, S

x = 6, S

y = 13, I

x = 6, I

y = 13, I

x = 5, D

y = 12, D

x = 6, I

read x

x = x + 1

x = x + y

x = x + 1

read y

y = y + 1

read x

y = x + y

read y

y = 12, S

y = 13, I

x = 19, D

x = 6, S

x = 20, D

y = 13, S

x = 6, D

x = 5, S

y = y + 1

Processor 0

Variables andtheir states atProcessor 1

Variables andtheir states inProcessor 1Global mem.

Instruction atProcessor 0

Instruction atTimetheir states atVariables and

Figure 2.23 Example of parallel program execution with the simple three-state coherence protocoldiscussed in Section ??.

Page 24: Figure 2.1 Example ofa two-way

Tag

s

Snoo

p H

/W

Processor

Cache

Tag

s

Snoo

p H

/W

Processor

Cache

Tag

s

Snoo

p H

/W

Processor

Cache

Dirty

Address/data

Memory

Figure 2.24 A simple snoopy bus based cache coherence system.

Page 25: Figure 2.1 Example ofa two-way

(a) (b)

Directory

Data

State

PresenceBits

Cache

Processor

Processor

Cache

Processor

Cache

Processor

Cache

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

Memory

Pres

ence

bits

/ St

ate

Processor

Cache

Memory

Pres

ence

bits

/ St

ate

Figure 2.25 Architecture of typical directory based systems: (a) a centralized directory; and (b) adistributed directory.

Page 26: Figure 2.1 Example ofa two-way

P0

P3

P2

P1

P3

P1

P2

P3

P2

P0

P1

P0

Time

Time

Time

(a) A single message sent over astore-and-forward network

(b) The same message broken into two parts

and sent over the network.

(c) The same message broken into four parts

and sent over the network.

Figure 2.26 Passing a message from node P0 to P3 (a) through a store-and-forward communica-tion network; (b) and (c) extending the concept to cut-through routing. The shaded regions representthe time that the message is in transit. The startup time associated with this message transfer is as-sumed to be zero.

Page 27: Figure 2.1 Example ofa two-way

Desired direction of message traversal

Flit from m

essage 1

Flit from message 0

Flit from message 2

Flit

from

mes

sage

3

Flit buffers

A

B C

D

Figure 2.27 An example of deadlock in a cut-through routing network.

Page 28: Figure 2.1 Example ofa two-way

Step 2 (110 111)Step 1 (010 110)

111110

101

011

100

010

001000

111110

101

011

100

010

001001000

010

101100

011

110 111

000

Ps Ps Ps

Pd Pd Pd

Figure 2.28 Routing a message from node Ps (010) to node Pd (111) in a three-dimensionalhypercube using E-cube routing.

Page 29: Figure 2.1 Example ofa two-way

(d)(c)

(b)(a)

l

9

14 1615

8765

4321

16151413

1211109

8765

4321

4321

8765

16151413

1211109

a b d

e f

c

g h

i j k l

m n o p

a b c d

e f g h

i j k l

m n o p

k h m i

j p o b

d e a n

c g f

13

121110

Figure 2.29 Impact of process mapping on performance: (a) underlying architecture; (b) pro-cesses and their interactions; (c) an intuitive mapping of processes to nodes; and (d) a randommapping of processes to nodes.

Page 30: Figure 2.1 Example ofa two-way

1−bit Gray code 2−bit Gray code 3−bit Gray code 3−D hypercube 8−processor ring

0

1

3

2

6

7

5

4

0 0

0 1

1 1

1 0

0

1

0

1

2

3

4

5

6

7

0 0 0

0 0 1

0 1 1

0 1 0

1 1 0

1 1 1

1 0 1

1 0 0

line

along this

Reflect

(a)

110

010

000 001

011

111

101

(b)

100

Figure 2.30 (a) A three-bit reflected Gray code ring; and (b) its embedding into a three-dimensionalhypercube.

Page 31: Figure 2.1 Example ofa two-way

(3,3) 10 10

(2,3) 11 10

(1,3) 01 10

(0,3) 00 10

(3,2) 10 11

(2,2) 11 11

(1,2) 01 11

(0,2) 00 11

(3,1) 10 01

(2,1) 11 01

(1,1) 01 01

(0,1) 00 01

(3,0) 10 00

(2,0) 11 00

(1,0) 01 00

(0,0) 00 00

(0,0) 0 00 (0,1) 0 01 (0,2) 0 11 (0,3) 0 10

(1,0) 1 00 (1,1) 1 01 (1,2) 1 11 (1,3) 1 10

011

001000

010

110 111

101100

identical two least−significant bits

Processors in a column have Processors in a row have identical

two most−significant bits

(a)

(b)

Figure 2.31 (a) A 4 × 4 mesh illustrating the mapping of mesh nodes to the nodes in a four-dimensional hypercube; and (b) a 2 × 4 mesh embedded into a three-dimensional hypercube.

Page 32: Figure 2.1 Example ofa two-way

(a) Mapping a linear array into a

linear array (congestion 5)(b) Inverting the mapping − mapping a 2D mesh into a

2D mesh (congestion 1).

Figure 2.32 (a) Embedding a 16 node linear array into a 2-D mesh; and (b) the inverse of themapping. Solid lines correspond to links in the linear array and normal lines to links in the mesh.

Page 33: Figure 2.1 Example ofa two-way

P = 32(b)

P = 16(a)

Figure 2.33 Embedding a hypercube into a 2-D mesh.

Page 34: Figure 2.1 Example ofa two-way

...

....

..

(b) Chip (32 GF)

.

(a) CPU (1GF) (c) Board (2 TF)

(d) Tower (16 TF) (e) Blue Gene (1 PF)

Figure 2.34 The hierarchical architecture of Blue Gene.

Page 35: Figure 2.1 Example ofa two-way

Router

(a) (b)

P Control

Memory

Figure 2.35 Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.

Page 36: Figure 2.1 Example ofa two-way

1 R-Brick, 4 C-Bricks, and16 processors at each vertex.

Processor

128 Processor Configuration

(16 processors)R-Brick

To meta-router

To 8 other R-Bricks

C-Brick

To 4 C-Bricks

C-Brick

C-Brick

Metarouter

128 processors

512 Processor Configuration

C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

Crossbar

Memory/DirectoryC-Brick

I/P/D/X Brick

R-Brick

R-Brick

32 Processor Configuration

Figure 2.36 Architecture of the SGI Origin 3000 family of servers.

Page 37: Figure 2.1 Example ofa two-way

Starfire Ultra 1000 (up to 64 processors)

16 x

16

non-

bloc

king

cro

ssba

r

Add

ress

bus

System Board

System Board

32 b

yte

data

bus

Sun Ultra 6000 (6 - 30 processors)

Four

add

ress

bus

es

System Board

System Board

System BoardSystem Board

Figure 2.37 Architecture of the Sun Enterprise family of servers.

Page 38: Figure 2.1 Example ofa two-way

Switch

Switch

Host

Switch Switch

Switch

Host

Host

SwitchSwitch

Host

Switch

Host

Host

Hos

t

Host

Hos

t

Switch

Host

Hos

t Host

Inte

rfac

eIn

terf

ace

Hos

tH

ost

InterfaceInte

rfac

e

InterfaceHost

Hos

t

InterfaceHost

Interface

Interface Interface

Host

Interface

HostInterface

Host

Host

Host

TFiber toexternal network

Figure 2.38 A typical connection pattern for switches and hosts in a Myrinet. The figure alsoillustrates routing of messages between hosts. At any point of time, multiple pairs of processors withnon-conflicting paths may be communicate with each other.

Page 39: Figure 2.1 Example ofa two-way

000

111

110

101

100

011

010

001

000

111

110

101

100

011

010

001

Figure 2.39 A Butterfly network with eight processing nodes.

Page 40: Figure 2.1 Example ofa two-way

P23

320

3

2

1

1

0

EW

N

S

switch

Figure 2.40 Switch connection patterns in a reconfigurable mesh.

Page 41: Figure 2.1 Example ofa two-way

(b)

(c) (d)

(a)

Figure 2.41 The construction of a 4 × 4 mesh of trees: (a) a 4 × 4 grid, (b) complete binarytrees imposed over individual rows, (c) complete binary trees imposed over each column, and (d)the complete 4 × 4 mesh of trees.

Page 42: Figure 2.1 Example ofa two-way

Figure 2.42 A 4 × 4 pyramidal mesh.