figure 2.1 example ofa two-way

(i)

(a) Three different code fragments for adding a list of four numbers.

(iii)(ii)

WB: Write−back

NA: No Action

E: Instruction Execute

IF ID ENA

IF ID

OF: Operand Fetch

OF

OF

IF ID OF E

IF ID EOF

IF ID NA WB

Adder Utilization

Clock cycle

5

6

7

4

Vertical waste

Horizontal wasteFull issue slots

Empty issue slots

1. load R1, @1000

2. load R2, @1008

3. add R1, @1004

4. add R2, @100C

5. add R1, R2

6. store R1, @2000

0 2 4Instruction cycles

6 8

1. load R1, @1000

2. add R1, @1004

3. add R1, @1008

4. add R1, @100C

5. store R1, @2000

1. load R1, @1000

3. load R2, @1008

4. add R2, @100C

5. add R1, R2

6. store R1, @2000

2. add R1, @1004

load R1, @1000

load R2, @1008

add R1, @1004

add R2, @100C

add R1, R2

store R1, @2000

ID: Instruction Decode

IF

IF: Instruction Fetch

ID

(b) Execution schedule for code fragment (i) above.

(c) Hardware utilization trace for schedule in (b).

Figure 2.1 Example of a two-way superscalar execution of instructions.

(a) Column major data access

A b A

=

b A b A b

+ + +

(b) Row major data access.

A b A b A b A b

= = = =

Figure 2.2 Multiplying a matrix with a vector: (a) multiplying column-by-column, keeping a runningsum; (b) computing each element of the result as a dot product of a row of the matrix with the vector.

(a) (b)

Global

+

+

+

+PE

PE

PE

PE

PE

PE

PE

PE

PE

control

unit

INT

ER

CO

NN

EC

TIO

N N

ET

WO

RK

INT

ER

CO

NN

EC

TIO

N N

ET

WO

RK

control unit

control unit

control unit

control unit

PE: Processing Element

Figure 2.3 A typical SIMD architecture (a) and a typical MIMD architecture (b).

Idle

Idle

(b)

Step 2

(a)

Idle

Step 1

Initial values

Idle

C

B

0

A

B

C 0

A

B

C0

A

B

A

0

else

C

Processor 0 Processor 1 Processor 2

5

0

4

2

1

1

0

0

A

B

C 0

A

B

C

A

B

C 0

A

B

C5 0

C = A/B;

C = A;

if (B == 0)

Processor 3

Processor 0 Processor 1 Processor 2 Processor 3

5

0

4

2

1

1

0

0

Processor 0 Processor 1 Processor 2 Processor 3

5

0

4

2

1

1

0

0

0

A

B

C

A

B

C

A

B

C

A

B

C 5 12

Figure 2.4 Executing a conditional statement on an SIMD computer with four processors: (a) theconditional statement; (b) the execution of the statement in two steps.

M

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

M

M

Inte

rcon

nect

ion

Net

wor

k

MM

P

CM

M

(b)

P

C

P

C

P

C

C

P

P

M

M

C

(a) (c)

P

P

P

Figure 2.5 Typical shared-address-space architectures: (a) Uniform-memory-access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with cachesand memories; (c) Non-uniform-memory-access shared-address-space computer with local memoryonly.

Static network Indirect network

Switching elementProcessing node

Network interface/switch

P

P P P

P

P

PP

Figure 2.6 Classification of interconnection networks: (a) a static network; and (b) a dynamicnetwork.

Cache /Local Memory

Cache /Local Memory

Shar

ed M

emor

y

Data

Processor 0

Address

Data

Shar

ed M

emor

yProcessor 0 Processor 1

(a)

(b)

Address

Processor 1

Figure 2.7 Bus-based interconnects (a) with no local caches; (b) with local memory/caches.

Memory Banks

b−1543210

Proc

essi

ng E

lem

ents

0

1

2

3

4

5

6

p−1

elementA switching

Figure 2.8 A completely non-blocking crossbar network connecting p processors to b memorybanks.

Memory banks

0

1

0

. . . . . . . . . . . . . . . . . . . .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Stage 1

b-1

Stage 2 Stage n

p-1

Processors Multistage interconnection network

1

Figure 2.9 The schematic of a typical multistage interconnection network.

000

010

100

110

001

011

101

111

000

010

100

110

001

011

101

111

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

= left_rotate(000)

= left_rotate(100)

= left_rotate(001)

= left_rotate(101)

= left_rotate(010)

= left_rotate(110)

= left_rotate(011)

= left_rotate(111)

Figure 2.10 A perfect shuffle interconnection for eight inputs and outputs.

(b)(a)

Figure 2.11 Two switching configurations of the 2 × 2 switch: (a) Pass-through; (b) Cross-over.

111

110

101

100

011

010

001

000 000

001

010

011

100

101

110

111

Figure 2.12 A complete omega network connecting eight inputs and eight outputs.

111

110

101

100

011

010

001

000 000

001

010

011

100

101

110

111

A

B

Figure 2.13 An example of blocking in omega network: one of the messages (010 to 111 or 110to 100) is blocked at link AB.

(a) (b)

Figure 2.14 (a) A completely-connected network of eight nodes; (b) a star connected network ofnine nodes.

(a) (b)

Figure 2.15 Linear arrays: (a) with no wraparound links; (b) with wraparound link.

(c)(b)(a)

Figure 2.16 Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D meshwith wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.

0

1

00

01

10

11

000 010

001 011

100 110

111101

0000

0100

0001 0011

0101

0110

0010

0111

1100 1110

1111

10111001

1000

1101

1010

0-D hypercube 1-D hypercube 2-D hypercube 3-D hypercube

4-D hypercube

Figure 2.17 Construction of hypercubes from hypercubes of lower dimension.

(a) (b)

Processing nodes

Switching nodes

Figure 2.18 Complete binary tree networks: (a) a static tree network; and (b) a dynamic treenetwork.

Figure 2.19 A fat tree network of 16 processing nodes.

C

B

A

P

PP

P

Figure 2.20 Bisection width of a dynamic network is computed by examining various equi-partitions of the processing nodes and selecting the minimum number of edges crossing the par-tition. In this case, each partition yields an edge cut of four. Therefore, the bisection width of thisgraph is four.

(b)

(a)

InvalidateMemoryMemory

P1P0P1P0

UpdateMemoryMemory

P1P0P1P0

load x

write #3, xload xload x

x = 1

x = 1x = 1

x = 1

x = 1x = 1

x = 3

x = 3

x = 3x = 3

x = 1

x = 1

write #3, xload x

Figure 2.21 Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update pro-tocol for shared variables.

flush

read/write

read write

C_read

read

C_write

write

C_write

Dirty

Shared

Invalid

Figure 2.22 State diagram of a simple three-state coherence protocol.

y = 13, D

y = 13, S

x = 6, S

x = 6, I

y = 19, D

y = 20, D

x = 5, S

y = 12, S

x = 5, I

y = 12, I

y = 13, S

x = 6, S

y = 13, I

x = 6, I

y = 13, I

x = 5, D

y = 12, D

x = 6, I

read x

x = x + 1

x = x + y

x = x + 1

read y

y = y + 1

read x

y = x + y

read y

y = 12, S

y = 13, I

x = 19, D

x = 6, S

x = 20, D

y = 13, S

x = 6, D

x = 5, S

y = y + 1

Processor 0

Variables andtheir states atProcessor 1

Variables andtheir states inProcessor 1Global mem.

Instruction atProcessor 0

Instruction atTimetheir states atVariables and

Figure 2.23 Example of parallel program execution with the simple three-state coherence protocoldiscussed in Section ??.

Tag

s

Snoo

p H

/W

Processor

Cache

Tag

s

Snoo

p H

/W

Processor

Cache

Tag

s

Snoo

p H

/W

Processor

Cache

Dirty

Address/data

Memory

Figure 2.24 A simple snoopy bus based cache coherence system.

(a) (b)

Directory

Data

State

PresenceBits

Cache

Processor

Processor

Cache

Processor

Cache

Processor

Cache

Inte

rcon

nect

ion

Net

wor

k

Inte

rcon

nect

ion

Net

wor

k

Memory

Pres

ence

bits

/ St

ate

Processor

Cache

Memory

Pres

ence

bits

/ St

ate

Figure 2.25 Architecture of typical directory based systems: (a) a centralized directory; and (b) adistributed directory.

P0

P3

P2

P1

P3

P1

P2

P3

P2

P0

P1

P0

Time

Time

Time

(a) A single message sent over astore-and-forward network

(b) The same message broken into two parts

and sent over the network.

(c) The same message broken into four parts

and sent over the network.

Figure 2.26 Passing a message from node P0 to P3 (a) through a store-and-forward communica-tion network; (b) and (c) extending the concept to cut-through routing. The shaded regions representthe time that the message is in transit. The startup time associated with this message transfer is as-sumed to be zero.

Desired direction of message traversal

Flit from m

essage 1

Flit from message 0

Flit from message 2

Flit

from

mes

sage

3

Flit buffers

A

B C

D

Figure 2.27 An example of deadlock in a cut-through routing network.

Step 2 (110 111)Step 1 (010 110)

111110

101

011

100

010

001000

111110

101

011

100

010

001001000

010

101100

011

110 111

000

Ps Ps Ps

Pd Pd Pd

Figure 2.28 Routing a message from node Ps (010) to node Pd (111) in a three-dimensionalhypercube using E-cube routing.

(d)(c)

(b)(a)

l

9

14 1615

8765

4321

16151413

1211109

8765

4321

4321

8765

16151413

1211109

a b d

e f

c

g h

i j k l

m n o p

a b c d

e f g h

i j k l

m n o p

k h m i

j p o b

d e a n

c g f

13

121110

Figure 2.29 Impact of process mapping on performance: (a) underlying architecture; (b) pro-cesses and their interactions; (c) an intuitive mapping of processes to nodes; and (d) a randommapping of processes to nodes.

1−bit Gray code 2−bit Gray code 3−bit Gray code 3−D hypercube 8−processor ring

0

1

3

2

6

7

5

4

0 0

0 1

1 1

1 0

0

1

0

1

2

3

4

5

6

7

0 0 0

0 0 1

0 1 1

0 1 0

1 1 0

1 1 1

1 0 1

1 0 0

line

along this

Reflect

(a)

110

010

000 001

011

111

101

(b)

100

Figure 2.30 (a) A three-bit reflected Gray code ring; and (b) its embedding into a three-dimensionalhypercube.

(3,3) 10 10

(2,3) 11 10

(1,3) 01 10

(0,3) 00 10

(3,2) 10 11

(2,2) 11 11

(1,2) 01 11

(0,2) 00 11

(3,1) 10 01

(2,1) 11 01

(1,1) 01 01

(0,1) 00 01

(3,0) 10 00

(2,0) 11 00

(1,0) 01 00

(0,0) 00 00

(0,0) 0 00 (0,1) 0 01 (0,2) 0 11 (0,3) 0 10

(1,0) 1 00 (1,1) 1 01 (1,2) 1 11 (1,3) 1 10

011

001000

010

110 111

101100

identical two least−significant bits

Processors in a column have Processors in a row have identical

two most−significant bits

(a)

(b)

Figure 2.31 (a) A 4 × 4 mesh illustrating the mapping of mesh nodes to the nodes in a four-dimensional hypercube; and (b) a 2 × 4 mesh embedded into a three-dimensional hypercube.

(a) Mapping a linear array into a

linear array (congestion 5)(b) Inverting the mapping − mapping a 2D mesh into a

2D mesh (congestion 1).

Figure 2.32 (a) Embedding a 16 node linear array into a 2-D mesh; and (b) the inverse of themapping. Solid lines correspond to links in the linear array and normal lines to links in the mesh.

P = 32(b)

P = 16(a)

Figure 2.33 Embedding a hypercube into a 2-D mesh.

...

....

..

(b) Chip (32 GF)

.

(a) CPU (1GF) (c) Board (2 TF)

(d) Tower (16 TF) (e) Blue Gene (1 PF)

Figure 2.34 The hierarchical architecture of Blue Gene.

Router

(a) (b)

P Control

Memory

Figure 2.35 Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.

1 R-Brick, 4 C-Bricks, and16 processors at each vertex.

Processor

128 Processor Configuration

(16 processors)R-Brick

To meta-router

To 8 other R-Bricks

C-Brick

To 4 C-Bricks

C-Brick

C-Brick

Metarouter

128 processors


C-Brick

C-Brick

C-Brick

C-Brick

C-Brick

Crossbar

Memory/DirectoryC-Brick

I/P/D/X Brick

R-Brick

R-Brick


Figure 2.36 Architecture of the SGI Origin 3000 family of servers.

Starfire Ultra 1000 (up to 64 processors)

16 x

16

non-

bloc

king

cro

ssba

r

Add

ress

bus

System Board

System Board

32 b

yte

data

bus

Sun Ultra 6000 (6 - 30 processors)

Four

add

ress

bus

es

System Board

System Board

System BoardSystem Board

Figure 2.37 Architecture of the Sun Enterprise family of servers.

Switch

Switch

Host

Switch Switch

Switch

Host

Host

SwitchSwitch

Host

Switch

Host

Host

Hos

t

Host

Hos

t

Switch

Host

Hos

t Host

Inte

rfac

eIn

terf

ace

Hos

tH

ost

InterfaceInte

rfac

e

InterfaceHost

Hos

t

InterfaceHost

Interface

Interface Interface

Host

Interface

HostInterface

Host

Host

Host

TFiber toexternal network

Figure 2.38 A typical connection pattern for switches and hosts in a Myrinet. The figure alsoillustrates routing of messages between hosts. At any point of time, multiple pairs of processors withnon-conflicting paths may be communicate with each other.

000

111

110

101

100

011

010

001

000

111

110

101

100

011

010

001

Figure 2.39 A Butterfly network with eight processing nodes.

P23

320

3

2

1

1

0

EW

N

S

switch

Figure 2.40 Switch connection patterns in a reconfigurable mesh.

(b)

(c) (d)

(a)

Figure 2.41 The construction of a 4 × 4 mesh of trees: (a) a 4 × 4 grid, (b) complete binarytrees imposed over individual rows, (c) complete binary trees imposed over each column, and (d)the complete 4 × 4 mesh of trees.

Figure 2.42 A 4 × 4 pyramidal mesh.

figure 2.1 example ofa two-way

Documents