outline - university of evansvillemr56/ece757/lecture9.pdf · message passing issues • sender...

28
Page 1 CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Scalable Multiprocessors (Chapter 7) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! (c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 2 Outline Motivation Network Transaction Primitive Supporting Programming Models Case Studies

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 1

CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)

Scalable Multiprocessors (Chapter 7)

Copyright 2001 Mark D. HillUniversity of Wisconsin-Madison

Slides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),

Alvy Lebeck (Duke), Steve Reinhardt (Michigan),and J. P. Singh (Princeton). Thanks!

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 2

Outline

• Motivation

• Network Transaction Primitive

• Supporting Programming Models

• Case Studies

Page 2: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 2

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 3

Scalable Systems

• Buses support up to 8 to 16 processors well

• What about scaling beyond 16 processors?

• What about cost?

• How do we build larger systems that are cost-effective?

(c) 2003, J. E. Smith 4

Scaling

• Bandwidth Scaling– should increase proportional to # procs.– crossbars– multi-stage nets

• Latency Scaling– ideally remain constant– in practice, logn scaling can be achieved– local memories help– latency may be dominated by overheads, anyway

• Cost Scaling– low overhead cost (most cost incremental)– in contrast to many large SMPs

• Physical Scaling– loose vs. dense packing– chip level integration vs. commodity parts

Page 3: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 3

(c) 2003, J. E. Smith 5

Scalable Systems

• Generic NUMA System

copyright 1999 Morgan Kaufmann Publishers, Inc

(c) 2003, J. E. Smith 6

Scalable Systems

• Dancehall System

copyright 1999 Morgan Kaufmann Publishers, Inc

Page 4: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 4

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 7

Scalable Systems

• Cluster of workstation/server like nodes• Commodity System-Area/Local-Area Network

Switches/Fiber• What are the hardware/software issues?

P

$

Mem NI

P

$

Mem NI

P

$

Mem NI

Network

P

$

Mem NI

P

$..

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 8

Node Communication Architecture

• How does a processor communicate with others?• Is there processor/memory support for

communication?• Where does the network interface sit?

P

$

MemoryNI

NI

I/O busMemory bus

Page 5: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 5

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 9

Communication Architecture Alternatives

• NI on I/O bus: Berkeley NOW, Wisconsin COW• NI on memory bus: Intel Paragon, TMC CM-5• Support for protected messaging: Princeton SHRIMP• TLB support for shared address space: Cray T3D, T3E• Full support for messaging: MIT J-Machine, nCube/2• Full support for coherent shared memory: Alpha 21364

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 10

Network Transaction Primitive

• One-way transfer of information from source to destination

• causes some action at the destination– process info and/or deposit in buffer– state change (e.g., set flag, interrupt program)– maybe initiate reply (separate network transaction)

output buffer input buf fer

Source Node Destination Node

Communication Network

° ° °

serialized msg

Page 6: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 6

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 11

Node-to-Node Architecture

• Scalable architectures use point-to-point messaging• Compare with SMP Bus• Communication architecture issues:

• Protection• Format• Output buffering• Input buffering• Media arbitration & flow control

• Destination name & routing• Action• Completion detection• Deadlock avoidance• Delivery guarantees

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 12

Communication Architecture

• Protection:– larger and more loosely coupled components– multiple protection domains– How do you isolate faulty hardware or software?

• Format:– Message descriptor and length– Flexibility in the contents depending communication model– May support simple protocol stack

» user information wrapped in communication information– e.g., shared memory messages: commands vs data– e.g., Active Messages: src, dest, PC, args

Page 7: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 7

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 13

Communication Architecture

• Buffering required:– when there is mismatch between arrival/departure rate

• Output buffers:– data from processor/memory going to the network– can use special buffer– buffer can be in main memory– what happens when output buffers fill up?

• Input buffers:– incoming data from other procs– may be picked it by the proc or sent to memory– what happens when input buffers fill up?

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 14

Communication Architecture

• Media arbitration & flow control:– distributed among the network switches– packets traverse through multiple switches

(flow control later)

• Destination name and routing:– the source must specify destination name and routing– e.g., shared memory, a physical memory address & a proc id

(routing later)

• Action:– may be memory write or read, executing code– may also send a response

Page 8: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 8

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 15

Communication Architecture

• Completion detection:– sender must know message is received– receiver must know there is a message

• Transaction ordering:– do all transactions from source to dest arrive in order?– must distinguish between out-of-order routing vs delivery– e.g., what happens to shared-memory consistency if out of order

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 16

Communication Architecture

• Deadlock avoidance:– switches require buffer space for routing– must manage buffers carefully to avoid deadlocks– end-to-end: sender must always receive while sending!

• Delivery guarantees:– when input buffers fill up:

» drop message and resend» control flow; buffers backup into network

Page 9: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 9

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 17

Outline

• Motivation

• Network Transaction Primitive

• Supporting Programming Models

• Case Studies

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 18

Shared Address Space vs Message Passing

• Do not confuse model/abstraction with system implementation:

Programming Interface

Message Passinge.g., DBMS,

MPI

H/W Shared Memorye.g., Intel SMP,

SGI Origin

Network of PCse.g., NOW, COW

MP Supercomputere.g., Cray T3E

Data Parallele.g., HPF

Shared Address Spacee.g., Pthreads,

PARMACS

System

Page 10: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 10

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 19

Shared Address Space Requirements

• Shared address space on a distributed memory computer

• Two-way request/response protocol:– basic: read a block, write a block, and block ownership– more complicated protocol optimizations on top of basic

load Xsource destination

Access XWait

X

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 20

Shared Address Space Issues

• Coherent or non-coherent:– fully cache-coherent similar to an SMP– coherence granularity: cache block, page, etc.– remote shared accesses only: Cray T3D

• Does it allow overlapping multiple transactions:– consider memory consistency models: chapter 9

• Implemented in H/W or S/W:– For S/W systems, H/W provides messaging– H/W or S/W can provide protection, flow control, etc.

Page 11: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 11

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 21

Shared Address Space Issues

• Assume implemented in H/W– issues similar to SMPs (see chapter 8)– protection primarily enforced by virtual memory– data address: sender & receiver NIs know what to do

• coherent: – block transfers from cache/memory to network– H/W protocol implements coherence

• non-coherent: – single/double word reads and writes from proc to network

• Support for larger transfers using DMA H/W– efficient block moves

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 22

Message Passing Requirements

• Messaging abstraction on a distributed memory computer

• Conceptually a one-way communication: – sender sends (variable-sized) message– receiver receives

• But, – Implementation is more complicated

• Asynchronous vs Synchronous messaging– synchronous: sender waits until reception is assured– asynchronous: sender sends and goes back to computation

Page 12: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 12

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 23

Message Passing Requirements

• Synchronous messaging

send requestsend X, dest destination

Tag CheckReceiver Ready?

Wait

ready

data transfer

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 24

Message Passing Issues

• Sender must specify where the data is:– typically in contiguous block of memory– can optimize for direct transfers from data structures: gather/scatter

• Proc or NI must transfer from memory to NI buffers– what about NI protection?– Can send through OS: but requires multiple data copies– Does NI have access to application memory?

• Unless support for low-overhead messaging ���� send large messages to amortize overheads

Page 13: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 13

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 25

Message Passing Requirements

• Asynchronous messaging: (optimistic)

sendsend X, dest destination

Tag CheckIf not, buffer

Go back to computation

What are problems with this approach?

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 26

Message Passing Issues

• Performing tag match in software to find where to go:– slow process– can always provide buffer for later processing– how much buffering is enough?

• Data typically arrives at a high rate for large messages

• What if multiple senders are sending to one receiver?• Need flow control or can drop messages

Page 14: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 14

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 27

Message Passing Requirements

• Asynchronous messaging: (optimistic)

send requestsend X, dest destination

Tag CheckIf not, wait

Go back to computation

ready

data transfer

receive

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 28

Active Messages

• User-level analog of network transaction– invoke handler function at receiver to extract packet from network– grew out of attempts to do dataflow programming on msg-passing

machines– handler may send reply, but no other messages

• May also perform memory-to-memory transfer

Request

handler

handler

Reply

Page 15: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 15

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 29

Outline

• Motivation

• Network Transaction Primitive

• Supporting Programming Models

• Case Studies

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 30

Massively Parallel Processor (MPP) Architectures

• Network interface typically close to processor

– Memory bus:» locked to specific

processor architecture/bus protocol

– Registers/cache:» only in research machines

• Time-to-market is long���� use available processor or work closely with processor designers

• Maximize performance and cost

I/O Bus

Memory Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

NetworkInterface

Network

I/O Bridge

Page 16: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 16

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 31

Network of Workstations

• Network interface on I/O bus

• Standards (e.g., PCI) ���� longer life,faster to market

• Slow (microseconds) to access network interface

• “ System Area Network” (SAN): between LAN & MPP

I/O Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

Core Chip Set

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 32

Transaction Interpretation

• Simplest MP: assist doesn’t interpret– DMA from/to buffer, interrupt or set flag on completion

• User-level messaging: get the OS out of the way– assist does protection checks to allow direct user access to

network– may have minimal interpretation otherwise

• Virtual DMA: get the CPU out of the way (maybe)

– basic protection plus address translation: user-level bulk DMA

• Global physical address space (NUMA): everything in hardware

– complexity increases, but performance does too (if done right)

• Cache coherence: even more so– stay tuned

Page 17: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 17

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 33

Spectrum of Designs

• None: Physical bit stream– physical DMA nCUBE, iPSC, . . .

• User/System– User-level port CM-5, *T– User-level handler J-Machine, Monsoon, . . .

• Remote virtual address– Processing, translation Paragon, Meiko CS-2, Myrinet

Memory Channel, SHRIMP

• Global physical address– Proc + Memory controller RP3, BBN, T3D, T3E

• Cache-to-cache (later)– Cache controller Dash, KSR, Flash

Incr

easi

ng

HW

Su

pp

ort

, Sp

ecia

lizat

ion

, In

tru

sive

nes

s, P

erfo

rman

ce (

??

?)

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 34

Net Transactions: Physical DMA

• Physical addresses: OS must initiate transfers– system call per message on both ends: very high overhead

• Sending OS copies data to kernel buffer w/ header/trailer– can avoid copy if interface does scatter/gather

• Receiver copies packet into OS buffer, then interprets– user message then copied (or mapped) into user space

PMemory

Cmd

DestData

AddrLengthRdy

PMemory

DMAchannels

Status,interrupt

Addr

LengthRdy

° ° °

Page 18: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 18

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 35

Conventional LAN Network Interface

NIC Controller

DMAaddr

len

trncv

TX

RX

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Data

Host Memory NIC

IO Busmem bus

Proc

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 36

User Level Ports

• Map network hardware into user’s address space– talk directly to network via loads & stores

• User-to-user communication without OS intervention: low latency

• Protection: user/user & user/system• DMA difficult… CPU involvement (copying) becomes

bottleneck

PMem

DestData

User/system

PMemStatus,interrupt

° ° °

Page 19: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 19

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 37

User Level Network ports

• Appears to user as logical message queues plus status• What happens if no user pop?

Virtual address space

Status

Net outputport

Net inputport

Program counter

Registers

Processor

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 38

Example: CM-5

• Input and output FIFO for each network

• Two data networks• Save/restore network

buffers on context switch

Diagnostics network

Control network

Data network

Processing�partition

Processing�partition

Control�processors

I/O partition

PM PM

SPARC

MBUS

DRAM�ctrl

DRAM DRAM DRAM DRAM

DRAM�ctrl

Vector�unit DRAM�

ctrlDRAM�

ctrl

Vector�unit

FPU Data�networks

Control�network

$�ctrl

$�SRAM

NI

Page 20: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 20

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 39

User Level Handlers

• Hardware support to vector to address specified in message

– message ports in registers– alternate register set for handler?

• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

U se r/sys te m

PM e m

D estD a ta Ad d re ss

PM em

° ° °

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 40

J-Machine

• Each node a small message-driven processor

• HW support to queue msgsand dispatch to msg handler task

Page 21: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 21

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 41

Dedicated Message Processing Without Specialized Hardware

• Microprocessor performs arbitrary output processing (at system level)• Microprocessor interprets incoming network transactions (in system)• User Processor <–> Msg Processor share memory• Msg Processor <–> Msg Processor via system network transaction

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

P M P

NI

User System

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 42

Levels of Network Transaction

• User Processor stores cmd / msg / data into shared output queue– must stil l check for output queue full (or grow dynamically)

• Communication assists make transaction happen– checking, translation, scheduling, transport, interpretation

• Avoid system call overhead• Multiple bus crossings likely bottleneck

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

PM P

NI

Page 22: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 22

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 43

Example: Intel Paragon

Network

° ° °Mem

P M P

NIi860xp50 MHz16 KB $4-way32B BlockMESI

sDMArDMA

64400 MB/s

$ $

16 175 MB/s Duplex

I/ONodes

rteMP handler

Var dataEOP

I/ONodes

Service

Devices

Devices

2048 B

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 44

Dedicated MP w/specialized NI:Meiko CS-2

• Integrate message processor into network interface– active messages-like capability– dedicated threads for DMA, reply handling, simple remote memory

access– supports user-level virtual DMA

» own page table» can take a page fault, signal OS, restart

• meanwhile, nack other node

• Problem: processor is slow, time-slices threads– fundamental issue with building your own CPU

Page 23: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 23

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 45

Myricom Myrinet (Berkeley NOW)

• Programmable network interface on I/O Bus (Sun SBUS or PCI)

– embedded custom CPU (“ Lanai” , ~40 MHz RISC CPU)– 256KB SRAM– 3 DMA engines: to network, from network, to/from host memory

• Downloadable firmware executes in kernel mode– includes source-based routing protocol

• SRAM pages can be mapped into user space– separate pages for separate processes– firmware can define status words, queues, etc.

» data for short messages or pointers for long ones» firmware can do address translation too… w/OS help

– poll to check for sends from user

• Bottom line: I/O bus still bottleneck, CPU could be faster

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 46

Shared Physical Address Space

• Implement SAS model in hardware w/o caching– actual caching must be done by copying from remote memory to

local– programming paradigm looks more like message passing than

Pthreads» yet, low latency & low overhead transfers thanks to HW

interpretation; high bandwidth too if done right» result: great platform for MPI & compiled data-parallel codes

• Implementation:– “ pseudo-memory” acts as memory controller for remote mem,

converts accesses to network transaction (request)– “ pseudo-CPU” on remote node receives requests, performs on local

memory, sends reply– split-transaction or retry-capable bus required (or dual-ported mem)

Page 24: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 24

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 47

Example: Cray T3D

• Up to 2,048 Alpha 21064s– no off-chip L2 to avoid inherent latency

• In addition to remote mem ops, includes:– prefetch buffer (hide remote latency)– DMA engine (requires OS trap)– synchronization operations (swap, fetch&inc, global AND/OR)– message queue (requires OS trap on receiver)

• Big problem: physical address space– 21064 supports only 32 bits– 2K-node machine limited to 2M per node– external “ DTB annex” provides segment-like registers for extended

addressing, but management is expensive & ugly

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 48

Cray T3E

• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)

– still has physical address space problems

• E-registers for remote communication and synchronization

– 512 user, 128 system; 64 bits each– replace/unify DTB Annex, prefetch queue, block transfer engine,

and remote load / store, message queue– Address specifies source or destination E-register and command– Data contains pointer to block of 4 E-regs and index for centrifuge

• Centrifuge– supports data distributions used in data-parallel languages (HPF)– 4 E-regs for global memory operation: mask, base, two arguments

• Get & Put Operations

Page 25: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 25

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 49

T3E (continued)

• Atomic Memory operations– E-registers & centrifuge used– F&I, F&Add, Compare&Swap, Masked_Swap

• Messaging– arbitrary number of queues (user or system)– 64-byte messages– create msg queue by storing message control word to memory

location

• Msg Send– construct data in aligned block of 8 E-regs– send like put, but dest must be message control word– processor is responsible for queue space (buffer management)

• Barrier and Eureka synchronization

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 50

DEC Memory Channel(Princeton SHRIMP)

• Reflective Memory• Writes on Sender appear in Receiver’s memory

– send & receive regions– page control table

• Receive region is pinned in memory• Requires duplicate writes, really just message

buffers

Virtual

Physical Physical

Virtual

Page 26: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 26

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 51

Performance of Distributed Memory Machines

• Microbenchmarking• One-way latency of small (five-word) message

– echo test– round-trip divided by 2

• Shared Memory remote read• Message Passing Operations

– see text

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 52

Network Transaction Performance

0

2

4

6

8

10

12

14

Mic

rose

con

ds

CM

-5

Par

agon

CS

2

NO

W

T3D

CM

-5

Par

agon

CS

2

NO

W

T3D

Gap

L

Or

Os

Figure 7.31

Page 27: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 27

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 53

Remote Read Performance

0

5

10

15

20

25

Mic

rose

con

ds

CM

-5

Par

agon

CS

2

NO

W

T3D

CM

-5

Par

agon

CS

2

NO

W

T3D

Gap

L

Issue

Figure 7.32

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 54

Summary of Distributed Memory Machines

• Convergence of architectures– everything “ looks basically the same”– processor, cache, memory, communication assist

• Communication Assist– where is it? (I/O bus, memory bus, processor registers)– what does it know?

» does it just move bytes, or does it perform some functions?– is it programmable?– does it run user code?

• Network transaction– input & output buffering– action on remote node

Page 28: Outline - University of Evansvillemr56/ece757/Lecture9.pdf · Message Passing Issues • Sender must specify where the data is: – typically in contiguous block of memory – can

Page 28

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 55

Outline

• Motivation

• Network Transaction Primitive

• Supporting Programming Models

• Case Studies