outline - university of evansvillemr56/ece757/lecture9.pdf · message passing issues • sender...

CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)

Scalable Multiprocessors (Chapter 7)

Copyright 2001 Mark D. HillUniversity of Wisconsin-Madison

Slides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),

Alvy Lebeck (Duke), Steve Reinhardt (Michigan),and J. P. Singh (Princeton). Thanks!

(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 2

Outline

• Motivation

• Network Transaction Primitive

• Supporting Programming Models

• Case Studies


Scalable Systems

• Buses support up to 8 to 16 processors well

• What about scaling beyond 16 processors?

• What about cost?

• How do we build larger systems that are cost-effective?

(c) 2003, J. E. Smith 4

Scaling

• Bandwidth Scaling– should increase proportional to # procs.– crossbars– multi-stage nets

• Latency Scaling– ideally remain constant– in practice, logn scaling can be achieved– local memories help– latency may be dominated by overheads, anyway

• Cost Scaling– low overhead cost (most cost incremental)– in contrast to many large SMPs

• Physical Scaling– loose vs. dense packing– chip level integration vs. commodity parts

(c) 2003, J. E. Smith 5

Scalable Systems

• Generic NUMA System

copyright 1999 Morgan Kaufmann Publishers, Inc

(c) 2003, J. E. Smith 6

Scalable Systems

• Dancehall System

copyright 1999 Morgan Kaufmann Publishers, Inc


Scalable Systems

• Cluster of workstation/server like nodes• Commodity System-Area/Local-Area Network

Switches/Fiber• What are the hardware/software issues?

P

$

Mem NI

P

$

Mem NI

P

$

Mem NI

Network

P

$

Mem NI

P

$..


Node Communication Architecture

• How does a processor communicate with others?• Is there processor/memory support for

communication?• Where does the network interface sit?

P

$

MemoryNI

NI

I/O busMemory bus


Communication Architecture Alternatives

• NI on I/O bus: Berkeley NOW, Wisconsin COW• NI on memory bus: Intel Paragon, TMC CM-5• Support for protected messaging: Princeton SHRIMP• TLB support for shared address space: Cray T3D, T3E• Full support for messaging: MIT J-Machine, nCube/2• Full support for coherent shared memory: Alpha 21364


Network Transaction Primitive

• One-way transfer of information from source to destination

• causes some action at the destination– process info and/or deposit in buffer– state change (e.g., set flag, interrupt program)– maybe initiate reply (separate network transaction)

output buffer input buf fer

Source Node Destination Node

Communication Network

° ° °

serialized msg


Node-to-Node Architecture

• Scalable architectures use point-to-point messaging• Compare with SMP Bus• Communication architecture issues:

• Protection• Format• Output buffering• Input buffering• Media arbitration & flow control

• Destination name & routing• Action• Completion detection• Deadlock avoidance• Delivery guarantees


Communication Architecture

• Protection:– larger and more loosely coupled components– multiple protection domains– How do you isolate faulty hardware or software?

• Format:– Message descriptor and length– Flexibility in the contents depending communication model– May support simple protocol stack

» user information wrapped in communication information– e.g., shared memory messages: commands vs data– e.g., Active Messages: src, dest, PC, args



• Buffering required:– when there is mismatch between arrival/departure rate

• Output buffers:– data from processor/memory going to the network– can use special buffer– buffer can be in main memory– what happens when output buffers fill up?

• Input buffers:– incoming data from other procs– may be picked it by the proc or sent to memory– what happens when input buffers fill up?



• Media arbitration & flow control:– distributed among the network switches– packets traverse through multiple switches

(flow control later)

• Destination name and routing:– the source must specify destination name and routing– e.g., shared memory, a physical memory address & a proc id

(routing later)

• Action:– may be memory write or read, executing code– may also send a response



• Completion detection:– sender must know message is received– receiver must know there is a message

• Transaction ordering:– do all transactions from source to dest arrive in order?– must distinguish between out-of-order routing vs delivery– e.g., what happens to shared-memory consistency if out of order



• Deadlock avoidance:– switches require buffer space for routing– must manage buffers carefully to avoid deadlocks– end-to-end: sender must always receive while sending!

• Delivery guarantees:– when input buffers fill up:

» drop message and resend» control flow; buffers backup into network


Outline

• Motivation



• Case Studies


Shared Address Space vs Message Passing

• Do not confuse model/abstraction with system implementation:

Programming Interface

Message Passinge.g., DBMS,

MPI

H/W Shared Memorye.g., Intel SMP,

SGI Origin

Network of PCse.g., NOW, COW

MP Supercomputere.g., Cray T3E

Data Parallele.g., HPF

Shared Address Spacee.g., Pthreads,

PARMACS

System


Shared Address Space Requirements

• Shared address space on a distributed memory computer

• Two-way request/response protocol:– basic: read a block, write a block, and block ownership– more complicated protocol optimizations on top of basic

load Xsource destination

Access XWait

X


Shared Address Space Issues

• Coherent or non-coherent:– fully cache-coherent similar to an SMP– coherence granularity: cache block, page, etc.– remote shared accesses only: Cray T3D

• Does it allow overlapping multiple transactions:– consider memory consistency models: chapter 9

• Implemented in H/W or S/W:– For S/W systems, H/W provides messaging– H/W or S/W can provide protection, flow control, etc.


Shared Address Space Issues

• Assume implemented in H/W– issues similar to SMPs (see chapter 8)– protection primarily enforced by virtual memory– data address: sender & receiver NIs know what to do

• coherent: – block transfers from cache/memory to network– H/W protocol implements coherence

• non-coherent: – single/double word reads and writes from proc to network

• Support for larger transfers using DMA H/W– efficient block moves


Message Passing Requirements

• Messaging abstraction on a distributed memory computer

• Conceptually a one-way communication: – sender sends (variable-sized) message– receiver receives

• But, – Implementation is more complicated

• Asynchronous vs Synchronous messaging– synchronous: sender waits until reception is assured– asynchronous: sender sends and goes back to computation



• Synchronous messaging

send requestsend X, dest destination

Tag CheckReceiver Ready?

Wait

ready

data transfer


Message Passing Issues

• Sender must specify where the data is:– typically in contiguous block of memory– can optimize for direct transfers from data structures: gather/scatter

• Proc or NI must transfer from memory to NI buffers– what about NI protection?– Can send through OS: but requires multiple data copies– Does NI have access to application memory?

• Unless support for low-overhead messaging �� send large messages to amortize overheads



• Asynchronous messaging: (optimistic)

sendsend X, dest destination

Tag CheckIf not, buffer

Go back to computation

What are problems with this approach?


Message Passing Issues

• Performing tag match in software to find where to go:– slow process– can always provide buffer for later processing– how much buffering is enough?

• Data typically arrives at a high rate for large messages

• What if multiple senders are sending to one receiver?• Need flow control or can drop messages



• Asynchronous messaging: (optimistic)

send requestsend X, dest destination

Tag CheckIf not, wait

Go back to computation

ready

data transfer

receive


Active Messages

• User-level analog of network transaction– invoke handler function at receiver to extract packet from network– grew out of attempts to do dataflow programming on msg-passing

machines– handler may send reply, but no other messages

• May also perform memory-to-memory transfer

Request

handler

handler

Reply


Outline

• Motivation



• Case Studies


Massively Parallel Processor (MPP) Architectures

• Network interface typically close to processor

– Memory bus:» locked to specific

processor architecture/bus protocol

– Registers/cache:» only in research machines

• Time-to-market is long�� use available processor or work closely with processor designers

• Maximize performance and cost

I/O Bus

Memory Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

NetworkInterface

Network

I/O Bridge


Network of Workstations

• Network interface on I/O bus

• Standards (e.g., PCI) �� longer life,faster to market

• Slow (microseconds) to access network interface

• “ System Area Network” (SAN): between LAN & MPP

I/O Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

Core Chip Set


Transaction Interpretation

• Simplest MP: assist doesn’t interpret– DMA from/to buffer, interrupt or set flag on completion

• User-level messaging: get the OS out of the way– assist does protection checks to allow direct user access to

network– may have minimal interpretation otherwise

• Virtual DMA: get the CPU out of the way (maybe)

– basic protection plus address translation: user-level bulk DMA

• Global physical address space (NUMA): everything in hardware

– complexity increases, but performance does too (if done right)

• Cache coherence: even more so– stay tuned


Spectrum of Designs

• None: Physical bit stream– physical DMA nCUBE, iPSC, . . .

• User/System– User-level port CM-5, *T– User-level handler J-Machine, Monsoon, . . .

• Remote virtual address– Processing, translation Paragon, Meiko CS-2, Myrinet

Memory Channel, SHRIMP

• Global physical address– Proc + Memory controller RP3, BBN, T3D, T3E

• Cache-to-cache (later)– Cache controller Dash, KSR, Flash

Incr

easi

ng

HW

Su

pp

ort

, Sp

ecia

lizat

ion

, In

tru

sive

nes

s, P

erfo

rman

ce (

??

?)


Net Transactions: Physical DMA

• Physical addresses: OS must initiate transfers– system call per message on both ends: very high overhead

• Sending OS copies data to kernel buffer w/ header/trailer– can avoid copy if interface does scatter/gather

• Receiver copies packet into OS buffer, then interprets– user message then copied (or mapped) into user space

PMemory

Cmd

DestData

AddrLengthRdy

PMemory

DMAchannels

Status,interrupt

Addr

LengthRdy

° ° °


Conventional LAN Network Interface

NIC Controller

DMAaddr

len

trncv

TX

RX

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Data

Host Memory NIC

IO Busmem bus

Proc


User Level Ports

• Map network hardware into user’s address space– talk directly to network via loads & stores

• User-to-user communication without OS intervention: low latency

• Protection: user/user & user/system• DMA difficult… CPU involvement (copying) becomes

bottleneck

PMem

DestData

User/system

PMemStatus,interrupt

° ° °


User Level Network ports

• Appears to user as logical message queues plus status• What happens if no user pop?

Virtual address space

Status

Net outputport

Net inputport

Program counter

Registers

Processor


Example: CM-5

• Input and output FIFO for each network

• Two data networks• Save/restore network

buffers on context switch

Diagnostics network

Control network

Data network

Processing�partition

Processing�partition

Control�processors

I/O partition

PM PM

SPARC

MBUS

DRAM�ctrl

DRAM DRAM DRAM DRAM

DRAM�ctrl

Vector�unit DRAM�

ctrlDRAM�

ctrl

Vector�unit

FPU Data�networks

Control�network

$�ctrl

$�SRAM

NI


User Level Handlers

• Hardware support to vector to address specified in message

– message ports in registers– alternate register set for handler?

• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

U se r/sys te m

PM e m

D estD a ta Ad d re ss

PM em

° ° °


J-Machine

• Each node a small message-driven processor

• HW support to queue msgsand dispatch to msg handler task


Dedicated Message Processing Without Specialized Hardware

• Microprocessor performs arbitrary output processing (at system level)• Microprocessor interprets incoming network transactions (in system)• User Processor <–> Msg Processor share memory• Msg Processor <–> Msg Processor via system network transaction

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

P M P

NI

User System


Levels of Network Transaction

• User Processor stores cmd / msg / data into shared output queue– must stil l check for output queue full (or grow dynamically)

• Communication assists make transaction happen– checking, translation, scheduling, transport, interpretation

• Avoid system call overhead• Multiple bus crossings likely bottleneck

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

PM P

NI


Example: Intel Paragon

Network

° ° °Mem

P M P

NIi860xp50 MHz16 KB $4-way32B BlockMESI

sDMArDMA

64400 MB/s

$ $

16 175 MB/s Duplex

I/ONodes

rteMP handler

Var dataEOP

I/ONodes

Service

Devices

Devices

2048 B


Dedicated MP w/specialized NI:Meiko CS-2

• Integrate message processor into network interface– active messages-like capability– dedicated threads for DMA, reply handling, simple remote memory

access– supports user-level virtual DMA

» own page table» can take a page fault, signal OS, restart

• meanwhile, nack other node

• Problem: processor is slow, time-slices threads– fundamental issue with building your own CPU


Myricom Myrinet (Berkeley NOW)

• Programmable network interface on I/O Bus (Sun SBUS or PCI)

– embedded custom CPU (“ Lanai” , ~40 MHz RISC CPU)– 256KB SRAM– 3 DMA engines: to network, from network, to/from host memory

• Downloadable firmware executes in kernel mode– includes source-based routing protocol

• SRAM pages can be mapped into user space– separate pages for separate processes– firmware can define status words, queues, etc.

» data for short messages or pointers for long ones» firmware can do address translation too… w/OS help

– poll to check for sends from user

• Bottom line: I/O bus still bottleneck, CPU could be faster


Shared Physical Address Space

• Implement SAS model in hardware w/o caching– actual caching must be done by copying from remote memory to

local– programming paradigm looks more like message passing than

Pthreads» yet, low latency & low overhead transfers thanks to HW

interpretation; high bandwidth too if done right» result: great platform for MPI & compiled data-parallel codes

• Implementation:– “ pseudo-memory” acts as memory controller for remote mem,

converts accesses to network transaction (request)– “ pseudo-CPU” on remote node receives requests, performs on local

memory, sends reply– split-transaction or retry-capable bus required (or dual-ported mem)


Example: Cray T3D

• Up to 2,048 Alpha 21064s– no off-chip L2 to avoid inherent latency

• In addition to remote mem ops, includes:– prefetch buffer (hide remote latency)– DMA engine (requires OS trap)– synchronization operations (swap, fetch&inc, global AND/OR)– message queue (requires OS trap on receiver)

• Big problem: physical address space– 21064 supports only 32 bits– 2K-node machine limited to 2M per node– external “ DTB annex” provides segment-like registers for extended

addressing, but management is expensive & ugly


Cray T3E

• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)

– still has physical address space problems

• E-registers for remote communication and synchronization

– 512 user, 128 system; 64 bits each– replace/unify DTB Annex, prefetch queue, block transfer engine,

and remote load / store, message queue– Address specifies source or destination E-register and command– Data contains pointer to block of 4 E-regs and index for centrifuge

• Centrifuge– supports data distributions used in data-parallel languages (HPF)– 4 E-regs for global memory operation: mask, base, two arguments

• Get & Put Operations


T3E (continued)

• Atomic Memory operations– E-registers & centrifuge used– F&I, F&Add, Compare&Swap, Masked_Swap

• Messaging– arbitrary number of queues (user or system)– 64-byte messages– create msg queue by storing message control word to memory

location

• Msg Send– construct data in aligned block of 8 E-regs– send like put, but dest must be message control word– processor is responsible for queue space (buffer management)

• Barrier and Eureka synchronization


DEC Memory Channel(Princeton SHRIMP)

• Reflective Memory• Writes on Sender appear in Receiver’s memory

– send & receive regions– page control table

• Receive region is pinned in memory• Requires duplicate writes, really just message

buffers

Virtual

Physical Physical

Virtual


Performance of Distributed Memory Machines

• Microbenchmarking• One-way latency of small (five-word) message

– echo test– round-trip divided by 2

• Shared Memory remote read• Message Passing Operations

– see text


Network Transaction Performance

0

2

4

6

8

10

12

14

Mic

rose

con

ds

CM

-5

Par

agon

CS

2

NO

W

T3D

CM

-5

Par

agon

CS

2

NO

W

T3D

Gap

L

Or

Os

Figure 7.31


Remote Read Performance

0

5

10

15

20

25

Mic

rose

con

ds

CM

-5

Par

agon

CS

2

NO

W

T3D

CM

-5

Par

agon

CS

2

NO

W

T3D

Gap

L

Issue

Figure 7.32


Summary of Distributed Memory Machines

• Convergence of architectures– everything “ looks basically the same”– processor, cache, memory, communication assist

• Communication Assist– where is it? (I/O bus, memory bus, processor registers)– what does it know?

» does it just move bytes, or does it perform some functions?– is it programmable?– does it run user code?

• Network transaction– input & output buffering– action on remote node


Outline

• Motivation



• Case Studies

outline - university of evansvillemr56/ece757/lecture9.pdf · message passing issues • sender...

Documents