outline - university of evansvillemr56/ece757/lecture9.pdf · message passing issues • sender...
TRANSCRIPT
Page 1
CS/ECE 757: Advanced Computer Architecture II(Parallel Computer Architecture)
Scalable Multiprocessors (Chapter 7)
Copyright 2001 Mark D. HillUniversity of Wisconsin-Madison
Slides are derived from work bySarita Adve (Illinois), Babak Falsafi (CMU),
Alvy Lebeck (Duke), Steve Reinhardt (Michigan),and J. P. Singh (Princeton). Thanks!
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 2
Outline
• Motivation
• Network Transaction Primitive
• Supporting Programming Models
• Case Studies
Page 2
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 3
Scalable Systems
• Buses support up to 8 to 16 processors well
• What about scaling beyond 16 processors?
• What about cost?
• How do we build larger systems that are cost-effective?
(c) 2003, J. E. Smith 4
Scaling
• Bandwidth Scaling– should increase proportional to # procs.– crossbars– multi-stage nets
• Latency Scaling– ideally remain constant– in practice, logn scaling can be achieved– local memories help– latency may be dominated by overheads, anyway
• Cost Scaling– low overhead cost (most cost incremental)– in contrast to many large SMPs
• Physical Scaling– loose vs. dense packing– chip level integration vs. commodity parts
Page 3
(c) 2003, J. E. Smith 5
Scalable Systems
• Generic NUMA System
copyright 1999 Morgan Kaufmann Publishers, Inc
(c) 2003, J. E. Smith 6
Scalable Systems
• Dancehall System
copyright 1999 Morgan Kaufmann Publishers, Inc
Page 4
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 7
Scalable Systems
• Cluster of workstation/server like nodes• Commodity System-Area/Local-Area Network
Switches/Fiber• What are the hardware/software issues?
P
$
Mem NI
P
$
Mem NI
P
$
Mem NI
Network
P
$
Mem NI
P
$..
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 8
Node Communication Architecture
• How does a processor communicate with others?• Is there processor/memory support for
communication?• Where does the network interface sit?
P
$
MemoryNI
NI
I/O busMemory bus
Page 5
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 9
Communication Architecture Alternatives
• NI on I/O bus: Berkeley NOW, Wisconsin COW• NI on memory bus: Intel Paragon, TMC CM-5• Support for protected messaging: Princeton SHRIMP• TLB support for shared address space: Cray T3D, T3E• Full support for messaging: MIT J-Machine, nCube/2• Full support for coherent shared memory: Alpha 21364
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 10
Network Transaction Primitive
• One-way transfer of information from source to destination
• causes some action at the destination– process info and/or deposit in buffer– state change (e.g., set flag, interrupt program)– maybe initiate reply (separate network transaction)
output buffer input buf fer
Source Node Destination Node
Communication Network
° ° °
serialized msg
Page 6
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 11
Node-to-Node Architecture
• Scalable architectures use point-to-point messaging• Compare with SMP Bus• Communication architecture issues:
• Protection• Format• Output buffering• Input buffering• Media arbitration & flow control
• Destination name & routing• Action• Completion detection• Deadlock avoidance• Delivery guarantees
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 12
Communication Architecture
• Protection:– larger and more loosely coupled components– multiple protection domains– How do you isolate faulty hardware or software?
• Format:– Message descriptor and length– Flexibility in the contents depending communication model– May support simple protocol stack
» user information wrapped in communication information– e.g., shared memory messages: commands vs data– e.g., Active Messages: src, dest, PC, args
Page 7
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 13
Communication Architecture
• Buffering required:– when there is mismatch between arrival/departure rate
• Output buffers:– data from processor/memory going to the network– can use special buffer– buffer can be in main memory– what happens when output buffers fill up?
• Input buffers:– incoming data from other procs– may be picked it by the proc or sent to memory– what happens when input buffers fill up?
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 14
Communication Architecture
• Media arbitration & flow control:– distributed among the network switches– packets traverse through multiple switches
(flow control later)
• Destination name and routing:– the source must specify destination name and routing– e.g., shared memory, a physical memory address & a proc id
(routing later)
• Action:– may be memory write or read, executing code– may also send a response
Page 8
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 15
Communication Architecture
• Completion detection:– sender must know message is received– receiver must know there is a message
• Transaction ordering:– do all transactions from source to dest arrive in order?– must distinguish between out-of-order routing vs delivery– e.g., what happens to shared-memory consistency if out of order
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 16
Communication Architecture
• Deadlock avoidance:– switches require buffer space for routing– must manage buffers carefully to avoid deadlocks– end-to-end: sender must always receive while sending!
• Delivery guarantees:– when input buffers fill up:
» drop message and resend» control flow; buffers backup into network
Page 9
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 17
Outline
• Motivation
• Network Transaction Primitive
• Supporting Programming Models
• Case Studies
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 18
Shared Address Space vs Message Passing
• Do not confuse model/abstraction with system implementation:
Programming Interface
Message Passinge.g., DBMS,
MPI
H/W Shared Memorye.g., Intel SMP,
SGI Origin
Network of PCse.g., NOW, COW
MP Supercomputere.g., Cray T3E
Data Parallele.g., HPF
Shared Address Spacee.g., Pthreads,
PARMACS
System
Page 10
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 19
Shared Address Space Requirements
• Shared address space on a distributed memory computer
• Two-way request/response protocol:– basic: read a block, write a block, and block ownership– more complicated protocol optimizations on top of basic
load Xsource destination
Access XWait
X
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 20
Shared Address Space Issues
• Coherent or non-coherent:– fully cache-coherent similar to an SMP– coherence granularity: cache block, page, etc.– remote shared accesses only: Cray T3D
• Does it allow overlapping multiple transactions:– consider memory consistency models: chapter 9
• Implemented in H/W or S/W:– For S/W systems, H/W provides messaging– H/W or S/W can provide protection, flow control, etc.
Page 11
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 21
Shared Address Space Issues
• Assume implemented in H/W– issues similar to SMPs (see chapter 8)– protection primarily enforced by virtual memory– data address: sender & receiver NIs know what to do
• coherent: – block transfers from cache/memory to network– H/W protocol implements coherence
• non-coherent: – single/double word reads and writes from proc to network
• Support for larger transfers using DMA H/W– efficient block moves
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 22
Message Passing Requirements
• Messaging abstraction on a distributed memory computer
• Conceptually a one-way communication: – sender sends (variable-sized) message– receiver receives
• But, – Implementation is more complicated
• Asynchronous vs Synchronous messaging– synchronous: sender waits until reception is assured– asynchronous: sender sends and goes back to computation
Page 12
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 23
Message Passing Requirements
• Synchronous messaging
send requestsend X, dest destination
Tag CheckReceiver Ready?
Wait
ready
data transfer
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 24
Message Passing Issues
• Sender must specify where the data is:– typically in contiguous block of memory– can optimize for direct transfers from data structures: gather/scatter
• Proc or NI must transfer from memory to NI buffers– what about NI protection?– Can send through OS: but requires multiple data copies– Does NI have access to application memory?
• Unless support for low-overhead messaging ���� send large messages to amortize overheads
Page 13
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 25
Message Passing Requirements
• Asynchronous messaging: (optimistic)
sendsend X, dest destination
Tag CheckIf not, buffer
Go back to computation
What are problems with this approach?
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 26
Message Passing Issues
• Performing tag match in software to find where to go:– slow process– can always provide buffer for later processing– how much buffering is enough?
• Data typically arrives at a high rate for large messages
• What if multiple senders are sending to one receiver?• Need flow control or can drop messages
Page 14
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 27
Message Passing Requirements
• Asynchronous messaging: (optimistic)
send requestsend X, dest destination
Tag CheckIf not, wait
Go back to computation
ready
data transfer
receive
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 28
Active Messages
• User-level analog of network transaction– invoke handler function at receiver to extract packet from network– grew out of attempts to do dataflow programming on msg-passing
machines– handler may send reply, but no other messages
• May also perform memory-to-memory transfer
Request
handler
handler
Reply
Page 15
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 29
Outline
• Motivation
• Network Transaction Primitive
• Supporting Programming Models
• Case Studies
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 30
Massively Parallel Processor (MPP) Architectures
• Network interface typically close to processor
– Memory bus:» locked to specific
processor architecture/bus protocol
– Registers/cache:» only in research machines
• Time-to-market is long���� use available processor or work closely with processor designers
• Maximize performance and cost
I/O Bus
Memory Bus
Processor
Cache
MainMemory
DiskController
Disk Disk
NetworkInterface
Network
I/O Bridge
Page 16
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 31
Network of Workstations
• Network interface on I/O bus
• Standards (e.g., PCI) ���� longer life,faster to market
• Slow (microseconds) to access network interface
• “ System Area Network” (SAN): between LAN & MPP
I/O Bus
Processor
Cache
MainMemory
DiskController
Disk Disk
GraphicsController
NetworkInterface
Graphics Network
interrupts
Core Chip Set
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 32
Transaction Interpretation
• Simplest MP: assist doesn’t interpret– DMA from/to buffer, interrupt or set flag on completion
• User-level messaging: get the OS out of the way– assist does protection checks to allow direct user access to
network– may have minimal interpretation otherwise
• Virtual DMA: get the CPU out of the way (maybe)
– basic protection plus address translation: user-level bulk DMA
• Global physical address space (NUMA): everything in hardware
– complexity increases, but performance does too (if done right)
• Cache coherence: even more so– stay tuned
Page 17
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 33
Spectrum of Designs
• None: Physical bit stream– physical DMA nCUBE, iPSC, . . .
• User/System– User-level port CM-5, *T– User-level handler J-Machine, Monsoon, . . .
• Remote virtual address– Processing, translation Paragon, Meiko CS-2, Myrinet
Memory Channel, SHRIMP
• Global physical address– Proc + Memory controller RP3, BBN, T3D, T3E
• Cache-to-cache (later)– Cache controller Dash, KSR, Flash
Incr
easi
ng
HW
Su
pp
ort
, Sp
ecia
lizat
ion
, In
tru
sive
nes
s, P
erfo
rman
ce (
??
?)
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 34
Net Transactions: Physical DMA
• Physical addresses: OS must initiate transfers– system call per message on both ends: very high overhead
• Sending OS copies data to kernel buffer w/ header/trailer– can avoid copy if interface does scatter/gather
• Receiver copies packet into OS buffer, then interprets– user message then copied (or mapped) into user space
PMemory
Cmd
DestData
AddrLengthRdy
PMemory
DMAchannels
Status,interrupt
Addr
LengthRdy
° ° °
Page 18
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 35
Conventional LAN Network Interface
NIC Controller
DMAaddr
len
trncv
TX
RX
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Data
Host Memory NIC
IO Busmem bus
Proc
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 36
User Level Ports
• Map network hardware into user’s address space– talk directly to network via loads & stores
• User-to-user communication without OS intervention: low latency
• Protection: user/user & user/system• DMA difficult… CPU involvement (copying) becomes
bottleneck
PMem
DestData
User/system
PMemStatus,interrupt
° ° °
Page 19
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 37
User Level Network ports
• Appears to user as logical message queues plus status• What happens if no user pop?
Virtual address space
Status
Net outputport
Net inputport
Program counter
Registers
Processor
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 38
Example: CM-5
• Input and output FIFO for each network
• Two data networks• Save/restore network
buffers on context switch
Diagnostics network
Control network
Data network
Processing�partition
Processing�partition
Control�processors
I/O partition
PM PM
SPARC
MBUS
DRAM�ctrl
DRAM DRAM DRAM DRAM
DRAM�ctrl
Vector�unit DRAM�
ctrlDRAM�
ctrl
Vector�unit
FPU Data�networks
Control�network
$�ctrl
$�SRAM
NI
Page 20
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 39
User Level Handlers
• Hardware support to vector to address specified in message
– message ports in registers– alternate register set for handler?
• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)
U se r/sys te m
PM e m
D estD a ta Ad d re ss
PM em
° ° °
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 40
J-Machine
• Each node a small message-driven processor
• HW support to queue msgsand dispatch to msg handler task
Page 21
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 41
Dedicated Message Processing Without Specialized Hardware
• Microprocessor performs arbitrary output processing (at system level)• Microprocessor interprets incoming network transactions (in system)• User Processor <–> Msg Processor share memory• Msg Processor <–> Msg Processor via system network transaction
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
P M P
NI
User System
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 42
Levels of Network Transaction
• User Processor stores cmd / msg / data into shared output queue– must stil l check for output queue full (or grow dynamically)
• Communication assists make transaction happen– checking, translation, scheduling, transport, interpretation
• Avoid system call overhead• Multiple bus crossings likely bottleneck
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
PM P
NI
Page 22
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 43
Example: Intel Paragon
Network
° ° °Mem
P M P
NIi860xp50 MHz16 KB $4-way32B BlockMESI
sDMArDMA
64400 MB/s
$ $
16 175 MB/s Duplex
I/ONodes
rteMP handler
Var dataEOP
I/ONodes
Service
Devices
Devices
2048 B
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 44
Dedicated MP w/specialized NI:Meiko CS-2
• Integrate message processor into network interface– active messages-like capability– dedicated threads for DMA, reply handling, simple remote memory
access– supports user-level virtual DMA
» own page table» can take a page fault, signal OS, restart
• meanwhile, nack other node
• Problem: processor is slow, time-slices threads– fundamental issue with building your own CPU
Page 23
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 45
Myricom Myrinet (Berkeley NOW)
• Programmable network interface on I/O Bus (Sun SBUS or PCI)
– embedded custom CPU (“ Lanai” , ~40 MHz RISC CPU)– 256KB SRAM– 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode– includes source-based routing protocol
• SRAM pages can be mapped into user space– separate pages for separate processes– firmware can define status words, queues, etc.
» data for short messages or pointers for long ones» firmware can do address translation too… w/OS help
– poll to check for sends from user
• Bottom line: I/O bus still bottleneck, CPU could be faster
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 46
Shared Physical Address Space
• Implement SAS model in hardware w/o caching– actual caching must be done by copying from remote memory to
local– programming paradigm looks more like message passing than
Pthreads» yet, low latency & low overhead transfers thanks to HW
interpretation; high bandwidth too if done right» result: great platform for MPI & compiled data-parallel codes
• Implementation:– “ pseudo-memory” acts as memory controller for remote mem,
converts accesses to network transaction (request)– “ pseudo-CPU” on remote node receives requests, performs on local
memory, sends reply– split-transaction or retry-capable bus required (or dual-ported mem)
Page 24
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 47
Example: Cray T3D
• Up to 2,048 Alpha 21064s– no off-chip L2 to avoid inherent latency
• In addition to remote mem ops, includes:– prefetch buffer (hide remote latency)– DMA engine (requires OS trap)– synchronization operations (swap, fetch&inc, global AND/OR)– message queue (requires OS trap on receiver)
• Big problem: physical address space– 21064 supports only 32 bits– 2K-node machine limited to 2M per node– external “ DTB annex” provides segment-like registers for extended
addressing, but management is expensive & ugly
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 48
Cray T3E
• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)
– still has physical address space problems
• E-registers for remote communication and synchronization
– 512 user, 128 system; 64 bits each– replace/unify DTB Annex, prefetch queue, block transfer engine,
and remote load / store, message queue– Address specifies source or destination E-register and command– Data contains pointer to block of 4 E-regs and index for centrifuge
• Centrifuge– supports data distributions used in data-parallel languages (HPF)– 4 E-regs for global memory operation: mask, base, two arguments
• Get & Put Operations
Page 25
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 49
T3E (continued)
• Atomic Memory operations– E-registers & centrifuge used– F&I, F&Add, Compare&Swap, Masked_Swap
• Messaging– arbitrary number of queues (user or system)– 64-byte messages– create msg queue by storing message control word to memory
location
• Msg Send– construct data in aligned block of 8 E-regs– send like put, but dest must be message control word– processor is responsible for queue space (buffer management)
• Barrier and Eureka synchronization
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 50
DEC Memory Channel(Princeton SHRIMP)
• Reflective Memory• Writes on Sender appear in Receiver’s memory
– send & receive regions– page control table
• Receive region is pinned in memory• Requires duplicate writes, really just message
buffers
Virtual
Physical Physical
Virtual
Page 26
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 51
Performance of Distributed Memory Machines
• Microbenchmarking• One-way latency of small (five-word) message
– echo test– round-trip divided by 2
• Shared Memory remote read• Message Passing Operations
– see text
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 52
Network Transaction Performance
0
2
4
6
8
10
12
14
Mic
rose
con
ds
CM
-5
Par
agon
CS
2
NO
W
T3D
CM
-5
Par
agon
CS
2
NO
W
T3D
Gap
L
Or
Os
Figure 7.31
Page 27
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 53
Remote Read Performance
0
5
10
15
20
25
Mic
rose
con
ds
CM
-5
Par
agon
CS
2
NO
W
T3D
CM
-5
Par
agon
CS
2
NO
W
T3D
Gap
L
Issue
Figure 7.32
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 54
Summary of Distributed Memory Machines
• Convergence of architectures– everything “ looks basically the same”– processor, cache, memory, communication assist
• Communication Assist– where is it? (I/O bus, memory bus, processor registers)– what does it know?
» does it just move bytes, or does it perform some functions?– is it programmable?– does it run user code?
• Network transaction– input & output buffering– action on remote node
Page 28
(c) 2001 Mark D. Hill from Adve, Falsafi, Lebeck, Reinhardt & Singh 55
Outline
• Motivation
• Network Transaction Primitive
• Supporting Programming Models
• Case Studies