cs 213 commercial multiprocessors. origin2000 system – shared memory directory state in same or...

17
CS 213 Commercial Multiprocessors

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

CS 213

Commercial Multiprocessors

Origin2000 System – Shared Memory

• Directory state in same or separate DRAMs, accessed in parallel• Upto 512 nodes (1024 processors)• With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per

proc• Peak SysAD bus bw is 780MB/s, so also Hub-Mem• Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub

Main Memory(1-4 GB)

Direc-tory

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub

Main Memory(1-4 GB)

Direc-tory

Interconnection Network

SysAD busSysAD bus

Origin Network

• Each router has six pairs of 1.56MB/s unidirectional links– Two to nodes, four

to other routers– latency: 41ns pin to

pin across a router

• Flexible cables up to 3 ft long

• Four “virtual channels”: request, reply, other two for priority or I/O

N

N

N

N

N

N

N

N

N

N

N

N

(b) 4-node (c) 8-node (d) 16-node

(e) 64-node

(d) 32-node

meta-router

Cray T3D – Shared Memory

• Build up info in ‘shell’• Remote memory operations encoded in address

DRAM

Reqout

P$

MMU

150-MHz DEC Alpha (64 bit)

8-KB instruction + 8-KB data

43-bit virtual address

Prefetch

Load-lock, store-conditional

32-bit

DTB

Prefetch queue· 16 64

Message queue

· 4,080 4 64

Special registers

· swaperand · fetch&add · barrier

PE# + FC

DMA

Resp

in 3D torus of pairs of PEs· share net and BLT

· up to 2,048

· 64 MB each

Req

in

Respout

Block transfer

32- and 64-bit memory and byte operations

Nonblocking stores and memory barrier

engine

physical address

IBM Power 4 – Shared Memory

Power-4 Multi-chip Module

32-way SMP

NOW – Message Passing

• General purpose processor embedded in NIC to implement VIA – to be discussed later

L2 $

Bus adapterSBUS (25 MHz)Mem

UltraSparc

s DMA

Host DMA

SRAM

Myrinet

X-bar

r DMA

Bus interface

Mainprocessor

LinkInterface

160-MB/sbidirectionallinks

MyricomLanai NIC(37.5-MHz processor,256-MB SRAM3 DMA units)

Eight-portwormholeswitches

Myrinet – Message Passing

Interface Processor

InfiniBand – Message Passing

Latency Comparison

Cray XD1 – Message Passing

Four chassis, each holding six blades, each containing a dual (quad) 2.4 GHz AMD Opteron motherboard with 4GB of RAM and one 74 GB hard disk. The interconnection topology, shown in Fig. 1 has three levels of latencies:

1. communication time between the CPUs inside one blade is through shared memory

2. very fast message passing communication among blades within a chassis,

3. slower message passing communication between two different chassis

Latency Comparison

Dual (Quad) SMP and Hardware Accelerator

IBM SP2 – Message Passing