sgi’2000parallel programming tutorial supercomputers 2 with the acknowledgement of igor zacharov...

12
SGI’2000 Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters

Upload: ariel-seales

Post on 15-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

SGI’2000Parallel Programming Tutorial

Supercomputers 2

With the acknowledgement of

Igor Zacharov and Wolfgang Mertz

SGI European Headquarters

SGI’2000Parallel Programming Tutorial

MIMD

MultiprocessorsSingle Address spaceShared Memory

MulticomputersMultiple Address spaces

UMACentral Memory

NUMAdistributed memory

NORMAno-remote memory access

PVP (Cray T90)

SMP (Intel SHV, SUN E10000, DEC 8400SGI Power Challenge, IBM R60, etc.)

COMA (KSR-1, DDM)

CC-NUMA(SGI Origin2000, SN1 (SGI3000), Cray T3E, HP Exemplar, Sequent NUMA-Q, Data General)

NCC-NUMA (Cray T3D, IBM SP3)

Cluster (IBM SP2, DEC TruCluster,Microsoft Wolfpack, “Beowolf”, etc.)

loosely coupled, multiple OS

“MPP” (Intel TFLOPS,TM-5)

tightly coupled & single OSMIMD Multiple Instruction s Multiple Data PVP Parallel Vector ProcessorUMA Uniform Memory Access SMP Symmetric Multi-ProcessorNUMA Non-Uniform Memory Access COMA Cache Only Memory ArchitectureNORMA No-Remote Memory Access CC-NUMA Cache-Coherent NUMAMPP Massively Parallel Processor NCC-NUMA Non-Cache Coherent NUMA

Classification of Computers

SGI’2000Parallel Programming Tutorial

Design Space of Competing Computer Architecture

SGI’2000Parallel Programming Tutorial

Processor

Cache

Processor

Cache

I/OI/OI/OI/OMain

MemoryMain

MemoryMain

MemoryMain

Memory

Processor

Cache

Central Bus

Structure of an SMP System (1)

• Does NOT scale due to Bus-saturation

• Bus is a very complex Component

• High Memory-Latency due to the Complexity

SGI’2000Parallel Programming Tutorial

Central Crossbar

Processor

Cache

Processor

Cache

I/OI/OI/OI/OMain

MemoryMain

MemoryMain

MemoryMain

Memory

Processor

Cache

Structure of an SMP System (2)

• Scales very well

• Crossbar is a very complex Component

• High Memory-Latency due to the Complexity

SGI’2000Parallel Programming Tutorial

^Nodeboard

I/O

Structure of an SMP System (3)Origin SGI NUMA Architecture

SGI NUMAhypercube

Global SwitchInterconnect N

N

R

R

R

R R

R

R

R

N

N

N

N

N

N

N

N

NN

N

N

N N

^Nodeboard

I/O

SGI’2000Parallel Programming Tutorial

Systems are built from Modules

Deskside(Module)

Rack(2 Modules)

Multi-rack(4 Modules)

Etc...

2-8 CPUs

16 CPUs

..128 CPUs

32 CPUs

SGI’2000Parallel Programming Tutorial

SGI Origin 3200SGI Onyx 3200

SGI Origin 3400SGI Onyx 3400

SGI Origin 3800SGI Onyx 3800

New High-End ProductsOrigin 3000 Servers – Onyx 3 Systems

IRIX 6.5

SGI’2000Parallel Programming Tutorial

SGI 3800 System (16-512p)

Minimum (16p) System 128p System

128P System Topology

R

Rack 1

C

CC

C

RC

CC

C

R

Rack 2

C

CC

C

R C

CC

C

R

Rack 3

C

CC

C

RC

CC

C

R

Rack 4

C

CC

C

R C

CC

C

1 2 3 4

Power Bay

Power Bay

I-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

R-Brick

C-Brick

C-Brick

C-Brick

Power Bay

C-Brick

Power Bay

Power Bay

I-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

Power Bay

Power Bay

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick

P, I, or, X-Brick P, I, or, X-Brick P, I, or, X-Brick

R-Brick8-port router

C-Brick

C-Brick

C-Brick

Power Bay

R-Brick

C-Brick

Power Bay

SGI’2000Parallel Programming Tutorial

ASCI Blue MountainLos Alamos National Laboratories

Origin 2000 with 3+ Tflops peak

1+ Tflop Application Performance

48 Systems with 128 CPUs each = 6144 CPUs

1536 Gbyte Memory

76 Tbyte Diskspace

SGI’2000Parallel Programming Tutorial

Spee

d of

Acc

ess

1/cl

ock

64reg

32KB(L1)

8MB(L2)

~1 - 100s GB

Cache subsystem memory

Device Capacity (size)

1

0.1

0.01

~4000 cy

~100 - 300 cy(NUMA)

~10 cy

~2-3cy

disk

Memory hierarchy

175 175235

285335 335

435485

585

343

554

759 759836

1067

1169

0

200

400

600

800

1000

1200

1400

2p 4p 8p 16p 32p 64p 128p 256p 512p

Rem

ote

Lat

ency

(n

s)

SN-MIPS Latency

Origin2000 Latency

SGI’2000Parallel Programming Tutorial

I/O

Web serving

Weather simulation CPU

Storage

Repository / archive

Signal processing

Media streaming

Traditional big supercomputer

Scale in Any and All Dimensions

NUMAflex™Flexible Configuration