sr8000 concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Concept

Tim Lanfear

Hitachi Europe GmbH.

[email protected]

2 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Model Range4 8 16 32 64 128 256 512

SR8000 32 64 128 256 512 1,024 - -SR8000 Model E1 38.4 76.8 153.6 307.2 614.4 1,228.8 2,457.6 4,915.2SR8000 Model F1 48 96 192 384 768 1,536 3,072 6,144SR8000 3-D

Crossbar- -

SR8000 Models E1, F1SR8000SR8000 Model E1SR8000 Model F1SR8000 32 64 128 256 512 1,024 - -SR8000 Models E1, F1 64 128 256 512 1,024 2,048 4,096 8,192

SR8000SR8000 Model E1SR8000 Model F1SR8000SR8000 Models E1, F1

External Interfaces

Number of NodesPeak Performance (GFlops)

Inter-node Network

Max. Memory Capacity (GB)

One Dimensional Crossbar

Two Dimensional Crossbar

One Dimensional Two Dimensional Crossbar Three Dimensional CrossbarInter-node Transfer Speed

1 GB/s (single direction) x2 -1.2 GB/s (single direction) x21 GB/s (single direction) x2

Ultra SCSI, Ethernet/Fast Ethernet, Gigabit Ethernet, ATM, HIPPI, Fibre Channel

Node

Peak Performance (GFlops)

89.612

Memory Capacity (GB)

2 / 4 / 82 / 4 / 8 / 16

System


SR8000 Appearance


Compact Model

Model Model A Model B Model C

Peak Performance 4 GFlops 8 GFlops 12 GFlops Memory Capacity 2GB / 4GB / 8GB /

16GB

External Interfaces

Number of I/O Interfaces

System Expandability

Input Current (Single Phase 200V-240V)

18 A

Power Consumption 3.3 kW

Noise Level

Dimension (mm) <W x D x H>

OS

Operation Management

Languages

Development Support

Matrix Calculation Library

Graphics

500 x 910 x 1500

Physical Planning

HI-UX/MPP for SR8000

NQS, NFS, RealTime Monitor, ADSM (ADSTAR Distributed Storage Manager)

SoftwareFORTRAN77, FORTRAN90, Parallel FORTRAN, C, C++,

Kuck and Associates C++

Application Development Environment, Parallel Debugger

MATRIX/MPP, MATRIX/MPP/SSS, MSL2

X Window System, OSF/Motif, GKS, PHIGS, PEXlib

Hardware

17 A

3.0 kW

57 dB

Ultra SCSI, Fibre Channel, Ethernet/Fast Ethernet, Gigabit Ethernet, HIPPI, ATM

2GB / 4GB / 8GB

8 (maximum)

Available


Vector vs SMP vs MPP

Feature Vector SMP MPPSingle Node Performance

Scalability

Programming Effort

Development Cycle

Energy Requirements


System Architecture

Cross-bar Inter-node Network

Node (PRN)

Node (PRN)

Node (ION) CPU CPU

System Control

Main MemoryNetwork Control

Ether, ATM, HIPPI RAID Disk

PCI

Service Processor

Console


Programming Models

Hardware Programming Model Example Single CPU Pseudo-vector processing Vector application

Independent processing on each IP

Compilation, parallel make

Message passing MPP application DO loop distribution with COMPAS

Vector application Single Node

Parallel processing of independent blocks of code

Message passing MPP application Multiple Nodes COMPAS and message

passing Vector parallel application


CPU Architecture

• 16 bytes/cycle memory BW• 128 Kbyte L1 cache• Pre-fetch and pre-load

instructions• 160 f.p. registers• 2 f.p. pipelines• 4 flops/cycle

Main MemoryMain Memory

Pre-fetchPre-fetch

Pre-loadPre-loadCache

Floating Point Registers

Load

Arithmetic UnitArithmetic Unit

Memory SwitchMemory Switch


Slide Window Registers

• Registers for all instructions• Registers for extended instructions only• Fixed registers: 4, 8, 16, 32 (16 illustrated)• Fixed + sliding = 128

Physical Sliding part: 0 to 127 Global part: 128 to 159

Logical0 to 1532 to 12516 to 31126-7

0 to 1532 to 12316 to 31124-7

Base=2

Base=4


Instruction Set Extensions

• Load and store with extended registers

• Floating point arithmetic with extended registers

• Slide window control

• Pre-fetch and pre-load

• Thread start-up and finish

• Predicate instructions


SR8000 Programming

Instruction Level Parallelism

(Pseudo-vector Processing: PVP)


Pre-fetch and Pre-load

• Pre-fetch: load cache line from memory to cache

• Pre-load: load one word from memory to register

• 16 streams

Main MemoryMain Memory

Pre-fetchPre-fetch

Pre-loadPre-loadCache

Floating Point Registers

Load

Arithmetic UnitArithmetic Unit

Memory SwitchMemory Switch


Pre-fetchIteration

PF Latency Use dataLD1

Use dataLD

Use dataLD

Use dataLD

2

3

4

PF Latency Use dataLD

Use dataLD

5

6

• Pre-fetch 128 bytes to cache• Follow by LD to register


Pre-load

PL Latency Use data1

Latency Use data

Latency Use data

Latency Use data

Latency Use data

Latency Use data

2

3

4

5

6

PL

PL

PL

PL

PL

• Pre-load 8 bytes to register• LD not required

Iteration


Software Pipelining

I=1 I=2 I=3

No SWPL

I=1

I=2

I=3

Infinite resource

Recurrence

a=

=a a=

=a a=

=a I=1

I=2

I=3

Resources:

registers, f.p. units, instruction issue, memory bandwidth etc

I=1

I=2

I=3

Finite resource

Initiation interval


VST

Pseudo-vector Processing

A(:) = A(:) + N

Vector Pseudo-Vector

PF Lat LD + ST

LD + ST

LD + ST

LD + ST

PF Lat LD + ST

LD + ST

LD + ST

VADDVLD


Effect of PVP

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

Loop length (N)

Mfl

op

s

PVP off

PVP on

Dot product: S = A(1:N)*B(1:N)


SR8000 Programming

Multi-thread Parallelism

(Cooperative Microprocessors in a Single Address Space: COMPAS)


COMPASMulti-dimensional Crossbar Network

Node

thread

process

. . .

.

IP

Pre-fetchLoadArithmeticStoreBranch

Pre-fetchLoadArithmeticStoreBranch

IP

IP

Automatic Parallel Processing

Node

COMPAS (Start Inst.)

IP: Instruction Processor

Node Node Node

Main memory (shared)

IP IP IP IP

COMPAS ( End Inst.)

COMPAS: Co-operative Micro-Processors in single Address Space


Loop Part

IP

(waiting for startup)

Hardware SupportSoftware

IP

Scalar Part

Start Parallel Inst.

Loop Part

End Parallel Inst.

Hardware Support

SC

MS

IP:Instruction ProcessorSC:Storage ControllerMS:Main Storage

Barrier SynchronizationMechanism

Scalar Part

Loop Part

IP


Loop Part

IP


IPIPIPIP


Loop Parallelisation

DO i =1,NA(i)=B(i)+C(i)

ENDDO

[fork]DO i =start,end

A(i)=B(i)+C(i)ENDDO[join]

i loop parallelisation

DO j=1,MW(j)=C(j)+D(j)DO i=1,N

A(i,j)=B(i,j)+W(j)ENDDO

ENDDO

[fork]DO j=start,end

W(j)=C(j)+D(j)DO i=1,N

A(i,j)=B(i,j)+W(j)ENDDO

ENDDO[join]

j loop parallelisation



DO j=2,MDO i=1,N

A(i,j) = A(i,j-1)+A(i,j)ENDDO

ENDDO

[fork]DO j=2,M

DO i=start,endA(i,j) = A(i,j-1)+A(i,j)

ENDDOENDDO[join]


DO i=1,NA(i) = B(i)+C(i)

ENDDODO j=1,M

D(j) = E(j)*F(j)ENDDO

[fork]DO i=start,end

A(i) = B(i)+C(i)ENDDODO j=start,end

D(j) = E(j)*F(j)ENDDO[join]


j loop parallelisation



DO i = 1,N

CALL sub(a,b,i)

ENDDO

*poption parallel force parallelisation

*poption tlocal(a,b,i) thread local variables

[fork]

DO i = 1,N

CALL sub(a,b,i)

ENDDO

[join]


Section Parallelisation

*poption parallel_sections

*poption section

CALL SUB1

*poption section

CALL SUB2

*poption end_parallel_sections

Execution of independent blocks of code in different threads

(sections are always single threaded)


Effect of COMPAS

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 10 100 1000 10000 100000

Loop length (N)

Mfl

op

s

COMPAS off

COMPAS on

Dot product: S = A(1:N)*B(1:N)


SR8000 Programming

Message Passing

(MPI)


Remote DMA

Receive Buffer

ProgramProgram

OSOS

memory copy

Send Buffer

Crossbar Network

Normal TransferProtocol ProcessingContext SwitchInterrupt Handling

Node Node

memory copy

Remote DMA Transfer

No Buffering in KernelNo OS System Call

data

data

data

data


Inter-node MPI


One MPI process per node; RDMA transfer possible

MPI MPI MPI


Intra-node MPI


One MPI process per IP; RDMA transfer not possible

MPI

MPI

MPI

Shared mem

ory

MPI

MPI

MPI

Shared mem

ory

MPI

MPI

MPIShared m

emory


MPI Ping-pong

0.00E+00

2.00E+02

4.00E+02

6.00E+02

8.00E+02

1.00E+03

1.20E+03

1.40E+03

1 10 100 1000 10000 100000 1000000

Message length

Ba

nd

wid

th (

Mb

yte

s/s

ec

)

Intra-node

Inter-node RDMAInter-node no RDMA


SR8000 Parallelism

Instruction level (PVP)

Multi-thread (COMPAS)

Node 1 Node 2

Message passing (MPI)


SR8000 Programming

Memory Architecture


Memory Hierarchy

fp registers (128+32)

L1 cache (128 Kb 4-way)

Store buffer (16 entries)

Switch

Memory (2 to 16 Gb, 512 banks)

16 b/cyc

16 b/cyc32 b/cyc

Other IPs


Address Translation

Virtual page number Page offset

Page

table

Main

memory

Virtual address

Cache recently used entries of page table in

TLB


Large TLB

Virtual page number Page offset

Large

page

table

Main

memory

Virtual address

Large TLB covers whole address space with 256 entries.

Page size 16Mb to 128 Mb


Memory Address Hashing

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

xor

memory controller data path

xor

storage controller data path


Key Features of SR8000

• High performance RISC CPU with PVP

• High performance node with COMPAS

• High sustained memory bandwidth

• High scalability with fast network

• Low energy and space requirements


SR8000 Programming

Performance


Top 500 – June 2000

Manufacturer Computer Rmax Installation Site

1 Intel ASCI Red 2379 Sandia National Lab

2 IBM ASCI Blue Pacific 2144 Lawrence Livermore National Lab

3 SGI ASCI Blue Mountain 1608 Los Alamos National Lab

4 IBM SP Power3 375 MHz 1417 NAVOCEANO

5 Hitachi SR8000-F1/112 1035 LRZ Munich

6 Hitachi SR8000-F1/100 917 KEK Tsukuba

7 Cray Inc T3E/1200 891 US Government

8 Cray Inc T3E/1200 891 US Army HPC Research Center

9 Hitachi SR8000/128 873 University of Tokyo

10 Cray Inc T3E/1200 815 US Government


Linpack Performance

80.25 (8 nodes)159.51 (16 nodes)

313.32 (32nodes)

917.15(100 nodes)

605.30 (64 nodes)577.49 (60 nodes)

0

100

200

300

400

500

600

700

800

900

1000

0 20 40 60 80 100 120Number of nodes

GFl

op

s

10.88 Gflops on 1 node

20.50 Gflops on 2 nodes


10.88 Gflops on 1 node




NAS Parallel FT

5.147.92

26.16

14.01

8.31

27.95

14.84

5.39

28.78

15.10

8.37

0

5

10

15

20

25

30

35

1 2 4 8Number of Nodes

GFl

ops

ClassA

ClassB

ClassC


NAS Parallel CG

7.98

14.54

6.353.67

10.52

16.46

4.14

27.86

0

5

10

15

20

25

30


GFl

ops

ClassB

ClassC

7.98

14.54

6.353.67

10.52

16.46

4.14

27.86

0

5

10

15

20

25

30


GFl

ops

ClassB

ClassC

sr8000 concept

Documents