sr8000 concept

42
All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd. SR8000 Concept Tim Lanfear Hitachi Europe GmbH. [email protected] eu.co.uk

Upload: yen

Post on 06-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

SR8000 Concept. Tim Lanfear Hitachi Europe GmbH. [email protected]. SR8000 Model Range. SR8000 Appearance. Compact Model. Vector vs SMP vs MPP. System Architecture. Cross-bar Inter-node Network. Node (ION). Node (PRN). Node (PRN). CPU. CPU. PCI. System Control. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Concept

Tim Lanfear

Hitachi Europe GmbH.

[email protected]

Page 2: SR8000 Concept

2 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Model Range4 8 16 32 64 128 256 512

SR8000 32 64 128 256 512 1,024 - -SR8000 Model E1 38.4 76.8 153.6 307.2 614.4 1,228.8 2,457.6 4,915.2SR8000 Model F1 48 96 192 384 768 1,536 3,072 6,144SR8000 3-D

Crossbar- -

SR8000 Models E1, F1SR8000SR8000 Model E1SR8000 Model F1SR8000 32 64 128 256 512 1,024 - -SR8000 Models E1, F1 64 128 256 512 1,024 2,048 4,096 8,192

SR8000SR8000 Model E1SR8000 Model F1SR8000SR8000 Models E1, F1

External Interfaces

Number of NodesPeak Performance (GFlops)

Inter-node Network

Max. Memory Capacity (GB)

One Dimensional Crossbar

Two Dimensional Crossbar

One Dimensional Two Dimensional Crossbar Three Dimensional CrossbarInter-node Transfer Speed

1 GB/s (single direction) x2 -1.2 GB/s (single direction) x21 GB/s (single direction) x2

Ultra SCSI, Ethernet/Fast Ethernet, Gigabit Ethernet, ATM, HIPPI, Fibre Channel

Node

Peak Performance (GFlops)

89.612

Memory Capacity (GB)

2 / 4 / 82 / 4 / 8 / 16

System

Page 3: SR8000 Concept

3 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Appearance

Page 4: SR8000 Concept

4 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Compact Model

Model Model A Model B Model C

Peak Performance 4 GFlops 8 GFlops 12 GFlops Memory Capacity 2GB / 4GB / 8GB /

16GB

External Interfaces

Number of I/O Interfaces

System Expandability

Input Current (Single Phase 200V-240V)

18 A

Power Consumption 3.3 kW

Noise Level

Dimension (mm) <W x D x H>

OS

Operation Management

Languages

Development Support

Matrix Calculation Library

Graphics

500 x 910 x 1500

Physical Planning

HI-UX/MPP for SR8000

NQS, NFS, RealTime Monitor, ADSM (ADSTAR Distributed Storage Manager)

SoftwareFORTRAN77, FORTRAN90, Parallel FORTRAN, C, C++,

Kuck and Associates C++

Application Development Environment, Parallel Debugger

MATRIX/MPP, MATRIX/MPP/SSS, MSL2

X Window System, OSF/Motif, GKS, PHIGS, PEXlib

Hardware

17 A

3.0 kW

57 dB

Ultra SCSI, Fibre Channel, Ethernet/Fast Ethernet, Gigabit Ethernet, HIPPI, ATM

2GB / 4GB / 8GB

8 (maximum)

Available

Page 5: SR8000 Concept

5 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Vector vs SMP vs MPP

Feature Vector SMP MPPSingle Node Performance

Scalability

Programming Effort

Development Cycle

Energy Requirements

Page 6: SR8000 Concept

6 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

System Architecture

Cross-bar Inter-node Network

Node (PRN)

Node (PRN)

Node (ION) CPU CPU

System Control

Main MemoryNetwork Control

Ether, ATM, HIPPI RAID Disk

PCI

Service Processor

Console

Page 7: SR8000 Concept

7 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Programming Models

Hardware Programming Model Example Single CPU Pseudo-vector processing Vector application

Independent processing on each IP

Compilation, parallel make

Message passing MPP application DO loop distribution with COMPAS

Vector application Single Node

Parallel processing of independent blocks of code

Message passing MPP application Multiple Nodes COMPAS and message

passing Vector parallel application

Page 8: SR8000 Concept

8 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

CPU Architecture

• 16 bytes/cycle memory BW• 128 Kbyte L1 cache• Pre-fetch and pre-load

instructions• 160 f.p. registers• 2 f.p. pipelines• 4 flops/cycle

Main MemoryMain Memory

Pre-fetchPre-fetch

Pre-loadPre-loadCache

Floating Point Registers

Load

Arithmetic UnitArithmetic Unit

Memory SwitchMemory Switch

Page 9: SR8000 Concept

9 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Slide Window Registers

• Registers for all instructions• Registers for extended instructions only• Fixed registers: 4, 8, 16, 32 (16 illustrated)• Fixed + sliding = 128

Physical Sliding part: 0 to 127 Global part: 128 to 159

Logical0 to 1532 to 12516 to 31126-7

0 to 1532 to 12316 to 31124-7

Base=2

Base=4

Page 10: SR8000 Concept

10 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Instruction Set Extensions

• Load and store with extended registers

• Floating point arithmetic with extended registers

• Slide window control

• Pre-fetch and pre-load

• Thread start-up and finish

• Predicate instructions

Page 11: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Programming

Instruction Level Parallelism

(Pseudo-vector Processing: PVP)

Page 12: SR8000 Concept

12 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pre-fetch and Pre-load

• Pre-fetch: load cache line from memory to cache

• Pre-load: load one word from memory to register

• 16 streams

Main MemoryMain Memory

Pre-fetchPre-fetch

Pre-loadPre-loadCache

Floating Point Registers

Load

Arithmetic UnitArithmetic Unit

Memory SwitchMemory Switch

Page 13: SR8000 Concept

13 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pre-fetchIteration

PF Latency Use dataLD1

Use dataLD

Use dataLD

Use dataLD

2

3

4

PF Latency Use dataLD

Use dataLD

5

6

• Pre-fetch 128 bytes to cache• Follow by LD to register

Page 14: SR8000 Concept

14 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Pre-load

PL Latency Use data1

Latency Use data

Latency Use data

Latency Use data

Latency Use data

Latency Use data

2

3

4

5

6

PL

PL

PL

PL

PL

• Pre-load 8 bytes to register• LD not required

Iteration

Page 15: SR8000 Concept

15 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Software Pipelining

I=1 I=2 I=3

No SWPL

I=1

I=2

I=3

Infinite resource

Recurrence

a=

=a a=

=a a=

=a I=1

I=2

I=3

Resources:

registers, f.p. units, instruction issue, memory bandwidth etc

I=1

I=2

I=3

Finite resource

Initiation interval

Page 16: SR8000 Concept

16 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

VST

Pseudo-vector Processing

A(:) = A(:) + N

Vector Pseudo-Vector

PF Lat LD + ST

LD + ST

LD + ST

LD + ST

PF Lat LD + ST

LD + ST

LD + ST

VADDVLD

Page 17: SR8000 Concept

17 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Effect of PVP

0

100

200

300

400

500

600

700

800

900

1 10 100 1000 10000 100000

Loop length (N)

Mfl

op

s

PVP off

PVP on

Dot product: S = A(1:N)*B(1:N)

Page 18: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Programming

Multi-thread Parallelism

(Cooperative Microprocessors in a Single Address Space: COMPAS)

Page 19: SR8000 Concept

19 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

COMPASMulti-dimensional Crossbar Network

Node

thread

process

. . .

.

IP

Pre-fetchLoadArithmeticStoreBranch

Pre-fetchLoadArithmeticStoreBranch

IP

IP

Automatic Parallel Processing

Node

COMPAS (Start Inst.)

IP: Instruction Processor

Node Node Node

Main memory (shared)

IP IP IP IP

COMPAS ( End Inst.)

COMPAS: Co-operative Micro-Processors in single Address Space

Page 20: SR8000 Concept

20 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Part

IP

(waiting for startup)

Hardware SupportSoftware

IP

Scalar Part

Start Parallel Inst.

Loop Part

End Parallel Inst.

Hardware Support

SC

MS

IP:Instruction ProcessorSC:Storage ControllerMS:Main Storage

Barrier SynchronizationMechanism

Scalar Part

Loop Part

IP

(waiting for startup)

Loop Part

IP

(waiting for startup)

IPIPIPIP

Page 21: SR8000 Concept

21 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation

DO i =1,NA(i)=B(i)+C(i)

ENDDO

[fork]DO i =start,end

A(i)=B(i)+C(i)ENDDO[join]

i loop parallelisation

DO j=1,MW(j)=C(j)+D(j)DO i=1,N

A(i,j)=B(i,j)+W(j)ENDDO

ENDDO

[fork]DO j=start,end

W(j)=C(j)+D(j)DO i=1,N

A(i,j)=B(i,j)+W(j)ENDDO

ENDDO[join]

j loop parallelisation

Page 22: SR8000 Concept

22 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation

DO j=2,MDO i=1,N

A(i,j) = A(i,j-1)+A(i,j)ENDDO

ENDDO

[fork]DO j=2,M

DO i=start,endA(i,j) = A(i,j-1)+A(i,j)

ENDDOENDDO[join]

i loop parallelisation

DO i=1,NA(i) = B(i)+C(i)

ENDDODO j=1,M

D(j) = E(j)*F(j)ENDDO

[fork]DO i=start,end

A(i) = B(i)+C(i)ENDDODO j=start,end

D(j) = E(j)*F(j)ENDDO[join]

i loop parallelisation

j loop parallelisation

Page 23: SR8000 Concept

23 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Loop Parallelisation

DO i = 1,N

CALL sub(a,b,i)

ENDDO

*poption parallel force parallelisation

*poption tlocal(a,b,i) thread local variables

[fork]

DO i = 1,N

CALL sub(a,b,i)

ENDDO

[join]

Page 24: SR8000 Concept

24 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Section Parallelisation

*poption parallel_sections

*poption section

CALL SUB1

*poption section

CALL SUB2

*poption end_parallel_sections

Execution of independent blocks of code in different threads

(sections are always single threaded)

Page 25: SR8000 Concept

25 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Effect of COMPAS

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 10 100 1000 10000 100000

Loop length (N)

Mfl

op

s

COMPAS off

COMPAS on

Dot product: S = A(1:N)*B(1:N)

Page 26: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Programming

Message Passing

(MPI)

Page 27: SR8000 Concept

27 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Remote DMA

Receive Buffer

ProgramProgram

OSOS

memory copy

Send Buffer

Crossbar Network

Normal TransferProtocol ProcessingContext SwitchInterrupt Handling

Node Node

memory copy

Remote DMA Transfer

No Buffering in KernelNo OS System Call

data

data

data

data

Page 28: SR8000 Concept

28 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Inter-node MPI

Cross-bar Inter-node Network

One MPI process per node; RDMA transfer possible

MPI MPI MPI

Page 29: SR8000 Concept

29 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Intra-node MPI

Cross-bar Inter-node Network

One MPI process per IP; RDMA transfer not possible

MPI

MPI

MPI

Shared mem

ory

MPI

MPI

MPI

Shared mem

ory

MPI

MPI

MPIShared m

emory

Page 30: SR8000 Concept

30 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

MPI Ping-pong

0.00E+00

2.00E+02

4.00E+02

6.00E+02

8.00E+02

1.00E+03

1.20E+03

1.40E+03

1 10 100 1000 10000 100000 1000000

Message length

Ba

nd

wid

th (

Mb

yte

s/s

ec

)

Intra-node

Inter-node RDMAInter-node no RDMA

Page 31: SR8000 Concept

31 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Parallelism

Instruction level (PVP)

Multi-thread (COMPAS)

Node 1 Node 2

Message passing (MPI)

Page 32: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Programming

Memory Architecture

Page 33: SR8000 Concept

33 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Memory Hierarchy

fp registers (128+32)

L1 cache (128 Kb 4-way)

Store buffer (16 entries)

Switch

Memory (2 to 16 Gb, 512 banks)

16 b/cyc

16 b/cyc32 b/cyc

Other IPs

Page 34: SR8000 Concept

34 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Address Translation

Virtual page number Page offset

Page

table

Main

memory

Virtual address

Cache recently used entries of page table in

TLB

Page 35: SR8000 Concept

35 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Large TLB

Virtual page number Page offset

Large

page

table

Main

memory

Virtual address

Large TLB covers whole address space with 256 entries.

Page size 16Mb to 128 Mb

Page 36: SR8000 Concept

36 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Memory Address Hashing

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

xor

memory controller data path

xor

storage controller data path

Page 37: SR8000 Concept

37 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Key Features of SR8000

• High performance RISC CPU with PVP

• High performance node with COMPAS

• High sustained memory bandwidth

• High scalability with fast network

• Low energy and space requirements

Page 38: SR8000 Concept

All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

SR8000 Programming

Performance

Page 39: SR8000 Concept

39 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Top 500 – June 2000

Manufacturer Computer Rmax Installation Site

1 Intel ASCI Red 2379 Sandia National Lab

2 IBM ASCI Blue Pacific 2144 Lawrence Livermore National Lab

3 SGI ASCI Blue Mountain 1608 Los Alamos National Lab

4 IBM SP Power3 375 MHz 1417 NAVOCEANO

5 Hitachi SR8000-F1/112 1035 LRZ Munich

6 Hitachi SR8000-F1/100 917 KEK Tsukuba

7 Cray Inc T3E/1200 891 US Government

8 Cray Inc T3E/1200 891 US Army HPC Research Center

9 Hitachi SR8000/128 873 University of Tokyo

10 Cray Inc T3E/1200 815 US Government

Page 40: SR8000 Concept

40 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

Linpack Performance

80.25 (8 nodes)159.51 (16 nodes)

313.32 (32nodes)

917.15(100 nodes)

605.30 (64 nodes)577.49 (60 nodes)

0

100

200

300

400

500

600

700

800

900

1000

0 20 40 60 80 100 120Number of nodes

GFl

op

s

10.88 Gflops on 1 node

20.50 Gflops on 2 nodes

40.76 Gflops on 4 nodes

10.88 Gflops on 1 node

20.50 Gflops on 2 nodes

40.76 Gflops on 4 nodes

Page 41: SR8000 Concept

41 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

NAS Parallel FT

5.147.92

26.16

14.01

8.31

27.95

14.84

5.39

28.78

15.10

8.37

0

5

10

15

20

25

30

35

1 2 4 8Number of Nodes

GFl

ops

ClassA

ClassB

ClassC

Page 42: SR8000 Concept

42 All Rights Reserved. Copyright © 2000 Hitachi Europe Ltd.

NAS Parallel CG

7.98

14.54

6.353.67

10.52

16.46

4.14

27.86

0

5

10

15

20

25

30

1 2 4 8Number of Nodes

GFl

ops

ClassB

ClassC

7.98

14.54

6.353.67

10.52

16.46

4.14

27.86

0

5

10

15

20

25

30

1 2 4 8Number of Nodes

GFl

ops

ClassB

ClassC