“the architecture of massively parallel processor cp-pacs” taisuke boku, hiroshi nakamura, et...

27
The Architecture of The Architecture of Massively Parallel Massively Parallel Processor CP-PACS” Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan University of Tsukuba, Japan by Emre Tapcı by Emre Tapcı

Upload: clifton-lang

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

““The Architecture of The Architecture of Massively Parallel Processor Massively Parallel Processor

CP-PACS”CP-PACS”Taisuke Boku, Hiroshi Nakamura, et al.Taisuke Boku, Hiroshi Nakamura, et al.

University of Tsukuba, JapanUniversity of Tsukuba, Japan

by Emre Tapcıby Emre Tapcı

Page 2: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

OutlineOutline

IntroductionIntroduction Specification of CP-PACSSpecification of CP-PACS Pseudo Vector Processor PVP-SWPseudo Vector Processor PVP-SW Interconnection Network of CP-PACSInterconnection Network of CP-PACS

Hyper-crossbar NetworkHyper-crossbar Network Remote DMA message transferRemote DMA message transfer Message broadcastingMessage broadcasting Barrier synchronizationBarrier synchronization

Performance EvaluationPerformance Evaluation Conclusion, References, Questions & Conclusion, References, Questions &

CommentsComments

Page 3: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

IntroductionIntroduction

CP-PACS: Computational Physics by Parallel CP-PACS: Computational Physics by Parallel Array Computer SystemsArray Computer Systems

To construct a dedicated MMP for To construct a dedicated MMP for computational physics, study Quantum-computational physics, study Quantum-Chromo DynamicsChromo Dynamics

Center for Computational Physics, Center for Computational Physics, University of Tsukaba, JapanUniversity of Tsukaba, Japan

Page 4: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

MIMD parallel processing system with MIMD parallel processing system with distributed memory.distributed memory.

Each Processing Unit (PU) has a RISC Each Processing Unit (PU) has a RISC processor and a local memory.processor and a local memory.

2048 of such PU’s, connected by an 2048 of such PU’s, connected by an interconnection network.interconnection network.

128 IO units, that support a 128 IO units, that support a distributed disk space.distributed disk space.

Page 5: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

Page 6: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

Theoretical performanceTheoretical performance To be able to solve problems like QCD, To be able to solve problems like QCD,

Astro-fluid dynamics, etc. a grat number Astro-fluid dynamics, etc. a grat number of PUs are required.of PUs are required.

For budget, reliability reasons, number For budget, reliability reasons, number of PUs is limited at 2048.of PUs is limited at 2048.

Page 7: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

Node processorNode processor Improve function of node processors Improve function of node processors

first.first. Caches do not work efficiently on Caches do not work efficiently on

ordinary RISC processors.ordinary RISC processors. New technique for cache function is New technique for cache function is

introduced: introduced: PVP-SWPVP-SW

Page 8: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

Interconnection NetworkInterconnection Network 3-dimensional Hyper-Crossbar (3-D HXB)3-dimensional Hyper-Crossbar (3-D HXB) Peak throughput of a single link: 300 Peak throughput of a single link: 300

MB/secMB/sec ProvidesProvides

Hardware message broadcastingHardware message broadcasting Block-stride message transferBlock-stride message transfer Barrier synchronizationBarrier synchronization

Page 9: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Specification of CP-PACSSpecification of CP-PACS

I/O systemI/O system 128 I/O units, equipped with RAID-5 hard 128 I/O units, equipped with RAID-5 hard

disk system.disk system. 528 GB total system disk space.528 GB total system disk space. RAID-5 system increases fault tolerance.RAID-5 system increases fault tolerance.

Page 10: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

MPPs require high performance node MPPs require high performance node processors.processors.

A node processor cannot achieve A node processor cannot achieve high performance unless cache high performance unless cache system works efficiently.system works efficiently. Little temporal locality existsLittle temporal locality exists Data space of application is much larger Data space of application is much larger

than cache size.than cache size.

Page 11: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

Vector processorsVector processors Main memory is pipelined.Main memory is pipelined. Vector length of load/store is long.Vector length of load/store is long. Load/store is executed in parallel with Load/store is executed in parallel with

arithmetic execution.arithmetic execution. We require these in our node We require these in our node

processorprocessor PVP-SW is introduced.PVP-SW is introduced. It is It is pseudo-vectorpseudo-vector..

Page 12: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

Cannot increase number of registers, Cannot increase number of registers, register field in instructions is register field in instructions is limited.limited.

So, a new technique, So, a new technique, Slide-Windowed Slide-Windowed RegistersRegisters is introduced. is introduced.

Page 13: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

Slide-Windowed RegistersSlide-Windowed Registers Physical registers consist of logical Physical registers consist of logical

windows, a window consists of 32 windows, a window consists of 32 registers.registers.

Total number of registers is 128.Total number of registers is 128. Global registers & Window registersGlobal registers & Window registers

Global registers are static and shared by all Global registers are static and shared by all windowswindows

Local registers are not shared.Local registers are not shared. One window active at a certain time.One window active at a certain time.

Page 14: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

Slide-Windowed RegistersSlide-Windowed Registers Active window is identified by a pointer, FW-Active window is identified by a pointer, FW-

STP.STP. New instructions are introduced, to deal with New instructions are introduced, to deal with

FW-STP:FW-STP: FWSTPSet: Sets new location for FW-STP.FWSTPSet: Sets new location for FW-STP. FRPreload: Load data from memory into a window.FRPreload: Load data from memory into a window. FRPoststore: Store data into memory from a FRPoststore: Store data into memory from a

window.window.

Page 15: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

Page 16: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Topology is a Topology is a Hyper-Crossbar Network Hyper-Crossbar Network (HXB)(HXB)

8 x 17 x 16, 2048 PUs, 128 I/O units.8 x 17 x 16, 2048 PUs, 128 I/O units. On a dimension of hypercube, the PUs are On a dimension of hypercube, the PUs are

interconnected by a crossbar.interconnected by a crossbar. For example: On Y dimension, a Y x Y size For example: On Y dimension, a Y x Y size

crossbar is used.crossbar is used. Routing is simple, route on 3 dimensions Routing is simple, route on 3 dimensions

consecutively.consecutively. Wormhole routing is employed.Wormhole routing is employed.

Page 17: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Wormhole routing & HXB together Wormhole routing & HXB together has these properties:has these properties: Small network diameterSmall network diameter Same sized torus can be simulated.Same sized torus can be simulated. Message broadcasting by hardware.Message broadcasting by hardware. Binary hypercube can be emulated.Binary hypercube can be emulated. Througput in even random transfer is Througput in even random transfer is

high.high.

Page 18: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Remote DMA transferRemote DMA transfer Making a system call to OS and copying Making a system call to OS and copying

data to OS area is messy.data to OS area is messy. Instead, access remote node’s memory Instead, access remote node’s memory

directly.directly. Remote DMA is good, because:Remote DMA is good, because:

Mode switching (kernel/user mode) is Mode switching (kernel/user mode) is tedious.tedious.

Redundant data copying (user Redundant data copying (user kernel kernel space) is not done.space) is not done.

Page 19: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Message BroadcastingMessage Broadcasting Supported by hardware.Supported by hardware.

First, perform on one dimensionFirst, perform on one dimension Then perform on other dimensionsThen perform on other dimensions

Hardware mechanisms to prevent deadlock Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same caused by two nodes broadcasting at the same time are present.time are present.

Hardware partitioning is possible.Hardware partitioning is possible. Send broadcast message to nodes in the sender’s Send broadcast message to nodes in the sender’s

partition only.partition only.

Page 20: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Barrier SynchronizationBarrier Synchronization A synchronization mechanism is A synchronization mechanism is

required in IPC systems.required in IPC systems. CP-PACS supports a hardware barrier CP-PACS supports a hardware barrier

synchronization facility.synchronization facility. Makes use of special syncronization packets, Makes use of special syncronization packets,

other than usual data packets.other than usual data packets. CP-PACS also supports partitioned CP-PACS also supports partitioned

pieces of network to use barrier pieces of network to use barrier synchronization.synchronization.

Page 21: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Performance EvaluationPerformance Evaluation

Based on LINPACK benchmark.Based on LINPACK benchmark. LU decomposition of a matrix.LU decomposition of a matrix. Outer product method is used, based on 2-Outer product method is used, based on 2-

dimensional block-cyclic distribution.dimensional block-cyclic distribution. All floating point and data loading/storing All floating point and data loading/storing

operations are done in PVP-SW manner.operations are done in PVP-SW manner.

Page 22: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Performance EvaluationPerformance Evaluation

Rmax

050100150200250300350400

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

Rm

ax

(G

flo

ps

/se

c)

Page 23: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Performance EvaluationPerformance Evaluation

Nmax

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

Ma

trix

siz

e

Page 24: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Performance EvaluationPerformance Evaluation

Rmax/peak

56

58

60

62

64

66

68

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

eff

ec

tiv

en

es

s

Page 25: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

ConclusionConclusion

CP-PACS is operational in University CP-PACS is operational in University of Tsukuba.of Tsukuba.

Working on large scale QCD Working on large scale QCD calculations.calculations.

Sponsored by Hitachi Ltd. & Grant-in-Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science aid of Ministry of Education, Science of Culture, in Japan.of Culture, in Japan.

Page 26: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

ReferencesReferences

T.Boku, H. Nakamura, K. Nakazawa, T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, Y. Iwasaki, The architecture of The architecture of Massively Parallel Processor CP-Massively Parallel Processor CP-PACS, PACS, Institute of Information Institute of Information Sciences and Electronics, University Sciences and Electronics, University of Tsukubaof Tsukuba

Page 27: “The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı

Questions & CommentsQuestions & Comments