“the architecture of massively parallel processor cp-pacs” taisuke boku, hiroshi nakamura, et...

““The Architecture of The Architecture of Massively Parallel Processor Massively Parallel Processor

CP-PACS”CP-PACS”Taisuke Boku, Hiroshi Nakamura, et al.Taisuke Boku, Hiroshi Nakamura, et al.

University of Tsukuba, JapanUniversity of Tsukuba, Japan

by Emre Tapcıby Emre Tapcı

OutlineOutline

IntroductionIntroduction Specification of CP-PACSSpecification of CP-PACS Pseudo Vector Processor PVP-SWPseudo Vector Processor PVP-SW Interconnection Network of CP-PACSInterconnection Network of CP-PACS

Hyper-crossbar NetworkHyper-crossbar Network Remote DMA message transferRemote DMA message transfer Message broadcastingMessage broadcasting Barrier synchronizationBarrier synchronization

Performance EvaluationPerformance Evaluation Conclusion, References, Questions & Conclusion, References, Questions &

CommentsComments

IntroductionIntroduction

CP-PACS: Computational Physics by Parallel CP-PACS: Computational Physics by Parallel Array Computer SystemsArray Computer Systems

To construct a dedicated MMP for To construct a dedicated MMP for computational physics, study Quantum-computational physics, study Quantum-Chromo DynamicsChromo Dynamics

Center for Computational Physics, Center for Computational Physics, University of Tsukaba, JapanUniversity of Tsukaba, Japan

Specification of CP-PACSSpecification of CP-PACS

MIMD parallel processing system with MIMD parallel processing system with distributed memory.distributed memory.

Each Processing Unit (PU) has a RISC Each Processing Unit (PU) has a RISC processor and a local memory.processor and a local memory.

2048 of such PU’s, connected by an 2048 of such PU’s, connected by an interconnection network.interconnection network.

128 IO units, that support a 128 IO units, that support a distributed disk space.distributed disk space.


Theoretical performanceTheoretical performance To be able to solve problems like QCD, To be able to solve problems like QCD,

Astro-fluid dynamics, etc. a grat number Astro-fluid dynamics, etc. a grat number of PUs are required.of PUs are required.

For budget, reliability reasons, number For budget, reliability reasons, number of PUs is limited at 2048.of PUs is limited at 2048.


Node processorNode processor Improve function of node processors Improve function of node processors

first.first. Caches do not work efficiently on Caches do not work efficiently on

ordinary RISC processors.ordinary RISC processors. New technique for cache function is New technique for cache function is

introduced: introduced: PVP-SWPVP-SW


Interconnection NetworkInterconnection Network 3-dimensional Hyper-Crossbar (3-D HXB)3-dimensional Hyper-Crossbar (3-D HXB) Peak throughput of a single link: 300 Peak throughput of a single link: 300

MB/secMB/sec ProvidesProvides

Hardware message broadcastingHardware message broadcasting Block-stride message transferBlock-stride message transfer Barrier synchronizationBarrier synchronization


I/O systemI/O system 128 I/O units, equipped with RAID-5 hard 128 I/O units, equipped with RAID-5 hard

disk system.disk system. 528 GB total system disk space.528 GB total system disk space. RAID-5 system increases fault tolerance.RAID-5 system increases fault tolerance.

Pseudo Vector Processor PVP-Pseudo Vector Processor PVP-SWSW

MPPs require high performance node MPPs require high performance node processors.processors.

A node processor cannot achieve A node processor cannot achieve high performance unless cache high performance unless cache system works efficiently.system works efficiently. Little temporal locality existsLittle temporal locality exists Data space of application is much larger Data space of application is much larger

than cache size.than cache size.


Vector processorsVector processors Main memory is pipelined.Main memory is pipelined. Vector length of load/store is long.Vector length of load/store is long. Load/store is executed in parallel with Load/store is executed in parallel with

arithmetic execution.arithmetic execution. We require these in our node We require these in our node

processorprocessor PVP-SW is introduced.PVP-SW is introduced. It is It is pseudo-vectorpseudo-vector..


Cannot increase number of registers, Cannot increase number of registers, register field in instructions is register field in instructions is limited.limited.

So, a new technique, So, a new technique, Slide-Windowed Slide-Windowed RegistersRegisters is introduced. is introduced.


Slide-Windowed RegistersSlide-Windowed Registers Physical registers consist of logical Physical registers consist of logical

windows, a window consists of 32 windows, a window consists of 32 registers.registers.

Total number of registers is 128.Total number of registers is 128. Global registers & Window registersGlobal registers & Window registers

Global registers are static and shared by all Global registers are static and shared by all windowswindows

Local registers are not shared.Local registers are not shared. One window active at a certain time.One window active at a certain time.


Slide-Windowed RegistersSlide-Windowed Registers Active window is identified by a pointer, FW-Active window is identified by a pointer, FW-

STP.STP. New instructions are introduced, to deal with New instructions are introduced, to deal with

FW-STP:FW-STP: FWSTPSet: Sets new location for FW-STP.FWSTPSet: Sets new location for FW-STP. FRPreload: Load data from memory into a window.FRPreload: Load data from memory into a window. FRPoststore: Store data into memory from a FRPoststore: Store data into memory from a

window.window.

Interconnection Network ofInterconnection Network ofCP-PACSCP-PACS

Topology is a Topology is a Hyper-Crossbar Network Hyper-Crossbar Network (HXB)(HXB)

8 x 17 x 16, 2048 PUs, 128 I/O units.8 x 17 x 16, 2048 PUs, 128 I/O units. On a dimension of hypercube, the PUs are On a dimension of hypercube, the PUs are

interconnected by a crossbar.interconnected by a crossbar. For example: On Y dimension, a Y x Y size For example: On Y dimension, a Y x Y size

crossbar is used.crossbar is used. Routing is simple, route on 3 dimensions Routing is simple, route on 3 dimensions

consecutively.consecutively. Wormhole routing is employed.Wormhole routing is employed.


Wormhole routing & HXB together Wormhole routing & HXB together has these properties:has these properties: Small network diameterSmall network diameter Same sized torus can be simulated.Same sized torus can be simulated. Message broadcasting by hardware.Message broadcasting by hardware. Binary hypercube can be emulated.Binary hypercube can be emulated. Througput in even random transfer is Througput in even random transfer is

high.high.


Remote DMA transferRemote DMA transfer Making a system call to OS and copying Making a system call to OS and copying

data to OS area is messy.data to OS area is messy. Instead, access remote node’s memory Instead, access remote node’s memory

directly.directly. Remote DMA is good, because:Remote DMA is good, because:

Mode switching (kernel/user mode) is Mode switching (kernel/user mode) is tedious.tedious.

Redundant data copying (user Redundant data copying (user kernel kernel space) is not done.space) is not done.


Message BroadcastingMessage Broadcasting Supported by hardware.Supported by hardware.

First, perform on one dimensionFirst, perform on one dimension Then perform on other dimensionsThen perform on other dimensions

Hardware mechanisms to prevent deadlock Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same caused by two nodes broadcasting at the same time are present.time are present.

Hardware partitioning is possible.Hardware partitioning is possible. Send broadcast message to nodes in the sender’s Send broadcast message to nodes in the sender’s

partition only.partition only.


Barrier SynchronizationBarrier Synchronization A synchronization mechanism is A synchronization mechanism is

required in IPC systems.required in IPC systems. CP-PACS supports a hardware barrier CP-PACS supports a hardware barrier

synchronization facility.synchronization facility. Makes use of special syncronization packets, Makes use of special syncronization packets,

other than usual data packets.other than usual data packets. CP-PACS also supports partitioned CP-PACS also supports partitioned

pieces of network to use barrier pieces of network to use barrier synchronization.synchronization.

Performance EvaluationPerformance Evaluation

Based on LINPACK benchmark.Based on LINPACK benchmark. LU decomposition of a matrix.LU decomposition of a matrix. Outer product method is used, based on 2-Outer product method is used, based on 2-

dimensional block-cyclic distribution.dimensional block-cyclic distribution. All floating point and data loading/storing All floating point and data loading/storing

operations are done in PVP-SW manner.operations are done in PVP-SW manner.


Rmax

050100150200250300350400

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

Rm

ax

(G

flo

ps

/se

c)


Nmax

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

Ma

trix

siz

e


Rmax/peak

56

58

60

62

64

66

68

1 2 3 4 5 6 7 8 9 10 11 12

# of PUs ( 2^ )

eff

ec

tiv

en

es

s

ConclusionConclusion

CP-PACS is operational in University CP-PACS is operational in University of Tsukuba.of Tsukuba.

Working on large scale QCD Working on large scale QCD calculations.calculations.

Sponsored by Hitachi Ltd. & Grant-in-Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science aid of Ministry of Education, Science of Culture, in Japan.of Culture, in Japan.

ReferencesReferences

T.Boku, H. Nakamura, K. Nakazawa, T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, Y. Iwasaki, The architecture of The architecture of Massively Parallel Processor CP-Massively Parallel Processor CP-PACS, PACS, Institute of Information Institute of Information Sciences and Electronics, University Sciences and Electronics, University of Tsukubaof Tsukuba

Questions & CommentsQuestions & Comments

“the architecture of massively parallel processor cp-pacs” taisuke boku, hiroshi nakamura, et...

Documents

japanspecification of

pacsspecification of

total number of registers

windowslocal registers

cache system

vector length of loadstore

slidewindowed registers

distributed memory