case study on pbs pro operation on large scale scientific gpu cluster

PBS Works 2015, Mountain View1

Case Study on PBS Pro Operationon Large Scale Scientific GPU

ClusterTaisuke Boku

Deputy DirecotorCenter for Computational Sciences

University of Tsukuba

2015/09/16

Outline Introduction to CCS, Univ. of Tsukuba HA-PACS Project at CCS Mission of CCS and Job Requirements HA-PACS Operation Statistics Problems and Solutions Future Requests to Scheduler Summary

2015/09/16PBS Works 2015, Mountain View2

What is CCS ?

Center for Computational Sciences Established in 1992

12 years as Center for Computational Physics

Reorganized as Center for Computational Sciences in 2004

Daily collaborative researches with two kinds of researchers (about 30 in total)

Computational Scientists who have NEEDS (applications)

Computer Scientists who have SEEDS (system & solution)

Tsukuba

Narita

COMA (latest supercomputer in CCS)

Item SpecificationComputation node Cray CS300 Cluster Unit with two Xeon PhiCPU Intel E5-2670v2 (Ivy Bridge EP) # of cores 10 cores/socket x 2 sockets = 20 cores/node

Clock 2.5 GHz Peak performance 400 GFLOPS/node

PCI-express generation 3 x 80 lanes (40 lanes/CPU) Memory 64 GiB, DDR3 1866MHz,

4 channel/socket, 119 GB/s/nodeMIC Intel Xeon Phi 7110P # of MICs/node 2 Peak performance 2.14 TFLOPS/node (1.07 TF/MIC)

Memory 16 GiB/node (8 GiB/MIC)Interconnection Infiniband FDR (Mellanox ConnectX-3)

• COMA (Cluster Of Many-core Architecture), operation started on Apr. 2014

• 393 nodes, 1.001 PFLOPS peak(ranked #51 in TOP500 Jun. 2014)

• Vendor: Cray Inc.

HA-PACS Project at CCS, U. Tsukuba HA-PACS: Highly Accelerated Parallel Advanced system for

Computational Sciences Design and deployment of very high density and high performance

GPU cluster for large scale scientific applications TCA (Tightly Coupled Accelerators): experimental implementation History

Apr. 2011: Project launched Feb. 2012: first Base Cluster part deployed (scheduler: SGE) Nov. 2013: additional TCA part deployed (scheduler: PBS Pro for entire

system) System will be operated until Mar. 2017

System Vender: Appro Inc. (Base Cluster) + Cray Inc. (TCA)

Spec. of HA-PACS base cluster & HA-PACS/TCABase cluster (Feb. 2012) TCA (Nov. 2013)

Node CRAY GreenBlade 8204 CRAY 3623G4-SM M/B Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket

(SandyBridge-EP, 2.6GHz 8 core) x2Intel Xeon E5-2680 v2 x 2 socket(IvyBridge-EP, 2.8GHz 10 core) x2

Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4# of Nodes (Racks)

268 (26) 64 (10)

Interconnect

Mellanox InfiniBand QDR 　 x 2 (Connect X-3)

Mellanox InfiniBand QDR x2 + PEACH2

Peak Perf. 802 TFlops 364 TFlopsPower 408 kW 99.3 kWTOP500 rank

#41 (Jun. 2012) #134 (Nov. 2013)2015/09/16PBS Works 2015, Mountain View6

HA-PACS/TCA (Compute Node)

(2.8 GHz x 8 flop/clock)

Total: 5.688 TFLOPS

8 GB/s

1.31 TFLOPSx4=5.24 TFLOPS

22.4 GFLOPS x20=448.0 GFLOPS

(16 GB, 14.9 GB/s)x8=128 GB, 119.4 GB/s

(6 GB, 250 GB/s)x4=24 GB, 1 TB/s

4 Channels1,866 MHz

59.7 GB/sec

4 Channels1,866 MHz

59.7 GB/sec

Ivy Bridge Ivy Bridge

4 x NVIDIA K20X

Gen 2 x 16

Gen 2 x 8

PEACH2 board(Proprietary Interconnect for TCA)

Gen 2 x 8

Red: upgraded from base-cluster to TCA

Legacy Devices

HA-PACS Base Cluster

PBS Works 2015, Mountain View9

HA-PACS/TCA (Nov. 2013) + Base cluster

2015/09/16

Base cluster

TCALINPACK: 277 Tflops

(Efficiency 76%)3.52GFLOPS/W #3 Green500

at Nov. 2013

Communication on TCA Architecture

CPUPCIe Switch

NodeCPU Memory

P C I e

GPU Memory

CPUPCIe Switch

NodeCPU Memory

P C I e

GPU Memory

P C ePCIePEACH

2PEACH

Using PCIe as a communication link between accelerators over the nodes

Direct device P2P communication is available thru PCIe.

PEACH2: PCI Express Adaptive Communication Hub ver. 2

Implementation of the interface and data transfer engine for TCA

Mar. 19, 2015GPU Technology Conference 2015

GPU Technology Conference 201511

PEACH2 board (Production version for HA-PACS/TCA)

Mar. 19, 2015

Main board+ sub board Most part operates at 250 MHz

(PCIe Gen2 logic runs at 250MHz)

PCI Express x8 card edgePower supply

for various voltageDDR3-SDRAM

FPGA(Altera Stratix IV

530GX)

PCIe x16 cable connecterPCIe x8 cable connecter

Mission of CCS Main mission

Research on advanced computational sciences including particle physics, astrophysics, material science, climate, biology and computer science (HPC)

Secondary Mission CCS is organized as a nation-wide supercomputer center in Japan to

support users on advanced computational sciences and providing supercomputer resources for them

Most part of resources are dedicated to the users in FREE, based on their scientific proposals and judgement by review board of proposals, in every fiscal year (April to March)

Some part is also provided to user groups with budget, request base Even users from CCS must apply the proposal and be assigned CPU

budget allowed by the committee2015/09/16PBS Works 2015, Mountain View12

Utilization of HA-PACS “Interdisciplinary Advanced Computational Science” Program

(FIXED-BUDGET) Proposal-base and free of charge Approx. 80% of system resource is dedicated Each accepted project is assigned “Node-Budget” for the year

=> users consume the total yearly budget accumulatively “Large Scale Scientific Use” Program (FIXED-NODE)

Charged utilization, still scientific use only Remaining part of the system is dedicated Each accepted project is assigned “Node x month”

=> users can use the assigned number of nodes every month Problem

How to mixture these two categories of utilization efficiently How to keep high system utilization ratio

Assumption and Solution Assumption

Since it is for large scale advanced computational sciences, we accept up to full-system sized job=> job size varies in very wide range

Every job is assumed to run in parallel (MPI+OpenMP+CUDA) Long term job is allowed: up to 24 hours

Solution to maximize the system utilization We don’t make individual partition for FIXED-NODE projects,

and share all the nodes by both FIXED-BUDGET and FIXED-NODE projects

FIXED-NODE projects are always assigned very high priority Fair-share policy is applied to FIXED-BUDGET projects

Actual job execution situation FIXED-NODE projects should wait for resources in some case, up to

24 hours in the worst case(it is included in the contract of supercomputer usage)

Since there is no separated partition for FIXED-NODE projects, FIXED-BUDGET projects can share the resources while FIXED-NODE projects do not use the system

Fair-share for FIXED-BUDGET projects is important because There are almost two kinds of users:

(1) jump-starters: running their jobs very constantly from the beginning of yearly program(2) slow-starters: do not use budget in the first half of yearly program, finally run their jobs hardly in the latter half

“jump-starters” and “slow-starters” can survive in peace in every year

Effect of FIXED-NODE and FIXED-BUDGET mixture

original assigned ratio

FIXED-BUDGET FIXED-NODE

low utilizationby FIXED-NODE

lost part

full systemutilization

We ask FIXED-NODE users to wait a moment => highly improved system utilization

HA-PACS system utilization statistics

green: operation ratered: utilization rate

operation history from Feb. 2012 to Aug. 2015

SGE operation (base only)

TCAextension

Non-fairshare

Fairshare workscorrectly

• 02/2012-09/2012 for test operation

• 10/2012-09/2013 for SGE-base operation with wrong fair-share setting + no FIXED-NODE

• 10/2013-11/2013 for TCA part extension (stopped)

• 12/2014 found problem on wrong setting of fair-share=> fixed around 04/2015

• 04/2015 starts FIXED-NODE operation

After the correction of fair-share setting, user’s complaint is drastically reduced

Our experience on PBS Pro Basic feature is very good, if we correctly set the system We run the system as node-by-node manner (not core-by-core) and

some features are redundant=> cause some mistake on parameter setting

Many options sometime confuse the system manager (or even the system vendor)

It is desired to have some “system consistency check” function in the scheduler

ex) if there are two inconsistent setting parameters, making a warning to system engineer

It is desired to have “REPLAY” feature

Idea of “REPLAY” In such a complicated operation situation, we like to

test several parameter sets or scheduling policies The system has a complete log of past execution If the system can simulate (REPLAY) for a new set of

parameters and policy based on past history, it greatly helps the system utilization analysis

Summary In CCS, Univ. of Tsukuba, we run a large scale GPU

cluster HA-PACS under PBS Pro scheduler Our operation assumption is complicated and we need

to set the scheduling policy and parameters very carefully

Current scheduling policy forces the paid-users (FIXED-NODE) to wait for a moment before job execution, but it greatly improves the system utilization rate

Some sophisticated feature to support the operation policy is desired

case study on pbs pro operation on large scale scientific gpu cluster

Technology

s3516 build your own gpu research cluster

recipe for running simple cuda code on a gpu based rocks...

tools and tips for managing a gpu cluster

s6261 vmd+optix: streaming interactive ray … h.264 video...

high-performance pedestrian multi-simulation using gpu...

htcondor and torque/maui · •test job submission via pbs...

report on running the biggest gpu cluster in hungary...

spark-based parallelization of basic local alignment ... ·...

article rgca: a reliable gpu cluster architecture for

tiresias: a gpu cluster manager for distributed deep...

building your own gpu research cluster | gtc · pdf file ·...

building your own gpu research cluster | gtc...

integer programming based heterogeneous cpu-gpu cluster...

nexus: a gpu cluster for accelerating neural networks for...

asian option pricing on cluster of gpus: first results ·...

[harvard cs264] 07 - gpu cluster programming (mpi & zeromq)

gpu cluster computing for fem - mathematik.tu-dortmund.de

a hybrid cpu/gpu cluster for encryption and decryption of...

a ram-disk provisioning service for high performance data...

themis: fair and efﬁcient gpu cluster schedulingthemis:...