case study on pbs pro operation on large scale scientific gpu cluster
Post on 10-Feb-2017
612 Views
Preview:
TRANSCRIPT
PBS Works 2015, Mountain View1
Case Study on PBS Pro Operationon Large Scale Scientific GPU
ClusterTaisuke Boku
Deputy DirecotorCenter for Computational Sciences
University of Tsukuba
2015/09/16
Outline Introduction to CCS, Univ. of Tsukuba HA-PACS Project at CCS Mission of CCS and Job Requirements HA-PACS Operation Statistics Problems and Solutions Future Requests to Scheduler Summary
2015/09/16PBS Works 2015, Mountain View2
What is CCS ?
2015/09/16PBS Works 2015, Mountain View3
Center for Computational Sciences Established in 1992
12 years as Center for Computational Physics
Reorganized as Center for Computational Sciences in 2004
Daily collaborative researches with two kinds of researchers (about 30 in total)
Computational Scientists who have NEEDS (applications)
Computer Scientists who have SEEDS (system & solution)
Tsukuba
Narita
Tokyo
COMA (latest supercomputer in CCS)
2015/09/16PBS Works 2015, Mountain View4
Item SpecificationComputation node Cray CS300 Cluster Unit with two Xeon PhiCPU Intel E5-2670v2 (Ivy Bridge EP) # of cores 10 cores/socket x 2 sockets = 20 cores/node
Clock 2.5 GHz Peak performance 400 GFLOPS/node
PCI-express generation 3 x 80 lanes (40 lanes/CPU) Memory 64 GiB, DDR3 1866MHz,
4 channel/socket, 119 GB/s/nodeMIC Intel Xeon Phi 7110P # of MICs/node 2 Peak performance 2.14 TFLOPS/node (1.07 TF/MIC)
Memory 16 GiB/node (8 GiB/MIC)Interconnection Infiniband FDR (Mellanox ConnectX-3)
• COMA (Cluster Of Many-core Architecture), operation started on Apr. 2014
• 393 nodes, 1.001 PFLOPS peak(ranked #51 in TOP500 Jun. 2014)
• Vendor: Cray Inc.
HA-PACS Project at CCS, U. Tsukuba HA-PACS: Highly Accelerated Parallel Advanced system for
Computational Sciences Design and deployment of very high density and high performance
GPU cluster for large scale scientific applications TCA (Tightly Coupled Accelerators): experimental implementation History
Apr. 2011: Project launched Feb. 2012: first Base Cluster part deployed (scheduler: SGE) Nov. 2013: additional TCA part deployed (scheduler: PBS Pro for entire
system) System will be operated until Mar. 2017
System Vender: Appro Inc. (Base Cluster) + Cray Inc. (TCA)
2015/09/16PBS Works 2015, Mountain View5
Spec. of HA-PACS base cluster & HA-PACS/TCABase cluster (Feb. 2012) TCA (Nov. 2013)
Node CRAY GreenBlade 8204 CRAY 3623G4-SM M/B Intel Washington Pass SuperMicro X9DRG-QF CPU Intel Xeon E5-2670 x 2 socket
(SandyBridge-EP, 2.6GHz 8 core) x2Intel Xeon E5-2680 v2 x 2 socket(IvyBridge-EP, 2.8GHz 10 core) x2
Memory DDR3-1600 128 GB DDR3-1866 128 GB GPU NVIDIA M2090 x4 NVIDIA K20X x 4# of Nodes (Racks)
268 (26) 64 (10)
Interconnect
Mellanox InfiniBand QDR x 2 (Connect X-3)
Mellanox InfiniBand QDR x2 + PEACH2
Peak Perf. 802 TFlops 364 TFlopsPower 408 kW 99.3 kWTOP500 rank
#41 (Jun. 2012) #134 (Nov. 2013)2015/09/16PBS Works 2015, Mountain View6
HA-PACS/TCA (Compute Node)
(2.8 GHz x 8 flop/clock)
Total: 5.688 TFLOPS
8 GB/s
AVX
1.31 TFLOPSx4=5.24 TFLOPS
22.4 GFLOPS x20=448.0 GFLOPS
(16 GB, 14.9 GB/s)x8=128 GB, 119.4 GB/s
(6 GB, 250 GB/s)x4=24 GB, 1 TB/s
4 Channels1,866 MHz
59.7 GB/sec
4 Channels1,866 MHz
59.7 GB/sec
Ivy Bridge Ivy Bridge
4 x NVIDIA K20X
Gen 2 x 16
Gen 2 x 16
Gen 2 x 16
Gen 2 x 16
Gen 2 x 8
PEACH2 board(Proprietary Interconnect for TCA)
Gen 2 x 8
Gen 2 x 8
Gen 2 x 8
Red: upgraded from base-cluster to TCA
Legacy Devices
2015/09/16PBS Works 2015, Mountain View7
HA-PACS Base Cluster
2015/09/16PBS Works 2015, Mountain View8
PBS Works 2015, Mountain View9
HA-PACS/TCA (Nov. 2013) + Base cluster
2015/09/16
Base cluster
TCALINPACK: 277 Tflops
(Efficiency 76%)3.52GFLOPS/W #3 Green500
at Nov. 2013
Communication on TCA Architecture
10
CPUPCIe Switch
NodeCPU Memory
P C I e
GPU
GPU Memory
P C e
CPUPCIe Switch
NodeCPU Memory
P C I e
GPU
GPU Memory
P C ePCIePEACH
2PEACH
2
Using PCIe as a communication link between accelerators over the nodes
Direct device P2P communication is available thru PCIe.
PEACH2: PCI Express Adaptive Communication Hub ver. 2
Implementation of the interface and data transfer engine for TCA
Mar. 19, 2015GPU Technology Conference 2015
GPU Technology Conference 201511
PEACH2 board (Production version for HA-PACS/TCA)
Mar. 19, 2015
Main board+ sub board Most part operates at 250 MHz
(PCIe Gen2 logic runs at 250MHz)
PCI Express x8 card edgePower supply
for various voltageDDR3-SDRAM
FPGA(Altera Stratix IV
530GX)
PCIe x16 cable connecterPCIe x8 cable connecter
Mission of CCS Main mission
Research on advanced computational sciences including particle physics, astrophysics, material science, climate, biology and computer science (HPC)
Secondary Mission CCS is organized as a nation-wide supercomputer center in Japan to
support users on advanced computational sciences and providing supercomputer resources for them
Most part of resources are dedicated to the users in FREE, based on their scientific proposals and judgement by review board of proposals, in every fiscal year (April to March)
Some part is also provided to user groups with budget, request base Even users from CCS must apply the proposal and be assigned CPU
budget allowed by the committee2015/09/16PBS Works 2015, Mountain View12
Utilization of HA-PACS “Interdisciplinary Advanced Computational Science” Program
(FIXED-BUDGET) Proposal-base and free of charge Approx. 80% of system resource is dedicated Each accepted project is assigned “Node-Budget” for the year
=> users consume the total yearly budget accumulatively “Large Scale Scientific Use” Program (FIXED-NODE)
Charged utilization, still scientific use only Remaining part of the system is dedicated Each accepted project is assigned “Node x month”
=> users can use the assigned number of nodes every month Problem
How to mixture these two categories of utilization efficiently How to keep high system utilization ratio
2015/09/16PBS Works 2015, Mountain View13
Assumption and Solution Assumption
Since it is for large scale advanced computational sciences, we accept up to full-system sized job=> job size varies in very wide range
Every job is assumed to run in parallel (MPI+OpenMP+CUDA) Long term job is allowed: up to 24 hours
Solution to maximize the system utilization We don’t make individual partition for FIXED-NODE projects,
and share all the nodes by both FIXED-BUDGET and FIXED-NODE projects
FIXED-NODE projects are always assigned very high priority Fair-share policy is applied to FIXED-BUDGET projects
2015/09/16PBS Works 2015, Mountain View14
Actual job execution situation FIXED-NODE projects should wait for resources in some case, up to
24 hours in the worst case(it is included in the contract of supercomputer usage)
Since there is no separated partition for FIXED-NODE projects, FIXED-BUDGET projects can share the resources while FIXED-NODE projects do not use the system
Fair-share for FIXED-BUDGET projects is important because There are almost two kinds of users:
(1) jump-starters: running their jobs very constantly from the beginning of yearly program(2) slow-starters: do not use budget in the first half of yearly program, finally run their jobs hardly in the latter half
“jump-starters” and “slow-starters” can survive in peace in every year
2015/09/16PBS Works 2015, Mountain View15
Effect of FIXED-NODE and FIXED-BUDGET mixture
2015/09/16PBS Works 2015, Mountain View16
original assigned ratio
FIXED-BUDGET FIXED-NODE
low utilizationby FIXED-NODE
lost part
full systemutilization
We ask FIXED-NODE users to wait a moment => highly improved system utilization
HA-PACS system utilization statistics
2015/09/16PBS Works 2015, Mountain View17
green: operation ratered: utilization rate
operation history from Feb. 2012 to Aug. 2015
SGE operation (base only)
TCAextension
Non-fairshare
Fairshare workscorrectly
• 02/2012-09/2012 for test operation
• 10/2012-09/2013 for SGE-base operation with wrong fair-share setting + no FIXED-NODE
• 10/2013-11/2013 for TCA part extension (stopped)
• 12/2014 found problem on wrong setting of fair-share=> fixed around 04/2015
• 04/2015 starts FIXED-NODE operation
After the correction of fair-share setting, user’s complaint is drastically reduced
Our experience on PBS Pro Basic feature is very good, if we correctly set the system We run the system as node-by-node manner (not core-by-core) and
some features are redundant=> cause some mistake on parameter setting
Many options sometime confuse the system manager (or even the system vendor)
It is desired to have some “system consistency check” function in the scheduler
ex) if there are two inconsistent setting parameters, making a warning to system engineer
It is desired to have “REPLAY” feature
2015/09/16PBS Works 2015, Mountain View18
Idea of “REPLAY” In such a complicated operation situation, we like to
test several parameter sets or scheduling policies The system has a complete log of past execution If the system can simulate (REPLAY) for a new set of
parameters and policy based on past history, it greatly helps the system utilization analysis
2015/09/16PBS Works 2015, Mountain View19
Summary In CCS, Univ. of Tsukuba, we run a large scale GPU
cluster HA-PACS under PBS Pro scheduler Our operation assumption is complicated and we need
to set the scheduling policy and parameters very carefully
Current scheduling policy forces the paid-users (FIXED-NODE) to wait for a moment before job execution, but it greatly improves the system utilization rate
Some sophisticated feature to support the operation policy is desired
2015/09/16PBS Works 2015, Mountain View20
top related