in large-scale cluster
DESCRIPTION
Resource Management. Issues. in Large-Scale Cluster. Yutaka Ishikawa [email protected] Computer Science Department/Information Technology Center The University of Tokyo http://www.il.is.s.u-tokyo.ac.jp/ http://www.itc.u-tokyo.ac.jp. Outline. Jittering Memory Affinity - PowerPoint PPT PresentationTRANSCRIPT
in Large-Scale Cluster
Yutaka [email protected]
Computer Science Department/Information Technology CenterThe University of Tokyo
http://www.il.is.s.u-tokyo.ac.jp/http://www.itc.u-tokyo.ac.jp
2007/11/2 1The University of Tokyo
IssuesResource Management
Outline• Jittering• Memory Affinity• Power Management• Bottleneck Resource Management
2007/11/2 The University of Tokyo 2
Issues• Jittering Problem
– The execution of a parallel application is disturbed by system processes in each node independently. This causes the delay of global operations such as allreduce
2007/11/2 The University of Tokyo 3
References:•Terry Jones, William Tuel, Brain Maskell, “Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System,” SC2003.•Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, “The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,1928 Processors of ASCI Q,” SC2003.
# 0# 1# 2# 3
# 0# 1# 2# 3
# 0# 1# 2# 3
Jittering Problem• Our Approach
– Clusters usually have two types network• Network for Computing• Network for Management
– The Management network is used to deliver the global clock• Interval Timer is turned off• Broadcast packet is sent from the global clock generator
– Gang scheduling is employed for all system and application processes
2007/11/2 The University of Tokyo 4
Network for Computing i.e., Myrinet, Infiniband
Network for Management i.e., gigabit ethernet
Global clock generator
Jittering Problem• Preliminary Experience
– The Management network is used to deliver the global clock– The Interval Timer is turned off– Each arrival of the special broadcast packet, the tick counter is
updated (The kernel code has been modified)– No cluster daemons, such as batch scheduler nor information
daemon, are running, but system daemons are running
2007/11/2 The University of Tokyo 5
CPU : AMD Opteron 275 2.2GHzMemory : 2GHzNetwork : Myri-10G : BCM5721 Gigabit Ethernet# of Host : 16Kernel : Linux 2.6.18 x86_64 modifiedMPI : mpich-mx 1.2.6MX : MX Version: 1.2.0Daemons: syslog, portmap, sshd, sysstat, netfs, nfslock, autofs, acpid, mx, ypbind, rpcgssd, rpcidmapd, network
Preliminary Global Clock Experience
2007/11/2 The University of Tokyo 6
NAS Parallel Benchmark MG
Elapsed time
(second)
20 times executions are sorted
+ No global clockX Global clock
Preliminary Global Clock Experience
2007/11/2 The University of Tokyo 7
NAS Parallel Benchmark FT
+ No global clockX Global clock
Elapsed time
(second)
Preliminary Global Clock Experience
2007/11/2 The University of Tokyo 8
+ No global clockX Global clock
Elapsed time
(second)
NAS Parallel Benchmark CG
What kind of heavy daemonrunning in cluster
• Batch Job System– In case of Torque– Every 1 second, the daemon takes 50 microseconds– Every 45 seconds, the daemon takes about 8 milliseconds
• Monitoring System– Not yet majored
• Simple Formulation
2007/11/2 The University of Tokyo 9
In case of 1000 node cluster 0.000050*1000/1 + 0.008*1000/45 = 22.8 %
Worst CaseOverhead
MIN(TIi, TRi x N)
TIi
N: Number of nodesTIi: Interval time in daemon iTRi: Running time in daemon i= Σ
t
TI
TR
The worst case might never happen !
Issues on NUMA• Memory Affinity in NUMA
– CPU Memory– Network Memory
• An Example of network and memory
2007/11/2 The University of Tokyo 10
Dual CoreCPU
Dual CoreCPU
Memory
NIC
NIC
NIC
NIC
NFP3600
NFP3050
Memory
NFP3600
NFP3050
Dual CoreCPU
Dual CoreCPU
Memory
Memory
NIC
NIC
NIC
NIC
Node 0 Node 1Near
Far
Memory Location and Communication
11
P
P
MNN
NN
C
C
M
C
C
P
P
M
M
NN
NN
Note: The result depends on the BIOS settings.
• Communication performance depends on data location.• Data is also accessed by CPU.• The location of data should be determined based on both CPU and network
location.• Dynamic data migration mechanism is needed ??
2007/11/2 The University of Tokyo
Power Management
• 100 Tflops cluster machine– 1666 Nodes
• If 80 % machine resource utilization (332 nodes are idle) – 66 KW power is wasted in case of idle
• 55K$(660 万円 )/year• This is under estimation because
memory size is small and no network switches are included
– 10.6KW power is wasted though the power is turned off!!
• 9K$ (110 万円 )/year
2007/11/2 The University of Tokyo 12
Power Consumption (Amp)
HPL running(Not optimized)
2.92
Idle (1.9GHz) 2.44Idle (1.0GHz) 2.02
Suspended 1.61No Power
but power cable is plugged in
(BMC running)
0.32
Power Consumption Issue
Supermicro AS-2021-M-UR+VOpteron 2347 x 2(Balcerona 1.9 GHz, 60.8 Gflops)4 Gbyte MemoryInfiniband HCA x 2Fedora Core 7
Power Consumption in single node
Digital AmmeterFLUKE105B
??
Power Management• Cooperating with Batch Job system
– Idle machines are turned off– When those machines are needed, they are turned on using the IPMI
(Intelligent Platform Management Interface) protocol (BMC).– However, still we lose 300 mA for each idle machine
• Quick shutdown/restart and synchronization mechanism
2007/11/2 The University of Tokyo 13
JOB1 running
JOB2 running
Batch Job System Idle
JOB2 running
Turn OFF Turn OFF
JOB2 running
Submit JOB3
Turn OFF
JOB2 running
Turn ON
Turn OFF
JOB2 running
Idle
JOB2 running
Turn OFF
In ServiceDispatch JOB3
Turn OFF
JOB2 running
JOB3 runs
Bottleneck Resource Management• What are bottleneck resources
– A cluster machine has many resources while other resources are limited.– When the cluster accesses such a resource, overloading or congestion
happens
• Examples
2007/11/2 The University of Tokyo 14
InternetInternet
10 GB/sec x N
10 GB/sec x N
10 GB/sec
10 GB/sec
– Internet• We have been focusing
on bottleneck links in GridMPI
– Global File System• From the file system view
point, N file operations are independently performed where N is the number of node