toward a practical “hpc cloud”: performance tuning of a virtualized hpc cluster

16
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster SC2011@Seattle, Nov.15 2011 Ryousei Takano Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan

Upload: ryousei-takano

Post on 12-May-2015

1.284 views

Category:

Technology


1 download

DESCRIPTION

AIST booth presentation slides at SC11.

TRANSCRIPT

Page 1: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

SC2011@Seattle, Nov.15 2011

Ryousei Takano

Information Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST),

Japan

Page 2: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Outline

•  What is HPC Cloud? •  Performance tuning method for HPC Cloud

–  PCI passthrough –  NUMA affinity –  VMM noise reduction

•  Performance evaluation

2

Page 3: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

HPC Cloud

HPC Cloud utilizes cloud resources in High Performance Computing (HPC) applications

3

Physical Cluster

Virtualized Clusters

Users require resources according to needs

Provider allocates users a dedicated virtual cluster on demand

Page 4: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

HPC Cloud (cont’d)

•  Pros: –  User side: easy to deployment –  Provider side: high resource utilization

•  Cons: –  Performance degradation?

4

The method of performance tuning on a virtualized environment is not established.

Page 5: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Current HPC Cloud

Its performance is not good and

unstable.

“True” HPC Cloud The performance is

closing to that of bare metals.

Toward a practical HPC Cloud

Use PCI passthrough

Set NUMA affinity

Reduce VMM noise (not completed)

5

VM1

NIC

VMM

Physical driver

Guest OS

To reduce the overhead of interrupt virtualization To disable unnecessary services on the host OS (i.e., ksmd).

VM (QEMU process)

Linux kernel

KVM

Physical CPU

VCPU threads

Guest OSThreads

CPU socket

Page 6: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

PCI passthrough

6

IO emulationVM1

NIC

VMM

Guest driver

Physical driver

Guest OSVM2

vSwitch

PCI passthrough VM1

NIC

VMM

Physical driver

Guest OSVM2

SR-IOV VM1

NIC

VMM

Physical driver

Guest OSVM2

Switch (VEB)

IO emulation PCI passthrough SR-IOVVM sharingPerformance

Page 7: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Virtual CPU scheduling

7

P3

VM (QEMU process)

P0

Linux kernel

KVM

Physical CPU P1 P2

VCPU threads

Process scheduler

Guest OS

Threads

CPU socket

V3V0 V1 V2

Bare Metal KVM

Virtual Machine

Virtual Machine Monitor (VMM)

Hardware

Xen VM (Xen DomU)

P0

Xen Hypervisor

Physical CPU P1 P2 P3

VM (Dom0)

Domain scheduler

Guest OS

V3V0 V1 V2

Threads

VCPU

A guest OS can not run numactl

Page 8: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

NUMA affinity

8

P3

VM (QEMU process)

P0

Linux kernel

KVM

Physical CPU P1 P2

VCPU threads

Process scheduler

Guest OS

Threads

CPU socket

V3V0 V1 V2

bind threads to vSocket

pin vCPU to CPU (Vn = Pn)

numactl

taskset

Bare Metal KVM Linux

P0Physical CPU P1 P2 P3

Process scheduler

Threads

numactl

memory memory

CPU socket

Page 9: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Evaluation

9

Compute node(Dell PowerEdge M610)

CPU Intel quad-core Xeon E5540/2.53GHz x2

Chipset Intel 5520

Memory 48 GB DDR3

InfiniBand Mellanox ConnectX (MT26428)

9

Blade switch

InfiniBand Mellanox M3601Q (QDR 16 ports)

Evaluation of HPC applications on 16 nodes cluster (part of AIST Green Cloud Cluster)

Host machine environmentOS Debian 6.0.1

Linux kernel 2.6.32-5-amd64

KVM 0.12.50

Compiler gcc/gfortran 4.4.5

MPI Open MPI 1.4.2

VM environmentVCPU 8Memory 45 GB

Page 10: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

1

10

100

1000

10000

1 10 100 1k 10k 100k 1M 10M 100M 1G

Band

wid

th [M

B/se

c]

Message size [byte]

Bare MetalKVM

MPI Point-to-Point communication performance

10

(higher is better)

PCI passthrough improves MPI communication throughput close to that of bare metal machines.

Bare Metal: non-virtualized cluster

Page 11: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

NUMA affinityExecution time on a single node: NPB multi-zone (Computational Fluid Dynamics) and Bloss (Non-linear eignsolver)

11

SP-MZ [sec] BT-MZ [sec] Bloss [min]Bare Metal 94.41 (1.00) 138.01 (1.00) 21.02 (1.00)KVM 104.57 (1.11) 141.69 (1.03) 22.12 (1.05)KVM (w/ bind) 96.14 (1.02) 139.32 (1.01) 21.28 (1.01)

NUMA affinity is an important performance factor not only on bare metal machines but also on virtual machines.

Page 12: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

NPB BT-MZ: Parallel efficiency

12

0

20

40

60

80

100

0

50

100

150

200

250

300

1 2 4 8 16

Para

llel e

ffici

ency

[%]

Perf

orm

ance 

[Gop

/s to

tal]

Number of nodes

Bare Metal

KVM

Amazon EC2

Bare Metal (PE)

KVM (PE)

Amazon EC2 (PE)

(higher is better)

Degradation of PE:   KVM: 2%, EC2: 14%

Page 13: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Bloss: Parallel efficiency

13

0

20

40

60

80

100

120

1 2 4 8 16

Para

llel E

ffici

ency

[%]

Number of nodes

Bare MetalKVM

Amazon EC2Ideal

Degradation of PE:   KVM: 8%, EC2: 22%

Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP

Overhead of communication and virtualization

Page 14: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

Summary

HPC Cloud is promising! •  The performance of coarse-grained parallel

applications is comparable to bare metal machines

•  We plan to operate a private cloud service “AIST Cloud” for HPC users

•  Open issues –  VMM noise reduction –  VMM-bypass device-aware VM scheduling –  Live migration with VMM-bypass devices

14

Page 15: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

0

20

40

60

80

100

0 100 200 300 400 500

InfiniBand Gigabit Ethernet 10 Gigabit Ethernet

LINPACK Efficiency

※Efficiency=(Maximum LINPACK performance:Rmax)/(Theoretical peak performance:Rpeak)

InfiniBand: 79%

Gigabit Ethernet: 54%

10 Gigabit Ethernet: 74%

TOP500 rank

Effi

cien

cy (

%)

TOP500 June 2011

#451 Amazon EC2 cluster compute instances

Virtualization causes the performance degradation!

GPGPU machines

Page 16: Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster

0

20

40

60

80

100

120

1 2 4 8 16

Para

llel E

ffici

ency

[%]

Number of nodes

Bare MetalKVM

KVM (w/ bind)Amazon EC2

Ideal

Bloss: Parallel efficiency

16

Binding threads and physical CPUs can be sensitive to VMM noise and degrade the performance.

Bloss: non-linear internal eigensolver –  Hierarchical parallel program by MPI and OpenMP