technology evaluationatcscs including beegfsparallelfilesystem · 24 years of supercomputers at...

Technology evaluation at CSCS includingBeeGFS parallel filesystem

Hussein N. HarakeCSCS-ETHZ

Agenda

© CSCS 2017 2

• CSCS• About the Systems Integration (SI) Unit• Technology Overview

• DDN IME• DDN WOS• OpenStack

• BeeGFS Case Study• What is BeeGFS?• Test System Layout• Tuning• Monitoring • Benchmark tools• Results• Next Steps• Monitoring and Profiling • Q&A

CSCS (Swiss National Supercomputing Centre)

© CSCS 2017 3

• Founded in 1991• Enables world-class research with a scientific user lab • Available to domestic and international researchers through a

transparent, peer-reviewed allocation process. • Open to academia and are available as well to users from industry

and the business sector. • Operated by ETH Zurich and is located in Lugano.

24 years of supercomputers at CSCS

1991 NEC SX35.5 GF Adula

1996 NEC SX410 GF Gottardo

1999 NEC SX5 64 GF Prometeo

2002 IBM SP41.3 TF Venus

2005 Cray XT35.8 TF Palu

2006 IBM P54.5 TF Blanc

2009-12 Cray XE6 402 TF Monte Rosa

2012-13 Cray XC307.7 PF Piz Daint

2014 XC30 1.25 PF Piz Daint extension 4

Data Centre

© CSCS 2017 5

- 2000 sq.m Machine Room- 20 MW of power and Cooling capacity- Lake Water cooling

- 700 Liters/s

Overview of Systems Integration (SI) Unit

Unit missions:

- Managing projects

- Relations with Vendors

- Evaluating Technologies

- Software deployments

Greina Cluster

7

Technology Overview – DDN IME

© CSCS 2017 8

Image courtesy of DDN

Tchnology Overview – DDN WOS (1)

© CSCS 2017 9Image courtesy of DDN

Technology Overview – DDN WOS (2)

© CSCS 2017 10

Technology Overview – DDN WOS (3)

© CSCS 2017 11

Technology Overview - OpenStack

© CSCS 2017 12

Image source: https://www.openstack.org/software/

Eidos Layout

13

BeeGFS Case Study

What is BeeGFS?

15

Parallel filesystemHPC orientedUsed to be called FhGFSAlternative to Lustre and GPFSDeveloped by FraunhoferOpen-sourceSupport delivered by ThinkParq

Image courtesy of BeeGFS

Basic Features of BeeGFS

© CSCS 2017 16

• Supports failover for data and Metadata using applications like Peacemaker, heartbeat

• Replication failover mechanism

• Supports Multiple data and metadata on both servers and targets

• Supports quota

• Uses Robin-hood to scan the entire filesystem

• Beegfs on demand filesystem (BeeOND)

• Easy to deploy and manage

• Support X86 and Open-power platform

Easy to deploy

17

BeeOND

© CSCS 2017 18

- Create a filesystem on Demand

- Uses the hard drive / SSDs on every compute node

- Filesystem get created by submitting a job to the scheduleWe are working on confirming SLURM support

- Memory could used instead of SSDs

- We used 20 SSDs on 20 nodes for our tests

Benefits of BeeOND

© CSCS 2017 19

Benefits from unused space

No impact on the parallel filesystem

Real utilization of the high speed network

Filesystem scales with the compute nodes

Open point:

What is the overhead on the compute nodes?

Test System Layout

© CSCS 2017 20

DDN 7700

4 * FDR Links

2 * FDR Links

Dual sockets SB128GB memory

Fabric 1 * FDR Links

• One couplet (two controllers)

• Two X86 servers

• One enclosure 60 drives

• 6 SSDs one raid volume

• 6 * 9 Raid 5 volumes

Tuning the servers

© CSCS 2017 21

echo 5 > /proc/sys/vm/dirty_background_ratioecho 20 > /proc/sys/vm/dirty_ratioecho 50 > /proc/sys/vm/vfs_cache_pressureecho 262144 > /proc/sys/vm/min_free_kbytesecho always > /sys/kernel/mm/transparent_hugepage/enabledecho always > /sys/kernel/mm/transparent_hugepage/defrag

for dev in dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 do

echo deadline > /sys/block/$dev/queue/schedulerecho 4096 > /sys/block/$dev/queue/nr_requestsecho 32768 > /sys/block/$dev/queue/read_ahead_kbecho 32767 > /sys/block/$dev/queue/max_sectors_kb

done

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorecho 1 > /proc/sys/vm/zone_reclaim_mode

Documentation for the tuned parameters:

https://www.kernel.org/doc/Documentation/sysctl/vm.txthttps://access.redhat.com/solutions/46111http://www.slideshare.net/rampalliraj/linux-kernel-io-schedulers?from_action=save

Benchmark tools

© CSCS 2017 24

• Mdtest measuring metadata

https://sourceforge.net/projects/mdtest/

• IOzone throughput read and write

http://www.iozone.org

Iozone results on /beegfs

© CSCS 2017 25

Test running:Children see throughput for 64 initial writers = 5032700.90 kB/secMin throughput per process = 63754.09 kB/sec Max throughput per process = 103798.58 kB/secAvg throughput per process = 78635.95 kB/secMin xfer = 12880896.00 kB

Test running:Children see throughput for 64 rewriters = 4996297.63 kB/secMin throughput per process = 68781.82 kB/sec Max throughput per process = 90666.23 kB/secAvg throughput per process = 78067.15 kB/secMin xfer = 16473088.00 kB

Test running:Children see throughput for 64 readers = 4225632.91 kB/secMin throughput per process = 40047.24 kB/sec Max throughput per process = 77678.61 kB/secAvg throughput per process = 66025.51 kB/secMin xfer = 10813440.00 kB

Test running:Children see throughput for 64 re-readers = 4253662.00 kB/secMin throughput per process = 56998.73 kB/sec Max throughput per process = 76042.87 kB/secAvg throughput per process = 66463.47 kB/secMin xfer = 15729664.00 kB

Mdtest results on BeeOND

© CSCS 2017 26

0

20000

40000

60000

80000

100000

120000

1 2 4 8 16 20

Directoriesp

erse

cond

NumerofMDSs

Directorycreation

0100000200000300000400000500000600000700000800000900000

1 2 4 8 16 20

Directoriesp

erse

cond

NumerofMDSs

DirectoryStat

0

20000

40000

60000

80000

100000

120000

140000

160000

1 2 4 8 16 20

Directoriesp

erse

cond

NumerofMDSs

DirectoryStatDirectory Removal

Mdtest results on BeeOND

© CSCS 2017 27

0

50000

100000

150000

200000

250000

300000

1 2 4 8 16 20

Filesp

erse

cond

NumerofMDSs

FileCreation

0100000200000300000400000500000600000700000800000900000

1 2 3 4 5 6

Filesp

erse

cond

NumerofMDSs

FileStat

0

50000

100000

150000

200000

250000

1 2 3 4 5 6

Filesp

erse

cond

NumerofMDSs

Fileremoval

Next steps

© CSCS 2017 28

• Scaling on bigger cluster

• Verifying the fail over procedures

• Verify the BeeOND overhead on compute nodes

• Using Nvme instead of SSDs

• Using tmpfs

• Create BeeOND through SLURM jobs

• Use Robinhood to scan millions of files

Check-MK Monitoring and Profiling

29

CPU Utilization

30

Q&A

31

[email protected]

technology evaluationatcscs including beegfsparallelfilesystem · 24 years of supercomputers at...

Documents