technology evaluationatcscs including beegfsparallelfilesystem · 24 years of supercomputers at...
TRANSCRIPT
Agenda
© CSCS 2017 2
• CSCS• About the Systems Integration (SI) Unit• Technology Overview
• DDN IME• DDN WOS• OpenStack
• BeeGFS Case Study• What is BeeGFS?• Test System Layout• Tuning• Monitoring • Benchmark tools• Results• Next Steps• Monitoring and Profiling • Q&A
CSCS (Swiss National Supercomputing Centre)
© CSCS 2017 3
• Founded in 1991• Enables world-class research with a scientific user lab • Available to domestic and international researchers through a
transparent, peer-reviewed allocation process. • Open to academia and are available as well to users from industry
and the business sector. • Operated by ETH Zurich and is located in Lugano.
24 years of supercomputers at CSCS
1991 NEC SX35.5 GF Adula
1996 NEC SX410 GF Gottardo
1999 NEC SX5 64 GF Prometeo
2002 IBM SP41.3 TF Venus
2005 Cray XT35.8 TF Palu
2006 IBM P54.5 TF Blanc
2009-12 Cray XE6 402 TF Monte Rosa
2012-13 Cray XC307.7 PF Piz Daint
2014 XC30 1.25 PF Piz Daint extension 4
Data Centre
© CSCS 2017 5
- 2000 sq.m Machine Room- 20 MW of power and Cooling capacity- Lake Water cooling
- 700 Liters/s
Overview of Systems Integration (SI) Unit
Unit missions:
- Managing projects
- Relations with Vendors
- Evaluating Technologies
- Software deployments
What is BeeGFS?
15
Parallel filesystemHPC orientedUsed to be called FhGFSAlternative to Lustre and GPFSDeveloped by FraunhoferOpen-sourceSupport delivered by ThinkParq
Image courtesy of BeeGFS
Basic Features of BeeGFS
© CSCS 2017 16
• Supports failover for data and Metadata using applications like Peacemaker, heartbeat
• Replication failover mechanism
• Supports Multiple data and metadata on both servers and targets
• Supports quota
• Uses Robin-hood to scan the entire filesystem
• Beegfs on demand filesystem (BeeOND)
• Easy to deploy and manage
• Support X86 and Open-power platform
BeeOND
© CSCS 2017 18
- Create a filesystem on Demand
- Uses the hard drive / SSDs on every compute node
- Filesystem get created by submitting a job to the scheduleWe are working on confirming SLURM support
- Memory could used instead of SSDs
- We used 20 SSDs on 20 nodes for our tests
Benefits of BeeOND
© CSCS 2017 19
Benefits from unused space
No impact on the parallel filesystem
Real utilization of the high speed network
Filesystem scales with the compute nodes
Open point:
What is the overhead on the compute nodes?
Test System Layout
© CSCS 2017 20
DDN 7700
4 * FDR Links
2 * FDR Links
Dual sockets SB128GB memory
Fabric 1 * FDR Links
• One couplet (two controllers)
• Two X86 servers
• One enclosure 60 drives
• 6 SSDs one raid volume
• 6 * 9 Raid 5 volumes
Tuning the servers
© CSCS 2017 21
echo 5 > /proc/sys/vm/dirty_background_ratioecho 20 > /proc/sys/vm/dirty_ratioecho 50 > /proc/sys/vm/vfs_cache_pressureecho 262144 > /proc/sys/vm/min_free_kbytesecho always > /sys/kernel/mm/transparent_hugepage/enabledecho always > /sys/kernel/mm/transparent_hugepage/defrag
for dev in dm-0 dm-1 dm-2 dm-3 dm-4 dm-5 dm-6 do
echo deadline > /sys/block/$dev/queue/schedulerecho 4096 > /sys/block/$dev/queue/nr_requestsecho 32768 > /sys/block/$dev/queue/read_ahead_kbecho 32767 > /sys/block/$dev/queue/max_sectors_kb
done
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governorecho 1 > /proc/sys/vm/zone_reclaim_mode
Documentation for the tuned parameters:
https://www.kernel.org/doc/Documentation/sysctl/vm.txthttps://access.redhat.com/solutions/46111http://www.slideshare.net/rampalliraj/linux-kernel-io-schedulers?from_action=save
Benchmark tools
© CSCS 2017 24
• Mdtest measuring metadata
https://sourceforge.net/projects/mdtest/
• IOzone throughput read and write
http://www.iozone.org
Iozone results on /beegfs
© CSCS 2017 25
Test running:Children see throughput for 64 initial writers = 5032700.90 kB/secMin throughput per process = 63754.09 kB/sec Max throughput per process = 103798.58 kB/secAvg throughput per process = 78635.95 kB/secMin xfer = 12880896.00 kB
Test running:Children see throughput for 64 rewriters = 4996297.63 kB/secMin throughput per process = 68781.82 kB/sec Max throughput per process = 90666.23 kB/secAvg throughput per process = 78067.15 kB/secMin xfer = 16473088.00 kB
Test running:Children see throughput for 64 readers = 4225632.91 kB/secMin throughput per process = 40047.24 kB/sec Max throughput per process = 77678.61 kB/secAvg throughput per process = 66025.51 kB/secMin xfer = 10813440.00 kB
Test running:Children see throughput for 64 re-readers = 4253662.00 kB/secMin throughput per process = 56998.73 kB/sec Max throughput per process = 76042.87 kB/secAvg throughput per process = 66463.47 kB/secMin xfer = 15729664.00 kB
Mdtest results on BeeOND
© CSCS 2017 26
0
20000
40000
60000
80000
100000
120000
1 2 4 8 16 20
Directoriesp
erse
cond
NumerofMDSs
Directorycreation
0100000200000300000400000500000600000700000800000900000
1 2 4 8 16 20
Directoriesp
erse
cond
NumerofMDSs
DirectoryStat
0
20000
40000
60000
80000
100000
120000
140000
160000
1 2 4 8 16 20
Directoriesp
erse
cond
NumerofMDSs
DirectoryStatDirectory Removal
Mdtest results on BeeOND
© CSCS 2017 27
0
50000
100000
150000
200000
250000
300000
1 2 4 8 16 20
Filesp
erse
cond
NumerofMDSs
FileCreation
0100000200000300000400000500000600000700000800000900000
1 2 3 4 5 6
Filesp
erse
cond
NumerofMDSs
FileStat
0
50000
100000
150000
200000
250000
1 2 3 4 5 6
Filesp
erse
cond
NumerofMDSs
Fileremoval
Next steps
© CSCS 2017 28
• Scaling on bigger cluster
• Verifying the fail over procedures
• Verify the BeeOND overhead on compute nodes
• Using Nvme instead of SSDs
• Using tmpfs
• Create BeeOND through SLURM jobs
• Use Robinhood to scan millions of files