how beegfs excels in extreme hpc scale-out...
TRANSCRIPT
www.beegfs.ioAlexander Eekhoff, Manager System Engineering2019
How BeeGFS excels in extreme HPC scale-out environmentsHPC Knowledge Meeting '19
HPC Knowledge Meeting ‘19
About ThinkParQ
• Established in 2014 as a spinoff from the Fraunhofer Center for High-Performance Computing, with a strong focus on R&D
• 5 rankings in the top 20 on the IO-500 list.
• Awarded the HPCwire 2018 Best Storage Product or Technology Award
• Together with Partners, ThinkParQ provides fast, flexible, and solid storage solutions around BeeGFS for the users’ needs
HPC Knowledge Meeting ‘19
Delivering solutions for
HPC AI / Deep Learning Life Sciences Oil and Gas
HPC Knowledge Meeting ‘19
Technology Partners
HPC Knowledge Meeting ‘19
Partners
Platinum Partners
Gold Partners APAC
Gold Partners EMEA
Gold Partners NA
HPC Knowledge Meeting ‘19
Partners
Gold Partners Platinum Partners
HPC Knowledge Meeting ‘19
Storage Service
Client Service
BeeGFS – The Leading Parallel Cluster File System
Ease of Use
Scalability
Performance
Robust
Well balanced from small to large files
Increase file system performance and
capacity, seamlessly and nondisruptively
Easy to deploy and integrate with existing
infrastructure
High availability design enabling continuous
operations
Direct Parallel File Access
Metadata Service
HPC Knowledge Meeting ‘19
Quick Facts: BeeGFS
/mnt/beegfs/dir1
Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1
…1 1 1 2 2 3 2 3 3 M MM
Simply grow capacity and performance to the level that you need
• A hardware-independent parallel file system (aka Software-defined Parallel Storage)• Runs on various platforms: x86, ARM,
OpenPower, …
• Multiple networks (InfiniBand,
OmniPath, Ethernet...)
• Open Source
• Runs on various Linux distros: RHEL, SLES, Ubuntu…
• NFS, CIFS, Hadoop enabled
HPC Knowledge Meeting ‘19
Enterprise Features
BeeGFS Enterprise Features (under support contract):• High Availability
• Quota Enforcement
• Access Control Lists (ACLs)
• Storage Pools
Support Benefits:• Professional Support
• Customer Portal (Training videos, additional documentation)
• Special repositories with early updates and hotfixes
• Guaranteed next business day response
End User License Agreement
https://www.beegfs.io/docs/BeeGFS_EULA.txt
beegfs.io
How BeeGFS Works
HPC Knowledge Meeting ‘19
What is BeeGFS
/mnt/beegfs/dir1
Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1
…1 1 1 2 2 3 2 3 3 M M M
Simply grow capacity and performance to the level that you need
HPC Knowledge Meeting ‘19
BeeGFS Architecture
• Client Service
• Native Linux module to mount the file system
• Management Service
• Service registry and watch dog
• Metadata Service
• Maintain striping information for files
• Not involved in data access between file open/close
• Storage Service
• Store the (distributed) file contents
• Graphical Administration and Monitoring Service
• GUI to perform administrative tasks and monitor system
information
• Can be used for “Windows-style installation“
Storage Service
Client Service
Direct Parallel File Access
Metadata Service
HPC Knowledge Meeting ‘19
BeeGFS Architecture
• Management Service• Meeting point for servers and clients
• Watches registered services and checks their state
• Not critical for performance, stores no user data
• Typically not running on a dedicated machine
Management Host Graphical Administration & Monitoring system
Storage Servers
Metadata Servers
Clients
Direct, parallel
file access
HPC Knowledge Meeting ‘19
BeeGFS Architecture
• Metadata Service• Stores information about the data
• Directory information
• File and directory ownership
• Location of user data files on storage targets
• Not involved in data access between file open/close
• Faster CPU cores improve latency
• Manages one metadata target
• In general, any directory on an existing local file system
• Typically a RAID1 or RAID10 on SSD or NVMe devices
• Stores complete metadata including file size
Management Host Graphical Administration & Monitoring system
Storage Servers
Metadata Servers
Clients
Direct, parallel
file access
HPC Knowledge Meeting ‘19
BeeGFS Architecture
• Storage Service• Stores striped user file contents (data chunk files)
• One or multiple storage services per BeeGFS instance
• Manages one or more storage targets
• In general, any directory on an existing local file system
• Typically a RAID-6 (8+2 or 10+2) or zfs RAIDz2 volume,
either internal or externally attached
• It can also be a single HDD, NVMe, or SSD device
• Multiple RDMA interfaces per server possible
• Different storage service instances bind to different
interfaces
• Different IP subnets for the interfaces for the routing to
work correctly
Management Host Graphical Administration & Monitoring system
Storage Servers
Metadata Servers
Clients
Direct, parallel
file access
HPC Knowledge Meeting ‘19
Live per-Client and per-User Statistics
HPC Knowledge Meeting ‘19
BeeGFS - Design Philosophy
• Designed for Performance, Scalability, Robustness and Ease of Use
• Distributed Metadata
• No Linux patches, on top of EXT, XFS, ZFS, BTRFS, ..
• Scalable multithreaded architecture
• Supports RDMA / RoCE & TCP (InfiniBand, Omni-Path, 100/40/10/1GbE, …)
• Easy to install and maintain (user space servers)
• Robust and flexible (all services can be placed independently)
• Hardware agnostic
beegfs.io
Key Features
HPC Knowledge Meeting ‘19
High Availability I – Buddy Mirroring
• Built-in Replication for High Availability
• Flexible setting per directory
• Individual for metadata and/or storage
• Buddies can be in different racks or different fire zones.
Storage Server #1
Storage Server #2
Storage Server #3
Storage Server #4
Target #101 Target #201 Target #301 Target #401
Buddy Group #1
Buddy Group #2
HPC Knowledge Meeting ‘19
High Availability II – Shared storage
• Shared storage together with Pacemaker/Corosync
• No extra storage space needed
• Works in active/active layout
• BeeGFS ha-utils simplify setup and administration
Storage Server #1
Storage Server #2
Storage Server #3
Storage Server #4
Target #101
Target #201
Target #301
Target #401
HPC Knowledge Meeting ‘19
Storage Pool
• Support for different types of storage
• Single namespace across all tiers
Storage Service
Performance Pool
…
Capacity Pool
CurrentProjects
FinishedProjects
HPC Knowledge Meeting ‘19
BeeOND – BeeGFS On Demand
• Create a parallel file system instance on-the-fly
• Start/stop with one simple command
• Use cases: cloud computing, test systems,
cluster compute nodes, …..
• Can be integrated in cluster batch system
• Common use case:per-job parallel file system• Aggregate the performance and capacity of
local SSDs/disks in compute nodes of a job
• Take load from global storage
• Speed up "nasty" I/O patterns
ComputeNode #1
ComputeNode #2
ComputeNode #3
ComputeNode #n
User-controlledData Staging
…
HPC Knowledge Meeting ‘19
The easiest way to setup a parallel filesystem…
# GENERAL USAGE…$ beeond start –n <nodefile> -d <storagedir> -c <clientmount>
-------------------------------------------------
# EXAMPLE…$ beeond start –n $NODEFILE –d /local_disk/beeond –c /my_scratch
Starting BeeOND Services…Mounting BeeOND at /my_scratch…Done.
HPC Knowledge Meeting ‘19
BeeGFS Additional Features
• HA support
• Quota user/group
• ACL
• Support for different types of storage
• Modification Event Logging
• Statistics in time series database
• Cluster Manager Integration eg Bright Cluster Manager, Univa
• Cloud readiness for AWS / Azure
HPC Knowledge Meeting ‘19
Bright Cluster Manager Integration
beegfs.io
BeeGFS and BeeOND
HPC Knowledge Meeting ‘19
Scale from small
Converged Setup
HPC Knowledge Meeting ‘19
Into Enterprise
Storage Service
Direct Parallel File Access
Direct Parallel File Access
...
...
Storage Service
HPC Knowledge Meeting ‘19
to BeeOND
NvME
Storage Service
...
beegfs.io
BeeGFS Use Cases
HPC Knowledge Meeting ‘19
Alfred Wegener Institute for Polar and Marine Research
• Institute was founded in 1980 and is named after meteorologist, climatologist and geologist Alfred Wegener.
• Government funded
• Conducts research in the Arctic, in the Antarctic and in the high and mid latitude oceans
• Additional research topics are: • North Sea research
• Marine biological monitoring
• Technical marine developments
• Actual mission: In September 2019 the icebreaker Polarstern will drift through the Arctic Ocean for 1 year with 600 team members from 17 countries & use the data gathered to take climate and ecosystem research to the next level.
HPC Knowledge Meeting ‘19
Day to day HPC operations @AWI
• CS400• 11,548 Cores
• 316 Nodes:
• 2x Intel Xeon Broadwell 18-Core CPUs
• 64GB RAM (DDR4 2400MHz)
• 400GB SSD
• 4 fat compute nodes, as above, but 512GB RAM
• 1 very fat node, 2x Intel Broadwell 14-Core CPUs, 1.5TB RAM
• Intel Omnipath network
• 1024TB fast parallel file system (BeeGFS)
• 128TB home and software file system
HPC Knowledge Meeting ‘19
Do you remember BeeOND?
• Global BeeGFS storage on spinning disks
• 1PB of scratch_fs providing 80GB/s
• 316 compute nodes
• Each equipped with 400MB SSD each
• 316x500MB/s per SSD equals 150GB/s aggregate
BeeOND burst “for free”
“Robust and stable, even in a case of unexpected power failure.“ Dr. Malte ThomaAlfred Wegener Institute, Helmholtz Centre for Polar and Marine Research - (Bremerhaven, Germany)
HPC Knowledge Meeting ‘19
Tokyo Institute of Technology: Tsubame 3
• Top national university for science and technology in Japan
• 130 year history
• Over 10,000 students located in the Tokyo Area
Tsubame 3
• Latest Tsubame Supercomputer
• #1 on the Green500 in November 2017
• 14.110 GFLOPS2 per watt
• BeeOND uses 1PB of available NVMe
HPC Knowledge Meeting ‘19
Tsubame 3 Configuration
• 540 nodes
• Four Nvidia Tesla P100 GPUs per node (2,160 total)
• Two 14-core Intel Xeon Processor E5-2680 v4 (15,120 cores total)
• Two dual-port Intel Omni-Path Architecture HFIs (2,160 ports total)
• 2 TB of Intel SSD DC Product Family for NVMe storage devices
• Simple integration with Univa Grid Engine
HPC Knowledge Meeting ‘19
AIST (National Institute of Advanced Industrial Science and Technology)
• Japanese Research Institute located in the Greater Tokyo Area
• Over 2,000 researchers
• Part of the Ministry of Economy, Trade and Industry
ABCI (AI Bridging Cloud Infrastructure)
• Japanese supercomputer in production since July 2018
• Theoretical performance is 130pflops – one of the
fastest in the world
• Will make its resources available through the cloud to various private and public entities in Japan
• #7 on the Top 500 list
HPC Knowledge Meeting ‘19
Largest Machine Learning Environment in Japan uses BeeOND
• 1,088 servers
• Two Intel Xeon Gold processor CPUs (a total of 2,176 CPUs)
• Four NVIDIA Tesla V100 GPU computing cards (a total of 4,352 GPUs)
• Intel SSD DC P4600 series based on an NVMe standard, as local storage. 1.6TB per node (a total of about 1.6PB)
• InfiniBand EDR
• Simple integration with Univa Grid Engine
HPC Knowledge Meeting ‘19
Spookfish
• Aerial survey system based in Western Australia
• High resolution images are provided to customers who need up to date information on terrain they plan to utilize
• Information can be fed into Geographical Information System and CAD applications.
HPC Knowledge Meeting ‘19
Spookfish System Architecture
• Metadata server x 6• Supermicro chassis with 4 x Intel Xeon X7560 and 256GB RAM
• Only performs MDS Services
• Metadata target x6 with buddy mirroring
• Converged storage server x 40• DELL R730 with 2 x Intel Xeon E5-2650v4 CPU’s and 128GB of RAM
• Storage servers also perform processing for applications
• Uses Linux cgroups to avoid out-of-memory events
• cgroups not used for CPU usage and so far no issues of CPU shortage
• Storage target x 160 with buddy mirroring
• 10GB/s Ethernet
• Performance exceeded expectations with 10GB/s read and 5-6GB/s
write after tuning
"The result [of switching to BeeGFS] is that we’re now able to process about 3 times faster with BeeGFS than with our old NFS server. We’re seeing speeds of up to 10GB/s read and 5-6GB/s write.” –Spookfish
HPC Knowledge Meeting ‘19
CSIRO
• The Commonwealth Scientific and Industrial Research Organisation (CSIRO) has adopted BeeGFS file system for their 2PB all NVMe storage in Australia, making it one of the largest NVMe storage systems in the world.
Overview:• 4 x Metadata Server• 32 x Storage Server• 2 PiB usable capacity DELL all NVMe• Look forward to ISC to see what the beast can do!• Further details: http://www.pacificteck.com/?p=437
Metadata x 4
Storage x 32
3.2 TB NVMe x 24per server
HPC Knowledge Meeting ‘19
Follow BeeGFS:
beegfs.io
Sales engagement
HPC Knowledge Meeting ‘19
What do we need?• Full Address:
• Name:
• Email:
• Phone #:
• Business (university, research institute, life science, HPC etc):
• Quantity Single Target (MDS) Servers:
• Quantity Multi Target (OSS) Servers:
• RAID Settings
• Server Type:
• Hardware Platform (e.g. Intel Xeon, AMD, ARM):
• Capacity Requirement:
• Performance Requirement:
• Quantity Clients (rough number):
• BeeOND up to 100, up to 500, > 500 nodes
• Support duration (3 years, 5 years):
• Expected support start date:
• System Nickname (to distinguish multiple systems):
• Interconnect: (EDR,FDR,QDR, OPA, 40,10 GigE)
• Linux Distribution (e.g. Red Hat):
HPC Knowledge Meeting ‘19
BeeGFS terms for sizing
• The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD.• A “Single Target Server” exports exactly one target, either for storage or metadata.
• A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata.
• An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.
HPC Knowledge Meeting ‘19
Pricing structure
HPC Knowledge Meeting ‘19
• 1st Level Support (Partner)
• 1st level support is done by the reseller or qualified partner (e.g. sub-contractor / sub- reseller) of reseller. 1st level support staff has knowledge level of general system administrators to perform following tasks:
•
• Definition of problem, steps to reproduce and expected behavior
• Description of customer hardware setup
• Description of customer software setup (e.g. operating system, software, firmware and driver versions)
• Gathering of other potentially relevant information such as log files
• Attempts to solve problems based on previously known similar cases (e.g. recommendation of software update or configuration changes)
•
• 2nd Level Support (Partner)
•
• 2nd level support is done by Gold Reseller or qualified partner (e.g. sub-contractor / sub- reseller) of Gold Reseller. 2nd level support staff has knowledge of BeeGFS concepts and tools as well as knowledge of general storage system stack (e.g. storage devices and network tools / testing) to perform following tasks:
•
• Problem and root cause analysis (e.g. based on log file analysis)
• Hardware check (e.g. network, storage devices, cables) and software check including attempts to reproduce issues on different test system to test if problems are caused by hardware malfunction at customer site.
• Issue discussion and potential solution or work-around discussion with customer
• Definition of minimal setup to reproduce problems before escalation to higher support level
•
• 3rd Level Support (ThinkParQ)
• 3rd level support is provided by ThinkParQ. The BeeGFS support has detailed knowledge of BeeGFS internals. Incoming support tickets are prioritized based on severity.
•
• Full problem and root cause analysis, optionally including remote login to customer system via ssh
• Code inspection for detailed internal analysis
• Patch development with early update releases for supported customer
• Recommendation of performance tuning methods and HPC consulting
• Reaction time is next business day German working hours
•
HPC Knowledge Meeting ‘19
Installation / Training
• Installation, Training can be done remotely.
• Remote ssh Installation per day 1,200 USD
• BeeGFS Remote Training (Agenda & Time) 10 hrs session • for partner it is free of cost, for end-customers we charge 1,200 USD)
• Introduction BeeGFS basic concepts, architecture and features• How do I ... with BeeGFS typical administrative tasks• Sizing and tuning• Designing and implementing storage solutions with BeeGFS• Projects with BeeGFS• Reference installation and best practices
• Sales & Presales Training: • BeeGFS sales, pre-sales training mapped to your customer focus, with FS comparison, use cases and get
your sales force up to speed. Please let me know when you guys are ready to schedule it
HPC Knowledge Meeting ‘19
Pricing Model
BeeGFS sizing for your information: The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD. ∙ A “Single Target Server” exports exactly one target, either for storage or
metadata.∙ A “Multi Target Server” exports up to six targets for storage and optionally
one target for metadata.∙ An “Unlimited Target Server” exports an unlimited number of targets for
storage and/or metadata.
HPC Knowledge Meeting ‘19
BeeGFS Storage Engine under the Hood