how beegfs excels in extreme hpc scale-out...

www.beegfs.ioAlexander Eekhoff, Manager System Engineering2019

How BeeGFS excels in extreme HPC scale-out environmentsHPC Knowledge Meeting '19

HPC Knowledge Meeting ‘19

About ThinkParQ

• Established in 2014 as a spinoff from the Fraunhofer Center for High-Performance Computing, with a strong focus on R&D

• 5 rankings in the top 20 on the IO-500 list.

• Awarded the HPCwire 2018 Best Storage Product or Technology Award

• Together with Partners, ThinkParQ provides fast, flexible, and solid storage solutions around BeeGFS for the users’ needs


Delivering solutions for

HPC AI / Deep Learning Life Sciences Oil and Gas


Technology Partners


Partners

Platinum Partners

Gold Partners APAC

Gold Partners EMEA

Gold Partners NA


Partners

Gold Partners Platinum Partners


Storage Service

Client Service

BeeGFS – The Leading Parallel Cluster File System

Ease of Use

Scalability

Performance

Robust

Well balanced from small to large files

Increase file system performance and

capacity, seamlessly and nondisruptively

Easy to deploy and integrate with existing

infrastructure

High availability design enabling continuous

operations

Direct Parallel File Access

Metadata Service


Quick Facts: BeeGFS

/mnt/beegfs/dir1

Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1

…1 1 1 2 2 3 2 3 3 M MM

Simply grow capacity and performance to the level that you need

• A hardware-independent parallel file system (aka Software-defined Parallel Storage)• Runs on various platforms: x86, ARM,

OpenPower, …

• Multiple networks (InfiniBand,

OmniPath, Ethernet...)

• Open Source

• Runs on various Linux distros: RHEL, SLES, Ubuntu…

• NFS, CIFS, Hadoop enabled


Enterprise Features

BeeGFS Enterprise Features (under support contract):• High Availability

• Quota Enforcement

• Access Control Lists (ACLs)

• Storage Pools

Support Benefits:• Professional Support

• Customer Portal (Training videos, additional documentation)

• Special repositories with early updates and hotfixes

• Guaranteed next business day response

End User License Agreement

https://www.beegfs.io/docs/BeeGFS_EULA.txt

beegfs.io

How BeeGFS Works


What is BeeGFS

/mnt/beegfs/dir1

Storage Server #1 Storage Server #2 Storage Server #3 Storage Server #4 Storage Server #5 Metadata Server #1

…1 1 1 2 2 3 2 3 3 M M M

Simply grow capacity and performance to the level that you need


BeeGFS Architecture

• Client Service

• Native Linux module to mount the file system

• Management Service

• Service registry and watch dog

• Metadata Service

• Maintain striping information for files

• Not involved in data access between file open/close

• Storage Service

• Store the (distributed) file contents

• Graphical Administration and Monitoring Service

• GUI to perform administrative tasks and monitor system

information

• Can be used for “Windows-style installation“

Storage Service

Client Service


Metadata Service


BeeGFS Architecture

• Management Service• Meeting point for servers and clients

• Watches registered services and checks their state

• Not critical for performance, stores no user data

• Typically not running on a dedicated machine

Management Host Graphical Administration & Monitoring system

Storage Servers

Metadata Servers

Clients

Direct, parallel

file access


BeeGFS Architecture

• Metadata Service• Stores information about the data

• Directory information

• File and directory ownership

• Location of user data files on storage targets

• Not involved in data access between file open/close

• Faster CPU cores improve latency

• Manages one metadata target

• In general, any directory on an existing local file system

• Typically a RAID1 or RAID10 on SSD or NVMe devices

• Stores complete metadata including file size


Storage Servers

Metadata Servers

Clients

Direct, parallel

file access


BeeGFS Architecture

• Storage Service• Stores striped user file contents (data chunk files)

• One or multiple storage services per BeeGFS instance

• Manages one or more storage targets

• In general, any directory on an existing local file system

• Typically a RAID-6 (8+2 or 10+2) or zfs RAIDz2 volume,

either internal or externally attached

• It can also be a single HDD, NVMe, or SSD device

• Multiple RDMA interfaces per server possible

• Different storage service instances bind to different

interfaces

• Different IP subnets for the interfaces for the routing to

work correctly


Storage Servers

Metadata Servers

Clients

Direct, parallel

file access


Live per-Client and per-User Statistics


BeeGFS - Design Philosophy

• Designed for Performance, Scalability, Robustness and Ease of Use

• Distributed Metadata

• No Linux patches, on top of EXT, XFS, ZFS, BTRFS, ..

• Scalable multithreaded architecture

• Supports RDMA / RoCE & TCP (InfiniBand, Omni-Path, 100/40/10/1GbE, …)

• Easy to install and maintain (user space servers)

• Robust and flexible (all services can be placed independently)

• Hardware agnostic

beegfs.io

Key Features


High Availability I – Buddy Mirroring

• Built-in Replication for High Availability

• Flexible setting per directory

• Individual for metadata and/or storage

• Buddies can be in different racks or different fire zones.

Storage Server #1

Storage Server #2

Storage Server #3

Storage Server #4

Target #101 Target #201 Target #301 Target #401

Buddy Group #1

Buddy Group #2


High Availability II – Shared storage

• Shared storage together with Pacemaker/Corosync

• No extra storage space needed

• Works in active/active layout

• BeeGFS ha-utils simplify setup and administration

Storage Server #1

Storage Server #2

Storage Server #3

Storage Server #4

Target #101

Target #201

Target #301

Target #401


Storage Pool

• Support for different types of storage

• Single namespace across all tiers

Storage Service

Performance Pool

…

Capacity Pool

CurrentProjects

FinishedProjects


BeeOND – BeeGFS On Demand

• Create a parallel file system instance on-the-fly

• Start/stop with one simple command

• Use cases: cloud computing, test systems,

cluster compute nodes, …..

• Can be integrated in cluster batch system

• Common use case:per-job parallel file system• Aggregate the performance and capacity of

local SSDs/disks in compute nodes of a job

• Take load from global storage

• Speed up "nasty" I/O patterns

ComputeNode #1

ComputeNode #2

ComputeNode #3

ComputeNode #n

User-controlledData Staging

…


The easiest way to setup a parallel filesystem…

# GENERAL USAGE…$ beeond start –n <nodefile> -d <storagedir> -c <clientmount>

-------------------------------------------------

# EXAMPLE…$ beeond start –n $NODEFILE –d /local_disk/beeond –c /my_scratch

Starting BeeOND Services…Mounting BeeOND at /my_scratch…Done.


BeeGFS Additional Features

• HA support

• Quota user/group

• ACL

• Support for different types of storage

• Modification Event Logging

• Statistics in time series database

• Cluster Manager Integration eg Bright Cluster Manager, Univa

• Cloud readiness for AWS / Azure


Bright Cluster Manager Integration

beegfs.io

BeeGFS and BeeOND


Scale from small

Converged Setup


Into Enterprise

Storage Service



...

...

Storage Service


to BeeOND

NvME

Storage Service

...

beegfs.io

BeeGFS Use Cases


Alfred Wegener Institute for Polar and Marine Research

• Institute was founded in 1980 and is named after meteorologist, climatologist and geologist Alfred Wegener.

• Government funded

• Conducts research in the Arctic, in the Antarctic and in the high and mid latitude oceans

• Additional research topics are: • North Sea research

• Marine biological monitoring

• Technical marine developments

• Actual mission: In September 2019 the icebreaker Polarstern will drift through the Arctic Ocean for 1 year with 600 team members from 17 countries & use the data gathered to take climate and ecosystem research to the next level.


Day to day HPC operations @AWI

• CS400• 11,548 Cores

• 316 Nodes:

• 2x Intel Xeon Broadwell 18-Core CPUs

• 64GB RAM (DDR4 2400MHz)

• 400GB SSD

• 4 fat compute nodes, as above, but 512GB RAM

• 1 very fat node, 2x Intel Broadwell 14-Core CPUs, 1.5TB RAM

• Intel Omnipath network

• 1024TB fast parallel file system (BeeGFS)

• 128TB home and software file system


Do you remember BeeOND?

• Global BeeGFS storage on spinning disks

• 1PB of scratch_fs providing 80GB/s

• 316 compute nodes

• Each equipped with 400MB SSD each

• 316x500MB/s per SSD equals 150GB/s aggregate

BeeOND burst “for free”

“Robust and stable, even in a case of unexpected power failure.“ Dr. Malte ThomaAlfred Wegener Institute, Helmholtz Centre for Polar and Marine Research - (Bremerhaven, Germany)


Tokyo Institute of Technology: Tsubame 3

• Top national university for science and technology in Japan

• 130 year history

• Over 10,000 students located in the Tokyo Area

Tsubame 3

• Latest Tsubame Supercomputer

• #1 on the Green500 in November 2017

• 14.110 GFLOPS2 per watt

• BeeOND uses 1PB of available NVMe


Tsubame 3 Configuration

• 540 nodes

• Four Nvidia Tesla P100 GPUs per node (2,160 total)

• Two 14-core Intel Xeon Processor E5-2680 v4 (15,120 cores total)

• Two dual-port Intel Omni-Path Architecture HFIs (2,160 ports total)

• 2 TB of Intel SSD DC Product Family for NVMe storage devices

• Simple integration with Univa Grid Engine


AIST (National Institute of Advanced Industrial Science and Technology)

• Japanese Research Institute located in the Greater Tokyo Area

• Over 2,000 researchers

• Part of the Ministry of Economy, Trade and Industry

ABCI (AI Bridging Cloud Infrastructure)

• Japanese supercomputer in production since July 2018

• Theoretical performance is 130pflops – one of the

fastest in the world

• Will make its resources available through the cloud to various private and public entities in Japan

• #7 on the Top 500 list


Largest Machine Learning Environment in Japan uses BeeOND

• 1,088 servers

• Two Intel Xeon Gold processor CPUs (a total of 2,176 CPUs)

• Four NVIDIA Tesla V100 GPU computing cards (a total of 4,352 GPUs)

• Intel SSD DC P4600 series based on an NVMe standard, as local storage. 1.6TB per node (a total of about 1.6PB)

• InfiniBand EDR

• Simple integration with Univa Grid Engine


Spookfish

• Aerial survey system based in Western Australia

• High resolution images are provided to customers who need up to date information on terrain they plan to utilize

• Information can be fed into Geographical Information System and CAD applications.


Spookfish System Architecture

• Metadata server x 6• Supermicro chassis with 4 x Intel Xeon X7560 and 256GB RAM

• Only performs MDS Services

• Metadata target x6 with buddy mirroring

• Converged storage server x 40• DELL R730 with 2 x Intel Xeon E5-2650v4 CPU’s and 128GB of RAM

• Storage servers also perform processing for applications

• Uses Linux cgroups to avoid out-of-memory events

• cgroups not used for CPU usage and so far no issues of CPU shortage

• Storage target x 160 with buddy mirroring

• 10GB/s Ethernet

• Performance exceeded expectations with 10GB/s read and 5-6GB/s

write after tuning

"The result [of switching to BeeGFS] is that we’re now able to process about 3 times faster with BeeGFS than with our old NFS server. We’re seeing speeds of up to 10GB/s read and 5-6GB/s write.” –Spookfish


CSIRO

• The Commonwealth Scientific and Industrial Research Organisation (CSIRO) has adopted BeeGFS file system for their 2PB all NVMe storage in Australia, making it one of the largest NVMe storage systems in the world.

Overview:• 4 x Metadata Server• 32 x Storage Server• 2 PiB usable capacity DELL all NVMe• Look forward to ISC to see what the beast can do!• Further details: http://www.pacificteck.com/?p=437

Metadata x 4

Storage x 32

3.2 TB NVMe x 24per server

http://www.pacificteck.com/?p=437


Follow BeeGFS:

beegfs.io

Sales engagement


What do we need?• Full Address:

• Name:

• Email:

• Phone #:

• Business (university, research institute, life science, HPC etc):

• Quantity Single Target (MDS) Servers:

• Quantity Multi Target (OSS) Servers:

• RAID Settings

• Server Type:

• Hardware Platform (e.g. Intel Xeon, AMD, ARM):

• Capacity Requirement:

• Performance Requirement:

• Quantity Clients (rough number):

• BeeOND up to 100, up to 500, > 500 nodes

• Support duration (3 years, 5 years):

• Expected support start date:

• System Nickname (to distinguish multiple systems):

• Interconnect: (EDR,FDR,QDR, OPA, 40,10 GigE)

• Linux Distribution (e.g. Red Hat):


BeeGFS terms for sizing

• The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD.• A “Single Target Server” exports exactly one target, either for storage or metadata.

• A “Multi Target Server” exports up to six targets for storage and optionally one target for metadata.

• An “Unlimited Target Server” exports an unlimited number of targets for storage and/or metadata.


Pricing structure


• 1st Level Support (Partner)

• 1st level support is done by the reseller or qualified partner (e.g. sub-contractor / sub- reseller) of reseller. 1st level support staff has knowledge level of general system administrators to perform following tasks:

•

• Definition of problem, steps to reproduce and expected behavior

• Description of customer hardware setup

• Description of customer software setup (e.g. operating system, software, firmware and driver versions)

• Gathering of other potentially relevant information such as log files

• Attempts to solve problems based on previously known similar cases (e.g. recommendation of software update or configuration changes)

•

• 2nd Level Support (Partner)

•

• 2nd level support is done by Gold Reseller or qualified partner (e.g. sub-contractor / sub- reseller) of Gold Reseller. 2nd level support staff has knowledge of BeeGFS concepts and tools as well as knowledge of general storage system stack (e.g. storage devices and network tools / testing) to perform following tasks:

•

• Problem and root cause analysis (e.g. based on log file analysis)

• Hardware check (e.g. network, storage devices, cables) and software check including attempts to reproduce issues on different test system to test if problems are caused by hardware malfunction at customer site.

• Issue discussion and potential solution or work-around discussion with customer

• Definition of minimal setup to reproduce problems before escalation to higher support level

•

• 3rd Level Support (ThinkParQ)

• 3rd level support is provided by ThinkParQ. The BeeGFS support has detailed knowledge of BeeGFS internals. Incoming support tickets are prioritized based on severity.

•

• Full problem and root cause analysis, optionally including remote login to customer system via ssh

• Code inspection for detailed internal analysis

• Patch development with early update releases for supported customer

• Recommendation of performance tuning methods and HPC consulting

• Reaction time is next business day German working hours

•


Installation / Training

• Installation, Training can be done remotely.

• Remote ssh Installation per day 1,200 USD

• BeeGFS Remote Training (Agenda & Time) 10 hrs session • for partner it is free of cost, for end-customers we charge 1,200 USD)

• Introduction BeeGFS basic concepts, architecture and features• How do I ... with BeeGFS typical administrative tasks• Sizing and tuning• Designing and implementing storage solutions with BeeGFS• Projects with BeeGFS• Reference installation and best practices

• Sales & Presales Training: • BeeGFS sales, pre-sales training mapped to your customer focus, with FS comparison, use cases and get

your sales force up to speed. Please let me know when you guys are ready to schedule it


Pricing Model

BeeGFS sizing for your information: The term “target” refers to a storage device exported through BeeGFS. Typically, a target is a RAID6 volume (for BeeGFS storage servers) or a RAID10 volume (for BeeGFS metadata servers) consisting of several disks, but it can optionally also be a single HDD or SSD. ∙ A “Single Target Server” exports exactly one target, either for storage or

metadata.∙ A “Multi Target Server” exports up to six targets for storage and optionally

one target for metadata.∙ An “Unlimited Target Server” exports an unlimited number of targets for

storage and/or metadata.


BeeGFS Storage Engine under the Hood

how beegfs excels in extreme hpc scale-out...

Documents