hpc storage - ddn.com · pdf file• hpc storage • september 2015 lustre towards...

27
• HPC Storage Lustre Towards ExaScaler DDN/Intel Webinar

Upload: vancong

Post on 15-Feb-2018

223 views

Category:

Documents


2 download

TRANSCRIPT

• HPC Storage

• September 2015

Lustre Towards ExaScaler

DDN/Intel Webinar

Content

• Exascale Storage Challenges (Intel)• From HDDs to NVM (Intel) • New Storage Architectures Beyond Lustre (Intel)• DDN ExaScaler: Software and Hardware Combined• Enhancing Lustre with NVM• Options for using NVM in Lustre today• Tools for Lustre at Scale

DDN & Intel

• Close Collaboration with Intel HPDD– Collaboration on Lustre features such as directory-level quota and novel security

features– DDN ExaScaler is a private branch within Intel IEEL– DDN Automated Build & Test Environment

• Member of Intel Storage Builder Program• Working with Intel to incorporate novel storage devices• Collaborate with Intel on new storage architectures towards Exascale

Brent GordaIntel

Legal InformationNo license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Cost reduction scenarios described are intended as examples of how a given Intel- based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation.• Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or

learn more at http://www.intel.com/content/www/us/en/software/intel-solutions-for-lustre-software.html.• Intel may make changes` to specifications and product descriptions at any time, without notice.• Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers,

licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user.

• Intel, Intel Inside, the Intel logo and Xeon Phi are trademarks of Intel Corporation in the United States and other countries. • Material in this presentation is intended as product positioning and not approved end user messaging. *Other names and brands may be claimed as the property of others.© 2016 Intel Corporation

5

Intel® Scalable System FrameworkA Holistic Design Solution for All HPC Needs

Small Clusters Through SupercomputersCompute and Data-Centric ComputingStandards-Based ProgrammabilityOn-Premise and Cloud-Based

Intel® Xeon® ProcessorsIntel® Xeon Phi™ Processors

Intel® Xeon Phi™ CoprocessorsIntel® Server Boards and

Platforms

Intel® Solutions for Lustre*Intel® Optane™ Technology

3D XPoint™ TechnologyIntel® SSDs

Intel® Omni-Path Architecture

Intel® True Scale FabricIntel® Ethernet

Intel® Silicon Photonics

HPC System Software StackIntel® Software ToolsIntel® Cluster Ready

ProgramIntel Supported SDVis

Compute Memory/Storage

Fabric Software

Intel Silicon Photonics

Disruptive change3D XPoint™ + Intel® Omni-Path Architecture

Byte-granular storage

Sub-µS storage latency

µS network latency

Conventional storage software

High overhead– 10s µS lost to communications S/W

– 100s µS lost to F/S & block I/O stack

Block/page granular & alignment sensitive– False sharing limits scalability

Exascale storage Low overhead

– OS bypass comms + storage

Byte granular & alignment insensitive– Ultra low latency enables fine granularity

Persistent Memory– All metadata & fine-grained application data

Block storage– NAND: High performance bulk data– Disk: High capacity cold data

PersistentMemory

Block

IntegratedFabric

Global Namespaces

Containers

Shared System Namespace– “Where’s my stuff”

Private Namespaces– “My stuff”– Entire simulation datasets

Multiple Schemas

POSIX– Shared (system) & Private (legacy datasets)– No discontinuity for application developers

Scientific: HDF5*, ADIOS*, SciDB*, …

Big Data: HDFS*, Spark*, Graph Analytics, …

`

ComputeNodes

SharedStorageNodes

System Namespace +Warm/staging datasets

Compute Cluster

Hot datasets

System POSIX Namespace /projects

/posix /Particle /Climate

HDF5 dataset

datadatadatadatadataset

group

datadatadatadatadataset

group

datadatadatadatadataset

group

group

ADIOS datasetPOSIX dataset

datadatadatadatafile

dir

datadatadatadatafile

dir

datadatadatadatafile

dir

dir

Lustre* evolving in response to:

• A Growing Customer Base

• Changing use cases

• Emerging HW capabilities

DAOS exploring new territory:

• What may lay beyond Posix

• Use new HW capabilities as storage

• Raising level of abstraction

The future is bothEvolutionary & Revolutionary

IML

Director class switch

Q2 ’16 Q3 ’16 Q4 ’16 Q1 ’17 Q2 ’17

10

Intel® Solutions for Lustre* Roadmap

Enterprise HPC, commercial technical computing, and analytics

FoundationCommunity Lustre software coupled with Intel support

Lustre 2.9.x

Foundation Edition 2.9

UID/GID MappingShared key cryptoLock aheadSubdirectory mounts

Lustre 2.8.x

Foundation Edition 2.8

Metadata directory striping (DNE 2)Client metadata RPC scalingLFSCK performance tuningSELinux on client + Kerberos updatesKey: DevelopmentShipping Planning

CloudHPC, commercial technical computing and analytics delivered on cloud services

Lustre 2.8.x

Cloud Edition 1.3

RHEL 7 Server SupportLustre release update

Lustre 2.9.x

Cloud Edition 1.4

Updated Lustre and base OS buildsNew instance support

Lustre 2.7.x

Cloud Edition 1.2.1

Client Support: EL6 & EL7Support for new regions

* Some names and brands may be claimed as the property of othersProduct placement not representative of final launch date within the specified quarter

Lustre 2.7.x

Enterprise Edition 3.1

Expanded OpenZFS support in IMLIML graphical UI enhancementsBulk IO Performance (16MB IO)OpenZFS metadata performance Subdirectory mountsPatchless server support

Lustre 2.7.x

Enterprise Edition 3.0

Client metadata RPC scalingOpenZFS performanceOpenZFS SnapshotsSELinux on client + KerberosIntel Omni-Path (OPA) supportRHEL 7 server/client, SLES 12 SP1 clientOpenZFS 0.6.5.3

Lustre 2.10.x

Foundation Edition 2.10

OpenZFS SnapshotsMulti-rail LNETProject quotas

DDN | ExaScaler Software Stack

11

Intel DSS

DDN Block Storage with SFX Cache

ldiskfsOpenZFS btrfs

DDN Lustre Edition

Intel IMLDDN ExaScaler

DDN IMENFS/CIFS/S3DDN Clients

DDN DirectMon DDN ExaScaler Monitor

Intel Hadoop

ExaScaler Data Management FrameworkFast Data Copy

Object (WOS)S3 CloudTape

DDN DDN & Intel Intel HPDD

Other HW

Other

Management and Monitoring

Global Filesystem

Local Filesystem

Storage Hardware

Data Management

Roadmap

Q2 2016 Q3 2016 Q4 2016 Q1 2017

• SFA14K• ES14K Appliance• SFA14K & ES14K Rack

Configurations• Infiniband EDR Support

ExaScaler 2.3Lustre 2.5.x, IEEL 2.4

Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,

• RHEL 7 and SLES 12 sp1 • IME 1.0 Integration• L2RC and File Heat• Enhanced SFX Integration

and fadvice() support• Client Metadata

Performance• SE Linux (Client) &

Kerberos Support• Open TSDB Scalable

Monitoring• Scalable Posix Copy Tool• NFS/CIFS Gateway Tools• OpenZFS Support (not

optimized!)• OpenZFS Snapshots

ExaScaler 3.0Lustre 2.7.x, IEEL 3.0

Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,

• Project Quota• Enhanced QoS Policies• S3 Protocol Export• Intel Omni-Path (OPA)• 14KXE• Online Upgrades

ExaScaler 3.1Lustre 2.7.x, IEEL 3.0

Hardware: SFA12K, SFA14K, SFA7700X, ES7K, ES14K,

• Enhanced Management & Monitoring

• Platform Enhancements

ExaScaler 3.2

Lustre 2.7.x, IEEL 3.2Hardware: SFA12K, SFA14K,

SFA14KX, SFA8700, SFA12KX, SFA7700X, ES7K, ES14K,

• Data on MDT• MDT Striped Directories• IME Metadata Caching• New Appliance

Configurations• Progressive File Layouts• SE Linux Server• UID/GID Mapping• Shared Key Encryption• MLS Design Phase 1• Multi-tenancy and Docker

Integration

ExaScaler 4.0

Lustre xx, IEEL 4.0Hardware: SFA12K, SFA14K,

SFA14KX, SFA8700, SFA12KX, SFA7700X, ES7K, ES14K,

Q3 2017

Shipping Planning InvestigationScheduled

DDN | ExaScaler SSUsHardware & Software Combined

XSES7K-MUp to 5 GB/secUp to 1148 TBSAS 7.2K rpmInternal OSSInternal MDSInternal MDT

MSFA12KX

Up to 38 GB/secUp to 3360 TBSAS 7.2K rpmExternal OSSExternal MDSInternal MDT

SES7K

Up to 12 GB/secUp to 1148 TBSAS 7.2K rpmInternal OSS

External MDSExternal MDT

LES14K

Up to 74 GB/secUp to 4032 TBSAS 7.2K rpmInternel OSSInternal MDSInternal MDT

XLES14K

Up to 96 GB/secUp to 1500 TBSAS 10K rpmInternal OSS

External MDSExternal MDT

XXLES14K

Up to 320 GB/secSAS SSD

Up to 1700 TBInternal OSS

External MDSExternal MDT

DDN | ES14K: Hardware & Software Combined Designed for Flash and NVM

Industry Leading Performance in 4U Up to 40 GB/sec throughput Up to 6 million IOPS to cache Up to 3.5 million IOPS to storage 1PB+ capacity (with 16TB SSD) 100 millisecond latency

Configuration Options 72 SAS SSD or 48 NVMe SSDs or HDDs only HDDs with SSD caching SSDs with HDD tier

Connectivity FDR/EDR OmniPath 40/100GbE

ES14K Architecture

QPI

IB/OPA 40/100 GbE

6 x SAS3

SFA14KE (Haswell) SFA14KEX (Broadwell)

OSS 0

SFAOS

OSS 1

SFAOS CPU1

6 x SAS3

QPI

6 x SAS3

OSS 0

SFAOS SFAOS

6 x SAS3

OSS 1

IB/OPA 40/100 GbE

CPU0 CPU1CPU0

OSS OSS OSS OSS OSS OSS OSS OSS OSS

IME

OSS OSS OSS

SFX

Block-Level Read Cache

Instant Commit, DSS, fadvice()

Millions of Read IOPS

L2RC

OSS-level Read cache

Heuristics with FileHeat

Millions of Read IOPS

IME

I/O level Write and Read Cache

Transparent + hints

10s of Millions of Read/Write IOPS

All SSD Lustre

Lustre HSM for Data Tiering to

HDD namespace

Generic Lustre I/O

Millions of Read and Write IOPS

SSD Options1 2 3 4

Rack Performance: Lustre

0

100

200

300

400

Write Read

IOR File-per-Process (GB/s)

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

Read

4k Random Read IOPS

350GB/s Read and Write (IOR)

4 Million IOPs

up to 4PB Flash Capacity

SFX & ReACT – Accelerating ReadsIntegrated with Lustre DSS

HDD Tier

DRAM Cache

OSS

SFX TierLarge Reads

Small Reads

Small Rereads

Cache Warm

DSSSF

X AP

I

Lustre L2RC and File Heat

OSS-based Read Caching– Uses SSDs (or SFA SSD pools) on the OSS

as read cache– Automatic prefetch management based

on file heat– File-heat is a relative (tunable) attribute

that reflects file access frequency– Indexes are kept in memory (worst case is

1 TB SSD for 10 GB memory)– Efficient space management for the SSD

cache space (4KB-1 MB extends)– Full support for ladvice in Lustre

OSS OSS OSS

Heap of Object Heat

Values

PREFETCH

Higher heat == Higher access

frequency

Lower heat == Lower access frequency

DDN | IMEApplication I/O Workflow

NVM TIER LUSTRECOMPUTE

Lightweight IME client intercepts application I/O. Places fragments into buffers + parity

IME client sends fragments to IME servers

IME servers write buffers to NVM and manage internal metadata

IME servers write aligned sequential I/O to SFA backend

Parallel File system operates at maximum efficiency

IME server IME server IME server IME server

IME Write Dataflow

IME server

Application

IME Client

Parallel Filesystem

1. Application issues fragmented, misaligned IO

2. IME clients send fragments to IME servers

3. Fragments are sent to IME servers and are

accessible via DHT to all clients

4. Fragments to be flushed from IME are assembled into

PFS stripes

5. PFS receives complete aligned PFS

stripe

ApplicationApplicationApplication

IME ClientIME ClientIME Client

COMPUTE NODES

IME Erasure Coding

• Data protection against IME server or SSD Failure is optional– (the lost data is "just cache”)

• Erasure Coding calculated at the Client– Great scaling with extremely high client

count– Servers don't get clogged up

• Erasure coding does reduce useable Client bandwidth and useable IME capacity:– 3+1: 56Gb 42Gb – 5+1: 56Gb 47Gb – 7+1: 56Gb 49Gb – 8+1: 56Gb 50Gb

COMPUTE NODES

IME

PFS

LUSTRE

PFS PFS

Data buffer

Parity buffer

Data buffer

Data buffer

Rack Performance: IME

500GB/s Read and Write

50 Million IOPs

768TB Flash Capacity0

100

200

300

400

500

600

Write Read

IOR File-per-Process (GB/s)

110

1001,000

10,000100,000

1,000,00010,000,000

100,000,000

Write Read

4k Random IOPS

Monitoring for Lustre at Scale

• Monitoring with OpenTSDB• Collect tens of millions of metrics per

second– Lustre file system– Storage backend

• Make full use of Lustre jobstats• Assist storage administrators to

understand what is happening in their system– Top-N query tool

Monitoring for Lustre at Scale

• Example: Top Metadata Processes

Example