enhancing scalability and performance of parallel file ... · white paper | enhancing scalability...

25
WHITE PAPER Intel® Enterprise Edition for Lustre* Software High-Performance Computing Executive Overview The volume of data available to application developers and research scientists is increasing exponentially. Putting that data to work requires high-performance computing (HPC) using a parallel file system. In many cases, scientific applications are I/O-intensive: they require not only reading massive amounts of data, but also significant amounts of writing. In this paper we explore how system architects, application developers, and researchers can tune the Lustre* file system to optimize read and write operations to obtain the maximum benefit from parallel I/O. In one of our tests, we achieved an 84-percent increase in performance simply by tuning Lustre metrics. Deep I/O characterization of a parallel file system running seismic wave simulation: A case study to significantly increase performance and throughput Enhancing Scalability and Performance of Parallel File Systems Table of Contents General Introduction 2 Spectral Finite Element Application: EFISPEC3D Code 3 EFISPEC3D Test Cases ........................................................ 5 Lustre* Introduction 6 Intel® Enterprise Edition for Lustre* Software .................................. 7 High-Performance Computing (HPC) Storage and Compute Nodes7 Relevant Lustre* Metrics for Developers 8 Real-Time Overview Using Intel® Manager for Lustre* Software ................ 8 Jobstats ...................................................................... 9 Server-Side Metrics ......................................................... 13 Striping Layout and Lustre ................................................... 13 Best Practices to Avoid Metadata Bottlenecks ................................ 14 Seismic Wave Simulation Example 15 Strong Scalability Tests ...................................................... 15 Weak Scalability Test ........................................................ 17 Measured Bandwidth and Application Apparent Bandwidth ................... 17 A Tuning Exercise 18 Using Lustre* Metrics to Understand I/O Patterns............................. 18 Tuning the Stripe Layout ..................................................... 20 Tuning the System Administrator Parameters ................................ 20 Custom Tuning Strategy ..................................................... 21 Conclusion 23 Bibliography 24 About the Authors 25 Familiarity with Lustre* file system metrics can aid in understanding how parallel I/O performance can be enhanced. Authors Gabriele Paciucci Solution Architect, Enterprise & HPC Platform Group, Intel Florent De Martin Seismologist, BRGM, French Geological Survey Philippe Thierry Principal Engineer, Energy Application Engineering Team, Intel

Upload: others

Post on 11-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White PaPerIntel® Enterprise Edition for Lustre* Software High-Performance Computing

Executive OverviewThe volume of data available to application developers and research scientists is increasing exponentially. Putting that data to work requires high-performance computing (HPC) using a parallel file system. In many cases, scientific applications are I/O-intensive: they require not only reading massive amounts of data, but also significant amounts of writing.

In this paper we explore how system architects, application developers, and researchers can tune the Lustre* file system to optimize read and write operations to obtain the maximum benefit from parallel I/O. In one of our tests, we achieved an 84-percent increase in performance simply by tuning Lustre metrics.

Deep I/O characterization of a parallel file system running seismic wave simulation: A case study to significantly increase performance and throughput

Enhancing Scalability and Performance of Parallel File Systems

Table of ContentsGeneral Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Spectral Finite Element Application: EFISPEC3D Code . . . . . . . . . . . . . . . . . . . . . . . 3

EFISPEC3D Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Lustre* Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Intel® Enterprise Edition for Lustre* Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7High-Performance Computing (HPC) Storage and Compute Nodes . . . . . . . . . . . . 7Relevant Lustre* Metrics for Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Real-Time Overview Using Intel® Manager for Lustre* Software . . . . . . . . . . . . . . . . 8Jobstats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Server-Side Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Striping Layout and Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Best Practices to Avoid Metadata Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Seismic Wave Simulation Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15Strong Scalability Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Weak Scalability Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Measured Bandwidth and Application Apparent Bandwidth . . . . . . . . . . . . . . . . . . . 17

A Tuning Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18Using Lustre* Metrics to Understand I/O Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Tuning the Stripe Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Tuning the System Administrator Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Custom Tuning Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24About the Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

Familiarity with Lustre* file system metrics can aid in understanding how parallel I/O performance can be enhanced.

AuthorsGabriele Paciucci

Solution Architect, Enterprise & HPC Platform Group, Intel

Florent De Martin Seismologist,

BRGM, French Geological Survey

Philippe Thierry Principal Engineer, Energy Application

Engineering Team, Intel

Page 2: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 2

General IntroductionThe main objective of this paper is to demonstrate the capability of parallel file systems and to illustrate some best-known methods using seismic wave simulation software.

Several scientific domains are performing intensive I/O (input/output) on very large clusters of servers. In many domains, such as weather forecasting, astrophysics, and advanced physics experiments, the I/O is limited to reading the data acquired from the field. In our case, we decided to illustrate our purpose with some techniques that involve a substantial amount of writing in addition to the initial data reading.

Our work is based on the seismic wave’s propagation that is used in several domains at different scales, from seismology (earth imaging and seismic hazard evaluation) to oil and gas exploration and civil engineering.

One of the main differences in this domain is that the wave propagation is used not only for simulating the physical phenomena (a so-called forward problem) [Virieux 2009a] but also to provide images of the subsurface using inverse problem theory [Virieux 2009b].

One of the main I/O-intensive applications used in the oil and gas industry is called “reverse time migration” (RTM), as part of a more general time reversal kind of algorithm.

This technique consists in a first simulation of the source wave field that must be stored in one way or another, and in a backward propagation of the wave fields recorded at the receiver stations. During this back propagation, an extra step must be performed: the imaging condition (also called the correlation of the forward and backward fields). This step requires reading the forward fields that have been previously stored and is among one of the most challenging for the oil and gas industry. Several techniques exist to work

around this mandatory cross-correlation challenge, including optimal checkpointing [Symes 2007] and random boundary techniques [Clapp 2009].

Each of these approaches has advantages and disadvantages depending on the physical parameter in use (the wave equation description), the domain of propagation (time, frequency), the kind of solvers (explicit, implicit), and the hardware balance between memory per node, local distributed storage, or parallel file systems.

We will not discuss in this paper all these dierent cases, since several references are already available [Anderson 2012, Imbert 2011].

To focus on the management of I/O, we will concentrate on the storage of the snapshots of the wave field during forward simulation. To illustrate a broader application domain, we will use a spectral finite element code currently in use from small-scale civil engineering and seismic hazard projects to oil and gas imaging and seismology.

Compared to other numerical schemes such as finite differences, the spectral finite element method is slightly more complicated to implement and optimize, but it offers the advantage of easily handling the topography of the region under investigation (Figure 1).

Whatever approach is chosen, the technique remains the same. Every given number of time steps of the propagation (10 time steps in our case), we write the 3D wave fields on disks. Apart from the imaging technique described above, one of the other reasons to store the incident wave fields is to conduct acquisition design. This is widely used in the oil and gas industry to verify the correct illumination of a given reservoir. In civil engineering that may help to optimally distribute the few acquisition stations.

Figure 1. The left side of this figure shows an example of the meshes in a Switzerland alpine region with topography reaching several thousand meters. On the right is an example of Peak Ground Acceleration (PGA) computed at the surface of the test case used in this paper.

Meshes in Switzerland Alpine Region with Topography

Peak Ground Acceleration Computed at Surface

Page 3: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 3

In this paper, we consider medium cases in terms of I/O to illustrate our purpose, since we will not show results on more than 5,000 cores and the total size written on disk will be about 17 GB per time step.

There is a strong correlation between the compute power of the high-performance computing (HPC) infrastructure and the ability of the underlying data storage solution to process data, limiting the impact on the CPU time.

As processor power increases, the traditional goal of a system architect is to design systems with an appropriate balance of data storage, network bandwidth, and data compute power. It is also relevant for a researcher and a scientist, who may be developing or simply running applications on these infrastructures, to understand how to take advantage of a parallel file system like Intel® Enterprise Edition for Lustre* (Intel® EE for Lustre*) software.

The objective of this paper is to analyze the capacity of Intel EE for Lustre software to sustain a strong scalability experiment for an open source seismic application using the latest hardware technology available, including high-end DataDirect Networks (DDN) storage arrays and servers equipped with next-generation Intel® Xeon® processor E5-2697 v4 cores. We will also guide developers on how to monitor the I/O pattern of applications and how to tune Lustre and the developers’ code in order to obtain maximum benefit from parallel I/O.

In the following sections, we will describe first the main features of the seismic wave simulation code and the kind of outputs we will be considering, as well as our test cases.

Then we will present a quick introduction to the Lustre* file system as well as provide an example of hardware configuration for parallel file systems.

The next section will discuss the features of the file system we can use to monitor the system but also to fine-tune the file system at the user level and also at the system administrator level.

The technical development of our tests follows standard optimization procedure. We started with a strong scalability case analysis and with respect to the results, we analyze the different visible features. We then show a weak scalability analysis using a uniform test case to avoid, as much as possible, any impact of the test case complexity on the performance results. We also demonstrate the value of fine-tuning the file system from a user space point of view.

Spectral Finite Element Application: EFISPEC3D CodeThe numerical method that we use to compute the 3D seismic wave field is the spectral-element method (SEM) [Maday and Patera 1989; Komatitsch and Vilotte 1998; Karniadakis and Sherwin 2013]. The SEM code itself is EFISPEC3D [Eléments FInis SPECtraux 3D; open source code under GPLv3/CeCILLv2 available at efispec.free.fr; De Martin 2011] which is widely

used for computing seismic wave propagation in complex geological media [De Martin et al. 2013; Matsushima et al. 2014; Chaljub et al. 2015; Maufroy et al. 2015]. EFISPEC3D is programmed in FORTRAN95/MPI to solve the matrix-vector system of ordinary differential equations MÜ(t)+KU(t)=Fext(t), where M is the diagonal mass matrix, Ü is the acceleration vector, K is the stiffness matrix, U is the displacement vector, and Fext is the vector of the external forces.

The time marching is done by an explicit Newmark scheme (Newmark 1959) using blocking or non-blocking MPI communications. The biggest advantage of non-blocking communications is to allow processes to continue computations while communication with another process is still pending. As we will see later in the paper, the impact of blocking or non-blocking communications varies with the number of MPI ranks and can get faster when the core count increases significantly. Within EFISPEC3D, the MPI communications happen between the processes in contact—that is, the processes sharing common Gauss-Lobatto-Legendre (GLL) points between spectral elements (Figure 2).

When communications are blocking, each process waits for the others to compute their vector (K.U) before exchanging data and moving on to the assembly phase. Once the assembly is done between each process in contact, the time marching that requires the assembly to be completed can be updated for the next time step. When communications are non-blocking, each process first computes the vector (K.U) for its spectral elements that are in contact with other processes and sends/receives data needed for the assembly while, in the meantime, the remaining spectral elements (that is, those not in contact with other processes) compute their vector (K.U). Once they finish this computation, spectral elements in contact with other processes have already exchanged their data so that the time marching can be updated for the next time step.

EFISPEC3D is also linked with several open source libraries:

• EXODUS II (Sandia National Laboratories; Schoof and Yarberry 1994) is used to read finite element data models made of millions of finite elements; EXODUS II is also linked with the NetCDF (www.unidata.ucar.edu/software/netcdf) and HDF5 (The HDF Group. Hierarchical Data Format, version 5, 1997-2016) libraries.

• METIS [Karypis and Kumar 1999] is used for partitioning finite element meshes.

• Lib_VTK_IO (sites.google.com/site/stefanozaghi/lib_vtk_io) is used to write data conforming to the Visualization Toolkit (VTK) XML binary standard at several billions of computational points (also referred to as GLL points in spectral element nomenclature).

For testing the performance of the Intel EE for Lustre file system, EFISPEC3D has been configured to output snapshots of the three components’ (x/y/z) velocity field over the entire domain of computation at every GLL point and for every 10 time steps.

Page 4: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 4

The workflow, from the mesh generation to its visualization to the decomposition of GLL spectral elements into VTK 8-node hexahedron elements, is shown in Figure 2. During the initialization phase, the mesh generated by CUBIT software and composed by “Ne” (non-overlapping finite elements) is first read using the EXODUS II data model and is then partitioned into “N” sub-meshes using METIS. Once the partitioning has been completed, each MPI process (in our test, one MPI process = one core) computes “Ne/N” elements. For all our tests, each spectral element holds a 4th-order polynomial approximation as the expansion basis (that is, each element holds 5x5x5 = 125 GLL points). When an output is needed, each MPI process writes out, in parallel, its own results at the VTK elements level in an XML file using the Lib_VTK_IO library.

The VTK XML binary file format is used by EFISPEC3D because it facilitates data streaming and parallel I/O. It also allows for data structure of different data types (for example, Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64, Float32, Float64, and so on) and it allows for flexible data structure within which data arrays are organized at will. To that extent, in order to stress both the writing and the reading capabilities of Intel EE for Lustre software, we opted for a complex data structure for which a binary scratch file first needs to be written to store temporary data arrays that will be then reread and appended to the end of the final VTK XML file. This file contains an XML header that is written while the scratch file is being created so that the header describes the final structure of the appended binary data arrays.

Figure 2. This figure shows the workflow from mesh generation to wave propagation visualization. In the case of 125 GLL spectral element, this VTK mesh contains 4x4x4 = 64 times the number of spectral elements contained in a sub-mesh.

MPI-0

MPI-0

MPI-1

MPI-1

MPI-n

MPI-n

EXODUS IIData Model

Mesh of the Entire Domain of the Simulation

MPI-0

Sub-mesh 0

MPISEND/RECEIVE

MPISEND/RECEIVE

MPISEND/RECEIVE

Sub-mesh 1

Spectral Element125 GLL points

Spectral Element125 GLL points

Spectral Element125 GLL points

Sub-mesh n

VTK 8-NodeHexahedron Element

Generated from GLL Points

VTK 8-NodeHexahedron Element

Generated from GLL Points

VTK 8-NodeHexahedron Element

Generated from GLL Points

...

...

...

METISMesh Partitioning

GLL Points Initialization

Spectral Element Decomposition for Visualization

Lib_VTK_IO Mesh Writing

External Library

Subroutine

EXODUS IIMesh Reading

CUBIT SoftwareMesh Generation

EFISPECE3DWave Propagation

Visualization

The process MPI-0 reads, using the EXODUS II library, the mesh generated by CUBIT software.The mesh is partitioned into n sub-meshes with the METIS library. The GLL points used for solving the wave propagation equations are also the vertices of the VTK 8-node hexahedron elements used for visualization. Each MPI process writes the results of its sub-mesh at the VTK elements level.

1

1

2

3

4

2

3

4

Page 5: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 5

EFISPEC3D Test CasesThe reference test case of this paper is the one designed for the Euroseistest Verification and Validation Project (E2VP). E2VP (Figure 3) is an international collaborative project organized jointly by: the Aristotle University of Thessaloniki, Greece; the ITSAK (Institute of Engineering Seismology and Earthquake Engineering of Thessaloniki), Greece; the Cashima research project (supported by CEA – the French Alternative Energies and Atomic Energy Commission and by ILL – the Laue-Langevin Institute, Grenoble); and ISTerre at Grenoble Alpes University, France. The E2VP target site is the Mygdonian basin near Thessaloniki, Greece, which is the international research and test site of many international seismological and earthquake-engineering projects. To foster the use of linear 3D numerical simulations in practical prediction, E2VP aimed at (a) evaluating the accuracy of the most-advanced numerical methods when applied to realistic 3D models, and (b) providing an objective, quantitative comparison between recorded earthquake ground motions and their numerical predictions.

For scaling purposes, the reference test case composed of 4.7 million hexahedron elements has been down-scaled to 148,877 elements or up-scaled to 76 million elements.

From that original project we built three test cases that save wave fields of a given size every 10 time steps.

The strong scaling test exhibits a decrease of work and I/O to be done as long as the number of cores is growing.

The weak scaling tests have a constant amount of work and I/O to be done regardless of the number of cores used for the computation. With the two different test cases used for weak scaling (TC2 and TC3 below), we expect to analyse I/O scalability with respect to file size.

• Test case 1 (TC1). “E2VP2” used for strong scaling analysis from 1 to 128 nodes

– Total number of hexahedron: 4.7 million – I/O size per time step and per core (on 36-core nodes): from 500 MB (1 node) to 4 MB (128 nodes)

• Test case 2 (TC2). “small Cubic” used for weak scaling from 1 to 128 nodes

– Total number of hexahedron: from 149,000 to 19 million – I/O size per time step and per core (on 36-core nodes): 4 MB

• Test case 3 (TC3). “large Cubic” used for weak scaling from 1 to 16 nodes

– Total number of hexahedron: from 149,000 to 19 million – I/O size per time steps and per core (on 36-core nodes): 500 MB

Figure 3. The reference test case of this paper is the one designed for the Euroseistest Verification and Validation Project. (Courtesy of Maufroy et al., 2015)

Euroseistest Site within Mygdonian Basin in Northeastern Greece

Thicknesses of Three Layers within the Basin

Total Sediment Thickness in Basin

Euroseistest Site within Mygdonian Basin in Northeastern Greece

Thicknesses of Three Layers within the Basin

Total Sediment Thickness in Basin

Thicknesses of the Three Layers of the Basin

Location of the Euroseistest Site within the Mygdonian Basin in Northeastern Greece

Total Sediment Thickness in the Basin

Page 6: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 6

Lustre* IntroductionThe Lustre file system [Barton 2014] is an open source (GPLv2) high-performance parallel file system, widely used in the HPC community to provide a global POSIX (Portable Operating System Interface) namespace and horizontally scalable I/O to the computing resources of an entire data center. It was originally designed for scalability and is capable of handling extremely large volumes and capacities of data using tens of thousands of disks with high availability and strong coherence of both data and metadata. This makes it the premier choice for the most demanding HPC applications spanning the world’s largest HPC clusters.

Lustre is a Linux* file system implemented entirely in the kernel. Its architecture is founded upon distributed object-based storage, which delegates block storage management to its backend servers and helps reduce significant scaling and performance issues associated with the consistent management of distributed block storage metadata (Figure 4).

Each file is split into data and metadata objects. The Lustre object storage device (OSD), an abstraction that allows the use of different backend file systems including ext4 and ZFS,1 stores objects. A single OSD instance corresponds to a single backend storage volume and is termed a storage target. The storage target depends on the underlying file system and volume for resilience to storage device failure. DDN provides

a high-end, high-density, robust storage solution optimized and proven with Intel EE for Lustre software to host backend storage targets.

Storage targets are exported as either metadata targets (MDTs) for file and directory operations, or as object storage targets (OSTs) to store file data. Servers configured specifically for their respective metadata or data I/O workloads usually export these targets. RAID-10 high-IOPS storage hardware and high core counts are used for metadata servers (MDSs), while high-capacity, high-bandwidth RAID-6 storage hardware and lower core counts are used for object storage servers (OSSs).

Lustre clusters consist of at least two MDS nodes configured for active-passive or active-active failover and multiple OSSs configured for active-active failover.

Lustre clients and servers communicate with each other using a layered communications stack. The Lustre Networking (LNET) layer abstracts the underlying networks such as InfiniBand* or TCP/IP. LNET provides both message passing and remote memory access (RMA) for efficient zero-copy bulk data movement. The Lustre RPC (Remote Procedure Call) layer (called Portal RPC, or PtlRPC) is built on top of the LNET layer to provide robust client–server communications in the face of message loss and server failures.

The Lustre Distributed Lock Manager (LDLM) is a service provided by storage targets in addition to object storage services. LDLM locks are used to serialize conflicting file system operations on objects managed by that target, and are the mechanism used to ensure distributed cache coherency, even while tens of thousands of clients are modifying the same file concurrently. The combination of coherent locking, together with recovery protocols exercised on server startup, ensures that caches remain consistent through server restart or failover. This boosts server throughput for file system-modifying operations by allowing the use of write-back rather than write-through caches, since uncommitted operations are recovered from the clients in case of server failure.

Applications perform parallel I/O from the Lustre client. The Lustre client is responsible for combining the data and index objects exported by the data and metadata storage targets that make up a cluster of Lustre servers into a single coherent POSIX-compliant file system.

Since this aggregation is done at the client, and since extent metadata is handled locally by storage targets, non-contending data access at the POSIX level can be mapped to non-contending data access at the object level, leading to near-linear scaling of I/O performance.

Clients aggregate I/O in their local caches to ensure bulk data is streamed to or from the servers efficiently. On read, the client can detect strided read patterns and use this to guide read-ahead. Similarly on write, dirty pages are aggregated whenever possible to ensure efficient network and disk transfers. In both cases, many aggregated bulk data RPCs may be kept “on the wire” to hide latency and ensure full bandwidth utilization. The Lustre client also aligns these aggregated bulk RPCs at

Object Storage Servers

Lustre* Basic Write I/O Operations

Logical Object Volume

Metadata Server

Metadata Client

ClientOSC

OSC

OSCLogical Metadata

Volume

1

2

3

The metadata client sends the open() request to the metadata server (MDS) using the Logical Metadata Volume framework.The MDS answers the request with a layout that the object storage clients (OSCs) can use to write the files in objects (Object A, Object B, Object C).The OSCs, organized in a Logical Object Volume, start to write the objects (objects written in parallel) into the object storage targets hosted on object storage servers according to the stripe layout. No metadata interaction occurs during the streaming.

Object A Data Stripe 0

Object B Data Stripe 1

Object C Data Stripe 2

1 2

3

Figure 4. This figure illustrates the basic write I/O operation in Lustre*. When an application is writing a file into the Lustre file system, the Lustre client running on the compute node transparently executes several operations to complete the request.

Page 7: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 7

regular offsets and sizes to help servers maximize consistency between allocating writes and subsequent reads and reduce disk seeks, and also aligns the disk I/O operations with the underlying RAID storage chunks.

Intel® Enterprise Edition for Lustre* SoftwareIntel EE for Lustre software is a supported distribution for Lustre specifically designed to allow the oil and gas industry (which needs large-scale, high-bandwidth storage) to tap into the power and scalability of Lustre, but with the simplified installation, configuration, and monitoring features of Intel® Manager for Lustre* software, a management solution purpose-built for the Lustre file system. Intel EE for Lustre software includes proven support from the Lustre experts at Intel, including worldwide 24x7 technical support.

High-Performance Computing (HPC) Storage and Compute NodesThe storage array used for this experiment was provided by DDN. DDN is a leader in high-performance storage solutions designed to meet the most pressing storage challenges of the seismic processing industry for capacity, speed, performance, and scalability. DDN is also a provider of massively scalable storage systems for unstructured data and big data environments.

At 6 million IOPS and 60 GB/sec from a single 4U appliance, the SFA14K* is the fastest storage solution in the industry today, and is able to drive an unmatched number of solid-state drives (SSDs) and spinning drives in the least amount of space, making it also the most dense storage solution on the market.

Based on its SFA* (Storage Fusion Architecture*) block appliances (Figure 7), DDN offers file storage. The DDN EXAScaler* appliance provides a scalable shared namespace for use in high-performance compute environments. EXAScaler can be used in environments ranging from a few dozen to thousands of client nodes. The main differentiator for typical scale-out NAS (network-attached storage) solutions is a system design that has no bottlenecks. Where NAS solutions rely on standard protocols such as NFS (Network File System) and SMB (Server Message Block), EXAScaler leverages Intel EE for Lustre software to not just scale capacity but also performance. Where NAS protocols establish a point-to-point connection between one client and one server to access data, EXAScaler allows every client to communicate with multiple storage servers at the same time and therefore increases the level of parallelism. It allows each client to access data faster individually—and all clients aggregated will see better overall performance. Furthermore, EXAScaler ensures coherence and presents all clients with a consistent view of the data at any point in time.

EXAScaler is developed by DDN in close collaboration with Intel. It is based on Intel EE for Lustre software but includes several enhancements in features and performance. Additionally it is a full software stack (and not just the Lustre parallel file system) that is tweaked for the DDN storage system and fully supported by DDN’s worldwide support organization.

Figure 6. This screenshot shows another part of the Intel® Manager for Lustre* software dashboard. Each metadata operation is represented by a different color. On the y-axis is the stacked metadata operations per second and on the x-axis is the time. The data shown here represents the metadata operations executed by the metadata server (MDS) during a strong scalability test using between 1 to 8 nodes. (Source: Intel data center)

Figure 5. This screenshot shows part of the Intel® Manager for Lustre* software dashboard. On the left, the write heat map displays (on the y-axis) the write utilization in bytes/sec of each object storage target (OST) and the time on the x-axis: dark red represents higher utilization. On the right, the read/write bandwidth chart displays read operations (dark blue not displayed) and write operations (light blue in the chart); aggregate bandwidth in bytes/sec is shown on the y-axis and time is shown on the x-axis. (Source: Intel data center)

Storage Fusion Architecture*Foundation

Storage Operating System

Multimedia Drive SupportSATA SASSSD

DataDirect NetworksHardware Platforms

12KX 12KXE

DataDirect Networks* File StorageGRIDScaler* EXAScaler*

Figure 7. The design concepts of DataDirect Networks Storage Fusion Architecture*.

Page 8: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 8

The environment that has been used for this white paper consists of a single DDN SFA12KX-40* system with a total of 320 hard-disk drives (HDDs) and 10 SSDs. The 320 HDDs are configured as 32 OSTs. Each OST consists of 10 HDDs in an 8+2 RAID6 configuration. The 10 SSDs were configured in RAID10 as a single MDT.

The single SFA12KX-40 was connected to six Lustre servers using direct and redundant FDR InfiniBand connections. The six Lustre servers were organized in pairs in order to provide high-availability protection from hardware failures. Four servers were configured as OSSs and two as MDSs. We used an FDR InfiniBand network to export the file system to the client with the LNET layer. Each Lustre server was configured using a dual-socket Intel Xeon processor E5-2697 v4 (2.30 GHz) and 128 GB of DDR4 2.4-GHz RAM.

During the test we also used an SSD-based Lustre file system provided by the Intel Data Center group [Hebenstreit 2014]. The Intel Data Center team provided access to 128 compute nodes configured with a dual-socket Intel Xeon processor E5-2697 v4 (2.30 GHz) and 128 GB of DDR4 2.4-GHz RAM connected to the Lustre file system using an FDR InfiniBand network.

The application is built using the Intel® Fortran Compiler (IFORT), linked to the Intel® MPI and scheduled on the compute node using IBM Spectrum LSF* (Load Sharing Facility).

Relevant Lustre* Metrics for DevelopersLustre is a software storage solution with a rich set of metrics relating to every aspect of the file system and the LNET layer.

The Lustre file system and the LNET layer expose metrics information into the Linux /proc file system. The number of metric files in a medium-size Lustre file system can exceed thousands.

In its distribution for Lustre, Intel developed and included a web-based interface that is able to collect real-time metrics of Lustre on the server side using agents and present the metrics in a web-based dashboard (Figure 5, Figure 6, and Figure 8) or available using a REST API.

Many other metrics are available to help developers optimize their code or troubleshoot I/O delays.

Real-Time Overview Using Intel® Manager for Lustre* SoftwareIntel Manager for Lustre software is a web-based interface designed to lower the complexity of Lustre and present high-level server-side metrics in real time. The tool is useful for discovering immediate problems and strange behavior of storage servers and applications.

Figure 8 shows an example of analysis of an application that reads a single, very large file on the Lustre file system using different stripe layouts.

The manager server web interface included with the Intel Manager for Lustre software is built on the REST API, which is accessed using HTTP. This API is available for integration with third-party applications. The types of operations possible

using the API include creating a file system, checking the system for alert conditions, and downloading performance metrics. All functionality provided in the manager server web interface is based on this API, so anything that can be done using the web interface can also potentially be done from third-party applications.

The API is based on the REST style, and uses JSON serialization. Some of the resources exposed in the API correspond to functionality within the Lustre file system, while others refer to functionality specific to the Intel Manager for Lustre software.

It’s possible to query the REST API and get a list of targets:curl -k “https://<URL>/api/target/?limit=0”| python -m json.tool

Similarly, you can get information on the file systems as follows:curl -k “https://<URL>/api/filesystem/”| python -m json.tool

In that case the URL corresponds to the IP address of the Intel Manager for Lustre server.

To get a sum total, we can apply the group_by and reduce_fn clauses (both are needed). For example:curl -k “https://<URL>/api/target/metric/?kind=OST& metrics=stats_readbytes&format=json&begin=2015-03-18T14:53:46.224Z&end=2015-06-28T06:25:00.000Z &num_points=30&group_by=filesystem&reduce_fn=sum” | python -m json.tool

This returns the sum totals for each point in time across all OSTs.

The Intel Manager for Lustre agent collects the following metrics:• Object storage targets

– Read/write bandwidth and IOPS – Size – Jobstats

• Metadata storage targets – Metadata operations – Size

• CPU and RAM utilization for each server

Figure 8. This is a screenshot of part of the Intel® Manager for Lustre* software dashboard. The data represents an application reading with several threads of a single 1 TB file. The read/write heat map (left side) provides real-time information about how the I/O is spread across the object storage targets. The read/write bandwidth chart (right) provides the performance achieved by the application when, for example, different stripe layouts (stripe counts) are used to read the single file. (Source: Intel data center)

2.4 GB/sec

750 MB/sec250 MB/sec

Page 9: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 9

Data is collected every 10 seconds and transferred to the Intel Manager for Lustre server. The server saves this information into a PostgreSQL* database.

Another benefit of using Intel Manager for Lustre is the ability to get this important information without having direct access to the Lustre servers. Users of the cluster normally are not allowed to have direct access to the storage servers. With Intel Manager for Lustre it is possible to provide read-only access to users who need to view the metrics.

JobstatsThe Lustre Jobstats metric collects file system operation statistics for the jobs running on Lustre clients, and exposes them using the /proc file system on the server. Job schedulers known to work with jobstats include SLURM* (Simple Linux Utility for Resource Management*), SGE* (Sun Grid Engine*), IBM Spectrum LSF, Loadleveler*, PBS* (Portable Batch System*), Maui Cluster Scheduler*, and MOAB Cluster Suite*.

The current state of jobstats can be verified by checking lctl get_param jobid_var on a client. Jobstats are disabled by default.

The jobstats code extracts the job identifier from an environment variable set by the scheduler when the job is started. To enable jobstats, set jobid_var to name the environment variable used by the scheduler, as in the following example:lctl conf_param <fsname>.sys.jobid_var=SLURM_JOB_ID

Intel Manager for Lustre is able to collect the jobstats statistics and correlate them with the outstanding I/O of each OST. Intel Manager for Lustre displays this information in terms of bandwidth and IOPS (Figure 9).

Lustre metrics can be accessed in several ways: • Directly accessing the specific metric from the /proc/{fs,sys}/{lnet,lustre} directories

• Using the lctl get_param command• Using the llstat command

For example, the global stats metric for a single client can be accessed using the following commands:• cat /proc/fs/Lustre/llite/Lustre01-ffff8800b41cb800/stats

• lctl get_param -n llite.*.stats• llstat /proc/fs/Lustre/llite/Lustre01-ffff8800b41cb800/stats

We will use the lctl command and the following convention in this paper:obdtype|fsname.obdname.proc_file_name

It is always possible to reset a counter for a metric by injecting a 0 into the file:lctl set_param -n llite.*.stats=0

Some of the metrics are not enabled by default and must be activated: • lctl set_param -n llite.*.extents_stats=1 to enable

• lctl set_param -n llite.*.extents_stats=0 to disable

In the following paragraphs, we explain the most important metric files for a developer in order to troubleshoot the behavior of an application. We suggest a top-down technique as summarized in Figure 10.

Figure 9. This screenshot shows how the Intel® Manager for Lustre* software Job Stats window can be integrated with the IBM Spectrum LSF*. The Job Stats window reports, for a specific period of time and for a specific OST, the utilization in terms of read bytes (throughput) and read IOPS (left) and write bytes and write IOPS (right). The Job ID reported is the Job ID extracted from the LSF. (Source: Intel data center)

osc.*OST0000.rpc_stats

osc.*OST0001.rpc_stats

osc.*OST000n.rpc_stats

Logical Object VolumeFramework that organizes all object storage clients

Logical Metadata Volume

Framework that organizes all metadata clients

Peer

s

Peer

sPe

ers

Peer

s

Lustre* Network Server Metrics

...

llite.*.extents_stats

llite.*.stats

Application

obdfilter.*OST0000.brw_stats

obdfilter.*OST0000.stats

obdfilter.*OST0001.exports.*.stats

obdfilter.*OST0000.job_stats

Object Storage ServerObject Storage Target Host

...

mdc.*MDT0000.md_stats

mdc.*MDT0001.md_stats

mdc.*MDT000n.md_stats...

Metadata ServerMetadata Target Host

mdt.*MDT0000.md_stats

mdt.*MDT0000.job_stats

...

LustreNetworking

Client Metrics

Logical Metadata Volume communicates with the metadata targets

Logical Object Volume streams data to the object storage targets

Closely related to the application (described later in the paper)Relevant to data streaming Metadata operationsLustre Networking layer

Figure 10. Starting from the application (gray), this diagram summarizes the relevant Lustre* metrics that should be considered when troubleshooting the end-to-end behavior of a software application.

Page 10: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 10

Description of llite.*.extents_statsIt is always possible to understand the I/O pattern of an application using a legacy command such as strace:

strace -T -e trace=open,read,write,close –p <PID>

The output (Table 1) provides an idea of the block size requested and written (1,048,576 bytes) and the time for the execution of the specific call (last column).

Enabling the extents_stats give us a better understanding of the I/O pattern requested by the application to the Lustre client. Unfortunately, to enable this metric it is necessary to be a privileged user (root).

lctl set_param llite.*.extents_stats=1

The following output (Table 2) provides an immediate understating of how the application is reading and writing data into the file system.

An application can obtain the best performance from Lustre when the block size is aligned with the amount of data carried by the RPCs and the block size available at the storage level. By default, Lustre can transfer data with a block size of 1 MB and the backend is generally configured to achieve the best performance with the same segment size. It is also possible to increase the RPC block size to 4 MB. In the latest version of the DDN EXAScaler distribution for Lustre, the system

administrator can increase the RPC block size to 16 MB and can also configure the SFA storage to use this very large segment size.

Developers can take advantage of this very large block size by increasing the MPI-IO block size using MPI-IO hints:

• MPI_Info_set(finfo,”cb_block_size”,”4194304”);

• MPI_Info_set(finfo,”cb_buffer_size”,”134217728”);

A per-process version of this metric file is available:

llite.*.extents_stats_per_process

Description of llite.*.statsThis file collects the aggregate stats from and to the Lustre clients. Entries are displayed only if the corresponding operations are performed. In Table 3, after each call, the data shows the number of samples, the metric, the minimum block size, the maximum block size, and the number of bytes collected since the last reset of the counter.

The read_bytes and write_bytes are the I/O requested by the application to the Linux Virtual Filesystem Switch (VFS). The osc_write and osc_read (not displayed) are the I/O requested by the Lustre client to the Lustre server through the network.

Table 1. Strace output: for each call, the last two columns provide size and time of execution.

read(0, “H6\276\f\10\275qMc\266?\261$\335\276$K\”..., 1048576) = 1048576 <0.115464>write(1, “H6\276\f\10\275qMc\266?\261$\335\276$K\”..., 1048576) = 1048576 <0.001608>read(0, “<=\”\326G\5\361\330~\351v\ “..., 1048576) = 1048576 <0.124830>write(1, “<=\”\326G\5\361\330~\351v\ “..., 1048576) = 1048576 <0.001812>

Table 2. This metric file represents a histogram of the blocks of data read and written by the application since the last reset of the counter. Each row represents a range of block sizes in bytes. For each read and write column, the number of samplings, the percentage, and the cumulative percentage is displayed. In this example, the application has written 44 blocks of data in a range between 1 MB and 2 MB.

# lctl get_param -n llite.*.extents_statssnapshot_time: 1453981032.125203 (secs.usecs) read | write extents calls % cum% | calls % cum% 0K - 4K : 0 0 0 | 0 0 0 4K - 8K : 0 0 0 | 0 0 0 8K - 16K : 0 0 0 | 0 0 0 16K - 32K : 0 0 0 | 0 0 0 32K - 64K : 0 0 0 | 0 0 0 64K - 128K : 0 0 0 | 0 0 0 128K - 256K : 0 0 0 | 0 0 0 256K - 512K : 0 0 0 | 0 0 0 512K - 1024K : 0 0 0 | 0 0 0 1M - 2M : 0 0 0 | 44 100 100

Page 11: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 11

As Table 3 shows, the same amount of data is written by the application to both the Linux VFS and transferred to the Lustre server. In the output we don’t see any osc_read (no network activity) because the application during the read (read_bytes) took advantage of the data already cached into the Linux VFS.

This metric file also collects other calls like mmap, fsync, and truncate.

It is also possible to filter the statistic for a specific PID, PPID, or GID: • llite.*.stats_track_gid• llite.*.stats_track_pid• llite.*.stats_track_ppid

Description of osc.*OST*.rpc_statsFor each OSS, Lustre maintains statistics about the communication between the client and the server:

lctl get_param -n osc.*OST0001*.rpc_stats

The output of the file is complex but the most relevant part is the following histogram (Table 4).

Lustre moves data from memory to storage (and when it is available, Lustre can perform this operation using the RDMA protocol). Linux manages memory allocation in 4K pages, so in the histogram in Table 4 the number of pages should be multiplied by 4K to find the size of the block of data (256 pages = 1 MB) that is transferred from the client to the server. The histogram also shows the number of samples, the percentage, and the cumulative percentage for each class.

Understanding the number of pages utilized for each RPC by the Lustre client for an application is critical because this gives the developer information about how to optimize the code and how to tune the file system.

For example, suppose an application is writing and reading a large file but using a very small block size (4K). From the llite.*.stats file (Table 5), we can see from read_bytes and write_bytes that the block size used from and to the application is 4K, but the block size transferred on the network is larger (1 MB). This requires a smaller number of RPCs to move the same amount of data.

Table 3. This metric file displays the most important calls executed from any application doing I/O to the Lustre* client running on this compute node. The first column shows the list of the calls executed by applications (only calls with at least 1 sample are displayed). In the following columns: number of samples, metric unit, minimum block size, maximum block size, and total since the last reset of the counter.

# lctl get_param -n llite.*.statssnapshot_time 1453981730.512857 secs.usecsread_bytes 231 samples [bytes] 0 1048576 241172480write_bytes 343 samples [bytes] 1048576 1048576 359661568osc_write 343 samples [bytes] 1048576 1048576 359661568open 2 samples [regs]close 1 samples [regs]seek 1 samples [regs]readdir 2 samples [regs]getattr 3 samples [regs]create 1 samples [regs]alloc_inode 1 samples [regs]getxattr 343 samples [regs]inode_permission 8 samples [regs]

Table 4. This metric file represents a histogram of the pages of memory transferred by the Lustre* OSC to the OST per each Remote Procedure Call (RPC) since the last reset of the counter. Each row represents a number of Linux memory pages transferred in one RPC (1 page is 4K). For each read and write column, the number of samplings, the percentage, and the cumulative percentage are displayed. In this example the client sent 391 RPCs with 256 pages and received 10 RPCs with 1 page and 391 with 256 pages.

read writepages per rpc rpcs % cum % | rpcs % cum %1: 10 2 2 | 0 0 02: 0 0 2 | 0 0 04: 0 0 2 | 0 0 08: 0 0 2 | 0 0 016: 0 0 2 | 0 0 032: 0 0 2 | 0 0 064: 0 0 2 | 0 0 0128: 0 0 2 | 0 0 0256: 391 97 100 | 391 100 100

Page 12: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 12

In Table 4 the output of the osc.*OST0001*.rpc_stats confirms that almost all the requests are transferred by the Lustre client using a larger block size more efficiently and using a lower number of requests thanks to the dirty page cache. This cache helps to pack an optimal amount of data into each I/O RPC.

It is also possible to control the number of issued RPCs and optimize the size and/or number of pages for each RPC in progress at any time to help optimize the client I/O RPC stream.

RPC stream tunables include:• osc.*.max_dirty_mb. Controls how many MBs of dirty

data can be written and queued up in the OSC. POSIX file writes that are cached contribute to this count. When the limit is reached, additional writes stall until previously-cached writes are written to the server.

• osc.*.cur_dirty_bytes. A read-only value that returns the current number of bytes written and cached on this OSC.

• osc.osc_instance.max_pages_per_rpc. The maximum number of pages that will undergo I/O in a single RPC to the OST. The minimum setting is a single page and the maximum setting is 4096 (4096 is available at the moment only on DDN’s Lustre branch, the normal maximum is 1024).

• osc.osc_instance.max_rpcs_in_flight. The maximum number of concurrent RPCs in flight from an OSC to its OST. If the OSC tries to initiate an RPC but finds that it already has this number of RPCs outstanding, it will wait to issue further RPCs until some complete. The minimum setting is 1 and maximum setting is 256.

• llite.*.max_cache_mb. The maximum amount of inactive data cached by the client.

Descriptions of peersThe Lustre file system uses LNET for every communication. LNET is an abstraction layer that enables the file system to access every possible underlying hardware network technology and use the best-in-class protocol for each of them.

Each client has a limited number of slots that can be used to connect to the servers. The following metric files can provide the status of the slots:

lctl get_param -n peers

The resulting file (Table 6) shows all the peers known to this node and provides information on the queue state. Here, “max” is the maximum number of concurrent sends from this peer and “tx” is the number of peer credits currently available for this peer.

A possible negative value in the “min” column means that the number of slots on the LNET layer was not sufficient and the queue was overloaded. This is an indication to increase the number of peer credits and credits. Increasing the credits value has some drawbacks, including increased memory requirements and possible congestion in networks with a very large number of peers.

On the other hand, increasing the number of peer credits and credits makes it possible to increase the number of outstanding RPCs and increase the performance of the file system when applications need to create a large number of (small) files.

Table 5. This metric file represents the llite.*.stats output for an application reading and writing almost 4 GB of data using a block size of 4K. The read_bytes and write_bytes rows report exactly the size and the block size requested by the application. The Lustre* client is able to pack the data in larger Remote Procedure Calls (RPCs) and transfer the same amount of data using fewer calls.

snapshot_time 1454067781.979713 secs.usecsread_bytes 100000 samples [bytes] 4096 4096 409600000write_bytes 100000 samples [bytes] 4096 4096 409600000osc_read 401 samples [bytes] 4096 1048576 409600000osc_write 391 samples [bytes] 655360 1048576 409600000open 3 samples [regs]close 3 samples [regs]seek 2 samples [regs]truncate 1 samples [regs]getxattr 100000 samples [regs]inode_permission 7 samples [regs]

Table 6. This metric file lists all the peers connected to this client and provides information about the queue status. The most relevant columns are as follows: “max” is the maximum number of concurrent sends from this peer; “tx” is the number of peer credits currently available for this peer; a possible negative value in the “min” column means that the number of slots on the Lustre* Network (LNET) was not sufficient and the queue was overloaded.

nid refs state last max rtr min tx min queue10.10.130.1@tcp1 1 NA -1 8 8 8 8 5 010.10.130.3@tcp1 1 NA -1 8 8 8 8 2 010.10.130.2@tcp1 1 NA -1 8 8 8 8 5 010.10.130.4@tcp1 1 NA -1 8 8 8 8 4 0

Page 13: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 13

Server-Side MetricsIt is interesting to follow (end-to-end) the behavior of an application also on the Lustre server side. Users of the Lustre file system typically are not allowed to log in to the Lustre servers.

This is a list of the most important metrics: • obdfilter.*OST*.stats. Generic read and write

performance of the specific OST, including block size• obdfilter.*OST*.brw_stats. The brw_stats file in

the obdfilter directory contains histogram data showing statistics for the number of I/O requests sent to the disk, their size, and whether they are contiguous on the disk or not

• obdfilter.*OST*.export.*.stats. Statistic per client connected to the specific OST

• mdt.*MDT*.md_stats. Metadata call operations per MDT

In a multitenant environment it is difficult to follow a specific application using these metric files.

The jobstats metrics can correlate applications running across a cluster doing I/O on a specific target. Metadata operations statistics are collected on MDTs (Table 7). These statistics can be accessed for all file systems and all jobs on the MDT using the lctl get_param mdt.*.job_stats command.

In Table 7, the job_id is the unique number assigned by the scheduler to the specific application across the cluster. Data operations statistics are collected on OSTs (Table 8). Data operations statistics can be accessed using the lctl get_param obdfilter.*.job_stats command. The read and write format includes minimum and maximum block size.

A plugin [Jette and Grondona 2003] is available for the SLURM scheduler to integrate I/O statistics to a job.

Striping Layout and LustreA key feature of the Lustre file system is its ability to distribute the segments of a single file across multiple OSTs using a technique called file striping (Figure 11). A file is said to be striped when its linear sequence of bytes is separated into small chunks, or stripes, so that read and write operations can access multiple OSTs concurrently.

Table 7. The output of the jobstats metric for the MDT is in JSON format and displays the job_id of the application performing metadata operations on the cluster. Each row after the job_id row shows the specific metadata call performed with the number of samples.

job_stats:-job_id: 289923 snapshot_time: 1454075498 open: { samples: 1, unit: reqs } close: { samples: 1, unit: reqs } mknod: { samples: 0, unit: reqs } link: { samples: 0, unit: reqs } unlink: { samples: 0, unit: reqs } mkdir: { samples: 0, unit: reqs } rmdir: { samples: 0, unit: reqs } rename: { samples: 0, unit: reqs } getattr: { samples: 0, unit: reqs } setattr: { samples: 1, unit: reqs } getxattr: { samples: 0, unit: reqs } setxattr: { samples: 0, unit: reqs } statfs: { samples: 0, unit: reqs } sync: { samples: 0, unit: reqs } samedir_rename: { samples: 0, unit: reqs } crossdir_rename:{ samples: 0, unit: reqs }

Table 8. The output of the jobstats metric for one OST is in JSON format and displays the job_id of the application performing I/O. Each row provides information about the specific I/O call performed including the number of samples. For read and write calls, the output shows the minimum and maximum block size and the total size of the data transfer (in bytes).

job_stats:-job_id: 289923 snapshot_time: 1454075524 read: { samples: 0, unit: bytes, min: 0, max: 0, sum: 0 } write: { samples: 390, unit: bytes, min: 655360, max: 1048576, sum: 408551424 } setattr: { samples: 0, unit: reqs } punch: { samples: 1, unit: reqs } sync: { samples: 0, unit: reqs }

OST 02

File B, Chunk 1File B, Chunk 4

File B, Chunk 3File B, Chunk 6

File B, Chunk 7

File C

OST 00

File B, Chunk 2File B, Chunk 5

File A

OST 01

Figure 11. This diagram shows how files with different stripe layouts are striped across OSTs in a Lustre* file system. File A (dark blue) is a small file saved as a single object in OST01. File B (light blue) is a large file saved in 7 chunks across all the OSTs available. File C (blue) is a large file saved as single object with a bigger stripe size.

Page 14: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 14

The following values affect how a file will be striped: • Stripe count. The number of OSTs over which a file is

distributed• Stripe size. The number of bytes written on one OST

before cycling to the next• Stripe offset. The preferred allocation algorithm

For each file or directory, users can use the lfs setstripe command to set the stripe layout. If the stripe layout is not set in advance, the file inherits the stripe layout of the parent directory. The default stripe layout for a Lustre file system is as follows:• Stripe count = 1. Each file is written in one OST only.• Stripe size = 1 MB. Each file is split in 1-MB chunks. The

stripe size is aligned with the default RPC page size.• Stripe offset = -1. By default, Lustre’s allocation algorithm

decides in which OST to write the data.

Lustre’s file striping method is the same principal used in striping for a redundant array of independent disks (RAID) in which data is striped across a certain number of objects.

Lustre can stripe files across up to 2,000 objects and each object can be up to 16 TB in size. This leads to a maximum file size of 32 PB.

File striping increases I/O performance since writing or reading from multiple OSTs simultaneously increases the available I/O bandwidth.

Many applications require high-bandwidth access to a single file—more than can be provided by a single OSS. In addition, file striping provides space for very large files. This is possible since files that are too large to be written to a single OST can be striped across multiple OSTs.

However, file striping is not without drawbacks. In some cases, using it can increase both the overhead and risk associated with I/O operations. For example, the increased overhead of common file system operations, such as stat, can decrease I/O performance when file striping is used unnecessarily. When file striping is used correctly, any increase in overhead is completely hidden by file system parallelism.

It is possible to set Lustre stripe properties in C using the llapi library included in the Lustre client.

MPI I/O libraries (including MVAPICH, OpenMPI, and Intel MPI) support Lustre setting the stripe count and size in the MPI code using MPI I/O hints.

Fortran: call mpi_info_set(myinfo,”striping_factor”, stripe_count,mpierr)

call mpi_info_set(myinfo,”striping_unit”, stripe_size,mpierr)

C: MPI_Info_set(myinfo,”striping_factor”,stripe_count);

MPI_Info_set(myinfo,”striping_unit”,stripe_size);

Best Practices to Avoid Metadata BottlenecksIt is important to organize the metadata infrastructure of the application being developed with an awareness of the limits of Lustre. Table 9 shows how Lustre’s limits change based on the backend file system selected.

The most common choice for Lustre storage is using LDISKFS as backend file system.

In this case having a file system with several hundreds of millions of files is common and it is common also to reach the limit of the maximum number of files in a single directory. For example, with 4096 cores across 128 nodes, dumping for each time step velocity, acceleration, and ground displacement in a single directory can reach the limit in only 800 steps.

With LDISKFS, we also don’t suggest hosting more than 1 million files in a directory because updating the LDIKSFS’s htree database becomes slow and limits the metadata performance of the directory.

Organization of the metadata structure of the application is important to avoid large directories as much as possible, especially if POSIX is used.

Table 9. The limits of Lustre* are different depending on the backend file system used. LDISKFS is the most common option in the market. ZFS is emerging as a more modern and enterprise-ready alternative.

LimitValue using LDISKFS as backend

Value using ZFS as backend Notes

Maximum stripe count 2,000 2,000 Limit is 160 for LDISKFS if the "ea_inode" feature is not enabled on the metadata target

Maximum stripe size < 4 GB < 4 GBMinimum stripe size 64 KB 64 KBMaximum object size 16 TB 256 TBMaximum file size 31.25 PB 8 EBMaximum number of files or subdirectories in a single directory

48-byte filenames 10 million 248

128-byte filenames 5 million 248

Maximum number of files in the file system 4 billion per MDT 256 trillion per MDTMaximum length of a filename 255 bytes 255 bytesMaximum length of a pathname 4,096 bytes 4,096 bytes Limited by Linux* Virtual Filesystem

Page 15: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 15

Large directories also affect locking performance because locks are set at the directory level even if the Lustre MDS (in both the LDLM and LDISKFS layers) has a parallel locking mechanism for create operations within the same directory, and the lock parallelism grows as the size of the directory grows.

The design of the LDLM was based on the VAX/VMS Distributed Lock Manager in which named abstract resources can be locked in a variety of modes to control how corresponding physical resources may be accessed. For example, protected read locks are mutually compatible and are used to permit read-only access to a shared resource, while protected write locks are used by clients to cache dirty data, and exclusive locks are incompatible with all lock modes other than null and are therefore used by servers wishing to modify a resource and invalidate the caches of all other lock holders.

If a file to be opened is not subject to write(s), it should be opened as read-only. Furthermore, if the access time on the file does not need to be updated, the open flags should be O_RDONLY | O_NOATIME. If this file is opened by all files in the group using MPI, the master process (rank 0) should open it as O_RDONLY with all of the non-master processes (rank > 0) opening it as O_RDONLY | O_NOATIME.

If a shared file is to be read and the data to be shared among the process group is less than approximately 100 MB, it is preferable to have one reader and then broadcast the contents rather than have everyone read the file.

Consider using available I/O middleware libraries. For large-scale applications that are going to share large amounts of data, one way to improve performance is to use a middleware library, such as ADIOS, HDF5, or MPI-IO.

Seismic Wave Simulation ExampleIn this section, we describe strong scalability tests, weak scalability tests, and measured bandwidth versus application apparent bandwidth.

Strong Scalability TestsWe are using the first test case (TC1) as described in the section EFISPEC3D Test Cases and based on the E2VP2 experiment. In the strong analysis, a fixed size problem is analyzed. To reduce the elapsed time, we spread the problem across several cores and several nodes. This method maintains the scalability as close as possible to linear in order to continue to reduce the elapsed time.

First of all, we evaluated the scalability of the code without I/O but using blocking and non-blocking MPI communications (Figure 12).

Introducing I/O on the Parallel File SystemWith I/O enabled, for each core the application dumps a file every 10 time steps. The scalability is limited by the bandwidth available from the storage and by how efficiently the application is using the capabilities of the HPC storage installed.

As shown Figure 13, we calculated the scalability of EFISPEC3D in terms of elapsed time when I/O is enabled, comparing that to the scalability of the same application without I/O. We noticed three different behaviors during the experiment:1. Super-linearity at 8 and 16 compute nodes2. No scalability after 32 nodes3. Regression after 64 nodes

Two effects can explain these behaviors: limited number of threads due to a limited number of compute nodes and smaller file size as the number of cores increase.

Strong Scalability Test (TC1) Without I/OUsing Blocking and Non-Blocking MPI

Number of Clients

-

16

32

48

64

80

96

112

+

161 32 48 80 96 112 12864

MPI Non-blockingMPI BlockingLinear

Spee

d U

p

Figure 12. Strong scalability test (TC1) without I/O using blocking and non-blocking MPI communications. Comparison between the elapsed time of a theoretical linear scalability (Linear), using MPI blocking and MPI non-blocking communications.

Strong Scalability Test (TC1) With and Without I/O

Number of Clients

-

16

32

48

64

80

96

112

+

161 32 48 80 96 112 12864

Spee

d U

p

Without I/OWith I/OLinear

Figure 13. Strong scalability test (TC1) with I/O and without I/O. Comparison between the elapsed time for a theoretical linear scalability (Linear); without I/O (MPI non-blocking); and dumping on a Lustre* file system (with I/O) using MPI non-blocking communications.

Page 16: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 16

The bandwidth available is limited when the number of clients is low (Figure 14). This can explain the effect of super-linearity, because after 8 clients were connected, we started to achieve the full bandwidth from the storage.

In the strong scalability test the global experiment has a fixed size; when we increased the number of cores (increasing the number of computer nodes) used by the application, the single file dumped every 10 time steps decreased in size (Figure 15).

The I/O pattern became more random and the Lustre file system started to have difficulties sustaining the bandwidth, especially when the backend storage is using NL-SAS drives capable of a low number of IOPS. This explains the regression in Figure 13.

Using an SSD-based Lustre file system we can sustain better a random-like workload when the number of cores is high and the file size is becoming small (Figure 16).

During this test, we demonstrated that a well-designed application that can scale almost linearly across thousands of cores is heavily affected in its scalability when I/O is enabled. Even if a parallel file system like Lustre can sustain thousands of cores doing I/O, the number of threads and the size of the file dumped can impact the scalability curve.

Figure 15. File size as the number of compute nodes increases. During the strong scalability test (TC1), the file dumped by each core becomes smaller and the workload generated by the application becomes more random.

Band

wid

th (M

B/se

c)

Bandwidth - Write UsingStandard Benchmark (IOR)

Number of Clients

2,000

4,000

6,000

8,000

10,000

12,000

1 2 4 16 32 64 1288

Higher Is Better

Full BandwidthLimited Bandwidth

Figure 14. Bandwidth measured using a standard benchmark (IOR). The HPC storage used can start to achieve the full bandwidth when the number of clients used for the benchmark is more than 8. When the number of clients used is limited, Lustre* can deliver only a limited bandwidth.

File

Siz

e (M

B)

Strong Scalability Test (TC1)File Size

Number of Clients

2

4

1

8

16

32

64

128

256

512

1024

81 16 32 4840 64 72 80 88 96 104 112 120 12856

App

aren

t Ban

dwid

th (M

B/se

c)

Strong Scalability Test (TC1)Apparent Bandwidth

Number of Clients

2,000

0

4,000

6,000

8,000

10,000

12,000

1 2 4 16 32 64 1288

SSD-based File SystemHDD-based File System

Higher Is Better

Figure 16. Bandwidth measured from the application, comparing a traditional hard disk drive (HDD)-based and a solid-state drive (SSD)-based Lustre* file system during the strong scalability test (TC1). Each SSD can sustain a higher number of IOPS, and even when the file size per core decreases the bandwidth delivered is almost steady when the number of clients is more than 32. In contrast, the HDD-based file system cannot sustain the random I/O generated by EFISPEC3D.

Page 17: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 17

Weak Scalability TestTo verify the effect of different file sizes while scaling, we decided to change the model (TC2 explained in EFISPEC3D Test Cases) and run a weak scalability test where the file size per each thread remained constant but the global experiment size increased.

In Figure 17 we compared the scalability of HDD-based and SSD-based Lustre file systems during the TC2 weak scalability test. To calculate the scalability, we used the apparent bandwidth (as defined in “Measured Bandwidth and Application Apparent Bandwidth”). When the file size is constant, the two effects (super-linearity and regression) are less evident.

In this experiment we demonstrated that when the number of clients is higher, Lustre can sustain an increased number of threads with the same bandwidth without regression. But after we consumed all the bandwidth available from the storage through the Lustre file system, we wondered if it was possible for an application to get more.

Measured Bandwidth and Application Apparent BandwidthWe used the concept of apparent bandwidth to decide how to tune Lustre or compare results. The apparent bandwidth is defined as: “the size in MB of the total amount of data to be written on disk, divided by the elapsed time difference of the application running with and without I/O.”

Sum of the file size dumped [Elapsed time]-[Elapsed time without I/O]

Figure 18 compares the apparent bandwidth with the measured bandwidth using the benchmark IOR during the assessment of the storage.

Even though we used the same number of threads per client for both applications, and both applications were writing a single file per process, the I/O pattern of IOR [Shan 2007] is completely different compared to the I/O pattern for EFISPEC3D. IOR used very large files and did not use the CPU for any calculations, whereas EFISPEC3D used CPU and RAM for its algorithm and dumped a file after some time steps as fast as possible.

Figure 19 shows the real-time performance collected by Intel Manager for Lustre with a sampling of 10 seconds during one of the experiments. Even though the peak bandwidth measured by Intel Manager for Lustre exceeded 2 GB/sec, the apparent bandwidth calculated was no more than 1 GB/sec.

Another consideration is that the calculation time is also impacted by the presence of the Lustre client, which is using CPU time and memory.

In every case, using the apparent bandwidth enables us to understand the success of our tuning techniques; it is more difficult to use the apparent bandwidth to design and size the storage requirements.

Figure 17. Weak scalability test (TC2) comparing a hard disk drive (HDD)-based and solid-state drive (SSD)-based Lustre* file system. The scalability is calculated using the apparent bandwidth. The file system can scale until the backend storage can sustain the performance. This chart clearly shows the advantage of SSDs compared to standard hard drives.

Weak Scalability Test (TC2)Comparing HDD- and SSD-based

Lustre* File System

Number of Clients

-

16

32

48

64

80

96

112

+

161 32 48 80 96 112 12864

Spee

d U

p

SSD-based File SystemHDD-based File SystemLinear

Band

wid

th (M

B/se

c)

Application BandwidthApparent (During TC2) vs. Measured (IOR Standard)

Number of Clients

2,000

4,000

6,000

8,000

10,000

12,000

1 2 4 16 32 64 1288

Measured BandwidthApparent Bandwidth

Higher Is Better

Figure 18. Comparison between application apparent bandwidth and measured bandwidth. The apparent bandwidth was calculated during the weak scalability test (TC2) on the HDD-based Lustre* file system. The measured bandwidth was the result of several IOR benchmark tests with the optimal file size and number of threads for Lustre. Even though the HPC storage was able to sustain 11.5 GB/sec, the application wasn’t able to utilize all the bandwidth.

Page 18: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 18

A Tuning Exercise Lustre exposes a large number of tuning parameters to optimize the file system for different workloads. Default parameters should meet the requirements of a “stream” workload where many users are running a large number of large I/O threads using a large block size.

In this section, we describe a tuning exercise using a weak scalability experiment described in EFISPEC3D Test Cases as TC3, with a large dump size of 500 MB, between 1 to 16 compute nodes, and a HDD-based Lustre file system provided by DDN.

We can divide the Lustre tuning parameters into two major categories:

• User parameters. Can be modified by users to optimize the I/O of the specific user or for a specific experiment. Users use the lfs command or MPI-IO hints to change these parameters.

• System administrator parameters. Can be changed only by the system administrator of the cluster, using the lctl command. Changes normally impact all the users and applications.

Basically, users can change only the stripe layout parameters without any special permission using the lfs setstripe command or directly from the MPI-IO hints interface.

Using Lustre* Metrics to Understand I/O PatternsWe used the techniques described in the previous paragraphs to understand the I/O pattern of EFISPEC3D, collecting the llite.*.stats and osc.*.rpc_stats statistics from one client:

$ lctl get_param llite.*.stats

The output of llite.*.stats (Table 10) illustrates the following:

• The application is doing a very limited number of reads and most of them are terminated in the cache (read_bytes versus osc_read). Most of the activity is on writes.

• The write’s block size (write_bytes) is between 1 to 4 MB; the llite.*.extents or the osc.*.rpc_stat metrics would provide a better understanding. Figure 19. Intel® Manager for Lustre* software can track

the real-time activity of the Lustre servers during the EFISPEC3D experiment.

Read/Write Bandwidth

Table 10. llite.*.stats on a Lustre* client with default values during the weak scalability test (TC3). This output shows that the block size used by EFISPEC3D for writing (write_bytes row) is 4 MB while the RPC (Remote Procedure Call) block size (osc_write row) is only 1 MB.

snapshot_time 1448365136.88615 secs.usecsread_bytes 381412 samples [bytes] 0 4194304 98989269015write_bytes 2764264 samples [bytes] 1 4194304 5592552407475osc_read 19850 samples [bytes] 6 1048576 15913089868osc_write 5454999 samples [bytes] 6 1048576 5592632028478ioctl 46639 samples [regs]open 49073 samples [regs]close 48955 samples [regs]mmap 1144 samples [regs]seek 52615 samples [regs]readdir 44 samples [regs]truncate 46358 samples [regs]getattr 46011 samples [regs]unlink 4400 samples [regs]mkdir 7 samples [regs]statfs 1 samples [regs]alloc_inode 41705 samples [regs]getxattr 3484992 samples [regs]removexattr 46358 samples [regs]inode_permission 440588 samples [regs]

Page 19: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 19

• Based on the number of samples it is likely that the application is writing with a block size of 4 MB. Unfortunately, the default RPC size is 1 MB, which increases the number of I/O operations done by Lustre on the network versus what the application is requesting. In fact, the number of osc_write samples is double the number of write_bytes samples.

• The number of metadata operations (open, close, statfs, unlink, getattr) is limited with the notable exception of getxattr.

$ lctl osc.*.rpc_stats

The output of osc.*.rpc_stats in Table 11 confirms the limited number of reads and a 97-percent of 1 MB RPC size. It is necessary for a privileged account to enable the llite.*.extents metric file but it is likely the application is using a block size of 4 MB.

It is also possible to obtain a nice chart of the metrics information available from llite.*.stats using llstat and plot-llstat (Figure 20). The stats were collected with a sample rate of 1 second during the experiment TC3 on a single compute node using 36 cores of a dual-socket Intel Xeon processor E5-2697 v4 (2.30 GHz).

It is also important to understand if the application is using all the OSTs on the Lustre server. Intel Manager for Lustre provides this information with a heat map (Figure 21). This chart qualitatively represents the utilization in term of IOPS and bandwidth of each OST. Developers can understand if the application is using all the OSTs and they can optimize the stripe layout configuration accordingly.

Table 11. osc.*.rpc_stats on a Lustre* client with default values during the weak scalability test (TC3). This histogram shows that 97 percent of the write activity uses 256 Linux* pages. In Linux a memory page is 4K; therefore, the Remote Procedure Calls (RPCs) size is 1 MB.

read writepages per rpc rpcs % cum % | rpcs % cum %

1: 126 100 100 | 1994 1 12: 0 0 100 | 255 0 14: 0 0 100 | 40 0 18: 0 0 100 | 37 0 116: 0 0 100 | 96 0 132: 0 0 100 | 239 0 164: 0 0 100 | 539 0 1128: 0 0 100 | 1387 0 2256: 0 0 100 | 171995 97 100

Band

wid

th (M

B/se

c)

Time (seconds)1000

0

1000

1500

2000

2500

3000

3500

4000

4500

5000

500

200 300 500 600400 1000900800700

osc_write (bytes)write_bytes (bytes)

Weak Scalability Test (TC3)Lustre* Performance on a Single Client

Figure 20. Lustre* performance on a single client during the weak scalability test (TC3). The gnuplot and Lustre tools were used to plot the write_bytes and osc_write values from the llite.*.stats command every second. On the x-axis the time is in seconds and on the y-axis the bandwidth is in MB/second. The application is dumping a 500-MB file each 10 time steps.

Figure 21. This is a heat map generated in Intel® Manager for Lustre* software during the weak scalability test (TC3). The x-axis is time and each block on the y-axis represents an OST. The different shades of blue and red represent the bandwidth delivered: high bandwidth (bold red) to no activity (faded blue). This chart provides a qualitative overview of how the application uses the storage on the server side. In this case the application is using all the OSTs available only for a few 10-time-step intervals.

Page 20: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 20

Tuning the Stripe LayoutWe tested EFISPEC3D using different stripe layouts to understand if these parameters can improve the performance. We used the apparent bandwidth to compare the results.

First, we changed the stripe size from the default 1 MB to 16 MB, the results of which are shown in Figure 22. We did not notice any improvement in the performance.

Then we used the lfs setstripe command to change the stripe layout with the following parameters (but it is possible to use MPI-IO hints to set these parameters for the application):• Stripe count = 1; Stripe size = 1 MB (default)• Stripe count = 1; Stripe size = 4 MB• Stripe count = 1; Stripe size = 16 MB

We did not see any improvement in the write performance of the application by changing the stripe count because the I/O is influenced more by the number of pages that Lustre can carry on the network. Therefore, we did not change this parameter in this experiment.

Changing the stripe count for each file from 1 to 32 increased the performance (Figure 23).

Next, we used the lfs setstripe command to change the stripe layout with the following parameters:• Stripe count = 1; Stripe size = 1 MB (default)

• Stripe count = 4; Stripe size = 1 MB

• Stripe count = -1 (ALL 32 OSTs); Stripe size = 1 MB

In this case, the increased stripe count enables us to spread each file across all the OSTs, significantly boosting the performance when the number of clients is higher. For a lower number of clients, the bottleneck is the bandwidth available from the network adapters.

Tuning the System Administrator ParametersIn the previous section, we observed that EFISPEC3D is able to write with a block size of 4 MB, but Lustre is limiting the data transfer at 1 MB due the maximum number of pages per RPC. We decided to increase this value in order to take advantage of EFISPEC3D’s bigger block size.

It is not always possible to perform this operation because a privileged (root) account is needed and this tunable will impact all the applications running in the clients:

# lctl set_param osc.*.max_pages_per_rpc=1024

500

0

1,000

1,500

2,000

App

aren

t Ban

dwid

th (M

B/se

c)

Weak Scalability Test (TC3) Stripe SizeApparent Bandwidth

Number of Clients1 2 4 168

Higher Is Better

16 MB Stripe4 MB Stripe1 MB Stripe (Lustre* Default)

App

aren

t Ban

dwid

th (M

B/se

c)

Weak Scalability Test (TC3) Stripe CountApparent Bandwidth

Number of Clients

1,000

0

2,000

3,000

4,000

1 2 4 168

Higher Is Better

32 Stripe4 Stripe1 Stripe (Lustre* Default)

Figure 22. Apparent bandwidth during weak scalability test (TC3) scaling stripe size. The x-axis is the number of clients and the y-axis is the bandwidth in MB/second. We ran the application with the default stripe size of 1 MB and an increased stripe size of 4 MB and 16 MB without changing the number of memory pages per RPC (Remote Procedure Call).

Figure 23. Apparent bandwidth during weak scalability test (TC3) scaling stripe count. The x-axis is the number of clients and the y-axis is the bandwidth in MB/second. We ran the application with the default stripe count of 1 and then increased stripe count to 4 and to ALL the object storage targets (OSTs).

Page 21: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 21

When we Re-ran the application and collected llite.*.stats (Table 12), we noticed a decreased number of osc_write operations compared to the write_bytes operations requested and we expected a better utilization of the network compared to Table 10.

Also the osc.*.rpc_stats confirmed the utilization of a higher number of memory pages per RPC (Table 13).

Custom Tuning Strategy In the weak scalability test the file size dumped by each thread is constant, but the number of threads doing I/O to the storage changes, increasing the number of clients involved in the experiment. The tuning strategy should change based on these factors.

The stripe layout can be changed in the user space easily for each run, but changing the system administrator’s tunable is more difficult and impacts all the applications running in a cluster.

Table 13. osc.*.rpc_stats on a Lustre* client with an increased number of memory pages per RPC (Remote Procedure Call) during the weak scalability test (TC3). This histogram shows that 94 percent of the write activity uses 1,024 Linux* pages. In Linux a memory page is 4K; therefore, the RPCs size is 4 MB.

read writepages per rpc rpcs % cum % | rpcs % cum %

1: 60 100 100 | 440 1 12: 0 0 100 | 80 0 14: 0 0 100 | 7 0 18: 0 0 100 | 4 0 116: 0 0 100 | 10 0 132: 0 0 100 | 19 0 164: 0 0 100 | 14 0 1128: 0 0 100 | 62 0 2256: 0 0 100 | 437 1 3512: 0 0 100 | 417 1 51024: 0 0 100 | 27447 94 100

Table 12. llite.*.stats on a Lustre* client with increased RPC (Remote Procedure Call) page size during the weak scalability test (TC3). This output shows that the block size used by EFISPEC3D for writing (write_bytes row) is 4 MB and the RPC block size (osc_write row) is increased to 4 MB compared to Table 10.

snapshot_time 1448556086.743401 secs.usecsread_bytes 218879 samples [bytes] 0 4194304 81125429092write_bytes 1638417 samples [bytes] 1 4194304 3554524773484osc_read 11120 samples [bytes] 6 4194304 8326030329osc_write 893758 samples [bytes] 6 4194304 3554523428966ioctl 31271 samples [regs]open 34768 samples [regs]close 34677 samples [regs]mmap 616 samples [regs]seek 38576 samples [regs]readdir 26 samples [regs]truncate 33148 samples [regs]getattr 31268 samples [regs]unlink 4400 samples [regs]mkdir 4 samples [regs]alloc_inode 28718 samples [regs]getxattr 1638761 samples [regs]removexattr 33148 samples [regs]inode_permission 315545 samples [regs]

Page 22: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 22

In Figure 24 we found two strategies that provided better results compared with the Lustre’s default settings:

• Strategy 1. Stripe count = 4; Stripe Size = 1 MB; RPC max size = 1024

• Strategy 2. Stripe count = ALL (32 OSTs); Stripe Size = 1 MB; RPC max size = 1024

Strategy 1 increases the speed of the application when the number of client is limited to 4; when the number of clients is more than 8, Strategy 2 can boost the performance of EFISPEC3D.

We used the lfs setstripe command to change the stripe layout of the directory where EFISPEC3D was dumping the data. The stripe layout was calculated on-the-fly based on the number of clients used for the test. This very simple method provided great results, but a more elegant method can be developed if MPI-IO hints are used.

The results of the custom tuning strategy are shown in Figure 25 and Figure 26.

Figure 25 plots the scalability curve using the apparent bandwidth because in the weak scalability experiment the elapsed time is constant. Applying this custom tuning technique, we achieved a nice scalability speed up and, most importantly, when the number of cores increased the application could take advantage of the increased computational power.

Figure 26 confirms the improvement in performance—more than 84 percent on 16 compute nodes—of the custom tuning technique, compared with the default Lustre parameters.

Figure 24. Apparent bandwidth comparing Lustre* defaults and different tuning strategies during the weak scalability test (TC3). Strategy 1 increases the performance of the application when the number of clients is limited. Strategy 2 boosts the performance of the application when the number of clients is greater than 8.

Figure 25. Scalability outcome applying a custom tuning strategy for the weak scalability test (TC3) compared with Lustre* defaults. The scalability curve was calculated using the apparent bandwidth. The tuning technique suggested enables the application to scale nicely beyond 8 clients.

App

aren

t Ban

dwid

th (M

B/se

c)

Weak Scalability (TC3) Apparent BandwidthLustre* Defaults vs. Strategy 1 vs. Strategy 2

Number of Clients

1,000

0

2,000

3,000

4,000

1 2 4 168

Higher Is Better

Lustre DefaultStrategy 1: 4 Stripe count; 1024 max size RPCStrategy 2: ALL Stripe count; 1024 max size RPC

Weak Scalability (TC3) Scalability OutcomesCustom Tuning vs. Lustre* Defaults

Number of Clients

-

16

32

48

64

80

96

112

+

21 4 6 10 12 14 168

Spee

d U

p

Lustre with Custom TuningLustre DefaultLinear

Figure 26. Apparent bandwidth comparison between Lustre* default values and custom tuning during the weak scalability test (TC3). Applying different tunables during the experiment gives us the possibility to maintain a very good scalability and outperform default values by 84 percent.

App

aren

t Ban

dwid

th (M

B/se

c)

Weak Scalability (TC3) Apparent BandwidthLustre* Defaults vs. Custom Tuning

Number of Clients

1,000

0

2,000

3,000

4,000

1 2 4 168

Higher Is Better

Lustre with Custom TuningLustre Default

Page 23: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 23

ConclusionIn the last few years, the Lustre community has created several documents to help system administrators design Lustre solutions and fine-tune this powerful file system.

The objective accomplished by this paper was to focus on developers, scientists, and researchers and to guide them in understanding how to take advantage of an open source parallel file system like Lustre.

For that purpose, we used a seismic wave propagation application typically used by the oil and gas industry and seismologist communities. During such propagation, it is necessary to take “pictures” or “snapshots” of the 3D wave fields at a given time step frequency. This frequency of snapshotting can be tuned depending on the usage but still have a significant impact on I/O performance.

Using this application in a real environment, with all the typical restrictions that a non-privileged user normally experiences, was an excellent opportunity to demonstrate what is doable and where the limitations are. We selected Intel EE for Lustre, installed on an HPC storage provided by DDN, a leader in this market.

In the second part of the document, we described the metrics exposed by Lustre and made available to developers to troubleshoot their applications, focusing on the information available on the compute nodes, which non-privileged users can access. We also described the physical limits of Lustre, especially in the area of metadata, to organize the data allocation in the file system in the most optimal way.

In the analytic part of the document, we demonstrated how an application that can scale nicely on 128 compute nodes (4,608 Intel Xeon processor E5-2697 v4 cores) is heavily limited by I/O and can’t scale beyond 32 clients.

The application’s scalability was not only limited by the theoretical bandwidth available but also by the number of threads used for I/O and by the file size. The throughput efficiency of a Lustre file system based on NL-SAS HDDs decreases when the workload becomes random. An SSD-based Lustre file system can better sustain a strong scalability experiment when the data set is analyzed by thousands of cores that are dumping a file of a few MBs. We used the weak scalability test TC2, where the file size is constant, to verify that Lustre can sustain an increased number of threads without regression.

In the last part of the document, we guided developers in a tuning exercise using a specific test case (TC3) with a relatively large data set. We achieved an 84-percent improvement in bandwidth compared to the default Lustre settings, while—most importantly—maintaining the scalability of the application.

In this document we also guided the audience to understand about the important aspects of parallel file systems and parallel I/O.

Understanding the I/O patterns of applications using metrics files exposed by Lustre is essential for developers to find bottlenecks in the code. These can include wrong block size and sub-optimized calls used for I/O. This understating also helps system administrators to develop a tuning strategy to adapt Lustre and increase the performance of applications. Intel Manager for Lustre, included in Intel Enterprise Edition for Lustre software, can help with high-level and real-time information, but most of the time a deeper dig into the metrics files is necessary.

The I/O strategy for a large-scale application could be a challenge. A developer must adapt the application based on the planned scalability and decide how many cores per node should be used to read/write for dump:

• <00s – all cores should do read and write to reach the maximum bandwidth

• >000s – one core per node should dump to sustain the maximum bandwidth and maintain the scalability

In case of an application performing a strong scalability exercise, it is important to predict the file size to avoid a huge amount of small files being dumped, which decreases the efficiency of the HPC storage subsystem. The development of an additional layer to buffer the file or the use of MPI collective I/O should be considered.

Many MPI implementations support Lustre (such as Intel MPI, OpenMPI, and Cray MPI), so a developer can take advantage of the stripe optimization available on Lustre inside their Fortran or C application (as we explored in the last part of this document).

Page 24: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 24

Bibliography [Anderson 2012] Anderson, J.E., Tan, L. & Wang, D., 2012. Time-reversal checkpointing methods for RTM and FWI, Geophysics, 77, S93–S103, 2012.

[Barton 2014] Baron, E., Dilger, A. High Performance Parallel I/O Chapman and Hall/CRC 2014, 91-106, 2014.

[Chaljub 2015] Chaljub, E., Maufroy, E., Moczo, P., Kristek, J., Hollender, F., Bard, P-Y., Priolo, E., Klin, P., De Martin, F., Zhang, Z., Zhang, W., and Chen, X. 3-D numerical simulations of earthquake ground motion in sedimentary basins: testing accuracy through stringent models. Geophysical Journal International, 201(1), 90-111, 2015.

[Clapp 2009] Robert G. Clapp. Reverse time migration with random boundaries. SEG Technical Program Expanded Abstracts 2009: pp. 2809-2813, 2009.

[Demartin 2011] De Martin, F. Verification of a spectral-element method code for the southern California earthquake center LOH. 3 viscoelastic Case. Bulletin of the Seismological Society of America, 101(6), 2855-2865, 2011.

[Demartin 2013] De Martin, F., Matsushima, S., Kawase, H. Impact of geometric effects on near-surface Green’s functions. Bulletin of the Seismological Society of America, vol. 103(6): 3289-3304, doi: 10.1785/0120130039, 2013.

[Dussaud 2008] Dussaud, E., W. W. Symes, L. Lemaistre, P. Singer, B. Denel, and A. Cherrett, ,Computational strategies for reverse-time migration: 78th Annual Internat. Mtg., Soc. Expl. Geophys., Expanded Abstracts, SPMI 3.3, 2008.

[Karypi 1999] Karypi, G. and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, Vol. 20, No. 1, pp. 359-392, 1999.

[Hebenstreit 2014] Hebenstreit, M. Performance Evaluation of Intel® SSD-Based Lustre* Cluster File Systems at the Intel® CRT-DC (online), 2014.

[Jette and Grondona, 2003] Jette, M. and Grondona, M. SLURM: Simple Linux Utility for Resource Management. Proceedings of ClusterWorld Conference and Expo. San Jose, California, 2003.

[Imbert 2011] Imbert, D., Imadoueddine, K, Thierry, P., Chauris H., and Borges, L., Tips and tricks for finite difference and i/o‐less FWI. SEG, Expanded Abstracts,30,3174, 2011.

[Karniadakis 2013] Karniadakis, G., & Sherwin, S. Spectral/hp element methods for computational fluid dynamics. Oxford University Press, 2013.

[Komatitsch 1998] Komatitsch, D., & Vilotte, J. P. (1998). The spectral element method: an efficient tool to simulate the seismic response of 2D and 3D geological structures. Bulletin of the Seismological Society of America, 88(2), 368-392, 1998.

[Maday 1989] Maday, Y., & Patera, A. T. Spectral element methods for the incompressible Navier-Stokes equations. In IN: State-of-the-art surveys on computational mechanics (A90-47176 21-64). New York, American Society of Mechanical Engineers, 1989, p. 71-143. Research supported by DARPA. (Vol. 1, pp. 71-143), 1989.

[Matsushima 2014] Matsushima, S., T. Hirokawa, F. De Martin, H. Kawase, F. J. Sánchez‐Sesma. The Effect of Lateral Heterogeneity on Horizontal‐to‐Vertical Spectral Ratio of Microtremors Inferred from Observation and Synthetics Bulletin of the Seismological Society of America, vol. 104(1):381-393, doi:10.1785/0120120321, 2014.

[Maufroy 2015] Maufroy, E., E. Chaljub, F. Hollender, J. Kristek, P. Moczo, P. Klin, E. Priolo, A. Iwaki, T. Iwata, V. Etienne, F. De Martin, N. Theodoulidis, M. Manakou, C. Guyonnet-Benaize, K. Pitilakis, and P.-Y. Bard (accepted) Earthquake ground motion in the Mygdonian basin, Greece: the E2VP verification and validation of 3D numerical simulation up to 4 Hz. BSSA, vol. 105(3):1398-1418, 2015.

[Moczo 2002] Moczo, P., Kristek, J., Vavryčuk, V., Archuleta, R. J., & Halada, L. 3D heterogeneous staggered-grid finite-difference modeling of seismic motion with volume harmonic and arithmetic averaging of elastic moduli and densities. Bulletin of the Seismological Society of America, 92(8), 3042-3066, 2002.

[Newmark 1959] Newmark, N. M. A method of computation for structural dynamics. Journal of Engineering Mechanics, ASCE, 85 (EM3) 67-94, 1959.

[Shan 2007] Shan, H. and Shalf, J. Using IOR to Analyze the I/O performance for HPC Platforms. Cray User Group proceedings (online), 2007.

[Schoof 1994] Schoof, L. A., and Yarberry, V. R. EXODUS II: a finite element data model (No. SAND--92-2137). Sandia National Labs, Albuquerque, NM (United States). DOI:10.2172/10102115, 1994.

[Sochala 2013] Sochala, P., Le Maître, O. Polynomial Chaos expansion for subsurface flows with uncertain soil parameters, AWR, 62,139-154, 2013.

[Symes 2007] Symes, W. M., 2007, Reverse time migration with optimal checkpointing: Geophysics, 72, SM213–SM221.

[Virieux 2009a] Virieux, J., S. Operto, H. Ben Hadj Ali, R. Brossier, V. Etienne, F. Sourbier, L. Giraud & A. Haidar, Seismic wave modeling for seismic imaging, The Leading Edge, 28, 538 - Special section: Seismic modeling, 2009.

[Virieux 2009b] Virieux, J. & S. Operto. An overview of full waveform inversion in exploration geophysics, Geophysics, 74(6), WCC127-WCC152, 2009.

Page 25: Enhancing Scalability and Performance of Parallel File ... · White Paper | Enhancing Scalability and Performance of Parallel File Systems 3 In this paper, we consider medium cases

White Paper | Enhancing Scalability and Performance of Parallel File Systems 25

1 ZFS is a combined file system and logical volume manager designed by Sun Microsystems. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See intel.com/products/

processor_number for details. Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using

specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. No

computer system can be absolutely secure. Check with your system manufacturer or retailer, or learn more at intel.com. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference

in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/performance/resources/benchmark_limitations.htm or call (U.S.) 1-800-628-8686 or 1-916-356-3104.

THE INFORMATION PROVIDED IN THIS PAPER IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL’S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Copyright © 2016 Intel Corporation. All rights reserved. Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. 0816/EYAR/KC/PDF Please Recycle 334702-001US

About the AuthorsGabriele Paciucci is a Solution Architect in the Enterprise & HPC Platform Group at Intel. In this role, he provides technical consultation to partners and customers and evangelizes the Lustre technology worldwide. Gabriele is author of several research papers presented at major Lustre and HPC events. He is involved in several open source projects and has promoted the adoption of open source software and Linux since 2000 when he worked as software engineer at Red Hat. Gabriele received his Master’s Degree in Chemical Engineering from Università degli Studi di Roma La Sapienza in 1999.

Florent De Martin is a seismologist at BRGM (French Geological Survey). His research is dedicated to improving the realism of physics-based numerical earthquake simulations in order to better assess seismic hazard. For this purpose, he has been developing EFISPEC3D, an MPI computer program that solves the three-dimensional viscoelastic equations of motion in complex geological media using a continuous Galerkin spectral finite element method running on current HPC architectures. Florent received his engineering degree from Ecole Nationale des Ponts ParisTech and his Ph.D. from Ecole Centrale Paris.

Philippe Thierry is a Principal Engineer in the Energy Application Engineering Team at Intel, leading the technical orientation of the group, and specializing in HPC and seismic imaging. His current research focuses on uncertainties quantification and performance prediction with respect to upcoming hardware. Philippe received his Ph.D. in Geophysics from the Paris School of Mines and University of Paris VII.