architecture guide for hpe servers and wekaio matrix ... · performance with a much-reduced...

Architecture guide for HPE servers and WekaIO Matrix Solving AI storage bottlenecks

Architecture guide

Architecture guide

Contents Executive summary .............................................................................................................................................................................................................................................................................................................. 3

A storage bottleneck problem ............................................................................................................................................................................................................................................................................. 3

A solution architecture to address the problem.................................................................................................................................................................................................................................. 3

WekaIO Matrix software ................................................................................................................................................................................................................................................................................................... 3

Overview .................................................................................................................................................................................................................................................................................................................................. 3

Matrix software architecture ................................................................................................................................................................................................................................................................................. 4

Deep Learning use case .................................................................................................................................................................................................................................................................................................. 5

Deep Learning .................................................................................................................................................................................................................................................................................................................... 6

Cluster design............................................................................................................................................................................................................................................................................................................................ 6

Recommended platform choices ....................................................................................................................................................................................................................................................................... 7

Utilizing WekaIO storage .......................................................................................................................................................................................................................................................................................... 7

Best practices, sizing, and configuration choices ...................................................................................................................................................................................................................................... 9

Cluster design guidelines ......................................................................................................................................................................................................................................................................................... 9

Reference configurations ............................................................................................................................................................................................................................................................................................. 15

Reference cluster properties .............................................................................................................................................................................................................................................................................. 15

Sample Bill of Materials (BOM) ........................................................................................................................................................................................................................................................................ 15

Summary ..................................................................................................................................................................................................................................................................................................................................... 19

Resources ................................................................................................................................................................................................................................................................................................................................... 19

Architecture guide

Executive summary A storage bottleneck problem For distributed AI clusters used for Deep Learning or inferencing, the improvements in compute speed by CPUs and GPUs pose a challenge as performance throughput requirements have dramatically outpaced bandwidth improvements in storage solutions available for these applications.

A good example of the issue is seen in parallel file system design, a frequently used storage paradigm for technical compute. Parallel file systems provide storage built from fabric-connected servers—typically server-attached SAS/SATA hard drives or SAN/NAS storage solutions using the same. To efficiently deliver tens or hundreds of GB/s of bandwidth to compute clusters with SAS/SATA devices, the building blocks must contain a lot of storage devices or many less dense building blocks and an increased number of network ports.

Even if application capacity demands aren’t high, a significant number of storage devices may still be needed to hit high bandwidth requirements—particularly with hard drive-based solutions; each device provides a small percentage of the total performance needed. The parallel file systems used by the industry for decades weren’t designed for the latency or IOPS performance requirements that the latest in storage technology can deliver. So yesterday’s solution may not meet newer technical compute requirements, regardless of the hardware it runs on.

A different approach to providing performant storage for HPC and Deep Learning is to use server local NVMe SSDs. This design provides performance with a much-reduced footprint over dense hard drive systems, but might not provide the robustness of a parallel file system or SAN/NAS-attached storage solution. Management of cluster storage is typically nonexistent or ad hoc. Because the storage isn’t shared, large data sets prompt time- and resource-intensive processes for copying data to and from bulk storage. Loss of a node removes all access to its storage, and storage islands provide worse data durability the more you scale. Performance and capacity is limited by the NVMe on each compute node.

A solution architecture to address the problem Scalable storage building blocks are still the right choice, and storage solutions built on top of servers make sense. Which raises this question: if storage provided by Lustre, IBM Spectrum Scale (formerly IBM GPFS), or legacy SAN/NAS solutions doesn’t meet AI compute application requirements, what solution does Hewlett Packard Enterprise propose as an alternative?

A better way to address AI storage bottlenecks WekaIO Matrix software on HPE servers can provide storage performance and features required by modern technical compute.

Matrix provides an incredibly fast file system built for the speed of NVMe over fabrics. Matrix also forms a complete, easy-to-manage storage solution. The WekaIO MatrixFS file system has a POSIX interface that can present all data in a single namespace, so it can work easily with existing applications. MatrixFS also provides additional interface flexibility for various data ingest or processing needs (NFS and SMB). Data is reliably protected on MatrixFS and can be tiered to object storage for cost optimization.

Task-optimized HPE Gen10 servers support all the hardware required for maximizing storage performance over the fabric: NVMe drives and fast InfiniBand or Ethernet network adapters. HPE Gen10 servers are designed to provide the compute power and storage performance for the most demanding AI workloads.

Together, HPE servers and WekaIO Matrix software provide the building blocks of a parallel file system that can keep up with the demands of technical compute applications while providing a full-fledged, easy-to-manage storage solution.

WekaIO Matrix software Overview WekaIO Matrix is a software-only, high-performance file-based storage solution that is elastic and highly scalable. It is also easy to deploy, configure, manage, and expand. The design philosophy behind Matrix was to create a radically simple storage solution that has the performance of all-flash arrays with the scalability and economics of the cloud. Matrix transforms NVMe-based flash storage, compute nodes, and interconnect fabrics into a high-performance parallel file system that is well suited for Deep Learning use cases. Members of the Matrix cluster may be clients only, or contribute storage to the file system.

The two key components of Matrix are WekaIO MatrixFS and WekaIO Trinity. MatrixFS is the file system and data management layer while Trinity is a software platform for system management, visualization, and reporting.

https://www.hpe.com/info/hpc-storage

https://www.hpe.com/info/deep-learning

https://h20195.www2.hpe.com/v2/Getdocument.aspx?docname=a00065979enw

https://www.hpe.com/us/en/servers/gen10-servers.html

Architecture guide

MatrixFS MatrixFS was written entirely from scratch; it does not rely on legacy algorithms. The software also includes integrated tiering that seamlessly migrates data to and from the cloud (public, private, or hybrid) without special software or complex scripts; all data resides in a single namespace for easy management by the Trinity management console. Fully distributed data and metadata ensures no hotspots or bottlenecks, and MatrixFS provides distributed resilience and data protection without traditional bottlenecks. WekaIO has invented a patented and patent-pending distributed data protection coding scheme called MatrixDDP that improves performance with scale. Essentially, MatrixDDP delivers the scalability and durability of Erasure Coding but without the performance penalty. Because all nodes in the Matrix cluster participate in the recovery process, rebuild to reliability is also much better than mirroring or Erasure Coding.

With Matrix, there is no sense of data locality, which improves performance and resiliency. Contrary to popular belief, data locality actually contributes to performance and reliability issues by creating data hotspots and system scalability issues. By directly managing data placement, MatrixFS can shard the data and distribute it for optimal placement based on user configurable stripe sizes. Shared data perfectly matches the block sizes used by the underlying flash memory to improve performance and extend SSD service life.

Trinity Trinity has an intuitive graphical user interface that allows a single administrator without any specialized storage training to quickly and easily manage hundreds of petabytes. Reporting, visualization, and overall system management functions are accessible using the command-line interface (CLI) or the intuitive Trinity management console. CLI functionality is also available via an easy-to-use API, allowing integration with current management stacks. Trinity is entirely cloud based, eliminating the need to physically install and maintain any software or dedicated hardware resources, and you always have access to the latest management console features.

Figure 1. WekaIO Trinity

Matrix software architecture Matrix supports all major Linux® distributions and leverages virtualization and low-level Linux container techniques to run its own real-time operating system (RTOS) in user space, alongside the original Linux kernel. Matrix manages its assigned resources (CPU cores, memory regions, network interface card, and SSDs) to provide process scheduling, memory management, and to control the I/O and networking

Architecture guide

stacks. Matrix has a very small resource footprint, typically about 5%, leaving 95% for application processing. Matrix only uses the resources that are allocated to it, from as little as one server core and a small amount of RAM to consuming all of the resources of the server.

Figure 2 shows the software architecture including flexible application access (top right). Matrix core components—including the MatrixFS unified namespace and other functions—execute in user space, effectively eliminating time-sharing and other kernel-specific dependencies. The notable exception is the WekaIO kernel driver shown in the lower right of Figure 2, which provides the POSIX file system interface to applications. Using the kernel driver provides significantly higher performance than can be achieved using an NFS or SMB mount point.

Figure 2. WekaIO MatrixFS software architecture

By not relying on the Linux kernel for its I/O stack, Matrix avoids the CPU buffer copies and context switches that accompany that code path. Thus, Matrix effectively utilizes a zero-copy architecture with much more predictable latencies. Scheduling and memory management also bypass the kernel. To run the network stack and NVMe from user space as depicted in Figure 2, Matrix leverages acceleration technologies like DPDK, SPDK (NVMe only), and SR-IOV (Ethernet only). Additionally, these technologies and WekaIO’s own innovative software help Matrix maximize performance over InfiniBand or Ethernet networks.

Matrix also helps optimize storage costs by utilizing S3-based on- or off-premises object storage. A key use is policy-based tiering that operates transparently underneath the single namespace view, keeping only desired data in the flash tier. Matrix also supports cloud bursting with snap to object, allowing additional work on the same file system data without expanding the local compute cluster.

Deep Learning use case Before discussing the architecture or hardware for building Matrix, it’s helpful to understand how distributed storage and NVMe benefit Deep Learning clusters.

As discussed, MatrixFS provides the performance, low latency, and consistent response times of local NVMe storage within a distributed storage solution. This allows Matrix to support common Deep Learning access patterns that traditional parallel file systems struggled with, such as reading many small files or intensive metadata activity.

The following Deep Learning use case examines some areas where storage can bottleneck specific workloads or inhibit operation—and how Matrix helps address these problems. The examples are focused around general workflow and design benefits.

Architecture guide

Deep Learning Deep Learning growth has been driven both by massive compute resources through GPUs and the availability of large-scale training and test data sets. This subset of machine learning has seen a dramatic rise in models that are trained from a significant quantity of data—the more the better. While the impact of storage performance is not always obvious at small scale, production workloads and data sets can starve GPUs and reduce application efficiency without proper storage design. Some application workloads may be held back by data ingest and/or commit between runs, or configurations can hit data transfer bottlenecks as the GPU cluster grows.

As the amount of data has grown, the need for a storage layer that can keep up with all stages of the Deep Learning pipeline has arisen.

• Instrument ingest: Sensor count and/or high-definition sensor input creates a need for the capacity at scale and sustained bandwidth that MatrixFS provides. Burst ingest solutions are not sufficient for constant, intense data flows. Think of autonomous automobile data sets or large-scale IoT instrumentation.

• Cleaning and preprocessing: When the raw data is being processed and tagged for training sets, software will have to again read-through massive chunks of data (and potentially generate a fair amount of smaller writes I/O for that tagging). This can be particularly performance sensitive if this data is being processed in a streaming fashion and fed to further workflow stages.

• Training and inference: WekaIO Matrix can feed more data to each GPU than a local storage design and supports a much larger available data set than local NVMe. This also avoids data shuffling between ingest and local storage. Better data ingest prevents GPU starvation by keeping memory populated, and no data locality means your compute and storage design is flexible as you scale. Matrix design can drastically improve the ability to iterate epochs (full passes of the training data), particularly for neural net layers that compute quickly and have the ability to execute dominated by data load times.

• Archiving and cataloging: Data flow to object storage for active archiving is transparent and cost-optimizable within the MatrixFS namespace.

Cluster design This section covers design choices and diagrams to help visualize how Matrix storage can be connected within your data center. Use of the WekaIO logo (a W in a circle) indicates systems where Matrix software is installed.

Typical storage design separates storage hardware from compute clients. In this model, WekaIO MatrixFS clients (frontend) contribute no capacity to MatrixFS. Storage infrastructure servers (back end) then can dedicate all system resources to WekaIO Matrix for performance. This model allows back-end servers to be designed and cost-optimized solely for per-node storage density and/or I/O performance.

Figure 3. Matrix servers in the data center

A percentage of the hardware resources on every server in the cluster must be reserved for WekaIO Matrix—CPU, memory, and fabric bandwidth. Matrix resource utilization is clearly defined and isolated from application impact. For example, a frontend client core reserved for Matrix appears busy to the OS because its usage time is being fully dedicated to Matrix software. Therefore, the OS will not schedule jobs on that core, and Matrix will not take CPU time on other cores.

Architecture guide

WekaIO Matrix clusters can also be deployed in a hyperconverged model, where application software runs on servers with both Matrix frontend and back-end functions. Hyperconverged makes sense if the goal is a single building block for scaling technical compute cluster compute and storage together. It is however a more complicated design given that hardware must be chosen so Matrix and the application have sufficient memory, compute, and networking to co-exist.

Due to that added complexity in hyperconverged design, this architecture guide focuses on dedicated storage architectures. However, best practices and configuration guidelines around Matrix requirements are applicable to all designs.

Recommended platform choices HPE provides a variety of servers that can be considered for your particular solution design, but these are best fits for the majority of deep learning use cases. Further details around hardware choices will be included later in the document, under Reference configuration choices.

Compute • HPE Apollo 6500 Gen10 systems for GPU-intensive roles and maximizing per node GPU density.

• HPE ProLiant DL380 Gen10 servers for inference focused clusters or designs centered on fewer GPUs per server. May be primarily used for inference.

Storage • HPE ProLiant DL360 Gen10 server for dense NVMe storage per node with a good balance of CPU and networking resources to realize

that performance.

• HPE Apollo 2000 Gen10 systems with HPE ProLiant XL170r servers are good alternative storage clusters with high node counts in limited rackspace, or designs focused on maximum performance per node.

• HPE Apollo 4200 Gen10 systems in an AI data node configuration. This provides Matrix and Scality RING on a single server. AI data node reduces cost and solution complexity for Matrix deployments that require on-premises object, but will not fit all designs.

Utilizing WekaIO storage POSIX connectivity The POSIX interface to MatrixFS provides optimal performance and is intended as the primary ingest and operational interface for clients.

Object storage Recommended cluster design incorporates a sample on- or off-premises S3-compatible object storage interface, to optimize storage costs and (potentially) provide snapshot to cloud. It can also provide remote disaster recovery capability. Matrix will connect from 0 to 8 supported object stores—one object store per one or more filesystem groups.

On-premises object storage would use HPE Apollo 4200 Gen10 or HPE Apollo 4510 Gen10 dense storage servers. HPE recommends either Scality RING or SUSE Enterprise Storage for the on-premises object storage software; in the case of the AI data node design only Scality RING is supported.

NFS and SMB Many data centers have existing applications or workflows that require SMB or NFS protocol access to the same shared storage as technical compute. Matrix is not intended as a replacement for traditional NFS or SMB solutions but is designed to support this requirement.

Non-POSIX clients share a common namespace with POSIX clients, and provide best possible performance within the limitations of SMB and/or NFS protocols. NFS and SMB have additional configuration requirements covered in more detail in WekaIO’s documentation.

Cluster diagrams These diagrams show high-level compute and storage functions, and how they interact with Matrix in typical data center designs. This will help clarify data flows between Matrix and other functional roles.

Rack scale clusters Rack scale indicates a Matrix cluster that can be contained within a single rack. No intra-Matrix traffic (e.g., rebuild, distribution data chunks) will travel between racks, meaning top of rack/edge switching design is simplified. This type of Matrix cluster will typically be connected to only a single object store, which could be local or geo-distributed.

At rack scale, each building block could scale by as little as a single server. So one or more compute nodes or storage servers will be added as performance and capacity requirements grow.

https://buy.hpe.com/b2c/us/en/servers/apollo-systems/apollo-6500-system/apollo-6500-system/hpe-apollo-6500-gen10-system/p/1010742495

https://buy.hpe.com/b2c/us/en/servers/rack-servers/proliant-dl300-servers/proliant-dl380-server/hpe-proliant-dl380-gen10-server/p/1010026818

https://buy.hpe.com/b2c/us/en/servers/rack-servers/proliant-dl300-servers/proliant-dl360-server/hpe-proliant-dl360-gen10-server/p/1010007891

https://buy.hpe.com/b2c/us/en/servers/apollo-systems/apollo-2000-system/apollo-2000-system/hpe-apollo-2000-system/p/1010192759

https://buy.hpe.com/b2c/us/en/servers/apollo-systems/apollo-4000-system/apollo-4200-server/hpe-apollo-4200-gen10-server/p/1011147097

https://buy.hpe.com/b2c/us/en/servers/apollo-systems/apollo-4000-system/apollo-4200-server/hpe-apollo-4200-gen10-server/p/1011147097

https://buy.hpe.com/b2c/us/en/servers/apollo-systems/apollo-4500-system/apollo-4500-system/hpe-apollo-4510-gen10-system/p/1010193037

Architecture guide

Figure 4. Rack scale cluster

Data center scale clusters At data center scale, scalable building blocks are more typically partial or full racks fulfilling certain role(s) rather than individual servers or chassis. Concepts are largely the same between this and rack scale design but variety and distribution of hardware increases the complexity.

Typical impacts to storage:

• Multiple applications or workloads may run concurrently, complicating storage performance requirements.

• With multiple storage racks, inter-rack storage networking must accommodate WekaIO Matrix traffic overhead. This means traffic isolation so storage doesn’t negatively impact compute traffic, and sufficient inter-rack bandwidth to not degrade storage solution performance.

• Data retention needs and economics are such that a local/private object storage solution is required.

• Power and thermal requirements may dictate specific rack layouts.

Architecture guide

Figure 5. Data center scale cluster design

There are, of course, many options and considerations for design at larger scale. The design process for data center scale is often consultative and based on detailed requirements.

Best practices, sizing, and configuration choices The intent of this section is to help the reader understand typical choices and requirements when building a Matrix solution. This will help match that information to HPE reference design choices, and arm the reader with what to consider before building a cluster specific to their needs.

This section doesn’t cover all possible considerations, but should address most common ones. Software or hardware updates may impact best practices or features—WekaIO documentation is the most complete resource for Matrix software requirements on your deployed version.

Cluster design guidelines On-premises WekaIO Matrix cluster design on HPE hardware targets NVMe storage requirements of at least 10s of TB, a minimum 4 rack units of data center space, and 100 Gb/s networking infrastructure. Matrix is at its core a storage solution designed for realizing peak performance of flash at distributed scale, so while tradeoffs are possible Matrix may not be the best fit for lesser requirements. For example, requirements for 10GbE networking between storage nodes, under 1 TB of usable shared storage capacity, or 1–2 server cluster designs are more appropriately addressed by other HPE offerings.

A solution delivered on industry-standard servers has both the flexibility and the burden of choice. This leads to the topic of how to make the best design choices for Matrix. These best practices and guidelines cover decisions any designer needs to deploy Matrix in the data center.

Key design points for any network attached, software-defined storage cluster are:

• What total usable storage capacity do I require?

• What Key Performance Indicators (i.e., IOPS, Bandwidth) must this cluster achieve?

• What type of fabric connectivity do I need in my data center for the storage (InfiniBand, Ethernet)?

https://docs.weka.io/v/3.2/install/bare-metal/prerequisites-for-installation-of-weka-dedicated-hosts

Architecture guide

To best answer these questions for Matrix consider:

• The server type and quantity to choose

• CPU and Memory reservation requirements for WekaIO Matrix software

• Choices around data protection and storage efficiency to hit capacity and performance targets

• And finally, specific software and hardware choices and configuration required to build the cluster and integrate it into the data center

Cost, operational, or application requirements may dictate changes from the example reference designs, and it may not be possible to understand all performance or application requirements up front. HPE’s ability to consult and provide reference choices will help simplify this process.

HPE also recommends HPE solutions with WekaIO Matrix deployed with HPE qualified software releases to simplify rollout. Guidelines listed here assume the latest HPE qualified software release at the time of writing, Matrix 3.2.3. Different versions of Matrix can have differing dependencies, hardware sizing requirements, and might not be fully qualified on HPE hardware.

Storage server choice Platform type There are many different storage servers to choose from, and even limiting choices to NVMe only server designs can present a confusing number of options. HPE ProLiant DL360 Gen10 and the HPE Apollo 2000 Gen10 with the HPE ProLiant XL170r server reference choices are driven by one design guideline: You don’t want too many drives in a single server.

While it can be tempting to reduce costs by maximizing NVMe per server, if your goal is to realize peak performance there are balancing factors.

• WekaIO shines at larger total server counts, and a larger building block means significantly more storage as a minimum requirement.

• Larger drive counts have a proportionately greater impact on rebuilds if an entire node fails.

• Boxes with high NVMe storage density typically don’t have the PCIe slots to get all of that performance from the drives over the network. The available PCIe lanes are used for the drives or drives will be bottlenecked behind PCIe bridges. Maximizing data flow over the network is core to a distributed filesystem design.

• At larger drive counts, it may be impossible to reserve enough local CPU resources to realize full drive performance.

The “sweet” spot for a WekaIO Matrix NVMe storage server focused on performance is 2–10 drives, with 1–2 100 Gb/s adapters. For NVMe in this reference architecture, one 100GbE or EDR adapter per four or five PCIe Gen3 x4 NVMe devices is the target balance of network bandwidth and drive count + bandwidth for each storage server. Use cases where the key performance indicators are low latency and/or IOPS rather than bandwidth could increase the number of drives per network adapter.

Platform quantity WekaIO Matrix sees the best performance and storage efficiency improvements by scaling wide (more nodes) rather than up (more performance per server). More storage nodes also allows for better I/O distribution and parallelism in I/O.

Total cluster size can range from 6 to 4096 hosts, of which at least 6 must be nodes contributing storage to MatrixFS. Your storage cluster starting node count is one of the most important choice points you can make early on, as it locks you into a MatrixDDP choice and dictates what resources you’ll have to address your performance needs. The power, port count, and space requirements that come from this decision are central to data center planning.

While 6 storage servers is a supported and performant configuration—and is how the AI data node minimal configuration is designed—HPE’s reference recommendation for a WekaIO Matrix-only solution starts at 8 servers. The key reasons for this are:

• The 4+2 MatrixDDP usable storage efficiency of eight storage servers versus 3+2 MatrixDDP at six storage servers.

• Reduced ability to protect or self-heal data with only one additional server beyond MatrixDDP width at six servers (typically there are two).

• Eight nodes fits well in an HPE Apollo 2000-based design, as there are four HPE ProLiant XL170r nodes per chassis.

Architecture guide

Resource reservation CPU Core reservation Matrix allows for flexible allocation of CPU resources to specific functions. These functions cover frontend cores that handle client traffic, and back-end cores that address MatrixFS processing or can be specifically dedicated to drive I/O. Reference designs provided in this guide use the recommended CPU core reservation detailed below. Consult with your representative if CPU resources need to be tuned for specific application workloads.

From 0 to 19 Matrix dedicated CPU cores can be reserved on a server—at least one core must be left for the Linux OS. Max cores reserved depend on availability of cores and memory. As a general guideline, when you add drives to a Matrix cluster, you should also reserve more CPU cores to take advantage of the additional storage performance. These reserved cores will be seen by the OS as 100% utilized, and cannot run other tasks.

Role reservation For clients, the recommendation is one frontend core per client. A reserved frontend core is not required but is strongly recommended for best performance.

The recommended core reservation for each storage server is:

• A dedicated core per drive for I/O (more than one does not improve performance).

• An additional core per drive for general purpose Matrix back-end tasks.

• A single core dedicated to frontend reserved for non-POSIX access support. This is recommended even if initial deployment does not require this feature as it makes extending cluster client access simpler.

If this totals higher than the maximum possible core count for Matrix on a single server, reduce the general purpose back-end cores count accordingly.

Performance guidelines These default recommendations provide a good balance for Matrix performance and functionality. However, flexibility is available to reserve cores for application workload optimization or target use cases.

For example:

• More frontend cores are appropriate for particularly high IOPS or throughput clients. Only clients with significant performance requirements (such as hundreds of thousands of IOPS, or bandwidth saturating the 100 Gb/s link) should require more than one frontend core.

• More general purpose back-end cores can benefit write-heavy IOPS based loads, which tend to saturate on back-end compute resources first.

• Fewer back-end cores could still meet requirements for Matrix configurations that mostly need low-response latency.

Memory reservation Matrix software on a dedicated storage server will reserve most of the memory in the system by default, but making sure there’s enough memory present to support desired functionality is the responsibility of the solution architect. Matrix requires a base amount of memory for Matrix services and each reserved core in the system depending on the core’s role. The average size of files will further impact the total reservation.

Any calculation for reserved memory should typically leave at least 8 GB of system memory available for the OS.

Main formula The formula for memory requirements is somewhat complicated, and is broken down into two parts below: base and core reservations, and huge page allocation.

Matrix memory reservation =

((# FE cores + # drive cores) * 1.7 GiB) + (# BE-cores * 3.8GiB) + 1.7GiB <base memory>)) + huge page allocation<default 1.4 GiB>

This huge page allocation defaults to 1.4 GB if not specified, but Matrix requirements can grow for file systems with many files—particularly the larger file count possible with an object tier. The fewer storage nodes in the cluster, the greater this memory requirement is for each.

Architecture guide

This formula is:

Huge page allocation =

(8 bytes * (system usable capacity/average file size) + (24 bytes * (system usable capacity/1MB))) / # of backend hosts

Example calculations These sample numbers are from an 8 node entry-level reference configuration built on HPE ProLiant XL170r nodes (without an object tier) and assuming a 1 MiB average file size. A full file system would require:

Huge page allocation = (8B * (54.6TiB /1MiB) + (24B * (54.6TB/1MiB)))/ 8 = 0.21GiB

Given the above, we’ll leave the huge page allocation to the default to 1.4 GiB and the total Matrix memory reservation on a server would then be:

((1+ 4) * 1.7GiB) + (4 * 3.8GiB) + 1.7GiB + 1.4GiB = 8.5GiB + 15.2GiB + 1.7GiB + 1.4GiB = 26.8GiB

Which is significantly less than the total memory in the reference example, 96 GB. As it so happens, the memory quantity chosen for this reference system is meeting hardware requirements for best performance, rather than a total file system or core count minimum.

A typical default client reservation for this cluster should require approximately this much memory on every client:

1FE core * 1.7GiB + 1.7GiB = 3.4 GiB

Storage efficiency, performance, and protection The biggest impact for all three of these storage properties is choosing the appropriate MatrixDDP.

MatrixDDP is not a dynamic setting for the cluster—once chosen, it is fixed and the cluster would need to be rebuilt to change it. A cluster that starts with a small MatrixDDP will continue to be impacted by that choice even as it scales to more storage nodes.

There are a few other decisions that can be made to protect data through transparent hot spares and use of object storage.

Distributed data protection WekaIO Matrix supports dynamic data and protection distribution through MatrixDDP, where data can be distributed from 3 to 16 chunks, and the protection uses an additional two or four chunks. Each data or protection chunk must reside on different failure domains to minimize the probability of losing more than one chunk at a time. These failure domains can just be per each individual node but may also be defined for different chassis, racks, or other user-defined boundary appropriate to data center configuration. The chunks will be striped such that they are distributed across all nodes of the Matrix cluster, based on those failure domain definitions.

The total data and protection chunk count chosen for MatrixDDP balances safety and performance. The recommended configuration with two protection chunks—and no additional failure domains defined—is server count minus four (so for protection across the eight nodes of the entry-level reference configuration, that would be a MatrixDDP with four data and two protection chunks or a 4+2). Making the protection distribution less than the entire cluster allows for improved resiliency, as the full chunk count can still be distributed to healthy cluster members even with failed servers. If free capacity is available, then self-healing can also occur as missing chunks can be regenerated and redistributed to the healthy members.

The MatrixDDP size chosen balances protection efficiency versus infrastructure cost. More nodes will be able to better distribute rebuild traffic in case of a failure and provide greater parallelism. A larger ratio of data to protection chunks also means a greater percentage of usable storage. However, not all applications require the maximum possible protection or the additional capacity and footprint of a larger cluster. Note that while Matrix can support and optimize for differing types and quantities of performance in each failure domain, it is also a best practice for performance and storage utilization to provide homogeneous failure domains where possible. In other words, building blocks with the same quantity and type of storage.

Performance impact Specific performance is very workload and cluster dependent, but these are general behaviors seen from scaling tests done and particular types of fixed synthetic workloads—i.e., read and write bandwidth and read and write IOPS.

• Wider MatrixDDP benefits write I/O performance the most (where not constrained by CPU or network bottlenecks).

• Only large-scale clusters or highest protection requirements should design for four protection chunks. Expect n+4 codes to be an additional performance impact for write bandwidth and IOPS even on a healthy cluster.

• Peak read bandwidth performance could actually benefit from even fewer than 4 drives per network adapter due to MatrixDDP overhead. But it’s not always practical (or possible) to double network cards per node for that one specific I/O requirement.

Architecture guide

Rebuild reservation Additional capacity can be reserved through transparent hot sparing for designs that require higher guarantees of reliability. Storage reserved in this way will not add to maximum MatrixFS capacity, but will be actively in use for file system I/O. This means that workloads driven more by total reserved core and device count rather than the width of the code will still benefit. The reserved overhead will be applied to failure recovery, keeping rebuild space available even if the file system itself is completely full.

Object storage can also address rebuild design requirements, pushing colder data to object storage and freeing up rebuild space in the hot NVMe tier. Cluster designs that tier to object would then only need hot spare capacity reservation based on specific performance requirements.

Capacity impact To understand how data protection affects the amount of usable flash storage from the data above, the formula is this (for all flash storage not reserved as hot spares):

Usable storage =

Raw storage * (MatrixDDP data chunks/Matrix DDP data chunks + MatrixDDP protection chunks) * FS overhead factor (80%)

The FS overhead means 80% of drive storage is used for file system capacity; the remaining 20% of flash space is reserved for other file system use. This reservation includes read caching and flash overprovisioning to provide consistent high performance.

For examples, consider the entry-level reference configuration of eight HPE DL360 servers with 4 3.2 TB MU drives each, for a total raw storage of 102.4 TB. This configuration has a MatrixDDP of 4+2, so usable storage would be:

Usable storage = 102.4TB x (4/6) x 0.8 = 54.6TB (53% of raw)

For a 20 drive midrange configuration (256 TB raw), the code would be a maximum 16+2 resulting in:

Usable storage = 256TB x (16/18) x 0.8 = 182TB (71% of raw)

But if the same server count was expanded from the original 4+2 code, usable storage is the smaller ratio:

Usable storage = 256TB x (4/6) x 0.8 = 136.5TB (53% of raw)

Software and hardware requirements OS versions The following OS versions are supported on by HPE platform hardware and for Matrix software.

Table 1. Matrix 3.2 OS support on HPE qualified hardware

Storage server operating system Versions

Red Hat® Enterprise Linux 6.8, 6.9, 6.10, 7.2, 7.3, 7.4, 7.5, 7.6

CentOS 6.8, 6.9, 6.10, 7.2, 7.3, 7.4, 7.5, 7.6

Ubuntu 16.04, 18.04

Mellanox OFED Ideally all OFED packages on your fabric—not just Matrix storage nodes—should be the same revision. The recommended OFED version for Matrix 3.2 is 4.5-1.0.1.0 but must be one of the following: 4.2-1.0.0.0, 4.2-1.2.0.0, 4.3-1.0.1.0, 4.4-1.0.0.0, 4.4-2.0.7.0, or 4.5-1.0.1.0.

Matrix boot FS requirements Files will be installed in /opt/weka. This path should be on an SSD or SSD-like performance. It cannot be mounted over the network (NFS/SMB) or on RAM drive. Install requires at least 26 GB available for the WekaIO system, with an additional 10 GB reserved for each core used by Matrix on the system.

CPU choices • Both Intel® Xeon® and AMD EPYC architectures are supported.

• WekaIO Matrix sees performance benefits from higher frequency CPUs.

• Disable hyper-threading to maximize performance, preferably in the BIOS. If not disabled on a client, Matrix will consume both logical cores associated with the same physical core.

Architecture guide

Storage HPE believes NVMe is the optimal choice for a WekaIO flash tier and recommends High Performance NVMe with capacity and durability to meet application requirements.

Read Intensive flash will be appropriate for deep learning workloads and the durability of 1 drive write per day for our reference drive choice is appropriate for typical daily write requirements on a training and/or inference focused cluster. Workloads focused more specifically on write IOPS, or significant amount of write bandwidth (e.g., checkpoint-restart) may be more appropriate for Mixed Use NVMe drives.

Other varieties of flash are more task dependent as to fits. Mainstream or Write Intensive NVMe devices are more of a tradeoff of capacity vs. performance and durability than makes sense for most technical compute workloads. SAS or SATA flash can achieve lower cost for a given capacity, but add additional complexity for Matrix server configuration when balancing for optimal per node performance.

WekaIO does not support spinning disk in Matrix storage nodes, although a connected object store could be based on spinning disk.

DIMMs WekaIO is designed so that sockets with reserved cores should have all memory channels populated. Reference designs incorporate this requirement along with Matrix memory requirements in the most cost-effective fashion for each platform. Very high file counts relative to the Matrix cluster or a hyperconverged architecture could dictate more or denser DIMMs in a server design.

Networking hardware Matrix supports both Ethernet and InfiniBand connectivity, but the cluster must standardize on only one of these fabrics.

HPE and WekaIO recommend adapters based on Mellanox ConnectX-5 and later chipsets, which support both InfiniBand and Ethernet. On HPE platforms, ConnectX-5 chipsets are used for HPE 841QSFP28 adapters. These adapters have been qualified for use in both frontend and back-end roles, and should be running latest HPE firmware deliverables.

Ethernet-only alternatives supported by Matrix—but not qualified as a solution—include adapters based on the Intel® 82599 chipset and Intel X710 chipset. It is important that the NIC support Intel DPDK. Other network adapters will not have been optimized or qualified for use with Matrix.

Mellanox technology at 100 Gb/s (EDR or 100GbE) is the recommended and qualified fabric choice. Slower speeds may be required for integrating clients on legacy infrastructure. Ethernet speeds supported are 10GbE, 25GbE, 40GbE, 50GbE, and 100GbE, and supported InfiniBand speeds are FDR or EDR.

HPE recommends either Mellanox switches for an InfiniBand fabric or appropriate switches from our Ethernet offerings. In general, networking should use an enterprise-class, nonblocking switch. InfiniBand switch infrastructure should be at least EDR. For customers who need Ethernet switch redundancy, MLAG must be supported.

Fabric and OS configuration Some of the more important configuration and planning requirements for WekaIO networking:

Ethernet • Ethernet networks should enable jumbo frames. An MTU of 9000 is recommended for both switch and OS network settings.

• One IPv4 management IP address per host and one IPv4 data plane IP address per WekaIO core must be allocated for Ethernet installations. As an example, each storage server would require 20 reserved IPs if maximum possible cores were reserved.

• No bonding configured.

• SR-IOV should be supported and enabled, both for network adapters and in BIOS.

InfiniBand • OS network settings for IB ports should enable a MTU of 4092.

• The Subnet Manager must be configured for 4K frames.

• The InfiniBand Subnet Manager can ideally run on the switch (if managed) or on a host if the switch is not managed, but not both.

• One IPv4 IP address for management and data plane per host.

Consult with your representative for additional guidance if your Matrix systems use multiple network ports per server for data traffic.

Architecture guide

Reference configurations This section provides specific examples of software and hardware component choices for Matrix clusters built on HPE ProLiant DL360 Gen10 server and HPE Apollo 2000 Gen10 systems with HPE ProLiant XL170r servers. Matrix reference configurations incorporate the best practices and configuration details presented previously. A specific hyperconverged solution built for AI, the AI data node solution with WekaIO Matrix and Scality RING in a single Apollo 4200 Gen10 chassis, is described in this reference architecture.

Reference platforms go through specific integration and performance testing with Matrix 3.2 software. Because of the flexibility of Matrix, a variety of Intel and AMD platforms can be supported for Matrix storage and client roles. Individually, all HPE server platforms and the Matrix software go through their own robust qualification processes. However, without being able to evaluate the overall solution there may be additional effort and time to integrate or optimize other platforms into a Matrix deployment.

This section covers only SKUs and guidance for Matrix server hardware and software. Other solution deployment items (server factory options, service offerings for the platform, compute clients, racks and power, data center fabric choices, etc.) are outside of the scope of this document. Contact your HPE sales or account representative for further information and assistance around solution deployment requirements and customization.

Reference cluster properties HPE and WekaIO have defined three categories of recommended scales for cluster designs: entry level, midrange and high end. Example clusters properties for each scale are detailed to help envision both capabilities and what data center impact would be.

Table 2. Reference configuration categories

Type Focus Node quantity

Entry level Starter configuration. The minimum investment for representative performance and reliability. 8

Midrange Maximizes the width of MatrixDDP. Best starting scale for storage efficiency and performance per node. 20

High end Linear scale from midrange, around where multi-rack design starts to impact storage deployment. 40

Capacities and performance estimates for common I/O patterns/sizes on these reference configurations provide a general idea of the level of performance to expect. This data is not representative of any given application or technical compute workload—rather, it provides a peak rating for bandwidth or IOPS in an ideal case. All numbers represent default choice of 3.2 TB Mixed Use NVMe.

Table 3. HPE ProLiant DL360 Gen10 server reference capacity and performance

Type Raw capacity (TB) Usable capacity (TB) Read 4K IOPS (millions) Read 1M bandwidth (GB/s)

Entry level 102.4 54.6 2.5 30

Midrange 512 364.1 8.7 160

High end 1280 910.2 17.4 320

Table 4. HPE Apollo 2000 Gen10 system reference capacity and performance

Type Raw capacity (TB) Usable capacity (TB) Read 4K IOPS (millions) Read 1M bandwidth (GB/s)

Entry level 102.4 54.6 2.5 30

Midrange 256 182.0 8.7 125

High end 512 364.1 17.4 250

Sample Bill of Materials (BOM) These sections give detailed lists of the SKUs included in hardware and software builds for a reference cluster. Further guidance is also included around hardware choices and relevant details for deploying Matrix on that platform.

https://www.hpe.com/h20195/v2/Getdocument.aspx?docname=a00065979enw

Architecture guide

HPE ProLiant DL360 Gen10 server

Solution focus HPE ProLiant DL360 Gen10 server is for solutions focused around denser storage per server and improving per-node cost efficiency, while still keeping rack footprint and storage-per-node ratios manageable. Each server can contain up to 10 NVMe devices and two InfiniBand HCAs.

The HPE ProLiant DL360 Gen10 server supports the latest generation of Intel Xeon Scalable processor, along with 2666 MT/s HPE DDR4 SmartMemory supporting up to 3.0 TB. Deploy this 2P secure platform for diverse workloads in space-constrained environments; the dense 1U rack design provides up to three PCIe 3.0 slots and 10 SFF storage bays.

Figure 6. HPE ProLiant DL360 Gen10

All table quantities indicate the number of that component in a single server. Memory quantity supports a system with all NVMe bays populated and/or significant additional object storage tiering. OS drives support a Linux software RAID 1 configuration, and network adapters can support InfiniBand or Ethernet.

Table 5. Server components

Component name Quantity SKU

HPE DL360 Gen10 Premium 10NVMe CTO Svr 1 867960-B21

HPE DL360 Gen10 Intel Xeon-Gold 6226 (2.7GHz/12-core/125W) FIO Processor Kit 1 P02601-L21

HPE DL360 Gen10 Intel Xeon-Gold 6226 (2.7GHz/12-core/125W) Processor Kit 1 P02601-B21

HPE 16GB (1x16GB) Single Rank x4 DDR4-2933 CAS-21-21-21 Registered Smart Memory Kit 12 P00920-B21

HPE 800W Flex Slot Titanium Hot Plug Low Halogen Power Supply Kit 2 865438-B21

HPE InfiniBand EDR/Ethernet 100Gb 2-port 841QSFP28 Adapter 1 entry level, 2 midrange, and high end 872726-B21

HPE DL360 Gen10 SATA M.2 2280 Riser Kit 1 867978-B21

HPE 1U Gen10 SFF Easy Install Rail Kit 1 874543-B21

HPE 480GB SATA 6G Mixed Use M.2 2280 3yr Wty Digitally Signed Firmware SSD 2 875490-B21

HPE 3.2TB NVMe x4 Lanes Mixed Use SFF (2.5in) SCN 3yr Wty Digitally Signed Firmware SSD 4 entry level, 8 midrange, and 10 high end P10224-B21

Matrix core reservations All systems have the recommended reservation of one frontend core. In addition, the entry-level configuration would reserve four back-end cores and four dedicated drive cores for a total of nine reserved cores. The midrange servers would reserve eight back-end cores and eight dedicated drive core for a total of seventeen reserved cores. Finally, the high-end server would reserve ten dedicated drive cores and eight back-end cores for the maximum nineteen reserved cores.

HPE Apollo 2000 Gen10 system

Solution focus HPE Apollo 2000 Gen10 systems provide the highest reference compute and storage density per rack unit. Each HPE ProLiant XL170r Gen10 reference node contains up to four NVMe devices and one network adapter.

An enterprise bridge to scale-out architecture in a smaller footprint that increases data center floor space while improving performance and energy consumption. A flexible, density-optimized system, HPE Apollo 2000 Gen10 system is ideal for compute-intensive tasks that require high performance in a dense, scale-out form factor. Mix and match servers, with HPE ProLiant XL170r Gen10 for general-purpose and HPE ProLiant XL190r Gen10 for workloads requiring GPUs.

Architecture guide

Figure 7. HPE Apollo 2000 Gen10

Tables indicate the component quantity in a single chassis (4 nodes per chassis). Memory is designed for a flash-only Matrix, with Mixed Use data drives. OS drives support a Linux software RAID 1 configuration, and network adapters can support InfiniBand or Ethernet.

Table 6. Chassis components


HPE Apollo r2800 24SFF-Flex Gen10 CTO Chassis 1 867159-B21

HPE r2800 Gen10 16SFF NVMe Backplane FIO Kit 1 874800-B21

HPE r2x00 Gen10 Redundant Fan Module Kit 1 874308-B21

HPE 1600W Flex Slot Platinum Hot Plug LH Power Supply Kit 2 830272-B21

HPE r2x00 Gen10 PSU Enablement Kit 1 880186-B21

HPE 2U Shelf-Mount Adjustable Rail Kit 1 822731-B21

Table 7. Node components


HPE ProLiant XL170r Gen10 1U Node Configure-to-order Server 4 867055-B21

HPE XL1x0r Gen10 Intel Xeon-Gold 6226 (2.7GHz/12-core/125W) FIO Processor Kit 4 P12324-L21

HPE XL1x0r Gen10 Intel Xeon-Gold 6226 (2.7GHz/12-core/125W) Processor Kit 4 P12324-B21

HPE 8GB (1x8GB) Single Rank x8 DDR4-2933 CAS-21-21-21 Registered Smart Memory Kit 48 P00918-B21

HPE XL1x0r Gen10 Left Low Profile Riser Kit 4 874296-B21

HPE XL170r Gen10 16NVMe P2 Low Profile Riser Kit 4 874304-B21

HPE InfiniBand EDR/Ethernet 100Gb 2-port 841QSFP28 Adapter 4 872726-B21

HPE XL1x0r Gen10 M2 (NGFF) Riser Kit 4 874853-B21

HPE 480GB SATA 6G Mixed Use M.2 2280 3yr Wty Digitally Signed Firmware SSD 8 875490-B21

HPE Ethernet 1Gb 2-port 368FLR-T Media Module Adapter 4 866464-B21

HPE XL170r Gen10 S100i SATA Cable Kit 4 874305-B21

HPE 3.2TB NVMe x4 Lanes Mixed Use SFF (2.5in) SCN 3yr Wty Digitally Signed Firmware SSD 16 P10224-B21

Matrix core reservations All cluster scales have a recommended reservation of one frontend core, four back-end cores, and four dedicated drive cores for a total of nine reserved cores.

Software SKUs WekaIO software is licensed for support based on length of entitlement and raw storage capacity on Matrix servers, with additional licensing for object storage capacity. See the HPE QuickSpecs for more detail and full ordering rules.

To order the correct quantities of the available WekaIO software options:

1. Decide on the number of years of subscription and support to purchase up front.

2. Select the appropriate number of the per TB E-LTU SKUs to license the raw capacity of the cluster (fractional TB rounded up).

3. The Matrix SKUs for NVMe storage include 1 TB of usable tiering capacity for each 1 TB of raw WekaIO MatrixFS capacity purchased. Additional tiering capacity requires SKUs for each TB of usable object addressed by Matrix.

Only if ordering as or on behalf of an education or government entity should the “Education/Government SKUs” be chosen. All customers share the Matrix SKUs for tiering to object storage.

https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00042248enw

Architecture guide

Software BOM Example #1 This example is a noneducation, nongovernment, entry-level Matrix cluster with three years of subscription and support on the entry-level configuration with 3.2 TB NVMe (102.4 TB raw total).

Table 8. WekaIO software BOM—Entry level HPE Apollo 2000 Gen10 system configuration


WekaIO Matrix 3yr Subscription/Support per TB E-LTU for HPE Servers 103 Q9Q95AAE

Software BOM Example #2 A government organization wants to purchase a high end cluster based on HPE ProLiant DL360 Gen10 servers. The purchased licenses cover ten 3.2 TB NVMe drives per server (1280 TB raw in total) for five years of subscription and support.

This configuration is attached to 5 PB of usable object capacity dedicated to Matrix tiering. To cover this object tiering capacity, (5000 - 1280 = 3720) additional tiering SKUs are required.

Table 9. WekaIO software BOM—High end HPE ProLiant DL360 Gen10 server configuration


WekaIO Matrix Education/Government 5yr Subscription/Support per TB E-LTU for HPE Servers 1280 Q9R00AAE

WekaIO Matrix 5yr Tiering per TB E-LTU for HPE Servers 3720 Q9Q97AAE

NVMe options NVMe drive choice is a key choice point both for cost and workload reasons. HPE’s default choice for these reference designs is the 3.2 TB Mixed Use NVMe drive, which provides a good mix of capabilities and capacity absent other requirements.

Read Intensive drives can further increase capacity at the expense of write performance and durability (down from 5 drive writes per day for Mixed Use to 1 for the Read Intensive devices listed here), but this tradeoff frequently makes sense for deep learning workloads.

Below are HPE’s recommended NVMe selections for Matrix reference platforms. Other options from the respective platform QuickSpecs may be supportable by Matrix, but consult with HPE for further guidance.

More details on drive properties are available in HPE Solid State Disk Drives QuickSpecs.

Table 10. Recommended Mixed Use NVMe (both servers)

Component name SKU

HPE 1.6TB NVMe x4 Lanes Mixed Use SFF (2.5in) SCN 3yr Wty Digitally Signed Firmware SSD P10222-B21



Table 11. Recommended Read Intensive NVMe (HPE ProLiant DL360)

Component name SKU

HPE 1.92TB NVMe x4 Lanes Read Intensive SFF (2.5in) SCN 3yr Wty Digitally Signed Firmware SSD P10214-B21



https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=a00001288enw&doctype=quickspecs&doclang=EN_US&searchquery=&cc=in&lc=en

Architecture guide

Share now

Get updates

© Copyright 2018–2019 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

AMD is a trademark of Advanced Micro Devices, Inc. Intel and Intel Xeon are trademarks of Intel Corporation in the U.S. and other countries. Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. All other third-party marks are property of their respective owners.

a00045986ENW, August 2019, Rev. 4

Summary Dramatic improvements in computational power and the explosion of data sets have meant that parallel file systems traditionally used to address technical compute workloads are often impractical or inadequate to the task. However, the data integrity and management features of parallel file system solutions are still necessary, as well as a focus on modern technology and data workflows.

WekaIO Matrix on HPE servers provides an attractive performance, protection, and data management story for Deep Learning workloads. Matrix removes your computational storage bottlenecks by leveraging the power of NVMe and task-optimized servers with software designed for performance, scalability, and flexibility.

What HPE presents in this reference architecture is an introduction to the technology, the use cases it helps address in the technical compute space, and some best practices behind designing clusters with WekaIO Matrix and HPE servers. The example hardware and test data in this provide a good starting point for how to work with HPE and WekaIO to build a cluster that fits your needs.

Resources AI Data Node

HPE HPC solutions

WekaIO website

Learn more at hpe.com/storage/wekaio

http://www.hpe.com/info/getupdated

https://www.hpe.com/global/hpechat/index.html?jumpid=Collaterals_a00045986ENW

http://www.hpe.com/info/getupdated

https://www.hpe.com/h20195/v2/Getdocument.aspx?docname=a00065979enw

http://hpe.com/us/en/solutions/hpc-high-performance-computing.html

https://www.weka.io/

http://www.hpe.com/storage/wekaio

architecture guide for hpe servers and wekaio matrix ... · performance with a much-reduced...

Documents