reference architecture guide for ibm spectrum scale and

Reference Architecture Guide for IBM Spectrum Scale and Spectrum Archive with IBM DCS3860 Storage and IBM TS4500 Tape Library

OverviewToday’s powerful computers and applications: HPC envi-ronments, IoT, VoIP, data analytics, etc. generate a massive amount of data that is growing at an exponential rate. Solu-tion providers are tasked with finding a storage technology that is expandable and intelligent enough to support such demand within a budget for users. Additionally, it is known that simply collecting massive amounts of data does not generate value. Value is only found from the data after it is analyzed. In other words, high-performance computing resources are used to analyze this massive amount of data. A storage system has to provide enough throughput to support the computing needs as well as provide enough capacity to store the data requirements.

The many different storage technologies such as SSDs, SAS, SATA, tape, and software solutions available today make tiered storage solutions a viable and more efficient way of organizing data. Tiered storage solutions offer one way to assign ‘value’ to data by correlating it with a particular storage type available based on the performance, capacity, reliability, or retention needs of the underlying storage media. Through intelligent assignment of data to appropriate storage devices, one is able to more efficiently

map the cost and speed of storage to the needs of the overlying application, thereby increasing the performance of a solution while minimizing the cost of the entire solution.

Applications or environments which are bottlenecked by slower file systems benefit from assignment to faster (and more expensive) disk technologies such as SSDs, while applications which are bottlenecked by processor power or other components in an environment may be placed on slower storage technologies such as SAS, SATA, or tape without impacting performance. The cost savings by using the slower storage technologies can then be reinvested in faster processor power or storage technologies for other applications thereby making the entire solution faster and/or more efficient.

There are a number of reasons to limit the amount of data in a small/medium/large data warehouse. This includes ex-penses, storage capacity, backup and restore limitations, to name only a few. Typically, priority level high-demand data is termed “Hot/Active”, while low-demand data is termed “Cold/Passive”. At one time business users and IT personnel would separate hot data (active) from cold data (passive). This may be done so SSD burn out does not occur, for eco-nomical, or for performance reasons.

2

Below are examples of sample data directories and how hot and cold data might resides on them

SSDs make sense when looking at performance per unit dollar in a solution. Server workloads requir-ing very high IOP rates per GB are more cost effective on SSDs. Large online transaction systems such as reservation systems (hotels/airlines), e-commerce systems (Amazon/eBay), and anything with small, random reads/writes will run more cost effectively on SSDs.

In a competitive world where time is a critical constraint, businesses are always on the lookout for technological innovations that can provide an advantage over their competitors, and Solid State Disks (SSD) offer significantly better performance than traditional hard drives for most I/O patterns. Because of this many companies use mixed media types—HDD and SSD—as well as sizes and speeds. This means that high capacity 7,200 RPM HDDs are available for storing backups, or data that is not accessed very often; 10,000 RPM or 15,000 RPM HDDs might be deployed to house databases or oth-er data that needs to be accessed and manipulated relatively quickly; and SSDs are utilized for storing data that needs to be readily available or data that is sensitive to access latency, such as transaction logs or parent VHDX files.

One storage solution that facilitates this type of provisioning is IBM’s Spectrum Storage Suite. Spec-trum Scale Suite allows the user to take advantage of the various characteristics of different disk technologies without having to buy the most high-performing type of disk for the job. Two of the six software solutions within the IBM Spectrum Storage family are IBM Spectrum Scale and IBM Spec-trum Archive.

IBM Spectrum Scale is a clustered file system developed by IBM that supports a scale-out feature as well as intelligent policy driven data migration between the storage tiers and high I/O throughput, and other features required for mission critical operation. Spectrum Scale is hardware agnostic or software defined storage. This means, as long as a supported OS, compatible hardware, etc. are used, a storage system can be created using IBM Spectrum Scale on top of any storage hardware.

IBM Spectrum Archive enables direct and graphical access to data stored on IBM tape drives and libraries by incorporating the Linear Tape File System (LTFS) format standard for reading, writing, and exchanging descriptive metadata on formatted tape cartridges.

Combining IBM Spectrum Scale and IBM Spectrum Archive software defined storage with the IBM DCS3860 creates a highly intelligent, flexible storage infrastructure that enables high performance at a fraction of the cost. In real world analytics applications typically only a subset of the data is need-ed at any given time for processing. The Spectrum Scale policy engine is leveraged to place massive amounts of inactive data based on age, type, etc., on low cost Spectrum Archive enabled tape while active data is placed on a high performance SSD tier. Figure 1 illustrates the solution.

Hot Data Cold Data

Data root directoryFile system metadata

Temp directory

User database log directoryArchived data

Application data

3

With this in mind, the intent of this testing was two-fold.

• Benchmark the performance of IBM’s Spectrum Scale file system on SSD and NL-SAS based logical volumes. Test the performance of IBM’s Spectrum Archive for moving data to and from tape from SSD vs. NL-SAS based logical volumes.

• Apply a tiered storage solution to an application currently in use in a non-tiered environment and analyze the benefits and disadvantages of implementing such a solution.

IBM Spectrum Scale file system Spectrum Scale’s performance was benchmarked and compared on Network Shared Disks (NSDs) that were both NL-SAS and SSD resident as part of this analysis. Additionally, the performance of a tiered solution, separating metadata and file system data on different disk types was analyzed.

The following infrastructure was used to benchmark the performance of the IBM Spectrum Scale file system, :

Hardware

• Nine compute nodes as follows

- Six IBM System x3690 X5 servers

• Dual sockets Intel Xeon E6540, 2.0GHz, 6 core processor, 128GB DDR3 memory

• Two Mellanox single-port FDR Infiniband PCIe Gen3 x8 HBA

- Three E5-1650 v 3 Haswell servers

• Single socket, 2.4GHz, 6-core processor, 96GB DDR4 memory

• Single Mellanox dual-port FDR Infiniband PCIe Gen3 x16 HBA

Global Name Space

CIO Finance Engineering

IBM Spectrum ScaleIBM Spectrum

Archive

IBM DCS3860

Tier 1 Tier 2

Tier 3

IBM TS4500

Figure 1: IBM Spectrum Scale with IBM Spectrum

Archive architectural diagram

4

• Two Lenovo System x3650—M5, dual-proc E5-2650 v4 12-core Broadwell servers used for storage nodes

- Dual socket Intel Xeon E5-2650 v4, 2.2GHz, 12 core processor, 128 GB DDR4 memory

• Two Mellanox dual-port FDR Infiniband PCIe Gen3 x8 HBA

• Two LSI quad-port SAS9302-16e 12Gbps x16 SAS HBAs

• Two IBM DCS3860 direct attached storage units

- One IBM DCS3860 Gen2 using 20 800GB SSD with 12Gbps SAS interface

- One IBM DCS3860 Gen2 using 60 6TB 7.2K RPM NL-SAS with 12Gbps SAS interface

- Each component configuration

• FW: 08.20.09.00; NVSRAM: N1813P38R0820V01 software code

• 10 drives (either NL-SAS or SSD) were used to create each logical drive by selecting 2 drives from each drawer of DCS storage unit for drawer loss protection

• Each volume group were created using 10 drives with RAID6: 8+P+Q and segment size of 512KB for large sequential I/O

• One Mellanox SX6036 FDR Infiniband switch

Software

• CentOS 7.2 for x86_64 (compute and storage nodes)

• IBM Spectrum Scale version 4.2.0 standard edition was used to create a Spectrum Scale cluster and file system

• IBM DS Storage Manager version 11.20 was used to create and configure the storage devices

• Mellanox OFED Linux driver on both the compute and storage nodes

• The streaming I/O performance is measured with public domain software called “IOR HPC bench mark”. IOR is developed for benchmarking parallel file systems using POSIX, MPIIO, or HDF5 inter-faces, and a good tool to measure streaming I/O performance

A single logical volume (LUN) is created from each volume group owned by either controller A or B for balancing the load. To ensure the redundancy against the path failure, the Linux multipath de-vice-mapper is used on the storage servers. There are two active paths to each logical volume, and each is connected to each server. In other words, from both servers all the logical volumes are visi-ble/accessible in a redundant fashion in what is called an active-active multipath. With active-active multipath, IBM Spectrum Scale can serve the file system to the client even if one of the storage nodes goes offline.

2 nodes x 2 LSI-SAS9300-16e per node on PCIe Gen3 (x16) = 24GB/s

2 nodes x 2 x 2-port FDR IB HBA per node on PCIe Gen3 (x8) = 56GB/s

ComputeNodes

Storage Nodes

IBM DCS3860Storage Modules

Mellanox SX6036 FDR Infiniband Switch

DCS3860 2x quad-port 12Gbs SAS(1 connection per controller)60x NL-SAS 6TB 7200RPMor 20x 800GB SSD

Figure 2: IBM Spectrum Scale configuration

diagram

5

Tuning parameters that were used to benchmark Spectrum Scale on SSD and NL-SAS are listed in the Best Practices section. The benchmark test was tuned for sequential write and random read. The cache settings on the IBM 3860 storage subsystem were read, write, and cache mirroring ON and read pre-fetch off. With this tuning and configuration, test runs were performed on various numbers and combinations of SSD and/or NL-SAS based NSDs starting with two and up to 12 NSDs to obtain the following summary of results:

Spectrum Scale block allocation type: Cluster Spectrum Scale block allocation type: Scatter

Drives Write (MB/sec)

Normalized (MB/sec)

Read (MB/sec)

Normalized (MB/sec)

Write (MB/sec)

Normalized (MB/sec)

Read (MB/sec)

Normalized (MB/sec)

Usable Capacity

20 SSD 3579 179.0 9426 471.3 2557 127.9 9124 456.2 12TB

20 NL-SAS 1963 98.15 2756 137.8 930 46.5 1480 74 88TB

60 NL-SAS 2221 37.02 57.89 96.5 2271 37.85 4731 78.9 262TB

Normalized SSD: NL-SAS (20 drives)

1.82 3.42 2.75 6.16

Summarizing the data from the benchmark results using the above data leads to the following conclusions:

• The testing showed that SSD offered much better performance than NL-SAS in all scenarios inde-pendent of I/O type (read, write, scatter, cluster, to and from tape)

• The average (20 drive) normalized performance of the Spectrum Scale file system was on average 3.54 times faster on SSD than NL-SAS devices over all tests

• Storage capacity of NL-SAS drives used was 7.7 times as large as SSD capacity in the same rack space

IBM Spectrum ArchiveThe IBM TS4500 Tape Library was deployed for this technical analysis. The IBM TS4500 tape library (Machine Type 3584) is a modular tape library that consists of a high-density base frame and up to 17 high-density expansion frames. The frames join side by side and can grow to the left or to the right of the base frame. A single cartridge accessory supports all frames. The L55 base frame was used equipped with four LTO-6 and four LTO-7 tape drives.

A benchmark was performed for data migration (recall) to (from) a Linear Tape File System (LTFS) on both LTO-6 and LTO-7 tape drives. The source (target) files were hosted on a Spectrum Scale file system and was composed of key and output data for the Boeing 787-800 LS-Dyna application de-scribed in the next section. The total size of the data files for the 787 simulation was 320GB.

A Spectrum Scale file system was created on one NSD. The NSD was created on an underlying 10-disk RAID6: 8+P+Q, (either NL-SAS or SSD) volume group with segment size of 512KB, which is optimal for large sequential I/O. One LTFS was created across 4x LTO-6 cartridges, each with its own tape drive. Another LTFS was created across 4x LTO-7 cartridges, each with its own tape drive as well. Perfor-mance was then benchmarked to and from the two NSD types to and from the two LTO drive types, using the previously described application data files.

Two Lenovo System x3650 servers were used for the storage nodes. Both servers were running IBM Spectrum Scale version 4.2.0 standard edition. Additionally, one of the storage nodes was running IBM Spectrum Archive version ltfsee-1.2.1.0-10230. IBM Spectrum Archive enables direct and graphical access to data stored in IBM tape drives and libraries by incorporating the Linear Tape File System (LTFS) format standard for reading, writing, and exchanging descriptive metadata on formatted tape cartridges. The Lenovo servers running Spectrum Scale were direct attached via two LSI quad-port SAS9302-16e 12Gbps x 16 SAS HBAs to the DCS3860 storage units. Additionally, the Lenovo server running Spectrum Archive was redundantly attached via a dual port Emulex Lightpulse LPE 16002 to a Brocade 6505 switch that was connected redundantly to both ports of each of the 8 tape drives.

Table 1: Cluster and scatter performance

comparison

6

Hardware



- Two Mellanox dual-port FDR Infiniband PCIe Gen3 x8 HBA

- Two LSI quad-port SAS9302-16e 12Gbps x16 SAS HBAs

- One dual port Emulex Lightpulse LPE 16002 FC HBA

• One IBM DCS3860 storage unit with either 20 800GB SSD or 60 6TB 7.2K RPM NL-SAS drives

- 12Gbps SAS interface connection

- FW: 08.20.09.00; NVSRAM: N1813P38R0820V01 software code

- 10 drives (either NL-SAS or SSD) were used to create each logical drive by selecting 2 drives from each drawer of DCS storage unit for drawer loss protection

- Each volume group were created using 10 drives with RAID6: 8+P+Q and segment size of 512KB for large sequential I/O


• Two redundant Brocade 6505 switches

• One IBM TS4500 Tape Library

- 4x LTO-6 tape drives with 4 LTO-6 cartridges


- One LTFS pool was created on each of the four same type drives

Software

• CentOS 7.2 for x86_64

• IBM Spectrum Scale version 4.2.0 standard edition was used to create a Spectrum Scale cluster and file system.

• IBM Spectrum Archive version ltfsee-1.2.1.0-10230



• “IOR HPC bench mark”—a public domain software—measures streaming I/O performance. IOR benchmarks parallel file systems using POSIX, MPIO, or HDFS interfaces.

! "# $%&' () *+*

, -. /, ,. 0, .. 1, +. ). *. .. 2. +. -/. ,01-),*.2+

! "# $%&' () *+*

, -. /, ,. 0, .. 1, +. ). *. .. 2. +. -/. ,01-),*.2+

Brocade 6505 switches

IBM TS45004x LTO6 and 4x LTO7 drives



Storage Nodes

IBM DCS3860Storage Module


IBM DCS3860 2x quad-port 12Gbs SAS60x NL-SAS 6TB 7200RPM

IBM TS4500Tape Library

Figure 3: IBM Spectrum Archive configuration

diagram

7

Using the above configuration, the transfer times for the various storage drives and LTO tape drive types in the following table were obtained:

6TB 7200 RPM NL-SAS 800GB SSD

Migrate (sec) Recall (sec) Migrate (MB/sec) Recall (sec)

LTO-6 594 703 405 576

LTO-7 385 559 311 491

Performance Increase LTO-6 to LTO-7

67.3% 38.66% 30.23% 17.31%

In summary, we were able to get close to or better than advertised LTO-6 (160 MB/sec) and LTO-7 (300 MB/sec) performance thresholds with very little tuning. Specifically, 87.9% (55.7)% of theoretical LTO-7 performance on migrates (recalls) to (from) SSD was achieved. The LTO-7 drives performed much better than the LTO-6 tape drive as advertised. While the use case data was a mix of both large and small files Actual results will vary according to number and size of files. The use case data used to calculate the times above was determined to be around 40% compressible which was automatically performed by the tape drive hardware. Finally, this solution makes it very easy to scale up tape performance by adding additional tape drives.

LS-DynaThe use case that was selected for this reference architecture is a solution currently being used by the National Institute for Aviation Research (NIAR) at Wichita State University. NIAR is currently using a finite element program called LS-Dyna to simulate and perform impact analysis of real world prob-lems. Two recent simulations that NIAR has modeled are a Dodge Neon crash simulation and a Boeing 787-800 hard landing simulation. Both of these simulations have been performed by NIAR on a single Dell PowerEdge R920 four socket Intel Xeon E7-4880v2, 2.5GHz, 60-core server running Windows Server 2012R2 Standard with HPC Add-on and using single precision LS-Dyna version 9.71.

The Dodge Neon simulation was much smaller of the two simulations and was used for vetting the distributed memory solution designed for increasing the performance of the LS-Dyna application. The larger Boeing 787 simulation was then performed with the vetted architecture to test any perfor-mance increase of the LS-Dyna simulation that came from using multiple compute nodes and a dis-tributed memory solution. Additionally, the larger data set associated with the Boeing 787 simulation was used to benchmark the migrate and recall rate to and from tape for both SSD and NL-SAS based Spectrum Scale NSDs as further described in Tape Library Section.

Each of the simulations was run on two architectures for this reference. The first architecture used six IBM x3690s as the compute nodes and the second architecture used four Lenovo System x3650 serv-ers. Both of these servers are described in the section IBM Spectrum Scale Section. Open MPI version 1.10.2 was used to perform the same simulations using the mpp package of LS-Dyna version 8.1.105897 on the four and six server configurations in a distributed memory (DMP) environment. The compute nodes were connected to each other through a Mellanox SX6036 FDR Infiniband switch.

A Spectrum Scale file system was used to store the LS-Dyna Key and output files. Two Lenovo System x3650 servers were used as Spectrum Scale servers. In the second configuration, the two Leno-vo servers running Spectrum Scale were also compute nodes running the LS-Dyna application as described in the previous paragraph. In addition, one of the Spectrum Scale servers was also running Spectrum Archive. IBM Spectrum Scale version 4.2.0 standard edition and IBM Spectrum Archive version ltfsee-1.2.1.0-10230 were used. The Spectrum Scale file system was composed of two NSDs resident on two, 10 NL-SAS hard disks created with RAID6: 8+P+Q and a segment size of 512KB as described in the preceding sections.

Table 2: Transfer time comparison

8

The details of the servers in each configuration and software packages used are:

Hardware Configuration #1

• Six IBM System x3690 X5 servers used for compute nodes

- Dual sockets Intel Xeon E6540, 2.0GHz, 6 core processor, 128GB DDR3 memory

- Two Mellanox single-port FDR Infiniband PCIe Gen3 x8 HBA





• One Mellanox SX6036 FDR Infiniband switch.

• One IBM DCS3860 direct attached storage module with 12Gbps SAS interface

- 60 6TB 7.2K RPM NL-SAS

- FW: 08.20.09.00; NVSRAM: N1813P38R0820V01

- 6x 10 NL-SAS drives were used to create each logical drive by selecting 2 HDDs from each drawer of DCS storage unit for drawer loss protection

- Each volume group were create using the 6 HDDs with RAID level 6 : 8+P+Q, segment size of

512KB for large sequential I/O

Hardware Configuration #2

• Four Lenovo System x3650—M5, dual-proc E5-2650 v4 12-core Broadwell servers used for com-pute nodes (two of these were also used for storage nodes)




• One Mellanox SX6036 FDR Infiniband switch.



- FW: 08.20.09.00; NVSRAM: N1813P38R0820V01


- Each volume group were create using the 6 HDDs with RAID level 6 : 8+P+Q, segment size of 512KB for large sequential I/O

Figure 4: LS-Dyna hardware configuration

#1 diagram



ComputeNodes

Storage Nodes




9



Storage Nodes




Storage Nodes

Software


• Open MPI version 1.10.2 was used to perform the same simulations using the MPP package of LS-Dyna version 8.1.105897




Using the single precision build of LS-Dyna it took NIAR 60 hours to run the simulation. The Dodge Neon simulation was used as a baseline simulation and previous hardware architectures and runtimes at NIAR are unavailable. The total size of the LS-Dyna key and d3plot files for the Dodge Neon simu-lation was 1.5GB and the total size of the key and d3plot files for the Boeing 787-800 simulation was 320GBs. The time for the various runs performed on the different architectures are summarized in the table below:

Dodge Neon Boeing 787-800

Single Dell PowerEdge R920 NA 60 Hours

Six IBM x3690 264 sec 96 hours

Four Lenovo Systems x3650 135 sec 46.5 hours

Additional runtimes showing the scalability of the performance improvement that comes from adding additional processors to the Dodge Neon simulation are in the Appendix.

Tiered SolutionTo test the viability of the entire solution the Boeing 787-800 LS-Dyna simulation was run currently while multiple data migrations and recalls to and from the TS4500 tape library was done with random data on the same Spectrum Scale file system. For this test, four Lenovo servers were used for the LS-Dyna simulation. Two of these servers were also used as Spectrum Scale servers and one of the two servers running Spectrum Scale was also running Spectrum Archive. The data migration to and recall from tape was performed on the LTO-6 and LTO-7 drives defined in section tape library. A high level overview of the configuration used is below:

Figure 5: LS-Dyna hardware configuration

#2 diagram

Table 3: Simulation runtime comparison

10

Hardware

• Four Lenovo System x3650—M5, dual-proc E5-2650 v4 12-core Broadwell servers used for compute nodes (two of these were also used for storage nodes and one of these was also running Spectrum Archive)




- One dual port Emulex Lightpulse LPE 16002 FC HBA (Spectrum Archive node)


• Two redundant Brocade 6505 switches

• The Lenovo servers were direct attached to an IBM DCS storage unit



- FW: 08.20.09.00; NVSRAM: N1813P38R0820V01


- Each volume group were created using the 6 HDDs with RAID6 : 8+P+Q, segment size of 512KB for large sequential I/O

• One IBM TS4500 Tape Library



- One LTFS pool was created on each of the four same type drives

Software


• Open MPI version 1.10.2 was used to perform the same simulations using the mpp package of LS-Dyna version 8.1.105897


• IBM Spectrum Archive version ltfsee-1.2.1.0-10230

IBM Spectrum Scale

IBM DCS3860

NL-SAS pool

Tape pool

IBM TS4500

IBM Spectrum Archive

SSD pool

HotData

Warm Data

Cold Data

Figure 6: Tiered Solution diagram

11



The data migrations and recalls had no noticeable performance impact on the LS-Dyna simulations for the use case scenario being used. This is because the limiting performance factor in our configu-ration was caused by the application processor power and not the file system performance of either of the Spectrum Storage solutions. In summary, data could be staged from, and de-staged to, the tape library, without impacting the performance of the host application(s). Based on this result, it was determined that the FLAPE solution tested for this exercise is viable and reliable for the use case that was tested.

Hardware Components

IBM TS4500 Tape Library

IBM TS4500 Tape Library is a next-generation storage solution designed with dual robotic accessors to store up to 5.5 petabytes (PBs) of uncompressed data in a single library or scale up at 1 PB per square foot to 175.5 PBs. The IBM TS4500 answers the challenges of data volume growth in cloud and hybrid cloud infrastructures, increasing cost of storage footprints, the difficulty of migrating data across vendor platforms, and the increased complexity of IT training and management as staff resources shrink.

Drive Types Frame Definition Number of Tape Cartridges Cartridge Capacity

LTO Ultrium 7 LTO Ultrium 7 L55 Up to 882 Up to 139 PB per library

LTO Ultrium 6 LTO Ultrium 6: L55 Up to 57 PB per library

! "# $%&' () *+*

, -. /, ,. 0, .. 1, +. ). *. .. 2. +. -/. ,01-),*.2+

! "# $%&' () *+*

, -. /, ,. 0, .. 1, +. ). *. .. 2. +. -/. ,01-),*.2+

Brocade 6505 switches

IBM TS45004x LTO6 and 4x LTO7 drives



Storage Nodes




IBM TS4500Tape Library

Storage NodesFigure 4: Tiered Solution configuration diagram

12

DCS3860

The IBM System Storage DCS3860 storage system deliv-ers the performance and scalability organizations need to succeed in this new era of big data. Designed for high-per-formance computing applications, the DCS3860 system supports up to 60 drives in just 4U of rack space—and it can scale up to 360 drives, including up to 24 solid-state drives (SSDs), with the attachment of five expansion units.

Benefits Features

12 Gbps SAS extends storage performanceRAID levels 0, 1, 3, 5, 6, 10 and DDP

Read Cache uses SSDs as a level-two data cache

12 Gbps SAS, 10 Gbps iSCSI, 16 Gbps FC, HIC’sInterface128GB RAM

This high-density system also helps make the most of today’s IT budgets by increasing capacity while reducing the storage footprint, power consumption and related operational costs.

Lenovo SystemX 3650-M5

With the powerful, versatile 2U two-socket System x3650 M5 rack server, you can run even more enterprise work-loads, 24/7, and gain faster business insights. Integrated with up to two Intel® Xeon® processors E5-2600 v4 se-ries (up to 44 cores per system), fast TruDDR4 2400MHz Memory, and massive storage capacity, the x3650 M5 fast forwards your business.

You can select from an impressive array of storage configurations (up to 28 drive bays) that optimize diverse workloads from Cloud to Big Data.

Benefits Features

12Gb SAS storage controllerUp to 112TB internal storageUp to 44 cores in one box

Up to 1.5TB RAMPCI-E 3.0

Integrated management moduleLightpath diagnostics

Supports self-encrypting drives128GB RAM

Dual Intel Xeon E5-2650 [email protected] (12 cores each)

IBM X3690

IBM® System x3690 X5 is a powerful two-socket 2U rack-mount server using the latest Intel Xeon processors. The x3690 X5 servers can be combined with the IBM MAX5 memory expansion unit for up to 2 TB of memory. Add to that the 16 2.5-inch disk drive bays and you have a high performance workhorse in a rack-dense package. The x3690 X5 server belongs to the family of a new generation of Enterprise X-Architecture® servers.

The server delivers innovation with enhanced reliability and availability features to enable optimal performance for databases, enterprise applications, and virtualized environments.

Benefits Features

116HDD bays SAS/SATAUp to 2TB RAM

Onboard raid controllerIntegrated management module

Lightpath diagnosticsPCI-E 2.0

128GB RAM

Dual Intel Xeon E6540@2GHz (6 cores each)

13

Mellanox SX6036 InfiniBand Switch

This 36-port, managed bidirectional 56Gb/s (per-port) InfiniBand/VPI SDN switch provides the highest performing fabric solution in a 1U form factor by delivering up to 4Tb/s of non-blocking bandwidth with 200ns port-to-port latency.

Superb for Express Gen3 servers, Virtual Protocol Interconnect (VPI) simplifies system development by serving multiple fabrics with one hardware design to run both InfiniBand and Ethernet subnets on the same chassis.

Benefits Features

Virtual Protocol InterconnectUltra-low latency

Granular QoS for Cluster, LAN, SAN trafficEliminates fabric congestions

Cluster and converged fabric management4.032Tb/s switching capacity

FDR/FDR10 support for Forward Error CorrectionIBTA Specification 1.3 and 1.21 compliantQoS, Port Mirroring, Adaptive Routing

Integrated subnet manager agent (up to 648 nodes)

36 FDR (56Gb/s) ports in a 1U switch

Brocade 6505 Switch

The Brocade 6505 combines market leading through-put with an affordable switch form factor, making it ideal for growing SAN workloads. The 24 ports produce an aggregate 384 Gbps full duplex throughput; any eight ports can be trunked for 128 Gbps Inter-Switch Links (ISLs). Exchange-based Dynamic Path Selection (DPS) optimizes fabric wide performance and load balancing by automatically routing data to the most efficient and available path in the fabric.

In addition, the Brocade 6505 switch augments ISL trunking to provide more effective load balanc-ing in certain configurations while providing a low Total Cost of Ownership (TCO) thanks to a 12-port base configuration, easy administration, 1U footprint, low energy consumption—0.22 watts per Gbps, and 3.3 watts per port. Enterprise-class capabilities combined with a low TCO yield 40 percent higher performance compared to 10 Gigabit Ethernet (GbE) alternatives at a similar cost.

Benefits Features

Forward Error Correction (FEC) Hot swap module replacement

IP address filtering, IPSec pass-throughIPv6 support , ISL support , LDAP support

Management Information Base (MIB)Ports on Demand

Quality of Service (QoS), Registered State Change Notification (RSCN) support

Role based access controlSyslog support

Transparent (N Port ID Virtualization - NPIV) mode support

LUN Zoning

14

Mellanox Infiniband ConnectX-3 HBA

ConnectX-3 adapter cards with Virtual Protocol Interconnect (VPI) supporting InfiniBand and Ethernet connectivity provide the highest performing and most flexible interconnect solution for PCI Express Gen3 servers used in Enterprise Data Centers, High-Performance Computing, and Embedded environments.

Clustered data bases, parallel processing, transactional services and high-performance embedded I/O applications will achieve significant performance improvements resulting in reduced completion time and lower cost per operation ConnectX-3 with VPI also simplifies system development by serving multiple fabrics with one hardware design.

Models used in our testing were the CX353A-FCBT (single port) and the MCX354A-FCBT (dual port) with the following settings in /etc/modprobe.d/mlnx.conf:

# Module parameters for MLNX_OFED kernel modulesoptions mlx4_core log_num_mtt=24options mlx4_core log_mtts_per_seg=0

Benefits Features

One adapter for FDR/QDR InfiniBand, 10/40 GbEEthernet or Data Center Bridging fabrics

World-class cluster, network, and storage performanceGuaranteed bandwidth and low-latency services

I/O consolidationVirtualization acceleration

Power efficientScales to tens-of-thousands of nodes

Virtual Protocol Interconnect1μs MPI ping latency

Up to 56Gb/s InfiniBandSingle- and Dual-Port options available

PCI Express 3.0 (up to 8GT/s)CPU offload of transport operations

Application offloadGPU communication accelerationPrecision Clock Synchronization

End-to-end QoS and congestion controlHardware-based I/O virtualization

Fibre Channel encapsulation (FCoIB or FCoE)Ethernet encapsulation (EoIB)

Emulex Lightpulse LPE 16002

The new 16 Gb/s Fibre Channel Host Bus Adapter (HBA) is optimal for mission critical deployments. The Fibre Channel HBA LPe16000/16002 takes performance to a new level with PCIe 3.0 support, data integrity capabilities and cloud scale reliability with ease of use.

Organizations looking to improve performance, protect against data corruption, or simplify high per-formance FC environments can benefit from deploying the LPe16000/16002.

For this reference architecture the following settings were made in /etc/modprobe.d/lpfc.conf:

# config file to be placed in modprobe.d# directory for emulex lighpulse fc hbaoptions lpfc lpfc_use_msi=2options lpfc lpfc_sg_seg_cnt=256

Benefits Features

Integrate seamlessly into existing SANsSupport IT server consolidation initiativesAssure data availability and data integrity

Support N-Port Port ID Virtualization (NPIV) & Virtual Fabric

BlockGuard ready

Support for 16 Gb/s, 8 Gb/s & 4 Gb/s Fibre Channel devices

15

SAS HBA: LSI SAS 3008

The LSI SAS3008 HBAs (which share the same controller as the LSI 9300-16e HBAs) are engineered to deliver maximum performance. Delivering over a million IOPS and 6000+MB/s, the cards are designed to meet the growing demands of en-terprises that require even more robust throughput in a range of applications that includes transactional databases, Web 2.0, data mining, and video streaming and editing.

The LSI SAS 3008 controller supports 8 lanes of PCIe 3.0 and provides SATA/SAS links at rates be-tween 3 and 12Gbs and supports RAID 0, RAID 1 and RAID 10 in the IR version.

Benefits Features

DataBolt Technology: 12Gb/s SAS with existing 6Gb/s devicesSupport for RAID levels 0, 1, 1E, 10, & MegaRAID options

Deliver more than million IOPS and 6,000 MB/s throughput SAS transfer rates of up to 12Gb/s

SATA rates up to 6Gb/s

Eight PCI Express 3.0 lanes

Best PracticesTesting was conducted at Ennovar™ Institute of Emerging Technologies and Market Solutions located at Wichita State Univertiy where several best practice guidelines were either followed or discovered for this Reference Architecture. This section contains a list of these guidelines. Note that some of these guide-lines are user or use-case dependent and may not be optimal for every particular environment.

IBM Spectrum Scale with DCS3680 Tuning and Best Practices for Linux

At the system level, if there are more than two nodes in a cluster then hyperthreading should be dis-abled in BIOS. Turbo boost is useless once you move past two nodes and should be disabled in most modern systems. Fast RDMA capable interconnectivity between nodes will significantly minimize computational latency.

To benchmark Spectrum Scale the public domain software IOR was used. IOR was developed for benchmarking parallel file systems using POSIX, MPIO, or HDFS interfaces, and a good tool to mea-sure streaming I/O performance. The execution command used for this benchmark is:

mpirun –n total_number_of_process --hostfile hostsfile IOR –a POSIX -w -r -Z -t 4M -b 20G -F –eg

where:

-a: API for I/O [POSIX|MPIIO|HDF5|NCMPI]

-w: write file-r: read existing file-Z: access is to random offsets-t: size of transfer in bytes (e.g.: 8, 4k, 2m, 1g) -b: contiguous bytes to write per task (e.g.: 8, 4k, 2m, 1g)-F: file-per-process -e: perform fsync upon POSIX write close -g: use barriers between open, write/read, and close

The controller cache configuration settings used were: Read cache: on, dynamic prefetch: off, write cache: on, write cache mirroring on. Write performance can be significantly (~100%) improved by disabling cache mirroring on the DCS 3680. However, this obviously comes at the expense of the loss of data redundancy in cache level.

A network shared disk (NSD) is created before creating a Spectrum Scale file system. The NSDs are created according to the stanza file. The below example NSD definition file creates two Spectrum Scale storage pools using two NSDs per pool. NSDs S1V00 and S1V01 contain metadata only stored on a pool named SSD_Metadata, while NSDs S1V02 and S1V03 contain data only stored on a pool named system. In a typical tiered file system, the SSD_Metadata pool would be created on underlying SSD drives and the system pool would be created on HDDs in order to optimize the cost/performance ratio.

16

%nsd: nsd=S1V00 device=/dev/mapper/360080e50004316780000126d57d9482e servers=lenovo1_gpfs,lenovo2_gpfs usage=metadataOnly pool=SSD_Metadata%nsd: nsd=S1V01 device=/dev/mapper/360080e50004316780000126357d94766 servers=lenovo2_gpfs,lenovo1_gpfs usage=metadataOnly pool=SSD_Metadata%nsd: nsd=S1V02 device=/dev/mapper/360080e5000430fc00000126357d946b9 servers=lenovo2_gpfs,lenovo1_gpfs usage=dataOnly pool=system%nsd: nsd=S1V03 device=/dev/mapper/360080e50004316780000126857d947cd servers=lenovo1_gpfs,lenovo2_gpfs usage=dataOnly pool=system

An important attribute of the stanza file is that the primary/secondary server to for each NSD can be set here. For better I/O performance, it is best to balance the I/O as much as possible. To do this, the logical volumes owned by controller A and B on each DCS storage unit should be alternated between primary and secondary servers in turn in order to balance the I/O.

File system specific configuration settings that were modified from their default values are:

flag value description---------------- ------------------ ---------------------------------i 4096 Inode size in bytes-m 1 Default number of metadata replicas-r 1 Default number of data replicas-j cluster Block allocation type

GPFS global configuration settings modified from their defaults are:

verbsRdma enableprefetchThreads 144nsdMaxWorkerThreads 1024verbsRdmasPerConnection 14verbsRdmasPerNode 1024

verbsRdmaSend noscatterBufferSize 262144worker1Threads 256nsdThreadsPerQueue 12nsdSmallThreadRatio 1verbsPorts mlx4_0/1 mlx4_1/1pagepool 20G

• The filesystem block size was chosen to match the RAID stripe size of the logical volume on the DCS in order to minimize the partial stripe write penalty: 4MB = 512KB x 8 pagepool: size of the GPFS file data block cache

• worker1Threads: total number of concurrent requests that can be processed at one time.

• nsdSmallThreadRatio: the ratio of NSD server queues for small IO (default < 64KiB) to the number of NSD server queues that handle large IO (> 64KiB)

• nsdThreadsPerQueue: number of threads assigned to process each NSD server IO queue

• nsdMaxWorkerThreads: maximum number of NSD threads on an NSD server

• worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 1500 on 64bit architectures

• scatterBufferSend: the size determine how much of file data can be stored in pagepool

17

• The value of worker1Threds is changed to 1 for better streaming I/O.

• RDMA is used to transport the data over the InifiniBand network

IBM Spectrum Scale uses large file system block sizes, in this benchmark 4MB. The default kernel parameters that come with Linux distributions are too small for taking advantage of Spectrum Scale. They should be modified for better using the Spectrum Scale. The following parameters should be changed to take advantage of Spectrum Scale large file system block size:

• max_hw_sectors_kb: The maximum size of an I/O request the device can handle in kilobytes. Set to 4096 for our testing.

• max_sectors_kb: The maximum size of an I/O request in kilobytes. The default value is 512 KB. The maximum value for this parameter can’t exceed the value of max_hw_sectors_kb. Set to 4096 for our testing.

• read_ahead_kb: The maximum size of data the operating system read ahead during a sequential read operation in order to store information likely to be needed soon in the page cache. Set to 4096 for our testing.

• nr_requests: The maximum number of read and write requests that can be queued at one time. The default value is 128. Set to 16 for our testing.

• scheduler: The I/O scheduler determines when and for how long I/O operations run on a storage device. Known as the I/O elevator. The default value – deadline (for all block devices except SATA drive), cfq (for SATA drive). Set to noop for our testing.

max_sectors_kb and read_ahead_kb should be larger or equal to the file system block size. The value of max_sectors_kb cannot exceed max_hw_sectors_kb. For our testing max_hw_sectors_kb is 16MB. If you have a good storage unit, “noop” scheduler should be selected if you have a good storage unit.

NOTE: These parameters will be reset every time the server is restarted and must be reset after a reboot or in the system start up file.

Spectrum Archive

To improve the performance of Spectrum Archive, the following changes were made in /etc/mod-probe.d/lpfc.conf to the Emulex Fibre-Channel driver settings:

# config file to be placed in modprobe.d# directory for emulex lighpulse fc hbaoptions lpfc lpfc_use_msi=2

options lpfc lpfc_sg_seg_cnt=256

No other changes were necessary to tune Spectrum Archive. Spectrum Archive automatically per-formed hardware compression on our use-case data resulting in better than estimated threshold values. The actual performance of Spectrum Archive is dependent on the data set being used. For reference, in our use cases it was determined that the Dodge Neon data was 21% compressible and the Boeing 787-800 data was about 40% compressible.

Determining the compressibility is not easy to calculate when striping data across multiple tape drives. To determine the compressibility or our data, the data was first written to a single drive. We then queried LTFS EE to find out how many MBs were remaining on tape, migrated the number of MBs for the dataset, and then queried again. The ratio (compressibility) was the number of MBs used on tape divided by the total number of MBs in the dataset.

For our testing, the following migrate policy was used:

define(is_premigrated,(MISC_ATTRIBUTES LIKE ‘%M%’ AND MISC_ATTRIBUTES NOT LIKE ‘%V%’))define(is_migrated,(MISC_ATTRIBUTES LIKE ‘%V%’))define(is_resident,(NOT MISC_ATTRIBUTES Like ‘%M%’))define(MB,1048576)define(user_exclude_list,(PATH_NAME LIKE ‘/gpfs2/.ltfsee/%’ OR PATH_NAME LIKE ‘/gpfs2/.SpaceMan/%’))RULE EXTERNAL POOL ‘flape’EXEC ‘/opt/ibm/ltfsee/bin/ltfsee’ OPTS ‘-p primary backup’

18

RULE ‘flape’ MIGRATE FROM POOL ‘system’TO POOL ‘flape’FOR FILESET (‘flape’)WHERE (is_resident OR is_premigrated)AND NOT user_exclude_listAND FILE_SIZE > 0/* AND (CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL ‘5’ MINUTES) */

Additionally, the following script was used to benchmark the Spectrum Archive performance:

#! /usr/bin/bash# this script will recall and re-migrate data that has already been committed to tape# define our binariesDATE=”/usr/bin/date”ECHO=”/usr/bin/echo”FIND=”/usr/bin/find”IOSTAT=”/usr/bin/iostat”KILLALL=”/usr/bin/killall”LOGGER=”/usr/bin/logger”LS=”/usr/bin/ls”LTFSEE=”/opt/ibm/ltfsee/bin/ltfsee”MMAPPLY=”/usr/lpp/mmfs/bin/mmapplypolicy”PERL=”/usr/bin/perl”TEE=”/usr/bin/tee”# this is the tag to use for log output to /var/log/messages# so we can filter for messages from this scriptLOGTAG=”FLAPE”COMBO=”LTO6SSD”# iostat parametersIOSTATPARMS=”-xkt 5”# FILESYSTEMFS=gpfs1# enter directory where files are storedcd /$FS/tape# get time and dateTIME=$($DATE +”%H-%M”)DATESTAMP=$($DATE +”%d-%m-%Y”)filename=LTO6_SSD_$TIME_$DATESTAMP# log the start of the script$LOGGER -t $LOGTAG $COMBO” Beginning recall and re-migrate script execution”# start iostat and background it$IOSTAT $IOSTATPARMS > /tmp/stats.log &# we are assuming the data has been migrated before we start, add a check later to ver-ify this# clear cache$ECHO 3 > /proc/sys/vm/drop_caches# print start timestamp to log

$LOGGER -t $LOGTAG “Recall start”# get list of files and recall migrated data#$FIND . -type f | $LTFSEE recall | $TEE ~/bench_results/$filename-recall.log$LS | $LTFSEE recall | $TEE ~/bench_results/$filename-recall.log# tell when the recall finished$LOGGER -t $LOGTAG “Recall finish”# do a repair to make the files resident again so we can get an accurate migrate time$LOGGER -t $LOGTAG “Repairing files to get them back to resident on storage before mi-gration”# get filelistFILES=`$LS /$FS/tape/`# iterate through the listfor i in $FILES; do # repair the file so it is resident $LTFSEE repair /$FS/tape/$idone# clear cache$ECHO 3 > /proc/sys/vm/drop_caches# beginning timestamp for migration$LOGGER -t $LOGTAG “Migrate start “# re-migrate the data$MMAPPLY /$FS/tape -P /root/policy_flape | $TEE ~/bench_results/$filename-migrate.log# ending timestamp for data migration$LOGGER -t $LOGTAG “Migrate finish”

19

# stop iostat data collection$KILLALL -9 iostat# tell the user we are done$LOGGER -t $LOGTAG $COMBO” Script execution complete.”$PERL /root/bin/email2.plLS-Dyna

The following script was used to run LS-Dyna in the distributed memory environment for our testing:

#! /usr/bin/csh#setenv LSTC_MEMORY AUTOsetenv DYNA_SCRDIR “/gpfs1/LS-dyna/jobs/AirCraftModel”setenv DYNA_ARGS “i=/gpfs1/LS-dyna/jobs/AirCraftModel/737-800-FULL-ASSY.key memo-ry=1002m memory2=300m ncpu=-96”cd $DYNA_SCRDIRsetenv LSTC_FILE /usr/local/lstc/LSTC_FILEsetenv HOSTFILE /gpfs1/LS-dyna/mpd.hostssetenv NUMPROCS 96# --mca orte_base_help_aggregate 0/opt/openmpi-1.10.2/bin/mpirun --allow-run-as-root --wdir /gpfs1/LS-dyna/jobs/AirCraft-Model -np $NUMPROCS -machinefile $HOSTFILE /opt/LS-dyna/ls-dyna_mpp_s_r8_1_105897_x64_redhat54_ifort131_sse2_openmpi183 $DYNA_ARGS

HOSTFILE contains a list of the hosts to be used as compute nodes. It was found during our testing that the format of HOSTILE was very important to the performance of the LS-Dyna simulation. It was discovered that if the list of hosts in HOSTFILE are on the same line, the performance of LS-Dyna was much slower than when each host was on its on line. It is important to put each host on a separate line in HOSTFILE. Also, it is important to match the CPU count passed to LS-Dyna to the actual number of physical processor cores.

LS-Dyna was found to be very particular about the memory parameters used during our testing and would generate seemingly random error messages and crash when it had issues with the memory parameters used. For our tests, we used the following guidelines for the two memory parameters:

memory: 30-70% of available memorymemory2: 20-40% of available memory

ConclusionThis solution can most likely be extended to any architecture where the bottleneck is due to the processor power or communication speed of the compute nodes. This is because Spectrum Scale and Spectrum Archive do a good job of insulating the host application from the file system(s) perfor-mance. Additionally, Spectrum Scale performance scales nicely based on previous research carried out by Ennovar should the size of an existing Spectrum Scale system become a bottleneck.

Our testing demonstrated that it is possible to combine the performance benefits of flash devices for an application with the capacity benefits and retention policies of tape to create a tiered solution that purposes storage according to the value and needs of the data being utilized.

In fairness, it must be noted that the configuration was not balanced for the application and storage model being used. For example:

• The application was CPU intense creating a bottleneck on the compute nodes well before any bot-tleneck would have been seen on the Spectrum Scale

• The tape drive performance was substantially slower than Spectrum Scale performance

Also, the use case did not exercise the file system enough to see any impact from the slower LTFS speed and not enough tape drives to test the FC threshold impact on LTFS performance.

This being said, our testing demonstrated that it is possible to combine the performance benefits of flash storage for an application with the capacity benefits and retention policies of tape to create a tiered solution that purposes storage according to the value and needs of the data being utilized. In addition, the solution tested is scalable for better performance or better data retention and/or capaci-ty requirements allowing a user to more efficiently allocate their budget.

20

AppendixThe table below contains benchmark results for Spectrum Scale performance on various combina-tions of 800GB SSD, 6TB 7.2K NL-SAS, tiered and non-tiered storage media. All drives contain both Spectrum Scale metadata and file system data unless specifically noted in the first column.

Spectrum Scale block allocation type: Cluster Spectrum Scale block allocation type: Scatter

Drives Write (MB/sec)

Normalized (MB/sec)

Read (MB/sec)

Normalized (MB/sec)

Write (MB/sec)

Normalized (MB/sec)

Read (MB/sec)

Normalized (MB/sec)

Usable Capacity (TB)

20 SSD 3579 179.0 9426 471.3 2557 127.9 9124 456.2 12

20 NL-SAS 1963 98.15 2756 137.8 930 46.5 1480 74 88

20 SSD 20 NL-SAS

2503 62.58 3419 85.58 1063 26.58 1701 42.53 99

40 L-SAS 2228 55.70 5040 126.00 1678 41.95 3416 85.40 175

20 SSD (Metadata) 40 NL-SAS (System)

3237 53.95 4302 71.70 3350 55.83 4698 78.30 175

20 SSD40 NL-SAS

3907 65.12 6206 103.43 2175 36.25 4144 69.07 187

60 NL-SAS 2221 37.02 5789 96.48 2271 37.85 4731 78.85 262

20 SSD (Metadata) 100 NL-SAS (System)

7025 58.54 10871 90.59 7005 58.38 10813 90.11 437

20 SSD100 NL-SAS

87.58 72.98 13076 108.97 5182 43.18 8874 73.95 448

120 NL-SAS 8584 71.53 12915 107.63 8562 71.35 13263 110.53 554

The table below contains various runtimes for the LS-Dyna Dodge Neon simulation used to baseline and the configuration for the much larger and longer running Boeing 787-800 simulation. This table demonstrates the scalability of adding CPUs to the distributed memory LS-Dyna simulation as well as the negative impact that comes from having non-uniform compute nodes in the configuration.

Number of Lenovo System x3650s

Number of IBM x3690s Total Number of CPUs Runtime (seconds)

1 24 338

2 48 215

3 72 176

4 96 155

1 12 926

2 24 523

3 36 397

4 48 329

5 60 288

6 72 264

4 1 108 217

4 2 120 213

4 3 132 213

4 4 144 211

4 5 156 218

4 6 168 220

Table 4: Various drive combination

performance results

Table 5: LS-Dyna simulation run times

© Copyright 2016

Wichita State University.

All rights reserved.

November 2016

Ennovar™ is a technology

institute of Wichita State

University, a non-profit state

public institution.

The Ennovar and Wichita

State University names

are trademarks of Wichita

State University. All other

trademarks used herein

are the property of their

respective owners.

21

reference architecture guide for ibm spectrum scale and

Documents