dell hpc scalable storage building block

Dell HPC Scalable Storage Building Block Disk Pool Manager (DPM)

A Dell Technical White Paper

Wahid Bhimji, Philip J. Clark, University of

Edinburgh

Matt Doidge, Roger Jones, University of

Lancaster

Stephen Gray, DELL

Dell HPC Scalable Storage Building Block – Disk Pool Manager

Page ii

THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL

ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR

IMPLIED WARRANTIES OF ANY KIND.

© 2010 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without

the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.

Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.

Other trademarks and trade names may be used in this document to refer to either the entities

claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in

trademarks and trade names other than its own.

February 2012


Page 1

Contents Executive Summary ........................................................................................................................2

Scope .............................................................................................................................................2

1. Introduction ............................................................................................................................3

2. Dell HPC Scalable Storage Building Block Solution Technical Review..........................................3

3. DPM Description .....................................................................................................................4

3.1. Hardware ..................................................................................................................................... 5

3.2. Software ..................................................................................................................................... 7

3.3. Installation Specifics .................................................................................................................... 8

4. Evaluation ...............................................................................................................................9

4.1. Methodology ............................................................................................................................... 9

4.2. Test bed ....................................................................................................................................... 9

4.3. Local tests .................................................................................................................................. 10

4.4. Remote tests ............................................................................................................................. 11

5. Performance Benchmark Results ............................................................................................ 11

5.1. Local Tests ................................................................................................................................. 11

5.2. Remote Tests ............................................................................................................................. 13

6. Conclusions ........................................................................................................................... 16

7. References ............................................................................................................................ 16

Appendix A: Installation Resources ............................................................................................... 18

Appendix B: Benchmarks and Test Tools........................................................................................ 20

1. dd .................................................................................................................................................. 20

2. IOzone .......................................................................................................................................... 20

3. rfcp ............................................................................................................................................... 21

4. Direct reading with ROOT over rfio .............................................................................................. 21


Page 2

Executive Summary This solution guide describes the Disk Pool Manager (DPM) configuration of the Dell HPC Scalable

Storage Building Block. Guaranteeing performance of unstructured user data is becoming a common

requirement in Large Hadron Collider (LHC) environments. The configuration described here of a Dell

reference architecture to improve performance of data access using networked servers in an additive

configuration to provide data access to the HPC compute cluster. The goal is to provide affordable,

performance storage that is easily deployed and managed. Described here are the architecture,

performance and best practices for building such solutions.

Scope This paper covers installing and configuring a DPM disk server. It describes a series of local and

remote tests that can be used to determine functionality and performance of such a server. From

the results of those tests it makes some recommendations on tunings that could be applied with

this server when used for DPM.


Page 3

1. Introduction The Large Hadron Collider (LHC) and its experiments have created a huge demand on local and global

storage appliances. The LHC produces PetaBytes of data per year. The primary focus of the LHC

experiments is to verify the existence of the Higgs Boson and other yet undiscovered particles, hidden

in a massive and growing unstructured data pool. This creates an unprecedented requirement for

inexpensive, high performance, accessible, and scalable storage on a global scale. The data must be

organized and distributed to the researchers in a tiered membership that is unique to the LHC

community. The community needs to understand the storage requirement for their research. With 449

institutions across the globe working on the LHC project, many institutions need documentation on

repeatedly implementing a mature storage platform specific to their function in the LHC research

community.

The LHC community today uses many types of storage organization and functionality. The community

refers to this storage as a file system. In truth, it is a combination of many parts---software, hardware,

and strategy. One type is responsible for moving data off the trigger farm to the lower tiers. Some are

used to provide a high performance interface on local server disks for computational requirements.

Others are for achieving the data for future reference. Today Dell is working with the LHC community

directly to understand the storage requirements and provide the best possible practices and storage

hardware available. It was in this context that Dell engineered a HPC scalable storage building block

reference architecture (DSBRA) to meet the storage need for the community both large and small.

To complete the concept of Dell and the community working together, Edinburgh University, Lancaster

University and members of the Dell Global LHC team together have produced this whitepaper for the

general LHC community. Lancaster University uses the Disk Pool Manager (DPM) storage solution to

provide mass data storage for LHC analysis at their site. Using the DSBRA as the target storage

appliance they have built, installed, tuned, and tested DPM. The experience captured in the paper can

now be used by the novice and experienced user in the community to produce science quickly and

install such a system within their environment.

It is Dell Global LHC team’s goal to provide the best solutions available to the community. The team is

working consistently to improve the LHC community’s ability to produce science quickly. Please check

regularity with the Global LHC team or your local Dell account team for the latest testing and

whitepapers available. The DSBRA will now be discussed followed by the work performed by Lancaster

and Edinburgh Universities. Dell HPC Scalable Storage Building Block Solution Technical Review

2. Dell HPC Scalable Storage Building Block Reference Architecture

Technical Review This section provides a quick summary of the technical details of the current DSBRA. The basic DSBRA

consists of 1 PowerVault MD3200, and 4 PowerVault MD1200s attached to a PowerEdge R710 server. The

server is configured with 10GbE network connectivity and also several gigabit Ethernet connections.

The storage is configured using the PERC 700 controller, and different RAID configurations for the LUNS

are tested here. A High Performance Tier license key is added to turn on added HPC features of the

array. Each MD storage array has 12 x 2TB 7200rpm disks configured as virtual disks. The multiple RAID

LUNs are combined using a variety of different RAID configurations and the xfs or ext4 filesystems. This

file system is exported to the compute nodes via DPM. The DSBRA can be configured in three ways –

Small, Medium and Large. These correspond to a 360TB, 720TB and 1440TB raw solution respectively.

Figure 1 shows a DSBRA entry configuration in a redundant format. For redundancy the configuration


Page 4

can be duplicated. The raw bandwidth performance for the DSBRA shown in Figure 1 is 3.4GB/s writes

and 5.2GB/s reads with no operating system overhead.

The Dell white paper on DT-HSS has detail on the DSBRA configuration, best practices, performance

tuning and performance results (1). Other documents are currently in the creation phase.

Figure 1 - Redundant Entry Level DSBRA configuration

The DSBRA solution can be made highly availability and scaled by duplicating the configuration. It

leverages the modularity of building blocks (such as the server, software, storage and RAID

configuration) and the performance tuning best practices as far as possible. Also remember that the

DSBRA is a reference. To reproduce it, one must order it as components, since it is not a standard Dell

product.

3. DPM Description The DPM Storage Solution consists of a pair of Dell PowerEdge R710 servers that both have physical

access to a shared PowerVault MD3200 storage enclosure that is extended with PowerVault MD1200

storage arrays. The servers are in a configuration where they each access a separate set of LUNs. The

servers were named “Dellboy” and “Rodney”.


Page 5

3.1. Hardware

Figure 2 – DPM Dell Setup

OS Setup

The install of Scientific Linux was performed using an SL5.5 install DVD [Appendix A1] and the onboard

optical drive of the R710. We chose to leave hyper-threading on (to maximise the number of CPUs

available for transfer threads). Following the on-screen prompts during the install we chose the bare-

bones “Server” package set from the installation menu, keeping the install as lightweight as possible.

Hostname and (external) network interfaces were set to be assigned using DHCP. After the install was

complete the nodes were immediately updated using yum, tightened the firewall to only allow

necessary traffic [Appendix A2] and rebooted to apply kernel security patches – as is best practice.


Page 6

Network Setup

The servers are connected to a Gigabit Ethernet and a 10GbE switch. Due to problems with the 10GbE

switch of the test cluster, networking connectivity was split between a 1Gb external interface which

faces the world and a 3 x 1Gb bonded internal interface, on a separate, private sub-net shared with

the internal NICs of our other DPM pools and, more importantly, the compute nodes in our clusters.

This is a very common setup as most sites have this distinction between external and internal traffic,

and for many reasons (such as monitoring, security and potential network traffic contention) it is wise

to keep them separate. Details of the bonding method can be found in Appendix A3.

Configuring the MD3200:

Before attempting to install the DPM software we needed to perform some extra steps to enable the OS

on the R710s to be able to communicate with the raid controllers on the MD3200s. First we installed

the Dell OpenManager [Appendix A4]. We then downloaded the disk image containing the PERC

management software and, loop mounting the iso, running the install program (md32xx_install.bin).

The raids were then configured using the `SMclient' GUI provided by these steps (this required the

installing and configuring X and opening a new shell with X-forwarding enabled). A pointer to detailed

instructions for using the GUI and of the raid configuration we used can be found in Appendix A.

Filesystem Setup:

Disk Setup

The disks were setup in a variety of different RAID configurations to test the different performance of

each. These are listed below and in Table 1.

First MD3200 system (Dellboy)

Each of the 5 storage units was made into a separate raid 6 volume. Each volume was packaged into a

whole virtual disk and mapped to it’s own LUN, except for the 5th, which was made into 2 equal virtual

disks.

-The first 2 volumes had xfs partitions made upon them and mounted using noatime and logbufs=8

options

-The second two raid 6 volumes had a software raid 10 volume made over them (Appendix A), to

make a raid 60 volume which was formatted with xfs.

-The final 2 “half” volumes had an xfs and an ext4 filesystem on them respectively.

Second MD3200 system (Rodney)

The first two storage units were made into a single, large raid 10 volume packaged in one virtual disk,

the third unit was a single unit in a raid 10, the 4th and 5th units were put into a “double sized” raid 6.

After splitting up the raid arrays into the described volumes, we rebooted the R710s to facilitate the

detection of these chunks by the OS. Once back up and running we installed the xfsprogs and e4fsprogs

packages, and then created partitions on each of the SSB disk chunks using the parted tool. We then

formed xfs or ext4 filesystems on these fresh partitions, added entries for them into /etc/fstab,


Page 7

created the mount points and mounted them normally (mount -a). Details of the commands and options

used can be found in Appendix A5.

Raid Setup Capacity (TB) # Spindles Filesystem mountpoint

single raid 6 19 12 xfs delboy:/test1

single raid 6 19 12 xfs delboy:/test2

double raid 60 38 24 xfs delboy:/test3

half raid 6 9 6 ext4 delboy:/test4

“half” raid 6 9 6 xfs delboy:/test5

“double” raid 10 24 24 xfs rodney:/test1

single raid 10 12 12 xfs rodney:/test2

double raid 6 41 24 xfs rodney:/test3

Table 1: RAID configurations and filesystems used.

The motivation for looking at such a large number of different filesystems was to gain performance

information for as many options as possible. Due to how DPM distributes data it is best practice to have

all of your DPM pools roughly the same size. So, if one is incorporating a DSBRA into an existing DPM,

then you can split it into “chunks” of a size that would interoperate with the preexisting pools. Also it

is best practice not to expose a pool as a single, huge volume, due to administration problems

(rebuilding such an array would take a long time, as would fscking such a volume) as well as the simple

wisdom of not keeping all your data-eggs in one giant basket.

The PowerVault MD3200 is configured to have read cache enabled, write cache enabled, cache

mirroring enabled and read prefetch enabled. The cache block size is set to 32k. Since the cache

mirroring feature is enabled, the two RAID controllers in the MD3200 have mirrored caches. A single

RAID controller failure can be tolerated with no impact to data availability.

Each server is connected to both controllers on the MD3200. Each server has two SAS cables directly

connected to the MD3200, which eliminates a single point of server to storage I/O path failure. A

redundant path from MD3200 to the MD1200 array is deployed to enhance the availability of storage I/O

path. While a failover setup is not available within DPM at present- this setup could still be used to

allow easier data recovery in the event of server failure.

3.2. Software

DPM software [8] provides a head-node on which meta-data operations are carried out and disk-servers

that provide a series of data transfer protocols as indicated in figure 4 (here rfio was tested for local

file access while GridFTP was used for remote access). The specifics of this installation are provided in

the next section. DPM is agnostic as to the underlying file system used. Here we primarily used XFS but

a single partition was set up with EXT4 as indicated in Table 1.


Page 8

Figure 3 – DPM architecture

3.3. Installation Specifics

• The following describes how to install a gLite DPM. The repositories, package names and some

paths vary slightly for other “flavours” of DPM, such as EMI. However the services and the

principles under which they run remain the same.

• The first steps in installing and configuring a DPM pool node (or almost any grid service) is to

install an x509 grid host-certificate and key in /etc/grid-security/ (with the correct

permissions). As it can take a day or two to obtain a valid host certificate it's advised to

“order” one ahead of time. As a locally specific step we also manually configured a dpmmgr

(DPM-manager) user and group. This user owns all the files within the dpm pool and is

configured automatically if you use yaim [described in Appendix X6] to set up your disk pool.

This user must have the the same UID and GID on both the DPM headnode and throughout your

Disk Pools. The dpmmgr user and group must own any directory to be exported into the DPM

pool.

• Once these first steps were complete we enabled the glite yum repositories as well as the dag

repo [Appendix A6], and then being sure to disable the epel repositories (to avoid DPM version

conflicts with other DPM flavours). The install was then triggered by simply “yum installing”

the glite-DPM disk metapackage.

• Once the many packages are installed one can use yaim to perform the rest of the

configuration, which requires a properly configured site-info.def config file. It is also possible

to set up a disk pool manually, as in a disk pool yaim only does two main tasks – setting up

/etc/shift.conf and setting up the authentication mechanism. The latter is by far the most

complicated of these two, and it's advised to leave that to yaim.

• /etc/shift.conf contains a list of trusted hostnames for the various DPM access methods. At it's

simplist it should just contain the DPM headnode and the DPM poolnode. To speed up inter Disk

Pool communication (by removing the need for authenication) one can add in the hostnames of


Page 9

other Disk Pools. To enable testing freely between the two disk pools we added the other's

hostname to the shift.conf.

• Once a pool node has been installed there are a few steps elsewhere one must take for

enabling it in the dpm, mainly involving properly configuring the DPM headnode to

acknowledge and accept the new pool node. Firstly one must place the pool's hostname to the

shift.conf on the DPM headnode. Secondly one must ensure that communication between the

headnode, new diskpool, outside world and any internal clusters works as intended (this may

involve some editing of routing tables or /etc/host files and possibly firewall rules). Finally one

must make sure that the directory to be “exported” on the Disk Pool is owned by the dpmmgr

user & group. Once all these have been checked one can enable the disk pool in your dpm by

issuing a dpm-addfs command on your DPM headnode.

4. Evaluation The architecture proposed in this white paper was evaluated in Lancaster University’s computing

centre. This section describes the test methodology and the hardware used for verification. It also

contains details on the functionality and performance tests. Performance results are presented in

subsequent sections.

4.1. Methodology

A series of local and remote file-system tests were employed. The local tests employed the

standard tools dd and iozone. The remote tests used the protocols commonly used by the DPM

storage system: gsiftp and rfio. The remote tests also used workflows matching that used by the

"ATLAS" LHC experiment. Most focus was given to the later tests where a variety of parameters

were tested. These are described below and in more detail in Appendix B.

4.2. Test bed

The test bed used to evaluate the functionality and performance of the NSS-HA solution is

described here. Figure 4 shows the test bed used in this study.

The HPC compute cluster consisted of 512 cores over 64 servers.


Page 10

Figure 4 – The Lancaster DPM Testbed

Two PowerEdge R710 servers were used as the DPM disk servers. Both servers were connected to

PowerVault MD3200 storage extended with PowerVault MD1200 arrays. A switch configured with

bonded Gigabit Ethernet connection was used as the private HA cluster network between the

servers.

4.3. Local tests

To test the standalone performance of our SSB partitions outside of the DPM paradigm, we use the

well-known tools dd [Appendix B] and iozone [Appendix B].

The dd tests were split into two sections. The first used a test suite, written at SARA [2], which

performs repeated sets of multiple simultaneous dd write and read tests with large files [Appendix B].

We ran these tests using 8 simultaneous threads and let them run for approximately 24 hours, after

which an average result was calculated for the read and write rates. The second set of tests were a

much simpler second thread dd write test, writing 90GB files using /dev/zero then following up with a

read test to /dev/null. These tests were intended to provide a baseline for our other results.

Our iozone tests were similarly split, each volume was individually subjected to a sequential read,

sequential write and a random read/write using 8 threads and large files. We choose large files to

remove the effects of RAM caching (they had to be so large due to the 24GB of RAM on the host

machines) which seemed to affect our first set of tests using smaller files. The exact commands used

can be seen in Appendix B.


Page 11

We performed another batch of iozone tests on a single volume, varying the number of threads used

during a random read iozone run, the aim was to map any performance degradation as the number of

simultaneous threads accessing a volume increased.

4.4. Remote tests

Gridftp:

To test remote transfers using Gridftp we utilized a test developed by SARA [10]. This test sets up a series of remote transfers using the Gridftp protocol. A series of files were setup on the test servers

and copied back to the client. 2 GB filesizes were used – where the files were created with random

data taken from /dev/random.

Rfcp:

Rfcp is the copy command most commonly used for local copies on DPM storage elements. As ATLAS

jobs currently normally copy the input data files to the worker node before running on DPM nodes, this

simulates the interaction they would have with the storage. Real ATLAS “AOD” files of 2G filesize were

used, files were copied continuously using different (and randomly chosen) files each time. A real

analysis job would have copied the file and then processed it locally – so this test represents the

heaviest possible IO load from that number of concurrent jobs (i.e. the case where all the copies would

be occurring at the same time).

The wrapper scripts used to submit the copies are given in appendix B.

Up to 250 simultaneous copies were performed on each disk server – representing a realistic maximum

number of jobs for a server of this capacity.

Direct ROOT Reading over RFIO:

ROOT is the data analysis framework used by particle physicists. This test uses the ROOT libraries to

open and read a file directly using RFIO (ie. without copying first). A real ATLAS “AOD” file is used, but

one which has not be “reordered” by entry (see [9] for more details), files are continuously opened and

read (using different files each time), but no computation is done. In these ways the test is both

realistic, but also “worst-case” in terms of IO load.

The code for both the test and the wrapper are given in appendix B. A version of this test is now

available through the DPM performance test-suite [11][12].

Up to 100 simultaneous jobs were performed on a single filesystem of each server. This represents a

realistic load for the capacity provided.

5. Performance Benchmark Results This section presents the results of performance benchmarking on the NSS-HA Solution.

5.1. Local Tests

The results of the dd tests and iozone tests are shown in the tables below while the multi-thread

iozone test reading results are shown in the figure.


Page 12

dd test results

Volume DD suite result

(MB/s read/write)

Straight DD results

(MB/sRead/Write)

19TB single raid 6 178/20 688/301

38TB double raid 60 83/80 675/290

9TB “half” raid 6 176/18 501/196

9TB “half” raid 6 (ext4) 93/52 268/256

24TB “double” raid 10 184/28 289/125

12TB single raid 10 174/29 626/294

41 TB double raid 6 168/24 713/252

The multi-threaded “dd suite” results in the table above show only a small variation read and write

rates for most of the volume setups, with the exception of the large software raid 60 and the ext4

volume which had much lower read rates than the others. Both these volumes did however make up for

lower read performance with much higher write rates. For a typical Grid Storage Element most of the

operations are of the format “write-once, read-many”, so the additional write performance wouldn't

be a great benefit, but for other applications it could be useful.

In the case of the single-thread, standard dd tests there was a slightly greater variation in performance

between the setups. The ext4 partition again had a lower read rate but this time displayed no

considerable gain in write speed – any gains in write rates seem to be dependent on the number of

threads. The large raid 10 partition also seemed to perform poorer than the others. However, these

tests should be taken with the proverbial pinch of salt as few real-world applications would involve a

single-threaded dump to a partition of any real volume.

Iozone test results

Volume Read (KB/s) Write (KB/s) Random R/W (KB/s)

19TB single raid 6 537171 229333 136903/ 101411

38TB double raid 60 799849 407770 176563/ 127596

9TB “half” raid 6 (ext4) 427529 239111 173113/ 92647

9TB “half” raid 6 511624 220575 136544/ 96933

24TB “double” raid 10 564728 215497 202388/ 210904

12TB single raid 10 455706 230065 122292/ 174229

41 TB double raid 6 571992 204378 199682/ 66559

The more sophisticated multi-threaded IOZone test results show little variation between most of the

volumes for the sequential read and write results. The volume that stands out for these is the large

raid 60 volume. This volume also performs well with the random r/w tests, although with these tests

the best performing volume is the large raid 10 volume. Random read/write performance in general

seems to roughly correlate with number of spindles in the volume.


Page 13

Multi-thread Random Read test:

This plot shows an interesting phenomena where, although the rate of increase in total-read rates

rapidly falls off as the number of threads increases it didn't, within the scope of these results, fully

level off, and an increase in read rate is seen even when the number of threads is greater than the

number of CPUs on the machine. This shows that multi-threading reads seems to be a good way of

squeezing every last bit of read-performance from a volume.

If it wasn't for time considerations we would have liked to conduct this test with hyperthreading off

(reducing the effective number of cores on the machine to 8), and also repeated it on all the unique

volumes.

5.2. Remote Tests

Gridftp

For the remote gridftp tests – single file transfer rates of 110 MB/s were obtained. This was seen to

scale with the number of simultaneous files transferred, illustrating that this is network limited

(the external link used when accessed from a single client was limited to 1 Gbit/s). It was observed

that similar transfer rates could be achieved even when the local tests below were carried out (and

an independent interface used for access).

Rfcp

As mentioned above, 250 simultaneous rfcp processes were launched from the cluster targeting

randomly chosen partitions on one server at a time. On launching this high-levels of IOWAIT were


Page 14

observed on the servers as indicated in figure 5.1. It was found that increasing the block device

read ahead with the following command could alleviate this as has been observed on other DPM

disk servers in the UK [13]. Those studies suggested the value of 8MB that was used, but the

optimum value is likely to be infrastructure dependent.

/sbin/blockdev --setra 16384 /dev/dm-5

As indicated in figure 5.2 this immediately alleviated the load and led to the throughput of data

being limited only by the available network bandwidth.

Figure 5.1: Load and WAIT CPU on Rodney with 250 rfcp jobs before block device read-ahead

increased.

Figure 5.2: System load and Network utilization on “Rodney” for 250 rfcp jobs showing that once

the block device read-ahead value is increased the load is reduced and available network is fully

utilized.

ROOT over RFIO

As mentioned above, we ran 100 simultaneous jobs. The blkdev tuning mentioned in the last

section was already applied. However, as shown in figure 5.3 there was once more high levels of

WAIT CPU and the network was not utilized. In this case setting the RFIO buffer size to a larger

value can alleviate the situation. This can be done in /etc/shift.conf on the client (ie. worker

node), or (as in this case) in the application by using the environment variable:

export RFIO_IOBUFSIZE=524288

As shown in figure 5.4 this substantially reduces the CPU wait and means that the network can be

fully utilized. However, it is worth noting that it is better for the job efficiency to have a lower

value for the buffer size (as shown in table 5.1). This is because with larger buffer sizes, and

random IO, a large amount of data is being shipped for every call, only some of which is required

for the job. The balance chosen for a site will depend on the system and job mix.


Page 15

Figure 5.3: System load, CPU and Network usage on “Rodney” for 100 ROOT direct RFIO jobs

showing that high levels of IO wait with the default RFIO BUFFER SIZE of 128k.

Figure 5.4 System load, CPU and Network usage on “Rodney” for 100 ROOT direct RFIO jobs

showing that that setting the RFIO buffer to 512k alleviates the CPU WAIT and enables the network

to be utilized.

RFIO BUFFER SIZE 128k 512k

CPU Time 249 317 Wall Time 921 2444 CPU / Wall Time 27% 13%

RFIO BUFFER SIZE 4k 128k 512k CPU Time 227 267 405 Wall Time 19813 4321 71654 CPU / Wall Time 1.1% 4.5% 0.6%

Table 5.1: Time taken for single job (top) and 100 simultaneous ROOT direct RFIO jobs showing

that 128k buffer sizes offer lower overall job times and better CPU efficiencies.


Page 16

6. Conclusions This solution guide provides information on deploying a DPM solution for HPC clusters. The guidelines

include complete hardware and software information along with detailed configuration steps, best

practices and performance tuning notes to make it easy to deploy and manage such a solution.

We have found that the Dell HPC scalable storage building block reference architecture or DSBRA is

suitable for use as a storage server with DPM for ATLAS analysis workloads and when stressed with a

realistic number of jobs for the capacity provided.

We have found to deal with these workloads it is necessary to tune the blkdev readahead (in the case

of both copying files to the Worker Node or direct reading via rfio) and the RFIO_BUFFERSIZE

(particularly in the case of direct reading) and we suggest some values for these parameters. Larger

buffers can however mean worse cpu efficiencies in the case of direct random reading so the values

will depend on available network bandwidth and should be tuned on each installation.

When using large buffers the system is network limited and therefore we recommend if such high

capacity servers are used, 10 Gig Ethernet should be utilized.

7. References 1) Dell | Terascala HPC Storage Solution (DT-HSS)

http://content.dell.com/us/en/enterprise/d/business~solutions~hpcc~en/Documents~Dell-

terascala-dt-hss2.pdf.aspx

2) Dell NFS Storage Solution for HPC (NSS)

http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-NSS-NFS-Storage-

solution-final.pdf

3) Red Hat Enterprise Linux 5 Cluster Suite Overview

http://docs.redhat.com/docs/en-

US/Red_Hat_Enterprise_Linux/5/pdf/Cluster_Suite_Overview/Red_Hat_Enterprise_Linux-5-

Cluster_Suite_Overview-en-US.pdf

4) Deploying a Highly Available Web Server on Red Hat Enterprise Linux 5

http://www.redhat.com/f/pdf/rhel/Deploying_HA_Web_Server_RHEL.pdf

5) Platform Cluster Manager

http://www.platform.com/cluster-computing/cluster-management

6) Optimizing DELL™ PowerVault™ MD1200 Storage Arrays for High Performance Computing (HPC)

Deployments

http://i.dell.com/sites/content/business/solutions/power/en/Documents/Md-1200-for-hpc.pdf

7) Array Tuning Best Practices

http://www.dell.com/downloads/global/products/pvaul/en/powervault-md3200i-

performance-tuning-white-paper.pdf

8) DPM

https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm


Page 17

9) ROOT I/O

Vukotic, I, Bhimji, W, Biscarat, C, Brandt, G, Duckeck, G, van Gemmeren, P, Peters, A , Schaffer, R D

2010 Optimization and performance measurements of ROOT-based data formats in the ATLAS

experiment. ATL-COM-SOFT-2010-081. To be published in J. Phys.: Conf. Series

10) SARA test suite: (available from http://web.grid.sara.nl/acceptance_test). 11) Hellmich, M. Stress testing and developing the distributed data storage used for the Large Hardon

Collider, Available from:

http://www2.ph.ed.ac.uk/~wbhimji/GridStorage/StressTestingAndDevelopingDistributedDataStora

ge-MH.pdf

12) DPM performance testsuite: https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/Performance

13) http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html [accessed

February 2012]

8. Acknowledgements The tuning applied in this paper makes use of a huge amount of work carried out in the UK, including

that by John Bland, Sam Skipsey, Alessandra Forti and others in the GridPP Storage Group.


Page 18

Appendix A: Installation Resources

1 - Scientific Linux Iso Download:

https://www.scientificlinux.org/download

2 - Ports used by DPM Disk Server

PORT SERVICE PROTOCOL

5001 RFIO TCP

2811 GRIDFTP TCP

20000:25000 GLOBUS PORT RANGE TCP

3 - NIC Bonding Config files.

ifcfg-bond0: DEVICE=bond0 ONBOOT=yes BOOTPROTO=static NETMASK=255.255.240.0 IPADDR=10.41.52.101 NETWORK=10.41.48.0 USERCTL=no BONDING_OPTS='mode=balance-alb miimon=100 xmit_hash_policy=layer3+4' ifcfg-ethX (where X is a bond member):

DEVICE=eth3 HWADDR=84:2B:2B:72:66:45 ONBOOT=yes BOOTPROTO=none USERCTL=no MASTER=bond0 SLAVE=yes

4 - Dell OpenManager & other Utility Documentation Links: http://www.dell.com/content/topics/global.aspx/sitelets/solutions/management/en/openmanage?c=us&l=en&cs=555

5 - File System and Mounting Options.

Partitions were created using the parted CLI (given a gpt label using the mklabel command, and

created using the mkpart command), consuming the whole of the virtual disk volume in most cases.

Filesystems were created using `mkfs.xfs -f /dev/XXX` (or `mkfs.ext4 -f` in the case of the ext4

partition).


Page 19

The xfs volumes were mounted using the following options in /etc/fstab:

rw,noatime,logbufs=8

The ext4 volume was mounted using just “defaults,noatime”.

6 - DPM Installation and Configuration.

Yum repositories:

EGI-trust.repo

[EGI-trustanchors]

name=EGI-trustanchors

baseurl=http://repository.egi.eu/sw/production/cas/1/current/

gpgkey=http://repository.egi.eu/sw/production/cas/1/GPG-KEY-EUGridPMA-RPM-3

gpgcheck=1

enabled=1

glite-SE_dpm_disk.repo

[glite-SE_dpm_disk]

name=gLite 3.2 glite-SE_dpm_disk

baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.release/

gpgkey=ftp://glite.web.cern.ch/glite/glite_key_gd.asc

gpgcheck=0

enabled=1

[glite-SE_dpm_disk_updates]


baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.updates/

gpgkey=ftp://glite.web.cern.ch/glite/glite_key_gd.asc

gpgcheck=0

enabled=1

[glite-SE_dpm_disk_ext]


baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.externals/

gpgcheck=0

enabled=1

Yaim documentation:

https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320

Example shift.conf:

RFIOD TRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk

RFIOD WTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk

RFIOD RTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk

RFIOD XTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk

RFIOD FTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk

DPM PROTOCOLS rfio gsiftp


Page 20

Appendix B: Benchmarks and Test Tools

1. dd

dd is a Linux utility provided by the coreutils rpm distributed with SL 5.5. It is was used to measure

data throughput.

dd if=/dev/zero of=zerofile bs=1M count=90000

dd if=zerofile of=/dev/null bs=1M count=90000

2. IOzone

IOzone can be downloaded from http://www.iozone.org/. Version 3.353 was used for these tests and

installed on the servers. The iozone benchmark was used to measure sequential read and write

throughput (MB/sec) as well as random read and write I/O operations per second (IOPS).

iozone commands used:

{write, read, r/w}

iozone -i {0,1,2} -c -e -w -r 1024k -s 64g -t 8 -+n | tee -a resultfile.txt

random read for X={1,2,4,8,16,32} threads

iozone -i 5 -c -e -w -r 1024k -s 64g -t $X -+n

The IOzone tests were run from 1-64 nodes in clustered mode. All tests were N-to-N, i.e. N clients

would read or write N independent files.

The following table describes the command line arguments.

IOzone ARGUMENT DESCRIPTION

-i 0 Write test

-i 1 Read test

-i 2 Random Access test

-+n No retest

-c Includes close in the timing calculations

-t Number of Threads

-e Includes flush in the timing calculations

-r Records size

-s File size

-t Number of Threads

+m Location of clients to run IOzone on when in clustered mode

-w Does not unlink (delete) temporary file

-I Use O_DIRECT, bypass client cache


Page 21

3. rfcp

The test script used for copying with rfcp is detailed below. It is necessary to generate a proxy and

point to it with the X509_USER_PROXY environment variable. (This can be done with the voms-proxy-

init commit command on a grid UI for example).

RANGE=50 echo Starting Test export X509_USER_PROXY=/opt/sl5_soft/wahid/x509up_u521399 for k in `seq 1 40` do echo Test $k date let number=$RANDOM%$RANGE let number2=$RANDOM%5+1 echo rfio://delboy.lancs.ac.uk/test$number2/remotetestfiles/rtest$number.2G.file rfcp rfio://delboy.lancs.ac.uk/test$number2/remotetestfiles/rtest$number.2G.file /dev/null done

Multiple copies of this test were submitted to the batch system with the following command.

for i in `seq 1 250` ; do qsub -N Dellboy250Test${i}

-o /home/wahid/DellboysStressTests/250test${i}.out

-e /home/wahid/DellboysStressTests/250test${i}.err RfcpStressTest ;done

4. Direct reading with ROOT over rfio

ROOT is the standard package used for data analysis by particle physicists and built into the data

models and analysis software. Therefore this test is realistic for an ATLAS file.

The code for the program is given below. It requires ROOT libraries and DPM libraries (mentioned

below) to be available and is compiled against them using the Makefile also given below.

It also requires access to an atlas like AOD file and the building of a shared library (called aod.so here).

The later can be built using TFile::MakeProject in ROOT (see

http://root.cern.ch/root/html/TFile.html#TFile:MakeProject). For more details please contact the

authors.

#include <iostream> #include <iomanip> #include <stdlib.h> #include <fstream> #include <TROOT.h> #include <TRFIOFile.h> #include <TFile.h> #include <TString.h> #include <TTreePerfStats.h>


Page 22

#include <TTree.h> #include "TPluginManager.h" using namespace std; int main(int argc, char *argv[]) { TString inputFile = argv[1]; Int_t cachesize=0 ; if (argc > 2){

cachesize = atoi(argv[2]); } TFile *_file0 = TFile::Open(inputFile, "READ"); TTree* T= (TTree*)_file0->Get("CollectionTree"); Long64_t nentries = T->GetEntries(); if (argc > 3){ nentries = atoi(argv[3]); } if (cachesize > 0 ){ cout << "setting cache " << endl; cout << cachesize << endl ; T->SetCacheSize(cachesize); T->SetCacheEntryRange(0,nentries); T->AddBranchToCache("*",kTRUE); } TTreePerfStats ps("ioperf",T); cout << "Total Entries: " << nentries << endl; for (Long64_t i=0; i<nentries ; i++){ if (i%100 == 0 ){ cout << "processed" << i << " entries" << endl; } T->GetEntry(i); } ps.SaveAs("aodperStraightRFIO.root"); ps.Print(); }

Makefile:

ROOTCFLAGS = $(shell root-config --cflags)

ROOTLIBS = $(shell root-config --libs)

ROOTGLIBS = $(shell root-config --glibs)

CXX = g++

CXXFLAGS = -g -Wall -fPIC

LD = g++

LDFLAGS = -g

LDFLAGS += -m32

SOFLAGS = -shared

CXXFLAGS += $(ROOTCFLAGS)


Page 23

LIBS = $(ROOTLIBS)

NGLIBS = $(ROOTGLIBS)

NGLIBS += -lTreePlayer

NGLIBS += -lRFIO

GLIBS = $(filter-out -lNew -lPostscript -lPhysics -lGui, $(NGLIBS))

.SUFFIXES: .cc,.C

# ====================================================================

IOPerformerGrid: IOPerformerGrid.o

# -------------------------

$(LD) $(LDFLAGS) -o IOPerformerGrid IOPerformerGrid.o aod/aod.so

libshift.so.2.1 liblcgdm.so $(GLIBS)

.cc.o:

$(CXX) $(CXXFLAGS) -c $<

The script used for this test is given below. It is necessary to generate a proxy as for the test above. It

is also necessary to create a link to libdpm.so to libshift.so.2.1 and for the link to be in the

LD_LIBRARY_PATH. libdpm.so should be found in $LCG_LOCATION in a standard grid WorkerNode

installation. However, for our test it was necessary to replace this with a more recent version of the

library to allow setting of the RFIO_IOBUFSIZE by environment variable.

RANGE=25

echo Starting Test

export X509_USER_PROXY=/opt/sl5_soft/wahid/x509up_u521399

echo "Setting paths"

export LD_LIBRARY_PATH=/opt/sl5_soft/wahid/libs:$LD_LIBRARY_PATH

#ln -s $LCG_LOCATION/lib/libdpm.so /opt/sl5_soft/wahid/libs/libshift.so.2.1

export RFIO_IOBUFSIZE=4

for k in `seq 1 10`

do

echo Test $k

date

/opt/sl5_soft/wahid/IOPerformerGrid

rfio://delboy.lancs.ac.uk/test1/aodfiles/AOD.067184.big.pool.root.7154799.$RF

TSTNO

done

100 simultanous jobs are submitted to the batch system with the following command.

for i in `seq 1 100` ; do qsub -N ARodBuf100Rfio512kTest${i} -o

/home/wahid/DellboysStressTests/ARodRfio512k${i}.out -e

/home/wahid/DellboysStressTests/ARodRfio512k${i}.err -v RFTSTNO=${i}

RfioTestRod ; done


Page 24