dell hpc scalable storage building block
TRANSCRIPT
Dell HPC Scalable Storage Building Block Disk Pool Manager (DPM)
A Dell Technical White Paper
Wahid Bhimji, Philip J. Clark, University of
Edinburgh
Matt Doidge, Roger Jones, University of
Lancaster
Stephen Gray, DELL
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page ii
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL
ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR
IMPLIED WARRANTIES OF ANY KIND.
© 2010 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without
the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.
Other trademarks and trade names may be used in this document to refer to either the entities
claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in
trademarks and trade names other than its own.
February 2012
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 1
Contents Executive Summary ........................................................................................................................2
Scope .............................................................................................................................................2
1. Introduction ............................................................................................................................3
2. Dell HPC Scalable Storage Building Block Solution Technical Review..........................................3
3. DPM Description .....................................................................................................................4
3.1. Hardware ..................................................................................................................................... 5
3.2. Software ..................................................................................................................................... 7
3.3. Installation Specifics .................................................................................................................... 8
4. Evaluation ...............................................................................................................................9
4.1. Methodology ............................................................................................................................... 9
4.2. Test bed ....................................................................................................................................... 9
4.3. Local tests .................................................................................................................................. 10
4.4. Remote tests ............................................................................................................................. 11
5. Performance Benchmark Results ............................................................................................ 11
5.1. Local Tests ................................................................................................................................. 11
5.2. Remote Tests ............................................................................................................................. 13
6. Conclusions ........................................................................................................................... 16
7. References ............................................................................................................................ 16
Appendix A: Installation Resources ............................................................................................... 18
Appendix B: Benchmarks and Test Tools........................................................................................ 20
1. dd .................................................................................................................................................. 20
2. IOzone .......................................................................................................................................... 20
3. rfcp ............................................................................................................................................... 21
4. Direct reading with ROOT over rfio .............................................................................................. 21
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 2
Executive Summary This solution guide describes the Disk Pool Manager (DPM) configuration of the Dell HPC Scalable
Storage Building Block. Guaranteeing performance of unstructured user data is becoming a common
requirement in Large Hadron Collider (LHC) environments. The configuration described here of a Dell
reference architecture to improve performance of data access using networked servers in an additive
configuration to provide data access to the HPC compute cluster. The goal is to provide affordable,
performance storage that is easily deployed and managed. Described here are the architecture,
performance and best practices for building such solutions.
Scope This paper covers installing and configuring a DPM disk server. It describes a series of local and
remote tests that can be used to determine functionality and performance of such a server. From
the results of those tests it makes some recommendations on tunings that could be applied with
this server when used for DPM.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 3
1. Introduction The Large Hadron Collider (LHC) and its experiments have created a huge demand on local and global
storage appliances. The LHC produces PetaBytes of data per year. The primary focus of the LHC
experiments is to verify the existence of the Higgs Boson and other yet undiscovered particles, hidden
in a massive and growing unstructured data pool. This creates an unprecedented requirement for
inexpensive, high performance, accessible, and scalable storage on a global scale. The data must be
organized and distributed to the researchers in a tiered membership that is unique to the LHC
community. The community needs to understand the storage requirement for their research. With 449
institutions across the globe working on the LHC project, many institutions need documentation on
repeatedly implementing a mature storage platform specific to their function in the LHC research
community.
The LHC community today uses many types of storage organization and functionality. The community
refers to this storage as a file system. In truth, it is a combination of many parts---software, hardware,
and strategy. One type is responsible for moving data off the trigger farm to the lower tiers. Some are
used to provide a high performance interface on local server disks for computational requirements.
Others are for achieving the data for future reference. Today Dell is working with the LHC community
directly to understand the storage requirements and provide the best possible practices and storage
hardware available. It was in this context that Dell engineered a HPC scalable storage building block
reference architecture (DSBRA) to meet the storage need for the community both large and small.
To complete the concept of Dell and the community working together, Edinburgh University, Lancaster
University and members of the Dell Global LHC team together have produced this whitepaper for the
general LHC community. Lancaster University uses the Disk Pool Manager (DPM) storage solution to
provide mass data storage for LHC analysis at their site. Using the DSBRA as the target storage
appliance they have built, installed, tuned, and tested DPM. The experience captured in the paper can
now be used by the novice and experienced user in the community to produce science quickly and
install such a system within their environment.
It is Dell Global LHC team’s goal to provide the best solutions available to the community. The team is
working consistently to improve the LHC community’s ability to produce science quickly. Please check
regularity with the Global LHC team or your local Dell account team for the latest testing and
whitepapers available. The DSBRA will now be discussed followed by the work performed by Lancaster
and Edinburgh Universities. Dell HPC Scalable Storage Building Block Solution Technical Review
2. Dell HPC Scalable Storage Building Block Reference Architecture
Technical Review This section provides a quick summary of the technical details of the current DSBRA. The basic DSBRA
consists of 1 PowerVault MD3200, and 4 PowerVault MD1200s attached to a PowerEdge R710 server. The
server is configured with 10GbE network connectivity and also several gigabit Ethernet connections.
The storage is configured using the PERC 700 controller, and different RAID configurations for the LUNS
are tested here. A High Performance Tier license key is added to turn on added HPC features of the
array. Each MD storage array has 12 x 2TB 7200rpm disks configured as virtual disks. The multiple RAID
LUNs are combined using a variety of different RAID configurations and the xfs or ext4 filesystems. This
file system is exported to the compute nodes via DPM. The DSBRA can be configured in three ways –
Small, Medium and Large. These correspond to a 360TB, 720TB and 1440TB raw solution respectively.
Figure 1 shows a DSBRA entry configuration in a redundant format. For redundancy the configuration
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 4
can be duplicated. The raw bandwidth performance for the DSBRA shown in Figure 1 is 3.4GB/s writes
and 5.2GB/s reads with no operating system overhead.
The Dell white paper on DT-HSS has detail on the DSBRA configuration, best practices, performance
tuning and performance results (1). Other documents are currently in the creation phase.
Figure 1 - Redundant Entry Level DSBRA configuration
The DSBRA solution can be made highly availability and scaled by duplicating the configuration. It
leverages the modularity of building blocks (such as the server, software, storage and RAID
configuration) and the performance tuning best practices as far as possible. Also remember that the
DSBRA is a reference. To reproduce it, one must order it as components, since it is not a standard Dell
product.
3. DPM Description The DPM Storage Solution consists of a pair of Dell PowerEdge R710 servers that both have physical
access to a shared PowerVault MD3200 storage enclosure that is extended with PowerVault MD1200
storage arrays. The servers are in a configuration where they each access a separate set of LUNs. The
servers were named “Dellboy” and “Rodney”.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 5
3.1. Hardware
Figure 2 – DPM Dell Setup
OS Setup
The install of Scientific Linux was performed using an SL5.5 install DVD [Appendix A1] and the onboard
optical drive of the R710. We chose to leave hyper-threading on (to maximise the number of CPUs
available for transfer threads). Following the on-screen prompts during the install we chose the bare-
bones “Server” package set from the installation menu, keeping the install as lightweight as possible.
Hostname and (external) network interfaces were set to be assigned using DHCP. After the install was
complete the nodes were immediately updated using yum, tightened the firewall to only allow
necessary traffic [Appendix A2] and rebooted to apply kernel security patches – as is best practice.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 6
Network Setup
The servers are connected to a Gigabit Ethernet and a 10GbE switch. Due to problems with the 10GbE
switch of the test cluster, networking connectivity was split between a 1Gb external interface which
faces the world and a 3 x 1Gb bonded internal interface, on a separate, private sub-net shared with
the internal NICs of our other DPM pools and, more importantly, the compute nodes in our clusters.
This is a very common setup as most sites have this distinction between external and internal traffic,
and for many reasons (such as monitoring, security and potential network traffic contention) it is wise
to keep them separate. Details of the bonding method can be found in Appendix A3.
Configuring the MD3200:
Before attempting to install the DPM software we needed to perform some extra steps to enable the OS
on the R710s to be able to communicate with the raid controllers on the MD3200s. First we installed
the Dell OpenManager [Appendix A4]. We then downloaded the disk image containing the PERC
management software and, loop mounting the iso, running the install program (md32xx_install.bin).
The raids were then configured using the `SMclient' GUI provided by these steps (this required the
installing and configuring X and opening a new shell with X-forwarding enabled). A pointer to detailed
instructions for using the GUI and of the raid configuration we used can be found in Appendix A.
Filesystem Setup:
Disk Setup
The disks were setup in a variety of different RAID configurations to test the different performance of
each. These are listed below and in Table 1.
First MD3200 system (Dellboy)
Each of the 5 storage units was made into a separate raid 6 volume. Each volume was packaged into a
whole virtual disk and mapped to it’s own LUN, except for the 5th, which was made into 2 equal virtual
disks.
-The first 2 volumes had xfs partitions made upon them and mounted using noatime and logbufs=8
options
-The second two raid 6 volumes had a software raid 10 volume made over them (Appendix A), to
make a raid 60 volume which was formatted with xfs.
-The final 2 “half” volumes had an xfs and an ext4 filesystem on them respectively.
Second MD3200 system (Rodney)
The first two storage units were made into a single, large raid 10 volume packaged in one virtual disk,
the third unit was a single unit in a raid 10, the 4th and 5th units were put into a “double sized” raid 6.
After splitting up the raid arrays into the described volumes, we rebooted the R710s to facilitate the
detection of these chunks by the OS. Once back up and running we installed the xfsprogs and e4fsprogs
packages, and then created partitions on each of the SSB disk chunks using the parted tool. We then
formed xfs or ext4 filesystems on these fresh partitions, added entries for them into /etc/fstab,
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 7
created the mount points and mounted them normally (mount -a). Details of the commands and options
used can be found in Appendix A5.
Raid Setup Capacity (TB) # Spindles Filesystem mountpoint
single raid 6 19 12 xfs delboy:/test1
single raid 6 19 12 xfs delboy:/test2
double raid 60 38 24 xfs delboy:/test3
half raid 6 9 6 ext4 delboy:/test4
“half” raid 6 9 6 xfs delboy:/test5
“double” raid 10 24 24 xfs rodney:/test1
single raid 10 12 12 xfs rodney:/test2
double raid 6 41 24 xfs rodney:/test3
Table 1: RAID configurations and filesystems used.
The motivation for looking at such a large number of different filesystems was to gain performance
information for as many options as possible. Due to how DPM distributes data it is best practice to have
all of your DPM pools roughly the same size. So, if one is incorporating a DSBRA into an existing DPM,
then you can split it into “chunks” of a size that would interoperate with the preexisting pools. Also it
is best practice not to expose a pool as a single, huge volume, due to administration problems
(rebuilding such an array would take a long time, as would fscking such a volume) as well as the simple
wisdom of not keeping all your data-eggs in one giant basket.
The PowerVault MD3200 is configured to have read cache enabled, write cache enabled, cache
mirroring enabled and read prefetch enabled. The cache block size is set to 32k. Since the cache
mirroring feature is enabled, the two RAID controllers in the MD3200 have mirrored caches. A single
RAID controller failure can be tolerated with no impact to data availability.
Each server is connected to both controllers on the MD3200. Each server has two SAS cables directly
connected to the MD3200, which eliminates a single point of server to storage I/O path failure. A
redundant path from MD3200 to the MD1200 array is deployed to enhance the availability of storage I/O
path. While a failover setup is not available within DPM at present- this setup could still be used to
allow easier data recovery in the event of server failure.
3.2. Software
DPM software [8] provides a head-node on which meta-data operations are carried out and disk-servers
that provide a series of data transfer protocols as indicated in figure 4 (here rfio was tested for local
file access while GridFTP was used for remote access). The specifics of this installation are provided in
the next section. DPM is agnostic as to the underlying file system used. Here we primarily used XFS but
a single partition was set up with EXT4 as indicated in Table 1.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 8
Figure 3 – DPM architecture
3.3. Installation Specifics
• The following describes how to install a gLite DPM. The repositories, package names and some
paths vary slightly for other “flavours” of DPM, such as EMI. However the services and the
principles under which they run remain the same.
• The first steps in installing and configuring a DPM pool node (or almost any grid service) is to
install an x509 grid host-certificate and key in /etc/grid-security/ (with the correct
permissions). As it can take a day or two to obtain a valid host certificate it's advised to
“order” one ahead of time. As a locally specific step we also manually configured a dpmmgr
(DPM-manager) user and group. This user owns all the files within the dpm pool and is
configured automatically if you use yaim [described in Appendix X6] to set up your disk pool.
This user must have the the same UID and GID on both the DPM headnode and throughout your
Disk Pools. The dpmmgr user and group must own any directory to be exported into the DPM
pool.
• Once these first steps were complete we enabled the glite yum repositories as well as the dag
repo [Appendix A6], and then being sure to disable the epel repositories (to avoid DPM version
conflicts with other DPM flavours). The install was then triggered by simply “yum installing”
the glite-DPM disk metapackage.
• Once the many packages are installed one can use yaim to perform the rest of the
configuration, which requires a properly configured site-info.def config file. It is also possible
to set up a disk pool manually, as in a disk pool yaim only does two main tasks – setting up
/etc/shift.conf and setting up the authentication mechanism. The latter is by far the most
complicated of these two, and it's advised to leave that to yaim.
• /etc/shift.conf contains a list of trusted hostnames for the various DPM access methods. At it's
simplist it should just contain the DPM headnode and the DPM poolnode. To speed up inter Disk
Pool communication (by removing the need for authenication) one can add in the hostnames of
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 9
other Disk Pools. To enable testing freely between the two disk pools we added the other's
hostname to the shift.conf.
• Once a pool node has been installed there are a few steps elsewhere one must take for
enabling it in the dpm, mainly involving properly configuring the DPM headnode to
acknowledge and accept the new pool node. Firstly one must place the pool's hostname to the
shift.conf on the DPM headnode. Secondly one must ensure that communication between the
headnode, new diskpool, outside world and any internal clusters works as intended (this may
involve some editing of routing tables or /etc/host files and possibly firewall rules). Finally one
must make sure that the directory to be “exported” on the Disk Pool is owned by the dpmmgr
user & group. Once all these have been checked one can enable the disk pool in your dpm by
issuing a dpm-addfs command on your DPM headnode.
4. Evaluation The architecture proposed in this white paper was evaluated in Lancaster University’s computing
centre. This section describes the test methodology and the hardware used for verification. It also
contains details on the functionality and performance tests. Performance results are presented in
subsequent sections.
4.1. Methodology
A series of local and remote file-system tests were employed. The local tests employed the
standard tools dd and iozone. The remote tests used the protocols commonly used by the DPM
storage system: gsiftp and rfio. The remote tests also used workflows matching that used by the
"ATLAS" LHC experiment. Most focus was given to the later tests where a variety of parameters
were tested. These are described below and in more detail in Appendix B.
4.2. Test bed
The test bed used to evaluate the functionality and performance of the NSS-HA solution is
described here. Figure 4 shows the test bed used in this study.
The HPC compute cluster consisted of 512 cores over 64 servers.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 10
Figure 4 – The Lancaster DPM Testbed
Two PowerEdge R710 servers were used as the DPM disk servers. Both servers were connected to
PowerVault MD3200 storage extended with PowerVault MD1200 arrays. A switch configured with
bonded Gigabit Ethernet connection was used as the private HA cluster network between the
servers.
4.3. Local tests
To test the standalone performance of our SSB partitions outside of the DPM paradigm, we use the
well-known tools dd [Appendix B] and iozone [Appendix B].
The dd tests were split into two sections. The first used a test suite, written at SARA [2], which
performs repeated sets of multiple simultaneous dd write and read tests with large files [Appendix B].
We ran these tests using 8 simultaneous threads and let them run for approximately 24 hours, after
which an average result was calculated for the read and write rates. The second set of tests were a
much simpler second thread dd write test, writing 90GB files using /dev/zero then following up with a
read test to /dev/null. These tests were intended to provide a baseline for our other results.
Our iozone tests were similarly split, each volume was individually subjected to a sequential read,
sequential write and a random read/write using 8 threads and large files. We choose large files to
remove the effects of RAM caching (they had to be so large due to the 24GB of RAM on the host
machines) which seemed to affect our first set of tests using smaller files. The exact commands used
can be seen in Appendix B.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 11
We performed another batch of iozone tests on a single volume, varying the number of threads used
during a random read iozone run, the aim was to map any performance degradation as the number of
simultaneous threads accessing a volume increased.
4.4. Remote tests
Gridftp:
To test remote transfers using Gridftp we utilized a test developed by SARA [10]. This test sets up a series of remote transfers using the Gridftp protocol. A series of files were setup on the test servers
and copied back to the client. 2 GB filesizes were used – where the files were created with random
data taken from /dev/random.
Rfcp:
Rfcp is the copy command most commonly used for local copies on DPM storage elements. As ATLAS
jobs currently normally copy the input data files to the worker node before running on DPM nodes, this
simulates the interaction they would have with the storage. Real ATLAS “AOD” files of 2G filesize were
used, files were copied continuously using different (and randomly chosen) files each time. A real
analysis job would have copied the file and then processed it locally – so this test represents the
heaviest possible IO load from that number of concurrent jobs (i.e. the case where all the copies would
be occurring at the same time).
The wrapper scripts used to submit the copies are given in appendix B.
Up to 250 simultaneous copies were performed on each disk server – representing a realistic maximum
number of jobs for a server of this capacity.
Direct ROOT Reading over RFIO:
ROOT is the data analysis framework used by particle physicists. This test uses the ROOT libraries to
open and read a file directly using RFIO (ie. without copying first). A real ATLAS “AOD” file is used, but
one which has not be “reordered” by entry (see [9] for more details), files are continuously opened and
read (using different files each time), but no computation is done. In these ways the test is both
realistic, but also “worst-case” in terms of IO load.
The code for both the test and the wrapper are given in appendix B. A version of this test is now
available through the DPM performance test-suite [11][12].
Up to 100 simultaneous jobs were performed on a single filesystem of each server. This represents a
realistic load for the capacity provided.
5. Performance Benchmark Results This section presents the results of performance benchmarking on the NSS-HA Solution.
5.1. Local Tests
The results of the dd tests and iozone tests are shown in the tables below while the multi-thread
iozone test reading results are shown in the figure.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 12
dd test results
Volume DD suite result
(MB/s read/write)
Straight DD results
(MB/sRead/Write)
19TB single raid 6 178/20 688/301
38TB double raid 60 83/80 675/290
9TB “half” raid 6 176/18 501/196
9TB “half” raid 6 (ext4) 93/52 268/256
24TB “double” raid 10 184/28 289/125
12TB single raid 10 174/29 626/294
41 TB double raid 6 168/24 713/252
The multi-threaded “dd suite” results in the table above show only a small variation read and write
rates for most of the volume setups, with the exception of the large software raid 60 and the ext4
volume which had much lower read rates than the others. Both these volumes did however make up for
lower read performance with much higher write rates. For a typical Grid Storage Element most of the
operations are of the format “write-once, read-many”, so the additional write performance wouldn't
be a great benefit, but for other applications it could be useful.
In the case of the single-thread, standard dd tests there was a slightly greater variation in performance
between the setups. The ext4 partition again had a lower read rate but this time displayed no
considerable gain in write speed – any gains in write rates seem to be dependent on the number of
threads. The large raid 10 partition also seemed to perform poorer than the others. However, these
tests should be taken with the proverbial pinch of salt as few real-world applications would involve a
single-threaded dump to a partition of any real volume.
Iozone test results
Volume Read (KB/s) Write (KB/s) Random R/W (KB/s)
19TB single raid 6 537171 229333 136903/ 101411
38TB double raid 60 799849 407770 176563/ 127596
9TB “half” raid 6 (ext4) 427529 239111 173113/ 92647
9TB “half” raid 6 511624 220575 136544/ 96933
24TB “double” raid 10 564728 215497 202388/ 210904
12TB single raid 10 455706 230065 122292/ 174229
41 TB double raid 6 571992 204378 199682/ 66559
The more sophisticated multi-threaded IOZone test results show little variation between most of the
volumes for the sequential read and write results. The volume that stands out for these is the large
raid 60 volume. This volume also performs well with the random r/w tests, although with these tests
the best performing volume is the large raid 10 volume. Random read/write performance in general
seems to roughly correlate with number of spindles in the volume.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 13
Multi-thread Random Read test:
This plot shows an interesting phenomena where, although the rate of increase in total-read rates
rapidly falls off as the number of threads increases it didn't, within the scope of these results, fully
level off, and an increase in read rate is seen even when the number of threads is greater than the
number of CPUs on the machine. This shows that multi-threading reads seems to be a good way of
squeezing every last bit of read-performance from a volume.
If it wasn't for time considerations we would have liked to conduct this test with hyperthreading off
(reducing the effective number of cores on the machine to 8), and also repeated it on all the unique
volumes.
5.2. Remote Tests
Gridftp
For the remote gridftp tests – single file transfer rates of 110 MB/s were obtained. This was seen to
scale with the number of simultaneous files transferred, illustrating that this is network limited
(the external link used when accessed from a single client was limited to 1 Gbit/s). It was observed
that similar transfer rates could be achieved even when the local tests below were carried out (and
an independent interface used for access).
Rfcp
As mentioned above, 250 simultaneous rfcp processes were launched from the cluster targeting
randomly chosen partitions on one server at a time. On launching this high-levels of IOWAIT were
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 14
observed on the servers as indicated in figure 5.1. It was found that increasing the block device
read ahead with the following command could alleviate this as has been observed on other DPM
disk servers in the UK [13]. Those studies suggested the value of 8MB that was used, but the
optimum value is likely to be infrastructure dependent.
/sbin/blockdev --setra 16384 /dev/dm-5
As indicated in figure 5.2 this immediately alleviated the load and led to the throughput of data
being limited only by the available network bandwidth.
Figure 5.1: Load and WAIT CPU on Rodney with 250 rfcp jobs before block device read-ahead
increased.
Figure 5.2: System load and Network utilization on “Rodney” for 250 rfcp jobs showing that once
the block device read-ahead value is increased the load is reduced and available network is fully
utilized.
ROOT over RFIO
As mentioned above, we ran 100 simultaneous jobs. The blkdev tuning mentioned in the last
section was already applied. However, as shown in figure 5.3 there was once more high levels of
WAIT CPU and the network was not utilized. In this case setting the RFIO buffer size to a larger
value can alleviate the situation. This can be done in /etc/shift.conf on the client (ie. worker
node), or (as in this case) in the application by using the environment variable:
export RFIO_IOBUFSIZE=524288
As shown in figure 5.4 this substantially reduces the CPU wait and means that the network can be
fully utilized. However, it is worth noting that it is better for the job efficiency to have a lower
value for the buffer size (as shown in table 5.1). This is because with larger buffer sizes, and
random IO, a large amount of data is being shipped for every call, only some of which is required
for the job. The balance chosen for a site will depend on the system and job mix.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 15
Figure 5.3: System load, CPU and Network usage on “Rodney” for 100 ROOT direct RFIO jobs
showing that high levels of IO wait with the default RFIO BUFFER SIZE of 128k.
Figure 5.4 System load, CPU and Network usage on “Rodney” for 100 ROOT direct RFIO jobs
showing that that setting the RFIO buffer to 512k alleviates the CPU WAIT and enables the network
to be utilized.
RFIO BUFFER SIZE 128k 512k
CPU Time 249 317 Wall Time 921 2444 CPU / Wall Time 27% 13%
RFIO BUFFER SIZE 4k 128k 512k CPU Time 227 267 405 Wall Time 19813 4321 71654 CPU / Wall Time 1.1% 4.5% 0.6%
Table 5.1: Time taken for single job (top) and 100 simultaneous ROOT direct RFIO jobs showing
that 128k buffer sizes offer lower overall job times and better CPU efficiencies.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 16
6. Conclusions This solution guide provides information on deploying a DPM solution for HPC clusters. The guidelines
include complete hardware and software information along with detailed configuration steps, best
practices and performance tuning notes to make it easy to deploy and manage such a solution.
We have found that the Dell HPC scalable storage building block reference architecture or DSBRA is
suitable for use as a storage server with DPM for ATLAS analysis workloads and when stressed with a
realistic number of jobs for the capacity provided.
We have found to deal with these workloads it is necessary to tune the blkdev readahead (in the case
of both copying files to the Worker Node or direct reading via rfio) and the RFIO_BUFFERSIZE
(particularly in the case of direct reading) and we suggest some values for these parameters. Larger
buffers can however mean worse cpu efficiencies in the case of direct random reading so the values
will depend on available network bandwidth and should be tuned on each installation.
When using large buffers the system is network limited and therefore we recommend if such high
capacity servers are used, 10 Gig Ethernet should be utilized.
7. References 1) Dell | Terascala HPC Storage Solution (DT-HSS)
http://content.dell.com/us/en/enterprise/d/business~solutions~hpcc~en/Documents~Dell-
terascala-dt-hss2.pdf.aspx
2) Dell NFS Storage Solution for HPC (NSS)
http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-NSS-NFS-Storage-
solution-final.pdf
3) Red Hat Enterprise Linux 5 Cluster Suite Overview
http://docs.redhat.com/docs/en-
US/Red_Hat_Enterprise_Linux/5/pdf/Cluster_Suite_Overview/Red_Hat_Enterprise_Linux-5-
Cluster_Suite_Overview-en-US.pdf
4) Deploying a Highly Available Web Server on Red Hat Enterprise Linux 5
http://www.redhat.com/f/pdf/rhel/Deploying_HA_Web_Server_RHEL.pdf
5) Platform Cluster Manager
http://www.platform.com/cluster-computing/cluster-management
6) Optimizing DELL™ PowerVault™ MD1200 Storage Arrays for High Performance Computing (HPC)
Deployments
http://i.dell.com/sites/content/business/solutions/power/en/Documents/Md-1200-for-hpc.pdf
7) Array Tuning Best Practices
http://www.dell.com/downloads/global/products/pvaul/en/powervault-md3200i-
performance-tuning-white-paper.pdf
8) DPM
https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 17
9) ROOT I/O
Vukotic, I, Bhimji, W, Biscarat, C, Brandt, G, Duckeck, G, van Gemmeren, P, Peters, A , Schaffer, R D
2010 Optimization and performance measurements of ROOT-based data formats in the ATLAS
experiment. ATL-COM-SOFT-2010-081. To be published in J. Phys.: Conf. Series
10) SARA test suite: (available from http://web.grid.sara.nl/acceptance_test). 11) Hellmich, M. Stress testing and developing the distributed data storage used for the Large Hardon
Collider, Available from:
http://www2.ph.ed.ac.uk/~wbhimji/GridStorage/StressTestingAndDevelopingDistributedDataStora
ge-MH.pdf
12) DPM performance testsuite: https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/Performance
13) http://northgrid-tech.blogspot.com/2010/08/tuning-areca-raid-controllers-for-xfs.html [accessed
February 2012]
8. Acknowledgements The tuning applied in this paper makes use of a huge amount of work carried out in the UK, including
that by John Bland, Sam Skipsey, Alessandra Forti and others in the GridPP Storage Group.
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 18
Appendix A: Installation Resources
1 - Scientific Linux Iso Download:
https://www.scientificlinux.org/download
2 - Ports used by DPM Disk Server
PORT SERVICE PROTOCOL
5001 RFIO TCP
2811 GRIDFTP TCP
20000:25000 GLOBUS PORT RANGE TCP
3 - NIC Bonding Config files.
ifcfg-bond0: DEVICE=bond0 ONBOOT=yes BOOTPROTO=static NETMASK=255.255.240.0 IPADDR=10.41.52.101 NETWORK=10.41.48.0 USERCTL=no BONDING_OPTS='mode=balance-alb miimon=100 xmit_hash_policy=layer3+4' ifcfg-ethX (where X is a bond member):
DEVICE=eth3 HWADDR=84:2B:2B:72:66:45 ONBOOT=yes BOOTPROTO=none USERCTL=no MASTER=bond0 SLAVE=yes
4 - Dell OpenManager & other Utility Documentation Links: http://www.dell.com/content/topics/global.aspx/sitelets/solutions/management/en/openmanage?c=us&l=en&cs=555
5 - File System and Mounting Options.
Partitions were created using the parted CLI (given a gpt label using the mklabel command, and
created using the mkpart command), consuming the whole of the virtual disk volume in most cases.
Filesystems were created using `mkfs.xfs -f /dev/XXX` (or `mkfs.ext4 -f` in the case of the ext4
partition).
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 19
The xfs volumes were mounted using the following options in /etc/fstab:
rw,noatime,logbufs=8
The ext4 volume was mounted using just “defaults,noatime”.
6 - DPM Installation and Configuration.
Yum repositories:
EGI-trust.repo
[EGI-trustanchors]
name=EGI-trustanchors
baseurl=http://repository.egi.eu/sw/production/cas/1/current/
gpgkey=http://repository.egi.eu/sw/production/cas/1/GPG-KEY-EUGridPMA-RPM-3
gpgcheck=1
enabled=1
glite-SE_dpm_disk.repo
[glite-SE_dpm_disk]
name=gLite 3.2 glite-SE_dpm_disk
baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.release/
gpgkey=ftp://glite.web.cern.ch/glite/glite_key_gd.asc
gpgcheck=0
enabled=1
[glite-SE_dpm_disk_updates]
name=gLite 3.2 glite-SE_dpm_disk
baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.updates/
gpgkey=ftp://glite.web.cern.ch/glite/glite_key_gd.asc
gpgcheck=0
enabled=1
[glite-SE_dpm_disk_ext]
name=gLite 3.2 glite-SE_dpm_disk
baseurl=ftp://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-SE_dpm_disk/sl5/x86_64/RPMS.externals/
gpgcheck=0
enabled=1
Yaim documentation:
https://twiki.cern.ch/twiki/bin/view/LCG/GenericInstallGuide320
Example shift.conf:
RFIOD TRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk
RFIOD WTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk
RFIOD RTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk
RFIOD XTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk
RFIOD FTRUST fal-pygrid-30.lancs.ac.uk rodney.lancs.ac.uk delboy.lancs.ac.uk
DPM PROTOCOLS rfio gsiftp
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 20
Appendix B: Benchmarks and Test Tools
1. dd
dd is a Linux utility provided by the coreutils rpm distributed with SL 5.5. It is was used to measure
data throughput.
dd if=/dev/zero of=zerofile bs=1M count=90000
dd if=zerofile of=/dev/null bs=1M count=90000
2. IOzone
IOzone can be downloaded from http://www.iozone.org/. Version 3.353 was used for these tests and
installed on the servers. The iozone benchmark was used to measure sequential read and write
throughput (MB/sec) as well as random read and write I/O operations per second (IOPS).
iozone commands used:
{write, read, r/w}
iozone -i {0,1,2} -c -e -w -r 1024k -s 64g -t 8 -+n | tee -a resultfile.txt
random read for X={1,2,4,8,16,32} threads
iozone -i 5 -c -e -w -r 1024k -s 64g -t $X -+n
The IOzone tests were run from 1-64 nodes in clustered mode. All tests were N-to-N, i.e. N clients
would read or write N independent files.
The following table describes the command line arguments.
IOzone ARGUMENT DESCRIPTION
-i 0 Write test
-i 1 Read test
-i 2 Random Access test
-+n No retest
-c Includes close in the timing calculations
-t Number of Threads
-e Includes flush in the timing calculations
-r Records size
-s File size
-t Number of Threads
+m Location of clients to run IOzone on when in clustered mode
-w Does not unlink (delete) temporary file
-I Use O_DIRECT, bypass client cache
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 21
3. rfcp
The test script used for copying with rfcp is detailed below. It is necessary to generate a proxy and
point to it with the X509_USER_PROXY environment variable. (This can be done with the voms-proxy-
init commit command on a grid UI for example).
RANGE=50 echo Starting Test export X509_USER_PROXY=/opt/sl5_soft/wahid/x509up_u521399 for k in `seq 1 40` do echo Test $k date let number=$RANDOM%$RANGE let number2=$RANDOM%5+1 echo rfio://delboy.lancs.ac.uk/test$number2/remotetestfiles/rtest$number.2G.file rfcp rfio://delboy.lancs.ac.uk/test$number2/remotetestfiles/rtest$number.2G.file /dev/null done
Multiple copies of this test were submitted to the batch system with the following command.
for i in `seq 1 250` ; do qsub -N Dellboy250Test${i}
-o /home/wahid/DellboysStressTests/250test${i}.out
-e /home/wahid/DellboysStressTests/250test${i}.err RfcpStressTest ;done
4. Direct reading with ROOT over rfio
ROOT is the standard package used for data analysis by particle physicists and built into the data
models and analysis software. Therefore this test is realistic for an ATLAS file.
The code for the program is given below. It requires ROOT libraries and DPM libraries (mentioned
below) to be available and is compiled against them using the Makefile also given below.
It also requires access to an atlas like AOD file and the building of a shared library (called aod.so here).
The later can be built using TFile::MakeProject in ROOT (see
http://root.cern.ch/root/html/TFile.html#TFile:MakeProject). For more details please contact the
authors.
#include <iostream> #include <iomanip> #include <stdlib.h> #include <fstream> #include <TROOT.h> #include <TRFIOFile.h> #include <TFile.h> #include <TString.h> #include <TTreePerfStats.h>
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 22
#include <TTree.h> #include "TPluginManager.h" using namespace std; int main(int argc, char *argv[]) { TString inputFile = argv[1]; Int_t cachesize=0 ; if (argc > 2){
cachesize = atoi(argv[2]); } TFile *_file0 = TFile::Open(inputFile, "READ"); TTree* T= (TTree*)_file0->Get("CollectionTree"); Long64_t nentries = T->GetEntries(); if (argc > 3){ nentries = atoi(argv[3]); } if (cachesize > 0 ){ cout << "setting cache " << endl; cout << cachesize << endl ; T->SetCacheSize(cachesize); T->SetCacheEntryRange(0,nentries); T->AddBranchToCache("*",kTRUE); } TTreePerfStats ps("ioperf",T); cout << "Total Entries: " << nentries << endl; for (Long64_t i=0; i<nentries ; i++){ if (i%100 == 0 ){ cout << "processed" << i << " entries" << endl; } T->GetEntry(i); } ps.SaveAs("aodperStraightRFIO.root"); ps.Print(); }
Makefile:
ROOTCFLAGS = $(shell root-config --cflags)
ROOTLIBS = $(shell root-config --libs)
ROOTGLIBS = $(shell root-config --glibs)
CXX = g++
CXXFLAGS = -g -Wall -fPIC
LD = g++
LDFLAGS = -g
LDFLAGS += -m32
SOFLAGS = -shared
CXXFLAGS += $(ROOTCFLAGS)
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 23
LIBS = $(ROOTLIBS)
NGLIBS = $(ROOTGLIBS)
NGLIBS += -lTreePlayer
NGLIBS += -lRFIO
GLIBS = $(filter-out -lNew -lPostscript -lPhysics -lGui, $(NGLIBS))
.SUFFIXES: .cc,.C
# ====================================================================
IOPerformerGrid: IOPerformerGrid.o
# -------------------------
$(LD) $(LDFLAGS) -o IOPerformerGrid IOPerformerGrid.o aod/aod.so
libshift.so.2.1 liblcgdm.so $(GLIBS)
.cc.o:
$(CXX) $(CXXFLAGS) -c $<
The script used for this test is given below. It is necessary to generate a proxy as for the test above. It
is also necessary to create a link to libdpm.so to libshift.so.2.1 and for the link to be in the
LD_LIBRARY_PATH. libdpm.so should be found in $LCG_LOCATION in a standard grid WorkerNode
installation. However, for our test it was necessary to replace this with a more recent version of the
library to allow setting of the RFIO_IOBUFSIZE by environment variable.
RANGE=25
echo Starting Test
export X509_USER_PROXY=/opt/sl5_soft/wahid/x509up_u521399
echo "Setting paths"
export LD_LIBRARY_PATH=/opt/sl5_soft/wahid/libs:$LD_LIBRARY_PATH
#ln -s $LCG_LOCATION/lib/libdpm.so /opt/sl5_soft/wahid/libs/libshift.so.2.1
export RFIO_IOBUFSIZE=4
for k in `seq 1 10`
do
echo Test $k
date
/opt/sl5_soft/wahid/IOPerformerGrid
rfio://delboy.lancs.ac.uk/test1/aodfiles/AOD.067184.big.pool.root.7154799.$RF
TSTNO
done
100 simultanous jobs are submitted to the batch system with the following command.
for i in `seq 1 100` ; do qsub -N ARodBuf100Rfio512kTest${i} -o
/home/wahid/DellboysStressTests/ARodRfio512k${i}.out -e
/home/wahid/DellboysStressTests/ARodRfio512k${i}.err -v RFTSTNO=${i}
RfioTestRod ; done
Dell HPC Scalable Storage Building Block – Disk Pool Manager
Page 24