parallel io - research school of computer science · parallel io these slides are possible thanks...
TRANSCRIPT
5/11/2017
1
nci.org.au
nci.org.au
@NCInews
Parallel IOThese slides are possible thanks to these sources –Jonathan Drusi - SCInet Toronto – Parallel I/O Tutorial, Argonne National Labs: HPC I/O for Computational Scientists,TACC/Cornell MPI/IO Tutorial, Course in Data Intensive Computing – SICS; NeRSC Lustre Notes; Quincey Koziol – HDF Group
nci.org.au
References
• eBook: High Performance Parallel I/O
– Chapter 8: Lustre
– Chapter 13: MPI/IO
– Chapter 15: HDF5
• HPC I/O For Computational Scientists
(YouTube); Slides
• Parallel IO Basics - Paper
• eBook: Memory Systems - Cache, DRAM,
Disk – Bruce Jacob
5/11/2017
2
nci.org.au
The Advent of Big Data
• Big Data refers to datasets and flows large enough that have outpaced our capability to store, process, analyze and understand
– Increase in computing power makes simulations
larger and more frequent
– Increase in sensor technology resolution creates
larger observation data points
• Data sizes that once used to be measured in MBs or GBs now measured in TBs or PBs
• Easier to generate the data than to store it
nci.org.au
The Four V’s
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
5/11/2017
3
nci.org.au
BIG DATA PROJECTS AT THE NCI
nci.org.au© National ComputationalInfrastructure 2014
Numerical Weather Prediction Roadmap
Model Topography of Sydney, NSW
2 x daily 10-day & 3-day forecast40km Global Model
4 x daily 3-day forecast12km Regional Model
Sydney, NSW (research 1.5km topography)
4 x daily 36-hour forecast4km City/State Model
TCTC
Increasing model resolutionfor improved local information
Future model ensembles for likelihood of significant weather
2 x daily 10-day & 3-day forecast12km Global Model
8 x daily 3-day forecast5km Regional Model
24 x daily 18h or 36h forecast1.0km City/State Model
20132020
IN53E-01:“NCI Computational Environments and HPC/HPD” #AGU14, 19 December, 2014 @BenJKEvans
C/- Tim Pugh, BoM
2/34
5/11/2017
4
nci.org.au© National ComputationalInfrastructure 2014
Capture, analysis & application of Earth Obs
c/- Adam Lewis, GA
IN53E-01:“NCI Computational Environments and HPC/HPD” #AGU14, 19 December, 2014 @BenJKEvans
nci.org.au
ESA’s Sentinel Constellation
We care for a safer world
• Sentinel-1 systematic observation scenario in one/two main high rate modes of operation will result in significanlty large acquisition segments (data takes of few minutes)
• 25min in high rate modes leads to about 2.4 TBytes of compressed raw data per day for the 2 satellites
• Wave Mode operated continuously over ocean where high rate modes are not used
6 m
in IW
15 m
in IW
2 m
in IW
Sentinel-1 observation scenario and impact on data volumes
16 GB for SLC4 GB for GRD-HR
46 GB for SLC12 GB for GRD-HR
5/11/2017
5
nci.org.au
ESA’s Sentinel Constellation
We care for a safer world
• Sentinel-1 systematic observation scenario in one/two main high rate modes of operation will result in significanlty large acquisition segments (data takes of few minutes)
• 25min in high rate modes leads to about 2.4 TBytes of compressed raw data per day for the 2 satellites
• Wave Mode operated continuously over ocean where high rate modes are not used
6 m
in IW
15 m
in IW
2 m
in IW
Sentinel-1 observation scenario and impact on data volumes
16 GB for SLC4 GB for GRD-HR
46 GB for SLC12 GB for GRD-HR
Sentinel-1s 2.4TB/daySentinel-2s 1.6TB/day (High-Res Optical Land Monitoring)Sentinel-3s providing 0.6 TB/day (Land+Marine Observation)
nci.org.au
Nepal Earthquake Inteferogram using
Sentinel SAR Data
5/11/2017
6
nci.org.au© National ComputationalInfrastructure 2014
How to bring as much observational scrutiny as possible
to the CMIP/IPCC process?
How to best utilize the wealth of satellite observations for the
CMIP/IPCC process?
c/- Robert Ferraro, NASA/JPL, ESGF F2F, 2014
IN53E-01:“NCI Computational Environments and HPC/HPD” #AGU14, 19 December, 2014 @BenJKEvans
Combining Satellite and Climate4/34
nci.org.au
HARDWARE TRENDS
5/11/2017
7
nci.org.au
Disk and CPU Performance
HPCS2012
(and getting slower)
Dis
k (
MB
/s),
CPU
(M
IPS)
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Disk and CPU Performance
HPCS2012
(and getting slower)
Dis
k (
MB
/s),
CPU
(M
IPS)
1000x
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
8
nci.org.au
Memory and Storage Latency
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Assessing Storage Performance
• Data Rate – MB/sec
–Peak or sustained
–Writes are faster than reads
• IOPS – IO Operations Per Second
–open(), close(), seek(), read(),
write()
5/11/2017
9
nci.org.au
Assessing Storage Performance
• Data Rate – MB/sec
–Peak or sustained
–Writes are faster than reads
• IOPS – IO Operations Per Second
–open(), close(), seek(), read(),
write()
Lab – measuring
MB/s and IOPS
nci.org.au
Storage Performance
• Data Rate – MB/sec
– Peak or sustained
– Writes are faster than reads
• IOPS – IO Operations Per Second
– open(), close(), seek(), read(), write()
Device Bandwidth (MB/s) IOPS
SATA HDD 100 100
SSD 250 10000
HD: !
Open, Write, Close 1000x1kB files: 30.01s (eff: 0.033 MB/s)!
Open, Write, Close 1x1MB file: 40ms (eff: 25 MB/s) Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
10
nci.org.au
Storage Performance
• Data Rate – MB/sec
– Peak or sustained
– Writes are faster than reads
• IOPS – IO Operations Per Second
– open(), close(), seek(), read(), write()
Device Bandwidth (MB/s) IOPS
SATA HDD 100 100
SSD 250 10000
SSD: !
Open, Write, Close 1000x1kB files: 300ms (eff: 3.3 MB/s)!
Open, Write, Close 1x1MB file: 4ms (eff: 232 MB/s)
SSDs better at IOPS – no moving parts
Latency at controller, system calls etc.
SSDs are still very expensive. Disk to stay!
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Storage Performance
• Data Rate – MB/sec
– Peak or sustained
– Writes are faster than reads
• IOPS – IO Operations Per Second
– open(), close(), seek(), read(), write()
Device Bandwidth (MB/s) IOPS
SATA HDD 100 100
SSD 250 10000
SSD: !
Open, Write, Close 1000x1kB files: 300ms (eff: 3.3 MB/s)!
Open, Write, Close 1x1MB file: 4ms (eff: 232 MB/s)
SSDs better at IOPS – no moving parts
Latency at controller, system calls etc.
SSDs are still very expensive. Disk to stay!
Raijin/short – aggregate – 150GB/sec (writes), 120GB/sec (reads)
5 DDN SFA12K arrays for /short, each is capable of 1.3M read IOPS; 700,000 write IOPS yielding a totalof 6.5M read IOPS and 3.5M write IOPS
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
11
nci.org.au
Applications (processes)
VFS
Request-baseddevice mapper targets
dm-multipath
Physical devices
HDD SSD DVDdrive
MicronPCIe card
LSIRAID
AdaptecRAID
QlogicHBA
EmulexHBA
malloc
BIOs (block I /Os)
sysfs(transport attributes) SCSI upper level drivers
/dev/sda
scsi-mq
.../dev/sd*
SCSI low level driversmegaraid_sas
aacraid
qla2xxx ...libata
ahci ata_piix ... lpfc
Transport classesscsi_t ransport_fc
scsi_t ransport_sas
scsi_t ransport_...
/dev/vd*
virtio_blk mtip32xx
/dev/rssd*
ext2 ext3
btrfs
ext4 xfs
ifs iso9660...
NFS codaNetwork FS
gfs ocfs
smbfs ...
Pseudo FS Specialpurpose FSproc sysfs
futexfs
usbfs ...
tmpfs ramfs
devtmpfspipefs
network
nvmedevice
The Linux Storage Stack Diagram
version 4.10, 2017-03-10out lines the Linux storage stack as of Kernel version 4.10
mmap(anonymous pages)
iscsi_tcp
network
/dev/rbd*
Block-based FS
read
(2)
wri
te(2
)
open
(2)
sta
t(2
)
chm
od(2
)
...
Pagecache
mdraid...
stackable
Devices on top of “normal”block devices drbd
(opt ional)
LVMBIOs (block I /Os)
BIOs BIOs
Block Layer
multi queue
blkmq
Softwarequeues
Hardwaredispatchqueues
...
...
hooked in device drivers(they hook in like stackeddevices do)
BIOs
Maps BIOs to requests
deadline
cfqnoop
I /O scheduler
Hardwaredispatchqueue
Requestbased drivers
BIObased drivers
Requestbased drivers
ceph
struct bio- sector on disk
- bio_vec cnt- bio_vec index- bio_vec list
- sector cntFi
bre
Ch
an
nel
ove
r Et
her
net
LIO
target_core_mod
tcm
_fc
Fire
Wir
e
ISC
SI
Direct I/O(O_DIRECT)
device mapper
network
iscs
i_ta
rget
_mod
sbp_
targ
et
target_core_��le
target_core_iblock
target_core_pscsi
vfs_writev, vfs_readv, ...
dm-crypt dm-mirror
dm-thindm-cache
tcm
_qla
2xxx
tcm
_usb
_gad
get
USB
Fib
re C
ha
nn
el
tcm
_vh
ost
Vir
tua
l Ho
st
/dev/nvme*n*
SCSI mid layer
virtio_pci
LSI 12GbsSAS HBA
mpt3sas
bcache
/dev/nullb*
vmw_pvscsi
/dev/skd*
skd
stecdevice
virtio_scsi
para-virtualizedSCSI
VMware'spara-virtualized
SCSI
target_core_user
unionfs FUSE
/dev/mmcblk*p*
dm-raid
/dev/sr* /dev/st *
pm8001
PMC-SierraHBA
SD-/MMC-Card
/dev/rsxx*
rsxx
IBM ��ashadapter
/dev/zram*
memory
null_blk
ufs
userspace
ecryptfs
Stackable FS
mobile device��ash memory
nvme
overlayfs
userspace (e.g. sshfs)
mmcrbdzram
dm-delay
/dev/nbd*
nbd
/dev/ubiblock*
ubi
/dev/loop*
loop
nci.org.au
Applicat ions (processes)
VFS
malloc
BIOs (block I/Os)
ext2 ext3
btrfs
ext4 xfs
ifs iso9660...
NFS codaNetwork FS
gfs ocfs
smbfs ...
Pseudo FS Specialpurpose FSproc sysfs
futexfs
usbfs ...
tmpfs ramfs
devtmpfspipefs
network
The Linux Storage Stack Diagram
version 4.10, 2017-03-10outlines the Linux storage stack as of Kernel version 4.10
mmap(anonymous pages)
Block-based FS
read
(2)
wri
te(2
)
ope
n(2
)
stat
(2)
chm
od(2
)
...
Pagecache
mdraid...
stackable
Devices on top of “normal”block devices drbd
(optional)
LVMBIOs (block I/Os)
ceph
struct bio- sector on disk
- bio_vec cnt- bio_vec index- bio_vec list
- sector cnt
Fibr
e Ch
anne
lov
er E
ther
net
LIO
target_core_mod
tcm
_fc
Fire
Wire
ISCS
I
Direct I /O(O_DIRECT)
device mapper
iscs
i_ta
rget
_mod
sbp_
targ
et
target_core_��le
target_core_iblock
target_core_pscsi
vfs_writev, vfs_readv, ...
dm-crypt dm-mirror
dm-thindm-cache
tcm
_qla
2xx
x
tcm
_usb
_ga
dget
USB
Fibr
e Ch
anne
l
tcm
_vh
ost
Vir
tua
l Hos
t
bcache
target_core_user
unionfs FUSE
dm-raiduserspace
ecrypt fs
Stackable FS
overlayfs
userspace (e.g. sshfs)
dm-delay
5/11/2017
12
nci.org.au
BIOs (block I /Os)
...gfs ocfs usbfs ... devtmpfs
network
mdraid...
stackable
Devices on top of “normal”block devices drbd
(opt ional)
LVMBIOs (block I /Os)
BIOs BIOs
Block Layer
multi queue
blkmq
Softwarequeues
Hardwaredispatchqueues
...
...
hooked in device drivers(they hook in like stackeddevices do)
BIOs
Maps BIOs to requests
deadline
cfqnoop
I /O scheduler
Hardwaredispatchqueue
Requestbased drivers
BIObased drivers
Requestbased drivers
ceph
struct bio- sector on disk
- bio_vec cnt- bio_vec index- bio_vec list
- sector cnt
device mapper
target_core_��le
target_core_iblock
target_core_pscsi
dm-crypt dm-mirrordm-thindm-cache bcache
target_core_user
unionfs FUSE
dm-raiduserspace
ecrypt fs
Stackable FS
overlayfs
userspace (e.g. sshfs)
dm-delay
nci.org.au
Request-baseddevice mapper targets
dm-multipath
Physical devices
HDD SSD DVDdrive
MicronPCIe card
LSIRAID
AdaptecRAID
QlogicHBA
EmulexHBA
sysfs(transport at tributes) SCSI upper level drivers
/dev/sda
scsi-mq
.../dev/sd*
SCSI low level driversmegaraid_sas
aacraid
qla2xxx ...libata
ahci ata_piix ... lpfc
Transport classesscsi_transport_fc
scsi_t ransport_sas
scsi_transport_...
/dev/vd*
virt io_blk mt ip32xx
/dev/rssd*
nvmedevice
iscsi_tcp
network
/dev/rbd*
dispatchqueues
...dispatchqueue
Requestbased drivers
BIObased drivers
Requestbased drivers
network
/dev/nvme*n*
SCSI mid layer
virtio_pci
LSI 12GbsSAS HBA
mpt3sas
/dev/nullb*
vmw_pvscsi
/dev/skd*
skd
stecdevice
virtio_scsi
para-virtualizedSCSI
VMware'spara-virtualized
SCSI
/dev/mmcblk*p*
/dev/sr* /dev/st*
pm8001
PMC-SierraHBA
SD-/MMC-Card
/dev/rsxx*
rsxx
IBM ��ashadapter
/dev/zram*
memory
null_blk
ufs
mobile device��ash memory
nvmemmcrbdzram
/dev/nbd*
nbd
/dev/ubiblock*
ubi
/dev/loop*
loop
5/11/2017
14
nci.org.au
Scientific I/O
• I/O is commonly used by scientific applications to achieve goals like
– Storing numerical output from simulations for later analysis
– Implementing 'out-of-core' techniques for algorithms that process more data
than can fit in system memory and must page data in from disk
– Checkpointing to files that save the state of an application in case of system
failure
• In most cases, scientific applications write large amounts of data in a structured or
sequential 'append-only' way that does not overwrite previously written data or
require random seeks throughout the file
– Having said that, there are seeky workloads like graph traversal and bioinformatics
problems
• Most HPC systems are equipped with a parallel file system such as Lustre or GPFS that
abstracts away spinning disks, RAID arrays, and I/O subservers to present the user
with a simplified view of a single address space for reading and writing to files
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
nci.org.au
Data Access in Current Large-Scale Systems
Current systems have greater support on the logical side, more
complexity on the physical side.
I/O Hardware
Application
Files (POSIX)
I/O Transform Layer(s)
Data Model Library
SAN and RAID Enclosures
Compute Node Memory
Internal System Network(s)
Data Movement
Logical (data model) view of data access.
Physical (hardware) view of data access.
I/O Gateways
External Sys. Network(s)
I/O Servers
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
15
nci.org.au
Common Methods for Access Data In
Parallel
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
nci.org.au
Common Methods for Access Data In
Parallel
• Simplest to implement – each processor has its own
file-handle and works independently of other nodes
• PFS perform well on this type of IO, but this creates
a metadata bottleneck
– ls breaks for example
• Another downside is that program restarts are now
dependent on the getting the same processor layout
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
5/11/2017
16
nci.org.au
Common Methods for Access Data In
Parallel
• Many processors share the same file-handle, but write to their own distinct sections of a shared file
• If there are shared regions of files, then a locking manager is used to serialize access
– For large O(N), the locking is an impediment to performance
– Even in ideal cases where the file system is guaranteed that processors are writing to exclusive regions, shared file performance can be lower compared to file-per-processor
• The advantage of shared file access lies in data management and portability, especially when a higher-level I/O format such as HDF5 or netCDF is used to encapsulate the data in the file
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
nci.org.au
Common Methods for Access Data In
Parallel
• Collective buffering is a technique used to improve the performance of shared-file access by offloading some of the coordination work from the file system to the application
– A subset of the processors is chosen to be the 'aggregators'
– These collect data from other processors and pack it into contiguous buffers in memory that are then written to the file system
• Reducing the number of processors that interact with the I/O subservers reduces PFS contention
• Originally, developed to reduce the number of small, noncontiguous writes
– Another benefit that is important for file systems such as Lustre is that the buffer size can be set to a multiple of the ideal transfer size preferred by the file system
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
5/11/2017
17
nci.org.au
HPC IO – How it works
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Storing and Organizing Data: Storage Model
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
18
nci.org.au
Reading and Writing Data to a PFS
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Data Distribution in Parallel File Systems
Distribution across multiple servers allows concurrent access.
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
19
nci.org.au
Request Size and IO Rate
Interconnect latency has a significant impact on effective rate
of I/O. Typically I/Os should be in the O(Mbytes) range.
Tests run on 2K processes of IBM Blue Gene/P at ANL.
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Request Size and IO Rate
Interconnect latency has a significant impact on effective rate
of I/O. Typically I/Os should be in the O(Mbytes) range.
Tests run on 2K processes of IBM Blue Gene/P at ANL.
Why are writes faster than reads?
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
20
nci.org.au
Where, how you do I/O matters.
•Binary - smaller files, much faster to read/write.
•You’re not going to read GB/TB of data yourself; don’t
bother trying.
•Write in 1 chunk, rather than a few #s at a time.
Large Parallel File System
ASCII binary
173s 6s
Ramdisk
ASCII binary
174s 1s
Typical work station disk
ASCII binary
260s 20s
Timing data: writing 128Mdouble-precision numbers
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Where, how you do I/O matters.
• All disk systems do best when
reading/writing large,
contiguous chunks
• I/O operations (IOPS) are
themselves expensive
•moving around within a file
•opening/closing
• Seeks - 3-15ms - enough time
to read 0.75 MB!
Typical work station disk
binary - one large read
14s
binary - 8k at a time
20s
binary - 8k chunks, lots of seeks
150s
binary - seeky + open and closes
205s
Timing data: reading 128M
double-precision numbers
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
21
nci.org.au
Where, how you do I/O matters.
•RAM is much better for
random accesses
•Use right storage medium for
the job!
•Where possible, read in
contiguous large chunks, do
random access in memory
•Much better if you use most
of data read in
Large Parallel File System
ASCII binary
173s 6s
Ramdisk
ASCII binary
174s 1s
Typical work station disk
ASCII binary
260s 20s
Ramdisk
binary - one large read
1s
binary - 8k at a time
1s
binary - 8k chunks, lots of seeks
1s
binary - seeky + open and closes
1.5s
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Where, how you do I/O
matters.• Well built parallel file systems can
greatly increase bandwidth
• Many pipes to network (servers), many spinning disks (bandwidth off of disks)
• But typically even worse penalties for seeky/IOPSy operations (coordinating all those disks.)
• Parallel FS can help with big data in two ways
Large Parallel File System
ASCII binary
173s 6s
Ramdisk
ASCII binary
174s 1s
Typical work station disk
ASCII binary
260s 20s
Large Parallel File System
binary - one large read
7.5s
binary - 8k at a time
62 s
binary - 8k chunks, lots of seeks
428 s
binary - seeky + open and closes
2137 sJonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
22
nci.org.au
Where, how you do I/O
matters.• Well built parallel file systems can
greatly increase bandwidth
• Many pipes to network (servers), many spinning disks (bandwidth off of disks)
• But typically even worse penalties for seeky/IOPSy operations (coordinating all those disks.)
• Parallel FS can help with big data in two ways
Large Parallel File System
ASCII binary
173s 6s
Ramdisk
ASCII binary
174s 1s
Typical work station disk
ASCII binary
260s 20s
Large Parallel File System
binary - one large read
7.5s
binary - 8k at a time
62 s
binary - 8k chunks, lots of seeks
428 s
binary - seeky + open and closes
2137 s
StripingMultiple readers + writers
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Storing and Organizing Data:
Application Models
Application data models are supported via libraries that map
down to files (and sometimes directories).
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
23
nci.org.au
HPC IO Software Stack
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Lustre Components
• All of Raijin’s filesystems are Lustre, which is
is a distributed filesystem
• Primary components are the MDSes and
OSSes. The OSSes contain data, whereas the
MDSes map these objects into files
5/11/2017
24
nci.org.au
nci.org.au
Parts of the Lustre System
http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
25
nci.org.au
Lustre File System and Striping
http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
nci.org.au
How does striping help?
• Due to striping, the Lustre file system scales with the number of OSS’s available
• The capacity of a Lustre file system equals the sum of the capacities of the storage targets– Benefit #1: max file size is not limited by the size of a
single target.
– Benefit #2: I/O rate to a file is the of the aggregate I/O rate to the objects.
• Raijin provides 6 MDSes and 50 OSSes, capabile of 150GB/sec, but this speed is split by all users of the system
• Metadata access can be a bottleneck, so the MDS needs to have especially good performance (e.g., solid state disks on some systems)
5/11/2017
26
nci.org.au
Striping data across disks
Parallel FS
• Single client can make use of multiple disk systems simultaneously
• “Stripe” file across many drives
•One drive can be finding next block while another is sending current blockhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
nci.org.au
Lustre on Raijin 1/2
• Lustre is a scalable, POSIX-compliant parallel
file system designed for large, distributed-
memory systems, such as Raijin at NCI
• It uses a server-client model with separate
servers for file metadata and file content
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
5/11/2017
27
nci.org.au
Lustre on Raijin 2/2
• For example, on Raijin, the /short and /gdata{1,2} file systems each have a single metadata server (which can be a bottleneck when working with thousands of files) and 720, 520, 240 'Object Storage Targets' for /short and /gdata{1,2} respectively that store the contents of files
• Although Lustre is designed to correctly handle any POSIX-compliant I/O pattern, in practice it performs much better when the I/O accesses are aligned to Lustre's fundamental unit of storage, which is called a stripe and has a default size (on NCI systems) of 1MB
• Striping is a method of dividing up a shared file across many OSTs, as shown below. Each stripe is stored on a different OST, and the assignment of stripes to OSTs is round-robin
• Striping increases available b/w by using several OSTs in parallel
https://www.nersc.gov/users/training/online-tutorials/introduction-to-scientific-i-o/?show_all=1
nci.org.au
Invoking Striping
http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
28
nci.org.au
Invoking Striping
Lab – perform
striping on
Raijin:/short
and measure
IOPS, B/W
nci.org.au
End-to-End View
Parallel FS(GPFS, PVFS..)
I/O
Middleware
(MPIIO)
Application High-level
Library(HDF5,NetCDF,
ADIOS)
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
29
nci.org.au
Abstraction Layers• High Level libraries can simplify programmers tasks
• Express IO in terms of the data structures of the code, not bytes and blocks
• I/O middleware can coordinate, improve performance
• Data Sieving
• 2-phase I/O
Parallel FS(GPFS, PVFS..)
I/O
Middleware(MPIIO)
Application High-level
Library(HDF5,NetCDF,
ADIOS)
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
Data Sieving
• Combine many non-contiguous IO requests into fewer, bigger IO requests
• “Sieve” unwanted data out
• Reduces IOPS, makes use of high bandwidth for sequential IO
Jonathan Dursi
5/11/2017
30
nci.org.au
Two-Phase
IO• Collect requests into larger chunks
• Have individual nodes read big blocks
• Then use network communications to exchange pieces
• Fewer IOPS, faster IO
• Network communication usually faster
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
BEHIND THE SCENES
5/11/2017
31
nci.org.au
How it works – Concurrent Access
Files are treated like global shared memory regions. Locks are
used to manage concurrent access:
Files are broken up into lock units
Clients obtain locks on units that they will access before
I/O occurs
Enables caching on clients as well (as long as client has a lock,
it knows its cached data is valid)
Locks are reclaimed from clients when others desire access
If an access touches any
data in a lock unit, the
lock for that region must
be obtained before access
occurs.
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Implications of Locking in Concurrent Access
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
32
nci.org.au
Reducing the Number of Operations –
Data Sieving
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
nci.org.au
Avoiding Lock Contention
http://press3.mcs.anl.gov/computingschool/files/2014/01/hpc-io-all-final.pdf
5/11/2017
33
nci.org.au
nci.org.au
@NCInews
Data Storage at NCI
nci.org.au
Data Storage Subsystems at the NCI
• The NCI compute and data environments
allow researchers the ability to seamlessly
work with HPC and Cloud based compute
cycles, while have unified data storage.
• How is this done?
5/11/2017
34
nci.org.au
NCI’s integrated high-performance
environment
10 GigE
/g/data 56Gb FDR IB Fabric
/g/data1~6.3PB
/g/data2~3.1PB
/short7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata (tape)Persistent global parallel
filesystem
Raijin high-speed
filesystem
Raijin HPC Compute
Raijin Login + Data
moversVMware Cloud
NCI data
movers
To
Hu
xley
DC
Raijin 56Gb FDR IB Fabric
Internet
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin&HPC&Compute&
Raijin&Login&+&Data&
movers&VMware& Cloud&
NCI&data&
movers&
To H
uxl
ey D
C
Raijin 56Gb FDR IB Fabric
Internet&
The NCI Compute Environment
5/11/2017
35
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin&HPC&Compute&
Raijin&Login&+&Data&
movers&VMware& Cloud&
NCI&data&
movers&
To H
uxl
ey D
C
Raijin 56Gb FDR IB Fabric
Internet&
The NCI Compute Environment - Raijin
• The NCI compute environment consists of Raijin, the ‘Cloud’ and data movers
• Raijin itself is composed of
– Login nodes, Data mover nodes
– Compute node
• Users see the login and data-mover nodes
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin&HPC&Compute&
Raijin&Login&+&Data&
movers&VMware& Cloud&
NCI&data&
movers&
To H
uxl
ey D
C
Raijin 56Gb FDR IB Fabric
Internet&
The NCI Compute Environment – The Cloud
5/11/2017
36
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin&HPC&Compute&
Raijin&Login&+&Data&
movers&VMware& Cloud&
NCI&data&
movers&
To H
uxl
ey D
C
Raijin 56Gb FDR IB Fabric
Internet&
The NCI Compute Environment – The Cloud
• The NCI hosts two OpenStack clouds – the NCI NeCTAR node and the NCI partner cloud called Tenjin
– The Tenjin cloud node is managed by the NCI VL team and has a mature eco-system for hosting virtual laboratories and on-demand science desktops
– If you need access to the 10PB+ of Lustre storage, it is only available via Tenjin
• Our VMWare instance is reserved for internal NCI use like hosting our accounting infrastructure, wikis, LDAP
nci.org.au
10 GigE
/g/data 56Gb FDR IB Fabric
Raijin&HPC&Compute&
Raijin&Login&+&Data&
movers&VMware& Cloud&
NCI&data&
movers&
To H
uxl
ey D
C
Raijin 56Gb FDR IB Fabric
Internet&
The NCI Compute Environment – Data
movers
• Three types of data-mover nodes exist
– Shell access
– Aspera (in-production)
– GridFTP (Q3, 2017)
5/11/2017
37
nci.org.au
/g/data1 ~6.3PB
/g/data2 ~3.1PB
/short 7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata&(tape)&Persistent&global¶llel&
filesystem&
Raijin&high9speed&
filesystem&
NCI’s Data Infrastructure
nci.org.au
/g/data1 ~6.3PB
/g/data2 ~3.1PB
/short 7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata&(tape)&Persistent&global¶llel&
filesystem&
Raijin&high9speed&
filesystem&
NCI’s Data Infrastructure – Raijin /short
• /short provides Raijin’s fast Lustre-based scratch storage
• It is intended for short-term, fast disk for compute jobs
• Striping data across multiple OSTs gets good performance.
Be mindful whether your IO workload is IOPS or streaming
bandwidth based to make choices about stripe-widths and
stripe-size
5/11/2017
38
nci.org.au
/g/data1 ~6.3PB
/g/data2 ~3.1PB
/short 7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata&(tape)&Persistent&global¶llel&
filesystem&
Raijin&high9speed&
filesystem&
NCI’s Data Infrastructure - Massdata
• Massdata is a SGI DMF based tape archive or HSM
• The HSM is dual-site and keeps two copies on tape and we don’t keep backups
• There is a 1PB disk cache in-front of massdata to help stage data to tape or retrieve from tape
• Data is aged in the cache and when certain policies are met, then migrated to two copues
• Middle of 2017 NCI’s Lustre systems will be interfaced to the HSM i.e you don’t have to manually copy data into the massdata area
nci.org.au
/g/data1 ~6.3PB
/g/data2 ~3.1PB
/short 7.6PB
/home, /system, /images, /apps
Cache 1.0PB, Tape 12.3PB
Massdata&(tape)&Persistent&global¶llel&
filesystem&
Raijin&high9speed&
filesystem&
NCI’s Data Infrastructure - gdata
• gdata provides persistent disk storage for jobs
running on Raijin and for VMs running on the cloud
• gdata is designed for bulk storage and streaming,
large writes. Think of it as fast tape!
• Users should stage data from gdata into Raijin’s
/short filesystem for compute jobs
5/11/2017
39
nci.org.au
PARALLEL I/O WITH MPI-IO
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
40
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
41
nci.org.au
FILE
P0
P1
P2
P(n-1)
P# is a single processor with rank #.
…
memory
memory
memory
Simple MPI-IO
Each MPI task reads/writes a single block:
memory
http://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
42
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
nci.org.auhttp://www.cac.cornell.edu/education/training/ParallelMay2012/ParallelIOMay2012.pdf
5/11/2017
43
nci.org.au
MPI-IO Hello World#include <stdio.h>#include <string.h>#include <mpi.h>
int main(int argc, char **argv) { int ierr, rank, size; MPI_Offset offset; MPI_File file; MPI_Status status; const int msgsize=6; char message[msgsize+1];
ierr = MPI_Init(&argc, &argv); ierr|= MPI_Comm_size(MPI_COMM_WORLD, &size); ierr|= MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if ((rank % 2) == 0) strcpy (message, "Hello "); else strcpy (message, "World!");
offset = (msgsize*rank);
MPI_File_open(MPI_COMM_WORLD, "helloworld.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file); MPI_File_seek(file, offset, MPI_SEEK_SET); MPI_File_write(file, message, msgsize, MPI_CHAR, &status); MPI_File_close(&file);
MPI_Finalize(); return 0;}
helloworldc.c
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
#include <stdio.h>#include <string.h>#include <mpi.h>
int main(int argc, char **argv) { int ierr, rank, size; MPI_Offset offset; MPI_File file; MPI_Status status; const int msgsize=6; char message[msgsize+1];
ierr = MPI_Init(&argc, &argv); ierr|= MPI_Comm_size(MPI_COMM_WORLD, &size); ierr|= MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if ((rank % 2) == 0) strcpy (message, "Hello "); else strcpy (message, "World!");
offset = (msgsize*rank);
MPI_File_open(MPI_COMM_WORLD, "helloworld.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file); MPI_File_seek(file, offset, MPI_SEEK_SET); MPI_File_write(file, message, msgsize, MPI_CHAR, &status); MPI_File_close(&file);
MPI_Finalize(); return 0;}
MPI-IO
Hello
World
Usual MPI
startup/
teardown
boilerplate
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
44
nci.org.au
MPI-IO
Hello
World
0 1 2 3
mpiexec -n 4 ./helloworldc
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
if ((rank % 2) == 0) strcpy (message, "Hello ");
else strcpy (message, "World!");
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
45
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t :
MPI_File_open(MPI_COMM_WORLD, "helloworld.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t :
MPI_File_open(MPI_COMM_WORLD, "helloworld.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
int MPI_File_Open( MPI_Comm communicator, char *filename, int mode, MPI_Info info, MPI_File *handle);
Communicator;
collective
operation.
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
46
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t :
MPI_File_open(MPI_COMM_WORLD, "helloworld.txt", MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &file);
int MPI_File_Open( MPI_Comm communicator, char *filename, int mode, MPI_Info info, MPI_File *handle);
Info allows us to send
extra hints to MPI-IO
layer about
file(performance
tuning, special case
handling)
MPI_INFO_NULL: no
extra info.
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t:
offset = (msgsize*rank); MPI_File_seek(file, offset, MPI_SEEK_SET);
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
47
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t:
offset = (msgsize*rank); MPI_File_seek(file, offset, MPI_SEEK_SET);
int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int mode);
MPI_SEEK_SET Set file pointer to position offset
MPI_SEEK_CUR pointer ← current position + offset
MPI_SEEK_END pointer ← end of file - offset
Not collective; each adjusts its own local file pointerJonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t:
MPI_File_write(file, message, msgsize, MPI_CHAR, &status);
Hello_ Hello_World! World!
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
48
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t:
MPI_File_write(file, message, msgsize, MPI_CHAR, &status);
Hello_ Hello_World! World!
int MPI_File_write(MPI_File fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status)
• MPI File write is very much
like a MPI_Send.
• “Sending” count of datatype
from buf “to” the file.
• Here, writing 6 MPI_CHARs.
• Contiguous in memory
starting in buffer.Jonathan Dursi
nci.org.au
MPI-IO
Hello
World
Hello_ World! Hello_ World!
helloworld.tx t:
MPI_File_close(&file);
Hello_ Hello_World! World!
Jonathan Dursi https://support.scinet.utoronto.ca/wiki/images/3/3f/ParIO-HPCS2012.pdf
5/11/2017
49
nci.org.au
HDF5 – Beyond MPI-IO
www.hdfgroup.org
The HDF Group
Parallel HDF5
March 4, 2015 HPC Oil & Gas Workshop
Quincey KoziolDirector of Core Software & HPC
The HDF [email protected]
http://bit.ly/QuinceyKoziol98
5/11/2017
50
www.hdfgroup.org
QUICK INTRO TO HDF5
March 4, 2015 HPC Oil & Gas Workshop 99
www.hdfgroup.org
What is HDF5?
March 4, 2015 HPC Oil & Gas Workshop 100
• HDF5 == Hierarchical Data Format, v5
http://bit.ly/ParallelHDF5-HPCOGW-2015
• A flexible data model• Structures for data organization and specification
• Open source software• Works with data in the format
• Open file format
• Designed for high volume or complex data
5/11/2017
51
www.hdfgroup.org
What is HDF5, in detail?
• A versatile data model that can represent very complex data objects and a wide variety of metadata.
• An open source software library that runs on a wide range of computational platforms, from cell phones to massively parallel systems, and implements a high-level API with C, C++, Fortran, and Java interfaces.
• A rich set of integrated performance features that allow for access time and storage space optimizations.
• Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.
• A completely portable file format with no limit on the number or size of data objects stored.
March 4, 2015 101HPC Oil & Gas Workshop
http://bit.ly/ParallelHDF5-HPCOGW-2015
www.hdfgroup.org
HDF5 is like …
March 4, 2015 HPC Oil & Gas Workshop 102
5/11/2017
52
www.hdfgroup.org
Why use HDF5?
• Challenging data:• Application data that pushes the limits of what can be
addressed by traditional database systems, XML documents, or in-house data formats.
• Software solutions:• For very large datasets, very fast access requirements,
or very complex datasets.• To easily share data across a wide variety of
computational platforms using applications written in different programming languages.
• That take advantage of the many open-source and commercial tools that understand HDF5.
• Enabling long-term preservation of data.
March 4, 2015 103HPC Oil & Gas Workshop
www.hdfgroup.org
Who uses HDF5?
• Examples of HDF5 user communities• Astrophysics• Astronomers• NASA Earth Science Enterprise• Dept. of Energy Labs• Supercomputing Centers in US, Europe and Asia• Synchrotrons and Light Sources in US and Europe• Financial Institutions• NOAA• Engineering & Manufacturing Industries• Many others
• For a more detailed list, visit• http://www.hdfgroup.org/HDF5/users5.html
March 4, 2015 104HPC Oil & Gas Workshop
5/11/2017
53
www.hdfgroup.orgMarch 4, 2015 105
HDF5 Data Model
• File – Container for objects• Groups – provide structure among objects• Datasets – where the primary data goes
• Data arrays• Rich set of datatype options• Flexible, efficient storage and I/O
• Attributes, for metadata
Everything else is built essentially from these parts.
HPC Oil & Gas Workshop
www.hdfgroup.orgMarch 4, 2015 106
Structures to organize objects
palette
Raster image
3-D array
2-D arrayRaster image
lat | lon | temp
----|-----|-----
12 | 23 | 3.1
15 | 24 | 4.2
17 | 21 | 3.6
Table
““““/”””” (root)
““““/TestData””””
““““Groups””””
““““Datasets””””
HPC Oil & Gas Workshop
5/11/2017
54
www.hdfgroup.org
HDF5 Dataset
March 4, 2015 107
• HDF5 datasets organize and contain data elements.• HDF5 datatype describes individual data elements.
• HDF5 dataspace describes the logical layout of the data elements.
Integer: 32-bit, LE
HDF5 Datatype
Multi-dimensional array of identically typed data elements
Specifications for single dataelement and array dimensions
3
Rank
Dim[2] = 7
Dimensions
Dim[0] = 4
Dim[1] = 5
HDF5 Dataspace
HPC Oil & Gas Workshop
www.hdfgroup.org
HDF5 Attributes
• Typically contain user metadata
• Have a name and a value
• Attributes “decorate” HDF5 objects
• Value is described by a datatype and a dataspace
• Analogous to a dataset, but do not support partial I/O operations; nor can they be compressed or extended
March 4, 2015 108HPC Oil & Gas Workshop
5/11/2017
55
www.hdfgroup.org
HDF5 Software Distribution
HDF5 home page: http://hdfgroup.org/HDF5/• Latest release: HDF5 1.8.14 (1.8.15 coming in May
2015)
HDF5 source code:• Written in C, and includes optional C++, Fortran 90 APIs,
and High Level APIs• Contains command-line utilities (h5dump, h5repack,
h5diff, ..) and compile scripts
HDF5 pre-built binaries:• When possible, includes C, C++, F90, and High Level
libraries. Check ./lib/libhdf5.settings file.• Built with and require the SZIP and ZLIB external
librariesMarch 4, 2015 HPC Oil & Gas Workshop 109
www.hdfgroup.org
The General HDF5 API
• C, Fortran, Java, C++, and .NET bindings • IDL, MATLAB, Python (h5py, PyTables)• C routines begin with prefix H5?
? is a character corresponding to the type of object the function acts on
March 4, 2015 HPC Oil & Gas Workshop 110
Example Interfaces:
H5D : Dataset interface e.g., H5Dread H5F : File interface e.g., H5FopenH5S : dataSpace interface e.g., H5Sclose
5/11/2017
56
www.hdfgroup.org
The HDF5 API
• For flexibility, the API is extensive � 300+ functions
• This can be daunting… but there is hope�A few functions can do a lot�Start simple �Build up knowledge as more features are needed
March 4, 2015 HPC Oil & Gas Workshop 111
Victorinox
Swiss Army
Cybertool 34
www.hdfgroup.org
General Programming Paradigm
• Object is opened or created• Object is accessed, possibly many times• Object is closed
• Properties of object are optionally defined �Creation properties (e.g., use chunking storage)�Access properties
March 4, 2015 HPC Oil & Gas Workshop 112
5/11/2017
57
www.hdfgroup.org
Basic Functions
H5Fcreate (H5Fopen) create (open) File
H5Screate_simple/H5Screate create Dataspace
H5Dcreate (H5Dopen) create (open) Dataset
H5Dread, H5Dwrite access Dataset
H5Dclose close Dataset
H5Sclose close Dataspace
H5Fclose close File
March 4, 2015 HPC Oil & Gas Workshop 113
www.hdfgroup.org
Useful Tools For New Users
March 4, 2015 HPC Oil & Gas Workshop 114
h5dump:Tool to “dump” or display contents of HDF5 files
h5cc, h5c++, h5fc:Scripts to compile applications
HDFView:Java browser to view HDF5 fileshttp://www.hdfgroup.org/hdf-java-html/hdfview/
HDF5 Examples (C, Fortran, Java, Python, Matlab)http://www.hdfgroup.org/ftp/HDF5/examples/
5/11/2017
58
www.hdfgroup.org
OVERVIEW OF PARALLEL HDF5 DESIGN
March 4, 2015 HPC Oil & Gas Workshop 115
www.hdfgroup.org
• Parallel HDF5 should allow multiple processes to perform I/O to an HDF5 file at the same time• Single file image for all processes• Compare with one file per process design:
• Expensive post processing• Not usable by different number of processes• Too many files produced for file system
• Parallel HDF5 should use a standard parallel I/O interface
• Must be portable to different platforms
Parallel HDF5 Requirements
March 4, 2015 HPC Oil & Gas Workshop 116
5/11/2017
59
www.hdfgroup.org
Design requirements, cont
• Support Message Passing Interface (MPI) programming
• Parallel HDF5 files compatible with serial HDF5 files• Shareable between different serial or
parallel platforms
March 4, 2015 HPC Oil & Gas Workshop 117
www.hdfgroup.org
Design Dependencies
• MPI with MPI-IO• MPICH, OpenMPI• Vendor’s MPI-IO
• Parallel file system• IBM GPFS• Lustre• PVFS
March 4, 2015 HPC Oil & Gas Workshop 118
5/11/2017
60
www.hdfgroup.org
PHDF5 implementation layers
HDF5 Application
Compute node Compute node Compute node
HDF5 Library
MPI Library
HDF5 file on Parallel File System
Switch network + I/O servers
Disk architecture and layout of data on diskMarch 4, 2015 HPC Oil & Gas Workshop 119
www.hdfgroup.org
MPI-IO VS. HDF5
March 4, 2015 HPC Oil & Gas Workshop 120
5/11/2017
61
www.hdfgroup.org
MPI-IO
• MPI-IO is an Input/Output API• It treats the data file as a “linear byte
stream” and each MPI application needs to provide its own file and data representations to interpret those bytes
March 4, 2015 HPC Oil & Gas Workshop 121
www.hdfgroup.org
MPI-IO
• All data stored are machine dependent except the “external32” representation
• External32 is defined in Big Endianness• Little-endian machines have to do the data
conversion in both read or write operations• 64-bit sized data types may lose
information
March 4, 2015 HPC Oil & Gas Workshop 122
5/11/2017
62
www.hdfgroup.org
MPI-IO vs. HDF5
• HDF5 is data management software• It stores data and metadata according
to the HDF5 data format definition• HDF5 file is self-describing
• Each machine can store the data in its own native representation for efficient I/O without loss of data precision
• Any necessary data representation conversion is done by the HDF5 library automatically
March 4, 2015 HPC Oil & Gas Workshop 123
nci.org.au
LABS
5/11/2017
63
nci.org.au
Lab 1 - Measuring IO Performance
• Specifically: Measure IO performance (IOPS,
streaming b/w and latency);
• Objective 1: Learning to use the FIO tool to measure
IOPS and streaming bandwidth;
• Objective 2: Learning to use the ioping tool to
measure IO latency;
• Objective 3: Data collection and analysis;
• Objective 4: Learning how to interact with remote
code repositories like github;
nci.org.au
Lab 1 – Background – Tools
• What is FIO?
fio is a tool that will spawn a number of
threads or processes doing a particular type
of io action as specified by the user
• What is ioping?
A tool to monitor I/O latency in real time. It
shows disk latency in the same way as ping
shows network latency
5/11/2017
64
nci.org.au
Lab 1 – ioping Tasks – 1/3
a) Log into Raijin and in your /short/c04 area create a
directory called Lab1, then check out the following Git
repos:
a.1) FIO: https://github.com/axboe/fio.git
a.2) ioping: https://github.com/koct9i/ioping.git
b) Build both repos for FIO and ioping in your
raijin:/short/c04 area
nci.org.au
Lab 1 – ioping Tasks – 2/3
c) Raijin has three filesystems accessible to end-users: /home, /jobfs and /short
c.1) Measuring latency for single threaded sequential workloads
Using ioping, measure the IO latency for /short. Construct PBS jobs or
use the express queue to use 1 CPU and run the ioping executable and record latency (in milliseconds),
IOPS and B/W for block sizes of 4KB, 128KB and 1MB and working sets of 10MB, 100MB, 1024MB.
An example run:
% /short/z00/jxa900/ioping -c 20 -s 8KB -C -S 1024MB /home/900/jxa900 # run ioping for 20 requests, 8KB block size and for working set of 1GB on my home directory
...
8 KiB from /home/900/jxa900 (lustre 10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys): request=20 time=29 us
--- /home/900/jxa900 (lustre 10.9.103.3@o2ib3:10.9.103.4@o2ib3:/homsys) ioping statistics ---
20 requests completed in 2.79 ms, 160 KiB read, 7.16 k iops, 55.9 MiB/s
min/avg/max/mdev = 28 us / 139 us / 289 us / 103 us
From the above experiment, the corresponding data values are:
Title: Working Set (MB), Block Size (KB), Time Taken (ms), Data Read (KB), IOPS, B/W (MB/sec)
Row: 1024, 8, 2.79, 160, 7160, 55.9
5/11/2017
65
nci.org.au
Lab 1 – ioping Tasks – 3/3
From the previous experiment, note down in a table :
Can you explain the observed trends for IOPS and B/W for the three working set sizes, on changing the request block size?
Extra credit: Run this Lab on a Raijin compute nodes’ /jobfs and compare
Working Set
(MB)
Block Size
(KB)
Time Taken
(ms)
Data Read
(KB)
IOPS B/W
(MB/sec)
10 4
10 128
10 1024
100 4
100 128
100 1024
1024 4
1024 128
1024 1024
nci.org.au
Lab 1 – FIO Tasks – 1/2
Using FIO, measure read and write IOPS for two, four and eight threads of IO running on /short for a block sizes 1MB with
filesize of 1024MB. These can be set when calling FIO using the --bs=1024M and --size=1024MB flags. For a sequential write
workload add –readwrite=write.
Fill in below table.
SEQUENTIAL WRITE IO PERFORMANCE:
No. of
Threads
Working
Set (MB)
Block Size
(KB)
Time
Taken (ms)
Data Written
(MB)
IOPS B/W
(MB/sec)
2 1024 1024
4 1024 1024
8 1024 1024
5/11/2017
66
nci.org.au
Lab 1 – FIO Tasks – 2/2
• Sample run for random IO:# Random Workload – set using --readwrite=randrw, ratio of reads and writes set using --rwmixread=50 (i.e. 50% read, 50% writes)
% /short/z00/jxa900/fio-src/fio --randrepeat=1 --ioengine=libaio --gtod_reduce=1 --bs=4k --iodepth=8 --size=1024MB --readwrite=randrw --rwmixread=50 --thread --numjobs=2 --name=test --filename=/short/jxa900/test.out
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
...
fio-2.2.6-15-g3765
Starting 2 threads
Jobs: 2 (f=2): [m(2)] [100.0% done] [47712KB/0KB/0KB /s] [11.1K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=2): err= 0: pid=7014: Sat Apr 25 18:46:09 2015
mixed: io=2048.0MB, bw=43254KB/s, iops=10813, runt= 48485msec
cpu : usr=1.42%, sys=20.51%, ctx=243452, majf=0, minf=5
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued : total=r=524288/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
MIXED: io=2048.0MB, aggrb=43253KB/s, minb=43253KB/s, maxb=43253KB/s, mint=48485msec, maxt=48485msec
In the previous table note the data:
Title: Num Threads, Working Set (MB), Block Size (KB), Time Taken (ms), Data Read/Written (MB), IOPS, B/W (MB/sec)
Row: 2, 1024, 4, 48485, 2048, 10813, 42.2 (=43253/1024)
nci.org.au
Lab 2 – Lustre Striping Parameters on Raijin
Lustre allows you to modify three striping parameters for a shared file:
- the stripe count controls how many OSTs the file is striped over (for example, the stripe count is 4 for the
file shown above);
- the stripe size controls the number of bytes in each stripe; and
- the start index chooses where in the list of OSTs to start the round-robin assignment
(the default value -1 allows the system to choose the offset in order to load balance the file system).
The default parameters on Raijin:/short are [count=2, size=1MB, index=-1], but these can be changed and
viewed on a per-file or per-directory basis using the commands:
% lfs setstripe [file,dir] -c [count] -s [size] -i [index]
% lfs getstripe [file,dir]
A file automatically inherits the striping parameters of the directory it is created in, so changing the
parameters of a directory is a convenient way to set the parameters for a collection of files you are about to
create. For instance, if your application creates output files in a subdirectory called output/, you can set the
striping parameters on that directory once before your application runs, and all of your output files will
inherit those parameters.
5/11/2017
67
nci.org.au
Lab 2 – Tasks – 1/2
a) Create a directory under Raijin:/short and
run lfs getstripe <myDir>. Explain the output
of the command, use the man-pages if
required;
b) Re-run the lab exercise with FIO using the
following stripe sizes and counts for a 1GB file,
for a sequential write workload
Stripe Size: 1MB, 4MB
Stripe Count: -1, 2, 4
nci.org.au
Lab 2 – Tasks – 2/2
Stripe
Size (MB)
Stripe
Count
No. of
Threads
Working
Set (MB)
Block
Size
(KB)
Time
Taken
(ms)
Data
Written
(MB)
IOPS B/W
(MB/sec
)
1 -1 4 1024 1024
1 2 4 1024 1024
1 4 4 1024 1024
4 -1 4 1024 1024
4 2 4 1024 1024
4 4 4 1024 1024
Bonus Credit: Using your ANU UniID and password, log onto the NeCTAR cloudand bring up a single core VM using a CentOS or Ubuntu image on any NeCTAR node.Perform the same tests as in Lab 1.
URL for creating VMs: https://dashboard.rc.nectar.org.au/Setting up SSH keys: http://darlinglab.org/tutorials/instance_startup/
5/11/2017
68
nci.org.au
Lab 3 – MPI-IO – Tasks – 1/2
• Copy Lab3 into your /short area from /short/c07/src-io/Lab3
• Make and run the MPI-IO example helloworld
• Fix sine.c to output sine.dat to plot using gnuplot – see next slide. Remember to use ssh –XY [email protected] to get X11 forwarding working. There is a subtle bug that needs to be fixed AND you need to add in the MPI-IO code to write out the output from computing sin(x)
nci.org.au
Lab 3 – MPI-IO – Tasks 2/2