alok$choudhary$ northwestern$university$ · i/o$so’ware$and$data$ alok$choudhary$...
TRANSCRIPT
I/O So'ware and Data Alok Choudhary
Northwestern University
Large Scale CompuAng and Storage Requirements for Advanced ScienAfic CompuAng Research, ASCR / NERSC Workshop, January 5-‐6, 2011
1
I/O so'ware stack in HPC
• MPI standard, uniform API to all file systems • Lower-‐level parallel I/O opAmizaAons
– Coordinate processes to rearrange I/O requests – CollecAve I/O, data sieving, caching, lock alignment
2
High-‐level parallel I/O libraries
Data model based I/O libraries
MPI-‐IO
POSIX or file system API Client-‐side file system
Server-‐side file system
Storage system
ApplicaAons
• Portable file format, self-‐describing, metadata-‐rich • Examples:
– PnetCDF, netCDF4, HDF5
• Domain-‐specific • Encapsulate mulAple lower-‐level I/O interfaces
(e.g. HDF5, netCDF) • Examples:
– PIO: Community Climate System Model (CCSM)
– GIO: Global Cloud Resolve Model (GCRM)
• Parallel file systems – Lustre, PVFS2, GPFS, PanFS – File striping, caching, consistency controls, scalable
metadata operaAons, data reliability, fault tolerance
I/O resources in NERSC • NERSC Global Filesystem (NGF), an IBM GPFS
– HPC • Franklin (Cray XT4), Hopper (Cray XT5), Hopper II (Cray XE6), Carver (IBM)
– AnalyAcs clusters • PDSF (Linux), Euclid (Sun)
– Others • Tesla/Tuning, Dirac (GPU cluster), Magellan (Cloud)
• Parallel file systems – Lustre, GPFS
• Archival Storage (HPSS) • ConsulAng team
– Very helpful for answering I/O related quesAons, including hardware configuraAon, so'ware availability, run-‐Ame environment setup, system performance numbers, communicaAon with Cray, etc.
3
I/O so'ware at NERSC
• High-‐level I/O libraries – HDF5, netCDF4, Parallel netCDF
• I/O tracing libraries – IPMIO (profiling POSIX calls)
• Parallel I/O middleware – MPI-‐IO (Cray’s implementaAon)
– ROMIO with NWU’s opAmizaAon (file domain alignment)
• Lustre parallel file systems – User customizable striping configuraAon, an important feature for I/O developers
4
Case studies • MPI-‐IO file domain alignments
– Reorganize I/O requests to match the Lustre locking protocol – Significantly improve performance on Franklin
• I/O delegaAon – AddiAonal set of compute nodes to enable caching, prefetching,
aggregaAon – I/O requests are forwarded from applicaAon processes to delegates – Boost independent I/O compeAAvely to collecAve I/O
• Parallel netCDF non-‐blocking I/O – Aggregates many small requests for becer bandwidths
• Data-‐model based I/O library – Next generaAon high-‐level I/O library – Supports data models (seven dwarfs) with new file formats and data
layouts
5
MPI-‐IO file domain alignments
• Lustre – File system uses locks to keep data consistent
• File domain alignment in collecAve I/O – Minimize the number of clients accessing each file server
• Two implementaAons – NWU (single-‐stage) published in SC08* – Cray MPI-‐IO (mulA-‐stage), available in June 2009
6
group 1
P P2 0 1 2 0 1 5 3 4 5 3 4
group 0
sizestripe
0
0
3
3 4
4
1
1
2
2
5
5
S S S
file stripe
0 1 2
from Pfrom Pfrom Pfrom P from P
from Pfrom Pfrom P
from Pfrom Pfrom Pfrom P
I/O servers
aggregate access regionfile
P P P P P P P P P P
file aggregate access region
I/O serversS S S
file stripe
0 1 2
from Pfrom P from P
from Pfrom Pfrom P
from Pfrom Pfrom Pfrom P
5
4
2
0,1
3,4
5
from P1,2
0from P0
1
3
4,5
P0 P2 P3 P4 5PP1
stripesize
*W. Liao, and A. Choudhary. Dynamically AdapAng File Domain ParAAoning Methods for CollecAve I/O Based on Underlying Parallel File System Locking Protocols, SC 2008
Improvement from file domain alignment
• Franklin @ NERSC – Compare Cray’s and NWU’s MPI-‐IO
implementaAons – Measured peak for write is 16GB/sec
• S3D – CombusAon applicaAon from Sandia
Lab. – Four global arrays: two 3D and two 4D – Each process subarray size 50x50x50
• Flash – Astrophysics applicaAon from U. of
Chicago – I/O method: HDF5 – Each process writes 80~82 32x32x32
arrays
7
!"
#"
$"
%"
&"
'!"
'#"
(#" %$" '#&" #)%" )'#" '!#$" #!$&" $!*%" &'*#"
!"#$%&'(
)*+#*$,&#)&-'./%0&
1234%"&56&05"%/&
789&
+,-."/01213" 456"/01213"
!"
#"
$"
%"
&"
'!"
'#"
'$"
(#" %$" '#&" #)%" )'#" '!#$" #!$&" $!*%" &'*#"
!"#$%&'(
)*+#*$,&#)&-./0%1&
234'%"&56&15"%0&
78(0,&
+,-."/01213" 456"/01213"
I/O delegaAon
• Run inside of MPI-‐IO
• Run on a small set of addiAonal MPI processes
• All I/O delegates collaborate for becer performance
• Related work – I/O forwarding developed by
Rob Ross’s team at ANL
8
A A
A A A
A
A
A A
A
A
A
A
A
A
A
A
A
A
A
A
A A A A A A A
A A A A A A A
MPI applicaAon nodes A
D D D D D D D
D I/O delegate nodes
S S S S S S S S S I/O servers
compute nodes
Improvement from I/O delegaAon
• Franklin – Compared with Cray’s collecAve I/O
• I/O delegaAon – 1/8 of applicaAon processes as I/O delegates – Enhances independent I/O performance to be similar to the collecAve – Using independent I/O eases I/O programming – The above results presented in a paper accepted by IEEE TPDS*
9
S3D Flash
*A. Nisar, W. Liao, and A. Choudhary. Scaling Parallel I/O Performance through I/O Delegate and Caching System, SC 2008
PnetCDF
• A parallel I/O library for accessing files in CDF format – Version 1.0 released in 2005, now in version 1.2
• OpAmizaAons – Built on top of MPI-‐IO
– File header and dataset alignment
– Non-‐blocking I/O enables aggregaAon of mulAple requests
• Developers – NU: Jianwei Li (graduated in 2006), Kui Gao (postdoc), Wei-‐keng Liao,
Alok Choudhary
– ANL: Rob Ross, Rob Latham, Rajeev Thakur, Bill Gropp
• Recent applicaAon collaborators – GIO (PNNL), FLASH (U. Chicago), CCSM (NCAR)
10
GCRM I/O
• GIO (Geodesic Parallel IO API) developed at PNNL – Is an I/O library developed by Karen Schuchardt from PNNL
– Used by Global Cloud Resolve Model (GCRM) developed at Colorado State University that simulates the global climate
– Provides user configurable I/O method • PnetCDF • HDF5 • NetCDF4
Pictures by courtesy of Karen Schuchardt
11
GCRM I/O performance • I/O pacern
– Each process writes many, nonconAguous data blocks for each variable
• GIO strategies – Direct and interleaved messaging
methods to exchange I/O requests in order to get becer performance
• PnetCDF non-‐blocking I/O – Delay request so small-‐sized requests can
be aggregated into large ones – Simplifies the programming task – Results were presented in the Workshop
on High-‐ResoluAon Climate Modeling 2010*
GCRM run on Franklin @ NERSC
!"
#"
$"
%"
&"
'!"
%$!" '#&!" #(%!"
)*+,-./0"123"/+/)*+,-./0"123"
45.67")8/9:
.96;"./"<=2>7,"
?@A)75"+B"8CC*.,8D+/"C5+,7>>7>"
E#!"123""C5+,7>>7>"
'%!"123""C5+,7>>7>"
%$!"123""C5+,7>>7>"
12
Resolu'on level
Number of cells
Grid-‐point spacing
9 2.6 Million 15.6 km
10 10.5 Million 7.8 km
11 41.9 Million 3.9 km
*B. Palmer, K. Schuchardt, A. Koontz, R. Jacob, R. Latham, and W. Liao. IO for High ResoluAon Climate Models. Workshop on High-‐ResoluAon Climate Modeling, 2010
p0! p1!
p2! p3!
p1! p3! p1! p2!
p0! p1!
p2! p3!
p1!
p3!
p0!
p2!
p1!
p3!
p0!
p2!
Level 0!
Level 1!
Level 2!
(b) Mesh Refinement! (c) Hierarchical grid in box layout!(a) Combined AMR Grid !
!"#$%&%'()&*&)(
+,'&$-&./#'&()&*&)(
0/,&%'()&*&)(
p0! p2!
p2!
p0! p3! p3!p2!p1!p0!
p1!p0! p3!p3! p2!p1!p0!p0!p0!p0! p1!p1!p1! p2! p3! p2! p3! p2!p3!
!"#$%"&'()*$+',-."*/-0'0)*/)'1"20/."3/(*"(*0--"1-4-1,"56"0-78-9-8(""58":"&05)-,,-,"
Chombo I/O
• PDE tool from LBNL – Supports block-structured AMR grids
• I/O pattern – Array variables are partitioned among a
subset of processes – Calls MPI independent API as the
collective is not feasible • PnetCDF non-blocking I/O
– Aggregate requests to multiple variables – One collective I/O carries out the
aggregated request !"
#"
$"
%"
&"
'!"
'#"
%$" '#&" #(%" ('#" '!#$" #!$&" $')%" &')#"
*+,-./0"+1+231456+7"
*+,-./0"231456+7"
8/0("231456+7"
9:;2,<"1=">??364>@1+"?<14,AA,A"
B<6-,"2>+CD6C-E"6+"FGHA,4"
Chombo run on Franklin @ NERSC
13
Data-‐model based I/O library (X-‐stack) (Pis Alok Choudhary, Wei-‐Keng Liao, Northwestern; Rob Ross, Tim Tautges, ANL;
Nagiza Samatova, NCSU; Quincy Koziol, HDF Group) • DAMSEL: next generaAon of high-‐level I/O library • Supports various data models
– Describe unstructured data relaAonships: trees, graph-‐based, space-‐filling curve, etc.
– Scalable I/O for irregular distributed data objects – More sophisAcate data query API
• Virtual filing
14
– A file container of mulAple files appears as a single file
– Balance concurrent access (to reduce contenAon) and the number of files created (to ease file manageability) Space filling curves are used in climate codes
when parAAoning the grid to improve scalability. Image courtesy John Dennis (NCAR).
ComputaAonal moAfs
15
Future
• Challenges: – Reads in data analysis, as most of the case only subsets of data are read
• Hardware accelerators – GPU for on-‐line data compression, analysis
• Faster storage device – SSD as a read cache at compute nodes or I/O servers
16
AcAve storage
• Offloading I/O intensive computaAon to the file servers
• If servers were equipped GPUs, the operaAons can run faster – TS: tradiAonal storage, task runs
on clients
– AS: acAve storage, task runs on server’s CPU
– AS + GPU: task runs on server’s GPU
17
KMEANS
GPU to accelerate data analyAcs
• Our I/O delegaAon work demonstrates that set-‐aside processes improve I/O performance
• Delegate processes can also be used for off-‐loading data intensive computaAon
• HPC with a subset of compute nodes equipped with GPU and SSD provides a rich experimental plaoorm
18
PublicaAons • Alok Choudhary, Wei-‐keng Liao, Kui Gao, Arifa Nisar, Robert Ross, Rajeev Thakur,
and Robert Latham. Scalable I/O and Analy'cs. In the Journal of Physics: Conference Series, Volume 180, Number 012048 (10 pp), August 2009.
• Kui Gao, Wei-‐keng Liao, Arifa Nisar, Alok Choudhary, Robert Ross, and Robert Latham. Using Subfiling to Improve Programming Flexibility and Performance of Parallel Shared-‐file I/O. In the Proceedings of the Interna>onal Conference on Parallel Processing, Vienna, Austria, September 2009.
• Kui Gao, Wei-‐keng Liao, Alok Choudhary, Robert Ross, and Robert Latham. Combining I/O Opera'ons for Mul'ple Array Variables in Parallel NetCDF. In the Proceedings of the Workshop on Interfaces and Archi-‐ tectures for Scien>fic Data Storage, held in conjuncAon with the the IEEE Cluster Conference, New Orleans, Louisiana, September 2009.
• B. Palmer, K. Schuchardt, A. Koontz, R. Jacob, R. Latham, and W. Liao. IO for High Resolu'on Climate Models. Workshop on High-‐ResoluAon Climate Modeling, 2010
• Arifa Nisar, Wei-‐keng Liao, and Alok Choudhary. Delega'on-‐based I/O SoWware Architecture for High Performance Compu'ng Systems. Accepted by IEEE TPDS, 2010.
19
I/O resources from NERSC • NERSC Global Filesystem (NGF), an IBM GPFS
– HPC • Franklin (Cray XT4), Hopper (Cray XT5), Hopper II (Cray XE6), Carver (IBM)
– AnalyAcs clusters • PDSF (Linux), Euclid (Sun)
– Others • Tesla/Tuning, Dirac (GPU cluster), Magellan (Cloud)
• Parallel file systems – Lustre, GPFS
• Archival Storage (HPSS) • ConsulAng team
– Very helpful for answering I/O related quesAons, including hardware configuraAon, so'ware availability, run-‐Ame environment setup, system performance numbers, communicaAon with Cray, etc.
20