jeffrey s. vetter, micah beck, philip roth future technologies group @ ornl
DESCRIPTION
DoE I/O characterizations, infrastructures for wide-area collaborative science, and future opportunities. Jeffrey S. Vetter, Micah Beck, Philip Roth Future Technologies Group @ ORNL University of Tennessee, Knoxville. NCCS Resources. September 2005 Summary. Control Network. 1 GigE. - PowerPoint PPT PresentationTRANSCRIPT
DoE I/O characterizations, infrastructures for wide-area
collaborative science, and future opportunities
Jeffrey S. Vetter, Micah Beck, Philip Roth
Future Technologies Group @ ORNL
University of Tennessee, Knoxville
2JV
Cray XT3
Jaguar
5294 2.4GHz 11TB Memory
Cray X1E
Phoenix
1024 .5GHz 2 TB Memory
SGI Altix
Ram
(256) 1.5GHz 2TB Memory
IBM SP4
Cheetah
(864) 1.3GHz 1.1TB Memory
IBM Linux
NSTG
(56) 3GHz76GB Memory
Visualization
Cluster
(128) 2.2GHz 128GB Memory
IBM
HPSS
Many StorageDevices
Supported
Net
wor
kR
oute
rs
UltraScience
10 GigE
1 GigE
Control Network
Shar
e dD
i sk
120TB
7 Systems
September 2005Summary
Scientific Visualization Lab
32TB 32TB36TB 9TB
5PB
Supercomputers7,622 CPUs
16TB Memory45 TFlops
238.5 TB
5 PB•27 projector Power Wall
4.5TB
Ba c
kup
Stor
age
5TB
Test Systems
•96 processor Cray XT3•32 processor Cray X1E*•16 Processor SGI Altix
Evaluation Platforms
•144 processor Cray XD1 with FPGAs•SRC Mapstation•Clearspeed•BlueGene (at ANL)
NCCS Resources
3JV
Implications for Storage, Data, Networking
Decommissioning very important storage systems in GPFS
– IBM SP3 decommissioned in June– IBM p690 cluster disabled on Oct 4
Next generation systems bring new storage architectures– Lustre on 25TF XT3, lightweight kernel [limited number of os
services]– ADIC StorNext on 18 TF X1E
Visualization and Analysis systems– SGI Altix, Viz cluster, 35 Megapixel Powerwall
Networking– Connected to ESnet, Internet2, TeraGrid, UltraNet, etc.– ORNL Internal network being upgraded now
End-to-end solutions underway – Klasky’s talk
4JV
Initial Lustre Performance on Cray XT3
Acceptance criteria for XT3 is 5 GBps BW– 32 OSS, 64 OST configuration hit 6.7 GBps write, 6.2 GBps
read
5JV
Preliminary I/O Survey (Spring 2005)
MechanismMechanism Example ProblemExample Problem
AppApp VersionVersion UseUse Fortran Fortran I/OI/O
MPI-IOMPI-IO NetCDFNetCDF SchemeScheme SizeSize FrequencyFrequency
GYRO 3.0.0 Read input files x F: Rank 0 < 1 MB Initialization
Write checkpoint files x x F: Rank 0 87.5 MB Once per 1000 time-steps
Write logging/debug files
x Rank 0 ~150 KB File-dependent
POP (standalone)
1.4.3/2.0 Read input files x x (2.0 only)
Parallel, rank 0 Initialization
Read forcing files x x Parallel, rank 0 Every few time-steps
Write 3d field files x x Parallel, rank 0 1.4 GB Several per simulation-month
CAM (standalone)
3.0 Read input files x Rank 0 ~300 MB
Initialization
Write checkpoint files x Rank 0 Once per simulation-day
Write output files x Rank 0 ~110 MB
Termination
AORSA2D Read input files x All ranks ~26 MB Initialization
Write output files x Rank 0 ~10 MB Termination
TSI Read input files x Rank 0 Initialization
Write timestep files x x (vh1) All ranks, post-processed into single NetCDF file
28 GB / timestep
Hundreds per run
Need I/O access patterns.
6JV
Initial Observations
Most users have limited I/O capability because the libraries and runtime systems are inconsistent across platforms
– Little use of [Parallel] NetCDF or HDF5– Seldom direct use of MPI-IO
Widely varying file size distribution– 1MB, 10MB, 100MB, 1GB, 10GB
Comments from community– POP: baseline parallel I/O works; not clear new
decomposition scheme is easily parallelized– TSI: would like to use higher-level libraries (parallel netcdf,
hdf5) but they are not implemented or perform poorly on target architectures
7JV
Preliminary I/O Performance of VH-1 on X1E using Different I/O Strategies
8JV
Learning From Experience with VH-1
The fast way to write (today)– Each rank writes a separate architecture-specific file
•Native mode MPI I/O file-per-rank O(1000) MB/s
– Manipulating such a dataset in this form is awkwardTwo solutions
– Write one portable file using collective I/O•Native mode MPIO single file < 100 MB/s (no external32 on Phoenix)
– Post-process sequentially, rewriting in portable format– In either case the computing platform is doing a lot of
work in order to generate convenient metadata structure in the FS
Future Opportunities
10JV
Future Directions
Exposing structure to applicationsHigh performance and fault tolerance in data
intensive scientific workAnalysis and benchmarking of data intensive
scientific codes Advanced object storage devicesAdapting parallel I/O and parallel FS to wide
area collaborative science
11JV
Exposing File Structure to Applications
VH-1 experiences motivate some issuesExpensive compute platforms perform I/O optimized
to their own architecture and configuration– Metadata describes organization and encoding, using a
portable schema– Processing is performed between writer and reader
File systems are local managers of structure & metadata
– File description metadata can be managed outside a FS– In some cases, it may be possible to bypass file systems
and access Object Storage Devices directly from the application
– Exposing resources enables application autonomy.
12JV
High Performance and Fault Tolerance in Data Intensive Scientific Work
In complex workflows, performance and fault tolerance are defined in terms of global results, not locally
Logistical Networking have proved valuable to application and tool builders
We will continue collaborating in and investigating the design of SDM tools that incorporate LN functionality.
– Caching and distribution of massive datasets– Interoperation with GridFTP-based infrastructure– Robustness through the use of widely distributed resources
13JV
Science Efforts Currently Leveraging Logistical Networking
Transfers over long networks decrease latency seen by TCP flows by storing, forwarding at intermediate point
– Producer and consumer are buffered, transfer accelerated
Data movement within Terascale Supernova Initiative, Fusion Energy Simulation projects
Storage model for implementation of SRM interface for Open Science Grid project at Vanderbilt ACCRE
America View distribution of MODIS satellite data to a national audience
14JV
Continue Analysis and Benchmarking of Data Intensive Scientific Codes
Scientists may/should have abstract idea of I/O characteristics (esp. access patterns, access sizes), which are sensitive to systems and libraries
Instrumentation at the source level is difficult and may be misleading
Standard benchmarks and tools must be developed as a basis for comparison.
– Some tools exist: ORNL’s mpiP provides statistics about MPI-IO runtime
– I/O performance metrics should be routinely collected– Parallel I/O behavior should be accessible to users– Non-determinism in performance must be addressed
15JV
Move Toward Advanced Object Storage Devices
OSD is an emerging trend in file and storage systems (eg. Lustre, Panasas)
Current architecture is evolutionary from SCSI– Need to develop advanced forms of OSD that can
be accessed directly by multiple users– Active OSD technology may enable pre- and post-
processing at network storage nodes to offload hosts
These nodes must fit into larger middleware framework and workflow scheme
16JV
Adapt Parallel I/O and Parallel FS to Wide Area Collaborative Science
Emergence of massive digital libraries and shared workspaces requires common tools that provide more control than file transfer or distributed FS solutions
Direct control over tape, wide area transfer and local caching are important elements of application optimization
New standards are required for expressing file system concepts interoperably in a heterogeneous wide area environment.
17JV
Enable Uniform Component Coupling
Major components of scientific workflows interact through asynchronous file I/O interfaces
– Granularity was traditionally the output of a complete run.
– Today, as in Klasky & Bhat’s data streaming, granularity is one time step due to increased scale of computation
Flexible management of state required for customization of component interactions
– localization (e.g.. caching) – fault tolerance (e.g. redundancy)– optimization (e.g. point-to-multipoint)
Questions?
Bonus Slides
20JV
Important Attributes of a Common Interface
Primitive, generic in order to serve many purposes
– Sufficient to implement application requirementsWell layered in order to allow for diversity
– Not imposing the costs of complex high layers on users of the lower layer functionality
Easily ported to new platforms and widely acceptable within the developer community
– Who are the designers? What is the process?
21JV
Developer Conversations / POP
Parallel Ocean Program– Already have working parallel I/O scheme– Initial look at MPI-IO seemed to indicate
impedance mismatch with POP’s decomposition scheme
– NetCDF option doesn’t use parallel I/O– I/O is a low priority compared to other
performance issues
22JV
Developer Conversations / TSI
Terascale Supernova Initiative– Would like to use Parallel NetCDF or HDF5, but they
are unavailable or perform poorly on platform of choice (Cray X1)
– Negligible performance impact of each rank writing individual timestep file, at least up to 140 PEs
Closer investigation of VH-1 shows– Performance impact of writing file-per-rank is not
negligible– Major costs are imposed by writing architecture-
independent files, forming a single timestep file– Parallel file systems address only some of these
issues
23JV
Interfaces Used in Logistical Networking
On the network side– Sockets/TCP link clients to server– XML Metadata Schema
On the client side– Procedure calls in C/Fortran/Java/…– Application layer I/O libraries
End user tools– Command line– GUI implemented in TCL