jeffrey s. vetter, micah beck, philip roth future technologies group @ ornl

DoE I/O characterizations, infrastructures for wide-area

collaborative science, and future opportunities

Jeffrey S. Vetter, Micah Beck, Philip Roth

Future Technologies Group @ ORNL

University of Tennessee, Knoxville

2JV

Cray XT3

Jaguar

5294 2.4GHz 11TB Memory

Cray X1E

Phoenix

1024 .5GHz 2 TB Memory

SGI Altix

Ram

(256) 1.5GHz 2TB Memory

IBM SP4

Cheetah

(864) 1.3GHz 1.1TB Memory

IBM Linux

NSTG

(56) 3GHz76GB Memory

Visualization

Cluster

(128) 2.2GHz 128GB Memory

IBM

HPSS

Many StorageDevices

Supported

Net

wor

kR

oute

rs

UltraScience

10 GigE

1 GigE

Control Network

Shar

e dD

i sk

120TB

7 Systems

September 2005Summary

Scientific Visualization Lab

32TB 32TB36TB 9TB

5PB

Supercomputers7,622 CPUs

16TB Memory45 TFlops

238.5 TB

5 PB•27 projector Power Wall

4.5TB

Ba c

kup

Stor

age

5TB

Test Systems

•96 processor Cray XT3•32 processor Cray X1E*•16 Processor SGI Altix

Evaluation Platforms

•144 processor Cray XD1 with FPGAs•SRC Mapstation•Clearspeed•BlueGene (at ANL)

NCCS Resources

3JV

Implications for Storage, Data, Networking

Decommissioning very important storage systems in GPFS

– IBM SP3 decommissioned in June– IBM p690 cluster disabled on Oct 4

Next generation systems bring new storage architectures– Lustre on 25TF XT3, lightweight kernel [limited number of os

services]– ADIC StorNext on 18 TF X1E

Visualization and Analysis systems– SGI Altix, Viz cluster, 35 Megapixel Powerwall

Networking– Connected to ESnet, Internet2, TeraGrid, UltraNet, etc.– ORNL Internal network being upgraded now

End-to-end solutions underway – Klasky’s talk

4JV

Initial Lustre Performance on Cray XT3

Acceptance criteria for XT3 is 5 GBps BW– 32 OSS, 64 OST configuration hit 6.7 GBps write, 6.2 GBps

read

5JV

Preliminary I/O Survey (Spring 2005)

MechanismMechanism Example ProblemExample Problem

AppApp VersionVersion UseUse Fortran Fortran I/OI/O

MPI-IOMPI-IO NetCDFNetCDF SchemeScheme SizeSize FrequencyFrequency

GYRO 3.0.0 Read input files x F: Rank 0 < 1 MB Initialization

Write checkpoint files x x F: Rank 0 87.5 MB Once per 1000 time-steps

Write logging/debug files

x Rank 0 ~150 KB File-dependent

POP (standalone)

1.4.3/2.0 Read input files x x (2.0 only)

Parallel, rank 0 Initialization

Read forcing files x x Parallel, rank 0 Every few time-steps

Write 3d field files x x Parallel, rank 0 1.4 GB Several per simulation-month

CAM (standalone)

3.0 Read input files x Rank 0 ~300 MB

Initialization

Write checkpoint files x Rank 0 Once per simulation-day

Write output files x Rank 0 ~110 MB

Termination

AORSA2D Read input files x All ranks ~26 MB Initialization

Write output files x Rank 0 ~10 MB Termination

TSI Read input files x Rank 0 Initialization

Write timestep files x x (vh1) All ranks, post-processed into single NetCDF file

28 GB / timestep

Hundreds per run

Need I/O access patterns.

6JV

Initial Observations

Most users have limited I/O capability because the libraries and runtime systems are inconsistent across platforms

– Little use of [Parallel] NetCDF or HDF5– Seldom direct use of MPI-IO

Widely varying file size distribution– 1MB, 10MB, 100MB, 1GB, 10GB

Comments from community– POP: baseline parallel I/O works; not clear new

decomposition scheme is easily parallelized– TSI: would like to use higher-level libraries (parallel netcdf,

hdf5) but they are not implemented or perform poorly on target architectures

7JV

Preliminary I/O Performance of VH-1 on X1E using Different I/O Strategies

8JV

Learning From Experience with VH-1

The fast way to write (today)– Each rank writes a separate architecture-specific file

•Native mode MPI I/O file-per-rank O(1000) MB/s

– Manipulating such a dataset in this form is awkwardTwo solutions

– Write one portable file using collective I/O•Native mode MPIO single file < 100 MB/s (no external32 on Phoenix)

– Post-process sequentially, rewriting in portable format– In either case the computing platform is doing a lot of

work in order to generate convenient metadata structure in the FS

Future Opportunities

10JV

Future Directions

Exposing structure to applicationsHigh performance and fault tolerance in data

intensive scientific workAnalysis and benchmarking of data intensive

scientific codes Advanced object storage devicesAdapting parallel I/O and parallel FS to wide

area collaborative science

11JV

Exposing File Structure to Applications

VH-1 experiences motivate some issuesExpensive compute platforms perform I/O optimized

to their own architecture and configuration– Metadata describes organization and encoding, using a

portable schema– Processing is performed between writer and reader

File systems are local managers of structure & metadata

– File description metadata can be managed outside a FS– In some cases, it may be possible to bypass file systems

and access Object Storage Devices directly from the application

– Exposing resources enables application autonomy.

12JV

High Performance and Fault Tolerance in Data Intensive Scientific Work

In complex workflows, performance and fault tolerance are defined in terms of global results, not locally

Logistical Networking have proved valuable to application and tool builders

We will continue collaborating in and investigating the design of SDM tools that incorporate LN functionality.

– Caching and distribution of massive datasets– Interoperation with GridFTP-based infrastructure– Robustness through the use of widely distributed resources

13JV

Science Efforts Currently Leveraging Logistical Networking

Transfers over long networks decrease latency seen by TCP flows by storing, forwarding at intermediate point

– Producer and consumer are buffered, transfer accelerated

Data movement within Terascale Supernova Initiative, Fusion Energy Simulation projects

Storage model for implementation of SRM interface for Open Science Grid project at Vanderbilt ACCRE

America View distribution of MODIS satellite data to a national audience

14JV

Continue Analysis and Benchmarking of Data Intensive Scientific Codes

Scientists may/should have abstract idea of I/O characteristics (esp. access patterns, access sizes), which are sensitive to systems and libraries

Instrumentation at the source level is difficult and may be misleading

Standard benchmarks and tools must be developed as a basis for comparison.

– Some tools exist: ORNL’s mpiP provides statistics about MPI-IO runtime

– I/O performance metrics should be routinely collected– Parallel I/O behavior should be accessible to users– Non-determinism in performance must be addressed

15JV

Move Toward Advanced Object Storage Devices

OSD is an emerging trend in file and storage systems (eg. Lustre, Panasas)

Current architecture is evolutionary from SCSI– Need to develop advanced forms of OSD that can

be accessed directly by multiple users– Active OSD technology may enable pre- and post-

processing at network storage nodes to offload hosts

These nodes must fit into larger middleware framework and workflow scheme

16JV

Adapt Parallel I/O and Parallel FS to Wide Area Collaborative Science

Emergence of massive digital libraries and shared workspaces requires common tools that provide more control than file transfer or distributed FS solutions

Direct control over tape, wide area transfer and local caching are important elements of application optimization

New standards are required for expressing file system concepts interoperably in a heterogeneous wide area environment.

17JV

Enable Uniform Component Coupling

Major components of scientific workflows interact through asynchronous file I/O interfaces

– Granularity was traditionally the output of a complete run.

– Today, as in Klasky & Bhat’s data streaming, granularity is one time step due to increased scale of computation

Flexible management of state required for customization of component interactions

– localization (e.g.. caching) – fault tolerance (e.g. redundancy)– optimization (e.g. point-to-multipoint)

Questions?

Bonus Slides

20JV

Important Attributes of a Common Interface

Primitive, generic in order to serve many purposes

– Sufficient to implement application requirementsWell layered in order to allow for diversity

– Not imposing the costs of complex high layers on users of the lower layer functionality

Easily ported to new platforms and widely acceptable within the developer community

– Who are the designers? What is the process?

21JV

Developer Conversations / POP

Parallel Ocean Program– Already have working parallel I/O scheme– Initial look at MPI-IO seemed to indicate

impedance mismatch with POP’s decomposition scheme

– NetCDF option doesn’t use parallel I/O– I/O is a low priority compared to other

performance issues

22JV

Developer Conversations / TSI

Terascale Supernova Initiative– Would like to use Parallel NetCDF or HDF5, but they

are unavailable or perform poorly on platform of choice (Cray X1)

– Negligible performance impact of each rank writing individual timestep file, at least up to 140 PEs

Closer investigation of VH-1 shows– Performance impact of writing file-per-rank is not

negligible– Major costs are imposed by writing architecture-

independent files, forming a single timestep file– Parallel file systems address only some of these

issues

23JV

Interfaces Used in Logistical Networking

On the network side– Sockets/TCP link clients to server– XML Metadata Schema

On the client side– Procedure calls in C/Fortran/Java/…– Application layer I/O libraries

End user tools– Command line– GUI implemented in TCL

jeffrey s. vetter, micah beck, philip roth future technologies group @ ornl

Documents

baseline parallel io

input filesxx

limited io capability

doe io characterizations

input filesxf

preliminary io survey

need io access patterns

important storage systems