scidac sdm center all hands meeting, october 5-7, 2005 northwestern university pis:alok choudhary,...
DESCRIPTION
3 Parallel NetCDF NetCDF defines: –A set of APIs for file access –A machine-independent file format Parallel netCDF work –New APIs for parallel access –Maintaining the same file format Tasks –Built on top of MPI for portability and high performance –Support C and Fortran interfaces –Support external data representations P0P1P2P3 netCDF Parallel File System Parallel netCDF P0P1P2P3 Parallel File SystemTRANSCRIPT
![Page 1: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/1.jpg)
SciDAC SDM Center All Hands Meeting, October 5-7, 2005
Northwestern University PIs: Alok Choudhary, Wei-keng Liao
Graduate Students: Jianwei Li, Avery Ching, Kenin Coloma
ANL Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham
Parallel I/O Middleware Optimizations and
Future Directions
![Page 2: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/2.jpg)
2
• Progress and accomplishments – Wei-keng Liao– Parallel netCDF– Client-side file caching in MPI-IO– Data-type I/O for non-contiguous file access in PVFS
• Future research directions – Alok Choudhary– I/O middleware– Autonomic and Active storage Systems
Outline
![Page 3: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/3.jpg)
3
Parallel NetCDF
• NetCDF defines:– A set of APIs for file access– A machine-independent file format
• Parallel netCDF work– New APIs for parallel access– Maintaining the same file format
• Tasks– Built on top of MPI for portability and
high performance– Support C and Fortran interfaces– Support external data representations
P0 P1 P2 P3
netCDF
Parallel File System
Parallel netCDF
P0 P1 P2 P3
Parallel File System
![Page 4: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/4.jpg)
4
PnetCDF Current Status• Version 1.0.0 was released on July 27, 2005• Supported platforms
– Linux Cluster, IBM SP, SGI Origin, Cray X, NEC SX• Two sets of parallel APIs are completed
– High level APIs (mimicking the serial netCDF APIs)– Flexible APIs (extended to utilize MPI derived datatype)
• Fully supported both in C and Fortran• Support for large file ( > 4GB files)• Test suites
– Self test codes ported from Unidata netCDF package to validate against single-process results
– Parallel test codes for both sets of APIs
![Page 5: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/5.jpg)
5
Illustrative PnetCDF Users• FLASH – astrophysical thermonuclear application from
ASCI/Alliances center at university of Chicago• ACTM – atmospheric chemical transport model, LLNL• WRF-ROMS – regional ocean model system I/O module from
scientific data technologies group, NCSA• ASPECT – data understanding infrastructure, ORNL• pVTK – parallel visualization toolkit, ORNL• PETSc – portable, extensible toolkit for scientific computation,
ANL• PRISM – PRogram for Integrated Earth System Modeling, users
from C&C Research Laboratories, NEC Europe Ltd.• ESMF – earth system modeling framework, national center for
atmospheric research• More …
![Page 6: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/6.jpg)
6
PnetCDF Future Work
• Non-blocking I/O APIs• Performance improvement for data type
conversion– Type conversion while packing non-contiguous
buffers• Extending PnetCDF for newer applications, e.g.,
data analysis and mining• Collaboration with application users
![Page 7: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/7.jpg)
7
File Caching in MPI-IO
Parallel netCDF
MPI-IO
PVFS
Applications
Storage devices
![Page 8: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/8.jpg)
8
File Caching for Parallel Apps• Why file caching?
– Improves the performance for repeated file access– Enable write-behind strategy
• Accumulates multiple small writes to better utilize network bandwidth• May balance the work load for irregular I/O patterns• Useful for checkpointing
– Enable data pre-fetching• Useful for read-only applications (parallel data mining, visualization)
• Why not just use traditional caching strategies?– Each client performs independently cache incoherence– I/O servers are in charged with cache coherence control potential I/O
serialization– Inadequate for parallel environment where application clients frequently
read/write shared files
![Page 9: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/9.jpg)
9
Caching Sub-system in MPI-IO
• Application-aware file caching – A user-level implementation in
MPI-IO library– MPI communicators define the
subsets of processes operating on a shared file
client processors
I/O servers
global cache poollocal cache
buffers
networkinterconnect
memory
– Processes cooperate with each other to perform caching– Data cached in one client can be directly accessed by another– Moves cache coherence control from servers to clients– Distributed coherence control (less overhead)
• Supports both collective and independent I/O
![Page 10: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/10.jpg)
10
Design• Cache metadata
– File-block based granularity– Cyclically stored in all
processes • Global cache pool
– Comprises local memory of all processes
– Single copy of file data to avoid coherence issue
processes
1P 2P 3P0P
File logical partitioning
Distributed cache meta dataprocesses
block 9 statusblock 5 statusblock 1 status
block 10 statusblock 6 statusblock 2 status
block 11 statusblock 7 statusblock 3 status
block 8 statusblock 4 statusblock 0 status
1P 2P 3P0PGlobal cache pool
local memorylocal memory local memorylocal memory
page 3page 2page 1
block 4block 3block 2block 1block 0
page 3page 2page 1
page 3page 2page 1
page 3page 2page 1
• Two implementations:– Using an I/O thread (POSIX thread)– Using the MPI remote-memory-access (RMA) facility
![Page 11: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/11.jpg)
11
Example Read OperationFile logical partitioning
block 4block 3block 2block 1block 0
1P 2P 3P0PDistributed metadata
processes
block 9 statusblock 5 statusblock 1 status
block 10 statusblock 6 statusblock 2 status
block 11 statusblock 7 statusblock 3 status
block 8 statusblock 4 statusblock 0 status
page 3page 2page 1
processes 1P 2P 3P0PGlobal poollocal memorylocal memory local memorylocal memory
page 3page 2page 1
page 3page 2page 1
page 3page 2page 1page 1
If not y
et ca
ched
page 2
Alre
ady
cach
ed
block 3
met
adat
a lo
okup
lock it !unlock it !
![Page 12: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/12.jpg)
12
Future Work• Data pre-fetching
– Instructional (through MPI info) and non-instructional (based on sequential access)
• Collective write-behind for data check-pointing• Stand-alone distributed lock sub-system
– Using MPI-2 remote-memory access facility• Design new MPI file hints for caching• Application I/O pattern study
– Structured/unstructured AMR
![Page 13: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/13.jpg)
13
Data-type I/O in PVFS
Parallel netCDF
MPI-IO
PVFS
Applications
Storage devices
![Page 14: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/14.jpg)
14
Non-contiguous I/O• Four types
– Contiguous both in memory and file
– Contiguous in memory, non-contiguous in file
– Non-contiguous in memory, contiguous in file
– Non-contiguous both in memory and file
• Each segment is an I/O request of (offset, length)
memory
file
memory
file
memory
file
memory
file
![Page 15: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/15.jpg)
15
Implementations• POSIX I/O
– One call per (offset, length)– Generates large number of I/O
requests• Data sieving
– Single (offset, length) covering multiple segments
– Accessing unused data and introduces consistency control overhead
• List I/O– Single calls handle multiple non-
contiguous access– Passing multiple (offset, length)s
across network
Application process
I/O request I/O request I/O request
Client-side file system
Application process
List I/O request
Client-side file system
Server-side file system
network
![Page 16: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/16.jpg)
16
Data-type I/O• Single requests all the way to the
servers• Abandons offset-length pair
representation– Borrow MPI datatype concept to
describe non-contiguous access patterns
– New file system data types– New file system interfaces
• An implementation in PVFS– Both client and server sides
Application process
Datatype I/O request
PVFS client
PVFS server
network
Single request
![Page 17: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/17.jpg)
17
Summary of Accomplishments
• High-level I/O– Parallel netCDF
• Low-level I/O– MPI-IO file caching
• Parallel file system– Data-type I/O in PVFS
Parallel netCDF
MPI-IO
PVFS
![Page 18: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/18.jpg)
18
Future Research
![Page 19: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/19.jpg)
19
Typical Components in I/O Systems• Based on a lot of current apps• High-level
– E.g., NetCDF, HDF, ABC– Applications use these
• Mid-level– E.g., MPI-IO– Performance experience
• Low-level– E.g., File systems – Critical for performance in above
• More access info lost if more components used
Compute node
Compute node
Compute node
Compute node
network
I/OServer
I/OServer
I/OServer
End-to-End Performance critical
Applications
Client-side File System
Parallel netCDF,HDF5, ...
MPI-IO
![Page 20: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/20.jpg)
20
Collectives, independentsI/O hints: access style (read_once, write_mostly, sequential, random, …), collective buffering, chunking, striping
Open mode (O_RDONLY, O_WRONLY, O_SYNC), file status, locking, flushing, cache invalidationMachine dependent: data shipping, sparse access, double buffering
Access base on : file blocks, objects scheduling, aggregation
Read-ahead, write-behind, metadata management, file striping, security, redundancy
Save attributes along with data, external data types (byte-alignment), data structures (flexible dimensionality), hierarchical data model
Access patterns: shared files, individual files, data partitioning, check-pointing, data structures, inter-data relationship
network
I/OServer
I/OServer
I/OServer
Applications
Client-side File System
Parallel netCDF,HDF5, ...
MPI-IO
application-aware caching, pre-fetching, file grouping, “vector of bytes”, flexible caching control, object-based data alignment, memory-file layout mapping, more control over hardware, Shared file descriptors,
Group locks, flexible locking control, scalable metadata management, zero-copying, QoS, Shared file descriptors,
Active storage: data filtering,object-based/hierarchical storage management, indexing, mining, power-management
Caching, fault tolerance, read-ahead, write-behind, I/O load balance, wide-area, heterogeneous FS support, thread-safe
Graph-based data model
![Page 21: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/21.jpg)
21
FS DM Datasets HSS
Goal
Decouple “What” from “How” andBe Proactive
cachingcollectivereorganize
load balanceFault-tolerance
Understand
App
1
App
2
App
3
App
4
I/O S
W O
PT
streaming/
Small/large
configuration
s/w layer
Regular/irregular
Local/remote
• user burdened• Ineffective interfaces• Non-communicating layers
Current
SpeedBWLatencyQoS
![Page 22: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/22.jpg)
22
Component Design for I/O• Application-aware
– Capture application’s file access information– Relationship between files, objects, users
• Environment-aware– Network (reliability, security), storage devices (active disks)
• Context-aware– Binding data attributes to files, indexing for fast search
• High-performance I/O needs supports from– Languages + Compilers– I/O libraries– File systems– Storage devices
![Page 23: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/23.jpg)
23
Component Interface Design• Informative
– Should deliver access/storage information top-down/bottom-up
• Flexibility– Should describe arbitrary data distribution in memory buffers,
files, storage devices• Functionality
– Asynchronous operations, read-ahead, write-behind, replications
– Provides ability for additional innovation• Object-based I/O
– For hardware control (I/O co-processor, active disk, object-based file systems, etc.)
![Page 24: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/24.jpg)
24
Future Work in MPI-IO• Investigate interface extensions• Client-side caching sub-system
– Implementations for various I/O strategies: buffering, pre-fetching, replication, migration
– Adaptive caching mechanisms and algorithms for optimizing different access patterns
• Distributed mutual exclusive locking sub-system– Shared resources, such as files and memory– Pipeline locking (overlap lock waiting time with I/O)
• Work with HDF5 and parallel netCDF– Design I/O strategies for metadata and data
• Metadata: small, overlap, repeated, strong consistency requirement• Array data: large, less frequent update
![Page 25: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/25.jpg)
25
Future Work in Parallel File Systems• File caching (focus on parallel apps)• File versioning
– Alternative to file locking– Reliability and availability aspects
• Guarantee atomicity in the presence of client or I/O system failure
• Can enable efficient RAID-type schemes in PFS (because of atomicity)
• Dynamic rebalancing of I/O
• File list lock– Locks to multiple regions in a single request
![Page 26: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/26.jpg)
26
ML310-board4
ML310-board3
ML310-board2
Active Storage System (reconfigurable system)
External net
ML310-host
ML310-board1
Switch
• Xilinx XC2VP30 Virtex-II Pro family– 30,816 logic cells (3424 CLBs)– 2 PPC405 embedded cores– 2,448 Kb (136 18 Kb blocks) BRAM– 136 dedicated 18x18 multiplier blocks
• Software:– Data Mining– Encryption– Functions and runtime libs– Linux micro-kernel
![Page 27: SciDAC SDM Center All Hands Meeting, October 5-7, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Jianwei Li, Avery Ching,](https://reader035.vdocument.in/reader035/viewer/2022070610/5a4d1bb07f8b9ab0599cc650/html5/thumbnails/27.jpg)
27
MineBench - data mining benchmark suite