Transcript
Page 1: Contact: SByna@lblCRD_PowerPoint_Poster_36x48_Template The Scientific Data Services (SDS) Framework SDS Server Implementation Scientific Data Management Group Lawrence Berkeley National

Lawrence Berkeley National Laboratory Creative Services Office (CSO) #25378 CRD_PowerPoint_Poster_36x48_Template

The Scientific Data Services (SDS) Framework SDS Server Implementation

Scientific Data Management Group Lawrence Berkeley National Laboratory

Bin Dong, Suren Byna, John Wu SDS: A Framework for Scientific Data Services

Problem •  File systems are static

•  Data stored on current file systems is immutable •  Written and often laid out on file systems by data producers (simulations and

experiments) •  Consumers of data are responsible for organizing the data for accelerated

analysis, which is often ignored resulting in poor analysis performance Solution •  The Scientific Data Services (SDS) framework

•  Bringing the merits of database management systems to manage scientific data on file systems

•  Transparent access to data with with existing scientific data interfaces (starting with HDF5)

•  Transparent data reorganization and Scientific data querying support

HDF5 API •  The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls •  Developed a VOL plugin for SDS for capturing file open, read, and close functions SDS Query Interface •  An interface to perform SQL-style queries on arrays •  Functional API that can be used from C/C++ applications Parser •  Checks the conditions in a query •  Verifies the validity of file names, etc. Server Connector •  Packages a query or HDF5 read call information and sends to the SDS server •  Using protocol buffers for communication •  MPI Rank 0 communicates to the server and then informs the remaining MPI

processes Reader •  Reads data from the dataset location returned by the SDS Server Post Processor •  Performs any post-processing needed before copying to the application memory

•  Eg. Decompression, transposition, etc.

Contact: [email protected]

SDS Client Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#

Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Request Dispatcher •  Receives SDS client requests and SDS Admin interface •  SDS Admin interface issues reorganization commands •  Based on the request, dispatcher passes on the request to Query Evaluator and

Reorganization Evaluator Query Evaluator •  Looks up SDS Metadata for finding available reorganized datasets and their

locations for a given dataset •  SDS Metadata

•  File name, HDF5 dataset info, permissions Reorganization Evaluator •  Decides whether to reorganize based on the frequency of read accesses •  Takes commands from the Admin interface •  Instructs Data Reorganizer to create a reorganization job script Data Organization Recommender •  Identifies optimal data reorganization •  Informs the Reorganization Evaluator with the selected strategy Data Organizer •  Locates reorganization code, such as sorting, indexing algorithms •  Decides on the number of cores to use •  Prepares a batch job script •  Monitors the job execution •  After reorganization job is complete, stores the new data location in the SDS

Metadata Manager

Network(

MPI(Process(0(

Applica4on(

HDF5(API(

SDS(Client(

Parallel&File&System&&SDS(Server(

Database( Batch(System(

MPI(Process(N(

Applica4on(

HDF5(API(

SDS(Client(

SDS(Query(API( SDS(Query(API(

Results

1.2X%

2.3X%

8.1X% 18.9X% 25.7X% 36.3X% 42.0X% 44.8X% 51.2X%0.00%20.00%40.00%60.00%80.00%100.00%120.00%140.00%

E>1.10% E>1.15% E>1.20% E>1.25% E>1.30% E>1.35% E>1.40% E>1.45% E>1.50%

Time%(sec)%

Query%

Full%Data%Scan% Read%Reorganized%Data%

0"

0.2"

0.4"

0.6"

0.8"

1"

40" 80" 120" 160" 200" 240" 280" 320"

Time"(sec)"

Number"of"Concurent"Clients"

HDF5%Open+Close% SDS%Metadata%Read%

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File Handle and Query String

Original File Metadata or Reorganized File Metadata

Data

Requested Data

Server&Connector&

SDS&&Server&

Applica:on&

Parallel&File&System&

SDS Client HDF5&API& File Handle and Read Parameters

File Handle and Final Query String

Implementation with SDS Client library and persistent SDS Server

The authors thank Quincey Koziol and Mohamad Chaarawi from The HDF5 Group for their support with HDF5 VOL, Homa Karimabadi, William Daughton, and Vadim Roytershteyn for their guidance in understanding the read patterns of VPIC data analysis, Arie Shoshani and Spyros Blanas for their thoughtful comments about the work. This work is supported in part by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No.~DE-AC02-05CH11231, and used resources of The National Energy Research Scientific Computing Center (NERSC).

Contact: Suren Byna ([email protected])

More results available in: Bin Dong, Suren Byna, and John Wu, “Expediting Scientific Data Analysis with Reorganization”, IEEE Cluster 2013

Top Related