contact: sbyna@lblcrd_powerpoint_poster_36x48_template the scientific data services (sds) framework...

1
The Scientific Data Services (SDS) Framework SDS Server Implementation Scientific Data Management Group Lawrence Berkeley National Laboratory Bin Dong, Suren Byna, John Wu SDS: A Framework for Scientific Data Services Problem File systems are static Data stored on current file systems is immutable Written and often laid out on file systems by data producers (simulations and experiments) Consumers of data are responsible for organizing the data for accelerated analysis, which is often ignored resulting in poor analysis performance Solution The Scientific Data Services (SDS) framework Bringing the merits of database management systems to manage scientific data on file systems Transparent access to data with with existing scientific data interfaces (starting with HDF5) Transparent data reorganization and Scientific data querying support HDF5 API The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls Developed a VOL plugin for SDS for capturing file open, read, and close functions SDS Query Interface An interface to perform SQL-style queries on arrays Functional API that can be used from C/C++ applications Parser Checks the conditions in a query Verifies the validity of file names, etc. Server Connector Packages a query or HDF5 read call information and sends to the SDS server Using protocol buffers for communication MPI Rank 0 communicates to the server and then informs the remaining MPI processes Reader Reads data from the dataset location returned by the SDS Server Post Processor Performs any post-processing needed before copying to the application memory Eg. Decompression, transposition, etc. Contact: [email protected] SDS Client Implementation SDS Client Request Dispatcher Data Organiza6on Recommender Data Reorganizer SDS Metadata Manager Query Evaluator Reorganiza6on Evaluator Client Request Query Request Query Response Organiza6on List Read Sta6s6cs SDS Metadata SDS Metadata Frequently Read File Handle and its SDS Metadata ACribute of Reorganized file File Handle and Reorganiza6on Type Reorganiza6on Job Running Parallel File System Original File Reorganized File Job Script Job Results SDS Metadata Organiza6on List File Data SDS Server Reorganiza6on Request Periodic SelfJstart Request SDS Admin Interface User Reorganiza6on Request Request Dispatcher Receives SDS client requests and SDS Admin interface SDS Admin interface issues reorganization commands Based on the request, dispatcher passes on the request to Query Evaluator and Reorganization Evaluator Query Evaluator Looks up SDS Metadata for finding available reorganized datasets and their locations for a given dataset SDS Metadata File name, HDF5 dataset info, permissions Reorganization Evaluator Decides whether to reorganize based on the frequency of read accesses Takes commands from the Admin interface Instructs Data Reorganizer to create a reorganization job script Data Organization Recommender Identifies optimal data reorganization Informs the Reorganization Evaluator with the selected strategy Data Organizer Locates reorganization code, such as sorting, indexing algorithms Decides on the number of cores to use Prepares a batch job script Monitors the job execution After reorganization job is complete, stores the new data location in the SDS Metadata Manager Network MPI Process 0 Applica4on HDF5 API SDS Client Parallel File System SDS Server Database Batch System MPI Process N Applica4on HDF5 API SDS Client SDS Query API SDS Query API Results 1.2X 2.3X 8.1X 18.9X 25.7X 36.3X 42.0X 44.8X 51.2X 0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 E>1.10 E>1.15 E>1.20 E>1.25 E>1.30 E>1.35 E>1.40 E>1.45 E>1.50 Time (sec) Query Full Data Scan Read Reorganized Data 0 0.2 0.4 0.6 0.8 1 40 80 120 160 200 240 280 320 Time (sec) Number of Concurent Clients HDF5 Open+Close SDS Metadata Read Parser Reader Post Processor SDS Query Interface File Handle and Query String Original File Metadata or Reorganized File Metadata Data Requested Data Server Connector SDS Server Applica:on Parallel File System SDS Client HDF5 API File Handle and Read Parameters File Handle and Final Query String Implementation with SDS Client library and persistent SDS Server The authors thank Quincey Koziol and Mohamad Chaarawi from The HDF5 Group for their support with HDF5 VOL, Homa Karimabadi, William Daughton, and Vadim Roytershteyn for their guidance in understanding the read patterns of VPIC data analysis, Arie Shoshani and Spyros Blanas for their thoughtful comments about the work. This work is supported in part by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No.~DE-AC02-05CH11231, and used resources of The National Energy Research Scientific Computing Center (NERSC). Contact: Suren Byna ([email protected]) More results available in: Bin Dong, Suren Byna, and John Wu, “Expediting Scientific Data Analysis with Reorganization”, IEEE Cluster 2013

Upload: others

Post on 03-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Contact: SByna@lblCRD_PowerPoint_Poster_36x48_Template The Scientific Data Services (SDS) Framework SDS Server Implementation Scientific Data Management Group Lawrence Berkeley National

Lawrence Berkeley National Laboratory Creative Services Office (CSO) #25378 CRD_PowerPoint_Poster_36x48_Template

The Scientific Data Services (SDS) Framework SDS Server Implementation

Scientific Data Management Group Lawrence Berkeley National Laboratory

Bin Dong, Suren Byna, John Wu SDS: A Framework for Scientific Data Services

Problem •  File systems are static

•  Data stored on current file systems is immutable •  Written and often laid out on file systems by data producers (simulations and

experiments) •  Consumers of data are responsible for organizing the data for accelerated

analysis, which is often ignored resulting in poor analysis performance Solution •  The Scientific Data Services (SDS) framework

•  Bringing the merits of database management systems to manage scientific data on file systems

•  Transparent access to data with with existing scientific data interfaces (starting with HDF5)

•  Transparent data reorganization and Scientific data querying support

HDF5 API •  The HDF5 Virtual Object Layer (VOL) feature allows capturing HDF5 calls •  Developed a VOL plugin for SDS for capturing file open, read, and close functions SDS Query Interface •  An interface to perform SQL-style queries on arrays •  Functional API that can be used from C/C++ applications Parser •  Checks the conditions in a query •  Verifies the validity of file names, etc. Server Connector •  Packages a query or HDF5 read call information and sends to the SDS server •  Using protocol buffers for communication •  MPI Rank 0 communicates to the server and then informs the remaining MPI

processes Reader •  Reads data from the dataset location returned by the SDS Server Post Processor •  Performs any post-processing needed before copying to the application memory

•  Eg. Decompression, transposition, etc.

Contact: [email protected]

SDS Client Implementation

SDS#Client#

Request#Dispatcher#

Data#Organiza6on#Recommender#

Data#Reorganizer#

SDS#Metadata#Manager#

Query#Evaluator#

Reorganiza6on#Evaluator#

Client#Request#

Query#Request#

Query#Response#

Organiza6on#List#

Read#Sta6s6cs#

SDS#Metadata#

SDS#Metadata#

Frequently#Read#File#Handle#and#its#SDS#Metadata#

ACribute#of##Reorganized#file#

File#Handle##and##Reorganiza6on#Type#

Reorganiza6on##Job##Running#

Parallel#File#System#

Original#File##

Reorganized#File#

Job#Script#

Job#Results#

SDS#Metadata#

Organiza6on#List#

File#Data#

SDS#Server#

Reorganiza6on##Request#

Periodic#SelfJstart#Request#

SDS#Admin#Interface#

User#Reorganiza6on#Request#

Request Dispatcher •  Receives SDS client requests and SDS Admin interface •  SDS Admin interface issues reorganization commands •  Based on the request, dispatcher passes on the request to Query Evaluator and

Reorganization Evaluator Query Evaluator •  Looks up SDS Metadata for finding available reorganized datasets and their

locations for a given dataset •  SDS Metadata

•  File name, HDF5 dataset info, permissions Reorganization Evaluator •  Decides whether to reorganize based on the frequency of read accesses •  Takes commands from the Admin interface •  Instructs Data Reorganizer to create a reorganization job script Data Organization Recommender •  Identifies optimal data reorganization •  Informs the Reorganization Evaluator with the selected strategy Data Organizer •  Locates reorganization code, such as sorting, indexing algorithms •  Decides on the number of cores to use •  Prepares a batch job script •  Monitors the job execution •  After reorganization job is complete, stores the new data location in the SDS

Metadata Manager

Network(

MPI(Process(0(

Applica4on(

HDF5(API(

SDS(Client(

Parallel&File&System&&SDS(Server(

Database( Batch(System(

MPI(Process(N(

Applica4on(

HDF5(API(

SDS(Client(

SDS(Query(API( SDS(Query(API(

Results

1.2X%

2.3X%

8.1X% 18.9X% 25.7X% 36.3X% 42.0X% 44.8X% 51.2X%0.00%20.00%40.00%60.00%80.00%100.00%120.00%140.00%

E>1.10% E>1.15% E>1.20% E>1.25% E>1.30% E>1.35% E>1.40% E>1.45% E>1.50%

Time%(sec)%

Query%

Full%Data%Scan% Read%Reorganized%Data%

0"

0.2"

0.4"

0.6"

0.8"

1"

40" 80" 120" 160" 200" 240" 280" 320"

Time"(sec)"

Number"of"Concurent"Clients"

HDF5%Open+Close% SDS%Metadata%Read%

Parser&

Reader&

Post&Processor&

SDS&Query&Interface&

File Handle and Query String

Original File Metadata or Reorganized File Metadata

Data

Requested Data

Server&Connector&

SDS&&Server&

Applica:on&

Parallel&File&System&

SDS Client HDF5&API& File Handle and Read Parameters

File Handle and Final Query String

Implementation with SDS Client library and persistent SDS Server

The authors thank Quincey Koziol and Mohamad Chaarawi from The HDF5 Group for their support with HDF5 VOL, Homa Karimabadi, William Daughton, and Vadim Roytershteyn for their guidance in understanding the read patterns of VPIC data analysis, Arie Shoshani and Spyros Blanas for their thoughtful comments about the work. This work is supported in part by the Director, Office of Laboratory Policy and Infrastructure Management of the U.S. Department of Energy under Contract No.~DE-AC02-05CH11231, and used resources of The National Energy Research Scientific Computing Center (NERSC).

Contact: Suren Byna ([email protected])

More results available in: Bin Dong, Suren Byna, and John Wu, “Expediting Scientific Data Analysis with Reorganization”, IEEE Cluster 2013