connecting arbitrary data sources to the grid

33
Connecting arbitrary data sources to the grid Shunde Zhang Australian Research Collaboration Service (ARCS) eResearch SA School of Computer Science, University of Adelaide

Upload: montana

Post on 14-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

Connecting arbitrary data sources to the grid. Shunde Zhang Australian Research Collaboration Service (ARCS) eResearch SA School of Computer Science, University of Adelaide. Background. Australian Research Collaboration Service A successor of APAC Services HPC Data - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Connecting arbitrary data sources to the grid

Connecting arbitrary data sources to the grid

Shunde ZhangAustralian Research Collaboration Service

(ARCS)

eResearch SA

School of Computer Science, University of Adelaide

Page 2: Connecting arbitrary data sources to the grid

Background

Australian Research Collaboration Service

A successor of APACServices

– HPC– Data– Collaboration tools: AccessGrid, EVO,

Plone, drupal, Sakai

Page 3: Connecting arbitrary data sources to the grid

ARCS Data Fabric

Page 4: Connecting arbitrary data sources to the grid

ARCS Data Fabric (cont.)

A national serviceProvided to all Australian

researchersBased on iRODS

Page 5: Connecting arbitrary data sources to the grid

The Problem

Interoperability with “The Grid”– “The Grid”: Globus, gLite, condor, etc.– Data sources

• GridFTP-compatible: dCache• Non GridFTP-compatible: iRODS, SRB

Possible solutions– “Manual” copy (or do it in PBS script)– Copy queue

Page 6: Connecting arbitrary data sources to the grid

The Problem (cont.)

Movement of massive data– Both ends use same software (talks

same protocol)– Different systems are used (talks

different protocol)– Efficiency

Possible solutions– Transfer via an intermediate point

Page 7: Connecting arbitrary data sources to the grid

A solution - old fashioned

AWS Import/Export for Amazon S3– Ship the hard-disks by courier

company

Page 8: Connecting arbitrary data sources to the grid

Our Solution - GridFTP

De facto standard– Compatible with the Grid, and many grid

clientsEfficiency

– Parallel transfer– Data channel reuse– Large file transfer - in small blocks

Compatible with many file transfer services– Monitoring– Scheduling

Page 9: Connecting arbitrary data sources to the grid

An overview of GridFTP protocolBased on FTP with extensionsThird-party transfer

– Intermediate point not neededSecurity - GSIExtended block mode

– Parallel transfer– Striped transfer– Partial transfer

Reliable and restartableTCP and UDP

Page 10: Connecting arbitrary data sources to the grid

The Architecture

GridFTP interface

Generic File System Framework

Data Source Plugin

Data Source

Page 11: Connecting arbitrary data sources to the grid

Generic File System Framework

FileSystem

FileSystemConnection

FileObject

RandomAccessFileObject

creates

creates

creates

Page 12: Connecting arbitrary data sources to the grid

FileSystem interface

public String getSeparator();

public void init() throws IOException;

public FileSystemConnection

createFileSystemConnection(GSSCredential credential) throws

FtpConfigException, IOException;

public void exit();

Page 13: Connecting arbitrary data sources to the grid

FileSystemConnection interface

public FileObject getFileObject(String path);

public String getHomeDir();

public String getUser();

public void close() throws IOException;

public boolean isConnected();

public long getFreeSpace(String path);

Page 14: Connecting arbitrary data sources to the grid

FileObject interfacepublic String getName();public String getPath();public boolean exists();public boolean isFile();public boolean isDirectory();public int getPermission();public String getCanonicalPath() throws IOException;public FileObject[] listFiles();public long length();public long lastModified();public RandomAccessFileObject getRandomAccessFileObjec(String type) throws IOException;public boolean delete();public FileObject getParent();public boolean mkdir();public boolean renameTo(FileObject file);public boolean setLastModified(long t);

Page 15: Connecting arbitrary data sources to the grid

RandomAccessFileObject interfacepublic void seek(long offset) throws IOException;public int read() throws IOException;public int read(byte[] b) throws IOException;public int read(byte[] b, int off, int len) throws

IOException;public void close() throws IOException;public String readLine() throws IOException;public void write(int b) throws IOException;public void write(byte[] b) throws IOException;public void write(byte[] b, int off, int len) throws

IOException;public long length() throws IOException;

Page 16: Connecting arbitrary data sources to the grid

The Implementation - Griffin

GridFTP interface

Generic file system framework

GridFTP client

Grid job submission system

Data transfer service

Adaptor for iRODS

Adaptor for local file system

Other adaptors

iRODS Local File System Other data source

Griffin

Page 17: Connecting arbitrary data sources to the grid

Features

GridFTP protocol version 1Java-based

– Spring framework– OS-independent

Lightweight, stand-alone, self-contained– No need to install Globus Toolkit

Two plugins included– iRODS plugin– Local file system plugin

Open source (Apache 2 & GPL)

Page 18: Connecting arbitrary data sources to the grid

Parallel transfer with Griffin

Client GriffinData Source

WAN LAN/localhost

Page 19: Connecting arbitrary data sources to the grid

Authentication

GSI– iRODS plugin

User mapping – local file system plugin– XML file

• Maps GSI authentication (certificate DN) to internal user management system

Page 20: Connecting arbitrary data sources to the grid

Use case

Integration of the Grid and Data Fabric– iRODS plugin for Data Fabric– Third-party transfer to cluster (Globus

GridFTP)

Tested with– Globus.org– Globus-url-copy (5.0 and 4.x)– Globus GridFTP GUI

Page 21: Connecting arbitrary data sources to the grid

Performance Evaluation

Server: Two quad-core Xeon 3.16GHz CPU, 16GB memory

Client: IBM xSeries 346 with two hyper-threaded Intel Xeon 3.20GHz CPUs, 4GB memory

Network: 1Gbps LANWAN: two 10Gbps linksTransfer: 256MB, 512MB, 1GB, 2GB,

4GB, 8GB, 16GB– iCommands– Globus-url-copy

Page 22: Connecting arbitrary data sources to the grid

Evaluation Set up - Griffin vs iCommands

Client

iRODS

Local File System

Griffin

Jargon Adaptor

globus-url-copy iCommands

Page 23: Connecting arbitrary data sources to the grid

Evaluation Result Chart - Griffin vs iCommands

Page 24: Connecting arbitrary data sources to the grid

Evaluation Set up -Griffin vs Globus GridFTP

Client

Globus GridFTP server

Local File System

Griffin

Local FS Adaptor

globus-url-copy

Page 25: Connecting arbitrary data sources to the grid

Evaluation Result Chart - Griffin vs Globus GridFTP

Page 26: Connecting arbitrary data sources to the grid

Related work

Client library– SAGA/jSAGA– Commons-vfs

Data transfer service– Stork– PAFTP

Globus– XIO– DSI

Page 27: Connecting arbitrary data sources to the grid

Griffin vs. Globus GridFTP

Griffin Globus GridFTP

Java C

OS-independent *nix

Simple, standalone complex

Page 28: Connecting arbitrary data sources to the grid

Conclusion

A generic solution to connect arbitrary data sources to the grid– Data in/out of the grid– Data transfer between different data

sources

Java-based implementation– Standalone, lightweight– Plugable– Not depend on Globus

Page 29: Connecting arbitrary data sources to the grid

Future work

Currently working on a plugin for MongoDB

Java NIOUDPStriped transfer

Page 30: Connecting arbitrary data sources to the grid

MongoDB plugin

MongoDB– NOSQL database– Stores JSON-style documents– GridFS component

• Stores files

Plugin for griffin– Read/write files via GridFS

Page 31: Connecting arbitrary data sources to the grid

Acknowledgements

ARCS funded

Page 32: Connecting arbitrary data sources to the grid

Current Status

ARCS production serviceUsed to transfer data in/out of

ARCS Data FabricWebsite

– https://projects.arcs.org.au/trac/griffin

Page 33: Connecting arbitrary data sources to the grid

Thank you!

Questions/Comments?