easy and instantaneous processing for data-intensive workflows

November 15th, 2010 MTAGS 2010 New Orleans, USA

Easy and Instantaneous Processing for Data-Intensive WorkflowsNan Dun, Kenjiro Taura, and Akinori YonezawaGraduate School of Information Science and TechnologyThe University of TokyoContact Email: dunnan@yl.is.s.u-tokyo.ac.jp

Background

✤ More computing resources

✤ Desktops, clusters, clouds, and supercomputers

✤ More data-intensive applications

✤ Astronomy, bio-genome, medical science, etc.

✤ More domain researchers using distributed computing

✤ Know their workflows well, but with few system knowledge

Motivation

Domain Researchers System People

•Able to use resources provided by universities, institutions, etc.

•Know their apps well

•Know little about system, esp. distributed systems

•Only able to control resources they administrate

•Know system well

•Know little domain apps

Would you please help me

running my applications?

OK, but teach me about your

applications first

Actually, you can do it by yourself on any machines!

Outlines

✤ Brief description of our processing framework

✤ GXP parallel/distributed shell

✤ GMount distributed file system

✤ GXP Make workflow engine

✤ Experiments

✤ Practices in clusters and supercomputer

✤ From the view point of underlying data sharing, since our target is data-intensive!

Usage: How simple it is!

1. Write workflow description in makefile

2. Resource exploration (Start from one single node!)

• $ gxpc use ssh clusterA clusterB$ gxpc explore clusterA[[000-020]] clusterB[[100-200]]

3. Deploy distributed file system

• $ gmnt /export/on/each/node /file/system/mountpoint

4. Run workflow

• $ gxpc make -f makefile -j N_parallelism5

GXP Parallel/Distributed Shell

✤ GXP shell magic

✤ Install and start from one single node

✤ Implemented in Python, no compilation

✤ Support various login channels, e.g. SSH, RSH, TORQUE

✤ Efficiently issue command and invoke processes on many nodes in parallel

TORQUE

$gxpc e ls

GMount Distributed File System

✤ Building block: SSHFS-MUX

✤ Mount multiple remote directories to local one

✤ SFTP protocol over SSH/Socket channel

✤ Parallel mount

✤ Use GXP shell to execute on every nodes.

sshfsm B C D

sshfsm Asshfsm A

sshfsm A

GMount (Cont.)

✤ GMount features esp. for wide-area environments

✤ No centralized servers

✤ Locality-aware file lookup

✤ Efficient when application has access locality

✤ New file created locally

Cluster A

Cluster B

Cluster DCluster C

Shared by GMount

GXP Make Workflow Engine

✤ Fully compatible with GNU Make

✤ Straightforward to write data-oriented applications

✤ Integrated in GXP shell

✤ Practical dispatching throughput in wide-area

✤ 62 tasks/sec in InTrigger vs. 56 tasks/sec by Swift+Falkon in TeraGrid

$gxp make out.dat: in.dat run a.job run b.job run c.job run.d.job

GXP Make (Cont.)

✤ Why Make is Good?

✤ Straightforward to write data-oriented applications

✤ Expressive: embarrassly parallel, MapReduce, etc.

✤ Fault tolerance: continue at failure point

✤ Easy to debug: “make -n” option

✤ Concurrency control: “make -j” option

✤ Widely used and thus easy to learn10

Evaluation

✤ Experimental environments

✤ InTrigger multi-cluster platform

✤ 16 clusters, 400 nodes, 1600 CPU cores

✤ Connected by heterogeneous wide-area links

✤ HA8000 cluster system

✤ 512 nodes, 8192 cores

✤ Highly coupled network

Evaluation

✤ Benchmark

✤ ParaMark

✤ Parallel metadata I/O benchmark

✤ Real-world application

✤ Event recognition from PubMed database

Medline XML

Text Extraction

Protein Name Recognizer

Enju ParserSagae’s

Dependency Parser

Event Recognizer

Event Structure

Task Characteristics

0102030405060708090

Input Files

Processing Time File Size

Experiments (Cont.)

✤ Comparison to another two different sharing approaches

NFS Raid

N N N N N N

Master Site

MDSClient

Client DSS

Gfarm Distributed File System SSHFS-MUX All-to-One Mount

Summary of Data Sharing

Operations NFS SSHFS-MUX All-to-One Gfarm GMount

Metadata

Central server Central server Metada

ServersLocalty-aware

Operations

Central server Central server Data Server Data Server

Transfer Rate in WAN

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M

Block Size (Bytes)

Gfarm SSHFSM (direct) SSHFS Iperf

Workflow in LAN

✤ Single cluster using NFS and SSHFSM All-to-One, 158 tasks, 72 workers

0 750 1500 2250 3000

llelis

Execution Time (sec)

NFSSSHFSM All-to-One

Workflow in WAN

✤ 11 clusters using SSHFSM All-to-One, 821 tasks, 584 workers

0 1250 2500 3750 5000

llelis

All JobsLong Jobs

Long jobs dominate the execution time

GMount vs. Gfarm

2 4 8 16

# of Concurrent Clients

Gfarm in LANGfarm in WANGMount in WAN

Aggregate Metadata Performance

100000

2 4 8 16

# of Concurrent Clients

Gfarm ReadGfarm WriteGMount ReadGMount Write

Aggregate I/O Performance

GMount vs. Gfarm (Cont.)

✤ 4 clusters using Gfarm and GMount, 159 tasks, 252 workers

0 875 1750 2625 3500

llelis

GMountGfarm

15% speed up

Many new files are created

GMount vs. Gfarm (Cont.)

Small Jobs only Create New Empty Files on “/” directory

“Create” Jobs on GMount“Create” Jobs on Gfarm

ha8000

On Supercomputer

Lustre FS

ha8000

gateway

clokoxxx

cloko cluster

hongo cluster

hongo clusterGXP Make Master

sshfs mount

On Supercomputer (Cont.)

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

• Allocated 6 hours time slot

• External workers appended

Conclusion

✤ GXP shell + GMount + GXP Make

✤ Simplicity: no need stack of middleware, wide compatibility

✤ Easiness: effortlessly and rapidly build

✤ User-level: no privilege required, by any users

✤ Adaptability: uniform interface for clusters, clouds, and supercomputer

✤ Scalability: scales to hundreds of nodes

✤ Performance: high throughput in wide-area environments24

Future Work

✤ Improve GMount for better create performance

✤ Improve GXP Make for better and smart scheduling

✤ Better user interface: configuration ---> workflow execution

✤ Further reduce installation cost

✤ Implement SSHFS-MUX in Python

Open Source Software

✤ SSHFS-MUX/GMount

✤ http://sshfsmux.googlecode.com/

✤ GXP parallel/distributed shell

✤ http://gxp.sourceforge.net/

✤ ParaMark

✤ http://paramark.googlecode.com/

Questions?

easy and instantaneous processing for data-intensive workflows

Documents

slide 1 custom templates for reusable composition of...

calculation of instantaneous frequency and instantaneous...

instantaneous gas water heater - bosch tankless water...

a compiler toolchain for distributed data intensive...

instantaneous auxiliary relays - protektel ·...

derivatives, instantaneous velocity. - university of notre...

instantaneous auxiliary relays

instantaneous arbitrage and the capm · this paper studies...

instantaneous...

cardio: cost-aware replication for data-intensive workflows

the leading single platform solution for security,...

instantaneous snapshot

7th biennial ptolemy miniconference berkeley, ca february...

on the use of burst buffers for accelerating data-intensive...

bi-criteria scheduling of scientific grid workflows · and...

derivatives, instantaneous...

data intensive research - storageconference.us · data...

instantaneous centre method

ncredible instantaneous pictures

worksheet average and instantaneous velocity...