easy and instantaneous processing for data-intensive workflows

November 15th, 2010 MTAGS 2010 New Orleans, USA

Easy and Instantaneous Processing for Data-Intensive WorkflowsNan Dun, Kenjiro Taura, and Akinori YonezawaGraduate School of Information Science and TechnologyThe University of TokyoContact Email: [email protected]

mailto:[email protected]

mailto:[email protected]

Background

✤ More computing resources

✤ Desktops, clusters, clouds, and supercomputers

✤ More data-intensive applications

✤ Astronomy, bio-genome, medical science, etc.

✤ More domain researchers using distributed computing

✤ Know their workflows well, but with few system knowledge

2

Motivation

3

Domain Researchers System People

•Able to use resources provided by universities, institutions, etc.

•Know their apps well

•Know little about system, esp. distributed systems

•Only able to control resources they administrate

•Know system well

•Know little domain apps

Would you please help me

running my applications?

OK, but teach me about your

applications first

Actually, you can do it by yourself on any machines!

Outlines

✤ Brief description of our processing framework

✤ GXP parallel/distributed shell

✤ GMount distributed file system

✤ GXP Make workflow engine

✤ Experiments

✤ Practices in clusters and supercomputer

✤ From the view point of underlying data sharing, since our target is data-intensive!

4

Usage: How simple it is!

1. Write workflow description in makefile

2. Resource exploration (Start from one single node!)

• $ gxpc use ssh clusterA clusterB$ gxpc explore clusterA[[000-020]] clusterB[[100-200]]

3. Deploy distributed file system

• $ gmnt /export/on/each/node /file/system/mountpoint

4. Run workflow

• $ gxpc make -f makefile -j N_parallelism5

GXP Parallel/Distributed Shell

✤ GXP shell magic

✤ Install and start from one single node

✤ Implemented in Python, no compilation

✤ Support various login channels, e.g. SSH, RSH, TORQUE

✤ Efficiently issue command and invoke processes on many nodes in parallel

6

SSH

RSH

TORQUE

$gxpc e ls

ls

ls

ls

ls

ls

ls

ls

ls

ls

GMount Distributed File System

✤ Building block: SSHFS-MUX

✤ Mount multiple remote directories to local one

✤ SFTP protocol over SSH/Socket channel

✤ Parallel mount

✤ Use GXP shell to execute on every nodes.

7

B

DA

C

$gmnt

sshfsm B C D

sshfsm Asshfsm A

sshfsm A

GMount (Cont.)

8

✤ GMount features esp. for wide-area environments

✤ No centralized servers

✤ Locality-aware file lookup

✤ Efficient when application has access locality

✤ New file created locally

Cluster A

Cluster B

Cluster DCluster C

Shared by GMount

GXP Make Workflow Engine

✤ Fully compatible with GNU Make

✤ Straightforward to write data-oriented applications

✤ Integrated in GXP shell

✤ Practical dispatching throughput in wide-area

✤ 62 tasks/sec in InTrigger vs. 56 tasks/sec by Swift+Falkon in TeraGrid

9

B

D

A

C

$gxp make out.dat: in.dat run a.job run b.job run c.job run.d.job

a.job

b.job

c.job

d.job

GXP Make (Cont.)

✤ Why Make is Good?

✤ Straightforward to write data-oriented applications

✤ Expressive: embarrassly parallel, MapReduce, etc.

✤ Fault tolerance: continue at failure point

✤ Easy to debug: “make -n” option

✤ Concurrency control: “make -j” option

✤ Widely used and thus easy to learn10

Evaluation

11

✤ Experimental environments

✤ InTrigger multi-cluster platform

✤ 16 clusters, 400 nodes, 1600 CPU cores

✤ Connected by heterogeneous wide-area links

✤ HA8000 cluster system

✤ 512 nodes, 8192 cores

✤ Highly coupled network

Evaluation

12

✤ Benchmark

✤ ParaMark

✤ Parallel metadata I/O benchmark

✤ Real-world application

✤ Event recognition from PubMed database

Medline XML

Text Extraction

Protein Name Recognizer

Enju ParserSagae’s

Dependency Parser

Event Recognizer

Event Structure

Task Characteristics

13

0

500

1,000

1,500

2,000

2,500

3,000

0102030405060708090

Proc

essi

ng T

ime

(sec

)

Input Files

File

Siz

e (K

B)

Processing Time File Size

Experiments (Cont.)

14

✤ Comparison to another two different sharing approaches

M

NFS Raid

M M

N N N N N N

Master Site

DSS

MDSClient

DSS

Client DSS

Gfarm Distributed File System SSHFS-MUX All-to-One Mount

Summary of Data Sharing

15

Operations NFS SSHFS-MUX All-to-One Gfarm GMount

Metadata

I/O

Central server Central server Metada

ServersLocalty-aware

Operations

Central server Central server Data Server Data Server

Transfer Rate in WAN

16

0

5

10

15

20

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M

Thro

ughp

ut (M

B/se

c)

Block Size (Bytes)

Gfarm SSHFSM (direct) SSHFS Iperf

Workflow in LAN

✤ Single cluster using NFS and SSHFSM All-to-One, 158 tasks, 72 workers

17

0

42.5

85

0 750 1500 2250 3000

Para

llelis

m

Execution Time (sec)

NFSSSHFSM All-to-One

Workflow in WAN

✤ 11 clusters using SSHFSM All-to-One, 821 tasks, 584 workers

18

0

300

600

0 1250 2500 3750 5000

Para

llelis

m


All JobsLong Jobs

Long jobs dominate the execution time

GMount vs. Gfarm

19

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

2 4 8 16

ops/

sec

# of Concurrent Clients

Gfarm in LANGfarm in WANGMount in WAN

Aggregate Metadata Performance

1

10

100

1000

10000

100000

2 4 8 16

MBs

/sec

# of Concurrent Clients

Gfarm ReadGfarm WriteGMount ReadGMount Write

Aggregate I/O Performance

GMount vs. Gfarm (Cont.)

✤ 4 clusters using Gfarm and GMount, 159 tasks, 252 workers

20

0

87.5

175

0 875 1750 2625 3500

Para

llelis

m


GMountGfarm

15% speed up

Many new files are created

GMount vs. Gfarm (Cont.)

21

0.01

0.1

1

10

100

1,000

Elas

ped

Tim

e (s

econ

d)

Small Jobs only Create New Empty Files on “/” directory

“Create” Jobs on GMount“Create” Jobs on Gfarm

ha8000

On Supercomputer

22

Lustre FS

ha8000

ha8000

ha8000

ha8000

ha8000

gateway

gateway

gateway

clokoxxx

clokoxxx

clokoxxx

clokoxxx

hongo

cloko cluster

hongo cluster

hongo clusterGXP Make Master

sshfs mount

On Supercomputer (Cont.)

23

0

1,500

3,000

4,500

6,000

7,500

9,000

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

• Allocated 6 hours time slot

• External workers appended

Conclusion

✤ GXP shell + GMount + GXP Make

✤ Simplicity: no need stack of middleware, wide compatibility

✤ Easiness: effortlessly and rapidly build

✤ User-level: no privilege required, by any users

✤ Adaptability: uniform interface for clusters, clouds, and supercomputer

✤ Scalability: scales to hundreds of nodes

✤ Performance: high throughput in wide-area environments24

Future Work

✤ Improve GMount for better create performance

✤ Improve GXP Make for better and smart scheduling

✤ Better user interface: configuration ---> workflow execution

✤ Further reduce installation cost

✤ Implement SSHFS-MUX in Python

25

Open Source Software

✤ SSHFS-MUX/GMount

✤ http://sshfsmux.googlecode.com/

✤ GXP parallel/distributed shell

✤ http://gxp.sourceforge.net/

✤ ParaMark

✤ http://paramark.googlecode.com/

26

http://sshfsmux.googlecode.com

http://sshfsmux.googlecode.com

http://gxp.sourceforge.net

http://gxp.sourceforge.net

http://paramark.googlecode.com

http://paramark.googlecode.com

Questions?

easy and instantaneous processing for data-intensive workflows

Documents