easy and instantaneous processing for data-intensive workflows
TRANSCRIPT
November 15th, 2010 MTAGS 2010 New Orleans, USA
Easy and Instantaneous Processing for Data-Intensive WorkflowsNan Dun, Kenjiro Taura, and Akinori YonezawaGraduate School of Information Science and TechnologyThe University of TokyoContact Email: [email protected]
Background
✤ More computing resources
✤ Desktops, clusters, clouds, and supercomputers
✤ More data-intensive applications
✤ Astronomy, bio-genome, medical science, etc.
✤ More domain researchers using distributed computing
✤ Know their workflows well, but with few system knowledge
2
Motivation
3
Domain Researchers System People
•Able to use resources provided by universities, institutions, etc.
•Know their apps well
•Know little about system, esp. distributed systems
•Only able to control resources they administrate
•Know system well
•Know little domain apps
Would you please help me
running my applications?
OK, but teach me about your
applications first
Actually, you can do it by yourself on any machines!
Outlines
✤ Brief description of our processing framework
✤ GXP parallel/distributed shell
✤ GMount distributed file system
✤ GXP Make workflow engine
✤ Experiments
✤ Practices in clusters and supercomputer
✤ From the view point of underlying data sharing, since our target is data-intensive!
4
Usage: How simple it is!
1. Write workflow description in makefile
2. Resource exploration (Start from one single node!)
• $ gxpc use ssh clusterA clusterB$ gxpc explore clusterA[[000-020]] clusterB[[100-200]]
3. Deploy distributed file system
• $ gmnt /export/on/each/node /file/system/mountpoint
4. Run workflow
• $ gxpc make -f makefile -j N_parallelism5
GXP Parallel/Distributed Shell
✤ GXP shell magic
✤ Install and start from one single node
✤ Implemented in Python, no compilation
✤ Support various login channels, e.g. SSH, RSH, TORQUE
✤ Efficiently issue command and invoke processes on many nodes in parallel
6
SSH
RSH
TORQUE
$gxpc e ls
ls
ls
ls
ls
ls
ls
ls
ls
ls
GMount Distributed File System
✤ Building block: SSHFS-MUX
✤ Mount multiple remote directories to local one
✤ SFTP protocol over SSH/Socket channel
✤ Parallel mount
✤ Use GXP shell to execute on every nodes.
7
B
DA
C
$gmnt
sshfsm B C D
sshfsm Asshfsm A
sshfsm A
GMount (Cont.)
8
✤ GMount features esp. for wide-area environments
✤ No centralized servers
✤ Locality-aware file lookup
✤ Efficient when application has access locality
✤ New file created locally
Cluster A
Cluster B
Cluster DCluster C
Shared by GMount
GXP Make Workflow Engine
✤ Fully compatible with GNU Make
✤ Straightforward to write data-oriented applications
✤ Integrated in GXP shell
✤ Practical dispatching throughput in wide-area
✤ 62 tasks/sec in InTrigger vs. 56 tasks/sec by Swift+Falkon in TeraGrid
9
B
D
A
C
$gxp make out.dat: in.dat run a.job run b.job run c.job run.d.job
a.job
b.job
c.job
d.job
GXP Make (Cont.)
✤ Why Make is Good?
✤ Straightforward to write data-oriented applications
✤ Expressive: embarrassly parallel, MapReduce, etc.
✤ Fault tolerance: continue at failure point
✤ Easy to debug: “make -n” option
✤ Concurrency control: “make -j” option
✤ Widely used and thus easy to learn10
Evaluation
11
✤ Experimental environments
✤ InTrigger multi-cluster platform
✤ 16 clusters, 400 nodes, 1600 CPU cores
✤ Connected by heterogeneous wide-area links
✤ HA8000 cluster system
✤ 512 nodes, 8192 cores
✤ Highly coupled network
Evaluation
12
✤ Benchmark
✤ ParaMark
✤ Parallel metadata I/O benchmark
✤ Real-world application
✤ Event recognition from PubMed database
Medline XML
Text Extraction
Protein Name Recognizer
Enju ParserSagae’s
Dependency Parser
Event Recognizer
Event Structure
Task Characteristics
13
0
500
1,000
1,500
2,000
2,500
3,000
0102030405060708090
Proc
essi
ng T
ime
(sec
)
Input Files
File
Siz
e (K
B)
Processing Time File Size
Experiments (Cont.)
14
✤ Comparison to another two different sharing approaches
M
NFS Raid
M M
N N N N N N
Master Site
DSS
MDSClient
DSS
Client DSS
Gfarm Distributed File System SSHFS-MUX All-to-One Mount
Summary of Data Sharing
15
Operations NFS SSHFS-MUX All-to-One Gfarm GMount
Metadata
I/O
Central server Central server Metada
ServersLocalty-aware
Operations
Central server Central server Data Server Data Server
Transfer Rate in WAN
16
0
5
10
15
20
4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M
Thro
ughp
ut (M
B/se
c)
Block Size (Bytes)
Gfarm SSHFSM (direct) SSHFS Iperf
Workflow in LAN
✤ Single cluster using NFS and SSHFSM All-to-One, 158 tasks, 72 workers
17
0
42.5
85
0 750 1500 2250 3000
Para
llelis
m
Execution Time (sec)
NFSSSHFSM All-to-One
Workflow in WAN
✤ 11 clusters using SSHFSM All-to-One, 821 tasks, 584 workers
18
0
300
600
0 1250 2500 3750 5000
Para
llelis
m
Execution Time (sec)
All JobsLong Jobs
Long jobs dominate the execution time
GMount vs. Gfarm
19
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
2 4 8 16
ops/
sec
# of Concurrent Clients
Gfarm in LANGfarm in WANGMount in WAN
Aggregate Metadata Performance
1
10
100
1000
10000
100000
2 4 8 16
MBs
/sec
# of Concurrent Clients
Gfarm ReadGfarm WriteGMount ReadGMount Write
Aggregate I/O Performance
GMount vs. Gfarm (Cont.)
✤ 4 clusters using Gfarm and GMount, 159 tasks, 252 workers
20
0
87.5
175
0 875 1750 2625 3500
Para
llelis
m
Execution Time (sec)
GMountGfarm
15% speed up
Many new files are created
GMount vs. Gfarm (Cont.)
21
0.01
0.1
1
10
100
1,000
Elas
ped
Tim
e (s
econ
d)
Small Jobs only Create New Empty Files on “/” directory
“Create” Jobs on GMount“Create” Jobs on Gfarm
ha8000
On Supercomputer
22
Lustre FS
ha8000
ha8000
ha8000
ha8000
ha8000
gateway
gateway
gateway
clokoxxx
clokoxxx
clokoxxx
clokoxxx
hongo
cloko cluster
hongo cluster
hongo clusterGXP Make Master
sshfs mount
On Supercomputer (Cont.)
23
0
1,500
3,000
4,500
6,000
7,500
9,000
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
• Allocated 6 hours time slot
• External workers appended
Conclusion
✤ GXP shell + GMount + GXP Make
✤ Simplicity: no need stack of middleware, wide compatibility
✤ Easiness: effortlessly and rapidly build
✤ User-level: no privilege required, by any users
✤ Adaptability: uniform interface for clusters, clouds, and supercomputer
✤ Scalability: scales to hundreds of nodes
✤ Performance: high throughput in wide-area environments24
Future Work
✤ Improve GMount for better create performance
✤ Improve GXP Make for better and smart scheduling
✤ Better user interface: configuration ---> workflow execution
✤ Further reduce installation cost
✤ Implement SSHFS-MUX in Python
25
Open Source Software
✤ SSHFS-MUX/GMount
✤ http://sshfsmux.googlecode.com/
✤ GXP parallel/distributed shell
✤ http://gxp.sourceforge.net/
✤ ParaMark
✤ http://paramark.googlecode.com/
26
Questions?