clustering the reliable file transfer service

June 6, 2007 TeraGrid '07 1

Clustering the Reliable File Transfer Service

Jim Basney and Patrick DudaNCSA, University of Illinois

This material is based upon work supported by the National Science Foundation under Grant No. 0426972.


Goal

• Provide a highly availableReliable File Transfer (RFT) Service– Tolerate server failures

• Hardware/software faults and resource exhaustion

– Continue to handle incoming requests– Continue to make forward progress on file

transfers in the queue


Globus ToolkitReliable File Transfer Service

RFTClient

GridFTP

GridFTP


RFT and GridFTP Clustering

GridFTPcontrol

GridFTPcontrol

RFT

GridFTPdata

GridFTPdata

GridFTPdata

GridFTPdata

RFT

RFT


Clustering Approach

RFT

RFT

RFT

LoadBalancer

HADBMS


Web ServiceContainer

RFT State Management

RFT

DelegationService

Client

DBMS


RFT DB Tables

Request Transfer RestartID

Termination Time

Started Flag

Max Attempts

Delegated EPR

Container ID

Start Time

ID

Request ID

Source URL

Destination URL

Status

Attempts

Retry Time

Transfer ID

Restart Marker

Last Update Time

Added Fields


New Tables

Delegation Service Persistent SubscriptionResource ID

Caller DN

Local Name

Termination Time

Listener

Certificate

Container ID

Consumer

Producer

Policy

Precondition

Selector

Topic

Security Descriptor

…


RFT Fail-Over

• Based on time-outs• Periodically query database for pending

requests with no recent activity– Stalled requests could be caused by RFT service

crash, hardware failure, RFT service overload, etc.– If found, obtain DB write lock, query again, claim

stalled requests, and release lock

• Configuration values:– Query interval (default: 30 seconds)– Recent interval (default: 60 seconds)


Evaluation Environment

• Dedicated 12 node Linux cluster– Red Hat Enterprise Linux AS Release 3– Switched Gigabit Ethernet– 2 GB RAM– dual 2GHz Intel Xeon CPUs 512KB cache

• Globus Toolkit 4.0.3

• MySQL Standard 5.0.27


Evaluation

• Correctness / Effectiveness– Submitted multiple RFT requests of

different sizes to 12 RFT instances– Verified fail-over and notification

functionality

• Performance– Evaluate overhead of shared DBMS– Stress test: transfer many small files


0

2

4

6

8

10

12

14

0 510 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

100105110115120125130135140145150155160165seconds

files transferred per second

web servicescontainer stopped

fail-over

60 second fail-over interval


0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

number of nodes

total seconds

GT4 submit time cluster submit time


0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8 9 10

number of nodes

total seconds

GT4 transfer time cluster transfer time

4% 6%10%

14%

22%

43%

57%

82%

95%


Related Work

• HAND: Highly Available Dynamic Deployment Infrastructure for GT4– Migrate services between containers to maintain availability

during planned outages– Does not address management of persistent service state or

fail-over for unplanned outages

• myGrid– DBMS persistence of WS-ResourceProperties in Apache

WSRF– Points to a general-purpose approach for DBMS-based

persistence of stateful WSRF services


Conclusion

• Clustering RFT provides load-balancing and fail-over with acceptable performance for small clusters

• Clustering is a promising approach for application to other grid services


Future Work

• Correctly handle replay of FTP deletes• Implement credentialRefreshListener• Evaluate use of different DBMS solutions• Investigate GT4 DBMS persistence in general• Investigate use of WS-Naming


Thanks!

• Questions? Comments?

• This material is based upon work supported by the National Science Foundation under Grant No. 0426972.

• Performance experiments were conducted on computers at the Technology Research, Education, and Commercialization Center (TRECC), a program of the University of Illinois at Urbana-Champaign, funded by the Office of Naval Research and administered by the National Center for Supercomputing Applications. We thank Tom Roney for his assistance with the TRECC cluster.

• We also thank Ravi Madduri from the Globus project for answering our questions about RFT.

clustering the reliable file transfer service

Documents