clustering the reliable file transfer service

18
June 6, 2007 TeraGrid '07 1 Clustering the Reliable File Transfer Service Jim Basney and Patrick Duda NCSA, University of Illinois This material is based upon work supported by the National Science Foundation under Grant No. 0426972.

Upload: eydie

Post on 04-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Clustering the Reliable File Transfer Service. Jim Basney and Patrick Duda NCSA, University of Illinois. This material is based upon work supported by the National Science Foundation under Grant No. 0426972. Goal. Provide a highly available Reliable File Transfer (RFT) Service - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 1

Clustering the Reliable File Transfer Service

Jim Basney and Patrick DudaNCSA, University of Illinois

This material is based upon work supported by the National Science Foundation under Grant No. 0426972.

Page 2: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 2

Goal

• Provide a highly availableReliable File Transfer (RFT) Service– Tolerate server failures

• Hardware/software faults and resource exhaustion

– Continue to handle incoming requests– Continue to make forward progress on file

transfers in the queue

Page 3: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 3

Globus ToolkitReliable File Transfer Service

RFTClient

GridFTP

GridFTP

Page 4: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 4

RFT and GridFTP Clustering

GridFTPcontrol

GridFTPcontrol

RFT

GridFTPdata

GridFTPdata

GridFTPdata

GridFTPdata

RFT

RFT

Page 5: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 5

Clustering Approach

RFT

RFT

RFT

LoadBalancer

HADBMS

Page 6: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 6

Web ServiceContainer

RFT State Management

RFT

DelegationService

Client

DBMS

Page 7: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 7

RFT DB Tables

Request Transfer RestartID

Termination Time

Started Flag

Max Attempts

Delegated EPR

Container ID

Start Time

ID

Request ID

Source URL

Destination URL

Status

Attempts

Retry Time

Transfer ID

Restart Marker

Last Update Time

Added Fields

Page 8: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 8

New Tables

Delegation Service Persistent SubscriptionResource ID

Caller DN

Local Name

Termination Time

Listener

Certificate

Container ID

Consumer

Producer

Policy

Precondition

Selector

Topic

Security Descriptor

Page 9: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 9

RFT Fail-Over

• Based on time-outs• Periodically query database for pending

requests with no recent activity– Stalled requests could be caused by RFT service

crash, hardware failure, RFT service overload, etc.– If found, obtain DB write lock, query again, claim

stalled requests, and release lock

• Configuration values:– Query interval (default: 30 seconds)– Recent interval (default: 60 seconds)

Page 10: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 10

Evaluation Environment

• Dedicated 12 node Linux cluster– Red Hat Enterprise Linux AS Release 3– Switched Gigabit Ethernet– 2 GB RAM– dual 2GHz Intel Xeon CPUs 512KB cache

• Globus Toolkit 4.0.3

• MySQL Standard 5.0.27

Page 11: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 11

Evaluation

• Correctness / Effectiveness– Submitted multiple RFT requests of

different sizes to 12 RFT instances– Verified fail-over and notification

functionality

• Performance– Evaluate overhead of shared DBMS– Stress test: transfer many small files

Page 12: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 12

0

2

4

6

8

10

12

14

0 510 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

100105110115120125130135140145150155160165seconds

files transferred per second

web servicescontainer stopped

fail-over

60 second fail-over interval

Page 13: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 13

0

1

2

3

4

5

6

1 2 3 4 5 6 7 8 9 10

number of nodes

total seconds

GT4 submit time cluster submit time

Page 14: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 14

0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8 9 10

number of nodes

total seconds

GT4 transfer time cluster transfer time

4% 6%10%

14%

22%

43%

57%

82%

95%

Page 15: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 15

Related Work

• HAND: Highly Available Dynamic Deployment Infrastructure for GT4– Migrate services between containers to maintain availability

during planned outages– Does not address management of persistent service state or

fail-over for unplanned outages

• myGrid– DBMS persistence of WS-ResourceProperties in Apache

WSRF– Points to a general-purpose approach for DBMS-based

persistence of stateful WSRF services

Page 16: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 16

Conclusion

• Clustering RFT provides load-balancing and fail-over with acceptable performance for small clusters

• Clustering is a promising approach for application to other grid services

Page 17: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 17

Future Work

• Correctly handle replay of FTP deletes• Implement credentialRefreshListener• Evaluate use of different DBMS solutions• Investigate GT4 DBMS persistence in general• Investigate use of WS-Naming

Page 18: Clustering the  Reliable File Transfer Service

June 6, 2007 TeraGrid '07 18

Thanks!

• Questions? Comments?

• This material is based upon work supported by the National Science Foundation under Grant No. 0426972.

• Performance experiments were conducted on computers at the Technology Research, Education, and Commercialization Center (TRECC), a program of the University of Illinois at Urbana-Champaign, funded by the Office of Naval Research and administered by the National Center for Supercomputing Applications. We thank Tom Roney for his assistance with the TRECC cluster.

• We also thank Ravi Madduri from the Globus project for answering our questions about RFT.