1 teragrid data transfer jeffrey p. gardner pittsburgh supercomputing center [email protected]

40
1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center [email protected]

Upload: phillip-thornton

Post on 03-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

1

TeraGrid Data Transfer

Jeffrey P. GardnerPittsburgh Supercomputing

[email protected]

Page 2: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 2

Outline

GSISSH Use passwordless login between TeraGrid machines Hand-on Exercises

TeraGrid File Management Data Transfer Performance GridFTP

Terminology TeraGrid Deployment

Hands-on Exercises Use of GridFTP clients & servers to transfer files

Page 3: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 3

Hands-on: PreparationPrepare for exercises by logging into NCSA, getting

valid proxy certificate. Login to tg-login.ncsa.teragrid.org:

ssh [email protected]

Enter your password:xxxxxx

Get a valid proxy certificate: tg-login1> grid-proxy-init Enter GRID pass phrase for this identity: yyyyyy

Creating proxy . . . . . . . . . . . Done

Your proxy is valid until: Tue Jun 21 08:06:03 2005

Page 4: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 4

GSISSH: SSH using TG Certificates

Now login to TACC using GSISSH tg-login> gsissh tg-login.sdsc.teragrid.org

TA DA! See that your NCSA certificate DN and user

account name have been entered into TACC’s grid-mapfile

> grep -i userid /etc/grid-security/grid-mapfile

"/C=US/O=National Center for Supercomputing Applications/CN=Jeff Gardner" gardnerj

Logout of TACC > exit

Page 5: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 5

TeraGrid File Placement No common cross-site filesystems (currently)

This will change very shortly! NCSA, SDSC, TACC, ANL will install GPFS (“Global Parallel

File System”)

User controls where their data resides Appropriate sites(s) Appropriate storage

Online Filesystem(s) Speed, visibility, quotas, backup policy Each filesystem directly accessible from single site

Mass Storage Systems Long-term storage, slower access

Page 6: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 6

TeraGrid File Movement File movement responsibility of user

Between Online Filesystems Intra-site

Cross-site*

Between Mass Storage and Online Filesystems Intra-site*

Cross-site*

* Session focuses on these types of transfers

Page 7: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 7

TeraGrid Transfer Environment

TeraGrid backbone bandwidth means Wide Area Network is rarely a bottleneck

SDSC<->Caltech<->NCSA<->PSC: 40 Gb/sec NCSA<->TACC: 10 Gb/sec

GSI authentication and proxy certificates provide automagic security for transfers

just do “grid-proxy-init” and you’re in

Transfer requests can be integrated into job execution scripts

Moving input data to site(s) of job execution Moving results to another filesystem, site, or archive

Page 8: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 8

Data Transfer Performance What impacts transfer rates?

Disk and filesystem speed Connectivity of filesystem to

node Node characteristics & load Connectivity of node to WAN For all networks

Bandwidth Latency Buffer Size Protocol Load Encryption …

Don’t expect 40 Gb/sec!

node

WAN (TG Backbone) 40 Gb/s

switch

node

30 Gb/s

1 Gb/s

switch

30 Gb/s

node

Page 9: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 9

Performance – Choices Matter

Transfer large files for best performance

Use fast filesystems, dedicated transfer nodes, optimized transfer parameters

Transfer 1 GByte file from NCSA to SDSC (10/6/2004)

Choices Transfer Time

Transfer Rate

Home filesystemsLogin nodesDefault parameters

20 min 18 sec .845 MBytes/sec(.0066 Gbits/sec)

Parallel filesystemsTransfer nodesOptimized parameters

11 sec 93.091 MBytes/sec

(.727 Gbits/sec)

Page 10: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 10

GridFTP Terminology - Protocol

“GridFTP is a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth, wide-area networks. GridFTP is based on FTP, the highly popular Internet file transfer protocol.”

- Quoted from Globus Alliance website

Page 11: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 11

Terminology - Client GridFTP client programs issue requests

that adhere to the GridFTP protocol Users run GridFTP client programs to transfer

files

There is no client program named gridFTP, which can be confusing because users are told “use gridFTP to transfer your files”

tgcp, globus-url-copy and uberftp are three GridFTP client programs that are part of the Common TeraGrid Software Stack (CTSS)

Page 12: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 12

Terminology – 3rd Party Transfer

A GridFTP transfer between two GridFTP servers, rather than between a server and a client, is called a third-party transfer A third-party transfer occurs

when the GridFTP client initiating the transfer is run on a system that isneither the source northe destination of thetransfer operation

Allows use of dedicated transfernodes

User runs GridFTP client to request data transfer;

HOST A

Source of Data

GridFTP Server Process

Host B

Data

Requests in GridFTP protocol

Destination of Data

GridFTP Server Process

Host C

Page 13: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 13

Terminology - Server

A GridFTP server process understands requests that adhere to the GridFTP protocol, and performs authentication and data transfer operations based on those requests

TeraGrid GridFTP servers usually run on: Login nodes:

tg-login.<site>.teragrid.org Dedicated GridFTP nodes:

tg-gridftp.<site>.teragrid.org Some mass storage front-ends are GridFTP

servers mss.ncsa.teragrid.org

Page 14: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 14

TG GridFTP Server Deployment

tg-login.<site>.teragrid.org is a login node and also runs a GridFTP server Shared resource; Many tasks

tg-gridftp.<site>.teragrid.org is a dedicated GridFTP server Dedicated file transfer resource usually better connectivity

Page 15: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 15

TG GridFTP Client Deployment

uberftp interactive GridFTP transfer client configurable tcp buffersize and number of

parallel streams

Page 16: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 16

TG GridFTP Client Deployment

globus-url-copy <source_url> <destination_url> command line interface -tcp-bs <size> | -tcp-buffer-size <size>

specify the size (in bytes) of the buffer to be used by the underlying ftp data channels

-p <parallelism> | -parallel <parallelism> specify the number of streams to be used in the ftp transfer

tgcp [gridFTP-server1:]file1 [gridFTP-server2:]file2 command line interface friendly “scp-like” wrapper around globus-url-copy

Page 17: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 17

Hands-on:

Participants will be led through a series of exercises using tgcp, globus-url-copy and uberftp.

Demonstrates transferring files Between TeraGrid sites

Between TG machines and archival storage systems

Page 18: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 18

Hands-on preparation:

Login to tg-login.ncsa.teragrid.org if you have not already done so

Get the test data file:wget http://www.psc.edu/~gardnerj/test.file

Page 19: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 19

Hands-on: Exercise 1GridFTP between login nodes

Copy a 9 MByte file from the current directory at NCSA to your home directory at TACC. Use the login node at TACC as the remote GridFTP server. Use default transfer parameters.

Use globus-url-copy to transfer the file:

Type command on a single line – no carriage return!tg-login1> /usr/bin/time –f %e globus-url-copy

file:`pwd`/test.file

gsiftp://tg-login.tacc.teragrid.org/~/test.file.Ex1

3.18

Page 20: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 20

Hands-on: Exercise 2GridFTP between GridFTP Servers

Copy a 9 MByte file from the current directory at NCSA to your home directory at TACC. Use a third-party transfer and the GridFTP server nodes at both NCSA and SDSC.

Use globus-url-copy to transfer the file:

tg-login1> /usr/bin/time -f %E globus-url-copy gsiftp://tg-gridftp.ncsa.teragrid.org/`pwd`/test.file gsiftp://tg-gridftp.tacc.teragrid.org/~/test.file-Ex2

3.01

Page 21: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 21

Hands-on: Exercise 3GridFTP between GridFTP Servers

Copy a 9 MByte file from the current directory at NCSA to your home directory at TACC. Use a third-party transfer and the GridFTP server nodes at both NCSA and SDSC. Use optimized transfer parameters.

Use globus-url-copy to transfer the file:

tg-login1> /usr/bin/time -f %E globus-url-copy –tcp-bs 4000000 –p 4 gsiftp://tg-gridftp.ncsa.teragrid.org/`pwd`/test.file gsiftp://tg-gridftp.tacc.teragrid.org/~/test.file-Ex3

2.54

Page 22: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 22

Hands-on: Exercise 4Using tgcp

Copy a 9 MByte file from your home directory at NCSA to your home directory at TACC using tgcp. tgcp automatically uses third-party transfers and optimized transfer parameters.

Add tgcp to your path (it is not in there by default): tg-login1> soft add +tgcp

Use tgcp to transfer the file: tg-login1> /usr/bin/time -f %E tgcp test.file

tg-gridftp.tacc.teragrid.org:/home/userid/test.file-Ex4

globus-url-copy –p 4 –tcp-bs 2000000

gsiftp://tg-gridftp.ncsa.teragrid.org:2812/home/ac/gardnerj/test.file

gsiftp://tg-gridftp.tacc.teragrid.org:2812/home/gardnerj/test.file

4.06 (?!!)

Page 23: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 23

Hands-on: Exercise 5 – pg 1UberFTP between login nodes

Copy a 9 MByte file from your NCSA home directory to TACC. Use optimized transfer parameters. Interactive session.

Start uberftp and set transfer parameters: tg-login1> uberftp

uberftp> parallel 4

uberftp> tcpbuf 4000000TCP buffer set to 4000000 bytes

Open connection to TACC: uberftp> open tg-login.tacc.teragrid.org%%% BANNER %%%

220 UNIX Archive FTP server ready.

230 User xxx logged in.

Page 24: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 24

Hands-on: Exercise 5 – pg 2UberFTP between login nodes

Copy the file: uberftp> put test.file test.file-Ex5150 Opening BINARY connection(s) for test.file-Ex5.

226 Transfer complete.

Transfer rate 9621728 bytes in 0.51 seconds. 19017.90 KB/sec

Get a listing of the TACC home directory:uberftp> ls-rw---- user group 9621728 date test.file-Ex1

-rw---- user group 9621728 date test.file-Ex2

-rw---- user group 9621728 date test.file-Ex3

. . .

Exit UberFTP:uberftp> quit

Page 25: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 25

Hands-on: Exercise 6 – pg 1UberFTP between GridFTP servers

Copy a 9 MByte file from your NCSA home directory to TACC using third-party transfers. Use optimized transfer parameters. Interactive session.

Start uberftp and set transfer parameters: tg-login1> uberftp

uberftp> parallel 4

uberftp> tcpbuf 4000000TCP buffer set to 4000000 bytes

Page 26: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 26

Hands-on: Exercise 6 – pg 2UberFTP between GridFTP servers

Open “local” connection to NCSA dedicated GridFTP servertg-login1> lopen tg-gridftp.ncsa.teragrid.org220 tg-gridftp4.ncsa...blah..blah ready.

230 User xxx logged in.

Open “remote” connection to TACC dedicated GridFTP server:

uberftp> open tg-gridftp.tacc.teragrid.org220 lonestar GridFTP...blah..blah ready.

230 User xxx logged in.

Page 27: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 27

Hands-on: Exercise 6 – pg 3UberFTP between GridFTP servers

Copy the file: uberftp> put test.file test.file-ex6src> 150 Opening BINARY mode data connection(s).

dst> 150 Opening BINARY mode data connection(s).

src> 226 Transfer complete.

dst> 226 Transfer complete.

Exit UberFTP:uberftp> quit

Page 28: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 28

Useful UberFTP commands

Unix-like commands ls, cd, mkdir, rmdir, pwd, rm

Put “l” in front for “local” versions of commands lls, lcd, lmkdir, lrmdir, lpwd, lrm

put transfer from local host to remote host

get transfer from remote host to local host

mput, mget transfer multiple files between hosts

help

Page 29: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 29

Tweaking Optimization Parameters

globus-url-copy -tcp-bs <size> | -tcp-buffer-size <size>

specify the size (in bytes) of the buffer to be used by the underlying ftp data channels

“Low” network traffic: 8000000 “High” network traffic: 4000000

-p <parallelism> | -parallel <parallelism> specify the number of streams to be used in the ftp

transfer Low network traffic: 1 High network traffic: 2 - 4

Page 30: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 30

Tweaking Optimization Parameters

uberftp tcpbuf <size>

specify the size (in bytes) of the buffer to be used by the underlying ftp data channels

“Low” network traffic: 8000000 “High” network traffic: 4000000

parallel <parallelism> specify the number of streams to be used in the ftp

transfer Low network traffic: 1 High network traffic: 2 - 4

Page 31: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 31

Using Robotic-Tape Archival Resources

NCSA Mass Storage System (MSS) Accessible using GridFTP to mss.ncsa.teragrid.org

TACC SGI Data Migration Facility (DMF) Accessible by simply placing files in $ARCHIVE

directory SDSC HPSS archival storage system

Use HSI from SDSC cluster only PSC “Golem”

Accessible using GridFTP to tg-gridftp.psc.teragrid.org

Page 32: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 32

Using Robotic-Tape Archival Resources

Files on these machines are transferred to their local disks, but may be automatically migrated to tape if necessary.

If you access a file that has been migrated to tape, it will be retrieved automatically, but expect some delay (up to a few minutes)

Storage capacity is essentially infinite!

Page 33: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 33

Hands-on: Exercise 7 – pg 1Copy several 9 MByte files from your home directory at TACC to the

NCSA Mass Storage System. Use 3rd party transfer at TACC. GSISSH from NCSA to TACC: tg-login> gsissh tg-login.tacc.teragrid.org

Start uberftp session: lonestar> uberftp

Establish “local” connection to TACC dedicated GridFTP server: uberftp> lopen tg-gridftp.tacc.teragrid.org220 lonestar GridFTP..blah..blah..ready.

230 User xxx logged in.

Establish “local” connection to TACC dedicated GridFTP server: uberftp> open tg-gridftp.tacc.teragrid.org%%%%%Lots of Stuff%%%%%%%

230 User xxx logged in.

Page 34: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 34

Hands-on: Exercise 7 – pg 2Put multiple files to NCSA MSS:uberftp> mput test.file*src> 150 Opening BINARY mode data connection for test file...

dst> 150 Opening BINARY mode data connection for test file...

src> 226 Transfer complete.

dst> 226 Transfer complete.

. . .

Page 35: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 35

Hands-on: Exercise 7 – pg 3Get a listing of the Mass Storage System directory:uberftp> ls-rw---- user group DK common 9621728 date test.file-Ex1

-rw---- user group DK common 9621728 date test.file-Ex2

-rw---- user group DK common 9621728 date test.file-Ex3

. . .

Quit uberftp:uberftp> quit

File is on disk. AR used to indicate file on tape.

Page 36: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 36

Using PSC “Golem”

tg-gridftp.psc.teragrid.org maps directly onto Golem’s filesystem.

Example: tg-login1> globus-url-copy –tcp-bs 4000000 –p 4

gsiftp://tg-gridftp.ncsa.teragrid.org/`pwd`/test.file gsiftp://tg-gridftp.psc.teragrid.org/~/test.file

Page 37: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 37

Using TACC DMF

Simply copy files to $ARCHIVE directory Files in this directory are automatically

migrated to tape if necessary. If you access a file that has been migrated

to tape, it will be retrieved automatically, but expect some delay (up to a few minutes)

/archive/teragrid/username is visible from the login nodes, but not the TACC dedicated GridFTP servers.

Page 38: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 38

Hands-on: WrapupLogout of TACC gsissh session:lonestar> exit

Destroy your proxy:tg-login> grid-proxy-destroy

Logout of NCSA ssh session:tg-login> exit

Page 39: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 39

Data Transfer Summary GridFTP clients tgcp, globus-url-copy and uberftp

can be used to perform transfers between many TeraGrid online filesystems and mass storage systems accessible via GridFTP servers.

Users responsible for managing data transfers, including job-related data movement which can be incorporated into job scripts.

Choose servers, filesystems, and transfer parameters wisely to optimize performance.

Ongoing efforts to improve rates and usability.

Page 40: 1 TeraGrid Data Transfer Jeffrey P. Gardner Pittsburgh Supercomputing Center gardnerj@psc.edu

CIG MCW, Boulder, CO 40

Useful URLs for help

TeraGrid user information overview http://www.teragrid.org/userinfo/index.html

Summary of TG Resources http://www.teragrid.org/userinfo/guide_hardware_table.html

Summary of machines with links to site-specific user guides (just click on the name of each site)

http://www.teragrid.org/userinfo/guide_hardware_specs.html

Data Transfer guide http://www.teragrid.org/userinfo/guide_data_transfer.html

Archival Storage guide http://www.teragrid.org/userinfo/

guide_data_storage.html#archival