sam job submission what is sam? sam submit …… data management details. conclusions. rod walker,...

17
SAM Job Submission • What is SAM? • sam submit …… • Data Management • Details. • Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.

Upload: kathryn-may

Post on 31-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

SAM Job Submission

• What is SAM?

• sam submit ……

• Data Management

• Details.

• Conclusions.

Rod Walker, 10th May, Gridpp, Manchester.

What is SAM?

• SAM is Sequential data Access via Meta-data• Project started in 1997 to handle D0’s needs for

Run II data system.• Current SAM team includes:

– Andrew Baranovski, Lauri Loebel-Carpenter, Gabriele Garzoglio, Chris Jozwiak, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)

• http://d0db.fnal.gov/sam

SAM is a Distributed SystemDatabaseServer(s)(Central Database)

NameServer

Global Resource

Manager(s)Log server

Station 1Servers

Station 2Servers

Station 3 Servers

Station nServers

Mass Storage System(s)

SharedGlobally

Local

SharedLocally

Arrows indicateControl and data flow

Job Submission

• Executable– Runtime environment

• Executable&assoc. files (user specific).• Experiment environment.

• Data– Dataset definition

• Select by metadata. • Converted to LFN`s at submit time, ie.datasets

change.• Build SQL query…then…execute query.

Dataset

Job Running & Job Control

ClientLocal SM

(Station Master)

Batch SystemProcess Manager

(SAM wrapper script)User Task

Job Manager(Project Master)

2.submit to SM

4.submitTo BS

6.start job 8.invoke

5.Submission ok

10.resubmit

9.setJobCount/stop

3.invoke

jobEnd

1. sam submit –defname=mydata –script=myexe

7.Started

(Run this exe | on this data)

User exeUser exeUser exe

Job control

User exe

getNextFile()

Here`s the path to a local file: /sam/cache1/boo/mydata1.dat

WaitFinished

Replica Catalogue

LFN

PFNStager

Fetch PFN

BS

Release

12

34

Physics & wrapper

Data Management

• Replica Catalogue

• Replication

• Cache Management

Replica Catalogue

• Combined with Metadata in an Oracle database, although logically distinct– Query on metadata to create a dataset

• list of LFN`s

• Experiment specific (D0/CDF).

– Query on LFN to locate physical file.• Generic replica catalogue.

• node:/path/to/cache/myfile.dat

Replica Catalogue

600,000 files increasing at 3000/day, 120TB.

150,000 in cache

5000 files per day replicated, 5000 destroyed.

½ million queries per day, (90% SELECT).

Cache Managment

• 13.6TB, in several 100 individually managed caches.• 1TB in and out/day (10k files)• Cache lifetime ~10 days• Various prescriptions for cache replacement, e.g. 1st in, 1st

out, last use.

70% hit rate(~6000 files/day)

Replication

• Easy – use your favourite ftp.

• BUT……what could go wrong.– Cache space – Cache Management.– network, dead node, corrupted file - retries.– dead disk, uncached – fail-over.– sluggish robot, slow delivery – hold job.

• A stroll through my log file.

05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:01:51 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact [email protected] 05/07/02 16:01:52 imperial-test.SM.imperial-test 11698: Delivery failed,scheduling retry in 3 seconds

Retry

05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: delivery error (Category SAM Internal)  Severity level: ERROR  Generated on 07 May 16:02:35 by eworker  In the context: executed process samcpcab:d0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:256 STDOUT: Executing Kerberos rcp: /usr/krb5/bin/rcpd0cs015.fnal.gov:/sam/cache/boo/reco_all_0000151193_021.raw_p10.15.01_000/sam/cache20/lancs/boo STDERR: kshd: Logins currently disabled.trying normal rcp (/usr/bsd/rcp)  WARNING: NO ENCRYPTION!d0cs015.fnal.gov: Connection refused, method name: samcp  Recommended action: Please contact [email protected]  05/07/02 16:02:35 imperial-test.SM.imperial-test 11698: Maximum numberof retrials exceeded. Will not retry again from this source!05/07/02 16:02:35 imperial-test.SM.Repler 11698: Will avoid locations:(cab:d0cs015.fnal.gov:/sam/cache/boo)05/07/02 16:02:35 imperial-test.SM.Repler 11698: No loc is preferred,selectingenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all(prl733.24)

Give up on this source.

Avoid this location. Get another location from RC, and retry.

05/07/02 16:10:53 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:10:53 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_021.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=1369320147LABEL=PRL859LOCATION=0000_000000000_0000067DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=160.38SEEK_TIME=73.47MOUNT_TIME=25.36QWAIT_TIME=65.79TIME2NOW=329.78STATUS=ok  STDERR: Completed transferring 1369320147 bytes in 1 files in329.720216036 sec.        Overall rate = 3.96 MB/sec.  Drive rate = 8.14 MB/sec.        Network rate = 8.13 MB/sec.  Exit status

Got it

05/07/02 15:46:09 imperial-test.SM.PBS BS Adapter 11698: Rememberingthat job 1760.gw39.hep.ph.ic.ac.uk for project 61983_sam_ is held --------------------------05/07/02 16:00:56 imperial-test.SM.imperial-test 11698: Delivery status:Simple Status:  Code: OK (Category Enstore)  Severity level: SUCCESS  Generated on 07 May 16:00:56 by eworker  In the context: executed process samcpenstore:/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000imperial-test:d0mino.fnal.gov:/sam/cache20/lancs/boo, result: EXIT CODE:0 STDOUT:INFILE=/pnfs/sam/dzero/copy1/datalogger/initial_runs/d0farm/reco/all/reco_all_0000153170_012.raw_p10.15.01_000OUTFILE=/sam/cache20/lancs/booFILESIZE=788805399LABEL=PRL829LOCATION=0000_000000000_0000025DRIVE=d0enmvr9a:/dev/rmt/tps0d1nDRIVE_SN=4560020042TRANSFER_TIME=90.08SEEK_TIME=45.05MOUNT_TIME=27.14QWAIT_TIME=225.50TIME2NOW=392.28STATUS=ok  STDERR: Completed transferring 788805399 bytes in 1 files in392.221878052 sec.        Overall rate = 1.92 MB/sec.  Drive rate = 8.35 MB/sec.        Network rate = 8.35 MB/sec.  Exit status = 0., method name: samcp  Recommended action: Please contact [email protected]/07/02 105/07/02 16:00:57 imperial-test.SM.PBS BS Adapter 11698: Willexecute: qrls 1760.gw39.hep.ph.ic.ac.uk

Hold in queue until 1st file delivered.

Release

File arrives

Conclusions

• Executable is stupid - no knowledge of data transfer. Job manager does the clever stuff.

• SAM has a fully featured, tried and tested data management system.

• No GSI, GridFTP, or CondorG as yet,

…but you need more than G`s to make a grid!