atlas tier-0 exercise
DESCRIPTION
ATLAS Tier-0 exercise. Export report. Outline. Goals Setup Ramp-up Observations CASTOR stager limits dCache problems Conclusions Future plans. Goals. Nominal export to all Tier-1 sites First to each site individually Then collectively to all Tier-1s Gradual involvement of Tier-2s - PowerPoint PPT PresentationTRANSCRIPT
Distributed Data Management
Miguel Branco
ATLAS Tier-0 exerciseExport report
2
Outline
• Goals
• Setup
• Ramp-up
• Observations
• CASTOR stager limits
• dCache problems
• Conclusions
• Future plans
3
Goals• Nominal export to all Tier-1 sites
– First to each site individually– Then collectively to all Tier-1s
• Gradual involvement of Tier-2s• Rates (in MB/s):
• Total of 997 MBytes/sec out of CERN• These are the daily peak rates we
want to achieve.• Average rates are 40% lower
(ATLAS Computing Model dayhas 50k secs)
ASGC 65
FZK 88
CNAF 88
BNL 287
RAL 102
IN2P3 109
SARA 109
PIC 50
NDGF 50
TRIUMF 48
4
Goals
• Use DQ2 0.3
• Use new ARDA Dashboard monitoring functionality
• Exercise central VO BOXes
• New LFC bulk GUID lookup method
5
Setup
• New LFC server version @ CERN– (1.6.3)
• ORACLE-based DQ2 0.3 central catalogs– 2 load-balanced front-ends
• Central DQ2 0.3 site services– Contacting FTS servers and LFCs, handling
bookkeeping and file transfers
• Managed machines, lemon monitoring, RPM auto-updates, …
6
Setup
• Tier-0 plug-in updated for DQ2 0.3– Improved exception handling– File sizes, checksums recorded centrally
on DQ2 catalogs (previously only on LFC)
• Separate ARDA dashboard instance for Tier-0 tests– Will be kept later as a DQ2 testbed
7
Setup
• ATLAS daily operations morning meeting– Involving all developers
• ARDA• DDM• Tier-0
– .. and others whenever necessary
8
Ramp-up (summary)• 1st week:
– finalizing setup, SLC3/4/python versions– FZK, LYON, BNL
• End of 1st week (weekend):– SARA, CNAF
• 2nd week:– PIC, TRIUMF, ASGC
• 3rd week:– NDGF
• Missing:– RAL - was upgrading to CASTOR 2
9
Observations• A single observation at first…• … unstable CASTOR
– GridFTP errors dominating the export• Similar errors (rfcp timeouts) when importing data to
CASTOR
– Failure rate as high as 95% reading from CASTOR during ‘near collapse’ periods
• .. and constant multiple transfer attempts per file
– Did spot T1 failures (e.g. unstable SRM/storages) but these were at the noise level
• hard to take meaningful conclusions regarding T1 stabilities from this run
– … then went to the details
10
CASTOR stager limits• The CASTOR architecture is highly dependant on a database
(ORACLE) and scheduling system (LSF)• The LSF-plugin - the part that “connects” CASTOR with its LSF
backend - had a well-known limitation– Could cope with ~1000 PENDing jobs in queue (configured limit to
avoid a complete stop of the scheduler)• As long as the scheduler is capable of dispatching the jobs to the
diskservers there could be many thousands of parallel jobs…
• The Tier-0 internal processing puts a load of ~ 100 jobs into CASTOR
• The Tier-0->Tier-1 export puts a load of ~ 300 to 600 jobs into CASTOR– depending on rate, sites being served etc
• So, not much left for the rest:– Simulation production, Tier1->Tier-0, Tier2->Tier0, analysis @
CERNCAF• as soon as other activities started, CASTOR died
11
CASTOR stager limits• ATLAS has centrally organized its CASTOR usage into 4 groups:
atlast0 Internal Tier-0 user (1st pass processing)
atlas001 ATLAS users (also including simulation production activities at CERN)
atlas002 Tier-0 export (managed by DQ2)
atlas003 CERNCAF (managed by DQ2)
atlas001
atlas002
12
CASTOR stager limits• Shown is ‘atlas001’ requests to CASTOR
– (will show later atlas002, atlast0)
• … under load, requests stopping being served or were served very slowly
this T0 exercisedidn’t yet have thededicated ‘atlast0’account
CASTORoverloaded
13
CASTOR stager limits• … same period for export (atlas002) and internal Tier-0 (atlast0)
– correlation found between simulation production activities, internal Tier-0 processing and Tier-0 export
– Tier-0 affected by other users’ activities• Particularly BAD since CASTOR would be serving most aggressive
user (and Tier-0 is the most carefully throttled user!)
atlast0atlas002
Jobs pending!(long timeouts)
14
CASTOR limits• Simultaneous activities of various users caused
spikes and exceed 1K limit onto CASTOR-LSF– which then lead to slow serving of requests– … rfcp’s hanging for > 1 hour (same for GridFTP transfers)
• But the stager was not the only problem with CASTOR– files ending up in invalid states (very few)– backend database problems
• ORACLE DB appeared to be swapping and under heavy load
• Database H/W was upgraded (DB server memory) and situation has since improved
153rd week of exercise, 4hours period
Overview of export to Tier-1s
Transfer requests not servedand overall many timeouts
16
CASTOR problems
Useful error categorization
17
… more error categorization
a sample ofnon-T0 errors
18
… keeping track of full file transfer history(example above: 3 DQ2 transfer attempts, no internal FTS retries, CASTOR timeouts)
file
tran
sfer
his
tory
19
7 days period in March
Throughput measurement also includes downtime periods of CASTOR (several hours)
20
New CASTOR stager• Agreed with IT to fix CASTOR with maximum
urgency– task force setup and lead by Bernd– the existing CASTOR had reached its limits and no
longer was sufficient for our needs!– decided to setup a separate stager initially, running a new
CASTOR version• to keep our normal production activities going as before while
we test the new one
• New stager introduces a rewritten LSF-plugin that should cope with significantly higher number of jobs in the queue– Will also enable more advanced scheduling as originally
planned in CASTOR design
21
While waiting for new CASTOR stager…
• … decide to use the new ARDA data management monitoring to understand all transfer errors
• First lowered the rate for processing at CERN and reduced export to BNL and LYON only– not to reach CASTOR limits
• Then went to analyze all error messages as well as throughput:– (next slides)
22
Sorting out the streams…
• FTS sets the number of “file streams”– those are the number of parallel file
transfers– but unfortunately, these are not necessarily
all active all the time• because the SRM handling is included in the
FTS transfer “slot”
GridFTPSRM SRMSlot 1
GridFTPSRM SRMSlot 2 GridFTPSRM SRM
no network usage… :-(
23
Sorting out the streams…• GridFTP sets the number of TCP streams
– That’s TCP streams per file transfer
• Current numbers for TCP streams and FTS file streams are a result of the Service Challenge tests– a few things varied though: e.g. ATLAS file sizes– ATLAS is also running 1st pass processing at the Tier-0!
• An important point is that there is no established activity to monitoring state of network– … particularly without all the “overhead”
• from storages, SRM, FTS, …
– so we do not really know if our GridFTP timeouts are actually due to CASTOR, dCache or the network (or all combined!)
• Therefore, now started parallel activity doing pure GridFTP transfers to BNL– From diskserver to diskserver, no CASTOR or dCache, to try and
understand TCP streams and network first• then parallel file transfers• and double check if we see the same GridFTP errors as for the export
24
Sorting out the errors…• Also did detailed analysis of all error messages• Important CASTOR errors:
– 451 Local resource failure: malloc: Cannot allocate memory: not CASTOR error actually but due to GridView taking up all memory on diskserver!!
• Important GridFTP errors:– 421 Timeout (900 seconds): closing control connection.: only
dominant error not understood so far• Happens all the time whether the system is under load or not
– Operation was aborted (the GridFTP transfer timed out).: error appears to happen when either CASTOR is overloaded or destination storage is overloaded or as a consequence of other errors (e.g. “end-of-file was reached)
• Important dCache errors:– an end-of-file was reached: appears to happen when file system on
diskserver is full (why doesn’t PtP fail??)
• We do have many other errors but at low rate (1/10000) which disappear after retrial
25
dCache problems
• One occurrence of a dCache storage full at LYON– expect “graceful” handling from dCache– not quite…
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] an end-of-file was reached 8215
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Bestpool <pool-disk-sc3-12> too high : 2.0E8 4585
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Bestpool <pool-disk-sc3-6> too high : 2.0E8 1677
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] a system call failed (Connection refused) 126
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 421 421 Timeout (900 seconds): closing control connection. 123
State from FTS: Failed; Retries: 1; Reason: SOURCE error duringPREPARATION phase: [GENERAL_FAILURE] CastorStagerInterface.c:2145 Internal error (errno=0, serrno=1015) 69
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 451 451 Local resource failure: malloc: Cannot allocate memory. 46
State from FTS: Failed; Retries: 1; Reason: DESTINATION error duringPREPARATION phase: [REQUEST_TIMEOUT] failed to prepare Destination file in 180 seconds 28
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: Bestpool <pool-disk-sc3-6> too high : 2.000000000357143E8 17
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 426 426 Data connection. data_write() failed: Handle not in the proper state 11
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] globus_l_ftp_control_read_cb: Error while searching for end of reply 9
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 553 553 /pnfs/in2p3.fr/data/atlas/disk/sc4/multi_vo_tests/sc4tier0/04/05/T0.E.run008462.ESD._lumi0030._0001__DQ2-1175760332:Cannot create file: CacheException(rc=10006;msg=Pnfs request timed out) 3
State from FTS: Failed; Retries: 1; Reason: SOURCE error duringPREPARATION phase: [PERMISSION] [SrmPing] failed: SOAP-ENV:Client - CGSI-gSOAP: GSS Major Status: General failureGSS Minor Status ErrorChain:acquire_cred.c:125: gss_acquire_cred: Error with GSI credentialglobus_i_gsi_gss_utils.c:1323: globus_i_gsi_gss_cred_read:Error with gss credential handleglobus_i_gsi_gss_utils.c:1532: globus_i_gsi_gss_create_cred: Error with gss credentialhandleglobus_i_gsi_gss_utils.c:2103: globus_i_gsi_gssapi_init_ssl_context: Error with GSI credentialglobus_gsi_system_config.c:3475:globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificates directoryglobus_gsi_system_config.c:2996:globus_i_gsi_sysconfig_get_home_dir_unix: Error getting passwordentry for current user: Error occured for uid: 17680 2
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 451 451 rfio read failure: Connection reset by peer. 2
State from FTS: Failed; Retries: 1; Reason: DESTINATION error duringPREPARATION phase: [PERMISSION] [SrmPing] failed: SOAP-ENV:Client - CGSI-gSOAP: GSS Major Status: General failureGSS Minor StatusError Chain:acquire_cred.c:125: gss_acquire_cred: Error with GSI credentialglobus_i_gsi_gss_utils.c:1323:globus_i_gsi_gss_cred_read: Error with gss credentialhandleglobus_i_gsi_gss_utils.c:1532: globus_i_gsi_gss_create_cred: Error with gss credentialhandleglobus_i_gsi_gss_utils.c:2103: globus_i_gsi_gssapi_init_ssl_context: Error with GSI credentialglobus_gsi_system_config.c:3475:globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificates directoryglobus_gsi_system_config.c:2996:globus_i_gsi_sysconfig_get_home_dir_unix: Error getting password entry for current user: Error occured for uid: 17680 1
State from FTS: Failed; Retries: 1; Reason: SOURCE error duringPREPARATION phase: [PERMISSION] [SrmGet] failed: SOAP-ENV:Client- CGSI-gSOAP: GSS Major Status: General failureGSS Minor Status ErrorChain:acquire_cred.c:125: gss_acquire_cred: Error with GSIcredentialglobus_i_gsi_gss_utils.c:1323: globus_i_gsi_gss_cred_read:Error with gss credential handleglobus_i_gsi_gss_utils.c:1532: globus_i_gsi_gss_create_cred: Error with gss credentialhandleglobus_i_gsi_gss_utils.c:2103: globus_i_gsi_gssapi_init_ssl_context: Error with GSI credentialglobus_gsi_system_config.c:3475:globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificates directoryglobus_gsi_system_config.c:2996:globus_i_gsi_sysconfig_get_home_dir_unix: Error getting password entry for current user: Error occured for uid: 17680 1
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 553 553 /pnfs/in2p3.fr/data/atlas/disk/sc4/multi_vo_tests/sc4tier0/04/05/T0.A.run008462.AOD.AOD02._0001__DQ2-1175760207:Cannot create file: CacheException(rc=10006;msg=Pnfs request timed out) 1State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 550 550 /castor/cern.ch/grid/atlas/t0/perm/T0.B.run008456.AOD.AOD04/T0.B.run008456.AOD.AOD04._0001:Address already in use. 1
State from FTS: Failed; Retries: 1; Reason: SOURCE error duringPREPARATION phase: [PERMISSION] [SrmGetRequestStatus] failed: SOAP-ENV:Client - CGSI-gSOAP: GSS Major Status: General failureGSS MinorStatus Error Chain:acquire_cred.c:125: gss_acquire_cred: Error with GSI credentialglobus_i_gsi_gss_utils.c:1323:globus_i_gsi_gss_cred_read: Error with gss credentialhandleglobus_i_gsi_gss_utils.c:1532: globus_i_gsi_gss_create_cred: Error with gsscredential handleglobus_i_gsi_gss_utils.c:2103: globus_i_gsi_gssapi_init_ssl_context: Error with GSIcredentialglobus_gsi_system_config.c:3475:globus_gsi_sysconfig_get_cert_dir_unix: Could not find a valid trusted CA certificatesdirectoryglobus_gsi_system_config.c:2996:globus_i_gsi_sysconfig_get_home_dir_unix: Error getting password entry for current user: Error occured for uid: 17680 1
State from FTS: Failed; Retries: 1; Reason: SOURCE error duringPREPARATION phase: [PERMISSION] [SetFileStatus] failed: SOAP-ENV:Client - CGSI-gSOAP: GSS Major Status: General failureGSS Minor StatusError Chain:(null) 1
State from FTS: Failed; Retries: 1; Reason: TRANSFER error duringTRANSFER phase: [GRIDFTP] the server sent an error response: 451 451 rfio read failure: Connection closed by remote end. 1State from FTS: Failed; Retries: 1; Reason: No status updates receivedsince more than [360] seconds. Probably the process serving the transfer is stuck
Illegible I know..
But the point is, asingle cause -storage full -originated over 11kerrors of ~20 types!(we were expectingPtP fail, storage full)
26
dCache problems• We found transfers problems to BNL in another
occasion• After investigation by BNL (explanation by M. Ernst):
“a full filesystem has lead to the "end-of-file reached" error at BNL, we meant to say that this condition arose on the gridftp server node (the log filled the filesystem because of a high log level configured on this particular node). The problem was not caused due to lack of space on the actual storage repositories. So, since this failure occurred due to the fact that log rotation wasn't working properly - which is rarely happening - I would consider a re-occurance as fairly unlikely.”
• Such errors put considerably more load onto the Tier-0 storage system– which again affected the overall behaviour…
27
FTS 2.0• About ~1 week ago started using FTS 2.0 (Pilot installation) to serve
LYON transfers from the Tier-0 exercise– FTS 2.0 introduces error categorization, e.g.:
• [REQUEST_TIMEOUT] • [INVALID_PATH] • [GRIDFTP] • [GENERAL_FAILURE]• [SRM_FAILURE]• [PERMISSION] • [INVALID_SIZE]
• We are using FTS 2.0 server with existing FTS 1.x clients– Smooth transition without any problem so far!
• Plan for FTS 2.0:– continue with existing pilot service– then try new FTS 2.0 client– then FTS 2.0 to Production?
• FTS 2.0 will be able, in the future, to split SRM from GridFTP handling..– this is very important to carefully throttle number of file streams
28
Conclusions (so far…)• Detailed analysis of transfer errors, including automatic
categorization provided by our monitoring is very useful!– FTS 2.0 clearly an improvement here
• At some point have to assume errors will always exist:– e.g. when everything goes “smooth” we see error rate varying
between 15% and 5%…
• Still figuring out what to do with bad error reporting:– e.g. storage full, how to cope with it?
• Interference between sources/destination is worrying:– a bad file system in one of the BNL diskservers blocked a few
transfer slots onto CASTOR@CERN• and we have so “few” slots with current CASTOR :(
• Diverting traffic away from Tier-0?– e.g. our AOD goes to all Tier-1s but should it all come from CERN?
29
Conclusions (so far…)
• Half-way in our tests• Tests a success since we found early a
critical CASTOR limitation– … for which there is a very good prospect of being
solved quickly!
• Nonetheless, ATLAS testing schedule is essentially ~1 month late– … we should be testing T1->T2s by now!
• Big thank you to everyone in IT and at the Tier-1s for their support!
30
Future plans1. Fix CASTOR
• new stager almost ready for ATLAS
2. Re-run exercise using new stager1. Tier-0 -> Tier-1s2. Then Tier-1 -> Tier2s
• More monitoring summarieshuge amount of information now available centrally and we
need to summarize it …
• DQ2 developments now considering:• Further split between data taking, simulation
production and end-user activities• Improvements on data re-routing
31
More information
• https://twiki.cern.ch/twiki/bin/view/Atlas/TierZero20071
• http://dashb-atlas-data-test.cern.ch/dashboard/request.py/site
• http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/DC/Tier0/monitoring/short.html