atlas site status board automatic queue exclusion based on downtimes
DESCRIPTION
ATLAS Site Status Board Automatic queue exclusion based on downtimes. ATLAS site topology Site exclusion algorithm Test results First real exclusion and recovery. C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth - PowerPoint PPT PresentationTRANSCRIPT
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
1
ATLAS Site Status BoardAutomatic queue exclusion based on downtimes
21st Feb 2012
• ATLAS site topology• Site exclusion algorithm• Test results• First real exclusion and recovery
C. Borrego, S. Campana, S. Gayazov, A. DiGirolamo, X. Espinal, E. Magradze, L. Rinaldi, J. Schovancova, G. Stewart, M. Wrigth
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
2
ATLAS site topology
• Based on information from AGIS, Schedconfig
• Mapping between various ATLAS site naming conventions• AGIS (based on GOCDB/OIM), Panda, DDM
• Populated “exception file”
• ATLAS site-oriented topology
• http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json
• ATLAS Panda queue-oriented topology
• http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues.json
• http://adc-ssb.cern.ch/SITE_EXCLUSION/panda_queues_dict.json
In touch with Pilot factory monitoring developers to get mapping between queues and resources as Pilot factories see it
Will enable us to map ANALY queues to downtimes of CE
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
3
Site exclusion
• Queue exclusion based on downtime of a SE, CE (, LFC)
– Exclusion tools has undergone thorough testing before was put into production for the first queues
21st Feb 2012
18 Oct 2011
AGISSite downtime information
DDM exclusion collectorFetches SE downtime from AGIS
Site ASE downtime
starts
Site A: SESE Excluded
Site BSE downtime
over
Site exclusion collectorFetches SE/CE/LFC downtime
from AGISSite C
SE downtime starts
Site C: CECEs Excluded
Site DLFC downtime
starts
Site D: CECE(s) Excluded
Site D: SESE(s) Excluded
Site ECE(s) downtime
starts
Site E: CECE(s) Excluded
Site B: SESE Recovered
In productionIn testing
phase
GOCDB OIMDB
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
4
Site exclusion algorithm
• Fetch ongoing and future downtimes from AGIS
• Map downtimes from sites to queues (topology!)• SRM downtime: action with every queue type (ANALY, prod)
• CE downtime: action only with prod queues
• Decide exclusion/recovery action, consider
• time of downtime
• queue type (production, analysis, “special”)
• current queue status
• current queue comment
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
5
Exclusion of a production queue
• 12 hr in advance of a downtime:• setoffline with comment “set.offline.by.SSB” if queue is:
• Online with any possible comment
• Brokeroff with comment “set.brokeroff.by.SSB”
• Test with comment “HC.Test.Me”
• Otherwise do not touch that queue!
• When downtime starts:• Make sure that queue is set offline when appropriate
• See the rules above, in the T-12h .. T intervals
• End of downtime/downtime disappears – recovery:• settest with comment “HC.Test.Me” if the current status is
Offline with comment “set.offline.by.SSB”
• Otherwise do not touch that queue!
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
6
Exclusion of an analysis queue• 6 hr in advance of a downtime:
• setbrokeroff with comment “set.brokeroff.by.SSB” if queue is:
• Online with any possible comment
• Brokeroff with comment “set.brokeroff.by.SSB”
• Offline with comment “set.offline.by.SSB”
• Otherwise do not touch that queue!
• 2 hr in advance of a downtime and during downtime:• setoffline with comment “set.offline.by.SSB” if queue is:
• Online with any possible comment
• Brokeroff with comment “set.brokeroff.by.SSB”
• Test with comment “HC.Test.Me”
• Otherwise do not touch that queue!
• End of downtime/downtime disappears – recovery:• settest with comment “HC.Test.Me” if the current status is
Offline with comment “set.offline.by.SSB”
• Otherwise do not touch that queue!21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
7
Testing the exclusion idea - 1• Assembled test data:
• 2 flavours of production queues (only 1 enabled),
• 2 flavours of analysis queues (only 1 enabled)
• Phase space of queue status contains every possible combination of [queue type, queue status, queue comment]:
• FAKE_QUEUE_TYPES (x) FAKE_QUEUE_PREFIXES (x) (x) FAKE_STATES (x) FAKE_COMMENTS, where
• FAKE_QUEUE_TYPES=[DEFAULT_QUEUE_TYPE_PRODUCTION, DEFAULT_QUEUE_TYPE_ANALYSIS, DEFAULT_QUEUE_TYPE_SPECIAL]
• FAKE_QUEUE_PREFIXES={DEFAULT_QUEUE_TYPE_PRODUCTION: ['testsite-testsitece02-at2testsite-pbs_test', 'testsite-testsitece03-at2testsite-pbs_test'], DEFAULT_QUEUE_TYPE_ANALYSIS:['ANALY', 'ANALY2'], DEFAULT_QUEUE_TYPE_SPECIAL:['SPECIAL1', 'SPECIAL2']}
• FAKE_STATES=['online', 'offline', 'test', 'brokeroff']
• FAKE_COMMENTS=['', 'dummy', 'set.offline.by.SSB', 'set.offline.by.SSB.dummy', 'set.brokeroff.by.SSB', 'set.brokeroff.by.SSB.dummy', 'set.online.by.SSB', 'set.online.by.SSB.dummy', 'HC.Test.Me', 'HC.Test.Me.dummy']
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
8
Testing the exclusion idea -2• “Dashboard” with the timeline for each queue class from
the phase spacehttp://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.html
• Log with detailed actions describedhttp://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher_digest.log
• Test downtimes:• SRM: from 2012-02-05 23:30 UTC to 2012-02-06 02:00 UTC
• SRM: from 2012-02-06 04:30 UTC to 2012-02-06 06:00 UTC
• SRM: from 2012-02-07 04:30 UTC to 2012-02-07 06:00 UTC
• CE: for each queue from 2012-02-06 8am 9am UTC
The exclusion algorithm does what is expected and when it is expected!
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
9
Real actions• After thorough testing and improving log debugging
features for operations• We started taking real actions for several queues
https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33952
The exclusion tool does what is expected and when it is expected!
• Tested with ifae and UKI-SCOTGRID-DURHAM, which have downtimes today.
• Next in the pipeline is SFU-LCG2.
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
10
Operational experience - 1
• Every action is logged, so it’s easier to debug what went wrong if this occur.http://adc-ssb.cern.ch/SITE_EXCLUSION/switcher/switcher.log
• Found few minor issues on the way
• Fetched only future downtimes from AGIS. Fixed. Now fetching ongoing and future
downtimes.
• Disabled all real queues for the past night Fixed. Now all queues from elog:33952 are
enabled again.
The exclusion tool takes only actions we intend it to take!21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
11
Operational experience - 2
• Found few minor issues on the way
• Fetched only future downtimes from AGIS.
Fixed. Now fetching ongoing and future downtimes.
• Disabled all real queues for the past night
Fixed. Now all queues from elog:33952 are enabled again.
The exclusion tool takes only actions we intend it to take!
21st Feb 2012
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
ES
12
Summary
Using ATLAS site topology– http://adc-ssb.cern.ch/SITE_EXCLUSION/ATLAS_sites.json
First real exclusions and recoveries successful!
Next steps: Add more queues to real actions Add more configurability (now: system-wide)
Questions?
[email protected] Feb 2012