atlas central services and computing security
DESCRIPTION
ATLAS Central Services and Computing Security. Flavia Donno CERN/IT-ES-VOS. Outline. Security in ATLAS: can this model be exported to other LHC experiments? Why we do it How we do it: policies and plan Handling a security incident in ATLAS - PowerPoint PPT PresentationTRANSCRIPT
ATLAS Central Services and
Computing Security
Flavia DonnoCERN/IT-ES-VOS
04/21/23 1Flavia Donno, Tier1 Service Coordination
Outline
04/21/23 2Flavia Donno, Tier1 Service Coordination
• Security in ATLAS: can this model be exported to other LHC experiments?– Why we do it– How we do it: policies and plan– Handling a security incident in ATLAS
• The ATLAS Central Services Operations Team. Can this evolve into a general services for LHC experiments?– Who we are and what we do– Some details: service inventory and the web redirector– Better interface and documentation to CERN/IT available
tools.
ATLAS Security: The Goal• Minimize ATLAS central services unavailability during
data taking, mitigating vulnerability and improving operation and management of ATLAS services.
Preserve ATLAS reputation worldwide.– The focus is on security– Security cannot be enforced without good service
management practices Policies– Strategy: increase robustness and availability of critical
services while containing vulnerabilities of less critical ones Plan
• Other sites supporting ATLAS can follow the same policies where applicable
04/21/23 3Flavia Donno, Tier1 Service Coordination
ATLAS Security: references and documentation
• Various CERN IT and wLCG/EGEE documents define policies and good practices:– VOBox Service Level Agreement (CERN-IT/FIO-FS)
• https://twiki.cern.ch/twiki/pub/FIOgroup/FsSLA/sla-v1.2.1.pdf– VOBox Security Recommendations and Questionnaire (wLCG Joint
Security Policy Group)• https://edms.cern.ch/file/639856/0.6/VO-Box-security-policy-0-6.pdf
– GRID system administrators best practice and guidelines (EGEE security group)
• http://rss-grid-security.cern.ch/glite.php?display=1
04/21/23 4Flavia Donno, Tier1 Service Coordination
SECURITY POLICIES AND PLAN FOR CERN ATLAS CENTRAL SERVICES
https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_PlanV20.pdf
–Special thanks to Sebastian Lopienski, Romain Wartel and Stefan Lueders
Key figures and roles
04/21/23 5Flavia Donno, Tier1 Service Coordination
The policies
04/21/23 6Flavia Donno, Tier1 Service Coordination
P1 AM and SM Requests for new hardware must be done through the ATLAS VOC (VO Contact) : [email protected] or https://savannah.cern.ch/support/?group=atlascsops Requests must be justified and details about the services the machine will run must be provided.
P2 AM and SM AM and SM are responsible for requesting new hardware to replace the machines nearing warranty expiration. They are also responsible to request hardware in order to guarantee high availability of their service if needed (DNS load balancing, hot spares, etc.).
Sensors are provided to warn SM of warranty expirations.Changes cannot be made to existing hardware configurations.
The policies
04/21/23 7Flavia Donno, Tier1 Service Coordination
P3 AM and SM
They must respond timely (within 1 working day) to inquiries coming from the VOC. AM and SM are also responsible for security issues with their service. For security inquiries, the response must come within 1 hour, in case of severe incidents, within a few hours otherwise. The subject of the e-mail is of the form “[Important|Severe]: SI# ” where # is an internal sequence number. If unresponsive the call will be escalated to ADC coordinator and ATLAS computing coordinator.
P4 VOC The VOC must keep an updated list of all ATLAS SM and AM contacts.
P5 AM The AM must provide the VOC with installation and configuration details about the services, the need for data backup, connectivity details, procedures for draining a service etc. Details must be given according to the ATLAS Service Documentation Template (*).
The Service Documentation Card
04/21/23 8Flavia Donno, Tier1 Service Coordination
If you edit your serviceTwiki page, you are presentedWith the edit window.At the bottom you findThe button “Add Form”
(*) https://twiki.cern.ch/twiki/bin/view/Atlas/ATLASServiceDocumentationTemplateNo need to copy and past. Just follow the instructions to add a form in your twiki.
The policies
04/21/23 10Flavia Donno, Tier1 Service Coordination
P6 VOC and AM
All ATLAS central services at CERN must be “quattorized” (**)
P7 SM and AM
SM and AM should provide lemon sensors and procedures that allow for monitoring, raising alarms and directing the operators for their services. (***)
(**) https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProceduresProcedures and good practices are listed in the twiki. It is ongoing work.
• Distribution through rpms (ATLAS software repository)• SLC5 migration• ATLAS secure external configuration repository (based on SINDES)• Protecting DB passwords• Good practices (aliases, http vs https and CERN SSO, svn checkout, …)
(***) The VOBox service is provided as a business hours service on CERN working days. For 24/7 coverage procedures can be provided for the operators (OP)
–The VOC or AM must react to monitoring alarms escalated to them.–The VOC or AM should be careful not to take actions that could raise alarms
The policies
04/21/23 11Flavia Donno, Tier1 Service Coordination
P8 VOC , SM, AM
They must attend security courses and get informed and updated about security threats for the services they provide and manage. They are responsible for ensuring that their software does not pose security threats, that access to databases is secure and is sufficiently monitored, that stored data are compliant with legal requirements, and that VO services, including pilot job frameworks, are operated according to the applicable policy documents. It is the responsibility of the ATLAS VOC to publish available security courses within the ATLAS collaboration. Operators/shifters must be trained to detect security anomalies.
System level security compliance is monitored by the CERN IT security team
SM/AM must proactive check for security related updatesHousekeeping of the machines is the responsibility of the VOC and AM (logs rotation, tmp space management, etc.)
The policies
04/21/23 12Flavia Donno, Tier1 Service Coordination
P9 ATLAS, AM, SM
The management agrees that people working on computing will spend 5-10% of their time on security (1-2 days per month).
P10 All Security incidents must be reported to the ATLAS VOC and to CS: [email protected] or phone 70500 accordantly with [a] and [b]. Depending on the severity of the incident, services can be taken out either in agreement with AM/SM or, in case of unresponsiveness, independently. Services can be restored on hot spares if available.
[a] Report a Computer Security Incidenthttp://lcg.web.cern.ch/LCG/incident.htm [b] LCG Agreement on Incident Responsehttps://edms.cern.ch/document/428035
More on Security in ATLAS– The AM must make available to the VOC and VO operators the list of possible
threats/risks and actions taken to mitigate them (software updates and security patches, etc.). In particular services should depend as little as possible on specific package versions, allowing for automatic updates.
– The AM is responsible to train ATLAS operators to identify anomalies with the services.
– The SM is responsible for monitoring the service for the correct behavior (load, performance, available free space, etc. – it might reveal an intrusion)
– The VOC will regularly review machine configurations to identify possible security threats (limited interactive or root access, interactive access [or sudo privileges!!!] to generic accounts, open doors, access control lists, un-needed services, world-writable files/directories [used in init.d], etc.) - we work with FIO to make this process semi-automatic.
– The VOC might run security challenges vs. specific services if needed.– VOC, AM and SM are responsible to follow recommended procedures:
• http://rss-grid-security.cern.ch/glite.php
04/21/23 13Flavia Donno, Tier1 Service Coordination
The plan
04/21/23 14Flavia Donno, Tier1 Service Coordination
The plan
04/21/23 15Flavia Donno, Tier1 Service Coordination
…such tasks will be repeated every three months.
Handling a security incident in ATLAS• Report all suspicious security incidents to the ATLAS VOC: atlas-adc-central-
[email protected] and to [email protected] or phone 70500. Please use caution and discretion.
• The ATLAS VOC will coordinate the over incident response process in Atlas: in particular, the VOC will contact the AM/SM for the specific service. The AM/SM will be contacted both via phone (whenever provided) and e-mail. The subject of the e-mail is of the form “[Important|Severe]: SI#” where # is an internal sequence number. In case of a severe incident we expect AM/SM to respond within 1 hour.
• The service will be removed from the public network (restricting to local access if possible) weighting this with the impact to the experiment and the organization. Depending on the severity of the incident, services can be shut down either in agreement with AM/SM or, in case of unresponsiveness, independently (contacting ATLAS management and informing user communities).
04/21/23 16Flavia Donno, Tier1 Service Coordination
Handling a security incident in ATLAS• As in case of normal maintenance interventions, if specific procedures are
provided by the AM/SM concerning the unavailability of the service, the ATLAS VOC will follow them (i.e. informing user communities, advertising downtime, redirecting dependent services, operation procedures, etc.)
• In agreement with the AM/SM and if the nature of the incident will allow it, the service could be partially restored on hot spares, if previously arranged.
• The original machines where the service was running and intruded will be made available to the CERN Security Team for analysis. However, it is the responsibility of the AM and SM to investigate the incident with the advice and guidance of the CERN Security Team.
04/21/23 17Flavia Donno, Tier1 Service Coordination
Handling a security incident in ATLAS• Once the service vulnerability is mitigated by the FM/AM/SM, the ATLAS VOC
will coordinate with the CS for a service examination/scan to make sure the mitigation is effective.
• Once the vulnerability problem is solved the service is restored.
04/21/23 18Flavia Donno, Tier1 Service Coordination
• Started with B. Koblitz who left in July 2009
• The team:Flavia Donno (CERN-IT/GS) – coordinationSerguei Baranov (JINR - Russia)Sergey Makarychev (ITEP - Russia) - here till end of December 2009Alexey Buzykaev (BINP - Russia) - left on November 18th, 2009
04/21/23 19Flavia Donno, Tier1 Service Coordination
The ATLAS Central Services Team• The mandate:
– ATLAS contact for CERN/IT (Alarm Tickets, security, etc.)– Management of ATLAS Central Services according to the SLA between CERN-IT and
ATLAS (hardware requests, quattor, best practices, etc.)– Providing assistance with machine, software and service management to ATLAS
service developers and providers (distribution practices, software versions, rebooting, etc.)
– Provision of general frameworks, tools and practices to guarantee availability and reliability of services (squid/frontier distribution, web redirector, hot spares, sensors, better documentation, etc.)
– Collection of information and operational statistics (service inventory, machine usage, etc.)
– Enforcing the application of security policies– Training of newcomers in the team– Spreading knowledge among service developers and providers about the tools
available within CERN/IT (Shibboleth2, SLS, etc.)– Participating in the work of the ADC Operations Team– And more …
04/21/23 20Flavia Donno, Tier1 Service Coordination
The ATLAS Central Services Team• Communication channels:
– Support requests are submitted here:https://savannah.cern.ch/support/?group=atlascsops
– You can contact us via e-mail:[email protected]
• Documentation:– Useful twiki pages:
https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProcedureshttps://twiki.cern.ch/twiki/bin/view/Atlas/ADCMachinesHowto
– Public Machine and Service Inventory:https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines
– Security Document:https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_Planv20.pdf
04/21/23 21Flavia Donno, Tier1 Service Coordination
Service Inventory
04/21/23 22Flavia Donno, Tier1 Service Coordination
https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines
Service Inventory
04/21/23 24Flavia Donno, Tier1 Service Coordination
• We manage a total of 184 machines [virtual/development/production]
• Production• 13 machines for Run Time Tester Service• 64 machine for the ATLAS Build services• 53 machines used for production services (DQM, Site Services, Central Catalogues, T0, etc.)
• Development
• 23 Development Machines• 20 Virtual Machines (on 3 physical ones)
• Spares • 5 spares• 6 unknown
Availability in ATLAS– Normal operations are complemented by the following actions
to ensure availability of services:
– Every ~6 months the ATLAS VOC will exercise rebooting and moving of some critical, load-balanced services on hot spares in order to be prepared for emergencies.
– In coordination with the Security Team the ATLAS VOC performs regular checks on the existing services for signaled vulnerabilities.
– The ATLAS VOC makes available web framework (such as the ATLAS web redirector) services to AM/SM. Such frameworks ensure a secure and CERN supported environment where to run web services (Shibboleth, CERN approved software packages, safe services etc.). In case the web redirector cannot be used, recommendations will be given to avoid common problems.
– The ATLAS VOC advices and provides support on specific software packages and their versions to be used on specific platforms (Python 2.6, emacs, mod_python, mod_wsgi, Django, etc.).
– The ATLAS VOC provides sensors for most common alarm needs.
04/21/23 25Flavia Donno, Tier1 Service Coordination
The web redirector
04/21/23 26Flavia Donno, Tier1 Service Coordination
atmbadm
attrcadm
atrqadm
https://atlas-minimumbias.cern.ch
https://atlas-trigconf.cern.ch
https://atlas-runquery.cern.ch
atmbsrv
attrcsrv
atrqsrv
Administrationaccounts
Serveraccounts
Conclusions
04/21/23 27Flavia Donno, Tier1 Service Coordination
• Security is important both for obtaining what we are working for (physics results) and for the good reputation of ATLAS and all LHC experiments– Please, help us enforcing it!– The ATLAS security model is general and can be applied to other LHC
experiments
• Central Services Operations Teams have an important role for the smooth running of experiment services– The goal is to improve service availability, reliability and management– We depend on CERN/IT, but a lot to be done– Can the ATLAS model be expanded in order to offer a common services to
all LHC experiments?
04/21/23 28Flavia Donno, Tier1 Service Coordination
… comments/suggestions/corrections ?
Please, send e-mail to [email protected]