planning for lcg emergencies hepix, fall 2005 slac, 13 october 2005 david kelsey cclrc/ral, uk
DESCRIPTION
13-Oct-05David Kelsey, LCG Emergencies3 Background Computing and Networking is essential –Tier 0 (CERN) and 12 Tier 1 critical for data taking 10 Gbps Optical Private link to each T1 –The T1’s collectively keep a second copy of the raw data –The T1’s play vital role in (re)processing and providing access to derived data –During data taking, can cope with Tier 0 - Tier 1 link down for 12 hours to < few days. All T1’s down – very bad! –LCG MoU requires avg T1 uptime during data taking: 99% LCG TDR says –“Special attention needs to be paid to the security aspects of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.” LCG also essential for analysis Need to keep the Grid running at all times –Therefore must deal quickly with incidentsTRANSCRIPT
Planning for LCG EmergenciesHEPiX, Fall 2005
SLAC, 13 October 2005
David KelseyCCLRC/RAL, UK
13-Oct-05 David Kelsey, LCG Emergencies 2
LHC Tier 0/1/2
T0
IN2P3
GridKaTRIUMF
ASCC
Fermilab
Brookhaven
Nordic
CNAF
SARA
PIC
RAL
T2
T2
T2
T2
T2 T2
T2
T2
T2
T2T2
T2T2
General Purpose IP ResearchNetworks:
NREN’s, GEANT2, LHCNet, EsnetAbilene, Dedicated Links …. Etc.
Special PurposeOptical Private Network:
GEANT2+NREN 10Gbit circuits andLHCNet Dedicated 10Gbit Links to US
CERN
CERNCERN
Network Architecture
13-Oct-05 David Kelsey, LCG Emergencies 3
Background• Computing and Networking is essential
– Tier 0 (CERN) and 12 Tier 1 critical for data taking• 10 Gbps Optical Private link to each T1
– The T1’s collectively keep a second copy of the raw data– The T1’s play vital role in (re)processing and providing
access to derived data – During data taking, can cope with Tier 0 - Tier 1 link down
for 12 hours to < few days. All T1’s down – very bad!– LCG MoU requires avg T1 uptime during data taking: 99%
• LCG TDR says– “Special attention needs to be paid to the security aspects
of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.”
• LCG also essential for analysis• Need to keep the Grid running at all times
– Therefore must deal quickly with incidents
13-Oct-05 David Kelsey, LCG Emergencies 4
Security Incident Response• Joint (LCG/EGEE) Security Policy Group & EGEE
Operational Security Coordination Team– Based Security Incident Response Policy and
procedures on work of Open Science Grid• Agreement on Incident Response
See https://edms.cern.ch/document/428035/• Sites must
– Take local action to prevent disruption– Report to local security officers– Report to others via Grid Incident Response
mail list• “Volunteer” incident response team created
when needed
13-Oct-05 David Kelsey, LCG Emergencies 5
Incident classification• High: (team leader required)
– The incident could lead to exploitation of the trust fabric, i.e user and host identities, or the incident could lead to instability of the overall Grid, or a denial-of-service is in progress against all replicas of a given Grid service.
• Medium: (team leader required if widespread)– The incident affects an instance of a Grid service, but
Grid stability is not at risk, or a denial-of-service affects one replica of a given Grid service, or a local attack compromised a privileged user account.
• Low: (team leader probably not required)– A local attack comprised individual user, non-
privileged credentials, or a denial-of-service attack or compromise affects only local grid resources.
13-Oct-05 David Kelsey, LCG Emergencies 6
Emergency procedures• JSPG discussed this at last meeting (Sep 2005)• Started from point of view of Security incidents
– But quickly realised that other disasters are also likely, so should deal with these too
• Very early overview of the issues at this point– Certainly no plan yet– Invite feedback from HEPiX
• There must be lots of site-based plans• JSPG will produce a draft emergency plan (and
address policy issues)– Grid Operations and OSCT will need to
define the details
13-Oct-05 David Kelsey, LCG Emergencies 7
JSPG discussion topics• What is the scope?
– LCG vs EGEE?– Critical: Tier 0/1, data taking, data integrity
• Inter-site information flow– This is the critical point to be tackled– Users, Sys Admins and Managers
• External information– including interface(s) to the Press
• How do we keep the infrastructure operational?– Is this the aim?
• What do we take down?– And who decides?
• Can optical private networks remain up?– And are they sufficient for LCG data taking?
• How do we deal with Tier 2 problems?
LCG/EGEE Emergency Procedures
Denise HeagertyCERN
David Kelsey, LCG Emergencies 9
When are emergency procedures required?
Emergency procedures are required to cover the following cases:
Incident response plans cannot be followed: critical parts of the infrastructure are unavailable (e.g. mailing lists)
Incident response plans are inappropriate: E.g. need to rapidly inform large parts of the community beyond the security contacts or incident communication channels are compromised
Examples Major power cut at Site A lasted several days Cable cut network access to Site B Major worm disrupted network access at Site C Security incident blocks user access to accounts at Site D Wide area exploit of the (homogeneous) security fabric
David Kelsey, LCG Emergencies 10
What is needed in an emergency? Out of band communication channels
Alternative service providers (Internet, telephony) Alternative contact details (e-mail, chat, …) Alternative technology
Clear decision-making roles There is no time for consensus during a crisis Usual decision making process needs to be bypassed
Clear information flow and roles For at least management, users, the press Reduce the risk of mis-communication
Disaster Recovery Plan Definition of critical infrastructure to kept running or repaired
quickly Dependencies and sequence must be clear for restoring services Mailing lists (at CERN) are key to restoring communication
David Kelsey, LCG Emergencies 11
Some ideas to stimulate discussion
Define an emergency advisory committee? Members, mandate Goal is to ensure rapid and appropriate decisions
Assure information flow E.g. update DNS servers to point to temporary (web) servers Pre-record messages on telephone help services
Prepare alternative communication channels E.g. commercial conference call facilities Alternative Internet providers (e-mail addresses, chat, phone,…)
When/do we return to normal Incident Response?
13-Oct-05 David Kelsey, LCG Emergencies 12
Final words• LCG needs a written plan• Clear definition of roles• Operations staff need to know what to do
– Training• The sites need to agree to policy and
procedures– Recognise the powers of operations staff
• Sites already have their own internal plans– Now trying to extend to the Grid
• Feedback and advice is welcome!