enabling grids for e-science egee-ii infso-ri-031688 osg-doc-498 maite barroso: grid operations lhcc...

31
Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Operations EGEE and OSG Maite Barroso, CERN Ruth Pordes, Fermilab LHCC Comprehensive Review 25th September, 2006

Upload: garey-summers

Post on 03-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 1

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Operations EGEE and OSG

Maite Barroso, CERNRuth Pordes, Fermilab

LHCC Comprehensive Review

25th September, 2006

Page 2: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Outline

• EGEE operations• OSG operations• EGEE – OSG interoperations

Page 3: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

EGEE: > 190 sites, 40 countries ~ 155 sites certified and in production > 28,000 processors, ~ 26 PB storage

EGEE Infrastructure: size

Page 4: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

EGEE Infrastructure: usageJobs per day

LCG

BioMed

Other

0

10000

20000

30000

40000

50000

60000

May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06

Normalised CPU: k.SI2k Hours

LCG

BioMed

Other

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

4,500,000

5,000,000

May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06

~6000 cpu-months/month

Page 5: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

EGEE operation: Key objectives

• Grid management – ROCs, relations with resource providers through negotiation of service-level

agreements (SLAs)

• Middleware deployment and introducing new resources • Operate a set of essential core infrastructure services • Grid monitoring and control • Resource and user support• International collaboration

– to drive collaboration with peer organisations in the Americas and the Asia-Pacific region to ensure the interoperability of Grid infrastructures and services so that the EGEE-II user communities

• Capture and provide middleware requirements • Grid security and incident response • Long term sustainability of the infrastructure

– to work both within the project and with the other related infrastructure projects and embryonic National Grid Infrastructures to put in place the necessary structures and organisation to ensure a long term sustainable infrastructure

Page 6: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Grid management: structure

• Operations Coordination Centre (OCC)

– responsible for the overall activity management, oversight of all operational and support activities

• Regional Operations Centres (ROC)

– providing the core of the support infrastructure, each supporting a number of resource centres within its region

• Resource centres – providing resources

(computing, storage, network, etc.);

• Grid User Support (GGUS)

– coordination and management of user support activities, single point of contact (portal) for users

Page 7: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Operations coordination

• ROC managers meeting– Biweekly– Discuss inter-ROC issues, general coordination, interfaces with

other activities

• WLCG-EGEE-OSG Operations meeting– Weekly, Mondays at 16:00 (Swiss time)– WLCG/OSG/EGEE– Pre-reports from sites, ROCs and VOs through CIC portal– Discuss, track and solve operation related issues from the previous

week

• Operation Workshops– Twice per year. Some joint between WLCG/OSG/EGEE– Last one: June 2006

http://agenda.cern.ch/fullAgenda.php?ida=a062031– Next one: Spring 2007

Page 8: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Middleware deployment

Development team 3

Development team 2

Development team 1

Certification

PPSAPT

repository

Softwarepasses

certification

TechnicalCoordinationGroup (TCG)

Longerterm

strategy

Certification APT

repository

Buildis ready

EMT

Steer nextrelease

Integration

TaggedRPMs

gLite Middleware Savannah

Bugs

Pre-prod. Service

Bugs

Production service

ProductionAPT

repository

SoftwareOK in PPS

Page 9: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Grid monitoring and control

The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources

Regional Operations

Centre

… …Regional

Operations Centre

Resource Centre

Resource Centre

Regional Operations

Centre

Resource Centre

Resource Centre

OSCTGrid Operator on-duty (COD)

Monitoring shows a problem

Page 10: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Grid Operator on Duty

• Role:– Watch the problems detected by the grid monitoring tools

– Problem diagnosis

– Report these problems (GGUS tickets)

– Follow and escalate them if needed (well defined procedure)

– Provide help, propose solutions

– Build and maintain a central knowledge database (WIKI)

• Who does it?: – 9 ROC teams working in pairs (one lead and one backup) on

a weekly rotation

– CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern-Europe, Central-Europe, Germany-Switzerland

Page 11: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Grid monitoring tools

• Tools used by the Grid Operator on Duty team to detect problems

• Distributed responsibility

• CIC portal– single entry point– Integrated view of monitoring tools

• Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)

• Grid Operations Centre Core Database (GOCDB)

• GIIS monitor (Gstat)

• GOC certificate lifetime

• GOC job monitor

• Others

Page 12: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Site Functional Tests

• Site Functional Tests (SFT)– Framework to test (sample)

services at all sites– Shows results matrix– Detailed test log available for

troubleshooting and debugging– History of individual tests is

kept – Can include VO-specific tests

(e.g. sw environment)– Normally >80% of sites pass

SFTs NB of 180 sites, some are

not well managed

• Very important in stabilising sites:

• Apps use only good sites• Bad sites are automatically excluded• Sites work hard to fix problems

Page 13: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Service Availability Monitoring

• Service Availability Monitoring (SAM)

– Will cover all core grid services

– measure availability by service, site, VO

– each service has associated service class defining required availability (Critical, highly available, etc.)

– Will be used to generate alarms

– to generate trouble tickets

– to call out support staff

Page 14: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Site availability

Page 15: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Operational procedures

• Described at the operations manual:

https://edms.cern.ch/document/701575

• Introducing new resources• Resource registration and contact information

– Stored in GOCDB

• Site downtime scheduling • Broadcast of planned and unplanned interventions

– EGEE broadcast tool

• Site suspension– The site is then removed from the top-level BDII and monitoring is turned off

• Escalation procedures

Page 16: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 16

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Operational securityFrom the EGEE Operational Security Coordination Team (OSCT)

• Recent security incident:– Many HEP sites affected by the recent incident– Local root compromises (on up to date machines)– Many compromised accounts (password sniffers)– Not a Grid attack as such but involved many LCG sites

• What went well?– Many people worked very hard– Collaboration was excellent– Sharing of necessary information was good– The Grid csirts list (and HEPIX security list) kept people informed

• What did not go so well? (matters for OSCT)– UK site decided (on the basis of following guidance) not to inform the Grid csirts– No incident handling team created (but CERN took the lead)– Private information leaked out on to several public mail lists and google searchable

archives and web sites– Discussion supposed to happen on “contacts” list not “csirts” list – much activity on

csirts list– Concern that sites who said they were not involved had not looked carefully enough– Need to strive for the correct balance in Open vs Closed communication– But must encourage sites to report

Page 17: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 17

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Open Science Grid and WLCG

The Open Science Grid contributes to the WLCG as the US distributed facility infrastructure.

OSG delivers accountable resources and cycles for LHC experiment production and analysis.

OSG federates with other infrastructures and interoperates with managerial, operational and technical activities.

OSG cooperates with the EGEE to ensure an effective and transparent system for the experiments.

Page 18: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 18

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Current OSG deployment

96 Resources across

production & integration infrastructures

27 Virtual Organizations including operations and

monitoring groups

>15,000 CPUs

~6 PB MSS

~4 PB disk

Page 19: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 19

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

August OSG Usage- 3 largest VOs

50K & 90K CPU Hours/day

ATLAS CDF CMS

Page 20: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 20

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Running Jobs of Rest of the VOs

OSG jobs are “jobs submitted via OSG interfaces or servicesOSG jobs are “jobs submitted via OSG interfaces or services

3 large VOs had ~3500 simultaneous jobs in same period

3 large VOs had ~3500 simultaneous jobs in same period

1000 jobs

Page 21: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 21

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Software Release & Patches

These are subsets of the VDT, tailored to OSG

2 OSG major releases a year.>4 minor releases a year.

Development releases for testingCritical patches have

separate path.

Page 22: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 22

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Site and Service Validation

• Validation services being packages for use by any VO.• Grid Operations runs the validations also:

– Site-Verify executed by Operations under the operations VO.

– Job execution and file transfer tests executed under the GridEx VO.

• GridCat displays results of validations for “red” “green” presentation display.

• Integration Grid provides system for Application validation of releases and patches to the software and new services.

Page 23: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 23

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Support Model in OSG

• Distributed set of Support Centers covers all aspects of OSG– VO, Resources, Services, Middleware, Community

– A support center may support multiple activities.

• The goal of the OSG support model is to provide OSG users and resources with rapid responses to reported issues.

• Each VO supports their own users and resources. • There is an OSG Grid Operations Center for coordination and

routing of issues along with critical infrastructure components.• OSG GOC has final responsibility for releases of the OSG

software stack (including patches).

Page 24: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 24

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

OSG Grid Operations Center

• Supports Centralized Grid Services– Monitoring Tools (MonALISA, GridCat)

– Resource Information Tools (VORS, BDII)

– Centralized Trouble Ticketing

– Interaction with Peering Grids (EGEE/TeraGrid)

– Communication Hub

– Software Packaging

– Documentation of Operations Information

– Security Response

– Keeps Definitive Contact Directory for VOs, Resources, and Support Centers

– Releasing Critical Patches/Upgrades to OSG

• And supports the OSG VO

Page 25: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 25

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Support Mechanisms in OSG

• Distributed set of Support Centers for all production activities in OSG– VO, Resources, Services, Middleware, Community– A support center may support multiple activities.

• When VOs, Resources, or Services are registered they identify a Support Center (may be Community Support).

• All Support Centers participate in OSG Operations.

Page 26: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 26

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Examples Support Services

• Middleware– VDT is core-middleware support center. Other direct middleware support contacts e.g. Monalisa. – VOs and other support centers are provided with a path to the middleware representatives– VDT has Weekly office hours and independent trouble ticket system

• Community Support– Open support for Users and Resources not covered by an specific support center.– Voluntary Participation on mail lists & Community Chat Room

• User Support – VO Users Contact their VO support center to begin the troubleshooting process – Problems are routed by the OSG-GOC to the responsible Support Center if problem moves

outside the VO– Support Documents should be made available from VO Support Center and recorded on the

OSG Twiki along with VO policy– Local Ticketing Systems for some VOs

• Application Support– Application questions go directly to the VO Support Center for routing/troubleshooting.

Page 27: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 27

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Security Operations

• Security Officer plans and coordinates Integrated Security Management consisting of Risk Assessment of vulnerabilities resulting in Management, Operations and Technical controls.

• Equivalence of Site and VO responsibilities and procedures.• Incident Response includes identified security contacts of all OSG

organizations.

Page 28: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 28

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

EGEE – OSG interoperations

• Coordination– WLCG-EGEE-OSG operations meeting– Operations workshop

Focused of last one was OSG-EGEE interoperations, much progress achieved

– Regular phone calls to make progress on specific areas

• Operations tools: common and/or interoperable– Global BDII extracted from EGEE and OSG registration DBs– GGUS interfaced to OSG FootPrints – Site/service monitoring tools interfacing being discussed

Security: work is underway to share security contact information and incident information– Cross population of mail lists– EGEE sites in the OSG lists

And vice-versa

– Technical details still to be agreed Read access to GOC-DB etc

– Ensure consistent (and many times common) policies through joint working groups.

Page 29: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 29

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Problem Reports

• 3 WLCG ROCs in the US: US-ATLAS, US-CMS, OSG-GOC.• All tickets routed from WLCG through OSG-GOC. OSG GOC and

EGEE GGUS exchange and automatically route tickets.• OSG-GOC automatically routes tickets to US-CMS-ROC and,

currently, manually routes tickets to US-ATLAS-ROC

Page 30: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 30

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

EGEE OSG Activities

• Completed– Interoperation of information published in BDII for use by WLCG Resource

Brokers.

• In progress– Operations VO, “Ops” on EGEE and OSG for common tests and

validations.

– Programmatic interface to trouble ticket sysetm which allows retrieval of EGEE - OSG resource scheduled downtimes.

• To watch for– How do communicate and test interoperability of changes (interfaces and

capabilities) before they get to production?

– How do we communicate about new s/w developments in time to have common approaches & avoid duplication & divergence?

– How do we manage ourselves to not give in to “panic mode” responses & give ourselves time to not organize “just in time”.

– How do we prioritize support for our non-WLCG stakeholders during data taking?

Page 31: Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September 2006 1 Operations EGEE

Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 31

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688OSG-doc-498

Summary

• WLCG Operations is a focus of EGEE and OSG Operations.• The 2 grid infrastructures are working together to ensure smooth,

scalable, and effective production support.