wlcg service requirements wlcg workshop mumbai tim bell cern/it/fio

35
WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

Upload: shona-walsh

Post on 03-Jan-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

WLCG Service Requirements

WLCG WorkshopMumbai

Tim Bell CERN/IT/FIO

Page 2: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 2

Agenda

LCG Memorandum of Understanding

Defining what needs to be delivered

Checking the plan Tracking delivery using a

dashboard

Page 3: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 3

What the MoU provides

A high level definition of the service Basis for estimating Tier investments

Tier responsibilities Overall capacity

Basic support structure Implementation schedule Governance

Roles *B

Page 4: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 4

Tier0 service levels

Service Maximum delay in responding to operational problems Average availability2

Service interruption Degradation of the capacity of the service

by more than 50%

Degradation of the capacity of the service

by more than 20%

During accelerator operation

At all other times

Raw data recording 4 hours 6 hours 6 hours 99% n/a

Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation

6 hours 6 hours 12 hours 99% n/a

Networking service to Tier-1 Centres during accelerator operation

6 hours 6 hours 12 hours 99% n/a

All other Tier-0 services 12 hours 24 hours 48 hours 98% 98%

All other services3 – prime service hours4

1 hour 1 hour 4 hours 98% 98%

All other services – outwith prime service hours

12 hours 24 hours 48 hours 97% 97%

Page 5: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 5

Tier1 service levels

Maximum delay in responding to operational problems

Average availability measured on an annual

basis

Service

Service interruption

Degradation of the capacity of the service by more than

50%

Degradation of the capacity of the service by

more than 20%

During accelerator operation

At all other times

Acceptance of data from the Tier-0 Centre during accelerator operation

12 hours 12 hours 24 hours 99% n/a

Networking service to the Tier-0 Centre during accelerator operation

12 hours 24 hours 48 hours 98% n/a

Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outwith accelerator operation

24 hours 48 hours 48 hours n/a 98%

All other services – prime service hours6

2 hour 2 hour 4 hours 98% 98%

All other services – outwith prime service hours

24 hours 48 hours 48 hours 97% 97%

Page 6: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 6

The MoU is not …

An implementation bible What grid services at which site How to run the services How to deploy

Magic recipe for service delivery Application 99% = 1.5 hours down /

week Administrator 40 hours/week = 24% up

Page 7: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 7

What is your quest ?

Page 8: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 8

We seek the holy grail !

A stable and functional Grid

Page 9: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 9

Define the site services

What services do we provide ? Who is responsible ? What level of service is required ? What capacity of service ? What is the support structure ? Who pays for what ?

Page 10: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 10

Service catalog approach

A service catalog consists Service Class – Criticality Calendar – Variation with time Product – What application Customer – Which VO Service =

Service Class x Calendar x Product x Customer

Page 11: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 11

Service class

https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition

Class Description

Downtime

Reduced Degraded Avail

C Critical 1 hour 1 hour 4 hours 99%

H High 4 hours 6 hours 6 hours 99%

M Medium 6 hours 6 hours 12 hours 99%

L Low 12 hours 24 hours 48 hours 98%

U Unmanaged

None None None None

Page 12: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 12

Class notes Downtime defines the time between the start

of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%)

Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%)

Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%)

Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations.

None means the service is running unattended

Page 13: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 13

Service calendar

Calendar

Description AccOn Prime

AP Accelerator operating, prime shift

Y Y

AS Accelerator operating, second shift

Y N

OP Accelerator off, prime shift N Y

OS Accelerator off, second shift N N Some services are critical only during accelerator shift

Other services are less critical outside working hours

Page 14: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 14

ProductsProduct Name Product

Short CodeDescription

Resource Broker RB Farms out jobs to sites+logging and book-keeping

MyProxy PX Renew/acquire credentials

BDII BDII Grid Information System

Compute Element CE Gateway to local batch systems

Mon Box MONB Grid Monitoring including archiver

Grid View GRVW Monitoring of Grid activity

Site Functional Test SFT Regular test of components per site

Grid Peek GRPK Storage of outputs of running jobs

VOMS VOMS Manage user/roles for VOs

Page 15: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 15

Products (cont)Product Name Product

Short CodeDescription

LCG File Catalog LFC Maps file names to storage locations

File Transfer Service FTS Reliable file transfer delivery

Storage Element SE SRM Compatible Storage Service

Page 16: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 16

Products notes

Provides 1st level breakdown of the grid to smaller units

Suprisingly dynamic list. New products arriving weekly.

Short codes provide basis for naming conventions

Page 17: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 17

Service catalog

Service Instance Product Cst AP AS OP OS

RBP Production Resource Broker RB SH C C C C

PXP Production My Proxy PX SH C C C C

BDIIP Production Global BDII DBII SH C C C C

BDIIS Production Site BDII DBII SH H H H H

CEP Production Compute Element CE SH C C C C

MONBP Production Monbox MONB SH M M M M

GRVWP Production Grid View GRVW SH M L M L

SFTP Production Site Func Test SFT SH M M M M

GRPKP Production Grid Peek Service GRPK SH M M M M

VOMSP Production VOMS VOMS SH C C C C

Match product with customer and service class in each calendar slot

Multiple services (e.g. production, test, site…) for single product

Page 18: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 18

Service catalog (cont)Service Instance Product Cst AP AS OP OS

LFCP-ALICE

Alice Production LCG File Catalog

LFC Alice H H H H

LFCP-ATLAS

Atlas Production LCG File Catalog

LFC Atlas H H H H

LFCP-CMS

CMS Production LCG File Catalog

LFC CMS H H H H

LFCP-LHCB

LHCb Production LCG File Catalog

LFC LHCb C C C C

FTSP Production file transfer service FTS SH C C C C

CSTRP Production Castor + SRM SE SH C C C C

Page 19: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 19

Questionnaire

Simple questions to assess readiness for production

It is not actually necessary to fill out the answers but the questions should be asked

Focus is on the infrastructure

Page 20: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 20

Service questions

What service levels are required for each calendar period ?

Who is providing support for the application ?

Who supports the infrastructure ? How should the support be

contacted? What support service do they

provide?

Page 21: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 21

Configuration questions

What are the application interfaces?

What server does the application run on ?

Is there a picture of the configuration?

What are the application parameters and how are they set up?

Page 22: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 22

Facilities questions

?

Page 23: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 23

Facilities questions

Are all systems in a machine room ?

Is the room access controlled ? Is there good power provision ?

UPS ? Batteries ? What is the response time for

facilities problems ?

Page 24: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 24

Hardware questions

What kind of machine is required CPU, RAM, Disk

Do we need redundancy ? Power Supply, Disk, ….

Do maintenance contracts match the service ?

Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines

Page 25: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 25

Sample RB disk calculation

Parameter Value (MB)

Size of input sandbox 10

Size of output sandbox 10

Jobs / Day currently 21000

Estimated Factor for LHC 3

Sandbox Purge Time (days) 14

Jobs in queue 35000

Total Disk Space Required 17,640,000

Page 26: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 26

Network questions

What network capacity OPN connectivity ? Bandwidth ? Firewall ports ?

Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.

Page 27: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 27

Sample CE ports sheet

Function Direction Port

Globus Job Manager Outgoing 20000-21000

GridFTP Incoming 2811

GRIS BDII Incoming 2135

EDG Log Daemon Incoming 9002

Page 28: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 28

Database questions

What is your sites preferred database ?

What are the options for each application ?

Expected database size / growth ? High Availability options ?

Page 29: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 29

Backup / Restore questions What needs to be backed up for each

service ? How do we ensure consistency in the

event of a restore ? e.g. RB / CE. Software corruption risk different by

application ? e.g. LFC/SE vs Proxy Has a restore test been done ?

There is currently no list of critical state data for each application or steps to be executed after a restore

Page 30: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 30

Operations questions How are problems identified ?

Local console ? Grid Monitoring ?

Who should be contacted to resolve the problem ?

Who should be informed of the problem ?

What new procedures / operations guides are required ?

What is the local coverage for nights / weekends ?

How does local and Grid operations interwork ?

Page 31: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 31

Validation

Check that the service class matches the answers A critical service cannot have the

server in an office Check the dependencies that no

critical services depend on non-critical services FTS, critical, requires MyProxy

therefore MyProxy Service must be critical

Page 32: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 32

Implementation Tracking at CERN

A dashboard approach on the WikiService Area Class Status Req Dvl HW Ops RB WMS C WlcgScDashRb Green Green Green RED CE WMS C WlcgScDashCe Green Green Green Yellow GRPK WMS M WlcgScDashGrpk Green Yellow Green RED FTS DMS H WlcgScDashFts Green Green Green Green LFC DMS C WlcgScDashLfc Green Green Yellow Green BDII IS C WlcgScDashBdii Green Green Green Green MYPX AAS C WlcgScDashPx Green Green Green Yellow VOMS AAS C WlcgScDashVOMS Green RED Green RED MONB IS M WlcgScDashMon Green Green Green RED GRVW IS M WlcgScDashGrvw Green Green Green Yellow SFT IS M WlcgScDashSft Green Green Yellow RED UI Wms C WlcgScDashUi? Green Green Green Green SE DMS C WlcgScDashSe? Green Yellow Green Yellow

Page 33: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 33

Common Themes But it’s all green ? What’s the problem ?

Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues.

Operations No operations or problem determination guides.

Limited administration guides. Support call-tree unclear Backup/Restore details are missing

Hardware Limited or no capacity planning information leads to

incorrect server sizing ‘Forgot a box’ problems e.g. one per-VO not one per

site Development

Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)

Page 34: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 34

Summary

Complete a service catalog for your sites

Check the questions and prepare an action plan to address items under your control

Assess the status by service and concentrate on getting the reds to yellows

Page 35: WLCG Service Requirements WLCG Workshop Mumbai Tim Bell CERN/IT/FIO

11th February 2006 Service Checklist [email protected] 35

More Information

LCG MoU http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf

SC4 Service Definitions for CERN https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition

SC4 CERN Dashboard https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash