operating central european egee roc

19
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks CGW’06 17 October 2006 Operating Central European EGEE ROC Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir and Marian Bubak ACC CYFRONET AGH

Upload: sabina

Post on 09-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Operating Central European EGEE ROC. Marcin Radecki, Tomasz Szepieniec , Ale ksander Kusznir and Marian Bubak ACC CYFRONET AGH. Outline. Introduction EGEE and Central European (CE) R egion Challenges for CE Regional Operating Centre Applications & Users Cooperation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Operating  Central European EGEE ROC

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks CGW’06 17 October 2006

Operating Central European EGEE ROC

Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir

and Marian Bubak

ACC CYFRONET AGH

Page 2: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 2

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Outline

• Introduction– EGEE and Central European (CE) Region

• Challenges for CE Regional Operating Centre– Applications & Users

– Cooperation

– Grid Infrastructure

• Conclusions

Page 3: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 3

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

EGEE – Community

• Possibly largest production infrastructure spans over 32 countries

• c.a. 200 sites grouped under 11 ROCs

• Scientific community involves over 2000 people

• EGEE’06 conference in Geneva– 700 attendees, – 32 „partner” projects present

ID Name Discipline UsersEGEE-001 Atlas Physics 890EGEE-002 Alice Physics 175EGEE-003 LHCb Physics 159EGEE-004 CMS Physics 632EGEE-010 ESR Earth Sciences 42EGEE-014 Biomed Biomed 114EGEE-039 Comp Chem Chemistry 15EGEE-040 Magic Astro particle physics 16EGEE-042 dteam Infrastructure testing 30EGEE-065 EGEODE Geo-Physics 33EGEE-066 Planck Astrophysics 8

Total 2114

Page 4: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 4

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Central European Region in EGEE

• 7 countries, 22 sites, 1493 CPUs, 70 TB storage space

• Supports 10/11 EGEE-approved + lot of associated VOs

• Site size scales from 2-3 to 300 CPUs

• Need for solutions suitable for both large computing centres and small sites

– Maintenance model– Skills & experience– Scalable across a site’s resources

Page 5: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 5

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Challenges for CE ROC

• We need to attract new users to grid and make possible their work in the new environment in order to use the resources efficiently. Provide the services the users require.

• Grid spans across many administrative domains, each of which need to be active in terms of cooperation to share resources and collaborate productively. Excellent possibility for expertise sharing.

• Having resources is not enough; infrastructure need to be stable before real users start to use it and we should maximize utilization as possible.

Page 6: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 6

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Grid-enabling users

• Means to gain and uphold users with us– Understand users’ needs and satisfy them

– Easy access, how-to-use documentation (in national languages)

– Stable working environment

– User Support infrastructure

• Results:– Computational chemistry

Mariusz Sterzel (CYFRONET) coordinatescomputational chemistry applications in EGEE

Enabling commercial software - Gaussian VO Study on pyrazoloquinolines (PQ) used for laser

light generation

– Bioinformatics Never Born Protein folding and function

recognition - Prof. Irena Roterman team (CM-UJ)

– Others: Many small teams are working

within regional catch-all VO – VOCE

Page 7: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 7

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

VOs in the Region

• Supported VOs listalice, atlas, auger, balticgrid, bellebiomed, cms, compass, compchem, crogrid, esr, euchina., gamess.gaussian, geant4, gear, geclipse,hone, hungrid, lhcb, magic, ops,skgrid, voce, vocet, zeus

• Service/Data Challenges and test productions

– Atlas Service Challenge 4– World-wide In Silico Docking On

Malaria data challenge 1st and 2nd (ongoing)

– EGEE-ITU International digital broadcasting

agreement – new frequency plan compatibility and complementary

analysis

Page 8: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 8

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Managment of CE ROC

CYFRONET

IISAS/PSNC

CESNET/PSNC

ICM WARSAW

ROC

Manager

ROC

Manager

User Support

Responsible

User Support

ResponsibleOperations

Responsible

Operations

ResponsibleSecurity

Responsible

Security

Responsible

1st Line

Support

1st Line

SupportCore Grid

Services

Core Grid

Services

Regional Certification

of Middleware

Regional Certification

of MiddlewareGrid Operator

On Duty

Grid Operator

On Duty

Pre-Production

Service

Pre-Production

Service

• ROC Manager– Represents the region at the level of

the Project managerial bodies

– Supervises all Service Activities

• Operations– Coordinate actions related to

infrastructure and middleware

– Escalates unsolvable problems level higher

– Fit the Project requirements into the region

• User Support– Provides support tools for users

– Takes part in shifts handling all user tickets in GGUS system

• Security– Incident handling procedures

– Incident response team

Page 9: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 9

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Procedures and Commitments

• Well defined procedures makes collaboration more efficient– Clear paths on how we deal with things to avoid misunderstandings

– Newbies are always there

– People tend to forget things over the time

• Procedures examples:– New site registration

– New site admin joining

– Site problem handling

– Sending Weekly Reports

• Commitments monitoring makes people more motivated

Page 10: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 10

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Operations - coordinate the work

• Operations is the most time consuming task– To make sure that operational procedures are understood and followed up

properly

– To ensure production requirements are met at the sites

– To work out best solutions for problems

– To understand expectations/needs

– To make sure problems are being solved in a proper way

– To ensure weekly reports are completed and sent

• Three styles of site administration observed– Keep all services ready all the time – „I’m the best admin in the city”

– React only when gets a problem report – „I’m a bit occupied”

– React only if my name appears on a „black list”, available to the public – „I’m hard-working on… something important”

Page 11: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 11

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Resources and their usage

• Accounting in EGEE– July-October ’06 - over 672k

CPU hours computed in CE region; equivalent of 275 CPUs running 24x7

– Problems with „missing” data

– Update rate: daily

• Our approach to accounting– Site performance efficiency

study: - Up-to-date information on what is going at a site,- Maximize site utilization

better to have jobs queued at a site than idle CPUs

– Is being extended towards a new system for fine grain accounting

Jobs Executing

Avoid low usage periodsAvoid low usage periods

Max. CPUsMax. CPUs

Jobs Queued

Page 12: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 12

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Stable infrastructure- social aspect

• How EGEE keeps the Grid stable– Grid Operator on Duty (GOD) watching entire grid

CE joined this activity in a first turn in EGEE-II

– Raise a ticket for each detected problem – Problem diagnosis and solution suggestion– Use monitoring tools for problem detection and availability metrics

• 1st Line Support in CE - how to be better than the average?– To detect and fix failures before they get notified by GOD Team and a ticket

is raised– Support site admins on remedy actions– Suggest known well-working practices expertise sharing– Knowledge comes out of the mind with pain despite saving a lot of time

while at work it needs a lot of encouragement for people to do so

Page 13: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 13

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Stable infrastructure - monitoring with NAGIOS

• Try to monitor as much functionality as possible

– E.g. all machines certificates expiration date

– Reasonable probe frequency

• Send a problem notification immediately but…

– Do not spam each 5 minute

• Allow site admin to tell the problem is being worked on

– Do not send notification until notified

• Allow site admin to schedule extraordinary check at will

– To let him convince at once how good the workaround is working

• Smart testing hierarchy• Monitors CE Core Services

– added tests for checking RB, BDII, LFC, VOMS

• Used by 1st line support– Overview of the region– Detailed check of services– Schedule checks when working on fixes

Page 14: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 14

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Data from EGEE CIC portal: https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&scope=project&project=&metrics=sft

Operations metrics results

D ec 05

Jan 06

Feb 06

M ar 06

Apr 06

M ay 06

Jun 06

Jul 06

Aug 06

Sep 06

0

1

2

3

4

5

6

7

8

9

Functional test failure % ratio

EG EE

C E

Best p layer

% o

f fa

ilure

s

Dec 05

Jan 06

Feb 06

M ar 06

Apr 06

M ay 06

Jun 06

Jul 06

Aug 06

Sep 06

0

1

2

3

4

5

6

7

8

9

Tim e unavailable % ratio

EG EE

C E

Best P layer

% o

f ti

me

EGEE Operations metrics results from last 10 months

Page 15: Operating  Central European EGEE ROC

CGW’06; Cracow; 15-18th October 2006 15

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Conclusions

• CYFRONET gained the know-how on:– Coordination of a large initiative

– Organization of work for different subtasks

– Running a stable production infrastructure

– Accurate Grid job accounting

– Sensible and precise Grid infrastructure monitoring

– Facilitating the application users introduction to Grid

• Experience gathered in CE ROC may easily be re-used in building national Polish grid

Page 16: Operating  Central European EGEE ROC

16PL-Grid, Warszawa, 22.09.2006

Ogólnopolska infrastruktura gridowa PL-Grid

Zespół Akademickiego Centrum Komputerowego CYFRONET AGH

Kraków, czerwiec – wrzesień 2006

W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i sposób podejścia do utworzenia narodowej infrastruktury gridowej, niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science), spójnej z infrastrukturą europejską.

PL-Grid jako infrastruktura dla e-Science

Aktualnie prowadzenie badań naukowych wymaga wykorzystania zaawansowanych technologii informatycznych. Rośnie liczba zespołów naukowych, które intensywnie ze sobą współpracują, a do tego niezbędne są narzędzia informatyczne umożliwiające gromadzenie i wymianę uzyskanej wiedzy w skali globalnej. Wyniki eksperymentów to olbrzymie, rozproszone zbiory danych o różnorodnej strukturze, których opracowanie wymaga narzędzi dostępu, ich integracji oraz przetwarzania danych. Symulacja komputerowa jest w pełni akceptowaną metodą badawczą i coraz częściej łączone są ze sobą wyniki uzyskane z symulacji i eksperymentów. Takie nowatorskie podejście jest najbardziej widoczne w fizyce wysokich energii, w astrofizyce, naukach biologicznych i medycznych, w naukach o Ziemi.

Dla realizacji tego nowego paradygmatu prowadzenia badań naukowych, zwanego e-Science, jest niezbędna infrastruktura gridowa (zwana też Cyber-Science Infrastructure), obejmująca oprogramowanie umożliwiające współdzielenie różnych zasobów komputerowych oraz narzędzia wspierające współdziałanie partnerów w ramach tzw. wirtualnych organizacji.

Rys1. PL-Grid jako infrastruktura dla e-Science

Page 17: Operating  Central European EGEE ROC

17PL-Grid, Warszawa, 22.09.2006

Nutzer

Warstwadostępowa/tworzeniaaplikacji

Zasobygridowe

Usługigridowe

Podstawoweusługi

gridowe

Rozproszonerepozytoria

danych

Użytkownicy

Krajowasieć

komputerowa

Globus

Zarządzaniewirtualnymi

organizacjami

Zarządzaniezadaniami

Zarządzanie danymi

Systembezpieczeństwa

UNICORE(DEISA)

Rozproszonezasoby

obliczeniowe

Portale gridowe, narzędzia programistyczne

Monitorowanie

LCG/gLite(EGEE)

Uproszczona architektura PL-Gridu

Page 18: Operating  Central European EGEE ROC

18PL-Grid, Warszawa, 22.09.2006

Gridy dziedzinowe

PL-Grid

Infrastruktura(sprzęt, sieć)

Koordynacja

Raporty

Zalecenia

Informacja

Propozycje

Ocena

Zarząd Konsorcjum(Koordynator + członkowie)

CentrumOperacyjne

RadaUżytkowników

RadaKonsorcjum

Struktura organizacyjna PL-Gridu

Page 19: Operating  Central European EGEE ROC

19PL-Grid, Warszawa, 22.09.2006

TematMiesiące

0 3 6 9 12 15 18 21 24 27 30 33 36

Przygotowanie i zatwierdzenie projektu

Organizacja konsorcjum

Zatrudnienie pracowników

Zakupy urządzeń

Infrastruktura badawczo-szkoleniowa

Infrastruktura produkcyjna

Rozwój oprogramowania

Szkolenia gridowe

Przeglądy działalności

faza testowa

faza pilotowa

faza utrzymania i rozwoju

Harmonogram prac