cern local high availability solutions and experiences

25
1 CERN local High CERN local High Availability solutions and Availability solutions and experiences experiences Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006

Upload: etoile

Post on 15-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

CERN local High Availability solutions and experiences. Thorsten Kleinwort CERN IT/FIO WLCG Tier 2 workshop CERN 16.06.2006. Introduction. Different h/w used for GRID services Various techniques & First experiences & Recommendations. GRID Service issues. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CERN local High Availability solutions and experiences

1

CERN local High Availability CERN local High Availability solutions and experiences solutions and experiences

Thorsten KleinwortCERN IT/FIO

WLCG Tier 2 workshop CERN 16.06.2006

Page 2: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT2

• Different h/w used for GRID services

• Various techniques &First experiences &Recommendations

IntroductionIntroduction

Page 3: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT3

GRID Service issuesGRID Service issues

• Most GRID Services do not have built in failover functionality

• Stateful servers: Information lost on crashDomino effects on failuresScalability issues: e.g. >1 CE

Page 4: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT4

Different approachesDifferent approaches

Server hardware setup:• Cheap ‘off the shelf’ h/w satisfies

only for batch nodes (WN)• Improve reliability (cost ^)

• Dual Power supply (and test it works!)• UPS• Raid on system/data disks (hot swappable)• Remote console/BIOS access• Fiber Channel disk infrastructure

• CERN: ‘Mid-Range-(Disk-)Server’

Page 5: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT5

Page 6: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT6

Various TechniquesVarious Techniques

Where possible, increase number of ‘servers’ to improve ‘service’:

• WN, UI (trivial)• BDII: possible, state information is

very volatile and re-queried periodically

• CE, RB,… more difficult, but possible for scalability

Page 7: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT7

Load BalancingLoad Balancing

• For multiple Servers, use DNS alias/load balancing to select the right one:• Simple DNS aliasing:

• For selecting master or slave server • Useful for scheduled interventions• Dedicated (VO-specific) host names, pointing to the

same machine (E.g. RB)• Pure DNS Round Robin (no server check)

configurable in DNS: to equally balance load• Load Balancing Service, based on Server

load/availability/checks: Uses monitoring information to determine best machine(s)

Page 8: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT8

DNS Load BalancingDNS Load Balancing

LXPLUS.CERN.CH

LXPLUSXXX

LB Server

DNS

Update LXPLUS

Select least “loaded”

SNMP/Lemon

Page 9: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT9

Load balancing & scalingLoad balancing & scaling

Publish several (distinguished) Servers for same site to distribute load:

• Can also be done for stateful services, because clients stick to the selected server

• Helps to reduce high load (CE, RB)

Page 10: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT10

Use existing ‘Services’Use existing ‘Services’

• All GRID applications that support ORACLE as a data backend (will) use the existing ORACLE Service at CERN for state information

• Allows for stateless and therefore load balanced application

• ORACLE Solution is based on RAC servers with FC attached disks

Page 11: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT11

High Availability LinuxHigh Availability Linux

• Add-on to standard Linux• Switches IP address between two

servers:• If one Server crashes• On request

• Machines monitor each other• To further increase high availability:

State information can be on (shared) FC

Page 12: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT12

HA Linux + FC diskHA Linux + FC disk

SERVICE.CERN.CH

Heartbeat

FC Switches

FC Disks

No singlePoint of failure

Page 13: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT13

WMS: CE & RBWMS: CE & RB

CE & RB are stateful Servers:• load balancing only for scaling:

• Done for RB and CE

• CE’s and RB’s on ‘Mid-Range-Server’ h/w

• If one server goes, the information it keeps is lost

• [Possibly FC storage backend behind load balanced, stateless servers]

Page 14: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT14

GRID Services: DMS & ISGRID Services: DMS & IS

DMS (Data Management System)• SE: Storage Element• FTS: File Transfer Service• LFC: LCG File CatalogueIS (Information System)• BDII: Berkley Database Information

Index

Page 15: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT15

DMS: SEDMS: SE

SE: castorgrid:• Load balanced front end cluster,

with CASTOR storage backend• 8 simple machines (batch type)

Here load balancing and failover works (but not for simple GRIDFTP)

Page 16: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT16

DMS:FTS & LFCDMS:FTS & LFC

LFC & FTS Service:• Load Balanced Front End• ORACLE database backend+FTS Service:• VO agent daemons

One ‘warm’ spare for all:Gets ‘hot’ when needed

• Channel agent daemons (Master & Slave)

Page 17: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT17

FTS: VO agent (failover)FTS: VO agent (failover)

LHCB

ATLAS

CMS

ALICE

SPARE

Page 18: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT18

Page 19: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT19

IS: BDIIIS: BDII

BDII: Load balanced Mid-Range-Servers:

• LCG-BDII top level BDII• PROD-BDII site BDII• In preparation: <VO>-BDII• State information in ‘volatile’ cache

re-queried within minutes

Page 20: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT20

Page 21: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT21

Example: BDIIExample: BDII

• 17.11.2005 -> First BDII in production• 29.11.2005 -> Second BDII in production

Page 22: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT22

AAS:MyPRoxyAAS:MyPRoxy

• MyProxy has a replication function for a Slave Server, that allows read-only proxy retrieval [not used any more]

• Slave Server gets ‘read-write’ copy from Master regularly to allow DNS switch over (rsync, every minute)

• HALinux handles failover

Page 23: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT23

HA LinuxHA Linux

PX101 PX102MYPROXY.CERN.CH

Heartbeat

rsync

Page 24: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT24

AAS: VOMSAAS: VOMS

voms102 voms103

voms-ping

VOMS DB ORACLE 10g RAC

LCG-VOMS(HALinux)

Page 25: CERN local High Availability solutions and experiences

16 June 2006 Thorsten Kleinwort CERN-

IT25

MS: SFT, GRVW, MONBMS: SFT, GRVW, MONB

• Not seen as critical• Nevertheless servers are migrated

to ‘Mid-Range-Servers’• GRVW has a ‘hot-standby’ • SFT & MONB on Mid-Range-Servers