fault isolation and service quality assurance · cern network fault isolation essentials ... 24x7...
Post on 23-Aug-2020
0 Views
Preview:
TRANSCRIPT
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Fault Isolation and service quality assurance
in a 10gbE redundant grid infrastructure
Nikos TrikoupisInfrastructure and Operations
CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Agenda
Requirements and challenges for Monitoring in the CERN network
Fault Isolation essentials
Service Quality Assurance
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Introduction
CERN’s “business processes” rely on network services availability.
Our mission is to deliver and manage an infrastructurereliable but capable of sustaining a high rate of change.
Being the LCG Tier-0 network provider is a huge responsibility.
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Facts and Challenges
A collaborative environment with highly complex applications
Network redesign: multi-10gbE core, 10gbE to the farms
The10G WAN PHY standard allows for WAN connectivity at LAN speeds and the use of the same management tools
For the first time, the barrier between Campus and WAN disappears.
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
LCG
GPN
TN
CNIC
Tier1s
V1
V2C1
C2
C3 C4 C5
2-S
10-1
40-S 376-R 874-R
887-R
BC1 BC2
BB1 BB2
LB1
LB2
LC1
LC3 LC4
LV1
Meyrin
Prevessin
LHC
FIREWALL
EXT
CCR-513
CCC-874
T513
T874
cernh2
cernh8
rci76-2 rci65-3
sw6506-isp
E513-E
Internet
rci76-1
rci72-4
WHO
gate3
gate-bkup
B513-Erca80-2
rca80-1
FARMS
Hot-standby server
Primary serverLHCopn
Primary Server
LHCopn Hot-standby
SPECTRUM
LC2
Network Layout: Marc Collignon, IT/CS/IO
SPECTRUM
OneClick“Secondary”
OneClick“Primary”
SPECTRUMSPECTRUM
CERN Network overview
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Chaotic, Unpredictable Traffic Patterns
Increased Demand
Restrictions in budget and personnel
New Types of network and user equipment
Network
More and Higher-Speed Bandwidth Choices
Pressures on Network Management
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
It’s theNetwork’s Fault!
Physicists
Technical Services System Managers
Application Developers
Managing the Network Foundation
Copyright © 2003
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Defending the Network
Faults may occur, but:
They have to be detected (before users do it) as quickly as possible.The cause of the fault has to be identified so that corrective action may be taken.This task has to be performed by operations on a 24x7 basis.Time To Repair must be reducedPrioritize faults based on impact
The size and complexity of the network infrastructure dictates the use of automated network management tools
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
RequirementsEvent filtering, deduplication, suppression and correlationAutomated network discovery and updateAccurate Layer 2 and Layer 3 network topology, including redundancy and routing protocols
Network Root Cause Analysis
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
?
?
?
?
?
?
?
?
?
?? ?
?
How to prioritize?
Which to fix?
Oops!
Traditional Procedure
Without Network Root Cause Analysis
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Pinpointing the root cause
An alarm displayed in CERN’s alarm manager is the result of a fault isolation process, Root Cause Analysis.
ONE alarm displayed for one problem
symptomatic faults suppressed
operational procedures are followed to complete the troubleshooting process
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
one problem one problem butbut
9 9 ‘‘device unreachabledevice unreachable’’alarms!alarms!
A failure scenario: device fault
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
undetected problem ?undetected problem ?
A failure scenario: Loss of redundancy
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
On Requirement 4SPECTRUM in the device failure scenario
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
On Requirement 4SPECTRUM in the loss of redundancy scenario
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Distinguishing events and meaningful alarms
Alarm/Event
DB
IMT EventRules
ConditionCorrelation
Event Management System
APISyslogSpectroWATCHESTRAPS
Event Management System allows configuration, creation and control of traps, events and alarms
~25.000 events a day (30 Nov 06)
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
To do the job, our monitoring system must:
1. understand topology and relationships.2. work across multiple-vendor and technology
solutions.3. distinguish between a plethora of events and
meaningful alarms.4. quickly pinpoint the root cause and suppress all
symptomatic faults.5. help prioritize based on impact. 6. be fault-tolerant.
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Database Server
Web Server
CPU, Memory, Disk
Apache.exe process
User Response time
(HTTP test)
Application Servers
Redundant Servers
Response time (TCP test)
Log file Parsing
CPU, Memory, Disk
…
CPU, Memory
SQL Log file parsing
…
SQL Processes
Network Connection
Request Service
DNS
Other Required Services
…
Scope of the Service Quality Assurance Problem: From Silos to Services
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Availability and Performance Monitoring Best Practices
Don't fall in the trap of collecting all possible data availableTry to focus on key metrics that are indicators of total end-to-end service quality.Automate!
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Configuring Services: The Basics
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Service Dashboard
The tool to provide real time service status and statistics
Allows at-a-glance understanding of How well the services are runningProblems and statusTransparency towards users and IT management
Status and Statistics exported to PerfSonar, MonaLisa as well as other databases and alarm systems.
SummaryCurrent Service StatusCurrent “Customer”Status
General DetailsMTTRMTBF% Uptime, Downtime, Degraded
Outage DetailsDurationCauseTroubleshooterImpact
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Conclusions
Monitoring is essential for robust network operation.
Root Cause Analysis enables quick reactions to faults.
Transparent, real time reporting and information exchange demonstrates service quality and gains the trust of users and collaborators.
Focus on collecting and storing relevant data.
4th TERENA NRENs and Grids WorkshopNikos Trikoupis - CERN IT/CS
Thank you!Thank you!Q & AQ & A
http://http://cern.chcern.ch/monitoring/monitoring
top related