ramiro voicu , iosif legrand, harvey newman, artur barczyk, costin grigoras, ciprian dobre,
DESCRIPTION
Monitoring and operational management in USLHCNet. Ramiro Voicu , Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa. CHEP09 - March 2009 Prague. Outline. MonALISA Framework Architecture Data handling - PowerPoint PPT PresentationTRANSCRIPT
Ramiro Voicu CHEP09 Prague March 20091
Ramiro VoicuRamiro Voicu, Iosif Legrand, Harvey Newman,, Iosif Legrand, Harvey Newman,Artur Barczyk, Costin Grigoras, Ciprian Dobre, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor RozsaAlexandru Costan, Azher Mughal, Sandor Rozsa
Monitoring and operational managementMonitoring and operational managementin USLHCNetin USLHCNet
CHEP09 - March 2009 Prague
Ramiro Voicu CHEP09 Prague March 2009
22
OutlineOutline
MonALISA Framework
Architecture
Data handling
Automatic actions
USLHCNet
Network topology
Monitoring modules
Reliable monitoring & accounting
Alarms & triggers
Conclusions
Ramiro Voicu CHEP09 Prague March 2009
3
The MonALISA ArchitectureThe MonALISA Architecture
3
Regional or Global High Level Regional or Global High Level Services, Services, Repositories & ClientsRepositories & Clients
Secure and reliable communicationSecure and reliable communicationDynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationAAA for ClientsAAA for Clients
Distributed Dynamic Distributed Dynamic Registration and Discovery-Registration and Discovery-based on a lease based on a lease mechanism and remote eventsmechanism and remote events
JINI-Lookup Services Secure & Public
MonALISA services
Proxies
HL services
Agents
Network of
Distributed System for gathering and Distributed System for gathering and analyzing information based on analyzing information based on mobile agents: mobile agents: Customized aggregation, Triggers,Customized aggregation, Triggers,ActionsActions
Fully Distributed System with no Single Point of Failure
Ramiro Voicu CHEP09 Prague March 2009
4
MonALISA Service & Data HandlingMonALISA Service & Data Handling
4
Data Store
Data CacheService & DB
Configuration Control (SSL)
Predicates & Agents
Data (via ML Proxy)
Applications Clients or Higher Level
Services
WS Clients andservice
WebService
WSDLSOAP
LookupService
LookupService
Registration
Discovery
Postgres
AGENTSAGENTS
FILTERS / TRIGGERSFILTERS / TRIGGERS
Monitoring ModulesMonitoring ModulesCollects any type of information
Dynamic (Re)Loading
Push and Pull
Ramiro Voicu CHEP09 Prague March 2009
5
Two levels of decisions:
local (autonomous),
global (correlations).
Actions triggered by:
values above/below given thresholds,
absence/presence of values,
correlations between any values.
Action types:
alerts (emails/instant msg/atom feeds),
running an external command,
automatic charts annotations in the repository,
running custom code, like securely ordering a ML service to (re)start a site service.
ML ServiceML Service
ML ServiceML Service
Actions based onActions based onglobal informationglobal information
Actions based onActions based onlocal informationlocal information
• Traffic• Jobs• Hosts• Apps
• Temperature• Humidity• A/C Power• …
SensorsSensors Local Local decisionsdecisions
Global Global decisionsdecisions
Local and Global Decision FrameworkLocal and Global Decision Framework
Global ML
Services
Ramiro Voicu CHEP09 Prague March 2009
6
Monitoring architecture in ALICEMonitoring architecture in ALICE
6
Long HistoryDB
LCG Tools
MonALISA @Site
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
MonALISA @CERN
MonALISA
LCG Site
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn TQ
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn Job Agent
ApMon
AliEn CE
ApMon
AliEn SE
ApMon
ClusterMonitor
ApMon
AliEn IS
ApMon
AliEn Optimizers
ApMon
AliEn Brokers
ApMon
MySQLServers
ApMon
CastorGridScripts
ApMon
APIServices
MonaLisaMonaLisaRepositoryRepository
Aggregated Data
rss
vsz
cputime
run
tim
e
job
slots
free
spac
e
nr.
of
file
s
op
en
files
Queued
JobAgents
cpu
ksi2k
jobstatus
disk
used
pro
cesses
loadn
etIn
/ou
t
jobsstatussockets
migratedmbytes
active
sessions
MyP
roxy
status
Alerts
Actions
See Costin Grigoras’ poster (067):
Automated agents for management and control of the
ALICE Computing Grid
Ramiro Voicu CHEP09 Prague March 2009
7
USLHCNetUSLHCNet
USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia.
Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers.
The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities.
Hybrid network: uses both Ciena CD and Force10 routers
4 transatlantic 10G links at the moment (6 links in the second part of this year)*
* See Harvey Newman talk[502] from Monday: “Status and outlook of the HEP network”
Ramiro Voicu CHEP09 Prague March 2009
8
USLHCnet ML weather mapUSLHCnet ML weather map
Ramiro Voicu CHEP09 Prague March 2009
9
Monitoring modulesMonitoring modules
We developed a set of monitoring modules for USLHCNet network devices:
Force10 (SNMP & sFlow)
Traffic per interface
sFlow traffic
Link status monitoring
Ciena Core Director (TL1 – Transaction Language1)
ETTP (Ethernet Termination Point) traffic
EFLOW (Ethernet Flow) traffic
OSRP (routing protocol) topology
Dynamic circuits inside the optical core of the network
Ramiro Voicu CHEP09 Prague March 2009
10
USLHCnet monitoringUSLHCnet monitoring
MonALISA
@GVA
MonALISA
@CHI
MonALISA
@NYC
MonALISA
@AMSSNMP
TL1
SNMP
Ramiro Voicu CHEP09 Prague March 2009
11
USLHCnet redundant monitoringUSLHCnet redundant monitoring
MonALISA
@GVA
MonALISA
@CHI
MonALISA
@NYC
MonALISA
@AMS
Each CircuitEach Circuitis monitored at bothis monitored at bothends by at least twoends by at least twoMonALISA services;MonALISA services;the monitored datathe monitored datais aggregated by is aggregated by global filters in global filters in the repositorythe repository
Ramiro Voicu CHEP09 Prague March 2009
12
Local and global filtersLocal and global filters
Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems
The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services
The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment.
We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS.
Ramiro Voicu CHEP09 Prague March 2009
13
USLHCnet: USLHCnet: Precise measurements Precise measurements for the Operational Status on the WAN Linkfor the Operational Status on the WAN Link
Operations & management assisted by agent-based softwareOperations & management assisted by agent-based software Used on the new CIENA equipment used for network managmentUsed on the new CIENA equipment used for network managment
Ramiro Voicu CHEP09 Prague March 2009
14
USLHCnet: Traffic on different segmentsUSLHCnet: Traffic on different segments
Ramiro Voicu CHEP09 Prague March 2009
15
USLHCnet: Accounting for Integrated TrafficUSLHCnet: Accounting for Integrated Traffic
Ramiro Voicu CHEP09 Prague March 2009
16
USLHCnet: Ciena alarms monitoringUSLHCnet: Ciena alarms monitoring
Ramiro Voicu CHEP09 Prague March 2009
17
The Need for Planning and Scheduling for The Need for Planning and Scheduling for Large Data TransfersLarge Data Transfers
In Parallel Sequential
2.5 X Faster to perform the two reading tasks sequentially
Ramiro Voicu CHEP09 Prague March 2009
18
Dynamic restorationof lightpath if a segment has problems
Monitoring Optical SwitchesMonitoring Optical Switches
Ramiro Voicu CHEP09 Prague March 2009
19
CERNGeneva
CALTECHPasadena
Starlight
Manlan
USLHCnet
Internet2
Controlling Optical Planes Controlling Optical Planes Automatic Path RecoveryAutomatic Path Recovery
“Fiber cut” simulationsThe traffic moves from one transatlantic line to the other oneFDT transfer (CERN – CALTECH) continues uninterruptedTCP fully recovers in ~ 20s
1
23
4
FDT Transfer
4 Fiber cuts simulations
200+ MBytes/secFrom a 1U Node
4 fiber cut emulations
For more details, see Iosif Legrand’s poster (054):
A High Performance Data Transfer Service
Ramiro Voicu CHEP09 Prague March 2009
20
ConclusionsConclusions
The MonALISA framework provides a flexible and reliable monitoring infrastructure
350+ installed services, 1.5M+ unique parameters, 25kHz value updates
Truly distributed architecture with no single points of failure
Highly modular platform
Automatic decision taking capability at both local and global levels
USLHCNet provides a state-of-the-art hybrid network with support for circuit oriented network services
Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime
We are investigating dynamic provisioning of circuits from collaborating agents
http://monalisa.caltech.edu
http://repository.uslhcnet.org