la vision de bull
Post on 05-Jan-2017
217 Views
Preview:
TRANSCRIPT
1© Bull, 2014
October 14th 2014 Dave Williams
Technical Architect
Multi-Tenant Nagios Monitoring
2© Bull, 2014
Agenda
BackgroundMulti-Tenant MonitoringWhy Multi-TenantMulti-Tenant DesignService CatalogueFutures & ‘Blue Sky thinking’Questions
3© Bull, 2014
Background
UK basedMainframe (IBM & Honeywell)Unix (HP-UX, AIX, Solaris)Linux (RedHat, SLES, Debian)Network (CASE, 3COM, CISCO)
Working for BullFrench Computer ManufacturerMainframes, Unix, HPC, Security, Managed Services, Advisory Services
4© Bull, 2014
Background
System MonitoringOpenViewNetviewOpen Master
Open Source MonitoringNetSaint on AIXNagios
5© Bull, 2014
Why Multi-Tenant ?
Outsourcing Support & MonitoringMultiple Customers
–Different Levels of security–Different Hardware / Software Platforms
One Support Team–Only need to know about real problems–Can be driven by support ticket not Nagios
Required 365 x 24–Infrastructure must survive all outages without loss of service
6© Bull, 2014
Multi-Tenant Design
Each customer may have 2-3000 hosts10-100 services per hostReal time monitoring
Customer profileSLA ReportingBatch Event completionDifferent SLA’s for each Business Process per customerDifferent alerting & escalation methods per customer
7© Bull, 2014
Multi-Tenant Design
Hardware Platform – Central SupportVirtualised Platform (Intel based)
–XenServer Hypervisor Allows clustering with shared storage Inexpensive Licensing
Shared Storage–NAS
Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Network connection using dual interfaces bound across multiple switches Could have used FreeNas
LAN Infrastructure–Dual connections to all hardware–SNMP managed switches
8© Bull, 2014
Hardware Platform – Basic Schematic
9© Bull, 2014
Multi-Tenant Design
Hardware Platform – ResilienceVirtualised Platform (Intel based)
–XenServer Hypervisor Allows clustering with shared storage If Primary node fails cluster will ‘spin up’ image on 2nd node
Same data / logs (Shared storage)
LAN Infrastructure–Dual connections to all hardware
Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches
10© Bull, 2014
Hardware Setup
11© Bull, 2014
Multi-Tenant Design
Hardware Platform – RecoveryVirtualised Platform (Intel based)
–XenServer Hypervisor Allows clustering with shared storage If Primary Site fails will spin up image Internet Access fails over – using BGP
Shared Storage – replicated from Prime Site–NAS
Using QNAP Appliances with underlying RAID-5 & Hot Spare protection Using RTRR (Real Time Remote Replication) between sites Network connection using dual interfaces bound across multiple switches
LAN Infrastructure–Dual connections to all hardware
Bonded interfaces for NAS access – no data loss / access loss with failure SNMP managed switches
12© Bull, 2014
Hardware Platform - Resilience
13© Bull, 2014
Hardware Platform – Customer Site
Using generic netbooks Minimum requirement
–1Gb Memory , Atom processor, Ethernet Port–Running Centos 6.4 64 bit Operating System
Can use Raspberry Pi for small customers–512K Memory , Arm processor , Ethernet Port –Running Raspbian Operating System
14© Bull, 2014
Software Platform – Central Site
Nagios – CoreRunning latest 4.0.8Using MK Livestatus for interfacingUsing Thruk for Visualisation
Graylog2 / Elastic SearchStore all logs & Syslog in ‘Big Data’ repository using MongoDB
Asterisk PBXAllow all alerting to use standard dial-up with speech synthesis + IVR
SMS-ClientStill using TAPI to SMS Text contacts
15© Bull, 2014
Software Platform – Central Site (contd)
NRPERunning 2.1.5
NSCA &NSCA-ngUsing NSCA for external communicationUsing NSCA-ng for issuing remote commands
Postfix / ProcmailUsed to generate emails but also handle responses.Routes unsolicited alerting emails (HP Insight, Pingdom)
OTRSRecord alerts, track issues
16© Bull, 2014
Software Platform – Remote Site
Nagios – CoreRunning latest 4.0.8
NRPERunning 2.14
NSCA Using NSCA for external communication
OpenVPNCommunication via IPSec VPN
17© Bull, 2014
Customer Multi-Tenant
18© Bull, 2014
Multi Tenant Schematic
19© Bull, 2014
Service Catalogue
ITIL FlavourReally just services & their characteristics
20© Bull, 2014
Service Catalogue
Agreed list of servers / servicesWith importance levelsWith alerting pathsWith escalation pathsRecovery options
Feeds into Service Level Agreements and Operational Level AgreementsBasis of agreed reporting structures
21© Bull, 2014
Examples
Basic Spreadsheet plus Shell scriptUsually easy to create, Shell script is different for each customer based on a initial standard script
Chef or PuppetUse Exported ResourcesNagios Cookbook – Nagios Conference 2012 Presentation
22© Bull, 2014
Multi Tenant Issues
Naming conventionsEvery customer has a server01Customers naming conventions are obscure Customers have multiple physical locations or levels of security
–This gives rise to different nagios names to actual names:–Custloc1-swfeltsw01–Custloc2-nwfeltsw01
Not so smart when a non-Nagios originated alert is received,–‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example–The external alert processor has to perform table lookups before building the
appropriate NSCA command for example
23© Bull, 2014
Futures & Blue Sky thinking
The Nagios Visualisation is resource heavyAll Customers want their own Dashboard All Customers want a different screen layout
Why not move the visualisation into the cloud ?Use a Amazon EC2 image to access central Livestatus via httpsAllow end user to authenticateCustomer portal allows ‘spin up’ & ‘spin down’ of images
–Move billing to the customer–Scale horizontally for Visualisation
24© Bull, 2014
Load Sharing
Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse.Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example.No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data
25© Bull, 2014
BPI Example
26© Bull, 2014
Dashboard Example
27© Bull, 2014
Questions ?
28© Bull, 2014
top related