monitoring best practices & tools for running highly available databases
DESCRIPTION
Monitoring best practices & tools for running highly available databases. Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008. Oracle Real Application Clusters. Architecture. RAC1. RAC2. RAC5. RAC6. RAC3. RAC4. Highly Available databases – Oracle ‘services’. - PowerPoint PPT PresentationTRANSCRIPT
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Monitoring best practices &Monitoring best practices &toolstools for runningfor running
highly available databases highly available databases
Miguel Anjo & Dawid WojcikMiguel Anjo & Dawid WojcikDM meeting – 20.May.2008DM meeting – 20.May.2008
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Oracle Real Application Oracle Real Application ClustersClusters
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
ArchitectureArchitecture
RAC1 RAC2
RAC3 RAC4
RAC6
RAC5
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Highly Available databases – Highly Available databases – Oracle ‘services’Oracle ‘services’
• Resources distributed among Oracle Resources distributed among Oracle servicesservices– Applications assigned to dedicated serviceApplications assigned to dedicated service– On node failure, resources re-distributed On node failure, resources re-distributed
CMS_COND Preferred A1 A2 A3CMS_C2K Preferred A3 A1 A2CMS_DBS A2 A3 A1 PreferredCMS_DBS_W A3 A1 A2 PreferredCMS_SSTRACKER Preferred Preferred Preferred PreferredCMS_TRANSFERMGMT A2 Preferred Preferred A1
CMS_COND Preferred A1 A2CMS_C2K A2 Preferred A1CMS_DBS A2 A1 PreferredCMS_DBS_W A1 A2 PreferredCMS_SSTRACKER Preferred Preferred PreferredCMS_TRANSFERMGMT Preferred Preferred A1
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Highly Available databases – Highly Available databases – Apps and DB Release cycleApps and DB Release cycle• Applications’ release cycle
• Database software release cycle
Development service Validation service Production service
Validation serviceversion 10.2.0.(n+1)
Production serviceversion 10.2.0.n
Production serviceversion 10.2.0.(n+1)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Why monitor?Why monitor?
• Monitor (n.)Monitor (n.)– Computer Science.Computer Science. A program that observes, supervises, or A program that observes, supervises, or
controls the activities of other programs.controls the activities of other programs.
• Need to keep all components in healthy stateNeed to keep all components in healthy state• We are prepared for single failures, some double We are prepared for single failures, some double
failuresfailures• Commitment to give 24/7 best effort serviceCommitment to give 24/7 best effort service• SW misbehavior affecting performanceSW misbehavior affecting performance• Trends might indicate need to grow systemTrends might indicate need to grow system• Security breachesSecurity breaches
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Monitoring participantsMonitoring participants
Presentation title - 7
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Monitoring participantsMonitoring participants
Presentation title - 8
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
What we monitorWhat we monitor
• 25 database clusters25 database clusters• 124 servers, 450 cores, 150 disk-arrays, 124 servers, 450 cores, 150 disk-arrays,
2000 disks at Tier02000 disks at Tier0• 10 Tier1 sites for Streams replication10 Tier1 sites for Streams replication
• 150+ Oracle ‘services’ / applications150+ Oracle ‘services’ / applications• 20002000+ user schemas+ user schemas• 1M+ connections/day1M+ connections/day
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
PDB-BackupPDB-Backup
• 2 node cluster2 node cluster• Using Oracle ClusterwareUsing Oracle Clusterware
• Running: Running: – RACMon (monitoring agents)RACMon (monitoring agents)– StreamMon (monitoring agents)StreamMon (monitoring agents)– BackupsBackups– Scripts repositoryScripts repository
• Monitored by Lemon. Set asMonitored by Lemon. Set as Critical Critical in Operator procedures in Operator procedures
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Monitored componentsMonitored components
• ServersServers– AccessibilityAccessibility– CDB stateCDB state– Tools: Lemon + RACMon + Tools: Lemon + RACMon +
OEMOEM
• Disk arraysDisk arrays– AccessibilityAccessibility– State given by controllerState given by controller
• Firmware, disk state, disk Firmware, disk state, disk size, disk speedsize, disk speed
– Tools: Lemon + RACMonTools: Lemon + RACMon
• Database SWDatabase SW– Clusterware stateClusterware state– Service accessibilityService accessibility– Space availableSpace available– Oracle StreamsOracle Streams– Tools: RACMon + OEM + Tools: RACMon + OEM +
StreamMonStreamMon
• Database usageDatabase usage– OS CPU, I/OOS CPU, I/O– User Sessions, CPU, I/OUser Sessions, CPU, I/O– User quotas, tablespace User quotas, tablespace
usageusage– Bad usage (short connections, Bad usage (short connections,
bind variables)bind variables)– Table fragmentationTable fragmentation– Tools: RACMon, ReportsTools: RACMon, Reports
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Best practises (I)Best practises (I)
• No overhead to DB (monitored object)No overhead to DB (monitored object)
• Monitor as much as possibleMonitor as much as possible
• Presentation layer simple & compactPresentation layer simple & compact
• Possibility to drill downPossibility to drill down
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Best practises (II)Best practises (II)
• Hierarchy of alarms and notificationsHierarchy of alarms and notifications
• Simplicity Simplicity reliability reliability
• Centralized version vs. deployed Centralized version vs. deployed everywhereeverywhere
• Independent blocks (monitoring, Independent blocks (monitoring, dashboard, reporting) for HAdashboard, reporting) for HA
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Monitoring toolsMonitoring tools
• Monitoring toolsMonitoring tools
– Lemon, SLSLemon, SLS
– Basic Monitoring (in house development)Basic Monitoring (in house development)
– SQL scriptsSQL scripts (reactive monitoring) (reactive monitoring)
– RACMonRACMon (in house development, (in house development, openlabopenlab))
– StreamMonStreamMon (in house development , (in house development , openlabopenlab))
– OEMOEM – Oracle Enterprise Manager (Grid Control) - – Oracle Enterprise Manager (Grid Control) - openlabopenlab
– Service oriented monitoring toolsService oriented monitoring tools
• Experiment reportsExperiment reports
• DB Availability & Performance PagesDB Availability & Performance Pages
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Basic monitoringBasic monitoring
•Checking every 5 minutes•Each failure e-mail with error•3 consecutive failures SMS
•Almost perfect for single instance databases
•Limitations•On RAC, system survives to single HW failures•Users connect to ‘service’, not database instance•No other components (storage, clusterware) monitoring•Missing dashboard view
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DBA monitoringDBA monitoring
• SQL scripts – reactive monitoring (ad-hoc SQL scripts – reactive monitoring (ad-hoc monitoring)monitoring)
• Pros:Pros:– Easy to useEasy to use– Fast real time informationFast real time information
• Cons:Cons:– No global overviewNo global overview– Diagnosing single problem Diagnosing single problem – Requires expert knowledgeRequires expert knowledge
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
RACMon requirementsRACMon requirements
• Reliable (24/7)Reliable (24/7)
• Easy to use and configureEasy to use and configure
• Provides up to date information (frequent runs)Provides up to date information (frequent runs)
• Centralized – no configuration or deployment on RAC sideCentralized – no configuration or deployment on RAC side
• Web interface (RAC monitoring dashboard) – one common place for Web interface (RAC monitoring dashboard) – one common place for RACs’ statusRACs’ status
• Monitoring of Oracle services (DB and user level) and Oracle Monitoring of Oracle services (DB and user level) and Oracle clusterwareclusterware
• Monitoring of ASM instances (diskgroups and failgroups)Monitoring of ASM instances (diskgroups and failgroups)
• Monitoring other parts of the infrastructure – backups, storage, … Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility)(easy extensibility)
• Notification send via emails & SMSs to DBAsNotification send via emails & SMSs to DBAs
• Availability numbers (over extended periods of time)Availability numbers (over extended periods of time)
• Disabling monitoring for specific machines or clusters (scheduled and Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)unscheduled intervention logbook)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
CERN Web Services
. . .
RAC databases PDB-Backup Monitoring cluster
Create, ship and run monitoring script
Retrieve and parse output
RACMon ArchitectureRACMon Architecture
RACMon - examplesRACMon - examples
RACMon - examplesRACMon - examples
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
RACMonRACMon
• Pros/Features:Pros/Features:– Customized for our environmentCustomized for our environment– Gives an overview of all our HW and RACsGives an overview of all our HW and RACs– Configurable alerts (via email and SMS) and Configurable alerts (via email and SMS) and
alert levels (production or non-production alert levels (production or non-production systems)systems)
– Drill down details available via multiple links to Drill down details available via multiple links to other types of monitoring software (OEM, other types of monitoring software (OEM, Lemon, StreamMon)Lemon, StreamMon)
• Cons:Cons:– Requires manpower for developmentRequires manpower for development
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Oracle StreamsOracle Streams
• ““Oracle Streams enables the propagation and Oracle Streams enables the propagation and management of data, transactions and events management of data, transactions and events in a data stream either within a database, or in a data stream either within a database, or from one database to another.”from one database to another.”
StreamMonStreamMon
StreamMonStreamMon
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
StreamMonStreamMon
• Streams availability and usage monitoringStreams availability and usage monitoring• Build in alerting in case of any error in streams Build in alerting in case of any error in streams
stackstack
• Pros:Pros:– Monitoring of all T1 sites in one place (streams monitoring Monitoring of all T1 sites in one place (streams monitoring
not available in any other tool, including OEM)not available in any other tool, including OEM)– Convenient and easy to use web interfaceConvenient and easy to use web interface– Advanced plotting utilitiesAdvanced plotting utilities
• Cons:Cons:– Required manpower for development (currently in Required manpower for development (currently in
maintenance only)maintenance only)– Uses not-standard libraries, requires customized serverUses not-standard libraries, requires customized server
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Oracle Enterprise ManagerOracle Enterprise Manager
• Architecture:Architecture:– Agent running on each server uploads information to central repository, if Agent running on each server uploads information to central repository, if
repository is not available, it caches datarepository is not available, it caches data– Management Service provides insight into any monitored target detailsManagement Service provides insight into any monitored target details– Management Service based on set-up metrics and policies sends e-mails Management Service based on set-up metrics and policies sends e-mails
(SMSes)(SMSes)– Proactive monitoring possible (actions based on problem diagnostics)Proactive monitoring possible (actions based on problem diagnostics)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Oracle Enterprise ManagerOracle Enterprise Manager
• Oracle Enterprise Manager Grid Control featuresOracle Enterprise Manager Grid Control features
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Oracle Enterprise ManagerOracle Enterprise Manager
• Pros:Pros:– Highly configurable alerts, metrics and notification policiesHighly configurable alerts, metrics and notification policies– Advanced and easy to use web interfaceAdvanced and easy to use web interface– Easy drill downEasy drill down– External product – fully supportedExternal product – fully supported
• Cons:Cons:– Universal – requires more navigationUniversal – requires more navigation– No global overview (per target oriented)No global overview (per target oriented)– Customization for many target requires much workCustomization for many target requires much work– Bugs may by intrusive (e.g. affecting streams, excessive Bugs may by intrusive (e.g. affecting streams, excessive
memory/CPU consumption, storage, DB instances)memory/CPU consumption, storage, DB instances)– Manpower required for maintenance and configurationManpower required for maintenance and configuration– Not reliable enough for 24/7 monitoringNot reliable enough for 24/7 monitoring
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Weekly reportsWeekly reports
• Targeted to experiment DBAs and Targeted to experiment DBAs and CoordinatorsCoordinators
• Information aboutInformation about• Bookkeeping – Application names, contactsBookkeeping – Application names, contacts• Resource usage – Sessions, CPU, Logical and Resource usage – Sessions, CPU, Logical and
Physical I/OPhysical I/O• Security: Connection errors, expiring passwords, Security: Connection errors, expiring passwords,
not used schemasnot used schemas• Space: consumed, fragmentation, recycle binSpace: consumed, fragmentation, recycle bin• Bad usage: short connections, queries missing Bad usage: short connections, queries missing
bind variablesbind variables
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Weekly reportsWeekly reports
• PHP scriptsPHP scripts• Generate report over last 7 daysGenerate report over last 7 days• Specific to one RAC clusterSpecific to one RAC cluster
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Weekly reportsWeekly reports
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
Weekly reportsWeekly reports
• Current functionalityCurrent functionality– Simple way to visualize whole DB usageSimple way to visualize whole DB usage– Concentrates on main users (dynamic)Concentrates on main users (dynamic)– Easy to spot problems (color coded)Easy to spot problems (color coded)– Very good feedback from our users Very good feedback from our users
• Now working on user configurable reportsNow working on user configurable reports
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DB availability and performance DB availability and performance pagepage
• PHP, aggregation of other toolsPHP, aggregation of other tools• Requested by experimentsRequested by experiments• Dashboard of “current” DB activityDashboard of “current” DB activity
• Almost real time monitoring (up to last hour)Almost real time monitoring (up to last hour)• Application resource usageApplication resource usage• No extra load No extra load
– uses SLS, RACMon, StreamMon, weekly reportsuses SLS, RACMon, StreamMon, weekly reports
• Possibility to drill downPossibility to drill down
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
DB availability and performance DB availability and performance pagepage
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
InternetServices
SummarySummary
• Many monitoring components developed for Many monitoring components developed for our environmentour environment– Out of the box tools not sufficientOut of the box tools not sufficient– Open frameworks – new features easily addedOpen frameworks – new features easily added– Feedback given to Oracle Enterprise Manager Feedback given to Oracle Enterprise Manager
development (development (openlabopenlab))
• Very good feedback from T1s and Very good feedback from T1s and experimentsexperiments– Components included in experiment Components included in experiment
dashboards, WLCG ServiceMaps, SLSdashboards, WLCG ServiceMaps, SLS