monitoring best practices & tools for running highly available databases

35
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ Internet Services Monitoring best practices & Monitoring best practices & tools tools for running for running highly available databases highly available databases Miguel Anjo & Dawid Wojcik Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008 DM meeting – 20.May.2008

Upload: fell

Post on 04-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Monitoring best practices & tools for running highly available databases. Miguel Anjo & Dawid Wojcik DM meeting – 20.May.2008. Oracle Real Application Clusters. Architecture. RAC1. RAC2. RAC5. RAC6. RAC3. RAC4. Highly Available databases – Oracle ‘services’. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Monitoring best practices &Monitoring best practices &toolstools for runningfor running

highly available databases highly available databases

Miguel Anjo & Dawid WojcikMiguel Anjo & Dawid WojcikDM meeting – 20.May.2008DM meeting – 20.May.2008

Page 2: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Oracle Real Application Oracle Real Application ClustersClusters

Page 3: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

ArchitectureArchitecture

RAC1 RAC2

RAC3 RAC4

RAC6

RAC5

Page 4: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Highly Available databases – Highly Available databases – Oracle ‘services’Oracle ‘services’

• Resources distributed among Oracle Resources distributed among Oracle servicesservices– Applications assigned to dedicated serviceApplications assigned to dedicated service– On node failure, resources re-distributed On node failure, resources re-distributed

CMS_COND Preferred A1 A2 A3CMS_C2K Preferred A3 A1 A2CMS_DBS A2 A3 A1 PreferredCMS_DBS_W A3 A1 A2 PreferredCMS_SSTRACKER Preferred Preferred Preferred PreferredCMS_TRANSFERMGMT A2 Preferred Preferred A1

CMS_COND Preferred A1 A2CMS_C2K A2 Preferred A1CMS_DBS A2 A1 PreferredCMS_DBS_W A1 A2 PreferredCMS_SSTRACKER Preferred Preferred PreferredCMS_TRANSFERMGMT Preferred Preferred A1

Page 5: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Highly Available databases – Highly Available databases – Apps and DB Release cycleApps and DB Release cycle• Applications’ release cycle

• Database software release cycle

Development service Validation service Production service

Validation serviceversion 10.2.0.(n+1)

Production serviceversion 10.2.0.n

Production serviceversion 10.2.0.(n+1)

Page 6: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Why monitor?Why monitor?

• Monitor (n.)Monitor (n.)– Computer Science.Computer Science. A program that observes, supervises, or A program that observes, supervises, or

controls the activities of other programs.controls the activities of other programs.

• Need to keep all components in healthy stateNeed to keep all components in healthy state• We are prepared for single failures, some double We are prepared for single failures, some double

failuresfailures• Commitment to give 24/7 best effort serviceCommitment to give 24/7 best effort service• SW misbehavior affecting performanceSW misbehavior affecting performance• Trends might indicate need to grow systemTrends might indicate need to grow system• Security breachesSecurity breaches

Page 7: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Monitoring participantsMonitoring participants

Presentation title - 7

Page 8: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Monitoring participantsMonitoring participants

Presentation title - 8

Page 9: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

What we monitorWhat we monitor

• 25 database clusters25 database clusters• 124 servers, 450 cores, 150 disk-arrays, 124 servers, 450 cores, 150 disk-arrays,

2000 disks at Tier02000 disks at Tier0• 10 Tier1 sites for Streams replication10 Tier1 sites for Streams replication

• 150+ Oracle ‘services’ / applications150+ Oracle ‘services’ / applications• 20002000+ user schemas+ user schemas• 1M+ connections/day1M+ connections/day

Page 10: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

PDB-BackupPDB-Backup

• 2 node cluster2 node cluster• Using Oracle ClusterwareUsing Oracle Clusterware

• Running: Running: – RACMon (monitoring agents)RACMon (monitoring agents)– StreamMon (monitoring agents)StreamMon (monitoring agents)– BackupsBackups– Scripts repositoryScripts repository

• Monitored by Lemon. Set asMonitored by Lemon. Set as Critical Critical in Operator procedures in Operator procedures

Page 11: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Monitored componentsMonitored components

• ServersServers– AccessibilityAccessibility– CDB stateCDB state– Tools: Lemon + RACMon + Tools: Lemon + RACMon +

OEMOEM

• Disk arraysDisk arrays– AccessibilityAccessibility– State given by controllerState given by controller

• Firmware, disk state, disk Firmware, disk state, disk size, disk speedsize, disk speed

– Tools: Lemon + RACMonTools: Lemon + RACMon

• Database SWDatabase SW– Clusterware stateClusterware state– Service accessibilityService accessibility– Space availableSpace available– Oracle StreamsOracle Streams– Tools: RACMon + OEM + Tools: RACMon + OEM +

StreamMonStreamMon

• Database usageDatabase usage– OS CPU, I/OOS CPU, I/O– User Sessions, CPU, I/OUser Sessions, CPU, I/O– User quotas, tablespace User quotas, tablespace

usageusage– Bad usage (short connections, Bad usage (short connections,

bind variables)bind variables)– Table fragmentationTable fragmentation– Tools: RACMon, ReportsTools: RACMon, Reports

Page 12: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Best practises (I)Best practises (I)

• No overhead to DB (monitored object)No overhead to DB (monitored object)

• Monitor as much as possibleMonitor as much as possible

• Presentation layer simple & compactPresentation layer simple & compact

• Possibility to drill downPossibility to drill down

Page 13: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Best practises (II)Best practises (II)

• Hierarchy of alarms and notificationsHierarchy of alarms and notifications

• Simplicity Simplicity reliability reliability

• Centralized version vs. deployed Centralized version vs. deployed everywhereeverywhere

• Independent blocks (monitoring, Independent blocks (monitoring, dashboard, reporting) for HAdashboard, reporting) for HA

Page 14: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Monitoring toolsMonitoring tools

• Monitoring toolsMonitoring tools

– Lemon, SLSLemon, SLS

– Basic Monitoring (in house development)Basic Monitoring (in house development)

– SQL scriptsSQL scripts (reactive monitoring) (reactive monitoring)

– RACMonRACMon (in house development, (in house development, openlabopenlab))

– StreamMonStreamMon (in house development , (in house development , openlabopenlab))

– OEMOEM – Oracle Enterprise Manager (Grid Control) - – Oracle Enterprise Manager (Grid Control) - openlabopenlab

– Service oriented monitoring toolsService oriented monitoring tools

• Experiment reportsExperiment reports

• DB Availability & Performance PagesDB Availability & Performance Pages

Page 15: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Basic monitoringBasic monitoring

•Checking every 5 minutes•Each failure e-mail with error•3 consecutive failures SMS

•Almost perfect for single instance databases

•Limitations•On RAC, system survives to single HW failures•Users connect to ‘service’, not database instance•No other components (storage, clusterware) monitoring•Missing dashboard view

Page 16: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DBA monitoringDBA monitoring

• SQL scripts – reactive monitoring (ad-hoc SQL scripts – reactive monitoring (ad-hoc monitoring)monitoring)

• Pros:Pros:– Easy to useEasy to use– Fast real time informationFast real time information

• Cons:Cons:– No global overviewNo global overview– Diagnosing single problem Diagnosing single problem – Requires expert knowledgeRequires expert knowledge

Page 17: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

RACMon requirementsRACMon requirements

• Reliable (24/7)Reliable (24/7)

• Easy to use and configureEasy to use and configure

• Provides up to date information (frequent runs)Provides up to date information (frequent runs)

• Centralized – no configuration or deployment on RAC sideCentralized – no configuration or deployment on RAC side

• Web interface (RAC monitoring dashboard) – one common place for Web interface (RAC monitoring dashboard) – one common place for RACs’ statusRACs’ status

• Monitoring of Oracle services (DB and user level) and Oracle Monitoring of Oracle services (DB and user level) and Oracle clusterwareclusterware

• Monitoring of ASM instances (diskgroups and failgroups)Monitoring of ASM instances (diskgroups and failgroups)

• Monitoring other parts of the infrastructure – backups, storage, … Monitoring other parts of the infrastructure – backups, storage, … (easy extensibility)(easy extensibility)

• Notification send via emails & SMSs to DBAsNotification send via emails & SMSs to DBAs

• Availability numbers (over extended periods of time)Availability numbers (over extended periods of time)

• Disabling monitoring for specific machines or clusters (scheduled and Disabling monitoring for specific machines or clusters (scheduled and unscheduled intervention logbook)unscheduled intervention logbook)

Page 18: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

CERN Web Services

. . .

RAC databases PDB-Backup Monitoring cluster

Create, ship and run monitoring script

Retrieve and parse output

RACMon ArchitectureRACMon Architecture

Page 19: Monitoring best practices & tools for running highly available databases

RACMon - examplesRACMon - examples

Page 20: Monitoring best practices & tools for running highly available databases

RACMon - examplesRACMon - examples

Page 21: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

RACMonRACMon

• Pros/Features:Pros/Features:– Customized for our environmentCustomized for our environment– Gives an overview of all our HW and RACsGives an overview of all our HW and RACs– Configurable alerts (via email and SMS) and Configurable alerts (via email and SMS) and

alert levels (production or non-production alert levels (production or non-production systems)systems)

– Drill down details available via multiple links to Drill down details available via multiple links to other types of monitoring software (OEM, other types of monitoring software (OEM, Lemon, StreamMon)Lemon, StreamMon)

• Cons:Cons:– Requires manpower for developmentRequires manpower for development

Page 22: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Oracle StreamsOracle Streams

• ““Oracle Streams enables the propagation and Oracle Streams enables the propagation and management of data, transactions and events management of data, transactions and events in a data stream either within a database, or in a data stream either within a database, or from one database to another.”from one database to another.”

Page 23: Monitoring best practices & tools for running highly available databases

StreamMonStreamMon

Page 24: Monitoring best practices & tools for running highly available databases

StreamMonStreamMon

Page 25: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

StreamMonStreamMon

• Streams availability and usage monitoringStreams availability and usage monitoring• Build in alerting in case of any error in streams Build in alerting in case of any error in streams

stackstack

• Pros:Pros:– Monitoring of all T1 sites in one place (streams monitoring Monitoring of all T1 sites in one place (streams monitoring

not available in any other tool, including OEM)not available in any other tool, including OEM)– Convenient and easy to use web interfaceConvenient and easy to use web interface– Advanced plotting utilitiesAdvanced plotting utilities

• Cons:Cons:– Required manpower for development (currently in Required manpower for development (currently in

maintenance only)maintenance only)– Uses not-standard libraries, requires customized serverUses not-standard libraries, requires customized server

Page 26: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Oracle Enterprise ManagerOracle Enterprise Manager

• Architecture:Architecture:– Agent running on each server uploads information to central repository, if Agent running on each server uploads information to central repository, if

repository is not available, it caches datarepository is not available, it caches data– Management Service provides insight into any monitored target detailsManagement Service provides insight into any monitored target details– Management Service based on set-up metrics and policies sends e-mails Management Service based on set-up metrics and policies sends e-mails

(SMSes)(SMSes)– Proactive monitoring possible (actions based on problem diagnostics)Proactive monitoring possible (actions based on problem diagnostics)

Page 27: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Oracle Enterprise ManagerOracle Enterprise Manager

• Oracle Enterprise Manager Grid Control featuresOracle Enterprise Manager Grid Control features

Page 28: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Oracle Enterprise ManagerOracle Enterprise Manager

• Pros:Pros:– Highly configurable alerts, metrics and notification policiesHighly configurable alerts, metrics and notification policies– Advanced and easy to use web interfaceAdvanced and easy to use web interface– Easy drill downEasy drill down– External product – fully supportedExternal product – fully supported

• Cons:Cons:– Universal – requires more navigationUniversal – requires more navigation– No global overview (per target oriented)No global overview (per target oriented)– Customization for many target requires much workCustomization for many target requires much work– Bugs may by intrusive (e.g. affecting streams, excessive Bugs may by intrusive (e.g. affecting streams, excessive

memory/CPU consumption, storage, DB instances)memory/CPU consumption, storage, DB instances)– Manpower required for maintenance and configurationManpower required for maintenance and configuration– Not reliable enough for 24/7 monitoringNot reliable enough for 24/7 monitoring

Page 29: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Weekly reportsWeekly reports

• Targeted to experiment DBAs and Targeted to experiment DBAs and CoordinatorsCoordinators

• Information aboutInformation about• Bookkeeping – Application names, contactsBookkeeping – Application names, contacts• Resource usage – Sessions, CPU, Logical and Resource usage – Sessions, CPU, Logical and

Physical I/OPhysical I/O• Security: Connection errors, expiring passwords, Security: Connection errors, expiring passwords,

not used schemasnot used schemas• Space: consumed, fragmentation, recycle binSpace: consumed, fragmentation, recycle bin• Bad usage: short connections, queries missing Bad usage: short connections, queries missing

bind variablesbind variables

Page 30: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Weekly reportsWeekly reports

• PHP scriptsPHP scripts• Generate report over last 7 daysGenerate report over last 7 days• Specific to one RAC clusterSpecific to one RAC cluster

Page 31: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Weekly reportsWeekly reports

Page 32: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

Weekly reportsWeekly reports

• Current functionalityCurrent functionality– Simple way to visualize whole DB usageSimple way to visualize whole DB usage– Concentrates on main users (dynamic)Concentrates on main users (dynamic)– Easy to spot problems (color coded)Easy to spot problems (color coded)– Very good feedback from our users Very good feedback from our users

• Now working on user configurable reportsNow working on user configurable reports

Page 33: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DB availability and performance DB availability and performance pagepage

• PHP, aggregation of other toolsPHP, aggregation of other tools• Requested by experimentsRequested by experiments• Dashboard of “current” DB activityDashboard of “current” DB activity

• Almost real time monitoring (up to last hour)Almost real time monitoring (up to last hour)• Application resource usageApplication resource usage• No extra load No extra load

– uses SLS, RACMon, StreamMon, weekly reportsuses SLS, RACMon, StreamMon, weekly reports

• Possibility to drill downPossibility to drill down

Page 34: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

DB availability and performance DB availability and performance pagepage

Page 35: Monitoring best practices & tools for running highly available databases

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

InternetServices

SummarySummary

• Many monitoring components developed for Many monitoring components developed for our environmentour environment– Out of the box tools not sufficientOut of the box tools not sufficient– Open frameworks – new features easily addedOpen frameworks – new features easily added– Feedback given to Oracle Enterprise Manager Feedback given to Oracle Enterprise Manager

development (development (openlabopenlab))

• Very good feedback from T1s and Very good feedback from T1s and experimentsexperiments– Components included in experiment Components included in experiment

dashboards, WLCG ServiceMaps, SLSdashboards, WLCG ServiceMaps, SLS