vcenter operations

vCenter OperationsTechnical Discussion, May 2011

Iwan ‘e1’ RahabokSenior Systems [email protected] | virtual-red-dot.blogspot.com | 9119-9226

VCAP-DCD

Introduction

Including physical

management

Private Cloud Self-Service Solution Bundle: IaaS

Infrastructure & Operations Performance, Capacity, Configuration

Security & Compliance vShield + VCM: Operational and Regulatory

Compliance

IT Service Management Problem, incident, change, config

Application Management App Release + Performance

1.

2.

3.

4.

5.

Automation >< Orchestration

More Engineering More Management

Performance problems often occur with no real warning– Many times end users are the first to notice problems– Root cause determination is difficult and time-consuming– Solving problems requires all-hands-on-deck bridge calls

Real-time understanding of performance is lacking– No reliable understanding of the health of IT infrastructure makes IT too reactive– Siloed monitoring tools do not allow a common “truth”– No correlation across IT silos

Optimizing IT infrastructure is difficult if not impossible– Understanding the abnormal metric behaviors that lead to degradation of Key

Performance Indicators is not possible with current tools– Understanding the abnormal behaviors that define your worst performing devices is

not possible with current tools– Heavy reliance on “Tribal Knowledge” of a few application experts

Management Challenges

What If You Could…

• Automate • Eliminate time-consuming problem resolution processes

• Correlate and Accelerate • “One Click” to root cause of emerging performance problems to reduce MTTI/MTTR

• Get Proactive• Avert end user and business impact of building performance problems

• Collaborate• Aggregate and correlate data from monitoring landscape to create a single “truth”

• Optimize• Tune components to deliver optimal performance for application transactions

vCenter Operations

vCenter Operations Advanced

vCenter Operations Enterprise+ Configuration & Compliance

Management (vCenter Configuration Manager)+ Other VMware & 3rd Party Integrations

(View, management, servers, storage)

Non-Vmware (incl. physical) environmentsVMware Cloud / vCenter

vCenter

vCenter Operations Standard Capacity

Management

Performance Management

(up to 1500 VM)

Purpose Built Capacity Planning & Analysis• Integrated capacity analysis and forecasting• Decision support & automation via views, alerts,

reports• VM right sizing and capacity reclamation

Automated Configuration & Compliance• Automated Patching and Provisioning• Comprehensive change tracking to isolate root cause• Single-click rollback to remediate and return to normal

Patented Performance Analytics• Self-learning of “normal” performance conditions• Service health baseline and trending • Smart alerts of impending performance degradation

Comparing the EditionsStandard Enterprise

Data Sources vCenter x 1 • Any 3rd party monitoring tools’ time series data• Change events• Multiple vCenter Servers

Objects vCenter Objects (i.e.)• Data Centers• Clusters• ESX Hosts• Datastores• VMs x 1500

Unlimited Scope (i.e.)• Applications• Network Infrastructure• Storage• Hosts (ESX, Win, Linux, etc)• VMs

Users Infrastructure (e.g. VI Admins) Operations, Infrastructure, Application Teams, Business Owners, CxOs

Dynamic Thresholds Yes YesPerformance Root Cause Yes YesProactive Alerting No YesCustomizable Dashboards No YesNotifications No Yes

Sco

peFu

nctio

n

vCenter Operation – Standard Edition

Demo

• Familiarisation of UI• Infrastructure and Analysis

• Concepts• Workload• Health• Capacity

vCenter Environment - Workload

• Workload Measures• Demand for resources vs. Resources currently used• Result is a percentage of Workload

• Low number is Good – Object has the resources it needs• Can go above 100% - Object is “Starving”

• Workload summarized across critical resources • CPU• Storage• Network• Memory

• Workload Details View• View the state of the Peer and Parent Objects and troubleshoot

• Am I a victim or a villain? • Is this a population problem?

vCenter Environment - Health

• Health Measures• How normal is this object behaving: • 0-100 (Higher is Healthier or Normal)• Learns dynamic ranges of “Normal” for each metric• Learns patterns of behavior and identifies metric

abnormalities• Healthy = no abnormalities

• Health and Workload together• Health High and Workload High – Normal Behavior for

this timeframe• Health High and Workload Low – Normal Behavior for

this timeframe• Health Low and Workload High – Something is amiss!

Perfomance spike• Health Low and Workload Low – Something is amiss.

Demand drops

Important NoteLow Health does not

imply a problem. It tells you that the object is acting differently than normal.

Learn Normal Behavior and Identify Abnormalities

• Doesn’t assume IT data has a normal bell-shaped distribution

• Sophisticated Analytics – 8 different algorithms

• Learns your dynamic ranges of “Normal” without templates

• Learns patterns of behavior and identifies Abnormalities

BLUE LINEMetric’s Current

Value

GRAY BARUpper and Lower band of Dynamic Threshold -

“Normal”

RED BARBreached Dynamic

Threshold – “Abnormal”

vCenter Environment - Capacity

• Capacity• How much time before Capacity run out?• 0-100: Higher number, longer time.• Thresholds User Configurable

• 30 Days Left = RED• 60 Days Left = Orange• Etc.

• Unlike Workload, Capacity is long-term.

• Capacity measured for critical resources • CPU, RAM, Storage, Network

• Capacity Details View• Shows the chart and trend for each of the above resources• Denotes current state• Projected breach point and days left

Health (Deviation)

• Green square: 76–100. • The health of the object is normal. No attention required.

• Yellow square: 51–75. • The object is experiencing some level of issues. You must check and take appropriate

action.

• Orange square: 26–50. • The object might have serious issues. You must check and take appropriate action as

soon as possible.

• Red square: 0–25. • The object is either not functioning properly or will stop functioning soon. You must take

an action immediately.

• Blue square: • No data is available for any of the metrics for the time period.

• Gray square: • The object is offline.

Workload

• Green circle: 0- 84. • There is no excessive workload

on the object. No attention required.

• Yellow circle: 85–94. • The object is experiencing some

high resource workloads.

• Orange circle: 95–99. • Workload on the object is

approaching its capacity in at least one area.

• Red circle: 100 or more. • Workload on the object is at or

over its capacity in one or more areas.

The number 85 and 95 are shown as Green and Yellow lines in the Events chart.

Capacity

• Green cube: 26-100. • The object is not expected to reach its capacity limits within the next 120 days.

• Yellow cube: 16–25. • In 60 - 120 days.

• Orange cube: 6–15. • In 30 - 60 days.

• Red cube: 0–5. • In < 30 days.

The number 5, 15 and 25 are shown as colored lines in the Events chart.

Performance Visibility Across the Virtualized Datacenter

Full visibility up and down the

datacenter stack

Aggregates 100s of metrics into 1 intelligent

score

Drill into ESX server for

further details

Intuitive, Web RIA-based user-friendly interface

Context sensitive object

hierarchy

Breadcumbs to track object hierarchy

Search and filter

Continuous, automatic learning of

normal behavior for key metrics

Workload issue correlated to

net I/O constraints

Quickly show Reservation vs

Demand vs Usage

Drilldown to track changes

Diagnostics relative to

parent, peer and child objects

Detailed display of events and health score

changes

Visibility into Disk and Network IO performance

Disk subsystem performance

details by datastores and

LUNs

Network statistics for every NIC

Quiz: what’s the difference between

Total & Host?

Quickly identify “suspect”

performance metric

KPI history with timestamp to indicate root

cause

Capacity

• Estimating the of days left• Score is 0-100. Non linear. 10 doesnot mean 10 days left.

• CapacityIQ value add:• What-If analysis• Discovery of over-allocated and under-allocated VM • Reporting• A Capacity-centric dashboard

Capacity: Guest OS level info

Relative scores to prioritize any

remediation efforts

Health tree with topology mapping

Top-down visibility into

health changes

Time-series charts for

individual metric

Individual performance metric details

Single view that correlates

multiple metrics

Detailed list of all metrics

indicating smart alerts

Visualisation quickly pinpoints hotspots

Single click drill down for further

details

Storage

• Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them?

• For each of these datastores, how do we know the relative workload generated by the VM?

• For every VM, how do we know the latency is within reasonable number?

• How do we show all the above data in “one chart”, without the need to show a lot of numbers?

Heatmap customisation

vCenter Operations Standard Architecture

Four Main Services: Collector, Analytics, Web, ActiveMQ

Bundled DB: PostgresSQL DB File-based DB

(FSDB) for raw metric storage

Single Collector for vCenter. Embedded in appliance

vCenter Operations Standard Processing

2a: Analytics runs daily to determine hour-by-hour Dynamic Thresholds for

next 24 hours

2b: Full FSDB is scanned by the analytic algorithms to determine per metric best match the next 24

hour period

1a: vCenter Collector collects metrics, topology & change

events from vCenter - Ongoing -

1b: Data stored in

FSDB

3: Incoming data points are tested against Dynamic

Threshold bands and used to calculate Health,

Workload and Capacity

2c: Store metric Dynamic

Thresholds data in PostgresSQL DB

4: Results provided to UI: Update

“Badges”, provide Root Cause for

Health scores, etc.

vCenter Operation – Ent Edition

38

Data Agnostic Approach to Data Collection

Accepts any time series data (examples)• Server OS

• Server App layer (eg, IIS, Oracle, WebSphere, etc)

• Network

• Storage

• User Experience

• Transactional

• Business Data

• Change Events

Minimal Required Fields (4)• Object Name, Metric Name, Value, Timestamp

Data Extraction - *not* an analytic question• No rules/templates to Write and Maintain

• vCenter Operations Analytics do all of the “Work”

vCenter Operations

39

Slide 39

Learn Normal Behavior and Identify Abnormalities

Doesn’t assume IT data has a normal bell-shaped distribution Sophisticated Analytics – 8 different algorithms Learns your dynamic ranges of “Normal” without templates Learns patterns of behavior and identifies Abnormalities

BLUE LINEMetric’s

Measured Value

GRAY BARLearned Upper and

Lower band of Dynamic Threshold - “Normal”

RED ZoneBreached Dynamic

Threshold – “Abnormal”

40

Dynamic Threshold Algorithms

Understand the normal behavior of any time-series metric

Eight (8) distinct algorithms each determine an upper and lower ‘band’ – results of each algorithm compete to ‘win’ to represent the ‘best choice’

vC Ops Ent - Stand Alone detects metric-level abnormalities for use in:

Dynamic Thresholds are the Cornerstone to all other forms of vC Ops Ent - Stand Alone Analytics

* Figure shows a performance metric (blue line), its normal behavior (gray zone), and when it’s behaving abnormally (red area)

Generation of Smart Alerts

Visualizing real-time ‘Health’

Revealing hidden relationships

etc.

41

Proactive Alerting – Smart AlertsUser Experience (eg, RUM, etc.)

Database Silo (eg, Quest, etc.)

App Data (eg, Wily, etc.)

Network Data (e.g., Ionix IPPM, etc.)

Smart Alert Generation (“When”)

Business Data (eg, Finance)

! SMART ALERT

Business Application

42

Smart Alert Trigger

vC Ops Ent - Stand Alone tracks aggregate amount of abnormality and alerts when “explosion” is detected, or when a ‘high water mark’ is detected

Intrinsically observed that performance problems are first seen at the metric level when metrics begin to behave abnormally

• Blue shaded region represents the number of metrics for an application (represented by a set of servers/devices) that are at any given time measured abnormally

• The Red line represents an Analytically determined ever-changing level at which vC Ops Ent - Stand Alone determines a performance warning is warranted – a Smart Alert is triggered

43

Smart Alert Summary (“What”)

Root cause technology tier is the DB

Metric-level root cause

symptoms - START HERE

Impact analysis shows the health of the application as well as the health of the tiers that comprise the applicationRoot-Cause ranks the tiers in order of priority and within those tiers shows the most affected metrics and resources

44

Drill down to the Root Cause


Early Warning SMART ALERT

Noise Line Crossed

45



Impact to application

health

Impact to health of each technology tier

No major impact to application key Performance

Indicators (KPIs)…yet.

46


See change and other external events

affect on application health with this “mash up” view


47

Learning behaviour analytically

Determine performance Health

Alert only when applications need

attention

Tracking disparate “Resources” from

various technology silos

48

Impact to health to each

technology tier

Proactive Alert

DB is Root Cause tier START HERE!

Symptoms

Application Health

Performance Visibility Across the Virtualized Datacenter

KPIs are outside of normal level but

not breached SLAs

49

Performance Visibility Across the Entire Datacenter

Application Owner View - Application health view with active alerts and tier health

50

Dynamic Performance Dashboards – Application Owner Views

Application health view with

active alerts and tierhealth

Health and Alerts broken down by Tier and

Objects

Heat Maps allow you to see the Health of hundreds of

objects at once.

51

Real-Time Performance Insight

Performancehealth of each individual app

Performance ofKPIs with dynamic

thresholds

CIO Viewperformance of

all apps

52

Dynamic Performance Dashboards – CIO Views

Performancehealth of each individual app

Performance ofKPIs with dynamic

thresholds

performance ofall apps

Launched in Context

53

Dynamic Performance Dashboards – Historic Weather Maps

Response time “weather map” that can be

played back over a selectable time period

to show problempatterns

54

Dynamic Performance Dashboards – Customizable

Simply drag and dropvisualization “widgets”

to create new, role-based dashboards

Set widget interactions to create powerful in context

dashboards

55

Health Score

Automatically understand performance ‘Health’ for

• 100 (Green) = Perfect Performance (i.e., entirely normal); • 0 (Red) = Terrible Performance (extremely abnormal behavior)

A single Server, Device, Resource

Entire Tier or Silo

Entire Application or Service

Entire Datacenter

Any Arbitrary Group of Resources

Objective measure of performance based on underlying level of abnormal behavior. Adjusted based on:

# of Abnormally behaving KPI

# of Abnormally behaving metrics

Consideration of lowest / highest

volume of abnormalities

56

One Source of Truth Across the Enterprise

Health - Objective measure of performance based on underlying level of abnormal behavior

Analytics provide a Health score for any resource or grouping

• A single Server, Device, Resource

• Entire Tier or Silo

• Entire Application or Service

• Entire Datacenter

• Any Arbitrary Group of Resources

Dynamic Performance Dashboards – Health Scores

“How is our world doing?”

57

vC Ops Ent - Stand Alone Architecture

Four installed ‘Services: Collector, Analytics, Web, ActiveMQ

Architecture includes MS SQL or Oracle DB, plus File-based DB (FSDB) for raw metric storage

Collectors can be distributed for scalability, or to span DCs & firewalls

Analytics runs daily to determine hour-by-hour DTs for next 24 hours

Incoming data points are tested against DT bands, metric-level anomalies are tracked for Alerting and Dashboarding

“Northbound” integration with products like Ionix SMARTS SAM

58

Under the Hood of vC Ops EntSlide 58

Thank You

vcenter operations

Documents