vcenter operations

59
vCenter Operations Technical Discussion, May 2011 Iwan ‘e1’ Rahabok Senior Systems Consultant [email protected] | virtual-red-dot.blogspot.com | 9119-9226 VCAP-DCD

Upload: alden

Post on 24-Feb-2016

72 views

Category:

Documents


0 download

DESCRIPTION

vCenter Operations. Technical Discussion, May 2011. VCAP-DCD. Iwan ‘e1’ Rahabok Senior Systems Consultant [email protected] | virtual-red-dot.blogspot.com | 9119-9226. Introduction. Application Management App Release + Performance. 5. IT Service Management - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: vCenter Operations

vCenter OperationsTechnical Discussion, May 2011

Iwan ‘e1’ RahabokSenior Systems [email protected] | virtual-red-dot.blogspot.com | 9119-9226

VCAP-DCD

Page 2: vCenter Operations

Introduction

Page 3: vCenter Operations

Including physical

management

Private Cloud Self-Service Solution Bundle: IaaS

Infrastructure & Operations Performance, Capacity, Configuration

Security & Compliance vShield + VCM: Operational and Regulatory

Compliance

IT Service Management Problem, incident, change, config

Application Management App Release + Performance

1.

2.

3.

4.

5.

Page 4: vCenter Operations

Automation >< Orchestration

More Engineering More Management

Page 5: vCenter Operations

Performance problems often occur with no real warning– Many times end users are the first to notice problems– Root cause determination is difficult and time-consuming– Solving problems requires all-hands-on-deck bridge calls

Real-time understanding of performance is lacking– No reliable understanding of the health of IT infrastructure makes IT too reactive– Siloed monitoring tools do not allow a common “truth”– No correlation across IT silos

Optimizing IT infrastructure is difficult if not impossible– Understanding the abnormal metric behaviors that lead to degradation of Key

Performance Indicators is not possible with current tools– Understanding the abnormal behaviors that define your worst performing devices is

not possible with current tools– Heavy reliance on “Tribal Knowledge” of a few application experts

Management Challenges

Page 6: vCenter Operations

What If You Could…

• Automate • Eliminate time-consuming problem resolution processes

• Correlate and Accelerate • “One Click” to root cause of emerging performance problems to reduce MTTI/MTTR

• Get Proactive• Avert end user and business impact of building performance problems

• Collaborate• Aggregate and correlate data from monitoring landscape to create a single “truth”

• Optimize• Tune components to deliver optimal performance for application transactions

Page 7: vCenter Operations

vCenter Operations

Page 8: vCenter Operations

vCenter Operations Advanced

vCenter Operations Enterprise+ Configuration & Compliance

Management (vCenter Configuration Manager)+ Other VMware & 3rd Party Integrations

(View, management, servers, storage)

Non-Vmware (incl. physical) environmentsVMware Cloud / vCenter

vCenter

vCenter Operations Standard Capacity

Management

Performance Management

(up to 1500 VM)

Page 9: vCenter Operations

Purpose Built Capacity Planning & Analysis• Integrated capacity analysis and forecasting• Decision support & automation via views, alerts,

reports• VM right sizing and capacity reclamation

Automated Configuration & Compliance• Automated Patching and Provisioning• Comprehensive change tracking to isolate root cause• Single-click rollback to remediate and return to normal

Patented Performance Analytics• Self-learning of “normal” performance conditions• Service health baseline and trending • Smart alerts of impending performance degradation

Page 10: vCenter Operations

Comparing the EditionsStandard Enterprise

Data Sources vCenter x 1 • Any 3rd party monitoring tools’ time series data• Change events• Multiple vCenter Servers

Objects vCenter Objects (i.e.)• Data Centers• Clusters• ESX Hosts• Datastores• VMs x 1500

Unlimited Scope (i.e.)• Applications• Network Infrastructure• Storage• Hosts (ESX, Win, Linux, etc)• VMs

Users Infrastructure (e.g. VI Admins) Operations, Infrastructure, Application Teams, Business Owners, CxOs

Dynamic Thresholds Yes YesPerformance Root Cause Yes YesProactive Alerting No YesCustomizable Dashboards No YesNotifications No Yes

Sco

peFu

nctio

n

Page 11: vCenter Operations

vCenter Operation – Standard Edition

Page 12: vCenter Operations

Demo

• Familiarisation of UI• Infrastructure and Analysis

• Concepts• Workload• Health• Capacity

Page 13: vCenter Operations

vCenter Environment - Workload

• Workload Measures• Demand for resources vs. Resources currently used• Result is a percentage of Workload

• Low number is Good – Object has the resources it needs• Can go above 100% - Object is “Starving”

• Workload summarized across critical resources • CPU• Storage• Network• Memory

• Workload Details View• View the state of the Peer and Parent Objects and troubleshoot

• Am I a victim or a villain? • Is this a population problem?

Page 14: vCenter Operations

vCenter Environment - Health

• Health Measures• How normal is this object behaving: • 0-100 (Higher is Healthier or Normal)• Learns dynamic ranges of “Normal” for each metric• Learns patterns of behavior and identifies metric

abnormalities• Healthy = no abnormalities

• Health and Workload together• Health High and Workload High – Normal Behavior for

this timeframe• Health High and Workload Low – Normal Behavior for

this timeframe• Health Low and Workload High – Something is amiss!

Perfomance spike• Health Low and Workload Low – Something is amiss.

Demand drops

Important NoteLow Health does not

imply a problem. It tells you that the object is acting differently than normal.

Page 15: vCenter Operations

Learn Normal Behavior and Identify Abnormalities

• Doesn’t assume IT data has a normal bell-shaped distribution

• Sophisticated Analytics – 8 different algorithms

• Learns your dynamic ranges of “Normal” without templates

• Learns patterns of behavior and identifies Abnormalities

BLUE LINEMetric’s Current

Value

GRAY BARUpper and Lower band of Dynamic Threshold -

“Normal”

RED BARBreached Dynamic

Threshold – “Abnormal”

Page 16: vCenter Operations

vCenter Environment - Capacity

• Capacity• How much time before Capacity run out?• 0-100: Higher number, longer time.• Thresholds User Configurable

• 30 Days Left = RED• 60 Days Left = Orange• Etc.

• Unlike Workload, Capacity is long-term.

• Capacity measured for critical resources • CPU, RAM, Storage, Network

• Capacity Details View• Shows the chart and trend for each of the above resources• Denotes current state• Projected breach point and days left

Page 17: vCenter Operations

Health (Deviation)

• Green square: 76–100. • The health of the object is normal. No attention required.

• Yellow square: 51–75. • The object is experiencing some level of issues. You must check and take appropriate

action.

• Orange square: 26–50. • The object might have serious issues. You must check and take appropriate action as

soon as possible.

• Red square: 0–25. • The object is either not functioning properly or will stop functioning soon. You must take

an action immediately.

• Blue square: • No data is available for any of the metrics for the time period.

• Gray square: • The object is offline.

Page 18: vCenter Operations

Workload

• Green circle: 0- 84. • There is no excessive workload

on the object. No attention required.

• Yellow circle: 85–94. • The object is experiencing some

high resource workloads.

• Orange circle: 95–99. • Workload on the object is

approaching its capacity in at least one area.

• Red circle: 100 or more. • Workload on the object is at or

over its capacity in one or more areas.

The number 85 and 95 are shown as Green and Yellow lines in the Events chart.

Page 19: vCenter Operations

Capacity

• Green cube: 26-100. • The object is not expected to reach its capacity limits within the next 120 days.

• Yellow cube: 16–25. • In 60 - 120 days.

• Orange cube: 6–15. • In 30 - 60 days.

• Red cube: 0–5. • In < 30 days.

The number 5, 15 and 25 are shown as colored lines in the Events chart.

Page 20: vCenter Operations

Performance Visibility Across the Virtualized Datacenter

Full visibility up and down the

datacenter stack

Aggregates 100s of metrics into 1 intelligent

score

Drill into ESX server for

further details

Page 21: vCenter Operations

Intuitive, Web RIA-based user-friendly interface

Context sensitive object

hierarchy

Breadcumbs to track object hierarchy

Search and filter

Page 22: vCenter Operations

Continuous, automatic learning of

normal behavior for key metrics

Workload issue correlated to

net I/O constraints

Quickly show Reservation vs

Demand vs Usage

Page 23: vCenter Operations

Drilldown to track changes

Diagnostics relative to

parent, peer and child objects

Detailed display of events and health score

changes

Page 24: vCenter Operations

Visibility into Disk and Network IO performance

Disk subsystem performance

details by datastores and

LUNs

Network statistics for every NIC

Quiz: what’s the difference between

Total & Host?

Page 25: vCenter Operations

Quickly identify “suspect”

performance metric

KPI history with timestamp to indicate root

cause

Page 26: vCenter Operations

Capacity

• Estimating the of days left• Score is 0-100. Non linear. 10 doesnot mean 10 days left.

• CapacityIQ value add:• What-If analysis• Discovery of over-allocated and under-allocated VM • Reporting• A Capacity-centric dashboard

Page 27: vCenter Operations

Capacity: Guest OS level info

Page 28: vCenter Operations

Relative scores to prioritize any

remediation efforts

Page 29: vCenter Operations

Health tree with topology mapping

Top-down visibility into

health changes

Time-series charts for

individual metric

Page 30: vCenter Operations

Individual performance metric details

Single view that correlates

multiple metrics

Detailed list of all metrics

indicating smart alerts

Page 31: vCenter Operations

Visualisation quickly pinpoints hotspots

Single click drill down for further

details

Page 32: vCenter Operations

Storage

• Since all the datastores are on the same array, how do we quickly tell the relative workload generated by every one of them?

• For each of these datastores, how do we know the relative workload generated by the VM?

• For every VM, how do we know the latency is within reasonable number?

• How do we show all the above data in “one chart”, without the need to show a lot of numbers?

Page 33: vCenter Operations
Page 34: vCenter Operations

Heatmap customisation

Page 35: vCenter Operations

vCenter Operations Standard Architecture

Four Main Services: Collector, Analytics, Web, ActiveMQ

Bundled DB: PostgresSQL DB File-based DB

(FSDB) for raw metric storage

Single Collector for vCenter. Embedded in appliance

Page 36: vCenter Operations

vCenter Operations Standard Processing

2a: Analytics runs daily to determine hour-by-hour Dynamic Thresholds for

next 24 hours

2b: Full FSDB is scanned by the analytic algorithms to determine per metric best match the next 24

hour period

1a: vCenter Collector collects metrics, topology & change

events from vCenter - Ongoing -

1b: Data stored in

FSDB

3: Incoming data points are tested against Dynamic

Threshold bands and used to calculate Health,

Workload and Capacity

2c: Store metric Dynamic

Thresholds data in PostgresSQL DB

4: Results provided to UI: Update

“Badges”, provide Root Cause for

Health scores, etc.

Page 37: vCenter Operations

vCenter Operation – Ent Edition

Page 38: vCenter Operations

38

Data Agnostic Approach to Data Collection

Accepts any time series data (examples)• Server OS

• Server App layer (eg, IIS, Oracle, WebSphere, etc)

• Network

• Storage

• User Experience

• Transactional

• Business Data

• Change Events

Minimal Required Fields (4)• Object Name, Metric Name, Value, Timestamp

Data Extraction - *not* an analytic question• No rules/templates to Write and Maintain

• vCenter Operations Analytics do all of the “Work”

vCenter Operations

Page 39: vCenter Operations

39

Slide 39

Learn Normal Behavior and Identify Abnormalities

Doesn’t assume IT data has a normal bell-shaped distribution Sophisticated Analytics – 8 different algorithms Learns your dynamic ranges of “Normal” without templates Learns patterns of behavior and identifies Abnormalities

BLUE LINEMetric’s

Measured Value

GRAY BARLearned Upper and

Lower band of Dynamic Threshold - “Normal”

RED ZoneBreached Dynamic

Threshold – “Abnormal”

Page 40: vCenter Operations

40

Dynamic Threshold Algorithms

Understand the normal behavior of any time-series metric

Eight (8) distinct algorithms each determine an upper and lower ‘band’ – results of each algorithm compete to ‘win’ to represent the ‘best choice’

vC Ops Ent - Stand Alone detects metric-level abnormalities for use in:

Dynamic Thresholds are the Cornerstone to all other forms of vC Ops Ent - Stand Alone Analytics

* Figure shows a performance metric (blue line), its normal behavior (gray zone), and when it’s behaving abnormally (red area)

Generation of Smart Alerts

Visualizing real-time ‘Health’

Revealing hidden relationships

etc.

Page 41: vCenter Operations

41

Proactive Alerting – Smart AlertsUser Experience (eg, RUM, etc.)

Database Silo (eg, Quest, etc.)

App Data (eg, Wily, etc.)

Network Data (e.g., Ionix IPPM, etc.)

Smart Alert Generation (“When”)

Business Data (eg, Finance)

! SMART ALERT

Business Application

Page 42: vCenter Operations

42

Smart Alert Trigger

vC Ops Ent - Stand Alone tracks aggregate amount of abnormality and alerts when “explosion” is detected, or when a ‘high water mark’ is detected

Intrinsically observed that performance problems are first seen at the metric level when metrics begin to behave abnormally

• Blue shaded region represents the number of metrics for an application (represented by a set of servers/devices) that are at any given time measured abnormally

• The Red line represents an Analytically determined ever-changing level at which vC Ops Ent - Stand Alone determines a performance warning is warranted – a Smart Alert is triggered

Page 43: vCenter Operations

43

Smart Alert Summary (“What”)

Root cause technology tier is the DB

Metric-level root cause

symptoms - START HERE

Impact analysis shows the health of the application as well as the health of the tiers that comprise the applicationRoot-Cause ranks the tiers in order of priority and within those tiers shows the most affected metrics and resources

Page 44: vCenter Operations

44

Drill down to the Root Cause

Smart Alert Summary (“What”)

Early Warning SMART ALERT

Noise Line Crossed

Page 45: vCenter Operations

45

Drill down to the Root Cause

Smart Alert Summary (“What”)

Impact to application

health

Impact to health of each technology tier

No major impact to application key Performance

Indicators (KPIs)…yet.

Page 46: vCenter Operations

46

Drill down to the Root Cause

See change and other external events

affect on application health with this “mash up” view

Smart Alert Summary (“What”)

Page 47: vCenter Operations

47

Learning behaviour analytically

Determine performance Health

Alert only when applications need

attention

Tracking disparate “Resources” from

various technology silos

Page 48: vCenter Operations

48

Impact to health to each

technology tier

Proactive Alert

DB is Root Cause tier START HERE!

Symptoms

Application Health

Performance Visibility Across the Virtualized Datacenter

KPIs are outside of normal level but

not breached SLAs

Page 49: vCenter Operations

49

Performance Visibility Across the Entire Datacenter

Application Owner View - Application health view with active alerts and tier health

Page 50: vCenter Operations

50

Dynamic Performance Dashboards – Application Owner Views

Application health view with

active alerts and tierhealth

Health and Alerts broken down by Tier and

Objects

Heat Maps allow you to see the Health of hundreds of

objects at once.

Page 51: vCenter Operations

51

Real-Time Performance Insight

Performancehealth of each individual app

Performance ofKPIs with dynamic

thresholds

CIO Viewperformance of

all apps

Page 52: vCenter Operations

52

Dynamic Performance Dashboards – CIO Views

Performancehealth of each individual app

Performance ofKPIs with dynamic

thresholds

performance ofall apps

Launched in Context

Page 53: vCenter Operations

53

Dynamic Performance Dashboards – Historic Weather Maps

Response time “weather map” that can be

played back over a selectable time period

to show problempatterns

Page 54: vCenter Operations

54

Dynamic Performance Dashboards – Customizable

Simply drag and dropvisualization “widgets”

to create new, role-based dashboards

Set widget interactions to create powerful in context

dashboards

Page 55: vCenter Operations

55

Health Score

Automatically understand performance ‘Health’ for

• 100 (Green) = Perfect Performance (i.e., entirely normal); • 0 (Red) = Terrible Performance (extremely abnormal behavior)

A single Server, Device, Resource

Entire Tier or Silo

Entire Application or Service

Entire Datacenter

Any Arbitrary Group of Resources

Objective measure of performance based on underlying level of abnormal behavior. Adjusted based on:

# of Abnormally behaving KPI

# of Abnormally behaving metrics

Consideration of lowest / highest

volume of abnormalities

Page 56: vCenter Operations

56

One Source of Truth Across the Enterprise

Health - Objective measure of performance based on underlying level of abnormal behavior

Analytics provide a Health score for any resource or grouping

• A single Server, Device, Resource

• Entire Tier or Silo

• Entire Application or Service

• Entire Datacenter

• Any Arbitrary Group of Resources

Dynamic Performance Dashboards – Health Scores

“How is our world doing?”

Page 57: vCenter Operations

57

vC Ops Ent - Stand Alone Architecture

Four installed ‘Services: Collector, Analytics, Web, ActiveMQ

Architecture includes MS SQL or Oracle DB, plus File-based DB (FSDB) for raw metric storage

Collectors can be distributed for scalability, or to span DCs & firewalls

Analytics runs daily to determine hour-by-hour DTs for next 24 hours

Incoming data points are tested against DT bands, metric-level anomalies are tracked for Alerting and Dashboarding

“Northbound” integration with products like Ionix SMARTS SAM

Page 58: vCenter Operations

58

Under the Hood of vC Ops EntSlide 58

Page 59: vCenter Operations

Thank You