performance forensics

82
Performance Forensics Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture [email protected]

Upload: blondelle-kayson

Post on 01-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Performance Forensics. Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering Stephen Feldman Senior Director Performance Engineering and Architecture [email protected]. Welcome to BbWold’08. Finishing my 5 th year at Blackboard. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Performance Forensics

Performance ForensicsPerformance ForensicsUncovering the Mysteries of Performance and Scalability

Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

Uncovering the Mysteries of Performance and Scalability Incidents through Forensic Engineering

Stephen Feldman Senior Director Performance Engineering and Architecture

[email protected]

Page 2: Performance Forensics

Welcome to BbWold’08

• Finishing my 5th year at Blackboard.• Brought in to build a Performance Engineering

Practice.• Team of 15 including myself

– Half of the team are Performance Test Engineers– Half of the team are Software Developers

• Responsible for the performance and scalability of the BbLearn architecture.

Page 3: Performance Forensics

Session Housekeeping

• Three hours of fun and excitement.• Feel free to fire up your laptops.• We will take 1 break at the half-way point

– Take a break when ever you need to

• Questions are welcome at any time.

Page 4: Performance Forensics

Our Session Schedule

• Part One: Introduction to Performance Forensics– 1:00 to 2:25pm

• Break– 2:25 to 2:35pm

• Part Two: Advanced Performance Forensics– 2:35pm to 4:00pm

Page 5: Performance Forensics

Sessions Goals

The goals of today’s session are…• Introduce you to the science of performance

forensics.• Present a methodology for performing forensics.• Discuss techniques for arriving at root cause

analysis.• Familiarize the audience with tools that can be

used to assist the forensics process.

Page 6: Performance Forensics

Session Learning Objectives

At the end of the session you should be able to…• Write your own problem statements.• Perform the process of evidence collection and

interviewing.• Apply techniques for using data and analysis to avoid

diagnosis bias and value attribution.• Perform root cause analysis as part of the performance

forensics process. • Begin using different tools for capturing key performance

data

Page 7: Performance Forensics

Part One: Introduction to Performance Forensics

What is forensic engineering?

Page 8: Performance Forensics

A Practical Definition

• The term forensics means “The science and practice of collection, analysis, and presentation of information relating to a crime in a manner suitable for use in a court of law.”– This definition is in the context of a crime.

• Forensic engineering is the application of accepted engineering practices and principles for discussion, debate, argumentative, or legal purposes.

Page 9: Performance Forensics

Introduction to Performance Forensics

Page 10: Performance Forensics

Definition of Performance Forensics

• The practice of collecting evidence, performing interviews and modeling for the purpose of root cause analysis of a performance or scalability problem.

• Performance problems can be classified in two main categories:– Response Time Latency– Queuing Latency

Page 11: Performance Forensics

Cognition of Response Times

Page 12: Performance Forensics

Queuing Model: Visual of a Bottleneck

Page 13: Performance Forensics

Performance Forensics Methodology

Page 14: Performance Forensics

Performance Forensics MethodologyIdentify the

Problem

Collecting Evidence

Interviewing

Sampling and

Simulating

Modeling and

Visualizing

Method-R

Data Analysis

Root Cause

Iden

tify

th

e M

ost

Imp

ort

ant

Op

erat

ion

s th

at A

ffec

t Y

ou

r B

usi

nes

s

Turn the Problem Statement into a Diagnosis to Get to Root Cause

Develop a Problem Statement

Formulate a Hypothesis

Establish a Diagnosis

Page 15: Performance Forensics

Identify the Problem

Page 16: Performance Forensics

Identifying the Problem

• Problems are not always easily identifiable.• When they are easily apparent a simple problem

statement should be declared so that the investigation can commence.– Calling out symptoms not diagnosing

• When the problem is not clear, narrowing down the possibilities of what the problem could be should be the appropriate course of action.

• Be willing to leave the problem statement open ended until a more formulated problem statement can be attained.

Page 17: Performance Forensics

Problem Statements

• Example Weak Problem Statement:– Sally Simpleton is experiencing response time latency

in the Grade Center.

• Why is it the statement weak?– Who is Sally Simpleton?– What defines response time latency?– What is she doing in the Grade Center?– When does it happen?

• Can it be reproduced?

Page 18: Performance Forensics

Strengthen the Problem Statement

• Sand College is reporting response time latency of 90 to 120 seconds when course administrators edit Grade Center cells.– The problem is reproducible when using Sally

Simpleton’s login credentials and accessing her course section (Introduction to Software Performance Engineering).

– The problem has been reproduced at all times of days across different course sections and on different systems.

Page 19: Performance Forensics

Evidence

Page 20: Performance Forensics

Evidence

• Multiple types of gathered evidence used to solve performance problems.– Log artifacts– Monitoring/Measurement tools– Instrumentation/Sensors

• Interactive evidence gathering through interviews.

• Evidentiary support through discrete simulation• Improving future evidentiary capabilities by

improving Performance Maturity Model

Page 21: Performance Forensics

Log Artifacts

• Understand what logs are in place and where they can be found.

• Know what they are used for and whether they provide the right information.

• Keep them slim and usable.• Learn how to associate and correlate

– Associate multiple log artifacts– Correlate events to the problem statement

Page 22: Performance Forensics

Example Log Visualization

Page 23: Performance Forensics

Example Log Visualization

Page 24: Performance Forensics

Putting Collectors/Sensors in Place

• When should this happen?– When a problem statement cannot be developed from

the data you do have (evidence or interviews) and more data needs to be collected.

• How should you go about this?– Want to minimize disruption to the production

environment.– Adaptive collection: Less Intensive to More Intensive

over time.

Basic Sampling Continuous Collection Profiling

Page 25: Performance Forensics

Monitoring and Measurement

• Third party components whether commercial or open source deployed to measure responsiveness and resource utilization

• Excellent tools for trending and correlation• Specialization of tools to solve different types of

problems.• Used in forensics for correlation for resource

utilization to event occurrences.

Page 26: Performance Forensics

Ex 1: Thin-Slicing Monitoring Visualizations

Page 27: Performance Forensics

Ex 2: Thin-Slicing Monitoring Visualizations

Page 28: Performance Forensics

Ex 3: Thin-Slicing Monitoring Visualizations

Page 29: Performance Forensics

Ex 4: Thin-Slicing Monitoring Visualizations

Page 30: Performance Forensics

Interviewing

• Techniques – Lassie Question– Time Association– User experienced– Locality– Component/Feature Specific

• Gathering non-discrete clues• Making use of method-R• Avoiding diagnosis bias• Eliminating value attribution• Can a pattern be identified?

Page 31: Performance Forensics

Diagnosis Bias

• It is human nature to label people, ideas or things based on our initial opinions of them.

• Not necessarily scientific, but rather a combination of gut feelings, irrational judgment or failure to process enough conclusive data.

• We often diagnose before we can get to root cause analysis based on a hunch or perception.

Page 32: Performance Forensics

Value Attribution

• Humans have a tendency to imbue someone or something with certain qualities based on its perceived value rather than objective data.

• Example 1: The problem can’t be my SAN, I spent $250,000 on it.

• Example 2: It can’t be the network, my engineers are the best in the field. They won’t allow a network problem to happen.

Page 33: Performance Forensics

Discrete Simulation as Evidentiary Support

• Performance testing is another technique for gathering evidence.

• Provides the opportunity to increase logging and watch for events or occurrences note seen originally.

• Also provides the opportunity to reproduce conditions that cause the performance issue.

Page 34: Performance Forensics

Modeling and Visualizing

Page 35: Performance Forensics

Modeling and Visualizing

Page 36: Performance Forensics

An Abstract Example

• Role of temperature in O-ring failures was difficult to determine by focusing on cases. Attention was focused on two key cases with O-ring failures: – SRM15 (cold launch) – SR22 (warm launch)

Page 37: Performance Forensics

Missed Opportunities for Visualizing Data

Page 38: Performance Forensics

Missed Opportunities for Visualizing Data

Page 39: Performance Forensics

Reshaping the Same Data

Page 40: Performance Forensics

Hypothesis versus Diagnosis

• Hypothesis: A prediction or educated guess about a problem prior to proving scientifically or mathematically.

• Diagnosis: A scientific, empirical or measured conclusion about a problem.– Not necessarily the correct answer, but enough data

has been gathered to propose a diagnosis.

• A problem statement needs to be in place for both to exist.

• Both need supporting data to develop either

Page 41: Performance Forensics

Quick Comments About Method-R

• Method-R is a preferred methodology for problem statement development and problem diagnosis.

• While it was created for Oracle performance analysis, it can be applied to all aspects of software performance forensics.

• Identifying the most important user actions for the needs of the business in order to improve performance.

Page 42: Performance Forensics

Correlation

Page 43: Performance Forensics

What is Correlation?

• Correlation is a measure of the statistical relationship between two comparable data points.– Time associations are typically made.– Correlate to resource demand– Correlate to event or occurrence

• Correlation primarily a part of hypothesis and diagnosis.

Page 44: Performance Forensics

Examples of Correlation

Page 45: Performance Forensics

Examples of Correlation

Page 46: Performance Forensics

Examples of Correlation

Page 47: Performance Forensics

Examples of Correlation

Page 48: Performance Forensics

Getting to Root Cause Analysis

Page 49: Performance Forensics

Performance Forensics MethodologyIdentify the

Problem

Collecting Evidence

Interviewing

Sampling and

Simulating

Modeling and

Visualizing

Method-R

Data Analysis

Root Cause

Iden

tify

th

e M

ost

Imp

ort

ant

Op

erat

ion

s th

at A

ffec

t Y

ou

r B

usi

nes

s

Turn the Problem Statement into a Diagnosis to Get to Root Cause

Develop a Problem Statement

Formulate a Hypothesis

Establish a Diagnosis

Page 50: Performance Forensics

Getting to Root Cause Analysis

• Devising a strong problem statement– Foundation steps of Method-R

• Knowing where to collect evidence• Formulating a data-driven hypothesis• Appropriate use of correlation, modeling and visualizing• Proving the hypothesis out (test-driven approach)• Establishing a diagnosis

– Avoid diagnosis bias and value attribution

• Treating the symptoms– A diagnosis is not always black and white

Page 51: Performance Forensics

A Case for a Performance Maturity Model

Reactiveand

Exploratory

Monitorand

Instrument

PerformanceOptimizing

BusinessOptimizing

ProcessOptimized

Level 5Level 4Level 3Level 2Level 1

Emphasis on Emphasis on HardwareHardware

Emphasis on Emphasis on ApplicationApplication

Emphasis on Emphasis on Eco-SystemEco-System

Emphasis on Emphasis on ProcessProcess

Emphasis on Emphasis on PeoplePeople

Page 52: Performance Forensics

Part Two: Advanced Performance Forensics

Applying Performance Forensics at Home

Page 53: Performance Forensics

Resources vs. Interfaces

• One of the most critical data points to collect• Interfaces are critical for understanding

throughput and queuing models.– Queuing is another cause of latency– Also a cause of time-outs

• Resources are critical for understanding the cost of performing a transaction.– Core Resources: CPU, Memory and I/O

• Response Time = Service Time + Queue Time

Page 54: Performance Forensics

Performance Forensics Tools

Page 55: Performance Forensics

Categories of Tools

• HTTP and User Experience• System Collectors: Not Going to Cover (Implied)• JVM Instrumentation• Java Profilers• Database Instrumentation

– Session and Wait Event– Profilers

Page 56: Performance Forensics

HTTP and User Experience

Page 57: Performance Forensics

Fiddler2

• Fiddler 2 measures end-to-end client responsiveness of a web request.

• Captures requests in order to present http codes, size of objects, sequence of loading, time to process request, performance by bandwidth speed.– Rough estimation of User Experience based on locality.

• Inspects every detail of the http request– Detailed session inspection– Breakdown of http transformation

• Other Tools in Category: Y-slow/Firebug, Charlesproxy, liveHTTPheaders and IEInspector

Page 58: Performance Forensics
Page 59: Performance Forensics

Coradiant Truesight

• Commercial tool used for passive user experience monitoring.

• Captures page, object and session level data.• Capable of defining Service Level Thresholds and

Automatic Incident Management.• Used to trace back session as if you were watching over

the user’s shoulder.• Exceptional tool for trend analysis.• Primarily used in forensics as evidence for analysis.• Other Tools in the Category: Quest User Experience and

Citrix EdgeSight

Page 60: Performance Forensics

Coradiant Truesight

Page 61: Performance Forensics

Coradiant Truesight

Page 62: Performance Forensics

Log Analyzers

• Both commercial and open source tools are available to parse and analyze http access logs.

• Provides trend data, client statistical data, http summary information.

• Recommend using this data to study request and bandwidth trends for correlation purposes with resource utilization graphs.– Such a large volume of data.– Recommend working within small time slices

Page 63: Performance Forensics

JVM Instrumentation

Page 64: Performance Forensics

-VerboseGC and -Xloggc

Page 65: Performance Forensics

JSTAT

Page 66: Performance Forensics

JHAT, JMAP and SAP Memory Analyzer

• Jhat: Java Heap Analysis Tool takes a heap dump and parses the data into useful and human-digestible information about what's in the JVM's memory.

• JMap: Java Memory Map is a JVM tool that provides information about what is in the heap at a given time.– Provides text and OQL views into JHat data

• SAP Memory Analyzer will visualize the JHat output

Page 67: Performance Forensics
Page 68: Performance Forensics
Page 69: Performance Forensics

IBM Pattern Modeling Tool for Java GC

Page 70: Performance Forensics

Database Wait Event Tools

Page 71: Performance Forensics

The Importance of Wait Events

• Rise of Session Level Forensics– Underlying theme with all of these tools that “Session” is more

important then “System”

• Wait event tuning used to account for latency– Exists in SQL Server (Waits and Queues) and Oracle (10046)

• Waits are statistical explanations of latency• Each individual wait event might be deceiving, but

looking at both aggregates and outliers can explain why a performance problem exists.

• When sampling directly, usually only have about 1 hour to act on the data.

Page 72: Performance Forensics

The Importance of Wait Events

Page 73: Performance Forensics

ASH

• ASH: Active Session History– Samples session activity in the system every second.– 1 hour of history in memory for immediate access at your

fingertips

• ASH in Memory– Collects active session data only– History v$session_wait + v$session + extras• Circular Buffer - 1M to 128M (~2% of SGA)

• Flushed every hour to disk or when buffer 2/3 full (it protects itself so you can relax)

• Tools to Consider: SessPack and SessSnaper

Page 74: Performance Forensics

SQL Server Performance Dashboard

• Feature of SQL Server 2005 SP2 • Template report that take advantage of DMVs• Provides views into wait events

– Doesn’t link events to SQL IDs in the report– Provides aggregate views of wait events

• Complimentary Tools: SQL Server Health and History Tool and Quest Spotlight for SQL Server

Page 75: Performance Forensics
Page 76: Performance Forensics

Database Profilers and Utilities

Page 77: Performance Forensics

RML and Profiler

• The RML utilities process SQL Server trace files and view reports showing how SQL Server is performing. – Which application, database or login is using the most resources, and

which queries are responsible for that.– Whether there were any plan changes for a batch during the time when

the trace was captured and how each of those plans performed.– What queries are running slower in today's data compared to a previous

set of data

• Profiler captures statements, query counts/statistics, wait events– Can capture and correlate profile data to Perfmon data

• Heavy overhead with both• Other Tools to Consider: Quest Performance Analysis for SQL

Server

Page 78: Performance Forensics

Oracle OEM and 10046

• Oracle finally delivered with OEM with a web-based interface.– Performance dashboard provides great historical and present

overview– Access to ADDM and ASH simplifies job of DBA– SQL History

• Problems– licensing somewhat cost prohibitive– Still doesn’t provide wait events

• For 10046 still need to consider profiling on your own and using a profiler reader like Hotsos P4.– Difficult to trace and capture sessions

Page 79: Performance Forensics
Page 80: Performance Forensics

References and Additional Places to Go

Page 81: Performance Forensics

Want More?

• To view my resources and references for this presentation, visit

www.scholar.com

• Simply click “Advanced Search” and search by sfeldman and tag: ‘bbworld08’ or ‘forensics’

Page 82: Performance Forensics

Final Questions?