continuous performance monitoring of a distributed application [con4730]
DESCRIPTION
JavaONE 2013 Copyright © 2013, Oracle and/or its affiliates. All rights reserved.1TRANSCRIPT
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.1
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.2
Insert Picture Here
Continuous Performance Monitoring of a Distributed Application
Ashish Srivastava Principal Member of Technical StaffDiana YuryevaSenior Member of Technical Staff
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.3
The following is intended to outline our general product direction. It is intended
for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.4
Session Goal
§ Components:– Design patterns
– Tools
§ Qualities:– Continuous
– Light-weight
– Recordable
Arrive at solution for extreme performance monitoring
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.5
Session Agenda
§ Use Case
§ Software Patterns
§ Tools
§ Pitfalls and Advice
§ Q&A
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.6
About Us
§ Oracle Billing and Revenue Management Elastic Charging Engine– 100% real-time charging application
– Java
– Distributed grid– Oracle Coherence
– Oracle NoSQL
– Focus on extreme performance
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.7
Operating Conditions
§ Low latency expectations
§ Heavy system load
§ Distributed environment
§ Multi-level software stack
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.8
Monitoring Requirements
§ Detailed insight about performance– Latency
– Throughput
§ View over time§ Reporting§ Bottleneck detection§ View of system as cohesive unit
Functional
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.9
Monitoring Requirements
§ Minimal impact on processing
§ Ease of use
§ Separation of concerns
Non-Functional
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.10
Session Agenda
§ Use Case
§ Software Patterns
§ Tools
§ Pitfalls and Advice
§ Q&A
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.11
Approach
§ Off-the-shelf software not sufficient
§ Custom development needed– Incorporate monitoring into system
– Collect, analyze and present metrics
How do I address these requirements?
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.12
Collecting Metrics
§ Goal– Incorporate metrics collection into general processing
§ Approach– Enhance domain model with monitoring-related data structures
Problem overview
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.13
Collecting Metrics
Model of sample system
ECE Client
Network Mediation
A
B
B'C
C'A'
A
Node1
Node3
Node2
request
―Debatch the requests―Data lookups―Apply Tariff―Save Session―Prepare Response―..
response
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.14
Collecting Metrics
Solution
ECE Client
Network Mediation
A
B
B'C
C'A'
A
Node1
Node3
Node2
request
―Debatch the requests―Data lookups―Apply Tariff―Save Session―Prepare Response―..
response
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.15
Client Node
Processing Node
Envelope
Routing ContextPayloadTracking Context Chronicler
– TimePointsStat Reporterharvest
Envelope
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.16
Collecting Metrics
##### Elapsed time = 3600 seconds
##### Avg throughput = 20000 ops/sec
##### Avg latency = 50 ms
Result
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.17
Granular Reporting
§ Goal– I need more granular reporting of performance over time
§ Approach– Enhance reporting of collected metrics
Problem overview
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.18
Granular Reporting
Solution – data structure
Chronicler removed
―A moving reporting window―100% reporting―Sampled reporting―Stats exposed over JMX―Fixed data set for a window
Chronicler added
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.19
Granular Reporting
Solution – class diagram
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.20
Granular Reporting
§ I can see min/max/avg latency and throughput over time
§ My throughput reporting is quite good: I can see whether I had stable or erratic throughput
Result
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.21
Latency Percentile Report
§ Goal– Latencies are still not detailed enough. I need to know more than the
average/min/max latencies
– Need to guarantee that 99.999% of the requests take less than 55ms
§ Approach– Introduce range bucketing to count latencies
Problem overview
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.22
Latency Percentile Report
§ Pre-defined buckets of latency percentiles§ Data set does not grow. Each bucket is updated§ Multiple percentile breakdown
– End-to-end
– Server side processing
– Per batch reporting
Solution
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.23
Latency Percentile Report
2013-03-21 08:44:29.112 PDT INFO ##### Latency statistics based on percentiles:
Percentile: 0.1, Latency: 1ms, Total Count: 1173148
Percentile: 1.0, Latency: 2ms, Total Count: 100909763
Percentile: 10.0, Latency: 2ms, Total Count: 100909763
Percentile: 95.0, Latency: 26ms, Total Count: 685176664
Percentile: 99.0, Latency: 50ms, Total Count: 713029967
Percentile: 99.5, Latency: 58ms, Total Count: 716355711
Percentile: 99.9, Latency: 78ms, Total Count: 719217619
Percentile: 99.99, Latency: 104ms, Total Count: 719836971
Percentile: 99.999, Latency: 128ms, Total Count: 719897850
Percentile: 100.0, Latency: 169ms, Total Count: 719904814
Result – printed report
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.24
Latency Percentile Report
Result – heat map
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.25
Method Breakdown
§ Goal– I want to measure the impact of a new method under varying load
– End-to-end latency always ON
– Minimum performance impact
§ Approach– Method annotations
– Aspect
Problem overview
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.26
Method Breakdown
Solution
public enum LabelEnum { APPLY_TARIFF,
... DEBATCH } public class ClassToBeTracked { @Track(pointLabel = LabelEnum.APPLY_TARIFF) private <ReturnObject> method(<Parameters>) { ... } }
<pointcut name="scope" expression="within(ClassName) "/>
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.27
Method Breakdown
Result – detailed breakdown report
2013-07-15 16:29:24.953 PDT Chronicler Breakdown: DEBATCH -> 64149 nanoseconds LOOKUP_DATA -> 1056748 nanoseconds APPLY_TARIFF -> 99994 nanoseconds SAVE_SESSION -> 12989 nanoseconds PREPARE_RESPONSE -> 15998 nanoseconds
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.28
Session Agenda
§ Use Case
§ Software Patterns
§ Tools
§ Pitfalls and Advice
§ Q&A
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.29
Storing and Presenting Metrics
§ Goal– I collect detailed performance metrics, but I need to report them too
– I need a tool which stores these metrics and presents them in a unified view
§ Approach– Create monitoring dashboard
– Technologies: JRDS,RRD and in-house development
Problem overview
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.30
Storing and Presenting Metrics
Result – monitoring dashboard
Configuration: Topology 24 servers Throughput 20000 ops/sec Duration 10 hrs
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.31
Storing and Presenting Metrics
§ Graphical§ Supports various metrics
– Application-specific
– Machine-specific
– JVM-specific
§ Consolidated view– All graphs on one page
Solution qualities
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.32
Storing and Presenting Metrics
§ Easy to use– Collects and saves data automatically
§ Easy to share– Includes configuration for future references
– Send links to web pages
– Print page as PDF
Solution qualities
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.33
Storing and Presenting Metrics
§ Stores data without losing precision§ Supports drilling down§ Light-weight§ Customizable
Solution qualities
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.34
Session Agenda
§ Use Case
§ Software Patterns
§ Tools
§ Pitfalls and Advice
§ Q&A
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.35
Pitfalls and Advice
§ Distributed system monitoring != Single JVM monitoring– Consolidated view is critical
§ Consistency of tools across team is important– Same language across development, QE and Performance teams saves
hours
§ Solution should enable you to be agile– Run monitoring on laptop AND realistic setup
Take into consideration
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.36
Pitfalls and Advice
§ These hide problems– Averaging
– Sampling
§ GC has big impact, so include it in your metrics§ Watch our for processes sharing the same host§ Always run long-duration tests
Some things to pay attention to
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.37
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.38
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.39
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.40
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.41
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.42
Session Summary
Detailed insight about performance– Latency
– Throughput
View over time Reporting Bottleneck detection View of system as cohesive unit
Let's see how we addressed original requirements
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.43
Session Agenda
§ Use Case
§ Software Patterns
§ Tools
§ Pitfalls and Advice
§ Q&A
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.44
Copyright © 2012, Oracle and/or its affiliates. All rights reserved.45
Links
§ ECE– http://www.oracle.com/us/products/applications/communications/elastic-
charging-engine
§ JRDS
– http://www.jrds.fr