understanding application hiccups · jhiccup a tool for capturing and displaying platform hiccups...

29
©2011 Azul Systems, Inc. Understanding Application Hiccups and what you can do about them An introduction to the Open Source jHiccup tool Gil Tene, CTO & co-Founder, Azul Systems

Upload: others

Post on 29-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Understanding Application Hiccups

and what you can do about them

An introduction to the Open Source jHiccup tool

Gil Tene, CTO & co-Founder, Azul Systems

Page 2: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

About me: Gil Tene

co-founder, CTO @Azul Systems

Have been working on “think different” GC approaches since 2002

Created Pauseless & C4 core GC algorithms (Tene, Wolf)

A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... * working on real-world trash compaction issues, circa 2004

Page 3: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

About AzulWe make scalable Virtual Machines

Have built “whatever it takes to get job done” since 2002

3 generations of custom SMP Multi-core HW (Vega)

Now Pure software for commodity x86 (Zing)

“Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale

Vega

C4

Page 4: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

A classic look at response time behavior

Key Assumption: Response time is a function of load

source: IBM CICS server docuementation, “understanding response times”

Average?

Max?

Median?

90%?

99.9%

Page 5: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Common fallacies

Computers run application code continuously

CPUs stop processing application code for all sorts of reasons

e.g: Interrupts. Scheduling of other work, swapping, etc.

Modern system architectures add more: Power management, Virtualization (cross-image context switching, physical VM motion), Garbage Collection, etc.

Response time can be measured as work units/time.

Response time exhibits a normal distribution

Leading to attempts to represent with average + std. deviation

Page 6: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Response time over time

When we measure behavior over time, we often see:

source: ZOHO QEngine White Paper: performance testing report analysis

“Hiccups”

Page 7: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Application Hiccups

Where do they come from?

Usually a factor of outside of the individual transaction work

E.g: Queueing, accumulated work, platform inconsistency

Do they matter?

That depends. What are your end-user’s expectations?

Hiccups often dominate response time behavior

How can/should we measure them?

Average? Max? 99.9%? Mean with Std. deviation?

Hiccup magnitude is often not a function of load

Page 8: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

0"

1000"

2000"

3000"

4000"

5000"

6000"

7000"

8000"

9000"

0" 20" 40" 60" 80" 100" 120" 140" 160" 180"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

What happened here?

Source: Gil running an idle program and suspending it five times in the middle

“Hiccups”

Page 9: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Pitfall: Calulating %’iles “naïvely”

Common Example:

build/buy simple load tester to measure throughput

issue requests one by one at a certain rate

measure and log response time for each request

results log used to produce histograms, percentiles, etc.

So what’s wrong with that?

works well only when all responses fit within in rate interval

technique includes “automatic backoff” and coordination

But requirements interested in random, uncoordinated requests

Bad how bad can this get, really?

Page 10: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Example of naïve %’ile

1 msec

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

How would you characterize this system?Naïve results:

10,000 @ 1msec 1 @ 100 second

Naïve characterization: 99.99% below 1 sec !!!

Page 11: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Proper measurement

System Stalledfor 100 Sec

Elapsed Time

System easily handles 100 requests/sec

Responds to eachin 1msec

10,000 resultsVarying linearlyfrom 100 secto 10 msec

10,000 results@ 1 msec each

Proper characterization: 50% below 1 second

Page 12: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

jHiccup

A tool for capturing and displaying platform hiccups

Records any observed non-continuity of the underlying platform

plots results in simple, consistent format

Simple, non-intrusive

As simple as adding the word “jHiccup” to your java launch line

% jHiccup java myflags myApp

Adds a background thread that samples time @ 1000/sec

Open Source

released to the public domain, creative commons CC0

Page 13: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

what jHiccup measuresjHiccup measures the platform, not the application

Experiences whatever delays application threads would see

Highlights inconsistency in platform execution of “nothing”

No attempt to identify cause (e.g. scheduling, GC, swapping, etc.)

An application cannot behave better than it’s platform

If the platform stalls, the application stalled as well.

jHiccup provides a lower bound for “best application behavior”

Useful control measurements

Comparing jHiccup results for an idle application running on the same platform, at the same time, narrows down behavior to process-specific artifacts (-c option)

Page 14: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

The anatomy of Hiccup Charts

Page 15: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Telco App Example

0"

20"

40"

60"

80"

100"

120"

140"

0" 500" 1000" 1500" 2000" 2500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

0"

20"

40"

60"

80"

100"

120"

140"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Hiccups"by"Percen?le" SLA"

Optional SLA

plotting

Max Time per

interval

Hiccup

duration at

percentile

levels

Page 16: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

0"

200"

400"

600"

800"

1000"

1200"

0" 200" 400" 600" 800" 1000" 1200" 1400"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=967.68&

0"

200"

400"

600"

800"

1000"

1200"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 17: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

How jHiccup works

Background thread, measures time to do “nothing”

Sleeps for 1 milliseconds, allocates 1 object, records time gap

Collects detailed “floating point” histogram of observed times

Constant, small memory footprint, virtually no cpu load

No effect on application threads

Outputs two files

A time-based log with 1 line per interval (default 5 sec interval).

A percentile distribution log of accumulated histogram data

Spreadsheet imports files and plots data

A standard, simple chart format

Page 18: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Some fun with jHiccup

Page 19: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Examples

Page 20: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Idle App on Busy System

0"

10"

20"

30"

40"

50"

60"

0" 100" 200" 300" 400" 500" 600" 700" 800" 900"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=49.728&

0"

10"

20"

30"

40"

50"

60"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Idle App on Quiet System

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600" 700" 800" 900"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.336&

0"

5"

10"

15"

20"

25"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 21: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Idle App on Quiet System

0"

5"

10"

15"

20"

25"

0" 100" 200" 300" 400" 500" 600" 700" 800" 900"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=22.336&

0"

5"

10"

15"

20"

25"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Idle App on Dedicated System

0"

0.05"

0.1"

0.15"

0.2"

0.25"

0.3"

0.35"

0.4"

0.45"

0" 100" 200" 300" 400" 500" 600" 700" 800" 900"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=0.411&

0"

0.05"

0.1"

0.15"

0.2"

0.25"

0.3"

0.35"

0.4"

0.45"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 22: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

A 1GB Java Cache under load

0"

500"

1000"

1500"

2000"

2500"

3000"

3500"

4000"

0" 500" 1000" 1500" 2000"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=3448.832&

0"

500"

1000"

1500"

2000"

2500"

3000"

3500"

4000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 23: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Telco App

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=1665.024&

0"

200"

400"

600"

800"

1000"

1200"

1400"

1600"

1800"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 24: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Fun with jHiccup

Page 25: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

©2011 Azul Systems, Inc.

Obviously, we can use it forcomparison purposes

Page 26: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Zing 5, 1GB in an 8GB heap

0"

5"

10"

15"

20"

25"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"

Max=20.384&

0"

5"

10"

15"

20"

25"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 27: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Zing 5, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Page 28: Understanding Application Hiccups · jHiccup A tool for capturing and displaying platform hiccups Records any observed non-continuity of the underlying platform plots results in simple,

Oracle HotSpot CMS, 4GB in a 18GB heap

0"

5000"

10000"

15000"

20000"

25000"

30000"

35000"

40000"

45000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=40173.568&

0"

5000"

10000"

15000"

20000"

25000"

30000"

35000"

40000"

45000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&

Oracle HotSpot CMS, 1GB in an 8GB heap

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

0" 500" 1000" 1500" 2000" 2500" 3000" 3500"

Hiccup&Dura*on&(msec)&

&Elapsed&Time&(sec)&

Hiccups&by&Time&Interval&

Max"per"Interval" 99%" 99.90%" 99.99%" Max"

0%" 90%" 99%" 99.9%" 99.99%" 99.999%"

Max=13156.352&

0"

2000"

4000"

6000"

8000"

10000"

12000"

14000"

Hiccup&Dura*on&(msec)&

&

&

Percen*le&

Hiccups&by&Percen*le&Distribu*on&