an automated approach for recommending when to stop performance tests

An Automated Approach for Recommending When to Stop

Performance Tests

Hammam AlGhamdi

Weiyi Shang

Mark D. Syer

Ahmed E. Hassan

1

Failures in ultra large-scale systems are often due to

performance issues rather than

functional issues

2

A 25-minutes service outage in 2013 cost Amazon approximately $1.7M

3

4

Performance testing is essential to prevent these failures

System under test

requests

requests

requests Performance

counters, e.g., CPU, memory, I/O and response time

Pre-defined workload

Performance testing environment

5

Determining the length of a performance test is challenging

Time

Repetitive data is generated from the test

Optimal stopping time

6


Time

Stopping too early, misses performance

issues

Stopping too late, delays the release and wastes testing

resources Optimal stopping

time

7

Our approach for recommending when to stop a performance test

Collected data Likelihood of repetitiveness

First derivatives

Whether to stop the test

1) Collect the already-generated

data

2) Measure the likelihood of repetitiveness

3) Extrapolate the likelihood of repetitiveness

4) Determine whether to stop the

test

STOP

No

Yes

8



First derivatives



data




test

STOP

No

Yes

9



First derivatives



data




test

STOP

No

Yes

10



First derivatives



data




test

STOP

No

Yes

11

Time

Current time

Collected data

Likelihood of repetitiveness

First derivative


1) Collect the already-

generated data



4) Determine whether to stop

the test

Step 1: Collect the data that the test generates

Performance counters, e.g., CPU, memory, I/O

and response time

12

Time

Step 2: Measure the likelihood of repetitiveness

Select a random time period (e.g. 30 min)

Current time

Collected data


First derivative



generated data




the test

A

13

Time

Current time

Search for another non-overlapping time period that is NOT statistically significantly different.

…


… B … A

Collected data


First derivative



generated data




the test

14

Time

Wilcoxon test between the

distributions of every performance counter across both

periods

…

Current time


B … A

Collected data


First derivative



generated data




the test

15


Response time

CPU Memory IO

p-values 0.0258 0.313 0.687 0.645

Statistically significantly different in response time!

Collected data


First derivative



generated data




the test

Time

Wilcoxon test between every performance counter from both periods

…

Current time

B … A

16


Response time

CPU Memory IO

p-values 0.67 0.313 0.687 0.645

Find a time period that is NOT statistically significantly different in ALL performance metrics!

Collected data


First derivative



generated data




the test

Time

Wilcoxon test between every performance counter from both periods

…

Current time

B … A

17

Find a period that is NOT statistically significantly different?

Yes. Repetitive! No. Not repetitive!


Collected data


First derivative



generated data




the test

Time

Wilcoxon test between every performance

counter from both periods

…

Current time

B … A

18


Collected data


First derivative



generated data




the test

Repeat this process a large number (e.g., 1,000) times

to calculate the:

likelihood of repetitiveness

19


30 min 40 min

Time …

1h 10 min

Collected data


First derivative



generated data




the test

A new likelihood of repetitiveness is measured periodically, e.g., every 10 min, in order to get more frequent feedback on the

repetitiveness

20


Time


00:00 24:00

1%

100%

Stabilization (little new information)

Collected data


First derivative



generated data




the test

The likelihood of repetitiveness eventually starts stabilizing.

21

Step 3: Extrapolate the likelihood of repetitiveness

Time


00:00 24:00

1%

100%

Collected data


First derivative



generated data




the test

To know when the repetitiveness stabilizes, we calculate the first derivative.

22

Step 4: Determine whether to stop the test

Time

likelihood of repetitivenes

s

00:00 24:00

1%

100%

Stop the test if the fist derivative is

close to 0.

Collected data


First derivative



generated data




the test

To know when the repetitiveness stabilizes, we calculate the first derivative.

23



First derivatives



data




test

STOP

No

Yes

PetClinic Dell DVD Store

CloudStore

24

We conduct 24-hour performance tests on three systems

25

We evaluate whether our approach:

Stops the test too early?

Stops the test too late?

Optimal stopping time

1 2

26

Pre-stopping data Post-stopping data

Time

STOP

Does our approach stop the test too early?

00:00 24:00

1

1) Select a random time period from the post-

stopping data

2) Check if the random time period has a

repetitive one from the pre-stopping data

The test is likely to generate little new data, after the stopping times (preserving more

than 91.9% of the information).

Repeat 1,000 times

27

We apply our evaluation approach in RQ1 at the end of every hour during the test to find the most cost effective stopping time.

Does our approach stop the test too late?

2

1h 2h

Time …

10h 20h 24h

The most cost-effective stopping time has: 1.  A big difference to

the previous hour

2. A small difference to the next hour

28

1%

100%

00:00 04:00 05:00 06:00


2


29

There is a short delay between the

recommended stopping times and the most cost effective stopping times (The majority are under

4-hour delay).

Short delay


2

30


Time


issues



time

31

30


Time


issues



time

32



First derivatives



data




test

STOP

No

Yes

33

30


Time


issues



time 32



First derivatives



data




test

STOP

No

Yes

34


Time

STOP


00:00 24:00

1


stopping data





Repeat 1,000 times

35

30


Time


issues



time

32


Time

STOP


00:00 24:00

1


stopping data





Repeat 1,000 times

32



First derivatives



data




test

STOP

No

Yes

36



4-hour delay).

Short delay


2

37

30


Time


issues



time

32


Time

STOP


00:00 24:00

1


stopping data





Repeat 1,000 times

33



4-hour delay).

Short delay


2

32



First derivatives



data




test

STOP

No

Yes

an automated approach for recommending when to stop performance tests

Technology