technoweb split test in the context of validated learning

35
siemens.com/answers Unrestricted © Siemens AG 2013. All rights reserved Validated Learning at TechnoWeb A. Oertl, M. Heiss, B. Laenger & B. Kavsek | Corporate Technology | Sep. 2013

Upload: michael-heiss

Post on 29-Oct-2014

797 views

Category:

Business


0 download

DESCRIPTION

This talk was given at the i-know 2013 and the IEEE TMC Chapter CE Meeting in November 2013. Authors are Andreas Oertl (frist author), Michael Heiss, Bettina Laenger, Barbara Kavsek. This time a more detailed presentation about a split test for the urgent request notification within Siemens TechnoWeb (and it's statistical significance analysis)

TRANSCRIPT

Page 1: TechnoWeb Split Test in the context of validated learning

siemens.com/answers

Unrestricted © Siemens AG 2013. All rights reserved

Validated Learning at TechnoWebA. Oertl, M. Heiss, B. Laenger & B. Kavsek | Corporate Technology | Sep. 2013

Page 2: TechnoWeb Split Test in the context of validated learning

Page 2 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Have you read this book?

WARNING:

Each day you delay reading this book you risk wasting money.

Source: http://www.amazon.de/The-Lean-Startup-Entrepreneurs-Continuous/dp/0307887898

Page 3: TechnoWeb Split Test in the context of validated learning

Page 3 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Reduce Cycle Time

Evade the worst impact on productivity: Do not build something nobody wants

Source: http://www.betterthanpants.com/baby-mop.html

Solution:

• Get immediate feedback from customers

Page 4: TechnoWeb Split Test in the context of validated learning

Page 4 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Meaningful progress:Validated Learning

Defining success:

• Success is improved customer behavior

• Success is measured either by generally applicable metrics, or metrics tailored to a specific situation.

Metric examples:

• Value hypothesis• Retention rate (generic): How many customers return within a set time period?

• UR-conversion rate (custom): How many per mill of the notified users respond to the question?

• Growth hypothesis• Cohort based (generic): Separate behavior analysis of independent user groups (e.g. monthly new users).

• Invitation rate (generic): The willingness of users to invite their personal contacts to the same service.

• The results are used to decide if the change in the feature has positive, negative or no effects on consumer behavior.

• This way, learning immediately delivers business relevant insights.

Page 5: TechnoWeb Split Test in the context of validated learning

Page 5 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Agenda

Page 6: TechnoWeb Split Test in the context of validated learning

Page 6 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Corporate Problem Solving via TechnoWeb:Ask an Urgent Request and get answers from peers

Urgent Requests are distributed per email to the relevant target group (target messaging)

Business Impact(estimated by sender)

90% get help

Headline of the Urgent Request

Many replieson average 7 replies,

first within 35min.

Name and optional photo of the sender

Page 7: TechnoWeb Split Test in the context of validated learning

Page 7 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

New is not always better

Requirement: Urgent Request notifications had to be changed to fit corporate design guidelines

Which solution is better?

Page 8: TechnoWeb Split Test in the context of validated learning

Page 8 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

A meaningful conclusion can only be drawn after this question is answered:

Does the change positively influencecustomer behavior?

• Urgent Requests are the most important functionality of TechnoWeb

• The e-mail notification invites users to give answers

• Therefore, the effectiveness of the notification is mission critical for the success of TechnoWeb.

It is imperative to measure customer response to the new template.

Releasing new features without validated learningis like being in the dark

Validated Learning

• The automatic conclusion: the new feature is “obviously better” than the old one, and the time and money for the improvement were well spent.

Common approach:

Decisions are often made using one’s own best judgment, ignoring customer needs.

Page 9: TechnoWeb Split Test in the context of validated learning

Page 9 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Agenda

Page 10: TechnoWeb Split Test in the context of validated learning

Page 10 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Statistical Evaluation by Engineers without Specialized Statistical Knowledge

Page 11: TechnoWeb Split Test in the context of validated learning

Page 11 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Split-Test:Preparing to prove assumptions

Hypothesis Urgent Requestnumber i

Ei, new

Vi, new

Ci, new

Ei, old

Vi, old

Ci, old

SPLIT

Approx. 50% of the users receive the old template

Approx. 50% of the users receive the new template

The new template outperforms the old template in:

• Click-through rate

• Conversion rate

The introduced metrics are:

• Click-through rate ratio

• Conversion rate ratio

•Ei…number of sent notifications

•old…old template

•new…new template

•i…Urgent Request number

•Vi…number of views

•Ci…number of comments

Split Test: Define a hypothesis with metric and expected value

Page 12: TechnoWeb Split Test in the context of validated learning

Page 12 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Results of 323.560 Urgent Request notifications

• The click-through and conversion rate ratios compare the relative success (relative to the number of notifications sent) of the old and new templates. A value <1 means that the performance of the new template is inferior to the old template. A value of 1 signifies no change, whereas a value >1 indicates a better performing new template.

• For the click-through rate ratio, all 61 Urgent Requests are considered. For the conversion rate ratio, only 32 Urgent Requests have sufficient data to be used.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 20

2

4

6

8

10

12

Histogram: Click-through rate ratio ctrnew/ctrold

Click-through rate ratio

Co

un

t o

f U

rgen

t R

equ

ests

wit

h

corr

esp

on

din

g r

atio

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 >20

1

2

3

4

5

Histogram: Conversion rate ratio convnew/convold

Conversion rate ratioC

ou

nt

of

Urg

ent

Req

ues

ts w

ith

co

rres

po

nd

ing

rat

io

Page 13: TechnoWeb Split Test in the context of validated learning

Page 13 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Decreasing click-through rate ratiowith increasing Business Impact

• The Business Impact Level assigns a monetary value to the problem statement of the Urgent Request

• A value <1 means that the performance of the new template is inferior to the old template

• The monetary value is displayed less prominently in the new template. Instead of assuming an impact, we measure it:

Visibility of Business Impact Level: Impact on performance

€1,000 €10,000 €50,000 €250,000 €1,000,000 0.0

0.2

0.4

0.6

0.8

1.0

1.2

Average Click-through rate ratio

Business Impact

Av

era

ge

Clic

k-t

hro

ug

h r

ati

o

€1,000 €10,000 €50,000 €250,000 €1,000,000 0.0

0.2

0.4

0.6

0.8

1.0

1.2Average Conversion rate ratio

Business Impact

Ave

rag

e C

on

vers

ion

rat

e ra

tio

Page 14: TechnoWeb Split Test in the context of validated learning

Page 14 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Split-Tests in enterprises facelower statistical significance

• Problem

Some Urgent Requests, with 25.000 email notifications, are statistically significant. Others send only a couple of hundred emails.

• Solution

Discard all data sets where no significant data (views or comments) has been recorded from either old or new template.

• Problem

Even though 323.560 notifications were evaluated, it’s in the nature of the application that the absolute number of comments are low. In some cases +/- 1 comment can significantly influence the result.

• Solution

• Disregard multiple comments by the same user (e.g. follow-up comments).

• Disregard all activity by the author of the Urgent Request.

• Discard all data sets where there are no comments from both the old and new template.

This comes at the cost of less data to work with, but the remaining data is much more trustworthy.

Statistical significanceLow comment count

Decisions had to be made concerning raw data processing:

Page 15: TechnoWeb Split Test in the context of validated learning

Page 15 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Statistical Evaluation by Statisticians

Page 16: TechnoWeb Split Test in the context of validated learning

Page 16 September 2013 Siemens CT TIM CEE

Techno Web Split Analysis: Old versus new Template for Urgent Requests

Approach for sample selection:

Urgent Request 1 OLDNEW

Urgent Request t1 Before

UR 1 .. t1

first-time

Urgent Request t2 If >50% have already received new template more „NEW“ than „OLD“

Receivers of NEW template will always receive NEW further on.

OLD

Before

UR 1 .. t2

OLD

0% 50% 100%

Page 17: TechnoWeb Split Test in the context of validated learning

Page 17 September 2013 Siemens CT TIM CEE

Statistical Questions

• Statistical Question to be answered by the analysis:• Is there a difference in the number of responses (views,

comments) of the old versus new template?

• Do first-time users of the new template behave differently from users that received the new template before?

• Requested for future analyses: • Is there one representative number for the extent of this difference,

considering all urgent requests?

Page 18: TechnoWeb Split Test in the context of validated learning

Page 18 September 2013 Siemens CT TIM CEE

Sample CharacteristicsDependency within 1 observation?

• Are we considering paired or unpaired samples?- Paired sample means that 2 characteristics of one observation are dependent- We want to compare responses (views, comments) to the same urgent

request for old versus new template.- Thus, we have to consider pairs of responses and investigate the difference

between response ratios for each urgent request.

- Example:

Assuming independent samples assuming equal mean in old and new template. BUT: In reality: ctrold < ctrnew in ¾ of requests!

We assume dependent samples paired test

click-through ratio old

click-through ratio new

Urgent request 1 0.01 0.03

Urgent request 2 0.03 0.05

Urgent request 3 0.07 0.01

Urgent request 4 0.05 0.07

Page 19: TechnoWeb Split Test in the context of validated learning

Page 19 September 2013 Siemens CT TIM CEE

Sample CharacteristicsIndependency between observations?

The problem is that for most statistical tests, values between observations of the sample (i.e. different urgent requests) have to be independent.

We know that the same person gets several urgent requests, however, it is assumed that the response behavior (to click on the notification link) is independent for different topics.

Thus we can assume independence of the different urgent requests.

click-through ratio old

click-through ratio new

Urgent request 1 0.01 0.03

Urgent request 2 0.03 0.05

Urgent request 3 0.07 0.01

Urgent request 4 0.05 0.07

dependent

independent

Page 20: TechnoWeb Split Test in the context of validated learning

Page 20 September 2013 Siemens CT TIM CEE

Selection of Test Method

Comparison of means:

Is the mean response significantly different in the new template compared to the old template?

• t-Test for paired samples

Premises:

- 2 paired samples (xi,yi) with expectation values m1 and m2

- Differences di=xi-yi normally .distributed with expectation value d

: Hypothesis H0: =0d

• - Wilcoxon test for paired samples

- 2 (paired samples xi,yi) with expectation values m1 and m2

- Differences di=xi-yi symmetrically distributed fulfilled if xi and yi have the same distribution shape.

Hypothesis: H0: m1 = m2

Page 21: TechnoWeb Split Test in the context of validated learning

Page 21 September 2013 Siemens CT TIM CEE

Check of premises

Before applying a hypothesis test, the differences (v0-v1 and c0-c1) have to be tested on normal distribution.

Using the Kolmogoroff-Smirnoff test, we receive the following result:

H0: Variable has a normal distribution.

=5% a no normal distribution in both cases (views, comments)

Therefore we have to use a test which does not require normal distribution

Wilcoxon rank sum test.

variable p-value

v1-v0 0.04558

c1-c0 0.002431

v1: click-through ratio new

v0: click-through ratio old

c1: conversion rate new

c0: conversion rate old

Page 22: TechnoWeb Split Test in the context of validated learning

Page 22 September 2013 Siemens CT TIM CEE

Check of premises

Symmetrical Distribution of differences v0-v1 and c0-c1:

Page 23: TechnoWeb Split Test in the context of validated learning

Page 23 September 2013 Siemens CT TIM CEE

Hypothesis Test: Principle of the Wilcoxon rank sum test

Wilcoxon rank sum test (U-test for paired samples):

Example for n=8

R=min(R+, R- )=1.5

Critical value for n=7 (UR1 excluded), =5%: a Rcritical=2

<R R critical H0: m1 = m2 is rejected

UR v0 v1 dv=v1-v0 rank for dv>0 rank for dv<0

1 0.02 0.02 0 - -

2 0.01 0 -0.01 1.5

3 0.01 0.10 0.09 7

4 0.06 0.13 0.07 6

5 0.03 0.04 0.01 1.5

6 0.11 0.15 0.04 5

7 0.06 0.08 0.02 3

8 0.03 0.06 0.03 4

R+ = 26.5 R- = 1.5

Page 24: TechnoWeb Split Test in the context of validated learning

Page 24 September 2013 Siemens CT TIM CEE

Test Results

Results of Wilcoxon rank sum test:

Possible explanations why there are more views of the old template:

- Link to urgent request better visible.

- Users used to old template.

- Already enough information in e-mail no need to view details.

- Subjective impression of full information in new template.

H0 p-value

m 0v =m 1v 1.815e-06m 0v >m 1v 1m 0v <m 1v 9.076e-07m 0c =m 1c 0.4616m 0c >m 1c 0.7718m 0c <m 1c 0.2308

Red: p<0.05 significant i.e. H0 is rejected.

Test result: mv0>mv

1mc0=mc

1

More views using old template.

No significant change in number of comments.

Views

Comments

Page 25: TechnoWeb Split Test in the context of validated learning

Page 25 September 2013 Siemens CT TIM CEE

Plots: response for old versus new template

.

Comments in old (black) versus new

(red) template

Views in old (black) versus new (red)

template

Page 26: TechnoWeb Split Test in the context of validated learning

Page 26 September 2013 Siemens CT TIM CEE

Variable for comparison of old and new template

Click-through ratio and conversion ratio: Problem of exclusion of zero values.

0.1

0.2

0.3

0.4

0.5

0.60

0000

0000

0000

1

0.70

0000

0000

0000

10.

80.

9 11.

11.

21.

31.

41.

51.

61.

71.

81.

9 2

0

2

4

6

8

10

12

Histogram: Click-through rate ratio ctrnew/ctrold

Click-through rate ratio

Co

un

t o

f U

rgen

t R

equ

ests

wit

h

corr

esp

on

din

g r

atio

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2 >20

1

2

3

4

5

Histogram: Conversion rate ratio convnew/convold

Conversion rate ratio

Co

un

t o

f U

rgen

t R

equ

ests

wit

h

corr

esp

on

din

g r

atio

Page 27: TechnoWeb Split Test in the context of validated learning

Page 27 September 2013 Siemens CT TIM CEE

Variable for comparison of old and new template

Using differences v1-v0 and c1-c0 instead of quotients v1/v0 and c1/c0

zero values do not have to be excluded.

Page 28: TechnoWeb Split Test in the context of validated learning

Page 28 September 2013 Siemens CT TIM CEE

New First-Timers

• Considering only subgroup receiving the new template:

Is there a correlation between the number of “new first-timers” (NFT) and the number of

(a) views (V1)?

(b) comments (C1)?

(a) H0: r(V1,NFT) = 0

(b) H0: r(C1,NFT) = 0

Kolmogoroff-Smirnoff test yields that number of new first-timers NFT is not normally distributed (p= 7.936e-10) using Spearman‘s or Kendall‘s correlation coefficient.

Page 29: TechnoWeb Split Test in the context of validated learning

Page 29 September 2013 Siemens CT TIM CEE

New First-Timers

• Considering only subgroup receiving the new template:

Is there a correlation between the number of “new first-timers” (NFT) and the number of

(a) views (V1)?

(b) comments (C1)?

Test results: case variables method r p-value

(a) V1,NFT Spearman 0.3939 0.0017

(a) V1, NFT Kendall 0.2919 0.0015

(b) C1,NFT Spearman 0.3976 0.0015

(b) C1, NFT Kendall 0.3012 0.0019

H0 is rejected in every case (p<0.05).

significant positive correlation

Number of new first-timers related to number of views and comments: The more new first-timers, the more views and comments.

Page 30: TechnoWeb Split Test in the context of validated learning

Page 30 September 2013 Siemens CT TIM CEE

Sample CharacteristicsImprovement suggestion for novel split test

Proposition of sample selection for next split test:

• Existing TechnoWeb users are randomly split into two equally sized groups A and B.

• Every new TechnoWeb user is assigned group A or group B randomly with a probability of 50% for each group.

• Group A always receives the old, group B always receives the new template.

• First time views don’t have to be investigated separately by this approach, because they are more clearly distinguished from the beginning.

Page 31: TechnoWeb Split Test in the context of validated learning

Page 31 September 2013 Siemens CT TIM CEE

Results and Recommendations

The following results were obtained:

• No significant change in number of comments in new versus old template.

• More views in old than in new template.

• The more users receiving the new template for the first time, the more views and comments.

• Statistically relevant number for comparison of old and new template: R=min(R+, R- ). Critical R varies according to sample size.

Suggestions:

• Use v1-v0 and c1-c0, respectively, instead of v1/v0 and c1/c0, in order not to exclude zero-answers.

• Sample selection: randomly choose 50% that always receive old template, 50% that always receive new template and stick to that selection.

Page 32: TechnoWeb Split Test in the context of validated learning

Page 32 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Agenda

Page 33: TechnoWeb Split Test in the context of validated learning

Page 33 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Learning: The new template has three usability problems

• The overall performance of the new template is inferior to the old template

• Identified cause:

• An important eye-catcher, the assignment of a monetary value to the problem, is less visible in the new template, resulting in a decreased click-through rate.

• Suspected causes: (to be validated in the next build-measure-learn cycle)

• The prominently placed call-to-action in the new template might be less inviting for users – most do not want to comment right away.

• The Link “Show Urgent Request” is much less visible in the new template

• Part of the reduced click-through rate in the new template could be due to the content being presented in an easily-readable way.

New Template

and

Old Template

and

Page 34: TechnoWeb Split Test in the context of validated learning

Page 34 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Moving key elements to the sidedecreased their effectiveness

Page 35: TechnoWeb Split Test in the context of validated learning

Page 35 September 2013 Siemens CT TIM CEE Unrestricted © Siemens AG 2013. All rights reserved

Conclusion

• Features that do not positively influence customer behavior should not be implemented.

• Initial negative results should not kill a project. Instead, iterative improvement will lead to a product that consumers will appreciate.

• Split-testing a new feature is worth the time and effort.

• The initial time investment in the first split is offset by knowledge gained on how to efficiently set up a split test.

• Even though initially the problem looked simple, regular statistical text-book knowledge was not sufficient for the statistical significance analysis.

• Consulting a professional statistician from the planning phase of the split would have saved much time and effort, and allowed to measure in a more focused way.

GeneralSpecific