[@indeedeng] managing experiments and behavior dynamically with proctor

ProctorManaging A/B Tests and More

Tom BergmanProduct ManagerAggregation

Matt SchemmelSoftware Engineer

Resume

We help people get jobs.

What's best for thejob seeker?

Test & Measure EVERYTHING

A/B Testing: Definition

A/B testing is an experimental methodology comparing at least two variants, a control group A and test group B, in a controlled experiment

A/B Testing Key Points

1. Unbiased2. Independent3. Representative

Test and Control Groups should be:

103 tests315 variations

2^147 combinations

Control

10% test

10% test

10% test

10% test

10% test

10% test

Control

+2.9%

+2.3%

+2.0%

+5.2%

+12.8%

+9.6%

Control

+2.9%

+2.3%

+2.0%

+5.2%

+12.8%

+9.6%

+614M emails

Control

Why A/B Testing?

Before and After

Before and After is bad science.

Weekly TrafficV

isito

rs

ThurWedMon Tues Fri

Yearly TrafficV

isito

rs

Mid Year Test

A < B

AB

Vis

itors

End of Year Test

AB

A > B

Obligatory XKCD Comic

History of A/B Testing @Indeed

Next we tried ...

● Multiple Code Versions

● Separate Configuration

● "Sampling by Load Balancer"

Load Balancer: Multiple Versions

CONTROL TEST

Load Balancer

(Old Version Code) (New Version Code)

Load Balancer: Multiple Versions

1. Tedious2. Expensive3. Inflexible

It worked, but ...

Finally ...

1. Arbitrarily Group Users2. Select Test Groups3. Implement Variations

Built Libraries, hand-write code per test to:

Custom Coded Tests

1. Sophisticated Tests2. Scientifically Valid Methods3. Low Operational Overhead

Allowed us:

Custom Coded Tests: Stats

Goals:

1. Increase Engineering Velocity2. Standardize Representation3. Work Seamlessly Across Products

ProctorIndeed’s Java Framework for

Managing A/B Tests and More

ProctorIndeed's Open Source Java Framework for

Managing A/B Tests and More

github.com/indeedeng/proctor

https://github.com/indeedeng/proctor


Using Proctor

1. Background and Design

2. Running A/B Tests with Proctor

3. Beyond the Basics

Background and Design

Running a Test

1. Define the Experiment2. Select Groups3. Implement the Behavior4. Log the Results

Existing Behavior

Save Alert

Test Behavior

Define the Experiment

1. Buckets

2. Sample Sizes

Key characteristics:

Control: Gray Test: Blue

50% 50%


(global)

Division of Responsibilities

Test Definition

Apply the Experiment

(each product)

Proctor Library

Test Specification

Buckets Enumerate the Test Variations

● ID, for code

● Long Description, for people

● Short Name, for people

0"Control Group"Gray

1"Test Group"

Blue

Sizing the Buckets

1. Buckets

2. Sample Sizes

Selecting a Test Bucket

Good user experience does, too:

● Fast● Consistent

Good science requires good sampling:

● Independent● Unbiased

Assign each subsequent visitor to the next bucket.

Round robin assignment

FastUnbiasedConsistent

Independent✓ ✓~✘

● Requires global state for "next bucket"

● Requires state for assigned buckets

At small scale, you might need round-robin to ensure equal sample sizes.

At large scale, randomized assignment is uniform enough.

Randomized Assignment


Independent✓ ✓??

Select a bucket at random at the point of execution.

Roll the dice as needed


Independent✓ ✓✘✓

Roll Once and Cache in a Cookie

● Single-domain, Single-device

● N cookies: Hard to evolve

● One cookie: Fragile to edit

● Size scales with # experiments


Independent✓ ✓~~

Roll Once and Cache in Session

● Consistent only to length of session

● Tied to one server / data-center

● Many apps don’t use sessions


Independent✓ ✓✘~

Roll Dice and Cache in DB


Independent✓ ✓~✘

● DB hit on every request

● More infrastructure

We can do better

Flaws stem from the need to record selected buckets.

What if we didn't?

1. Assign each user a unique ID

2. Map that ID to a bucket

3. Store the ID, not the assignments

Don’t Record. Recalculate.


Independent ??

??

Simple Mapping: Mod N

id mod N=> bucket

Doesn’t work:● Should provide uniform distribution;

mod N assumes it.

● Limited bucket distributions

Range Mapping

id / MAX_ID => bucket

testcontrol0 10.5

Buckets can be any size

1(inactive)

testcontrol0 10.5

testcontrol0 10.5

testcontrol0 10.5

Sequential IDs No Longer UniformMAX_ID

2

Unbiased✘

Hashed Range Mapping

hash ( id ) => bucket

Kept:

● Arbitrary bucket allocations ok

testcontrolMIN_INT MAX_INT0

Unbiased Distribution for Any ID

50 / 50:

33 / 33 / 33:

Unbiased✓

But is it independent?

Sign Up Activatevs

Sign Up Sign Upvs

Should look like this

25% 25%

25% 25%

Sign Up

Sign Up Activate

Activate

But our inputs are consistent

hash ( id ) => bucket

testcontrolMIN_INT MAX_INT0

So our buckets are identical

S A S S A S A A A S A S A

Col

or

S S S S S SA A A A A A AText

And we look like this

50% 0%

0% 50%

Sign Up

Sign Up Activate

Activate

Independent✘

Add Salt to Test

hash ( id + test.salt ) => bucket

Kept:

● Arbitrary bucket allocations

● Uniform distribution

Uncorrelated Distribution

A S A S A S S A A A S S A

Col

or

A S A S A S S A A A S S A

Text

Independent✓

But is it fast?

0.90

0.85

0.80

0.75

0.70

0.65

0.60

Resume Editor Resume Search

But is it consistent?

Consistency bounded only by ID

We Usually Use Tracking Cookies

● Easy

● Ubiquitous on the web

● Require no server-side storage

● Best we can do with no user action

Consistent~


Independent✓ ✓~✓

Best we’ve seen so far…

Definitions Map Buckets to ID Range

Bucket Range

gray 0.50

blue 0.50

Each bucket maps to a % of the hashed range

Sometimes, Though, Cookies Won't Do

● Cross-Domain● Some People Block Cookies

● Cross-Device● Cookies are Web-Only

Many Ways to ID a User

Account #12345

Tracking cookie:UID#1

Email [email protected]

Access Token:4/rymOMYE…

Session ID557206C363F…

Proctor Uses Any Set of IDs

ID Type... Tracked By...

USER Tracking Cookie

ACCOUNT Account ID

EMAIL Email Address

… …

We use…

Account ID

● Authenticated

● Consistent across domains

● Consistent across devices

● Consistent across visits

Email Address

● Sometimes available without account

● Identified, though not authenticated

Each Test Applies to One ID Type

● Test groups split by that identifier

● Visitors without that identifier are ignored

Running A/B Tests

Test Definitions Encoded in JSON

● Compact

● Simple and Flexible

● Editable by Humans

● Editable by Machines

Basic Data in the Test Definition

"description": "Button colors","salt": "buttonBgColorTst","type": "USER"

Buckets in the Test Definition

"buckets": [{"id": 0, "name": "gray","description": "Control group"

}, {"id": 1, "name": "blue","description": "Test group"

}]

Mapping Buckets to Ranges

"ranges": [{"bucketValue": 0,"length": 0.5

}, {"bucketValue": 1,"length": 0.5

}]

Complete Test Definition{

"description": "Button colors",

"type": "USER",

"salt": "buttonBgColorTst",

"buckets": […],

"allocations": [{

"ranges": […]

}],

}


proctor data

Division of Responsibilities

Test Definition


(each product)

Proctor Library

Test Specification

Proctor includes several modules

Proctor

Common

Ant Builder

Codegen

Maven Builder

Product Test Specification lists active tests

References into the global pool:

"tests": [{"buttonBgcolorTest": {"buckets": {"gray": 0, "blue": 1

}}

}]


On every request…

1. Select Groups2. Render the Response

3. Log the Action

On every request…

1. Collect identifiers

2. Select buckets for opted-in tests

Determining Buckets in Code

Collect identifiers for all ID Types

// Product codeString cookie = getTrackingCookie(request);String accountId = getAccountIdOrNull(request);

// Proctor preparationIdentifiers identifiers = Identifiers.of(

TestType.USER, cookie,TestType.ACCOUNT, accountId

);

// Proctor preparationIdentifiers identifiers = Identifiers.of(

TestType.USER, trackingCookie,TestType.ACCOUNT, accountId

);

Select Buckets for Opted-In Tests

// Proctor assignmentsProctorResult assignments =

proctor.determineBuckets(identifiers);


On every request…

1. Select Groups

2. Render the Response3. Log the Action

Choose behavior for selected bucket

int bgColorBucket;

/* … */

// Choose a background color for templatesif (bgColorBucket == 1) {

// Testmodel.put("buttonBgColor", "#00f");

} else {// Control groupmodel.put("buttonBgColor", "#ccc");

}

ProctorResult exposes buckets


proctor.determineBuckets(identifiers);

// Get selected bucket for this userint bgColorBucket = assignments

// Map<String, TestBucket>: All tests.getBuckets()

// TestBucket: This assignment.get("buttonBgColorTst") // TestBucket

// int: Enumerated ID.getValue();

… verbosely

"Redundant" names in test spec…

"buttonBgColorTest": {"buckets": {"gray": 0, "blue": 1

}}

… are used to generate helper methods

// Choose a background color for templatesResumeSearchGroups groups =

new ResumeSearchGroups(assignments);

// Boolean accessors for each test & bucketgroups.isButtonBgColorTstGray();groups.isButtonBgColorTstBlue();

// Enumerated value by test namegroups.getButtonBgColorTstValue();

Helper designed for use in UI layer

This immutable bean is trivial to:

● Read from JSP/JSF● Read from Templates

○ Freemarker, Velocity, Closure, etc

● Serialize as JSON


On every request…

1. Select Groups

2. Render the Response

3. Log the Action

Logging Bucket Assignments

Proctor just selects the buckets.

When and how you log are up to you:

● On related events only● On every event

Publication

Test Definitions in Source Control

● No new infrastructure● Lots of desirable features for free

History Diff Access Control

App Servers

Test Definitions

Proctor Data

App

Publish

Artifact Periodic Refresh

Individual test changes pushed to a named branch:

Publication is also via Source Control

/trunk

/branches/production

Overwriting Tests on a Named Branch

Not required to use proctor, but beneficial:

● Same features for free History, Diff, ACL

● No merging● Easy roll-back, roll-forward

Build Servers

Test Definitions

Test Specifications

Project

Deliverable

App Servers

Publish

Artifact Periodic Refresh

Compile

Deploy

App

Proctor Data

Beyond the Basics

Test Segmentation

Segmentation

Test often apply to only certain users:

● Specific markets

● Specific languages

● Specific devices

Segmentation through Test Rules

● Test definition allows one optional rule

● A rule is simply a boolean expression

● If the rule passes, the user is assigned to a test bucket

Rules are written in Unified EL

Simple Things are Simple

{"description": "Button colors","rule": "country == ‘CA’""buckets": […]

}

● No deployment needed

● Changes live within minutes

Primitive and rich data types

"userAgent.phone || userAgent.tablet"

"userAgent.supports.html5"

"userAgent.supports.geolocation"

"userAgent.supports.fileUpload"

Commons EL is Easily Extended

JSTL Standard Functions

Custom code

"rule": "fn:endsWith( account.email, '@indeed.com')"

"rule": "proctor:contains(

['US', 'CA'], country)"

Arbitrary Complexity

Sometimes rules are unavoidably complex:

"Android v2.1+":userAgent.android && ( userAgent.OS.majorVersion gt 2 || ( userAgent.OS.majorVersion == 2

&& userAgent.OS.minorVersion gte 1

))

What context is available?

So far we've seen:● country● language● userAgent● account

What's the full list of available context variables?

Context Defined in Test Specification

● Test spec declares available context variables

● This is a contract to provide values at runtime

{"tests": […],"providedContext": {

"country": "String","language": "String""userAgent":

"com.indeed.web.UserAgent"}

}


proctor.determineBuckets(identifiers,country,language,userAgent);

Provided While Determining Buckets

private ResumeSearchProctor proctor;

Also generated from test specification:

Payloads

Even Tiny Changes Need Deploys

// Choose a background color for templatesif (bgColorBucket == 1) {

// Testmodel.put("btnBgcolor", "#00f");

} else {// Control groupmodel.put("btnBgcolor", "#ccc");

}

Many tests have no behavioral change:● CSS Colors

● Display Text

● Algorithm Weights

Some Tests Just Vary Data

Payloads

● Values added for each bucket in a test

● Proctor verifies payloads are "all or none"


Payloads

● Values added for each bucket in a test

● Proctor verifies payloads are "all or none"


"#ccc" "#00f"

Part of Test Definition

"buckets": [{"id": 0, "name": "gray","description": "Control group","payload": {"stringValue": "#ccc"

}}, …]

● No deployment needed

● Changes live within minutes

Declared in Project Test Specification

● Type definition only

● Must match test definition

"buttonBgColorTst": {"buckets": […],"payload": {"type": "stringValue"

}}

Cleaner Code, Only Data Deploy

// Choose a background colormodel.put("btnBgcolor", groups.getButtonBgColorTstPayload()

);

Cross-Product Tests

Cross-Product Tests

Many flavors of cross-product test, including

● Peer webapps

● Client / Service

● Mobile Native / Web

Proctor offers an interesting alternative

Cross-Product Tests

Even more ways to coordinate tests

● Tracking parameters on links, requests

● Service response metadata

● Different service calls

Two products can share test groups

As long as both products

● Share the test’s identifier● Provide the context variables it uses

Deterministic selection guarantees identical bucket assignment.

Evolving Tests

Evolving Tests

testcontrol

Evolving Tests

testcontrol (inactive)

10%

control

Changed allocations, not ID mapping

testOOPS!

● Inconsistent experience● Polluted results

Evolving Tests Smoothly

[ 10%, 10%]


[ 10%, 10%, 80% ]


[ 10%, 10%, 80% ]

[ 10%, 10%, 40%, 40%]


testcontrol testcontrol


[ 10%, 80%, 10% ]

[ 50%, 50% ]


control test

Evolving Tests… Turbulently

hash ( uid + test.salt ) => bucket

Any ID:

testcontrol

Test range:

test

1te

st1

After re-salt:

Contextual Sampling

Contextual Allocation


10% (US):

50% (Rest of World):testcontrol

testcontrol

20% (CA):(inactive)

Allocations

Each test definition ● has one or more allocations

Each allocation● has a rule and ranges totaling 1.0● except the last, which has no rule.

Allocation Rules

● Use Unified EL, same as test rules.

● Use the same context variables as test rules.

● Choose the first matching allocation.

Allocations in the Test Definition{ "description": "Button colors","type": "USER","salt": "buttonBgColorTst","buckets": [ … ],"allocations": [{"rule": "country == 'US'","ranges": [ … ]

}, {"ranges": [ … ]

}]}

Pre-Production

Environments

Local

Integration

QA

Production

commit

push

push

Show test matrix

/private/showTestMatrix

Show test bucket assignments

/private/showGroups/private/showGroups

Privileged users can force assignments

Privileged users can force assignments

?prforceGroups=buttonColorTst1

Beyond A/B TestingProctor Patterns for Managing Behavior

Kill SwitchWhen ● New Feature

How● 'Active' bucket @ 100%

Phased RolloutWhen ● Experimental Feature

How● 'Active' bucket @ 0%● 'Active' → 1% → 5% → 100%

When ● Downsampling

○ trace logging○ survey

How● 'Active' bucket @ 0%● 'Active' → 1% → 10% → 5% → ??

Throttle

Feature TogglesWhen ● Localized Behavior● Device-Specific Behavior● Logged-in, w/ Resume, etc.

How● Multiple Allocations ● Targeted Rules

Dark DeploysWhen ● Partial Implementations● Additional QA is needed

How● 'Active' bucket @ 0%● 'Active' → 100%

When ● Dependencies between products

○ Resume Wizard feature

How● 'Active' bucket at 0%● Resume Wizard allocation: → 100%● Home page promo allocation: → 100%

Cross-Product Coordination

Pre-Proctor Tests

Post-Proctor Tests

Proctor

42

103

Post-Proctor Tests + Toggles

Proctor

42

10

103

65

Proctor WebappA/B Test Change Management

(Coming Soon to github)

Proctor Webapp

Building On Proctor(Not Open Source)

Description: Group 0: control - Job alert label: Save Alert (control) Group 1: labelSubscribe - Job alert label: Subscribe Group 2: labelSignUp - Job alert label: Sign up Group 3: labelGetJobs - Job alert label: Get jobs Group 4: labelSendMeNewJobs - Job alert label: Send me new jobs Group 5: labelActivate - Job alert label: Activate Group 6: labelSave - Job alert label: Save

History:

jack @ 2013-03-12 (r203267): Promoting jasxjabtnlbltst (trunk r203089) to

production JASX-11365: jasxjabtnlbltst disabled

ketan @ 2012-12-11 (r190675): merged r190418: JASX-10663: Stop

jasxjabtnlbltst in all languages except nl

will @ 2012-11-29 (r188801): merged r187452: JASX-10457: exclude US from

jasxjabtnlbltst

ketan @ 2012-10-25 (r182881): merged r182688: JASX-10234 - Adding new

langauges to job alert button label test

ketan @ 2012-10-25 (r182876): merged r181938: JASX-10234 - Adding test

definition and allocations for job alert button label test

DEMOGet out your Phones and Tablets

http://go.indeed.com/demo

Simple: test different background colors 25% 25%

25%25%

50%


Let’s increase our bucket size...

50%

50%


We have a winner! 50%

100%


Let’s do something wacky!

Android >= 4 iOS >= 7

iOSAndroid


Also a reference implementation

Running on heroku -- feel free to clone!http://indeedeng-hello-proctor.herokuapp.com

Source:github.com/indeedeng/proctor-demo

http://indeedeng-hello-proctor.herokuapp.com

http://indeedeng-hello-proctor.herokuapp.com

https://github.com/indeedeng/proctor-demo

https://github.com/indeedeng/proctor-demo

Q&A

Source:github.com/indeedeng/proctor

Docs:indeedeng.github.io/proctor



http://indeedeng.github.io/proctor

http://indeedeng.github.io/proctor

Next @IndeedEng Talk

BoxcarSelf-balancing distributed services

Wednesday, October 30R.B. Boyer

[@indeedeng] managing experiments and behavior dynamically with proctor

Technology