[@indeedeng] managing experiments and behavior dynamically with proctor
DESCRIPTION
Video available at: http://youtu.be/Q1T5J0KXUwY At this very moment, Indeed is running more than one hundred A/B experiments. In previous @IndeedEng talks, we have discussed how we use A/B testing to develop better products. In this tech talk, software engineer Matt Schemmel and product manager Tom Bergman describe Proctor, the system we developed to define and manage all of these experiments. They explain how we use Proctor to target users using data-driven rules, adjust experiments on-the-fly, and ensure clean results for multi-variate tests. Over time, Proctor has evolved from a system designed for managing experiments to one that manages overall system behavior through dynamic "feature toggle" functionality. Matt and Tom also share lessons we have learned from years of experimenting at web scale. Matt Schemmel is a Senior Software Engineer working primarily on our Resume products. Tom Bergman is a Product Manager currently working on our Aggregation systems. He previously helped evolve many of Indeed's data analysis tools, and also helped us launch and grow our sites in Japan, Korea, and China.TRANSCRIPT
ProctorManaging A/B Tests and More
Tom BergmanProduct ManagerAggregation
Matt SchemmelSoftware Engineer
Resume
We help people get jobs.
What's best for thejob seeker?
Test & Measure EVERYTHING
A/B Testing: Definition
A/B testing is an experimental methodology comparing at least two variants, a control group A and test group B, in a controlled experiment
A/B Testing Key Points
1. Unbiased2. Independent3. Representative
Test and Control Groups should be:
103 tests315 variations
2^147 combinations
Control
10% test
10% test
10% test
10% test
10% test
10% test
Control
+2.9%
+2.3%
+2.0%
+5.2%
+12.8%
+9.6%
Control
+2.9%
+2.3%
+2.0%
+5.2%
+12.8%
+9.6%
+614M emails
Control
Why A/B Testing?
Before and After
Before and After is bad science.
Weekly TrafficV
isito
rs
ThurWedMon Tues Fri
Yearly TrafficV
isito
rs
Mid Year Test
A < B
AB
Vis
itors
End of Year Test
AB
A > B
Obligatory XKCD Comic
History of A/B Testing @Indeed
Next we tried ...
● Multiple Code Versions
● Separate Configuration
● "Sampling by Load Balancer"
Load Balancer: Multiple Versions
CONTROL TEST
Load Balancer
(Old Version Code) (New Version Code)
Load Balancer: Multiple Versions
1. Tedious2. Expensive3. Inflexible
It worked, but ...
Finally ...
1. Arbitrarily Group Users2. Select Test Groups3. Implement Variations
Built Libraries, hand-write code per test to:
Custom Coded Tests
1. Sophisticated Tests2. Scientifically Valid Methods3. Low Operational Overhead
Allowed us:
Custom Coded Tests: Stats
Goals:
1. Increase Engineering Velocity2. Standardize Representation3. Work Seamlessly Across Products
ProctorIndeed’s Java Framework for
Managing A/B Tests and More
ProctorIndeed's Open Source Java Framework for
Managing A/B Tests and More
github.com/indeedeng/proctor
Using Proctor
1. Background and Design
2. Running A/B Tests with Proctor
3. Beyond the Basics
Background and Design
Running a Test
1. Define the Experiment2. Select Groups3. Implement the Behavior4. Log the Results
Running a Test
1. Define the Experiment2. Select Groups3. Implement the Behavior4. Log the Results
Existing Behavior
Save Alert
Test Behavior
Define the Experiment
1. Buckets
2. Sample Sizes
Key characteristics:
Control: Gray Test: Blue
50% 50%
Define the Experiment
(global)
Division of Responsibilities
Test Definition
Apply the Experiment
(each product)
Proctor Library
Test Specification
Buckets Enumerate the Test Variations
● ID, for code
● Long Description, for people
● Short Name, for people
0"Control Group"Gray
1"Test Group"
Blue
Sizing the Buckets
1. Buckets
2. Sample Sizes
Selecting a Test Bucket
Good user experience does, too:
● Fast● Consistent
Good science requires good sampling:
● Independent● Unbiased
Assign each subsequent visitor to the next bucket.
Round robin assignment
FastUnbiasedConsistent
Independent✓ ✓~✘
● Requires global state for "next bucket"
● Requires state for assigned buckets
At small scale, you might need round-robin to ensure equal sample sizes.
At large scale, randomized assignment is uniform enough.
Randomized Assignment
FastUnbiasedConsistent
Independent✓ ✓??
Select a bucket at random at the point of execution.
Roll the dice as needed
FastUnbiasedConsistent
Independent✓ ✓✘✓
Roll Once and Cache in a Cookie
● Single-domain, Single-device
● N cookies: Hard to evolve
● One cookie: Fragile to edit
● Size scales with # experiments
FastUnbiasedConsistent
Independent✓ ✓~~
Roll Once and Cache in Session
● Consistent only to length of session
● Tied to one server / data-center
● Many apps don’t use sessions
FastUnbiasedConsistent
Independent✓ ✓✘~
Roll Dice and Cache in DB
FastUnbiasedConsistent
Independent✓ ✓~✘
● DB hit on every request
● More infrastructure
We can do better
Flaws stem from the need to record selected buckets.
What if we didn't?
1. Assign each user a unique ID
2. Map that ID to a bucket
3. Store the ID, not the assignments
Don’t Record. Recalculate.
FastUnbiasedConsistent
Independent ??
??
Simple Mapping: Mod N
id mod N=> bucket
Doesn’t work:● Should provide uniform distribution;
mod N assumes it.
● Limited bucket distributions
Range Mapping
id / MAX_ID => bucket
testcontrol0 10.5
Buckets can be any size
1(inactive)
testcontrol0 10.5
testcontrol0 10.5
testcontrol0 10.5
Sequential IDs No Longer UniformMAX_ID
2
Unbiased✘
Hashed Range Mapping
hash ( id ) => bucket
Kept:
● Arbitrary bucket allocations ok
testcontrolMIN_INT MAX_INT0
Unbiased Distribution for Any ID
50 / 50:
33 / 33 / 33:
Unbiased✓
But is it independent?
Sign Up Activatevs
Sign Up Sign Upvs
Should look like this
25% 25%
25% 25%
Sign Up
Sign Up Activate
Activate
But our inputs are consistent
hash ( id ) => bucket
testcontrolMIN_INT MAX_INT0
So our buckets are identical
S A S S A S A A A S A S A
Col
or
S S S S S SA A A A A A AText
And we look like this
50% 0%
0% 50%
Sign Up
Sign Up Activate
Activate
Independent✘
Add Salt to Test
hash ( id + test.salt ) => bucket
Kept:
● Arbitrary bucket allocations
● Uniform distribution
Uncorrelated Distribution
A S A S A S S A A A S S A
Col
or
A S A S A S S A A A S S A
Text
Independent✓
But is it fast?
0.90
0.85
0.80
0.75
0.70
0.65
0.60
Resume Editor Resume Search
But is it consistent?
Consistency bounded only by ID
We Usually Use Tracking Cookies
● Easy
● Ubiquitous on the web
● Require no server-side storage
● Best we can do with no user action
Consistent~
FastUnbiasedConsistent
Independent✓ ✓~✓
Best we’ve seen so far…
Definitions Map Buckets to ID Range
Bucket Range
gray 0.50
blue 0.50
Each bucket maps to a % of the hashed range
Sometimes, Though, Cookies Won't Do
● Cross-Domain● Some People Block Cookies
● Cross-Device● Cookies are Web-Only
Many Ways to ID a User
Account #12345
Tracking cookie:UID#1
Email [email protected]
Access Token:4/rymOMYE…
Session ID557206C363F…
Proctor Uses Any Set of IDs
ID Type... Tracked By...
USER Tracking Cookie
ACCOUNT Account ID
EMAIL Email Address
… …
We use…
Account ID
● Authenticated
● Consistent across domains
● Consistent across devices
● Consistent across visits
Email Address
● Sometimes available without account
● Identified, though not authenticated
Each Test Applies to One ID Type
● Test groups split by that identifier
● Visitors without that identifier are ignored
Running A/B Tests
Test Definitions Encoded in JSON
● Compact
● Simple and Flexible
● Editable by Humans
● Editable by Machines
Basic Data in the Test Definition
"description": "Button colors","salt": "buttonBgColorTst","type": "USER"
Buckets in the Test Definition
"buckets": [{"id": 0, "name": "gray","description": "Control group"
}, {"id": 1, "name": "blue","description": "Test group"
}]
Mapping Buckets to Ranges
"ranges": [{"bucketValue": 0,"length": 0.5
}, {"bucketValue": 1,"length": 0.5
}]
Complete Test Definition{
"description": "Button colors",
"type": "USER",
"salt": "buttonBgColorTst",
"buckets": […],
"allocations": [{
"ranges": […]
}],
}
Define the Experiment
proctor data
Division of Responsibilities
Test Definition
Apply the Experiment
(each product)
Proctor Library
Test Specification
Proctor includes several modules
Proctor
Common
Ant Builder
Codegen
Maven Builder
Product Test Specification lists active tests
References into the global pool:
"tests": [{"buttonBgcolorTest": {"buckets": {"gray": 0, "blue": 1
}}
}]
Apply the Experiment
On every request…
1. Select Groups2. Render the Response
3. Log the Action
On every request…
1. Collect identifiers
2. Select buckets for opted-in tests
Determining Buckets in Code
Collect identifiers for all ID Types
// Product codeString cookie = getTrackingCookie(request);String accountId = getAccountIdOrNull(request);
// Proctor preparationIdentifiers identifiers = Identifiers.of(
TestType.USER, cookie,TestType.ACCOUNT, accountId
);
// Proctor preparationIdentifiers identifiers = Identifiers.of(
TestType.USER, trackingCookie,TestType.ACCOUNT, accountId
);
Select Buckets for Opted-In Tests
// Proctor assignmentsProctorResult assignments =
proctor.determineBuckets(identifiers);
Apply the Experiment
On every request…
1. Select Groups
2. Render the Response3. Log the Action
Choose behavior for selected bucket
int bgColorBucket;
/* … */
// Choose a background color for templatesif (bgColorBucket == 1) {
// Testmodel.put("buttonBgColor", "#00f");
} else {// Control groupmodel.put("buttonBgColor", "#ccc");
}
ProctorResult exposes buckets
// Proctor assignmentsProctorResult assignments =
proctor.determineBuckets(identifiers);
// Get selected bucket for this userint bgColorBucket = assignments
// Map<String, TestBucket>: All tests.getBuckets()
// TestBucket: This assignment.get("buttonBgColorTst") // TestBucket
// int: Enumerated ID.getValue();
… verbosely
"Redundant" names in test spec…
"buttonBgColorTest": {"buckets": {"gray": 0, "blue": 1
}}
… are used to generate helper methods
// Choose a background color for templatesResumeSearchGroups groups =
new ResumeSearchGroups(assignments);
// Boolean accessors for each test & bucketgroups.isButtonBgColorTstGray();groups.isButtonBgColorTstBlue();
// Enumerated value by test namegroups.getButtonBgColorTstValue();
Helper designed for use in UI layer
This immutable bean is trivial to:
● Read from JSP/JSF● Read from Templates
○ Freemarker, Velocity, Closure, etc
● Serialize as JSON
Apply the Experiment
On every request…
1. Select Groups
2. Render the Response
3. Log the Action
Logging Bucket Assignments
Proctor just selects the buckets.
When and how you log are up to you:
● On related events only● On every event
Publication
Test Definitions in Source Control
● No new infrastructure● Lots of desirable features for free
History Diff Access Control
App Servers
Test Definitions
Proctor Data
App
Publish
Artifact Periodic Refresh
Individual test changes pushed to a named branch:
Publication is also via Source Control
/trunk
/branches/production
Overwriting Tests on a Named Branch
Not required to use proctor, but beneficial:
● Same features for free History, Diff, ACL
● No merging● Easy roll-back, roll-forward
Build Servers
Test Definitions
Test Specifications
Project
Deliverable
App Servers
Publish
Artifact Periodic Refresh
Compile
Deploy
App
Proctor Data
Beyond the Basics
Test Segmentation
Segmentation
Test often apply to only certain users:
● Specific markets
● Specific languages
● Specific devices
Segmentation through Test Rules
● Test definition allows one optional rule
● A rule is simply a boolean expression
● If the rule passes, the user is assigned to a test bucket
Rules are written in Unified EL
Simple Things are Simple
{"description": "Button colors","rule": "country == ‘CA’""buckets": […]
}
● No deployment needed
● Changes live within minutes
Primitive and rich data types
"userAgent.phone || userAgent.tablet"
"userAgent.supports.html5"
"userAgent.supports.geolocation"
"userAgent.supports.fileUpload"
Commons EL is Easily Extended
JSTL Standard Functions
Custom code
"rule": "fn:endsWith( account.email, '@indeed.com')"
"rule": "proctor:contains(
['US', 'CA'], country)"
Arbitrary Complexity
Sometimes rules are unavoidably complex:
"Android v2.1+":userAgent.android && ( userAgent.OS.majorVersion gt 2 || ( userAgent.OS.majorVersion == 2
&& userAgent.OS.minorVersion gte 1
))
What context is available?
So far we've seen:● country● language● userAgent● account
What's the full list of available context variables?
Context Defined in Test Specification
● Test spec declares available context variables
● This is a contract to provide values at runtime
{"tests": […],"providedContext": {
"country": "String","language": "String""userAgent":
"com.indeed.web.UserAgent"}
}
// Proctor assignmentsProctorResult assignments =
proctor.determineBuckets(identifiers,country,language,userAgent);
Provided While Determining Buckets
private ResumeSearchProctor proctor;
Also generated from test specification:
Payloads
Even Tiny Changes Need Deploys
// Choose a background color for templatesif (bgColorBucket == 1) {
// Testmodel.put("btnBgcolor", "#00f");
} else {// Control groupmodel.put("btnBgcolor", "#ccc");
}
Many tests have no behavioral change:● CSS Colors
● Display Text
● Algorithm Weights
Some Tests Just Vary Data
Payloads
● Values added for each bucket in a test
● Proctor verifies payloads are "all or none"
Control: Gray Test: Blue
Payloads
● Values added for each bucket in a test
● Proctor verifies payloads are "all or none"
Control: Gray Test: Blue
"#ccc" "#00f"
Part of Test Definition
"buckets": [{"id": 0, "name": "gray","description": "Control group","payload": {"stringValue": "#ccc"
}}, …]
● No deployment needed
● Changes live within minutes
Declared in Project Test Specification
● Type definition only
● Must match test definition
"buttonBgColorTst": {"buckets": […],"payload": {"type": "stringValue"
}}
Cleaner Code, Only Data Deploy
// Choose a background colormodel.put("btnBgcolor", groups.getButtonBgColorTstPayload()
);
Cross-Product Tests
Cross-Product Tests
Many flavors of cross-product test, including
● Peer webapps
● Client / Service
● Mobile Native / Web
Proctor offers an interesting alternative
Cross-Product Tests
Even more ways to coordinate tests
● Tracking parameters on links, requests
● Service response metadata
● Different service calls
Two products can share test groups
As long as both products
● Share the test’s identifier● Provide the context variables it uses
Deterministic selection guarantees identical bucket assignment.
Evolving Tests
Evolving Tests
testcontrol
Evolving Tests
testcontrol (inactive)
10%
control
Changed allocations, not ID mapping
testOOPS!
● Inconsistent experience● Polluted results
Evolving Tests Smoothly
[ 10%, 10%]
testcontrol (inactive)
[ 10%, 10%, 80% ]
Evolving Tests Smoothly
[ 10%, 10%, 80% ]
[ 10%, 10%, 40%, 40%]
testcontrol (inactive)
testcontrol testcontrol
Evolving Tests Smoothly
[ 10%, 80%, 10% ]
[ 50%, 50% ]
testcontrol (inactive)
control test
Evolving Tests… Turbulently
hash ( uid + test.salt ) => bucket
Any ID:
testcontrol
Test range:
test
1te
st1
After re-salt:
Contextual Sampling
Contextual Allocation
testcontrol (inactive)
10% (US):
50% (Rest of World):testcontrol
testcontrol
20% (CA):(inactive)
Allocations
Each test definition ● has one or more allocations
Each allocation● has a rule and ranges totaling 1.0● except the last, which has no rule.
Allocation Rules
● Use Unified EL, same as test rules.
● Use the same context variables as test rules.
● Choose the first matching allocation.
Allocations in the Test Definition{ "description": "Button colors","type": "USER","salt": "buttonBgColorTst","buckets": [ … ],"allocations": [{"rule": "country == 'US'","ranges": [ … ]
}, {"ranges": [ … ]
}]}
Pre-Production
Environments
Local
Integration
QA
Production
commit
push
push
Show test matrix
/private/showTestMatrix
Show test bucket assignments
/private/showGroups/private/showGroups
Privileged users can force assignments
Privileged users can force assignments
?prforceGroups=buttonColorTst1
Beyond A/B TestingProctor Patterns for Managing Behavior
Kill SwitchWhen ● New Feature
How● 'Active' bucket @ 100%
Phased RolloutWhen ● Experimental Feature
How● 'Active' bucket @ 0%● 'Active' → 1% → 5% → 100%
When ● Downsampling
○ trace logging○ survey
How● 'Active' bucket @ 0%● 'Active' → 1% → 10% → 5% → ??
Throttle
Feature TogglesWhen ● Localized Behavior● Device-Specific Behavior● Logged-in, w/ Resume, etc.
How● Multiple Allocations ● Targeted Rules
Dark DeploysWhen ● Partial Implementations● Additional QA is needed
How● 'Active' bucket @ 0%● 'Active' → 100%
When ● Dependencies between products
○ Resume Wizard feature
How● 'Active' bucket at 0%● Resume Wizard allocation: → 100%● Home page promo allocation: → 100%
Cross-Product Coordination
Pre-Proctor Tests
Post-Proctor Tests
Proctor
42
103
Post-Proctor Tests + Toggles
Proctor
42
10
103
65
Proctor WebappA/B Test Change Management
(Coming Soon to github)
Proctor Webapp
Proctor Webapp
Proctor Webapp
Proctor Webapp
Building On Proctor(Not Open Source)
Description: Group 0: control - Job alert label: Save Alert (control) Group 1: labelSubscribe - Job alert label: Subscribe Group 2: labelSignUp - Job alert label: Sign up Group 3: labelGetJobs - Job alert label: Get jobs Group 4: labelSendMeNewJobs - Job alert label: Send me new jobs Group 5: labelActivate - Job alert label: Activate Group 6: labelSave - Job alert label: Save
History:
jack @ 2013-03-12 (r203267): Promoting jasxjabtnlbltst (trunk r203089) to
production JASX-11365: jasxjabtnlbltst disabled
ketan @ 2012-12-11 (r190675): merged r190418: JASX-10663: Stop
jasxjabtnlbltst in all languages except nl
will @ 2012-11-29 (r188801): merged r187452: JASX-10457: exclude US from
jasxjabtnlbltst
ketan @ 2012-10-25 (r182881): merged r182688: JASX-10234 - Adding new
langauges to job alert button label test
ketan @ 2012-10-25 (r182876): merged r181938: JASX-10234 - Adding test
definition and allocations for job alert button label test
DEMOGet out your Phones and Tablets
http://go.indeed.com/demo
Simple: test different background colors 25% 25%
25%25%
50%
http://go.indeed.com/demo
Let’s increase our bucket size...
50%
50%
http://go.indeed.com/demo
We have a winner! 50%
100%
http://go.indeed.com/demo
Let’s do something wacky!
Android >= 4 iOS >= 7
iOSAndroid
http://go.indeed.com/demo
Also a reference implementation
Running on heroku -- feel free to clone!http://indeedeng-hello-proctor.herokuapp.com
Source:github.com/indeedeng/proctor-demo
Q&A
Source:github.com/indeedeng/proctor
Docs:indeedeng.github.io/proctor
Next @IndeedEng Talk
BoxcarSelf-balancing distributed services
Wednesday, October 30R.B. Boyer