remaining agile with billions of documents: appboy and creative mongodb schemas

Jon Hyman, Co-Founder & CIO MongoDB World 2015

@appboy @jon_hyman

REMAINING AGILE WITH BILLIONS OF DOCUMENTS: APPBOY’S CREATIVE MONGODB SCHEMAS

Appboy is a marketing automation platform for mobile apps

We work with top brands and apps

MongoDB World 2014 Recap

• Prior to 2013, scaled vertically

• Sharded in Q2 2013

• Added write buffering with Redis (transactional)

• In 2014, started splitting out collections to more clusters

• By MongoDB World 2014, Appboy handled over 4 billion data points per month

Appboy’s growth on MongoDB

MongoDB World 2014 Recap

Appboy’s Growth Today

• Approximately 22 billion events per month

• Handling spikes of 2B+ events per day

• We anticipate tracking over 1B unique users in Q3

• 11 clusters, over 160 shards

Appboy’s growth on MongoDB

Appboy’s Growth in 2015

• Statistical analysis in read queries

• Random rate limiting and A/B testing

• Flexible schemas, tokenizing field names

• Schemas for data intensive algorithms at Appboy

Agenda

Today at MongoDB World 2015!

The importance of randomness:

STATISTICAL ANALYSIS

A group of users who match some set of filters.

User Segmentation

Appboy shows you segment membership in real-time as you add/edit/remove filters.

How do we do it quickly? We estimate the population sizes of segments when using our web UI.

Counting Quickly

Goal: Quickly get the count() of an arbitrary query

Problem: MongoDB counts are slow, especially unindexed ones

! "

Counting Quickly

10 million documents that represent people:

Counting Quickly

{ favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }

10 million documents that represent people:

• How many people like blue? • How many live in NYC and love pizza? • How many men have a shoe size less than 10?

{ favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }

Counting Quickly

Big Question: How do you estimate counts?

Answer: The same way news networks do it.

With confidence.

Add an index on the random number:

{ random: 4583, favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }

Add a random number in a known range to each document. Say, between 0 and 9999.

db.users.ensureIndex({random:1})

Counting Quickly

Step 1: Get a random sample

I have 10 million documents. Of my 10,000 random “buckets”, I should expect each “bucket” to hold about 1,000 users.

E.g.,

db.users.find({random: 123}).count() == ~1000db.users.find({random: 9043}).count() == ~1000db.users.find({random: 4982}).count() == ~1000

Counting Quickly

Step 1: Get a random sample

Let’s take a random 100,000 users. Grab a random range that “holds” those users. These all work:

Tip: Limit $maxScan to 100,000 just to be safe

db.users.find({random: {$gt: 0, $lt: 101})db.users.find({random: {$gt: 503, $lt: 604})db.users.find({random: {$gt: 8938, $lt: 9039})db.users.find({$or: [ {random: {$gt: 9955}}, {random: {$lt: 56}}])

Counting Quickly

Step 2: Learn about that random sample

Explain Result:

db.users.find( { random: {$gt: 0, $lt: 101}, gender: “M”, favorite_color: “blue”, size_size: {$gt: 10} }, )._addSpecial(“$maxScan”, 100000).explain()

{ nscannedObjects: 100000, n: 11302, ...}

Counting Quickly

Step 3: Do the math

Population: 10,000,000

Sample size: 100,000

Num matches: 11,302

Percentage of users who matched: 11.3%

Estimated total count: 1,130,000 +/- 0.2% with 95% confidence

Counting Quickly

Step 4: Optimize

• Limit $maxScan to (100,000/numShards) to be even faster • Cache the random range for a few hours (keep sample set warm) • Add more RAM (or shards) • Cache results to not hit the database for the same query • Don’t use explain(). Get more than one count: use the aggregation framework on top of the population’s sample size

Counting Quickly

Counting Quickly

Goal is to handle scale, do things that work for any size user base

Random sampling is a good way to do this


RATE LIMITING AND A/B TESTING

• Want to send different messages to users in a cohort and measure against a control (a set of users in the cohort who do not receive any message)

• Who receives the message should be random

• If you have 1M users and want to send a test to 50k, want to select a random 50k (and another random 50k for control)

• If you target the same 1M user cohort with 50k test sizes, different users should be in each test

• Generically, this is the same as “random rate limiting”

• If you wanted to limit delivery to 50k, who receives it should be random

A/B Testing

Randomly scan and select users based on “random” value

• Parallel processes process users across different “random” ranges

• Be sure to handle all “random” values (for apps with fewer than 10,000 users)

• Keep track of global rate limited state to know when to stop processing

• Users randomly receive variations based on send probability (more on this later), also randomly chosen to be in control

Randomly scan and select users based on “random” value


NEED MORE RANDOMNESS

• Use statistical analysis to look at random user samples based on “random” value

• A/B tests send on random users based on “random” value

• You just biased yourself when retargeting by overloading - need another “random” value and use different ones for each case

Statistical Sampling and A/B Testing

Flexible Schemas:EXTENSIBLE USER PROFILES

Appboy creates a rich user profile on every user who opens one of our customers’ apps

Extensible User Profiles

We also let our customers add their own custom attributes


{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, country: “DE”, ... }

Let’s talk schema


{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, custom: { brands_purchased: “Puma and Asics”, credit_card_holder: true, shoe_size: 11, ... }, ... }

Custom attributes can go alongside other fields!

db.users.find(…).update({$set: {“custom.loyalty_program”:true}})


• Easily extensible to add any number of fields

• Don’t need to worry about type (bool, string, integer, float, etc.):

MongoDB handles it all

• Can do atomic operations like $inc easily

• Easily queryable, no need to do complicated joins against the right value column

• Can take up a lot of space

“this_is_my_really_long_custom_attribute_name_weeeeeee”

• Can end up with mismatched types across documents { visited_website: true }{ visited_website: “yes” }

Pros

Cons


Extensible User Profiles - How to Improve the Cons

Space Concern Tokenize values, use a field map:

{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, custom: { 0: true, 1: 11, 2: “Alex & Ani”, ... }, ... }

{ loyalty_program: 0, shoe_size: 1, favorite_brand: 2 }

You should also limit the length of values


Type Constraints Handle in the client, store expected types in a map and coerce/reject bad values

{ loyalty_program: Boolean, shoe_size: Integer, favorite_brand: String }

(also need a map for display names of fields…)


• Use arrays to store items in map, index in array is “token” • 1+ document per customer that has array field list • Atomically push new custom attribute to end of array, get

index (“token”) and cache value for fast retrieval later

Field Map

[“Loyalty Program”, “Shoe Size”, “Favorite Color”]

0 1 2

• Avoid document growing unbounded • We cap how many array elements we store before

generating a new document (say, 100) • Have field least_value in document that represents

token value of index 0 in “list” • $push if list.99 does not exist, use $findAndModify

to create a new document atomically and retry $push

Field Map

[“Loyalty Program”, “Shoe Size”, “Favorite Color”]

100 101 102

• Adds indirection and complexity, but worth it

• Small field name size in each document

• Compression in WiredTiger makes this not an issue anymore

from storage perspective, but still has benefits for field names

• Easy identifiers to pass around in code for custom attributes

Field Map Summary

Flexible Schemas:

FOR DATA INTENSIVE ALGORITHMS

Data Intensive Algorithms, Part 1:

MULTI-ARM BANDIT MULTIVARIATE TESTING

• Appboy customers run multivariate tests of message

campaigns for a long duration

• Goal is to, in the shortest period of time, find the variation

which we are statistically certain provides the highest

conversion

• Customers check in on results and make determination

Multivariate Testing

Multivariate testing example:

Think of it like you are at a row of slot machines, each has a

random reward across a specific distribution not known in

advance. Need to maximize reward.

Multi-arm Bandit Multivariate Testing

"Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg

“[The bandit problem] was formulated during the [second world]

war, and efforts to solve it so sapped the energies and minds of

Allied analysts that the suggestion was made that the problem

be dropped over Germany, as the ultimate instrument of

intellectual sabotage.”


- Peter Whittle, 1979

Appboy inspired by paper from U. Chicago Booth - http://

faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/

ExperimentsInTheServiceEconomy.pdf

“Multi-armed bandit experiments in the online service economy”

Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google


• Twice per day, Appboy will automatically go in and optimize

send distributions for each variation using algorithm

• Requires a lot of observed data

• For each variation:

• Unique recipients who received it

• Conversion rate

• Timeseries of this data


{ company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... }}


{ company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... }}

• Pre-aggregated stat lets us

pull back entirety of

experiment extremely quickly

• Shard on company ID so we

can pull back all their

campaigns at once and

optimize together

• Pre-aggregated stats need

special care to build to avoid

write overload


Data Intensive Algorithms, Part 2:

INTELLIGENT DELIVERY

• Appboy analyzes the optimal time to

send a message to a user

• If Alice is more likely to engage at

night and Bob in the morning, they’ll

get notifications at those windows

“Comparing overall open rates before and after using it, we've seen over 100% improvement

in performance. Our one week retention campaigns targeted at male Urban On members

improved 138%. Additionally, engaging a particularly difficult segment, users who have been

inactive for three months, has improved 94%.”

- Jim Davis, Director of CRM and Interactive Marketing at Urban Outfitters

Intelligent Delivery

• Algorithm is data-intensive on a per-user basis

• Appboy Intelligent Delivery sends tens to hundreds of

millions of messages each day, need to compute optimal

time on a per-user basis quickly


Pre-aggregate dimensions on a per-user basis


{ _id: BSON::ObjectId of user, dimension_1: [DateTime, DateTime, …], dimension_2: [DateTime, DateTime, …], dimension_3: [DateTime, DateTime, …], dimension_4: [Float, Float, …], dimension_5: […],}

• When dimensional data for a user comes in, record a copy of it in a document

• Shard on {_id: “hashed”} for optimal distribution across shards and best write

throughput

• When needing to Intelligently Deliver to a user, query back one document to

get all the data to input into the algorithm. This is super fast.

• MongoDB’s flexible schemas make adding new dimensions trivial


• Consolidating data for fast

retrieval is a huge win

• MongoDB’s flexible schemas

make this possible

• Choose the right shard key for

the document access pattern

• (Not a catch all, be sure to still

store data non pre-aggregated)

Data Intensive Algorithms Summary

Thanks! [email protected]

@appboy @jon_hyman

remaining agile with billions of documents: appboy and creative mongodb schemas

Technology

random sample

random range

random number

random buckets

mongodb mongodb world

group of users

queries random rate

mongodb appboys growth