remaining agile with billions of documents: appboy and creative mongodb schemas
TRANSCRIPT
Jon Hyman, Co-Founder & CIO MongoDB World 2015
@appboy @jon_hyman
REMAINING AGILE WITH BILLIONS OF DOCUMENTS: APPBOY’S CREATIVE MONGODB SCHEMAS
• Prior to 2013, scaled vertically
• Sharded in Q2 2013
• Added write buffering with Redis (transactional)
• In 2014, started splitting out collections to more clusters
• By MongoDB World 2014, Appboy handled over 4 billion data points per month
Appboy’s growth on MongoDB
MongoDB World 2014 Recap
• Approximately 22 billion events per month
• Handling spikes of 2B+ events per day
• We anticipate tracking over 1B unique users in Q3
• 11 clusters, over 160 shards
Appboy’s growth on MongoDB
Appboy’s Growth in 2015
• Statistical analysis in read queries
• Random rate limiting and A/B testing
• Flexible schemas, tokenizing field names
• Schemas for data intensive algorithms at Appboy
Agenda
Today at MongoDB World 2015!
Appboy shows you segment membership in real-time as you add/edit/remove filters.
How do we do it quickly? We estimate the population sizes of segments when using our web UI.
Counting Quickly
Goal: Quickly get the count() of an arbitrary query
Problem: MongoDB counts are slow, especially unindexed ones
! "
Counting Quickly
10 million documents that represent people:
Counting Quickly
{ favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }
10 million documents that represent people:
• How many people like blue? • How many live in NYC and love pizza? • How many men have a shoe size less than 10?
{ favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }
Counting Quickly
Big Question: How do you estimate counts?
Answer: The same way news networks do it.
With confidence.
Add an index on the random number:
{ random: 4583, favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }
Add a random number in a known range to each document. Say, between 0 and 9999.
db.users.ensureIndex({random:1})
Counting Quickly
Step 1: Get a random sample
I have 10 million documents. Of my 10,000 random “buckets”, I should expect each “bucket” to hold about 1,000 users.
E.g.,
db.users.find({random: 123}).count() == ~1000db.users.find({random: 9043}).count() == ~1000db.users.find({random: 4982}).count() == ~1000
Counting Quickly
Step 1: Get a random sample
Let’s take a random 100,000 users. Grab a random range that “holds” those users. These all work:
Tip: Limit $maxScan to 100,000 just to be safe
db.users.find({random: {$gt: 0, $lt: 101})db.users.find({random: {$gt: 503, $lt: 604})db.users.find({random: {$gt: 8938, $lt: 9039})db.users.find({$or: [ {random: {$gt: 9955}}, {random: {$lt: 56}}])
Counting Quickly
Step 2: Learn about that random sample
Explain Result:
db.users.find( { random: {$gt: 0, $lt: 101}, gender: “M”, favorite_color: “blue”, size_size: {$gt: 10} }, )._addSpecial(“$maxScan”, 100000).explain()
{ nscannedObjects: 100000, n: 11302, ...}
Counting Quickly
Step 3: Do the math
Population: 10,000,000
Sample size: 100,000
Num matches: 11,302
Percentage of users who matched: 11.3%
Estimated total count: 1,130,000 +/- 0.2% with 95% confidence
Counting Quickly
Step 4: Optimize
• Limit $maxScan to (100,000/numShards) to be even faster • Cache the random range for a few hours (keep sample set warm) • Add more RAM (or shards) • Cache results to not hit the database for the same query • Don’t use explain(). Get more than one count: use the aggregation framework on top of the population’s sample size
Counting Quickly
Counting Quickly
Goal is to handle scale, do things that work for any size user base
Random sampling is a good way to do this
• Want to send different messages to users in a cohort and measure against a control (a set of users in the cohort who do not receive any message)
• Who receives the message should be random
• If you have 1M users and want to send a test to 50k, want to select a random 50k (and another random 50k for control)
• If you target the same 1M user cohort with 50k test sizes, different users should be in each test
• Generically, this is the same as “random rate limiting”
• If you wanted to limit delivery to 50k, who receives it should be random
A/B Testing
• Parallel processes process users across different “random” ranges
• Be sure to handle all “random” values (for apps with fewer than 10,000 users)
• Keep track of global rate limited state to know when to stop processing
• Users randomly receive variations based on send probability (more on this later), also randomly chosen to be in control
Randomly scan and select users based on “random” value
• Use statistical analysis to look at random user samples based on “random” value
• A/B tests send on random users based on “random” value
• You just biased yourself when retargeting by overloading - need another “random” value and use different ones for each case
Statistical Sampling and A/B Testing
Appboy creates a rich user profile on every user who opens one of our customers’ apps
Extensible User Profiles
{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, country: “DE”, ... }
Let’s talk schema
Extensible User Profiles
{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, custom: { brands_purchased: “Puma and Asics”, credit_card_holder: true, shoe_size: 11, ... }, ... }
Custom attributes can go alongside other fields!
db.users.find(…).update({$set: {“custom.loyalty_program”:true}})
Extensible User Profiles
• Easily extensible to add any number of fields
• Don’t need to worry about type (bool, string, integer, float, etc.):
MongoDB handles it all
• Can do atomic operations like $inc easily
• Easily queryable, no need to do complicated joins against the right value column
• Can take up a lot of space
“this_is_my_really_long_custom_attribute_name_weeeeeee”
• Can end up with mismatched types across documents { visited_website: true }{ visited_website: “yes” }
Pros
Cons
Extensible User Profiles
Extensible User Profiles - How to Improve the Cons
Space Concern Tokenize values, use a field map:
{ first_name: “Sherika”, email: “[email protected]”, dob: 1994-10-24, gender: “F”, custom: { 0: true, 1: 11, 2: “Alex & Ani”, ... }, ... }
{ loyalty_program: 0, shoe_size: 1, favorite_brand: 2 }
You should also limit the length of values
Extensible User Profiles - How to Improve the Cons
Type Constraints Handle in the client, store expected types in a map and coerce/reject bad values
{ loyalty_program: Boolean, shoe_size: Integer, favorite_brand: String }
(also need a map for display names of fields…)
Extensible User Profiles - How to Improve the Cons
• Use arrays to store items in map, index in array is “token” • 1+ document per customer that has array field list • Atomically push new custom attribute to end of array, get
index (“token”) and cache value for fast retrieval later
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
0 1 2
• Avoid document growing unbounded • We cap how many array elements we store before
generating a new document (say, 100) • Have field least_value in document that represents
token value of index 0 in “list” • $push if list.99 does not exist, use $findAndModify
to create a new document atomically and retry $push
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
100 101 102
• Adds indirection and complexity, but worth it
• Small field name size in each document
• Compression in WiredTiger makes this not an issue anymore
from storage perspective, but still has benefits for field names
• Easy identifiers to pass around in code for custom attributes
Field Map Summary
• Appboy customers run multivariate tests of message
campaigns for a long duration
• Goal is to, in the shortest period of time, find the variation
which we are statistically certain provides the highest
conversion
• Customers check in on results and make determination
Multivariate Testing
Think of it like you are at a row of slot machines, each has a
random reward across a specific distribution not known in
advance. Need to maximize reward.
Multi-arm Bandit Multivariate Testing
"Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg
“[The bandit problem] was formulated during the [second world]
war, and efforts to solve it so sapped the energies and minds of
Allied analysts that the suggestion was made that the problem
be dropped over Germany, as the ultimate instrument of
intellectual sabotage.”
Multi-arm Bandit Multivariate Testing
- Peter Whittle, 1979
Appboy inspired by paper from U. Chicago Booth - http://
faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/
ExperimentsInTheServiceEconomy.pdf
“Multi-armed bandit experiments in the online service economy”
Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google
Multi-arm Bandit Multivariate Testing
• Twice per day, Appboy will automatically go in and optimize
send distributions for each variation using algorithm
• Requires a lot of observed data
• For each variation:
• Unique recipients who received it
• Conversion rate
• Timeseries of this data
Multi-arm Bandit Multivariate Testing
{ company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... }}
Multi-arm Bandit Multivariate Testing
{ company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... }}
• Pre-aggregated stat lets us
pull back entirety of
experiment extremely quickly
• Shard on company ID so we
can pull back all their
campaigns at once and
optimize together
• Pre-aggregated stats need
special care to build to avoid
write overload
Multi-arm Bandit Multivariate Testing
• Appboy analyzes the optimal time to
send a message to a user
• If Alice is more likely to engage at
night and Bob in the morning, they’ll
get notifications at those windows
“Comparing overall open rates before and after using it, we've seen over 100% improvement
in performance. Our one week retention campaigns targeted at male Urban On members
improved 138%. Additionally, engaging a particularly difficult segment, users who have been
inactive for three months, has improved 94%.”
- Jim Davis, Director of CRM and Interactive Marketing at Urban Outfitters
Intelligent Delivery
• Algorithm is data-intensive on a per-user basis
• Appboy Intelligent Delivery sends tens to hundreds of
millions of messages each day, need to compute optimal
time on a per-user basis quickly
Intelligent Delivery
{ _id: BSON::ObjectId of user, dimension_1: [DateTime, DateTime, …], dimension_2: [DateTime, DateTime, …], dimension_3: [DateTime, DateTime, …], dimension_4: [Float, Float, …], dimension_5: […],}
• When dimensional data for a user comes in, record a copy of it in a document
• Shard on {_id: “hashed”} for optimal distribution across shards and best write
throughput
• When needing to Intelligently Deliver to a user, query back one document to
get all the data to input into the algorithm. This is super fast.
• MongoDB’s flexible schemas make adding new dimensions trivial
Intelligent Delivery
• Consolidating data for fast
retrieval is a huge win
• MongoDB’s flexible schemas
make this possible
• Choose the right shard key for
the document access pattern
• (Not a catch all, be sure to still
store data non pre-aggregated)
Data Intensive Algorithms Summary