retail reference architecture part 3: scalable insight component providing user history,...
DESCRIPTION
During this session we will cover the best practices for implementing the insight component with MongoDB. This includes efficiently ingesting and managing a large volume of user activity logs, such as clickstreams, views, likes and sales. We'll dive into how you can derive user statistics, product maps and trends using different analytics tools like the aggregation framework, map/reduce or the Hadoop connector. We will also cover operational considerations, including low-latency data ingestion and seamless aggregation queries.TRANSCRIPT
Retail Reference Architecturewith MongoDB
Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal
Introduction
MongoDB Overview
4
MongoDB Strategic Advantages
Horizontally Scalable-Sharding
AgileFlexible
High Performance &Strong Consistency
Application
HighlyAvailable-Replica Sets
{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}
5
Documents let you build your data to fit your application
Relational MongoDB{ customer_id : 1,
name : "Mark Smith",city : "San Francisco",orders: [ {
order_number : 13,store_id : 10,date: “2014-01-03”,products: [
{SKU: 24578234,
Qty: 3, Unit_price:
350},{SKU:
98762345, Qty: 1, Unit_Price:
110}]
},{ <...> }
]}
CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston
Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2
6
Notions
RDBMS MongoDB
Database Database
Table Collection
Row Document
Column Field
Architecture Overview
8
Information Management
Merchandising
Content
Inventory
Customer
Channel
Sales & Fulfillment
Insight
Social
Architecture Overview
Customer
ChannelsAmazon
Ebay…
StoresPOSKiosk
…
MobileSmartphone
Tablet
Website
Contact Center
APIData and Service
Integration
SocialFacebook
Twitter…
Data Warehouse
Analytics
Supply Chain Management
System
Suppliers
3rd Party
In Network
Web Servers
Application Servers
9
Commerce Functional Components
Information Layer
Look & Feel
Navigation
Customization
Personalization
Branding
Promotions
Chat
Ads
Customer's Perspective
ResearchBrowseSearch
SelectShopping Cart
PurchaseCheckout
ReceiveTrack
UseFeedbackMaintain
DialogAssist
Market / Offer
Guide
Offer
Semantic Search
Recommend
Rule-based Decisions
Pricing
Coupons
Sell / Fullfill
Orders
Payments
Fraud Detection
Fulfillment
Business Rules
InsightSession CaptureActivity
Monitoring
Customer Enterprise
Information Management
Merchandising
Content
Inventory
Customer
Channel
Sales & Fulfillment
Insight
Social
Merchandising
11
Merchandising
Merchandising
MongoDB
Product Variation
Product Hierarchy
Pricing
Promotions
Ratings & Reviews
Calendar
Semantic Search
Product Definition
Localization
12
• Single view of a product: Single scalable catalog service used by all services and channels
• Read volume is high and sustained
• Write volume spikes up during catalog update, but also allows real-time updating of a product
• Advanced indexing and querying is a requirement: find product by SKU, category, color, etc
• Geographical distribution and low latency achieved through replication
• Scaling achieved through sharding
Merchandising - principles
13
Merchandising - requirements
Requirement Example Challenge MongoDB
Single-view of product Blended description and hierarchy of product to ensure availability on all channels
Flexible document-oriented storage
High sustained read volume with low latency
Constant querying from online users and sales associates, requiring immediate response
Fast indexed querying, replication allows local copy of catalog, sharding for scaling
Spiky and real-time write volume
Bulk update of full catalog without impacting production, real-time touch update
Fast in-place updating, real-time indexing, , sharding for scaling
Advanced querying Find product based on color, size, description
Ad-hoc querying on any field, advanced secondary and compound indexing
14
Merchandising - Product Page
Product images
General Informatio
n
List of Variations
External Informatio
n
Localized Descriptio
n
15
> db.definitions.findOne()
{ productId: "301671", // main product id
department: "Shoes",
category: "Shoes/Women/Pumps",
brand: "Guess",
thumbnail: "http://cdn…/pump.jpg",
image: "http://cdn…/pump1.jpg", // larger version of thumbnail
title: "Evening Platform Pumps",
description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit",
shortDescription: "Evening Platform Pumps",
style: "Designer",
type: "Platform",
rating: 4.5, // user rating
lastUpdated: Date("2014/04/01"), // last update time
… }
Merchandising - Product Definition
16
• Get item from Product Id
db.definition.findOne( { productId: "301671" } )
• Get item from Product Ids
db.definition.findOne( { productId: { $in: ["301671", "301672" ] } } )
• Get items by department
db.definition.find({ department: "Shoes" })
• Get items by category prefix
db.definition.find( { category: /^Shoes\/Women/ } )
• Indices
productId, department, category, lastUpdated
Merchandising - Product Definition
17
> db.variations.findOne()
{
_id: "730223104376", // the sku
productId: "301671", // references product id
thumbnail: "http://cdn…/pump-red.jpg",
image: "http://cdn…/pump-red.jpg", // larger version of thumbnail
size: 6.0,
color: "Red",
width: "B",
heelHeight: 5.0,
lastUpdated: Date("2014/04/01"), // last update time
…
}
Merchandising - Product Variation
18
• Get Variation from SKU
db.variation.find( { _id: "730223104376" } )
• Get all variations for a product, sorted by SKU
db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )
• Indices
productId, lastUpdated
Merchandising - Product Variation
20
Price: {
_id: "sku730223104376_store123",
currency: "USD",
price: 89.95,
lastUpdated: Date("2014/04/01"), // last update time
…
}
_id: concatenation of item and store.
Store: can be a store group or store id.
Item: can be an item id or sku
Indices: lastUpdated
Merchandising – Pricing
21
• Get all prices for a given item
db.prices.find( { _id: /^p301671_/ )
• Get all prices for a given sku (price could be at item level)
db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])
• Get minimum and maximum prices for a sku
db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },
max: { $max : price} } })
• Get price for a sku and store id (returns up to 4 prices)
db.prices.find( { _id: { $in: [ "sku730223104376_store1234",
"sku730223104376_sgroup0",
"p301671_store1234",
"p301671_sgroup0"] , { price: 1 })
Merchandising - Pricing
22
• The hierarchy of items typically follows:
• Company– Division:
• Department: Women's shoe store– Class: Pumps
»Item: Guess classic pump• Variation: size 6 black
Merchandising – Product Hierarchy
24
Merchandising – Browse and Search products
Browse by category
Special Lists
Filter by attributes
Lists hundreds of item
summaries
Ideally a single query is issued to the database to obtain all items and metadata to display
25
The previous page presents many challenges:
• Response is needed within milliseconds for hundreds of items
• Faceted search on many attributes of an item: department, brand, category, etc
• Attributes to match may be at the variation level: color, size, etc, in which case the variation should be shown
• One item may have thousands of variations. Only one item should be displayed even if many variations match
• Efficient sorting on several attributes: price, popularity
• Pagination feature which requires deterministic ordering
Merchandising – Browse and Search products
26
Merchandising – Browse and Search products
Hundreds of sizes
One Item
Dozens of colors
A single item may have thousands of variations
27
Merchandising – Browse and Search products
Images of the matching variations are displayed
HierarchySort
parameter
Faceted Search
28
Merchandising – Traditional Architecture
Relational DBSystem of Records
Full Text SearchEngine
Indexing
#1 obtain search
results IDs
ApplicationCache
#2 obtain objects by
ID
Pre-joined into objects
29
The traditional architecture presents issues:
• 3 different systems to maintain: RDBMS, Search engine, Caching layer
• A search returns a list of IDs which then are looked up in the cache as a batch or one by one. It significantly increases latency of response
• RDBMS schema is complex and static
• The search index needs to be refreshed at intervals
• Setup does not allow efficient pagination
Merchandising – Traditional Architecture
30
MongoDB Data Store
Merchandising - Architecture
Product Summaries
Product Definitions
Pricing
PromotionsProduct
VariationsRatings & Reviews
#1 Obtain results
31
The product index relies on the following parameters:
• The department (required): the main component of category, e.g. "Shoes"
• An indexed attribute (optional)
– Category path, e.g. "Shoes/Women/Pumps"
– Price range (based on online prices)
– List of Item Attributes, e.g. Brand = Guess
– List of Variation Attributes, e.g. Color = red
• A non-indexed attribute (optional)
– List of Item Secondary Attributes, e.g. Style = Designer
– List of Variation Secondary Attributes, e.g. heel height = 5.0
• As well as Sorting, e.g. Price Low to High
Merchandising – Product Summaries
32
> db.summaries.findOne()
{ "_id": "p39",
"title": "Evening Platform Pumps 39",
"department": "Shoes", "category": "Shoes/Women/Pumps",
"thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg",
"price": 145.99,
"rating": 0.95,
"attrs": [ { "brand" : "Guess"}, … ],
"sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …],
"vars": [
{ "sku": "sku2441",
"thumbnail": "http://cdn…/pump-small-39.jpg.Blue",
"image": "http://cdn…/pump-39.jpg.Blue",
"attrs": [ { "size": 6.0 }, { "color": "Blue" }, …],
"sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …],
}, … Many more skus …
] }
Indices: vars.sku, department + attr + category, department + vars.attrs + category,
department + category, department + price, department + rating
Merchandising – Product Summaries
33
• Get summary from item iddb.variation.find({ _id: "p301671" })
• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )
• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )
• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,
"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :
180.99 } } )
Merchandising - Product Summaries
34
Merchandising – Query stats
Department Category Price Primary attribute
Time Average (ms)
90th (ms) 95th (ms)
1 0 0 0 2 3 3
1 1 0 0 1 2 2
1 0 1 0 1 2 3
1 1 1 0 1 2 2
1 0 0 1 0 1 2
1 1 0 1 0 1 1
1 0 1 1 1 2 2
1 1 1 1 0 1 1
1 0 0 2 1 3 3
1 1 0 2 0 2 2
1 0 1 2 10 20 35
1 1 1 2 0 1 1
Content
36
Content
Content
MongoDB
Metadata
Asset Repository
Digital Right Mgt
Access Control
Processing / Encoding
Inventory
38
Inventory
Inventory
MongoDB
External Inventory
Internal Inventory
Regional Inventory
Purchase Orders
Fulfillment
Promotions
39
Demonstration Document Model
Definitions• id: p0
Variations• id: sku0• pId: p0
Summary• id: p0• vars: [sku0,
sku1, …]
Stores• id: s1• Loc: [22, 33]
Inventory• store: s1• pId: p0• vars:
[{sku: sku0, q: 3},{sku: sku2, q: 2}]
Product
40
db.stores.findOne()
{ "_id" : ObjectId("53549fd3e4b0aaf5d6d07f35"),
"className" : "catalog.Store",
"storeId" : "store0",
"name" : "Bessemer store",
"address" : {
"addr1" : "1st Main St",
"city" : "Bessemer",
"state" : "AL",
"zip" : "12345",
"country" : "US"
},
"location" : [
-86.95444,
33.40178
]
… }
Inventory - Stores
41
• Get a store by storeId
db.stores.find({ productId: "301671" })
• Get nearby stores sorted by distance
db.stores.runCommand({ "geoNear" : "stores" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true}
Inventory - Stores
42
> db.inventory.findOne()
{ "_id": "5354869f300487d20b2b011d",
"storeId": "store0",
"location": [
-86.95444,
33.40178
],
"productId": "p0",
"vars": [
{ "sku": "sku1", "q": 14 },
{ "sku": "sku3", "q": 7 },
{ "sku": "sku7", "q": 32 },
{ "sku": "sku14", "q": 65 },
...
]
}
Inventory - Quantities
43
• Get all items in a storedb.inventory.find({ storeId: "store100" })
• Get quantity for an item at a storedb.inventory.find({ storeId: "store100", productId: "p200" })
• Get quantity for a sku at a storedb.inventory.find(
{ storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { "vars.$": 1 })
• Increment / decrement inventory for an item at a storedb.inventory.update(
{ storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { $inc: { "vars.$.q": 20 } })
• Indices: productId, storeId + productId, location (geo) + productId
Inventory - Stores
44
• Aggregate total quantity for an itemdb.inventory.aggregate([
{ $match: { productId: "p200" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }])
{ "_id" : "result", "count" : 101752 }
• Aggregate total quantity for a storedb.inventory.aggregate([
{ $match: { storeId: "store100" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }])
{ "_id" : "result", "count" : 29347 }
Inventory - Stores
45
• Get inventory for an item near a pointdb.runCommand(
{ "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", "vars.sku": "sku11736" }})
• Get closest store with available skudb.runCommand(
{ "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", vars: { $elemMatch: { "sku": "sku11736", q: { $gt: 0 } }}}}})
Inventory - Stores
Customer
47
Customer
Customer
MongoDB
Profile
Market Segment
Demographics
Wish List
Preference
Inbox
Sales / Support Chat
Content Subscription
Channels
49
Channels
Channels
MongoDB
Location
Store
Assortment
Point of Sale
Channel Definition
Planogram
Sales & Fulfillment
51
Sales & Fulfillment
Sales & Fulfillment
MongoDB
Sales Transaction
Shipping
Tracking
Return & Exchange
Business Rule
Audit
Shopping Cart
Insight
53
Insight
Insight
MongoDB
Advertising metrics
Clickstream
Recommendations
Session Capture
Activity Logging
Geo Tracking
Product Analytics
Customer Insight
Application Logs
54
• Many user activities can be of interest:– Search– Product view, like or wish– Shopping cart add / remove– Sharing on social network– Ad impression, Clickstream
• Those will be used to compute:– Product Map (relationships, etc)– User Preferences– Recommendations– Trends
Activity Logging – Data of interest
55
Activity logging - Architecture
MongoDB
HVDFAPI
Activity LoggingUser History
External Analytics:Hadoop,Spark,Storm,
…
User Preferences
Recommendations
Trends
Product MapApps
Internal Analytics:
Aggregation,MR
All user activity is recorded
MongoDB – Hadoop
Connector
Personalization
56
Activity Logging
57
• You need to store and manage an incoming stream of data samples (views, impressions, orders, …)– High arrival rate of data from many sources– Variable schema of arriving data– You need to control retention period of data
• You need to compute derivative data sets based on these samples– Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries
• You need low latency access to up-to-date data (user history)– Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples
Activity Logging – Problem statement
58
Activity logging - Requirements
Requirement MongoDB
Ingestion of 100ks of writes / sec
Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning.
Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted.
Fast querying on varied fields, sorting
Secondary Btree indexes can lookup and sort the data in milliseconds.
Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.
59
Activity Logging using HVDF
HVDF (High Volume Data Feed):
• Open source reference implementation of high volume writing with MongoDB
• Rest API server written in Java with most popular libraries
• Public project, issues can be logged
• Can be run as-is, or customized as needed
60
Feed
High volume data feed architecture
Channel
Sample Sample Sample Sample
Source
Source
Processor
Inline Processing
Batch Processing
Stream Processing
The Channel is the sequence of data
samples that a sensor sends into the
platform.
Sources send samples into the Channel
Processors generate derivative Channels from
other Channel data
61
HVDF -- High Volume Data Feed engine
HVDF – Reference implementation
REST Service API
Processor Plugins
Inline
Batch
Stream
Channel Data Storage
Raw Channel
Data
Aggregated Rollup T1
Aggregated Rollup T2
Query Processor Streaming spout
Custom Stream Processing Logic
Incoming Sample Stream
POST /feed/channel/data
GET /feed/channeldata?time=XXX&range=YYY
Real-time Queries
62
{ _id: ObjectId(),
geoCode: 1, // used to localize write operations
sessionId: "2373BB…",
device: { id: "1234",
type: "mobile/iphone",
userAgent: "Chrome/34.0.1847.131"
}
type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity
itemId: "301671",
sku: "730223104376",
order: { id: "12520185",
… },
location: [ -86.95444, 33.40178 ],
tags: [ "smartphone", "iphone", … ], // associated tags
timeStamp: Date("2014/04/01 …")
}
User Activity - Model
63
Dynamic schema for sample data
Sample 1{ deviceId: XXXX, time: Date(…) type: "VIEW", …}
Channel
Sample 2{ deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, …}
Sample 3{ deviceId: XXXX, time: Date(…) type: “FB_LIKE”}
Each sample can have
variable fields
64
Channels are sharded
Shard
Shard
Shard
Shard
Shard
Shard Key: Customer_id
Sample{ customer_id: XXXX, time: Date(…) type: "VIEW",}
ChannelYou choose how
to partition samples
Samples can have dynamic
schema
Scale horizontally by adding shards
Each shard is highly available
65
Channels are time partitioned
Channel
Sample Sample Sample Sample Sample Sample Sample Sample
- 2 days - 1 Day Today
Partitioning keeps indexes manageable
This is where all of the writes
happen
Older partitions are read only for
best possible concurrency
Queries are routed only to needed
partitions
Partition 1 Partition 2 Partition N
Each partition is a separate collection
Efficient and space reclaiming
purging of old data
66
Dynamic queries on Channels
Channel
Sample Sample Sample Sample
AppApp
App
Indexes
Queries Pipelines Map-Reduce
Create custom indexes on Channels
Use full mongodb query language to access samples
Use mongodb aggregation pipelines to
access samples
Use mongodb inline map-reduce to access samples
Full access to field, text, and geo
indexing
67
North America - West
North America - East
Europe
Geographically distributed system
Channel
Sample Sample Sample Sample
Source
Source
Source
Source
Source
Source
Sample
Sample
Sample
Sample
Geo shards per location
Clients write local nodes
Single view of channel available
globally
68
Insight
69
Insight – Useful Data
• Useful data for better shopping:– User history (e.g. recently seen products)– User statistics (e.g. total purchases, visits)– User interests (e.g. likes videogames and SciFi)– User social network– Cross-selling: people who bought this item had
tendency to buy those other items (e.g. iPhone, then bought iPhone case)
– Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)
70
Example of real-time aggregation with Agg Framework
User Activity – Computing User Stats
71
Example of real-time aggregation with Agg Framework
User Activity – Computing User Stats
72
Let's simplify each activity recorded as the following:
{ userId: 123, type: order, itemId: 2, time }
{ userId: 123, type: order, itemId: 3, time }
{ userId: 234, type: order, itemId: 7, time }
To calculate items bought by a user for a period of time, let's use MongoDB's Map Reduce:
- Match activities of type "order" for the past 2 weeks
- map: emit the document by userId
- reduce: push all itemId in a list
- Output looks like { _id: userId, items: [2, 3, 8] }
User Activity – Items frequently bought together
73
Then run a 2nd mapreduce job that for each of the previous results:
- map: emits every combination of 2 items, starting with lowest itemId
- reduce: sum up the total.
- output looks like { _id: { a: 2, b: 3 } , count: 36 }
User Activity – Items frequently bought together
74
The output collection can then be queried per item Id and sorted by count, and cutoff at a threshold.
Need of index on { _id.a, count } and { _id.b, count }
You then obtain an affiliation collection with docs like:
{ itemId: 2, affil: [ { id: 3, weight: 36}, { id: 8, weight: 23} ] }
User Activity – Items frequently bought together
75
Example of Hadoop integration
User Activity – Hadoop integration
Social
77
Social
Social
MongoDB
Social Channels
User Network
Activity
Chat
Social Profiles
Community Mgt
Rewards / Gamification
Conclusion
Appendix
83
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Single View of Product Cluster Topology
84
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DCPrimary node replicates data to all secondaries in the shard
as fast as possible
Single View of Product Cluster Topology
85
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Center Shard contains all the data for stores
in Center region
Single View of Product Cluster Topology
86
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Center Shard contains all the data for stores
in Center region
Local writes enable very high throughput
of updates
Single View of Product Cluster Topology
87
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Each region is able to see the data of all
stores from its “local” DC.
Single View of Product Cluster Topology
88
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Two nodes in each DC for painless maintenance
with zero downtime
Single View of Product Cluster Topology
89
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Even if a DC goes out, the database remains fully available
thanks to automated failover
Single View of Product Cluster Topology
90
West DC
Primary
Primary
Primary
Shard“West”
Shard“Center”
Shard“East”
Center DC East DC
Data set can grow, shards can add up, without any rewrite of the
application code
Single View of Product Cluster Topology
Thank You!
Antoine GirbalSenior Solutions Engineer, MongoDB Inc.@antoinegirbal