retail reference architecture part 3: scalable insight component providing user history,...

Retail Reference Architecturewith MongoDB

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Introduction

MongoDB Overview

4

MongoDB Strategic Advantages

Horizontally Scalable-Sharding

AgileFlexible

High Performance &Strong Consistency

Application

HighlyAvailable-Replica Sets

{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

5

Documents let you build your data to fit your application

Relational MongoDB{ customer_id : 1,

name : "Mark Smith",city : "San Francisco",orders: [ {

order_number : 13,store_id : 10,date: “2014-01-03”,products: [

{SKU: 24578234,

Qty: 3, Unit_price:

350},{SKU:

98762345, Qty: 1, Unit_Price:

110}]

},{ <...> }

]}

CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston

Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2

6

Notions

RDBMS MongoDB

Database Database

Table Collection

Row Document

Column Field

Architecture Overview

8

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Architecture Overview

Customer

ChannelsAmazon

Ebay…

StoresPOSKiosk

…

MobileSmartphone

Tablet

Website

Contact Center

APIData and Service

Integration

SocialFacebook

Twitter…

Data Warehouse

Analytics

Supply Chain Management

System

Suppliers

3rd Party

In Network

Web Servers

Application Servers

9

Commerce Functional Components

Information Layer

Look & Feel

Navigation

Customization

Personalization

Branding

Promotions

Chat

Ads

Customer's Perspective

ResearchBrowseSearch

SelectShopping Cart

PurchaseCheckout

ReceiveTrack

UseFeedbackMaintain

DialogAssist

Market / Offer

Guide

Offer

Semantic Search

Recommend

Rule-based Decisions

Pricing

Coupons

Sell / Fullfill

Orders

Payments

Fraud Detection

Fulfillment

Business Rules

InsightSession CaptureActivity

Monitoring

Customer Enterprise

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Merchandising

11

Merchandising

Merchandising

MongoDB

Product Variation

Product Hierarchy

Pricing

Promotions

Ratings & Reviews

Calendar

Semantic Search

Product Definition

Localization

12

• Single view of a product: Single scalable catalog service used by all services and channels

• Read volume is high and sustained

• Write volume spikes up during catalog update, but also allows real-time updating of a product

• Advanced indexing and querying is a requirement: find product by SKU, category, color, etc

• Geographical distribution and low latency achieved through replication

• Scaling achieved through sharding

Merchandising - principles

13

Merchandising - requirements

Requirement Example Challenge MongoDB

Single-view of product Blended description and hierarchy of product to ensure availability on all channels

Flexible document-oriented storage

High sustained read volume with low latency

Constant querying from online users and sales associates, requiring immediate response

Fast indexed querying, replication allows local copy of catalog, sharding for scaling

Spiky and real-time write volume

Bulk update of full catalog without impacting production, real-time touch update

Fast in-place updating, real-time indexing, , sharding for scaling

Advanced querying Find product based on color, size, description

Ad-hoc querying on any field, advanced secondary and compound indexing

14

Merchandising - Product Page

Product images

General Informatio

n

List of Variations

External Informatio

n

Localized Descriptio

n

15

> db.definitions.findOne()

{ productId: "301671", // main product id

department: "Shoes",

category: "Shoes/Women/Pumps",

brand: "Guess",

thumbnail: "http://cdn…/pump.jpg",

image: "http://cdn…/pump1.jpg", // larger version of thumbnail

title: "Evening Platform Pumps",

description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit",

shortDescription: "Evening Platform Pumps",

style: "Designer",

type: "Platform",

rating: 4.5, // user rating

lastUpdated: Date("2014/04/01"), // last update time

… }

Merchandising - Product Definition

16

• Get item from Product Id

db.definition.findOne( { productId: "301671" } )

• Get item from Product Ids

db.definition.findOne( { productId: { $in: ["301671", "301672" ] } } )

• Get items by department

db.definition.find({ department: "Shoes" })

• Get items by category prefix

db.definition.find( { category: /^Shoes\/Women/ } )

• Indices

productId, department, category, lastUpdated

Merchandising - Product Definition

17

> db.variations.findOne()

{

_id: "730223104376", // the sku

productId: "301671", // references product id

thumbnail: "http://cdn…/pump-red.jpg",

image: "http://cdn…/pump-red.jpg", // larger version of thumbnail

size: 6.0,

color: "Red",

width: "B",

heelHeight: 5.0,


…

}

Merchandising - Product Variation

18

• Get Variation from SKU

db.variation.find( { _id: "730223104376" } )

• Get all variations for a product, sorted by SKU

db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )

• Indices

productId, lastUpdated

Merchandising - Product Variation

20

Price: {

_id: "sku730223104376_store123",

currency: "USD",

price: 89.95,


…

}

_id: concatenation of item and store.

Store: can be a store group or store id.

Item: can be an item id or sku

Indices: lastUpdated

Merchandising – Pricing

21

• Get all prices for a given item

db.prices.find( { _id: /^p301671_/ )

• Get all prices for a given sku (price could be at item level)

db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])

• Get minimum and maximum prices for a sku

db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },

max: { $max : price} } })

• Get price for a sku and store id (returns up to 4 prices)

db.prices.find( { _id: { $in: [ "sku730223104376_store1234",

"sku730223104376_sgroup0",

"p301671_store1234",

"p301671_sgroup0"] , { price: 1 })

Merchandising - Pricing

22

• The hierarchy of items typically follows:

• Company– Division:

• Department: Women's shoe store– Class: Pumps

»Item: Guess classic pump• Variation: size 6 black

Merchandising – Product Hierarchy

24

Merchandising – Browse and Search products

Browse by category

Special Lists

Filter by attributes

Lists hundreds of item

summaries

Ideally a single query is issued to the database to obtain all items and metadata to display

25

The previous page presents many challenges:

• Response is needed within milliseconds for hundreds of items

• Faceted search on many attributes of an item: department, brand, category, etc

• Attributes to match may be at the variation level: color, size, etc, in which case the variation should be shown

• One item may have thousands of variations. Only one item should be displayed even if many variations match

• Efficient sorting on several attributes: price, popularity

• Pagination feature which requires deterministic ordering


26


Hundreds of sizes

One Item

Dozens of colors

A single item may have thousands of variations

27


Images of the matching variations are displayed

HierarchySort

parameter

Faceted Search

28

Merchandising – Traditional Architecture

Relational DBSystem of Records

Full Text SearchEngine

Indexing

#1 obtain search

results IDs

ApplicationCache

#2 obtain objects by

ID

Pre-joined into objects

29

The traditional architecture presents issues:

• 3 different systems to maintain: RDBMS, Search engine, Caching layer

• A search returns a list of IDs which then are looked up in the cache as a batch or one by one. It significantly increases latency of response

• RDBMS schema is complex and static

• The search index needs to be refreshed at intervals

• Setup does not allow efficient pagination

Merchandising – Traditional Architecture

30

MongoDB Data Store

Merchandising - Architecture

Product Summaries

Product Definitions

Pricing

PromotionsProduct

VariationsRatings & Reviews

#1 Obtain results

31

The product index relies on the following parameters:

• The department (required): the main component of category, e.g. "Shoes"

• An indexed attribute (optional)

– Category path, e.g. "Shoes/Women/Pumps"

– Price range (based on online prices)

– List of Item Attributes, e.g. Brand = Guess

– List of Variation Attributes, e.g. Color = red

• A non-indexed attribute (optional)

– List of Item Secondary Attributes, e.g. Style = Designer

– List of Variation Secondary Attributes, e.g. heel height = 5.0

• As well as Sorting, e.g. Price Low to High

Merchandising – Product Summaries

32

> db.summaries.findOne()

{ "_id": "p39",

"title": "Evening Platform Pumps 39",

"department": "Shoes", "category": "Shoes/Women/Pumps",

"thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg",

"price": 145.99,

"rating": 0.95,

"attrs": [ { "brand" : "Guess"}, … ],

"sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …],

"vars": [

{ "sku": "sku2441",

"thumbnail": "http://cdn…/pump-small-39.jpg.Blue",

"image": "http://cdn…/pump-39.jpg.Blue",

"attrs": [ { "size": 6.0 }, { "color": "Blue" }, …],

"sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …],

}, … Many more skus …

] }

Indices: vars.sku, department + attr + category, department + vars.attrs + category,

department + category, department + price, department + rating

Merchandising – Product Summaries

33

• Get summary from item iddb.variation.find({ _id: "p301671" })

• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )

• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )

• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,

"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :

180.99 } } )

Merchandising - Product Summaries

34

Merchandising – Query stats

Department Category Price Primary attribute

Time Average (ms)

90th (ms) 95th (ms)

1 0 0 0 2 3 3

1 1 0 0 1 2 2

1 0 1 0 1 2 3

1 1 1 0 1 2 2

1 0 0 1 0 1 2

1 1 0 1 0 1 1

1 0 1 1 1 2 2

1 1 1 1 0 1 1

1 0 0 2 1 3 3

1 1 0 2 0 2 2

1 0 1 2 10 20 35

1 1 1 2 0 1 1

Content

36

Content

Content

MongoDB

Metadata

Asset Repository

Digital Right Mgt

Access Control

Processing / Encoding

Inventory

38

Inventory

Inventory

MongoDB

External Inventory

Internal Inventory

Regional Inventory

Purchase Orders

Fulfillment

Promotions

39

Demonstration Document Model

Definitions• id: p0

Variations• id: sku0• pId: p0

Summary• id: p0• vars: [sku0,

sku1, …]

Stores• id: s1• Loc: [22, 33]

Inventory• store: s1• pId: p0• vars:

[{sku: sku0, q: 3},{sku: sku2, q: 2}]

Product

40

db.stores.findOne()

{ "_id" : ObjectId("53549fd3e4b0aaf5d6d07f35"),

"className" : "catalog.Store",

"storeId" : "store0",

"name" : "Bessemer store",

"address" : {

"addr1" : "1st Main St",

"city" : "Bessemer",

"state" : "AL",

"zip" : "12345",

"country" : "US"

},

"location" : [

-86.95444,

33.40178

]

… }

Inventory - Stores

41

• Get a store by storeId

db.stores.find({ productId: "301671" })

• Get nearby stores sorted by distance

db.stores.runCommand({ "geoNear" : "stores" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true}

Inventory - Stores

42

> db.inventory.findOne()

{ "_id": "5354869f300487d20b2b011d",

"storeId": "store0",

"location": [

-86.95444,

33.40178

],

"productId": "p0",

"vars": [

{ "sku": "sku1", "q": 14 },

{ "sku": "sku3", "q": 7 },

{ "sku": "sku7", "q": 32 },

{ "sku": "sku14", "q": 65 },

...

]

}

Inventory - Quantities

43

• Get all items in a storedb.inventory.find({ storeId: "store100" })

• Get quantity for an item at a storedb.inventory.find({ storeId: "store100", productId: "p200" })

• Get quantity for a sku at a storedb.inventory.find(

{ storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { "vars.$": 1 })

• Increment / decrement inventory for an item at a storedb.inventory.update(

{ storeId: "store100", productId: "p200", "vars.sku": "sku11736" }, { $inc: { "vars.$.q": 20 } })

• Indices: productId, storeId + productId, location (geo) + productId

Inventory - Stores

44

• Aggregate total quantity for an itemdb.inventory.aggregate([

{ $match: { productId: "p200" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }])

{ "_id" : "result", "count" : 101752 }

• Aggregate total quantity for a storedb.inventory.aggregate([

{ $match: { storeId: "store100" }}, { $unwind: "$vars" }, { $group: { _id: "result", count: {$sum: 1} } }])

{ "_id" : "result", "count" : 29347 }

Inventory - Stores

45

• Get inventory for an item near a pointdb.runCommand(

{ "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", "vars.sku": "sku11736" }})

• Get closest store with available skudb.runCommand(

{ "geoNear" : "inventory" , "near" : [ -82.800672 , 40.090844] , "maxDistance" : 10.0 , "spherical" : true, limit: 10, query: { productId: "p200", vars: { $elemMatch: { "sku": "sku11736", q: { $gt: 0 } }}}}})

Inventory - Stores

Customer

47

Customer

Customer

MongoDB

Profile

Market Segment

Demographics

Wish List

Preference

Inbox

Sales / Support Chat

Content Subscription

Channels

49

Channels

Channels

MongoDB

Location

Store

Assortment

Point of Sale

Channel Definition

Planogram

Sales & Fulfillment

51

Sales & Fulfillment

Sales & Fulfillment

MongoDB

Sales Transaction

Shipping

Tracking

Return & Exchange

Business Rule

Audit

Shopping Cart

Insight

53

Insight

Insight

MongoDB

Advertising metrics

Clickstream

Recommendations

Session Capture

Activity Logging

Geo Tracking

Product Analytics

Customer Insight

Application Logs

54

• Many user activities can be of interest:– Search– Product view, like or wish– Shopping cart add / remove– Sharing on social network– Ad impression, Clickstream

• Those will be used to compute:– Product Map (relationships, etc)– User Preferences– Recommendations– Trends

Activity Logging – Data of interest

55

Activity logging - Architecture

MongoDB

HVDFAPI

Activity LoggingUser History

External Analytics:Hadoop,Spark,Storm,

…

User Preferences

Recommendations

Trends

Product MapApps

Internal Analytics:

Aggregation,MR

All user activity is recorded

MongoDB – Hadoop

Connector

Personalization

56

Activity Logging

57

• You need to store and manage an incoming stream of data samples (views, impressions, orders, …)– High arrival rate of data from many sources– Variable schema of arriving data– You need to control retention period of data

• You need to compute derivative data sets based on these samples– Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries

• You need low latency access to up-to-date data (user history)– Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples

Activity Logging – Problem statement

58

Activity logging - Requirements

Requirement MongoDB

Ingestion of 100ks of writes / sec

Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning.

Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted.

Fast querying on varied fields, sorting

Secondary Btree indexes can lookup and sort the data in milliseconds.

Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.

59

Activity Logging using HVDF

HVDF (High Volume Data Feed):

• Open source reference implementation of high volume writing with MongoDB

• Rest API server written in Java with most popular libraries

• Public project, issues can be logged

• Can be run as-is, or customized as needed

60

Feed

High volume data feed architecture

Channel

Sample Sample Sample Sample

Source

Source

Processor

Inline Processing

Batch Processing

Stream Processing

The Channel is the sequence of data

samples that a sensor sends into the

platform.

Sources send samples into the Channel

Processors generate derivative Channels from

other Channel data

61

HVDF -- High Volume Data Feed engine

HVDF – Reference implementation

REST Service API

Processor Plugins

Inline

Batch

Stream

Channel Data Storage

Raw Channel

Data

Aggregated Rollup T1

Aggregated Rollup T2

Query Processor Streaming spout

Custom Stream Processing Logic

Incoming Sample Stream

POST /feed/channel/data

GET /feed/channeldata?time=XXX&range=YYY

Real-time Queries

62

{ _id: ObjectId(),

geoCode: 1, // used to localize write operations

sessionId: "2373BB…",

device: { id: "1234",

type: "mobile/iphone",

userAgent: "Chrome/34.0.1847.131"

}

type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity

itemId: "301671",

sku: "730223104376",

order: { id: "12520185",

… },

location: [ -86.95444, 33.40178 ],

tags: [ "smartphone", "iphone", … ], // associated tags

timeStamp: Date("2014/04/01 …")

}

User Activity - Model

63

Dynamic schema for sample data

Sample 1{ deviceId: XXXX, time: Date(…) type: "VIEW", …}

Channel

Sample 2{ deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, …}

Sample 3{ deviceId: XXXX, time: Date(…) type: “FB_LIKE”}

Each sample can have

variable fields

64

Channels are sharded

Shard

Shard

Shard

Shard

Shard

Shard Key: Customer_id

Sample{ customer_id: XXXX, time: Date(…) type: "VIEW",}

ChannelYou choose how

to partition samples

Samples can have dynamic

schema

Scale horizontally by adding shards

Each shard is highly available

65

Channels are time partitioned

Channel

Sample Sample Sample Sample Sample Sample Sample Sample

- 2 days - 1 Day Today

Partitioning keeps indexes manageable

This is where all of the writes

happen

Older partitions are read only for

best possible concurrency

Queries are routed only to needed

partitions

Partition 1 Partition 2 Partition N

Each partition is a separate collection

Efficient and space reclaiming

purging of old data

66

Dynamic queries on Channels

Channel


AppApp

App

Indexes

Queries Pipelines Map-Reduce

Create custom indexes on Channels

Use full mongodb query language to access samples

Use mongodb aggregation pipelines to

access samples

Use mongodb inline map-reduce to access samples

Full access to field, text, and geo

indexing

67

North America - West

North America - East

Europe

Geographically distributed system

Channel


Source

Source

Source

Source

Source

Source

Sample

Sample

Sample

Sample

Geo shards per location

Clients write local nodes

Single view of channel available

globally

68

Insight

69

Insight – Useful Data

• Useful data for better shopping:– User history (e.g. recently seen products)– User statistics (e.g. total purchases, visits)– User interests (e.g. likes videogames and SciFi)– User social network– Cross-selling: people who bought this item had

tendency to buy those other items (e.g. iPhone, then bought iPhone case)

– Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)

70

Example of real-time aggregation with Agg Framework

User Activity – Computing User Stats

71

Example of real-time aggregation with Agg Framework

User Activity – Computing User Stats

72

Let's simplify each activity recorded as the following:

{ userId: 123, type: order, itemId: 2, time }



To calculate items bought by a user for a period of time, let's use MongoDB's Map Reduce:

- Match activities of type "order" for the past 2 weeks

- map: emit the document by userId

- reduce: push all itemId in a list

- Output looks like { _id: userId, items: [2, 3, 8] }

User Activity – Items frequently bought together

73

Then run a 2nd mapreduce job that for each of the previous results:

- map: emits every combination of 2 items, starting with lowest itemId

- reduce: sum up the total.

- output looks like { _id: { a: 2, b: 3 } , count: 36 }


74

The output collection can then be queried per item Id and sorted by count, and cutoff at a threshold.

Need of index on { _id.a, count } and { _id.b, count }

You then obtain an affiliation collection with docs like:

{ itemId: 2, affil: [ { id: 3, weight: 36}, { id: 8, weight: 23} ] }


75

Example of Hadoop integration

User Activity – Hadoop integration

Social

77

Social

Social

MongoDB

Social Channels

User Network

Activity

Chat

Social Profiles

Community Mgt

Rewards / Gamification

Conclusion

Appendix

83

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Single View of Product Cluster Topology

84

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DCPrimary node replicates data to all secondaries in the shard

as fast as possible


85

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Center Shard contains all the data for stores

in Center region


86

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Center Shard contains all the data for stores

in Center region

Local writes enable very high throughput

of updates


87

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Each region is able to see the data of all

stores from its “local” DC.


88

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Two nodes in each DC for painless maintenance

with zero downtime


89

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Even if a DC goes out, the database remains fully available

thanks to automated failover


90

West DC

Primary

Primary

Primary

Shard“West”

Shard“Center”

Shard“East”

Center DC East DC

Data set can grow, shards can add up, without any rewrite of the

application code


Thank You!

Antoine GirbalSenior Solutions Engineer, MongoDB Inc.@antoinegirbal

retail reference architecture part 3: scalable insight component providing user history,...

Technology

larger version

evening platform

access samples

single view

indexed attribute

store db

item db

user history