predict repeat shoppers with h20 and spark

Predicting Repeat Shoppers of Online Stores using Gradient

Boosting Machines in H20 + SparkSergey Fogelson, Ph.D.

Director of Data Science, airisData

We are hiring!

[email protected]

airis.DATA

airis.DATAairis.DATA, a start-up system integration company based in Princeton, NJ focused on machine learning and data engineering solutions. •We are hiring: •Data Scientists •Data Engineers • Intern Data Scientists

Contact Me!• Sergey Fogelson, Director of Data Science,

airis.DATA

• [email protected] • http://www.slideshare.net/airisdata • http://twitter.com/airisdata • http://airisdata.com/

http://www.slideshare.net/airisdata

http://www.slideshare.net/airisdata

http://twitter.com/airisdata

http://airisdata.com/

Overview• H2O Architecture

• Gradient Boosting Machines

• Repeat Shopper Problem: Motivations

• Repeat Shopper Problem: Raw data (shopper logs)

• Repeat Shopper: Feature generation

• Walkthrough: Using H2O Flow to generate a GBM model from features

H2O Architecture• in-memory analytics on

clusters:

• in-memory (not-persistent) K/V store, with *exact* (not lazy) consistency semantics and transactions

• memory model is a distributed version of the Java Memory Model

• distributed parallelized state-of-the-art Machine Learning algorithms

H2O Architecture• Has built in support for the following algorithms:

• GLM (Generalized Linear Models - Logistic, Linear Regression, etc.)

• GBM (what we are talking about today)

• Random Forests (a less fancy non-regularized GBM)

• K-Means (common clustering method)

• PCA (principal components analysis for dimensionality reduction and feature exploration)

• Deep Learning (can construct many-layered neural networks with arbitrary kinds of units at every neural net layer)

• Has API’s in:

• Java

• Scala

• Python

• R

• Flow (browser GUI that we will use today)

Gradient Boosted Machines• Gradient: Computes the gradient of the loss function at each iteration

• Boosted: Uses boosting (bootstrapped resampling of data)

• Commonly uses trees, but can use any base learner (We will be using trees)

• Intuition:

• Train a single base learner

• Compute what general “direction” in sample space it misclassifies the most

• Focus your next learner on learning in that “direction”

• Rinse, repeat until training error is less than some epsilon

Gradient Boosted Machines• The overall gradient boosted machine model

prediction is the sum of the base learner’s, predictions weighted by each learner’s overall contribution to decreasing the overall loss function

• So, each successive trained learner contributes a smaller amount to the overall prediction

• An additional regularization penalty is added to GBM to limit model complexity (based on depth of each tree and number of leaves per tree)

Repeat Shopper: Problem Motivation

• Many, many people do a substantial amount of their total shopping online

• A large fraction of those people are one-time shoppers of most small online stores (except the very large stores like Amazon) looking for one-time deals

• However, some shoppers of small online stores will become repeat purchasers

• A repeat purchaser is much more lucrative to target for deals and promotions, as targeting all visitors of a given store with promotions can significantly impact revenue

• Being able to predict who is likely to become a repeat purchaser from a given online store given a limited view of their browsing behavior and one-time purchase behavior is critical to maximizing store revenue

Repeat Shopper: Raw Data

• We have at our disposal:

• User activity logs: user id / store id / category id / brand id / timestamp (month & day) / behavior (click, view, add to favorite, purchase)

• Limited user profile: user id / age range / gender

• Training set of labels confirmed by stores: user id / store id / repeat purchaser (yes/no)

Repeat Shopper: Raw Data

• Data caveats:

• Many user ids and store ids do not appear in the labeled set

• Data is incredibly sparse (we will need to compute aggregates and do some intense feature space reduction)

• Data is very unbalanced (non-repeat shoppers >>> repeat shoppers)

Repeat Shopper: Feature Generation

• User-based features:

• Need a way to measure overall user activity:

• compute monthly per-user number of distinct sellers purchased from - this quantifies how broadly the given user purchases goods on a monthly basis

• filter where action type = purchase

• groupby on (user,month)

• count distinct sellers

• compute monthly aggregate actions per user per category- this quantifies overall user shopping activity and a view of the user’s default shopping behaviors as a function of the categories of objects that exist across the entire marketplace. More clicks,likes, favorites imply higher chance of making a purchase within the given category.

• Groupby user, month, action, category

• count

• this yields a very large, very sparse matrix (rows - users are in the 1000s, columns - category*month*action). reduce the matrix via PCA to the top 200 dimensions.


• Seller-based features:

• Need a way to measure repeat buyer frequency as a function of the seller:

• compute monthly per-seller repeat buyers

• filter where action type = purchase

• groupby on (merchant,user, month)

• filter again where count >= 2 (indicates repeat purchaser for the given month)

• disaggregate and distinct on merchant,user,month

• groupby(merchant,month)

• count

• compute monthly aggregate actions per merchant per category - this quantifies overall merchant activity and a view of the merchant’s visibility in the marketplace. More clicks,likes, favorites imply higher visibility for the merchant, and higher overall default probability that the merchant will have repeat buyers.

• Groupby merchant, month, action, category

• count

• this yields a very large, very sparse matrix (rows - merchants are in the 1000s, columns - category*month*action). reduce the matrix via PCA to the top 200 dimensions.


• Seller-based features:

• Seller similarity - allows us to quantify how related two sellers are:

• merchant similarity is defined as # of customers who bought from both merchants.

• this again yields a very large sparse matrix (5K rows X 5K columns)

• reduce Merchant Similarity matrix to first 10 PCA components (conserves ~85% of the overall variance)

Predict Repeat Shopper: Overall Matrix

• 40K rows, 423 columns

• ~15K repeat purchasers, 25K non

• Lets use H20 flow to make a GBM model on the data!

predict repeat shoppers with h20 and spark

Technology