20180523 Seminar
1. Factorization Machines - 2010
2. Field-aware Factorization Machines - 2016
3. Practical lessons from predicting clicks on ads at facebook - 2014
Presented by Hochul Kim
20180523
Factorization Machines
Factorization Machines (FM)
Other factorization models : (ex : MF, SVD++, PITF, FPMC, …)
Drawback
not applicable for general prediction taskswork only with special input dataequations and optimization algorithms are derived individually
Feature of FM
model all interactions between variables (using factorized parameters)
Advantages of FM
general predictor working with any real valued feature vectorthey are able to estimate interactions even in problems with huge sparsitycalculated in linear time
Prediction Under Sparsity
terms in this paper
prediction function: let
feature vector: for Regression: for Classification(for binary classification may be):
training dataset:
describe sparsity of :
num of none zero in : mean of : condition of Huge Sparsity := where
indeies
user index: where item index: where time index where raitng independent value (e.g. )example of data (Netfilx)
Input Data for example
Fig.1 Example for sparse real valued featuce vectors the transactions of example 1
FM Model Equation - 1 : Model
where
and dot product is
where
FM Model Equation - 2
2-way FMs(degree , )
capture all single and pairwise interactions between variable
: the global bias
: models the strength of the -th variable
: models the interaction between the -th variable and -th variable
Instead of using an own model parameter the FM models the interaction by factorizing it
for any positive definite matrix there exists a matrix such that if is largeenough(sufficiently large)
typically a small should be chosen
there is not enough data to estimate complex interactions
i.e. : Restricting better generalization
Example
Parameter Estimation Under Sparsity : want to estimate the interaction between (Alice) and (Star Trek)
no case in the training data where both variables and are non-zero
But with the factorized interacton parameter can estimate
from and ,
from and ,
from and ,
FM Model Equation - 3
Complexity
Complexity of model :
all pairwise interactions have to be computed.
deduction
Learning FM
can be learned by SGD with logistic loss (or hinge loss)gradient of FM
can be precomputedthus case can be done in (under sparsity)
Field-aware Factorization Machines for CTR Prediction
Introduction
Exisiting Models - 1
Poly2
Paper: Training and testing low-degree polynomial data mappings via linear SVM
use degree-2 polynomial mapping for capture the information of feature conjunctions
applying a linear model on the explicit form of degree-2 mappings
training time and testtime are much faster than kernel methods.
model
where is a hashing and , function as below(model size is user secified parameter):
Complexity:
where is the average number of non-zero elements per instance.
Exisiting Models - 2
FMs
Paper: Factorization Machinesmodel
size of = with deduction
than complexity is reduced to
why FMs can be better than Poly2 when the data set is sparse?
Example
for (ESPN, Adidas) pair in example, only one negative traning data for pair (0,-1)
For Poly2
a very negative weight might be learned for this pair.
FMs
prediction of is determined by and , ans these are also learned from (ESPN-Nike), (NBC-Adidas), so prediction may be more accurate
FFMs-1
Idea
Pairwise Interaction Tensor Factorization
use FMs for each set of field(e.g. : (User, Item), (User, Tag), (Item, Tag))
Example
Clicked Publisher(P) Advertiser (A) Gender (G)
Y ESPN Nike Male
data for (Yes, ESPN, Nike, Male)
i.e. factorize each feature(e.g. Nike, ESPN, ...) only
i.e. factorize each feature(e.g. Nike, ESPN, ...) with field(e.g. Advertiser, Publisher, Gender)
FFMs-2
Model equation
where and are respectively the fields of and
Complexity
let num of field is
size of is
= num of feature(=length of feature vector ) = num of field that features are categorized = length of latent vector
complexity
but usually
model variables complexity
LM
Poly2
FM
FFM
compare algorithms
FFMs-3
Solving Optimization Problem - Algorithm
Let be a tensor of all ones
Run the following loop for t epoches
for do
Sample a data point
calculate
for non-zero terms in do
for non-zero terms in do
calculate sub-gradient, see (A)
for do
update sum (variable), see (B)update model, see (C)
initial value
is a user-specified learning rate, set as in experiments is a learning rate, set as in experiments are randomly sampled from a uniform distribution between
are set to 1 in odrer to prevent a large value of
FFMs-4
Solving Optimization Problem - formula
(A) sub gradients
where
(B) for each coodinate the sum of squared gradient is accumulated
(C) update and
where , (size of and is )
empirical experience in paper
we find that normalizing each instance to have the unit length makes the test accuracy slightly better and insensitive to parameters.
Parallelization on Shared-memory Systems
apply HOGWILD for parallelization
parallellized the first forsee Experiment for more detail
Adding Information field
data format for packages
데이터 형태
LIBSVM data format;
for FFM, consider;
terms
for label : : true, clicked / : false, not clicked = Publisher, = Advertiser, : Gender
Categorical Feature
from
convert to boolean for each feature
Numerical Features
Accepted AR Hidx Cite
45.73 2 3
1.04 100 50000
from
convert data with strategy
naive way : merely duplicates of features
discretize each numerical feature to a categorical one with strategy (discretization with rountstrategy)
i.e apply strategy to original value and results of discretization are fits into 'feature' andvalue is converted as boolean
drawback
not east to determine best setting(discretization strategy)some information will be lost after discretization
Experiment-1
evaluation
use logistic loss
where is num of test instance
impact of paremeters -
impact of paremeters -
impact of paremeters -
Early Stopping
various methods was tried such as lazy update 5 and ALS-based optimization.but Early Stopping was best
1. Split the data set into a training set and a validation set.2. At the end of each epoch, use the validation set to calculate the loss.3. If the loss goes up, record the number of epochs. Stop or go to step 4.4. If needed, use the full data set to re-train a model with the number of epochs obtained in step
Speedup
definition speedup:
Epoch-loss
loss is converges after #13
thread-speedup
speedup is converges more than 8 threadmemory lock
Comparison on More Dataset
Practical Lessons from Predicting Clicks on Ads at Facebook
Abstract
Volume : 750 million daily Active User(AU), over 1 million active advertisers
Combined decision trees with logistic regression
outperforming either of these methods on its own by over 3%
Explore how a number of fundamental parameters impact the final prediction performance ofsystem
Most important thing is to have the right features
historical information about the userad dominate other types of features -> contextual feature
Right Model & Features -> other factors play small roles
Introduction
Billing : bid and pay per click auctions
The efficiency of an ads auction depends on the accuracy and calibration of click prediction.
Needs..
Robust and Adaptive Capable of learning from massive volumes of data
Feature of Facebook Ads
Ads are not associated with a query
but specify demographic and interest targeting
Traditional sponsored search advertising
user query is used to retrieve candidate ads
Experimental Setup
Evaluation Metrics
Normalized Entropy: the predictive log loss normalized by the entropy of the background CTR
= empirical CTR of the training data set
Calibration: ratio of the average estimated CTR and empirical CTR
Prediction Model Structure-1
system Architecure
Figure 1: Boosted decision trees + Probabilistic sparse linear classifier (for online learning)
Prediction Model Structure-2
Model
online learning schemes : based on the Stochastic Gradient Descent (SGD) algorithm
After feature Transform, , where is num of feature = num of leaves in boosteddscision tree
: -th unit vector : the value of the categorical input features
for labeled impression
Bayesian online learning scheme for probit regression (BOPR)
GLM (with probit link function)
where
Prior
Posterior
The resulting model consists of the mean and the variance of the approximate posteriordistribution of weight vector
Prediction Model Structure-3
update
,
more discussion
This inference can be viewed as an SGD scheme
where
(A) can be seen as a per-coordinate gradient descent like (B)
Prediction Model Structure-4
Decision tree feature transformers
Boosted decision tree
use follow the Gradient Boosting Machine (GBM) with L2-TreeBoost algorithm
experiment
compare two logistic regression models
result
Model Structure NE (relative to Trees only)
LR + Trees 96.58%
LR only (non-transformed) 99.43%
Trees only 100% (reference)
Table 1
Logistic Regression (LR) and boosted decision trees (Trees) make a powerful combination. We evaluate them by their Normalized Entropy (NE) relative to that of the Trees only model.
Prediction Model Structure -2
Data freshness
Click prediction systems are often deployed in dynamic environments where the data distributionchanges over time.
Experiment
model on one particular day and test it on consecutive days
train on one day of dataevaluate on the six consecutive days
result
worth to retraining daily
But : too Expensive
The boosted decision trees can be trained daily - (with some restrictions)linear classifier can be trained in near real-time (online learning)
Prediction Model Structure -3
Online Linear Classifier - Compare BOPR and SGD -1
name Learning rate schema Parameters
Per-coordinate α = 0.1, β = 1.0
Per-weight square root α = 0.01
Per-weight α = 0.01
Global α = 0.01
Constant α = 0.0005
SGD Learning rate Schema and params
Result
Model Type NE (relative to LR)
LR 100% (reference)
BOPR 99.82%
Prediction Model Structure-4
Online Linear Classifier - Compare BOPR and SGD-2
Per-coordinate online LR vs BOPR
advantages of LR over BOPR
model size is half ( only)
Fast (Depending on the implementation)Low computational cost
advantage of BOPR over LR
being a Bayesian formulation -> provides a full predictive distribution
can compute percentilescan be used for explore/exploit
Online Data Joiner
forms a tight closed loop
train classifier layer online
positive and negative
Positive = ClickedNegative = User did not click the ad after a fixed and sufficiently long period of time after seeing the ad
architecture for serving ad and join data
1. The initial data stream is generated when a user visits Facebook
2. request is made to the ranker for candidate
3. The ads are passed back to the user’s device and in parallel
each ad and the associated features used in ranking that impression are added to the impression stream
4. after the full join window has expired will the labelled impression be emitted to the trainingstream.
Ohter
protection mechanisms against anomalies that could corrupt the online learning system
Containing Memory and Latency-1
Number of boosting trees
Boosting Feature importance
Containing Memory and Latency-2
Historical feature
contextual feature
depends on current information regarding the contextex : the device used by the users, the current page that the user is on, ..
historical feature
depend on previous interaction for the ad or user
ex : CTR of ad in last week, average CTR of user, ..
Percentage of historical feature
Type of features NE (relative to Contextual)
All 95.65%
Historical only 96.32%
Contextual only 100%
but contextual features are very important to handle the cold start problem
data freshness
historical features describe long-time accumulated user behavior
much more stable than contextual features
Massive Training Data
Massive Data -> Need Sampling
Uniform subsampling
experiment with sampling rate 0.01, 0.1, 0.5, 1
Negative Down sampling
Discussion
Practical Lessons
Data freshness matterssignificantly increases the prediction accuracy Best online learning method - BOPR is better than LR-SGD
Results
tradeoff between the number of boosted decision trees and accuracy feature selection effect by boosted dicision trees