kaggle and click-through rate prediction

Post on 07-Nov-2021

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Computer Science Faculty Publications Computer Science

2-27-2019

Kaggle and Click-Through Rate Prediction Kaggle and Click-Through Rate Prediction

Todd W. Neller Gettysburg College

Follow this and additional works at: https://cupola.gettysburg.edu/csfac

Part of the Databases and Information Systems Commons

Share feedbackShare feedback about the accessibility of this item. about the accessibility of this item.

Recommended Citation Recommended Citation Neller, Todd W. "Kaggle and Click-Through Rate Prediction." Presented at the Gettysburg College Computer Science Colloquium, Gettysburg, PA, February 2019.

This is the author's version of the work. This publication appears in Gettysburg College's institutional repository by permission of the copyright owner for personal use, not for redistribution. Cupola permanent link: https://cupola.gettysburg.edu/csfac/56

This open access presentation is brought to you by The Cupola: Scholarship at Gettysburg College. It has been accepted for inclusion by an authorized administrator of The Cupola. For more information, please contact cupola@gettysburg.edu.

Kaggle and Click-Through Rate Prediction Kaggle and Click-Through Rate Prediction

Abstract Abstract Neller presented a look at Kaggle.com, an online Data Science and Machine Learning learning community, as a place to seek rapid, experiential peer education for most any Data Science topic. Using the specific challenge of Click-Through Rate Prediction (CTRP), he focused on lessons learned from relevant Kaggle competitions on how to perform CTRP.

Keywords Keywords Data science, machine learning, Kaggle, click-through rate prediction

Disciplines Disciplines Computer Sciences | Databases and Information Systems

Comments Comments This presentation was given at the Gettysburg College Computer Science Colloquium on February 27th, 2019.

This presentation is available at The Cupola: Scholarship at Gettysburg College: https://cupola.gettysburg.edu/csfac/56

Dr. Todd W. Neller Professor of Computer Science

Click-Through Rate (CTR) Prediction

• Number of impressions = number of times an advertisement was served/offered

• Given: much data on past link offerings and whether or not users clicked on those links

• Predict: the probability that a current user will click on a given link

Example Data on Past Link Offerings

• User data: – User ID from site login, cookie

– User IP address, IP address location

• Link context data: – Site ID, page ID, prior page(s)

– Time, date

• Link data: – Link ID, keywords

– Position offered on page

Example: Facebook Information

Better CTR Prediction

Better Ad Selection

Greater Click-Through Rate

Greater Ad Revenue

Why is CTR Prediction Important?

• Advertising Industry View:

– Much of online advertising is billed using a pay-per-click model.

New Idea?

https://www.slideshare.net/savvakos/how-you-can-earn-attention-in-the-digital-world-80695468

Benefits Beyond Advertising

• Herbert Simon, 1971: – “In an information-rich world,

the wealth of information means a dearth of something else: a scarcity of whatever it is that information consumes. What information consumes is rather obvious: the attention of its recipients.”

• Better CTR prediction more relevance better use of scarce time

Outline

• Click-Through Rate Predition (CTRP) Introduction • Kaggle

– Learning community offerings incentives – CTRP Competitions

• Feature Engineering – Numbers, Categories, and Missing Values

• Favored regression techniques for CTRP – Logistic Regression – Gradient Boosted Decision Trees (e.g. xgBoost) – Field-aware Factorization Machines (FFMs)

• Future Recommendations

What is Kaggle.com?

• Data Science and Machine Learning Community featuring – Competitions $$$, peer learning, experience,

portfolio

– Datasets

– Kernels

– Discussions

– Tutorials (“Courses”)

– Etc.

• Status incentives

Discussions

Tutorials

Status Incentives

Kaggle CTRP Competitions

Criteo Display Advertising Challenge

• Criteo Display Advertising Challenge Data:

– Features (inputs):

• 13 numeric: unknown meanings, mostly counts, power laws evident

• 26 categorical: unknown meanings, hashed (encoding without decoding), few dominant, many unique

– Target (output): 0 / 1 (didn’t / did click through)

Hashing

• A hash function takes some data and maps it to a number.

• Example: URL (web address) – Representation: string of characters – Character representation: a

number (Unicode value) – Start with value 0. – Repeat for each character:

• Multiply value by 31 • Add next character Unicode to value

– Don’t worry about overflow – it’s just a consistent “mathematical blender”.

“Hi” H = 72, i = 105 31 * 72 + 105 = 2337

Hash Function Characteristics

• Mapping: same input same output

• Uniform: outputs have similar probabilities – Collision: two different inputs

same output

– Collisions are allowable (inevitable if #data > #hash values) but not desirable.

• Non-invertible: can’t get back input from output (e.g. cryptographic hashing, anonymization)

Missing Data

• The first 10 lines of the training data:

• Missing numeric and categorical features:

Missing Data: Imputation

• One approach to dealing with missing data is to impute values, i.e. replace with reasonable values inferred from surrounding data.

• In other words, create predictors for each value based on other known/unknown values.

• Cons: – Difficult to validate.

– In Criteo data, missing values are correlated.

– So … we’re writing predictors to impute data we’re learning predictors from?

General Introduction to Handling Missing Data

Missing Data: Embrace the “Unknown”

• Retain “unknown” as data that contains valuable information.

• Does the lack of CTR context data caused by incognito browsing mode provide information on what a person is more likely to click?

• Categorical data: For each category C# with missing data, create a new category value “C#:unknown”.

Missing Data: Embrace the “Unknown”

• Numeric data:

– Create an additional feature that indicates whether the value for a feature is (un)known.

• Additionally could impute mean, median, etc., for unknown value.

– Convert to categorical and add “C#:unknown” category…

Numeric to Categorical: Binning

• Histogram-based – Uniform ranges: (+) simple (-) uneven distribution,

poor for non-uniform data – Uniform ranges on transformation (e.g. log): (+)

somewhat simple (-) transformation requires understanding of data distribution

• Quantiles – E.g. quartiles = 4-quantiles, quintiles = 5-quantiles – (+) simple, even distribution by definition, (-)

preponderance of few values duplicate bins (eliminate)

Categorical to Numeric: One-Hot Encoding

• For each categorical input variable: – For each possible category value, create a new numeric

input variable that can be assigned numeric value 1 (“belongs to this category”) or 0 (“does not belong to this category).

– For each input, replace the categorical value variable with these new numeric inputs.

https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding

Categorical to Numeric: Hashing

• When there are a large number of categories, one-hot encoding isn’t practical.

– E.g. Criteo data category C3 in its small sample of CTR data had 10,131,226 distinct categorical values.

– One approach (e.g. for power law data): one-hot encode few dominant values plus “rare” category.

– Hashing trick:

• Append category name and unusual character before category value and hash to an integer.

• Create a one-hot-like category for each integer.

Hashing Trick Example

• From https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf:

• Fundamental tradeoff: greater/lesser number hashed features results in … – … more/less expensive computation

– … less/more frequent hash collisions (i.e. unlike categories treated as like)

Example: Passing vs. Studying

Unknown Logistic Model

Misapplication of Linear Regression

Logistic Regression Recovering Model

Logistic Regression with Stochastic Gradient Descent

• Output: • Initially: β0 = β1 = 0 • Repeat:

– For each input x, • Adjust intercept β0 by learning rate * error * p′(x) • Adjust coefficient β1 by learning rate * error * p′(x) * x

• Note: – Error = y - p(x) – p′(x) = p(x) * (1 – p(x)) (the slope of p at x) – This is neural network learning with a single logistic

neuron with bias input of 1

Logistic Regression Takeaways

• The previous algorithm doesn’t require complex software. (12 lines raw Python code)

• Easy and effective for CTR prediction.

• Key to good performance: skillful feature engineering of numeric features

• Foreshadowing: Since logistic regression is a simple special case of neural network learning, I would expect deep learning tools to make future inroads here.

Maximizing Info with Decisions

• Number Guessing Game example:

– “I’m thinking of a number from 1 to 100.”

– Number guess “Higher.” / “Lower.” / “Correct.”

– What is the best strategy and why?

• Good play maximizes information according to some measure (e.g. entropy).

Decision Trees for Regression (Regression Trees)

• Numeric features (missing values permitted)

• At each node in the tree, a branch is decided on according to a features value (or lack thereof)

A regression tree estimating the probability of kyphosis (hunchback) after surgery, given the age of the patient and the vertebra at which surgery was started. Source: https://en.wikipedia.org/wiki/Decision_tree_learning

The Power of Weak Classifiers

• Caveats: – Too deep: Single instance leafs overfitting; similar

to nearest neighbor (n=1)

– Too shallow: Large hyperrectangular sets underfitting; poor, blocky generalization

• Many weak classifiers working together can achieve good fit and generalization. – “Plans fail for lack of counsel, but with many advisers

they succeed.” – Proverbs 15:22

• Ensemble methods: boosting, bagging, stacking

Gradient Boosting of Regression Trees

• Basic boosting idea: – Initially, make a 0 or constant prediction.

– Repeat: • Compute prediction errors from the weighted sum of

our weak-learner predictions.

• Fit a new weak-learner to predict these errors and add its weighted error-prediction to our model.

• Alex Rogozhnikov’s beautiful demonstration: https://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

XGBoost

• “Among the 29 challenge winning solutions

published at Kaggle’s blog during 2015, 17

solutions [~59%] used XGBoost. Among these

solutions, eight [~28%] solely used XGBoost

to train the model, while most others combined

XGBoost with neural nets in ensembles.” - Tianqi Chen, Carlos Guestrin. “XGBoost: A Scalable Tree Boosting System”

XGBoost Features

• XGBoost is a specific implementation of gradient boosted decision trees that: – Supports a command-line interface, C++, Python

(scikit-learn), R (caret), Java/JVM languages + Hadoop platform

– A range of computing environments with parallelization, distributed computing, etc.

– Handles sparse, missing data

– Is fast and high-performance across diverse problem domains

– https://xgboost.readthedocs.io

Field-aware Factorization Machines (FFMs)

• Top-performing technique in 3 of 4 Kaggle CTR prediction competitions plus RecSys 2015: – Criteo: https://www.kaggle.com/c/criteo-display-ad-

challenge

– Avazu: https://www.kaggle.com/c/avazu-ctr-prediction

– Outbrain: https://www.kaggle.com/c/outbrain-click-prediction

– RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511&dl=ACM&coll=DL&CFID=941880276&CFTOKEN=60022934

What’s Different? Field-Aware Latent Factors

• Latent factor

– learned weight; tuned variable

– How much an input contributes to an output

• Many techniques learn “latent factors”:

– Linear regression: one per feature + 1

– Logistic regression: one per feature + 1

What’s Different? Field-Aware Latent Factors (cont.)

• Many techniques learn “latent factors”:

– Degree-2 polynomial regression: one per pair of features

– Factorization machine (FM):

• k per feature

• “latent factor vector”, a.k.a. “latent vector”

What’s Different? Field-Aware Latent Factors (cont.)

• Many techniques learn “latent factors”:

– Field-aware Factorization machine (FFM):

• k per feature and field pair

• Field: – Features are often one-hot encoded

– Continuous block of binary features often represent different values for the same underlying “field”

– E.g. Field: “OS”, features: “Windows”, “MacOS”, “Android”

– libffm: FFM library (https://github.com/guestwalk/libffm)

Is the Extra Engineering Worth it?

• Kaggle Criteo leaderboard based on logarithmic loss (a.k.a. logloss)

– 0.69315 50% correct in binary classification (random guessing baseline)

• Simple logistic regression with hashing trick:

– 0.46881 (private leaderboard) ~62.6% correct

• FFM with feature engineering using GBDT:

– 0.44463 (private leaderboard) ~64.1% correct

Computational Cost

• ~1.5% increase in correct prediction, but greater computational complexity:

– Logistic regression: n factors to learn and relearn in dynamic context

– FFM: kn2 factors to learn and relearn

Published Research from the Trenches

• Initial efforts focused on logistic regression

• Most big production systems reportedly kept it simple in the final stage of prediction: – Google (2013): prob. feature inclusion + Bloom

filters logistic regression

– Facebook (2014): boosted decision trees logistic regression

– Yahoo (2014): hashing trick logistic regression

• However…

Towards Neural Network Prediction

• More recently, Microsoft (2017) research

– reports “factorization machines (FMs), gradient boosting decision trees (GBDTs) and deep neural networks (DNNs) have also been evaluated and gradually adopted in industry.”

– recommends boosting neural networks with gradient boosting decision trees

Perspective

• The last sigmoid layer of a neural network (deep or otherwise) for binary classification is logistic regression.

• Previous layers of a deep neural network learn an internal representation of inputs, i.e. perform automatic feature engineering.

• Thus, most efforts to engineer successful, modern CTR prediction systems focus on layered feature engineering using: – Hashing tricks – Features engineered with GBDTs, FFMs, and deep neural

networks (DNNs), or a layered/ensembled combination thereof.

• Future: Additional automated feature representation learning with deep neural networks

CTRP Conclusions

• To get prediction performance quickly and easily, hash data to binary features and apply logistic regression.

• For + few % of accuracy, dig into Kaggle forums and the latest industry papers for a variety of means to engineer features most helpful to CTR prediction. We’ve surveyed a number here.

• Knowledge is power. ( data predictions)

• Priority of effort: data > feature engineering > learning/regression algorithms.

Next Steps

• Interested in learning more about Data Science and Machine Learning?

– Create a Kaggle Account

– Enter a learning competition, e.g. “Titanic: Machine Learning from Disaster”

– Take related tutorials, learn from kernels and discussions, steadily work to improve your skills, and share it forward

top related