dataiku at sf datamining meetup - kaggle yandex challenge

write your own data story!

short story

Founded January 2013

January 2014A Data Science Studio

powered team wins a Challenge




Data Science Studio’s GA February 2014





July 2014Data Science Studio

Available for Free with a

Community Edition !!





15 People Now

July 2014Data Science Studio

Available for Free with a

Community Edition !!

!BI

Developer

Data Preparation

Build Algorithm

Build Application

Run Application

Business Analyst

DataScientist

I don’t want to be a data cleaner anymore“

Finding Leaks in my Data Pipelines“

Waiting for the

(gradient boosted) trees

to grow“

MPP Databases

Statistical Software Machine Learning

No-SQL Hadoop

Demo Time

Challenge

Using Historical Logs of a search engine QUERIES RESULTS CLICKS !and a set of new QUERIES and RESULTS !rerank the RESULTS in order to optimize relevance

Personalized Web SearchYandexFri 11 Oct 2013 – Fri 10 Jan 2014 194 Teams $9,000 cash prize

No researcher. No experience in reranking.

Not much experience in ML for most of us. Not exactly our job. No expectations.

Kenji Lefevre 37

Algrebraic Geometry Learning Python

Christophe Bourguignat 37

Signal Processing Eng. Learning Scikit

Mathieu Scordia 24

Data Scientist

Paul Masurel 33

Soft. Engineer

The Team

A-Team?

“HOBBITS"

YANDEX SUPPLIED 27 DAYS OF ANONYMOUS LOG

Challenge Data

34,573,630 Sessions with user id 21,073,569 Queries 64,693,054 Clicks

~ 15GB

Example

Relevance?

A METRIC FOR RELEVANCE RIGHT FROM THE LOG? ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE

A LOOK AT THE LOGS.

WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION

DWELL TIME

DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH THE RELEVANCE

GOOD WE HAVE A MEASURE OF RELEVANCE ! CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE

NOW?

Emphasis on relevant documents

Discount per ranking

Discount Cumulative Gain

Normalized Discount Cumulative Gain

Just Normalize Between 0 and 1

PERSONALIZED RERANKING IS ABOUT REORDERING THE N-BEST RESULTS BASED ON

THE USER PAST SEARCH HISTORY

Results Obtained in the contest: !

Original NCDG 0.79056 !

ReRanked NCDG 0.80714 !!

~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #5 on each query

~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #2 in 20% of the queries

Equivalent To

How they did it

Simple, point wise approach

Session 1 Session 2 ....0

1

2

For each (URL, Session) predict relevance (0,1 or 2)

Supervised Learning on History

We split 27 days of the train dataset 24 (history) + 3 days (annotated). !

Stop randomly in the last 3 days at a “test" session (like Yandex)

Train Set (24 history)

Train Set (annotation) Test Set

Working with a ML workflow collaboratively

Features Construction : Team Member work independantly

Learning : Team Member work independantly

Split Train & Validation

Features on 30 days

Labelled 30 days data

!regression : we keep the hierarchy between the classes, but optimizing NDCG is cookery. classification : we lose the hierarchy but we can optimize the NDCG (more and that later)

REGRESSION or CLASSIFICATION

According to P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007. Classification outperforms regression.

!

Compute the probabilities that P(relevance = X)

Build a sorted list

!

Sort by !

P(Relevance=1) + 3 P (Relevance=2)

Hence order by decreasing

Hence order by P(Relevance=1) + 3 P (Relevance=2)

P. Li, C. J. C. Burges, and Q. Wu. Mcrank: Learning to rank using multiple classification and gradient boosting. In NIPS, 2007.

get slightly better results with linear weighting.

Features

FIRST OF ALL THE RANK In this contest, the rank is both

The rank that has been displayed to the user THE DISPLAY RANK

!The rank that is computed by Yandex using

PageRank, non-personalized log analysis?, TF-IDF, and machine learning etc.

THE NON-PERSONALIZED RANK

RANK AS feature

Digression

THE PROBLEM!WITH RERANKING

53% OF THE COMPETITORS COULD NOT IMPROVE THE BASELINE

Worse 53%

Better 47%

1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !!

IDEAL

1. compute non-personalized rank 2. select 10 bests hits 3. serve 10 bests hits ranked in random order 4. re-rank using log analysis, including non-personalized rank as a

feature 5. compute score against the log with the former rank

REAL

Users tend to click on the first few urls. User satisfaction metric is influenced by the display rank. Our score is not aligned with our goal.

PROBLEM

We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank

PROMOTES OVER CONSERVATIVE RE-RANKING POLICY

Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.

Average per session of the max position jump

end digression

Revisits (Query-(User)-URL) features and variants Query Features Cumulative Features User Click Habits Collaborative Filtering Seasonality

FEATURES

!In the past, when the user was displayed this url, with the exact same query

what is the probability that :

REVISITS

• satisfaction=2 • satisfaction=1 • satisfaction=0 • miss (not-clicked) • skipped (after the last click)

5 Conditional Probability Features

1 An overall counter of display 4 mean reciprocal rank (kind of the harmonic mean of the rank) 1 snippet quality score (twisted formula used to compute snippet quality)

11 Base Features

• (In the past|within the same sesssion), • (with this very query | whatever query | a subquery | a super query) • and was offered (this url/this domain)

MANY VARIATIONSX2X 3X 2

12 variants

With the same user

Without being the same user ( URL - query features)

• Same Domain • Same URL • Same Query and Same URL

3 variants

15 Variants X 11 Base Features

165 Features

ADDITIVE SMOOTHINGhttp://fumicoton.com/posts/bayesian_rating

• book A : 1 rating of 5. Average rating of 5. • book B : 50 ratings. Average rating of 4.5

In our case to evaluate the probability that a (URL|query) should have a label l, under predicate P:

http://fumicoton.com/posts/bayesian_rating

CUMULATIVE FEATURES

Aggregate the features of the URL above in the ranking list

Rationale : If a URL above is likely to be clicked, those below are likely to be missed

QUERY FEATURES

Click entropy number of time it has been queried for number of terms average position within in session average number of occurences in a session MRR of its clicks

How complex and ambiguous is a query ?

USER FEATURESWhat are the users habits ?

Click entropy User click rank counters

Rank {1, 2} clicks Rank {3, 4, 5} clicks Rank {6,7,8,9,10 } clicks

Average number of terms Average number of different terms in a session Total number of queries issued by the user

SEASONALITYWhat day is monday ?

COLLABORATIVE FILTERING (ATTEMPT)

User / Domain interaction matrix. FunkSVD Algorithm

Simon Funkhttp://sifter.org/~simon/journal/20061211.html

https://github.com/commonsense/divisi/blob/master/svdlib/_svdlib.pyxCython implementation

Marginal increase 5.10^-5 of the NCDG !

Why ?

learning

Short Story

Point Wise, Random Forest, 30 Features, 4th Place (*)

List Wise , LambdaMART, 90 Features, 1st Place (*)

(*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were at the first place during the whole competition even if not officially contestant

Trained in 2 days, 1135 Trees

Optimize & Train in ~ 1 hour (12 cores), 24 trees

Lambda Mart

From RankNet to LambdaRank to LambdaMART: An Overview

Christopher J.C. Burges

Microsoft Research Technical Report MSR-TR-2010-82

LambdaMART = LambdaRank + MART

Lambda RankOriginal Ranking Re Ranked

13 errors 11 errors

High Quality Hit

Low Quality Hit

Rank Net Gradient

LambdaRank "Gradient"


Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82

Grid SearchWe are not doing typical classification here. It is extremely important to perform grid

search directly against NDCG final score.

NDCG “conservatism” end up with large “min samples per leaf” (between 40 and 80 )

Feature Selection

Top-Down approach : Starting from a high number of features, iteratively removed subsets of features. This approach led to the subset of 90 features for the LambdaMart winning solutions

(Similar strategy now implemented by sklearn.feature_selection.RFECV)

! Bottom-up approach : Starting from a low number of features, add the

features that produce the best marginal improvement. Gave the 30 features that lead to the best solution with the point-wise approach.

Top Features

References

http://sourceforge.net/p/lemur/wiki/RankLib/Ranklib ( Implementation of LambdaMART)

These Slides http://www.slideshare.net/Dataiku

Learning to rank using multiple classification and gradient boosting.

P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007


Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82

http://fumicoton.com/posts/bayesian_ratingBlog Post About Additive Smoothing

Blog Posts about the solution

Contest Url

Paper with Detailed Description

http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html

http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf

https://www.kaggle.com/c/yandex-personalized-web-search-challenge

Research Papers

References

http://www.slideshare.net/Dataiku

http://fumicoton.com/posts/bayesian_rating

Random ThoughtsDependancy analysis and comparing rank with predictive “relevance" could help determine general cases where the existing engine is not relevant enough How does it compare to a pure statistical approach ? !Applying personalisation technique this way might not be practical because of the amount of live information to be maintained (in real-time) about users (each query, each click) to perform actionnable predictions How could a machine learning challenge enforce this kind of constraints? Is data science a science, a sport or a hobby. Newcomers can discover a field, improve existing results, and seemingly obtain incrementally more effective results, with little plateau effect ! Are we just at the very beginning non-industrial era of this discipline?

THANK YOU!Florian DOUETTEAU

[email protected]

+33 6 70 56 88 97

mailto:[email protected]

dataiku at sf datamining meetup - kaggle yandex challenge

Data & Analytics