Download - Florian Douetteau @ Dataiku
![Page 1: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/1.jpg)
write your own data story!
![Page 2: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/2.jpg)
Using Historical Logs of a search engine QUERIES RESULTS CLICKS
and a set of new QUERIES and RESULTS
rerank the RESULTS in order to optimize relevance
Personalized Web SearchFri 11 Oct 2013 – Fri 10 Jan 2014
194 Teams $9,000 cash prize
34,573,630 Sessions with user id 21,073,569 Queries 64,693,054 Clicks
~ 15GB
![Page 3: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/3.jpg)
A METRIC FOR RELEVANCE RIGHT FROM THE LOG? ASSUMING WE SEARCH FOR "FRENCH NEWSPAPER", WE TAKE
A LOOK AT THE LOGS.
![Page 4: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/4.jpg)
WE COMPUTE THE SO CALLED DWELL TIME OF A CLICK I.E. THE TIME ELAPSED BEFORE THE NEXT ACTION
DWELL TIME
![Page 5: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/5.jpg)
DWELL TIME HAS BEEN SHOWN TO BE CORRELATED WITH THE RELEVANCE
![Page 6: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/6.jpg)
GOOD WE HAVE A MEASURE OF RELEVANCE ! CAN WE GET AN OVERALL SCORE FOR OUR SEARCH ENGINE
NOW?
![Page 7: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/7.jpg)
Emphasis on relevant documents
Discount per ranking
Discount Cumulative Gain
Just Normalize Between 0 and 1
Normalized Discount Cumulative Gain
![Page 8: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/8.jpg)
PERSONALIZED RERANKING IS ABOUT REORDERING THE N-BEST RESULTS BASED ON
THE USER PAST SEARCH HISTORY
Results Obtained in the contest:
Original NCDG 0.79056
ReRanked NCDG 0.80714
~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #5 on each query
~ Raising the rank of a relevant ( relevancy = 2) result from Rank #6 to Rank #2 in 20% of the queries
Equivalent To
![Page 9: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/9.jpg)
No researcher. No experience in reranking.
Not much experience in ML for most of us. Not exactly our job. No expectations.
Kenji Lefevre 37
Algebraic Geometry Learning Python
Christophe Bourguignat 37
Signal Processing Eng. Learning Scikit
Mathieu Scordia 24
Data Scientist
Paul Masurel 33
Soft. Engineer
The Team
![Page 10: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/10.jpg)
A-Team?
![Page 11: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/11.jpg)
Data Hobbits
![Page 12: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/12.jpg)
Understanding The Problem
![Page 13: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/13.jpg)
53% OF THE COMPETITORS COULD NOT IMPROVE THE BASELINE
Worse 53%
Better 47%
![Page 14: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/14.jpg)
1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !!
IDEAL SETUP
![Page 15: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/15.jpg)
1. compute non-personalized rank 2. select 10 bests hits 3. serve 10 bests hits ranked in random
order 4. re-rank using log analysis, including non-
personalized rank as a feature 5. compute score against the log with the
former rank
REAL SETUP
1. compute non-personalized rank 2. select 10 best hits and serves them in order 3. re-rank using log analysis. 4. put new ranking algorithm in prod (yeah right!) 5. compute NDCG on new logs 6. … 7. Profits !!
IDEAL
![Page 16: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/16.jpg)
Users tend to click on the first few urls. User satisfaction metric is influenced by the display rank.
Our score is not aligned with our goal.
PROBLEM
We cannot discriminate the effect of the signal of the non-personalized rank from effect of the display rank
![Page 17: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/17.jpg)
PROMOTES OVER CONSERVATIVE RE-RANKING POLICY
Even if we know for sure that the url with rank 9 would be clicked by the user if it was presented at rank 1, it would be probably a bad idea to rerank it to rank 1 in this contest.
Average per session of the max position jump
![Page 18: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/18.jpg)
Simple, point wise approach
Session 1 Session 2 ....0
1
2
For each (URL, Session) predict relevance (0,1 or 2)
![Page 19: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/19.jpg)
Supervised Learning on History
We split 27 days of the train dataset 24 (history) + 3 days (annotated).
Stop randomly in the last 3 days at a “test" session (like Yandex)
Train Set (24 history)
Train Set (annotation) Test Set
![Page 20: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/20.jpg)
How They Did It
![Page 21: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/21.jpg)
Features Construction : Team Member work independantly
Split Train & Validation
![Page 22: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/22.jpg)
The Existing Rank (base rank) Revisits (Query-(User)-URL) features and variants
Query Features Cumulative Features
User Click Habits Features Collaborative Filtering Features
Seasonality Features
FEATURES
![Page 23: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/23.jpg)
In the past, when the user was displayed this url, with the exact same query what is the probability that :
REVISITS
• satisfaction=2 • satisfaction=1 • satisfaction=0 • miss (not-clicked) • skipped (after the last click)
5 Conditional Probability Features
1 An overall counter of display 4 mean reciprocal rank (kind of the harmonic mean of the rank) 1 snippet quality score (twisted formula used to compute snippet quality)
11 Base Features
![Page 24: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/24.jpg)
• (In the past|within the same sesssion), • (with this very query | whatever query | a subquery | a super query) • and was offered (this url/this domain)
MANY VARIATIONSX2X 3X 2
12 variants
With the same user
Without being the same user ( URL - query features)
• Same Domain • Same URL • Same Query and Same URL
3 variants
15 Variants X 11 Base Features
165 Features
![Page 25: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/25.jpg)
Features Construction : Team Member work independantly
Learning : Team Member work independantly
Split Train & Validation
> 200 Potential Features on 30 days
Labelled 30 days data
![Page 26: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/26.jpg)
Short Story
Point Wise, Random Forest, 30 Features, 4th Place (*)
List Wise , LambdaMART, 90 Features, 1st Place (*)
(*) A Yandex “PaceMaker" Team was also displaying results on the leaderboard and were at the first place during the whole competition even if not officially contestant
Trained in 2 days, 1135 Trees
Optimize & Train in ~ 1 hour (12 cores), 24 trees
![Page 27: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/27.jpg)
Original Ranking Re Ranked
13 errors 11 errors
High Quality Hit
Low Quality Hit
Rank Net Gradient
LambdaRank "Gradient"
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
Lambda Mart
Gradient Boosted Trees with a special gradient called
“Lambda Rank"
![Page 28: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/28.jpg)
Grid SearchWe are not doing typical classification here. It is extremely important to perform grid
search directly against NDCG final score.
NDCG “conservatism” end up with large “min samples per leaf” (between 40 and 80 )
![Page 29: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/29.jpg)
Feature SelectionTop-Down approach : Starting from
a high number of features, iteratively removed subsets of features. This approach led to the subset of 90 features for the LambdaMart winning solutions
(Similar strategy now implemented by sklearn.feature_selection.RFECV)
Bottom-up approach : Starting from a low number of features, add the features that produce the best marginal improvement. Gave the 30 features that lead to the best solution with the point-wise approach.
![Page 30: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/30.jpg)
Take Away
Set up a Valid and Solid Cross Validation scheme
Prototype with fast ML methods, optimize with boosting
Be systematic in terms of feature selection
Setup a reproductible workflows early on
Split tasks when running as a team
![Page 31: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/31.jpg)
Special Offer
We offer a free server (with DSS) for teams running on Kaggle Competitions
Conditions: - Be at least 3 people - Up to 3 three teams Max. sponsored per competition
[email protected] DOUETTEAU
![Page 32: Florian Douetteau @ Dataiku](https://reader030.vdocument.in/reader030/viewer/2022020123/55c29030bb61eb612b8b458f/html5/thumbnails/32.jpg)
http://sourceforge.net/p/lemur/wiki/RankLib/Ranklib ( Implementation of LambdaMART)
These Slides http://www.slideshare.net/Dataiku
Learning to rank using multiple classification and gradient boosting.
P. Li, C. J. C. Burges, and Q. Wu. Mcrank - In NIPS, 2007
From RankNet to LambdaRank to LambdaMART: An Overview
Christopher J.C. Burges - Microsoft Research Technical Report MSR-TR-2010-82
http://fumicoton.com/posts/bayesian_ratingBlog Post About Additive Smoothing
Blog Posts about the solution
Contest Url
Paper with Detailed Description
http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/http://www.dataiku.com/blog/2014/01/14/winning-kaggle.html
http://research.microsoft.com/en-us/um/people/nickcr/wscd2014/papers/wscdchallenge2014dataiku.pdf
https://www.kaggle.com/c/yandex-personalized-web-search-challenge
Research Papers
References