![Page 1: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/1.jpg)
WHAT DOES IT TAKE TO WIN THE KAGGLE/YANDEX COMPETITION
Christophe BourguignatKenji Lefèvre-HasegawaPaul Masurel @DataikuMatthieu Scordia @Dataiku
![Page 2: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/2.jpg)
OUTLINE OF THE TALK
• Review of the Kaggle/Yandex Challenge• How we worked (team work & tools)• The winning model
![Page 3: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/3.jpg)
GOAL Re-rank URLs returned by Yandex according to the personal preferences of the users
url1
url2
url3
url4
url3
url2
url1
url4
GOAL
ML CHALLENGE Predict user’s pertinence for urls and rerank result set accordingly
The Kaggle/Yandex challenge
![Page 4: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/4.jpg)
GIVEN• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
• Test session prior activity queries, clicks & dwell-times
SIZE• 15Gb size
The Kaggle/Yandex challenge
Q Q T ?Test session :
Q Q Q Q
![Page 5: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/5.jpg)
QUALITY METRIC
• One query test / user on the last 3 days• NDCG metric penalize error of pertinence on top ranked
urls
• No A/B test
The Kaggle/Yandex challenge
url1
url2
url3
url4
url3
url2
url1
url4
url1
url2
url4
url3
Prediction Another rankingKaggle
BADOK
![Page 6: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/6.jpg)
TEAM DATAIKU SCIENCE STUDIO / KAGGLE
• Christophe Bourguignat Engineer, Data enthusiastic
• Kenji Lefèvre-Hasegawa Ph.D. math, new to ML
• Paul Masurel Software Engineer @dataiku
• Matthieu Scordia Data Scientist @dataiku
First meeting : October16th 2013
How we worked (Team work & tools)
![Page 7: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/7.jpg)
WE’VE USED
• Related papers (mainly Microsoft’s)• 12 core, 64 Gb• Python scikit-learn• Dataiku Science Studio• Java Ranklib
How we worked (Team work & tools)
![Page 8: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/8.jpg)
DATAIKU SCIENCE STUDIO
How we worked (Team work & tools)
LEARNING
Team members work independantly
Original train
Split train & validation
Labels
Featu
res &
labels
FEATURES CONSTRUCTION
Team members work independantly
Features
DATA DRIVEN COMPUTATION
![Page 9: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/9.jpg)
HOW MUCH WORK ?• 960+ emails • 360+ features• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last 3 weeks • 56 kaggle submissions• 196 teams, 264 players, 3570 submissions
How we worked (Team work & tools)
1/2 month 1 week 1 week 1 week
Top 25
Top 10
5th
1st
3rd
1st
2014-01-01Future top 2 & 3
enter race
![Page 10: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/10.jpg)
PROBLEM ANALYSIS
Query
Result Set• Rank• URL Snippet Quality• URL is skipped, clicked or missed
Reading URL• URL & Domain pertinence with dwell-time
CLICK
The winning model
![Page 11: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/11.jpg)
FEATURESFeatures :• Rank• User habits, query specificity (entropy, frequency,…)• Snippet pertinence• Missed, Skipped, Clicked• URL & Domain Pertinence
Declinaison of & Clicked• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)• For each user : historic & previous activity in test session & aggregate• For all user• Declined for all queries & same query
The winning model
![Page 12: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/12.jpg)
MODELS
• Random Forest (predict proba)+ maximize E(NDCG)
• Lambda MART (Gradient Boosting Tree optimized for NDCG) WINS !
The winning model
Kaggle/Yandex Top 1 then 3rd
![Page 13: What does it take to win the Kaggle/Yandex competition](https://reader037.vdocument.in/reader037/viewer/2022103111/54c6e84d4a79590e788b45df/html5/thumbnails/13.jpg)
QUESTIONS
?
Thank you !