stories behind kaggle competitions h2o meetup

20
stories behind kaggle competitions wendy kan, data scientist [email protected] @wendykan 5/19/2015 @

Upload: wendy-chih-wen-kan

Post on 11-Apr-2017

441 views

Category:

Data & Analytics


0 download

TRANSCRIPT

stories behind kaggle competitionswendy kan, data scientist

[email protected]@wendykan

5/19/2015 @

kaggle runs public machine learning competitions

we worked with clients/hosts on various types of problems and data of different sizes

my job as a data scientist at kaggle

“data science is not just kaggle competitions”

whyyyy???

machine learning processes

● Business Problem● Collect Data● Transform Data● Dataset Splitting● Evaluation Metric● Feature Extraction

● Feature Selection● Model Training● Model Ensembling● Methodology Selection● Production System● Ongoing Optimization

not every problem can be turned into a kaggle competition

size matters! where bigger is better (most of the time)

data cleaning/formatting:

● easy to make a quick submission● boosts participation● (too) clean data kills creativity

data privacy/anonymization

metric: how do you measure success?

● Classification - AUC/ Logarithmic Loss/Accuracy

● Regression - RMSE/MAE

● Ranking - MAP/NDCG

● Other / Custom

https://www.kaggle.com/wiki/Metrics

the design of a competition shapes how people are going to solve a problem

Splitting dataset

● training/test● public/private

Time series data

data leakage“Deemed ‘one of the top ten data mining mistakes’, leakage is essentially the introduction of information about the data mining target, which should not be legitimately available to mine from”

“the concept of identifying and harnessing leakage has been openly addressed as one of three key aspects for winning data mining competitions”

“Leakage in Data Mining: formulation, detection, and avoidance” S Kaufman et al

do you have thousands of people reviewing your performance at work 24/7?

I do.

1. people make mistakes. honesty is the best policy.

2. crowdsourcing is powerful. anything that can go wrong will go wrong.