the hitchhiker’s guide to kaggle

75
The Hitchhiker’s Guide to Kaggle July 27, 2011 [email protected] [doubleclix.wordpress.com] [email protected]

Upload: krishna-sankar

Post on 26-Jan-2015

132 views

Category:

Technology


2 download

DESCRIPTION

For the OSCON Data 2011 workshop "The Hitchhiker’s Guide to A Kaggle Competition" http://www.oscon.com/oscon2011/public/schedule/detail/20011

TRANSCRIPT

Page 1: The Hitchhiker’s Guide to Kaggle

The Hitchhiker’s Guide to Kaggle

July 27, 2011�[email protected] [doubleclix.wordpress.com] �

[email protected]

Page 2: The Hitchhiker’s Guide to Kaggle
Page 3: The Hitchhiker’s Guide to Kaggle

Analytics Competitions ! Algorithms

Tools

DataSets

The Amateur Data Scientist CART

randomForest

Old Competition

Competition in-flight

Titanic

Churn

HHP Ford

Page 4: The Hitchhiker’s Guide to Kaggle

Encounters

�  3rd ◦  Participate in HHP &

Other competitions

� 1st ◦ This Workshop

�  2nd ◦  Do Hands-on Walkthrough ◦  I will post the walkthrough

scripts in ~ 10 days

Page 5: The Hitchhiker’s Guide to Kaggle

Goals Of This workshop

1.  Introduction to Analytics Competitions from Data, Algorithms & Tools perspective

2.  End-To-End Flow of a Kaggle Competition – Ford

3.  Introduction to the Heritage Health Prize Competition

4.  Materials for you to explore further ◦  Lot more slides ◦ Walkthrough – will post in 10 days

Page 6: The Hitchhiker’s Guide to Kaggle

Agenda �  Algorithms for the Amateur Data Scientist [25Min] ◦  Algorithms, Tools & frameworks in perspective

�  The Art of Analytics Competitions[10Min] ◦  The Kaggle challenges

�  How the RTA FORD was won - Anatomy of a competition [15Min] ◦  Predicting FORD using Trees ◦  Submit an Entry

�  Competition in flight - The Heritage Health Prize [30Min] ◦  Walkthrough

�  Introduction �  Dataset Organization �  Analytics Walkthrough ◦  Submit our entry

�  Conclusion [5Min]

Page 7: The Hitchhiker’s Guide to Kaggle

ALGORITHMS FOR THE AMATEUR DATA SCIENTIST

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. �Published by Harmony Books in 1979 �

Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …�

Page 8: The Hitchhiker’s Guide to Kaggle

The Amateur Data Scientist �  Am not a quant or a ML expert �  School of Amz, Springer & UTube �  For the Rest of us �  References I used (Refs also in the respective slide): ◦  The Elements Of Statistical Learning (a.k.a ESLII)

�  By Hastie, Tibshirani & Friedman ◦  Statistical Learning From a Regression Perspective

�  By Richard Berk �  As Jeremy says it, you can dig into it as needed ◦  Not necessarily be an expert in R toolbox

Page 9: The Hitchhiker’s Guide to Kaggle

Jeremy’s Axioms �  Iteratively explore data �  Tools ◦  Excel Format, Perl, Perl Book

�  Get your head around data ◦  Pivot Table

�  Don’t over-complicate �  If people give you data, don’t assume

that you need to use all of it �  Look at pictures ! �  History of your submissions – keep a

tab �  Don’t be afraid to submit simple

solutions ◦  We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/ !

Page 10: The Hitchhiker’s Guide to Kaggle

Big data to smart data

�  summary

1 Don’t throw away any data !

2 Be ready for different ways of organizing the data

Page 11: The Hitchhiker’s Guide to Kaggle

Ref: Anthony’s Kaggle Presentation!

Users apply different techniques

•  Support Vector Machine •  adaBoost •  Bayesian Networks •  Decision Trees •  Ensemble Methods •  Random Forest •  Logistic Regression

•  Genetic Algorithms •  Monte Carlo Methods •  Principal Component

Analysis •  Kalman Filter •  Evolutionary Fuzzy

Modelling •  Neural Networks

Quora •  http://www.quora.com/What-are-the-top-10-

data-mining-or-machine-learning-algorithms

Page 12: The Hitchhiker’s Guide to Kaggle

� Let us take a 15 min overview of the algorithms ◦ Relevant in the context of this workshop ◦  From the perspective of the datasets we plan

to use

� More of a qualitative than mathematical � To get a feel for the how & the why

Page 13: The Hitchhiker’s Guide to Kaggle

Classifiers

Linear Regression

Continuous Variables

Categorical Variables

Decision Trees

k-NN(Nearest

Neighbors)

Bias�Variance �

Model Complexity�Over-fitting �

Boosting Bagging

CART

Page 14: The Hitchhiker’s Guide to Kaggle

Titanic Passenger Metadata •  Small •  3 Predictors

•  Class •  Sex •  Age •  Survived?

http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!

Customer Churn •  17 Predictors

Kaggle Competition - Stay Alert Ford Challenge •  Simple Dataset •  Competition Class Heritage Health Prize Data

•  Complex •  Competition in Flight

Page 15: The Hitchhiker’s Guide to Kaggle

Titanic Dataset

�  Taken from passenger manifest

�  Good candidate for a Decision Tree

�  CART [Classification & Regression Tree] ◦  Greedy, top-down

binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions

�  CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!

Page 16: The Hitchhiker’s Guide to Kaggle

Titanic Dataset R walk through � Load libraries � Load data � Model CART � Model rattle() � Tree � Discussion

Male ?

Adult?

N 3rd?

N Y

3rd?

N Y

Y

Y

Y

Page 17: The Hitchhiker’s Guide to Kaggle

CART Male ?

Adult?

N 3rd?

N Y

3rd?

N Y

Y

Y

Y

Female

Child

Page 18: The Hitchhiker’s Guide to Kaggle

Male ?

N3rd?

N Y

Y

Female 1 Do Not Over-fit

2 All predictors are not needed

3 All data rows are not needed

4 Tuning the algorithms will give different results

CART

Page 19: The Hitchhiker’s Guide to Kaggle

Churn Data

� Predict churn � Based on ◦  Service calls, v-mail and so forth

Page 20: The Hitchhiker’s Guide to Kaggle

CART Tree

Page 21: The Hitchhiker’s Guide to Kaggle

Challenges

� Model Complexity ◦ Complex Model increases the training data fit ◦  But then over-fits and doesn't perform as well

with real data

� Bias vs. Variance ◦  Classical diagram ◦  From ELSII ◦  By Hastie, Tibshirani &

Friedman

Prediction Error�

Training �Error�

Page 22: The Hitchhiker’s Guide to Kaggle

Partition Data ! ◦ Training (60%) ◦  Validation(20%) & ◦  “Vault” Test (20%) Data sets

k-fold Cross-Validation ◦  Split data into k equal parts ◦  Fit model to k-1 parts & calculate prediction

error on kth part ◦ Non-overlapping dataset

Solution #1 �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

But the fundamental problem still exists ! �

Page 23: The Hitchhiker’s Guide to Kaggle

Bootstrap ◦ Draw datasets (with replacement) and fit model

for each dataset �  Remember : Data Partitioning (#1) & Cross Validation

(#2) are without replacement

Solution #2 �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

Bagging (Bootstrap aggregation) ◦  Average prediction over a

collection of bootstrap-ed samples, thus reducing variance

Page 24: The Hitchhiker’s Guide to Kaggle

Boosting ◦  “Output of weak classifiers into a powerful

committee” ◦  Final Prediction = weighted majority vote ◦  Later classifiers get misclassified points �  With higher weight, �  So they are forced �  To concentrate on them ◦  AdaBoost (AdaptiveBoosting) ◦  Boosting vs Bagging �  Bagging – independent trees �  Boosting – successively weighted

Solution #3 �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

Page 25: The Hitchhiker’s Guide to Kaggle

Random Forests+

◦  Builds large collection of de-correlated trees & averages them

◦  Improves Bagging by selecting i.i.d* random variables for splitting

◦  Simpler to train & tune ◦  “Do remarkably well, with very little tuning

required” – ESLII ◦  Less suseptible to overfitting (than boosting) ◦  Many RF implementations �  Original version - Fortran-77 ! By Breiman/Cutler �  R, Mahout, Weka, Milk (ML toolkit for py), matlab

* i.i.d – independent identically distributed!+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm !

Solution #4 �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

Page 26: The Hitchhiker’s Guide to Kaggle

Ensemble methods ◦  Two Step �  Develop a set of learners �  Combine the results to develop a composite

predictor ◦  Ensemble methods can take the form of: �  Using different algorithms, �  Using the same algorithm with different settings �  Assigning different parts of the dataset to different

classifiers

◦  Bagging & Random Forests are examples of ensemble method

Ref: Machine Learning In Action !

Solution - General �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

Page 27: The Hitchhiker’s Guide to Kaggle

Random Forests �  While Boosting splits based on best among all variables, RF splits

based on best among randomly chosen variables �  Simpler because it requires two variables – no. of Predictors

(typically √k) & no. of trees (500 for large dataset, 150 for smaller) �  Error prediction ◦  For each iteration, predict for dataset that is not in the sample (OOB

data) ◦  Aggregate OOB predictions ◦  Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate �  Can use this to search for optimal # of predictors

◦  We will see how close this is to the actual error in the Heritage Health Prize

�  Assumes equal cost for mis-prediction. Can add a cost function �  Proximity matrix & applications like adding missing data, dropping

outliers

Ref: R News Vol 2/3, Dec 2002 !Statistical Learning from a Regression Perspective : Berk !

A Brief Overview of RF by Dan Steinberg!

Page 28: The Hitchhiker’s Guide to Kaggle

Lot more to explore (Homework!)

� Loss matrix ◦  E.g. Telcom churn - Better to give incentives to

false + (who is not leaving) than optimize in incentives for false –ves(who is leaving)

� Missing values � Additive Models � Bayesian Models � Gradient Boosting

Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-New_Tree_Data_Set_and_Loss_Matrices.pdf!

Page 29: The Hitchhiker’s Guide to Kaggle

Churn Data w/ randomForest

Page 30: The Hitchhiker’s Guide to Kaggle

KAGGLE COMPETITIONS

Page 31: The Hitchhiker’s Guide to Kaggle

“I keep saying the sexy job in the next ten years will be statisticians.”

Hal Varian Google Chief Economist 2009

Page 32: The Hitchhiker’s Guide to Kaggle
Page 33: The Hitchhiker’s Guide to Kaggle

Mismatch between those with data and those with the skills to analyse it

Crowdsourcing

Page 34: The Hitchhiker’s Guide to Kaggle

Forecast Error

(MASE) Existing model

Aug 9 2 weeks later

1 month later

Competition End

Tourism Forecasting Competition

Page 35: The Hitchhiker’s Guide to Kaggle

Existing model (ELO)

Chess Ratings Competition

Aug 4 1 month later

2 months later

Today

Error Rate (RMSE)

Page 36: The Hitchhiker’s Guide to Kaggle

12,500 “Amateur” Data Scientists with different backgrounds

Page 37: The Hitchhiker’s Guide to Kaggle

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

R on Kaggle

Ref: Anthony’s Kaggle Presentation!

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

Among academics

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

Among Americans

Page 38: The Hitchhiker’s Guide to Kaggle

Successful grant applications

~25%

NASA tried, now it’s our turn

Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter.

Page 39: The Hitchhiker’s Guide to Kaggle

“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”

“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”

Page 40: The Hitchhiker’s Guide to Kaggle

40  

Who to hire?

Page 41: The Hitchhiker’s Guide to Kaggle

Clean, Real world data Professional Reputation & Experience

Interactions with experts in related fields Prizes

1

4

2

3

Why Participants Compete

Page 42: The Hitchhiker’s Guide to Kaggle

Use the wizard to post a competition

Page 43: The Hitchhiker’s Guide to Kaggle

Participants make their entries

Page 44: The Hitchhiker’s Guide to Kaggle

Competitions are judged based on predictive accuracy

Page 45: The Hitchhiker’s Guide to Kaggle

Competition Mechanics

Competitions are judged on objective criteria

Page 46: The Hitchhiker’s Guide to Kaggle

THE FORD COMPETITION

The Anatomy of a KAGGLE COMPETITION

Page 47: The Hitchhiker’s Guide to Kaggle

Ford Challenge - DataSet

� Goal: ◦  Predict Driver Alertness

� Predictors: ◦  Psychology – P1 .. P8 ◦  Environment – E1 .. E11 ◦ Vehicle – V1 .. V11 ◦  IsAlert ?

� Data statistics meaningless outside the IsAlert context

Page 48: The Hitchhiker’s Guide to Kaggle

Ford Challenge – DataSet Files

�  Three files ◦  ford_train

�  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows

◦  ford_test �  100 Trials,~1,200 observations/trial, 120,841 rows

◦  example_submission.csv

Page 49: The Hitchhiker’s Guide to Kaggle

A Plan

Page 50: The Hitchhiker’s Guide to Kaggle

glm

Page 51: The Hitchhiker’s Guide to Kaggle

Submission & Results Raw, all variables, rpart

Raw, selected variables, rpart

All variables, glm

Page 52: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

�  How I Did It Blogs �  http://blog.kaggle.com/

2011/03/25/inference-on-winning-the-ford-stay-alert-competition/

�  http://blog.kaggle.com/2011/04/20/mick-wagner-on-finishing-second-in-the-ford-challenge/

�  http://blog.kaggle.com/2011/03/16/junpei-komiyama-on-finishing-4th-in-the-ford-competition/

Page 53: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

�  Junpei Komiyama (#4) ◦  To solve this problem, I constructed a Support Vector

Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦  This approach took more than 3 hours to complete ◦  I found some data (P3-P6) were characterized by

strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased. These suggested the necessity of pre-processing the observation data before SVM analysis for better performance

Page 54: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

�  Junpei Komiyama (#4) ◦  Averaging – improved score and processing time ◦  Average 7 data points �  Reduced processing by 86% & �  Increased score by 0.01

◦  Tools �  Python processing of csv �  libSVM

Page 55: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

�  Mick Wagner (#2) ◦  Tools �  Excel, SQL Server ◦  I spent the majority of my time analyzing the data. I

inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦  I made the first 150 trials (~30%) be my test data and the

remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦  I was concerned that using the entire data set would create

too much noise and lead to inaccuracies in the model … so focussed on data with state change

Page 56: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

� Mick Wagner (#2) ◦  After testing the Decision Tree and Neural Network

algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦ Only used E4, E5, E6, E7, E8, E9, E10, P6, V4, V6,

V10, and V11

Page 57: The Hitchhiker’s Guide to Kaggle

How the Ford Competition was won

�  Inference (#1) ◦  Very interesting ◦  “Our first observation is that trials are not

homogeneous – so calculated mean, sd et al” ◦  “Training set & test set are not from the same

population” – a good fit for training will result in a low score ◦  Lucky Model (Regression) �  -­‐410.6073(sd(E5))  +  0.1494(V11)  +  4.4185(E9)  

◦  (Remember – Data had P1-P8,E1-E11,V1-V11)

Page 58: The Hitchhiker’s Guide to Kaggle

HOW THE RTA WAS WON

“This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.”

Page 59: The Hitchhiker’s Guide to Kaggle

� Thanks to ◦  François

GUILLEM & ◦ Andrzej Janusz

� They both used R

�  Share their code & algorithms

Page 60: The Hitchhiker’s Guide to Kaggle

How the RTA was won

�  I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way

- François GUILLEM (#14) �  I used a simple k-NN approach but the idea

was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis

- Andrzej Janusz(#17)

Page 61: The Hitchhiker’s Guide to Kaggle

How the RTA was won

�  #1 used Random Forests ◦  Time, Date & Week as predictors

- José P. González-Brenes and Matías Cortés

�  Regression models for data segments (total ~600!)

�  Tools: ◦  Java/Weka ◦  4 processors, 12 GB RAM ◦  48 hours of computations

- Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/ !

Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!

Page 62: The Hitchhiker’s Guide to Kaggle

THE HHP

TimeCheck : Should be ~2:40!!

Page 63: The Hitchhiker’s Guide to Kaggle

Lessons from Kaggle Winners 1 Don’t over-fit

2 All predictors are not needed

3 All data rows are not needed, either

4 Tuning the algorithms will give different results

5 Reduce the dataset (Average, select transition data,…)

6 Test set & training set can differ

7 Iteratively explore & get your head around data

8 Don’t be afraid to submit simple solutions

9 Keep a tab & history your submissions

Page 64: The Hitchhiker’s Guide to Kaggle

The Competition

“The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”

Page 65: The Hitchhiker’s Guide to Kaggle

TimeLine

Page 66: The Hitchhiker’s Guide to Kaggle

Data Organization

Members ID Age at 1st Claim Sex

Claims

113,000 Entries Missing values

MemberID Prov ID Vendor, PCP, Year Speciality PlaceOfSvc PayDelay LengthOfStay DaysSinceFirstClaimThatYear PrimaryConditionGroup CharlsonIndex ProcedureGroup SupLOS

2,668,990 Entries Missing values Different Coding

Delay 162+ SupLOS – Length of stay is suppressed during de-identification process for some entries

LabCount MemberID, Year, DSFS,LabCount

361,485 Entries Fairly Consistent Coding (10+)

DrugCount MemberID, Year, DSFS,DrugCount

818,242 Entries Fairly Consistent Coding (10+)

Days In Hospital Y2

Days In Hospital Y3

Days In Hospital Y4 (Target)

MemberID Claims Truncated DaysInHospital

76039 Entries(Y2) 71436 Entries (Y3) 70943 Entries Lots Of Zeros

Page 67: The Hitchhiker’s Guide to Kaggle
Page 68: The Hitchhiker’s Guide to Kaggle

Calculation & Prizes

Prediction Error Rate

Deadline : Aug 31,2011

Deadline : Feb 13,2012

Deadline : Sep 04,2012

Deadline Apr 04,2013

06:59:59 UTC

Page 69: The Hitchhiker’s Guide to Kaggle

HHP ANALYTICS Now it is our turn …

Page 70: The Hitchhiker’s Guide to Kaggle

POA

� Load data into SQLite � Use SQL to de-normalize & pick out

datasets � Load them into R for analytics � Total/Distinct count ◦  Claims = 2,668,991/113,001 ◦  Members = 113,001 ◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦  dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦  dih_y3 = 71,436/distinct/10,730 dih > 0 ◦  dih_y4 = 70,943/distinct

Page 71: The Hitchhiker’s Guide to Kaggle

Idea #1

�  dih_Y2  =  β0  +  β1dih_Y1  +  β2DC  +  β3LC  �  dih_Y3  =  β0  +  β1dih_Y2  +  β2DC  +  β3LC  �  dih_Y4  =  β0  +  β1dih_Y3  +  β2DC  +  β3LC  �  select count(*) from dih_y2 join dih_y3 on

dih_y2.member_id = dih_y3.member_id; �  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683

(7,699 dih_y3 > 0)

� Data is not straightforward to get this ◦  Summarize drug and lab by member, year ◦  Split into year to get DC  &  LC  by  year ◦  Add to dih_Yx table ◦  Linear Regression

Page 72: The Hitchhiker’s Guide to Kaggle

Some SQL for idea #1

�  create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member !

�  Same for lab_tot �  create table drug_tot_y1 as select * from

drug_tot where year = “Y1” � … for y2,y3 and y1, y2,y3 for lab_tot � … join with dih_yx tables

Page 73: The Hitchhiker’s Guide to Kaggle

Idea #2

� Add claims at yx to the Idea #1 equations � dih_Yn  =  β0  +  β1dih_Yn-­‐1  +  β2DC/n-­‐1  +  β3LC/n-­‐1  +  β4Caimn-­‐1  

� Then we will have to define the criteria for Caimn-­‐1  from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup

Page 74: The Hitchhiker’s Guide to Kaggle

The Beginning As the End

� We started with a set of goals

� Homework ◦  For me : �  To finish the hands-on walkthrough

& post it in ~10 days

◦  For you �  Go through the slides �  Do the walkthrough �  Submit entries to Kaggle

Page 75: The Hitchhiker’s Guide to Kaggle

IDE <- RStudioR_Packages <- c(plyr, rattle, rpart, randomForest)R_Search <- http://www.rseek.org/, powered=google

Questions ?!

I enjoyed a lot

preparing the materials

… Hope

you enjoyed more

attending …