the hitchhiker’s guide to kaggle

The Hitchhiker’s Guide to Kaggle

July 27, 2011�[email protected] [doubleclix.wordpress.com] �

[email protected]�

Analytics Competitions ! Algorithms

Tools

DataSets

The Amateur Data Scientist CART

randomForest

Old Competition

Competition in-flight

Titanic

Churn

HHP Ford

Encounters

�  3rd ◦  Participate in HHP &

Other competitions

� 1st ◦ This Workshop

�  2nd ◦  Do Hands-on Walkthrough ◦  I will post the walkthrough

scripts in ~ 10 days

Goals Of This workshop

1.  Introduction to Analytics Competitions from Data, Algorithms & Tools perspective

2.  End-To-End Flow of a Kaggle Competition – Ford

3.  Introduction to the Heritage Health Prize Competition

4.  Materials for you to explore further ◦  Lot more slides ◦ Walkthrough – will post in 10 days

Agenda �  Algorithms for the Amateur Data Scientist [25Min] ◦  Algorithms, Tools & frameworks in perspective

�  The Art of Analytics Competitions[10Min] ◦  The Kaggle challenges

�  How the RTA FORD was won - Anatomy of a competition [15Min] ◦  Predicting FORD using Trees ◦  Submit an Entry

�  Competition in flight - The Heritage Health Prize [30Min] ◦  Walkthrough

�  Introduction �  Dataset Organization �  Analytics Walkthrough ◦  Submit our entry

�  Conclusion [5Min]

ALGORITHMS FOR THE AMATEUR DATA SCIENTIST

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. �Published by Harmony Books in 1979 �

Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …�

The Amateur Data Scientist �  Am not a quant or a ML expert �  School of Amz, Springer & UTube �  For the Rest of us �  References I used (Refs also in the respective slide): ◦  The Elements Of Statistical Learning (a.k.a ESLII)

�  By Hastie, Tibshirani & Friedman ◦  Statistical Learning From a Regression Perspective

�  By Richard Berk �  As Jeremy says it, you can dig into it as needed ◦  Not necessarily be an expert in R toolbox

Jeremy’s Axioms �  Iteratively explore data �  Tools ◦  Excel Format, Perl, Perl Book

�  Get your head around data ◦  Pivot Table

�  Don’t over-complicate �  If people give you data, don’t assume

that you need to use all of it �  Look at pictures ! �  History of your submissions – keep a

tab �  Don’t be afraid to submit simple

solutions ◦  We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/ !

Big data to smart data

�  summary

1 Don’t throw away any data !

2 Be ready for different ways of organizing the data

Ref: Anthony’s Kaggle Presentation!

Users apply different techniques

•  Support Vector Machine •  adaBoost •  Bayesian Networks •  Decision Trees •  Ensemble Methods •  Random Forest •  Logistic Regression

•  Genetic Algorithms •  Monte Carlo Methods •  Principal Component

Analysis •  Kalman Filter •  Evolutionary Fuzzy

Modelling •  Neural Networks

Quora •  http://www.quora.com/What-are-the-top-10-

data-mining-or-machine-learning-algorithms

� Let us take a 15 min overview of the algorithms ◦ Relevant in the context of this workshop ◦  From the perspective of the datasets we plan

to use

� More of a qualitative than mathematical � To get a feel for the how & the why

Classifiers

Linear Regression

Continuous Variables

Categorical Variables

Decision Trees

k-NN(Nearest

Neighbors)

Bias�Variance �

Model Complexity�Over-fitting �

Boosting Bagging

CART

Titanic Passenger Metadata •  Small •  3 Predictors

•  Class •  Sex •  Age •  Survived?

http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!

Customer Churn •  17 Predictors

Kaggle Competition - Stay Alert Ford Challenge •  Simple Dataset •  Competition Class Heritage Health Prize Data

•  Complex •  Competition in Flight

Titanic Dataset

�  Taken from passenger manifest

�  Good candidate for a Decision Tree

�  CART [Classification & Regression Tree] ◦  Greedy, top-down

binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions

�  CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!

Titanic Dataset R walk through � Load libraries � Load data � Model CART � Model rattle() � Tree � Discussion

Male ?

Adult?

N 3rd?

N Y

3rd?

N Y

Y

Y

Y

CART Male ?

Adult?

N 3rd?

N Y

3rd?

N Y

Y

Y

Y

Female

Child

Male ?

N3rd?

N Y

Y

Female 1 Do Not Over-fit

2 All predictors are not needed

3 All data rows are not needed

4 Tuning the algorithms will give different results

CART

Churn Data

� Predict churn � Based on ◦  Service calls, v-mail and so forth

CART Tree

Challenges

� Model Complexity ◦ Complex Model increases the training data fit ◦  But then over-fits and doesn't perform as well

with real data

� Bias vs. Variance ◦  Classical diagram ◦  From ELSII ◦  By Hastie, Tibshirani &

Friedman

Prediction Error�

Training �Error�

Partition Data ! ◦ Training (60%) ◦  Validation(20%) & ◦  “Vault” Test (20%) Data sets

k-fold Cross-Validation ◦  Split data into k equal parts ◦  Fit model to k-1 parts & calculate prediction

error on kth part ◦ Non-overlapping dataset

Solution #1 �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

But the fundamental problem still exists ! �

Bootstrap ◦ Draw datasets (with replacement) and fit model

for each dataset �  Remember : Data Partitioning (#1) & Cross Validation

(#2) are without replacement


Bagging (Bootstrap aggregation) ◦  Average prediction over a

collection of bootstrap-ed samples, thus reducing variance

Boosting ◦  “Output of weak classifiers into a powerful

committee” ◦  Final Prediction = weighted majority vote ◦  Later classifiers get misclassified points �  With higher weight, �  So they are forced �  To concentrate on them ◦  AdaBoost (AdaptiveBoosting) ◦  Boosting vs Bagging �  Bagging – independent trees �  Boosting – successively weighted


Random Forests+

◦  Builds large collection of de-correlated trees & averages them

◦  Improves Bagging by selecting i.i.d* random variables for splitting

◦  Simpler to train & tune ◦  “Do remarkably well, with very little tuning

required” – ESLII ◦  Less suseptible to overfitting (than boosting) ◦  Many RF implementations �  Original version - Fortran-77 ! By Breiman/Cutler �  R, Mahout, Weka, Milk (ML toolkit for py), matlab

* i.i.d – independent identically distributed!+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm !


Ensemble methods ◦  Two Step �  Develop a set of learners �  Combine the results to develop a composite

predictor ◦  Ensemble methods can take the form of: �  Using different algorithms, �  Using the same algorithm with different settings �  Assigning different parts of the dataset to different

classifiers

◦  Bagging & Random Forests are examples of ensemble method

Ref: Machine Learning In Action !

Solution - General �  Goal�◦  Model Complexity (-) �◦  Variance (-) �◦  Prediction Accuracy (+) �

Random Forests �  While Boosting splits based on best among all variables, RF splits

based on best among randomly chosen variables �  Simpler because it requires two variables – no. of Predictors

(typically √k) & no. of trees (500 for large dataset, 150 for smaller) �  Error prediction ◦  For each iteration, predict for dataset that is not in the sample (OOB

data) ◦  Aggregate OOB predictions ◦  Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate �  Can use this to search for optimal # of predictors

◦  We will see how close this is to the actual error in the Heritage Health Prize

�  Assumes equal cost for mis-prediction. Can add a cost function �  Proximity matrix & applications like adding missing data, dropping

outliers

Ref: R News Vol 2/3, Dec 2002 !Statistical Learning from a Regression Perspective : Berk !

A Brief Overview of RF by Dan Steinberg!

Lot more to explore (Homework!)

� Loss matrix ◦  E.g. Telcom churn - Better to give incentives to

false + (who is not leaving) than optimize in incentives for false –ves(who is leaving)

� Missing values � Additive Models � Bayesian Models � Gradient Boosting

Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-New_Tree_Data_Set_and_Loss_Matrices.pdf!

Churn Data w/ randomForest

KAGGLE COMPETITIONS

“I keep saying the sexy job in the next ten years will be statisticians.”

Hal Varian Google Chief Economist 2009

Mismatch between those with data and those with the skills to analyse it

Crowdsourcing

Forecast Error

(MASE) Existing model

Aug 9 2 weeks later

1 month later

Competition End

Tourism Forecasting Competition

Existing model (ELO)

Chess Ratings Competition

Aug 4 1 month later

2 months later

Today

Error Rate (RMSE)

12,500 “Amateur” Data Scientists with different backgrounds

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

R on Kaggle

Ref: Anthony’s Kaggle Presentation!

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

Among academics

R

Matlab

SAS

WEKA

SPSS

Python

Excel

Mathematica

Stata

Among Americans

Successful grant applications

~25%

NASA tried, now it’s our turn

Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter.

“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”

“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”

40

Who to hire?

Clean, Real world data Professional Reputation & Experience

Interactions with experts in related fields Prizes

1

4

2

3

Why Participants Compete

Use the wizard to post a competition

Participants make their entries

Competitions are judged based on predictive accuracy

Competition Mechanics

Competitions are judged on objective criteria

THE FORD COMPETITION

The Anatomy of a KAGGLE COMPETITION

Ford Challenge - DataSet

� Goal: ◦  Predict Driver Alertness

� Predictors: ◦  Psychology – P1 .. P8 ◦  Environment – E1 .. E11 ◦ Vehicle – V1 .. V11 ◦  IsAlert ?

� Data statistics meaningless outside the IsAlert context

Ford Challenge – DataSet Files

�  Three files ◦  ford_train

�  510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows

◦  ford_test �  100 Trials,~1,200 observations/trial, 120,841 rows

◦  example_submission.csv

A Plan

Submission & Results Raw, all variables, rpart

Raw, selected variables, rpart

All variables, glm

How the Ford Competition was won

�  How I Did It Blogs �  http://blog.kaggle.com/

2011/03/25/inference-on-winning-the-ford-stay-alert-competition/

�  http://blog.kaggle.com/2011/04/20/mick-wagner-on-finishing-second-in-the-ford-challenge/

�  http://blog.kaggle.com/2011/03/16/junpei-komiyama-on-finishing-4th-in-the-ford-competition/


�  Junpei Komiyama (#4) ◦  To solve this problem, I constructed a Support Vector

Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦  This approach took more than 3 hours to complete ◦  I found some data (P3-P6) were characterized by

strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased. These suggested the necessity of pre-processing the observation data before SVM analysis for better performance


�  Junpei Komiyama (#4) ◦  Averaging – improved score and processing time ◦  Average 7 data points �  Reduced processing by 86% & �  Increased score by 0.01

◦  Tools �  Python processing of csv �  libSVM


�  Mick Wagner (#2) ◦  Tools �  Excel, SQL Server ◦  I spent the majority of my time analyzing the data. I

inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦  I made the first 150 trials (~30%) be my test data and the

remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦  I was concerned that using the entire data set would create

too much noise and lead to inaccuracies in the model … so focussed on data with state change


� Mick Wagner (#2) ◦  After testing the Decision Tree and Neural Network

algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦ Only used E4, E5, E6, E7, E8, E9, E10, P6, V4, V6,

V10, and V11


�  Inference (#1) ◦  Very interesting ◦  “Our first observation is that trials are not

homogeneous – so calculated mean, sd et al” ◦  “Training set & test set are not from the same

population” – a good fit for training will result in a low score ◦  Lucky Model (Regression) �  -‐410.6073(sd(E5)) + 0.1494(V11) + 4.4185(E9)

◦  (Remember – Data had P1-P8,E1-E11,V1-V11)

HOW THE RTA WAS WON

“This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.”

� Thanks to ◦  François

GUILLEM & ◦ Andrzej Janusz

� They both used R

�  Share their code & algorithms

How the RTA was won

�  I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way

- François GUILLEM (#14) �  I used a simple k-NN approach but the idea

was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis

- Andrzej Janusz(#17)

How the RTA was won

�  #1 used Random Forests ◦  Time, Date & Week as predictors

- José P. González-Brenes and Matías Cortés

�  Regression models for data segments (total ~600!)

�  Tools: ◦  Java/Weka ◦  4 processors, 12 GB RAM ◦  48 hours of computations

- Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/ !

Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!

THE HHP

TimeCheck : Should be ~2:40!!

Lessons from Kaggle Winners 1 Don’t over-fit

2 All predictors are not needed

3 All data rows are not needed, either

4 Tuning the algorithms will give different results

5 Reduce the dataset (Average, select transition data,…)

6 Test set & training set can differ

7 Iteratively explore & get your head around data

8 Don’t be afraid to submit simple solutions

9 Keep a tab & history your submissions

The Competition

“The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”

TimeLine

Data Organization

Members ID Age at 1st Claim Sex

Claims

113,000 Entries Missing values

MemberID Prov ID Vendor, PCP, Year Speciality PlaceOfSvc PayDelay LengthOfStay DaysSinceFirstClaimThatYear PrimaryConditionGroup CharlsonIndex ProcedureGroup SupLOS

2,668,990 Entries Missing values Different Coding

Delay 162+ SupLOS – Length of stay is suppressed during de-identification process for some entries

LabCount MemberID, Year, DSFS,LabCount

361,485 Entries Fairly Consistent Coding (10+)

DrugCount MemberID, Year, DSFS,DrugCount

818,242 Entries Fairly Consistent Coding (10+)

Days In Hospital Y2

Days In Hospital Y3

Days In Hospital Y4 (Target)

MemberID Claims Truncated DaysInHospital

76039 Entries(Y2) 71436 Entries (Y3) 70943 Entries Lots Of Zeros

Calculation & Prizes

Prediction Error Rate

Deadline : Aug 31,2011

Deadline : Feb 13,2012

Deadline : Sep 04,2012

Deadline Apr 04,2013

06:59:59 UTC

HHP ANALYTICS Now it is our turn …

POA

� Load data into SQLite � Use SQL to de-normalize & pick out

datasets � Load them into R for analytics � Total/Distinct count ◦  Claims = 2,668,991/113,001 ◦  Members = 113,001 ◦  Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦  Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦  dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦  dih_y3 = 71,436/distinct/10,730 dih > 0 ◦  dih_y4 = 70,943/distinct

Idea #1

�  dih_Y2 = β0 + β1dih_Y1 + β2DC + β3LC �  dih_Y3 = β0 + β1dih_Y2 + β2DC + β3LC �  dih_Y4 = β0 + β1dih_Y3 + β2DC + β3LC �  select count(*) from dih_y2 join dih_y3 on

dih_y2.member_id = dih_y3.member_id; �  Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683

(7,699 dih_y3 > 0)

� Data is not straightforward to get this ◦  Summarize drug and lab by member, year ◦  Split into year to get DC & LC by year ◦  Add to dih_Yx table ◦  Linear Regression

Some SQL for idea #1

�  create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member !

�  Same for lab_tot �  create table drug_tot_y1 as select * from

drug_tot where year = “Y1” � … for y2,y3 and y1, y2,y3 for lab_tot � … join with dih_yx tables

Idea #2

� Add claims at yx to the Idea #1 equations � dih_Yn = β0 + β1dih_Yn-‐1 + β2DC/n-‐1 + β3LC/n-‐1 + β4Caimn-‐1

� Then we will have to define the criteria for Caimn-‐1 from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup

The Beginning As the End

� We started with a set of goals

� Homework ◦  For me : �  To finish the hands-on walkthrough

& post it in ~10 days

◦  For you �  Go through the slides �  Do the walkthrough �  Submit entries to Kaggle

IDE <- RStudioR_Packages <- c(plyr, rattle, rpart, randomForest)R_Search <- http://www.rseek.org/, powered=google

Questions ?!

I enjoyed a lot

preparing the materials

… Hope

you enjoyed more

attending …

the hitchhiker’s guide to kaggle

Technology