the hitchhiker’s guide to kaggle
DESCRIPTION
For the OSCON Data 2011 workshop "The Hitchhiker’s Guide to A Kaggle Competition" http://www.oscon.com/oscon2011/public/schedule/detail/20011TRANSCRIPT
The Hitchhiker’s Guide to Kaggle
July 27, 2011�[email protected] [doubleclix.wordpress.com] �
Analytics Competitions ! Algorithms
Tools
DataSets
The Amateur Data Scientist CART
randomForest
Old Competition
Competition in-flight
Titanic
Churn
HHP Ford
Encounters
� 3rd ◦ Participate in HHP &
Other competitions
� 1st ◦ This Workshop
� 2nd ◦ Do Hands-on Walkthrough ◦ I will post the walkthrough
scripts in ~ 10 days
Goals Of This workshop
1. Introduction to Analytics Competitions from Data, Algorithms & Tools perspective
2. End-To-End Flow of a Kaggle Competition – Ford
3. Introduction to the Heritage Health Prize Competition
4. Materials for you to explore further ◦ Lot more slides ◦ Walkthrough – will post in 10 days
Agenda � Algorithms for the Amateur Data Scientist [25Min] ◦ Algorithms, Tools & frameworks in perspective
� The Art of Analytics Competitions[10Min] ◦ The Kaggle challenges
� How the RTA FORD was won - Anatomy of a competition [15Min] ◦ Predicting FORD using Trees ◦ Submit an Entry
� Competition in flight - The Heritage Health Prize [30Min] ◦ Walkthrough
� Introduction � Dataset Organization � Analytics Walkthrough ◦ Submit our entry
� Conclusion [5Min]
ALGORITHMS FOR THE AMATEUR DATA SCIENTIST
“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”
- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams. �Published by Harmony Books in 1979 �
Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …�
The Amateur Data Scientist � Am not a quant or a ML expert � School of Amz, Springer & UTube � For the Rest of us � References I used (Refs also in the respective slide): ◦ The Elements Of Statistical Learning (a.k.a ESLII)
� By Hastie, Tibshirani & Friedman ◦ Statistical Learning From a Regression Perspective
� By Richard Berk � As Jeremy says it, you can dig into it as needed ◦ Not necessarily be an expert in R toolbox
Jeremy’s Axioms � Iteratively explore data � Tools ◦ Excel Format, Perl, Perl Book
� Get your head around data ◦ Pivot Table
� Don’t over-complicate � If people give you data, don’t assume
that you need to use all of it � Look at pictures ! � History of your submissions – keep a
tab � Don’t be afraid to submit simple
solutions ◦ We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/ !
Big data to smart data
� summary
1 Don’t throw away any data !
2 Be ready for different ways of organizing the data
Ref: Anthony’s Kaggle Presentation!
Users apply different techniques
• Support Vector Machine • adaBoost • Bayesian Networks • Decision Trees • Ensemble Methods • Random Forest • Logistic Regression
• Genetic Algorithms • Monte Carlo Methods • Principal Component
Analysis • Kalman Filter • Evolutionary Fuzzy
Modelling • Neural Networks
Quora • http://www.quora.com/What-are-the-top-10-
data-mining-or-machine-learning-algorithms
� Let us take a 15 min overview of the algorithms ◦ Relevant in the context of this workshop ◦ From the perspective of the datasets we plan
to use
� More of a qualitative than mathematical � To get a feel for the how & the why
Classifiers
Linear Regression
Continuous Variables
Categorical Variables
Decision Trees
k-NN(Nearest
Neighbors)
Bias�Variance �
Model Complexity�Over-fitting �
Boosting Bagging
CART
Titanic Passenger Metadata • Small • 3 Predictors
• Class • Sex • Age • Survived?
http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic!http://www.homestoriesatoz.com/2011/06/blogger-to-wordpress-a-fish-out-of-water.html!
Customer Churn • 17 Predictors
Kaggle Competition - Stay Alert Ford Challenge • Simple Dataset • Competition Class Heritage Health Prize Data
• Complex • Competition in Flight
Titanic Dataset
� Taken from passenger manifest
� Good candidate for a Decision Tree
� CART [Classification & Regression Tree] ◦ Greedy, top-down
binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions
� CART in R http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf!
Titanic Dataset R walk through � Load libraries � Load data � Model CART � Model rattle() � Tree � Discussion
Male ?
Adult?
N 3rd?
N Y
3rd?
N Y
Y
Y
Y
CART Male ?
Adult?
N 3rd?
N Y
3rd?
N Y
Y
Y
Y
Female
Child
Male ?
N3rd?
N Y
Y
Female 1 Do Not Over-fit
2 All predictors are not needed
3 All data rows are not needed
4 Tuning the algorithms will give different results
CART
Churn Data
� Predict churn � Based on ◦ Service calls, v-mail and so forth
CART Tree
Challenges
� Model Complexity ◦ Complex Model increases the training data fit ◦ But then over-fits and doesn't perform as well
with real data
� Bias vs. Variance ◦ Classical diagram ◦ From ELSII ◦ By Hastie, Tibshirani &
Friedman
Prediction Error�
Training �Error�
Partition Data ! ◦ Training (60%) ◦ Validation(20%) & ◦ “Vault” Test (20%) Data sets
k-fold Cross-Validation ◦ Split data into k equal parts ◦ Fit model to k-1 parts & calculate prediction
error on kth part ◦ Non-overlapping dataset
Solution #1 � Goal�◦ Model Complexity (-) �◦ Variance (-) �◦ Prediction Accuracy (+) �
But the fundamental problem still exists ! �
Bootstrap ◦ Draw datasets (with replacement) and fit model
for each dataset � Remember : Data Partitioning (#1) & Cross Validation
(#2) are without replacement
Solution #2 � Goal�◦ Model Complexity (-) �◦ Variance (-) �◦ Prediction Accuracy (+) �
Bagging (Bootstrap aggregation) ◦ Average prediction over a
collection of bootstrap-ed samples, thus reducing variance
Boosting ◦ “Output of weak classifiers into a powerful
committee” ◦ Final Prediction = weighted majority vote ◦ Later classifiers get misclassified points � With higher weight, � So they are forced � To concentrate on them ◦ AdaBoost (AdaptiveBoosting) ◦ Boosting vs Bagging � Bagging – independent trees � Boosting – successively weighted
Solution #3 � Goal�◦ Model Complexity (-) �◦ Variance (-) �◦ Prediction Accuracy (+) �
Random Forests+
◦ Builds large collection of de-correlated trees & averages them
◦ Improves Bagging by selecting i.i.d* random variables for splitting
◦ Simpler to train & tune ◦ “Do remarkably well, with very little tuning
required” – ESLII ◦ Less suseptible to overfitting (than boosting) ◦ Many RF implementations � Original version - Fortran-77 ! By Breiman/Cutler � R, Mahout, Weka, Milk (ML toolkit for py), matlab
* i.i.d – independent identically distributed!+ http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm !
Solution #4 � Goal�◦ Model Complexity (-) �◦ Variance (-) �◦ Prediction Accuracy (+) �
Ensemble methods ◦ Two Step � Develop a set of learners � Combine the results to develop a composite
predictor ◦ Ensemble methods can take the form of: � Using different algorithms, � Using the same algorithm with different settings � Assigning different parts of the dataset to different
classifiers
◦ Bagging & Random Forests are examples of ensemble method
Ref: Machine Learning In Action !
Solution - General � Goal�◦ Model Complexity (-) �◦ Variance (-) �◦ Prediction Accuracy (+) �
Random Forests � While Boosting splits based on best among all variables, RF splits
based on best among randomly chosen variables � Simpler because it requires two variables – no. of Predictors
(typically √k) & no. of trees (500 for large dataset, 150 for smaller) � Error prediction ◦ For each iteration, predict for dataset that is not in the sample (OOB
data) ◦ Aggregate OOB predictions ◦ Calculate Prediction Error for the aggregate, which is basically the OOB
estimate of error rate � Can use this to search for optimal # of predictors
◦ We will see how close this is to the actual error in the Heritage Health Prize
� Assumes equal cost for mis-prediction. Can add a cost function � Proximity matrix & applications like adding missing data, dropping
outliers
Ref: R News Vol 2/3, Dec 2002 !Statistical Learning from a Regression Perspective : Berk !
A Brief Overview of RF by Dan Steinberg!
Lot more to explore (Homework!)
� Loss matrix ◦ E.g. Telcom churn - Better to give incentives to
false + (who is not leaving) than optimize in incentives for false –ves(who is leaving)
� Missing values � Additive Models � Bayesian Models � Gradient Boosting
Ref: http://www.louisaslett.com/Courses/Data_Mining_09-10/ST4003-Lab4-New_Tree_Data_Set_and_Loss_Matrices.pdf!
Churn Data w/ randomForest
KAGGLE COMPETITIONS
“I keep saying the sexy job in the next ten years will be statisticians.”
Hal Varian Google Chief Economist 2009
Mismatch between those with data and those with the skills to analyse it
Crowdsourcing
Forecast Error
(MASE) Existing model
Aug 9 2 weeks later
1 month later
Competition End
Tourism Forecasting Competition
Existing model (ELO)
Chess Ratings Competition
Aug 4 1 month later
2 months later
Today
Error Rate (RMSE)
12,500 “Amateur” Data Scientists with different backgrounds
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
R on Kaggle
Ref: Anthony’s Kaggle Presentation!
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
Among academics
R
Matlab
SAS
WEKA
SPSS
Python
Excel
Mathematica
Stata
Among Americans
Successful grant applications
~25%
NASA tried, now it’s our turn
Mapping Dark Matter is a image analysis competition whose aim is to encourage the development of new algorithms that can be applied to challenge of measuring the tiny distortions in galaxy images caused by dark matter.
“In less than a week, Martin O’Leary, a PhD student in glaciology, outperformed the state-of-the-art algorithms”
“The world’s brightest physicists have been working for decades on solving one of the great unifying problems of our universe”
40
Who to hire?
Clean, Real world data Professional Reputation & Experience
Interactions with experts in related fields Prizes
1
4
2
3
Why Participants Compete
Use the wizard to post a competition
Participants make their entries
Competitions are judged based on predictive accuracy
Competition Mechanics
Competitions are judged on objective criteria
THE FORD COMPETITION
The Anatomy of a KAGGLE COMPETITION
Ford Challenge - DataSet
� Goal: ◦ Predict Driver Alertness
� Predictors: ◦ Psychology – P1 .. P8 ◦ Environment – E1 .. E11 ◦ Vehicle – V1 .. V11 ◦ IsAlert ?
� Data statistics meaningless outside the IsAlert context
Ford Challenge – DataSet Files
� Three files ◦ ford_train
� 510 Trials, ~1,200 observations each spaced by 0.1 sec -> 604,330 rows
◦ ford_test � 100 Trials,~1,200 observations/trial, 120,841 rows
◦ example_submission.csv
A Plan
glm
Submission & Results Raw, all variables, rpart
Raw, selected variables, rpart
All variables, glm
How the Ford Competition was won
� How I Did It Blogs � http://blog.kaggle.com/
2011/03/25/inference-on-winning-the-ford-stay-alert-competition/
� http://blog.kaggle.com/2011/04/20/mick-wagner-on-finishing-second-in-the-ford-challenge/
� http://blog.kaggle.com/2011/03/16/junpei-komiyama-on-finishing-4th-in-the-ford-competition/
How the Ford Competition was won
� Junpei Komiyama (#4) ◦ To solve this problem, I constructed a Support Vector
Machine (SVM), which is one of the best tools for classification and regression analysis, using the libSVM package. ◦ This approach took more than 3 hours to complete ◦ I found some data (P3-P6) were characterized by
strong noise... Also, many environmental and vehicular data showed discrete values continuously increased and decreased. These suggested the necessity of pre-processing the observation data before SVM analysis for better performance
How the Ford Competition was won
� Junpei Komiyama (#4) ◦ Averaging – improved score and processing time ◦ Average 7 data points � Reduced processing by 86% & � Increased score by 0.01
◦ Tools � Python processing of csv � libSVM
How the Ford Competition was won
� Mick Wagner (#2) ◦ Tools � Excel, SQL Server ◦ I spent the majority of my time analyzing the data. I
inputted the data into Excel and started examining the data taking note of discrete and continuous values, category based parameters, and simple statistics (mean, median, variance, coefficient of variance). I also looked for extreme outliers. ◦ I made the first 150 trials (~30%) be my test data and the
remainder be my training dataset (~70%). This single factor had the largest impact on the accuracy of my final model. ◦ I was concerned that using the entire data set would create
too much noise and lead to inaccuracies in the model … so focussed on data with state change
How the Ford Competition was won
� Mick Wagner (#2) ◦ After testing the Decision Tree and Neural Network
algorithms against each other and submitting models to Kaggle, I found the Neural Network model to be more accurate ◦ Only used E4, E5, E6, E7, E8, E9, E10, P6, V4, V6,
V10, and V11
How the Ford Competition was won
� Inference (#1) ◦ Very interesting ◦ “Our first observation is that trials are not
homogeneous – so calculated mean, sd et al” ◦ “Training set & test set are not from the same
population” – a good fit for training will result in a low score ◦ Lucky Model (Regression) � -‐410.6073(sd(E5)) + 0.1494(V11) + 4.4185(E9)
◦ (Remember – Data had P1-P8,E1-E11,V1-V11)
HOW THE RTA WAS WON
“This competition requires participants to predict travel time on Sydney's M4 freeway from past travel time observations.”
� Thanks to ◦ François
GUILLEM & ◦ Andrzej Janusz
� They both used R
� Share their code & algorithms
How the RTA was won
� I effectively used R for the RTA competition. For my best submission, I just used simple technics (OLS and means) but in a clever way
- François GUILLEM (#14) � I used a simple k-NN approach but the idea
was to process data first & to compute some summaries of time series in consecutive timestamps using some standard indicators from technical analysis
- Andrzej Janusz(#17)
How the RTA was won
� #1 used Random Forests ◦ Time, Date & Week as predictors
- José P. González-Brenes and Matías Cortés
� Regression models for data segments (total ~600!)
� Tools: ◦ Java/Weka ◦ 4 processors, 12 GB RAM ◦ 48 hours of computations
- Marcin Pionnier (#5) Ref: http://blog.kaggle.com/2011/02/17/marcin-pionnier-on-finishing-5th-in-the-rta-competition/ !
Ref: http://blog.kaggle.com/2011/03/25/jose-p-gonzalez-brenes-and-matias-cortes-on-winning-the-rta-challenge/!
THE HHP
TimeCheck : Should be ~2:40!!
Lessons from Kaggle Winners 1 Don’t over-fit
2 All predictors are not needed
3 All data rows are not needed, either
4 Tuning the algorithms will give different results
5 Reduce the dataset (Average, select transition data,…)
6 Test set & training set can differ
7 Iteratively explore & get your head around data
8 Don’t be afraid to submit simple solutions
9 Keep a tab & history your submissions
The Competition
“The goal of the prize is to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”
TimeLine
Data Organization
Members ID Age at 1st Claim Sex
Claims
113,000 Entries Missing values
MemberID Prov ID Vendor, PCP, Year Speciality PlaceOfSvc PayDelay LengthOfStay DaysSinceFirstClaimThatYear PrimaryConditionGroup CharlsonIndex ProcedureGroup SupLOS
2,668,990 Entries Missing values Different Coding
Delay 162+ SupLOS – Length of stay is suppressed during de-identification process for some entries
LabCount MemberID, Year, DSFS,LabCount
361,485 Entries Fairly Consistent Coding (10+)
DrugCount MemberID, Year, DSFS,DrugCount
818,242 Entries Fairly Consistent Coding (10+)
Days In Hospital Y2
Days In Hospital Y3
Days In Hospital Y4 (Target)
MemberID Claims Truncated DaysInHospital
76039 Entries(Y2) 71436 Entries (Y3) 70943 Entries Lots Of Zeros
Calculation & Prizes
Prediction Error Rate
Deadline : Aug 31,2011
Deadline : Feb 13,2012
Deadline : Sep 04,2012
Deadline Apr 04,2013
06:59:59 UTC
HHP ANALYTICS Now it is our turn …
POA
� Load data into SQLite � Use SQL to de-normalize & pick out
datasets � Load them into R for analytics � Total/Distinct count ◦ Claims = 2,668,991/113,001 ◦ Members = 113,001 ◦ Drug = 818,242/75,999 <- unique = 141,532/75,999(test) ◦ Lab = 361,485/86,640 <- unique = 154,935/86,640 (test) ◦ dih_y2 = 76,039 / distinct/11,770 dih > 0 ◦ dih_y3 = 71,436/distinct/10,730 dih > 0 ◦ dih_y4 = 70,943/distinct
Idea #1
� dih_Y2 = β0 + β1dih_Y1 + β2DC + β3LC � dih_Y3 = β0 + β1dih_Y2 + β2DC + β3LC � dih_Y4 = β0 + β1dih_Y3 + β2DC + β3LC � select count(*) from dih_y2 join dih_y3 on
dih_y2.member_id = dih_y3.member_id; � Y2-Y3 = 51,967 (8,339 dih_y2 > 0)/ Y3-Y4 = 49,683
(7,699 dih_y3 > 0)
� Data is not straightforward to get this ◦ Summarize drug and lab by member, year ◦ Split into year to get DC & LC by year ◦ Add to dih_Yx table ◦ Linear Regression
Some SQL for idea #1
� create table drug_tot as select member_id,Year, total(drug_count) from drug_count group by member_id,year order by member_id,year; <- total drug, lab per year for each member !
� Same for lab_tot � create table drug_tot_y1 as select * from
drug_tot where year = “Y1” � … for y2,y3 and y1, y2,y3 for lab_tot � … join with dih_yx tables
Idea #2
� Add claims at yx to the Idea #1 equations � dih_Yn = β0 + β1dih_Yn-‐1 + β2DC/n-‐1 + β3LC/n-‐1 + β4Caimn-‐1
� Then we will have to define the criteria for Caimn-‐1 from the claim predictors viz. PrimaryConditionGroup, CharlsonIndex and ProcedureGroup
The Beginning As the End
� We started with a set of goals
� Homework ◦ For me : � To finish the hands-on walkthrough
& post it in ~10 days
◦ For you � Go through the slides � Do the walkthrough � Submit entries to Kaggle
IDE <- RStudioR_Packages <- c(plyr, rattle, rpart, randomForest)R_Search <- http://www.rseek.org/, powered=google
Questions ?!
I enjoyed a lot
preparing the materials
… Hope
you enjoyed more
attending …