r, data wrangling & kaggle data science competitions

79
R, Data Wrangling & Data Science Competitions October 28, 2014 @ksankar // doubleclix.wordpress.com http://www.bigdatatechcon.com/ tutorials.html#RDataWranglingandDataScienceCompetitions I want to die on Mars but not on impactElon Musk, interview with Chris Anderson “The shrewd guess, the fertile hypothesis, the courageous leap to a tentative conclusion – these are the most valuable coin of the thinker at work” -- Jerome Seymour Bruner "There are no facts, only interpretations." - Friedrich Nietzsche

Upload: krishna-sankar

Post on 22-Nov-2014

545 views

Category:

Data & Analytics


26 download

DESCRIPTION

Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi This is the R version of my pycon tutorial + a few updates It is work in progress. I will update with daily snapshot until done.

TRANSCRIPT

Page 1: R, Data Wrangling & Kaggle Data Science Competitions

R, Data Wrangling & Data Science Competitions

October 28, 2014

@ksankar // doubleclix.wordpress.com

http://www.bigdatatechcon.com/tutorials.html#RDataWranglingandDataScienceCompetitions

“I want to die on Mars but not on

impact”

— Elon Musk, interview with Chris Ande

rson

“The shrewd guess, the fertile hypothesis, the courageous leap to a

tentative conclusion – these are the most valuable coin of the thinker at

work” -- Jerome Seymour Bruner�"There are no facts, only interpretations." - Friedrich Nietzsche �

Page 2: R, Data Wrangling & Kaggle Data Science Competitions

etude

http://en.wikipedia.org/wiki/%C3%89tude, http://www.etudesdemarche.net/articles/etudes-sectorielles.htm, http://upload.wikimedia.org/wikipedia/commons/2/26/La_Cour_du_Palais_des_%C3%A9tudes_de_l%E2%80%99%C3%89cole_des_beaux-arts.jpg

We will focus on “short”, “acquiring skill” & “having fun” !

Page 3: R, Data Wrangling & Kaggle Data Science Competitions

Goals & non-goals

Goals

¤ Improve ML skills by participating in Kaggle Competitions ¤ Catalyst for you to widen the ML

experience ¤ Give you a focused time to work

thru submissions § Work with me. I will wait

if you want to catch-up § Submit a few solutions

¤ Less theory, more usage - let us see if this works

¤ As straightforward as possible § The programs can be

optimized

Non-goals

¡ Go deep into the algorithms • We don’t have

sufficient time. The topic can be easily a 5 day tutorial !

¡ Dive into R internals •  That is for another day

¡ A passive talk • Nope. Interactive &

hands-on

Page 4: R, Data Wrangling & Kaggle Data Science Competitions

Activities & Results

o Activities: •  Get familiar with the mechanics of Data Science Competitions •  Explore the intersection of Algorithms, Data, Intelligence, Inference &

Results •  Discuss Data Science Horse Sense ;o)

o Results : •  Submitted entries for 3 competitions •  Applied Algorithms on Kaggle Data

•  CART, RF •  Linear Regression •  SVM

•  Knowledge of Model Evaluation •  Cross Validation, ROC Curves

Page 5: R, Data Wrangling & Kaggle Data Science Competitions

About Me

o Chief Data Scientist at BlackArrow.tv o Have been speaking at OSCON, PyCon, Pydata et al o Reviewing Packt Book “Machine Learning with Spark” o Picked up co-authorship Second Edition of “Fast Data Processing with Spark” o Have done lots of things:

•  Big Data (Retail, Bioinformatics, Financial, AdTech),

•  Written Books (Web 2.0, Wireless, Java,…) •  Standards, some work in AI, •  Guest Lecturer at Naval PG School,… •  Planning MS-CFinance or Statistics

•  Volunteer as Robotics Judge at First Lego league World Competitions o  @ksankar, doubleclix.wordpress.com

The  Nuthead  band  !  

Page 6: R, Data Wrangling & Kaggle Data Science Competitions

Setup & Data

R & IDE

o  Install r o  Install R Studio

Tutorial Materials

o Github : https://github.com/xsankar/cloaked-ironman

o Clone or download zip

Setup an account in Kaggle (www.kaggle.com) We will be using the data from 3 Kaggle competitions ①  Titanic: Machine Learning from Disaster

Download data from http://www.kaggle.com/c/titanic-gettingStarted Directory ~/cloaked-ironman/titanic-r

②  Predicting Bike Sharing @ Washington DC Download data from http://www.kaggle.com/c/bike-sharing-demand/data Directory ~/cloaked-ironman/bike

③  Walmart Store Forecasting Download data from http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/data Directory ~/cloaked-ironman/walmart

Data

Page 7: R, Data Wrangling & Kaggle Data Science Competitions

Agenda [1 of 2]

o Oct 28 : 8:00-11:30 3 hrs o Intro, Goals, Logistics, Setup [10] [8:00-8:10) o Anatomy of a Kaggle Competition [40] [8:10-8:50)

• Competition Mechanics • Register, download data, create sub directories •  Trial Run : Submit Titanic

o Algorithms for the Amateur Data Scientist [20] [8:50-9:10) • Algorithms, Tools & frameworks in perspective • “Folk Wisdom”

o Model Evaluation & Interpretation [20] [9:10 - 9:30) • Confusion Matrix, ROC Graph

Page 8: R, Data Wrangling & Kaggle Data Science Competitions

Agenda [2 of 2]

o Break (30 Min) [9:30-10:00) o Session 1 : The Art of a Competition – Bike Sharing[45 min] [10:00 – 10:45)

•  Dataset Organization •  Analytics Walkthrough

•  Algorithms – ctree,SVM •  Feature Extraction

• Hands-on Analytics programming of the challenge •  Submit entry

o Session 2 : The Art of a Competition – Walmart [30 min] [10:45 – 11:15) •  Dataset Organization •  Analytics Walkthrough, Transformations

o Questions : [11:15 – 11:30)

Page 9: R, Data Wrangling & Kaggle Data Science Competitions

Overload Warning … There is enough material for a week’s training … which is good & bad ! Read thru at your pace, refer, ponder & internalize

Page 10: R, Data Wrangling & Kaggle Data Science Competitions

Close Encounters

� 1st  ◦  This Tutorial

�  2nd  ◦  Do More Hands-on Walkthrough

�  3nd  ◦  Listen To Lectures ◦  More competitions …

Page 11: R, Data Wrangling & Kaggle Data Science Competitions

Anatomy Of a Kaggle Competition

Page 12: R, Data Wrangling & Kaggle Data Science Competitions

Kaggle Data Science Competitions

o Hosts Data Science Competitions o Competition Attributes:

•  Dataset •  Train •  Test (Submission) •  Final Evaluation Data Set (We don’t

see)

•  Rules •  Time boxed •  Leaderboard •  Evaluation function •  Discussion Forum •  Private or Public

Page 13: R, Data Wrangling & Kaggle Data Science Competitions

Titanic  Passenger  Metadata  •  Small  •  3  Predictors  

•  Class  •  Sex  •  Age  •  Survived?  

http://www.ohgizmo.com/2007/03/21/romain-jerome-titanic http://flyhigh-by-learnonline.blogspot.com/2009/12/at-movies-sherlock-holmes-2009.html

City Bike Sharing Prediction (Washington DC)

Walmart Store Forecasting

Page 14: R, Data Wrangling & Kaggle Data Science Competitions

Train.csv  Taken  from  Titanic  Passenger  Manifest  

Variable   Descrip-on  

Survived   0-­‐No,  1=yes  

Pclass   Passenger  Class  (  1st,2nd,3rd  )  

Sibsp   Number  of  Siblings/Spouses  Aboard  

Parch   Number  of  Parents/Children  Aboard  

Embarked   Port  of  EmbarkaMon  o  C  =  Cherbourg  o  Q  =  Queenstown  o  S  =  Southampton  

Titanic  Passenger  Metadata  •  Small  •  3  Predictors  

•  Class  •  Sex  •  Age  •  Survived?  

Page 15: R, Data Wrangling & Kaggle Data Science Competitions

Test.csv  

Submission

o 418 lines; 1st column should have 0 or 1 in each line o Evaluation:

• % correctly predicted

Page 16: R, Data Wrangling & Kaggle Data Science Competitions

Approach

o This is a classification problem - 0 or 1 o Comb the forums ! o Opportunity for us to try different algorithms & compare them

•  Simple Model •  CART[Classification & Regression Tree]

•  Greedy, top-down binary, recursive partitioning that divides feature space into sets of disjoint rectangular regions

•  RandomForest •  Different parameters

•  SVM •  Multiple kernels •  Table the results

o Use cross validation to predict our model performance & correlate with what Kaggle says http://www.dabi.temple.edu/~hbling/8590.002/Montillo_RandomForests_4-2-2009.pdf

Page 17: R, Data Wrangling & Kaggle Data Science Competitions

Simple Model – Our First Submission

o #1 : Simple Model (M=survived)

o #2 : Simple Model (F=survived)

https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-python

Refer  to  1-­‐Intro_to_Kaggle.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 18: R, Data Wrangling & Kaggle Data Science Competitions

#3 : Simple CART Model

o CART (Classification & Regression Tree)

hTp://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/ClassificaMon/Decision_Trees  

May be better, because we have improved on the survival of men !

Refer  to  1-­‐Intro_to_Kaggle.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 19: R, Data Wrangling & Kaggle Data Science Competitions

#4 : Random Forest Model

o  https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience •  Chris Clark http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/

o  https://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests

o  https://github.com/RahulShivkumar/Titanic-Kaggle/blob/master/titanic.py

Refer  to  1-­‐Intro_to_Kaggle.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 20: R, Data Wrangling & Kaggle Data Science Competitions

#5 : SVM

o Multiple Kernels o kernel = ‘radial’ #Radial Basis Function

o Kernel = ‘sigmoid’

o  agconti's blog - Ultimate Titanic !

o  http://fastly.kaggle.net/c/titanic-gettingStarted/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster/29713

Refer  to  1-­‐Intro_to_Kaggle.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 21: R, Data Wrangling & Kaggle Data Science Competitions

Feature Engineering - Homework

o Add attribute : Age •  In train 714/891 have age; in test 332/418 have age • Missing values can be just Mean Age of all passengers • We could be more precise and calculate Mean Age based on Title (Ms,

Mrs, Master et al) •  Box plot age

o Add attribute : Mother, Family size et al o Feature engineering ideas

•  http://www.kaggle.com/c/titanic-gettingStarted/forums/t/6699/sharing-experiences-about-data-munging-and-classification-steps-with-python

o More ideas at http://statsguys.wordpress.com/2014/01/11/data-analytics-for-beginners-pt-2/ o And https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md

Page 22: R, Data Wrangling & Kaggle Data Science Competitions

What does it mean ? Let us ponder ….

o We have a training data set representing a domain • We reason over the dataset & develop a model to predict outcomes

o How good is our prediction when it comes to real life scenarios ? o The assumption is that the dataset is taken at random

•  Or Is it ? Is there a Sampling Bias ? •  i.i.d ? Independent ? Identically Distributed ? • What about homoscedasticity ? Do they have the same finite variance ?

o Can we assure that another dataset (from the same domain) will give us the same result ?

o Will our model & it’s parameters remain the same if we get another data set ? o How can we evaluate our model ? o How can we select the right parameters for a selected model ?

Page 23: R, Data Wrangling & Kaggle Data Science Competitions

Algorithms for the Amateur Data Scientist

“A towel is about the most massively useful thing an interstellar hitchhiker can have … any man who can hitch the length and breadth of the Galaxy, rough it … win through, and still know where his towel is, is clearly a man to be reckoned with.”

- From The Hitchhiker's Guide to the Galaxy, by Douglas Adams.

Algorithms ! The Most Massively useful thing an Amateur Data Scientist can have …

8:50

Page 24: R, Data Wrangling & Kaggle Data Science Competitions

Ref: Anthony’s Kaggle Presentation

Data Scientists apply different techniques

•  Support Vector Machine •  adaBoost •  Bayesian Networks • Decision Trees •  Ensemble Methods •  Random Forest •  Logistic Regression

•  Genetic Algorithms •  Monte Carlo Methods •  Principal Component Analysis •  Kalman Filter •  Evolutionary Fuzzy Modelling •  Neural Networks

Quora •  http://www.quora.com/What-are-the-top-10-data-mining-or-machine-learning-algorithms

Page 25: R, Data Wrangling & Kaggle Data Science Competitions

Algorithm spectrum

o  Regression o  Logit o  CART o  Ensemble :

Random Forest

o  Clustering o  KNN o  Genetic Alg o  Simulated

Annealing  

o  Collab Filtering

o  SVM o  Kernels

o  SVD

o  NNet o  Boltzman

Machine o  Feature

Learning  

Machine  Learning   Cute  Math   Ar0ficial  Intelligence  

Page 26: R, Data Wrangling & Kaggle Data Science Competitions

Classifying Classifiers

Statistical   Structural  

Regression   Naïve  Bayes  

Bayesian  Networks  

Rule-­‐based   Distance-­‐based  

Neural  Networks  

Production  Rules   Decision  Trees  

Multi-­‐layer  Perception  

Functional   Nearest  Neighbor  

Linear   Spectral  Wavelet  

kNN   Learning  vector  Quantization  

Ensemble  

Random  Forests  

Logistic  Regression1  

SVM  Boosting  

1Max  Entropy  Classifier    

Ref: Algorithms of the Intelligent Web, Marmanis & Babenko

Page 27: R, Data Wrangling & Kaggle Data Science Competitions

Classifiers  

Regression  Continuous Variables

Categorical Variables

Decision  Trees  

k-­‐NN(Nearest  Neighbors)  

Bias Variance

Model Complexity Over-fitting

BoosMng  Bagging  

CART  

Page 28: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge”

Page 29: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (1 of A)

o  "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions

o  Learning = Representation + Evaluation + Optimization o  It’s Generalization that counts

•  The fundamental goal of machine learning is to generalize beyond the examples in the training set

o Data alone is not enough •  Induction not deduction - Every learner should embody some knowledge

or assumptions beyond the data it is given in order to generalize beyond it

o Machine Learning is not magic – one cannot get something from nothing •  In order to infer, one needs the knobs & the dials •  One also needs a rich expressive dataset

A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755

Page 30: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (2 of A)

o Over fitting has many faces •  Bias – Model not strong enough. So the learner has the tendency to learn the

same wrong things •  Variance – Learning too much from one dataset; model will fall apart (ie much

less accurate) on a different dataset •  Sampling Bias

o  Intuition Fails in high Dimensions –Bellman •  Blessing of non-conformity & lower effective dimension; many applications

have examples not uniformly spread but concentrated near a lower dimensional manifold eg. Space of digits is much smaller then the space of images

o  Theoretical Guarantees are not What they seem •  One of the major developments o f recent decades has been the realization that

we can have guarantees on the results of induction, particularly if we are willing to settle for probabilistic guarantees.

o  Feature engineering is the Key A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755

Page 31: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (3 of A)

o More Data Beats a Cleverer Algorithm •  Or conversely select algorithms that improve with data •  Don’t optimize prematurely without getting more data

o  Learn many models, not Just One •  Ensembles ! – Change the hypothesis space •  Netflix prize •  E.g. Bagging, Boosting, Stacking

o  Simplicity Does not necessarily imply Accuracy o  Representable Does not imply Learnable

•  Just because a function can be represented does not mean it can be learned

o Correlation Does not imply Causation o  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o  A few useful things to know about machine learning - by Pedro Domingos

§  http://dl.acm.org/citation.cfm?id=2347755

Page 32: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (4 of A)

o The simplest hypothesis that fits the data is also the most plausible •  Occam’s Razor •  Don’t go for a 4 layer Neural Network unless

you have that complex data •  But that doesn’t also mean that one should

choose the simplest hypothesis • Match the impedance of the domain, data & the

algorithms o Think of over fitting as memorizing as opposed to learning. o Data leakage has many forms o Sometimes the Absence of Something is Everything o  [Corollary] Absence of Evidence is not the Evidence of

Absence New to Machine Learning? Avoid these three mistakes, James Faghmous https://medium.com/about-data/73258b3848a4

§  Simple  Model  §  High  Error  line  that  cannot  be  

compensated  with  more  data  §  Gets  to  a  lower  error  rate  with  less  data  

points  §  Complex  Model  

§  Lower  Error  Line  §  But  needs  more  data  points  to  reach  

decent  error    

Ref: Andrew Ng/Stanford, Yaser S./CalTech

Page 33: R, Data Wrangling & Kaggle Data Science Competitions

Importance of feature selection & weak models

o “Good features allow a simple model to beat a complex model”-Ben Lorica1

o “… using many weak predictors will always be more accurate than using a few strong ones …” –Vladimir Vapnik2

o “A good decision rule is not a simple one, it cannot be described by a very few parameters” 2

o “Machine learning science is not only about computers, but about humans, and the unity of logic, emotion, and culture.” 2

o “Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it can’t surprise you” – Hadley Wickham3

hTp://radar.oreilly.com/2014/06/streamlining-­‐feature-­‐engineering.html  hTp://nauMl.us/issue/6/secret-­‐codes/teaching-­‐me-­‐so^ly  hTp://www.johndcook.com/blog/2013/02/07/visualizaMon-­‐modeling-­‐and-­‐surprises/  

Updated  Slide  

Page 34: R, Data Wrangling & Kaggle Data Science Competitions

Check your assumptions

o The decisions a model makes, is directly related to the it’s assumptions about the statistical distribution of the underlying data o For example, for regression one should check that:

① Variables are normally distributed •  Test for normality via visual inspection, skew & kurtosis, outlier inspections via

plots, z-scores et al

② There is a linear relationship between the dependent & independent variables

•  Inspect residual plots, try quadratic relationships, try log plots et al

③ Variables are measured without error ④ Assumption of Homoscedasticity §  Homoscedasticity assumes constant or near constant error variance §  Check the standard residual plots and look for heteroscedasticity

§  For example in the figure, left box has the errors scattered randomly around zero; while the right two diagrams have the errors unevenly distributed

Jason W. Osborne and Elaine Waters, Four assumptions of multiple regression that researchers should always test, http://pareonline.net/getvn.asp?v=8&n=2

Page 35: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (5 of A)

Donald Rumsfeld is an armchair Data Scientist !

http://smartorg.com/2013/07/valuepoint19/

The World Knowns            Unknowns  

You UnKnown   Known  

o  Others  know,  you  don’t   o  What  we  do  

o  Facts,  outcomes  or  scenarios  we  have  not  encountered,  nor  considered  

o  “Black  swans”,  outliers,  long  tails  of  probability  distribuMons  

o  Lack  of  experience,  imaginaMon  

o  PotenMal  facts,  outcomes  we  are  aware,  but  not    with  certainty  

o  StochasMc  processes,  ProbabiliMes  

o  Known Knowns o  There are things we know that we know

o  Known Unknowns o  That is to say, there are things that we

now know we don't know o  But there are also Unknown Unknowns

o  There are things we do not know we don't know

Page 36: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (6 of A) - Pipeline

o  Scalable  Model  Deployment  

o  Big  Data  automation  &  purpose  built  appliances  (soft/hard)  

o  Manage  SLAs  &  response  times  

o  Volume  o  Velocity  o  Streaming  Data  

o  Canonical  form  o  Data  catalog  o  Data  Fabric  across  the  

organization  o  Access  to  multiple  

sources  of  data    o  Think  Hybrid  –  Big  Data  

Apps,  Appliances  &  Infrastructure  

Collect Store Transform

o  Metadata  o  Monitor  counters  &  

Metrics  o  Structured  vs.  Multi-­‐

structured  

o  Flexible  &  Selectable  §  Data  Subsets    §  Attribute  sets  

o  Refine  model  with  §  Extended  Data  

subsets  §  Engineered  

Attribute  sets  o  Validation  run  across  a  

larger  data  set  

Reason Model Deploy

Data Management

Data Science

o  Dynamic  Data  Sets  o  2  way  key-­‐value  tagging  of  

datasets  o  Extended  attribute  sets  o  Advanced  Analytics  

Explore Visualize Recommend Predict

o  Performance  o  Scalability  o  Refresh  Latency  o  In-­‐memory  Analytics  

o  Advanced  Visualization  o  Interactive  Dashboards  o  Map  Overlay  o  Infographics  

¤  Bytes to Business a.k.a. Build the full stack

¤  Find Relevant Data For Business

¤  Connect the Dots

Page 37: R, Data Wrangling & Kaggle Data Science Competitions

Volume

Velocity

Variety

Data Science “folk knowledge” (7 of A)

Context

Connectedness

Intelligence

Interface

Inference

“Data of unusual size” that can't be brute forced

o  Three Amigos o  Interface = Cognition o  Intelligence = Compute(CPU) & Computational(GPU) o  Infer Significance & Causality

Page 38: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (8 of A) Jeremy’s Axioms

o  Iteratively explore data o Tools

•  Excel Format, Perl, Perl Book o Get your head around data

•  Pivot Table o Don’t over-complicate o  If people give you data, don’t assume that you

need to use all of it o  Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions

• We will do this during this workshop

Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-jeremy-howard/

Page 39: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (9 of A)

①  Common Sense (some features make more sense then others) ②  Carefully read these forums to get a peak at other peoples’ mindset ③  Visualizations ④  Train a classifier (e.g. logistic regression) and look at the feature weights ⑤  Train a decision tree and visualize it ⑥  Cluster the data and look at what clusters you get out ⑦  Just look at the raw data ⑧  Train a simple classifier, see what mistakes it makes ⑨  Write a classifier using handwritten rules ⑩  Pick a fancy method that you want to apply (Deep Learning/Nnet)

-- Maarten Bosma -- http://www.kaggle.com/c/stumbleupon/forums/t/5761/methods-for-getting-a-first-overview-over-the-data

Page 40: R, Data Wrangling & Kaggle Data Science Competitions

Data Science “folk knowledge” (A of A) Lessons from Kaggle Winners

①  Don’t over-fit ②  All predictors are not needed

• All data rows are not needed, either ③  Tuning the algorithms will give different results ④  Reduce the dataset (Average, select transition data,…) ⑤  Test set & training set can differ ⑥  Iteratively explore & get your head around data ⑦  Don’t be afraid to submit simple solutions ⑧  Keep a tab & history your submissions

Page 41: R, Data Wrangling & Kaggle Data Science Competitions

The curious case of the Data Scientist

o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story

Data Scientist (noun): Person who is better at

statistics than any software engineer & better

at software engineering than any statistician

– Josh Wills (Cloudera)

Data Scientist (noun): Person who is worse at

statistics than any statistician & worse at

software engineering than any software

engineer – Will Cukierski (Kaggle)

http://doubleclix.wordpress.com/2014/01/25/the-­‐curious-­‐case-­‐of-­‐the-­‐data-­‐scientist-­‐profession/

Large is hard; Infinite is much easier ! – Titus Brown

Page 42: R, Data Wrangling & Kaggle Data Science Competitions

Essential Reading List

o  A few useful things to know about machine learning - by Pedro Domingos •  http://dl.acm.org/citation.cfm?id=2347755

o  The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert •  http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/

lack_of_a_priori_distinctions_wolpert.pdf o  http://www.no-free-lunch.org/ o  Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C

•  http://www.stat.purdue.edu/~‾doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FDR.pdf

o  A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe •  http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/

o  Avoid these three mistakes, James Faghmo •  https://medium.com/about-data/73258b3848a4

o  Leakage in Data Mining: Formulation, Detection, and Avoidance •  http://www.cs.umb.edu/~‾ding/history/470_670_fall_2011/papers/

cs670_Tran_PreferredPaper_LeakingInDataMining.pdf

Page 43: R, Data Wrangling & Kaggle Data Science Competitions

For your reading & viewing pleasure … An ordered List

①  An Introduction to Statistical Learning •  http://www-bcf.usc.edu/~‾gareth/ISL/

②  ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning •  http://online.stanford.edu/course/statistical-learning-winter-2014

③  Prof. Pedro Domingo •  https://class.coursera.org/machlearning-001/lecture/preview

④  Prof. Andrew Ng •  https://class.coursera.org/ml-003/lecture/preview

⑤  Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data •  https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120

⑥  Mathematicalmonk @ YouTube •  https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA

⑦  The Elements Of Statistical Learning •  http://statweb.stanford.edu/~‾tibs/ElemStatLearn/

http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-machine-learning/

Page 44: R, Data Wrangling & Kaggle Data Science Competitions

Of Models, Performance, Evaluation & Interpretation

9:10

Page 45: R, Data Wrangling & Kaggle Data Science Competitions

Bias/Variance (1 of 2)

o Model Complexity • Complex Model increases the

training data fit • But then it overfits & doesn't

perform as well with real data o  Bias vs. Variance

o  Classical diagram o  From ELSII, By Hastie, Tibshirani & Friedman

o  Bias – Model learns wrong things; not complex enough; error gap small; more data by itself won’t help

o  Variance – Different dataset will give different error rate; over fitted model; larger error gap; more data could help

Prediction Error

Training Error

Ref: Andrew Ng/Stanford, Yaser S./CalTech

Learning Curve

Page 46: R, Data Wrangling & Kaggle Data Science Competitions

Bias/Variance (2 of 2)

o High Bias • Due to Underfitting • Add more features • More sophisticated model

•  Quadratic Terms, complex equations,…

• Decrease regularization o High Variance

• Due to Overfitting • Use fewer features • Use more training sample •  Increase Regularization

Prediction Error

Training Error

Ref: Strata 2013 Tutorial by Olivier Grisel

Learning Curve

Need  more  features  or  more  complex  model  to  improve  

Need  more  data  to  improve  

'Bias is a learner’s tendency to consistently learn the same wrong thing.' -- Pedro Domingos

Page 47: R, Data Wrangling & Kaggle Data Science Competitions

Partition Data ! •  Training (60%)

•  Validation(20%) &

• “Vault” Test (20%) Data sets k-fold Cross-Validation • Split data into k equal parts

• Fit model to k-1 parts & calculate prediction error on kth part

• Non-overlapping dataset

Data Partition & Cross-Validation

�  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Train   Validate   Test  

#2   #3   #4  #5  

#1  

#2   #3   #5   #4  #1  

#2   #4   #5   #3  #1  

#3   #4   #5   #2  #1  

#3   #4   #5   #1  #2  

K-­‐fold  CV  (k=5)  

Train   Validate  

Page 48: R, Data Wrangling & Kaggle Data Science Competitions

Bootstrap • Draw datasets (with replacement) and fit model for each dataset

•  Remember : Data Partitioning (#1) & Cross Validation (#2) are without replacement

Bootstrap & Bagging �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Bagging (Bootstrap aggregation) ◦  Average prediction over a collection of

bootstrap-ed samples, thus reducing variance

Page 49: R, Data Wrangling & Kaggle Data Science Competitions

◦  “Output  of  weak  classifiers  into  a  powerful  commiTee”  ◦  Final  PredicMon  =  weighted  majority  vote    ◦  Later  classifiers  get  misclassified  points    �  With  higher  weight,    �  So  they  are  forced    �  To  concentrate  on  them  ◦  AdaBoost  (AdapMveBoosting)  ◦  BoosMng  vs  Bagging  �  Bagging  –  independent  trees  �  BoosMng  –  successively  weighted  

Boosting �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 50: R, Data Wrangling & Kaggle Data Science Competitions

◦  Builds  large  collecMon  of  de-­‐correlated  trees  &  averages  them  

◦  Improves  Bagging  by  selecMng  i.i.d*  random  variables  for  spliong  

◦  Simpler  to  train  &  tune  ◦  “Do  remarkably  well,  with  very  li@le  tuning  required”  –  ESLII  ◦  Less  suscepMble  to  over  fiong  (than  boosMng)  ◦  Many  RF  implementaMons  �  Original  version  -­‐  Fortran-­‐77  !  By  Breiman/Cutler  �  Python,  R,  Mahout,  Weka,  Milk  (ML  toolkit  for  py),  matlab    

* i.i.d – independent identically distributed + http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Random Forests+

�  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 51: R, Data Wrangling & Kaggle Data Science Competitions

◦  Two  Step  �  Develop  a  set  of  learners  �  Combine  the  results  to  develop  a  composite  predictor  ◦  Ensemble  methods  can  take  the  form  of:  �  Using  different  algorithms,    �  Using  the  same  algorithm  with  different  seongs  �  Assigning  different  parts  of  the  dataset  to  different  classifiers  

◦  Bagging  &  Random  Forests  are  examples  of  ensemble  method    

Ref: Machine Learning In Action

Ensemble Methods �  Goal ◦  Model Complexity (-) ◦  Variance (-) ◦  Prediction Accuracy (+)

Page 52: R, Data Wrangling & Kaggle Data Science Competitions

Random Forests

o While Boosting splits based on best among all variables, RF splits based on best among randomly chosen variables

o  Simpler because it requires two variables – no. of Predictors (typically √k) & no. of trees (500 for large dataset, 150 for smaller)

o Error prediction •  For each iteration, predict for dataset that is not in the sample (OOB data) •  Aggregate OOB predictions •  Calculate Prediction Error for the aggregate, which is basically the OOB

estimate of error rate •  Can use this to search for optimal # of predictors

•  We will see how close this is to the actual error in the Heritage Health Prize o Assumes equal cost for mis-prediction. Can add a cost function o Proximity matrix & applications like adding missing data, dropping outliers

Ref: R News Vol 2/3, Dec 2002 Statistical Learning from a Regression Perspective : Berk

A Brief Overview of RF by Dan Steinberg

Page 53: R, Data Wrangling & Kaggle Data Science Competitions

Classifiers  

Regression  Continuous Variables

Categorical Variables

Decision  Trees  

k-­‐NN(Nearest  Neighbors)  

Bias Variance

Model Complexity Over-fitting

BoosMng  Bagging  

CART  

Page 54: R, Data Wrangling & Kaggle Data Science Competitions

Model Evaluation & Interpretation Relevant Digression

9:20

Page 55: R, Data Wrangling & Kaggle Data Science Competitions

Cross Validation

o Reference: •  https://www.kaggle.com/wiki/

GettingStartedWithPythonForDataScience

• Chris Clark ‘s blog :http://blog.kaggle.com/2012/07/02/up-and-running-with-python-my-first-kaggle-entry/

• Predicive Modelling in py with scikit-learning, Olivier Grisel Strata 2013 •  titanic from pycon2014/parallelmaster/An introduction to Predictive

Modeling in Python

Page 56: R, Data Wrangling & Kaggle Data Science Competitions

Model Evaluation - Accuracy

o Accuracy =

o For cases where tn is large compared tp, a degenerate return(false) will be very accurate !

o Hence the F-measure is a better reflection of the model strength

Predicted=1   Predicted=0  

Actual  =1   True+  (tp)   False-­‐  (fn)  –  Type  II  

Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)  

       tp  +  tn  tp+fp+fn+tn    

Page 57: R, Data Wrangling & Kaggle Data Science Competitions

Model Evaluation – Precision & Recall

o Precision = How many items we identified are relevant o Recall = How many relevant items did we identify o  Inverse relationship – Tradeoff depends on situations

•  Legal – Coverage is important than correctness •  Search – Accuracy is more important •  Fraud

•  Support cost (high fp) vs. wrath of credit card co.(high fn)

       tp  tp+fp    

•  Precision    •  Accuracy  •  Relevancy  

       tp  tp+fn    

•  Recall    •  True  +ve  Rate  •  Coverage  •  Sensitivity  •  Hit  Rate  

http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html

       fp  fp+tn    

•  Type  1  Error  Rate  

•  False  +ve  Rate  •  False  Alarm  Rate  

•  Specificity  =  1  –  fp  rate  

•  Type  1  Error  =  fp  •  Type  2  Error  =  fn  

Predicted=1   Predicted=0  

Actual  =1   True+  (tp)   False-­‐  (fn)  -­‐  Type  II  

Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)  

Page 58: R, Data Wrangling & Kaggle Data Science Competitions

Confusion Matrix      

Actual  

Predicted  

C1   C2   C3   C4  

C1   10   5   9   3  

C2   4   20   3   7  

C3   6   4   13   3  

C4   2   1   4   15  

Correct  Ones  (cii)  

Precision  =  

Columns                  i  

cii  cij  

Recall  =  

Rows            j  

 

cii  cij  

Σ Σ

Page 59: R, Data Wrangling & Kaggle Data Science Competitions

Model Evaluation : F-Measure

Precision = tp / (tp+fp) : Recall = tp / (tp+fn) F-Measure

Balanced, Combined, Weighted Harmonic Mean, measures effectiveness

=  β2  P  +  R  

Common Form (Balanced F1) : β=1 (α = ½ ) ; F1 = 2PR / P+R

+  (1  –  α)  α   1  P  1  R  

1   (β2  +  1)PR  

Predicted=1   Predicted=0  

Actual  =1   True+  (tp)   False-­‐  (fn)  -­‐  Type  II  

Actual=0   False+  (fp)  -­‐  Type  I   True-­‐  (tn)  

Page 60: R, Data Wrangling & Kaggle Data Science Competitions

Hands-on Walkthru - Model Evaluation

Train   Test  

712 (80%) 179

891 hTp://cran.r-­‐project.org/web/packages/e1071/vigneTes/svmdoc.pdf  -­‐  model  eval  Kappa  measure  is  interesMng  

Refer  to  2-­‐Model_EvaluaMon.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 61: R, Data Wrangling & Kaggle Data Science Competitions

ROC Analysis

o “How good is my model?” o Good Reference : http://people.inf.elte.hu/kiss/13dwhdm/roc.pdf o “A receiver operating characteristics (ROC) graph is a technique for visualizing,

organizing and selecting classifiers based on their performance”

o Much better than evaluating a model based on simple classification accuracy o Plots tp rate vs. fp rate

Page 62: R, Data Wrangling & Kaggle Data Science Competitions

ROC Graph - Discussion o E = Conservative, Everything NO o H = Liberal, Everything YES o Am not making any

political statement ! o F = Ideal o G = Worst o The diagonal is the chance o North West Corner is good o South-East is bad

•  For example E •  Believe it or Not - I have

actually seen a graph with the curve in this region !

E

F

G

H

Predicted=1   Predicted=0  

Actual  =1   True+  (tp)   False-­‐  (fn)  

Actual=0   False+  (fp)   True-­‐  (tn)  

Page 63: R, Data Wrangling & Kaggle Data Science Competitions

ROC Graph – Clinical Example

Ifcc  :  Measures  of  diagnostic  accuracy:  basic  definitions  

Page 64: R, Data Wrangling & Kaggle Data Science Competitions

ROC Graph Walk thru

Refer  to  2-­‐Model_EvaluaMon.R  at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 65: R, Data Wrangling & Kaggle Data Science Competitions

Break

10:00

9:30

Page 66: R, Data Wrangling & Kaggle Data Science Competitions

The Art of a Competition – Session I : Bike Sharing at Washington DC

10:00

Page 67: R, Data Wrangling & Kaggle Data Science Competitions

Few interesting Links - Comb the forums o  Quick First prediction : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10510/a-simple-model-for-kaggle-bike-sharing

•  Solution by Brandon Harris

o  Random forest http://www.kaggle.com/c/bike-sharing-demand/forums/t/10093/solution-based-on-random-forests-in-r-language o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9368/what-are-the-machine-learning-algorithms-applied-for-this-

prediction o  GBM : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9349/gbm o  Research paper : http://www.kaggle.com/c/bike-sharing-demand/forums/t/9457/research-paper-weather-and-dc-bikeshare o  Ggplot http://www.kaggle.com/c/bike-sharing-demand/forums/t/9352/visualization-using-ggplot-in-r o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9474/feature-importances o  Converting datetime to hour : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10064/tip-converting-date-time-to-hour o  Casual & Registered Users :

http://www.kaggle.com/c/bike-sharing-demand/forums/t/10432/predict-casual-registered-separately-or-just-count o  RMSLE : https://www.kaggle.com/c/bike-sharing-demand/forums/t/9941/my-approach-a-better-way-to-benchmark-please o  http://www.kaggle.com/c/bike-sharing-demand/forums/t/9938/r-how-predict-new-counts-in-r o  Weather data : http://www.kaggle.com/c/bike-sharing-demand/forums/t/10285/weather-data o  Date Error : https://www.kaggle.com/c/bike-sharing-demand/forums/t/8343/i-am-getting-an-error/47402#post47402 o  Using dates in R : http://www.noamross.net/blog/2014/2/10/using-times-and-dates-in-r---presentation-code.html

Page 68: R, Data Wrangling & Kaggle Data Science Competitions

Data Organization – train, test & submission

•  datetime - hourly date + timestamp •  Season

•  1 = spring, 2 = summer, 3 = fall, 4 = winter •  holiday - whether the day is considered a holiday •  workingday - whether the day is neither a weekend nor holiday •  Weather

•  1: Clear, Few clouds, Partly cloudy, Partly cloudy •  2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist •  3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,

Light Rain + Scattered clouds •  4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

•  temp - temperature in Celsius •  atemp - "feels like" temperature in Celsius •  humidity - relative humidity •  windspeed - wind speed •  casual - number of non-registered user rentals initiated •  registered - number of registered user rentals initiated •  count - number of total rentals

Page 69: R, Data Wrangling & Kaggle Data Science Competitions

Approach

o Convert to factors o Engineer new features from date

o Explore other synthetic features

Page 70: R, Data Wrangling & Kaggle Data Science Competitions

#1 : ctree

Refer  to  3-­‐Session-­‐I-­‐Bikes.R      at  hTps://github.com/xsankar/cloaked-­‐ironman/  

Page 71: R, Data Wrangling & Kaggle Data Science Competitions

#2 : Add Month + year

Page 72: R, Data Wrangling & Kaggle Data Science Competitions

#3 : RandomForest

Page 73: R, Data Wrangling & Kaggle Data Science Competitions

The Art of a Competition - Session II : Walmart Store Prediction

10:45

Page 74: R, Data Wrangling & Kaggle Data Science Competitions

Data Organization & Approach

o TBD

Page 75: R, Data Wrangling & Kaggle Data Science Competitions

Comb the forums

o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/8125/first-place-entry/

o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/8023/thank-you-and-2-rank-model

o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/8095/r-codes-for-arima-naive-exponential-smoothing-forecasting/

o http://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/forums/t/8055/6-bad-models-make-1-good-model-power-of-ensemble-learning/

Page 76: R, Data Wrangling & Kaggle Data Science Competitions

The Beginning As The End 11:15

Page 77: R, Data Wrangling & Kaggle Data Science Competitions

The Beginning As the End

�  We  started  with  a  set  of  goals  

�  Homework  ◦  For  you  �  Go  through  the  slides    �  Do  the  walkthrough  �  Work  on  more  compe00ons  �  Submit  entries  to  Kaggle  

Page 78: R, Data Wrangling & Kaggle Data Science Competitions

References:

o An Introduction to scikit-learn, pycon 2013, Jake Vanderplas •  http://pyvideo.org/video/1655/an-introduction-to-scikit-learn-machine-

learning

o Advanced Machine Learning with scikit-learn, pycon 2013, Strata 2014, Olivier Grisel •  http://pyvideo.org/video/1719/advanced-machine-learning-with-scikit-learn

o  Just The Basics, Strata 2013, William Cukierski & Ben Hamner •  http://strataconf.com/strata2013/public/schedule/detail/27291

o The Problem of Multiple Testing •  http://download.journals.elsevierhealth.com/pdfs/journals/1934-1482/

PIIS1934148209014609.pdf

Page 79: R, Data Wrangling & Kaggle Data Science Competitions