r tutorial slides

Upload: osiccor

Post on 03-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 R Tutorial Slides

    1/13

    R TutorialCapital One Data Mining Cup

    UW Statistics Club

    Saturday, March 23, 2013

  • 7/28/2019 R Tutorial Slides

    2/13

    Who Will Benefit From This?

    Aimed at Students who

    Have the statistical background but lack the (R) modelling expertise

    Never taken a linear regression course (or simply forgot the one they did!)

  • 7/28/2019 R Tutorial Slides

    3/13

    What We Will Be Doing Today

    Walkthrough example of a statistical prediction problem using Kaggle testdata (Titanic problem)

    The goal is to predict who will survive given different factors such as

    Age

    Ticket Fare

    Sex

    Cabin

    Number of family aboard

    http://www.kaggle.com/http://www.kaggle.com/
  • 7/28/2019 R Tutorial Slides

    4/13

    R Basics

    Opening R (RStudio)

    Navigating to the working directory

    Running commands

    Installing packages

    Loading packages

  • 7/28/2019 R Tutorial Slides

    5/13

    Basic Guideline to Data Analysis

    1. Define the question

    2. Define the ideal data set

    3. Determine what data you can access

    4. Obtain the data

    5. Clean the data

    6. Exploratory data analysis

    7. Statistical prediction/modelling8. Interpret results

    9. Challenge results

    10. Synthesize/write up results

    11. Create reproducible code

  • 7/28/2019 R Tutorial Slides

    6/13

    Cleaning the Data (skipped)

    Fix variable names

    Merge data sets

    Fix missing content

    Fix inconsistent data

  • 7/28/2019 R Tutorial Slides

    7/13

    Exploratory Data Analysis

    Make use of

    Aggregation Tables

    Charts

    We use two different R packages here:ggplot2, plyr

  • 7/28/2019 R Tutorial Slides

    8/13

    Testing Your Model

    Before we build our model we need to have a methodology on how we will test it.

    A nave analyst would use the entire data set to build the model and then test it on the samedata set. This causes overfitting!

    Instead: partition training data set into a real training set and a validation set. To createvalidation set use:

    Random sub-sampling

    K-fold

    Leave-one-out

    What measurement do we use to compare?

    Adjusted 2, AIC, BIC

  • 7/28/2019 R Tutorial Slides

    9/13

    Building Our First Model - Simple Linear

    Regression

    Why is this a good starting point?

    Easy to implement in R

    Black box (i.e. no tuning parameters)

    Easy to interpret/explain

    Disadvantage: performs poorly in non-linear setting

  • 7/28/2019 R Tutorial Slides

    10/13

    Building Our First Model - Simple Linear

    Regression

    After we have run our first model we want to:

    Examine Residuals plot

    Examine Q-Qplot

    Use the Model Testing process to pick a proper model

    Using the step function in R

  • 7/28/2019 R Tutorial Slides

    11/13

    Understanding Interaction (optional)

  • 7/28/2019 R Tutorial Slides

    12/13

    Checking for Multicollinearity (optional)

    Multiple predictor variables are highly correlated

    Can be caused by:

    Creating a new predictor variable from existing ones

    Having multiple predictors that explain the same thing

    Consequence: standard error blows up on estimate

    Use R to compute correlation between all predictors. If there exists sets ofpredictors above 0.90 0.95 then either:

    Remove all but one

    Combine into a new composite variable

  • 7/28/2019 R Tutorial Slides

    13/13

    What Next?

    Taking our Simple Linear Regression to the next level

    Higher order terms

    Interaction terms

    Data Transformations

    Check for multicollinearity

    Different Types of Models (not covered here but check the R Code!)

    Generalized Linear Models

    Trees

    Random Forest

    Ensemble Methods