r tutorial slides
TRANSCRIPT
-
7/28/2019 R Tutorial Slides
1/13
R TutorialCapital One Data Mining Cup
UW Statistics Club
Saturday, March 23, 2013
-
7/28/2019 R Tutorial Slides
2/13
Who Will Benefit From This?
Aimed at Students who
Have the statistical background but lack the (R) modelling expertise
Never taken a linear regression course (or simply forgot the one they did!)
-
7/28/2019 R Tutorial Slides
3/13
What We Will Be Doing Today
Walkthrough example of a statistical prediction problem using Kaggle testdata (Titanic problem)
The goal is to predict who will survive given different factors such as
Age
Ticket Fare
Sex
Cabin
Number of family aboard
http://www.kaggle.com/http://www.kaggle.com/ -
7/28/2019 R Tutorial Slides
4/13
R Basics
Opening R (RStudio)
Navigating to the working directory
Running commands
Installing packages
Loading packages
-
7/28/2019 R Tutorial Slides
5/13
Basic Guideline to Data Analysis
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling8. Interpret results
9. Challenge results
10. Synthesize/write up results
11. Create reproducible code
-
7/28/2019 R Tutorial Slides
6/13
Cleaning the Data (skipped)
Fix variable names
Merge data sets
Fix missing content
Fix inconsistent data
-
7/28/2019 R Tutorial Slides
7/13
Exploratory Data Analysis
Make use of
Aggregation Tables
Charts
We use two different R packages here:ggplot2, plyr
-
7/28/2019 R Tutorial Slides
8/13
Testing Your Model
Before we build our model we need to have a methodology on how we will test it.
A nave analyst would use the entire data set to build the model and then test it on the samedata set. This causes overfitting!
Instead: partition training data set into a real training set and a validation set. To createvalidation set use:
Random sub-sampling
K-fold
Leave-one-out
What measurement do we use to compare?
Adjusted 2, AIC, BIC
-
7/28/2019 R Tutorial Slides
9/13
Building Our First Model - Simple Linear
Regression
Why is this a good starting point?
Easy to implement in R
Black box (i.e. no tuning parameters)
Easy to interpret/explain
Disadvantage: performs poorly in non-linear setting
-
7/28/2019 R Tutorial Slides
10/13
Building Our First Model - Simple Linear
Regression
After we have run our first model we want to:
Examine Residuals plot
Examine Q-Qplot
Use the Model Testing process to pick a proper model
Using the step function in R
-
7/28/2019 R Tutorial Slides
11/13
Understanding Interaction (optional)
-
7/28/2019 R Tutorial Slides
12/13
Checking for Multicollinearity (optional)
Multiple predictor variables are highly correlated
Can be caused by:
Creating a new predictor variable from existing ones
Having multiple predictors that explain the same thing
Consequence: standard error blows up on estimate
Use R to compute correlation between all predictors. If there exists sets ofpredictors above 0.90 0.95 then either:
Remove all but one
Combine into a new composite variable
-
7/28/2019 R Tutorial Slides
13/13
What Next?
Taking our Simple Linear Regression to the next level
Higher order terms
Interaction terms
Data Transformations
Check for multicollinearity
Different Types of Models (not covered here but check the R Code!)
Generalized Linear Models
Trees
Random Forest
Ensemble Methods