microsoft nerd talk - r and tableau - 2-4-2013
DESCRIPTION
This presentation is from a talk I gave at Microsoft NERD for the Boston Predictive Analytics Meetup group.TRANSCRIPT
TABLEAU AND RBeauty and the Beast
Tanya Cashorali
@tanyacash21
R – THE WORKHORSE
TABLEAU – MAKES BEAUTIFUL THINGS HAPPEN
BUT SO CAN R
TOGETHER THEY ARE UNSTOPPABLE
SERIOUSLY THOUGH, WHAT IS R?
Open source Statistical Programming Environment 4,211 community contributed packages on CRAN
as of 1/31/2013 - http://cran.r-project.org/ Interpreted - Terminal or GUI (Rstudio)
WHAT IS TABLEAU?
Data visualization software for interactive business intelligence
Spun out of Stanford University in 2003, current CTO was a founder of Pixar Animation Studios
Drag and drop interface
R AND TABLEAU
Various database drivers
Tableau
Dashboards
R
Write to .csv
Live connection
data mungedata model
Insert using the RODBC package
START WITH THE R WORKHORSE Read data into R
pbp2012 <- read.csv(file=“2012_nfl_pbp_data_reg_season.csv”, header=TRUE)
View the data str(pbp2012)
START WITH THE R WORKHORSE (CONT’D)
Conduct pre-processing or “data munging” is.na(pbp2012$down); as.numeric(pbp2012$ydline)
Slice and dice subset(pbp2012, qtr == 1)
Write to CSV for consumption by Tableau Public write.csv(pbp2012, file=“pbp2012.csv",
row.names=FALSE)
R NO HUDDLE EXAMPLE
## read in the dataseasons <- c(2002:2011)pbp <- read.csv("2012_nfl_pbp_data_reg_season.csv", header=TRUE, stringsAsFactors=FALSE)n1 <- read.csv("2002_nfl_pbp_data.csv", header=TRUE, stringsAsFactors=FALSE)pbp <- pbp[,-which(is.na(match(colnames(pbp), colnames(n1))))]for(season in seasons){
n1 <- read.csv(paste(season, "_nfl_pbp_data.csv", sep=""), header=TRUE, stringsAsFactors=FALSE)
pbp <- rbind(pbp, n1)} ## grab the no huddle playsnh <- pbp[grep("Huddle", pbp$description),] ## calculate the percentage of no-huddle plays each team rannh.by.team <- table(nh$off)
R NO HUDDLE EXAMPLE (CONT’D)
ggplot(nh.by.team, aes(x=reorder(Var1, -Freq), y=Freq)) + geom_bar(stat="identity") + labs(x="Team", y="Number of Plays", title="Number of No Huddle Plays Ran by Team 2002-2012") + theme(axis.text.x = element_text(angle = 50, hjust = 1))
R NO HUDDLE EXAMPLE (CONT’D)## table by offensive team and quarter
nh.df <- data.frame(table(nh$off, nh$qtr))[-1,]
colnames(nh.df) <- c("Team", "Quarter", "Number")
## plot number of no huddle plays by team by quarter
ggplot(nh.df, aes(x=reorder(Team, Number), y=Number, fill=Quarter)) + geom_bar() + labs(x="Team", y="Number", title="Number of No Huddle Plays in the NFL by Team by Quarter") + theme(axis.text.x = element_text(angle = 50, hjust = 1))
TABLEAU-IFIED
http://sportsdataviz.com/percentage-no-huddle-plays-by-nfl-team-by-season-2002-2012/
## write file for Tableauwrite.table(nh.by.team, file=“noHuddles.txt", sep="\t", row.names=FALSE)
IS THE RAVENS OFFENSE PREDICTABLE?
http://sportsdataviz.com/superbowl-xlvii-2013-baltimore-ravens-offense-predictability/
## Read in the data generated by play_parser.pyplays <- read.csv(“plays.csv", header=TRUE, stringsAsFactors=FALSE)
## extract Baltimore offensive playsplays <- plays[grep("BAL", plays$gameid),]plays <- subset(plays, def != "BAL")
## 1,625 offensive BAL plays in the 2012 regular seasonnrow(plays)
## classify the other play types that are not passes or runsplays$type <- as.character(plays$type)plays[grep("PENALTY", plays$desc),]$type <- "Penalty"plays[grep("kick", plays$desc),]$type <- "Kick"plays[grep("punt", plays$desc),]$type <- "Punt"plays[grep("field goal", plays$desc),]$type <- "FG"
## create a binned variable yardsToGoplays$yardsToGo <- "0"plays[plays$ydline >= 80,]$yardsToGo <- ">= 80"plays[plays$ydline >= 50 & plays$ydline < 80,]$yardsToGo <- "50 <= yardsToGo < 80"plays[plays$ydline >= 30 & plays$ydline < 50,]$yardsToGo <- "30 <= yardsToGo < 50"plays[plays$ydline >= 10 & plays$ydline < 30,]$yardsToGo <- "10 <= yardsToGo < 30"plays[plays$ydline < 10,]$yardsToGo <- "< 10"
## write out file for Tableauwrite.table(plays, file="BALplays2012regSeason.csv", row.names=FALSE)
IS THE RAVENS OFFENSE PREDICTABLE? (CONT’D)
http://sportsdataviz.com/superbowl-xlvii-2013-baltimore-ravens-offense-predictability/
Set the scenario for each play during the Superbowl and predicted either run or pass based on percentage.
RESULTS AND CONSIDERATIONS
Predicted plays correctly 60.3% of the time Missing variables (defensive and offensive formations, crowd
noise, weather, injured players, power outage, etc.) Change in Ravens’ offensive coordinator in week 15 Lack of data
SUMMARY
Initial analysis in R Explore the data Pre-process Write to file for consumption by Tableau Public or to
database for Tableau Desktop Create interactive dashboards in Tableau in minutes
that can be shared via a web interface (free = publicly available, paid = private internally hosted Tableau Server)
REFERENCES
NFL Play by Play Data (2002 – 2012) http://www.advancednflstats.com/2010/04/play-by-play-data.html
Python parser for NFL PBP Data http://www.10flow.com/
Tableau Public http://www.tableausoftware.com/public/
R http://cran.r-project.org/ SportsDataViz - http://www.sportsdataviz.com/
APPENDIX
TABLEAU DESKTOP FEATURE COMPARISON
Public Edition Personal Edition Professional Edition
Operating System Windows application Windows application Windows application
Saves to the Tableau Public Website?
Only Option Option
Opens Data in Files? Yes Yes Yes
Opens Data in Databases?
No No Yes
Save Work Locally? No Yes Yes
Export Results Locally?
No Yes Yes
Data Limitation? 100,000 rows Unlimited Unlimited
Publish to Tableau Server?
No No Yes
Cost Free $999 $1,999