technicalreport_nflproject_austin&ovais

Austin, Siddiqui

Analyzing NFL Drive Success

Austin Grosel, Ovais Siddiqui

Table of Contents

1. ABSTRACT ---------------------------------------------------------------------------------------------- pg 2

2. INTRODUCTION ----------------------------------------------------------------------------------------pg 2

3. METHODOLOGY --------------------------------------------------------------------------------------pg 3

4. ANALYSIS ----------------------------------------------------------------------------------------------- pg 4

5. FUTURE WORK---------------------------------------------------------------------------------------- pg 7

6. APPENDIX------------------------------------------------------------------------------------------------ pg 8

7. SAS CODE----------------------------------------------------------------------------------------------pg 15

8. R CODE--------------------------------------------------------------------------------------------------pg 18

9. REFERENCES-----------------------------------------------------------------------------------------pg 19

Austin, Siddiqui

ABSTRACT

Sports has seen an analytical revolution in the past couple of years. Many

franchises in the sports world are turning to data analysis for decision making. The

goal of this project is to look at American Football data from the NFL and try to

determine the characteristics of an efficient team offense. The research data had

every team’s drive statistics for the past sixteen years, with the dependent variable

as total points per drive (PPD). Using exploratory data analysis along with multiple

linear regression, two models were developed predicting PPD, one with interaction

terms and one without them. The model without the interaction terms seemed to

make more sense from a business perspective, and significant predictors of PPD

turned out to be related to the passing plays on offense. By understanding the

important factors in this PPD model, NFL front offices and coaches may want to

invest more capital in developing a premier passing offense.

INTRODUCTION

Since sports have been around, franchises have tried to find ways to gain a competitive

advantage. Around the start of the 2000’s, a few professional baseball teams started using

mathematics, statistics, and data analysis over the traditional scouting techniques. This sparked a

best-selling novel Moneyball by Michael Lewis to follow the Oakland Athletics 2002 season as

they transitioned to using numbers over film. Because of Oakland and other MLB team’s success

in their adoption of statistics, other teams followed, and soon, other sports followed. Out of the

top 3 American sports to adopt analytics, American Football, specifically the NFL, has seen the

slowest transition to analytics. American Football is different than other sports: there are 22 total

players on the field, coaches have a much stronger influence on the game than baseball or

basketball, and there are so many styles of play that it’s extremely hard to quantify how

important some “box score” statistics actually are. However, recently there have been

statisticians, fans, and writers who have tried to make sense of this unique game.

An advanced analytics site called Football Outsiders have put together NFL drive

statistics (Football Outsiders), which takes a look at each team’s drive statistics throughout the

season. A drive in the NFL is a series of plays by an offense with the objective to score points.

Each drive has a starting field position of where the drive starts. The outcomes on a drive are

score points (six for a touchdown, three for a field goal), turn the ball over to the other team or

the game clock expires.

There are two types of plays that an offense can run on their drives: a rushing play,

which usually happens when the quarterback hands the ball to a player or runs in himself, or a

Austin, Siddiqui

passing play, which happens when a player (usually the quarterback) throws the ball to another

player. Conventional wisdom has shown rushing plays gets less yards on average but is a more

conservative play call than passing plays. Generally, teams have passed a little more than 50% of

the time (NFL Team Rankings), but this may be influenced because when a pass play is not

complete (that is, when a pass is thrown but is not caught) the clock stops. So passing plays are

generally used when you need to get down the field in a short amount of time. Data has also

shown that passing the ball well is more important to winning games than rushing the ball well

(Burke).

The data used in this project has every team’s drive statistics for the past seven years. Our

goal is to develop a multiple linear regression model to predict how many points per drive (PPD)

a team should have, based on other drive statistics such as passing stats, rushing stats, starting

field position, time of drive, and plays per drive. Our hypothesis is, according to our research, the

passing statistics will be much more important to our model than rushing statistics, starting field

position will be a significant factor (the further down the field you start, the more chance you’ll

score points) and that more time on a drive results in more points. We feel that our results can

help teams invest their resources in an effective manner, whether it’d be in the passing game or

the rushing game.

METHODOLOGY

We were able to obtain the data by subscribing to the Armchair Analysis NFL Database

(Armchair). This database had a zip file that contained CSV files for everything NFL related:

team and player data, schedules, history, etc. We observed the Drive.csv file which contained

data of every single drive for the past sixteen years. The different columns of this dataset were

the starting field position, the total time spent on the drive in seconds, the amount of passing

plays and yards, the amount of rushing plays and yards, the number of rushing and passing first

downs on that drive, and the number of plays for that drive. Along with these numbers, it had the

result of the drive.

For the pre-processing step, we first went into the dataset and created a new column

labeled “Points”. If the result was labeled a TD (touchdown), we’d put 6 points. If it was labeled

FG (field goal), we’d put 3 points. If we saw that the result was labeled ENDQ (end of quarter),

we decided to remove these observations from the dataset. This is because end of quarter drives

usually happen when a team decides to run the clock out, and thus their objective was never to

score in the first place. The rest of the results were given a 0 because there were no points scored

on that drive. We then decided to create new fields for passing efficiency and rushing efficiency.

These were calculated by dividing the amount of passing yards by passing attempts and rushing

yards by rushing attempts.

Austin, Siddiqui

To further remove any bias, we also decided to remove all fourth quarter drives. This was

because if a team is trailing in the fourth quarter, they will pass much more than run because they

are losing and want to score quickly. The fourth quarter numbers would probably skew to more

passing attempts than a normal drive at any given time, so those drives were completely

removed.

The next step was to aggregate these drives together based on each team and season.

These operations are found in the SetUp.R code in the R language. Now we had 510 seasonal

drive statistics for every team in the past 16 years. The variable we decided to be our dependent

variable was points per drive, or PPD. PPD would give us good insight on how efficient an

offense of a particular team was in that particular season.

Our approach started with an exploratory analysis, which included plotting descriptive

statistics and looking at the correlation values. When no multicollinearity was found, we divided

the data up into test and training data sets. Then, we used the training sets to come up with our

first model using the adjusted r-squared selection method. We performed a residual analysis on

the model, and decided to see if interaction terms would help increase the Adjusted R-Squared

value. Afterwards, we used the test data to see which model was more successful at predicting

the test data. Finally, after weighing all options (model accuracy, simplicity from a business

perspective, etc.), we concluded with the model we felt was best.

ANALYSIS

After the pre-processing phase, our dataset consists of nine predictors which are given as:

time = Total time of the drive (in seconds)

start_pos = Starting field position

pass_eff = Passing efficiency (pass yards/pass attempts)

rush_eff = Rushing efficiency (rush yards/rush attempts)

pass_att = , Passing attempts per drive

rush_att = Rushing attempts per drive

pass_fd = First downs from pass plays per drive

rush_fd = first downs from rushing plays per drive

total_plays = Plays per drive

In order to check the distribution of our dependent variable, we created a histogram with

normal density curve plotted over it (See Appendix A). The Graph shows a normal distribution

with most of the values concentrated around the mean value of µ = 1.668 with a standard

deviation of Ω = 0.4035. This statistic indicates that 68% of the teams score between 1.2 and 2

Austin, Siddiqui

points per drive on average and only 5% of the teams having an average score of more than 2.5

points per drive.

Then we focused on any kind of transformation that might be needed on our dataset. We

made use of Pearson Correlation matrix and the scatterplot matrix to look for linearity between

response and independent variables. Both plots suggest a linear relationship between the Y and

X variables (See Appendix B). The variables Passing Efficiency (Pass_Eff) and Passing First

down (Pass_fd) appeared to have a strong correlation with our Y variable, having values of 0.80

and 0.79 respectively. These were followed by Total number of plays (Total_plays) with a value

of 0.67. We also took notice of multicollinearity issues and the pair of passing efficiency

(pass_fd) and passing first down (pass_fd) had a high correlation value of 0.87. Therefore, this

issue would be discussed at a later stage after the selection of our final predictors significant to

the model.

One important thing to note here is that we tried several interaction terms to see if our

model can be improved or not. As discussed during the presentation, Even though the interaction

model has the same number of predictors as our selected model and the Adj R2 was almost the

same as well, we decided to choose a model that is more simple and supports our hypothesis.

The interaction model consisted of the combination of all the original nine predictors whereas the

simpler model will have only six single predictors giving us the same result (See Appendix H).

Hence we chose the simpler model without any interaction terms. Here, it is worth mentioning

the interaction variable of Starting Position (start_pos) and Pass Attempt (Pass_att). In the

normal model, starting position gives you an advantage and is positively associated with Points

per drive (PPD). The closer you are to the opposition’s end zone, the more chances there are to

score. However, the negative association of the interaction term with PPD, reveals the fact that

the as the offensive team gets closer to the opposition’s end zone, they should pass less and try to

rush to the end zone to score the points. Other than this fact, we didn’t find anything interesting

related to our hypothesis, so we carried with our analysis on the simpler model without any

interaction terms.

At this point, we could have either divided the data into testing and training sets for

model validation or we could have carried on with the steps of data analysis to get the final

model. We decided to adopt the latter approach and divide the data into train/test sets once we

have come up with our final predictors of the model. Therefore, we adopted multiple selection

methods to come up with final predictors. Three model selection methods namely; Stepwise,

Adjusted R-square and CP were used (see Appendix C). All three methods had six predictors in

common which are Time, starting position (start_pos), passing efficiency (pass_eff), rushing

efficiency (rush_eff), passing first down (pass_fd), rushing first down (rush_fd). The stepwise

suggested an additional variable of Total_plays to be included as well. However, we decided to

go with the Adjusted R-square method in which the model with six predictors had Adj R2 =

Austin, Siddiqui

0.873 whereas the complete model (nine predictors) had the value of Adj R2 = 0.8749. Hence

having more than six predictors overestimates our model.

Next, we checked the model for residual Analysis. Our selected model was able to satisfy

all of the assumptions. The Residuals vs Predicted plot (See Appendix D) showed a random

pattern with values centered around the mean showing constant variance and independence with

only a couple of points as outliers. Similarly, the normal probability plot showed a 45-degree line

pattern satisfying our normality assumption.

Since we saw a couple of outliers during our residual analysis, we further checked for any

outliers and influential points that may affect our model. We used the Studentized Residuals and

Cook’s D Distance to look for such points. Our cut off point for both the statistics were values

that had |Studentized Residual| >= |3| and Cook’s D >= 4/n. One point in particular, Observation

449 stood out in our data with both values exceeding the minimum criterion (See Appendix E).

Therefore, we checked if it had any significant effect on the model. We used the trial and error

method by removing this observation and testing the significance of the whole model. We reran

the complete model, but the result did not vary and same predictors proved to be significant. In

addition, by removing the observation, the Adj R2 increased from 0.873 to 0.876 for the six

significant predictors. Therefore, both of us agreed that the increase was not worth discarding a

value from the dataset. As far as the multicollinearity issue is concerned, we used the VIF

statistic with a cutoff point of 10; however, all the values had VIF much less than that.

At this point, we were satisfied with our model and selected predictors. We calculated

standardized estimates to analyzed the which predictor has the greatest influence on Points per

drive (PPD) (See Appendix F). The Passing First Down (Pass_fd) had the strongest influence on

the PPD with a value of 0.606 followed by the next Passing statistic Passing Efficiency

(pass_eff). The Rushing statistics: Rush_eff and Rush_fd had the values of 0.14 and 0.27. This

outcome also supports our hypothesis that Passing statistics are much more valuable than rushing

statistics when it comes to scoring points.

It was time for us to see the predictive performance of our model by dividing the data

into Training and Testing sets. We used a 75/25 split to divide the data into training and testing

sets using the PROC SURVEYSELECT command. The comparison between the training and

testing set can be seen in Appendix G: DIVIDING THE DATA INTO TRAINING AND

TESTING SET TO CHECK THE PREDICTIVE PERFORMANCE. The RMSE values of the

training and testing set proved to be very close i.e. 0.143 and 0.144. Similarly, The R2 values

were 0.87 and 0.92 respectively. To further strengthen the reliability of our model, we used the

cross validated R2 method by computing |model R2 ‐R2 CV|. In our case, we assumed that a

good model will have a value of less than or equal to 0.3.

Austin, Siddiqui

R2 Train (per model o/p) = 0.8786

R2 Test (yhat^2) = 0.9297^2

|model R2 ‐R2 CV| = 0.0143

The value of 0.0143 substantiates our model’s impressive predictive performance with

the unseen data. So our final model equation is:

Points per Drive =

-1.4568 – 0.00337 * time + 0.03602 * start_pos +0.11615 * pass_eff

+0.08610 * rush_eff + 1.087 * pass_fd + 0.877 * rush_fd + e

FUTURE WORK

We were very satisfied with the findings of our regression analysis, however, there are

some things that we could look at in the future if more time was given. One of these tasks would

be analyzing what predicts first downs. It makes reasonable sense that an offense with more first

downs will score more points because they are in a sense hitting their “checkpoints” abundantly,

therefore they’re moving down the field. It’d be interesting to create other models on predicting

rushing first downs and passing first downs.

Another idea we could look into is comparing the different seasons of this data. Sports

are always evolving, and data in the year 2000 may not be relevant to how American Football is

played in 2015. Studies show the running back position in the NFL (the player who tends to get

the most rushing attempts) has decreased importance dramatically over the years. Our hypothesis

would be that recent data would give even more importance to the passing game when looking at

PPD.

Austin, Siddiqui

APPENDIX

APPENDIX A: NORMAL DENSITY CURVE PLOTTED ON TOP OF HISTOGRAM FOR

POINTS PER DRIVE

APPENDIX B: CORRELATION TABLE AND SCATTERPLOT MATRIX

Austin, Siddiqui

APPENDIX C: MODEL SELECTION

STEPWISE OUTPUT

Austin, Siddiqui

ADJUSTED R-SQUARE OUTPUT

CP OUTPUT:

Austin, Siddiqui

APPENDIX D: RESIDUAL ANALYSIS

PREDICTED VS STUDENTIZED

NORMAL PROBABILITY PLOT

Austin, Siddiqui

APPENDIX E: INFLUENTIAL POINTS AND OUTLIERS

APPEDIX F: STANDARDIZED ESTIMATES

Austin, Siddiqui

APPENDIX G: DIVIDING THE DATA INTO TRAINING AND TESTING SET TO CHECK

THE PREDICTIVE PERFORMANCE

Austin, Siddiqui

APPENDIX H: OUTPUT OF MODEL WITH INTERACTION TERMS

Austin, Siddiqui

SAS CODE

*----- GET DATA FROM EXTERNAL FILE USING "INFILE METHOD" ----; DATA NFLproject; INFILE "NFLproject.csv" DELIMITER = ',' FIRSTOBS=2 MISSOVER; INPUT id $ tname $ seas ppd time start_pos pass_eff rush_eff pass_att

rush_att pass_fd rush_fd total_plays; RUN;

PROC PRINT; RUN;

TITLE "Descriptive Statistics"; proc means mean std stderr clm p25 p50 p75; var ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd

total_plays; run;

/* creates histogram with normal density plotted on top of histogram;*/ TITLE "Histogram - PPD"; proc univariate normal; var ppd; histogram / normal(mu=est sigma=est); run;

/* Proc correlation */ TITLE "RELATIONSHIP BETWEEN VARIABLES"; proc corr; var ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd

total_plays; run;

*Creating scatter plot matrix; TITLE "RELATIONSHIP BETWEEN VARIABLES"; PROC SGSCATTER; matrix ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd

total_plays; run;

*Selecting a model using STEPWISE selection Method; TITLE "RUNNING A SELECTION METHOD";

PROC REG data = NFLproject; model ppd = time start_pos pass_eff rush_eff pass_att rush_att pass_fd

rush_fd total_plays / selection = stepwise; run;

*Selecting a model using ADJRSQ selection Method; PROC REG data = NFLproject; model ppd = time start_pos pass_eff rush_eff pass_att rush_att pass_fd

rush_fd total_plays / selection = ADJRSQ; run;

Austin, Siddiqui

*Selecting a model using CP selection Method; PROC REG data = NFLproject; model ppd = time start_pos pass_eff rush_eff pass_att rush_att pass_fd

rush_fd total_plays / selection = CP; run;

*Checking for Model Assumptions; TITLE "RESIDUAL ANALYSIS";

* studentized residuals; plot student.*predicted.; * studentized residuals with every x-var; plot student.*(time start_pos pass_eff rush_eff pass_fd rush_fd); * normal probablity plot of studentized residuals; plot npp.*student.; run;

*Testing for outliers and influential points; TITLE "Checking for Outliers/Influential Points";

PROC REG; model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd/ influence r

vif; run;

*TO REMOVE OBSERVATION 449; TITLE "Model after removing the influential point"; data NFLproject_new; * write to a different dataset; set NFLproject; *remove 449th observation; if _n_=449 then delete; run;

*Checking the complete model with the new dataset w/o influential point; *Using ADJRSQ selection Method; PROC REG data = NFLproject_new; model ppd = time start_pos pass_eff rush_eff pass_att rush_att pass_fd

rush_fd total_plays / selection = ADJRSQ; run;

*the ADJRSQ only increased by 0.03 with six predictors which is not worth

discarding any value from the dataset;

*Running the proc reg of the old dataset with the six significant predictors; TITLE "Checking for the Most Influential Predictor"; Proc reg data = NFLproject; model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd / stb; run;

* Get training and testing data; title "Test and Train Sets for PPD"; proc surveyselect data=NFLproject out=val_NFLproject seed=123456

samprate=0.75 outall; * outall - show all the data selected (1) and not selected (0) for training;

Austin, Siddiqui

run;

title "Train Sets for PPD"; data train_NFLproject (where= (Selected = 1)); set val_NFLproject; run; proc print data=train_NFLproject; run;

title "Test Sets for PPD"; data test_NFLproject (where= (Selected = 0)); set val_NFLproject; run; proc print data=test_NFLproject; run;

TITLE "Creating new variable"; data val_NFLproject; set val_NFLproject; if selected then new_y=ppd; run; proc print data=val_NFLproject; run;

* Building a model with Train Data; title "Validation - Train Set"; proc reg data = train_NFLproject; model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd; run;

* Compare the two dataset outputs and predictions; TITLE "COMPARING TRAINING AND TESTING DATASET RESULTS"; proc reg data = val_drives; model ppd=time start_pos pass_eff rush_eff pass_fd rush_fd; output out=outm1(where=(new_y=.)) p=yhat; run;

proc print data=outm1; run;

title "Difference between Observed and Predicted Test Set"; data outm1_sum; set outm1; diff = ppd - yhat; abs_diff = abs(diff); run; proc summary data = outm1_sum; var diff abs_diff; output out=outm1_stats std(diff)=rmse mean(abs_diff)=mae; run; proc print data=outm1_stats; title "Validation Stats for Model 1"; run; proc corr data=outm1; var ppd yhat; run;

Austin, Siddiqui

R CODE

File - SetUp.R

drive = read.csv(“~/Downloads/Drive.csv”)

df = aggregate(points~tname+seas, data = drive, FUN = "mean") df$id = paste0(tolower(df_yfog$tname), df_yfog$seas)

df_yfog = aggregate(yfog~tname+seas, data = drive, FUN = "mean") df_yfog$id = paste0(tolower(df_yfog$tname), df_yfog$seas)

df_time = aggregate(time~tname+seas, data = drive, FUN = "mean") df_time$id = paste0(tolower(df_time$tname), df_time$seas)

df_pfd = aggregate(pfd~tname+seas, data = drive, FUN = "mean") df_pfd$id = paste0(tolower(df_pfd$tname), df_pfd$seas)

df_rfd = aggregate(rfd~tname+seas, data = drive, FUN = "mean") df_rfd$id = paste0(tolower(df_rfd$tname), df_rfd$seas)

df_pa = aggregate(pa~tname+seas, data = drive, FUN = "mean") df_pa$id = paste0(tolower(df_pa$tname), df_pa$seas)

df_ra = aggregate(ra~tname+seas, data = drive, FUN = "mean") df_ra$id = paste0(tolower(df_ra$tname), df_ra$seas)

df_passeff = aggregate(passeff~tname+seas, data = drive, FUN = "mean") df_passeff$id = paste0(tolower(df_passeff$tname), df_passeff$seas)

df_rusheff = aggregate(rusheff~tname+seas, data = drive, FUN = "mean") df_rusheff$id = paste0(tolower(df_rusheff$tname), df_rusheff$seas)

df_plays = aggregate(plays~tname+seas, data = drive, FUN = "mean") df_plays$id = paste0(tolower(df_plays$tname), df_plays$seas)

Austin, Siddiqui

REFERENCES

"Football Outsiders." Football Outsiders Everything. N.p., n.d. Web. 23 Nov. 2016.

"NFL Team Passing Play Percentage." NFL Football Stats - NFL Team Passing Play Percentage on

TeamRankings.com. N.p., n.d. Web. 23 Nov. 2016.

Burke, Brian. "Why Passing Is More Important Than Running in the N.F.L." The New York Times.

The New York Times, 31 Aug. 2010. Web. 23 Nov. 2016.

"Armchair Analysis.com." NFL Play Data. 697,180 Plays. Daily Updates. Armchair Analysis.com.

N.p., n.d. Web. 23 Nov. 2016.

technicalreport_nflproject_austin&ovais

Documents