data manipula+onemeyers.scripts.mit.edu/.../cs149_slides/class11.pdf · randomly split the data:...

23
Data manipula+on

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Data manipula+on

Page 2: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Outline for today

Be#erknowasport:RobertoClementeReviewofmul8plelinearregressionManipula8ngdatawithdplyrWorksheet5

Page 3: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Be5er know a player Roberto Clemente

Page 4: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Announcements

ExtraofficehoursthisFriday.ThereisasignuplinkonMoodle,oremailme

Startthinkingaboutyourfinalprojectfortheclass.TheguidelinesforthefinalprojectareonMoodle.

AprojectproposalisdueonWednesdayMarch29th

Page 5: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Review

Page 6: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Regression

Regressionismethodofusingonevariabletopredictthevalueofasecondvariable.Inlinearregressionwefitalinetothedata,calledtheregressionline.

ŷ=a+b·x

Page 7: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Measuring goodness of fit

Residual

Wecanmeasurehowwellthelinefitsthedatausingthemeansquarederror(MSE):

y

ŷ

Residual=Observed–Predicted=y–ŷ

LeastsquareregressionlineminimizestheMSE

Page 8: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Mul+ple regression

Wehaveobservedthatsomesta8s8cscombinemul8pletypesofmeasurementsBaUngaverage:BA=[(1)·1B+(1)·2B+(1)·3B+(1)·HR]/ABSluggingpercentage:Slug=[(1)·1B+(2)·2B+(3)·3B+(4)·HR]/ABGenericlinearsta8s8c:stat=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0

Page 9: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Mul+ple regression

Mul8pleregressionmakespredic8ons(ŷ)usingmul8plevariables

Wecanfindtheop8malweights(wi’s)foracombina8onofbasicsta8s8csbyminimizingtheMSEonŷ=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0

fit<-lm(R~BB+HBP+H+X2B+X3B+HR,data=team.baUng.162)TheRmodelfitcontainstheweightswi‘sformakingpredic8onsŷcoef(fit)

Page 10: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

What is the best sta+s+c we can create?

Supposewefit:fit<-lm(R~BB+H+X2B+X3B+HR+BRA+OBP+AB,data=team.baUng.162)

RMSE:

sqrt(mean(linear_model$residuals^2))23.49 HowaccuratewouldthisRMSEbeifweappliedittonewdata?

Page 11: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

OverfiEng

FiUngamodeltopreciselytothedataathandinsuchawaythatitdoesnotgeneralizetonewdataiscalledoverfi=ngIfweusedthesamedatatofitourmodel(findthewi’s)aswedidtoevaluatewhetheritwasagoodfitoures8mateofRMSEmightbetooop8mis8c•  i.e.,oures8mateoftheRMSEmightbetoosmall

Page 12: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

OverfiEng

Oneshouldalwaysyoudifferentdatawhenfi=ngandevalua8ngamodel!

Page 13: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Cross-valida+on

Cross-valida8onisamethodforassessingthegoodnessofamodelinawaythatcanavoidoverfiUng

Whatwedoisbuildthemodelononesetofdata,calledthetrainingset

•  i.e.,findthecoefficientsononesetofdata

Thenweevaluatewhetherthemodelfitswellonasecondsetofdata,calledthetestset

•  i.e.,predicttheŷ’sbasedonx’sfromanewdataset

Ifthemodelistrulygood,weshouldgetgoodpredic8onsonthetestset

•  i.e.,asmallRMSEonthetestset

Page 14: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Ma5hew’s BaEng Sta+s+c (MBS)

Randomlysplitthedata:•  ½ifthedataisinthetrainingset•  ½ofthedataisinthetestset

FitthemodelusingthetrainingdatafortheMBS:

fit<-lm(R~BB+H+X2B+…,data=training.data)

Makepredic8onsonthetestdata

predicted.yhats<-predict(fit,newdata=test.data)cross.validated.RMSE<-sqrt(mean((predicted.yhats-test.data$R)^2))

MSEforpredic8onsmadeusingthesametrainingdatax’sandy’s:23.18MSEforpredic8onsmadeonthetestdatax’sandy’s:24.63

Page 15: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Manipula+ng data with dplyr

Rpackagesaddaddi8onalfunc8onstoRlibrary(‘package.name’)

dplyrisaveryusefulpackageformanipula8ngdataframeslibrary(‘dplyr’)

Thereareseveralveryusefulfunc8onsinthedplyrpackageincluding:

•  filter()•  select()•  mutate()•  group_by()•  summarize()

Allthesefunc8onstakeadataframeasinputandreturnadataframeasoutput

Page 16: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

filter()

Thefilter()func8onreturnsasubsetofrowsinadataframeExample:

all.data<-get.Lahman.baUng.data()red.sox.data<-filter(all.data,teamID=="BOS")

Ques8on:Howcouldwegetallplayerswhohavelessthan300PA?max.300PA<-filter(all.data,PA<300)

Page 17: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

The pipe operator %>%

Thepipeoperator%>%allowsustochaincommandstogether>all.data<-get.Lahman.baUng.data()>red.sox.2015<-all.data%>%

filter(teamID=="BOS")%>%filter(yearID==2015)

Page 18: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

select()

Theselect()func8onreturnsasubsetofthevariables•  i.e.,subsetofthecolumnsofadataframe

Example:all.data<-get.Lahman.baUng.data()data.hits.and.walks<-select(all.data,H,BB)

Ques8on:Howcouldweonlyhomerunsanddoubles?data.homeruns.and.doubles<-select(all.data,HR,X2B)

Page 19: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

mutate()

Themutate()func8onaddsnewvariablestoadataframefromvariablesthatarealreadyinthedataframe

•  i.e.,createsnewcolumnsfromoldcolumns

Example:data.with.1B<-mutate(all.data,X1B=H–X2B–X3B-HR)

Ques8on:•  HowcanweaddBRA(whichisOBP*SlugPct)toourdataframe?data.with.BRA<-mutate(all.data,BRA=OBP*SlugPct)

Page 20: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

group_by()

Thegroup_by()func8onassignscategoricalvariablestogroups

•  byitselfitdoesnothing,butitisusefulinconjunc8onwiththesummarize()func8onasdescribedonthenextslide

Example:data.team.grouped<-group_by(all.data,teamID)

Ques8on:•  Howcangroupdatabyyear?data.year.grouped<-group_by(all.data,yearID)

Page 21: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

summarize()

Thesummarize()func8onreducesthedatabasedonthegroupingassignedbythegroup_by()func8on

•  i.e.,ittakesmanycasesandcreatesummarysta8s8csfromthesecasesseparatelyforeachgrouping.

Example:data.team.grouped<-all.data%>%

group_by(teamID)%>%summarize(sum(H,na.rm=TRUE))

Ques8on:Howcanwegetthetotalnumberofhitsasafunc8onoftheyear?

data.year.grouped<-all.data%>%group_by(yearID)%>%summarize(sum(H,na.rm=TRUE))

Page 22: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Summary of some of what we have learned about descrip+ve sta+s+cs

Descrip8vesta8s8cs:median,mean,standarddevia8on,percen8les,fivenumbersummary,range,interquar8lerange,z-scores,correla8on

Plots:barplots,piecharts,histograms,boxplots,sca#erplotsRegression:linearregression,mul8pleregression,residuals,RMSE,overfiUng

AlotaboutbaseballandanalyzingdatainR

Page 23: Data manipula+onemeyers.scripts.mit.edu/.../CS149_slides/class11.pdf · Randomly split the data: • ½ if the data is in the training set • ½ of the data is in the test set Fit

Worksheet 5

>get.worksheet(5)Pleasegetstartedonthisworksheetearly,someoftheques8onsontheworksheetmightbechallenging!A{erthebreak:probabilityandinferen8alSta8s8cs