data manipula+onemeyers.scripts.mit.edu/.../cs149_slides/class11.pdf · randomly split the data:...
Post on 25-May-2020
3 Views
Preview:
TRANSCRIPT
Data manipula+on
Outline for today
Be#erknowasport:RobertoClementeReviewofmul8plelinearregressionManipula8ngdatawithdplyrWorksheet5
Be5er know a player Roberto Clemente
Announcements
ExtraofficehoursthisFriday.ThereisasignuplinkonMoodle,oremailme
Startthinkingaboutyourfinalprojectfortheclass.TheguidelinesforthefinalprojectareonMoodle.
AprojectproposalisdueonWednesdayMarch29th
Review
Regression
Regressionismethodofusingonevariabletopredictthevalueofasecondvariable.Inlinearregressionwefitalinetothedata,calledtheregressionline.
ŷ=a+b·x
Measuring goodness of fit
Residual
Wecanmeasurehowwellthelinefitsthedatausingthemeansquarederror(MSE):
y
ŷ
Residual=Observed–Predicted=y–ŷ
LeastsquareregressionlineminimizestheMSE
Mul+ple regression
Wehaveobservedthatsomesta8s8cscombinemul8pletypesofmeasurementsBaUngaverage:BA=[(1)·1B+(1)·2B+(1)·3B+(1)·HR]/ABSluggingpercentage:Slug=[(1)·1B+(2)·2B+(3)·3B+(4)·HR]/ABGenericlinearsta8s8c:stat=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0
Mul+ple regression
Mul8pleregressionmakespredic8ons(ŷ)usingmul8plevariables
Wecanfindtheop8malweights(wi’s)foracombina8onofbasicsta8s8csbyminimizingtheMSEonŷ=w1·BB+w2·HBP+w3·1B+w4·2B+w5·3B+w6·HR+w0
fit<-lm(R~BB+HBP+H+X2B+X3B+HR,data=team.baUng.162)TheRmodelfitcontainstheweightswi‘sformakingpredic8onsŷcoef(fit)
What is the best sta+s+c we can create?
Supposewefit:fit<-lm(R~BB+H+X2B+X3B+HR+BRA+OBP+AB,data=team.baUng.162)
RMSE:
sqrt(mean(linear_model$residuals^2))23.49 HowaccuratewouldthisRMSEbeifweappliedittonewdata?
OverfiEng
FiUngamodeltopreciselytothedataathandinsuchawaythatitdoesnotgeneralizetonewdataiscalledoverfi=ngIfweusedthesamedatatofitourmodel(findthewi’s)aswedidtoevaluatewhetheritwasagoodfitoures8mateofRMSEmightbetooop8mis8c• i.e.,oures8mateoftheRMSEmightbetoosmall
OverfiEng
Oneshouldalwaysyoudifferentdatawhenfi=ngandevalua8ngamodel!
Cross-valida+on
Cross-valida8onisamethodforassessingthegoodnessofamodelinawaythatcanavoidoverfiUng
Whatwedoisbuildthemodelononesetofdata,calledthetrainingset
• i.e.,findthecoefficientsononesetofdata
Thenweevaluatewhetherthemodelfitswellonasecondsetofdata,calledthetestset
• i.e.,predicttheŷ’sbasedonx’sfromanewdataset
Ifthemodelistrulygood,weshouldgetgoodpredic8onsonthetestset
• i.e.,asmallRMSEonthetestset
Ma5hew’s BaEng Sta+s+c (MBS)
Randomlysplitthedata:• ½ifthedataisinthetrainingset• ½ofthedataisinthetestset
FitthemodelusingthetrainingdatafortheMBS:
fit<-lm(R~BB+H+X2B+…,data=training.data)
Makepredic8onsonthetestdata
predicted.yhats<-predict(fit,newdata=test.data)cross.validated.RMSE<-sqrt(mean((predicted.yhats-test.data$R)^2))
MSEforpredic8onsmadeusingthesametrainingdatax’sandy’s:23.18MSEforpredic8onsmadeonthetestdatax’sandy’s:24.63
Manipula+ng data with dplyr
Rpackagesaddaddi8onalfunc8onstoRlibrary(‘package.name’)
dplyrisaveryusefulpackageformanipula8ngdataframeslibrary(‘dplyr’)
Thereareseveralveryusefulfunc8onsinthedplyrpackageincluding:
• filter()• select()• mutate()• group_by()• summarize()
Allthesefunc8onstakeadataframeasinputandreturnadataframeasoutput
filter()
Thefilter()func8onreturnsasubsetofrowsinadataframeExample:
all.data<-get.Lahman.baUng.data()red.sox.data<-filter(all.data,teamID=="BOS")
Ques8on:Howcouldwegetallplayerswhohavelessthan300PA?max.300PA<-filter(all.data,PA<300)
The pipe operator %>%
Thepipeoperator%>%allowsustochaincommandstogether>all.data<-get.Lahman.baUng.data()>red.sox.2015<-all.data%>%
filter(teamID=="BOS")%>%filter(yearID==2015)
select()
Theselect()func8onreturnsasubsetofthevariables• i.e.,subsetofthecolumnsofadataframe
Example:all.data<-get.Lahman.baUng.data()data.hits.and.walks<-select(all.data,H,BB)
Ques8on:Howcouldweonlyhomerunsanddoubles?data.homeruns.and.doubles<-select(all.data,HR,X2B)
mutate()
Themutate()func8onaddsnewvariablestoadataframefromvariablesthatarealreadyinthedataframe
• i.e.,createsnewcolumnsfromoldcolumns
Example:data.with.1B<-mutate(all.data,X1B=H–X2B–X3B-HR)
Ques8on:• HowcanweaddBRA(whichisOBP*SlugPct)toourdataframe?data.with.BRA<-mutate(all.data,BRA=OBP*SlugPct)
group_by()
Thegroup_by()func8onassignscategoricalvariablestogroups
• byitselfitdoesnothing,butitisusefulinconjunc8onwiththesummarize()func8onasdescribedonthenextslide
Example:data.team.grouped<-group_by(all.data,teamID)
Ques8on:• Howcangroupdatabyyear?data.year.grouped<-group_by(all.data,yearID)
summarize()
Thesummarize()func8onreducesthedatabasedonthegroupingassignedbythegroup_by()func8on
• i.e.,ittakesmanycasesandcreatesummarysta8s8csfromthesecasesseparatelyforeachgrouping.
Example:data.team.grouped<-all.data%>%
group_by(teamID)%>%summarize(sum(H,na.rm=TRUE))
Ques8on:Howcanwegetthetotalnumberofhitsasafunc8onoftheyear?
data.year.grouped<-all.data%>%group_by(yearID)%>%summarize(sum(H,na.rm=TRUE))
Summary of some of what we have learned about descrip+ve sta+s+cs
Descrip8vesta8s8cs:median,mean,standarddevia8on,percen8les,fivenumbersummary,range,interquar8lerange,z-scores,correla8on
Plots:barplots,piecharts,histograms,boxplots,sca#erplotsRegression:linearregression,mul8pleregression,residuals,RMSE,overfiUng
AlotaboutbaseballandanalyzingdatainR
Worksheet 5
>get.worksheet(5)Pleasegetstartedonthisworksheetearly,someoftheques8onsontheworksheetmightbechallenging!A{erthebreak:probabilityandinferen8alSta8s8cs
top related