dplyr & functions - companyname.com - super …hofroe.net/stat480/14-dplyr.pdfdplyr routines...

22
dplyr & Functions stat 480 Heike Hofmann

Upload: others

Post on 30-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

dplyr & Functions stat 480

Heike Hofmann

Page 2: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Outline

• dplyr functions and package

• Functions

Page 3: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

library(dplyr)data(baseball, package=”plyr”))

Page 4: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Your Turn• Use data(baseball, package="plyr") to make

the baseball dataset active in R.

• Subset the data on your favorite player (you will need the Lahmann ID, e.g. Sammy Sosa sosasa01, Barry Bonds bondsba01, Babe Ruth ruthba01) Compute your player’s batting averages for each season (batting average = #Hits/#at bats). Define a new variable experience in the data set as year - min(year)Plot averages by #years of experience. Compute an all-time batting average for your player.

Page 5: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

summarise

• What does the summarise function do? Read up on it on its help pages:

•help(summarise)

Page 6: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

summarise# overall batting average

summarise(baseball,

! mba = sum(h)/sum(ab)

)

summarise(baseball,

! first = min(year),! ! ! ! ! # first year of baseball records

! duration = max(year) - min(year),! # duration of record taking in years

! nteams = length(unique(team)),! ! # different number of teams

! nplayers = length(unique(id))! ! # number of baseball players # in the dataset

)

Page 7: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

dplyr package

• introduces workflow that makes working with large datasets (relatively) easy

• main functionality:group_by, summarise, mutate, filter

• http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Page 8: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

group_by

• group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s)

• Power combination: group_by and summarisefor a grouped dataframe, the summary statistics will be calculated for every group

Page 9: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

library(dplyr)

data(baseball, package="plyr")

summarise(baseball, seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) )

summarise(group_by(baseball, id), seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T))

seasons atbats avg1 137 4891061 0.2739821

id seasons atbats avg1 perezne01 12 5127 0.26721282 walketo04 12 4554 0.28897673 sweenma01 13 1738 0.26006904 schmija01 13 591 0.10490695 loaizes01 13 265 0.16603776 hollato01 12 3191 0.27295527 suppaje01 13 312 0.18269238 valdeis01 12 399 0.13032589 stinnke01 14 2033 0.234136710 parkch01 14 406 0.1822660.. ... ... ... ...

Page 10: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Chaining operator %.%

• x %.% f(y) is equivalent to f(x, y)

• baseball %.% group_by(id) is equivalent togroup_by(baseball, id)

• Read %.% as ‘then’ i.e. “take data, then group it by player’s id, then summarise it to …”

Page 11: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Chained version of example

baseball %.% group_by(id) %.% summarise( seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) )

Page 12: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Your Turn

• Use dplyr statements to get (a) the life time batting average for each player (mba)(b) the life time number of times each player was at bats. (nab)

• Plot nab versus mba.

Page 13: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

filter• filter(data, expr1, ...)

is a function that takes a dataset and subsets it according to a set of expressions

• filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly

Page 14: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Your Turn

• Use dplyr statements to get the number of team members on a team for each season (think of unique)

• Has the number of homeruns per season changed over time? Summarize the data with dplyr routines first, then visualize.

Page 15: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Functions in R

• Have been using functions a lot, now we want to write them ourselves!

• Idea: avoid repetitive coding (errors will creep in)

• Instead: extract common core, wrap it in a function, make it reusable

Page 16: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Basic Structure

• Name

• Input arguments

• names,

• default values

• Body

• Output values

Page 17: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

A first functionmean <- function(x) { return(sum(x)/length(x))}

mean(1:15)mean(c(1:15, NA))

mean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x))}

mean(1:15)mean(c(1:15, NA), na.rm=T)

Page 18: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Function mean

• Name: mean

• Input arguments x, na.rm=T

• names,

• default values

• Body if(na.rm) x <- na.omit(x)

• Output values return(sum(x)/length(x))

Page 19: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Function Writing

• Start simple, then extend

• Test out each step of the way

• Don’t try too much at once

•help(browser)

Page 20: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Practice

•Write a function called mba input: playerIDoutput: life-time batting average for playerID

• what does mba(“bondsba01”)do?

• write a function called pstatsinput: playerIDoutput: life-time batting average for playerID number of overall at bats

Page 21: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Checkpoint

• Submit all of your code for the last Your Turn at

http://heike.wufoo.com/forms/check-point/

Page 22: dplyr & Functions - CompanyName.com - Super …hofroe.net/stat480/14-dplyr.pdfdplyr routines first, then visualize. Functions in R •Have been using functions a lot, now we want

Testing

• Always test the functions you’ve written

• Even better: let somebody else test them for you

• Switch seats with your neighbor, test their functions!