dplyr & functions - companyname.com - super …hofroe.net/stat480/14-dplyr.pdfdplyr routines...

Post on 30-Jun-2020

17 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

dplyr & Functions stat 480

Heike Hofmann

Outline

• dplyr functions and package

• Functions

library(dplyr)data(baseball, package=”plyr”))

Your Turn• Use data(baseball, package="plyr") to make

the baseball dataset active in R.

• Subset the data on your favorite player (you will need the Lahmann ID, e.g. Sammy Sosa sosasa01, Barry Bonds bondsba01, Babe Ruth ruthba01) Compute your player’s batting averages for each season (batting average = #Hits/#at bats). Define a new variable experience in the data set as year - min(year)Plot averages by #years of experience. Compute an all-time batting average for your player.

summarise

• What does the summarise function do? Read up on it on its help pages:

•help(summarise)

summarise# overall batting average

summarise(baseball,

! mba = sum(h)/sum(ab)

)

summarise(baseball,

! first = min(year),! ! ! ! ! # first year of baseball records

! duration = max(year) - min(year),! # duration of record taking in years

! nteams = length(unique(team)),! ! # different number of teams

! nplayers = length(unique(id))! ! # number of baseball players # in the dataset

)

dplyr package

• introduces workflow that makes working with large datasets (relatively) easy

• main functionality:group_by, summarise, mutate, filter

• http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

group_by

• group_by(data, var1, ...) is a function that takes a dataset and introduces a group for each (combination of) level(s) of the grouping variable(s)

• Power combination: group_by and summarisefor a grouped dataframe, the summary statistics will be calculated for every group

library(dplyr)

data(baseball, package="plyr")

summarise(baseball, seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) )

summarise(group_by(baseball, id), seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T))

seasons atbats avg1 137 4891061 0.2739821

id seasons atbats avg1 perezne01 12 5127 0.26721282 walketo04 12 4554 0.28897673 sweenma01 13 1738 0.26006904 schmija01 13 591 0.10490695 loaizes01 13 265 0.16603776 hollato01 12 3191 0.27295527 suppaje01 13 312 0.18269238 valdeis01 12 399 0.13032589 stinnke01 14 2033 0.234136710 parkch01 14 406 0.1822660.. ... ... ... ...

Chaining operator %.%

• x %.% f(y) is equivalent to f(x, y)

• baseball %.% group_by(id) is equivalent togroup_by(baseball, id)

• Read %.% as ‘then’ i.e. “take data, then group it by player’s id, then summarise it to …”

Chained version of example

baseball %.% group_by(id) %.% summarise( seasons = max(year)-min(year)+1, atbats = sum(ab), avg = sum(h, na.rm=T)/sum(ab, na.rm=T) )

Your Turn

• Use dplyr statements to get (a) the life time batting average for each player (mba)(b) the life time number of times each player was at bats. (nab)

• Plot nab versus mba.

filter• filter(data, expr1, ...)

is a function that takes a dataset and subsets it according to a set of expressions

• filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with the logical ‘AND’ &. You can use other boolean operators explicitly

Your Turn

• Use dplyr statements to get the number of team members on a team for each season (think of unique)

• Has the number of homeruns per season changed over time? Summarize the data with dplyr routines first, then visualize.

Functions in R

• Have been using functions a lot, now we want to write them ourselves!

• Idea: avoid repetitive coding (errors will creep in)

• Instead: extract common core, wrap it in a function, make it reusable

Basic Structure

• Name

• Input arguments

• names,

• default values

• Body

• Output values

A first functionmean <- function(x) { return(sum(x)/length(x))}

mean(1:15)mean(c(1:15, NA))

mean <- function(x, na.rm=F) { if (na.rm) x <- na.omit(x) return(sum(x)/length(x))}

mean(1:15)mean(c(1:15, NA), na.rm=T)

Function mean

• Name: mean

• Input arguments x, na.rm=T

• names,

• default values

• Body if(na.rm) x <- na.omit(x)

• Output values return(sum(x)/length(x))

Function Writing

• Start simple, then extend

• Test out each step of the way

• Don’t try too much at once

•help(browser)

Practice

•Write a function called mba input: playerIDoutput: life-time batting average for playerID

• what does mba(“bondsba01”)do?

• write a function called pstatsinput: playerIDoutput: life-time batting average for playerID number of overall at bats

Checkpoint

• Submit all of your code for the last Your Turn at

http://heike.wufoo.com/forms/check-point/

Testing

• Always test the functions you’ve written

• Even better: let somebody else test them for you

• Switch seats with your neighbor, test their functions!

top related