data wrangling...make data suitable to use with a 1 particular piece of software 2 reveal...

118
Data Wrangling Country 2011 2012 2013 FR DE US

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Data Wrangling

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Page 2: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Makedatasuitabletousewithaparticularpieceofsoftware1Revealinformation2

DataWrangling:TwoGoals

--Grolemund &Wickham,RforDataScience,O'Reilly2016

Page 3: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Wrangling MungingJanitor WorkManipulationTransformation

50-80%of your time?

Page 4: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Slides modified from RStudio Data Wrangling Workshop https://www.rstudio.com/resources/webinars/data-wrangling-with-r-and-rstudio/

Twopackagestohelpyouworkwiththestructureofdata

tidyr

dplyr

Page 5: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

http://www.rstudio.com/resources/cheatsheets/

Also in Chinese…

Page 6: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Data sets come in many formats

…but R (often) prefers just one

Page 7: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

# install.packages("devtools")# devtools::install_github("rstudio/EDAWR")library(EDAWR)?storms?cases

AnRpackagewithallofthedatasetsthatshowninthislecture.

?pollution?tb

EDAWR

Page 8: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

pollutionstorms

devtools::install_github("rstudio/EDAWR")library(EDAWR)

cases

• Stormname• WindSpeed(mph)• Airpressure• Date

• Country• Year• Count

• City• AmountofLargeParticles• AmountofSmallparticles

Page 9: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

pollutionstorms

devtools::install_github("rstudio/EDAWR")library(EDAWR)

cases

storms$stormstorms$windstorms$pressurestorms$date

cases$countrynames(cases)[-1]unlist(cases[1:3, 2:4])

pollution$city[1,3,5]pollution$amount[1,3,5]pollution$amount[2,4,6]

Page 10: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storms$pressure / storms$wind

9501003987100410061000

8.622.315.225.120.122.2

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

pressure100710091005101310101010

/ 110/ 45/ 65/ 40/ 50/ 45

wind1104565405045

storms

Adding/modifyingcolumns

𝑟𝑎𝑡𝑖𝑜 =𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒𝑤𝑖𝑛𝑑

Page 11: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

Eachvariable issavedinitsowncolumn.1Eachobservation issavedinitsownrow.2Each"type"ofobservationstoredinasingletable (here,storms).3

storms

Tidydata

Page 12: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Recap:Tidydata

Variablesincolumns,observationsinrows,eachtypeinatable1Easytoaccessvariables

Automaticallypreservesobservations

23

Page 13: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

tidyr

Page 14: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

library(tidyr)?gather?spread

tidyrTwomainfunctions:gather() andspread()

# install.packages("tidyr")

Tidyr:Apackagethatreshapesthelayoutoftables.

Page 15: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

YourTurn

cases

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Imaginehowthisdatawouldlookifitweretidywiththreevariables:country,year,n

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Page 16: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Page 17: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 18: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 19: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 20: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 21: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 22: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 23: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 24: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 25: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 26: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 27: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 28: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

gather()

Page 29: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

Page 30: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

key(formercolumnnames)

Page 31: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Country 2011 2012 2013

FR 7000 6900 7000

DE 5800 6000 6200

US 15000 14000 13000

Country Year n

FR 2011 7000

DE 2011 5800

US 2011 15000

FR 2012 6900

DE 2012 6000

US 2012 14000

FR 2013 7000

DE 2013 6200

US 2013 13000

key value(formercells)

Page 32: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

gather(cases, "year", "n", 2:4)

Collapsesmultiplecolumnsintotwocolumns:1. akey columnthatcontainstheformercolumnnames2. avalue columnthatcontainstheformercolumncells

gather()

dataframetoreshape nameofthenewkey

column(acharacterstring)

nameofthenewvaluecolumn

(acharacterstring)

namesornumericindexesofcolumnsto

collapse

Page 33: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

## country year n## 1 FR 2011 7000## 2 DE 2011 5800## 3 US 2011 15000## 4 FR 2012 6900## 5 DE 2012 6000## 6 US 2012 14000## 7 FR 2013 7000## 8 DE 2013 6200## 9 US 2013 13000

## country 2011 2012 2013## 1 FR 7000 6900 7000## 2 DE 5800 6000 6200## 3 US 15000 14000 13000

gather(cases,"year","n",2:4)

Page 34: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

YourTurn

Imaginehowthepollutiondatasetwouldlooktidywiththreevariables:city,large,small

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

pollution

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Page 35: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Page 36: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 37: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 38: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 39: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 40: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 41: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 42: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 43: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

Page 44: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

spread()

Page 45: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

key(newcolumnnames)

Page 46: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

key value(newcells)

Page 47: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

spread(pollution, size, amount)

Generatesmultiplecolumnsfromtwocolumns:1. eachuniquevalueinthekey columnbecomesacolumnname2. eachvalueinthevalue columnbecomesacellinthenewcolumns

spread()

dataframetoreshape

columntouseforkeys(newcolumns

names)

columntouseforvalues(newcolumn

cells)

Page 48: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city large small

New York 23 14London 22 16Beijing 121 56

spread()

spread(pollution,size,amount)

gather()

Separateallvariables impliedbylaw,formulaorgoal

Page 49: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

Therearethreemorevariableshiddeninstorms:

unite()andseparate()

storms

•Year•Month•Day

Page 50: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Separatesplitsacolumnbyacharacterstringseparator.

separate()

separate(storms, date, c("year", "month", "day"), sep = "-")

storm wind pressure year month dayAlberto 110 1007 2000 08 12Alex 45 1009 1998 07 30Allison 65 1005 1995 06 04Ana 40 1013 1997 07 1Arlene 50 1010 1999 06 13Arthur 45 1010 1996 06 21

storms storms2storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

Page 51: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storm wind pressure year month dayAlberto 110 1007 2000 08 12Alex 45 1009 1998 07 30Allison 65 1005 1995 06 04Ana 40 1013 1997 07 1Arlene 50 1010 1999 06 13Arthur 45 1010 1996 06 21

storms2

Uniteunitescolumnsintoasinglecolumn.

unite()

unite(storms2, "date", year, month, day, sep = "-")

storms

Page 52: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Recap:tidyrApackagethatreshapesthelayoutofdatasets.

Makeobservationsfromvariableswithgather()

Makevariablesfromobservationswithspread()

Splitandmergecolumnswithunite() andseparate()

Alsoreshape2 package

Page 53: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Data sets contain more information than they display

Page 54: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

library(dplyr)?select?filter?arrange

dplyr

?mutate?summarise?group_by

# install.packages("dplyr")

dplyr:Apackagethathelpstransformtabulardata.

Page 55: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Extract existingvariables.1Extract existingobservations.2

Waystoaccessinformation

select()

filter()

mutate()

summarise()

Derive newvariables3 (from existing variables)

Change theunitofanalysis4

Page 56: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

select(storms, storm, pressure)

stormsstorm pressureAlberto 1007Alex 1009Allison 1005Ana 1013Arlene 1010Arthur 1010

select()

Page 57: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

select(storms, -storm)# see ?select for more

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

stormswind pressure date110 1007 2000-08-1245 1009 1998-07-3065 1005 1995-06-0440 1013 1997-07-0150 1010 1999-06-1345 1010 1996-06-21

select()

Page 58: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

select(storms, wind:date)# see ?select for more

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

stormswind pressure date110 1007 2000-08-1245 1009 1998-07-3065 1005 1995-06-0440 1013 1997-07-0150 1010 1999-06-1345 1010 1996-06-21

select()

Page 59: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

- Selecteverythingbut: Selectrangecontains() Selectcolumnswhosenamecontainsacharacterstringends_with() Selectcolumnswhosenameendswithastringeverything() Selecteverycolumnmatches() Selectcolumnswhosenamematchesaregularexpressionnum_range() Selectcolumnsnamedx1,x2,x3,x4,x5one_of() Selectcolumnswhosenamesareinagroupofnamesstarts_with() Selectcolumnswhosenamestartswithacharacterstring

* Blue functions come in dplyr

Usefulselectfunctions

Page 60: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Allison 65 1005 1995-06-04Arlene 50 1010 1999-06-13

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

filter(storms, wind >= 50)

filter()

Page 61: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Allison 65 1005 1995-06-04

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

filter(storms, wind >= 50,storm %in% c("Alberto", "Alex", "Allison"))

filter()

Page 62: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

?Comparison< Lessthan

> Greaterthan

== Equalto

<= Lessthanorequalto

>= Greaterthanorequalto

!= Notequalto%in% Groupmembershipis.na IsNA!is.na IsnotNA

& booleanand

| booleanor

xor exactlyor

! not

any anytrue

all alltrue

?base::Logic

logicaltestsinR

Page 63: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

mutate(storms, ratio = pressure / wind)

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storm wind pressure date ratioAlberto 110 1007 2000-08-12 9.15Alex 45 1009 1998-07-30 22.42Allison 65 1005 1995-06-04 15.46Ana 40 1013 1997-07-01 25.32Arlene 50 1010 1999-06-13 20.20Arthur 45 1010 1996-06-21 22.44

mutate()

Page 64: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

mutate(storms, ratio = pressure / wind, inverse = ratio^-1)

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storm wind pressure date ratio inverseAlberto 110 1007 2000-08-12 9.15 0.11Alex 45 1009 1998-07-30 22.42 0.04Allison 65 1005 1995-06-04 15.46 0.06Ana 40 1013 1997-07-01 25.32 0.04Arlene 50 1010 1999-06-13 20.20 0.05Arthur 45 1010 1996-06-21 22.44 0.04

mutate()

Page 65: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pmin(),pmax() Element-wiseminandmaxcummin(),cummax() Cumulativeminandmaxcumsum(),cumprod() Cumulativesumandproductbetween() Arevaluesbetweenaandb?cume_dist() Cumulativedistributionofvaluescumall(),cumany() Cumulativeallandanycummean() Cumulativemeanlead(),lag() Copywithvaluesonepositionntile() Binvectorintonbucketsdense_rank(),min_rank(),percent_rank(),row_number() Variousrankingmethods

* All take a vector of values and return a vector of values** Blue functions come in dplyr

Usefulmutatefunctions

Page 66: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pmin(),pmax() Element-wiseminandmaxcummin(),cummax() Cumulativeminandmaxcumsum(),cumprod() Cumulativesumandproductbetween() Arevaluesbetweenaandb?cume_dist() Cumulativedistributionofvaluescumall(),cumany() Cumulativeallandanycummean() Cumulativemeanlead(),lag() Copywithvaluesonepositionntile() Binvectorintonbucketsdense_rank(),min_rank(),percent_rank(),row_number() Variousrankingmethods

* All take a vector of values and return a vector of values

123456

136101521

cumsum()

"Window"functions

Page 67: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% summarise(median = median(amount), variance = var(amount))

median variance22.5 1731.6

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

summarise()

Page 68: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% summarise(mean = mean(amount), sum = sum(amount), n = n())

mean sum n42 252 6

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

summarise()

Page 69: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

min(),max() Minimumandmaximumvaluesmean() Meanvaluemedian() Medianvaluesum() Sumofvaluesvar,sd() Varianceandstandarddeviationofavectorfirst() Firstvalueinavectorlast() Lastvalueinavectornth() Nthvalueinavectorn() Thenumberofvaluesinavectorn_distinct() Thenumberofdistinctvaluesinavector

* All take a vector of values and return a single value** Blue functions come in dplyr

Usefulsummaryfunctions

Page 70: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

min(),max() Minimumandmaximumvaluesmean() Meanvaluemedian() Medianvaluesum() Sumofvaluesvar,sd() Varianceandstandarddeviationofavectorfirst() Firstvalueinavectorlast() Lastvalueinavectornth() Nthvalueinavectorn() Thenumberofvaluesinavectorn_distinct() Thenumberofdistinctvaluesinavector

* All take a vector of values and return a single value

123456

21sum()

"Summary"functions

Page 71: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAna 40 1013 1997-07-01Alex 45 1009 1998-07-30Arthur 45 1010 1996-06-21Arlene 50 1010 1999-06-13Allison 65 1005 1995-06-04Alberto 110 1007 2000-08-12

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

arrange(storms, wind)

arrange()

Page 72: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAna 40 1013 1997-07-01Alex 45 1009 1998-07-30Arthur 45 1010 1996-06-21Arlene 50 1010 1999-06-13Allison 65 1005 1995-06-04Alberto 110 1007 2000-08-12

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

arrange(storms, wind)

arrange()

Page 73: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storm wind pressure dateAlberto 110 1007 2000-08-12Allison 65 1005 1995-06-04Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21Alex 45 1009 1998-07-30Ana 40 1013 1997-07-01

storms

arrange(storms, desc(wind))

arrange()

Page 74: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAna 40 1013 1997-07-01Alex 45 1009 1998-07-30Arthur 45 1010 1996-06-21Arlene 50 1010 1999-06-13Allison 65 1005 1995-06-04Alberto 110 1007 2000-08-12

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

arrange(storms, wind)

arrange()

Page 75: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAna 40 1013 1997-07-01Arthur 45 1010 1996-06-21Alex 45 1009 1998-07-30Arlene 50 1010 1999-06-13Allison 65 1005 1995-06-04Alberto 110 1007 2000-08-12

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

arrange(storms, wind, date)

arrange()

Page 76: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

library(dplyr)

select(tb, child:elderly)tb %>% select(child:elderly)

tb select( , child:elderly)

%>%

These do the same thing

The pipe operator %>%

Page 77: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

LittleBunnyFooFoo (anurseryrhyme)

LittlebunnyFooFooWenthoppingthroughtheforest

ScoopingupthefieldmiceAndboppingthemonthehead

Page 78: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

LittleBunnyFooFoo (anurseryrhyme)

Usingtemporaryobjects:

T1=hop_through(foo_foo, forest)T2=scoop_up(T1,field_mice)T3=bop_on(T2,head)

Little bunny Foo FooWent hopping through the forest

Scooping up the field miceAnd bopping them on the head

Page 79: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Usingnestedfunctioncalls:

bop_on(scoop_up(hop_through(foo_foo,forest

),field_mice),

head)

LittleBunnyFooFoo (anurseryrhyme)Little bunny Foo Foo

Went hopping through the forestScooping up the field mice

And bopping them on the head

Page 80: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

LittleBunnyFooFoo (anurseryrhyme)Little bunny Foo Foo

Went hopping through the forestScooping up the field mice

And bopping them on the head

Usingdplyr pipes:

foo_foo %>%hop_through(forest) %>%scoop_up(field_mice) %>%bop_on(head)

Using pipes usually leads to more transparent code…• No temporary objects to remember / mess up• Reads chronologically

Page 81: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

select()

select(storms, storm, pressure)

stormsstorm pressureAlberto 1007Alex 1009Allison 1005Ana 1013Arlene 1010Arthur 1010

Page 82: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

select()

storms %>% select(storm, pressure)

stormsstorm pressureAlberto 1007Alex 1009Allison 1005Ana 1013Arlene 1010Arthur 1010

Page 83: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Allison 65 1005 1995-06-04Arlene 50 1010 1999-06-13

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

filter()storms

filter(storms, wind >= 50)

Page 84: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm wind pressure dateAlberto 110 1007 2000-08-12Allison 65 1005 1995-06-04Arlene 50 1010 1999-06-13

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

filter()storms

storms %>% filter(wind >= 50)

Page 85: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

storm pressureAlberto 1007Allison 1005Arlene 1010

storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms

storms %>% filter(wind >= 50) %>%select(storm, pressure)

Page 86: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

mutate()storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storms %>%mutate(ratio = pressure / wind) %>%select(storm, ratio)

?

Page 87: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

mutate()storm wind pressure dateAlberto 110 1007 2000-08-12Alex 45 1009 1998-07-30Allison 65 1005 1995-06-04Ana 40 1013 1997-07-01Arlene 50 1010 1999-06-13Arthur 45 1010 1996-06-21

storm ratioAlberto 9.15Alex 22.42Allison 15.46Ana 25.32Arlene 20.20Arthur 22.44

storms %>%mutate(ratio = pressure / wind) %>%select(storm, ratio)

?

Page 88: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Shortcuttotype%>%

Cmd M+ (Mac)

(Windows)

Shift +

Ctrl M+ Shift +

Page 89: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Unit of analysis

Page 90: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

mean sum n42 252 6

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56 summarize()

Page 91: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Beijing large 121Beijing small 56

mean sum n

42 252 6

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16

Page 92: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14

London large 22London small 16 19.0 38 2

mean sum n

18.5 37 2

88.5 177 2

group_by()+summarise()

Page 93: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16

pollution %>% group_by(city)

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

group_by()

Page 94: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% group_by(city)## Source: local data frame [6 x 3]## Groups: city#### city size amount## 1 New York large 23## 2 New York small 14## 3 London large 22## 4 London small 16## 5 Beijing large 121## 6 Beijing small 56

Page 95: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

group_by()+summarise()

pollution %>% group_by(city) %>% summarise(mean = mean(amount), sum = sum(amount), n = n())

Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16

Page 96: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% group_by(city) %>% summarise(mean = mean(amount), sum = sum(amount), n = n())

Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14

London large 22London small 16

city mean sum n

New York 18.5 37 2

London 19.0 38 2

Beijing 88.5 177 2

Page 97: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% group_by(city) %>% summarise(mean = mean(amount), sum = sum(amount), n = n())

Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14

London large 22London small 16

city mean sum n

New York 18.5 37 2London 19.0 38 2Beijing 88.5 177 2

Page 98: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

pollution %>% group_by(city) %>% summarise(mean = mean(amount))

city meanNew York 18.5London 19.0Beijing 88.5

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Page 99: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

pollution %>% group_by(size) %>% summarise(mean = mean(amount))

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

size meanlarge 55.3small 28.6

city size amount

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

Page 100: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

city particlesize

amount(µg/m3)

New York large 23New York small 14London large 22London small 16Beijing large 121Beijing small 56

ungroup()

pollution %>% ungroup()

Page 101: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

country year sex casesAfghanistan 1999 female 1Afghanistan 1999 male 1Afghanistan 2000 female 1Afghanistan 2000 male 1Brazil 1999 female 2Brazil 1999 male 2Brazil 2000 female 2Brazil 2000 male 2China 1999 female 3China 1999 male 3China 2000 female 3China 2000 male 3

country year casesAfghanistan 1999 2Afghanistan 2000 2Brazil 1999 4Brazil 2000 4China 1999 6China 1999 6

country year sex casesAfghanistan 1999 female 1Afghanistan 1999 male 1Afghanistan 2000 female 1Afghanistan 2000 male 1Brazil 1999 female 2Brazil 1999 male 2Brazil 2000 female 2Brazil 2000 male 2China 1999 female 3China 1999 male 3China 2000 female 3China 2000 male 3

tb %>%group_by(country, year) %>%summarise(cases = sum(cases)) %>%summarise(cases = sum(cases))

country casesAfghanistan 4Brazil 8China 12

Hierarchyofinformation

Largerunitsofanalysis

Page 102: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Makenewvariables,withmutate().

Recap:Information

Groupobservationswithgroup_by() andsummarise().

Arrangeobservations,witharrange().

Extractvariablesandobservationswithselect()andfilter()

Page 103: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Joining data

Page 104: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4

+ =

bind_cols(y, z)

y z

dplyr::bind_cols()

x1 x2 x1 x2A 1 B 2B 2 C 3C 3 D 4

Page 105: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

dplyr::bind_rows()

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4

+ =

bind_rows(y, z)

y z x1 x2A 1B 2C 3B 2C 3D 4

Page 106: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

dplyr::union()

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4

+ =

union(y, z)

y z

x1 x2A 1B 2C 3D 4

Page 107: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

dplyr::intersect()

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4

+ =

intersect(y, z)

y z

x1 x2B 2C 3

Page 108: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

dplyr::setdiff()

x1 x2A 1B 2C 3

x1 x2B 2C 3D 4

+ =

setdiff(y, z)

y z

x1 x2A 1D 4

Page 109: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songs artists

left_join(songs, artists, by = "name")

song name playsAcross the Universe John guitar

Come Together John guitarHello, Goodbye Paul bassPeggy Sue Buddy <NA>

dplyr::left_join()

left_join(x,y):Returnallrowsfromx,andallcolumnsfromxandy.Iftherearemultiplematchesbetweenxandy,allcombinationofthematchesarereturned.Thisisamutatingjoin.

Page 110: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

left_join(songs, artists, by = "name")

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songs artistssong name playsAcross the Universe John guitar

Come Together John guitarHello, Goodbye Paul bassPeggy Sue Buddy <NA>

dplyr::left_join()

Page 111: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

song first lastAcross the Universe John LennonCome Together John LennonHello, Goodbye Paul McCartney

Peggy Sue Buddy Holly

first last playsGeorge Harrison sitar

John Lennon guitar

Paul McCartney bass

Ringo Starr drums

Paul Simon guitar

John Coltranee sax

+ =

songs2 artists2

left_join(songs2, artists2, by = c("first", "last"))

song first last playsAcross the Universe John Lennon guitar

Come Together John Lennon guitar

Hello, Goodbye Paul McCartney bass

Peggy Sue Buddy Holly <NA>

dplyr::left_join()

Page 112: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

song first lastAcross the Universe John Lennon

Come Together John Lennon

Hello, Goodbye Paul McCartney

Peggy Sue Buddy Holly

first last playsGeorge Harrison sitar

John Lennon guitar

Paul McCartney bass

Ringo Starr drums

Paul Simon guitar

John Coltrane sax

+ =

songs2 artists2

left_join(songs2, artists2, by = c("first", "last"))

song first last playsAcross the Universe John Lennon guitar

Come Together John Lennon guitar

Hello, Goodbye Paul McCartney bass

Peggy Sue Buddy Holly <NA>

dplyr::left_join()

Page 113: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

left_join()

left_join(songs, artists, by = "name")

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songs artistssong name playsAcross the Universe John guitar

Come Together John guitarHello, Goodbye Paul bassPeggy Sue Buddy <NA>

Page 114: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

inner_join(songs, artists, by = "name")

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songssong name playsAcross the Universe John guitar

Come Together John guitarHello, Goodbye Paul bass

inner_join()artists

inner_join(x,y):Returnallrowsfromxwheretherearematchingvaluesiny,andallcolumnsfromxandy.Iftherearemultiplematchesbetweenxandy,allcombinationofthematches

arereturned.Thisisamutatingjoin.

Page 115: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

semi_join(songs, artists, by = "name")

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songssong nameAcross the Universe John

Come Together JohnHello, Goodbye Paul

semi_join()artists

semi_join(x,y):Returnallrowsfromxwheretherearematchingvaluesiny,keepingjustcolumnsfromx.Asemijoindiffersfromaninnerjoinbecauseaninnerjoinwillreturnone

rowofxforeachmatchingrowofy,whereasemijoinwillneverduplicaterowsofx.Thisisafilteringjoin.

Page 116: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

anti_join(songs, artists, by = "name")

song nameAcross the Universe John

Come Together JohnHello, Goodbye PaulPeggy Sue Buddy

name playsGeorge sitarJohn guitarPaul bassRingo drums

+ =

songssong namePeggy Sue Buddy

anti_join()artists

anti_join(x,y):Returnallrowsfromxwheretherearenotmatchingvaluesiny,keepingjustcolumnsfromx.Thisisafilteringjoin.

GreatJoinCheatsheet:http://stat545.com/bit001_dplyr-cheatsheet.html

Page 117: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

Recap:Bestformatforanalysis

Variables incolumns

Observations inrows

Separateallvariables implied by law, formula or goal

Unitofanalysismatches theunitofanalysisimplied by law, formula or goal

Single table

Page 118: Data Wrangling...Make data suitable to use with a 1 particular piece of software 2 Reveal information Data Wrangling: Two Goals--Grolemund& Wickham, R for Data Science, O'Reilly 2016

InteractiveExercises