enter the tidyversewhat is the “tidyverse”? a collection of r packages largely developed by...

44
Enter the Tidyverse BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

Upload: others

Post on 10-Aug-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

EntertheTidyverseBIO5312FALL2017

STEPHANIE J. SPIELMAN,PHD

Page 2: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Whatisthe“tidyverse”?AcollectionofRpackageslargelydevelopedbyHadleyWickhamandothersatRstudio

Haveemergedasstaplesofmodern-daydatascienceinthepast5—10years

Wewillfocuson:• Visualization/plottingwithggplot2• Datamanagementand”wrangling”withdplyr andtidyr•DocumentpresentationwithRMarkdown

Page 3: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

FocusisontidydataframesEachvariableformsacolumn.Eachobservationformsarow.Eachtypeofobservationalunitformsatable.

Tidydataprovidesaconsistentapproachtodatamanagementthatgreatlyfacilitatesdownstreamanalysisandviz

Page 4: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

WorkingwithtidydataThepackagedplyr canmanipulateandmanagetidydata

Thepackagetidyr canrearrangedatatoconvertto/fromtidydata

Thepackageggplot2 isusedforvisualization/plotting

Page 5: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Thefundamentalverbsofdplyrfilter() selectrowsselect() selectcolumnsmutate() createnewcolumnsgroup_by() establish adatagroupingtally() count observationsinagroupingsummarize() calculate summarystatisticarrange() arrangerows

Therearemorefunctionsbuttheseonesarekey!

Page 6: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Thepipeoperator%>%“Pipes”outputfromonefunction/operationasinputtothenext

## Find the mean of iris sepal lengthsmean.sepal <- mean(iris$Sepal.Length)

## Using %>%mean.sepal <- iris$Sepal.Length %>% mean()

iris$Sepal.Length %>% mean() -> mean.sepal

iris %>% mean(Sepal.Length) -> mean.sepal

“forwardassignment”operatorfollowsthelogicalflowofpiping

## Start simple: display datahead(iris)

## Using %>%iris %>% head()

Page 7: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

dplyr demoCommandsindemoareonsjspielman.org/bio5312_fall2017/day2_tidyverse1

Page 8: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Visualizingwithggplot2Thepackageggplot2 isagraphicspackagethatimplementsagrammarofgraphics◦ Operatesondataframes,notvectorslikeBaseR◦ Explicitlydifferentiatesbetweenthedataandtherepresentationofthedata

Page 9: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Theggplot2 grammar

Grammar element* What isit

Data Thedataframebeingplotted

Geometrics Thegeometricshapethatwillrepresentthedata• Point,boxplot,histogram, violin,bar,etc.

Aesthetics Theaesthetics ofthegeometricobject• Color,size,shape,etc.

*Tableistinysubsetofwhatggplot2hastooffer

Page 10: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:scatterplot> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

Page 11: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:scatterplot> ggplot( iris, aes(x = Sepal.Length, y = Petal.Length) ) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

• Passinthedataframeasyourfirstargument

• Aestheticsmapthedataontoplotcharacteristics,herexandyaxes

• Displaythedatageometricallyaspoints

Page 12: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:scatterplotwithcolor> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(color = "red" )

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

Page 13: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:scatterplotwithaes color> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

setosa

versicolor

virginica

• Placingcolorinsideaesethetic mapsittothedata.

Page 14: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:scatterplotwithaes color,shape

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species, shape = Species)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

● setosa

versicolor

virginica

Page 15: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Aestheticsmaybeplacedinsidetherelevantgeom

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species, shape = Species))

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

● setosa

versicolor

virginica

> ## Remember dplyr!> iris %>% ggplot(aes(x = Sepal.Length, y =

Petal.Length)) + geom_point(aes(color = Species, shape = Species))

Page 16: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Aestheticsareformappingonly> ### Color all points blue?> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = "blue")) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

colour● blue

Page 17: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Aestheticsareformappingonly> ### Color all points blue?> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = "blue")) + geom_point()

> ### Correctly color all points blue> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(color = "blue")

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

colour● blue

Page 18: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Example:multiplegeoms> ### Use some fake data:> fake.data <- data.frame(t = 1:10, y = runif(10, 1, 100))

> ggplot(fake.data, aes(x = t, y = y)) + geom_point() + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

Page 19: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Makesureaestheticmappingsareproperlyapplied

> ggplot(fake.data, aes(x = t, y = y, size = y)) + geom_point() + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

y●

25

50

75

Page 20: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Makesureaestheticmappingsareproperlyapplied

> ggplot(fake.data, aes(x = t, y = y, size = y)) + geom_point() + geom_line()

> ggplot(fake.data, aes(x = t, y = y)) + geom_point( aes(size=y) ) + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

y●

25

50

75

Page 21: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Page 22: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange" )

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Page 23: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange", color = "brown" )

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Page 24: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange", color = "brown" )

+ xlab("Sepal Length") + ylab("Count") + ggtitle("Histogram of iris sepal lengths")

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal Length

Cou

nt

Histogram of iris sepal lengths

Page 25: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Boxplots> ggplot(iris, aes(x = "", y = Sepal.Length)) + geom_boxplot()

5

6

7

8

x

Sepal.Length

Page 26: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Boxplots> ggplot(iris, aes(x = "", y = Sepal.Length)) + geom_boxplot(fill = "green")

5

6

7

8

x

Sepal.Length

Page 27: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Boxplots> ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(fill = "green")

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length

Page 28: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Boxplots> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot()

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length Species

setosa

versicolor

virginica

Page 29: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Boxplots:Customizingthefillmappings

> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"))

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length Species

setosa

versicolor

virginica

Page 30: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

scale_fill_manual()alsotweakslegend

> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"), name = "Species name", labels=c("SETOSA", "VIRGINICA", "VERSICOLOR"))

●5

6

7

8

setosa versicolor virginicaSpecies

Sepa

l.Len

gth Species name

SETOSA

VIRGINICA

VERSICOLOR

Page 31: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Changingtheorder> ### Ordering depends on factor levels> levels(iris$Species)

[1] "setosa" "versicolor" "virginica"

> ### Change order of levels> iris$Species <- factor(iris$Species, levels=c("virginica", "setosa", "versicolor"))

[1] "virginica" "setosa" "versicolor"

> ### Replot> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +

geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"))

●5

6

7

8

virginica setosa versicolorSpecies

Sepal.Length Species

virginica

setosa

versicolor

Page 32: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

GroupedboxplotsThiswillapplytoviolinplotsaswell.> ## Create another categorical variable for grouping purpopses> iris %>%

group_by(Species) %>%mutate(size = ifelse( Sepal.Width > median(Sepal.Width) , "big" , "small" )) -> iris2

> head(iris2) Source: local data frame [150 x 6]Groups: Species [3]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species size<dbl> <dbl> <dbl> <dbl> <fctr> <chr>

1 5.1 3.5 1.4 0.2 setosa big2 4.9 3.0 1.4 0.2 setosa small3 4.7 3.2 1.3 0.2 setosa small4 4.6 3.1 1.5 0.2 setosa small5 5.0 3.6 1.4 0.2 setosa big6 5.4 3.9 1.7 0.4 setosa big

Condition ValueifTRUE

ValueifFALSE

Page 33: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Groupedboxplots> ggplot(iris2, aes( x = Species, fill=size, y=Sepal.Width)) + geom_boxplot()

●●

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sepal.W

idth size

big

small

Page 34: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Groupedboxplots> ggplot(iris2, aes( x = size, fill = Species, y=Sepal.Width)) + geom_boxplot()

●●

2.0

2.5

3.0

3.5

4.0

4.5

big smallsize

Sepal.W

idth Species

setosa

versicolor

virginica

Page 35: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Detour:scale_color_manual()customizescolor

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species)) + scale_color_manual(values=c("cornflowerblue", "deepskyblue4", "lightcyan4"))

●●●

●●

●●

●● ●

●●

●●

●●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

virginica

setosa

versicolor

Page 36: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Detourround2:scale_<fill/color>_??Therearemany scalestousebesidesdefaultandcustom.◦ scale_<fil/color>_brewer()usespre-madecolorschemesfromcolorbrewer.org

◦ scale_color_gradient()cantakealowandhightofillalongaspectrum

Seehere:http://ggplot2.tidyverse.org/reference/#scales

Page 37: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Violinplot> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()

5

6

7

8

virginica setosa versicolorSpecies

Sepal.Length Species

virginica

setosa

versicolor

Page 38: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Barplot> ggplot(iris, aes(x = Species, fill = Species)) + geom_bar()

0

10

20

30

40

50

virginica setosa versicolorSpecies

count

Speciesvirginica

setosa

versicolor

Page 39: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Stacked/groupedbarplot> head(iris2)

Source: local data frame [150 x 6]Groups: Species [3]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species size<dbl> <dbl> <dbl> <dbl> <fctr> <chr>

1 5.1 3.5 1.4 0.2 setosa big2 4.9 3.0 1.4 0.2 setosa small3 4.7 3.2 1.3 0.2 setosa small4 4.6 3.1 1.5 0.2 setosa small5 5.0 3.6 1.4 0.2 setosa big6 5.4 3.9 1.7 0.4 setosa big

Page 40: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Stacked/groupedbarplot> ggplot(iris, aes(x = Species, fill = size)) + geom_bar()

0

10

20

30

40

50

setosa versicolor virginicaSpecies

count size

big

small

Page 41: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Stacked/groupedbarplot> ggplot(iris, aes(x = Species, fill = size)) + geom_bar( position = "dodge" )

0

10

20

30

setosa versicolor virginicaSpecies

count size

big

small

Page 42: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Densityplot> ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density()

Whatdoesthetailofthesetosa distributionlooklike?

0.0

0.4

0.8

1.2

5 6 7 8Sepal.Length

density

Speciessetosa

versicolor

virginica

Page 43: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

Densityplot> ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density( alpha = 0.5 )

0.0

0.4

0.8

1.2

5 6 7 8Sepal.Length

density

Speciessetosa

versicolor

virginica

Page 44: Enter the TidyverseWhat is the “tidyverse”? A collection of R packages largely developed by Hadley Wickham and others at Rstudio Have emerged as staples of modern-day data science

ThemesGraybackgroundandgridnotworkingforyou?Meneither.

◦ Built-inotherthemes:http://ggplot2.tidyverse.org/reference/ggtheme.html

◦ Customizeyourtheme:http://ggplot2.tidyverse.org/reference/theme.html

◦ Usesomebodyelse'sthemes:◦ https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html◦ https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html