enter the tidyverse -...

44
Enter the Tidyverse BIO5312 FALL2017 STEPHANIE J. SPIELMAN, PHD

Upload: volien

Post on 29-Aug-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

EntertheTidyverseBIO5312FALL2017

STEPHANIE J. SPIELMAN,PHD

Whatisthe“tidyverse”?AcollectionofRpackageslargelydevelopedbyHadleyWickhamandothersatRstudio

Haveemergedasstaplesofmodern-daydatascienceinthepast5—10years

Wewillfocuson:• Visualization/plottingwithggplot2• Datamanagementand”wrangling”withdplyr andtidyr•DocumentpresentationwithRMarkdown

FocusisontidydataframesEachvariableformsacolumn.Eachobservationformsarow.Eachtypeofobservationalunitformsatable.

Tidydataprovidesaconsistentapproachtodatamanagementthatgreatlyfacilitatesdownstreamanalysisandviz

WorkingwithtidydataThepackagedplyr canmanipulateandmanagetidydata

Thepackagetidyr canrearrangedatatoconvertto/fromtidydata

Thepackageggplot2 isusedforvisualization/plotting

Thefundamentalverbsofdplyrfilter() selectrowsselect() selectcolumnsmutate() createnewcolumnsgroup_by() establish adatagroupingtally() count observationsinagroupingsummarize() calculate summarystatisticarrange() arrangerows

Therearemorefunctionsbuttheseonesarekey!

Thepipeoperator%>%“Pipes”outputfromonefunction/operationasinputtothenext

## Find the mean of iris sepal lengthsmean.sepal <- mean(iris$Sepal.Length)

## Using %>%mean.sepal <- iris$Sepal.Length %>% mean()

iris$Sepal.Length %>% mean() -> mean.sepal

iris %>% mean(Sepal.Length) -> mean.sepal

“forwardassignment”operatorfollowsthelogicalflowofpiping

## Start simple: display datahead(iris)

## Using %>%iris %>% head()

dplyr demoCommandsindemoareonsjspielman.org/bio5312_fall2017/day2_tidyverse1

Visualizingwithggplot2Thepackageggplot2 isagraphicspackagethatimplementsagrammarofgraphics◦ Operatesondataframes,notvectorslikeBaseR◦ Explicitlydifferentiatesbetweenthedataandtherepresentationofthedata

Theggplot2 grammar

Grammar element* What isit

Data Thedataframebeingplotted

Geometrics Thegeometricshapethatwillrepresentthedata• Point,boxplot,histogram, violin,bar,etc.

Aesthetics Theaesthetics ofthegeometricobject• Color,size,shape,etc.

*Tableistinysubsetofwhatggplot2hastooffer

Example:scatterplot> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

Example:scatterplot> ggplot( iris, aes(x = Sepal.Length, y = Petal.Length) ) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

• Passinthedataframeasyourfirstargument

• Aestheticsmapthedataontoplotcharacteristics,herexandyaxes

• Displaythedatageometricallyaspoints

Example:scatterplotwithcolor> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(color = "red" )

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

Example:scatterplotwithaes color> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

setosa

versicolor

virginica

• Placingcolorinsideaesethetic mapsittothedata.

Example:scatterplotwithaes color,shape

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species, shape = Species)) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

● setosa

versicolor

virginica

Aestheticsmaybeplacedinsidetherelevantgeom

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species, shape = Species))

●●●

●●

●●

●● ●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

● setosa

versicolor

virginica

> ## Remember dplyr!> iris %>% ggplot(aes(x = Sepal.Length, y =

Petal.Length)) + geom_point(aes(color = Species, shape = Species))

Aestheticsareformappingonly> ### Color all points blue?> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = "blue")) + geom_point()

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

colour● blue

Aestheticsareformappingonly> ### Color all points blue?> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = "blue")) + geom_point()

> ### Correctly color all points blue> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(color = "blue")

●●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●●●

● ●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length

colour● blue

Example:multiplegeoms> ### Use some fake data:> fake.data <- data.frame(t = 1:10, y = runif(10, 1, 100))

> ggplot(fake.data, aes(x = t, y = y)) + geom_point() + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

Makesureaestheticmappingsareproperlyapplied

> ggplot(fake.data, aes(x = t, y = y, size = y)) + geom_point() + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

y●

25

50

75

Makesureaestheticmappingsareproperlyapplied

> ggplot(fake.data, aes(x = t, y = y, size = y)) + geom_point() + geom_line()

> ggplot(fake.data, aes(x = t, y = y)) + geom_point( aes(size=y) ) + geom_line()

0

25

50

75

100

2.5 5.0 7.5 10.0t

y

y●

25

50

75

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange" )

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange", color = "brown" )

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal.Length

count

Histograms> ggplot(iris, aes(x = Sepal.Length)) + geom_histogram( fill = "orange", color = "brown" )

+ xlab("Sepal Length") + ylab("Count") + ggtitle("Histogram of iris sepal lengths")

0.0

2.5

5.0

7.5

10.0

12.5

5 6 7 8Sepal Length

Cou

nt

Histogram of iris sepal lengths

Boxplots> ggplot(iris, aes(x = "", y = Sepal.Length)) + geom_boxplot()

5

6

7

8

x

Sepal.Length

Boxplots> ggplot(iris, aes(x = "", y = Sepal.Length)) + geom_boxplot(fill = "green")

5

6

7

8

x

Sepal.Length

Boxplots> ggplot(iris, aes(x = Species, y = Sepal.Length)) + geom_boxplot(fill = "green")

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length

Boxplots> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot()

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length Species

setosa

versicolor

virginica

Boxplots:Customizingthefillmappings

> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"))

●5

6

7

8

setosa versicolor virginicaSpecies

Sepal.Length Species

setosa

versicolor

virginica

scale_fill_manual()alsotweakslegend

> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"), name = "Species name", labels=c("SETOSA", "VIRGINICA", "VERSICOLOR"))

●5

6

7

8

setosa versicolor virginicaSpecies

Sepa

l.Len

gth Species name

SETOSA

VIRGINICA

VERSICOLOR

Changingtheorder> ### Ordering depends on factor levels> levels(iris$Species)

[1] "setosa" "versicolor" "virginica"

> ### Change order of levels> iris$Species <- factor(iris$Species, levels=c("virginica", "setosa", "versicolor"))

[1] "virginica" "setosa" "versicolor"

> ### Replot> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +

geom_boxplot() + scale_fill_manual(values=c("red", "blue", "purple"))

●5

6

7

8

virginica setosa versicolorSpecies

Sepal.Length Species

virginica

setosa

versicolor

GroupedboxplotsThiswillapplytoviolinplotsaswell.> ## Create another categorical variable for grouping purpopses> iris %>%

group_by(Species) %>%mutate(size = ifelse( Sepal.Width > median(Sepal.Width) , "big" , "small" )) -> iris2

> head(iris2) Source: local data frame [150 x 6]Groups: Species [3]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species size<dbl> <dbl> <dbl> <dbl> <fctr> <chr>

1 5.1 3.5 1.4 0.2 setosa big2 4.9 3.0 1.4 0.2 setosa small3 4.7 3.2 1.3 0.2 setosa small4 4.6 3.1 1.5 0.2 setosa small5 5.0 3.6 1.4 0.2 setosa big6 5.4 3.9 1.7 0.4 setosa big

Condition ValueifTRUE

ValueifFALSE

Groupedboxplots> ggplot(iris2, aes( x = Species, fill=size, y=Sepal.Width)) + geom_boxplot()

●●

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sepal.W

idth size

big

small

Groupedboxplots> ggplot(iris2, aes( x = size, fill = Species, y=Sepal.Width)) + geom_boxplot()

●●

2.0

2.5

3.0

3.5

4.0

4.5

big smallsize

Sepal.W

idth Species

setosa

versicolor

virginica

Detour:scale_color_manual()customizescolor

> ggplot(iris, aes(x = Sepal.Length, y = Petal.Length)) + geom_point(aes(color = Species)) + scale_color_manual(values=c("cornflowerblue", "deepskyblue4", "lightcyan4"))

●●●

●●

●●

●● ●

●●

●●

●●●●

●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●●●

●●

●●

●●

●●

2

4

6

5 6 7 8Sepal.Length

Petal.Length Species

virginica

setosa

versicolor

Detourround2:scale_<fill/color>_??Therearemany scalestousebesidesdefaultandcustom.◦ scale_<fil/color>_brewer()usespre-madecolorschemesfromcolorbrewer.org

◦ scale_color_gradient()cantakealowandhightofillalongaspectrum

Seehere:http://ggplot2.tidyverse.org/reference/#scales

Violinplot> ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_violin()

5

6

7

8

virginica setosa versicolorSpecies

Sepal.Length Species

virginica

setosa

versicolor

Barplot> ggplot(iris, aes(x = Species, fill = Species)) + geom_bar()

0

10

20

30

40

50

virginica setosa versicolorSpecies

count

Speciesvirginica

setosa

versicolor

Stacked/groupedbarplot> head(iris2)

Source: local data frame [150 x 6]Groups: Species [3]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species size<dbl> <dbl> <dbl> <dbl> <fctr> <chr>

1 5.1 3.5 1.4 0.2 setosa big2 4.9 3.0 1.4 0.2 setosa small3 4.7 3.2 1.3 0.2 setosa small4 4.6 3.1 1.5 0.2 setosa small5 5.0 3.6 1.4 0.2 setosa big6 5.4 3.9 1.7 0.4 setosa big

Stacked/groupedbarplot> ggplot(iris, aes(x = Species, fill = size)) + geom_bar()

0

10

20

30

40

50

setosa versicolor virginicaSpecies

count size

big

small

Stacked/groupedbarplot> ggplot(iris, aes(x = Species, fill = size)) + geom_bar( position = "dodge" )

0

10

20

30

setosa versicolor virginicaSpecies

count size

big

small

Densityplot> ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density()

Whatdoesthetailofthesetosa distributionlooklike?

0.0

0.4

0.8

1.2

5 6 7 8Sepal.Length

density

Speciessetosa

versicolor

virginica

Densityplot> ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density( alpha = 0.5 )

0.0

0.4

0.8

1.2

5 6 7 8Sepal.Length

density

Speciessetosa

versicolor

virginica

ThemesGraybackgroundandgridnotworkingforyou?Meneither.

◦ Built-inotherthemes:http://ggplot2.tidyverse.org/reference/ggtheme.html

◦ Customizeyourtheme:http://ggplot2.tidyverse.org/reference/theme.html

◦ Usesomebodyelse'sthemes:◦ https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html◦ https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html