lab5a - intro to ggplot2ahamann/teaching/renr480/lab5azs.pdf · lab5a - intro to ggplot2 z.sang...
TRANSCRIPT
Lab5A - Intro to GGPLOT2Z.Sang
Sept 24, 2018
In this lab you will learn to visualize raw data by plotting exploratory graphics with ggplot2 package. Unlikefinal graphs for publication or thesis, exploratory graphics are usually made very quickly in the process ofchecking for errors, outliers, distribution, and correlations of variables. The goal of making graphs is usuallydeveloping a personal understanding of the data and to prioritize tasks for follow up analysis.
Grammar of ggplot2
ggplot2 , one other important package of tidyverse, is designed for data visualization of data frames. ‘gg’ ofthe name represents ‘grammar of graphic’, and ggplot2 has been recognized as one of three main graphicsystem of R.
The most important thing to get used to with ggplot2 is the logical structure of plots. The code you writespecifies the connections between the variables in your data, and the x and y location, colors, size, shapes etcthat you can see on the screen. In ggplot2, these logical connections between your data and the plot elementsare called aesthetic mappings or just aesthetics.
You begin every plot by telling the ggplot() function what your data is, and then how the variables in thisdata logically map onto the plot’s aesthetic mapping. Then you take the result and say what general sort ofplot you want, such as a scatterplot, a boxplot, or a bar chart. In ggplot2, the overall type of plot is called ageom. Each geom has a function that creates it and the function’s name follows the pattern of “geom_. . . ()” .For example, geom_point() makes scatterplots, geom_bar() makes bar plots, geom_boxplot() makes boxplots,and so on. You combine these two pieces, the ggplot(data, mapping) object and the geom_. . . (), by literallyadding them together in an expression, using the + symbol.
Data, mapping (or aesthetics), and geometry (geom) are three mandatory components for ggplot2. As otherfunctions, the output of ggplot2 can be assigned to an object for further editing. Other optional ggplot2grammar components will be introduced in Lab5 for figure customization.
1
A little too complex? Don’t worry; you will get familiar with the grammar system very soon. In this lab, wewill use this ggplot2 syntax to plot the following exploratory graphics: histogram (density plot), boxplot,scatterplot, and scatterplot matrix.
Data preparation
• For this exercise, use a weather station dataset “AB_Stations.csv” that you can download from thecourse website. The first three columns specify the weather station ID, as well as the ecosystems andthe biome of Alberta in which the weather station is located. This is followed by a number of climatevariables that you can use for exploration (MAT=mean annual temp, MWMT= mean warmest monthtemp, MCMT=mean coldest month temp, MAP=mean annual precipitation, MSP=mean summerprecipitation, DRYNESS=an index).
• Load required packages.#install.package('tidyverse') # if no tidyverse package installed
library(tidyverse)
• Import the dataset with the code below, and use head(), tail(), str()or View() functions to check theimported data table.
dat1 <- read.csv("E:\\lab3\\AB_Stations.csv")head(dat1, 10)## STATION ECOSYS BIOME MAT MWMT MCMT MAP MSP DRYNESS## 1 300114 G-NF Grassland 2.4 17.5 -24.9 443 287 17## 2 301449 G-DMG Grassland 4.5 18.8 -23.1 415 257 3## 3 302343 G-MG Grassland 4.9 18.4 -23.3 429 258 12## 4 302369 G-DMG Grassland 5.1 18.3 -24.3 405 254 1## 5 302789 G-NF Grassland 2.8 17.4 -22.2 431 291 14## 6 304155 B-AP Boreal -0.9 17.4 -30.4 480 292 45## 7 304642 B-UBH Boreal -0.5 16.3 -29.0 511 330 52## 8 305076 M-M Montane 2.8 15.1 -21.3 550 319 42## 9 305773 B-CP Boreal 3.3 17.3 -24.5 488 309 22## 10 306733 B-KU Boreal -3.5 16.5 -32.3 431 260 60
5.1. Hisograms
One useful plot type for exploration of raw data is histograms. They are commonly used to visually checkthe distribution of continuous variables. The geom of histogram is geom_histogram(). For histograms the yaxis is counting the number of observations in each bin (default of ggplot2), but y can also be set as density.
• According to the ggplot2 syntax, we can execute the following command to get a histogram for avariable, in this case the variable “DRYNESS”:
hist_a <- ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(color = 'gray90')hist_a## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
2
0
5
10
15
0 50 100
DRYNESS
coun
t
ggplot2 chooses the bin width by default when generating histograms, but chances are that bin width is notthe most appropriate one for any histogram you may want to make. It is therefore critical to change bins toverify whether the resulting histogram reflects the data accurately. Too many bins makes histograms overlypeaky and losses the whole picture of distribution, while too few bins cover limited details of the distribution.The following two methods to change bins:
1. First method: set the number of bins you want for the histogram;
ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 5, color = 'gray90') #5 binsggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 20, color = 'gray90')#20 binggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 50, color = 'gray90')#50 bin
0
20
40
60
0 50 100 150
DRYNESS
coun
t
0
5
10
15
20
0 50 100
DRYNESS
coun
t
0
5
10
0 50 100
DRYNESS
coun
t
2. Second method: set the width of bins:ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 1, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 5, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 10, color = 'gray90')
0
5
10
0 50 100
DRYNESS
coun
t
0
5
10
15
0 50 100
DRYNESS
coun
t
0
10
20
30
0 50 100
DRYNESS
coun
t
• Great to visually check the effectiveness of data transformations. In this case, the square-root transfor-mation achieves approximately a normal distribution.
3
hist_b <- ggplot(dat1, aes(x = sqrt(DRYNESS))) + geom_histogram(color = 'gray90')hist_b## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0
5
10
15
3 6 9 12
sqrt(DRYNESS)
coun
t
• You can also fill colors of the bins by group/class. In many scenarios we have multiple distributions wewould like to visualize simultaneously. For example, were the biomes having similar dryness situation?One commonly employed visualization strategy is stacking bars on top of each other and filling histogramin different colors for groups;
hist_c <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram()
• Although counting numbers is used as y axis by default, you can change y axis as density. Given unevensample size for each group/class, density histograms may show inconsistent pattern with the frequencyones.
hist_d <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram(aes(y = ..density..))#specify y as density
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
0
5
10
15
0 50 100
DRYNESS
coun
t
BIOME
Boreal
Grassland
Montane
hist_c
0.00
0.02
0.04
0.06
0.08
0 50 100
DRYNESS
dens
ity
BIOME
Boreal
Grassland
Montane
hist_d
• One biggest disadvantage of the stacked histogram is hard to quantify each group; for example, howmany samples of Boreal have DRYNESS values around 40? About 15 or 8? It’s not so clear to comparedistributions among groups. To solve this, one way is to change the positions of bins. One commonway is dodging which preserves the vertical position of a geom while adjusting the horizontal position.
hist_e <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram(position = 'dodge') #change bin positions
hist_e## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
4
0.0
2.5
5.0
7.5
10.0
12.5
0 50 100
DRYNESS
coun
t
BIOME
Boreal
Grassland
Montane
• Histograms have been a popular visualization option since at least the 18th century, in part becausethey are easily generated by hand. More recently, as extensive computing power has become popularizedin everyday devices such as laptops and cell phones, we see them increasingly being replaced by densityplots. In a density plot, we attempt to visualize the underlying probability distribution of the data bydrawing an appropriate continuous curve;
hist_f <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_density(alpha = 0.4) #introduce transparency
hist_f
0.00
0.01
0.02
0.03
0.04
0 50 100
DRYNESS
dens
ity
BIOME
Boreal
Grassland
Montane
Similarly, we fill density curve with different colors. The alpha argument is used to introduce transparency ofthe color, and alpha value in the range of 0 (totally transparent) to 1 (no transparent). Also, try to addmultiple geom:hist_f + geom_histogram(aes(y = ..density..), alpha = 0.6, position = 'dodge')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
5
0.00
0.02
0.04
0.06
0 50 100
DRYNESS
dens
ity
BIOME
Boreal
Grassland
Montane
5.2. Scatter plots
With scatter plots you can visually check the relationships among variables. Are they linear or curvilinear?Outliers are also easily visible.
• Now, try to use a scatter plot to visually check the relationships among variables and to identify outliers.To check the relationships between Mean Summer Precipitation (MSP) and Mean Annual Precipitation(MAP), we can use them as x, y in the plot respectively (normally, y axis is for dependent variable, andx is for independent variable, but in this case it’s ok to exchange axes). The geom function for scatterplot is geom_point()
ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point()
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
• Cool! It seems there is a positive relationship between these two variables. However, several overlappedpoints could influence the interpretability of the plot. One easy way is to introduce the transparency ofpoints.
plt <- ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point(alpha = .3)plt
6
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
• Besides changing the transparency, changing the point position by counterintuitively adding randomnoise could be helpful to see each point.
plt_jittered <- ggplot(dat1, aes(x = MAP, y = MSP)) +geom_point(position = "jitter")
plt_jittered
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
• You can also add labels to your plot with the geom function geom_text().In this case, we want to labelSTATION name of points. hjust and vjust is used to control the placement of labels.
plt_label <- ggplot(dat1, aes(x = MAP, y = MSP, label = STATION)) +geom_point() + geom_text(hjust = 0, vjust = 0, size = 2.2, color = 'gray40')
plt_label
300114
301449302343302369
302789304155
304642
305076
305773
306733
307370
307476
308940
309125
309275
311653
312849313133
313617
314291
314428
314549
315571
316679317166
317380
317402317973
318467
318820
320886
321054
321629
321981322685322819
323397
323860
324975
325083
326204
326346
328570
329232
330507
330987
331250
331666
332997
333387
333684333941
334604
335494
335543336145
336582
338128
339025
339231
339256340053
340251
340921
342611
343767
344513
345289
346766
348363
348868
350956
352795
352817
354141
354656
355248
356746
357642
361092
361136
361336
361368
361782
361916
363855
363994364112
364909
366384
366906
367460
367544
367548
369889371685
373633
373664
373935
374134
376822379433
379739
381553
381684 381902
382286384538
384579
384757
384942
387607388064
388754
388919
389756391351
393944
394461
394585
394756
394953
395114
396675
397640
397981
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
7
Could you tell the STATION ID of the two outliers around the lower right corner of the plot?
• However, do all BIOME types follow the same relationship between MAP and MSP? To figure it out,we need add some visual aid to separate these types (e.g., color, shape)
plt_biome <- ggplot(dat1, aes(x = MAP, y = MSP, color = BIOME,shape = BIOME)) + geom_point()
plt_biome
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
BIOME
Boreal
Grassland
Montane
How about density plot for 2-D plot? Try:plt + geom_density2d()
250
275
300
325
350
375
400 500 600 700
MAP
MS
P
Box plots
• Just like scatter plots, boxplots is a good way to visually check the relationships among two variables.If one variable is continuous (as y) and the other is categorical (as x), then boxplot is a good option.For instance, to understand the general distribution of mean annual temperature (MAT) of each biometypes (BIOME). The geom for boxplot is geom_boxplot()
ggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot()
8
−2.5
0.0
2.5
5.0
Boreal Grassland Montane
BIOME
MAT
If you still have time, you can add the following arguments within the parentheses of geom_boxplot()and seewhat their functions could be: varwidth = T, notch = Tggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(varwidth = T) # the width of box reflecting the samples sizeggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(notch = T) # add notch to box(es)## notch went outside hinges. Try setting notch=FALSE.
−2.5
0.0
2.5
5.0
Boreal Grassland Montane
BIOME
MAT
−2.5
0.0
2.5
5.0
Boreal Grassland Montane
BIOME
MAT
Similarly, we can make boxplot for mean annual temperature (MAT) of ecosystems (ECOSYS).ggplot(dat1, aes(x = ECOSYS, y = MAT)) + geom_boxplot()
−2.5
0.0
2.5
5.0
B−APB−BSAB−CMB−CPB−DMB−KUB−LBHB−NMB−PeacB−PRPB−UBHG−DMGG−FFG−FPG−MGG−NFM−AM−LFM−MM−SAM−UF
ECOSYS
MAT
9
Since the names of ecosystems takes spaces and can easily overlap, we prefer to use ECOSYS as y axis andhave horizontal boxplot:ggplot(dat1, aes(x = ECOSYS, y = MAT)) +
geom_boxplot() + coord_flip() # horizontal: flip the x, y axes
# Great! Now you can color boxplots based their BIOME types;ggplot(dat1, aes(x = ECOSYS, y = MAT, fill = BIOME)) +
geom_boxplot(varwidth = T) + coord_flip() # colored by BIOME groups
B−AP
B−BSA
B−CM
B−CP
B−DM
B−KU
B−LBH
B−NM
B−Peac
B−PRP
B−UBH
G−DMG
G−FF
G−FP
G−MG
G−NF
M−A
M−LF
M−M
M−SA
M−UF
−2.5 0.0 2.5 5.0
MAT
EC
OS
YS
a
B−AP
B−BSA
B−CM
B−CP
B−DM
B−KU
B−LBH
B−NM
B−Peac
B−PRP
B−UBH
G−DMG
G−FF
G−FP
G−MG
G−NF
M−A
M−LF
M−M
M−SA
M−UF
−2.5 0.0 2.5 5.0
MAT
EC
OS
YS
BIOME
Boreal
Grassland
Montane
b
• Boxplots are generally useful, but it does only focus on five numbers of the samples (min, max, 25th,50th, and 75th). To add more details about distribution, we can add points (jittered) or violin plot asalternatives.
ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_violin()+ geom_boxplot(width = .1)
ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_boxplot() + geom_point(position='jitter', alpha=.2, size=2)
10
−2.5
0.0
2.5
5.0
Boreal Grassland Montane
BIOME
MAT
−5.0
−2.5
0.0
2.5
5.0
7.5
Boreal Grassland Montane
BIOME
MAT
The first commend narrowed the width of boxplots and added them into violin plot, and the second oneadded scatter plots into boxplots.
• Well done! So far we just analyze one continuous variable once a time. Can we visual multiple variablesin one plot?
Hope you still remember in lab2B we applied gather() function to transform a data frame from wide to long.In ggplot2, x or y must only be determined by one single variable. Therefore, first we need to gather multipleinterested variables into one, and then use the new data table for ggplot2 plotting. For example, let make aboxplot to check the distribution of three BIOME types of mean annual temperature (MAT), mean warmestmonth temperature (MWMT) and mean coldest month temperature (MCMT).dat2 <- gather(dat1, key = 'temp', value = 'value', MAT, MCMT, MWMT)
head(dat2) #quick check the new data table## STATION ECOSYS BIOME MAP MSP DRYNESS temp value## 1 300114 G-NF Grassland 443 287 17 MAT 2.4## 2 301449 G-DMG Grassland 415 257 3 MAT 4.5## 3 302343 G-MG Grassland 429 258 12 MAT 4.9## 4 302369 G-DMG Grassland 405 254 1 MAT 5.1## 5 302789 G-NF Grassland 431 291 14 MAT 2.8## 6 304155 B-AP Boreal 480 292 45 MAT -0.9ggplot(dat2, aes(x = temp, y = value, fill = BIOME)) +
geom_boxplot() #using different colors for BIOME types
−20
0
20
MAT MCMT MWMT
temp
valu
e
BIOME
Boreal
Grassland
Montane
11
Looks nice! If you change the temperature variables as treatments, then different performance among andwithin groups is a strong clue of interaction.
5.4. Multi-panel scatter plots in R
So far, in this lab we learnt 1-dimensional (histogram, density plot), 2-dimensional (scatter plot, boxplot)exploratory graphics, they normally can only analyze one or a pair of variables a time. If you have a datatable with 10 potential independent variables, plotting them one by one is not effective. To get the generalidea of the relationships among variables in very short time:#need use ggpairs() function of GGally package#install.packages('GGally')library(GGally)ggpairs(dat1[, c('MAT', 'MAP', 'MSP', 'DRYNESS', 'BIOME')],
aes(color = BIOME))
Cor : −0.177
Boreal: 0.245
Grassland: −0.206
Montane: −0.91
Cor : −0.0431
Boreal: 0.311
Grassland: −0.374
Montane: −0.171
Cor : 0.711
Boreal: 0.547
Grassland: 0.54
Montane: −0.0371
Cor : −0.735
Boreal: −0.727
Grassland: −0.51
Montane: −0.908
Cor : 0.667
Boreal: 0.117
Grassland: 0.745
Montane: 0.892
Cor : 0.492
Boreal: 0.123
Grassland: 0.818
Montane: 0.189
MAT MAP MSP DRYNESS BIOME
MAT
MA
PM
SP
DR
YN
ES
SB
IOM
E
−5.0 −2.5 0.0 2.5 5.0 7.5400 500 600 700 250 275 300 325 350 375 0 50 100 Boreal GrasslandMontane
0.0
0.1
0.2
0.3
0.4
400
500
600
700
250
275
300
325
350
375
0
50
100
0.02.55.07.5
0.02.55.07.5
0.02.55.07.5
Voilà. Now you can see the plot matrix among MAT, MAP, MSP, DRYNESS with BIOME, and also usedifferent colors distinguish BIOME types.
12