lab5a - intro to ggplot2ahamann/teaching/renr480/lab5azs.pdf · lab5a - intro to ggplot2 z.sang...

Lab5A - Intro to GGPLOT2Z.Sang

Sept 24, 2018

In this lab you will learn to visualize raw data by plotting exploratory graphics with ggplot2 package. Unlikefinal graphs for publication or thesis, exploratory graphics are usually made very quickly in the process ofchecking for errors, outliers, distribution, and correlations of variables. The goal of making graphs is usuallydeveloping a personal understanding of the data and to prioritize tasks for follow up analysis.

Grammar of ggplot2

ggplot2 , one other important package of tidyverse, is designed for data visualization of data frames. ‘gg’ ofthe name represents ‘grammar of graphic’, and ggplot2 has been recognized as one of three main graphicsystem of R.

The most important thing to get used to with ggplot2 is the logical structure of plots. The code you writespecifies the connections between the variables in your data, and the x and y location, colors, size, shapes etcthat you can see on the screen. In ggplot2, these logical connections between your data and the plot elementsare called aesthetic mappings or just aesthetics.

You begin every plot by telling the ggplot() function what your data is, and then how the variables in thisdata logically map onto the plot’s aesthetic mapping. Then you take the result and say what general sort ofplot you want, such as a scatterplot, a boxplot, or a bar chart. In ggplot2, the overall type of plot is called ageom. Each geom has a function that creates it and the function’s name follows the pattern of “geom_. . . ()” .For example, geom_point() makes scatterplots, geom_bar() makes bar plots, geom_boxplot() makes boxplots,and so on. You combine these two pieces, the ggplot(data, mapping) object and the geom_. . . (), by literallyadding them together in an expression, using the + symbol.

Data, mapping (or aesthetics), and geometry (geom) are three mandatory components for ggplot2. As otherfunctions, the output of ggplot2 can be assigned to an object for further editing. Other optional ggplot2grammar components will be introduced in Lab5 for figure customization.

1

A little too complex? Don’t worry; you will get familiar with the grammar system very soon. In this lab, wewill use this ggplot2 syntax to plot the following exploratory graphics: histogram (density plot), boxplot,scatterplot, and scatterplot matrix.

Data preparation

• For this exercise, use a weather station dataset “AB_Stations.csv” that you can download from thecourse website. The first three columns specify the weather station ID, as well as the ecosystems andthe biome of Alberta in which the weather station is located. This is followed by a number of climatevariables that you can use for exploration (MAT=mean annual temp, MWMT= mean warmest monthtemp, MCMT=mean coldest month temp, MAP=mean annual precipitation, MSP=mean summerprecipitation, DRYNESS=an index).

• Load required packages.#install.package('tidyverse') # if no tidyverse package installed

library(tidyverse)

• Import the dataset with the code below, and use head(), tail(), str()or View() functions to check theimported data table.

dat1 <- read.csv("E:\\lab3\\AB_Stations.csv")head(dat1, 10)## STATION ECOSYS BIOME MAT MWMT MCMT MAP MSP DRYNESS## 1 300114 G-NF Grassland 2.4 17.5 -24.9 443 287 17## 2 301449 G-DMG Grassland 4.5 18.8 -23.1 415 257 3## 3 302343 G-MG Grassland 4.9 18.4 -23.3 429 258 12## 4 302369 G-DMG Grassland 5.1 18.3 -24.3 405 254 1## 5 302789 G-NF Grassland 2.8 17.4 -22.2 431 291 14## 6 304155 B-AP Boreal -0.9 17.4 -30.4 480 292 45## 7 304642 B-UBH Boreal -0.5 16.3 -29.0 511 330 52## 8 305076 M-M Montane 2.8 15.1 -21.3 550 319 42## 9 305773 B-CP Boreal 3.3 17.3 -24.5 488 309 22## 10 306733 B-KU Boreal -3.5 16.5 -32.3 431 260 60

5.1. Hisograms

One useful plot type for exploration of raw data is histograms. They are commonly used to visually checkthe distribution of continuous variables. The geom of histogram is geom_histogram(). For histograms the yaxis is counting the number of observations in each bin (default of ggplot2), but y can also be set as density.

• According to the ggplot2 syntax, we can execute the following command to get a histogram for avariable, in this case the variable “DRYNESS”:

hist_a <- ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(color = 'gray90')hist_a## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2

0

5

10

15

0 50 100

DRYNESS

coun

t

ggplot2 chooses the bin width by default when generating histograms, but chances are that bin width is notthe most appropriate one for any histogram you may want to make. It is therefore critical to change bins toverify whether the resulting histogram reflects the data accurately. Too many bins makes histograms overlypeaky and losses the whole picture of distribution, while too few bins cover limited details of the distribution.The following two methods to change bins:

1. First method: set the number of bins you want for the histogram;

ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 5, color = 'gray90') #5 binsggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 20, color = 'gray90')#20 binggplot(dat1, aes(x = DRYNESS)) + geom_histogram(bins = 50, color = 'gray90')#50 bin

0

20

40

60

0 50 100 150

DRYNESS

coun

t

0

5

10

15

20

0 50 100

DRYNESS

coun

t

0

5

10

0 50 100

DRYNESS

coun

t

2. Second method: set the width of bins:ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 1, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 5, color = 'gray90')ggplot(dat1, aes(x = DRYNESS)) + geom_histogram(binwidth = 10, color = 'gray90')

0

5

10

0 50 100

DRYNESS

coun

t

0

5

10

15

0 50 100

DRYNESS

coun

t

0

10

20

30

0 50 100

DRYNESS

coun

t

• Great to visually check the effectiveness of data transformations. In this case, the square-root transfor-mation achieves approximately a normal distribution.

3

hist_b <- ggplot(dat1, aes(x = sqrt(DRYNESS))) + geom_histogram(color = 'gray90')hist_b## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0

5

10

15

3 6 9 12

sqrt(DRYNESS)

coun

t

• You can also fill colors of the bins by group/class. In many scenarios we have multiple distributions wewould like to visualize simultaneously. For example, were the biomes having similar dryness situation?One commonly employed visualization strategy is stacking bars on top of each other and filling histogramin different colors for groups;

hist_c <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram()

• Although counting numbers is used as y axis by default, you can change y axis as density. Given unevensample size for each group/class, density histograms may show inconsistent pattern with the frequencyones.

hist_d <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram(aes(y = ..density..))#specify y as density

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0

5

10

15

0 50 100

DRYNESS

coun

t

BIOME

Boreal

Grassland

Montane

hist_c

0.00

0.02

0.04

0.06

0.08

0 50 100

DRYNESS

dens

ity

BIOME

Boreal

Grassland

Montane

hist_d

• One biggest disadvantage of the stacked histogram is hard to quantify each group; for example, howmany samples of Boreal have DRYNESS values around 40? About 15 or 8? It’s not so clear to comparedistributions among groups. To solve this, one way is to change the positions of bins. One commonway is dodging which preserves the vertical position of a geom while adjusting the horizontal position.

hist_e <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_histogram(position = 'dodge') #change bin positions

hist_e## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4

0.0

2.5

5.0

7.5

10.0

12.5

0 50 100

DRYNESS

coun

t

BIOME

Boreal

Grassland

Montane

• Histograms have been a popular visualization option since at least the 18th century, in part becausethey are easily generated by hand. More recently, as extensive computing power has become popularizedin everyday devices such as laptops and cell phones, we see them increasingly being replaced by densityplots. In a density plot, we attempt to visualize the underlying probability distribution of the data bydrawing an appropriate continuous curve;

hist_f <- ggplot(dat1, aes(x = DRYNESS, fill = BIOME)) +geom_density(alpha = 0.4) #introduce transparency

hist_f

0.00

0.01

0.02

0.03

0.04

0 50 100

DRYNESS

dens

ity

BIOME

Boreal

Grassland

Montane

Similarly, we fill density curve with different colors. The alpha argument is used to introduce transparency ofthe color, and alpha value in the range of 0 (totally transparent) to 1 (no transparent). Also, try to addmultiple geom:hist_f + geom_histogram(aes(y = ..density..), alpha = 0.6, position = 'dodge')## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5

0.00

0.02

0.04

0.06

0 50 100

DRYNESS

dens

ity

BIOME

Boreal

Grassland

Montane

5.2. Scatter plots

With scatter plots you can visually check the relationships among variables. Are they linear or curvilinear?Outliers are also easily visible.

• Now, try to use a scatter plot to visually check the relationships among variables and to identify outliers.To check the relationships between Mean Summer Precipitation (MSP) and Mean Annual Precipitation(MAP), we can use them as x, y in the plot respectively (normally, y axis is for dependent variable, andx is for independent variable, but in this case it’s ok to exchange axes). The geom function for scatterplot is geom_point()

ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point()

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

• Cool! It seems there is a positive relationship between these two variables. However, several overlappedpoints could influence the interpretability of the plot. One easy way is to introduce the transparency ofpoints.

plt <- ggplot(dat1, aes(x = MAP, y = MSP)) + geom_point(alpha = .3)plt

6

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

• Besides changing the transparency, changing the point position by counterintuitively adding randomnoise could be helpful to see each point.

plt_jittered <- ggplot(dat1, aes(x = MAP, y = MSP)) +geom_point(position = "jitter")

plt_jittered

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

• You can also add labels to your plot with the geom function geom_text().In this case, we want to labelSTATION name of points. hjust and vjust is used to control the placement of labels.

plt_label <- ggplot(dat1, aes(x = MAP, y = MSP, label = STATION)) +geom_point() + geom_text(hjust = 0, vjust = 0, size = 2.2, color = 'gray40')

plt_label

300114

301449302343302369

302789304155

304642

305076

305773

306733

307370

307476

308940

309125

309275

311653

312849313133

313617

314291

314428

314549

315571

316679317166

317380

317402317973

318467

318820

320886

321054

321629

321981322685322819

323397

323860

324975

325083

326204

326346

328570

329232

330507

330987

331250

331666

332997

333387

333684333941

334604

335494

335543336145

336582

338128

339025

339231

339256340053

340251

340921

342611

343767

344513

345289

346766

348363

348868

350956

352795

352817

354141

354656

355248

356746

357642

361092

361136

361336

361368

361782

361916

363855

363994364112

364909

366384

366906

367460

367544

367548

369889371685

373633

373664

373935

374134

376822379433

379739

381553

381684 381902

382286384538

384579

384757

384942

387607388064

388754

388919

389756391351

393944

394461

394585

394756

394953

395114

396675

397640

397981

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

7

Could you tell the STATION ID of the two outliers around the lower right corner of the plot?

• However, do all BIOME types follow the same relationship between MAP and MSP? To figure it out,we need add some visual aid to separate these types (e.g., color, shape)

plt_biome <- ggplot(dat1, aes(x = MAP, y = MSP, color = BIOME,shape = BIOME)) + geom_point()

plt_biome

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

BIOME

Boreal

Grassland

Montane

How about density plot for 2-D plot? Try:plt + geom_density2d()

250

275

300

325

350

375

400 500 600 700

MAP

MS

P

Box plots

• Just like scatter plots, boxplots is a good way to visually check the relationships among two variables.If one variable is continuous (as y) and the other is categorical (as x), then boxplot is a good option.For instance, to understand the general distribution of mean annual temperature (MAT) of each biometypes (BIOME). The geom for boxplot is geom_boxplot()

ggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot()

8

−2.5

0.0

2.5

5.0

Boreal Grassland Montane

BIOME

MAT

If you still have time, you can add the following arguments within the parentheses of geom_boxplot()and seewhat their functions could be: varwidth = T, notch = Tggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(varwidth = T) # the width of box reflecting the samples sizeggplot(dat1, aes(x = BIOME, y = MAT)) + geom_boxplot(notch = T) # add notch to box(es)## notch went outside hinges. Try setting notch=FALSE.

−2.5

0.0

2.5

5.0


BIOME

MAT

−2.5

0.0

2.5

5.0


BIOME

MAT

Similarly, we can make boxplot for mean annual temperature (MAT) of ecosystems (ECOSYS).ggplot(dat1, aes(x = ECOSYS, y = MAT)) + geom_boxplot()

−2.5

0.0

2.5

5.0

B−APB−BSAB−CMB−CPB−DMB−KUB−LBHB−NMB−PeacB−PRPB−UBHG−DMGG−FFG−FPG−MGG−NFM−AM−LFM−MM−SAM−UF

ECOSYS

MAT

9

Since the names of ecosystems takes spaces and can easily overlap, we prefer to use ECOSYS as y axis andhave horizontal boxplot:ggplot(dat1, aes(x = ECOSYS, y = MAT)) +

geom_boxplot() + coord_flip() # horizontal: flip the x, y axes

# Great! Now you can color boxplots based their BIOME types;ggplot(dat1, aes(x = ECOSYS, y = MAT, fill = BIOME)) +

geom_boxplot(varwidth = T) + coord_flip() # colored by BIOME groups

B−AP

B−BSA

B−CM

B−CP

B−DM

B−KU

B−LBH

B−NM

B−Peac

B−PRP

B−UBH

G−DMG

G−FF

G−FP

G−MG

G−NF

M−A

M−LF

M−M

M−SA

M−UF

−2.5 0.0 2.5 5.0

MAT

EC

OS

YS

a

B−AP

B−BSA

B−CM

B−CP

B−DM

B−KU

B−LBH

B−NM

B−Peac

B−PRP

B−UBH

G−DMG

G−FF

G−FP

G−MG

G−NF

M−A

M−LF

M−M

M−SA

M−UF

−2.5 0.0 2.5 5.0

MAT

EC

OS

YS

BIOME

Boreal

Grassland

Montane

b

• Boxplots are generally useful, but it does only focus on five numbers of the samples (min, max, 25th,50th, and 75th). To add more details about distribution, we can add points (jittered) or violin plot asalternatives.

ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_violin()+ geom_boxplot(width = .1)

ggplot(dat1, aes(x = BIOME, y = MAT)) +geom_boxplot() + geom_point(position='jitter', alpha=.2, size=2)

10

−2.5

0.0

2.5

5.0


BIOME

MAT

−5.0

−2.5

0.0

2.5

5.0

7.5


BIOME

MAT

The first commend narrowed the width of boxplots and added them into violin plot, and the second oneadded scatter plots into boxplots.

• Well done! So far we just analyze one continuous variable once a time. Can we visual multiple variablesin one plot?

Hope you still remember in lab2B we applied gather() function to transform a data frame from wide to long.In ggplot2, x or y must only be determined by one single variable. Therefore, first we need to gather multipleinterested variables into one, and then use the new data table for ggplot2 plotting. For example, let make aboxplot to check the distribution of three BIOME types of mean annual temperature (MAT), mean warmestmonth temperature (MWMT) and mean coldest month temperature (MCMT).dat2 <- gather(dat1, key = 'temp', value = 'value', MAT, MCMT, MWMT)

head(dat2) #quick check the new data table## STATION ECOSYS BIOME MAP MSP DRYNESS temp value## 1 300114 G-NF Grassland 443 287 17 MAT 2.4## 2 301449 G-DMG Grassland 415 257 3 MAT 4.5## 3 302343 G-MG Grassland 429 258 12 MAT 4.9## 4 302369 G-DMG Grassland 405 254 1 MAT 5.1## 5 302789 G-NF Grassland 431 291 14 MAT 2.8## 6 304155 B-AP Boreal 480 292 45 MAT -0.9ggplot(dat2, aes(x = temp, y = value, fill = BIOME)) +

geom_boxplot() #using different colors for BIOME types

−20

0

20

MAT MCMT MWMT

temp

valu

e

BIOME

Boreal

Grassland

Montane

11

Looks nice! If you change the temperature variables as treatments, then different performance among andwithin groups is a strong clue of interaction.

5.4. Multi-panel scatter plots in R

So far, in this lab we learnt 1-dimensional (histogram, density plot), 2-dimensional (scatter plot, boxplot)exploratory graphics, they normally can only analyze one or a pair of variables a time. If you have a datatable with 10 potential independent variables, plotting them one by one is not effective. To get the generalidea of the relationships among variables in very short time:#need use ggpairs() function of GGally package#install.packages('GGally')library(GGally)ggpairs(dat1[, c('MAT', 'MAP', 'MSP', 'DRYNESS', 'BIOME')],

aes(color = BIOME))

Cor : −0.177

Boreal: 0.245

Grassland: −0.206

Montane: −0.91

Cor : −0.0431

Boreal: 0.311

Grassland: −0.374

Montane: −0.171

Cor : 0.711

Boreal: 0.547

Grassland: 0.54

Montane: −0.0371

Cor : −0.735

Boreal: −0.727

Grassland: −0.51

Montane: −0.908

Cor : 0.667

Boreal: 0.117

Grassland: 0.745

Montane: 0.892

Cor : 0.492

Boreal: 0.123

Grassland: 0.818

Montane: 0.189

MAT MAP MSP DRYNESS BIOME

MAT

MA

PM

SP

DR

YN

ES

SB

IOM

E

−5.0 −2.5 0.0 2.5 5.0 7.5400 500 600 700 250 275 300 325 350 375 0 50 100 Boreal GrasslandMontane

0.0

0.1

0.2

0.3

0.4

400

500

600

700

250

275

300

325

350

375

0

50

100

0.02.55.07.5

0.02.55.07.5

0.02.55.07.5

Voilà. Now you can see the plot matrix among MAT, MAP, MSP, DRYNESS with BIOME, and also usedifferent colors distinguish BIOME types.

12

lab5a - intro to ggplot2ahamann/teaching/renr480/lab5azs.pdf · lab5a - intro to ggplot2 z.sang...

Documents