data visualization in r - rice universityjn13/slides/pdfs/day1amslides... · data visualization in...

40
Data Visualization in R May 15, 2017 Data Visualization in R May 15, 2017 1 / 40

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Data Visualization in R

May 15, 2017

Data Visualization in R May 15, 2017 1 / 40

Page 2: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Jumping In

Let’s get started right away by the loading ggplot2 package and readingin our dataset.

### Install packages if you don't have them yet

### Typical install:

# install.packages('gpplot2')

# install.packages('dplyr')

### Install personal copy (no admin rights)

# install.packages('gpplot2',lib="/path/to/myfolder")

# install.packages('dplyr',lib="/path/to/myfolder")

### Load packages

library(ggplot2)

library(dplyr)

# Load personal copy

# library(ggplot2,lib.loc="/path/to/myfolder")

# library(dplyr,lib.loc="/path/to/myfolder")

### Read In data

auto.data <- read.csv("AutoData.csv",

header = TRUE)

# tbl_df() isn't necessary here

# It helps to display the data more clearly

auto.data <- tbl_df(auto.data)

Data Visualization in R May 15, 2017 2 / 40

Page 3: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Glimpse at the data

Run the following to get a quick glimpse of the data

# Find the dimensions

dim(auto.data)

# Look at the structure

str(auto.data)

# Examine the top

head(auto.data)

# Find out about a function

?str

Data Visualization in R May 15, 2017 3 / 40

Page 4: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Data Exploration

When looking at a new data set, exploration is key.

What types of variables do we have?

What types of relationships do you expect to see between variables?

Does your intuition check out? If not, why not?

Do we observe anomalous behavior?

Data Visualization in R May 15, 2017 4 / 40

Page 5: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Scatter Plots

One of the simpler plots we can make is a scatter plot between tocontinuous variables.

Try it out:

# qplot is convenient front end for the more powerful,

# but slightly more complicated ggplot() function.

qplot(curb.weight,price,data=auto.data)

Data Visualization in R May 15, 2017 5 / 40

Page 6: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

10000

20000

30000

40000

1500 2000 2500 3000 3500 4000

curb.weight

pric

e

Data Visualization in R May 15, 2017 6 / 40

Page 7: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

qplot is an easy to use front end to the main ggplot function.

It has several reasonable defaults for plotting data

As we just just saw, two continuous inputs gets us back a scatter plot.

Data Visualization in R May 15, 2017 7 / 40

Page 8: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

The true power of ggplot comes from its ability to easily visualizerelationships between many variables.

The main ingredients we’ll be using are:

1 aesthetics

2 facets

3 geoms

Data Visualization in R May 15, 2017 8 / 40

Page 9: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Aesthetics

Aesthetics control many of the plot’s visual properties

Importantly these visual properties may be mapped directly to variables

Data Visualization in R May 15, 2017 9 / 40

Page 10: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Aesthetics Example

Try the following

# map color to factor/categorical variable

qplot(curb.weight,

price,

data=auto.data,

color=num.of.cylinders)

# map color to continuous variable

qplot(curb.weight,

price,

data=auto.data,

color=bore)

Data Visualization in R May 15, 2017 10 / 40

Page 11: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

10000

20000

30000

40000

1500 2000 2500 3000 3500 4000

curb.weight

pric

e

num.of.cylinders

eight

five

four

six

three

twelve

10000

20000

30000

40000

1500 2000 2500 3000 3500 4000

curb.weight

pric

e

3.0

3.5

bore

Data Visualization in R May 15, 2017 11 / 40

Page 12: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

There are many other aesthetics besides color. Some we’ll encounter are:

1 color

2 size

3 shape

4 fill

Not all aesthetics work with both categorical and continuous variables (likecolor did)

Also only a certain subset of aesthetics will be available for each plot type(geom)

Data Visualization in R May 15, 2017 12 / 40

Page 13: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Try It Out

See how the following aesthetics behave with the scatter plot. Feel free tochange the variables in the scatter plot

qplot(curb.weight,

price,

data=auto.data,

size=horsepower)

qplot(curb.weight,

price,

data=auto.data,

shape=drive.wheels)

Data Visualization in R May 15, 2017 13 / 40

Page 14: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Facets

Facets represent another way of visualizing the effect of factor/categoricalvariables

Facets enable us to get a separate plot for each level/category

Data Visualization in R May 15, 2017 14 / 40

Page 15: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Facet Example

Let’s try out a faceting example

qplot(curb.weight,

price,

data=auto.data) + facet_wrap(~aspiration)

Data Visualization in R May 15, 2017 15 / 40

Page 16: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

std turbo

1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000

10000

20000

30000

40000

curb.weight

pric

e

Data Visualization in R May 15, 2017 16 / 40

Page 17: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Note facet_wrap gives a separate plot for each category

Also note how we incorporated the behavior of facet_wrap: via the +

operator

This is one of the main strengths of ggplot: plots are built up in intuitivelayers

Data Visualization in R May 15, 2017 17 / 40

Page 18: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Also available is facet_grid for examining the interaction between twocategorical variables:

qplot(curb.weight,

price,

data=auto.data) +

facet_grid(drive.wheels~num.of.doors)

Data Visualization in R May 15, 2017 18 / 40

Page 19: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

four two

4wd

fwd

rwd

1500 2000 2500 3000 3500 4000 1500 2000 2500 3000 3500 4000

10000

20000

30000

40000

10000

20000

30000

40000

10000

20000

30000

40000

curb.weight

pric

e

Data Visualization in R May 15, 2017 19 / 40

Page 20: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Try out the following:

qplot(curb.weight,

price,

data=auto.data) + facet_grid(.~drive.wheels)

qplot(curb.weight,

price,

data=auto.data) + facet_grid(drive.wheels~.)

qplot(curb.weight,

price,

data=auto.data,

color=num.of.doors) + facet_grid(drive.wheels~.)

Data Visualization in R May 15, 2017 20 / 40

Page 21: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

geoms

The final way we’ll look at to control ggplots is via geoms

The geom controls the type of plot which is displayed.

We’ve already looked at one: geom_point.

We could rewrite our scatter plot code more explicitly as:

qplot(curb.weight,price,data=auto.data,geom='point')

Data Visualization in R May 15, 2017 21 / 40

Page 22: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Let’s check out another geom: geom_histogram

# geom_histogram operates with a single continuous variable.

# Let's look at price

qplot(price,

data=auto.data,

geom='histogram')

# or via qplot's defaults

qplot(price,data=auto.data)

Data Visualization in R May 15, 2017 22 / 40

Page 23: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

0

10

20

30

10000 20000 30000 40000

price

coun

t

Data Visualization in R May 15, 2017 23 / 40

Page 24: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Note the warning concerning binwidth

The binwidth chosen can dramatically impact how we visually interpret thedistribution

It’s best to experiment with values to get a feel for the data

We can alter the binwidth by passing the option to qplot

qplot(price,

data=auto.data,

geom='histogram',

binwidth=20000)

Data Visualization in R May 15, 2017 24 / 40

Page 25: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

0

25

50

75

0 20000 40000

price

coun

t

This tells a very different story than the original!

Data Visualization in R May 15, 2017 25 / 40

Page 26: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Note our price distribution is a bit skewed

Perhaps we are not interested in higher priced (≥ 20, 000 say) cars

We can limit our plot cars with lower price by setting limits

qplot(price,

data=auto.data,

geom='histogram',

binwidth=450) +

xlim(4000,20000)

Data Visualization in R May 15, 2017 26 / 40

Page 27: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

0

5

10

5000 10000 15000 20000

price

coun

t

Data Visualization in R May 15, 2017 27 / 40

Page 28: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Just like our point geom, histogram too has aesthetics. Try thefollowing

qplot(price,

data=auto.data,

color=drive.wheels)

qplot(price,

data=auto.data,

fill=drive.wheels)

Data Visualization in R May 15, 2017 28 / 40

Page 29: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

0

10

20

30

10000 20000 30000 40000

price

coun

t

drive.wheels

4wd

fwd

rwd

0

10

20

30

10000 20000 30000 40000

priceco

unt

drive.wheels

4wd

fwd

rwd

Which one do like the best? Do you like either? How might we make itbetter?

Data Visualization in R May 15, 2017 29 / 40

Page 30: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

The colors help but the figure is a bit busy.

We can try faceting instead:

qplot(price,

data=auto.data) +

facet_wrap(~drive.wheels)

Data Visualization in R May 15, 2017 30 / 40

Page 31: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

4wd fwd rwd

10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000

0

10

20

30

price

coun

t

Data Visualization in R May 15, 2017 31 / 40

Page 32: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

This helps us separate out the categorical variables much easier.

Note the counts vary quite a bit among the different classes, but yet thecount axis is the same for all. We can change this by modifying thefacet_wrap call:

qplot(price,

data=auto.data) +

facet_wrap(~drive.wheels,

scales = 'free_y')

Data Visualization in R May 15, 2017 32 / 40

Page 33: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

4wd fwd rwd

10000 20000 30000 40000 10000 20000 30000 40000 10000 20000 30000 40000

0

5

10

15

0

10

20

30

0

1

2

3

price

coun

t

Data Visualization in R May 15, 2017 33 / 40

Page 34: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

rwd

fwd

4wd

10000 20000 30000 40000

0

1

2

3

0

10

20

30

0

5

10

15

price

coun

t

See ?facet_wrap for more useful options. For example nrow=3 in theabove.

Data Visualization in R May 15, 2017 34 / 40

Page 35: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Try it out

Take some time to explore the data using various geoms, aesthetics, andfacets.

How are the other variable related to price?

Are some of the relationships stronger than others?

Data Visualization in R May 15, 2017 35 / 40

Page 36: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

More geoms

There are many other geoms besides point and histogram. Try ??geom

to see a list.

Different geoms operate with different (combinations of) data types (i.e.categorical or continuous).

As is characteristic of ggplot, geoms can be layered to create plots ofincreasing detail/complexity.

Data Visualization in R May 15, 2017 36 / 40

Page 37: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Try out the following:

qplot(price,data=auto.data,

geom='density')

qplot(price,

..density.., # don't use counts

data=auto.data,

geom='histogram') +

geom_density()

qplot(height,price,

data=auto.data,

geom='density2d')

qplot(height,price,

data=auto.data)+

geom_density2d()

Data Visualization in R May 15, 2017 37 / 40

Page 38: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

Can you guess the geom for creating a boxplot?

Create a boxplot displaying price for each of the drive.wheels

categories

Data Visualization in R May 15, 2017 38 / 40

Page 39: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

qplot(drive.wheels,

price,

data=auto.data,

geom='boxplot')

Data Visualization in R May 15, 2017 39 / 40

Page 40: Data Visualization in R - Rice Universityjn13/slides/pdfs/Day1AMSlides... · Data Visualization in R May 15, 2017 24 / 40. 0 25 50 75 0 20000 40000 price count This tells a very di

References and Additional Info

ggplot2 documentation: http://docs.ggplot2.org/current/

Hadley’s ggplot2 book: http://ggplot2.org/book/

RStudio ggplot cheatsheet: http://www.rstudio.com/wp-content/

uploads/2015/03/ggplot2-cheatsheet.png

Data Visualization in R May 15, 2017 40 / 40