visualization and exploration - carleton...

47
Visualization and Exploration January 25, 2016

Upload: others

Post on 20-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Visualization and Exploration

January 25, 2016

Page 2: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Visualization

Reasons to use visualization:I to find problems with the dataI to explore dependencies and featuresI to present results

General guidelines:I Display as much information as possible with least amount

of effort required from the viewer to get it.I Clarity is paramount - make the data stand out

I avoid overusing colours, shapes, patternsI avoid distracting elements that don’t add value, e.g. grid

lines, background coloursI use the right aspect ratio

I Visualization is an iterative process

Page 3: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Visualization in R

Standard graphing capabilities in R are the graphicspackage.

Package lattice improves by adding easy display ofmultivariate and conditional relationships. Implementation ofthe trellis project:http://ect.bell-labs.com/sl/project/trellis/

See chapters 3 and 4 in "Using R for Data Analysis andGraphics" for introduction and examples:http://cran.r-project.org/doc/contrib/usingR.pdf

Also try:demo(graphics)demo(lattice)

Page 4: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Visualization in R

Package ggplot2 is the new kid on the block. Implements theThe Grammar of Graphics by Leland Wilkinson:https://www.springer.com/statistics/computational+

statistics/book/978-0-387-24544-7

Documentation at http://docs.ggplot2.org/current/

I In ggplot2 graphs are defined on data frames.I Graphs are produced by adding layers and

transformations.I Data are displayed using aesthetics, such as position,

colour, size, shape

Page 5: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Package ggplot2

Some graph elements in ggplot2:I geom: geometric objects define the type of plotI stat: statistical transformationsI facet: displays subsets of the data in different panels

allowing for visualization of conditional relationships.

Use ggplot function to create graph object and add layerswith the + operator.

Use qplot function for a simplified interface to ggplot2.

Page 6: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Example

qplot(mpg, disp, data=mtcars, colour=factor(cyl),main="Engine displacement vs MPG", xlab="MPG",ylab="Engine displacement (cb.in)")

●●

●●

●●

● ●●

●●

●● ●

●●

100

200

300

400

10 15 20 25 30 35MPG

Eng

ine

disp

lace

men

t (cb

.in)

factor(cyl)

4

6

8

Engine displacement vs MPG

Page 7: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Example

qplot(mpg, disp, data=mtcars, colour=factor(cyl),main="Engine displacement vs MPG", xlab="MPG",ylab="Engine displacement (cb.in)") +

theme_bw() + labs(colour="Cylinders")

●●

●●

●●

● ●●

●●

●● ●

●●

100

200

300

400

10 15 20 25 30 35MPG

Eng

ine

disp

lace

men

t (cb

.in)

Cylinders

4

6

8

Engine displacement vs MPG

Page 8: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Dataset for examples

A copy of the file is available on the course webpage.custdata <- read.table("custdata.tsv", header=T, sep="\t")

The business objective is to predict whether your customer hashealth insurance. This synthetic dataset contains customersinformation for ones whose health insurance status is known.

Page 9: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Spot problems

Missing values

> dim(custdata)[1] 1000 11> mv <- colSums(is.na(custdata))> cbind(mv) % cbind to display as column

mvcustid 0sex 0is.employed 328income 0marital.stat 0health.ins 0housing.type 56recent.move 56num.vehicles 56age 0state.of.res 0

Page 10: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Spot problems

Values out of range

> summary(custdata$income)Min. 1st Qu. Median Mean 3rd Qu. Max.

-8700 14600 35000 53500 67000 615000>> summary(custdata$age)

Min. 1st Qu. Median Mean 3rd Qu. Max.0.0 38.0 50.0 51.7 64.0 146.7

Page 11: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Values out of range

> qplot(age, data=custdata)stat_bin: binwidth defaulted to range/30. Use’binwidth = x’ to adjust this.

0

30

60

90

0 50 100 150age

coun

t

Page 12: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Values out of range

qplot(age, data=custdata, binwidth=5)

0

25

50

75

100

0 50 100 150age

coun

t

Page 13: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Values out of range

qplot(income, data=custdata, binwidth=10000)

0

50

100

150

0e+00 2e+05 4e+05 6e+05income

coun

t

Page 14: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Values out of range

library(scales)qplot(income, data=custdata, binwidth=10000) +

scale_x_continuous(labels=dollar)

0

50

100

150

$0 $200,000 $400,000 $600,000income

coun

t

Page 15: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Values out of range

qplot(income, data=custdata, binwidth=5000) +scale_x_continuous(labels=dollar)

0

40

80

120

$0 $200,000 $400,000 $600,000income

coun

t

Page 16: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

I qplot selects automatically the type of graph from thenumber and type of arguments

I for a single numerical variable the default is histogram

I the same plot can be done using the following commands

ggplot(custdata) +geom_histogram(aes(x=income), binwidth=5000) +scale_x_continuous(labels=dollar)

Page 17: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Logarithmic scale

I Use logarithmic scale for variables where percent changeis more important than change in value.

I Use logarithmic scale when data spans a wide range, e.g.multiple orders of magnitude

Page 18: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Logarithmic histogram

custdata2 <- subset(custdata, income > 0)qplot(income, data=custdata2, binwidth=5000) +

scale_x_log10(breaks=10^(1:6), labels=dollar)

0

250

500

750

$10$100$1,000$10,000$100,000$1,000,000income

coun

t

Page 19: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Logarithmic histogram

binwidth should be in percent change, not dollar amountqplot(income, data=custdata2, binwidth=0.05) +

scale_x_log10(breaks=10^(1:6), labels=dollar)

0

20

40

$100 $1,000 $10,000 $100,000 $1,000,000income

coun

t

Page 20: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Density plots

qplot(income, data=custdata2, geom="density") +scale_x_log10(breaks=10^(1:6), labels=dollar)

0.00

0.25

0.50

0.75

1.00

$100 $1,000 $10,000 $100,000 $1,000,000income

dens

ity

Page 21: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Histogram vs density

I Both apply to continuous variables.

I Both give an idea of the underlying probability distribution.

I Two histograms of the same data may look very differentwith different bin widths and choosing the best bin width isimportant.

I A density plot is a "continuous histogram". It plots anestimated probability distribution function.

Page 22: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chartsA bar chart is a histogram for categorical variable. It is thedefault geometry in qplot for factor and logical variables

qplot(marital.stat, data=custdata)

0

100

200

300

400

500

Divorced/Separated Married Never Married Widowedmarital.stat

coun

t

Page 23: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar charts

qplot(state.of.res, data=custdata)

0

25

50

75

100

AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyomingstate.of.res

coun

t

What a mess!

Page 24: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar charts

qplot(state.of.res, data=custdata) + coord_flip()

AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaGeorgiaHawaiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettsMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshireNew JerseyNew MexicoNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaRhode IslandSouth CarolinaSouth DakotaTennesseeTexasUtahVermontVirginiaWashingtonWest VirginiaWisconsinWyoming

0 25 50 75 100count

stat

e.of

.res

Better! When you have more than a few categories, usehorizontal bars!

Page 25: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar charts

qplot(state.of.res, data=custdata) + coord_flip() +theme(axis.text.y=element_text(size=rel(0.6)))

AlabamaAlaska

ArizonaArkansasCaliforniaColorado

ConnecticutDelaware

FloridaGeorgia

HawaiiIdahoIllinois

IndianaIowa

KansasKentuckyLouisiana

MaineMaryland

MassachusettsMichigan

MinnesotaMississippi

MissouriMontana

NebraskaNevada

New HampshireNew JerseyNew Mexico

New YorkNorth Carolina

North DakotaOhio

OklahomaOregon

PennsylvaniaRhode Island

South CarolinaSouth Dakota

TennesseeTexasUtah

VermontVirginia

WashingtonWest Virginia

WisconsinWyoming

0 25 50 75 100count

stat

e.of

.res

Better yet! The labels are small, but at least they don’t overlap.

Page 26: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Sorted bar chart

custdata <- transform(custdata, state.of.res.ord=reorder(state.of.res, state.of.res, length))

qplot(state.of.res.ord, data=custdata) + coord_flip() +theme(axis.text.y=element_text(size=rel(0.6)))

DelawareNorth Dakota

WyomingRhode Island

AlaskaMontanaVermont

IdahoKansasNevadaHawaii

New HampshireSouth Dakota

UtahMaine

New MexicoArkansas

MississippiOregon

NebraskaArizona

IowaAlabamaColorado

OklahomaWest Virginia

ConnecticutLouisiana

South CarolinaKentucky

North CarolinaMaryland

WashingtonMinnesota

MissouriTennessee

MassachusettsGeorgia

WisconsinIndianaVirginia

New JerseyFloridaIllinois

OhioMichigan

TexasPennsylvania

New YorkCalifornia

0 25 50 75 100count

stat

e.of

.res

.ord

Page 27: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Aside: reorder a factor

Let’s reorder states by average number of vehicles percustomer.state.by.num.vehicles <- reorder(custdata$state.of.res,

custdata$num.vehicles, mean, na.rm=TRUE)

What is the average number of vehicles per customer in eachstate?

In Alabama:with(custdata, mean(

num.vehicles[state.of.res=="Alabama"], na.rm=TRUE))

Repeat for each of the 50 states. There has to be a better way!

Page 28: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Aside: reorder a factor

Let’s reorder states by average number of vehicles percustomer.state.by.num.vehicles <- reorder(custdata$state.of.res,

custdata$num.vehicles, mean, na.rm=TRUE)

What is the average number of vehicles per customer in eachstate?

In Alabama:with(custdata, mean(

num.vehicles[state.of.res=="Alabama"], na.rm=TRUE))

Repeat for each of the 50 states. There has to be a better way!

Page 29: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Aside

Using base R# splitpieces <- split(custdata, custdata$state.of.res)

# applyresult <- lapply(pieces, function(p) data.frame(

state.of.res=p$state.of.res[[1]],state.avg.vehicles=mean(p$num.vehicles, na.rm=TRUE)

))

# combineresult <- do.call("rbind", result)

Page 30: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Aside

Package plyr implements split-apply-combine framework veryneatly in a single function call.

library(plyr)result <- ddply(

custdata, # dataframe"state.of.res", # split-by variablessummarize, # function to apply to each piece

# function argumentsstate.avg.vehicles=mean(num.vehicles, na.rm=TRUE)

)

Page 31: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Single variable

To summarize visualization of single variable

I For a numerical variable use a histogram or density plot tolook for outliers, or incorrect values.

I Also get a feel for the distribution – is it symmetric, normal,lognormal.

I For categorical variables use a bar chart to comparefrequencies of categories.

Page 32: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Scatter plot

custdata2 <- with(custdata,subset(custdata, age>0 & age < 100 & income > 0))

qplot(age, income, data=custdata2) +scale_y_continuous(labels=dollar)

●●● ●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●● ●

●●

● ●

●●

● ●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

● ●

●●

●● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●●

● ●

●●●

●●

●●

● ●

●●

●●

●● ●●

●●●

●● ●

●●

●●

●●

●●

●●●

● ●●

●●

●●●

● ●

● ●●

●● ●

●●

●●

● ●

● ●

●● ●

● ●●

●●●●

●●●

●● ●

●●

● ●●●

●● ●

●●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●● ●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●● ●●

●●●

● ●●

●●

●●

● ●●

● ●

●●

●●● ●●

●●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

● ●

● ●

●●

●●

●●● ●

●●

●●

●●●

●●

● ●

$0

$200,000

$400,000

$600,000

20 40 60 80age

inco

me

Page 33: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Scatter plot

qplot(age, income, data=custdata2, colour=health.ins) +scale_y_continuous(labels=dollar)

●●● ●

●●

● ●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●● ●

●●

● ●

●●

● ●●

●●

●●

●●

●●●

●●

● ●

●●

● ●

●● ●

●●

●●

●●●

●●

●●

●●

●● ●

● ●

● ●

●●

●● ●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●●

● ●

●●●

●●

●●

● ●

●●

●●

●● ●●

●●●

●● ●

●●

●●

●●

●●

●●●

● ●●

●●

●●●

● ●

● ●●

●● ●

●●

●●

● ●

● ●

●● ●

● ●●

●●●●

●●●

●● ●

●●

● ●●●

●● ●

●●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●● ●

●●

●●

●●●

●●

● ●

●●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●● ●●

●●●

● ●●

●●

●●

● ●●

● ●

●●

●●● ●●

●●

●●

●●

●●

● ●

●●

●●

● ●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●● ●

●●

●●

●●

●●●●

●●

●●

●●

●● ●

●●

● ●

● ●

●●

●●

●●● ●

●●

●●

●●●

●●

● ●

$0

$200,000

$400,000

$600,000

20 40 60 80age

inco

me health.ins

FALSE

TRUE

Page 34: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

2D histogram

qplot(age, income, data=custdata2, geom="bin2d") +scale_y_continuous(labels=dollar)

$0

$200,000

$400,000

$600,000

20 40 60 80age

inco

me

4

8

12

16

count

Page 35: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

2D histogram

library(hexbin)qplot(age, income, data=custdata2, geom="hex") +

scale_y_continuous(labels=dollar)

$0

$200,000

$400,000

$600,000

25 50 75age

inco

me

5

10

15

count

Page 36: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

2D histogram

library(hexbin)ggplot(custdata2, aes(x=age, y=income)) +

geom_hex(binwidth=c(5, 10000)) +geom_smooth(color="white", se=F) +ylim(0,200000)

0

50000

100000

150000

200000

25 50 75age

income

4

8

12

16count

Page 37: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Scatter plot

Also works for continuous vs. categorical.

qplot(age, health.ins, data=custdata2)

●●

● ●● ●● ●●● ● ●● ● ●● ● ●●

●● ●● ●

● ●● ●●● ●● ● ●●●●● ●● ● ●●●●●●

● ● ●●

●●

● ●●

●●● ●● ● ●●● ● ●● ● ●● ●● ●● ●●

● ● ●●● ●●● ● ●

●●

● ●

● ●● ●● ● ● ●●

● ●● ●● ● ●● ●●● ●● ●●

● ●

●● ● ● ●● ●● ●

●● ●

●●●● ● ●●● ●●● ●●● ●●●● ● ●● ●●●

● ●● ●● ●● ●●

● ●● ● ●●● ●● ●●● ●●●

● ● ●● ●● ●●

●● ● ●●● ● ●●● ●●● ●●● ● ● ●

● ●

●●● ●● ●● ● ● ●●● ●

● ●● ●● ●

● ●

●● ●

●●

● ●● ●●● ●● ●●● ● ●● ● ●●● ●● ●● ●●● ●●●

●●

● ● ●● ●●● ●● ● ●●

● ● ● ●

●●

● ● ●● ●●● ●● ●● ●

●●

● ● ●

●● ●● ●●●●● ●●●● ●●

● ●● ●●● ● ● ●● ● ●● ●●● ●●

● ●● ●● ● ●● ●●● ● ●● ●●● ●● ●

● ●●● ● ● ●● ●●

●●

●●●

● ● ●●●●

●●

●● ●●●

● ●●● ● ●● ●●

● ●●● ● ●● ● ●

● ●●

●● ● ●● ●●● ●

● ●●

● ● ●●●●

●●●●●● ● ●

● ●●

●● ●● ●● ●●● ●● ●● ●● ●● ● ● ● ●● ●

●●● ● ●

●● ● ●● ●

● ●● ● ●●●● ●● ●●

●●

●● ●● ● ●●

● ●● ● ●

●●●

●●● ●

● ●●● ● ●

●●●

●●● ●● ●

●●

●● ●

●●●● ● ● ●

●● ● ●● ●● ● ●

●●

● ●●

●● ●●●● ● ●

●● ● ●● ● ●

● ●●● ●●

● ●●● ●● ●●● ●

● ● ● ● ●

●●●●●

● ●●●● ● ●●●

●●●●

●● ●● ●● ●

●●● ●● ●● ●●●● ● ●● ●●●

●●●●

●● ● ●●●

●●● ●

● ●● ●● ●● ●

● ●

● ● ● ●●● ●● ●● ●● ● ●● ●●●● ● ●● ●● ●●●● ●●●● ●●●● ●

●● ● ● ●● ●

● ●

●●●● ● ● ●

●● ●●

● ●●● ●●●● ●●●● ●

●● ●

●● ●●● ● ●

●● ● ●●● ● ●●

●●

● ●● ● ● ●●

●●● ● ●●●●●

●● ●●● ●● ●●●● ●●

●●●● ●

● ●● ●●●

●● ●

●●●● ●● ●● ● ●●● ●● ● ●●

● ●● ● ● ●● ●● ●

●FALSE

TRUE

20 40 60 80age

heal

th.in

s

Page 38: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Scatter plotThis is better – it gives a better feel for the density at each level.

qplot(age, health.ins, data=custdata2,position=position_jitter(height=0.2))

●●

●●

●●

● ● ●

●● ●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●●

●●

●●●●

●●

● ●

●●

●●

●●

● ●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●●●●

● ●●

●●●

●●●

● ●

●●

●●

●● ●●●

●● ●

●●

● ●

●●

●●●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

● ●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●●

●●

● ●

●●

●●●

● ●

●●

●●●

●●

●●

●● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●●●

●●

●●

●●

● ●

FALSE

TRUE

20 40 60 80age

heal

th.in

s

Page 39: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Scatter plot

qplot(age, health.ins, data=custdata2, color=log10(income),position=position_jitter(height=0.2))

●●

● ●● ●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●● ●●

●●

●●

● ●●

●●

● ●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●● ●

●●

●● ●

●●

●●●

●●

●● ●

●●

●●

● ●

●● ●

●●●

● ● ●

●●

● ●

● ●

● ●●●

●●●

●●

● ●

● ●

●●

●●

● ●

●●

● ● ●

● ●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●FALSE

TRUE

20 40 60 80age

heal

th.in

s

2

3

4

5

log10(income)

Page 40: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variables

Use the fill aesthetic as the second variable

qplot(marital.stat, data=custdata2, fill=health.ins)

0

100

200

300

400

Divorced/Separated Married Never Married Widowedmarital.stat

coun

t health.ins

FALSE

TRUE

Page 41: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variablesSome prefer side-by-side

ggplot(custdata2) +geom_bar(aes(marital.stat, fill=health.ins),

position="dodge")

0

100

200

300

400

Divorced/Separated Married Never Married Widowedmarital.stat

coun

t health.ins

FALSE

TRUE

Page 42: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variablesFilled bar chart shows the proportion of insured within eachlevel of marital status.

ggplot(custdata2) +geom_bar(aes(marital.stat, fill=health.ins),

position="fill")

0.00

0.25

0.50

0.75

1.00

Divorced/Separated Married Never Married Widowedmarital.stat

coun

t health.ins

FALSE

TRUE

Page 43: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variables

Add a cloud of points to convey the size of each level.

last_plot() + geom_point(aes(x=marital.stat, y=-0.05),position=position_jitter(h=0.02), size=0.75, alpha=0.75)

0.00

0.25

0.50

0.75

1.00

Divorced/Separated Married Never Married Widowedmarital.stat

coun

t health.ins

FALSE

TRUE

Page 44: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variablesMore than two levels

ggplot(custdata2) +geom_bar(aes(housing.type, fill=marital.stat),

position="dodge")

0

100

200

Homeowner free and clearHomeowner with mortgage/loanOccupied with no rentRented NAhousing.type

coun

t

marital.stat

Divorced/Separated

Married

Never Married

Widowed

Page 45: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variablesRemove NA from housing.type and fix labels

ggplot(subset(custdata2, !is.na(housing.type))) +geom_bar(aes(housing.type, fill=marital.stat),

position="dodge") +theme(axis.text.x=element_text(angle=15))

0

100

200

Homeowner free and clear

Homeowner with mortgage/loan

Occupied with no rent

Rented

housing.type

coun

t

marital.stat

Divorced/Separated

Married

Never Married

Widowed

Page 46: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Bar chart for two variablesUse facetting instead of fill to get a better picture.

ggplot(subset(custdata2, !is.na(housing.type))) +geom_bar(aes(marital.stat)) +facet_wrap(~housing.type, scales="free_y") +theme(axis.text.x=element_text(size=rel(0.8)))

Homeowner free and clear Homeowner with mortgage/loan

Occupied with no rent Rented

0

20

40

60

80

0

100

200

0

1

2

3

4

5

0

40

80

120

Divorced/SeparatedMarried Never Married Widowed Divorced/SeparatedMarried Never Married Widowed

marital.stat

coun

t

Page 47: Visualization and Exploration - Carleton Universitypeople.scs.carleton.ca/~boyanbejanov/data5000/lecture3a.pdf · 2016. 2. 7. · Visualization Reasons to use visualization: I to

Visualization with R

Further readings

I A short course on ggplot2 by Hadley Wickhamhttp://courses.had.co.nz/11-rice/