basic graphics with r

17
Advanced Data Analytics: Basic Graphics in R Jeffrey Stanton School of Information Studies Syracuse University

Upload: syracuse-university

Post on 28-Nov-2014

1.784 views

Category:

Education


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Basic Graphics with R

Advanced Data Analytics: Basic Graphics in R

Jeffrey Stanton

School of Information Studies

Syracuse University

Page 2: Basic Graphics with R

Movie Data Set from JSE

• From McClaren and DePaolo’s article in the Journal of Statistics Education

• Daily per theater box office receipts in dollars for 49 movies• A variable number of entries for each movie depending

upon how long it ran• About 2500 observations altogether• DAILY_PER_THEATER: Amount in Dollars• DATE: mm/dd/yyyy of when the observation was made• DAY_NUM: Which day in the run, number from 1 up• MOVIE: The title of the movie• NUMBER: Index number of the movie

2

Page 3: Basic Graphics with R

Movie Dataset from JSE

http://www.amstat.org/publications/jse/datasets/moviedaily.dat

http://www.amstat.org/publications/jse/datasets/moviedaily.txt

> moviedaily <- read.delim("Z:/DataScience/AdvancedAnalytics/moviedaily.dat")

> view(moviedaily) # Display data in R-Studio in a separate pane> attach(moviedaily) # Make the new data the active data frame> class(moviedaily) # Make sure the dataset is a dataframe[1] "data.frame“> ls(moviedaily) # Show the variable names in the dataframe[1] "DAILY_PER_THEATER" "DATE" "DAY_NUM" [4] "MOVIE" "NUMBER" > hist(DAY_NUM)

3

Page 4: Basic Graphics with R

Histogram of DAY_NUM

4

Histogram of DAY_NUM

DAY_NUM

Fre

quen

cy

0 50 100 150 200

020

040

060

0

Page 5: Basic Graphics with R

About Histograms

• Basic type of diagnostic display shows how frequently each value occurs in the data set

• In R, works on numeric data only; getting counts on other modes of data requires another approach

• Works fine with continuous data (e.g., 3.1, 3.2, 3.25, etc.) because it can cluster together nearby values and count them in a single frequency category (representing a range)

• Try hist(NUMBER) and hist(DAILY_PER_THEATER)• Even though these look like numeric variables, in the data

importing process, R has made them into “factors” – factors are stored as integers with “category labels” and are used in various procedures to divide the data into groups

5

Page 6: Basic Graphics with R

Convert a factor into numbers

• Recall that a factor is stored as integers with character labels: It is the labels that we want to convert into numbers (we can’t control how R assigned the integers, so we don’t know exactly what they contain, only that they are unique)

• Try as.character(DAILY_PER_THEATER) – See how we get lots of numbers in quotes, plus some occasional other stuff that is not numbers

• Then try: as.numeric(as.character(DAILY_PER_THEATER))• Note the warning messages: “Warning message: NAs introduced by coercion” –

This is exactly what we want: “NA” is R’s way of coding missing data; all of the unusable string values (like: "No daily data“) have been turned into NAs because they are missing values

> detach(moviedaily)> moviedaily$dailyper<-

as.numeric(as.character(DAILY_PER_THEATER))> attach(moviedaily)> class(dailyper)# Adds a new numeric variable converted from the factor

6

Page 7: Basic Graphics with R

On most days, movies make a few $100

7

Histogram of dailyper

dailyper

Fre

quen

cy

0 5000 10000 15000 20000

050

010

0020

00

Page 8: Basic Graphics with R

Which Movie Made the Most $$$

• First, we need to aggregate the data, by summing the daily takes for each movie:

aggdata <- aggregate(dailyper,by=list(MOVIE),FUN=sum, na.rm=TRUE)# Aggregates by MOVIE, which is a factor with the movie names# Uses the sum function on the variable dailyper

• Next, lets organize the data in descending order:sortdata<-aggdata[order(-aggdata$x),]# The minus sign means decreasing order

• Remove the items that had no data (the sums ended up as zero):

sortdata<-sortdata[sortdata$x>1,]# Takes the subset of rows where the agg $ value > 1

• Finally, create a barplot showing the totals for each movie:barplot(sortdata$x,names.arg=as.character(sortdata$Group.1),las=2)

8

Page 9: Basic Graphics with R

Barplot of Movie Total Daily Take

9

Page 10: Basic Graphics with R

Let’s Do the Same Thing With Rcmdr

• The input data file has some anomalies that we had to clear up: Rcmdr data loader is not as forgiving as R-Studio

[3] ERROR: line 423 did not have 5 elements[4] ERROR: line 1990 did not have 5 elementsmoviedaily <-

read.table("Z:/DataScience/AdvancedAnalytics/moviedaily.dat",

header=TRUE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)

[5] NOTE: The dataset moviedaily has 2378 rows and 5 columns.

10

Page 11: Basic Graphics with R

Obviously We Need to Tweak It

11

121 16 212 29 379 509 65 81 99

DAILY_PER_THEATER

Fre

quen

cy

02

46

810

Page 12: Basic Graphics with R

We Still Need to Coerce

12

as.numeric(as.character(DAILY_PER_THEATER))

Page 13: Basic Graphics with R

Aggregate is Under the Menu: Data -> Active Data Set

13

Page 14: Basic Graphics with R

Remove Cases with Missing DataSubset the Data for Nonzero Values

14

Page 15: Basic Graphics with R

Rcmdr has no Sort Function…And the Barplot is Troubled

• We can use the sorting capability we learned before:aggdata<-aggdata[order(-aggdata$dailyper),]

• The Barplot menu choice in Rcmdr produces this code:barplot(table(aggdata$MOVIE), xlab="MOVIE", ylab="Frequency")– This creates a frequency table based on MOVIE, which is not really what

we want– The resulting chart is a histogram rather than a barchart with heights

based on dailyper

• We can run our own barchart command using the Rcmdr data:barplot(aggdata$dailyper,names.arg=as.character(aggdata$MOVIE),las=2)

15

Page 16: Basic Graphics with R

But Some Things are Still Messed Up!

• It has not discarded the zeroes as we asked• There are too many entries – there should be 49 or fewer –

looks like the aggregation did not work correctly

16

Tita

nic

Sta

r W

ars:

Pha

ntom

Men

ace

Chi

cago

Bat

man

A B

eaut

iful M

ind

Spi

der-

Man

Lord

of

the

Rin

gs:

Ret

urn

Shr

ek 2

Pira

tes

1: C

urse

of

the

Bla

ck P

earl

Spi

der-

Man

2S

hrek

the

Thi

rdS

hake

spea

re in

Lov

eS

pide

r-M

an 3

Shr

ekE

mpi

re S

trik

es B

ack,

The

Har

ry P

otte

r 4:

Gob

let

of F

ireH

arry

Pot

ter

2: C

ham

ber

of S

ecre

tsH

arry

Pot

ter

5: O

rder

of

the

Pho

enix

Goo

d G

irl,

The

Ret

urn

of t

he J

edi

Gla

diat

orH

arry

Pot

ter

3: P

rison

er o

f A

zkab

anD

epar

ted,

The

Mill

ion

Dol

lar

Bab

yS

uper

Siz

e M

eC

rash

You

Can

Cou

nt o

n M

eP

irate

s 2:

Dea

d M

ans

Che

st11

3508

7/7/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stET

Ups

ide

of A

nger

, T

heP

irate

s 3:

At

Wor

lds

End

1427

15/2

4/20

0730

Pira

tes

3: A

t W

orld

s E

ndH

arry

Pot

ter

1: S

orce

rers

Sto

ne18

8051

1/16

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

3: A

t W

orld

s E

nd38

6435

/26/

2007

30P

irate

s 3:

At

Wor

lds

End

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

7335

511/

22/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neLa

st M

imzy

, T

heP

irate

s 2:

Dea

d M

ans

Che

st96

0127

/15/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 2:

Dea

d M

ans

Che

st72

9907

/13/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 2:

Dea

d M

ans

Che

st38

5567

/9/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 3:

At

Wor

lds

End

9293

96/1

/200

730

Pira

tes

3: A

t W

orld

s E

ndH

arry

Pot

ter

1: S

orce

rers

Sto

ne95

7431

1/24

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

2: D

ead

Man

s C

hest

1524

187/

21/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne52

0841

1/20

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

2: D

ead

Man

s C

hest

5380

67/1

1/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

1516

0811

/30/

2001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st11

1915

7/17

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

3665

811/

18/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 3:

At

Wor

lds

End

5574

75/2

8/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 2:

Dea

d M

ans

Che

st23

2188

7/29

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

1315

607/

19/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 2:

Dea

d M

ans

Che

st21

9067

/27/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 3:

At

Wor

lds

End

2396

86/1

5/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

1576

36/7

/200

730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

1722

336/

9/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

7146

85/3

0/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 2:

Dea

d M

ans

Che

st29

9438

/4/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne29

6491

2/14

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

2317

4012

/8/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne43

1094

12/2

8/20

0115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

4110

2512

/26/

2001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st17

2554

7/23

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

4360

88/1

8/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

2133

712/

6/20

0115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

3: A

t W

orld

s E

nd11

3098

6/3/

2007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

3: A

t W

orld

s E

nd37

6936

/29/

2007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

2: D

ead

Man

s C

hest

1910

217/

25/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne45

1061

12/3

0/20

0115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

2: D

ead

Man

s C

hest

5746

49/1

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

3783

212/

22/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 3:

At

Wor

lds

End

1387

06/5

/200

730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

3110

496/

23/2

007

30P

irate

s 3:

At

Wor

lds

End

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

5734

71/1

1/20

0215

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

3: A

t W

orld

s E

nd29

4046

/21/

2007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

2: D

ead

Man

s C

hest

5199

08/2

6/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

2567

97/3

1/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

3710

398/

12/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 2:

Dea

d M

ans

Che

st35

3828

/10/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne49

3981

/3/2

002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne11

6881

1/26

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

5193

91/5

/200

215

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

2: D

ead

Man

s C

hest

5983

29/3

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

4925

98/2

4/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Pira

tes

2: D

ead

Man

s C

hest

2766

58/2

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

1356

311/

28/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne65

5921

/19/

2002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne35

4511

2/20

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

3: A

t W

orld

s E

nd19

5426

/11/

2007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

3: A

t W

orld

s E

nd21

5026

/13/

2007

30P

irate

s 3:

At

Wor

lds

End

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

4793

01/1

/200

215

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

3949

112/

24/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st31

9548

/6/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 3:

At

Wor

lds

End

2512

076/

17/2

007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

3: A

t W

orld

s E

nd27

4336

/19/

2007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

2: D

ead

Man

s C

hest

3340

28/8

/200

629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

1938

412/

4/20

0115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

3: A

t W

orld

s E

nd39

6847

/1/2

007

30P

irate

s 3:

At

Wor

lds

End

Pira

tes

3: A

t W

orld

s E

nd33

3516

/25/

2007

30P

irate

s 3:

At

Wor

lds

End

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

1719

3912

/2/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st39

3198

/14/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 2:

Dea

d M

ans

Che

st65

6139

/9/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 3:

At

Wor

lds

End

4138

37/4

/200

730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

3531

76/2

7/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 2:

Dea

d M

ans

Che

st63

1029

/7/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne63

1001

/17/

2002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st41

2858

/16/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne25

2361

2/10

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

3326

912/

18/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st45

6968

/20/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stP

irate

s 3:

At

Wor

lds

End

4723

87/1

8/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 2:

Dea

d M

ans

Che

st53

2508

/28/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne27

2421

2/12

/200

115

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

Pira

tes

2: D

ead

Man

s C

hest

4726

18/2

2/20

0629

Pira

tes

2: D

ead

Man

s C

hest

Har

ry P

otte

r 1:

Sor

cere

rs S

tone

3190

912/

16/2

001

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 3:

At

Wor

lds

End

4520

27/1

6/20

0730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 3:

At

Wor

lds

End

4323

07/9

/200

730

Pira

tes

3: A

t W

orld

s E

ndP

irate

s 2:

Dea

d M

ans

Che

st55

2208

/30/

2006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne53

1251

/7/2

002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neP

irate

s 2:

Dea

d M

ans

Che

st61

1219

/5/2

006

29P

irate

s 2:

Dea

d M

ans

Che

stH

arry

Pot

ter

1: S

orce

rers

Sto

ne59

5591

/13/

2002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne55

1071

/9/2

002

15H

arry

Pot

ter

1: S

orce

rers

Sto

neH

arry

Pot

ter

1: S

orce

rers

Sto

ne61

1121

/15/

2002

15H

arry

Pot

ter

1: S

orce

rers

Sto

ne

0

50000

100000

150000

200000

Page 17: Basic Graphics with R

Demonstrating Mastery

• Locate a data set in a CSV or Tab-Delimited file and read it into R

• Check the data to ensure that the process of reading in the data worked properly

• Run a histogram on any numeric variable• Aggregate the data based on any grouping variable; use a

sum function, a mean function, or some other function as appropriate

• Display the aggregated data in a barchart or another type of graph as appropriate

• Describe the difference between a histogram and a barchart

17