r for data visualization and graphics

52
R for Data Visualizaiton and Graphics Rob Kabacoff, Ph.D. Vice President of Research Source code for presentation: http:// tinyurl.com/Kabacoff-CS20

Upload: robert-kabacoff

Post on 16-Apr-2017

4.815 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: R for data visualization and graphics

R for Data Visualizaiton and Graphics

Rob Kabacoff, Ph.D.Vice President of Research

Source code for presentation: http://tinyurl.com/Kabacoff-CS20

Page 2: R for data visualization and graphics

2

R is a Statistical and Graphical Platform

• Free• Open source• State-of-the-art data analysis• Platform for programming new methods• Runs on Windows, Linux, Mac OS X• Enormous user base• Reproducible research

R Homepage - http://www.r-project.org/CRAN Mirrors – http://cran.r-project.org/

Page 3: R for data visualization and graphics

3

Data Input

R

Statistical Packages

Database Management Systems

Keyboard

OtherText Files

SAS SPSS Stata

ExcelnetCDF

SQL MySQL Oracle Access

ASCII

XMLHDF5Webscraping

Page 4: R for data visualization and graphics

4

Statistical Methods

Classical Test TheoryItem Response TheoryCorrespondence AnalysisMultidimensional ScalingMeta AnalysisStructural Equation ModelingComplex Survey DesignTime Series AnalysisLongitudinal AnalysisSocial Network AnalysisStudy of Mediation and ModerationPower AnalysisClinical Trials

Descriptive StatisticsExperimental DesignLinear , Generalized, Nonlinear, and Hierarchical ModelsAnalysis of Categorical DataNonparametric AnalysisSurvival AnalysisLatent Variable ModelsBayesian ModelsMissing Values Analysis Cluster AnalysisDecision TreesData Mining

and …

Page 5: R for data visualization and graphics

5

Graphs!

50000

100000

150000

200000

0 20 40Years Since Ph.D.

Sala

ry

discipline

Theoretical

Applied

University Salaries by Discipline

0 200 400 600 800

010

020

030

040

050

060

0

A Topographic Map of Maunga Whau

Meters North

Met

ers

Wes

t

10 Meter Contour Spacing

X

-10-5

0

5

10

Y

-10

-5

0

510

Sinc( r )

-202

46

8

-35

-25

-15

165 170 175 180 185

165 170 175 180 185 165 170 175 180 185

-35

-25

-15

long

lat

100 200 300 400 500 600

Given : depth

-11.1

-4.0 -2.0 0.0 2.0 4.0

14.3

Pearsonresiduals:

p-value =<2e-16

Survival on the Titanic

Age

Sex

Surv

ived

Fem

ale

Yes

No

Mal

e

Child Adult

Yes

No

Page 6: R for data visualization and graphics

6

A High Level Tour

• General Systems– base– lattice– ggplot2

• Interactive– iplots– rggobi– googleVis– Shiny

• Specialized– vcd (categorical data)– VIM (missing data)– likert (likert data)– scatterplot3d (3-D

scatterplot)– car (regression)– corrplot (correlations)– (decision trees)– (dendograms)– effects (glm/ANOVA)

Page 7: R for data visualization and graphics

Lattice Graphics

Salary (dollars)

Freq

uenc

y

0

20

40

60

80

100

50000 100000 150000 200000

0

10

20

30

40

50000 100000 150000 200000Salary (dollars)

Freq

uenc

y

ggplot2 Graphics

Base Graphics

Salary (dollars)

Freq

uenc

y

50000 100000 150000 200000

020

4060

8010

0

3 completegraphics systems

Page 8: R for data visualization and graphics

BASE GRAPHICS

Page 9: R for data visualization and graphics

9

histogramsHistogram with Rug plot

Salary (dollars)

Den

sity

50000 100000 150000 2000000.0e

+00

4.0e

-06

8.0e

-06

1.2e

-05

Histogram of Kernal Density Curve

Salary (dollars)

Den

sity

50000 100000 150000 2000000.0e

+00

4.0e

-06

8.0e

-06

1.2e

-05

Histogram with Normal Curve

Salary (dollars)

Freq

uenc

y

50000 100000 150000 200000

020

4060

8010

0

Page 10: R for data visualization and graphics

10

bar charts

Page 11: R for data visualization and graphics

11

box plots

Bass 2

Bass 1

Tenor 2

Tenor 1

Alto 2

Alto 1

Soprano 2

Soprano 1

60 65 70 75

Singer Height by Voice Part

Heights in Inches

Page 12: R for data visualization and graphics

12

line chartsMonthly Airline Passengers

Time

Pas

seng

ers

(K)

1950 1952 1954 1956 1958 1960

100

200

300

400

500

600

Monthly Airline Passengers

Time

Pas

seng

ers

(K)

1950 1952 1954 1956 1958 1960

100

200

300

400

500

600

UK Lung Cancer Deaths

year

1974 1975 1976 1977 1978 1979 1980

500

1000

1500

2000

2500

3000

3500

4000

TotalMaleFemale

Page 13: R for data visualization and graphics

13

time series

Season Decomposition of a Time Series

Monthly Air Passengers

100

300

500

data

Season Decomposition of a Time Series

-60

-20

020

60

seas

onal

Season Decomposition of a Time Series

200

300

400

500

tren

d

Season Decomposition of a Time Series

-40

020

60

1950 1952 1954 1956 1958 1960

rem

aind

er

time

SeasonDecomposition

Page 14: R for data visualization and graphics

14

scatterplots

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

1

2

3

4

5

6

7

Iris Data

Sepal Length (cm)

Pet

al L

engt

h (c

m)

-5 0 5 10

-10

-50

510

15

High Density Scatterplot (n=10,000)

X

Y

Page 15: R for data visualization and graphics

15

scatterplot matrix

Sepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

3.0

4.0

Sepal.Width

Petal.Length

12

34

56

7

4.5 5.5 6.5 7.5

0.5

1.5

2.5

1 2 3 4 5 6 7

Petal.Width

Anderson's Iris Data -- 3 species

Page 16: R for data visualization and graphics

16

dot plot

Cadillac FleetwoodLincoln ContinentalCamaro Z28Duster 360Chrysler ImperialMaserati BoraMerc 450SLCAMC JavelinDodge ChallengerFord Pantera LMerc 450SEMerc 450SLMerc 280CValiantHornet SportaboutMerc 280Pontiac FirebirdFerrari DinoMazda RX4Mazda RX4 WagHornet 4 DriveVolvo 142EToyota CoronaDatsun 710Merc 230Merc 240DPorsche 914-2Fiat X1-9Honda CivicLotus EuropaFiat 128Toyota Corolla

10 15 20 25 30

MPG by Automobile

Page 17: R for data visualization and graphics

17

contour plots

0 200 400 600 800

020

040

060

0

100

100

100

110

110

110

110

120

130

140

150

160

160

170

170

180

180

190

A Topographic Map of Maunga Whau

Meters North

Met

ers

Wes

t

10 Meter Contour Spacing

Page 18: R for data visualization and graphics

LATTICE GRAPHICS

Page 19: R for data visualization and graphics

lattice graphs

• expands base graphics to include trellis plots• seeks to improve in graph defaults (symbols, axes, labels) over base

gaphics• grouping

– color, fill, line type can be mapped to variable values• facets

– subgroups can be plotted in an array based on the levels of (usually) one or two variables

• customizable panel functions allow you fine grained control of what is plotted in each facet

• comments– clean and fast– high degree of customization possible

Page 20: R for data visualization and graphics

20

3D graphs with faceting

Page 21: R for data visualization and graphics

lattice graph with faceting and a customized panel function

Page 22: R for data visualization and graphics

GGPLOT2 GRAPHICS

Page 23: R for data visualization and graphics

ggplot2

• Grammar of Graphics• graphs built up in layers by plotting "geoms"• grouping

– color, fill, shape, size can be mapped to variable values• facets

– subgroups can be plotted in an array based on the levels of (usually) one or two variables

• comments– allows you to create novel plots– can be slow for large problems– no 3D graphs– HOT!

Page 24: R for data visualization and graphics

24

kernel density plots with grouping

Page 25: R for data visualization and graphics

25

histogram with faceting

Theoretical Applied

0

5

10

15

20

0

5

10

15

20

0

5

10

15

20

AsstP

rofA

ssocProf

Prof

50000 100000 150000 200000 50000 100000 150000 200000salary

coun

t

Page 26: R for data visualization and graphics

26

boxplotsTheoretical Applied

50000

100000

150000

200000

AsstProf AssocProf Prof AsstProf AssocProf Profrank

sala

ry

sex

Female

Male

Page 27: R for data visualization and graphics

27

jittered plots

Page 28: R for data visualization and graphics

28

scatter plot with smooth line

Page 29: R for data visualization and graphics

29

scatterplot with fit lines, grouping, and faceting

Page 30: R for data visualization and graphics

SPECIALIZED GRAPHS

Page 31: R for data visualization and graphics

31

visualizing missing dataN

umbe

r of m

issi

ngs

02

46

810

1214

Bod

yWgt

Bra

inW

gt

Non

D

Dre

am

Sle

ep

Spa

n

Ges

t

Pre

d

Exp

Dan

ger

Com

bina

tions

Bod

yWgt

Bra

inW

gt

Non

D

Dre

am

Sle

ep

Spa

n

Ges

t

Pre

d

Exp

Dan

ger

42

9

3

2

2

2

1

1

VIMpackage

Page 32: R for data visualization and graphics

32

scatterplot matrices

x

Freq

uenc

y

yrs.since.phd

0 10 20 30 40 50 60

010

2030

4050

010

3050

x

Freq

uenc

y

yrs.service

0 10 20 30 40 50 100000 150000 200000

1000

0020

0000

x

Freq

uenc

y

salary

carpackage

Page 33: R for data visualization and graphics

33-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

cyl

disp

wt

hp

carb

qsec

gear

am

drat

vs

disp

wt

hp

carb

qsec

gear

am

drat

vs

mpg

90

78

83

53

-59

-49

-52

-70

-81

-85

89

79

39

-43

-56

-59

-71

-71

-85

66

43

-17

-58

-69

-71

-55

-87

75

-71

-13

-24

-45

-72

-78

-66

27

6

-9

-57

-55

-21

-23

9

74

42

79

70

21

48

71

17

60

44

68 66

visualizing correlations

variables reorderedto find clusters

non-significant (.05)correlations indicatedwith an X

corrplot package

Page 34: R for data visualization and graphics

34

cyl

am vs

carb w

t

drat

gear

qsec

mpg hp

disp

Specification Variables

Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1-9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914-2Toyota Corona

Car

Mod

els

Heatmapstats

package

Page 35: R for data visualization and graphics

35

visualizing categorical data

0

500

1000

1500

2000

Male Female

Sex

0

500

1000

1500

No Yes

Survived

0200400600800

1000

1st 2nd3rd

Class

vcdpackage

Page 36: R for data visualization and graphics

36

visualizing effects (linear models)

2 x 3 ANCOVA

Page 37: R for data visualization and graphics

37

rank*sex effect plot

rank

sala

ry

70000

80000

90000

100000

110000

120000

130000

AsstProf AssocProf Prof

: sex Female

AsstProf AssocProf Prof

: sex Male

rank by sex interaction (means)adjusting for other variables

effectspackage

Page 38: R for data visualization and graphics

38

visualizing effects (generalized linear models)

Logistic regression with 8 predictors

Page 39: R for data visualization and graphics

39

rating effects (prob) by gender adjustingfor other variables

effectspackage

Page 40: R for data visualization and graphics

40

3D ScatterplotAutomobile Data

0 100 200 300 400 500

1015

2025

3035

1

2

3

4

5

6

Displacement (cu. in.)

Wei

ght (

lb/1

000)

Mile

s/(U

S)

Gal

lon

Mazda RX4Mazda RX4 Wag

Datsun 710Hornet 4 Drive

Hornet SportaboutValiant

Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C Merc 450SEMerc 450SL

Merc 450SLC

Cadillac FleetwoodLincoln Continental

Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona

Dodge ChallengerAMC Javelin

Camaro Z28

Pontiac Firebird

Fiat X1-9Porsche 914-2

Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

scatterplot3dpackage

Page 41: R for data visualization and graphics

INTERACTIVE GRAPHICS

Page 42: R for data visualization and graphics

42

iplots hold [Ctrl] and mouse over graph for info

Page 43: R for data visualization and graphics

43

rggobi

• GGobi is an open source visualization program for exploring high-dimensional data

• rggobi provides R command line interface to GGobi

Installation1. install GGobi: download from www.ggobi.org2. in R: install.packages("rggobi")

see:http://www.ggobi.org/rggobi/introduction.pdf

Page 44: R for data visualization and graphics

44

Display toopen newwindows

Interactionto select,identity, orbrush

View to changetype of xy plot

right mouseto select

Page 45: R for data visualization and graphics

45

googleVis

• Provides access to Google Chart Tools– motion charts– annotated time lines– maps– other (e.g. line, bar, bubble, column, area, scatter,

candlestick, pie, org charts)– https://developers.google.com/chart/

• output is html code containing data and references to JavaScript functions hosted by Google

• an internet connection required to view the graphs

demo(WorldBank)

Hans Rosling in his TED talks

Page 46: R for data visualization and graphics

46

Page 47: R for data visualization and graphics

47

Shiny

• Package for building interative web applications with R– homepage- http://www.rstudio.com/shiny/ – examples- http://www.rstudio.com/shiny/showcase/

• Distribution– self hosted (requires free Shiny Server on Linux

server)– Rstudio hosted– distribute as a package

pkgs <- c("Rcpp", "httpuv", "shiny")install.packages(pkgs)library(shiny)runExample("06_tabsets")

Page 48: R for data visualization and graphics

48

shiny example

Page 49: R for data visualization and graphics

RESOURCES

Page 50: R for data visualization and graphics

www.statmethods.net

Page 51: R for data visualization and graphics

51

Books

R in ActionRobert I. Kabacoff

R Graphics CookbookWinston Chang

LatticeDeepayan Sarkar

ggplot2Hadley Wickham

Page 52: R for data visualization and graphics

52

additional websites

• Cookbook for R http://www.cookbook-r.com/

• ggplot2 documentationhttp://docs.ggplot2.org/current/

• R-Bloggershttp://www.r-bloggers.com/