r for data visualization and graphics
TRANSCRIPT
R for Data Visualizaiton and Graphics
Rob Kabacoff, Ph.D.Vice President of Research
Source code for presentation: http://tinyurl.com/Kabacoff-CS20
2
R is a Statistical and Graphical Platform
• Free• Open source• State-of-the-art data analysis• Platform for programming new methods• Runs on Windows, Linux, Mac OS X• Enormous user base• Reproducible research
R Homepage - http://www.r-project.org/CRAN Mirrors – http://cran.r-project.org/
3
Data Input
R
Statistical Packages
Database Management Systems
Keyboard
OtherText Files
SAS SPSS Stata
ExcelnetCDF
SQL MySQL Oracle Access
ASCII
XMLHDF5Webscraping
4
Statistical Methods
Classical Test TheoryItem Response TheoryCorrespondence AnalysisMultidimensional ScalingMeta AnalysisStructural Equation ModelingComplex Survey DesignTime Series AnalysisLongitudinal AnalysisSocial Network AnalysisStudy of Mediation and ModerationPower AnalysisClinical Trials
Descriptive StatisticsExperimental DesignLinear , Generalized, Nonlinear, and Hierarchical ModelsAnalysis of Categorical DataNonparametric AnalysisSurvival AnalysisLatent Variable ModelsBayesian ModelsMissing Values Analysis Cluster AnalysisDecision TreesData Mining
and …
5
Graphs!
50000
100000
150000
200000
0 20 40Years Since Ph.D.
Sala
ry
discipline
Theoretical
Applied
University Salaries by Discipline
0 200 400 600 800
010
020
030
040
050
060
0
A Topographic Map of Maunga Whau
Meters North
Met
ers
Wes
t
10 Meter Contour Spacing
X
-10-5
0
5
10
Y
-10
-5
0
510
Sinc( r )
-202
46
8
-35
-25
-15
165 170 175 180 185
165 170 175 180 185 165 170 175 180 185
-35
-25
-15
long
lat
100 200 300 400 500 600
Given : depth
-11.1
-4.0 -2.0 0.0 2.0 4.0
14.3
Pearsonresiduals:
p-value =<2e-16
Survival on the Titanic
Age
Sex
Surv
ived
Fem
ale
Yes
No
Mal
e
Child Adult
Yes
No
6
A High Level Tour
• General Systems– base– lattice– ggplot2
• Interactive– iplots– rggobi– googleVis– Shiny
• Specialized– vcd (categorical data)– VIM (missing data)– likert (likert data)– scatterplot3d (3-D
scatterplot)– car (regression)– corrplot (correlations)– (decision trees)– (dendograms)– effects (glm/ANOVA)
Lattice Graphics
Salary (dollars)
Freq
uenc
y
0
20
40
60
80
100
50000 100000 150000 200000
0
10
20
30
40
50000 100000 150000 200000Salary (dollars)
Freq
uenc
y
ggplot2 Graphics
Base Graphics
Salary (dollars)
Freq
uenc
y
50000 100000 150000 200000
020
4060
8010
0
3 completegraphics systems
BASE GRAPHICS
9
histogramsHistogram with Rug plot
Salary (dollars)
Den
sity
50000 100000 150000 2000000.0e
+00
4.0e
-06
8.0e
-06
1.2e
-05
Histogram of Kernal Density Curve
Salary (dollars)
Den
sity
50000 100000 150000 2000000.0e
+00
4.0e
-06
8.0e
-06
1.2e
-05
Histogram with Normal Curve
Salary (dollars)
Freq
uenc
y
50000 100000 150000 200000
020
4060
8010
0
10
bar charts
11
box plots
Bass 2
Bass 1
Tenor 2
Tenor 1
Alto 2
Alto 1
Soprano 2
Soprano 1
60 65 70 75
Singer Height by Voice Part
Heights in Inches
12
line chartsMonthly Airline Passengers
Time
Pas
seng
ers
(K)
1950 1952 1954 1956 1958 1960
100
200
300
400
500
600
Monthly Airline Passengers
Time
Pas
seng
ers
(K)
1950 1952 1954 1956 1958 1960
100
200
300
400
500
600
UK Lung Cancer Deaths
year
1974 1975 1976 1977 1978 1979 1980
500
1000
1500
2000
2500
3000
3500
4000
TotalMaleFemale
13
time series
Season Decomposition of a Time Series
Monthly Air Passengers
100
300
500
data
Season Decomposition of a Time Series
-60
-20
020
60
seas
onal
Season Decomposition of a Time Series
200
300
400
500
tren
d
Season Decomposition of a Time Series
-40
020
60
1950 1952 1954 1956 1958 1960
rem
aind
er
time
SeasonDecomposition
14
scatterplots
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
1
2
3
4
5
6
7
Iris Data
Sepal Length (cm)
Pet
al L
engt
h (c
m)
-5 0 5 10
-10
-50
510
15
High Density Scatterplot (n=10,000)
X
Y
15
scatterplot matrix
Sepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
2.0
3.0
4.0
Sepal.Width
Petal.Length
12
34
56
7
4.5 5.5 6.5 7.5
0.5
1.5
2.5
1 2 3 4 5 6 7
Petal.Width
Anderson's Iris Data -- 3 species
16
dot plot
Cadillac FleetwoodLincoln ContinentalCamaro Z28Duster 360Chrysler ImperialMaserati BoraMerc 450SLCAMC JavelinDodge ChallengerFord Pantera LMerc 450SEMerc 450SLMerc 280CValiantHornet SportaboutMerc 280Pontiac FirebirdFerrari DinoMazda RX4Mazda RX4 WagHornet 4 DriveVolvo 142EToyota CoronaDatsun 710Merc 230Merc 240DPorsche 914-2Fiat X1-9Honda CivicLotus EuropaFiat 128Toyota Corolla
10 15 20 25 30
MPG by Automobile
17
contour plots
0 200 400 600 800
020
040
060
0
100
100
100
110
110
110
110
120
130
140
150
160
160
170
170
180
180
190
A Topographic Map of Maunga Whau
Meters North
Met
ers
Wes
t
10 Meter Contour Spacing
LATTICE GRAPHICS
lattice graphs
• expands base graphics to include trellis plots• seeks to improve in graph defaults (symbols, axes, labels) over base
gaphics• grouping
– color, fill, line type can be mapped to variable values• facets
– subgroups can be plotted in an array based on the levels of (usually) one or two variables
• customizable panel functions allow you fine grained control of what is plotted in each facet
• comments– clean and fast– high degree of customization possible
20
3D graphs with faceting
lattice graph with faceting and a customized panel function
GGPLOT2 GRAPHICS
ggplot2
• Grammar of Graphics• graphs built up in layers by plotting "geoms"• grouping
– color, fill, shape, size can be mapped to variable values• facets
– subgroups can be plotted in an array based on the levels of (usually) one or two variables
• comments– allows you to create novel plots– can be slow for large problems– no 3D graphs– HOT!
24
kernel density plots with grouping
25
histogram with faceting
Theoretical Applied
0
5
10
15
20
0
5
10
15
20
0
5
10
15
20
AsstP
rofA
ssocProf
Prof
50000 100000 150000 200000 50000 100000 150000 200000salary
coun
t
26
boxplotsTheoretical Applied
50000
100000
150000
200000
AsstProf AssocProf Prof AsstProf AssocProf Profrank
sala
ry
sex
Female
Male
27
jittered plots
28
scatter plot with smooth line
29
scatterplot with fit lines, grouping, and faceting
SPECIALIZED GRAPHS
31
visualizing missing dataN
umbe
r of m
issi
ngs
02
46
810
1214
Bod
yWgt
Bra
inW
gt
Non
D
Dre
am
Sle
ep
Spa
n
Ges
t
Pre
d
Exp
Dan
ger
Com
bina
tions
Bod
yWgt
Bra
inW
gt
Non
D
Dre
am
Sle
ep
Spa
n
Ges
t
Pre
d
Exp
Dan
ger
42
9
3
2
2
2
1
1
VIMpackage
32
scatterplot matrices
x
Freq
uenc
y
yrs.since.phd
0 10 20 30 40 50 60
010
2030
4050
010
3050
x
Freq
uenc
y
yrs.service
0 10 20 30 40 50 100000 150000 200000
1000
0020
0000
x
Freq
uenc
y
salary
carpackage
33-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
cyl
disp
wt
hp
carb
qsec
gear
am
drat
vs
disp
wt
hp
carb
qsec
gear
am
drat
vs
mpg
90
78
83
53
-59
-49
-52
-70
-81
-85
89
79
39
-43
-56
-59
-71
-71
-85
66
43
-17
-58
-69
-71
-55
-87
75
-71
-13
-24
-45
-72
-78
-66
27
6
-9
-57
-55
-21
-23
9
74
42
79
70
21
48
71
17
60
44
68 66
visualizing correlations
variables reorderedto find clusters
non-significant (.05)correlations indicatedwith an X
corrplot package
34
cyl
am vs
carb w
t
drat
gear
qsec
mpg hp
disp
Specification Variables
Maserati BoraChrysler ImperialLincoln ContinentalCadillac FleetwoodHornet SportaboutPontiac FirebirdFord Pantera LCamaro Z28Duster 360ValiantHornet 4 DriveAMC JavelinDodge ChallengerMerc 450SLCMerc 450SEMerc 450SLHonda CivicToyota CorollaFiat X1-9Fiat 128Ferrari DinoMerc 240DMazda RX4Mazda RX4 WagMerc 280CMerc 280Lotus EuropaMerc 230Volvo 142EDatsun 710Porsche 914-2Toyota Corona
Car
Mod
els
Heatmapstats
package
35
visualizing categorical data
0
500
1000
1500
2000
Male Female
Sex
0
500
1000
1500
No Yes
Survived
0200400600800
1000
1st 2nd3rd
Class
vcdpackage
36
visualizing effects (linear models)
2 x 3 ANCOVA
37
rank*sex effect plot
rank
sala
ry
70000
80000
90000
100000
110000
120000
130000
AsstProf AssocProf Prof
: sex Female
AsstProf AssocProf Prof
: sex Male
rank by sex interaction (means)adjusting for other variables
effectspackage
38
visualizing effects (generalized linear models)
Logistic regression with 8 predictors
39
rating effects (prob) by gender adjustingfor other variables
effectspackage
40
3D ScatterplotAutomobile Data
0 100 200 300 400 500
1015
2025
3035
1
2
3
4
5
6
Displacement (cu. in.)
Wei
ght (
lb/1
000)
Mile
s/(U
S)
Gal
lon
Mazda RX4Mazda RX4 Wag
Datsun 710Hornet 4 Drive
Hornet SportaboutValiant
Duster 360
Merc 240D
Merc 230
Merc 280
Merc 280C Merc 450SEMerc 450SL
Merc 450SLC
Cadillac FleetwoodLincoln Continental
Chrysler Imperial
Fiat 128
Honda Civic
Toyota Corolla
Toyota Corona
Dodge ChallengerAMC Javelin
Camaro Z28
Pontiac Firebird
Fiat X1-9Porsche 914-2
Lotus Europa
Ford Pantera L
Ferrari Dino
Maserati Bora
Volvo 142E
scatterplot3dpackage
INTERACTIVE GRAPHICS
42
iplots hold [Ctrl] and mouse over graph for info
43
rggobi
• GGobi is an open source visualization program for exploring high-dimensional data
• rggobi provides R command line interface to GGobi
Installation1. install GGobi: download from www.ggobi.org2. in R: install.packages("rggobi")
see:http://www.ggobi.org/rggobi/introduction.pdf
44
Display toopen newwindows
Interactionto select,identity, orbrush
View to changetype of xy plot
right mouseto select
45
googleVis
• Provides access to Google Chart Tools– motion charts– annotated time lines– maps– other (e.g. line, bar, bubble, column, area, scatter,
candlestick, pie, org charts)– https://developers.google.com/chart/
• output is html code containing data and references to JavaScript functions hosted by Google
• an internet connection required to view the graphs
demo(WorldBank)
Hans Rosling in his TED talks
46
47
Shiny
• Package for building interative web applications with R– homepage- http://www.rstudio.com/shiny/ – examples- http://www.rstudio.com/shiny/showcase/
• Distribution– self hosted (requires free Shiny Server on Linux
server)– Rstudio hosted– distribute as a package
pkgs <- c("Rcpp", "httpuv", "shiny")install.packages(pkgs)library(shiny)runExample("06_tabsets")
48
shiny example
RESOURCES
www.statmethods.net
51
Books
R in ActionRobert I. Kabacoff
R Graphics CookbookWinston Chang
LatticeDeepayan Sarkar
ggplot2Hadley Wickham
52
additional websites
• Cookbook for R http://www.cookbook-r.com/
• ggplot2 documentationhttp://docs.ggplot2.org/current/
• R-Bloggershttp://www.r-bloggers.com/