multivariate data visualization. multivariate data graphical display data graphics visually display...
TRANSCRIPT
Multivariate data visualization
multivariate dataGraphical display
Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading and color.
“Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.” Carl Sagan
“There is no statistical tool that is as powerful as a well-chosen graph.” Chambers, Cleveland, Kleiner, and Tukey
multivariate dataGraphical display
Advantages of graphical methods:
-In comparison with other types of presentation, well-design charts are more effective in creating interest and in appealing to the attention of the reader.
-Visual relationships as portrayed by charts and graphs are more easily grasped and more easily remembered.
-The use of charts and graphs saves time since the essential meaning of large measures of statistical data can be visualized at a glance.
-Charts and graphs provide a comprehensive picture of a problem that makes for a more complete and better balanced understanding that could be derived from tabular or textual forms of presentation.
-Charts and graphs can bring out hidden facts and relationships and can stimulate, as well as aid, analytical thinking and investigation.
multivariate dataGraphical display
Goals for graphical displays of data:
• To provide an overview
•To tell a story
•To suggest hypotheses
•To criticize a model
multivariate dataAir pollution data
Air pollution in 41 cities in the USA.R data “USairpollution”
Variables:
SO2: SO2 content of air in micrograms per cubic metertemp: average annual temperature in degrees Fahrenheitmanu: number of manufacturing enterprises employing 20 or more workerspopul: population size (1970 census) in thousandswind: average annual wind speed in miles per hourprecip: average annual precipitation in inchespredays: average number of days with precipitation per year
multivariate dataScatterplot
The relational graphic - the greatest of all graphical designs.It links at list two variables encouraging the viewer to assess the possible causal relationship between the plotted variables.
The standard for representing continuous bivariate data but it can be enhanced in a variety of ways to accommodate information about other variables.
Build a basic scatterplot of the two variables “manu” and “popul”in USairpollution data.
multivariate dataScatterplot
# set up the labels for the two variables:
> mlab="Manufacturing enterprises with 20 or more workers"> plab="Population size (1970 census) in thousands“
#plot manu vs. popul :
> plot(popul~manu, data=USairpollution,xlab=mlab,ylab=plab)
# read the date Usairpollution:
>USairpollution=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T)
multivariate dataScatterplot
0 500 1000 1500 2000 2500 3000
05
00
15
00
25
00
35
00
Manufacturing enterprises with 20 or more workers
Po
pu
latio
n s
ize
(1
97
0 c
en
sus)
in t
ho
usa
nd
s
multivariate dataScatterplot
We construct a scatter plot that shows the marginal distribution of manu and popul in two ways:
1. Marginal distribution as rug plots on each axis:
>plot(popul~manu,data=USairpollution,xlab=mlab,ylab=plab)
>rug(USairpollution$manu,side=1)
> rug(USairpollution$popul,side=2)
multivariate dataScatterplot
Rugs
0 500 1000 1500 2000 2500 3000
05
00
15
00
25
00
35
00
Manufacturing enterprises with 20 or more workers
Po
pu
latio
n s
ize
(1
97
0 c
en
sus)
in t
ho
usa
nd
s
multivariate dataScatterplot
# set the plot margins>par(mar=c(2,2,2,2))
# set the limit of the x axis> xlim=with(USairpollution,range(manu))*1.1
# set the display of the plots > layout(matrix(c(1,3,2,0),nrow=2,byrow=TRUE), widths=c(2,1),heights=c(2,1),respect=TRUE)
2. Marginal distribution of manu as a histogram and that of popul as boxplot:
# plot popul vs manu>plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim)
# add labels for the points on the scatterplot > with(USairpollution, text(manu,popul,cex=0.6, labels=abbreviate(City)))
# add histogram for manu along the x axis> with(USairpollution,hist(manu,xlim=xlim))
# add a boxplot for popul along the y axis> with(USairpollution,boxplot(popul))
multivariate dataScatterplot
multivariate dataBivariate boxplot
Based on calculating “robust” measures of location, scale, and correlation.Consists of a pair of concentric ellipse: - Hinge (include 50% of the data )- Fence (delimitates potentially troublesome outliers)
Resistant regression lines of both y on x and x on y are shown, with the intersection showing the bivariate location estimator.
The acute angle between the regression lines will be small for a large absolute value of correlations and large for a small one.
multivariate dataBivariate boxplot
# select row number for marginal cities in USairpollution>lab=c(“Houston”,"Chicago","Detroit","Cleveland","Philadelphia")>outcity=match(lab, USairpollution[,1])
# subset data for manu and popul variables > x=USairpollution[,c("manu","popul")]# draw the bivariate boxplot # need packages: “MVA” and HSAUR2” for bvbox
> bvbox(x,main="USairpollution",xlab=mlab,ylab=plab)
# add text for the outcity > text(x$manu[outcity],x$popul[outcity],labels=lab,cex=0.7, pos=c(2,2,4,2,2))
0 1000 2000 3000
01
00
02
00
03
00
0
USairpollution
Manufacturing enterprises with 20 or more workers
Po
pu
latio
n s
ize
(1
97
0 c
en
sus)
in t
ho
usa
nd
s
Chicago
Detroit
Cleveland
Philadelphia
Houston
Hindge
Fence
Regression lines
multivariate dataBivariate boxplot
Scatter plots should always be consulted when calculating correlation coefficients because the presence of outliers can on occasion considerably distort the value of a correlation coefficient.
The observations identified as outliers may then be excluded from the calculation of the correlation coefficients.
The bivariate boxplot identifies Chicago, Philadelphia, Detroit, and Cleveland as outliers in the scatterplot of manu and popul.
multivariate dataR output# correlation between manu and popul variables
> with(USairpollution,cor(manu,popul)) [1] 0.9552693 # correlation coefficient
# remove the outliers
>outcity=match(c("Chicago","Detroit","Cleveland","Philadelphia"),USairpollution[,1])
# correlation between manu and popul with the outliers removed
>with(USairpollution,cor(manu[-outcity],popul[-outcity])) [1] 0.7955549 # correlation coefficient
multivariate dataThe convex hull of bivariate data
An alternative approach to using the scatterplot combined with the bivariate boxplot.
Gives a robust estimation of correlation.
The convex hull of a set of bivariate observations consists of the vertices of the smallest convex polyhedron in variable space within which or on which all data points lie.
Removal of the points lying on the convex hull can eliminate isolated outliers without disturbing the general shape of the bivariate distribution.
multivariate dataR output
# scatterplot of manu and popul in USairpollution data >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim)
# extract the hull from manu and popul data>hull=with(USairpollution,chull(manu,popul)) [1] 9 15 41 6 2 18 16 14 7
# plot the convex hull >with(USairpollution,polygon(manu[hull],popul[hull],density=15,angle=30))
# correlation of the manu and popul with the hull removed> with(USairpollution,cor(manu[-hull],popul[-hull])) [1] 0.9225267
multivariate dataThe chi-plot
Suggested by Fisher and Switzer (1985,2001). The chi-plot is designed to address the problem of independence. It transforms the measurements (x11 ,..., xn1) and (x12,..., xn2) into values ( ,…, ) and ( ,…, ), which, plotted in a scatterplot, can be used to detect deviations form independence.
Under independence the values should show a non-systematic random fluctuation around zero; the values measure the distance of unit i from the “center” of the bivariate distribution.
1 n1 n
ii
multivariate dataThe chi-plot
# scatter plot for manu and popul>with(USairpollution,plot(manu,popul, xlab=mlab,ylab=plab,cex.lab=0.9))
# need package “asbio”> with(USairpollution,chi.plot(manu,popul))
multivariate dataThe chi-plot
Departure from independence is indicated in the second pot by a lack of points in the horizontal band indicated on the plot.
multivariate dataThe bubble and other glyph plots
Bubble plot – three variables are displayed: two are used to form the scatter plot itself and the value of the third variable are represented by circles with radii proportional to these values and centered on the appropriate plot in the scatterplot.
Make a bubble plot for temp, wind, and SO2 in “USairpollution” data.
The plot seems to suggest that cities with moderate annual temperature and moderate annual wind speeds tend to suffer the greatest air pollution, but does not tell us anything about the rest of variables.
Bubble plot
# set ylim for wind >ylim=with(USairpollution,range(wind)*c(0.95,1))
# scatterplot of wind and temp>plot(wind~temp,data=USairpollution,xlab="Average annual temperature(Fahrenheit)",ylab="Average annual wind speed (m.p.h.)",pch=10,ylim=ylim)
# plot SO2 as circles>with(USairpollution,symbols(temp,wind,circles=SO2, inches=0.5,add=TRUE))
The bubble and other glyph plots
multivariate data
Bubble plot
The bubble and other glyph plots
multivariate data
Star plot# star plot for all variables in USairpollution>stars(USairpollution,labels=as.character(USairpollution$City),cex=0.45)
The bubble and other glyph plots
Make a star plot for all variables in “USairpollution” as a seven-sided “star” with the lengths of each side representing each of the seven variables. The plot does not communicate much useful information about the data but it shows which units have similar shapes.
multivariate data
Star plot
The bubble and other glyph plots
multivariate dataThe scatterplot matrix
A scatterplot matrix is a square, symmetric grid of bivariate scatterplots.
The grid has q rows and columns, each one corresponding to a different variable.
Each of the grid’s cells shows a scatterplot of two variables. Variable j is plotted against variable i in the ij cell, and the same variables appear in the cell ji, with the x- and y- axes of the scatterplots interchanged.
multivariate data
# plot the grid pairs(USairpollution[,-1],cex.labels=NULL,font.labels=0.1,gap=0.1,pch=".",cex=0.00001)
# plot the grid and the regression lines >pairs(USairpollution[,-1], pch=".",cex=0.1,gap=0,font.labels=0.001, panel=function(x,y,...){ points(x,y,...) abline(lm(y~x),col="grey") })
The scatterplot matrix
multivariate data
SO2
45 65 0 2000 10 40
20
4575
temp
manu
0
035
00
popul
wind
612
1060
precip
20 80 0 2000 6 9 40 120
40
predays
The scatterplot matrix
multivariate data
The scatterplot matrix
SO2
45 60 75 0 2000 10 40
20
80
45
65
temp
manu
02500
02500
popul
wind
69
10
50
precip
20 80 0 2000 6 9 12 40 120
40
140
predays
# three dimensional plot for hips, chest, waist in Measure data > library(scatterplot3d)>with(Measure,scatterplot3d(chest,waist,hips, pch=(1:2)[gender],type="h“, angle=55))
multivariate dataThree-dimensional plots
32 34 36 38 40 42 4432
34
36
38
40
42
2224
2628
3032
34
chest
wa
ist
hip
s
The three dimensional plot of chest, waist, and hips in “Measure” data suggests the presence of two separate groups of point corresponding to the males and females in the data.
# three dimensional plot for temp, wind, and SO2> library(scatterplot3d)>with(USairpollution,scatterplot3d(temp,wind,SO2, type="h“, angle=55))
40 45 50 55 60 65 70 75 80 0 2
0 40 6
0 801
00120
6 7
8 910
1112
13
temp
win
d
SO
2
multivariate dataThree-dimensional plots
The plot of temp, wind, and SO2 from the “USairpollution” datashows that two observations with high SO2 levels stand out.
multivariate dataTrellis graphics
Trellis graphics is an approach to examining high-dimensional structure in data by means of one-, tow-, and three-dimensional graphs. The problem addressed is how observations of one or more variables depend on the observations of the other variables.
Multiple conditioning that allows some type of plot to be displayed for different values of a given variable (or variables).
multivariate data
# scatterplot of SO2 and temp for light and high winds> library(lattice)> plot(xyplot(SO2~temp|cut(wind,2),data=USairpollution))
temp
SO
2
20406080
100
50 60 70
(5.99,9.35]
50 60 70
(9.35,12.7]
Trellis graphics
Scatterplots of SO2 and temp conditioned on values of wind divided in two equalparts.
multivariate data
# three-dimensional plots of temp, wind, and precip conditioned on four levels of SO2. >pollution=with(USairpollution,equal.count(SO2,4)) >plot(cloud(precip~temp*wind|pollution,panel.aspect=0.9,data=USairpollution))
Trellis graphics
Not very useful for small data sets.