multivariate data visualization. multivariate data graphical display data graphics visually display...

Multivariate data visualization

multivariate dataGraphical display

Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading and color.

“Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.” Carl Sagan

“There is no statistical tool that is as powerful as a well-chosen graph.” Chambers, Cleveland, Kleiner, and Tukey


Advantages of graphical methods:

-In comparison with other types of presentation, well-design charts are more effective in creating interest and in appealing to the attention of the reader.

-Visual relationships as portrayed by charts and graphs are more easily grasped and more easily remembered.

-The use of charts and graphs saves time since the essential meaning of large measures of statistical data can be visualized at a glance.

-Charts and graphs provide a comprehensive picture of a problem that makes for a more complete and better balanced understanding that could be derived from tabular or textual forms of presentation.

-Charts and graphs can bring out hidden facts and relationships and can stimulate, as well as aid, analytical thinking and investigation.


Goals for graphical displays of data:

• To provide an overview

•To tell a story

•To suggest hypotheses

•To criticize a model

multivariate dataAir pollution data

Air pollution in 41 cities in the USA.R data “USairpollution”

Variables:

SO2: SO2 content of air in micrograms per cubic metertemp: average annual temperature in degrees Fahrenheitmanu: number of manufacturing enterprises employing 20 or more workerspopul: population size (1970 census) in thousandswind: average annual wind speed in miles per hourprecip: average annual precipitation in inchespredays: average number of days with precipitation per year

multivariate dataScatterplot

The relational graphic - the greatest of all graphical designs.It links at list two variables encouraging the viewer to assess the possible causal relationship between the plotted variables.

The standard for representing continuous bivariate data but it can be enhanced in a variety of ways to accommodate information about other variables.

Build a basic scatterplot of the two variables “manu” and “popul”in USairpollution data.


# set up the labels for the two variables:

> mlab="Manufacturing enterprises with 20 or more workers"> plab="Population size (1970 census) in thousands“

#plot manu vs. popul :

> plot(popul~manu, data=USairpollution,xlab=mlab,ylab=plab)

# read the date Usairpollution:

>USairpollution=read.csv("E:/Multivariate_analysis/Data/USairpollution.csv",header=T)


0 500 1000 1500 2000 2500 3000

05

00

15

00

25

00

35

00

Manufacturing enterprises with 20 or more workers

Po

pu

latio

n s

ize

(1

97

0 c

en

sus)

in t

ho

usa

nd

s


We construct a scatter plot that shows the marginal distribution of manu and popul in two ways:

1. Marginal distribution as rug plots on each axis:

>plot(popul~manu,data=USairpollution,xlab=mlab,ylab=plab)

>rug(USairpollution$manu,side=1)

> rug(USairpollution$popul,side=2)


Rugs

0 500 1000 1500 2000 2500 3000

05

00

15

00

25

00

35

00


Po

pu

latio

n s

ize

(1

97

0 c

en

sus)

in t

ho

usa

nd

s


# set the plot margins>par(mar=c(2,2,2,2))

# set the limit of the x axis> xlim=with(USairpollution,range(manu))*1.1

# set the display of the plots > layout(matrix(c(1,3,2,0),nrow=2,byrow=TRUE), widths=c(2,1),heights=c(2,1),respect=TRUE)

2. Marginal distribution of manu as a histogram and that of popul as boxplot:

# plot popul vs manu>plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim)

# add labels for the points on the scatterplot > with(USairpollution, text(manu,popul,cex=0.6, labels=abbreviate(City)))

# add histogram for manu along the x axis> with(USairpollution,hist(manu,xlim=xlim))

# add a boxplot for popul along the y axis> with(USairpollution,boxplot(popul))


multivariate dataBivariate boxplot

Based on calculating “robust” measures of location, scale, and correlation.Consists of a pair of concentric ellipse: - Hinge (include 50% of the data )- Fence (delimitates potentially troublesome outliers)

Resistant regression lines of both y on x and x on y are shown, with the intersection showing the bivariate location estimator.

The acute angle between the regression lines will be small for a large absolute value of correlations and large for a small one.


# select row number for marginal cities in USairpollution>lab=c(“Houston”,"Chicago","Detroit","Cleveland","Philadelphia")>outcity=match(lab, USairpollution[,1])

# subset data for manu and popul variables > x=USairpollution[,c("manu","popul")]# draw the bivariate boxplot # need packages: “MVA” and HSAUR2” for bvbox

> bvbox(x,main="USairpollution",xlab=mlab,ylab=plab)

# add text for the outcity > text(x$manu[outcity],x$popul[outcity],labels=lab,cex=0.7, pos=c(2,2,4,2,2))

0 1000 2000 3000

01

00

02

00

03

00

0

USairpollution


Po

pu

latio

n s

ize

(1

97

0 c

en

sus)

in t

ho

usa

nd

s

Chicago

Detroit

Cleveland

Philadelphia

Houston

Hindge

Fence

Regression lines


Scatter plots should always be consulted when calculating correlation coefficients because the presence of outliers can on occasion considerably distort the value of a correlation coefficient.

The observations identified as outliers may then be excluded from the calculation of the correlation coefficients.

The bivariate boxplot identifies Chicago, Philadelphia, Detroit, and Cleveland as outliers in the scatterplot of manu and popul.

multivariate dataR output# correlation between manu and popul variables

> with(USairpollution,cor(manu,popul)) [1] 0.9552693 # correlation coefficient

# remove the outliers

>outcity=match(c("Chicago","Detroit","Cleveland","Philadelphia"),USairpollution[,1])

# correlation between manu and popul with the outliers removed

>with(USairpollution,cor(manu[-outcity],popul[-outcity])) [1] 0.7955549 # correlation coefficient

multivariate dataThe convex hull of bivariate data

An alternative approach to using the scatterplot combined with the bivariate boxplot.

Gives a robust estimation of correlation.

The convex hull of a set of bivariate observations consists of the vertices of the smallest convex polyhedron in variable space within which or on which all data points lie.

Removal of the points lying on the convex hull can eliminate isolated outliers without disturbing the general shape of the bivariate distribution.

multivariate dataR output

# scatterplot of manu and popul in USairpollution data >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim)

# extract the hull from manu and popul data>hull=with(USairpollution,chull(manu,popul)) [1] 9 15 41 6 2 18 16 14 7

# plot the convex hull >with(USairpollution,polygon(manu[hull],popul[hull],density=15,angle=30))

# correlation of the manu and popul with the hull removed> with(USairpollution,cor(manu[-hull],popul[-hull])) [1] 0.9225267

multivariate dataThe chi-plot

Suggested by Fisher and Switzer (1985,2001). The chi-plot is designed to address the problem of independence. It transforms the measurements (x11 ,..., xn1) and (x12,..., xn2) into values ( ,…, ) and ( ,…, ), which, plotted in a scatterplot, can be used to detect deviations form independence.

Under independence the values should show a non-systematic random fluctuation around zero; the values measure the distance of unit i from the “center” of the bivariate distribution.

1 n1 n

ii


# scatter plot for manu and popul>with(USairpollution,plot(manu,popul, xlab=mlab,ylab=plab,cex.lab=0.9))

# need package “asbio”> with(USairpollution,chi.plot(manu,popul))


Departure from independence is indicated in the second pot by a lack of points in the horizontal band indicated on the plot.

multivariate dataThe bubble and other glyph plots

Bubble plot – three variables are displayed: two are used to form the scatter plot itself and the value of the third variable are represented by circles with radii proportional to these values and centered on the appropriate plot in the scatterplot.

Make a bubble plot for temp, wind, and SO2 in “USairpollution” data.

The plot seems to suggest that cities with moderate annual temperature and moderate annual wind speeds tend to suffer the greatest air pollution, but does not tell us anything about the rest of variables.

Bubble plot

# set ylim for wind >ylim=with(USairpollution,range(wind)*c(0.95,1))

# scatterplot of wind and temp>plot(wind~temp,data=USairpollution,xlab="Average annual temperature(Fahrenheit)",ylab="Average annual wind speed (m.p.h.)",pch=10,ylim=ylim)

# plot SO2 as circles>with(USairpollution,symbols(temp,wind,circles=SO2, inches=0.5,add=TRUE))

The bubble and other glyph plots

multivariate data

Bubble plot


multivariate data

Star plot# star plot for all variables in USairpollution>stars(USairpollution,labels=as.character(USairpollution$City),cex=0.45)


Make a star plot for all variables in “USairpollution” as a seven-sided “star” with the lengths of each side representing each of the seven variables. The plot does not communicate much useful information about the data but it shows which units have similar shapes.

multivariate data

Star plot


multivariate dataThe scatterplot matrix

A scatterplot matrix is a square, symmetric grid of bivariate scatterplots.

The grid has q rows and columns, each one corresponding to a different variable.

Each of the grid’s cells shows a scatterplot of two variables. Variable j is plotted against variable i in the ij cell, and the same variables appear in the cell ji, with the x- and y- axes of the scatterplots interchanged.

multivariate data

# plot the grid pairs(USairpollution[,-1],cex.labels=NULL,font.labels=0.1,gap=0.1,pch=".",cex=0.00001)

# plot the grid and the regression lines >pairs(USairpollution[,-1], pch=".",cex=0.1,gap=0,font.labels=0.001, panel=function(x,y,...){ points(x,y,...) abline(lm(y~x),col="grey") })

The scatterplot matrix

multivariate data

SO2

45 65 0 2000 10 40

20

4575

temp

manu

0

035

00

popul

wind

612

1060

precip

20 80 0 2000 6 9 40 120

40

predays


multivariate data


SO2

45 60 75 0 2000 10 40

20

80

45

65

temp

manu

02500

02500

popul

wind

69

10

50

precip

20 80 0 2000 6 9 12 40 120

40

140

predays

# three dimensional plot for hips, chest, waist in Measure data > library(scatterplot3d)>with(Measure,scatterplot3d(chest,waist,hips, pch=(1:2)[gender],type="h“, angle=55))

multivariate dataThree-dimensional plots

32 34 36 38 40 42 4432

34

36

38

40

42

2224

2628

3032

34

chest

wa

ist

hip

s

The three dimensional plot of chest, waist, and hips in “Measure” data suggests the presence of two separate groups of point corresponding to the males and females in the data.

# three dimensional plot for temp, wind, and SO2> library(scatterplot3d)>with(USairpollution,scatterplot3d(temp,wind,SO2, type="h“, angle=55))

40 45 50 55 60 65 70 75 80 0 2

0 40 6

0 801

00120

6 7

8 910

1112

13

temp

win

d

SO

2

multivariate dataThree-dimensional plots

The plot of temp, wind, and SO2 from the “USairpollution” datashows that two observations with high SO2 levels stand out.

multivariate dataTrellis graphics

Trellis graphics is an approach to examining high-dimensional structure in data by means of one-, tow-, and three-dimensional graphs. The problem addressed is how observations of one or more variables depend on the observations of the other variables.

Multiple conditioning that allows some type of plot to be displayed for different values of a given variable (or variables).

multivariate data

# scatterplot of SO2 and temp for light and high winds> library(lattice)> plot(xyplot(SO2~temp|cut(wind,2),data=USairpollution))

temp

SO

2

20406080

100

50 60 70

(5.99,9.35]

50 60 70

(9.35,12.7]

Trellis graphics

Scatterplots of SO2 and temp conditioned on values of wind divided in two equalparts.

multivariate data

# three-dimensional plots of temp, wind, and precip conditioned on four levels of SO2. >pollution=with(USairpollution,equal.count(SO2,4)) >plot(cloud(precip~temp*wind|pollution,panel.aspect=0.9,data=USairpollution))

Trellis graphics

Not very useful for small data sets.

multivariate data visualization. multivariate data graphical display data graphics visually display...

Documents

populin usairpollution

graphical displays of

population plab

variables manu

continuous bivariate

workers plab

date usairpollution

population size