visualization - statistical methods

33
Visualization - Statistical Methods Sarah Filippi, University of Oxford 20 October 2015 Michaelmas Term 2015

Upload: others

Post on 29-Mar-2022

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Visualization - Statistical Methods

Visualization - Statistical Methods

Sarah Filippi, University of Oxford

20 October 2015Michaelmas Term 2015

Page 2: Visualization - Statistical Methods

First step

The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

Graphics reveal data, communicate complex ideas anddependencies with clarity, precision and efficiency.

Page 3: Visualization - Statistical Methods

Graphical excellence

Excellent graphics:

I show the data

I induce the viewer to think about the substance

I avoid bias

I make large complex data sets coherent

I encourage data exploration and debate

Page 4: Visualization - Statistical Methods
Page 5: Visualization - Statistical Methods

Categorical Data

Let’s start by not using one common graph type:

There is no data that can be displayed in a pie chartthat cannot be displayed BETTER in some other type ofchart.J. W. Tukey

And let’s not even think about 3D and exploded pie charts.

Page 6: Visualization - Statistical Methods

What’s the matter with pie charts:

I people are not good at interpreting areas

I small and large slices are relatively distorted

I zero is often a very meaningful number but gets los

I very hard to compare two pie charts

Barplots are usually a much better choice: barplot(height)

Page 7: Visualization - Statistical Methods

Suppose we have a few ordinal or categorical variables: theinteresting questions are then how they vary together. Here is across-tabulation on the caffeine consumption (in mg/day) ofwomen in a maternity ward by marital status. (A contingencytable.)

0 1-150 151-300 300+Married 652 1537 598 242Prev.married 36 46 38 21Single 218 327 106 67

The next two slides show two graphical representations, a set ofbarplots (aka bar charts), and a mosaic plot.Different versions of these plots and other plots for categoricaldata can be found in package vcd.

Page 8: Visualization - Statistical Methods

Married Prev. married Single

00−150150−300300+

020

040

060

080

010

0012

0014

00

Married Prev. married Single

00−150150−300300+

0.0

0.2

0.4

0.6

0.8

1.0

Page 9: Visualization - Statistical Methods

A special case of a mosaic plot is sometimes called a spineplot.

0

0

Married Prev. married

00−

150

150−

300

0.0

0.2

0.4

0.6

0.8

1.0

Page 10: Visualization - Statistical Methods

Barplots

Barplots should not (!) be used to compare distributions of dataacross groups.

Page 11: Visualization - Statistical Methods

Box-and-whisker Plots (boxplots)

2000

2500

3000

3500

4000

Median

1st quartile

3rd quartile

Lower Whisker

Upper Whisker

Outliers

There are about as many variations as software designers.

Page 12: Visualization - Statistical Methods

Parallel box plots are often useful to show the differences betweensubgroups of the data.

●●

●●●●

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

050

100

150

GDP

Infant mortality

Page 13: Visualization - Statistical Methods

Violin plotsViolin plots replace the representation in a boxplot by avariable-width box determined by a density estimate.

See e.g. the help for function vioplot() in the package of thatname or the help for function panel.violin() in packagelattice.

050

100

150

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

Page 14: Visualization - Statistical Methods

infant.mortality

(0,100]

(100,1000]

(1000,1e+04]

(1e+04,1e+05]

0 50 100 150

●●● ●

● ● ●● ●●●

Page 15: Visualization - Statistical Methods

Histograms

I Very convenient to study the shape of the distribution of thedata.

I We can choose a set of breakpoints covering the data, andcount how many points fall into each interval.

I Warning: some software plot the counts or the proportions orpercentages.

A true histogram has the area of each bar proportional to thecount, and total area one. This matters if the breaks are notequally spaced. See function truehist() in package MASS.

How do we choose the number and position of the breaks?

hist(data, prob = FALSE, breaks=breaks)

truehist(data, x0)

Page 16: Visualization - Statistical Methods

Histogram of infant.mortality

infant.mortality

Fre

quen

cy

0 50 100 150

020

4060

80

Histogram of infant.mortality

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

0

Histogram of infant.mortality

infant.mortality

Fre

quen

cy

0 50 100 150

05

1015

2025

3035

Histogram of infant.mortality

infant.mortality

Den

sity

0 50 100 150

0.00

00.

010

0.02

00.

030

Page 17: Visualization - Statistical Methods

1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

duration

1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

0.5

duration

Duration of Old Faithful eruptions

Page 18: Visualization - Statistical Methods

Density Plots

I Histograms are density plots: the tops of the bars is apiecewise-constant estimator of the underlying pdf.

I We can use smooth estimates of density. Examples:I Kernel density estimates

f̂(x) =1

n

n∑i=1

Kh(x− xi)

where Kh is a kernel and h is the bandwidth.

density() – check arguments bw, from, to...

I Splines or losplines: a spline is a piecewise polynomial functionwhich has smooth properties at the places where thepolynomial pieces connect.

logspline() in package polspline

Page 19: Visualization - Statistical Methods

kernel density

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

00.

035

logspline

infant.mortality

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

00.

025

0.03

00.

035

Page 20: Visualization - Statistical Methods

Rugs

We can add rug below the x axes to highlight position of data. Seethe help for rug() and also jitter().

Possible to have semi-transparent grey rugs by specifying the color:col=rgb(0,0,0,0.25).

1 2 3 4 5

0.0

0.5

1.0

1.5

kernel density

duration

1 2 3 4 5

0.0

0.5

1.0

1.5

logspline

duration

Page 21: Visualization - Statistical Methods

Scatterplots

The canonical plot of two continuous variables is a scatterplot.

Example from UN dataset:plot(infant.mortality ∼ gdp, UN, cex = 0.5)

plot(infant.mortality ∼ gdp, UN, log = "xy",...)

●●

● ●

●●●

●●

●●

●● ●

●●

● ●

● ●

● ●

● ●

0 10000 20000 30000 40000

050

100

150

gdp

infa

nt.m

orta

lity

● ●

●●

● ●

● ●

● ●

● ●

50 100 200 500 2000 5000 20000

25

1020

5010

020

0

gdp

infa

nt.m

orta

lity

Page 22: Visualization - Statistical Methods

Using scatterplot() from package car.

50 100 200 500 1000 2000 5000 10000 20000 50000

25

1020

5010

020

0

gdp

infa

nt.m

orta

lity

● ●

● ●

● ●

●●

● ●

Tonga

Iraq

Afghanistan

Bosnia

Sao.Tome

Sudan

Gabon

Liberia

Korea.Dem.Peoples.Rep French.Guiana

Page 23: Visualization - Statistical Methods

Smoother

I A fitted regression line and a smooth line have beenautomatically added.

I Such smooth curves often help to highlight trends in ascatterplot, but they can also be deceptive. Smoothers arethings we will return to, but see the functionsloess.smooth() and smooth.spline().

I With the scatterplot() function, outliers are automaticallylabelled. It is often best to do this manually with theidentify() function.

Page 24: Visualization - Statistical Methods

It is often useful to visually convey confidence in your plots.

●●

●●

● ●●

●●

0

100

200

300

10 15 20 25 30 35mpg

hp

Page 25: Visualization - Statistical Methods

Scatterplot with color by types

5000 10000 15000 20000 25000

2030

4050

6070

80

income

pres

tige

●●

●●

●●

●●

type

bc prof wc

scatterplot(prestige ∼ income|type, data=Prestige,

smoother=FALSE, reg.line=FALSE)

Page 26: Visualization - Statistical Methods

Scatterplot Matrices (or pairs plots)

Sepal.Length

2.0 2.5 3.0 3.5 4.0

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

●●●●

●●

●●

●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

2.0

2.5

3.0

3.5

4.0

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

● ●●

●●

●●

●●

●● ●

●●

●Sepal.Width

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●●

●●

● ●

● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

● ●●

●●●

● ●●

●●●

●●

●●● ●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

●●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

● ●●

●●●

●●

● ●●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

Petal.Length

12

34

56

7

●●●●●

●●●●●●●●

●●●●●

●●

●●

●●● ●●●●● ●●●●●●

●●●

●●●●

●●●●●

●●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●●

●●●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

●●●● ●

●●

●●●

●●●●

●●● ●●

● ●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●●●

●●● ●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●● ●● ●

●●●●

●●●

●●●

●●● ●●

●●

●●●●

●●●● ●

●● ●

●●●

●●

●● ●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

1 2 3 4 5 6 7

●●●●●

●●●●●●●

●●●

●●●●●

●●

●●●●

●●●●●●

●●●●●

●●

●●●●

●● ●

●●

●●

●●

●●● ●

●●

●●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

Petal.Width

Anderson's Iris Data −− 3 species

pairs(iris[1:4], bg = c("red", "green3",

"blue")[unclass(iris$Species)])

Page 27: Visualization - Statistical Methods

Image or contours

The functions image or contour are useful to explore threedimensional data or to illustrate distributions in two dimensions.

−4 −2 0 2 4

−4

−2

02

4

0.02

0.04

0.06

0.08

0.1 0.12

0.1

4

−4 −2 0 2 4−

4−

20

24

Page 28: Visualization - Statistical Methods

Aspect ration of plotThe aspect ratio of a plot is very important:Cleveland/McGill recommended an average slope of about 45◦ asthe eye is most sensitive to departures from 45◦.

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●●

●●

●●

●●

●●

●● ●

●●

−3 −2 −1 0 1 2 3

−6

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●●

●●

−3 −1 0 1 2 3−

6−

4−

20

24

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Page 29: Visualization - Statistical Methods

Arranging Several Plots on Single Page

par(mfrow=c(2,3))

for(i in 1:6) { plot(1:10) }

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index1:

10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

Page 30: Visualization - Statistical Methods

The layout function allows to divide the plotting device intovariable numbers of rows and columns with the column-widths andthe row-heights specified in the respective arguments.

nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=TRUE),

c(3,7), c(5,5),respect=TRUE)

for(i in 1:3) { plot(1:10) }

2 4 6 8

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

2 4 6 8 10

24

68

10

Index

1:10

Page 31: Visualization - Statistical Methods

Make your graphical display easy to understand

I Add labels to your axis (with appropriate font and font size)

I Control the scale of the axes using the commands xlim andylim. Also check the command axes.

I Add clear captions

I Use appropriate colors

I ....

Page 32: Visualization - Statistical Methods

Save a figure in a pdf file

I recommend you to always save your figure in pdf as it makes iteasier to include in LaTeX. This can be done in R using thefollowing command line:

pdf("filename.pdf")

...

dev.off()

You can also specify the size of the figure with the options width

and height – the measures are in inches.

Page 33: Visualization - Statistical Methods

To watch at home...

TED talk on The beauty of data visualization:

David McCandless turns complex data sets (like worldwide militaryspending, media buzz, Facebook status updates) into beautiful,simple diagrams that tease out unseen patterns and connections.Good design, he suggests, is the best way to navigate informationglut – and it may just change the way we see the world.

Link: http://www.ted.com/talks/david_mccandless_the_

beauty_of_data_visualization