statistics 202: data mining part i graphical exploratory data...

14
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 3 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 10, 2012 1/1 Statistics 202: Data Mining c Jonathan Taylor Part I Graphical exploratory data analysis 2/1 Statistics 202: Data Mining c Jonathan Taylor Graphical summaries of data Exploratory data analysis A preliminary exploration of the data to better understand its characteristics. Motivations: Helping to select the right tool for preprocessing or analysis. Making use of humans’ abilities to recognize patterns. Pioneered by John Tukey, one of the giants of 20th century statistics. Our focus Visual summary statistics Quantitative summary statistics Extraction of data slices 3/1 Statistics 202: Data Mining c Jonathan Taylor Graphical summaries of data Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. Visualization of data is one of the most powerful and appealing techniques for data exploration. Humans have a well developed ability to analyze large amounts of information that is presented visually. Goals: to detect general patterns and trends and/or to detect outliers and unusual patterns. 4/1

Upload: others

Post on 21-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 3

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 10, 2012

1 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Part I

Graphical exploratory data analysis

2 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Exploratory data analysis

A preliminary exploration of the data to better understandits characteristics.

Motivations:

Helping to select the right tool for preprocessing oranalysis.Making use of humans’ abilities to recognize patterns.

Pioneered by John Tukey, one of the giants of 20thcentury statistics.

Our focus

Visual summary statisticsQuantitative summary statisticsExtraction of data slices

3 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Visualization

Visualization is the conversion of data into a visual ortabular format so that the characteristics of the data andthe relationships among data items or attributes can beanalyzed or reported.

Visualization of data is one of the most powerful andappealing techniques for data exploration.

Humans have a well developed ability to analyze largeamounts of information that is presented visually.

Goals: to detect general patterns and trends and/or todetect outliers and unusual patterns.

4 / 1

Page 2: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Stem and leaf plot

The decimal point is at the |

1 | 01223333333444444444444

1 | 555555555555556666666777799

2 |

2 |

3 | 033

3 | 55678999

4 | 000001112222334444

4 | 5555555566677777888899999

5 | 000011111111223344

5 | 55566666677788899

6 | 0011134

6 | 6779

>

5 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Histogram

Usually shows the distribution of values of a single variable.

Divide the values into bins and show a bar plot of thenumber of objects in each bin.

The height of each bar indicates the number of objects ifall bins are of same width.

If bins are of different width, then often it is the *area* ofthe bar that indicates the number of objects in that bin.

Shape of histogram depends on the number of bins.

6 / 1

Statistics 202:Data Mining

c©JonathanTaylor

A histogram with 10 breaks

7 / 1

Statistics 202:Data Mining

c©JonathanTaylor

A histogram with 30 breaks

8 / 1

Page 3: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Density

A smoothed / infinitesimal version of histogram.

The height of the curve, f (x) is proportional to probabilitya value falls in interval [x , x + dx ].

A density is non-negative, f (x) ≥ 0 and integrates to 1

∫ ∞

−∞f (x) dx = 1 = 100%

The integral over any interval [a, b] is the frequency /probability a value falls in this interval.

9 / 1

Statistics 202:Data Mining

c©JonathanTaylor

The standard normal density

4 3 2 1 0 1 2 3 4units

0

5

10

15

20

25

30

35

40

% p

er

unit

The area between z = −0.7 and z = 0.7 is 51.61% 10 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density of petal.length within eachiris type

11 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Density estimate

Simplest way to estimate a density of a sample{x1, . . . , xn} is to use a kernel density estimator

Definition

f̂ (x) =1

n

n∑

i=1

1

h· φ((x − xi )/h)

where h is the width of each bump attached to eachsample point.

R chooses h automatically.

12 / 1

Page 4: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density bw=1

13 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density bw=0.05

14 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Distribution function

Associated to a density f is its distribution function

F (x) =

∫ x

−∞f (u) du

with 0 ≤ F (x) ≤ 1.

Given a sample {x1, . . . , xn} we define its ECDF(Empirical Cumulative Distribution Function) as

F̂ (x) =1

n

n∑

i=1

1(−∞,x](xi )

where

1(−∞,x](xi ) =

{1 −∞ < xi ≤ x

0 xi > x .15 / 1

Statistics 202:Data Mining

c©JonathanTaylor

ECDF of petal.length for each iris type

16 / 1

Page 5: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Quantile function

The inverse of a distribution function

F (Q(p)) = p

with 0 ≤ F (x) ≤ 1.

For a sample {x1, . . . , xn} an estimated quantile Q̂(p) ischosen so that approximately np of the data points areless than Q̂(p).

17 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Quantile plots of petal.length for each iris type

18 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot

19 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot of all variables across iris types

20 / 1

Page 6: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot of petal.length for each iris type

21 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Contour of 2D density estimate

22 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Heatmap of 2D density estimate

23 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Correlation (case × case) matrix of iris data

24 / 1

Page 7: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Part II

Multidimensional scaling

25 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

A visual tool

Recall the PCA scores were

X̃XXV = U∆

where X̃XX = HXXXS−1/2 = U∆V T .

Above, U are eigenvectors of

X̃XXX̃XXT

= HXXXS−1XXXTH

Also, XXXS−1XXXT is a measure of similarity between cases.

26 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

Given a similarity matrix A, recall the relationship betweena similarity and “distance”

Dij = (Aii − 2Aij + Ajj)1/2

Now, consider the matrix Bij with entries

Bij = −1

2D2ij .

Finally, consider the matrix

C = HBH

27 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

If A = XXXS−1XXXT is a similarity matrix for a (scaled) datamatrix X .

Then,

B = A− 1

2(ν111T + 111νT )

= XXXS−1XXXT − 1

2(ν111T + 111νT )

Above, ν = diag(A).

Therefore,

C = HAH = X̃XXX̃XXT

= U∆2U.

In short, the eigenvectors of C are the PCA scores.

28 / 1

Page 8: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Olympic PCA scores

29 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Olympic MDS scores

30 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS vs. PCA

31 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS vs. PCA

32 / 1

Page 9: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

We can form B, C for any dissimilarity matrix D.

The matrix C will be symmetric, so it will bediagonalizable as C = WΛW T with Wn×k andrank(C ) = k . In the general case the diagonal entries Λii

are not necessarily non-negative.

This leads to a Euclidean representation

(WΛ1/2)n×k

with each row of WΛ1/2 being a point in Euclidean space.

Taking the fisrst two columns of WΛ1/2 gives atwo-dimensional representation.

33 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

The points(WΛ1/2)[, 1 : 2]

are an optimal Euclidean embedding in the sense thattheir interpoint distances are chosen to be close to thedissimilarities in D.

If the dissimilarity is Euclidean and the points lie in a2-dimensional plane in Rn, then the interpoint distanceswill be identical . . .

34 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Distances between US capitals

35 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Distances between US capitals

36 / 1

Page 10: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Map of U.S. capitals

37 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `2

38 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Iris PCA scores

39 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `1

40 / 1

Page 11: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `∞ / sup norm

41 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `20

42 / 1

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `0.2

43 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Other applications

Manifold learning

Goal of manifold learning is to find “low-dimensional”representation s of XXX that are not necessarily linear.

Examples of techniques:

ISOMAPLaplacian eigenmaps / diffusion geometryLocal linear embedding

44 / 1

Page 12: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

45 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

Graph distance

In the “bottleneck”, we already have a metric (Euclideandistance), so we can use MDS.

But we might want to emphasize the fact that the groupsare “barely connected”...

How can we emphasize this bottleneck in terms of adistance?

ISOMAP and Diffusion Geometry method does this bycreating a graph and creating a new metric based on thegraph.

46 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

Graph distances

In our “bottleneck” picture, let’s connect k mutual nearestneighbours to form a graph Gk

That is, insert an edge (i , j) if either1 i is within j ’s k nearest neighbours;2 j is within i ’s k nearest neighbours.

Let

DGk(i , j) = length of shortest path between vertices i and j in Gk

This is a metric on Gk .

47 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

ISOMAP

Shove DGkthrough classical MDS.

ISOMAP can “flatten” data if there is a low-dimensionalEuclidean configuration where the Euclidean distances wellapproximate the graph distances DGk

.

This means that the data has to be “flat” if ISOMAP is torecover it.

48 / 1

Page 13: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

49 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Part III

Multidimensional arrays (data cubes)

50 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional arrays

Data cubes

It is sometimes useful to summarize data into amultidimensional array, with one axis per “category.”

In the unemployment data, we might summarize theresults by tuples (state, period, variable)

In the Iris data, we might summarize by (petal.width,

petal.length, iris.type)

In order to summarize by petal.width, petal.length

we might form categories: low, medium, high.

51 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Iris grouped by petal.width, petal.length,

iris.type

52 / 1

Page 14: Statistics 202: Data Mining Part I Graphical exploratory data analysisjtaylo/courses/stats202/restricted/note… · Graphical exploratory data analysis 2/1 Statistics 202: Data Mining

Statistics 202:Data Mining

c©JonathanTaylor

Iris grouped by petal.width, petal.length,

iris.type

53 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Slices of the cube.

54 / 1

Statistics 202:Data Mining

c©JonathanTaylor

Drilling down

Resolution of a data cube

In the process of forming a data cube for unemploymentacross states, we had already aggregated over county.

We could make another cube with county-level resolution.This would be drilling down.

The operation of going from county-level to state level isrolling up.

Such operations are often best handled by database toolsrather than *R*, but *R* does support multidimensionalarrays.

55 / 1