data mining taylor statistics 202: data miningstatweb.stanford.edu/~jtaylo/courses/stats202/... ·...

55
Statistics 202: Data Mining c Jonathan Taylor Statistics 202: Data Mining Week 3 Based in part on slides from textbook, slides of Susan Holmes c Jonathan Taylor October 10, 2012 1/1

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 3

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 10, 2012

1 / 1

Page 2: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Part I

Graphical exploratory data analysis

2 / 1

Page 3: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Exploratory data analysis

A preliminary exploration of the data to better understandits characteristics.

Motivations:

Helping to select the right tool for preprocessing oranalysis.Making use of humans’ abilities to recognize patterns.

Pioneered by John Tukey, one of the giants of 20thcentury statistics.

Our focus

Visual summary statisticsQuantitative summary statisticsExtraction of data slices

3 / 1

Page 4: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Visualization

Visualization is the conversion of data into a visual ortabular format so that the characteristics of the data andthe relationships among data items or attributes can beanalyzed or reported.

Visualization of data is one of the most powerful andappealing techniques for data exploration.

Humans have a well developed ability to analyze largeamounts of information that is presented visually.

Goals: to detect general patterns and trends and/or todetect outliers and unusual patterns.

4 / 1

Page 5: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Stem and leaf plot

The decimal point is at the |

1 | 01223333333444444444444

1 | 555555555555556666666777799

2 |

2 |

3 | 033

3 | 55678999

4 | 000001112222334444

4 | 5555555566677777888899999

5 | 000011111111223344

5 | 55566666677788899

6 | 0011134

6 | 6779

>

5 / 1

Page 6: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Histogram

Usually shows the distribution of values of a single variable.

Divide the values into bins and show a bar plot of thenumber of objects in each bin.

The height of each bar indicates the number of objects ifall bins are of same width.

If bins are of different width, then often it is the *area* ofthe bar that indicates the number of objects in that bin.

Shape of histogram depends on the number of bins.

6 / 1

Page 7: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

A histogram with 10 breaks

7 / 1

Page 8: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

A histogram with 30 breaks

8 / 1

Page 9: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Density

A smoothed / infinitesimal version of histogram.

The height of the curve, f (x) is proportional to probabilitya value falls in interval [x , x + dx ].

A density is non-negative, f (x) ≥ 0 and integrates to 1∫ ∞−∞

f (x) dx = 1 = 100%

The integral over any interval [a, b] is the frequency /probability a value falls in this interval.

9 / 1

Page 10: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

The standard normal density

4 3 2 1 0 1 2 3 4units

0

5

10

15

20

25

30

35

40

% p

er

unit

The area between z = −0.7 and z = 0.7 is 51.61% 10 / 1

Page 11: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density of petal.length within eachiris type

11 / 1

Page 12: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Density estimate

Simplest way to estimate a density of a sample{x1, . . . , xn} is to use a kernel density estimator

Definition

f̂ (x) =1

n

n∑i=1

1

h· φ((x − xi )/h)

where h is the width of each bump attached to eachsample point.

R chooses h automatically.

12 / 1

Page 13: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density bw=1

13 / 1

Page 14: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Estimated density bw=0.05

14 / 1

Page 15: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Distribution function

Associated to a density f is its distribution function

F (x) =

∫ x

−∞f (u) du

with 0 ≤ F (x) ≤ 1.

Given a sample {x1, . . . , xn} we define its ECDF(Empirical Cumulative Distribution Function) as

F̂ (x) =1

n

n∑i=1

1(−∞,x](xi )

where

1(−∞,x](xi ) =

{1 −∞ < xi ≤ x

0 xi > x .15 / 1

Page 16: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

ECDF of petal.length for each iris type

16 / 1

Page 17: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Graphical summaries of data

Quantile function

The inverse of a distribution function

F (Q(p)) = p

with 0 ≤ F (x) ≤ 1.

For a sample {x1, . . . , xn} an estimated quantile Q̂(p) ischosen so that approximately np of the data points areless than Q̂(p).

17 / 1

Page 18: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Quantile plots of petal.length for each iris type

18 / 1

Page 19: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot

19 / 1

Page 20: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot of all variables across iris types

20 / 1

Page 21: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Boxplot of petal.length for each iris type

21 / 1

Page 22: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Contour of 2D density estimate

22 / 1

Page 23: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Heatmap of 2D density estimate

23 / 1

Page 24: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Correlation (case × case) matrix of iris data

24 / 1

Page 25: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Part II

Multidimensional scaling

25 / 1

Page 26: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

A visual tool

Recall the PCA scores were

X̃XXV = U∆

where X̃XX = HXXXS−1/2 = U∆V T .

Above, U are eigenvectors of

X̃XXX̃XXT

= HXXXS−1XXXTH

Also, XXXS−1XXXT is a measure of similarity between cases.

26 / 1

Page 27: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

Given a similarity matrix A, recall the relationship betweena similarity and “distance”

Dij = (Aii − 2Aij + Ajj)1/2

Now, consider the matrix Bij with entries

Bij = −1

2D2ij .

Finally, consider the matrix

C = HBH

27 / 1

Page 28: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

If A = XXXS−1XXXT is a similarity matrix for a (scaled) datamatrix X .

Then,

B = A− 1

2(ν111T + 111νT )

= XXXS−1XXXT − 1

2(ν111T + 111νT )

Above, ν = diag(A).

Therefore,

C = HAH = X̃XXX̃XXT

= U∆2U.

In short, the eigenvectors of C are the PCA scores.

28 / 1

Page 29: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Olympic PCA scores

29 / 1

Page 30: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Olympic MDS scores

30 / 1

Page 31: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS vs. PCA

31 / 1

Page 32: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS vs. PCA

32 / 1

Page 33: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

We can form B, C for any dissimilarity matrix D.

The matrix C will be symmetric, so it will bediagonalizable as C = WΛW T with Wn×k andrank(C ) = k . In the general case the diagonal entries Λii

are not necessarily non-negative.

This leads to a Euclidean representation

(WΛ1/2)n×k

with each row of WΛ1/2 being a point in Euclidean space.

Taking the fisrst two columns of WΛ1/2 gives atwo-dimensional representation.

33 / 1

Page 34: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional scaling

Classical multidimensional scaling

The points(WΛ1/2)[, 1 : 2]

are an optimal Euclidean embedding in the sense thattheir interpoint distances are chosen to be close to thedissimilarities in D.

If the dissimilarity is Euclidean and the points lie in a2-dimensional plane in Rn, then the interpoint distanceswill be identical . . .

34 / 1

Page 35: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Distances between US capitals

35 / 1

Page 36: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Distances between US capitals

36 / 1

Page 37: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Map of U.S. capitals

37 / 1

Page 38: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `2

38 / 1

Page 39: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Iris PCA scores

39 / 1

Page 40: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `1

40 / 1

Page 41: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `∞ / sup norm

41 / 1

Page 42: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `20

42 / 1

Page 43: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

MDS of Iris data, `0.2

43 / 1

Page 44: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Other applications

Manifold learning

Goal of manifold learning is to find “low-dimensional”representation s of XXX that are not necessarily linear.

Examples of techniques:

ISOMAPLaplacian eigenmaps / diffusion geometryLocal linear embedding

44 / 1

Page 45: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

45 / 1

Page 46: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

Graph distance

In the “bottleneck”, we already have a metric (Euclideandistance), so we can use MDS.

But we might want to emphasize the fact that the groupsare “barely connected”...

How can we emphasize this bottleneck in terms of adistance?

ISOMAP and Diffusion Geometry method does this bycreating a graph and creating a new metric based on thegraph.

46 / 1

Page 47: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

Graph distances

In our “bottleneck” picture, let’s connect k mutual nearestneighbours to form a graph Gk

That is, insert an edge (i , j) if either1 i is within j ’s k nearest neighbours;2 j is within i ’s k nearest neighbours.

Let

DGk(i , j) = length of shortest path between vertices i and j in Gk

This is a metric on Gk .

47 / 1

Page 48: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

ISOMAP

Shove DGkthrough classical MDS.

ISOMAP can “flatten” data if there is a low-dimensionalEuclidean configuration where the Euclidean distances wellapproximate the graph distances DGk

.

This means that the data has to be “flat” if ISOMAP is torecover it.

48 / 1

Page 49: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Manifold learning

49 / 1

Page 50: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Part III

Multidimensional arrays (data cubes)

50 / 1

Page 51: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Multidimensional arrays

Data cubes

It is sometimes useful to summarize data into amultidimensional array, with one axis per “category.”

In the unemployment data, we might summarize theresults by tuples (state, period, variable)

In the Iris data, we might summarize by (petal.width,

petal.length, iris.type)

In order to summarize by petal.width, petal.length

we might form categories: low, medium, high.

51 / 1

Page 52: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Iris grouped by petal.width, petal.length,

iris.type

52 / 1

Page 53: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Iris grouped by petal.width, petal.length,

iris.type

53 / 1

Page 54: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Slices of the cube.

54 / 1

Page 55: Data Mining Taylor Statistics 202: Data Miningstatweb.stanford.edu/~jtaylo/courses/stats202/... · Data Mining c Jonathan Taylor Graphical summaries of data ... appealing techniques

Statistics 202:Data Mining

c©JonathanTaylor

Drilling down

Resolution of a data cube

In the process of forming a data cube for unemploymentacross states, we had already aggregated over county.

We could make another cube with county-level resolution.This would be drilling down.

The operation of going from county-level to state level isrolling up.

Such operations are often best handled by database toolsrather than *R*, but *R* does support multidimensionalarrays.

55 / 1