data mining taylor statistics 202: data miningstatweb.stanford.edu/~jtaylo/courses/stats202/... ·...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 3

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 10, 2012

1 / 1


c©JonathanTaylor

Part I

Graphical exploratory data analysis

2 / 1


c©JonathanTaylor

Graphical summaries of data

Exploratory data analysis

A preliminary exploration of the data to better understandits characteristics.

Motivations:

Helping to select the right tool for preprocessing oranalysis.Making use of humans’ abilities to recognize patterns.

Pioneered by John Tukey, one of the giants of 20thcentury statistics.

Our focus

Visual summary statisticsQuantitative summary statisticsExtraction of data slices

3 / 1


c©JonathanTaylor


Visualization

Visualization is the conversion of data into a visual ortabular format so that the characteristics of the data andthe relationships among data items or attributes can beanalyzed or reported.

Visualization of data is one of the most powerful andappealing techniques for data exploration.

Humans have a well developed ability to analyze largeamounts of information that is presented visually.

Goals: to detect general patterns and trends and/or todetect outliers and unusual patterns.

4 / 1


c©JonathanTaylor

Stem and leaf plot

The decimal point is at the |

1 | 01223333333444444444444

1 | 555555555555556666666777799

2 |

2 |

3 | 033

3 | 55678999

4 | 000001112222334444

4 | 5555555566677777888899999

5 | 000011111111223344

5 | 55566666677788899

6 | 0011134

6 | 6779

>

5 / 1


c©JonathanTaylor


Histogram

Usually shows the distribution of values of a single variable.

Divide the values into bins and show a bar plot of thenumber of objects in each bin.

The height of each bar indicates the number of objects ifall bins are of same width.

If bins are of different width, then often it is the *area* ofthe bar that indicates the number of objects in that bin.

Shape of histogram depends on the number of bins.

6 / 1


c©JonathanTaylor

A histogram with 10 breaks

7 / 1


c©JonathanTaylor

A histogram with 30 breaks

8 / 1


c©JonathanTaylor


Density

A smoothed / infinitesimal version of histogram.

The height of the curve, f (x) is proportional to probabilitya value falls in interval [x , x + dx ].

A density is non-negative, f (x) ≥ 0 and integrates to 1∫ ∞−∞

f (x) dx = 1 = 100%

The integral over any interval [a, b] is the frequency /probability a value falls in this interval.

9 / 1


c©JonathanTaylor

The standard normal density

4 3 2 1 0 1 2 3 4units

0

5

10

15

20

25

30

35

40

% p

er

unit

The area between z = −0.7 and z = 0.7 is 51.61% 10 / 1


c©JonathanTaylor

Estimated density of petal.length within eachiris type

11 / 1


c©JonathanTaylor


Density estimate

Simplest way to estimate a density of a sample{x1, . . . , xn} is to use a kernel density estimator

Definition

f̂ (x) =1

n

n∑i=1

1

h· φ((x − xi )/h)

where h is the width of each bump attached to eachsample point.

R chooses h automatically.

12 / 1


c©JonathanTaylor

Estimated density bw=1

13 / 1


c©JonathanTaylor

Estimated density bw=0.05

14 / 1


c©JonathanTaylor


Distribution function

Associated to a density f is its distribution function

F (x) =

∫ x

−∞f (u) du

with 0 ≤ F (x) ≤ 1.

Given a sample {x1, . . . , xn} we define its ECDF(Empirical Cumulative Distribution Function) as

F̂ (x) =1

n

n∑i=1

1(−∞,x](xi )

where

1(−∞,x](xi ) =

{1 −∞ < xi ≤ x

0 xi > x .15 / 1


c©JonathanTaylor

ECDF of petal.length for each iris type

16 / 1


c©JonathanTaylor


Quantile function

The inverse of a distribution function

F (Q(p)) = p

with 0 ≤ F (x) ≤ 1.

For a sample {x1, . . . , xn} an estimated quantile Q̂(p) ischosen so that approximately np of the data points areless than Q̂(p).

17 / 1


c©JonathanTaylor

Quantile plots of petal.length for each iris type

18 / 1


c©JonathanTaylor

Boxplot

19 / 1


c©JonathanTaylor

Boxplot of all variables across iris types

20 / 1


c©JonathanTaylor

Boxplot of petal.length for each iris type

21 / 1


c©JonathanTaylor

Contour of 2D density estimate

22 / 1


c©JonathanTaylor

Heatmap of 2D density estimate

23 / 1


c©JonathanTaylor

Correlation (case × case) matrix of iris data

24 / 1


c©JonathanTaylor

Part II

Multidimensional scaling

25 / 1


c©JonathanTaylor


A visual tool

Recall the PCA scores were

X̃XXV = U∆

where X̃XX = HXXXS−1/2 = U∆V T .

Above, U are eigenvectors of

X̃XXX̃XXT

= HXXXS−1XXXTH

Also, XXXS−1XXXT is a measure of similarity between cases.

26 / 1


c©JonathanTaylor


Classical multidimensional scaling

Given a similarity matrix A, recall the relationship betweena similarity and “distance”

Dij = (Aii − 2Aij + Ajj)1/2

Now, consider the matrix Bij with entries

Bij = −1

2D2ij .

Finally, consider the matrix

C = HBH

27 / 1


c©JonathanTaylor



If A = XXXS−1XXXT is a similarity matrix for a (scaled) datamatrix X .

Then,

B = A− 1

2(ν111T + 111νT )

= XXXS−1XXXT − 1

2(ν111T + 111νT )

Above, ν = diag(A).

Therefore,

C = HAH = X̃XXX̃XXT

= U∆2U.

In short, the eigenvectors of C are the PCA scores.

28 / 1


c©JonathanTaylor

Olympic PCA scores

29 / 1


c©JonathanTaylor

Olympic MDS scores

30 / 1


c©JonathanTaylor

MDS vs. PCA

31 / 1


c©JonathanTaylor

MDS vs. PCA

32 / 1


c©JonathanTaylor



We can form B, C for any dissimilarity matrix D.

The matrix C will be symmetric, so it will bediagonalizable as C = WΛW T with Wn×k andrank(C ) = k . In the general case the diagonal entries Λii

are not necessarily non-negative.

This leads to a Euclidean representation

(WΛ1/2)n×k

with each row of WΛ1/2 being a point in Euclidean space.

Taking the fisrst two columns of WΛ1/2 gives atwo-dimensional representation.

33 / 1


c©JonathanTaylor



The points(WΛ1/2)[, 1 : 2]

are an optimal Euclidean embedding in the sense thattheir interpoint distances are chosen to be close to thedissimilarities in D.

If the dissimilarity is Euclidean and the points lie in a2-dimensional plane in Rn, then the interpoint distanceswill be identical . . .

34 / 1


c©JonathanTaylor

Distances between US capitals

35 / 1


c©JonathanTaylor

Distances between US capitals

36 / 1


c©JonathanTaylor

Map of U.S. capitals

37 / 1


c©JonathanTaylor

MDS of Iris data, `2

38 / 1


c©JonathanTaylor

Other applications

Manifold learning

Goal of manifold learning is to find “low-dimensional”representation s of XXX that are not necessarily linear.

Examples of techniques:

ISOMAPLaplacian eigenmaps / diffusion geometryLocal linear embedding

44 / 1


c©JonathanTaylor

Manifold learning

Graph distance

In the “bottleneck”, we already have a metric (Euclideandistance), so we can use MDS.

But we might want to emphasize the fact that the groupsare “barely connected”...

How can we emphasize this bottleneck in terms of adistance?

ISOMAP and Diffusion Geometry method does this bycreating a graph and creating a new metric based on thegraph.

46 / 1


c©JonathanTaylor

Manifold learning

Graph distances

In our “bottleneck” picture, let’s connect k mutual nearestneighbours to form a graph Gk

That is, insert an edge (i , j) if either1 i is within j ’s k nearest neighbours;2 j is within i ’s k nearest neighbours.

Let

DGk(i , j) = length of shortest path between vertices i and j in Gk

This is a metric on Gk .

47 / 1


c©JonathanTaylor

Manifold learning

ISOMAP

Shove DGkthrough classical MDS.

ISOMAP can “flatten” data if there is a low-dimensionalEuclidean configuration where the Euclidean distances wellapproximate the graph distances DGk

.

This means that the data has to be “flat” if ISOMAP is torecover it.

48 / 1


c©JonathanTaylor

Multidimensional arrays

Data cubes

It is sometimes useful to summarize data into amultidimensional array, with one axis per “category.”

In the unemployment data, we might summarize theresults by tuples (state, period, variable)

In the Iris data, we might summarize by (petal.width,

petal.length, iris.type)

In order to summarize by petal.width, petal.length

we might form categories: low, medium, high.

51 / 1


c©JonathanTaylor

Drilling down

Resolution of a data cube

In the process of forming a data cube for unemploymentacross states, we had already aggregated over county.

We could make another cube with county-level resolution.This would be drilling down.

The operation of going from county-level to state level isrolling up.

Such operations are often best handled by database toolsrather than *R*, but *R* does support multidimensionalarrays.

55 / 1

data mining taylor statistics 202: data miningstatweb.stanford.edu/~jtaylo/courses/stats202/... ·...

Documents