data mining taylor statistics 202: data...

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningWeek 2

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 3, 2012

1 / 1


c©JonathanTaylor

Part I

Other datatypes, preprocessing

2 / 1


c©JonathanTaylor

Other datatypes

Document data

You might start with a collection of n documents (i.e. allof Shakespeare’s works in some digital format; all ofWikileaks’ U.S. Embassy Cables).

This is not a data matrix . . .

Given p terms of interest: {Al Qaeda, Iran, Iraq, etc.} onecan form a term-document matrix filled with counts.

3 / 1


c©JonathanTaylor

Other datatypes

Document data

Document ID Al Qaeda Iran Iraq . . .

1 0 0 3 . . ....

......

......

250000 2 1 1 . . .

4 / 1


c©JonathanTaylor

Other datatypes

Time series

Imagine recording the minute-by-minute prices of allstocks in the S & P 500 for last 200 days of trading.

The data can be represented by a 78000 × 500 matrix.

BUT, there is definite structure across the rows of thismatrix.

They are not unrelated “cases” like they might be in otherapplications.

5 / 1


c©JonathanTaylor

Transaction data

6 / 1


c©JonathanTaylor

Social media data

7 / 1


c©JonathanTaylor

Graph data

8 / 1


c©JonathanTaylor

Other data types

Graph data

Nodes on the graph might be facebook users, or publicpages.

Weights on the edges could be number of messages sentin a prespecified period.

If you take weekly intervals, this leads to a sequence ofgraphs

Gi = communication over i-th week.

How this graph changes is of obvious interest . . . .

Even structure of just one graph is of interest – we’ll comeback to this when we talk about spectral methods . . .

9 / 1


c©JonathanTaylor

Data quality

Some issues to keep in mind

Is the data experimental or observational?

If observational, what do we know about the datagenerating mechanism? For example, although the S&P500 example can be represented as a data matrix, there isclearly structure across rows.

General quality issues:

How much of the data missing? Is missingness informative?Is it very noisy? Are there outliers?Are there a large number of duplicates?

10 / 1


c©JonathanTaylor

Preprocessing

General procedures

Aggregation Combining features into a new feature. Example:pooling county-level data to state-level data.

Discretization Breaking up a continuous variable (or a set ofcontinuous variables) into an ordinal (or nominal)discrete variable.

Transformation Simple transformation of feature (log or exp)or mapping to a new space (Fourier transform /power spectrum, wavelet transform).

11 / 1


c©JonathanTaylor

A continuous variable that could be discretized

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

12 / 1


c©JonathanTaylor

Discretization by fixed width

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

13 / 1


c©JonathanTaylor

Discretization by fixed quartile

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

14 / 1


c©JonathanTaylor

Discretization by clustering

4 2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

15 / 1


c©JonathanTaylor

Variable transformation: bacterial decay

16 / 1


c©JonathanTaylor

Variable transformation: bacterial decay

17 / 1


c©JonathanTaylor

BMW daily returns (fEcofin package)

18 / 1


c©JonathanTaylor

ACF of BMW daily returns

19 / 1


c©JonathanTaylor

Discrete wavelet transform of BMW daily returns

20 / 1


c©JonathanTaylor

Part II

Dimension reduction, PCA &

eigenanalysis

21 / 1


c©JonathanTaylor

Dimension reduction

Combinations of features

Given a data matrix XXX n×p with p fairly large, it can bedifficult to visualize structure.

Often useful to look at linear combinations of the features.

Each β ∈ Rp, determines a linear rule

fβ(xxx) = xxxTβ

Evaluating this on each row XXX i of XXX yields a vector

(fβ(XXX i ))1≤i≤n = XXXβ.

22 / 1


c©JonathanTaylor

Dimension reduction

Dimension reduction

By choosing β appropriately, we may find “interesting”new features.

Suppose we take k much smaller than p vectors of βwhich we write as a matrix

BBBp×k

The new data matrix XBXBXB has fewer dimensions than XXX .

This is dimension reduction . . .

23 / 1


c©JonathanTaylor

Dimension reduction

Principal Components

This method of dimension reduction seeks features thatmaximize sample variance . . .

Specifically, for β ∈ Rp, define the sample variance ofVβ = XXXβ ∈ Rn

Var(Vβ) =1

n − 1

n∑i=1

(Vβ,i − V β)2

V β =1

n

n∑i=1

Vβ,i

Note that Var(VCβ) = C 2 · Var(Vβ) so we usually

standardize ‖β‖2 =√∑n

i=1 β2i = 1

24 / 1


c©JonathanTaylor

Dimension reduction

Principal Components

Define the n × n matrix

H = In×n −1

n111111T

This matrix removes means:

(Hv)i = vi − v .

It is also a projection matrix:

HT = H

H2 = H

25 / 1


c©JonathanTaylor

Dimension reduction

Principal Components with Matrices

With this matrix,

Var(Vβ) =1

n − 1βTXXXTHXXXβ.

So, maximizing sample variance, with ‖β‖2 = 1 is

maximize‖β‖2=1

βT(XXXTHXXX

)β.

This boils down to an eigenvalue problem . . .

26 / 1


c©JonathanTaylor

Dimension reduction

Eigenanalysis

The matrix XXXTHXXX is symmetric, so it can be written as

XXXTHXXX = VDV T

where Dk×k = diag(d1, . . . , dk), rank(XXXTHXXX ) = k andVp×k has orthonormal columns, i.e. V TV = Ik×k .

We always have di ≥ 0 and we take d1 ≥ d2 ≥ . . . dk .

27 / 1


c©JonathanTaylor

Dimension reduction

Eigenanalysis & PCA

Suppose now that β =∑k

j=1 ajvj with vj the columns ofV . Then,

‖β‖2 =

√√√√ k∑i=1

a2i

βT(XXXTHXXX

)β =

k∑i=1

a2i di

Choosing a1 = 1, aj = 0, j ≥ 2 maximizes this quantity.

28 / 1


c©JonathanTaylor

Dimension reduction

Eigenanalysis & PCA

Therefore, β1 = v1 the first column of V solves

maximize‖β‖2=1

βT(XXXTHXXX

)β.

This yields scores HXXX β1 ∈ Rn.

These are the 1st principal component scores.

29 / 1


c©JonathanTaylor

Dimension reduction

Higher order components

Having found the direction with “maximal samplevariance” we might look for the “second most variable”direction by solving

maximize‖β‖2=1,βT v1=0

βT(XXXTHXXX

)β.

Note we restricted our search so we would not just recoverv1 again.

Not hard to see that if β =∑k

j=1 ajvj this is solved bytaking a2 = 1, aj = 0, j 6= 2.

30 / 1


c©JonathanTaylor

Dimension reduction

Higher order components

In matrix terms, all the principal components scores are

(HXXXV )n×k

and the loadings are the columns of V .

This information can be summarized in a biplot.

The loadings describe how each feature contributes toeach principal component score.

31 / 1


c©JonathanTaylor

Olympic data

32 / 1


c©JonathanTaylor

Olympic data: screeplot

33 / 1


c©JonathanTaylor

Dimension reduction

The importance of scale

The PCA scores are not invariant to scaling of thefeatures: for Qp×p diagonal the PCA scores of XXXQ are notthe same as XXX .

Common to convert all variables to the same scale beforeapplying PCA.

Define the scalings to be the sample standard deviation ofeach feature. In matrix form, let

S2 =1

n − 1diag

(XXXTHXXX

)Define XXX = HXXXS−1. The normalized PCA loadings are

given by an eigenanalysis of XXXTXXX .

34 / 1


c©JonathanTaylor

Olympic data

35 / 1


c©JonathanTaylor


36 / 1


c©JonathanTaylor

Dimension reduction

PCA and the SVD

The singular value decomposition of a matrix tells us wecan write

XXX = U∆V T

with ∆k×k = diag(δ1, . . . , δk), δj ≥ 0, k = rank(XXX ),UTU = V TV = Ik×k .

Recall that the scores were

XXXV = (U∆V T )V

= U∆

Also,

XXXTXXX = V∆2V T

so D = ∆2.

37 / 1


c©JonathanTaylor

Dimension reduction

PCA and the SVD

Recall XXX = HXXXS−1/2.

We saw that normalized PCA loadings are given by an

eigenanalysis of XXXTXXX .

It turns out that, via the SVD, the normalized PCA scores

are given by an eigenanalysis of XXXXXXT

because(XXXV

)n×k

= U∆

where U are eigenvectors of

XXXXXXT

= U∆2UT .

38 / 1


c©JonathanTaylor

Dimension reduction

Another characterization of SVD

Given a data matrix XXX (or its scaled centered version XXX )we might try solving

minimizeZZZ :rank(ZZZ)=k

‖XXX −ZZZ‖2F

where F stands for Frobenius norm on matrices

‖A‖2F =

n∑i=1

p∑j=1

A2ij

39 / 1


c©JonathanTaylor

Dimension reduction

Another characterization of SVD

It can be proven that

ZZZ = Sn×k(V T )k×p

where Sn×k is the matrix of the first k PCA scores and thecolumns of V are the first k PCA loadings.

This approximation is related to the screeplot: the heightof each bar describes the additional drop in Frobeniusnorm as the rank of the approximation increases.

40 / 1


c©JonathanTaylor


41 / 1


c©JonathanTaylor

Dimension reduction

Other types of dimension reduction

Instead of maximizing sample variance, we might trymaximizing some other quantity . . .

Independent Component Analysis tries to maximize“non-Gaussianity” of Vβ. In practice, it uses skewness,kurtosis or other moments to quantify “non-Gaussianity.”

These are both unsupervised approaches.

Often, these are combined with supervised approaches intoan algorithm like:

Feature creation Build some set of features using PCA.Validate Use the derived features to see if they are

helpful in the supervised problem.

42 / 1


c©JonathanTaylor

Olympic data: 1st component vs. total score

43 / 1


c©JonathanTaylor

Part III

Distances and similarities

44 / 1


c©JonathanTaylor


Similarities

Start with XXX which we assume is centered andstandardized.

The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.

The first 2 (or any 2) PCA scores yield an n × 2 matrixthat can be visualized as a scatter plot.

Similarly, the first 2 (or any 2) PCA loadings yield anp × 2 matrix that can be visualized as a scatter plot.

45 / 1


c©JonathanTaylor

Olympic data

46 / 1


c©JonathanTaylor


Similarities

The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.

The visualization of the cases based on PCA scores weredetermined by the eigenvectors of XXXXXXT , an n × n matrix.

To see this, remember that

XXX = U∆V T

soXXXXXXT = U∆2UT

XXXTXXX = V∆2V T

47 / 1


c©JonathanTaylor


Similarities

The matrix XXXXXXT is a measure of similarity between cases.

The matrix XXXTXXX is a measure of similarity betweenfeatures.

Structure in the two similarity matrices yield insight intothe set of cases, or the set of features . . .

48 / 1


c©JonathanTaylor


Distances

Distances are inversely related to similarities.

If A and B are similar, then d(A,B) should be small, i.e.they should be near.

If A and B are distant, then they should not be similar.

For a data matrix, there is a natural distance betweencases:

d(XXX i ,XXX k)2 = ‖XXX i −XXX k‖22

=

p∑j=1

‖XXX ij −XXX kj‖22

= (XXXXXXT )ii − 2(XXXXXXT )ik + (XXXXXXT )kk

49 / 1


c©JonathanTaylor


Distances

Suggests a natural transformation between a similaritymatrix S and a distance matrix D

Dik = (Sii − 2 · Sik + Skk)1/2

The reverse transformation is not so obvious. Somesuggestions from your book:

Sik = −Dik

= e−Dik

=1

1 + Dik

50 / 1


c©JonathanTaylor


Distances

A distance (or a metric) on a set S is a functiond : S × S → [0,+∞) that satisfies

d(x , x) = 0; d(x , y) = 0 ⇐⇒ x = yd(x , y) = d(y , x)d(x , y) ≤ d(x , z) + d(z , y)

If d(x , y) = 0 for some x 6= y then d is a pseudo-metric.

51 / 1


c©JonathanTaylor


Similarities

A similarity on a set S is a function s : S × S → R andshould satisfy

s(x , x) ≥ s(x , y) for all x 6= ys(x , y) = s(y , x)

By adding a constant, we can often assume thats(x , y) ≥ 0.

52 / 1


c©JonathanTaylor


Examples: nominal data

The simplest example for nominal data is just the discretemetric

d(x , y) =

{0 x = y

1 otherwise.

The corresponding similarity would be

s(x , y) =

{1 x = y

0 otherwise.

= 1− d(x , y)

53 / 1


c©JonathanTaylor


Examples: ordinal data

If S is ordered, we can think of S as (or identify S with) asubset of the non-negative integers.

If |S | = m then a natural distance is

d(x , y) =|x − y |m − 1

≤ 1

The corresponding similarity would be

s(x , y) = 1− d(x , y)

54 / 1


c©JonathanTaylor


Examples: vectors of continuous data

If S = Rk there are lots of distances determined by norms.

The Minkowski p or `p norm, for p ≥ 1:

d(x , y) = ‖x − y‖p =

(k∑

i=1

|xi − yi |p)1/p

Examples:

p = 2 the usual Euclidean distance, `2

p = 1 d(x , y) =∑k

i=1 |xi − yi |, the “taxicabdistance”, `1

p =∞ the d(x , y) = max1≤i≤k |xi − yi |, the supnorm, `∞

55 / 1


c©JonathanTaylor


Examples: vectors of continuous data

If 0 < p < 1 this is not a norm, but it still defines a metric.

The preceding transformations can be used to constructsimilarities.

Examples: binary vectors

If S = {0, 1}k ⊂ Rk the vectors can be thought of asvectors of bits.

The `1 norm counts the number of mismatched bits.

This is known as Hamming distance.

56 / 1


c©JonathanTaylor


Example: Mahalanobis distance

Given Σk×k that is positive definite, we defineMahalanobis distance on Rk by

dΣ(x , y) =(

(x − y)TΣ−1(x − y))1/2

This is the usual Euclidean distance, with a change ofbasis given by a rotation and stretching of the axes.

If Σ is only non-negative definite, then we can replace Σ−1

with Σ†, its pseudo-inverse. This yields a pseudo-metricbecause it fails the test

d(x , y) = 0 ⇐⇒ x = y .

57 / 1


c©JonathanTaylor


Example: similarities for binary vectors

Given binary vectors x , y ∈ {0, 1}k we can summarize theiragreement in a 2× 2 table

x = 0 x = 1 totalyy = 0 f00 f01 f0·y = 1 f10 f11 f1·totalx f·0 f·1 k

58 / 1


c©JonathanTaylor


Example: similarities for binary vectors

We define the simple matching coefficient (SMC)similarity by

SMC (x , y) =f00 + f11

f00 + f01 + f10 + f11=

f00 + f11

k

= 1− ‖x − y‖1

k

The Jaccard coefficient ignores entries where xi = yi = 0

J(x , y) =f11

f01 + f10 + f11.

59 / 1


c©JonathanTaylor


Example: cosine similarity & correlation

Given vectors x , y ∈ Rk the cosine similarity is defined as

−1 ≤ cos(x , y) =〈x , y〉‖x‖2‖y‖2

≤ 1.

The correlation between two vectors is defined as

−1 ≤ cor(x , y) =〈x − x · 111, y − y · 111〉‖x − x · 111‖2‖y − y · 111‖2

≤ 1.

60 / 1


c©JonathanTaylor


Example: correlation

An alternative, perhaps more familiar definition:

cor(x , y) =SxySx Sy

Sxy =1

k − 1

k∑i=1

(xi − x)(yi − y)

S2x =

1

k − 1

k∑i=1

(xi − x)2

S2y =

1

k − 1

k∑i=1

(yi − y)2

61 / 1


c©JonathanTaylor


Correlation & PCA

The matrix 1n−1XXX

TXXX was actually the matrix of pair-wise

correlations of the features. Why? How?

1

n − 1

(XXX

TXXX)ij

=1

n − 1

(D−1/2XXXTHXXXD−1/2

)The diagonal entries of D are the sample variances of eachfeature.

The inner matrix multiplication computes the pair-wisedot-products of the columns of HXXX(

XXXTHXXX)ij

=n∑

k=1

(Xki − Xi )(Xkj − Xj)

62 / 1


c©JonathanTaylor


Combining similarities

In a given data set, each case may have many attributes orfeatures.

Example: see the health data set for HW 1.

To compute similarities of cases, we must pool similaritiesacross features.

In a data set with M different features, we writexi = (xi1, . . . , xiM), with each xim ∈ Sm.

68 / 1


c©JonathanTaylor


Combining similarities

Given similarities sm on each Sm we can define an overallsimilarity between case xi and xj by

s(xi , xj) =M∑

m=1

wmsm(xim, xjm)

with optional weights wm for each feature.

Your book modifies this to deal with “asymmetricattributes”, i.e. attributes for which Jaccard similaritymight be used.

69 / 1

data mining taylor statistics 202: data...

Documents