data mining taylor statistics 202: data mining

Statistics 202:Data Mining

c©JonathanTaylor

Statistics 202: Data MiningDistances and similarities

Based in part on slides from textbook, slides of Susan Holmes

c©Jonathan Taylor

October 3, 2012

1 / 1


c©JonathanTaylor

Distances and similarities

Similarities

Start with XXX which we assume is centered andstandardized.

The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.

The first 2 (or any 2) PCA scores yield an n × 2 matrixthat can be visualized as a scatter plot.

Similarly, the first 2 (or any 2) PCA loadings yield anp × 2 matrix that can be visualized as a scatter plot.

2 / 1


c©JonathanTaylor

Olympic data

3 / 1


c©JonathanTaylor


Similarities

The PCA loadings were given by eigenvectors of thecorrelation matrix which is a measure of similarity.

The visualization of the cases based on PCA scores weredetermined by the eigenvectors of XXXXXXT , an n × n matrix.

To see this, remember that

XXX = U∆V T

soXXXXXXT = U∆2UT

XXXTXXX = V∆2V T

4 / 1


c©JonathanTaylor


Similarities

The matrix XXXXXXT is a measure of similarity between cases.

The matrix XXXTXXX is a measure of similarity betweenfeatures.

Structure in the two similarity matrices yield insight intothe set of cases, or the set of features . . .

5 / 1


c©JonathanTaylor


Distances

Distances are inversely related to similarities.

If A and B are similar, then d(A,B) should be small, i.e.they should be near.

If A and B are distant, then they should not be similar.

For a data matrix, there is a natural distance betweencases:

d(XXX i ,XXX k)2 = ‖XXX i −XXX k‖22

=

p∑j=1

‖XXX ij −XXX kj‖22

= (XXXXXXT )ii − 2(XXXXXXT )ik + (XXXXXXT )kk

6 / 1


c©JonathanTaylor


Distances

Suggests a natural transformation between a similaritymatrix S and a distance matrix D

Dik = (Sii − 2 · Sik + Skk)1/2

The reverse transformation is not so obvious. Somesuggestions from your book:

Sik = −Dik

= e−Dik

=1

1 + Dik

7 / 1


c©JonathanTaylor


Distances

A distance (or a metric) on a set S is a functiond : S × S → [0,+∞) that satisfies

d(x , x) = 0; d(x , y) = 0 ⇐⇒ x = yd(x , y) = d(y , x)d(x , y) ≤ d(x , z) + d(z , y)

If d(x , y) = 0 for some x 6= y then d is a pseudo-metric.

8 / 1


c©JonathanTaylor


Similarities

A similarity on a set S is a function s : S × S → R andshould satisfy

s(x , x) ≥ s(x , y) for all x 6= ys(x , y) = s(y , x)

By adding a constant, we can often assume thats(x , y) ≥ 0.

9 / 1


c©JonathanTaylor


Examples: nominal data

The simplest example for nominal data is just the discretemetric

d(x , y) =

{0 x = y

1 otherwise.

The corresponding similarity would be

s(x , y) =

{1 x = y

0 otherwise.

= 1− d(x , y)

10 / 1


c©JonathanTaylor


Examples: ordinal data

If S is ordered, we can think of S as (or identify S with) asubset of the non-negative integers.

If |S | = m then a natural distance is

d(x , y) =|x − y |m − 1

≤ 1

The corresponding similarity would be

s(x , y) = 1− d(x , y)

11 / 1


c©JonathanTaylor


Examples: vectors of continuous data

If S = Rk there are lots of distances determined by norms.

The Minkowski p or `p norm, for p ≥ 1:

d(x , y) = ‖x − y‖p =

(k∑

i=1

|xi − yi |p)1/p

Examples:

p = 2 the usual Euclidean distance, `2

p = 1 d(x , y) =∑k

i=1 |xi − yi |, the “taxicabdistance”, `1

p =∞ the d(x , y) = max1≤i≤k |xi − yi |, the supnorm, `∞

12 / 1


c©JonathanTaylor


Examples: vectors of continuous data

If 0 < p < 1 this is not a norm, but it still defines a metric.

The preceding transformations can be used to constructsimilarities.

Examples: binary vectors

If S = {0, 1}k ⊂ Rk the vectors can be thought of asvectors of bits.

The `1 norm counts the number of mismatched bits.

This is known as Hamming distance.

13 / 1


c©JonathanTaylor


Example: Mahalanobis distance

Given Σk×k that is positive definite, we defineMahalanobis distance on Rk by

dΣ(x , y) =(

(x − y)TΣ−1(x − y))1/2

This is the usual Euclidean distance, with a change ofbasis given by a rotation and stretching of the axes.

If Σ is only non-negative definite, then we can replace Σ−1

with Σ†, its pseudo-inverse. This yields a pseudo-metricbecause it fails the test

d(x , y) = 0 ⇐⇒ x = y .

14 / 1


c©JonathanTaylor


Example: similarities for binary vectors

Given binary vectors x , y ∈ {0, 1}k we can summarize theiragreement in a 2× 2 table

x = 0 x = 1 totalyy = 0 f00 f01 f0·y = 1 f10 f11 f1·totalx f·0 f·1 k

15 / 1


c©JonathanTaylor


Example: similarities for binary vectors

We define the simple matching coefficient (SMC)similarity by

SMC (x , y) =f00 + f11

f00 + f01 + f10 + f11=

f00 + f11

k

= 1− ‖x − y‖1

k

The Jaccard coefficient ignores entries where xi = yi = 0

J(x , y) =f11

f01 + f10 + f11.

16 / 1


c©JonathanTaylor


Example: cosine similarity & correlation

Given vectors x , y ∈ Rk the cosine similarity is defined as

−1 ≤ cos(x , y) =〈x , y〉‖x‖2‖y‖2

≤ 1.

The correlation between two vectors is defined as

−1 ≤ cor(x , y) =〈x − x · 111, y − y · 111〉‖x − x · 111‖2‖y − y · 111‖2

≤ 1.

17 / 1


c©JonathanTaylor


Example: correlation

An alternative, perhaps more familiar definition:

cor(x , y) =SxySx Sy

Sxy =1

k − 1

k∑i=1

(xi − x)(yi − y)

S2x =

1

k − 1

k∑i=1

(xi − x)2

S2y =

1

k − 1

k∑i=1

(yi − y)2

18 / 1


c©JonathanTaylor


Correlation & PCA

The matrix 1n−1XXX

TXXX was actually the matrix of pair-wise

correlations of the features. Why? How?

1

n − 1

(XXX

TXXX)ij

=1

n − 1

(D−1/2XXXTHXXXD−1/2

)The diagonal entries of D are the sample variances of eachfeature.

The inner matrix multiplication computes the pair-wisedot-products of the columns of HXXX(

XXXTHXXX)ij

=n∑

k=1

(Xki − Xi )(Xkj − Xj)

19 / 1


c©JonathanTaylor


Combining similarities

In a given data set, each case may have many attributes orfeatures.

Example: see the health data set for HW 1.

To compute similarities of cases, we must pool similaritiesacross features.

In a data set with M different features, we writexi = (xi1, . . . , xiM), with each xim ∈ Sm.

25 / 1


c©JonathanTaylor


Combining similarities

Given similarities sm on each Sm we can define an overallsimilarity between case xi and xj by

s(xi , xj) =M∑

m=1

wmsm(xim, xjm)

with optional weights wm for each feature.

Your book modifies this to deal with “asymmetricattributes”, i.e. attributes for which Jaccard similaritymight be used.

26 / 1

data mining taylor statistics 202: data mining

Documents