hubness in the context of feature selection and generation

25
Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanović 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad, Serbia 2 Institute of Computer Science University of Hildesheim, Germany

Upload: vidal

Post on 05-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Hubness in the Context of Feature Selection and Generation. Milo š Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanovi ć 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad , Serbia 2 Institute of Computer Science University of Hildesheim, Germany. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hubness in the Context of Feature Selection and Generation

Hubness in the Context of Feature Selection and Generation

Miloš Radovanović1 Alexandros Nanopoulos2

Mirjana Ivanović1

1Department of Mathematics and InformaticsFaculty of Science, University of Novi Sad, Serbia

2Institute of Computer ScienceUniversity of Hildesheim, Germany

Page 2: Hubness in the Context of Feature Selection and Generation

k-occurrences (Nk)Nk(x), the number of k-occurrences of point x, is the

number of times x occurs among k nearest neighbors of all other points in a data set Nk(x) is the in-degree of node x in the k-NN digraph

It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005]

FGSIR'10 2July 23, 2010

Page 3: Hubness in the Context of Feature Selection and Generation

Skewness of Nk

What causes the skewness of Nk?Artefact of data?

Are some songs more similar to others?Do some people have fingerprints or voices that

are harder to distinguish from other people’s?Specifics of modeling algorithms?

Inadequate choice of features?Something more general?

FGSIR'10 3July 23, 2010

Page 4: Hubness in the Context of Feature Selection and Generation

FGSIR'10 4July 23, 2010

0 5 10 150

0.05

0.1

0.15

0.2

0.25iid uniform, d = 3

N5

p(N

5)

l2l0.5cos

0 10 20 30 40 500

0.020.040.060.080.1

0.120.140.16

iid uniform, d = 20

N5

p(N

5)

l2l0.5cos

0 0.5 1 1.5 2 2.5-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5iid uniform, d = 100

log10(N5)

log 10

( p(N

5))

l2l0.5cos

Page 5: Hubness in the Context of Feature Selection and Generation

Contributions - OutlineDemonstrate the phenomenon

Skewness in the distr of k-occurrencesExplain its main reasons

No artifact of data No specifics of models (inadequate features, etc.) A new aspect of the „curse of dimensionality“

Impact on Feature Selection and Generation

FGSIR'10July 23, 2010 5

Page 6: Hubness in the Context of Feature Selection and Generation

OutlineDemonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions

FGSIR'10July 23, 2010 6

Page 7: Hubness in the Context of Feature Selection and Generation

Collection of 23 real text data sets

FGSIR'10July 23, 2010

SNk is standardized 3rd moment of Nk

If SNk = 0 no skew, positive (negative) values

signify right (left) skew

High skewness indicates hubness

7

Page 8: Hubness in the Context of Feature Selection and Generation

Collection of 14 real UCI data sets+ microarray data

FGSIR'10 8July 23, 2010

Page 9: Hubness in the Context of Feature Selection and Generation

OutlineDemonstrate the phenomenonExplain its main reasonsImpact on IRConclusions

FGSIR'10July 23, 2010 9

Page 10: Hubness in the Context of Feature Selection and Generation

Where are the hubs located?

FGSIR'10July 23, 2010

Spearman correlation between N10 and distance from data set mean10dmNC

Hubs are closer to the data center

10

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

iid uniform, d = 3, Cdm N

5 = -0.018

Distance from data set mean

N5

0.6 0.8 1 1.2 1.4 1.6 1.8 20

10

20

30

40

50

iid uniform, d = 20, Cdm N

5 = -0.803

Distance from data set mean

N5

2 2.5 3 3.5 40

20406080

100120140160

iid uniform, d = 100, Cdm N

5 = -0.865

Distance from data set mean

N5

Page 11: Hubness in the Context of Feature Selection and Generation

Centrality and its amplificationHubs due to centrality

vectors closer to the center tend to be closer to all other vectors

thus more frequent k-NN

Centrality is amplified by dimensionality

FGSIR'10July 23, 2010

point A closer to center than point B∑ sim(A,x) - ∑ sim(B,x) x x

11

Page 12: Hubness in the Context of Feature Selection and Generation

Concentration of similarityConcentration: as dim grows to infinity

Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero

Minkowski [François 2007, Beyer 1999, Aggarwal 2001]

Meaningfulness of nearest neighbors?Analytical proof for cosine sim [Radovanović 2010]

FGSIR'10July 23, 2010 12

Page 13: Hubness in the Context of Feature Selection and Generation

The hyper-sphere viewHyper-sphere view

Most vectors are about equidistant from the center and from each other, and lie on the surface of a hyper-sphere

Few vectors lie at the inner part of hyper-sphere, closer to its center, thus closer to all others

This is expected for large but finite dimensionality, since is non negligible

FGSIR'10July 23, 2010

E

√V

13

Page 14: Hubness in the Context of Feature Selection and Generation

What happens with real data?

Real text data are usually clustered (mixture of distributions)

Cluster with k-Means (#clusters = 3*Cls)

Compare with

Generalization of the hyper-sphere view with clusters

FGSIR'10July 23, 2010

10dmNC 10

cmNC

Spearman correlation between N10 anddistance from data/cluster center

14

Page 15: Hubness in the Context of Feature Selection and Generation

UCI data

FGSIR'10 15July 23, 2010

Page 16: Hubness in the Context of Feature Selection and Generation

Can dim reduction help?

FGSIR'10July 23, 2010

Intrinsic dimensionalityis reached

16

Page 17: Hubness in the Context of Feature Selection and Generation

UCI data

FGSIR'10 17July 23, 2010

0 20 40 60 80 100-0.5

0

0.5

1

1.5PCA

Features (%)

SN

10

musk1mfeat-factorsspectrometeriid uniform,d =15,no PCA

Page 18: Hubness in the Context of Feature Selection and Generation

OutlineDemonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions

FGSIR'10July 23, 2010 18

Page 19: Hubness in the Context of Feature Selection and Generation

FGSIR'10July 23, 2010

“Bad” hubs as obstinate resultsBased on information about classes,

k-occurrences can be distinguished into:“Bad” k-occurrences, BNk(x)“Good” k-occurrences, GNk(x)Nk(x) = BNk(x) + GNk(x)

19

Page 20: Hubness in the Context of Feature Selection and Generation

FGSIR'10July 23, 2010

How do “bad” hubs originate?Mixture is important also:

High dimensionality and skewness of Nk do not automatically induce “badness”

“Bad” hubs originate from a combination of high dimensionality and violation of the CA

Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006]

20

Page 21: Hubness in the Context of Feature Selection and Generation

Skewness of Nk vs. #features

FGSIR'10 21July 23, 2010

Skewness stays relatively constantIt abruptly drops when intrinsic dimensionality is reachedFurther feature selection may incur loss of information.

Page 22: Hubness in the Context of Feature Selection and Generation

Badness vs. #features

FGSIR'10 22July 23, 2010

Similar observationsWhen reaching intrinsic dimensionality, BNk ratio increasesThe representation ceases to reflect the information provided by labels very well

Page 23: Hubness in the Context of Feature Selection and Generation

Feature generationWhen adding features to bring new information to

the data: Representation will ultimately increase SNk and, thus,

produce hubs The reduction of BNk ratio “flattens out” fairly quickly,

limiting the usefulness of adding new features in the sense of being able to express the “ground truth”

If instead of BNk ratio we use classifier error rate, the results are similar

FGSIR'10 23July 23, 2010

Page 24: Hubness in the Context of Feature Selection and Generation

FGSIR'10July 23, 2010

Conclusion Little attention by research in feature selection/

generation to the fact that in intrinsically high-dimensional data, hubs will : Result in an uneven distribution of the cluster assumption violation

(hubs will be generated that attract more label mismatches with neighborin points)

Result in an uneven distribution of responsibility for classification or retrieval error among data points.

Investigating further the interaction between: hubness and different notions of CA violation

Important new insights into feature selection/generation

24

Page 25: Hubness in the Context of Feature Selection and Generation

FGSIR'10July 23, 2010

Thank You!

Alexandros [email protected]

25