hubness in the context of feature selection and generation
DESCRIPTION
Hubness in the Context of Feature Selection and Generation. Milo š Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanovi ć 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad , Serbia 2 Institute of Computer Science University of Hildesheim, Germany. - PowerPoint PPT PresentationTRANSCRIPT
Hubness in the Context of Feature Selection and Generation
Miloš Radovanović1 Alexandros Nanopoulos2
Mirjana Ivanović1
1Department of Mathematics and InformaticsFaculty of Science, University of Novi Sad, Serbia
2Institute of Computer ScienceUniversity of Hildesheim, Germany
k-occurrences (Nk)Nk(x), the number of k-occurrences of point x, is the
number of times x occurs among k nearest neighbors of all other points in a data set Nk(x) is the in-degree of node x in the k-NN digraph
It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005]
FGSIR'10 2July 23, 2010
Skewness of Nk
What causes the skewness of Nk?Artefact of data?
Are some songs more similar to others?Do some people have fingerprints or voices that
are harder to distinguish from other people’s?Specifics of modeling algorithms?
Inadequate choice of features?Something more general?
FGSIR'10 3July 23, 2010
FGSIR'10 4July 23, 2010
0 5 10 150
0.05
0.1
0.15
0.2
0.25iid uniform, d = 3
N5
p(N
5)
l2l0.5cos
0 10 20 30 40 500
0.020.040.060.080.1
0.120.140.16
iid uniform, d = 20
N5
p(N
5)
l2l0.5cos
0 0.5 1 1.5 2 2.5-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5iid uniform, d = 100
log10(N5)
log 10
( p(N
5))
l2l0.5cos
Contributions - OutlineDemonstrate the phenomenon
Skewness in the distr of k-occurrencesExplain its main reasons
No artifact of data No specifics of models (inadequate features, etc.) A new aspect of the „curse of dimensionality“
Impact on Feature Selection and Generation
FGSIR'10July 23, 2010 5
OutlineDemonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions
FGSIR'10July 23, 2010 6
Collection of 23 real text data sets
FGSIR'10July 23, 2010
SNk is standardized 3rd moment of Nk
If SNk = 0 no skew, positive (negative) values
signify right (left) skew
High skewness indicates hubness
7
Collection of 14 real UCI data sets+ microarray data
FGSIR'10 8July 23, 2010
OutlineDemonstrate the phenomenonExplain its main reasonsImpact on IRConclusions
FGSIR'10July 23, 2010 9
Where are the hubs located?
FGSIR'10July 23, 2010
Spearman correlation between N10 and distance from data set mean10dmNC
Hubs are closer to the data center
10
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
iid uniform, d = 3, Cdm N
5 = -0.018
Distance from data set mean
N5
0.6 0.8 1 1.2 1.4 1.6 1.8 20
10
20
30
40
50
iid uniform, d = 20, Cdm N
5 = -0.803
Distance from data set mean
N5
2 2.5 3 3.5 40
20406080
100120140160
iid uniform, d = 100, Cdm N
5 = -0.865
Distance from data set mean
N5
Centrality and its amplificationHubs due to centrality
vectors closer to the center tend to be closer to all other vectors
thus more frequent k-NN
Centrality is amplified by dimensionality
FGSIR'10July 23, 2010
point A closer to center than point B∑ sim(A,x) - ∑ sim(B,x) x x
11
Concentration of similarityConcentration: as dim grows to infinity
Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero
Minkowski [François 2007, Beyer 1999, Aggarwal 2001]
Meaningfulness of nearest neighbors?Analytical proof for cosine sim [Radovanović 2010]
FGSIR'10July 23, 2010 12
The hyper-sphere viewHyper-sphere view
Most vectors are about equidistant from the center and from each other, and lie on the surface of a hyper-sphere
Few vectors lie at the inner part of hyper-sphere, closer to its center, thus closer to all others
This is expected for large but finite dimensionality, since is non negligible
FGSIR'10July 23, 2010
E
√V
13
What happens with real data?
Real text data are usually clustered (mixture of distributions)
Cluster with k-Means (#clusters = 3*Cls)
Compare with
Generalization of the hyper-sphere view with clusters
FGSIR'10July 23, 2010
10dmNC 10
cmNC
Spearman correlation between N10 anddistance from data/cluster center
14
UCI data
FGSIR'10 15July 23, 2010
Can dim reduction help?
FGSIR'10July 23, 2010
Intrinsic dimensionalityis reached
16
UCI data
FGSIR'10 17July 23, 2010
0 20 40 60 80 100-0.5
0
0.5
1
1.5PCA
Features (%)
SN
10
musk1mfeat-factorsspectrometeriid uniform,d =15,no PCA
OutlineDemonstrate the phenomenonExplain its main reasonsImpact on FSGConclusions
FGSIR'10July 23, 2010 18
FGSIR'10July 23, 2010
“Bad” hubs as obstinate resultsBased on information about classes,
k-occurrences can be distinguished into:“Bad” k-occurrences, BNk(x)“Good” k-occurrences, GNk(x)Nk(x) = BNk(x) + GNk(x)
19
FGSIR'10July 23, 2010
How do “bad” hubs originate?Mixture is important also:
High dimensionality and skewness of Nk do not automatically induce “badness”
“Bad” hubs originate from a combination of high dimensionality and violation of the CA
Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006]
20
Skewness of Nk vs. #features
FGSIR'10 21July 23, 2010
Skewness stays relatively constantIt abruptly drops when intrinsic dimensionality is reachedFurther feature selection may incur loss of information.
Badness vs. #features
FGSIR'10 22July 23, 2010
Similar observationsWhen reaching intrinsic dimensionality, BNk ratio increasesThe representation ceases to reflect the information provided by labels very well
Feature generationWhen adding features to bring new information to
the data: Representation will ultimately increase SNk and, thus,
produce hubs The reduction of BNk ratio “flattens out” fairly quickly,
limiting the usefulness of adding new features in the sense of being able to express the “ground truth”
If instead of BNk ratio we use classifier error rate, the results are similar
FGSIR'10 23July 23, 2010
FGSIR'10July 23, 2010
Conclusion Little attention by research in feature selection/
generation to the fact that in intrinsically high-dimensional data, hubs will : Result in an uneven distribution of the cluster assumption violation
(hubs will be generated that attract more label mismatches with neighborin points)
Result in an uneven distribution of responsibility for classification or retrieval error among data points.
Investigating further the interaction between: hubness and different notions of CA violation
Important new insights into feature selection/generation
24