ds2014: feature selection in hierarchical feature spaces

1

Feature Selection inHierarchical Feature Spaces

Petar Ristoski, Heiko Paulheim10/12/2014

Motivation: Linked Open Data as Background

Knowledge

10/12/2014 2

• Linked Open Data is a method for publishing interlinked

datasets using machine interpretable semantics

• Started 2007

• A collection of ~1,000 datasets

– Various domains, e.g. general knowledge, government data, …

– Using semantic web standards (HTTP, RDF, SPARQL)

• Free of charge

• Machine processable

• Sophisticated tool stacks

Petar Ristoski, Heiko Paulheim

10/12/2014 3

Motivation: Linked Open Data as Background

Knowledge

Petar Ristoski, Heiko Paulheim

Example: the Auto MPG Dataset

• A well-known UCI dataset

– Goal: predict fuel consumption of cars

• Hypothesis: background knowledge → more accurate predictions

• Used background knowledge:

– Entity types and categories from DBpedia (=Wikipedia)

• Results: M5Rules down to almost half the prediction error

– i.e. on average, we are wrong by 1.6 instead of 2.9 MPG

10/12/2014 Petar Ristoski, Heiko Paulheim 4

Attribute setLinear Regression M5Rules

RMSE RE RMSE RE

original 3.359 0.118 2.859 0.088

original + direct types 3.334 0.117 2.835 0.091

original + categories 4.474 0.144 2.926 0.090

original + direct types + categories 2.551 0.088 1.574 0.042

Drawbacks

• The generated feature sets are rather large

– e.g. for dataset of 300 instances, it may generate up to 5,000 features

from one source

• Increase complexity and runtime

• Overfitting for too specific features

10/12/2014 5Petar Ristoski, Heiko Paulheim

Linked Open Data is Backed by Ontologies


LOD Graph Excerpt Ontology Excerpt

HIERARCHICAL FEATURE

SPACE


Problem Statement

• Each instance is an n-dimensional binary feature vector (v1,v2,…,vn),

where vi ∈ {0,1} for all 1≤ vi ≤n

• Feature space: V={v1,v2,…, vn}

• Hierarchic relation between two features vi and vj can be denoted as

vi < vj, where vi is more specific than vj

• For all hierarchical features, the following implication holds:

vi < vj→ (vi = 1 → vj = 1)

• Transitivity between hierarchical features exists:

vi < vj ˄ vj < vk→ vi < vk

• The problem of feature selection can be defined as finding a

projection of V to V’, where V’ ⊆ V and p(V’) ≥ p(V), where p is a

performance function:

𝑝: 𝑃 𝑉 → [0,1]


Hierarchical Feature Space: Example


Josh Donaldson is the best 3rd baseman in the American League.

LeBron James NOT ranked #1 after newly released list of Top NBA players

“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”―Albert Einstein

In his weekly address, President Barack Obama discusses expanding

opportunity for hard-working Americans: http://ofa.bo/ccH

Nineteen-year-old figure skater YuzuruHanyu, who won a gold medal in the

Sochi Olympics, is among the 684 peo... http://bit.ly/1kb6W5y

Barack Obama cracks jokes at Vladimir Putin's expense http://dlvr.it/5Z7JCR

I spotted the Lance Armstrong case in 2006 when everyone thought he was

God, and now this case catches my attention.

http://t.co/NVwO6vtsbG

http://t.co/hrpZee9wiX

http://t.co/WQ13io1YNK


Josh Donaldson is the best 3rd baseman in the American League.

LeBron James NOT ranked #1 after newly released list of Top NBA players

dbpedia:Josh_Donaldsondbpedia:LeBron_James

dbpedia-owl:Basketball_Player

dbpedia-owl:Baseball_Player

dbpedia-owl:Athlete


Hierarchical Feature Space

• Linked Open Data

– DBpedia, YAGO, Biperpedia, Google Knowledge Graph

• Lexical Databses

– WordNet, DANTE

• Domain specific ontologies, taxonomies and vocabularies

– Bioinformatics: Gene Ontology (GO), Entrez

– Drugs: the Drug Ontology

– E-commerce: GoodRelations


RELATED APPROACHES


Standard Feature Selection

• Wrapper methods

– Computationally expensive

• Filter methods

– Several techniques for scoring the relevance of the features

• Information Gain

• χ2

• Information Gain Ratio

• Gini Index

– Often similar results


Optimal Feature Selection


Standard Feature Selection: Information Gain


TSEL Feature Selection

• Tree-based feature selection (Jeong et al.)

– Select most representative and most effective feature from each branch

of the hierarchy

• 𝑙𝑖𝑓𝑡 =𝑃(𝑓|𝐶)

𝑃(𝐶)


Bottom-Up Hill-Climbing Feature Selection

• Bottom-up hill climbing search algorithm to find an optimal subset of

concepts for document representation (Wang et al.)

𝑓 = 1 +α − 𝑛

α∗ β ∗

𝑖∈𝐷𝐷𝑐𝑖 , 𝐷𝑐𝑖⊆ 𝐷𝐾𝑁𝑁𝑖 𝑎𝑛𝑑 β > 0


Greedy Top-Down Feature Selection

• Greedy based top-down search strategy for feature selection (Lu et al.)

– Select the most effective nodes from different levels of the hierarchy


PROPOSED APPROACH


Hierarchical Feature Selection Approach

(SHSEL)

• Exploit the hierarchical structure of the feature space

• Hierarchical relation : vi < vj→ (vi = 1 → vj = 1)

• Relevance similarity:

– Relevance (Blum et al.) : A feature vi is relevant to a target class C if

there exists a pair of examples A and B in the instance space such that

A and B differ only in their assignment to vi and C(A) ≠ C(B)

• Two features vi and vj have similar relevance if:

1 − 𝑅 𝑣𝑖 − 𝑅 𝑣𝑗 ≥ 𝑡, 𝑡 → [0,1]

• Goal: Identify features with similar relevance, and select the most

valuable abstract features, without losing predictive power


Hierarchical Feature Selection Approach

(SHSEL)

• Initial Selection

– Identify and filter out ranges of nodes with similar relevance in each

branch of the hierarchy

• Pruning

– Select only the most relevant features from the previously reduced set


Initial SHSEL Feature Selection

1. Identify range of nodes with similar relevance in each branch:

– Information gain: 𝑠(𝑣𝑖 , 𝑣𝑗) = 1 − 𝐼𝐺 𝑣𝑖 − 𝐼𝐺(𝑣𝑗)

– Correlation: 𝑠(𝑣𝑖 , 𝑣𝑗) = 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛(𝑣𝑖 , 𝑣𝑗)

2. If the similarity is greater than a user specified threshold, remove

the more specific feature, based on the hierarchical relation


𝑠 𝑣𝑖 , 𝑣𝑗 = 1 − 0.45 − 0.5 = 0.95

t=0.9

s>t

Post SHSEL Feature Selection

• Select the features with the highest relevance on each path

– user specified threshold

– select features with relevance above path average relevance


𝐼𝐺(𝑣𝑖)=0.2AVG(Sp)=0.25

EVALUATION


Evaluation

• We use 5 real-world datasets and 6 synthetically generated datasets

• Classification methods:

– Naïve Bayes

– k-Nearest Neighbors (k=3)

– Support Vector Machine (polynomial kernel function)

No parameter optimization


Evaluation: Real World Datasets

Name Features #Instances Class Labels #Features

Sports Tweets T DBpedia Direct Types 1,179 positive(523); negative(656) 4,082

Sports Tweets C DBpedia Categories 1,179 positive(523); negative(656) 10,883

Cities DBpedia Direct Types 212 high(67); medium(106); low(39) 727

NY Daily Headings DBpedia Direct Types 1,016 positive(580); negative(436) 5,145

StumbleUpon DMOZ Categories 3,020 positive(1,370); negative(1,650) 3,976


• Hierarchical features are generated from DBpedia (structured version of Wikipedia)

– The text is annotated with concepts using DBpedia Spotlight

• The feature generation is independent of the class labels, and it is unbiased towards any of the feature selection approaches

Evaluation: Synthetic Datasets

• Generate the middle layer using polynomial function

• Generate the hierarchy upwards and downwards following the

hierarchical feature implication and transitivity rule

• The depth and branching factor are controlled with parameters D

and B


Name #Instances Class Labels #Features

S-D2-B2 1,000 positive(500); negative(500) 1,201


S-D2-B10 1,000 positive(500); negative(500) 961





• Depth = 1 & Branching = 2


1 0 1 1 0 1 0 0

1 1 1 0

0

1

0 1 0 10 0 0


• Generate the middle layer using polynomial function

• Generate the hierarchy upwards and downwards following the

hierarchical feature implication and transitivity rule

• The depth and branching factor are controlled with parameters D

and B


Name #Instances Class Labels #Features



S-D2-B10 1,000 positive(500); negative(500) 961




Evaluation: Approach

• Testing all approaches using two classification methods

– Naïve Bayes, KNN and SVM

• Metrics for performance evaluation

– Accuracy: Acc V′ =𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝐶𝑙𝑎𝑠𝑠𝑓𝑖𝑒𝑑 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 (𝑉′)

𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

– Feature Space Compression: 𝑐 𝑉′ = 1 −|𝑉′|

|𝑉|

– Harmonic Mean: 𝐻 = 2 ∗𝐴𝑐𝑐 𝑉′ ∗𝑐 𝑉′

𝐴𝑐𝑐 𝑉′ +𝑐 𝑉′

• Results calculated using stratified 10-fold cross validation

– Feature selection is performed inside each fold

• Parameter optimization for each feature selection strategy


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Relevance Similarity Threshold

Accuracy

Compression

H. Mean

Evaluation: SHSEL IG


• Classification accuracy when using different relevance similarity threshold on the cities dataset

Evaluation: Classification Accuracy (NB)


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Sports Tweets T Sports Tweets C StumbleUpon Cities NY Daily Headings

original

initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

S_D2_B2 S_D2_B5 S_D2_B10 S_D4_B2 S_D4_B5 S_D4_B10

original

initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

Evaluation: Feature Space Compression (NB)


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

Evaluation: Harmonic Mean (NB)


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


initialSHSEL IG

initialSHSEL C

pruneSHSEL IG

pruneSHSEL C

SIG

SC

TSEL Lift

TSEL IG

HillClimbing

GreedyTopDown

Conclusion & Outlook


• Contribution

– An approach that exploits hierarchies for feature selection in

combination with standard metrics

– The evaluation shows that the approach outperforms standard feature

selection techniques, and other approaches using hierarchies

• Future Work

– Conduct further experiments

• E.g. text mining, bioinformatics

– Feature Selection in unsupervised learning

• E.g. clustering, outlier detection

• Laplacian Score

44

Feature Selection inHierarchical Feature Spaces

Petar Ristoski, Heiko Paulheim10/12/2014

ds2014: feature selection in hierarchical feature spaces

Science