xiangnan kong,philip s. yu department of computer science university of illinois at chicago kdd 2010

Semi-Supervised Feature Selection

for Graph ClassificationXiangnan Kong, Philip S. Yu

Department of Computer ScienceUniversity of Illinois at Chicago

KDD 2010

2

Graph Classification - why should we care?

Program Flows XML DocsChemical Compounds

In real apps, data are not directly represented as feature vectors, but graphs with complex structures.

E.g. G(V, E, l) - y

Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y

3

Example: Graph Classification

+- ?

Training Graphs

Testing Graph

Drug activity prediction problem

Given a set of chemical compounds labeled with activities

Predict the activities of testing molecules

Also Program Flows and XML docs

Program Flows XML Docs

4

Feature Vectors

HHN

x1

x2

Subgraph-based Graph Classification

H

H H

OC

H

H

H

H

H

H

H

H

1 0

0 1

1

1

Subgraph Patterns

…

…

…

H

G1

G2

g1 g2 g3H H

N

HHN

OOC C

CC

OO

C

CC

CC

C

C

CC

CC

C

CCC

CC

CC

CCC

CC

How to mine a set of subgraph patterns in order

to effectively perform graph classification?

Classifierx1 x2

Graph Objects

Feature Vectors

Classifiers

5

Conventional Methods- Two Components

Two Components:

1. Evaluation (effective)

whether a subgraph feature is relevant to graph classification?

2. Search space pruning (efficient)

how to avoid enumerating all subgraph features?

Labeled Graphs

+ -

Discriminative Subgraphs

HHN

C

CC

CC

C

OOC C

6

One Problem

Supervised Settings Require a large set of

labeled training graphs

However…

Labeling

a graph is hard !

Labeled Graphs

+ -


HHN

C

CC

CC

C

7

Lack of labels -> problems

Supervised Methods:

1. Evaluation effective?require large amount of label information

2. Search space pruning efficient?

pruning performances rely on

large amount of label information

Labeled Graphs


HHN

C

CC

CC

C

OOC C

+-

8

Semi-Supervised Feature Selection for Graph Classification

Subgraph Patterns

F(p)

Classification

++ -

HHN

C

CC

CC

C

OOC C

EvaluationCriteria

Labeled

Graphs

+-

Unlabeled Graphs

?

?

?

Mine useful subgraph patterns using labeled and unlabeled graphs

9

Two Key Questions to Address

Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

10

Cannot-Link Graphs in different classes should be far away

Must-Link Graphs in the same class should be close

Separability Unlabeled graphs are able to be separated from each other

What is a good feature?

+ +

-

-

2

1

3 Labeled

Labeled

Unlabeled

Unlabeled

Labeled

Labeled

11

Optimization

+ -

+ +

? ?

Cannot-Link Graphs in different classes should be far away

Must-LinkGraphs in the same class should be close

SeparabilityUnlabeled graphs are able to be separated from each other

Evaluation Function:

12

gSemi Score

Evaluation: gSemi Criterion

gSemi Score:

represents the k-th subgraph feature

In matrix form:

(the sum over all selected features)

NHHgood

CC

CCbad

13

Experiment Results

#labeled Graphs

=30

#labeled Graphs

=50

#labeled Graphs

=70

MCF-3 dataset

NCI-H23 dataset

OVCAR-8 dataset

PTC-MMdataset

PTC-FM dataset

14

Experiment Results

Accuracy(higher is

better)

Our approach performed best at NCI and PTC datasets

gSemi (Semi-Supervised)

Information Gain (Supervised)

Frequent (Unsupervised)

(#labeled graphs=30, MCF-3)

# Selected Features

15

Two Key Questions to Address

How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

16

Finding a Needle in a Haystack

Pattern Search Tree ┴

0-edges

1-edge

2-edges

…

gSpan [Yan et. al ICDM’02]

An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support)

not frequent

Too many frequent subgraph patterns

Find the most useful one(s)

How to find the Best node(s) in this tree

without searching all the nodes? (Branch and Bound to prune the

search space)

17

Pruning Principle

NHH

best subgraphso far

CC

CC

CH

CC

CC

C

C

CC

CC

C …

CC

CC

best scoreso far

upper bound

…currentnode

CC

CC

Pattern Search Tree

currentscore

sub-treeC

C

CC

C

H

CC

CC

C

C

CC

CC

C

…

gSemi score

If

best score ≥ upper bound

We can prune the entire sub-tree

18

Pruning Results

WithoutgSemi pruning

gSemi pruning

(MCF-3 dataset)

Running time(seconds)

(lower is better)

19

Pruning Results

gSemi pruning

WithoutgSemi pruning

(MCF-3 dataset)

# Subgraphs explored

(lower is better)

20

Parameters

Accuracy(higher is

better)

(MCF-3 dataset, #label=50, #feature=20)

best (α=1, β=0.1)

Semi-Supervisedclose to

Supervised

close to

Unsupervised

α(cannot-link constraints)

(must-linkconstraints)

β

21

Conclusions

Semi-Supervised Feature Selection for Graph Classification Evaluating subgraph features using both

labeled and unlabeled graphs (effective) Branch&bound pruning the search space

using labeled and unlabeled graphs (efficient)

Thank you!

xiangnan kong,philip s. yu department of computer science university of illinois at chicago kdd 2010

Documents

link graphs

subgraph search space

set of subgraph features

set of subgraph patterns

effective search space

useful subgraph patterns

label information search

gsemi score evaluation