xiangnan kong,philip s. yu department of computer science university of illinois at chicago kdd 2010

21
Semi-Supervised Feature Selection for Graph Classification Xiangnan Kong, Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

Upload: dayna-wright

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

Semi-Supervised Feature Selection

for Graph ClassificationXiangnan Kong, Philip S. Yu

Department of Computer ScienceUniversity of Illinois at Chicago

KDD 2010

Page 2: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

2

Graph Classification - why should we care?

Program Flows XML DocsChemical Compounds

In real apps, data are not directly represented as feature vectors, but graphs with complex structures.

E.g. G(V, E, l) - y

Conventional data mining and machine learning approaches assume data are represented as feature vectors. E.g. (x1, x2, …, xd) - y

Page 3: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

3

Example: Graph Classification

+- ?

Training Graphs

Testing Graph

Drug activity prediction problem

Given a set of chemical compounds labeled with activities

Predict the activities of testing molecules

Also Program Flows and XML docs

Program Flows XML Docs

Page 4: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

4

Feature Vectors

HHN

x1

x2

Subgraph-based Graph Classification

H

H H

OC

H

H

H

H

H

H

H

H

1 0

0 1

1

1

Subgraph Patterns

H

G1

G2

g1 g2 g3H H

N

HHN

OOC C

CC

OO

C

CC

CC

C

C

CC

CC

C

CCC

CC

CC

CCC

CC

How to mine a set of subgraph patterns in order

to effectively perform graph classification?

Classifierx1 x2

Graph Objects

Feature Vectors

Classifiers

Page 5: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

5

Conventional Methods- Two Components

Two Components:

1. Evaluation (effective)

whether a subgraph feature is relevant to graph classification?

2. Search space pruning (efficient)

how to avoid enumerating all subgraph features?

Labeled Graphs

+ -

Discriminative Subgraphs

HHN

C

CC

CC

C

OOC C

Page 6: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

6

One Problem

Supervised Settings Require a large set of

labeled training graphs

However…

Labeling

a graph is hard !

Labeled Graphs

+ -

Discriminative Subgraphs

HHN

C

CC

CC

C

Page 7: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

7

Lack of labels -> problems

Supervised Methods:

1. Evaluation effective?require large amount of label information

2. Search space pruning efficient?

pruning performances rely on

large amount of label information

Labeled Graphs

Discriminative Subgraphs

HHN

C

CC

CC

C

OOC C

+-

Page 8: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

8

Semi-Supervised Feature Selection for Graph Classification

Subgraph Patterns

F(p)

Classification

++ -

HHN

C

CC

CC

C

OOC C

EvaluationCriteria

Labeled

Graphs

+-

Unlabeled Graphs

?

?

?

Mine useful subgraph patterns using labeled and unlabeled graphs

Page 9: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

9

Two Key Questions to Address

Evaluation: How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

Search Space Pruning: How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

Page 10: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

10

Cannot-Link Graphs in different classes should be far away

Must-Link Graphs in the same class should be close

Separability Unlabeled graphs are able to be separated from each other

What is a good feature?

+ +

-

-

2

1

3 Labeled

Labeled

Unlabeled

Unlabeled

Labeled

Labeled

Page 11: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

11

Optimization

+ -

+ +

? ?

Cannot-Link Graphs in different classes should be far away

Must-LinkGraphs in the same class should be close

SeparabilityUnlabeled graphs are able to be separated from each other

Evaluation Function:

Page 12: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

12

gSemi Score

Evaluation: gSemi Criterion

gSemi Score:

represents the k-th subgraph feature

In matrix form:

(the sum over all selected features)

NHHgood

CC

CCbad

Page 13: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

13

Experiment Results

#labeled Graphs

=30

#labeled Graphs

=50

#labeled Graphs

=70

MCF-3 dataset

NCI-H23 dataset

OVCAR-8 dataset

PTC-MMdataset

PTC-FM dataset

Page 14: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

14

Experiment Results

Accuracy(higher is

better)

Our approach performed best at NCI and PTC datasets

gSemi (Semi-Supervised)

Information Gain (Supervised)

Frequent (Unsupervised)

(#labeled graphs=30, MCF-3)

# Selected Features

Page 15: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

15

Two Key Questions to Address

How to evaluate a set of subgraph features with both labeled and unlabeled graphs? (effective)

How to prune the subgraph search space using both labeled and unlabeled graphs?

(efficient)

Page 16: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

16

Finding a Needle in a Haystack

Pattern Search Tree ┴

0-edges

1-edge

2-edges

gSpan [Yan et. al ICDM’02]

An efficient algorithm to enumerate all frequent subgraph patterns (frequency ≥ min_support)

not frequent

Too many frequent subgraph patterns

Find the most useful one(s)

How to find the Best node(s) in this tree

without searching all the nodes? (Branch and Bound to prune the

search space)

Page 17: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

17

Pruning Principle

NHH

best subgraphso far

CC

CC

CH

CC

CC

C

C

CC

CC

C …

CC

CC

best scoreso far

upper bound

…currentnode

CC

CC

Pattern Search Tree

currentscore

sub-treeC

C

CC

C

H

CC

CC

C

C

CC

CC

C

gSemi score

If

best score ≥ upper bound

We can prune the entire sub-tree

Page 18: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

18

Pruning Results

WithoutgSemi pruning

gSemi pruning

(MCF-3 dataset)

Running time(seconds)

(lower is better)

Page 19: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

19

Pruning Results

gSemi pruning

WithoutgSemi pruning

(MCF-3 dataset)

# Subgraphs explored

(lower is better)

Page 20: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

20

Parameters

Accuracy(higher is

better)

(MCF-3 dataset, #label=50, #feature=20)

best (α=1, β=0.1)

Semi-Supervisedclose to

Supervised

close to

Unsupervised

α(cannot-link constraints)

(must-linkconstraints)

β

Page 21: Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010

21

Conclusions

Semi-Supervised Feature Selection for Graph Classification Evaluating subgraph features using both

labeled and unlabeled graphs (effective) Branch&bound pruning the search space

using labeled and unlabeled graphs (efficient)

Thank you!