data not in the pre-defined feature vectors that can be used to construct predictive models

• Data not in the pre-defined feature vectors that can be used to construct predictive models.

Applications:

• Transactional database

• Sequence database

• Graph database

Frequent pattern is a good candidate for discriminative features, especially for data of complicated structures.

Motivation:

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree

Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure

Why Frequent Patterns?

• A non-linear conjunctive combination of single features• Increase the expressive and discriminative power of the feature space

Examples:

• Exclusive OR problem & Solution

X Y C

0 0 0

0 1 1

1 0 1

1 1 0

0

0

1

1

x

y

L1

L2

Data is non-linearly separable in (x, y)

X Y XY C

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 0

min

e &

transfo

rm

Data is linearly separable in (x, y, xy)

3D Projection Using XY

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

xy

0

0

1

1

map

dat

a to

hi

gher

spa

ce

Conventional Frequent Pattern-Based Classification: Two-Step Batch Method

1. Mine frequent patterns;

2. Select most discriminative patterns;

3. Represent data in the feature space using such patterns;

4. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1Data3 1 1 0

Data4 0 0 1………

represent

Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSet mine

Mined Discriminative

Patterns1 2 4

select

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

ANN

DT

SVM

LR

Basic Flows: Problems of Separated Mine & Select in Batch Method

1. Mine step: Issues of scalability and combinatorial explosion • Dilemma of setting minsupport

• Promising discriminative candidate patterns?• Tremendous number of candidate patterns?

2. Select step: Issue of discriminative power

• 5 Datasets: UCI Machine Learning Repository

• Scalability Study:

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

0

1

2

3

4


Log(DTAbsSupport) Log(MbTAbsSupport)

Datasets #Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup)

Adult 252809 0.41%

Chess +∞ ~0%

Hypo 423439 0.0035%

Sick 4818391 0.00032%

Sonar 95507 0.00775%

Itemset Mining

• Accuracy of Mined Itemsets

70%

80%

90%

100%


DT Accuracy MbT Accuracy

Graph Mining

• 11 Datasets:• 9 NCI anti-cancer screen datasets

• PubChem Project.• Positive class : 1% - 8.3%

• 2 AIDS anti-viral screen datasets

• URL: http://dtp.nci.nih.gov.• H1: 3.5%, H2: 1%

• Scalability Study

0300600900

120015001800

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT #Pat MbT #Pat

0

1

2

3

4


Log(DT Abs Support) Log(MbT Abs Support)

• Predictive Quality of Mined Frequent Subgraphs

0.5

0.6

0.7

0.8


DT MbT Accuracy

0.88

0.92

0.96

1


DT MbTAUC

AUC of MbT, DT MbT VS Benchmarks

• Case Study

Motivation

Problems

Proposed Algorithm

Experiments

dataset

1

2 5

3 4 6 7

Few Data

……..+

……..

+

Divide-and-Conquer Based Frequent Pattern Mining

mine & select

mine & select

mine & select

Mined Discriminative Patterns

1234567

1. Mine and Select most discriminative patterns;

2. Represent data in the feature space using such patterns;

3. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1

Data3 1 1 0 Data4 0 0 1

………

represent

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

ANN

DT

SVM

LR

Direct Mining & Selection via Model-based Search Tree

Procedures as Feature Miner Or Be Itself as Classifier

Analyses:

1. Scalability of pattern enumeration• Upper bound

• “Scale down” ratio2. Bound on number of returned features

3. Subspace pattern selection

4. Non-overfitting5. Optimality under exhaustive search

Take Home Message:

1. Highly compact and discriminative frequent patterns can be directly mined through Model based Search Tree without worrying about combinatorial explosion.

2. Software and datasets are available by contacting the authors.

data not in the pre-defined feature vectors that can be used to construct predictive models

Documents