data not in the pre-defined feature vectors that can be used to construct predictive models

1
• Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph database Frequent pattern is a good candidate for discriminative features, especially for data of complicated structures. Motivation: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure Why Frequent Patterns? • A non-linear conjunctive combination of single features • Increase the expressive and discriminative power of the feature space Examples: • Exclusive OR problem & Solution X Y C 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 x y L1 L2 Data is non- linearly separable in (x, y) X Y XY C 0 0 0 0 0 1 0 1 1 0 0 1 1 1 1 0 m i n e & tr a n sf o r m Data is linearly separable in (x, y, xy) 3D P rojection U sing X Y 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x y xy 0 0 1 1 map d a ta t o hi gher s pa ce Conventional Frequent Pattern-Based Classification: Two-Step Batch Method 1. Mine frequent patterns; 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… represent Frequent Patterns 1---------------------- ---------2----------3 ----- 4 --- 5 -------- --- 6 ------- 7------ DataSet mine Mined Discriminative Patterns 1 2 4 select | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name ANN DT SVM LR Basic Flows: Problems of Separated Mine & Select in Batch Method 1.Mine step: Issues of scalability and combinatorial explosion Dilemma of setting minsupport Promising discriminative candidate patterns? Tremendous number of candidate patterns? 2.Select step: Issue of discriminative power 5 Datasets : UCI Machine Learning Repository Scalability Study: 0 1 2 3 4 Adult C hess H ypo Sick S onar Log(D T #P at) Log(M bT #P at) 0 1 2 3 4 Adult C hess H ypo Sick S onar Log(DTAbsSupport) Log(MbTAbsSupport) Datasets #Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup) Adult 252809 0.41% Chess +~0% Hypo 423439 0.0035% Sick 4818391 0.00032% Sonar 95507 0.00775% Itemset Mining Accuracy of Mined Itemsets 70% 80% 90% 100% Adult C hess Hypo Sick Sonar D T Accuracy M bT Accuracy Graph Mining 11 Datasets: • 9 NCI anti-cancer screen datasets • PubChem Project. Positive class : 1% - 8.3% 2 AIDS anti-viral screen datasets • URL: http://dtp.nci.nih.gov . • H1: 3.5%, H2: 1% Scalability Study 0 300 600 900 1200 1500 1800 NC I1 NC I33 NC I41 NC I47 NC I81 NC I83 NCI109 NC I123 NCI145 H1 H2 D T #Pat M bT #Pat 0 1 2 3 4 NC I1 NC I33 NC I41 NC I47 NC I81 NC I83 NC I109 NC I123 NC I145 H1 H2 Log(D T Abs Support) Log(M bT Abs Support) Predictive Quality of Mined Frequent Subgraphs 0.5 0.6 0.7 0.8 N CI1 NCI33 NCI41 N CI47 NCI81 NC I83 NC I109 N CI123 NC I145 H1 H2 DT M bT Accuracy 0.88 0.92 0.96 1 NC I1 NCI33 NC I41 NC I47 N CI81 NCI83 NCI109 N CI123 NCI145 H1 H2 DT M bT AUC AUC of MbT, DT MbT VS Benchmarks Case Study Motivation Problems Proposed Algorithm Experiments dataset 1 2 5 3 4 6 7 Few Dat a …….. + …….. + Divide-and-Conquer Based Frequent Pattern Mining mine & select mine & select mine & select Mined Discriminative Patterns 1 2 3 4 5 6 7 1.Mine and Select most discriminative patterns; 2.Represent data in the feature space using such patterns; 3.Build classification models. F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ……… represent | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name ANN DT SVM LR Direct Mining & Selection via Model-based Search Tree Procedures as Feature Miner Or Be Itself as Classifier Analyses: 1.Scalability of pattern enumeration Upper bound “Scale down” ratio 2.Bound on number of returned features 3.Subspace pattern selection 4.Non-overfitting 5.Optimality under exhaustive search Take Home Message: 1.Highly compact and discriminative frequent patterns can be directly mined through Model based Search Tree without worrying about combinatorial explosion. 2.Software and datasets are available by contacting the authors.

Upload: oren-garcia

Post on 31-Dec-2015

20 views

Category:

Documents


1 download

DESCRIPTION

0. represent. ANN. Petal.Length< 2.45. |. F1 F2 F4 Data1 1 1 0 Data2 1 0 1 Data3 1 1 0 Data4 0 0 1 ………. 1. DT. setosa. Petal.Width< 1.75. versicolor. virginica. 0. 1. SVM. LR. ANN. Petal.Length< 2.45. |. represent. DT. setosa. Petal.Width< 1.75. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data not in the pre-defined feature vectors that can be used to construct predictive models

• Data not in the pre-defined feature vectors that can be used to construct predictive models.

Applications:

• Transactional database

• Sequence database

• Graph database

Frequent pattern is a good candidate for discriminative features, especially for data of complicated structures.

Motivation:

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree

Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure

Why Frequent Patterns?

• A non-linear conjunctive combination of single features• Increase the expressive and discriminative power of the feature space

Examples:

• Exclusive OR problem & Solution

X Y C

0 0 0

0 1 1

1 0 1

1 1 0

0

0

1

1

x

y

L1

L2

Data is non-linearly separable in (x, y)

X Y XY C

0 0 0 0

0 1 0 1

1 0 0 1

1 1 1 0

min

e &

transfo

rm

Data is linearly separable in (x, y, xy)

3D Projection Using XY

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

xy

0

0

1

1

map

dat

a to

hi

gher

spa

ce

Conventional Frequent Pattern-Based Classification: Two-Step Batch Method

1. Mine frequent patterns;

2. Select most discriminative patterns;

3. Represent data in the feature space using such patterns;

4. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1Data3 1 1 0

Data4 0 0 1………

represent

Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------

DataSet mine

Mined Discriminative

Patterns1 2 4

select

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

ANN

DT

SVM

LR

Basic Flows: Problems of Separated Mine & Select in Batch Method

1. Mine step: Issues of scalability and combinatorial explosion • Dilemma of setting minsupport

• Promising discriminative candidate patterns?• Tremendous number of candidate patterns?

2. Select step: Issue of discriminative power

• 5 Datasets: UCI Machine Learning Repository

• Scalability Study:

01

23

4

Adult Chess Hypo Sick Sonar

Log(DT #Pat) Log(MbT #Pat)

0

1

2

3

4

Adult Chess Hypo Sick Sonar

Log(DTAbsSupport) Log(MbTAbsSupport)

Datasets #Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup)

Adult 252809 0.41%

Chess +∞ ~0%

Hypo 423439 0.0035%

Sick 4818391 0.00032%

Sonar 95507 0.00775%

Itemset Mining

• Accuracy of Mined Itemsets

70%

80%

90%

100%

Adult Chess Hypo Sick Sonar

DT Accuracy MbT Accuracy

Graph Mining

• 11 Datasets:• 9 NCI anti-cancer screen datasets

• PubChem Project.• Positive class : 1% - 8.3%

• 2 AIDS anti-viral screen datasets

• URL: http://dtp.nci.nih.gov.• H1: 3.5%, H2: 1%

• Scalability Study

0300600900

120015001800

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT #Pat MbT #Pat

0

1

2

3

4

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

Log(DT Abs Support) Log(MbT Abs Support)

• Predictive Quality of Mined Frequent Subgraphs

0.5

0.6

0.7

0.8

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbT Accuracy

0.88

0.92

0.96

1

NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2

DT MbTAUC

AUC of MbT, DT MbT VS Benchmarks

• Case Study

Motivation

Problems

Proposed Algorithm

Experiments

dataset

1

2 5

3 4 6 7

Few Data

……..+

……..

+

Divide-and-Conquer Based Frequent Pattern Mining

mine & select

mine & select

mine & select

Mined Discriminative Patterns

1234567

1. Mine and Select most discriminative patterns;

2. Represent data in the feature space using such patterns;

3. Build classification models.

F1 F2 F4

Data1 1 1 0Data2 1 0 1

Data3 1 1 0 Data4 0 0 1

………

represent

|

Petal.Width< 1.75setosa

versicolor virginica

Petal.Length< 2.45

Any classifiers you can name

ANN

DT

SVM

LR

Direct Mining & Selection via Model-based Search Tree

Procedures as Feature Miner Or Be Itself as Classifier

Analyses:

1. Scalability of pattern enumeration• Upper bound

• “Scale down” ratio2. Bound on number of returned features

3. Subspace pattern selection

4. Non-overfitting5. Optimality under exhaustive search

Take Home Message:

1. Highly compact and discriminative frequent patterns can be directly mined through Model based Search Tree without worrying about combinatorial explosion.

2. Software and datasets are available by contacting the authors.