data not in the pre-defined feature vectors that can be used to construct predictive models
Post on 31-Dec-2015
20 Views
Preview:
DESCRIPTION
TRANSCRIPT
• Data not in the pre-defined feature vectors that can be used to construct predictive models.
Applications:
• Transactional database
• Sequence database
• Graph database
Frequent pattern is a good candidate for discriminative features, especially for data of complicated structures.
Motivation:
Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree
Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure
Why Frequent Patterns?
• A non-linear conjunctive combination of single features• Increase the expressive and discriminative power of the feature space
Examples:
• Exclusive OR problem & Solution
X Y C
0 0 0
0 1 1
1 0 1
1 1 0
0
0
1
1
x
y
L1
L2
Data is non-linearly separable in (x, y)
X Y XY C
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
min
e &
transfo
rm
Data is linearly separable in (x, y, xy)
3D Projection Using XY
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
xy
0
0
1
1
map
dat
a to
hi
gher
spa
ce
Conventional Frequent Pattern-Based Classification: Two-Step Batch Method
1. Mine frequent patterns;
2. Select most discriminative patterns;
3. Represent data in the feature space using such patterns;
4. Build classification models.
F1 F2 F4
Data1 1 1 0Data2 1 0 1Data3 1 1 0
Data4 0 0 1………
represent
Frequent Patterns1-------------------------------2----------3----- 4 --- 5 ----------- 6 ------- 7------
DataSet mine
Mined Discriminative
Patterns1 2 4
select
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
ANN
DT
SVM
LR
Basic Flows: Problems of Separated Mine & Select in Batch Method
1. Mine step: Issues of scalability and combinatorial explosion • Dilemma of setting minsupport
• Promising discriminative candidate patterns?• Tremendous number of candidate patterns?
2. Select step: Issue of discriminative power
• 5 Datasets: UCI Machine Learning Repository
• Scalability Study:
01
23
4
Adult Chess Hypo Sick Sonar
Log(DT #Pat) Log(MbT #Pat)
0
1
2
3
4
Adult Chess Hypo Sick Sonar
Log(DTAbsSupport) Log(MbTAbsSupport)
Datasets #Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup)
Adult 252809 0.41%
Chess +∞ ~0%
Hypo 423439 0.0035%
Sick 4818391 0.00032%
Sonar 95507 0.00775%
Itemset Mining
• Accuracy of Mined Itemsets
70%
80%
90%
100%
Adult Chess Hypo Sick Sonar
DT Accuracy MbT Accuracy
Graph Mining
• 11 Datasets:• 9 NCI anti-cancer screen datasets
• PubChem Project.• Positive class : 1% - 8.3%
• 2 AIDS anti-viral screen datasets
• URL: http://dtp.nci.nih.gov.• H1: 3.5%, H2: 1%
• Scalability Study
0300600900
120015001800
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT #Pat MbT #Pat
0
1
2
3
4
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
Log(DT Abs Support) Log(MbT Abs Support)
• Predictive Quality of Mined Frequent Subgraphs
0.5
0.6
0.7
0.8
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbT Accuracy
0.88
0.92
0.96
1
NCI1 NCI33 NCI41 NCI47 NCI81 NCI83 NCI109 NCI123 NCI145 H1 H2
DT MbTAUC
AUC of MbT, DT MbT VS Benchmarks
• Case Study
Motivation
Problems
Proposed Algorithm
Experiments
dataset
1
2 5
3 4 6 7
Few Data
……..+
……..
+
Divide-and-Conquer Based Frequent Pattern Mining
mine & select
mine & select
mine & select
Mined Discriminative Patterns
1234567
1. Mine and Select most discriminative patterns;
2. Represent data in the feature space using such patterns;
3. Build classification models.
F1 F2 F4
Data1 1 1 0Data2 1 0 1
Data3 1 1 0 Data4 0 0 1
………
represent
|
Petal.Width< 1.75setosa
versicolor virginica
Petal.Length< 2.45
Any classifiers you can name
ANN
DT
SVM
LR
Direct Mining & Selection via Model-based Search Tree
Procedures as Feature Miner Or Be Itself as Classifier
Analyses:
1. Scalability of pattern enumeration• Upper bound
• “Scale down” ratio2. Bound on number of returned features
3. Subspace pattern selection
4. Non-overfitting5. Optimality under exhaustive search
Take Home Message:
1. Highly compact and discriminative frequent patterns can be directly mined through Model based Search Tree without worrying about combinatorial explosion.
2. Software and datasets are available by contacting the authors.
top related