yushi jing gvu, college of computing, georgia institute of technology vladimir pavlovi ć
DESCRIPTION
Boosted Augmented Naive Bayes Efficient discriminative learning of Bayesian network classifiers. Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć Department of Computer Science, Rutgers University James M. Rehg - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/1.jpg)
1
Boosted Augmented Naive BayesEfficient discriminative learning of
Bayesian network classifiers
Yushi Jing GVU, College of Computing, Georgia Institute of Technology
Vladimir PavlovićDepartment of Computer Science, Rutgers University
James M. RehgGVU, College of Computing, Georgia Institute of Technology
![Page 2: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/2.jpg)
2
Contribution 1. Boosting approach to Bayesian network classification
o Additive combination of simple models (e.g. Naïve Bayes)
o Weighted maximum likelihood learning
o Generalizes Boosted Naïve Bayes (Elkan 1997)o Comprehensive experimental evaluation of BNB.
2. Boosted Augmented Naïve Bayes (BAN)
o Efficient training algorithm
o Competitive classification accuracyo Naïve Bayes, TAN, BNC (2004), ELR (2001)
![Page 3: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/3.jpg)
3
Bayesian network Modular and Intuitive graphical representation
Explicit Probabilistic Representation
Bayesian network classifiers Joint distribution Conditional distribution Class Label
How to efficiently train Bayesian network discriminatively to improve its classification accuracy?
![Page 4: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/4.jpg)
4
Parameter Learning
...................
Maximum Likelihood parameter learning Efficient parameter learning algorithm Maximizes LLG score
No analytic solution for parameters that maximizes CLLG
CLL
1 1
log ( | ) log ( )M M
i i i
i i
LL P y x P x
![Page 5: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/5.jpg)
5
Model selection
ML does not optimize CLLA
ELRA optimizes CLLA
(Greiner and Zhou, 2002)
ML optimizes CLLB when B is optimal
BNC algorithm searches for the
optimal structure (Grossman and Domingos, 2004)
A
B
C Ensemble of sparse model as an alternative to B Using ML to train each sparse model
Excellent classification accuracy
Computationally expensive in training
![Page 6: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/6.jpg)
6
Talk outline
o Minimization function for Boosted Bayesian networko Empirical Evaluation of Boosted Naïve Bayeso Boosted Augmented Naïve Bayes (BAN)o Empirical Evaluation of BAN
Our Goal:
o Combine parameter and structure optimization
o Avoid over-fitting
o Retain training efficiency
![Page 7: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/7.jpg)
7
Exponential Loss Function (ELF)
Boosted Bayesian network classifier minimizes ELF function.
1
1 exp{ 2 ( )}yF x
( | )FP y x
1
1 ( | )1exp log
2 ( | )
i iMF
F i ii F
P y xELF
P y x
1
exp ( )M
i iF
i
ELF y F x
1
11
( | )
M
i ii FP y x
ELFF is an upper bound of –CLLF
1
( ) ( )k
K
kk
F x f x
where
![Page 8: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/8.jpg)
8
Minimizing ELF via ensemble method Ensemble method
Adaboost (Population version) constructs F(x) additively to approximately minimizes ELFF
Discriminatively updates the data weights
Tractable ML learning to train the parameters
1( )f x
i
2( )f x
1
( ) ( )k
K
kk
F x f x
3( )f x
…
![Page 9: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/9.jpg)
9
Results: 25 UCI datasets (BNB)
BNB vs. NB 0.151 vs.
0.173
BNB (10)
NB (2)
(13)
![Page 10: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/10.jpg)
10
Results: 25 UCI datasets (BNB)
BNB vs. NB 0.151 vs.
0.173
BNB vs. ELR-NB
0.151 vs. 0.161
BNB vs. TAN 0.151 vs. 0.184
BNB vs. BNC-2P0.151 vs. 0.164
BNB (9)
TAN (2)
BNB (5*)
ELR-NB (4*)
BNB (7)
BNC-2P (3)
BNB (10)
NB (2)
(13) (14)
(16) (15)
![Page 11: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/11.jpg)
11
Evaluation of BNB Computationally Efficient method
O(MNT) , T = 5~20, O(MN)
Good classification Accuracy Outperforms NB, TAN Competitive with ELR, BNC Sparse structure + boosting = competitive accuracy
Potential drawbacks Strongly correlated features (Corral, etc)
![Page 12: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/12.jpg)
12
Structure Learning Challenge:
Efficiency NP-hard problem
K-2, Hill Climbing search still examines polynomial number of structures
Resisting overfitting Structure controls classifier capacity
Our proposed solution: Combines sparse model to form an ensemble
Constrains edge selection
![Page 13: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/13.jpg)
13
Creating Step 1 (Friedman et al. 1999)
Build pair-wise conditional mutual information table
Create maximum spanning tree using conditional mutual information as edge weight
Convert a undirected graph into a directed graphtreeG
treeG4
3
21
treeG
![Page 14: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/14.jpg)
14
Initial structure
BANG
treeG
4
3
21
1. Select Naïve Bayes
2. Create BNB via AdaBoost
3. Evaluate BNB
![Page 15: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/15.jpg)
15
Iteratively adding edges
BANG
treeG
4
3
21
Ensemble CLL = -0.65
Ensemble CLL = -0.75
Ensemble CLL = -0.50
Ensemble CLL = -0.55?
![Page 16: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/16.jpg)
16
Final BAN structure
BANGEnsemble of the final structure produced by
![Page 17: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/17.jpg)
17
Analysis of BAN BAN
The base structure is sparser than BNC model
BAN uses an ensemble of sparser models to approximate a densely connected structure
Y
X
Example of BAN model Example of BNC-2P model
![Page 18: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/18.jpg)
18
Computational complexity of BAN Training Complexity: O(MN^2+ MNTS)
O (MN^2) G_tree O (MNTS) Structure Search
T => boosting iteration per structure S => number of structure examined S < N
Empirical training time T = 5~25, S = 0~5 Approximately 25-100 times the training of NB
![Page 19: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/19.jpg)
19
Result (simulated dataset):
25 different distribution CPT table Number of features
4000 samples, 5 fold cross validation
True structure:
Y
X
Naïve Bayes:
Y
X
![Page 20: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/20.jpg)
20
Results: (simulated dataset):
BAN(19)
NB (0)
(6)
BAN VS NB
![Page 21: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/21.jpg)
21
BNB (0)
Results: (simulated dataset):
BAN VS BNB
BAN (3)
• Correct edges added under BAN
22 True structure:
BNB achieved optimal error in 22 datasets
BAN outperforms BNB in the remaining 3
![Page 22: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/22.jpg)
22
Results: 25 UCI datasets (BAN)
Standard datasets for Bayesian network classifiers Friedman et. al. 1999 Greiner and Zhou 2002 Grossman and Domingos 2004
5 fold cross validation Implemented NB, TAN, BAN, BNB, BNC-2P Obtained results for ELR-NB, ELR-TAN
![Page 23: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/23.jpg)
23
BAN VS NB 0.141 VS 0.173
BAN VS TAN 0.141 VS 0.184
BAN (10)
NB (2)
BAN (10)
TAN (2)
(13)
Results: BAN vs. Standard method
![Page 24: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/24.jpg)
24
BAN VS BNC-2P 0.141 VS 0.164
BAN (7)
BNC (1)
Results: BAN vs. Structure Learning
BAN contains 0-5 augmented edges BNC-2P contains 4-16 augmented
edges
![Page 25: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/25.jpg)
25
BAN VS ELR-TAN 0.141 vs. 0.155
BAN VS ELR-NB 0.141 vs. 0.161
(13)
BAN (4)*
BAN (5)*
BAN (6)*
Error stats directly taken from published results
BAN is more efficient to train
Results: BAN vs. ELR
BAN (8)* (14)
![Page 26: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/26.jpg)
26
Evaluation of BAN vs. BNB
BAN VS BNB 0.141 VS 0.151
Comparison under significance test BAN outperforms BNB (7)
Corral 2% - 5%
BNB outperforms BAN (2) 0.5%-2%
Not significant 13 BAN choose BNB as base structure
IRIS, MOFN
Average testing error 0.141 vs. 0.151 BAN outperforms BNB (16) BNB outperforms BAN (6)
BAN (7)
BNB (2)
(14)
![Page 27: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/27.jpg)
27
Conclusion An ensemble of sparse model as an alternative to
structure and parameter optimization Simple to implement Very efficient in training Competitive classification accuracy
NB, TAN, HGC BNC ELR
![Page 28: Yushi Jing GVU, College of Computing, Georgia Institute of Technology Vladimir Pavlovi ć](https://reader031.vdocument.in/reader031/viewer/2022013004/56815a0c550346895dc758c8/html5/thumbnails/28.jpg)
28
Future Work Extend BAN to handle sequential data
Analyze the class of Bayesian network classifiers that can be approximated with an ensemble of sparse structures.
Can the BAN model parameters be obtained through parameter learning given the final model structure?
Can we use BAN approach to learn generative models?