when classifier selection meets information theory: a unifying view
DESCRIPTION
Classifier selection aims to reduce the size of an ensemble of classifiers in order to improve its efficiency and classification accuracy. Recently an information-theoretic view was presented for feature selection. It derives a space of possible selection criteria and show that several feature selection criteria in the literature are points within this continuous space. The contribution of this paper is to export this information-theoretic view to solve an open issue in ensemble learning which is classifier selection. We investigated a couple of informationtheoretic selection criteria that are used to rank classifiers.TRANSCRIPT
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
When Classifier Selection meetsInformation Theory: A Unifying View
Mohamed Abdel Hady, Friedhelm Schwenker,Günther Palm
Institute of Neural Information ProcessingUniversity of Ulm, Germany
December 8, 2010
1 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Outline
1 Ensemble Learning2 Ensemble Pruning3 Information Theory4 Ensemble Pruning meets Information Theory5 Experimental Results6 Conclusion and Future Work
2 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Ensemble Learning
An ensemble is a set of accurate and diverse classifiers. The objective is that theensemble outperforms its member classifiers.
h1
ghi
hN
Classifier Layer Combination Layer
x
h1(x)
hi(x)
hN(x)
g(x)
Ensemble Learning becomes a hot topic during the last years.
Ensemble methods consist of two phase: the construction of multiple individualclassifiers and their combination.
3 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Ensemble Learning
How to construct individual classifiers?
4 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Ensemble Pruning
Recent work has considered an additional intermediate phase that deals with thereduction of the ensemble size before combination.
This phase has several names in the literature such as ensemble pruning,selective ensemble, ensemble thinning and classifier selection.
Classifier selection is important for two reasons: classification accuracy andefficiency.
An ensemble may consist not only of accurate classifiers, but also of classifierswith lower accuracy. The main factor for an effective ensemble is to remove thepoor-performing classifiers while maintaining a good diversity among theensemble members.
The second reason is equally important, efficiency. Having a very large numberof classifiers in an ensemble adds a lot of computational overhead. For instance,decision trees may have large memory requirements and lazy learning methodshave a considerable computational cost during classification phase.
5 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Information Theory
EntropyH(X) = −
∑xj∈X
p(X = xj ) log2 p(X = xj ) (1)
Conditional Entropy
H(X |Y ) = −∑y∈Y
p(Y = y)∑x∈X
p(X = x |Y = y) log p(X = x |Y = y) (2)
6 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Information Theory
Shannon Mutual Information
I(X ; Y ) = H(X)− H(X |Y ) =∑x∈X
∑y∈Y
p(x , y) log2p(x , y)
p(x)p(y)(3)
Shannon Conditional Mutual Information
I(X1; X2|Y ) = H(X1|Y )− H(X1|X2Y )
=∑y∈Y
p(y)∑
x1∈X1
∑x2∈X2
p(x1, x2|y) log2p(x1, x2|y)
p(x1|y)p(x2|y)
(4)
7 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Ensemble Pruning meets Information Theory
Information theory can provide a bound on error probability, p(X1:N 6= Y ), for anycombiner g. The error of predicting target variable Y from input X1:N is boundedby two inequalities as follows,
H(Y )− I(X1:N ; Y )− 1log(|Y |)
≤ p(X1:N 6= Y ) ≤12
H(Y |X1:N ). (5)
I(X1:N ; Y ) involves high dimensional probability distributions p(x1, x2, . . . , xN , y)that are hard to be implemented. However, it can be decomposed into simplerterms.
8 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Interaction Information
Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is notable to measure properties of multiple (N) variables.
McGill presented what is called Interaction Information as a multi-variategeneralization for Shannon’s Mutual Information.
For instance, the Interaction Information between three random variables is
I(X1,X2,X3) = I(X1; X2|X3)− I(X1; X2) (6)
The general form for arbitrary size S is defined recursively.
I(S ∪ X) = I(S|X)− I(S) (7)
W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory,vol. 4, no. 4, pp. 93111, 1954.
9 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Mutual Information Decomposition
Theorem
Given a set of classifiers S = X1, . . . ,XN and a target class label Y , the Shannonmutual information between X1:N and Y can be decomposed into a sum of InteractionInformation terms,
I(X1:N ; Y ) =∑T⊆S
I(T ∪ Y), |T | ≥ 1. (8)
For a set of classifiers S = X1,X2,X3, the mutual information between thejoint variable X1:3 and a target Y can be decomposed as
I(X1:3; Y ) = I(X1; Y ) + I(X2; Y ) + I(X3; Y )
+ I(X1,X2,Y ) + I(X1,X3,Y ) + I(X1,X3,Y )
+ I(X1,X2,X3,Y )
Each term can then be decomposed into class unconditional I(X) andconditional I(X |Y ) according to Eq. (6).
I(X1:3; Y ) =3∑
i=1
I(Xi ; Y )−∑X⊆S|X|=2,3
I(X) +∑X⊆S|X|=2,3
I(X |Y )
10 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Mutual Information Decomposition (cont’d)
For an ensemble S of size N and according to Eq. (7),
I(X1:N ; Y ) =N∑
i=1
I(Xi ; Y )−∑X⊆S|X|=2..N
I(X) +∑X⊆S|X|=2..N
I(X |Y ) (9)
We assume that there exist only pairwise unconditional and conditionalinteractions and omit higher order terms.
I(X1:N ; Y ) 'N∑
i=1
I(Xi ; Y )−N−1∑i=1
N∑j=i+1
I(Xi ; Xj ) +
N−1∑i=1
N∑j=i+1
I(Xi ; Xj |Y ) (10)
G. Brown, A new perspective for information theoretic feature selection, in Proc. of the
12th Int. Conf. on Artificial Intelligence and Statistics (AI-STATS 2009), 2009.
11 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Classifier Selection Criterion
The objective of an information-theoretic classifier selection method, is to selecta subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by anyensemble learning algorithm, that carries as much information as possible aboutthe target class using a predefined selection criterion,
J(Xu(j)) = I(X1:k+1; Y )− I(X1:k ; Y )
= I(Xu(j); Y )−k∑
i=1
I(Xu(j); Xv(i)) +k∑
i=1
I(Xu(j); Xv(i)|Y )(11)
That is the difference in information, after and before the addition of Xu(j) into S.This tells us that the best classifier is a trade-off between these components: therelevance of the classifier, the unconditional correlations, and theclass-conditional correlations. In order to trade-off between these components,Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the rootcriterion,
J(Xu(j)) = I(Xu(j); Y )− βk∑
i=1
I(Xu(j); Xv(i)) + γk∑
i=1
I(Xu(j); Xv(i)|Y ). (12)
12 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Classifier Selection Algorithm
1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y )2: S = Xv(1)3: for k = 1 : K − 1 do4: for j = 1 : |Ω \ S| do5: Calculate J(Xu(j)) as defined in Eq. (12)6: end for7: v(k + 1) = arg max1≤j≤|Ω\S| J(Xu(j))
8: S = S ∪ Xv(k+1)9: end for
13 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Classifier Selection Heuristics
Maximal relevance (MR)J(Xu(j)) = I(Xu(j); Y ) (13)
Mutual Information Feature Selection (MIFS) [Battiti, 1994]
J(Xu(j)) = I(Xu(j); Y )−k∑
i=1
I(Xu(j); Xv(i)) (14)
Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005]
J(Xu(j)) = I(Xu(j); Y )−1|S|
k∑i=1
I(Xu(j); Xv(i)) (15)
14 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Classifier Selection Heuristics (cont’d)
Joint Mutual Information (JMI) [Yang and Moody, 1999]
J(Xu(j)) =k∑
i=1
I(Xu(j)Xv(i); Y ) (16)
Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006]
J(Xu(j)) = I(Xu(j); Y )−k∑
i=1
[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )
](17)
Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004]
J(Xu(j)) = min1≤i≤k
I(Xu(j); Y |Xv(i))
= I(Xu(j); Y )− max1≤i≤k
[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )](18)
15 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Experimental Results
Bagging and Random Forest to construct pool of 50 decision trees (N =50)
Each selection criterion is evaluated with K =40 (20%), 30 (40%), 20 (60%) and10 (80%).
11 data sets from the UCI machine learning repository
average of performing 5 runs of 10-fold cross-validation
normalized_test_acc = pruned_ens_acc−single_tree_accunpruned_ens_acc−single_tree_acc
id name Classes Examples FeaturesDiscrete Continuous
d1 anneal 6 898 32 6d2 autos 7 205 10 16d3 wisconsin-breast 2 699 0 9d4 bupa liver disorders 2 345 0 6d5 german-credit 2 1000 13 7d6 pima-diabetes 2 768 0 8d7 glass 7 214 0 9d8 cleveland-heart 2 303 7 6d9 hepatitis 2 155 13 6d10 ionosphere 2 351 0 34d11 vehicle 4 846 0 18
16 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Results
Figure: Comparison of the normalized test accuracy of the ensembleof C4.5 decision trees constructed by Bagging
17 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Results (cont’d)
Figure: Comparison of the normalized test accuracy of the ensembleof random trees constructed by Random Forest 18 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Conclusion
This paper examined the issue of classifier selection from an informationtheoretic viewpoint. The main advantage of information theoretic criteria is thatthey capture higher order statistics of the data.
The ensemble mutual information is decomposed into accuracy and diversitycomponents.
Although diversity was represented by low and high order terms, we keep onlythe first-order terms in this paper. In further study, we will study the influence ofincluding the higher-order terms on pruning performance.
We selected in the paper some points within the continuous space of possibleselection criteria, that represent well-known feature selection criteria, such asmRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a futurework, we will explore other points in this space that may lead to more effectivepruning.
We plan to extend the algorithm for pruning ensembles of regression estimator.
19 / 20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work
Thanks for your attention
Questions ??
20 / 20