study on ensemble learning by feng zhou. content introduction a statistical view of m3 network...

Post on 26-Mar-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Study on Ensemble Learning

By Feng Zhou

Content

• Introduction• A Statistical View of M3 Network• Future Works

Introduction• Ensemble learning:

– To combine a group of classifiers rather than to design a new one.– The decisions of multiple hypotheses are combined to produce more

accurate results.

• Problems in traditional learning algorithms– Statistical Problem– Computational Problem– Representation Problem

• Related Works– Resampling techniques: Bagging, Boosting– Approaches for extending to multi-class problem:

One-vs-One, One-vs-All.

Min-Max-Modular (M3) Network(Lu, IEEE TNN 1999)

• Steps– Dividing training sets. (Chen, IJCNN 2006; Wen, ICONIP 2005)

– Training pair-wise classifiers– Integrating the outcomes (Zhao, IJCNN 2005)

• Min process• Max process

0.1 0.5 0.7 0.2

0.4 0.3 0.5 0.6

0.8 0.5 0.4 0.2

0.5 0.9 0.7 0.3

0.1

0.3

0.2

0.3

Min Min Min Min

Max 0.3

A Statistical View

• Assumption– The pair-wise classifier outputs a probabilistic

value. Sigmoid function (J.C. Platt, ALMC 1999):

• Bayesian decision theory

1( | )

1 Ax BP x

e

{ , }

ˆ argmax ( | )P x

( | ) ( )( | )

( | ) ( ) ( | ) ( )

P x PP x

P x P P x P

where

A Simple Discrete Example

P(w|x)

W+ W-

X1 1/2

X2 1/2 2/5

X3 2/5

X4 1/5

A Simple Discrete Example (II)

Classifier 1 (w+:w1-) Classifier 2 (w+:w2

-)

Pc0(w+|x=x2) = 1/3

Pc1(w+|x=x2) = 1/2

Pc2(w+|x=x2) = 1/2

Classifier 0 (w+:w-)

Pc0 < min(Pc1,Pc2)

A More Complicated Example• When consider a new more

classifier, the evidence that x belong to w+ is getting shrinking.

• Pglobal(w+) < min(Ppartial(w+))

• The one reporting the minimum value contains the most information about w- (Minimization principle)

• If Ppartial(w+)=1, no information about w- is

contained.

Classifier 1 (w+:w1-) Classifier 2 (w+:w2

-)

……

Information about w- is increasing

Analysis• For each classifier cij

• For each sub-positive class wi+

• For positive class w+

( | , )i i jP x ( , )

( , ) ( , )i

iji j

P xM

P x p x

( | , )i iP x 1

1( 1)

i

jij

qn

M

( | )P x

11

1( 1)

1i i

nq

Analysis (II)• Decomposition of a complex problem

• Restoration to the original resoluation

Composition of Training Setsw+ w-

w1+ … wn+

+ w1- … wn-

-

w+

w1+

wn++

w-

w1-

wn--

Have been used

Trivial set, useless

Not used yet

Another Way of Combinationw+ w-

w1+ … wn+

+ w1- … wn-

-

w+

w1+

wn++

w-

w1-

wn--

'

'

1( 2)

1 1( 2)

i kki

i k jki kj

nM

qn n

M M

Training and testing Time: ( * ) ( )n n n n

Experiments - Synthesis Data

Experiments – Text Categorization(20 Newsgroup copus)

Experiments Setup

• Removing words :stemming

stop words < 30

• Using Naïve Bayes as the elementary classifier

• Estimating the probability with a sigmod function

Future Work

• Situation with consideration of noise– The virtue of the problem:

To access the underlying distribution– Independent parameters for the model:– Constraints we get: – To obtain the best estimation.

Kullback-Leibler Distance (T. Hastie, Ann Statist 1998)

n n ( )2

n n

References[1] T. Hastie & R. Tibshirani, Classification by pairwise coupling, Ann

Statist 1998.[2] J. C. Platt, (Probabilistic outputs for support vector machines and

comparisons to regularized likelihood methods, ALMC 1999[3] B. Lu & , Task decomposition and module combination based on

class relations a modular neural network for pattern classification, IEEE Tran. Neural Networks, 1999

[4] Y. M. Wen & B. Lu, Equal Clustering Makes Min-Max Modular Support Vector Machines More Efficient, ICONIP 2005

[5] H. Zhao & B. Lu, On efficient selection of binary classifiers for min-max modular classifier, IJCNN 2005

[6] K. Chen & B. Lu, Efficient classification of multi-label and imbalanced data using min-max modular classifiers, IJCNN 2006

top related