random forests and ferns - pennsylvania state...

Random Forests and Ferns

David Capel

The Multi-class Classification Problem

276 Fergus, Zisserman and Perona

Figure 1. Some sample images from the datasets. Note the large variation in scale in, for example, the cars (rear) database. These datasets are

from both http://www.vision. caltech.edu/html-files/archive.html and http://www.robots. ox.ac.uk vgg/data/, except for the Cars (Side) from(http://l2r.cs.uiuc.edu/ cogcomp/ index research.html) and Spotted Cats from the Corel Image library. A Powerpoint presentation of the figuresin this paper can be found at http://www.robots.ox.ac.uk/ vgg/presentations.html.

Training: Labelled exemplars representing multiple classes

Classifying: to which class does this new example belong?

? ?













Planes Faces Cars CatsHandwritten digits

• A classifier is a mapping H from feature vectors f to discrete class labels C

• Numerous choices of feature space F are possible, e.g.


Raw pixel values Color histograms Oriented filter banksTexton histograms

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C

P (C = ci|f1, f2, ..., fN )H(f) = arg max

iP (C = ci|f1, f2, ..., fN )

Dm = (fm, Cm) for m = 1...M

Gij =12

!1n

"

k

(d2ik + d2

kj)# d2ij #

1n2

"

kl

d2kl

#

err(y) ="

ij

$Gij # y!i yj

%2

y!i =&

!!v!i for " =1,2,...,d

Cij =1n

"

k

xikxjk

Gij = xi $ xj

arg minW

!(W ) ="

i

'''xi #"

j"Ni

Wijxj

'''2

"

j

Wij = 1

1


• We have a (large) database of labelled exemplars

• Problem: Given such training data, learn the mapping H

• Even better: learn the posterior distribution over class label conditioned on the features:

..and obtain classifier H as the mode of the posterior:

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


Gij =12

!1n

"

k

(d2ik + d2

kj)# d2ij #

1n2

"

kl

d2kl

#

err(y) ="

ij

$Gij # y!i yj

%2

y!i =&

!!v!i for " =1,2,...,d

Cij =1n

"

k

xikxjk

Gij = xi $ xj

arg minW

!(W ) ="

i

'''xi #"

j"Ni

Wijxj

'''2

"

j

Wij = 1

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


Gij =12

!1n

"

k

(d2ik + d2

kj)# d2ij #

1n2

"

kl

d2kl

#

err(y) ="

ij

$Gij # y!i yj

%2

y!i =&

!!v!i for " =1,2,...,d

Cij =1n

"

k

xikxjk

Gij = xi $ xj

arg minW

!(W ) ="

i

'''xi #"

j"Ni

Wijxj

'''2

"

j

Wij = 1

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C

P (C = ck|f1, f2, ..., fN )H(f) = argmax

kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

#pleftk log pleft

k #!

k

#prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

#pleftk log pleft

k #!

k

#prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

• Direct, non-parametric learning of class posterior (histograms)• K-Nearest Neighbours• Fisher Linear Discriminants• Relevance Vector Machines• Multi-class SVMs

How do we represent and learn the mapping?

In Bishop’s book, we saw numerous ways to build multi-class classifiers, e.g.

Center pixel intensity

Ed

ge

pix

el in

ten

sity


Ed

ge

pix

el in

ten

sity

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

1 ! Specificity

Sensitiv

ity


Edge p

ixel in

tensity

(a) (b) (c) (d)

Fig. 4. Distribution of (a) negative pairs gn and (b) positive pairs gp used to constructthe fast classifier. (c) ROC curve used to determine the value of ! (indicated by thedashed line) which will produce the optimal decision surface given the costs of positiveand negative errors. (d) The selected Bayes decision surface.

The two distributions can be combined to yield a Bayes decision surface. If gp

and gn represent the positive and negative distributions then the classificationof a given intensity pair x is:

classification(x) =!

+1 if ! · gp(x) > gn(x)!1 otherwise (1)

where ! is the relative cost of a false negative over a false positive. The parameter! was varied to produce the ROC curve shown in Figure 4c. A weighting of! = e12 produces the decision boundary shown in Figure 4d, and correspondsto a sensitivity of 0.9965 and a specificity of 0.75.

A subwindow is marked as a possible fiducial if a series of intensity pairs alllie within the positive decision region. Each pair contains the central point andone of seven outer pixels. The outer edge pixels were selected to minimize thenumber of false positives based on the above emiprical distributions.

The first stage of the cascade seeks dark points surrounded by lighter back-grounds, and thus functions is like a well-trained edge detector. Note howeverthat the decision criteria is not simply (edge ! center) > threshold as would bethe case if the center was merely required to be darker than the outer edge. In-stead, the decision surface in Figure 4d encodes the fact that {dark center,darkedge} are more likely to be background, and {light center, light edge} are rare inthe positive examples. Even at this early edge detection stage there are benefitsfrom including learning in the algorithm.

3.4 Cascade Stage Two: Nearest Neighbour

Among the various methods of supervised statistical pattern recognition, thenearest neighbour rule [18] achieves consistently high performance [19]. Thestrategy is very simple: given a training set of examples from each class, a newsample is assigned the class of the nearest training example. In contrast withmany other classifiers, this makes no a priori assumptions about the distribu-tions from which the training examples are drawn, other than the notion thatnearby points will tend to be of the same class.

Class A

Class B

f1

f2Direct, non-parametric

k-Nearest Neighbours

• Decision trees classify features by a series of Yes/No questions

• At each node, the feature space is split according to the outcome of some binary decision criterion

• The leaves are labelled with the class C corresponding the feature reached via that path through the tree

Binary Decision Trees

B1(f)=True?

B2(f)=True? B3(f)=True?

c1 c1 c1c2

Yes

YesYes

No

No No





B1(f)=True?


c1 c1 c1c2

Yes

YesYes

No

No No

B1(f)=True?

f





B1(f)=True?


c1 c1 c1c2

Yes

YesYes

No

No No

Yes

B3(f)=True?

B1(f)=True?

f





B1(f)=True?


c1 c1 c1c2

Yes

YesYes

No

No NoNo

Yes

B3(f)=True?

B1(f)=True?

f

Binary Decision Trees: Training

• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors

• Partitioning continues until each subset contains features of a single class

Binary Decision Trees: Training

• To train, recursively partition the training data into subsets according to some Yes/No tests on the feature vectors

• Partitioning continues until each subset contains features of a single class

A B

C C D

Common decision rules

• Axis-aligned splitting

Threshold on a single feature at each node

Very fast to evaluate

• General plane splitting

Threshold on a linear combination of features

More expensive but produces smaller trees

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


fi > T

w!f > T

Gij =12

!1n

"

k

(d2ik + d2

kj)# d2ij #

1n2

"

kl

d2kl

#

err(y) ="

ij

$Gij # y!i yj

%2

y!i =&

!!v!i for " =1,2,...,d

Cij =1n

"

k

xikxjk

Gij = xi $ xj

arg minW

!(W ) ="

i

'''xi #"

j"Ni

Wijxj

'''2

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

#pleftk log pleft

k #!

k

#prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

Choosing the right split

• The split location may be chosen to optimize some measure of classification performance of the child subsets

• Encourage child subsets with lower entropy (confusion)

• Gini impurity

• Information gain (entropy reduction)

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

#pleftk log pleft

k #!

k

#prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

fn > T ?fraction of each class pk

Pros and Cons of Decision Trees

• Training can be fast and easy to implement

• Easily handles a large number of input variables

• Clearly, it is always possible to construct a decision tree that scores 100% classification accuracy on the training set

• But they tend to over-fit and do not generalize very well

• Hence, performance on the testing set will be far less impressive

So, what can be done to overcome these problems?

Advantages

Drawbacks

Random Forests

• Somehow introduce randomness into the tree-learning process

• Build multiple, independent trees based on the training set

• When classifying an input, each tree votes for a class label

• The forest output is the consensus of all the tree votes

• If the trees really are independent, the performance should improve with more trees

Ho, Tin (1995) "Random Decision Forests" Int'l Conf. Document Analysis and Recognition

Breiman, Leo (2001) "Random Forests" Machine Learning

Basic idea

How to introduce randomness?

• Bagging

Generate randomized training sets by sampling with replacement from the full training set (bootstrap sampling)

• Feature subset selection

Choose different random subsets of the full feature vector to generate each tree

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12

D4 D9 D3 D4 D12 D10 D10 D7 D3 D1 D6 D1

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10

f4 f6 f7 f9 f10

Full training set

Random “bag”

Full feature vector

Feature subset

Random Forests

c1 c1 c1c2

D31

D21

D11

c1 c1

D32

D22

D12

c1 c1c2

D3L

D2L

D1L

Class 1 Class 2

Vot

es

Tree 1 Tree 2 Tree L

c2 c2 c2

Input feature vector f

Performance comparison

• Results from Ho 1995

• Handwritten digits (10 classes)

• 20x20 pixel binary images

• 60000 training, 10000 testing samples

• Feature set f1 : raw pixel values (400 features)

• Feature set f2 : f1 plus gradients (852 features in total)

• Uses full training set for every tree (no bagging)

• Compares different features subset sizes (100 and 200 resp)

Performance comparison

%Correct

Number of Trees in Forests1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

60

70

80

90

100

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d..................

............

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

.....................

...................

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . .. . .. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .. . . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

. . . . .

. . . . .

cap (f1), 100d

cap (f1), 200d

cap (f2), 100d

cap (f2), 200d

%Correct

Number of Trees in Forests1 2 3 4 5 6 7 8 9 10

60

70

80

90

100

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d...................

...............

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

.........................

......................

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .. . . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

. . . . . . . . .

. . . . . . . . .

per (f1), 100d

per (f1), 200d

per (f2), 100d

per (f2), 200d

Revision: Naive Bayes Classifiers

• We would like to evaluate the posterior probability over class label

• Bayes’ rules tells us that this is equivalent to

• But learning the joint likelihood distributions over all features is most

likely intractable!

• Naive Bayes makes the simplifying assumption that features are

conditionally independent given the class label

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C

P (C = ci|f1, f2, ..., fN )H(f) = argmax

iP (C = ci|f1, f2, ..., fN )


fi > T

w!f > T

IG =!

i

plefti (1# pleft

i ) +!

i

prighti (1# pright

i )

IE = #!

i

#plefti log pleft

i #!

i

#prighti log pright

i

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

i=1

P (fi|Ck)

Gij =12

#1n

!

k

(d2ik + d2

kj)# d2ij #

1n2

!

kl

d2kl

$

err(y) =!

ij

%Gij # y!i yj

&2

y!i ='

!!v!i for " =1,2,...,d

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


fi > T

w!f > T

IG =!

i

plefti (1# pleft

i ) +!

i

prighti (1# pright

i )

IE = #!

i

#plefti log pleft

i #!

i

#prighti log pright

i

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

i=1

P (fi|Ck)

Gij =12

#1n

!

k

(d2ik + d2

kj)# d2ij #

1n2

!

kl

d2kl

$

err(y) =!

ij

%Gij # y!i yj

&2

y!i ='

!!v!i for " =1,2,...,d

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


fi > T

w!f > T

IG =!

i

plefti (1# pleft

i ) +!

i

prighti (1# pright

i )

IE = #!

i

#plefti log pleft

i #!

i

#prighti log pright

i

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

i=1

P (fi|Ck)

Gij =12

#1n

!

k

(d2ik + d2

kj)# d2ij #

1n2

!

kl

d2kl

$

err(y) =!

ij

%Gij # y!i yj

&2

y!i ='

!!v!i for " =1,2,...,d

1

(likelihood x prior)

Revision: Naive Bayes Classifiers

• This independence assumption is usually false!

• The resulting approximation tends to grossly underestimate the true posterior probabilities

However ...• It is usually easy to learn the 1-d conditional densities P(fi|Ck)

• It often works rather well in practice!

• Can we do better without sacrificing simplicity?

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

Ferns : “Semi-Naive” Bayes

• Group of features into L small sets of size S (called Ferns)

where features fn are the outcome of a binary test on the input vector, such that fn : {0,1}

• Assume groups are conditionally independent, hence

• Learn the class-conditional distributions for each group and apply Bayes rule to obtain posterior,

M. Özuysal et al. “Fast Keypoint Recognition in 10 Lines of Code” CVPR 2007

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

Class(f) $ argmaxk

P (Ck)L"

l=1

P (Fl|Ck)

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {fl,1, fl,2, ..., fl,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

Class(f) $ argmaxk

P (Ck)L"

l=1

P (Fl|Ck)

1

Naive vs Semi-Naive Bayes

• Full joint class-conditional distribution:

Intractable to estimate

• Naive approximation:

Too simplistic

Poor approximation to true posterior

• Semi-Naive:

Balance of complexity and tractability

Trade complexity/performance by choice of Fern size (S), NumFerns (L)

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


iP (C = ci|f1, f2, ..., fN )


fi > T

w!f > T

IG =!

i

plefti (1# pleft

i ) +!

i

prighti (1# pright

i )

IE = #!

i

#plefti log pleft

i #!

i

#prighti log pright

i

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

i=1

P (fi|Ck)

Gij =12

#1n

!

k

(d2ik + d2

kj)# d2ij #

1n2

!

kl

d2kl

$

err(y) =!

ij

%Gij # y!i yj

&2

y!i ='

!!v!i for " =1,2,...,d

1

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

1

So how does a single Fern work?

• A Fern applies a series of S binary tests to the input vector I

• This gives an S-digit binary code for the feature which may be interpreted as an integer in the range [0 ... 2S-1]

• It’s really just a simple hashing scheme that drops input vectors into one of 2S “buckets”

e.g relative intensities of a pair of pixels:

f1(I) = I(xa,ya) > I(xb,yb) -> true

f2(I) = I(xc,yc) > I(xd,yd) -> false

0 1 0 1 1 11binary code decimal

So how does a single Fern work?

• The output of a fern when applied to a large number of input vectors of the same class is a multinomial distribution

p(F|C)

0 1 2 3 .. .. .. 2S

F

• Given a test input, simply apply the fern and “look-up” the posterior distribution over class label

Classifying using a single Fern

p(F|C0)

0 2S

p(F|C1)0 2S

p(F|C2)0 2S

p(F|CK)

0 2S

C2C1 C3 CK

Test input I

Apply fernF(I)=[00011]

= 3

F(I)=3

Normalizedistribution

Class posterior

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

p(F = z|Ck) =N(F = z|Ck) + 1

"2S"1z=0 (N(F = z|Ck) + 1)

p(Ck|F ) =p(F |Ck)"k p(F |Ck)

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck)

P (f1, f2, ..., fN |Ck) =N#

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N#

n=1

P (fn|Ck)

1

Adding randomness: an ensemble of ferns

• A single fern does not give great classification performance

• But .. we can build an ensemble of “independent” ferns by randomly choosing different subsets of features

e.g. F1 = {f2, f7, f22, f5, f9}

F2 = {f4, f1, f11, f8, f3}

F3 = {f6, f31, f28, f11, f2}

• Finally, combine their outputs using Semi-Naive Bayes:

F1

F2

F3

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck) =N"

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N"

n=1

P (fn|Ck)

Fl = {f l,1, f l,2, ..., f l,S}

P (f1, f2, ..., fN |Ck) =L"

l=1

P (Fl|Ck)

Class(f) $ argmaxk

P (Ck)L"

l=1

P (Fl|Ck)

1

Classifying using Random Ferns

• Having randomly selected and trained a collection of ferns, classifying new inputs involves only simple look-up operations

Fern 1 Fern 2 Fern L

Posterior over class label

One small subtlety...

• Even for moderate size ferns, the output range can be large

e.g. fern size S=10, output range = [0 ... 210] = [0 ... 1024]

• Even with large training set, many “buckets” may see no samples

• So .. assume a Dirichlet prior on the distributions P(F|Ck)

• Assigns a baseline low probability to all possible fern outputs:

where N(F=z|Ck) is the number of times we observe fern output equal to z in the training set for class Ck

p(F|Ck)

0 1 2 3 .. .. .. 2S

These zero-probabilities will be problematic for our Semi-Naive Bayes!

f = (f1, f2, ..., fN )C ! {c1, c2, ..., cK}

H : f " C


kP (C = ck|f1, f2, ..., fN )


fn > T

w!f > T

IG =!

k

pleftk (1# pleft

k ) +!

k

prightk (1# pright

k )

IE = #!

k

pleftk log pleft

k #!

k

prightk log pright

k

p(F = z|Ck) =N(F = z|Ck) + 1

"2S"1z=0 (N(F = z|Ck) + 1)

argmaxk

P (Ck|f1, f2, ..., fN )

argmaxk

P (f1, f2, ..., fN |Ck)P (Ck)

P (f1, f2, ..., fN |Ck)

P (f1, f2, ..., fN |Ck) =N#

n=1

P (fn|Ck)

Class(f) $ argmaxk

P (Ck)N#

n=1

P (fn|Ck)

1

Comparing Forests and Ferns

• Decision trees directly learn the posterior P(Ck|F)• Applies different sequence of tests in each child node• Training time grows exponentially with tree depth• Combine tree hypotheses by averaging

• Learn class-conditional distributions P(F|Ck)• Applies the same sequence of tests to every input vector• Training time grows linearly with fern size S• Combine hypothesis using Bayes rule (multiplication)

Tree classifier Fern classifier

Forests

Ferns

Application: Fast Keypoint Matching

• Ozuysal et al. use Random Ferns for keypoint recognition

• Similar to typical application of SIFT descriptors

• Very robust to affine viewing deformations

• Very fast to evaluate (13.5 microsec per keypoint)

Training

• Each keypoint to be recognized is a separate class

• Training sets are generated by synthesizing random affine deformations of the image patch (10000 samples)

• Features are pairwise intensity comparisons: fn(I) = I(xa,ya) > I(xb,yb)

• Fern size S=10 (randomly selected pixel pairs)

• Ensembles of 5 to 50 ferns are tested

11

(a)

(b)

(c)

Fig. 6. Warped patches from the images of Figure 5 show the range of affine deformations we considered. In each line, the left

most patch is the original one and the others are deformed versions of it. (a) Sample patches from the City image. (b) Sample

patches from the Flowers image. (c) Sample patches from the Museum image.

for the estimated distributions. Also the number of features evaluated per patch is equal in all

cases.

The training set is obtained by randomly deforming images of Figure 5. To perform these

experiments, we represent affine image deformations as 2 ! 2 matrices of the form

R!R!"diag(!1, !2)R" ,

where diag(!1, !2) is a diagonal 2 ! 2 matrix and R# represents a rotation of angle ". Both to

train and to test our ferns, we warped the original images using such deformations computed by

randomly choosing # and $ in the [0 : 2%] range and !1 and !2 in the [0.6 : 1.5] range. Fig. 6

depicts patches surrounding individual interest points first in the original images and then in the

warped ones. We used 30 random affine deformations per degree of rotation to produce 10800

November 7, 2008 DRAFT

Synthesized viewpoint variation

Classification Rate vs Number of Ferns

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Ave

rage C

lass

ifica

tion R

ate

# of Structures (Depth/Size 10)

FernsRandom Trees

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Ave

rage C

lass

ifica

tion R

ate


FernsRandom Trees

Figure 4. The percentage of correctly classified image patches is shown against different classifier parameters for the fern and tree basedmethods. The independence assumption between ferns allows the joint utilization of the features resulting in a higher classification rate.The error bars show the 95% confidence interval assuming a Gaussian distribution.

0

10

20

30

40

50

60

70

80

90

100

0 250 500 750 1000 1250 1500 1750 2000

Ave

rage C

lass

ifica

tion R

ate

# of Classes

Ferns (20 ferns with size 10)Random Trees (20 trees with depth 10)

0

10

20

30

40

50

60

70

80

90

100

0 250 500 750 1000 1250 1500 1750 2000

Ave

rage C

lass

ifica

tion R

ate

# of Classes


Figure 5. Classification using ferns can handle many more classes than Random Tree based methods. For both figures ferns with size 10and trees with depth 10 are used.

4.1. Ferns Outperform Trees

We evaluate the performance of the proposed fern basedapproach by comparing to the results of a Random Treebased implementation. The number of tests in the ferns andthe depth of the trees are taken to be equal, and we comparethe classification rate when using the same number of struc-tures, that is of ferns or Random Trees. In particular, thesame number of tests is performed on each keypoint, andthe same number of joint probabilities has to be stored.We do our tests on the images presented 3 with 500

classes and calculate the average classification rate onrandomly generated test images while eliminating falsematches using object geometry. Since the feature selec-tion is random, we repeat the test 10 times and calculate themean and variance of the classification rate and we performthe test on the two images.As depicted by Figure 4, despite the inaccuracy of the

independence assumptions the fern based classifier outper-forms the combination of trees. Furthermore as the number

of ferns is increased the random selection method does notcause large variations on the classifier performance.We also investigate the behavior of the classification rate

as the number of classes increases. Figure 5 shows that alarger number of classes does not affect the performanceof ferns much, while tree based methods can not cope withmany classes. In both experiments we have trained the clas-sifiers using classes from three different images up to 700classes for each image.

4.2. Linking the Two Approaches

Here we show that the two approaches are equivalentwhen the training set is small and give some insights intowhy the ferns perform better when it is large.Recall that we evaluate P(Fk |C = ci) of Eq. (3)

P(Fk |C = ci) = PeP(!(Fk))+ µ(1!P(!(Fk))) ,

where Pe is the empirical probability and µ = 1H . Since

computing the product of such terms is the same as sum-

•300 classes were used•Also compares to Random Trees of equivalent size

Classification Rate vs Number of Classes (Keypoints)

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Ave

rag

e C

lass

ifica

tion

Ra

te


FernsRandom Trees

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30 35 40 45 50

Ave

rag

e C

lass

ifica

tion

Ra

te


FernsRandom Trees

Figure 4. The percentage of correctly classified image patches is shown against different classifier parameters for the fern and tree basedmethods. The independence assumption between ferns allows the joint utilization of the features resulting in a higher classification rate.The error bars show the 95% confidence interval assuming a Gaussian distribution.

0

10

20

30

40

50

60

70

80

90

100

0 250 500 750 1000 1250 1500 1750 2000

Ave

rag

e C

lass

ifica

tion

Ra

te

# of Classes


0

10

20

30

40

50

60

70

80

90

100

0 250 500 750 1000 1250 1500 1750 2000

Ave

rag

e C

lass

ifica

tion

Ra

te

# of Classes


Figure 5. Classification using ferns can handle many more classes than Random Tree based methods. For both figures ferns with size 10and trees with depth 10 are used.

4.1. Ferns Outperform Trees

We evaluate the performance of the proposed fern basedapproach by comparing to the results of a Random Treebased implementation. The number of tests in the ferns andthe depth of the trees are taken to be equal, and we comparethe classification rate when using the same number of struc-tures, that is of ferns or Random Trees. In particular, thesame number of tests is performed on each keypoint, andthe same number of joint probabilities has to be stored.We do our tests on the images presented 3 with 500

classes and calculate the average classification rate onrandomly generated test images while eliminating falsematches using object geometry. Since the feature selec-tion is random, we repeat the test 10 times and calculate themean and variance of the classification rate and we performthe test on the two images.As depicted by Figure 4, despite the inaccuracy of the

independence assumptions the fern based classifier outper-forms the combination of trees. Furthermore as the number

of ferns is increased the random selection method does notcause large variations on the classifier performance.We also investigate the behavior of the classification rate

as the number of classes increases. Figure 5 shows that alarger number of classes does not affect the performanceof ferns much, while tree based methods can not cope withmany classes. In both experiments we have trained the clas-sifiers using classes from three different images up to 700classes for each image.

4.2. Linking the Two Approaches

Here we show that the two approaches are equivalentwhen the training set is small and give some insights intowhy the ferns perform better when it is large.Recall that we evaluate P(Fk |C = ci) of Eq. (3)

P(Fk |C = ci) = PeP(!(Fk))+ µ(1!P(!(Fk))) ,

where Pe is the empirical probability and µ = 1H . Since

computing the product of such terms is the same as sum-

•30 random ferns of size 10 were used

Drawback: Memory requirements

• Fern classifiers can be very memory hungry, e.g

Fern size = 11

Number of ferns = 50

Number of classes = 1000

RAM = 2fern size * sizeof(float) * NumFerns * NumClasses

= 2048 * 4 * 50 * 1000

= 400 MBytes!

Conclusions?

• Random Ferns are easy to understand and easy to train

• Very fast to perform classification once trained

• Provide a probabilistic output (how accurate though?)

• Appear to outperform Random Forests

• Can be very memory hungry!

random forests and ferns - pennsylvania state...

Documents