ieee transactions on pattern analysis and ...akghosh/ieee-tpami-2009.pdfparametric and nonparametric...

12
Classification Based on Hybridization of Parametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric methods of classification assume specific parametric models for competing population densities (e.g., Gaussian population densities can lead to linear and quadratic discriminant analysis) and they work well when these model assumptions are valid. Violation in one or more of these parametric model assumptions often leads to a poor classifier. On the other hand, nonparametric classifiers (e.g., nearest-neighbor and kernel-based classifiers) are more flexible and free from parametric model assumptions. But, the statistical instability of these classifiers may lead to poor performance when we have small numbers of training sample observations. Nonparametric methods, however, do not use any parametric structure of population densities. Therefore, even when one has some additional information about population densities, that important information is not used to modify the nonparametric classification rule. This paper makes an attempt to overcome these limitations of parametric and nonparametric approaches and combines their strengths to develop some hybrid classification methods. We use some simulated examples and benchmark data sets to examine the performance of these hybrid discriminant analysis tools. Asymptotic results on their misclassification rates have been derived under appropriate regularity conditions. Index Terms—Bayes risk, bandwidth, kernel density estimation, LDA, misclassification rate, multiscale smoothing, nearest neighbor, QDA. Ç 1 INTRODUCTION A popular approach in discriminant analysis is to estimate the population densities using the training data and then use those estimates into the Bayes rule [36] to construct a classifier. If ^ f j is the estimated density function for the jth population ðj ¼ 1; 2; ... ;J Þ and % j is its prior probability, the classification rule d: < d !f1; 2; ... ;J g can be expressed as dðxÞ¼ arg max % j ^ f j ðxÞ. When the % j s are not known, one usually estimates them using the propor- tions of observations from different classes in the training sample. The density estimate ^ f j ðj ¼ 1; 2; ... ;J Þ can be computed either parametrically or nonparametrically. In parametric approaches [36], [23], the f j s are assumed to be known except for a few parameters, which are estimated using the training data. For instance, in Fisher’s [10] linear and quadratic discriminant analysis (LDA and QDA), the f j s are assumed to be normal with equal and unequal scatter matrices (variance-covariance matrices), respec- tively. When the true f j s are close to the assumed parametric models, parametric classifiers are expected to perform very well. However, traditional parametric classi- fiers often lead to poor performance when one or more model assumptions get violated. In practice, it is difficult to verify the validity of model assumptions and improper models may lead to poor performance by the resulting classifier. Nonparametric classifiers [9], [23], on the other hand, are more flexible and free from such parametric model assumptions, but the statistical instability of these methods is a major problem, especially in the presence of small training samples. Kernel discriminant analysis [20], [43] and nearest neighbor classification [11], [8] are two well-known examples of nonparametric classifiers that use the kernel method [43], [40], [45] and the nearest neighbor method [33], [14] for density estimation, respectively. However, when one has some insight or additional information about the population distributions that happen to be close to some parametric model, one major limitation of these traditional nonparametric methods is their inability to utilize that additional information. This paper aims to overcome these limitations of traditional parametric and nonparametric classifiers and combines their strengths to develop hybrid classification techniques. Here, we assume a parametric model f j ðxÞ¼ f 0 j ðx; j Þ (¼ f 0 j ðxÞ, say) for the jth class ðj ¼ 1; 2; ... ;J Þ to start with and estimate the unknown parameter j (which may be a real parameter or a finite dimensional vector valued parameter) to get the initial parametric density estimate ^ f 0 j ðxÞ¼ f 0 j ðx; ^ j Þ. Note that if the true population density f j is far from the assumed parametric model, the use of ^ f 0 j is likely to lead to poor classification. Therefore, one needs to make some adjustment to these parametric density estimates. Here, we use a nonparametric adjust- ment factor ð j : < d !< þ Þ for this purpose and the final density estimate is obtained as ^ f j ðxÞ¼ j ðxÞ ^ f 0 j ðxÞ. When the f 0 j s ðj ¼ 1; 2; ... ;J Þ are close to the true densities f j , the j s are expected to be very close to 1 over the entire measurement space (different formulae for j s are proposed IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009 1153 . P. Chaudhuri and A.K. Ghosh are with the Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700 108, India. E-mail: [email protected], [email protected]. . H. Oja is with the Tampere School of Public Health, University of Tampere, Fin 33014, Finland. E-mail: [email protected]. Manuscript received 21 May 2007; revised 14 Jan. 2008; accepted 15 May 2008; published online 3 June 2008. Recommended for acceptance by P. Maragos. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2007-05-0292. Digital Object Identifier no. 10.1109/TPAMI.2008.149. 0162-8828/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

Classification Based on Hybridization ofParametric and Nonparametric Classifiers

Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja

Abstract—Parametric methods of classification assume specific parametric models for competing population densities (e.g.,

Gaussian population densities can lead to linear and quadratic discriminant analysis) and they work well when these model

assumptions are valid. Violation in one or more of these parametric model assumptions often leads to a poor classifier. On the other

hand, nonparametric classifiers (e.g., nearest-neighbor and kernel-based classifiers) are more flexible and free from parametric model

assumptions. But, the statistical instability of these classifiers may lead to poor performance when we have small numbers of training

sample observations. Nonparametric methods, however, do not use any parametric structure of population densities. Therefore, even

when one has some additional information about population densities, that important information is not used to modify the

nonparametric classification rule. This paper makes an attempt to overcome these limitations of parametric and nonparametric

approaches and combines their strengths to develop some hybrid classification methods. We use some simulated examples and

benchmark data sets to examine the performance of these hybrid discriminant analysis tools. Asymptotic results on their

misclassification rates have been derived under appropriate regularity conditions.

Index Terms—Bayes risk, bandwidth, kernel density estimation, LDA, misclassification rate, multiscale smoothing, nearest neighbor,

QDA.

Ç

1 INTRODUCTION

A popular approach in discriminant analysis is toestimate the population densities using the training

data and then use those estimates into the Bayes rule [36] toconstruct a classifier. If f̂j is the estimated density functionfor the jth population ðj ¼ 1; 2; . . . ; JÞ and �j is its priorprobability, the classification rule d : <d ! f1; 2; . . . ; Jg canbe expressed as dðxÞ ¼ arg max �jf̂jðxÞ. When the �js arenot known, one usually estimates them using the propor-tions of observations from different classes in the trainingsample. The density estimate f̂j ðj ¼ 1; 2; . . . ; JÞ can becomputed either parametrically or nonparametrically. Inparametric approaches [36], [23], the fjs are assumed to beknown except for a few parameters, which are estimatedusing the training data. For instance, in Fisher’s [10] linearand quadratic discriminant analysis (LDA and QDA), thefjs are assumed to be normal with equal and unequalscatter matrices (variance-covariance matrices), respec-tively. When the true fjs are close to the assumedparametric models, parametric classifiers are expected toperform very well. However, traditional parametric classi-fiers often lead to poor performance when one or moremodel assumptions get violated. In practice, it is difficult to

verify the validity of model assumptions and impropermodels may lead to poor performance by the resultingclassifier. Nonparametric classifiers [9], [23], on the otherhand, are more flexible and free from such parametricmodel assumptions, but the statistical instability of thesemethods is a major problem, especially in the presence ofsmall training samples. Kernel discriminant analysis [20],[43] and nearest neighbor classification [11], [8] are twowell-known examples of nonparametric classifiers that usethe kernel method [43], [40], [45] and the nearest neighbormethod [33], [14] for density estimation, respectively.However, when one has some insight or additionalinformation about the population distributions that happento be close to some parametric model, one major limitationof these traditional nonparametric methods is their inabilityto utilize that additional information. This paper aims toovercome these limitations of traditional parametric andnonparametric classifiers and combines their strengths todevelop hybrid classification techniques.

Here, we assume a parametric model fjðxÞ ¼ f0j ðx; ��jÞ

(¼ f0j ðxÞ, say) for the jth class ðj ¼ 1; 2; . . . ; JÞ to start

with and estimate the unknown parameter ��j (which

may be a real parameter or a finite dimensional vector

valued parameter) to get the initial parametric density

estimate f̂0j ðxÞ ¼ f0

j ðx; �̂�jÞ. Note that if the true population

density fj is far from the assumed parametric model, the

use of f̂0j is likely to lead to poor classification. Therefore,

one needs to make some adjustment to these parametric

density estimates. Here, we use a nonparametric adjust-

ment factor ð�j : <d ! <þÞ for this purpose and the final

density estimate is obtained as f̂�j ðxÞ ¼ �jðxÞf̂0j ðxÞ. When

the f0j s ðj ¼ 1; 2; . . . ; JÞ are close to the true densities fj, the

�js are expected to be very close to 1 over the entire

measurement space (different formulae for �js are proposed

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009 1153

. P. Chaudhuri and A.K. Ghosh are with the Theoretical Statistics andMathematics Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata700 108, India.E-mail: [email protected], [email protected].

. H. Oja is with the Tampere School of Public Health, University ofTampere, Fin 33014, Finland. E-mail: [email protected].

Manuscript received 21 May 2007; revised 14 Jan. 2008; accepted 15 May2008; published online 3 June 2008.Recommended for acceptance by P. Maragos.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2007-05-0292.Digital Object Identifier no. 10.1109/TPAMI.2008.149.

0162-8828/09/$25.00 � 2009 IEEE Published by the IEEE Computer Society

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 2: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

in Section 2). As a consequence, the resulting hybridclassifier is expected to perform like the associated para-metric classifier and much better than fully nonparametricclassifiers. Again, when the true densities are far from theassumed parametric models, the nonparametric adjustmentfactors provide a safeguard against deviations from para-metric model assumptions by making significant adjust-ment to the initial parametric density estimates. As a result,the performance of the hybrid classifier gets muchimproved and it becomes comparable to that of nonpara-metric classifiers in such cases.

Hjort and Glad [24] proposed one such hybrid methodfor density estimation using a nonparametric adjustmentfactor based on kernels. A similar method was alsoconsidered in [30]. Hjort and Jones [25] proposed anotherhybrid method for density estimation, where they com-puted the adjustment factor based on a local likelihoodcriterion. Different hybrid density estimation methodsblending parametric and nonparametric techniques havealso been discussed in [37], [3], [28], and [4]. Glad [18]proposed a hybrid method for regression problems as well,but no such hybrid technique has yet been adopted for theclassification purpose. In this paper, we investigate differ-ent methods for finding the adjustment factor and thehybrid classifier, and they are described in Section 2.1.Although most of the existing hybrid methods are based onkernels, we borrowed these ideas to develop nearestneighbor versions for hybrid classification in Section 2.2.Like other nonparametric methods, the adjustment factorsand, hence, the misclassification rate of the resultingclassifier depend on the values of the associated smoothingparameters. In classification problems, a good choice ofsmoothing parameters depends not only on the entiretraining sample but also on the specific observation to beclassified. Therefore, instead of using a single scale ofsmoothing, it would be useful to study the classificationresults for multiple levels of smoothing and then aggregatethem judiciously. This type of multiscale analysis wascarried out in [5], [6], [19] in the context of functionestimation and in [26], [27], [15], [16] for classification. Thisaggregation method is discussed in Section 2.3. Resultsrelated to the asymptotic convergence of error rates ofhybrid classifiers to the Bayes risk are given in Section 2.4.Section 2.5 contains a detailed description of the aggrega-tion method. In Sections 3 and 4, we use some simulatedand benchmark data sets, respectively, to compare theempirical performance of the hybrid classifiers withparametric and nonparametric methods. Finally, Section 5contains a brief summary of the work and concludingremarks.

2 DESCRIPTION OF HYBRID CLASSIFIERS

As it has been mentioned before, here, we start with a

parametric model that assumes fjðxÞ ¼ f0j ðx; ��jÞ ðj ¼

1; 2; . . . ; JÞ and estimate the parameter ��j using the training

data to get the initial parametric density estimate

f̂0j ðxÞ ¼ f0

j ðx; �̂�jÞ. Then, a nonparametric adjustment factor

�j is used to modify f̂0j ðxÞ and to get the final density

estimate f̂�j ðxÞ ¼ �jðxÞf̂0j ðxÞ.

2.1 Hybridization of Kernel and ParametricClassifiers

Here, we propose three different types of adjustment factorsfor the construction of f̂�j ðxÞ and the classification ruled�ðxÞ ¼ arg max �jf̂

�j ðxÞ.

Method-1. Along with the parametric density estimatef̂0j ðxÞ ðj ¼ 1; 2; . . . ; JÞ, we consider the nonparametric

kernel density estimate [43], [40], [45] f̂1jhjðxÞ, which is

given by

f̂1jhjðxÞ ¼ n�1

j h�djXnji¼1

K h�1j ðx� xjiÞ

n o;

where xj1;xj2; . . . ;xjnj are observations from the

jth population, K is a d-dimensional density function

symmetric about 0, and hj > 0 is the associated smooth-

ing parameter known as the bandwidth. Several choices

for the kernel function K are available in the literature

[43]. Throughout this paper, we shall use the Gaussian

kernel KðtÞ ¼ ð2�Þ�d=2e�t0t=2 for our purpose. To compute

the adjustment factor �jðxÞ, we consider a closed ball

Bðx; rjÞ ¼ fy : kx� yk � rjg of radius rj around x and

calculate the probability measures of that ball PfX 2Bðx; rjÞg under both the estimated parametric and the

nonparametric models. Note that if the assumed parametric

model is correct, the ratio of these two probabilities is

expected to be very close to unity; otherwise, it will vary

depending on the amount of deviation from the parametric

model. Therefore, if we denote these two probabilities by

P 0ðx; rjÞ and P 1ðx; rjÞ, respectively, the adjustment factor �jcan be given by

�jðx; rjÞ ¼ P 1ðx; rjÞ=P 0ðx; rjÞ:

Therefore, the final density estimate at x is obtained as

f̂ð1ÞjhjðxÞ ¼ f̂0

j ðxÞP 1ðx; rjÞ=P 0ðx; rjÞ:

The radius of the ball rj behaves like a smoothing parameter

here. As rj increases, �j tends to be smoother. Note that if

the rjs ðj ¼ 1; 2; . . . ; JÞ are small, for all j ¼ 1; 2; . . . ; J ,

P 1ðx; rjÞ=P 0ðx; rjÞ will be close to f̂1jhjðxÞ=f̂0

j ðxÞ and, in that

case, the resulting classifier will behave like the associated

nonparametric classifier. On the other hand, for large values

of rj, P1, P 0, and P 1=P 0, all are expected to be close to one.

Therefore, in that case, the hybrid classifier will behave like

the associated parametric classifier.

In addition to rjs ðj ¼ 1; 2; . . . ; JÞ, there is another set of

smoothing parameters hjs involved in this method. Instead

of dealing with these two sets of smoothing parameters

simultaneously, for computational simplicity, we take rj ¼3hj for all j ¼ 1; 2; . . . ; J . Note that if K is Gaussian, the

observations y with ky� xk=hj > 3 have negligible effect

on f̂1jhjðxÞ. This motivated us to take rj ¼ 3hj. The use of the

same bandwidth hj in all directions requires some pre-

liminary transformation of the data (i.e., sphering of the

data). Here, we used the usual moment-based estimate of

the scatter matrix (dispersion matrix) for this purpose.

To demonstrate the behavior of the adjustment factor

and that of the hybrid density estimate for varying choices

1154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 3: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

of smoothing parameters, we carried out a simulation

study. We generated 50 observations from an equal mixture

of Nð1; 0:25Þ and Nð�1; 0:25Þ distributions and estimated

the density function using parametric (assuming normality)

and nonparametric (kernel density estimation) method. As

expected, the nonparametric density estimate (light gray

line in Fig. 1a) was quite close to the true density function

(black line), but the parametric density estimate (dark gray

line) was very poor. The hybrid method could overcome

this limitation of the parametric approach. In Figs. 1b and

1c, we plotted the adjustment factors and the corresponding

hybrid density estimates, respectively, for seven equidistant

values of h, from h ¼ 0:1 (indicated by the light gray line) to

h ¼ 0:7 (dark gray line). As one would expect, for higher

values of h, the adjustment function was quite flat,

eventually leading to the parametric density estimate. But

for a small h, the adjustment was quite significant, and we

could achieve a good estimate of the density function.

We considered another example with two bivariate

populations each being an equal mixture of two bivariate

normal distributions having the same scatter matrix 0:25I2.

The first population is a mixture of two normals with

location parameters ð1; 1Þ and ð�1;�1Þ, and the second

population is a mixture of two normals with location

parameters ð�1; 1Þ and ð1;�1Þ. Note that the optimal

classifier for these bivariate populations depends on the

sign of the product of the two measurement variables.

When we assumed the population distributions to be

normal with the same scatter matrix, we ended up with

the linear classifier (see Fig. 2a) that misclassified almost

half of the observations. But, after adjusting these para-

metric density estimates, we could arrive at a good hybrid

classifier (see Fig. 2b).Method-2. Another hybrid density estimate that is

constructed by using nonparametric adjustment of aparametric density estimate was proposed in [24]. This isgiven by

f̂ð2ÞjhjðxÞ ¼ f̂0

j ðxÞ �1

nj

Xnji¼1

h�dj K h�1j ðx� xjiÞ

n o 1

f̂0j ðxjiÞ

¼ 1

nj

Xnji¼1

h�dj K h�1j ðx� xjiÞ

n o f̂0j ðxÞ

f̂0j ðxjiÞ

:

A similar kind of nonparametric adjustment factor was also

proposed in [30], where, instead of starting with a

parametric model, the authors considered a general class

of initial density estimates. One should note that the term

f̂0j ðxÞ=f̂0

j ðxjiÞ can be very influential when f̂0j ðxjiÞ is close to

zero but f̂0j ðxÞ is away from it. To get rid of such situations,

following [24], we truncated the values of these ratios so

that 0:1 � f̂0j ðxÞ=f̂0

j ðxjiÞ � 10.Method-3. The adjustment factor �jðxÞ can also be

obtained by maximizing the local log-likelihood score [25]

Lf�jðxÞg ¼ f̂1jhjðxÞ log �jðxÞ � �jðxÞh�djZKfh�1

j ðx� tÞgf̂0j ðtÞdt:

CHAUDHURI ET AL.: CLASSIFICATION BASED ON HYBRIDIZATION OF PARAMETRIC AND NONPARAMETRIC CLASSIFIERS 1155

Fig. 1. Parametric, nonparametric, and hybrid density estimates.

Fig. 2. Parametric and hybrid classifiers.

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 4: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

The resulting density estimates are given by

f̂ð3ÞjhjðxÞ ¼ f̂1

jhjðxÞf̂0

j ðxÞ= Khj � f̂0j

� �ðxÞ; j ¼ 1; 2; . . . ; J;

where KhðtÞ ¼ h�dKðt=hÞ and “*” denotes the convolution.

If the assumed parametric model f0j and the kernel K both

are Gaussian, so is the convolution and, in that case, Khj �f̂0j has a nice closed-form expression.

2.2 Hybridization of Nearest Neighbor andParametric Classifiers

We adopt similar ideas to find the adjustment factors and,

hence, the hybrid density estimates based on the nearest

neighbor method.Method-1. Like the kernel method, here we use the

adjustment factor P 1=P 0 but with different choices of

neighborhoods. Instead of considering different balls (neigh-

borhoods) for different classes, like usual nearest neighbor

classification [11], [7], we use the same ball Bn;kðxÞ ¼ fy :

kx� yk � kx� xðk;nÞkg for all populations, where xðk;nÞ is the

kth neighbor of x in the training sample of size n. The

resulting hybrid density estimates are given by

f̂ð1Þj;k ðxÞ ¼ f̂0

j ðxÞkj=nj

Pfj x 2 Bn;kðxÞ� � ; j ¼ 1; 2; . . . ; J;

where nj ðPnj ¼ nÞ is the number of training sample

observations from the jth class and kj ðPkj ¼ kÞ out of

them lie in Bn;kðxÞ. Since the measurement variables are not

always of comparable units and scales, for our data

analysis, we standardized the observations using the usual

moment-based estimate of the pooled scatter matrix before

using the euclidean metric. This is equivalent to using the

Mahalanobis distance [35] function. However, one may use

other flexible or adaptive metrics [12], [22] as well. Unlike

the kernel-based hybrid methods, here the hybrid density

estimates and, hence, the resulting classifier depend on a

single smoothing parameter k. Although one can use balls

of different radii (depending on kj) for different popula-

tions, we do not consider those in this paper.Method-2. Although no hybrid density estimate based on

the nearest neighbor method was proposed in [24], such an

extension is possible. Once again, we use the same

neighborhood Bn;kðxÞ for all populations and the resulting

density estimates can be expressed as

f̂ð2Þj;k ðxÞ ¼

f̂0j ðxÞ

njV ol Bn;kðxÞ� � X

xji2Bn;kðxÞ1=f̂0

j ðxjiÞ; j ¼ 1; 2; . . . ; J;

where V olðAÞ stands for the volume of A. To avoid high

sensitivity of f̂0j ðxÞ=f̂0

j ðxjiÞ when f̂0j ðxÞ � 0, we truncated

these ratios so that they take values in the interval ½0:1; 10�.Method-3. Straightforward conversion for the local log-

likelihood method into its nearest neighbor version is not

feasible because of the absence of any meaningful analog for

the convolution part in the denominator of f̂ð3ÞjhjðxÞ. How-

ever, one may look at the convolution as the expectation of

the kernel density estimate under f̂0j . Therefore, it can be

replaced by the expectation of the nearest neighbor density

estimate to get an analogous version:

f̂ð3Þj;k ðxÞ ¼ f̂0

j ðxÞkj=njVn;kðxÞ

Ef̂0 kj=njVn;kðxÞ� �

¼ f̂0j ðxÞ

kj=Vn;kðxÞEf̂0 kj=Vn;kðxÞ� � ;

where Vn;kðxÞ ¼ V olfBn;kðxÞg and Ef̂0 denotes the expecta-tion over the whole training sample, where xji � f̂0

j for allj ¼ 1; 2; . . . ; J and i ¼ 1; 2; . . . ; nj. Unlike the kernel method,here, the denominator of f̂

ð3Þj;k may not have a closed-from

expression. One can approximate this by an empiricalaverage computed using repeated generations of observa-tions from the initial parametric distributions. Note that likeusual nearest neighbor density estimates, the hybrid densityestimates discussed in Sections 2.1 and 2.2 may notintegrate to one, but this is expected to have very littleimpact on the performance of hybrid classification, where,rather than the accuracy of the density estimates, one isinterested in their relative ordering.

2.3 Multiscale Approach and Aggregation ofResults

For each of the hybrid methods described in Sections 2.1and 2.2, the adjustment factor �jðxÞ ðj ¼ 1; 2; . . . ; JÞ dependson the associated smoothing parameter. In classificationproblems, a good choice of the smoothing parameter for aclass depends not only on that class itself but also on theircompeting class densities. Therefore, instead of looking atthe accuracy of the hybrid density estimates separately, it ismore meaningful to consider all density estimates corre-sponding to different classes simultaneously. But, whenthere are several competing classes, finding a good set ofsmoothing parameters is computationally infeasible as therewould be different smoothing parameters associated withdifferent density estimates corresponding to differentclasses. To reduce this computational cost, in the case ofkernel-based methods, we have used the same bandwidth hfor all populations after standardizing the observationsusing the usual moment-based estimate of the pooledscatter matrix. Note that in the case of nearest neighbor-based methods, our construction of hybrid classifiers (asdiscussed in Section 2.2) allows the method to depend onlyon one smoothing parameter k. Here also, we havestandardized the observations in the same way.

One can use a cross-validation [32] method to find outthe optimum value of the smoothing parameter (like thebandwidth of a kernel or the number of nearest neighbors)and use it for the classification of all observations. However,one should note that, in addition to depending on thetraining sample, a good choice of the smoothing parameterdepends on the specific observation to be classified.Therefore, instead of using a fixed scale of smoothing overthe entire measurement space, it may be of more use tosimultaneously study the classification results for multiplescales of smoothing in some appropriate range. Theusefulness of this multiscale smoothing has been discussedin the literature by many authors, both in the context offunction estimation [5], [6], [19] and classification [26], [27],[15], [16]. Results obtained by different levels of smoothingare aggregated in a judicious way to arrive at the finalclassifier. Following the idea in [15] and [16], we used theweighted average of posterior probabilities to aggregate

1156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 5: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

these results. The aggregated final classifier can beexpressed as

dAðxÞ ¼ arg maxj

Xs

WðsÞ�jf̂

�jsðxÞPJ

t¼1 �tf̂�tsðxÞ

( );

where s is the smoothing parameter, WðsÞ is the weightfunction ð

PWðsÞ ¼ 1Þ, and f̂�js is the hybrid density

estimate for the jth class ðj ¼ 1; 2; . . . ; JÞ. Note that popularensemble methods like bagging [2] and boosting [13] alsoadopt similar ideas for aggregation of classification results.Since the bandwidth of a kernel h is a continuoussmoothing parameter, it is not possible to use all values ofh lying in an interval. Instead, for our data analysis, we usesome discrete values in that range.

2.4 Large Sample Properties

From the results on consistency of kernel density estimates[43], [40], [45] and that of nearest neighbor density estimates[33], we know that, under certain regularity conditions(different sets of conditions for kernel and nearest neighbormethods are required), both of these two nonparametricdensity estimates converge to the true density function. Theconsistency of hybrid density estimates under similarregularity conditions (see Propositions 1 and 2 in theAppendix) follows from these results, which in turn impliesthe asymptotic optimality (convergence to the Bayes risk) ofthe error rates of our aggregated hybrid classifiers. Thisresult is formally presented in the following theorem, forwhich the proof is given in the Appendix.

Theorem 1. Suppose that, for all j ¼ 1; 2; . . . ; J , f0j ðx; ��jÞ is

continuous in x and ��j. Define �̂�j and ��0j as in Propositions 1

and 2 given in the Appendix and assume that �̂�j is a consistentestimate for ��0

j for all j ¼ 1; 2; . . . ; J . Also, define Ln and Unas the lower and the upper bounds for the smoothing parameterwhen n is the training sample size:

1. Assume that Ln and Un both converge to zero and nLdnand nUd

n both tend to 1 as n!1. Then, themisclassification rate of the kernel-based aggregatedhybrid classifier converges to the optimal Bayes risk forboth of Method-2 and Method-3. If the kernelfunction K has bounded variation and the fjs areuniformly continuous, the above result also holds forMethod-1 if nLdn=logðnÞ and nUd

n=logðnÞ both tend to1 as n!1.

2. Suppose that Ln !1, Un !1, Ln=n! 0, andUn=n! 0 as n!1. Then, the misclassification ratesof all of Method-1, Method-2, and Method-3, which areaggregated hybrid classifiers based on nearest neighbortechniques, converge to the optimal Bayes risk.

2.5 Choice of the Weight Function

It is transparent from Theorem 1 that, for any weight functionW ðÞ, as long as WðÞ > 0 only in the range ½Ln; Un� andPW ðÞ ¼ 1, error rates of the multiscale hybrid classifiers

converge to the Bayes risk. However, in practice, a suitableweight function should be chosen for aggregation. Onewould naturally rely more on the classifiers having lowermisclassification rates. Therefore, the weight functionshould be a decreasing function of the error rate �ðsÞ as

happens in other popular aggregation methods. Forinstance, boosting [39], [13] uses the weight functionW ¼ logfð1��Þ=�g. However, it is our empirical experi-ence that the log function used in boosting decreases with �at a very slow rate, and it fails to appropriately weigh downthe poor classifiers resulting from poor choices of thesmoothing parameter. Therefore, for our data analysis, wehave used a function that decreases at an exponential rate,as proposed in [15], [16], and it is given by

WðsÞ ¼ Ce�1

2�ðsÞ��ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

�ð1��Þ=npn o2

;

where � ¼ mins �ðsÞ, C is a normalizing constant, and allmisclassification rates are estimated by the leave-one-outcross-validation technique [32]. Note that � and �ð1��Þ=n can be viewed as estimates for the mean and thevariance of the empirical misclassification rate of the hybridclassifier, with the best possible choice of s, when it is usedto classify n independent observations. This choice of WðsÞworked reasonably well in our examples, which we will seein Sections 3 and 4. Our empirical experience also suggeststhat the final result is not very sensitive to the choice of theweight function if it decreases at an appropriate rate.

In practice, the values of Un and Ln have to be specified aswell. For the kernel-based methods, we have followed theidea in [16] to set these limits. After standardizing theobservations, we computed all pairwise distances betweenthe standardized observations of a class and then calculate thelower and the upper fifth percentiles (�0:05ðjÞ and �0:95ðjÞ,j ¼ 1; 2; . . . ; J) from it. For our multiscale analysis, we havetaken a conservative approach and set minf�0:05ðjÞ=3g andmaxf�0:95ðjÞg as the lower and the upper limits, respec-tively. The choice of the factor “1/3” is motivated by the useof Gaussian kernel function (see [16] for a detaileddiscussion on the choice of upper and lower limits). ForNN-based methods, we have taken a simplified approach ofconsidering k ¼ 1; 2; . . . ; n� 1.

3 RESULTS ON SIMULATED EXAMPLES

In this section, we use some simulated data sets to illustratethe performance of the hybrid classifiers. For the sake ofsimplicity, here we restrict ourselves to two-class problemsinvolving bivariate data. In Section 4, we will present someclassification results for high-dimensional benchmark datasets involving more than two classes.

For each of these simulated examples, taking an equalnumber of observations from the two classes, we generated250 different training and test sets, each of size 100 and 200,respectively. Average test set error rates of the hybridclassifiers (both based on kernel and nearest neighbors)over those 250 trials are reported in Tables 1 and 2, alongwith their corresponding standard errors. For these hybridmethods, we report the error rates for both the multiscaleclassifiers (i.e., the aggregated classifiers denoted in thetables as Method-i(MS) for i ¼ 1; 2; 3) and the single-scaleclassifiers (denoted in the tables as Method-i(SS) fori ¼ 1; 2; 3), where the smoothing parameter is chosen byleave-one-out cross-validation technique. For Method-1, it isoften difficult to get a closed-form expression for P 1 and P 0.

CHAUDHURI ET AL.: CLASSIFICATION BASED ON HYBRIDIZATION OF PARAMETRIC AND NONPARAMETRIC CLASSIFIERS 1157

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 6: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

In this paper, we used 5,000 Monte Carlo samples to

estimate these probabilities. However, like the adjustment

factor in Method-2, this estimated probability ratio can be

very influential when P 0 is close to zero (e.g., when the

observation is an outlier with respect to the assumed

parametric model or the neighborhood size is very small).

To avoid it, we used this ratio for density adjustment only

when the estimated P 0 is greater than 0.001. Otherwise, the

nonparametric density estimate f̂1j was used as the final

density estimate f̂�j . Error rates are also reported for LDA,

QDA, kernel, and nearest neighbor classifiers. Note that

kernel discriminant analysis (KDA) and nearest neighbor

(NN) classification require the optimum value of the

smoothing parameter to be estimated and we used the

leave-one-out cross-validation method [32] for this purpose.

To facilitate the comparison, Bayes error rates are reported

as well. Throughout this section, prior probabilities of the

two classes are taken to be equal.We begin with an example (Example 1),

involving two Gaussian populations N2ð1; 1; 1; 1; 0Þand N2ð0; 0; 0:25; 0:25; 0Þ, where N2ð�1; �2; �

21; �

22; �Þ denotes

a bivariate normal distribution with mean vector ð�1; �2Þ,variances �2

1 and �22, and correlation coefficient �. Since this

example is an ideal setup for QDA, it had the average error

rate close to the optimum Bayes risk (see Table 1), while

LDA had a much higher error rate. Error rates of the

nonparametric classifiers (KDA and NN) were also sig-

nificantly higher than that of QDA, but, starting with the

1158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

TABLE 1Average Test Set Misclassification Rates (in Percent) of Different Classifiers and Their Standard Errors

For hybrid methods, figures in the first and second row represent the error rates of the single-scale and the multiscale classifier, respectively.

TABLE 2Average Test Set Misclassification Rates (in Percent) of Different Classifiers and Their Corresponding Standard Errors

(on Simulated Data Sets)

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 7: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

right parametric models (i.e., Gaussian distributions withunequal scatter matrices), hybrid classifiers could achievemuch better performance. All hybrid methods had sig-nificantly lower misclassification rates than nonparametricclassifiers. Overall, performances of the multiscale methodswere somewhat better than the corresponding single-scaleclassifiers. Among the hybrid methods, Method-1 andMethod-3 performed better than Method-2, and the errorrates of these two classifiers were not significantly differentfrom that of QDA.

Next, we consider two examples (Examples 2 and 3),where the population distributions are asymmetric innature. In Example 2, the measurement variables in eachclass are independent and identically distributed as alognormal variate with parameters (0, 1) for the first classand (2, 1) for the second class. Recall that a randomvariable X follows a lognormal distribution with para-meters ð�; �2Þ if logX � Nð�; �2Þ. In Example 3, thepopulation density functions for the two populations aregiven by f1ðx1; x2Þ ¼ expf�ðx1 þ x2Þg; x1; x2 � 0 andf2ðx1; x2Þ ¼ expf�ðx1 þ x2 � 2Þg; x1; x2 � 1, respectively.Since the population distributions were very different fromnormal, in these two examples, LDA and QDA did notperform well (see Table 1). Compared to LDA and QDA, thenonparametric methods (KDA and NN) had significantlybetter performance. However, when we chose the rightparametric model to start with, all hybrid methodssignificantly outperformed LDA, QDA, and nonparametricclassifiers. In these examples, the overall performance ofMethod-3 was somewhat better than the other two hybridclassifiers for both the kernel and the nearest neighbor-based hybridization. Only in Example 3, the multiscaleversion of Method-1 had higher error rates than thecorresponding single-scale classifier. In all other cases,aggregation helped to improve the performance of thehybrid classifiers. In Table 1, it is quite transparent that,when one has some insight about the population densitiesthat can be modeled parametrically, it is always advanta-geous to use hybrid classifiers.

One should also note that, while the parametric methodsare sensitive to model misspecification, the nonparametricadjustment factor used in hybrid classification provides anautomatic safeguard against it. To illustrate this point, onceagain, we consider Examples 2 and 3, where the populationdistributions are far from normal. To study the robustnessof the hybrid methods against parametric model misspeci-fication, here, instead of starting with the true parametricmodels, we started with Gaussian density functions with

equal or unequal scatter matrices. Note that LDA had poorperformance in both of these examples and QDA also had asignificantly higher error rate in Example 3. But, in spite ofstarting with wrong parametric models, almost all hybridclassifiers significantly improved the performance of LDAand QDA, and their error rates were comparable to those ofthe corresponding nonparametric methods (see Table 2).Only the kernel version of Method-1 had comparativelyhigher error rates in Example 2 when we used the samescatter matrix for initial parametric models of differentpopulations. But, even in that case, its performance wassignificantly better than LDA. Unlike parametric methods,the performance of the hybrid classifiers was less affectedby the wrong choice of the initial parametric models. Tomake it more transparent, we choose two other examples(Examples 4 and 5), where each of the two populations is anequal mixture of two Gaussian distributions. In Example 4,population 1 is a mixture of N2ð1; 1; 0:25; 0:25; 0Þ andN2ð�1;�1; 0:25; 0:25; 0Þ distributions and population 2 is amixture of N2ð1;�1; 0:25; 0:25; 0Þ and N2ð�1; 1; 0:25; 0:25; 0Þdistributions. In Example 5, population 1 is a mixture ofN2ð1; 1; 0:25; 0:25; 0Þ and N2ð3; 3; 0:25; 0:25; 0Þ, whereas po-pulation 2 is a mixture of N2ð2; 2; 0:25; 0:25; 0Þ andN2ð4; 4; 0:25; 0:25; 0Þ. Once again, LDA had poor perfor-mance in both these examples. Also, the performance ofQDA was not satisfactory in Example 5. But, despitestarting with wrong parametric models, hybrid methodscould lead to substantial reduction in misclassification rates.In some of the cases, their performance was even better thantheir nonparametric counterparts.

4 RESULTS FROM THE ANALYSIS OF BENCHMARK

DATA SETS

In this section, we analyze some benchmark data sets forfurther illustration of the hybrid classifiers. Four of thesedata sets, namely, the synthetic data, the satellite image(satimage) data, the vowel data, and the letter recognitiondata, have specific training and test sets. For these data sets,we report the test set error rates of different classifiers (seeTable 4). For other data sets which do not have specifictraining and test sets, we formed these sets by randomlypartitioning the data into two parts. Brief descriptions ofthese data sets and the sizes of the training and the test setsin each partition are reported in Table 3. This randompartitioning was carried out 250 times to generate 250 dif-ferent training and test samples. Average test set error ratesof different classifiers and their corresponding standard

CHAUDHURI ET AL.: CLASSIFICATION BASED ON HYBRIDIZATION OF PARAMETRIC AND NONPARAMETRIC CLASSIFIERS 1159

TABLE 3Brief Description of Benchmark Data Sets

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 8: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

errors over these 250 trials are reported in Table 4. Among

these data sets, the salmon data was taken from [29]. The

remaining data sets and their descriptions are available

either at the UCI Machine Learning Repository (http://

www.ics.uci.edu/~mlearn) or at the CMU Data Archive

(http://lib.stat.cmu.edu). In the satimage data set, although

there are 36 measurement variables, we considered only the

four central pixel values for classification. In the case of

biomedical data, we did not consider 15 observations which

have missing values, and carried out our analysis with the

remaining 194 observations. Unless mentioned otherwise,

we used Gaussian distributions as initial parametric models

and training sample proportions of different classes were

used as their prior probabilities.

In the case of the salmon data, population distributions

are nearly Gaussian. Therefore, in this case, both LDA and

QDA performed well, and their performance was much

better than nonparametric classifiers (see Table 4). Hybrid

classifiers also had the advantage of starting with Gaussian

models. All hybrid methods, especially the multiscale

classifiers, performed significantly better than their non-

parametric counterparts, and their error rates were compar-

able to those of the parametric methods. In the case of

biomedical data, once again, parametric methods, especially

QDA, performed significantly better than nonparametric

classifiers. This gives an indication that Gaussian distribu-

tions with different scatter matrices would possibly fit this

data well. Hybrid classifiers had much better performance,

1160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

TABLE 4Misclassification Rates (in Percent) of Different Classifiers and Their Standard Errors

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 9: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

when we started with these parametric models. Even theuse of a Gaussian model with a common scatter matrix fordifferent classes led to satisfactory performance by thehybrid classifiers.

In the case of the Iris data, all classifiers had almostsimilar error rates. Like the salmon data, here also thepopulation distributions are nearly Gaussian, and LDA isknown to perform well on this data set. In our experiment,most of the hybrid methods had error rates similar to that ofLDA, especially when we started with Gaussian distribu-tions with the same scatter matrix for different populations.On the diabetes data, the performance of QDA wassignificantly better than LDA, KDA, and NN, whichindicates that Gaussian density functions with differentscatter matrices for different populations could be a goodchoice. Starting with this parametric setup, we got sig-nificantly lower error rates for Method-1 and Method-3.This superiority was maintained even when the samescatter matrix was used for all populations. Method-2showed a somewhat different behavior. It performed betterwhen we used the same scatter matrix instead of differentscatter matrices for different populations.

In the case of crab data, there were very few observationsfrom competing classes in the training samples compared tothe dimensionality of the problem. Since LDA requires fewernumbers of parameters to be estimated, it had an advantagein this case, and it led to the best error rate among theparametric and nonparametric classifiers considered here. Inthis example, the error rates of Method-1 and Method-3 werecomparable to those of LDA or QDA, depending on the initialparametric setup, whereas Method-2 performed even better.

For the other four data sets, we report the error rates fordifferent methods on the specific test sets. On the syntheticdata, nonparametric methods had slightly higher error ratesas compared to parametric and most of the hybridclassifiers. On the satimage data, all classifiers exceptLDA had similar error rates. Although LDA led to a highermisclassification rate in this example, most of the hybridmethods, especially the kernel-based methods, could im-prove the classification performance significantly despitestarting with the same parametric model as LDA. On thevowel recognition data, nonparametric methods performedsomewhat better than the parametric classifiers. Theperformance of hybrid methods was comparable to that ofnonparametric classifiers in most of the cases. On the letterrecognition data, LDA and QDA had significantly highererror rates compared to nonparametric methods. Thisindicates that Gaussian models are inadequate for this dataset. But despite starting with these incorrect parametricmodels, hybrid methods yield error rates close to that of thenonparametric classifiers.

Note that, in this section, hybrid classifiers have alwaysbeen constructed using Gaussian parametric models, whichmay be inappropriate in some cases. For instance, in thecase of synthetic data, we know that each class is a mixtureof two Gaussian distributions. Therefore, instead of usingGaussian models, when we started with mixture Gaussianmodels, the hybrid classifiers performed much better. Forsingle-scale versions of the three hybrid methods, the testset error rates were found to be 9.5, 9.2, and 8.9 and 9.1, 9.2,and 9.0, respectively, for the kernel-based and the nearestneighbor-based hybridization. In the case of multiscaleclassification, these error rates were 9.4, 9.3, and 8.9 and 9.3,9.3, and 9.0, respectively.

To compare the overall performance of different single-scale and multiscale hybrid classifiers, we used the notionof robustness introduced in [12]. For a data set, if T differentclassifiers have error rates �1;�2; . . . ;�T , the robustnessðrtÞ of the tth ðt ¼ 1; 2; . . . ; T Þ classifier is defined asrt ¼ �t=��, where �� ¼ min1�t�T �t. Clearly, in any exam-ple, the best classifier will have rt ¼ 1, while a higher valueof rt indicates lack of robustness of the tth classifier. In eachof these benchmark examples, we computed this ratio for allhybrid classifiers, and they are graphically represented bybox plots in Fig. 3, which clearly show the utility of themultiscale method. In the case of single-scale classification,all hybrid classifiers had competitive performance. Themultiscale method improved the overall performance of allthese classifiers, especially for Method-2. The overallperformance of the multiscale version of Method-2 wasbetter than that of the other two methods.

In classification, one is more interested in estimates ofclass boundaries than estimates of population densityfunctions. Classification trees [1], neural nets [38], andsupport vector machines [44] are some of the nonparametricclassifiers that directly estimate the class boundaries with-out using density estimates. For the synthetic data, thesatimage data and the vowel recognition data discussedearlier, the error rates of these three and other nonpara-metric classifiers are given in [21], [38], [17]. The error ratesof the hybrid classifiers are quite comparable to thosereported error rates.

5 CONCLUDING REMARKS

Both traditional parametric and nonparametric classifiershave their own strengths and limitations. Hybrid classifiersare developed and studied in this paper to overcome thoselimitations and combine their strengths. When the true classdensities are close to the assumed parametric models,hybrid methods perform much better than their nonpara-metric counterparts. But, unlike parametric classifiers,hybrid classifiers are less sensitive to the choice of theinitial parametric models. Therefore, when one has somedoubt about the validity of model assumptions, it is alwayssafe to use hybrid classifiers, which provide a safeguardagainst possible deviations from parametric model assump-tions. When the true population distributions are far fromthe assumed parametric models, hybrid classifiers cansubstantially improve the performance of parametricmethods and yield misclassification rates, which arecomparable to those of the nonparametric classifiers. Usingseveral simulated and real data sets, we have amplydemonstrated these important features of hybrid classifiersin this article.

Among different hybrid classifiers considered in thisarticle, Method-2 is computationally much simpler thanMethod-1 since it does not need to estimate the probability ofa ball under the parametric and the nonparametric densityestimates. However, for the NN version of Method-1, thisprobability under the nonparametric density estimate canbe easily computed and that helps to cut down thecomputing cost substantially. In Method-3, one needs tocompute the expected value of the nonparametric densityestimate under the parametric model. If the parametricmodel and the kernel function are assumed to beGaussian, this can be computed fairly easily. But, in the

CHAUDHURI ET AL.: CLASSIFICATION BASED ON HYBRIDIZATION OF PARAMETRIC AND NONPARAMETRIC CLASSIFIERS 1161

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 10: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

case of other parametric models, unless one chooses anappropriate kernel function to preserve the convolutionproperty, and also in the case of nearest neighbor-basedhybridization, this computational cost may be substantial.In such cases, Method-2 will be computationally moreefficient than Method-3.

The multiscale method discussed in Section 2.3 providesthe flexibility of considering the results for different scales ofsmoothing. Shalak [41] suggested combining classifiers whenthey have diversity among themselves, and hybrid classifierswith different scales of smoothing are expected to havereasonable diversity among themselves. While classifierswith large smoothing parameters behave like parametricclassifiers and look at the global features, classifiers withsmall smoothing parameters behave like nonparametricclassifiers and concentrate more on the local patterns.Therefore, instead of using a fixed level of smoothing, it isadvantageous to use the multiscale versions of hybridclassifiers.

APPENDIX

Here, we present some results on the consistency of hybriddensity estimates and the proof of Theorem 1. For thisasymptotic analysis, we assume that all population densityfunctions are smooth and have derivatives up to a sufficientorder and K is symmetric about 0 with

Rktk2KðtÞdt <1

andRK2ðtÞdt <1. Note that the Gaussian and most of the

other popular kernel functions satisfy these properties.Here, we also assume that for all j ¼ 1; 2; . . . ; J , the initialparametric model f0

j has support over the entire measure-ment space, and nj=n converges to the prior probability�j ð0 < �j < 1Þ as n!1. In other words, each of the njs areof the same asymptotic order, and n!1 impliesminfn1; n2; . . . ; njg ! 1.

Proposition 1. Suppose that, for all j ¼ 1; 2; . . . ; J , f0j ðx; ��jÞ is

continuous in x and ��j, and as n!1 (it implies that nj !1 for all j ¼ 1; 2; . . . ; J), �̂�j !

P��0j for some ��0

j in the

parameter space irrespective of whether the parametric model is

correct or not. f0j ðx; ��0

j Þ may be viewed as the best parametric

approximation of fj in that class. Also, assume that, for all

j ¼ 1; 2; . . . ; J , hj ! 0 and njhdj !1 as n!1. Then,

f̂ð2ÞjhjðxÞ and f̂

ð3ÞjhjðxÞ both converge to fjðxÞ (in probability) as

n!1. Further, if K has bounded variation and the fjs are

uniformly continuous, the above convergence result holds also

for f̂ð1Þjhj

under a slightly stronger condition, namely,

nhdj=logðnÞ ! 1 as n!1 for j ¼ 1; 2; . . . ; J .

Proof. We first consider the case of f̂ð1ÞjhjðxÞ and express it in

the following form:

f̂ð1ÞjhjðxÞ ¼

V3hjðxÞ� ��1R

y2Bðx;3hjÞ f̂jhjðyÞdy

V3hjðxÞ� ��1R

y2Bðx;3hjÞ f0j ðy; �̂�Þ=f0

j ðx; �̂�Þn o

dy;

where V3hjðxÞ is the volume of Bðx; 3hjÞ. Since f0j ðx; �̂�jÞ is

continuous in �̂�j and x and �̂�j !P��0j , it is easy to show

that the denominator converges to 1 in probability. Now,under the assumed conditions, due to the uniformconvergence of kernel density estimates [42], for every > 0, we can find a positive integer m1 such that, for alln � m1, supy jf̂1

jhjðyÞ � fjðyÞj < =2. Again, because of the

continuity of fj, one can choose another positive integerm2 such that, for all n � m2, kx� yk < 3hj ) jfjðxÞ �fjðyÞj < =2: This implies that

V3hjðx� �

g�1

Zy2Bðx;3hjÞ

jf̂1jhjðyÞ � fjðxÞjdy <

for all n > maxfm1;m2g:

Therefore, the numerator of f̂ð1ÞjhjðxÞ (and, hence, f̂

ð1ÞjhjðxÞ

itself) converges to fjðxÞ in probability.Also, note that under the assumed conditions, the

consistency of f̂ð2ÞjhjðxÞ follows from [24].

Finally, we consider the case of f̂ð3ÞjhjðxÞ, which can be

expressed in the following form:

1162 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

Fig. 3. Robustness of hybrid classifiers: 1) Method-1(SS), 2) Method-1(MS), 3) Method-2(SS), 4) Method-2(MS), 5) Method-3(SS), and

6) Method-3(MS). (a) Kernel and normal density estimates with sameP

. (b) Kernel and normal density estimates with differentP

. (c) Nearest

neighbor and normal density estimates with sameP

. (d) Nearest neighbor and normal density estimates with differentP

.

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 11: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

f̂ð3ÞjhjðxÞ ¼ f0

j ðx; �̂�jÞf̂jhjðxÞ

Ef̂0jf̂jhjðxÞn o ; where f̂0

j ðxÞ ¼ f0j ðx; �̂�jÞ:

Now, using the result on the expectation of a kernel

density estimate [43], we get Ef̂0jff̂jhjðxÞg ¼ f0

j ðx; �̂�jÞ þOðh2

j Þ: Since �̂�j converges to ��0j and hj converges to zero

as nj tends to 1, it is easy to show that f0j ðx; �̂�jÞ and

Ef̂0jff̂jhjðxÞg both tend to f0

j ðx; ��0j Þ as nj !1 and their

ratio converges to one. Now, under the assumed

conditions, the convergence of f̂ð3ÞjhjðxÞ follows from the

convergence of the kernel density estimate f̂jhjðxÞ. tu

Lemma 1. Suppose that xj1;xj2; . . . ;xjnj ðPnj ¼ nÞ are

independent observations from a continuous density function

fj ðj ¼ 1; 2; . . . ; JÞ. Consider a data point x and define xðk;nÞ,

Bn;kðxÞ, and Vn;kðxÞ as in Section 2.2. Now, assume that k!1 and k=n! 0 and also recall that nj=n! �j ð0 < �j <

1;P�j ¼ 1Þ as n!1. Then, for all j ¼ 1; 2; . . . ; J ,

kj=njVn;kðxÞ !PfjðxÞ as n!1.

P r o o f . W r i t e kj=njVn;kðxÞ a s fkj=kg � fn=njg �fk=nVn;kðxÞg: Now, under the assumed conditions

kj=k!P�jfjðxÞ=

P�tftðxÞ, the posterior probability [7],

and n=nj ! 1=�j. The consistency of nearest neighbor

density estimates [33] also implies that k=nVn;kðxÞ !PP

�tftðxÞ: Hence, kj=njVn;kðxÞ !PfjðxÞ for all j ¼

1; 2; . . . ; J as the sample size n!1. tuProposition 2. Suppose that, for all j ¼ 1; 2; . . . ; J , f0

j ðx; ��jÞ is

continuous in x and ��j. Define ��0j and �̂�j as in Proposition 1.

Also, assume that as the training sample size n tends to infinity,

k!1, k=n! 0, and �̂�j !P��0j for all j ¼ 1; 2; . . . ; J . Then,

each of the three hybrid density estimates f̂ð1Þj;k ðxÞ, f̂

ð2Þj;k ðxÞ, and

f̂ð3Þj;k ðxÞ converges in probability to fjðxÞ.

Proof. We first consider the case of f̂ð1Þj;k ðxÞ, which can be

expressed as

f̂ð1Þj;k ðxÞ ¼

kj=njVn;kðxÞVn;kðxÞ� ��1R

y2Bn;kðxÞ fjðy; �̂�jÞ=fjðx; �̂�jÞn o

dy:

Now, note that, under the assumed conditions, as

n!1, the radius of Bn;kðxÞ shrinks to zero. Therefore,

using the continuity of f0j and the convergence of �̂�j, it is

easy to show that the denominator converges to one.

Now, Lemma 1 suggests that the numerator f̂ð1Þj;k ðxÞ and,

hence, f̂ð1Þj;k ðxÞ converge to fjðxÞ in probability.

Next, we consider the case of fð2Þj;k ðxÞ. Note that this

can be expressed as

f̂ð2Þj;k ðxÞ ¼ kj=njVn;kðxÞ

� �� 1

kj

Xxji :xji2Bn;kðxÞ

f0j ðx; �̂�jÞ=f0

j ðxji; �̂�jÞn o2

435:

Since the radius of Bn;kðxÞ shrinks to zero as n!1because of the continuity of f0

j and the convergence of �̂�j,

for every > 0, one can choose n0 such that for all n > n0,

y 2 Bn;kðxÞ ) 1� < ff0j ðx; �̂�jÞ=f0

j ðxji; �̂�jÞg < 1þ .

Therefore, the second term of f̂ð2Þj;k ðxÞ converges to one.

Now, the convergence of the first term and, hence, that of

f̂ð2Þj;k ðxÞ follow from Lemma 1.

Finally, we consider the case of fð3Þj;k ðxÞ. Since

kj=njVn;kðxÞ is a consistent estimate of the population

density function fjðxÞ (follows from Lemma 1),

Ef̂0ðkj=njVn;kðxÞÞ has the expression of the form

f̂0j ðxÞ þ rn, where rn ! 0, and f̂0

j ðxÞ ¼ f0j ðx; �̂�jÞ !

P

f0j ðx; ��0

j Þ as n!1 (using the continuity of f0j ). In

particular, Ef̂0ðkj=njVn;kðxÞÞ ¼ f̂0j ðxÞ þOðk=nÞ

2 (see [31],

[34] for the asymptotic mean of the nearest neighbor

density estimate). Therefore, f̂0j ðxÞ=Ef̂0fkj=njVn;kðxÞg

converges to one and the result follows from Lemma 1.tu

Proof of Theorem 1. Let us define

TnjðxÞ ¼P

s2½Ln;Un �WðsÞ�snjðxÞ, where �snj ¼

�jf̂�jsðxÞP

t�tf̂

�tsðxÞ

. It

is quite transparent from Propositions 1 and 2 that,

under the assumed conditions on Ln and Un, for any

sequence of smoothing parameters in the interval

½Ln; Un�, �snjðxÞ converges (in probability) to the posterior

probability pðj j xÞ ¼ �jfjðxÞ=P

t �tftðxÞ as n!1. If we

can show similar convergence for Tnj, the proof can be

completed using the Dominated Convergence Theorem.

If possible, suppose that TnjðxÞ does not converge to

pðj j xÞ as n!1. Then, there exist an > 0 and a

subsequence fTn0j;n0 � 1g such that jTn0jðxÞ � pðj j xÞj > for all n0 � 1. Since Tn0j is a weighted average of

�sn0jðxÞs, one can always find some s ¼ sðn0ÞðLn0 �sðn0Þ � Un0 Þ such that j�sðn

0Þn0j ðxÞ � pðj j xÞj > . Therefore,

we can construct a sequence of smoothing parameters

fsðn0Þ;n0 � 1g such that j�sðn0Þ

n0j ðxÞ � pðj j xÞj > for all

n0 � 1. But, this sequence satisfies the conditions re-

quired for the consistency of the hybrid density estimates

and, hence, �sðn0Þn0j ðxÞ should converge to pðj j xÞ in

probability. This is a contradiction. tu

ACKNOWLEDGMENTS

The authors would like to thank the reviewers for their

careful reading of the earlier version of the paper and for

providing them with several helpful comments. The

research of Probal Chaudhuri was partially supported by

the grants of the Council of Scientific and Industrial

Research and the Department of Biotechnology, Govern-

ment of India. The research of Hannu Oja was partially

supported by the grants of the Academy of Finland.

REFERENCES

[1] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, Classifica-tion and Regression Trees. Wadsworth and Brooks Press, 1984.

[2] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24,pp. 123-140, 1996.

[3] C. Bolance, M. Guillen, and J.P. Nielsen, “Kernel DensityEstimation of Actuarial Loss Functions,” Insurance: Math. andEconomics, vol. 32, pp. 19-36, 2003.

CHAUDHURI ET AL.: CLASSIFICATION BASED ON HYBRIDIZATION OF PARAMETRIC AND NONPARAMETRIC CLASSIFIERS 1163

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.

Page 12: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...akghosh/IEEE-TPAMI-2009.pdfParametric and Nonparametric Classifiers Probal Chaudhuri, Anil K. Ghosh, and Hannu Oja Abstract—Parametric

[4] T. Buch-Larsen, J.P. Nielsen, M. Guillen, and C. Bolance, “KernelDensity Estimation for Heavy-Tailed Distributions Using theChampernowne Transformation,” Statistics, vol. 39, pp. 503-518,2005.

[5] P. Chaudhuri and J.S. Marron, “SiZer for Exploration of Structuresin Curves,” J. Am. Statistical Assoc., vol. 94, pp. 807-823, 1999.

[6] P. Chaudhuri and J.S. Marron, “Scale Space View of CurveEstimation,” Annals of Statistics, vol. 28, pp. 408-428, 2000.

[7] T.M. Cover and P.E. Hart, “Nearest Neighbor Pattern Classifica-tion,” IEEE Trans. Information Theory, vol. 13, pp. 21-27, 1967.

[8] B.V. Dasarathy, Nearest Neighbor (NN) Norms: NN PatternClassification Techniques. IEEE CS, 1991.

[9] R. Duda, P. Hart, and D.G. Stork, Pattern Classification. John Wiley& Sons, 2000.

[10] R.A. Fisher, “The Use of Multiple Measurements in TaxonomicProblems,” Annals of Eugenics, vol. 7, pp. 179-188, 1936.

[11] E. Fix and J.L. Hodges Jr., “Discriminatory Analysis, Nonpara-metric Discrimination, Consistency Properties,” Report No. 4,Project 21-49-004, 1951.

[12] J.H. Friedman, “Flexible Metric Nearest Neighbor Classification,”technical report, Dept. of Statistics, Stanford Univ., 1994.

[13] J.H. Friedman, T. Hastie, and R. Tibshirani, “Additive LogisticRegression: A Statistical View of Boosting (with Discussion),”Annals of Statistics, vol. 28, pp. 337-374, 2000.

[14] K. Fukunaga and L.D. Hostetler, “Optimization of k-NearestNeighbor Density Estimates,” IEEE Trans. Information Theory,vol. 19, pp. 320-326, 1973.

[15] A.K. Ghosh, P. Chaudhuri, and C.A. Murthy, “On Visualizationand Aggregation of Nearest Neighbor Classifiers,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 27, no. 10, pp. 1592-1602, Oct. 2005.

[16] A.K. Ghosh, P. Chaudhuri, and D. Sengupta, “Classification UsingKernel Density Estimates: Multi-Scale Analysis and Visualiza-tion,” Technometrics, vol. 48, pp. 120-132, 2006.

[17] A.K. Ghosh and S. Bose, “Feature Extraction for ClassificationUsing Statistical Networks,” Int’l J. Pattern Recognition andArtificial Intelligence, vol. 21, pp. 1103-1126, 2007.

[18] I. Glad, “Parametrically Guided Nonparametric Regression,”Scandinavian J. Statistics, vol. 25, pp. 649-668, 1998.

[19] F. Godtliebsen, J.S. Marron, and P. Chaudhuri, “Significance inScale Space for Bivariate Density Estimation,” J. Computational andGraphical Statistics, vol. 11, pp. 1-22, 2002.

[20] D.J. Hand, Kernel Discriminant Analysis. John Wiley & Sons, 1982.[21] T. Hastie, R. Tibshirani, and A. Buja, “Flexible Discriminant

Analysis,” J. Am. Statistical Assoc., vol. 89, pp. 1255-1270, 1994.[22] T. Hastie and R. Tibshirani, “Discriminant Adaptive Nearest

Neighbor Classification,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 18, no. 6, pp. 607-616, June 1996.

[23] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer,2001.

[24] N.L. Hjort and I. Glad, “Nonparametric Density Estimation with aParametric Start,” Annals of Statistics, vol. 23, pp. 882-904, 1995.

[25] N.L. Hjort and M.C. Jones, “Locally Parametric NonparametricDensity Estimation,” Annals of Statistics, vol. 24, pp. 1619-1647,1996.

[26] C.C. Holmes and N.M. Adams, “A Probabilistic Nearest NeighborMethod for Statistical Pattern Recognition,” J. Royal Statistical Soc.,Series B, vol. 64, pp. 295-306, 2002.

[27] C.C. Holmes and N.M. Adams, “Likelihood Inference in Nearest-Neighbor Classification Methods,” Biometrika, vol. 90, pp. 99-112,2003.

[28] F. Hoti and L. Holmstrom, “A Semiparametric Density EstimationApproach to Pattern Classification,” Pattern Recognition, vol. 37,pp. 409-419, 2004.

[29] R.A. Johnson and D.W. Wichern, Applied Multivariate StatisticalAnalysis. Prentice Hall, 1992.

[30] M.C. Jones, O. Linton, and J.P. Neilsen, “A Simple and EffectiveBias Reduction Method for Density and Regression Estimation,”Biometrika, vol. 82, pp. 327-338, 1995.

[31] P.A. Lachenbruch and M.R. Mickey, “Estimation of Error Rates inDiscriminant Analysis,” Technometrics, vol. 10, pp. 1-11, 1968.

[32] S.L. Lai, “Large Sample Properties of k-Nearest NeighborProcedures,” PhD dissertation, Dept. of Math., Univ. of California,Los Angeles, 1977.

[33] D.O. Loftsgaarden and C.P. Quesenberry, “A NonparametricEstimate of a Multivariate Density Function,” Annals of Math.Statistics, vol. 36, pp. 1049-1051, 1965.

[34] Y.P. Mack, “Local Properties of k-NN Regression Estimates,”SIAM J. Algebraic and Discrete Methods, vol. 2, pp. 311-323, 1981.

[35] P.C. Mahalanobis, “On the Generalized Distance in Statistics,”Proc. Nat’l Inst. of Science, vol. 12, pp. 49-55, 1936.

[36] G.J. McLachlan, Discriminant Analysis and Statistical PatternRecognition. John Wiley & Sons, 1992.

[37] I. Olkin and C.H. Spiegelman, “A Semiparametric Approach toDensity Estimation,” J. Am. Statistical Assoc., vol. 82, pp. 858-865,1987.

[38] B.D. Ripley, Pattern Recognition and Neural Networks. CambridgeUniv. Press, 1996.

[39] R.E. Schapire, Y. Fruend, P. Bartlett, and W. Lee, “Boosting theMargin: A New Explanation for the Effectiveness of VotingMethods,” Annals of Statistics, vol. 26, pp. 1651-1686, 1998.

[40] D.W. Scott, Multivariate Density Estimation: Theory, Practice andVisualization. John Wiley & Sons, 1992.

[41] D.B. Shalak, “Prototype Selections for Composite Nearest Neigh-bor Classifiers,” PhD dissertation, Dept. of Computer Science,Univ. of Massachusetts, 1996.

[42] B.W. Silverman, “Weak and Strong Uniform Consistency of theKernel Estimate of a Density Function and Its Derivatives,” Annalsof Statistics, vol. 6, pp. 177-184, 1978.

[43] B.W. Silverman, Density Estimation for Statistics and Data Analysis.Chapman and Hall, 1986.

[44] V.N. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998.[45] M. Wand and M.C. Jones, Kernel Smoothing. Chapman and Hall,

1995.

Probal Chaudhuri received the BStat (Hons)and MStat degrees in statistics from the IndianStatistical Institute, Kolkata, in 1983 and 1985,respectively, and the PhD degree in statisticsfrom the University of California, Berkeley, in1988. He worked as a faculty member at theUniversity of Wisconsin, Madison, for some timebefore joining the faculty of the Indian StatisticalInstitute, Kolkata, in 1990. He is currently aprofessor in the Theoretical Statistics and

Mathematics Unit, Indian Statistical Institute, Kolkata. He is an electedfellow of all three science academies in India and a recipient of theShanti Swarup Bhatnagar Prize awarded by the Government of India.His research interests include nonparametric and robust statistics,statistical analysis of molecular data, pattern recognition, and imageprocessing.

Anil K. Ghosh received the BSc (Hons) degreein statistics from the University of Calcutta in1996 and the MStat and the PhD degrees fromthe Indian Statistical Institute, Kolkata, in 1998and 2004, respectively. From 2004-2006, hewas a postdoctoral research fellow at theInstitute of Statistical Science, Academia Sinica,Taiwan, and at the Mathematical SciencesInstitute, Australian National University, Canber-ra. Currently, he is an assistant professor in the

Theoretical Statistics and Mathematics Unit, Indian Statistical Institute(ISI), Kolkata. Before joining ISI, he was an assistant professor in theDepartment of Mathematics and Statistics, Indian Institute of Technol-ogy, Kanpur. His research interests include pattern recognition, robuststatistics, nonparametric smoothing, and machine learning.

Hannu Oja received the BSc degree in statisticsfrom the University of Tampere, Finland, in 1973and the PhD degree in statistics from theUniversity of Oulu, Finland, in 1981. He visitedPennsylvania State University for six months in1991. He is a professor of biometry in theTampere School of Public Health, University ofTampere, Finland, and an academy professor instatistics since 2008. His main fields of researchinterest include nonparametric and robust multi-

variate methods with biological and medical applications.

1164 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 7, JULY 2009

Authorized licensed use limited to: INDIAN STATISTICAL INSTITUTE. Downloaded on May 28, 2009 at 06:20 from IEEE Xplore. Restrictions apply.