· kernel methods for statistical learning in computer vision and pattern recognition applications...

KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION

AND PATTERN RECOGNITION APPLICATIONS

By

Refaat Mokhtar Mohamed

M.Sc., EE, Assiut University, Egypt, 2001

B.Sc., EE, Assiut University, Egypt, 1995

A Dissertation

Submitted to the Faculty of the

Graduate School of the University of Louisville

in Partial Fulfillment of the Requirements

for the Degree of

Doctor of Philosophy

Department of Electrical and Computer Engineering

University of Louisville

Louisville, Kentucky

December 2005

KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION AND PATTERN

RECOGNITION APPLICATIONS

By

Refaat Mokhtar Mohamed

A Dissertation Approved on

by the Following Reading and Examination Committee:

Aly Farag, Ph.D., Dissertation Director

Jon Atli Benediktsson, Ph.D.

Georgy Gimel’farb, Ph.D.

Greg Rempala, Ph.D.

Hichem Frigui. Ph.D.

Xiangqian Liu , Ph.D.

Tamer Inanc, Ph.D.

Ryan Gill, Ph.D.

ii

DEDICATION

To:

The memory of my mother who died on August 29, 1985

My lovely wife Yosra

iii

ACKNOWLEDGMENTS

All deepest and sincere thanks are due to Almighty ALLAH, the merciful, the com-

passionate for the uncountable gifts given to me.

I would like to extend my deepest appreciation to Dr. Aly A. Farag for giving me

the opportunity to join the CVIP Lab and for his direction and assistance in developing

this dissertation. I would also like to thank Dr. Moumen Ahmed for helping me in joining

the Lab. Many thanks for Dr. Greg Rempala for his continuous discussion and support. I

would also like to thank Dr. Jon Atli Benediktsson for joining my PhD committee. I would

also like to thank Dr. Hichem Frigui, Dr. Xiangqian Liu, Dr. Tamer Inanc, and Dr. Ryan

Gill for serving on my committee. I would also like to thank my colleagues in the CVIP

Lab for their continuous support and friendship during the past years, with special thanks

to Ayman El-Baz for his collaboration. I would like to thank all my friends in Louisville

for turning my stay here into a pleasant life.

Finally, I would like to thank my family for their unwaivering encouragement and

support, without which this thesis and research would not have been possible.

iv

ABSTRACT

KERNEL METHODS FOR STATISTICAL LEARNING IN COMPUTER VISION AND

PATTERN RECOGNITION APPLICATIONS

Refaat Mohamed

December 1, 2005

Statistical learning-based kernel methods are rapidly replacing other empirically

learning methods (e.g. neural networks) as a preferred tool for machine learning due to

many attractive features: a strong basis from statistical learning theory; no computational

penalty in moving from linear to non-linear models; the resulting optimization problem

is convex, guaranteeing a unique global solution and consequently producing systems with

excellent generalization performance. This research work introduces statistical learning for

solving different problems in computer vision and pattern recognition applications.

The probability density function (pdf) estimation is a one of the major ingredients

in Bayesian pattern recognition and machine learning. Many algorithms have been intro-

duced for solving the probability density function estimation problem either in parametric

or nonparametric setup. In the parametric approach, a reasonable functional form for the

probability density function is assumed, as such the problem is reduced to the parameters

estimation of the functional form. For estimating general density functions, the nonpara-

metric setups are used where there is no form assumed for the density function.

The curse of dimensionality is a major difficulty which exists in the density func-

tion estimation with high dimensional data spaces. An active area of research in the pattern

v

analysis community is to develop algorithms which cope with the dimensionality problem.

The purpose of this thesis is to present a kernel-based method for solving the density es-

timation problem as one of the fundamental problems in machine learning. The proposed

method does not pay much attention to the dimensionality problem.

The contribution of this thesis has three folds: creating a reliable and efficient

learning-based density estimation algorithm which is minimally dependent on the input

space dimensionality, investigating efficient learning algorithms for the proposed approach,

and investigating the performance of the proposed algorithm in different computer vision

and pattern recognition applications.

vi

TABLE OF CONTENTS

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vTABLE OF CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiNomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivCHAPTER

I. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1A. Research Domain of the Thesis . . . . . . . . . . . . . . . . . . . . . 3

1. Phase I: Implementation and Analysis of the MF-Based SVMDensity Estimation Framework . . . . . . . . . . . . . . . . . . 4

2. Phase II: Automation and Enhancements for the Learning Algo-rithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3. Phase III: Applications of the New MF-Based SVM Density Es-timation Framework in Real World Pattern Recognition Problems 5

B. Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

II. STATISTICAL LEARNING BASED SUPPORT VECTOR MACHINES . . 9A. Support Vector Machines (SVM) Regression . . . . . . . . . . . . . . 10B. Mean Field Theory for Learning of SVM Regression . . . . . . . . . . 14C. Summary of the Statistical Learning MF-Based SVM Regression Al-

gorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17D. Remarks on the MF-Based SVM Regression Algorithm . . . . . . . . 18E. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

III. DENSITY ESTIMATION USING MEAN FIELD BASED SUPPORT VEC-

TOR MACHINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20A. Density Estimation Problem Formulation . . . . . . . . . . . . . . . . 21B. Obtaining the Probability Density Function Estimate and Choosing of

the Kernel Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23C. Summary of the Proposed SVM Density Estimation Algorithm . . . . 24D. Consistency of the Proposed Algorithm . . . . . . . . . . . . . . . . . 24

1. Equivalent Kernel in Gaussian Processes Prediction . . . . . . . 252. Consistency Argument of the Proposed Algorithm . . . . . . . . 27

E. Convergence of the Proposed Algorithm . . . . . . . . . . . . . . . . . 29

vii

F. Estimation of the Learning Parameters . . . . . . . . . . . . . . . . . 291. Kernel Optimization using the EM algorithm . . . . . . . . . . . 292. Cross-Validation for Parameters Estimation . . . . . . . . . . . . 31

G. Experiments for Evaluating the Proposed Density Estimation . . . . . . 331. Density estimation for a 1-D Gaussian distribution . . . . . . . . 342. Density estimation for 1-D Mixture of Gaussian distributions . . 353. Density Estimation for 1-D Rayleigh Distribution . . . . . . . . . 364. Comparison with state of the art methods . . . . . . . . . . . . . 375. Density estimation for a 2-D cases . . . . . . . . . . . . . . . . . 396. Experiments on the automatic selection of the Kernel width using

EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 417. Experiments on the automatic selection of the learning parame-

ters using Cross Validation . . . . . . . . . . . . . . . . . . . . . 428. Experiments on the algorithm convergence . . . . . . . . . . . . 42

H. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

IV. STATISTICAL LEARNING IN COMPUTER VISION . . . . . . . . . . . . 49A. Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

1. An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502. Basic Regression Relations . . . . . . . . . . . . . . . . . . . . 523. Simultaneous Optimization of the Regression Relations . . . . . 534. The Overall Calibration Algorithm . . . . . . . . . . . . . . . . 54

B. Discussion of Some Calibration Methods . . . . . . . . . . . . . . . . 551. Linear Direct Transform Method (LDT) . . . . . . . . . . . . . . 552. Nonlinear Two Stages Method (NL) . . . . . . . . . . . . . . . . 563. Neural Networks Method (NN) . . . . . . . . . . . . . . . . . . 564. Heikkile Method (Heikki) . . . . . . . . . . . . . . . . . . . . . 57

C. Experimental Results and Discussions . . . . . . . . . . . . . . . . . . 581. Simulation with Synthetic Data . . . . . . . . . . . . . . . . . . 592. Experiments with Real Images . . . . . . . . . . . . . . . . . . . 62

D. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

V. APPLICATIONS ON THE PROPOSED DENSITY ESTIMATION APPROACH 68A. Test-of-agreement (ToA) for the response of two classifiers . . . . . . . 68B. Experiments for Density Estimation Using Real Remote Sensing Mul-

tispectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701. Experiments for density estimation using a multispectral agricul-

tural area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702. Experiments for density estimation using a multispectral urban

area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74C. Experiments for Density Estimation Using Real Remote Sensing Hy-

perspectral Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771. Experiments for density estimation using a hyperspectral 34-band

data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

viii

2. Experiments for density estimation using a hyperspectral 58-banddata set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

D. Applications in the Class Prior Probability Estimation . . . . . . . . . 821. MRF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 832. MRF Parameters Estimation Using SVM . . . . . . . . . . . . . 853. Image Segmentation Algorithm . . . . . . . . . . . . . . . . . . 854. Experiment on MRF Model Parameters Estimation . . . . . . . . 875. Experiments Using Remote Sensing Data . . . . . . . . . . . . . 88

E. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

VI. STATISTICAL LEARNING FOR CHANGE DETECTION . . . . . . . . . 96A. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96B. Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96C. Proposed Change Detection Approach . . . . . . . . . . . . . . . . . . 99D. Statistical Shape Modeling . . . . . . . . . . . . . . . . . . . . . . . . 100E. Change Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . 102F. Discussion of Some Change Detection Methods . . . . . . . . . . . . 102

1. Change Detection using Automatic Analysis of the DifferenceImage and EM Algorithm (DIEM) . . . . . . . . . . . . . . . . . 104

2. Change Detection using MRF Modeling (DIMRF) . . . . . . . . 106G. Experimental Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

1. Experiments on the proposed shape modeling approach . . . . . 1062. Experiments on Statistical Shape Modeling Using the MF-based

SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1073. Experiments on the Segmentation Algorithm . . . . . . . . . . . 107

a. Image Segmentation Algorithm107

b. Results108

4. Experiments on Different Resolutions Data Sets . . . . . . . . . 1135. Experiments on the Change Detection Algorithm . . . . . . . . . 115

a. Cairo Data Set116

b. Louisville Data117

H. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

VII. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121A. Review and Applications . . . . . . . . . . . . . . . . . . . . . . . . . 122B. Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123C. Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125CURRICULUM VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

ix

LIST OF TABLES

TABLE . PAGE1. Parameters of the 1-D mixture of Gaussians density function . . . . . . . . . 352. Results for the mixture of a Gaussian and Exponential density functions . . . 393. Ground truth camera parameters versus estimated parameters . . . . . . . . 614. Error in the 3-D reconstructed data . . . . . . . . . . . . . . . . . . . . . . . 655. Classification confusion matrix for the multispectral agricultural area using

the MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . 726. Classification accuracy using different density estimators for the multispec-

tral agricultural area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747. Classification confusion matrix for the multispectral urban area using the

MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 768. Classification accuracy using different density estimators for the multispec-

tral urban area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779. Classification confusion matrix for the hyperspectral 34-band data using the

MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7910. Classification accuracy using different density estimators for the hyperspec-

tral 34-band data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8011. Classification confusion matrix for the hyperspectral 58-band urban area us-

ing the MF-based SVM estimator. . . . . . . . . . . . . . . . . . . . . . . . 8112. Classification accuracy using different density estimators for the hyperspec-

tral 58-band urban area. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8213. The estimated means for 2nd order MRF cliques. . . . . . . . . . . . . . . . 8514. Estimated parameters for the mixture of Gaussians distribution. . . . . . . . 8715. Classification confusion matrix for the hyperspectral urban area after apply-

ing the MRF modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8916. Classification accuracy after applying MRF modeling for the 58-band hyper-

spectral data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9017. Classification confusion matrix for the multispectral data set without using

shape modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11218. Classification confusion matrix for the multispectral data set using shape

modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11319. Comparison of classification accuracies for the 15-meter resolution data set

using different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11420. Comparison of classification accuracies for the 60-meter resolution data set

using different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11521. Detection rates of the different change detection approaches. . . . . . . . . . 117

x

LIST OF FIGURES

FIGURE . PAGE1. Splitting data in Cross-Validation setups. . . . . . . . . . . . . . . . . . . . 322. Estimation of the 1-D Gaussian density function with the SVM density esti-

mation which is formulated with the, (a) proposed, (b) traditional formula-tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3. Estimation of a 1-D mix of Gaussian density functions . . . . . . . . . . . . 364. Estimation of a Rayleigh density function. . . . . . . . . . . . . . . . . . . . 375. Estimation of 1-D mixture of a Gaussian and an Exponential density func-

tions, (a) SDC method (quoted from [1]), (b) MF-based Method. . . . . . . 396. Estimation of a 2-D Gaussian density function, (a) the reference density

function and its contour, (b) the estimated density using the traditional formulation-based SVM and its contour, (c) the estimated density using MF-based SVMand its contour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7. Comparison between the estimation results of a 2-D mixture of an isotropicGaussian and two Gaussians with both positive and negative correlationstructure, (a) the contour of the estimated density using Parzen windowmethod, (b) the contour of the estimated density using Reduced set method,(c) the contour of the estimated density using MF-based SVM method, and(c) the estimated density using MF-based SVM . . . . . . . . . . . . . . . . 45

8. Estimation of the mixture of Gaussians in Fig 3 (a) with the proposed algo-rithm for automatic kernel parameters estimation, (b) CDF of the estimateddensity without automatic kernel optimization, and (c) CDF of the estimateddensity with the proposed kernel optimization algorithm . . . . . . . . . . . 46

9. Effect of the regularization constant C on the proposed algorithm performance 4710. Convergence of the estimation error with the optimization iterations for the

Gaussian density estimation example . . . . . . . . . . . . . . . . . . . . . 4811. Representation of the camera calibration as a mapping problem . . . . . . . 5212. The RMSE fortx as a function of noiseσ, computed for the five approaches:

linear, nonlinear using simplex method, neuro-calibration, Heikki and MFSVM.62

13. The RMSE forRy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6314. The RMSE foruo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6415. The RMSE for the skewness angleθ. Note: Heikki method assumes an ideal

camera model in the sense of skewness (i.e.θ = π/2), so there is no errorindicated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xi

16. Calibration setup: A stereo pair of images for a checker-board calibrationpattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

17. The 2-D projection of the calibration pattern corners (dots), and the detectedcorners from the image (left one of Fig. 16) of the calibration pattern (circles) 66

18. A multispectral agricultural area: (a) land cover, and classification resultsusing: (b) SVM , and (c) MF-SVM as a density estimator. . . . . . . . . . . 71

19. A multispectral urban area: (a) RGB snap-shot, and color-coded classifica-tion results using: (b) SVM , and (c) MF-SVM as a density estimator. . . . . 75

20. A hyperspectral 34-band urban area: (a) RGB snap-shot, and color-codedclassification results using: (b) SVM , and (c) MF-SVM as a density estimator. 78

21. A hyperspectral 58-band urban area: (a) RGB snap-shot, and color-codedclassification results using: (b) SVM , and (c) MF-SVM as a density estimator. 91

22. Numbering and order coding of neighborhood structure. . . . . . . . . . . . 9223. Clique Shapes of second order MRF model. . . . . . . . . . . . . . . . . . . 9224. A texture image for MRF model parameters estimation experiment: (a)original

image generated by Metropolis algorithm, (b) histogram of the MRF modelclique shapes of the original image, (c) regenerated image using the para-meters estimated using the MF-based SVM algorithm, and (d) the estimatedmixture of Gaussians to fit the cliques histogram. . . . . . . . . . . . . . . . 93

25. Evolution of the log-likelihood in the hyperspectral 34-band example. . . . . 9426. The final segmented image with the proposed segmentation setup for the

hyperspectral 34-band area. . . . . . . . . . . . . . . . . . . . . . . . . . . 9427. Evolution of the log-likelihood. . . . . . . . . . . . . . . . . . . . . . . . . 9528. The final segmented image for the 58-band data set. . . . . . . . . . . . . . . 9529. The RGB and the reference classified image of Cairo data set. . . . . . . . . 10930. Samples of shape modeling two classes of Cairo data set: class points, signed

distance map, and shape model density function for a) Water class, andb)Transportation class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

31. Classification results of Cairo data set. . . . . . . . . . . . . . . . . . . . . . 11232. Results for the 15-meter resolution data set: (a) Original, (b) Registration

results, (c) Classification results, and (d) Classification results after inversetransformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

33. Results for the 60-meter resolution data set: (a) Original, (b) Registrationresults, (c) Classification results, and (d) Classification results after inversetransformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

34. Cairo data set for the change detection evaluation: (a) Reference with changes,(b) Reference changes-map . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

35. Results for change detection algorithms: (a) difference-image using CVA,(b) detected changes-map using pixelwise analysis of the difference map, (c)detected changes-map using MRF modeling, and (d) detected changes-mapusing the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 119

xii

36. Results for the change detection algorithm: (a) Reference with changes, (b)Reference change-map, (c) Ordinary classification, and (d) Detected changes-map using the proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . 120

xiii

Nomenclature

D Data sampleDr (yi, ti)

ε An error tolerance

K (y,y′) Covariance function

τ Data sampleτ r ti

τ Data sampleτ r ti

D A data sample (input vectors and their associated targets)

τ Targets vector

b A bias in a linear kernel

C A regularization constant

CS Changes set

F (y) Cumulative distribution function value aty

Fn(y) Empirical cumulative distribution function value aty

I An image

I(−∞, y](u) Indicator function

Ir Reference image

M Point in 3-D space

xiv

m Point in 2-D space

P (3× 4) Projection matrix

t A target value

wi The weight corresponding the training instanti in SVM regression

y An instant of a raw data sample

Kn Covariance matrix

L(t, g(y)) Loss function

σ2i variance of a Gaussian distribution

g(y) Estimated output corresponding to the input vectory

CVA Change Vector Analysis

SD-Map Signed distance map

ToA Test of Agreement

xv

CHAPTER I

INTRODUCTION

Learning, in its broadest sense, is defined as building a model from incomplete

information in order to predict as accurately as possible some underlying structure of the

unknownreality. In practice, for this definition to make sense, the information about the

structure, the experience, and the model have to be expressed numerically. Thus, from the

statistical point of view the problem of learning becomes one of function estimation. As

such, an estimator of functions is referred to as a learning machine.

The problem of statistical learning is rather broad. Given a statistical framework

one can consider any learning problem in the real world as a problem of estimation of

someunknownfunction. Some types of learning problems are classes of special kinds of

function with special properties, and as such it is convenient to group learning problems

into differentdomains. The special properties of the types of functions in these domains

can give them more simple solutions than considering the general case. Indeed, consid-

ering solutions to problems in certain domains, one can construct learning machines with

properties that enable them to solve tasks which would (at present) be hard to solve in the

general case.

In the statistical learning framework, learning means estimating a function from a

given training data set. In this work, learning is used in solving many computer vision

and pattern recognition problems: estimating the probability density function (pdf) which

underlies the distribution of a given random sample, regression of functions, and further

analysis of classical pattern recognition approaches.

The principal part of the current research uses the statistical learning for various

1

types of density estimation. The problem of probability density function estimation is a

classical one in statistics. For the pattern recognition community, density estimation is

the bottle neck in designing a large class of classifiers which depends on maximization of

the posteriori probability principle. In such class, designing of the classifier is carried out

under the assumption that class conditional densities are known. Typically, there are some

vague, general knowledge about these densities, together with a number of design samples

or training data. The problem then, is to find some way to use this information to estimate

the unknown densities.

Depending on the available prior knowledge about the structure of the density func-

tion, there are mainly two directions for density estimation: parametric and nonparamet-

ric [2]. The parametric approach assumes a functional model controlled by a given set of

parameters which have to be fitted to the data. The goal of the learning process is to use the

training sample to estimate this set of parameters. Examples for such learning paradigm

are Maximum Likelihood Estimation (MLE), Bayesian Estimation, and the Expectation

Maximization (EM) algorithm. Problem associated with parametric density estimation are:

1. The forms of the distribution functions are assumed to be known, where normally the

density form for most practical applications is not known.

2. Most known forms of distributions are unimodal (single peak) while in most practical

applications distributions are multimodal.

3. In most cases individual features are assumed to be independent, but approximating

a multivariate distribution as a product of univariate distributions does not work well

in practice.

Due to the above problems associated with the parametric density estimation meth-

ods, an extensive research effort is directed toward nonparametric methods. The main dif-

ference with respect to the parametric approach comes from the influence of the data points

2

yi’s on the estimate at locationy. All the points have the same importance for parametric

estimators, whereas nonparametric estimators are asymptotically local, i.e. the influence

of the points vanishes as the distance from the location where the density is computed in-

creases. Examples for nonparametric methods are K-Nearest Neighborhood (KNN) and

Parzen-Window methods [3].

The curse of dimensionality term is used to describe the problem that occurs when

searching in/or estimating pdf on high-dimensional spaces. The complexity grows ex-

ponentially with the dimension, rapidly outstripping the computational and memory stor-

age capabilities of computers. The problem of estimating a density function on a high-

dimensional space may be seen as determining the density at each cell in a multidimen-

sional grid. Given a fixed number ofK grid lines per dimension, the number of indepen-

dent cells grows asKp, wherep is the dimension. Furthermore, if the density function

is to be estimated based on a set of high-dimensional samples, the number of samples re-

quired for accurate density function estimation also grows asKp. As stated in [4], with a

fixed number of training samples, the dimension for which accurate estimation is possible

is severely limited to a small number, usually about 5 depending on the specific problem.

This study aims to develop a learning algorithm for nonparametric density esti-

mation which is independent on the input space dimensionality. This algorithm uses the

Support Vector Machines (SVM) algorithm as the main building block of the density esti-

mation procedure.

A. Research Domain of the Thesis

This study addresses the density estimation problem in various dimensionality spaces

using the SVM algorithm. Existing algorithms suffer from the following:

1. The curse of dimensionality problem.

3

2. In terms of accuracy and time requirements, poor learning procedures for the SVM

algorithm.

3. The performance of the SVM density estimation algorithm in real world applications

is not well investigated.

The proposed research in this thesis addresses these shortcomings and represents

a novel framework for probability density estimation using the SVM. A novel procedure

which incorporates the Mean Field (MF) theory in the learning of the SVM density esti-

mator is proposed. For this reason, the proposed approach in this thesis is called the Mean

Field-based Support Vector Machines (or simply MF-Based SVM) density estimation ap-

proach. The proposed work in this thesis is divided into three phases:

1. Phase I: Implementation and Analysis of the MF-Based SVM Density Estimation

Framework

In this phase, the theoretical aspects of the MF-Based SVM density estimation

framework are developed. This requires the study of the statistical learning principles and

formulating the statistical representation of the proposed framework. Analysis and perfor-

mance assessment of the proposed framework need to be accomplished. This includes the

performance accuracy of the framework with different data types (synthetic or real) and in

different dimensional spaces. Also, this includes the performance of the framework in time

considerations.

2. Phase II: Automation and Enhancements for the Learning Algorithm

In this phase, some methods for enhancing the optimization algorithm in terms of:

accuracy, speed, and automation. The EM algorithm is proposed to be incorporated in the

learning process for estimating a set of the parameters. Also, the Cross-Validation principle

4

is proposed to be used for estimating some other set of the learning parameters.

3. Phase III: Applications of the New MF-Based SVM Density Estimation Framework in

Real World Pattern Recognition Problems

In this phase, the proposed framework is used in different applications of the pat-

tern recognition problem. In the following, a brief description of the main three of such

applications is presented although only the first two of them will be addressed in the body

of this thesis.

1. Explicit Camera Calibration

Camera calibration is an extensively studied topic in different machine intelligence

communities. The explicit camera calibration methods develop solutions by analyz-

ing physical model of camera imaging so that calibration is to identify a set of mod-

eling parameters of physical meanings [5], whereas the implicit calibration methods

resort to realizing a nonlinear mapping function that can well describe the input-

output relation [6]. The explicit calibration methods can provide camera’s physical

parameters, which are important in some applications, such as computer graphics,

virtual reality, 3-D reconstruction, etc.

This research presents an explicit approach for solving the camera calibration prob-

lem. Principally, The approach considers the problem as a mapping from the 3D

world coordinate system to the 2D image coordinate system, where the projection

matrix is the mapping function, and a statistical based regression algorithm is used

to simulate this mapping.

2. Remote Sensing

Segmentation of satellite imagery is one of the important research topics in the re-

mote sensing data analysis, and it is one of the funded projects in the CVIP lab. An

5

important main aspect associated with the remote sensing data is the high dimension-

ality of the data space. Thus, the proposed framework is applicable and provides a

promising direction for estimating the density function in the remote sensing spaces

which can be used for the image segmentation in further processing.

3. Parameters Estimation for Markov Random Field Models

Markov Random Fields (MRF) provide practical implementation for regions mod-

eling in image segmentation. The problem of parameter estimation of MRF models

is known to be a challenging task in the pattern recognition literature. This study

presents a formulation of the MRF modeling in a way such that the MF-Based SVM

algorithm can be applied for estimating the parameters of the MRF models.

4. Changes Detection in Images

Changes detection in images finds many applications in city planning, monitoring,

and security assessments. This research introduces statistical learning as a tool for

detecting changes in the land cover of scenes using space-born imagery. It proposes

a new approach and evaluates this approach with remote sensing data sets.

B. Thesis Outline

This thesis consists of seven chapters. The following remarks summarize the scope

of each chapter.

Chapter 1 introduces the thesis topic and discusses the motivation for the thesis

work. In addition, it discusses the research domain of the thesis and outlines the thesis

manuscript.

Chapter 2 discusses the theoretical formulation of the proposed MF-based SVM

regression framework. The statistical aspects behind the presented formulation and the dif-

ferences between the presented formulation and previous formulations are highlighted. The

6

introduction of the Mean Field theory in the learning of the SVM algorithm is discussed.

Chapter 3 uses the proposed MF-based SVM regression framework in a deep prob-

lem of machine learning which is the probability density function problem. The theoretical

aspect of this MF-based SVM density estimation approach are discussed in deep. The con-

sistency and convergence of the approach are discussed. Statistical performance measures

are used to illustrate the results of the presented density estimation approach. Also, several

estimation approaches are presented and illustrated for automating the learning algorithm:

the EM algorithm for estimating the parameters of the kernel, and Cross Validation for

estimating the rearranging parameters.

Chapter 4 presents the application of the proposed regression approach in a well

known computer vision problem which is the camera calibration. Motivations behind using

statistical learning approaches for camera calibration are discussed in this chapter. The

formulation of the camera calibration problem in a regression setup is outlined and the

link between this formulation and the learning of MF-based SVM regression algorithm is

established. A mixed learning algorithm between gradient descent and MF-based SVM

regression is formulated and applied to synthetic as well as real data sets.

Chapter 5 presents the application of the proposed density estimation approach in

segmentation of remote sensing data sets. Application of the framework in the segmen-

tation of real world multispectral and hyperspectral imagery is presented and evaluated

against other algorithms. Estimation of the MRF parameters in image modeling using the

proposed MF-Based SVM framework is presented.

Chapter 6 presents the application of the proposed statistical learning approaches in

solving the change detection problem. The problem considered in this work is the changes

in land cover of a scene using remote sensing imagery. The approach depends on using

statistical learning in modeling the shapes of the classes defined in the reference image.

These shape models are used with the models of the statistical sensor models (probability

7

densities) in detecting the changes in a scene.

Chapter 7 presents the conclusion and the future aspects to be addressed for further

boosting of the current work.

8

CHAPTER II

STATISTICAL LEARNING BASED SUPPORT VECTOR MACHINES

Support Vector Machines (SVM) were invented by Vladimir Vapnik and his co-

workers (first introduced in [7]). They are specific class of algorithms, characterized by

usage of kernels, absence of local minima, sparseness of the solution and capacity control

obtained by acting on the margin, or on number of support vectors. However, all these

nice features were already present in machine learning since 1960s. But, it was not until

1992 that all these features were put together to form the maximal margin classifier, the

basic Support Vector Machines, and not until 1995 that the soft margin version was intro-

duced [8]. SVM can be applied not only to classification problems but also to the case

of regression [9]. Still it contains all the main features that characterize maximum margin

algorithm: a non-linear function is learned by linear learning machine mapping into high

dimensional kernel-induced feature space.

SVM are gaining popularity due to many attractive features and promising empir-

ical performance. For instance, the formulation of SVM density estimation employs the

Structural Risk Minimization (SRM) principle, which has been shown to be superior to

the traditional Empirical Risk Minimization (ERM) principle employed in conventional

learning algorithms (e.g. neural networks) [10]. SRM minimizes an upper bound on the

generalization error as opposed to ERM, which minimizes the error on the training data. It

is this difference which makes SVM more attractive in statistical learning applications.

The traditional formulation of the SVM density estimation problem raises a quadratic

optimization problem of the same size as the training data set. This computationally de-

manding optimization problem prevents the SVM from being the default choice of the

9

pattern recognition community [11].

Several approaches have been introduced for circumventing the above shortcomings

of the SVM learning. These include simpler optimization criterion for SVM design (e.g.

the kernel ADATRON [12]), specialized QP algorithms like the conjugate gradient method,

decomposition techniques (which break down the large QP problem into a series of smaller

QP sub-problems), the sequential minimal optimization (SMO) algorithm and its various

extensions [13], Nystrom approximations [14], and greedy Bayesian methods [15] and

the Chunking algorithm [16]. Recently, active learning has become a popular paradigm

for reducing the sample complexity of large-scale learning tasks (e.g. [17–19]). In active

learning, instead of learning from ”random samples,” the learner has the ability to select its

own training data. This is done iteratively and the output of one step is used to select the

examples for the next step.

In this chapter, an algorithm which uses the Mean Field (MF) theory is used for

the learning of the SVM estimator. The MF methods provide efficient approximations

which are able to cope with the complexity of probabilistic data models [20]. MF methods

replace the intractable task of computing high dimensional sums and integrals by the much

easier problem of solving a system of linear equations. The density estimation problem

is formulated so that the MF method can be used to approximate the learning procedure

in a way that avoids the quadratic programming optimization. This proposed approach is

suitable for high dimensional density estimation problems and it is successfully applied to

various remote sensing data sets.

This chapter outlines the density estimation problem. A supervised density estima-

tion algorithm, which is based on the SVM approach, is presented and a practical learning

procedure is discussed. The practical aspects behind selecting the learning parameters for

proper density function estimation are discussed.

10

A. Support Vector Machines (SVM) Regression

The above discussion shows how thesuperviseddensity estimation problem is re-

duced to a regression problem. In this section, the SVM is presented as a supervised re-

gression tool and later on, it will be shown how can it be used as a density estimator for the

CCP. In the following discussion, the SVM as a regression tool is considered as the maxi-

mum a posteriori prediction with a Gaussian prior, under the Bayesian framework (Bayes’

theorem is used to relate the prior and posterior distributions). The idea is that, instead of

defining prior distributions over parameters of the learning machine, a Gaussian prior dis-

tribution is assumed over the function space on which the machine computes. In general,

the supervised regression learning problem can be stated as follows:

Given a training data setD = (yi, ti) |i = 1, 2, . . . , n, of input vectorsyi’s and associ-

ated targetsti’s, the goal is to infer the outputt for a new input data pointy. Generally, a

loss function which relates the estimated targetg(y) and the true targett is defined to char-

acterize the regression problem. In this work, the Vapnik’s - loss function is used which is

defined as:

L (t, g(y)) =

0 if |t− g(y)| ≤ ε

|t− g(y)| − ε otherwise

(1)

whereε> 0 is a predefined constant which controls the noise tolerance. To construct a

Bayesian framework under the assumed loss function in (1), an exponential model is em-

ployed. In this model, the likelihood for the probability of the true outputt at a given point

y, providing that the machine output isg(y), is assumed by the following relationship:

p (t|g (y)) =C

2 (εC + 1)exp −CL(t, g (y)) (2)

Since the elements of the training sample are assumed to be statistically independent ran-

dom vectors, the probabilistic interpretation of the SVM regression can be considered to

11

have the following likelihood, see [20]:

p(τ |g(D))=

(C

2 (εC + 1)

)n

exp

−C

n∑i=1

L(ti, g(yi))

(3)

where:

τ= [t1, t2, . . . , tn] andg (D) = [g (y1) , g (y2) , . . . , g (yn)]. Since, the SVM is con-

sidered as a maximum a posterior probability estimator with a Gaussian prior, the prior

probability distribution of the predictiong (y) is assumed as a Gaussian Process, GP. Gen-

erally, a GP is a stochastic process which is completely specified by its mean vector and

covariance matrix [21]. Thus, the prior probability for a sampleD can be expressed as a

GP with zero mean (for simplicity) and a covariance functionK (y,y′) as:

p(g(D))=1√

(2π)n det (Kn)exp

−1

2g (D)K−1

n g (D)T

(4)

whereKn = [K (yi,yj)] is the covariance matrix at the points ofD (K(., .) is a kernel function).

This can be parameterized with respect toD by Bayes’ Theorem:

p (g (D) |D) =p (D|g (D)) p (g (D))

p (D)

=M exp

−C

∑ni=1 L(ti, g(yi))− 1

2g (D)K−1

n g (D)T

√(2π)n det (Kn) p (D)

(5)

whereM =(

C2(εC+1)

)n

. The estimate of the posterior prediction distribution is the one

that maximizes the numerator of (5). Equivalently, the MAP estimate is obtained from:

ming(D)

C

n∑i=1

L(ti, g(yi)) +1

2g(D)K−1

n g(D)T (6)

Direct solution of (6) can be obtained by quadratic programming optimization (e.g., [22]).

The size of the optimization problem is the same as the size of the training sample. Since

Quadratic Programming (QP) Optimization routines have high complexity and require huge

memory and computational time for large data applications, solving the QP, especially with

12

a densen × n matrix, limits the use of the SVM algorithm for large data sets (e.g., [11]).

One way to avoid raising such a QP problem is to consider an approximate formulation for

the SVM regression [23]. The rest of this section and the following section present one of

such methods.

Using the posterior prediction distributionp (g (D) |D) which is defined in (5), the

prediction (expectation) on a new test pointy is given by:

〈g (y)〉 =

∫g (y) p (g (y) |D) dg(y)

=

∫g (y) p (g (y) ,g (D) |D) dg(y) dg(D) (7)

Substituting from (5) into (7) and with some mathematical reduction:

〈g (y)〉 =M√

(2π)n det (Kn)

∫g (y)A dg(y) dg(D) (8)

where:

A=exp

−C

∑ni=1 L(ti, g(yi))− 1

2g(D,y)K−1

n+1g(D,y)T

p (D),

Kn+1 =

Kn Kn (y)T

Kn (y) K (y,y)

, and

Kn (y) = [K (y1,y) ,K (y2,y) , . . . ,K (yn,y)] .

But:

g(D)p(g(D)) = KnKn−1p(g(D)) = −Kn

∂

∂ g(D)p(g(D)).

Then by extending the prior to include the new point (test pointy) we get:

g (y) exp

−1

2g(D,y)K−1

n+1g(D,y)T

=

n+1∑i=1

K (y,yi)∂

∂ g(yi)exp

−1

2g(D,y)K−1

n+1g(D,y)T

(9)

13

Substituting from (9) into (8) and apply integration by parts to shift the differentiation from

the prior to the likelihood, then:

〈g (y)〉 =M

P (D)

n∑i=1

K (y,yi)

∫N (g (D) |0,Kn)g (y).

∂

∂g (yi)exp

−C

n∑j=1

L(tj, g(yj))

dg(D)

=n∑

i=1

wi K (y,yi) (10)

wherewi is a constant defined as:

wi =M

P (D)

∫N (g (D) |0,Kn)g (y).

∂

∂g (yi)exp

−C

n∑j=1

L(tj, g(yj))

dg(D) (11)

B. Mean Field Theory for Learning of SVM Regression

The learning process suggests that the weightswi’s in (11) should be estimated us-

ing the training sample. But as can be seen, (11) is highly complicated and computationally

expensive since it contains a lot of integrations which need to be evaluated numerically. In

this work, the Mean Field theory is used to get an approximate and easy expression for

wi’s [24, 25]. The basic idea of the mean field theory is to approximate the statistics of

a random variable which is correlated to other random variables by assuming that the in-

fluence of the other variables can be compressed into a single effective mean “field” with

a rather simple distribution [20]. While MF theory arose primarily in the field of statis-

tical mechanics [26], it has more recently been applied elsewhere, for example for doing

Inference in Graphical Models theory in artificial intelligence [27, 28].

In this work, this approach is used to approximate the so calledcavity distribution.

The cavity distribution is defined asp(g (yi) |D

), whereg (yi) is the regressed SVM output

14

corresponding to an instantyi which is left from the training sample, andD is the training

sample without the instantyi.

For the cavity derivation, it is useful to introduce a new predictive posterior for the

output corresponding to the instantyi as:

p(g (yi) |D

)=

∫p(g(D))p(τ |g(D)) dg

(D)∫

p(g(D))p(τ |g(D)) dg(D)(12)

whereτ is the target vectorτ excludingti.

For the predictive posterior in (12), an average (expected value) can be defined as:

〈V〉i =

∫V p

(g (yi) |D

)dg(yi) (13)

where〈V〉i denotes the expected value forV given only the data sampleD. Then the

expression for the weightwi in (11) can be rewritten as:

wi =〈M ∂

∂g(yi)exp −CL(tj, g(yj))〉

i

〈M exp −CL(tj, g(yj))〉i(14)

To enable the weight’s calculation from (14), a closed form for the cavity distribution

in (12) is required, and that is where the Mean Field theory comes into play. The MF

considers that it is possible to calculate averages overp(g (yi) |D

)because of the fact that

it is a predictive posterior of the field at an inputyi. The MF approximatesp(g (yi) |D

)

with a Gaussian distribution in the form:

p(g (yi) |D

)≈ 1√2πσ2

i

exp

−(g (yi)− 〈g (yi)〉i)2

2σ2i

(15)

where the variance is defined as:σ2i =〈g (y2

i )i〉 − 〈g (yi)〉2i .Inserting (15) into (13) and evaluating (14), the weight coefficients can be obtained

as:

wi≈F (〈g (yi)〉i, σ2i )

G (〈g (yi)〉i, σ2i )

=Fi

Gi

(16)

15

where:

Fi =C

2exp

C

2

(2〈g (yi)〉i − 2ti + 2ε + Cσ2

i

)

×(

1− erf

〈g (yi)〉i − ti + ε + Cσ2

i√2σ2

i

)

− C

2exp

C

2

(−2〈g (yi)〉i + 2ti + 2ε + Cσ2i

)

×(

1− erf

−〈g (yi)〉i + ti + ε + Cσ2

i√2σ2

i

)

and

Gi =1

2erf

ti − 〈g (yi)〉i + ε√

2σ2i

− 1

2erf

ti − 〈g (yi)〉i − ε√

2σ2i

+C

2exp

C

2

(2〈g (yi)〉i − 2ti + 2ε + Cσ2

i

)

×(

1− erf

〈g (yi)〉i − ti + ε + Cσ2

i√2σ2

i

)

− C

2exp

C

2

(−2〈g (yi)〉i + 2ti + 2ε + Cσ2i

)

×(

1− erf

−〈g (yi)〉i + ti + ε + Cσ2

i√2σ2

i

)(17)

Equations(16) and(17) are called the Mean Field equations corresponding to the

weight coefficientwi. To evaluate the weight coefficients in (16), it is required to get both

the mean (average)〈g (yi)〉i and the varianceσ2i of the assumed Gaussian model for the

local predictive distributionp(g (yi) |D

). The detailed derivation for both〈g (yi)〉i and

σ2i depending on the mean field theory can be found in [20], but only the final results are

summarized here. The posterior average atyi is given by:

〈g (yi)〉 =n∑

j=1

wj K (yi,yj) (18)

16

From [20], the following results are obtained:

〈g (yi)〉i ≈ 〈g (yi)〉 − σ2i wi (19)

and,

σ2i ≈

1[(Σ +K)−1]

ii

− Σi (20)

where:

Σ = diag (Σ1, Σ2, . . . , Σn) , and

Σi = −σ2i −

(∂wi

∂〈g (yi)〉i

)−1

The expression for ∂wi

∂〈g(yi)〉i can be obtained from Equations(16) and(17) as:

∂wi

∂〈g (yi)〉i≈ C2 − wi

− wi〈g (yi)〉i + σ2i C

2∫ ti+ε

ti−εp(g (yi) |D

)dg(yi)

σ2i G (〈g (yi)〉i, σ2

i )

≈ C2 − wi − wi〈g (yi)〉i + σ2i C

2 + IGi

σ2i G (〈g (yi)〉i, σ2

i )(21)

where:

IGi =1

2erf

ti − 〈g (yi)〉i + ε√

2σ2i

− 1

2erf

ti − 〈g (yi)〉i − ε√

2σ2i

C. Summary of the Statistical Learning MF-Based SVM Regression Algorithm

The implementation steps of the proposed approach for density estimation using

SVM with the mean field theory being applied to the learning process are presented below:

Step 1.Prepare the training data setD.

17

Step 2.Set a learning rateη and randomly initializewi’s.

Step 3.Choose a kernelK (y,y′) and accordingly, calculate the covariance matrixKn and

let σ2i = [Kn]ii.

Step 4.Iterate steps 5 and 6 until convergence inwi’s.

Step 5.“inner loop”: For i = 1, 2, . . . , n do

5.a calculate〈g (yi)〉 from (18)

5.b calculate〈g (yi)〉i from (19).

5.c calculateFi andGi from (17)

5.d updatewi by:

wi = wi + η(Fi

Gi− wi

)

Step 6.“outer loop”: For every M iterations forwi, updateσ2i from (20).

D. Remarks on the MF-Based SVM Regression Algorithm

1. The most computationally expensive step in the above algorithm is the inversion of

the matrixKn + Σ in Step 6. So, it is recommended that step 6 at the ”outer loop”

iterate less frequently than step 5 of the ”inner loop”. For example, afterM = 10

iterations of updatingwi, there will be one update ofσ2i .

2. The optimization needed to obtain the weights is carried out in the feature space, i.e.

after applying the kernel function on the input samples.

3. Since the optimization is done in the feature space, the optimization does not depend

on the input space dimensionality and so the density estimation procedure too.

18

4. The following chapters introduce methods for obtaining the best values for the learn-

ing parameters (e.g.C).

E. Summary

In this chapter, the foundation of the statistical based formulation of SVM regres-

sion. This formulation allows a fast and efficient learning procedure for SVM regression

algorithm. The Mean Field theory is used to approximate the hard-to-evaluate integra-

tions with a much simpler system of equations which are solved iteratively to estimate the

weights in the weighted sum of kernels representation. The fastness and efficiency prop-

erties of the algorithm open the applicability of it in solving many pattern recognition and

computer vision problems.

19

CHAPTER III

DENSITY ESTIMATION USING MEAN FIELD BASED SUPPORT VECTORMACHINES

Density estimation is a problem of fundamental importance to all aspects of ma-

chine learning and pattern recognition [29, 30]. The probability density function (PDF) of

a continuous distribution is estimated from a representative sample drawn from the under-

lying density. The estimation can be carried out either in a parametric or non-parametric

way. When it is reasonable to assume, a priori, a particular functional form for the PDF then

the problem is reduced to the estimation of the required functional parameters; paramet-

ric approach. For estimating arbitrary density functions, finite mixture models [31, 32] are

gaining much attention as powerful approaches and they are routinely employed in many

practical applications. One can consider a finite mixture model as providing a condensed

representation of the data sample in terms of the sufficient statistics of each of the mixture

components and their respective mixing weights.

The kernel density estimator, also commonly referred to as the Parzen window es-

timator [33], can be viewed as the limiting form of a mixture model where the number of

mixture components will equal the number of points in the data sample. Unlike paramet-

ric or finite-mixture approaches to density estimation where only sufficient statistics and

mixing weights are required in estimation, Parzen density estimates employ the full data

sample in defining density estimates for subsequent observations. So, while large sample

sizes ensure reliable density estimates, they bring with them a computational cost for test-

ing which scales directly with the sample size. Herein lies the main practical difficulty with

employing kernel-based Parzen window density estimators.

20

In this dissertation, a method is proposed for estimating the density function using

the principle of Parzen estimator. But, it uses SVM principles to choose a subset of the

training data (Support Vectors) which is then is used in the computation of the density

estimate. Usually, the size of the Support Vectors subset is much smaller than the size

of the training data set and that reduces the computational cost of the estimation process.

The MF-based SVM regression approach introduces in the previous chapter is used in the

proposed density estimator to make the approach more faster and accurate. Also, methods

for automating the estimation process and enhancing the results are provided.

A. Density Estimation Problem Formulation

Given a random vectorY, the relation:

F (y) = P (Y < y) (22)

defines the cumulative probability distribution function, CDF, of the random vectorY. The

probability density function, PDF,p(y), of the random vectorY at a specific pointy is a

nonnegative quantity and it is related to the CDF by the relation:

F (y) =

∫ y

−∞p(y′) dy′ (23)

Hence, in order to estimate the probability density function it is required to obtain a solu-

tion for the inverse of the integral equation:∫ y

−∞p(y′, α) dy′ = F (y) (24)

on a given set of densitiesp(y, α). Where, the integration is a vector integration, andα is

the parameter set which characterizes the density function.

From another point of view, the estimation problem in ( 24) can be regarded as

solving the linear operator equation:

A [p(y)] = F (y) (25)

21

where the operatorA [.] is a one-to-one mapping for the elements of the Hilbert spaceE1

wherep(y) is defined into elements of the Hilbert spaceE2 whereF (y) is defined. But,

neitherp(y) nor F (y) in ( 25) is known. However, from the principles of probability

theory [34], given a random sampleD = y1,y2, . . . ,yn from an unknown distribution,

a practical estimate forF (y) can be obtained by:

Fn(y) =1

n

n∑

k=1

I(−∞, y](yk) (26)

where,n is the size of the sample andI(−∞, y](u) is the indicator function which is defined

as:

I(−∞, y](u) =

1 if u ≤ y

0 else

(27)

if both of y andu are scalars (1-dimensional data). Ify andu are vectors of lengthd, then:

I(−∞, y](u) =d∏

i=1

I(−∞, yi](ui) (28)

This functionFn(y), which is called the empirical distribution function, converges

with probability 1 to the original distribution functionF (y) [35]. Therefore, the pairs:

(y1, Fn(y1)), (y2, Fn(y2)), . . . , (yn, Fn(yn)) are constructed from the sampleD to gen-

erate the training data set:

D = (yi, ti) |ti = Fn(yi); i = 1, 2, . . . , n (29)

Now, a regression algorithm uses this training data set to solve the density estimation prob-

lem (25) in the image space (right hand side of (25)) to geta continuousapproximation for

the distribution functionF (y). This approximation can be used to express the solution in

the pre-image space (left hand side of ( 25)) to get an estimate for the density function using

the known operatorA. In this work, the SVM is used as a regression algorithm to get a

continuous approximation for the distribution functionF (y). The motivation behind using

22

the SVM as a regression tool is that a dense continuous approximation forF (y) is obtained

which should besafely differentiableso that the density functionp(y) can be obtained.

B. Obtaining the Probability Density Function Estimate and Choosing of theKernel Function

The above discussion shows that the MF-Based SVM regression algorithm can be

used for approximating the distribution functionF (y) from the training sampleD. The

algorithm proposed in the previous chapter is used to get and approximation forF (y)

which will be in the form of a weighted sum of the kernel function working on the instants

of the training sample as (see (10)):

F (y) =n∑

i=1

wi K (y,yi) (30)

Consequently, the estimate of the density function will be simply in the form:

p (y) =n∑

i=1

wi K ′(y,yi) =

n∑i=1

wi K (y,yi) (31)

whereK (y,yi) is the derivative ofK (y,yi).

There are some conditions on the kernel functionK (y,yi) so that a valid density

function estimate can be obtained from (31), see [22]. These conditions are:

i. Kγ = a (γ) K(

y−yi

γ

)

ii. a (γ)∫

K(

y−yi

γ

)dy = 1

iii. K(0) = 1

In the presented algorithm, a Gaussian Radial Basis Function (GRBF) kernel is used which

satisfies the above conditions (see, [22]) and it has the form:

K(y,yi) = exp

(−1

2(y − yi)Λ

−1(y − yi)T

)(32)

23

whereΛ is a parameter which is assumed to be predefined.

C. Summary of the Proposed SVM Density Estimation Algorithm

The implementation steps of the proposed approach for density estimation using

SVM with the mean field theory being applied to the learning process are presented below:

Step 1.Generate the training data setD defined in ( 29).

Step 2.Apply the MF-Based SVM regression algorithm (Algorithm-II.C) to get an approx-

imation forF (y).

Step 3.Calculatep (y) from (31).

The main goal of steps 1 and 2 is to get the weights of the SVM regression expan-

sion 30. Only those vectors which have corresponding weights greater than some threshold

(Support Vectors) are used in calculating the density estimate at test point. This reduces

the computational time for the density estimation than the traditional Parzen window esti-

mators.

D. Consistency of the Proposed Algorithm

The core component of the proposed density estimation approach is the regres-

sion algorithm using SVM. The proposed SVM regression approach is formulated using

Gaussian Processes and the Mean Field theory. It was shown that SVM regression boils

to a Gaussian Process prediction scheme. Thus, to test the consistency of the density

estimation approach, it suffices to show the consistency of Gaussian Process prediction

approaches. The following discussion examines the consistency issue of regression us-

ing Gaussian Processes. The argument of the consistency discussion uses the concept of

Equivalent Kernel (EK) which will be discussed next.

24

1. Equivalent Kernel in Gaussian Processes Prediction

As shown in (10), the predicted value for a test pointy using Gaussian Processes

Regression is a weighted sum of the kernel function acting on the input training points.

This can be written as:

g (y) =n∑

i=1

wiK (y,yi) = k(y)wT (33)

wherek(y) is a vector where itsi’th element is the value of the kernel function between

the test pointy and the training pointyi; K (y,yi). From the optimization point of view,

the problem is considered in theweight spacewhere the objective is to estimate the weight

vectorw. To make the derivation feasible, instead of usingε−insensitive loss function,

a quadratic loss function is used. Assuming the observations have noiseσν , the objective

function can be written in the form (see (6)):

E =1

2σ2ν

n∑i=1

(ti − g(yi))2 +

1

2g(D)K−1

n g(D)T (34)

Usingg as a shortcut forg(D), the following results can be obtained:

g = [k(y1)wT k(y2)w

T · · · k(yn)wT ]

= wKn (35)n∑

i=1

(ti − g(yi))2 = (τ − g)(τ − g)T

= (τ −wKn)(τ −wKn)T

(36)

Thus, the objective function is reduced to:

E =1

2σν

(τ − g)(τ − g)T +1

2gK−1

n gT

=1

2σν

ττT − 1

2σν

gτT +1

2σν

ggT +1

2gK−1

n gT

=1

2σν

g(σνK−1n + I)gT − gτT +

1

2ττT (37)

25

The posterior mean value of machine out vectorgPM is the one which minimizes

E, or the solution of (using vector differentiation with respect tow):

(σνK−1n + I)gPM

T = τT (38)

Substituting form (35) into (38), the posterior mean value of the weight vectorwPM

can be found from:

(σνK−1n + I)KT

nwPMT = τT or,

(σνK−1n + I)KnwPM

T = τT sinceKn is symmetric. Thus:

(σνI +Kn)wPMT = τT (39)

AssumeΣeq = (Kn + σνI), then:

wTPM = Σ−1

eq τT (40)

The mean prediction for a new inputy is:

µ(y) = k(y)wTPM

= k(y)Σ−1eq τT (41)

From the last results, the predictive mean at a test point can be written in the form:

g(y) = h(y)τT (42)

where:

h(y) = k(y)Σ−1eq = k(y)(Kn + σνI)−1 (43)

is known as theweight functionor theEquivalent Kernel.

26

2. Consistency Argument of the Proposed Algorithm

Suppose that the regression approach has a loss functionL, under a given Borel

probability measureu(y, t), the risk is defined as:

RL(g) =

∫L(t, g(y))du(y, t) (44)

where the optimization is done over the functional space over which the machine is com-

puting, i.e. the objective is to find the functionalη(y) that minimizes the riskRL(g).

Definition 1:

A procedure that returnsgD is consistent if:

RL(gD) → RL(η) as n →∞ (45)

As shown in (42), the GP posterior mean is expressed in terms of the equivalent

kernel (EK)h(y). But it is hard to understand the consistency of EK since it depends on

the matrix inverse ofKn +σνI, andKn depends on location of training inputs, see [36]. To

smooth out the issue of random locations of the input training points over the input space,

it will be assumed that the observations are distributedideally over the input space. This

means that the observations “smeared out” across the input space withρ data points per unit

(length, area, or volume, depending on the input space dimensionality). In this assumed

ideal case, the definition of consistency becomes:

Definition 2:

A procedure with ideal assumption of smeared out observations withρ data points per input

space volume unit is consistent if:

RL(gD) → RL(η) as ρ →∞ (46)

The objective function of the GP regression (see (34)) with a quadratic loss function

can be written in the form:

J [f ] =1

2σν

n∑i=1

(ti − g(yi))2 +

1

2‖g‖2

H (47)

27

where‖g‖H is the Reproducing Kernel Hilbert Space (RKHS) norm corresponding to the

kernelK. Under the idealized smearing out of the observations, a smoothed version of the

objective function can be obtained as:

Jρ[f ] =ρ

2σν

∫(η(y)− g(y))2dy +

1

2‖g‖2

H (48)

Williams [37] uses a Fourier analysis-based approach to argue the consistency of

the Gaussian Processes Prediction. In the following, a brief outlines of their approach is

presented, while the details can be found in the mentioned reference.

The basic relation between the functiong(y) and its Fourier transform is:

g(y) =

∫g(s)e2π i s.y ds (49)

and similarly forη(y). Under a stationary kernel, i.e.K(y, y) = K(y − y), the RKHS

norm in (48) can be represented as (see [38] for details):

‖g‖2H =

∫ |g(s)|2SK(s)

ds (50)

whereSK(s) is the power spectrum of the kernelK. Thus,

Jρ[f ] =1

2

∫ (ρ

σ2ν

|η(s)− g(s)|2 +|g(s)|2SK(s)

)ds (51)

The minimization ofJρ[f ] can be done using calculus of variations [39] which re-

sults in:

g(s) =SK(s)η(s)

σ2ν/ρ + SK(s)

(52)

In the domain of inverse Fourier transform, (52) can recognized as the convolution

relation:

g(y) =

∫h(y − y) η(y) dy (53)

From which, the Fourier transform of EK is:

h(s) =SK(s)

SK(s) + σ2ν/ρ

=1

1 + σ2ν/(ρ SK(s))

(54)

28

It can be easily noted from (54) that:

h(s) → 1 as ρ →∞ (55)

Thus,g(s) → η asρ →∞ which means that:

g(y) → η(y) as ρ →∞ (56)

which proves the consistency of the Gaussian Processes based Regression.

E. Convergence of the Proposed Algorithm

The proposed density estimation approach is has an iterative nature. Supposee(m)

denotes the error vector at iterationm. The convergence of such approach can be defined

such that:

e(m) → 0 as m →∞ (57)

or in simplest cases:e(m) = 0 for somem = m0. In our work, the convergence is shown

empirically.

F. Estimation of the Learning Parameters

The above proposed procedure for the MF-Based SVM framework contains some

learning parameters (e.g. regularization constant (C), learning rate (η), and the kernel’s

shape and parameters). These parameters should be carefully selected for proper perfor-

mance of the approach. This section proposes some methods for automatic selection of

these parameters.

1. Kernel Optimization using the EM algorithm

One of the commonly used kernels with SVM learning is the Gaussian Radial Basis

Function (GRBF) [40–42], which has the form in (32). The following discussion explains

29

an approach for automatic selection of the covariance for RBF kernel. This approach in-

corporates the EM algorithm [43–45] into the learning procedure so that the covariance of

the kernel is optimized while the SVM weight coefficients are estimated. The EM algo-

rithm is used to automatically select the covariance matrices of the kernels centered at the

training instants. This automatic optimization of the covariance matrix makes the SVM

learning faster in adaptation and more accurate which is reflected in a good performance of

the algorithm.

The EM algorithm can be used to estimate the parameters of a mixture of a Gaussian

distribution [42] based on the maximization of the following likelihood function:

L(w,Θ) =∑y∈Y

f(y) log p((y)) (58)

wheref(y) is the empirical density function.

The maximization of (58) can be found using the iterative block relaxation algo-

rithm. The relative contributions of each data itemy = 0, . . . , Y into each Gaussian com-

ponent at the stepm are specified by the following respective conditional weights

π[m](r|y) = w[m]r ϕ(y|θ[m]

r )

p[m]w,Θ(y)

(59)

wherer = i = 1, 2, . . . , n.

The block relaxation converging to a local maximum of the likelihood function

in (58) repeats iteratively the following two steps:

1. E-step[m + 1]: to find the covariance of a Gaussian component by maximizing

L(w,Θ) under the fixed conditional weights of (59) for the stepm, and

2. M-step[m + 1]: to find these latter weights by maximizingL(w,Θ) under the fixed

parameters (in our case this is the covariance)

until the changes of the log-likelihood and all the model parameters become small.

30

The covariance of each Gaussian is obtained by the unconditional maximization:

(σ[m+1]r )2 =

1

w[m+1]r

∑y∈Y

(y − µ

[m+1]i

).(y − µ

[m+1]i

)′

·f(y)π[m](r|y) (60)

This step is repeated in each step of the optimization of the SVM weight coefficients

in (30).

Initialization of the parameters for EM algorithm

As stated before the centers (means) of the Gaussian kernels are chosen to be the

input instances themselves. So, the proposed approach uses only the EM for estimating

the variances (covariances in multidimensional spaces) of the kernel function. To start the

EM algorithm, all these parameters are initialized to the same value which is the empiri-

cal variance (covariance) of the input training instances. In 1-D spaces this initialization

becomes:

σ21 = σ2

2 = · · · σ2n = σ2

empirical (61)

where

σ2empirical =

n∑i=1

(yi −m)2 and m =1

n

n∑i=1

yi

In multidimensional spaces:

Σ1 = Σ2 = · · ·Σn = Σempirical (62)

where

Σempirical =n∑

i=1

(yi −m)2 and m =1

n

n∑i=1

yi

2. Cross-Validation for Parameters Estimation

Cross-Validation (CV) is probably the simplest and most widely used method for

estimating the prediction error [46]. In CV methods, the training sample is split into two

31

FIGURE 1 – Splitting data in Cross-Validation setups.

parts: one for model fitting and the other for model evaluation. The idea behind CV is to

recycle data by switching the roles of training and test samples.

Specifically speaking, suppose there is anJ-fold CV problem. The data is split into

J roughly equal-sized parts (see Fig. (1) for J=5). For thejth part, the model is fitted using

theJ − 1 parts of the training data and the evaluation is done by thejth part of the data.

The application of the CV principle in parameters estimation goes as follows: sup-

pose the estimation algorithm has a parameter setλ, the steps to get the optimum value for

λ are:

1. Split the whole training data setD into J disjoint subsamplesD1,D1,· · · , DJ .

2. For j = 1, 2, · · · , J fit a model to the training sampleD = DrDj, and compute the

discrepancy,ej(λ), using the test sampleDj.

3. Find the optimalλ∗ as the minimizer of the overall discrepancye(λ) =∑

j ej(λ).

For illustration purposes, the general linear regression model is considered here,

assuming that there is an input vectory which has the corresponding target vectort. In

its basic form, the CV methods use the leave-one-out principle in approaching the CV

algorithm, which means thatJ = n. The ordinary CV (OCV) estimate of the prediction

error is:

OCV (λ) =1

n

n∑i=1

(ti − 〈gλ(yi)〉i)2 (63)

A CV estimate ofλ is the minimizer of (63).

In order to illustrate the concept, CV is used to estimate the parameters:C; the

regularization constant, andΛ; the kernel covariance; i.e.λ = C, Λ in the proposed den-

sity estimation approach. The search method used in this section is the “grid-search” [47]

32

on C andΛ. In grid-search, basically pairs of(C, Λ) are tried and the one with the best

cross-validation accuracy is picked. The grid-search is straightforward but seems not to be

an intellectual choice. In fact, there are several advanced methods which can save com-

putational cost by, for example, approximating the cross-validation rate. However, there

are two motivations why the simple grid-search approach is preferred here. One is that

psychologically we may not feel safe to use methods which avoid doing an exhaustive

parameter search by approximations or heuristics; especially in illustrative situations like

what is done in our work. The other reason is that the computational time to find good

parameters by grid-search is not much more than that by advanced methods since there are

only two parameters(C, Λ).

G. Experiments for Evaluating the Proposed Density Estimation

In this section, several data examples are used to illustrate the performance of the

proposed algorithm for density estimation. The data sets are generated with standard ran-

dom generators in 1- and 2-D spaces. The performance of the proposed algorithm is eval-

uated visually and using the Kullback-Leibler Distance (KLD) [48] measure which is per-

haps the most frequently used information-theoretic distance measure between two proba-

bility densities. KLD is one example of Ali-Silvey class of information-theoretic distance

measures [49] which are defined to be:

d(p0, p1) = f(ε0[c(ψ(x))]) (64)

wherep0 andp1 are two probability densities,ψ(.) represents the likelihood ratiop1

p0, c(.)

is convex,ε0[.] is the expected value with respect to the distributionp0 andf(.) is a non-

decreasing function. Suppose thatc(x) = x log x andf(x) = x, then the KLD is defined

to be:

KLD(p1| p0) =

∫p1(x) log

p1(x)

p0(x)dx (65)

33

In the practical application of the KLD for evaluating the density approximationp1 of the

reference densityp0, bothKLD(p1| p0) andKLD(p0| p1) should be calculated. If their

values are close with opposite signs this means that the two densities are close to each

other which means that the density estimator works fine.

1. Density estimation for a 1-D Gaussian distribution

This is a simple and standard experiment but illustrative and it is used here for

comparison purposes. In this experiment, a data sample of 100 instants from a 1-D standard

normal distribution is used to illustrate the performance of the proposed MF-based SVM

density estimation algorithm. The results are compared to those obtained in a previous work

which had been done using the traditional formulation of the SVM based density estimation

approach. The comparison is done based on the visual evaluation, the convergence speed

and the KLD measure.

As shown in Fig. (2), the MF-based SVM approximates closely the reference den-

sity function, while there is an apparent error in the approximation produced by the traditionally-

formulated SVM. In the case of the proposed MF-based SVM estimator, the KLD are:

KLD(p1| p0) = 0.12 andKLD(p0| p1) = −0.094. The two KLDs are close enough to

each other to show that the estimation is a good one. On the other hand, in the tradi-

tional formulation based SVM approximation Fig. (2-b), the distances are:KLD(p1| p0) =

−0.85 andKLD(p0| p1) = 1.86 which are not close to each other like the case in the pro-

posed algorithm. The computational cost of the proposed algorithm is of orderO(N2)

while the traditional SVM learning algorithm has the orderO(N3), see [50]. The cur-

rent experiment takes 0.015 second for the optimization process in the proposed MF-Based

SVM while it takes 0.22 second with the traditional SVM which emphasizes the faster

response of the proposed algorithm.

34

−4 −3 −2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4TrueEstimated

−4 −3 −2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4 TrueEstimated

(a) (b)

FIGURE 2: Estimation of the 1-D Gaussian density function with the SVM density esti-

mation which is formulated with the, (a) proposed, (b) traditional formulation.

2. Density estimation for 1-D Mixture of Gaussian distributions

In this little more challenging experiment, a data set of 100 instants is generated

from a 1-D mixture of Gaussians. The mixture consists of two components and has the

form:

p(x) = α1N (µ1, σ21) + α2N (µ2, σ

22) (66)

with the parameters shown in Table 1.

TABLE 1PARAMETERS OF THE 1-D MIXTURE OF GAUSSIANS DENSITY FUNCTION

Parameter µ1 µ2 σ21 σ2

2 α1 α2

Value -1 7 9 4 0.4 0.6

The results in Fig. (3) show that the proposed algorithm approximates well the

density function in (66). There are little errors at the tails of the density function com-

ponents. These tail-errors add up at the intersection of the two components which pro-

duces a noticeable error. The distance measure values in this case are:KLD(p1| p0) =

0.26 andKLD(p0| p1) = −0.09 which are affected by the error discussed before. This

35

−15 −10 −5 0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

TrueEstimated

FIGURE 3 – Estimation of a 1-D mix of Gaussian density functions

experiment takes 0.313 second for the optimization process to converge which means that

the algorithm still maintains a considerably fast convergence.

3. Density Estimation for 1-D Rayleigh Distribution

In this experiment, a data sample of 100 instants from a Rayleigh distribution which

has the form:

p(x) =xe

−x2

2s2

s2(67)

where the parameters is set to 1 in the experiment. The Rayleigh distribution is chosen

because there is a special interest in the medical imaging applications for the Rayleigh

distribution, and also this distribution represents a good non-symmetric variation other

than the Gaussian. The results shown in Fig. (4) illustrate that the proposed algorithm

approximates the density function very well. The KLD in this case are:KLD(p1| p0) =

0.3 andKLD(p0| p1) = −0.2. The apparent difference between the two distances may be,

is due to the small error in the left tail. It is interesting to note here that the peak values

36

0 0.5 1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

TrueEstimated

FIGURE 4 – Estimation of a Rayleigh density function.

of the densities (reference and estimated) occur at the same value ofx which is a positive

argument for the proposed estimation algorithm. This experiment takes 0.016 second for

the optimization process to converge.

4. Comparison with state of the art methods

In this section, a comparison between the proposed density estimation approach

and two of the state of the art methods is discussed. These methods are: Sparse Density

Construction (SDC) method [1], and Reduced Set Density Estimator (RSDE) method [50].

SDC method presents an efficient construction algorithm for obtaining sparse kernel

density estimates based on a regression approach that directly optimizes model generaliza-

tion capability. It uses an orthogonal forward regression to ensure computational efficiency

of the density construction. The algorithm incrementally minimizes the leave-one-out test

score. This method is shown to perform comparably with other algorithms.

RSDE method is optimal in theL2 sense in that the integrated squared error be-

37

tween the unknown true density and the RSDE is minimized in devising the estimator. The

required optimization turns out to be a straightforward quadratic optimization with simple

positivity and equality constraints and thus suitable forms of Multiplicative Updating [51]

or Sequential Minimal Optimization as introduced in [52] can be employed which ensures

at most quadratic scaling in the original sample size.

RSDE approach is fundamentally different in from classical Parzen window and

SVM density estimators in that the Integrated Squared Error (ISE) between the true (un-

known) density and the reduced set estimator is minimized. The sparsity of representation

(data condensation) emerges naturally from direct minimization of ISE due to the required

constraints on the functional form of p(x) without the requirement to resort to additional

sparsity inducing regularization terms or employingL1 or ε-insensitive losses.

For the comparison with the two methods, an example is chosen which is presented

in the above mentioned references and we apply our proposed density estimation approach

on that example. The function used in this example is a 1-D mixture of a Gaussian and

exponential densities in the form:

p(x) = 0.51√2π

exp

(−(x− 2)2

2

)+ 0.5

0.7

2exp(−0.7|x + 2|) (68)

The performance measure used is theL1 test error which has the form:

L1 =1

Ntest

Ntest∑

k=1

|p(x)− p(x)| (69)

where p(x) is the estimated density at pointx, andNtest is the number of test points.

According to the above reference, SDC method is compared with Parzen Window method

and classical SVM method [53]. The following table shows the results that presented in

their reference, in comparison with the proposed MF-based SVM approach. The results

show that the MF-based SVM approach outperforms the other approaches in terms of the

accuracy, with a slightly high number of kernels used. There is no quantitative evaluation

of the computational time of SDC or RSDE method. But they uses the leave-one-out test

38

TABLE 2RESULTS FOR THE MIXTURE OF A GAUSSIAN AND EXPONENTIAL DENSITY

FUNCTIONS

Method Parzen Window Classical SVM SDC RSDE MF-SVM

L1 × 10−2 2.063 2.165 2.177 1.8 0.5

kernel number 100 5 5 5 10

score which is known to be time consuming, although they use an iterative approach to

decrease the computational complicity. The visual results illustrated in Fig 5 show the

better performance of the proposed MF-based SVM density estimation approach.

−10 −5 0 50

0.05

0.1

0.15

0.2

0.25

x

p(x)

True

−−− MF−based

(a) (b)

FIGURE 5: Estimation of 1-D mixture of a Gaussian and an Exponential density functions,

(a) SDC method (quoted from [1]), (b) MF-based Method.

5. Density estimation for a 2-D cases

The first experiment is carried out to assess the performance of the proposed al-

gorithm in high dimensional spaces. A data set of 100 instants from a 2-D Gaussian dis-

tribution is used. Again, this experiment is used to compare the proposed algorithm with

the traditionally formulated SVM algorithm. Figure (6) shows both the density function

and its contour for the reference density function, the estimated density function using

39

the traditionally-formulated SVM estimator and the estimated density function using the

proposed MF-Based SVM estimator.

As can be noted from the figure, there is a significant improvement in the estimation

using the MF-based SVM over the traditionally formulated SVM. In the contour plot for

the estimated density there is a slight deformation in the contour of the estimated density

function using the traditional SVM and there is a shift in the mean vector . The distance

measures in the case of the traditionally-formulated SVM estimator are:KLD(p1| p0) =

0.39 andKLD(p0| p1) = −8.4 which shows that there is a large difference due to the

shift mentioned before. The significant improvement can be noted from the contour of the

estimated density function using the MF-based SVM estimator. The distance measures are:

KLD(p1| p0) = 4.029 andKLD(p0| p1) = −3.6, showing a close fit. This experiment

takes 0.172 second for the optimization process to converge with the proposed MF-Based

SVM learning while it takes 0.578 second with the traditional SVM.

Another experiment employs a sample (200 points) of 2-D data which is generated

with equal probability from an isotropic Gaussian and two Gaussians with both positive and

negative correlation structure. The probability density is estimated using a Parzen window

employing a Gaussian kernel and leave-one-out cross-validation was employed in selecting

the kernel bandwidth, the RSDE is obtained employing a Gaussian kernel and the kernel

bandwidth is selected by minimizing the cross-entropy between the Parzen window esti-

mate and RSDE, and the proposed MF-based SVM. The probability density iso-contours

of the resultant estimation are shown in Fig. 7. The results illustrate that the performance of

the proposed MF-based density estimation approach is highly comparable to RSDE method

with the advantage of avoiding the use of Quadratic Programming tools.

40

6. Experiments on the automatic selection of the Kernel width using EM algorithm

To evaluate the proposed algorithm for automatic kernel parameters selection, the

same data set for the mixture of Gaussian density functions is used.

The results in Fig. (8) show that the proposed Mean Field-based SVM density es-

timation with automatic kernel optimization approach approximates well the density func-

tion in (66). Comparing the results with Fig. (3), it shows that the proposed algorithm for

automatic selection of the kernel width enhances the estimation results. For a quantitative

evaluation, the Kullback-Leibler distance (KLD) [49] and the Levy distance [34] measures

are used. For the proposed MF-based SVM approach with kernel optimization, the KLD

is 0.02 which is small enough to show that the proposed approach is a good density esti-

mator. For comparison purposes, the KLD for MF-Based SVM approach without kernel

optimization is 0.09, which is another proof that the proposed approach outperforms other

algorithms.

The Levy distance is used to compare two distribution functions in order to reflect

the similarity of their density functions. In this experiment, the Levy distance is used to

compare the empirical distribution function of the input random sample and the estimated

distribution function by the density estimator. The CDF of the MF-Based SVM without

kernel optimization and that of the proposed MF-based SVM with kernel optimization are

shown in Fig. (8). The Levy distance is 0.049 for the proposed approach while it is 0.079

for the MF-Based SVM without kernel optimization which again illustrates the outstanding

performance of the proposed approach.

41

7. Experiments on the automatic selection of the learning parameters using Cross Valida-

tion

One step toward the automation of the proposed approach is to automatically esti-

mate the regularization constatC and the kernel widthσ. The following results discuss the

application of Cross Validation in estimating these two parameters. Figure (9-a) shows that

the improper choice of the regularization constantC (C = 2.1 in this case while it was0.1

in the previous case) results in a bad performance of the density estimation algorithm.

The evolution of the estimation error with the value of the kernel width at constant

value ofC is shown in Fig (9-b). The curve isconvexand shows that there is an optimal

values ofσ which provides minimum error. The error surface with the two parameters:C

andσ is shown in Fig (9-c), where its minimum occurs atC = 1.5 andσ = 0.8.

8. Experiments on the algorithm convergence

The convergence of the estimation approach is evaluated by the error convergence

during the learning process. The estimation error is calculated at each learning step of the

1-D Gaussian density function example discussed before. As shown in Fig (10), the error

converges with each learning step, even linearly. This reflects that the convergence of the

overall estimation algorithm.

H. Conclusion

This chapter presented the foundations of the density estimation approach based

on statistical learning principles. The proposed approach uses MF-based SVM algorithm

presented in chapter II. The chapter starts from the basic principles of the statistical theory

to represent the density estimation problem in terms of a regression setup, where MF-based

SVM regression approach is used. The consistency of the proposed approach is discussed in

42

terms of the equivalent kernel formulation. Different approaches for estimating the learning

parameters were presented, e.g. EM algorithm for the kernel width and the Cross Validation

approach for parameters estimation. Several experiments were presented to illustrate the

performance of the approach and its convergence.

43

−20

2

−2

0

2

0

0.2

0.4

0.6

0.8

1

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

(a)

−20

2

−2

0

2

0

0.2

0.4

0.6

0.8

1

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

(b)

−20

2

−2

0

2

0

0.2

0.4

0.6

0.8

1

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

(c)

FIGURE 6: Estimation of a 2-D Gaussian density function, (a) the reference density func-

tion and its contour, (b) the estimated density using the traditional formulation-based SVM

and its contour, (c) the estimated density using MF-based SVM and its contour

44

−2 0 2 4 6

−2

0

2

4

6

Parzen Window Density Estimation

−2 0 2 4 6

−2

0

2

4

6

Reduced Set Density Estimation

(a) (b)

−2 0 2 4 6

−2

0

2

4

6

MF−based SVM Density Estimation

−4

0

4

8

−4

0

4

8

0

0.4

0.8

1

Estimated density function using MF−based SVM

(c) (d)

FIGURE 7: Comparison between the estimation results of a 2-D mixture of an isotropic

Gaussian and two Gaussians with both positive and negative correlation structure, (a) the

contour of the estimated density using Parzen window method, (b) the contour of the es-

timated density using Reduced set method, (c) the contour of the estimated density using

MF-based SVM method, and (c) the estimated density using MF-based SVM

45

−15 −10 −5 0 5 10 15 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14TrueEstimated

(a)

−10 −5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

−8 −4 0 4 8 120

0.2

0.4

0.6

0.8

1

EmpiricalEstimated

(c) (d)

FIGURE 8 – Estimation of the mixture of Gaussians in Fig 3 (a) with the proposed algo-rithm for automatic kernel parameters estimation, (b) CDF of the estimated density withoutautomatic kernel optimization, and (c) CDF of the estimated density with the proposed ker-nel optimization algorithm

46

−4 −3 −2 −1 0 1 2 3 40

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4TrueEstimated

0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5x 10

−3

RM

SE

σ

C=1.5

(a) (b)

0.6

0.8

1

1.2

1.4

1.6

0.50.6

0.70.8

0.910

2

4

6

x 10−3

C

σ

RM

SE

(c)

FIGURE 9: Effect of the regularization constant C on the proposed algorithm performance

47

0 5 10 15 204.8

5

5.2

5.4

5.6

5.8

6x 10

−4

Iteration

RM

S E

rror

FIGURE 10: Convergence of the estimation error with the optimization iterations for the

Gaussian density estimation example

48

CHAPTER IV

STATISTICAL LEARNING IN COMPUTER VISION

Statistical learning-based kernel methods are rapidly replacing other empirically

learning methods (e.g. neural networks) as a preferred tool for machine learning due to

many attractive features: a strong basis from statistical learning theory; no computational

penalty in moving from linear to non-linear models; the resulting optimization problem is

convex, guaranteeing a unique global solution and consequently producing systems with

excellent generalization performance [11]. This chapter presents a statical learning-based

approach for solving the camera calibration problem. This approach uses the proposed

MF-based SVM algorithm to estimate the elements of the perspective projection matrix.

Camera calibration is an extensively studied topic in different machine intelligence

communities. The purpose of it is to establish a mapping between the camera’s 2-D image

plane and a 3-D world coordinate system so that a measurement of a 3-D point position

can be inferred from its projections in cameras’ image frames. The existing techniques to

solve this problem can be broadly classified into three main categories: linear, nonlinear

(or iterative), and two- step methods. Complete review of these approaches can be found

in [54, 55].

The explicit camera calibration methods develop solutions by analyzing physical

model of camera imaging so that calibration is to identify a set of modeling parameters

of physical meanings [5], whereas the implicit calibration methods resort to realizing a

nonlinear mapping function that can well describe the input- output relation [6]. The ex-

plicit calibration methods can provide camera’s physical parameters, which are important

in some applications, such as computer graphics, virtual reality, 3-D reconstruction, etc.

49

This chapter presents an explicit approach for solving the camera calibration prob-

lem. Principally, The approach considers the problem as a mapping from the 3-D world

coordinate system to the 2-D image coordinate system, where the projection matrix is the

mapping function, and MF-based SVM algorithm is used to simulate this mapping.

An important issue of SVM algorithm is the choice of the kernel function [56]. The

shape of the kernel controls the capacity and performance of the algorithm. To enable

explicit estimation of the projection matrix from SVM regression setup, a linear kernel is

used. Although SVM algorithm with a linear kernel has some limitations in the type of

mappings that can be simulated [57], a first order linear kernel is experimentally shown to

be sufficient for the current application.

A. Camera Calibration

This section provides a brief introduction to the camera calibration problem from

the regression point of view; as treated in this work. The derivation of the proposed ap-

proach as well as the implementation steps are discussed.

1. An Overview

The camera model considered here is the perspective projection based on the pin-

hole model [58]. If a pointM has world coordinates(X, Y, Z) and is projected onto a point

m that has image coordinates(x, y), this projection can be described, in homogeneous co-

50

ordinates, by the equation:

s m = P M or

s

x

y

1

= P

X

Y

Z

1

(70)

wheres is a scaling factor andP (3×4) is theprojection matrix, which can be decomposed

into two matrices:P = A D where

D =

R t

0T3 1

A =

αx −αx cot θ x0 0

0 αy/ sin θ y0 0

0 0 1 0

(71)

The4 × 4 matrix D represents the mapping from world coordinates to camera co-

ordinates and accounts for six extrinsic parameters of the camera: three for the rotation

R which is normally specified by three rotation (Euler) angles:Rx, Ry andRz and three

for the translationt = (tx; ty; tz)T . 03 represents the null vector(0; 0; 0)T . The3 × 4 ma-

trix A represents the intrinsic parameters of the camera: the scale factorsαx andαy, the

coordinatesx0 andy0 of the principal point, and the angleθ between the image axes.

The projection matrix can be represented in a simplified form as:

P =

P1

P2

P3

(72)

Using (72), the relation in (70) can be represented pictorially as in Fig. 11. This

figure shows that there is a coupling between the outputs of the three branches (the scaling

51

FIGURE 11 – Representation of the camera calibration as a mapping problem

terms is repeated in each output). Thus, it isnotpossible to optimize a branch independent

of the others.

2. Basic Regression Relations

As stated before, the proposed approach considers each branch of Fig. 11 as a re-

gression problem, and solves it using MF-based SVM algorithm. The general SVM regres-

sion rule in (10) is used to formulate the output from each branch. As stated before also, a

linear kernel is used in SVM algorithm formulation to enable an explicit estimation of the

projection matrix. This kernel has the form:

K(M, M) = M.M + b = M tM + b (73)

whereb is a constant. The general regression relation becomes:

f(M) =n∑

i=1

wi K(Mi, M) =n∑

i=1

wi (M ti M + b)

=

(n∑

i=1

wiXi

)X +

(n∑

i=1

wiYi

)Y +

(n∑

i=1

wiZi

)Z + (b + 1)

(n∑

i=1

wi

)(74)

where eachMi is a point from the training sample.

52

The specific outputs are obtained from (74) as:

fx(M) =

(n∑

i=1

wxi Xi

)X +

(n∑

i=1

wxi Yi

)Y +

(n∑

i=1

wxi Zi

)Z + (b + 1)

(n∑

i=1

wxi

)

f y(M) =

(n∑

i=1

wyi Xi

)X +

(n∑

i=1

wyi Yi

)Y +

(n∑

i=1

wyi Zi

)Z + (b + 1)

(n∑

i=1

wyi

)

f s(M) =

(n∑

i=1

wsi Xi

)X +

(n∑

i=1

wsi Yi

)Y +

(n∑

i=1

wsi Zi

)Z + (b + 1)

(n∑

i=1

wsi

)

(75)

wherewxi denotes thei’th weight in the regression machine which computesfx(M).

3. Simultaneous Optimization of the Regression Relations

As stated before, the relations in (75) are coupled, so the corresponding regression

machines can not be individually optimized. To overcome this problem, a gradient descent

step is used to simultaneously optimize the values of the scaling factors while optimizing

the Support Vector regression machines. This step minimizes the overall error:

E =n∑

i=1

‖PMi − mi‖2

=n∑

i=1

(fx(Mi)− sixi

)2

+(f y(Mi)− siyi

)2

+(f s(Mi)− si

)2

(76)

The proposed algorithm minimizes the error in (76) with respect to the scaling factor

s according to the gradient descent rule:

∆si = xi

(sixi − fx(Mi)

)+ yi

(siyi − f y(Mi)

)+

(si − f s(Mi)

)(77)

The update ofs follows:

snew = sold − η∆s (78)

whereη is a learning parameter.

53

4. The Overall Calibration Algorithm

In the following: the implementation steps of the proposed approach are summa-

rized.

Algorithm 1 Statistical Learning based Camera Calibration Algorithm

1. Prepare the training data sets where the inputs are the 3-D world coordinates, and the

outputs are the corresponding 2-D coordinates. Preferably, normalize both the inputs

and outputs, and prepare the augmented data set.

2. Initialize the values of the scaling factors ( to all ones in our implementation).

3. Optimize the regression branches in Fig. 11 using MF-based SVM algorithm

(see [59]).

4. Updates from (78).

5. Iterate from step 3 to minimize the overall error in (76).

After the completion of the optimization process, the estimated calibration matrix

will have the form:

P =

∑ni=1 wx

i Xi

∑ni=1 wx

i Yi

∑ni=1 wx

i Zi (b + 1)∑n

i=1 wxi

∑ni=1 wy

i Xi

∑ni=1 wy

i Yi

∑ni=1 wy

i Zi (b + 1)∑n

i=1 wyi

∑ni=1 ws

i Xi

∑ni=1 ws

i Yi

∑ni=1 ws

i Zi (b + 1)∑n

i=1 wsi

(79)

54

B. Discussion of Some Calibration Methods

For comparison, the experimental work uses two classical calibration methods: lin-

ear [58], and nonlinear (NL) using the simplex method [60], and two state-of-the-art meth-

ods: neural networks (NN) [61], and Heikki method [62]. The following discussion briefly

explains these method.

1. Linear Direct Transform Method (LDT)

The direct implication of the camera calibration problem in (70) results in:

xi =P11Xi + P12Yi + P13Zi + P14

P31Xi + P32Yi + P33Zi + P34

andyi =P21Xi + P22Yi + P23Zi + P24

P31Xi + P32Yi + P33Zi + P34

(80)

GivenN correspondence points, LDT rearranges the formulas in (80) into2N linear

equations inm′s in the form:

Cp = 0 (81)

where:

C =

X1 Y1 Z1 1 0 0 0 0 −x1X1 −x1Y1 −x1Z1 −x1

0 0 0 0 X1 Y1 Z1 1 −y1X1 −y1Y1 −y1Z1 −y1

X2 Y2 Z2 1 0 0 0 0 −x2X2 −x2Y2 −x2Z2 −x2

0 0 0 0 X2 Y2 Z2 1 −y2X2 −y2Y2 −y2Z2 −y2

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

XN YN ZN 1 0 0 0 0 −xNXN −xNYN −xNZN −xN

0 0 0 0 XN YN ZN 1 −yNXN −yNYN −yNZN −yN

and

p = [P11 P12 P13 · · ·P33 P34]T (82)

55

The system of linear equations in 81 can be solved using the Singular Value Decomposition

SVD approach to get the unknownp. The SVD decomposesC as:

C = USV T (83)

The solution is the eigenvectorV which corresponds to the smallest eigenvalue in the main

diagonal ofS.

2. Nonlinear Two Stages Method (NL)

In this approach an iterative minimization algorithm is employed to solve for the

camera parameters. The first stage of the approach finds initial values of the camera para-

meters. This can be done by many methods (see [63]), eventually by the above discussed

DLT method. The estimated camera parameters are used to reproject the control 3-D points

into the 2-D space. Then, the second stage of the algorithm uses a nonlinear optimization

algorithm to minimize the following error criterion in the image space:

E =N∑

i=1

∇(P Mi) (84)

Details of (84) can be found in (76). For the implementation of this method in the

current work, the well known Simplex method is used to minimize (84).

3. Neural Networks Method (NN)

There are many neural networks based methods for camera calibration. The one

used here is interested to employ a neural network not only to learn the mapping from 3-D

points to 2-D pixel points which minimizes the error in (76), but also to extract the projec-

tion matrix and camera parameters. Therefore, the network structure is laid out accordingly.

The net is a two-layer feedforward neural network. The input layer has three neurons plus

one augmented fixed at1. These three correspond to the three coordinatesX; Y ; Z of a 3-D

56

point. The number of output units is three, and the hidden layer consists of four neurons

(three plus one dummy). The hidden and output neurons have unity activation functions.

The weight matrix of the hidden layer is denoted byV , and it is assumed to correspond to

the extrinsic parameters matrixD. The weight matrix of the output layer is denotedW and

corresponds to the intrinsic parameters matrixA.

4. Heikkile Method (Heikki)

The calibration procedure suggested in this method utilizes circular control points

and performs mapping from world coordinates into image coordinates and backward from

image coordinates to lines of sight or 3-D plane coordinates. It introduces bias correction

for circular control points and a non-recursive method for reversing the distortion model.

The motivation for using circular control points is that lines in the object space are mapped

as lines on the image plane, but in general perspective projection is not a shape preserving

transformation. Two- and three-dimensional shapes are deformed if they are not coplanar

with the image plane. This is also true for circular landmarks, which are commonly used

control points in calibration. However, a bias between the observations and the model is

induced if the centers of their projections in the image are treated as projections of the circle

centers. This approach is mainly intended to be used with circular landmarks. However, it

is also suitable for small points without any specific geometry. In that case, the radius is set

to zero.

This approach is iterative and consists of two stages as the NL approach. The first

stages initializes the camera parameters with an easy approach (e.g. DLT) approach. In the

second stage, the parameters of the forward camera model are estimated by minimizing the

weighted sum of squared differences between the observations and the model. Assuming

that there aren circular control points andK images that are indexed byn = 1, · · · , n and

k = 1, · · · , K. A vector containing the observed image coordinates of the center of the

57

ellipsei in the framek is denoted byeo(i, k), and the corresponding vector produced by

the forward camera model of is denoted byem(i, k). Now, the objective function that used

can be expressed as:

J(θ) = eT (θ)Σ−1e(θ) (85)

whereeT (θ) = [(eo(1, 1)− em(1, 1))T (eo(2, 1)− em(2, 1))T · · · (eo(n,K)− em(n,K))T ].

The matrixΣ is the covariance matrix of the observation error. The parameters of the

forward camera model are obtained by minimizingJ(θ):

θ = arg minθ

J(θ) (86)

Again, this method uses the Simplex optimization method to solve for optimumθ.

C. Experimental Results and Discussions

The performance of the proposed statistical learning MFSVM approach for camera

calibration is evaluated in two different ways. The first part uses synthetic data to evaluate

the proposed MFSVM camera calibration approach. The synthetic data allows a quan-

titative evaluation of the approach performance with respect to ground truth data. This

synthetic data is generated from a virtual camera with specific internal and external para-

meters, and the performance of the proposed approach is investigated using this data. The

robustness of the algorithm is evaluated by adding different level of noise to the training

data. The proposed approach shows nearly steady values for the estimated parameters and

outperforms other classical algorithms.

The second part uses a real checker-board object with black/white squares. The

known 3-D coordinates of the black squares’ corners (inputs) with their corresponding

image coordinates (desired outputs) are used to train the proposed algorithm. The proposed

approach is applied in 3-D reconstruction of a real scene, and the approach performance is

reflected in the accuracy of the reconstruction process. The performance of the proposed

58

technique is compared against some known algorithms: classical linear, nonlinear, and

neural network approaches. The proposed approach shows an outstanding performance.

1. Simulation with Synthetic Data

In this experiment, a virtual CCD camera is modeled (pinhole camera model is

assumed) such that its intrinsic and extrinsic parameters are used as ground truth data. This

allows a fair comparison between the estimated parameters using the proposed approach

and other camera calibration approaches. In addition, this allows us to characterize the

performance of the proposed approach using noisy data at different noise levels. The setup

of this experiment can be described as follows:

• Given the ground truth values of the camera parameters (shown in Table 3), construct

the projection matrix (see 71).

• Given a set of 3-D reference points (representing points of interest of a calibration

pattern) and the projection matrix, get the projected image points.

• In real setups, the 2-D image points are detected by applying image processing algo-

rithms on the captured images of the calibration pattern (noisy environment, sensor

insensitivity, and inaccuracy of the feature extractors could be different sources of

errors). To simulate these conditions, Gaussian noise with zero mean and different

standard deviations,σ, is added to the 2-D image points.

• Use these noisy 2-D points and their corresponding 3-D points to produce a noisy

version of the projection matrix.

• Use this projection matrix in a backward computation of both camera intrinsic and

extrinsic parameters.

• Compare the estimated parameters and their corresponding ground truth values.

59

Motivated by a basic and realistic assumption that most training data sets are conta-

minated by different errors from data acquisition and/or pre-processing steps, the statistical

learning would be a reasonable approach for robust camera calibration. To show the robust-

ness of the proposed approach, a series of11 experiments with different noise levels are

carried out. The noise standard deviation,σ, is selected in the range from0 to 4 pixels with

a step0.4. Each experiment is repeated 50 times to get average results. The camera parame-

ters are then estimated by four other camera calibration approaches: linear [58], nonlinear

(NL) using the simplex method [60], calibration using neural networks (NN) [61], and

Heikki method [62]. The following discussion briefly explains these method.

To explicitly estimate the camera parameters, the proposed approach has to use a

linear kernel. Employing this kernel has introduced a trade off between the less accuracy

of the approach at low-level noise and the high robustness at high-level noise. In the ideal

case, where the training data set is free of errors/noise, Table 3 shows the estimated values

of the camera parameters in comparison with the other approaches against the ground truth

values. Although these results give minor credit to the competitive approaches over the

MFSVM approach at low-noise levels, the proposed approach gets significant credits at

higher noise levels as shown in Table 3.

The root mean square errors(RMSE) over the 50 trails between the ground truth

parameters and the estimated parameters are plotted in Figures (12, 13, 14, 15) as a function

of σ for a set of four camera parameters. This figure shows that the performance of most

of the traditional calibration methods is highly degraded with the increase of the noise

level which is reflected in high RMSE values. This degradation is rapid for the linear

approach, and with lower rates for both the nonlinear (NL) and neuro-calibration (NN)

approaches. However, the proposed approach shows outstanding robustness against noise

levels. The values of the RMSE are almost the same in all noise levels for thetx parameter,

and slightly increased in the other displayed parameters. Robustness against noise is one

60

TABLE 3: Ground truth camera parameters versus estimated parameters

Para-

meter

tx

(mm)

ty

(mm)

tz

(mm)

Rx

(rad)

Ry

(rad)

Rz

(rad)

αu

(pix)

αv

(pix)

uo

(pix)

vo

(pix)

θ

(rad)

True -27.0 -28.0 701.0 0.09 0.80 -0.03 556.0 549.0 172.0 121.0 1.57

Estimated parameters using noise-free data (σ = 0 pixels)

Linear -27.3 -28.1 700.7 0.090 0.799 -0.030 555.7 548.8 171.8 120.9 1.57

NL -27.2 -28.0 700.8 0.090 0.800 -0.030 555.9 548.8 171.9 121.0 1.57

NN -27.5 -28.2 699.5 0.089 0.800 -0.030 555.0 548.1 171.7 120.8 1.57

Heikki -27 -27.9 701.0 0.09 0.800 -0.030 556.1 550 172 120.9 1.57

SVM -29.7 -29.3 697.8 0.088 0.796 -0.031 553.4 545.4 169.8 120.1 1.57

Estimated parameters using noisy data (σ = 4 pixels)

Linear -80.4 -59.2 654.3 0.044 0.768 -0.068 531.5 519.0 141.3 100.7 1.56

NL -49.3 -44.6 676.8 0.059 0.781 -0.054 539.8 532.8 157.9 108.9 1.56

NN -49.8 -46.0 671.0 0.062 0.782 -0.053 536.4 528.7 156.8 108.3 1.56

Heikki -43.7 -48.9 884.8 0.172 0.762 -0.093 707.4 707.5 155.3 104.8 1.57

SVM -34.3 -41.6 682.6 0.069 0.791 -0.045 543.6 536.9 166.3 110.9 1.57

of the main strengths of statistical learning approaches in general, and specifically SVM.

This robustness found an increase interest of the application of such approaches in machine

learning application. As the results demonstrate, robustness is the feature that motivates the

use of MFSVM approach in solving the problem of camera calibration especially when the

noise level can not be ignored.

61

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

60

70tx

σ

RM

SE

(m

m)

linearsimplexneuroHeikkiSVM

FIGURE 12: The RMSE fortx as a function of noiseσ, computed for the five approaches:

linear, nonlinear using simplex method, neuro-calibration, Heikki and MFSVM.

2. Experiments with Real Images

In case of real images, there is no ground truth data (camera parameters) available

for camera calibration. The only available data is the 3-D coordinates for some control

points on a given calibration pattern. Therefore the accuracy of a camera calibration ap-

proach is measured in terms of the accuracy in reconstructing these 3-D points through

triangulation [54, 55]. To carry out this accuracy measure, calibration is performed for two

CCD cameras working as a stereo pair. Two images of a calibration pattern (see Fig. 16) are

62

0 0.5 1 1.5 2 2.5 3 3.5 40

0.005

0.01

0.015

0.02

0.025

0.03

0.035Ry

σ

RM

SE

(rad

)


FIGURE 13: The RMSE forRy

captured using these two cameras. The 3-D points used for the calibration are the vertices

of the checker-board squares of the calibration pattern.

Knowing the 3-D coordinates (Xi, Yi, Zi) of these points, the corresponding image-

point locations are detected accurately using edge-detection and fitting techniques (for ex-

amples of such techniques see [54, 55]). Given these two sets of points, the two cameras

are calibrated using the four calibration approaches stated before. The accuracy of the

calibration process is measured usingthe root mean square error(RMSE) defined as:

RMSE =

[1

n

n∑i=1

(Xi − Xi)2 + (Yi − Yi)

2 + (Zi − Zi)2

] 12

(87)

where (Xi, Yi, Zi) are the estimated 3-D coordinates of thei’th point andn is the number

63

0 0.5 1 1.5 2 2.5 3 3.5 40

5

10

15

20

25

30

35

uo

σ

RM

SE

(mm

)


FIGURE 14: The RMSE foruo

of points.

Figure 17 shows the 2-D projection of the calibration pattern corners (in dots) and

the detected corners from processing the pattern image (in circles). To obtain the 2-D

projection, the cameras are calibrated and then the 2-D projections are computed using (70).

It is clear from Fig. 17 that the projected 2-D corners are almost perfect with respect to the

detected ones. Knowing that the corners are detected perfectly from the images, the figure

illustrates the outstanding performance of the proposed MFSVM approach.

For quantitative evaluation, the RMSE of the four camera calibration approaches

for are given in Table 4.

64

0 0.5 1 1.5 2 2.5 3 3.5 40

0.002

0.004

0.006

0.008

0.01

0.012θ

σ

RM

SE

(rad

)


FIGURE 15: The RMSE for the skewness angleθ. Note: Heikki method assumes an ideal

camera model in the sense of skewness (i.e.θ = π/2), so there is no error indicated.

TABLE 4: Error in the 3-D reconstructed data

linear nonlinear neuro-calibration SVM

RMSE(mm) 0.4280 0.2991 0.3044 0.2794

It is clear from the table that the proposed approach outperforms the other methods

and it gives more accurate results in reconstructing the 3-D coordinates. It is worthy to

notice that, although the difference between the RMSE of MFSVM approach and the non-

linear approach is small, the MFSVM approach relaxes the requirement of a good guess

to start with. This requirement is one of the main drawbacks of nonlinear methods and

65

FIGURE 16: Calibration setup: A stereo pair of images for a checker-board calibration

pattern

FIGURE 17: The 2-D projection of the calibration pattern corners (dots), and the detected

corners from the image (left one of Fig. 16) of the calibration pattern (circles)

66

without it, the solution can diverge from the correct one. These results on real images em-

phasize the results obtained for the synthetic data and verify the validity of the proposed

method in real situations.

D. Conclusion

In this chapter, a robust method for camera calibration using Mean field theory-

based Support Vector Machines (MFSVM), as a statistical learning approach, is presented.

The projection matrix is obtained explicitly by using a dot product kernel in the formulation

of SVM algorithm.

The explicit estimation of the camera parameters is evaluated using synthetic data

while a 3-D scene reconstruction problem is used to evaluate the performance in real world

setups. Different noise levels are used to show the robustness of the approach which is

illustrated by nearly steady values of the estimated parameters with the noise standard

deviation. In addition, the approach is compared with other known techniques of camera

calibration namely; linear, non-linear using the simplex method, and neuro-calibration. The

experimental results showed an outstanding performance of the proposed approach in terms

of accuracy and robustness against noise compared to the competitive camera calibration

approaches. The RMSE drops from0.428 with the linear calibration to0.2794 with the

proposed approach.

67

CHAPTER V

APPLICATIONS ON THE PROPOSED DENSITY ESTIMATION APPROACH

This chapter presents an extensive elaborated experimental work which has been

carried out to evaluate some applications on the proposed density estimation approach.

The applications include density estimation using real remote sensing data in a Bayes Clas-

sification setup, and parameters estimation of MRF models in image applications. The

density estimation accuracy is reflected in the classification accuracy of the data sets using

only class conditional probability modeling with pre-specified priors. Since, the reference

density function is not known for these real data sets, the above evaluation methods (visual

inspection and KLD measure) for the performance of the proposed density estimation al-

gorithm cannot be used. Instead, the classification accuracy is used as a practical measure

of the performance of the density estimation algorithm. When carrying the classification

experiments to compare the performance of the different density estimation algorithms, the

operating conditions (in a Bayes classification setup) are the same except for the density

estimation algorithm. Thus, the argument that the classification accuracy is an indication

for the density estimation performance is applicable.

A. Test-of-agreement (ToA) for the response of two classifiers

To compare the performance of two classifiers against each other, a rule is proposed

which will be discussed here. Suppose there are two classifiers with the rulesM1(.) and

M2(.) respectively which are applied to the test data set. Define the statistic:

Sn =n∑

i=1

zi (88)

68

wherezi = 1 if M1(yi) = M2(yi) and0 otherwise. The statisticSn measures the agreement

between the two classifiers’ outputs, reflecting how much the responses of the two classi-

fiers agree in response to the same data point. From the central limit theorem (CLT) [34]:

Sn − Sn√nSn

2∼ N (0, Sn) (89)

The hypothesis:

H0 : probability that the classifiers agree. (90)

has the95% confidence interval[cSn

n− 2A

cSn

n+ 2A], whereA =

√bp(1−bp)

n; p =

cSn

n.

If this confidence interval contains the point1, then the probability that the re-

sponses of the two classifiers agree is1. If the confidence interval does not contain 1, then

there is a chance that the two classifiers disagree, and thus the hypothesis is rejected. With

this argument in mind, the rule to compare the performance of two classifiers is as follows:

1. If there is a difference in the performance between the two classifier but they disagree,

this means that this difference is significant.

2. If there is a difference in the performance between the two classifier and they agree,

this means that this difference is not significant.

Throughout the following experiments, the95% confidence interval between the

Bayes classifier which uses the MF-Based SVM density estimator against that uses: MLE,

Parzen-window, KNN or traditionally formulated SVM density estimators are calculated.

If an interval contains the point1, then the apparent difference in the performance between

the two classifiers is notsignificant.

69

B. Experiments for Density Estimation Using Real Remote Sensing MultispectralData

The following experiments are used to illustrate the performance of the proposed

density estimation algorithm in real data sets of relatively high dimensional spaces. In

the current experiments, two 7 bands multispectral data sets, with 30-meters resolution are

used. In the multispectral experiments, each point in the data sets is represented by a vector

of length 7, giving a dimensionality of the seventh order.

1. Experiments for density estimation using a multispectral agricultural area

This data set represents an agricultural area in the state of Kentucky, in the USA. A

169x169 scene is cropped from a multispectral Landsat 7-bands data collected in Wednes-

day, December 5, 2001. The resolution of this data set is28.5mx28.5m per pixel. Nine

classes are defined in this data set: Background, Corn, Soybean, Wheat, Oats, Alfalfa,

Clover, Hay/Grass and Unknown. The ground truth labels are available for the whole data

set. For the evaluation purposes, a subset from each class is used for training the density

estimator and the rest of the data is used for testing. Figure (18) shows the reference land

cover of the area and the classification results based on the SVM density estimators.

The confusion matrix for the classification based on the density estimation using

MF-based SVM is shown in Table 5. The average true classification accuracy for the

classes is 78.5%. The largest source of error is due to the misclassification between the

Background and other classes, with 46% of the Background reference pixels are classified

to the other classes and 9.6% from the other classes are classified to Background. A spe-

cific noticeable example for the misclassification between the Background and the other

classes is the misclassification between the Soybean and the Background, with 18.6% from

the Background reference points are classified as Soybean and 5% from the Soybean are

classified as Background. Other large errors can be noted in the Alfalfa and Hay/Grass

70

(a)

(b) (c)

FIGURE 18: A multispectral agricultural area: (a) land cover, and classification results

using: (b) SVM , and (c) MF-SVM as a density estimator.

classes. However, the reason for the later error is due to the prior probability assumption

which is assumed as the share of the class reference points in the data set. Since, each of the

Alfalfa and Hay/Grass classes is less represented in the data set, their priors are small and

thus a noticeable error is generated. This realization calls for another estimation method

for the prior probabilities. This dissertation presents a method for this modeling using the

MRF which will be discussed later in this chapter. Another observation that can be shown

from the classified image in Fig. (18-b), in which most of the regions contain some ran-

71

TABLE 5: Classification confusion matrix for the multispectral agricultural area using the

MF-based SVM estimator.

Class Total

Points

Back-

ground

Corn Soy-

bean

Wh-

eat

Oats Alf-

alfa

Cl-

over

Hay/

Grass

Unk-

nown

%

True

Back-

ground

6790 3661 770 1262 555 358 4 144 15 21 53.92

Corn 9371 475 8787 93 1 7 3 1 3 1 93.77

Soybean 8455 1090 101 6985 74 85 0 92 6 22 82.61

Wheat 1923 199 0 37 1581 74 0 31 1 0 82.22

Oats 800 121 2 22 47 598 0 10 0 0 74.75

Alfalfa 65 19 22 6 0 0 13 0 5 0 20

Clover 619 120 7 103 19 54 0 316 0 0 51.05

Hay/-

Grass

142 54 17 27 1 6 1 5 29 2 20.42

Unknown 396 8 0 3 6 0 0 0 1 378 95.45

% +ve

true

63.7 90.53 95.51 69.22 50.59 61.9 52.75 48.33 89.15

dom misclassification points (which appear like a random salt&pepper noise in the image)

although they should be smooth and clean. This is mainly due to the fact that the Bayes

classification setup treats the points in the data set as realizations of independent random

variables regardless of the contextual interactions. The MRF modeling overcomes also this

problem as will be illustrated later.

To evaluate the proposed MF-based SVM density estimator, other classical and new

density estimation algorithms are applied on the same data set. Table 6 summarizes the

classification accuracies obtained with different density estimators. It can be noted from

72

the table that the MF-based SVM density estimation algorithm outperforms the other al-

gorithms (this is reflected in the classification accuracy as discussed before). With the

classical algorithms (MLE with Gaussian assumptions, Parzen-Window estimation and K-

Nearest Neighbors “KNN”) there are some classes which have been completely disap-

peared (e.g. Alfalfa and Hay/Grass), however both the SVM-based algorithms manage to

recover part of these classes. The proposed MF-based SVM outperforms the traditional

SVM in the overall classification accuracy.

An important note here is that MLE fails in recognizing some classes because of the

unimodality assumption for the class conditional probabilities (CCP). One way to boost the

performance of the Gaussian-based ML approach is to use the enhanced statistics approach

which is proposed in [64], while another way is to use multimodal form for CCP. Under

the assumption of a Gaussian kernel in MF-based SVM regression, the density estimator

is equivalent to a mixture of Gaussians, thus MF-based SVM density estimator assumes a

multimodal form for CCP. But, the distinct feature here is that the optimum value of the

number of components in the mixture is automatically obtained. This automatic selection

of the number of components makes the performance of the proposed approach superior to

the traditional mixture of Gaussian density estimator using the EM algorithm [32, 65], the

results are illustrated in Table 6.

To justify the latest stated argument regarding the performance of the classifier,

which uses the MF-Based SVM density estimation, the ToA rule stated above is used

to analyze the results in Table 6. The95% confidence intervals (see section V.A) are:

[0.7462 0.7567], [0.7906 0.8005], [0.8033 0.8130], and[0.883 0.891]. None of these inter-

vals contains the point1 which indicates that the apparent difference in the performance of

the classifier which uses the MF-based SVM density estimator and the others issignificant.

But the Bayes classifier based on the MF-Based SVM density estimator has an apparently

better performance than the others, reflecting the better performance of the density estima-

73

tor.

TABLE 6: Classification accuracy using different density estimators for the multispectral

agricultural area.

Class % Accuracy

MLE Parzen

Window

KNN

(k=15)

Mixture of

Gaussians

Traditional

SVM

MF-based

SVM

Back-

ground

52 37 46 48 50.4 53.9

Corn 94 97 96 93 91.5 93.77

Soybean 78 92 82 82 77.9 82.61

Wheat 44 31 40 76 84.2 82.22

Oats 7 9 4 31 72.5 74.75

Alfalfa 0 0 0 41 76.9 20

Clover 5 4 4 44 69.8 51.05

Hay/-

Grass

0 0 1 9 66.2 20.42

Unknown 94 94 94 93 95.7 95.45

Average 71 72 71 74.4 76.1 78.5

2. Experiments for density estimation using a multispectral urban area

This data set represents an urban area around the Golden Gate Bay at the city of San

Francisco, California state, USA and shown in Fig. (19). A 700x700 scene is cropped from

a4632x4511 multispectral Landsat data set collected in Tuesday, September 28, 1998 with

a resolution5mx5m per pixel. There are five classes which are defined in this data set:

Trees, Streets, Water, Buildings, and Earth. The available ground truth set contains 5076

74

data points. For the evaluation purposes, a subset from each class is used for training and

the rest of the data is used for testing.

(a) (b)

(c)

FIGURE 19: A multispectral urban area: (a) RGB snap-shot, and color-coded classification

results using: (b) SVM , and (c) MF-SVM as a density estimator.

The confusion matrix for the classification using the MF-based SVM density esti-

mator is shown in Table 7, which indicates that this experiment is an easy experiment with

respect to the previous one. The overall average true classification accuracy for the classes

is 96.7%. There is a little classification confusion between the Streets and the Earth classes

where 3.8% from the Streets’ points are classified as Earth and 1% from the Earth points

75

TABLE 7: Classification confusion matrix for the multispectral urban area using the MF-

based SVM estimator.

Class Total

Points

Trees Streets Water Build-

ings

Earth % True

per Class

Trees 212 212 0 0 0 0 100

Streets 521 2 495 0 4 20 95

Water 595 0 0 595 0 0 100

Buildings 292 1 37 0 254 0 87

Earth 410 0 4 0 0 406 99

% +ve true 98.6 92.35 100 98.45 95.31

are classified as Streets. Another noticeable confusion is between the Buildings and Streets

classes. There are 12.67 % from the Buildings’ points are classified as Streets while there

are 0.8 % from the Streets’ points which are classified as Buildings. The misclassification

between Streets, Earth, and Buildings classes is reasonable due to the similarity between

these classes in the real world.

The proposed MF-based SVM density estimator is evaluated against some other

density estimation algorithms by noting the classification rate of the Bayes classification

setup using different density estimation algorithms. Table 8 summarizes the obtained re-

sults with different density estimators. It can be noted from the table that the SVM density

estimators outperform the classical algorithms however there is a little improvement using

the MF-based SVM estimator over the traditionally-formulated SVM estimator.

The95% confidence intervals for the results in Table 8 are:[0.93 0.94], [0.93 0.94],

[0.95 0.96], and[0.997 0.999]. None of these intervals contains the point1, which empha-

sizes the better performance of the proposed density estimator.

76

TABLE 8: Classification accuracy using different density estimators for the multispectral

urban area.

Class % Accuracy

MLE Parzen

Window

KNN

(k=3)

Traditional

SVM

MF-based

SVM

Trees 99 85 89 97.5 100

Streets 91 97 94 95.6 95

Water 97 100 100 100 100

Buildings 90 68 89 91.8 87

Earth 80 82 84 91.3 99

Average 92 89 92.7 95.7 96.7

C. Experiments for Density Estimation Using Real Remote Sensing HyperspectralData

The performance of the proposed density estimator algorithm in real high dimen-

sional spaces is illustrated in this section. In the current experiments, two hyperspectral

data sets, one has 34 bands and the other has 58 bands are used. These hyperspectral data

sets will raise a density estimation problem of the 34th and 58th dimensionally orders,

respectively.

1. Experiments for density estimation using a hyperspectral 34-band data set

This experiment uses a hyperspectral data set of size200x200 for an urban area

in the state of Indiana, in the USA. This scene is cropped from a618x1013 data set of

type “AISA Classic Reflectance” collected using the AISA hyperspectral sensor with34

channels in 1983 with a3mx3m resolution. There are nine classes defined in it: Agricul-

tural, Coniferous, Herbaceous, Other Impervious, Roads, Soil / Disturbed, and Water. The

77

ground truth labels are available for the whole data set and only a subset from each class is

used for training the density estimators. This data set is illustrated in Fig. (20).

(a) (b)

(c)

FIGURE 20: A hyperspectral 34-band urban area: (a) RGB snap-shot, and color-coded

classification results using: (b) SVM , and (c) MF-SVM as a density estimator.


MF-based SVM is shown in Table 9. The average true classification accuracy for the classes

78

TABLE 9: Classification confusion matrix for the hyperspectral 34-band data using the

MF-based SVM estimator.

Class Total

Points

Agricu-

ltural

Conife-

rous

Herba-

ceous

O Imp-

ervious

Roads Soil Water %

True

Agricultural 5138 4122 170 497 221 11 102 15 80.23

Coniferous 15182 3 12878 2228 66 2 1 4 84.82

Herbaceous 7481 27 947 5914 291 197 75 30 79.05

Other Imp. 925 4 56 175 595 62 30 3 64.32

Roads 627 0 17 145 44 418 3 0 66.67

Soil 6362 229 4 409 609 124 4922 64 77.37

Water 4285 5 323 282 328 2 133 321274.96

% +ve true 93.9 89.5 61.3 27.6 51.2 93.5 96.5

is 80.1%. The largest source of error is due to the misclassification between the different

types of vegetation. Actually, the ”Other Impervious” class has the lowest classification

rate because it shares its characteristics with other vegetation classes.

Table 10 summarizes the classification accuracies obtained with different density

estimators. From that table it can be noted that the MF-based SVM density estimation

algorithm outperforms the other algorithms.

The95% confidence intervals for the results in Table 10 are:[0.76 0.77], [0.73 0.74],

[0.74 0.75], and[0.856 0.863]. None of these intervals contains the point1, which reflects

the better performance of the MF-based SVM density estimator in high dimensional spaces.

2. Experiments for density estimation using a hyperspectral 58-band data set

Figure (21) shows a hyperspectral data set of an urban area in the state of New Mex-

ico, in the USA. This scene is of size 300x600 and had been cropped from a1093x2176 data

79

TABLE 10: Classification accuracy using different density estimators for the hyperspectral

34-band data.

Class % Accuracy

MLE Parzen

Window

KNN

(k=3)

Traditional

SVM

MF-based

SVM

Agricultural 86.65 85.95 85.66 80.87 80.23

Coniferous 71.63 91.4 62.4 88.33 84.82

Herbaceous 77.8 44.53 70.85 67.73 79.05

Other Impervious 66.7 46.38 60.43 69.73 64.32

Roads 83.41 36.36 76.24 67.3 66.67

Soil 90.77 80.27 82.27 70.7 77.37

Water 80.98 68.59 72.7 75.08 74.96

Average 78.8 75.8 71.4 78.53 80.1

set with a1mx1m resolution. This data set is collected using the AISA Eagle hyperspectral

sensor in1983 with 58 channels. There are nine classes defined in it: Unclassified/Shadow,

Water, Trees, Buildings, Asphalt Roads, Scrub Shrub/Herbaceous, Sand/Soil/Gravel, River-

ine Wetland and Fiverine Substrate. The ground truth labels are available for the whole data

set.


MF-based SVM is shown in Table 11. The average true classification accuracy for the

classes is 73%. The largest source of error is due to the misclassification between the Trees

and other classes, with 18.8% of the Trees reference pixels are classified to the other classes

and 19.6% from the other classes are classified to Trees. Due to the inherent similarities

of the materialistic structure between the Scrub Shrub/Herbaceous (low vegetation)and the

Trees, there is a significant misclassification between these two classes. There 11.4% from

80

TABLE 11: Classification confusion matrix for the hyperspectral 58-band urban area using

the MF-based SVM estimator.

Class Total

Points

Sha-

dow

Wa-

ter

Trees Buil-

dings

Asp-

halt

Scrub Sand Wet-

land

Sub-

strate

%

True

Shadow 1189 657 0 355 8 0 160 9 0 0 55.26

Water 7382 6 6930 73 6 13 2 194 0 158 93.88

Trees 91883 190 86 74607 1125 436 10477 4670 30 262 81.20

Build-

ings

11655 30 17 1133 5615 905 1236 2573 0 146 48.18

Asphalt 9349 32 2 245 91 7369 142 1465 0 3 78.82

Scrub 30542 1 0 12458 725 98 16569 684 1 6 54.25

Sand 25741 9 71 2599 2431 2156 413 17965 1 96 69.79

Wet-

land

437 0 0 51 0 0 27 7 350 2 80.1

Sub-

strate

1822 0 84 72 4 3 0 106 1 1552 85.18

% +ve

true

71 96.38 82.12 56.12 67.11 57.03 64.89 51.47 69.56

the Trees points are classified as Scrub Shrub/Herbaceous and 40.78% from the Scrub

Shrub/Herbaceous are classified as Tress. Other similar error can be seen between the

Sand/Soil/Gravel and Fiverine Substrate classes. The MF-based SVM density estimation

algorithm outperforms the other algorithm which is clear form Table 12.

The95% confidence intervals in this experiment are:[0.659 0.663], [0.725 0.729],

[0.696 0.7], and[0.807 0.811] which emphasizes the better performance of the proposed

density estimator in hyperspectral spaces.

81

TABLE 12: Classification accuracy using different density estimators for the hyperspectral

58-band urban area.

Class % Accuracy

MLE Parzen

Window

KNN

(k=3)

Traditional

SVM

MF-based

SVM

Shadow 87 69.13 97.9 80.82 55.26

Water 96 91.5 90.4 91 93.88

Trees 69.6 86.12 63.8 65.47 81.2

Buildings 51.3 39.76 46.7 48 48.18

Asphalt 52.9 70.8 84.9 81.42 78.82

Scrub 61 34.88 69.6 61.52 54.25

Sand 60.6 65.34 68.55 65.51 69.79

Wetland 82.38 2.29 90.85 89.47 80.1

Substrate 71.1 52.14 80.57 93.8 85.18

Average 66.2 70.2 67.02 65.99 73

D. Applications in the Class Prior Probability Estimation

The Bayesian classification setup requires the estimation of the class prior proba-

bility of each class defined in the image [2]. The Markov Random Field (MRF) is a natural

choice for implementing the hidden model for the segmented regions since MRF is the

best way to incorporate spatial correlations into a segmentation process. The refinement

of the segmented image using MRF modeling for the regions can be considered as a re-

finement of the Class Prior Probability (CPP) [66]. The images (raw data) and segmented

regions are specified with a joint Markov model that combines an unconditional model of

interdependent region labels and a conditional model of independent image signals in each

region [67–69]. The initial segmented image is then iteratively refined by using the MRF

82

model. In principle, the present work follows this conventional scheme but in contrast to

previous solutions, [70], it focuses on the most accurate identification of this region model.

The intra- and inter-region label co-occurrences are specified by a MRF model with the

nearest neighbors of each pixel. Under the assumed symmetric relationships between the

neighboring labels, the model resembles the conventional auto-binomial ones [71].

The present work suggests that the potential function for each clique in a MRF

model is assumed as a Gaussian-shaped kernel which leads to the formulation of the energy

function as a weighted sum of Gaussian kernels. Then, the MF-based SVM, in a regression

prospective, is used to estimate the parameters of this energy function, rather than using

the (empirically) pre-defined values for these parameters [3]. The motivation behind this

formulation is to design a complete classification framework where the developed MF-

based SVM algorithm is the main building block.

1. MRF Model

Definition 1: A cliqueC is a subset ofS for which every pair of sites is a neighbor. Single

pixels are also considered cliques. The set of all cliques on a grid is calledC.

Definition 2: A random fieldX is an MRF with respect to the neighborhood systemη =

ηs, s ∈ S if and only if

• p(X = x) > 0 for all x ∈ Ω, whereΩ is the set of all possible configurations on the

given grid;

• p(Xs = xs|Xs|r = xs|r) = p(Xs = xs|X∂s = x∂s), wheres|r refers to allN2 sites

excluding siter, and∂s refer to the neighborhood of sites;

Definition 3: X is a Gibbs random field (GRF) with respect to the neighborhood system

η = ηs : s ∈ S if and only if

p(x) =1

ze−E(x) (91)

83

whereZ is a normalizing constant called the partition function andE(x) is the energy

function of the form:

E(x) =∑c∈C

Vc(x) (92)

whereVc is called the potential and it is a function of the cliques around the site under

consideration. Only cliques of size 2 are involved in a pairwise interaction model. The

energy function for a pairwise interaction model can be written in the form [72]:

V (x) =N2∑t=1

G(xt) +N2∑t=1

m∑r=1

H(xt, xt:+r) (93)

whereG is the potential function for single-pixel cliques andH is the potential function

for all cliques of size 2. The parameterm depends on the size of the neighborhood around

each site. For example,m is 2, 4, 6, 10, and 12 for neighborhoods of orders 1, 2, 3, 4, 5,

respectively. Numbering and order coding of the neighborhood up to order five is shown in

Fig. 22. Also Fig. 22(a) shows the location of sitext:+r in the neighborhood system.

In this work the following model is proposed forG(.) andH(.) potential functions.

G(xt) =w0√2π

e−12

(µw0−I(xt)

σ

)2

(94)

H(xt, xt:+r) =wr√2π

e−12

(µwr−I(xt,x(t:+r))

σ

)2

(95)

whereI(a, b) is the indicator function whereI(a, b) = 1 if a = b, otherwise equal 0.I(a)

is always equal to 1.

The estimated mean values (µwr) of the clique shapes of the second order MRF (shown in

Fig. 23) are shown in Table 13.

84

TABLE 13: The estimated means for 2nd order MRF cliques.

Parameter µw0 µw1 µw2 µw3 µw4

Value 1/21 3/21 5/21 7/21 9/21

Parameter µw5 µw6 µw7 µw8 µw9

Value 11/21 13/21 15/21 17/21 21/21

2. MRF Parameters Estimation Using SVM

Comparing the form of the potential function of the MRF model in (93), after sub-

stituting the assumed models in (94) and (95), with that of the SVM regression output

in (18) shows that the SVM can be used for estimating the MRF parameters, provided that

the SVM regression algorithm uses a Gaussian Radial Basis-shaped kernel. In order to esti-

mate the weights in the SVM regression representation which correspond to the strengths of

the cliques in the MRF representation, the joint histogram for all clique shapes in the given

image are calculated. The MF-based SVM regression algorithm is used to approximate (fit

a regression to) the joint histogram by estimating the weights in (18). The experimental

section includes an example shows how this estimation is done.

3. Image Segmentation Algorithm

Typically in image segmentation, a segmented image after initial pixel-wise classi-

fication is further refined by optimal statistical estimation of the MRF model of the seg-

mented regions. The likelihood of the MAP image segmentation algorithm has the form

(see [3]):

Γ(X,Y) =1

|S|(log p(Y | X) + log p(X)

)(96)

whereS is the lattice representing the image,X is the segmented image (region map)

andY is the observed image. The first term in (96) is the likelihood for the conditional

85

distribution of an observed image (low level process) given its segmented image (the high

level process); i.e. class conditional probability. The second term is the unconditional

distribution of the segmented image which can be considered as a variation of the class

prior probability. As stated before, the SVM density estimation algorithm is used to model

the low level process. The high-level unconditional region map model is modeled using the

simple MRF model.

To make the search for a local maximum of the log-likelihood of (96) computation-

ally feasible, a conventional iterative process of estimating/re-estimating the conditional

image model is used (i.e. given a current region map, then update the map model given

the image). The process terminates when the current and previously estimated model pa-

rameters coincide to within a given accuracy range [67, 68, 70]. Therefore, the whole

iterative segmentation process is summarized as shown in the following algorithm. Be-

Algorithm 2 Image Segmentation Algorithm Outlines.

• Initialization: Find an initial map by the classical pixels Bayesian classification of

a given image after an initial estimation of the low level processY, using MF-based

SVM density estimator.

• Iterative refinement: Refine the initial map by iterating the following two steps:

1. Estimate the MRF parameters using the MF-based SVM algorithm.

2. Refine the segmented image using the ICM algorithm [66].

3. Calculate the log-likelihood from (96) and terminate if there is no big change

in the values of the log-likelihood.

cause at each step the approximate log-likelihood is greater than or equal to its previous

value, the proposed algorithm converges to a locally optimum solution. The experimental

section presents some experimental evolution of the log-likelihood values in (96) with the

86

iterations of the proposed segmentation algorithm.

4. Experiment on MRF Model Parameters Estimation

To evaluate the proposed algorithm for estimating the parameters of the MRF model

using the MF-based SVM algorithm, a synthetic texture image is generated by Metropolis

algorithm [71] as show in Fig. (24-a). Figure (24-b) shows the joint histogram for the ten

cliques shape (shown in Fig. (23)) of the second order neighborhood system. Figure (24-d)

shows the estimated Mixture of Gaussian distribution using the SVM which shows that

the SVM manages to estimate optimal values for the cliques strengths. Table 14 shows

the estimated parameters for each distribution. Figure (24-c) shows the regenerated image

using the estimated parameters shown in Table 14.

TABLE 14ESTIMATED PARAMETERS FOR THE MIXTURE OF GAUSSIANS

DISTRIBUTION.

Component Mean Weight Variance

1 1/21 0.1098 0.1592

2 3/21 0.1102 0.1592

3 5/21 0.1102 0.1592

4 7/21 0.1107 0.1592

5 9/21 0.1777 0.1592

6 11/21 0.0559 0.1592

7 13/21 0.0894 0.1592

8 15/21 0.0559 0.1592

9 17/21 0.0894 0.1592

10 21/21 0.0906 0.1592

87

5. Experiments Using Remote Sensing Data

The following experiments are used to illustrate the effectiveness of the MRF mod-

eling and the proposed segmentation setup (iterative setup) on the improvement of the seg-

mentation results. The two hyperspectral data sets are used to illustrate the performance.

In the proposed segmentation setup, as shown in section V.D.3, an initial guess for the

segmented image is obtained by the classical Bayes classifier with MF-based SVM den-

sity estimator. Then, the MRF parameters are calculated from the segmented image using

the MF-based SVM algorithm, the MAP segmentation is applied and the log-likelihood

from (96) is calculated. If there is a significant difference between the consecutive values

of the log-likelihood, the new values of the MRF model parameters are recalculated, and

the segmentation procedure is repeated again. Otherwise, i.e. if there is no significant

change in the log-likelihood values, the segmentation process is ended.

For the 34-band data set, Fig. (25) shows the evolution of the log-likelihood. It

can be noted that the log-likelihood converges and starts to saturate without major changes

after 8 iterations in this experiment. The final segmented image in Fig. (26) and the con-

fusion matrix in Table 15 illustrates the improvement effect of the CPP modeling on the

segmentation results. The average class accuracy rate increases to 83.75% (while it is 80%

without contextual modeling, see Table 10)and both the individual class accuracies and the

confidence in the points assigned to each class increase for most of the classes.

For the 58-band, Fig. (27) shows the evolution of the log-likelihood. It is clear that

the log-likelihood converges and starts to saturate without major changes after 6 iterations,

which means that a maximum estimate for the segmented image is obtained. The final seg-

mented image in Fig. (28) and classification results in Table (16) illustrate the improvement

effect of MRF modeling on segmentation results. The average class accuracy rate increases

to 83.38% (while it is 73% without contextual modeling, see Table 12) and the individual

class accuracies for most classes increase too.

88

TABLE 15: Classification confusion matrix for the hyperspectral urban area after applying

the MRF modeling.

Class Total

Points

Agricu-

ltural

Conife-

rous

Herba-

ceous

O Imp-

ervious

Roads Soil Water %

True

Agricultural 5138 4177 209 508 142 10 90 2 81.3

Coniferous 15182 0 13270 1883 21 0 0 8 87.4

Herbaceous 7481 8 939 6124 196 155 23 36 82.86

Other Im-

pervious

925 1 64 152 663 41 3 1 71.68

Roads 627 0 12 51 33 529 2 0 84.37

Soil 6362 92 4 448 309 1 5450 58 85.66

Water 4285 3 419 265 204 0 107 328776.7

% +ve true 97.57 88.95 64.93 42.28 71.88 96.03 96.9

E. Conclusion

This chapter presented several applications on the proposed statistical learning based

approaches for regression and density estimation. Remote Sensing data sets in multispec-

tral and hyperspectral spaces are used in these applications. The experiments on density

estimation use the classification accuracy in Bayes setups as indication for the performance

of the density estimation approach. The class prior probability in Bayes classification is

modeled using MRF models where the MF-based SVM algorithm is used to estimate the

model parameters.

89

TABLE 16: Classification accuracy after applying MRF modeling for the 58-band hyper-

spectral data set.

Class % Accuracy

Shadow 55.26

Water 96.95

Trees 91

Buildings 55.73

Asphalt 82.18

Scrub 52

Sand 70.42

Wetland 88.33

Substrate 87.82

Average 83.38

90

(a)

(b)

(c)

FIGURE 21: A hyperspectral 58-band urban area: (a) RGB snap-shot, and color-coded

classification results using: (b) SVM , and (c) MF-SVM as a density estimator.

91

(a) (b)

FIGURE 22: Numbering and order coding of neighborhood structure.

γ0 γ

1 γ

3 γ

2 γ

4 γ

5 γ

6 γ

7 γ

8 γ

9

FIGURE 23: Clique Shapes of second order MRF model.

92

(a) (c)

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Average number of Occurance of each clique shape

(b) (d)

FIGURE 24: A texture image for MRF model parameters estimation experiment:

(a)original image generated by Metropolis algorithm, (b) histogram of the MRF model

clique shapes of the original image, (c) regenerated image using the parameters estimated

using the MF-based SVM algorithm, and (d) the estimated mixture of Gaussians to fit the

cliques histogram.

93

1 2 3 4 5 6 7 8 9 10−450

−400

−350

−300

−250

−200

−150

−100

Iteration

Log−

Like

lihoo

d

FIGURE 25: Evolution of the log-likelihood in the hyperspectral 34-band example.

FIGURE 26: The final segmented image with the proposed segmentation setup for the

hyperspectral 34-band area.

94

1 2 3 4 5 6 7 8 9 10−3000

−2500

−2000

−1500

−1000

−500

Iteration

Log−

Like

lihoo

d

FIGURE 27: Evolution of the log-likelihood.

FIGURE 28: The final segmented image for the 58-band data set.

95

CHAPTER VI

STATISTICAL LEARNING FOR CHANGE DETECTION

Change detection in images finds many applications in city planning, monitoring,

and security assessments. This chapter introduces statistical learning as a tool for detecting

changes in images. It proposes a new approach and evaluates this approach with remote

sensing data sets.

A. Problem Statement

The problem of change detection in images can be stated formally as follows. Given

a imageI, find changesthat happened in that image with respect to a reference imageIr.

The current research focuses on the labeling or class-assignment study. Thus, “changes”

means non-matching labels. In turn, the change detection problem can be stated as finding

the pixel set (changes setCS) from I where the pixel labels of that set are different from

the labels of their counterparts inIr:

CS = (i, j) ∈ I : L(i, j) 6= L(m,n); (m,n) ∈ Ir and corresponds to(i, j) (97)

whereL(i, j) means the label of the pixel(i, j).

B. Literature Review

Automatic change detection is an active area of research in a number of important

applications, ranging from automatic video surveillance [73, 74] to video coding [75, 76],

tracking of moving objects [77, 78], and motion estimation [79, 80]. The increasing in-

terest to environmental protection and homeland security has led to the recognition of the

96

fundamental role played by change-detection techniques in monitoring the Earth’s sur-

face [73, 81–85]. Applications include, among others, damage assessment from natural

hazards (floods, forest fires, hurricanes) or large-scale accidents (e.g., oil spills), and also

keeping watch on unusual and suspicious activities in or around strategic infrastructural

elements (dams, waterways, reservoirs, power stations, nuclear and chemical enterprizes,

camping in unusual places, improvised landing strips, etc.) Most of the well known super-

vised and unsupervised methods for detecting changes in remotely sensed images [81, 83–

87] perform sequentially an image preprocessing, image comparison to get a “difference”

image, and the difference image analysis. Preprocessing: In the unsupervised change de-

tection case, the two images are made comparable in both the spatial and spectral domains.

The most critical step is to co-register the images with a sub-pixel accuracy so that corre-

sponding pixels within the images relate to the same ground area. Inaccurate co-registration

makes change detection unreliable [82], so special techniques are used to reduce the regis-

tration errors [81, 88–92]. Regarding to the spectral domain, potential error sources, such

as different illumination and atmospheric conditions at the two acquisition times, should be

accounted for to obtain accurate co-registration [93–95]. Depending on the application and

the available data, the problem is typically solved by absolute or relative radiometric image

calibration. One of the radiometric calibration algorithms proposed in [96–98] transforms

gray values in each image into the ground reflectance values, while the other algorithm

modifies histograms of the gray values to make the same gray values in the two images rep-

resent the same, but unknown reflectance [84, 93, 98]. Generally, illumination conditions in

remote sensing applications vary smoothly over each image. Therefore, in many cases an

original scene can be readily divided into Areas of Interest (AOIs) with assumed constant il-

lumination conditions. Separate analysis of each AOI suppresses main effects of varying il-

lumination on the change detection process [79]. Image comparison: The co-registered and

radiometrically corrected images (or linear/nonlinear combinations of their spectral signa-

97

tures [84]) are compared, pixel by pixel, in order to generate a “difference image” such that

the land-cover changes differ considerably in gray levels from the unchanged areas [84].

For example, the univariate image differencing (UID) [83, 84] performs pixel-wise sub-

traction of a single spectral band from the two images. The choice of the band depends

on the specific type of changes to be detected. The widely used Change Vector Analy-

sis (CVA) [94] forms differences for several spectral bands in each image (i.e., spectral

change vectors), and the difference image contains magnitudes of these vectors. Analysis

of the difference image: Land-cover changes are usually detected by thresholding the sig-

nal histogram of the difference image. The threshold selection is of a major importance for

the accurate change detection. Although some automatic choices of thresholds have been

proposed [99], remote sensing applications generally use non-automatic heuristic trial-and-

error strategies [81, 84, 100]. The classical choice of the threshold is based on a reasonable,

but not always verified assumption that only few changes have occurred between the two

observation dates. The changes are then represented by outliers of the marginal probability

distribution for the difference signals that mainly describe the unchanged pixels. Under this

assumption, a single-hypothesis testing based decision strategy [101] labels the signals that

are significantly different from the mean value, as changes. The decision threshold is fixed

at tσ from the mean difference value, whereσ is the standard signal deviation in the differ-

ence image andt is set by a trial-and-error procedure. The effect of the value oft on the

change detection accuracy is experimentally evaluated in [102]. Two Bayesian techniques

for automatic selection of the decision threshold in [103] maximize the total detection error

assuming the spatially independent pixels or a Markov Random Field (MRF) of the pixels

in the difference image, respectively. The MRF model uses the pixel spatial dependency

to improve the change detection. This approach has been extended in [104] using a semi-

parametric reduced Parzen model of probability distributions associated with changed and

unchanged pixels. In [105], the observed multi-temporal images are modeled by MRFs in

98

order to search for optimal changes under the maximum a posteriori (MAP) decision crite-

rion using the simulated annealing based energy minimization. Bernstein [106] studied the

change detection in relation to homeland security applications using the archived Landsat-5

images of the Portsmouth, Ohio Gaseous Diffusion Plant (OGDP) to determine capabili-

ties and limitations of long wavelength IR imagery in monitoring large nuclear enrichment

plants. This type of imagery was helpful in detecting large-scale changes in the OGPD’s

operational status (e.g. the shut-down of a single process building could be detected by

comparing the rooftop temperatures of the neighboring operational process buildings).

C. Proposed Change Detection Approach

The proposed approach for change detection starts with creating probabilistic shape

models from the classes defined in the reference imageIr. The shape modeling is done

with a new algorithm which uses distance-based shape descriptors and probability density

estimators with the proposed MF-based SVM approach. A Bayesian statistical analysis

approach uses these shape models in a MAP classification setup to detect pixels inI with

different labels from their counterparts inIr. The proposed approach differs from many

other familiar ones in that the changes are derived from classification maps for the reference

image, rather than from the imageI itself. This allows for using not only the pixel-wise

signatures, but also for the prior knowledge of shapes of objects being monitored. Also,

the proposed approach differs from other algorithms in that it detects changes within the

classification step itself rather than carrying out two consecutive steps: Classification, then

change detection. The main components of the proposed approach are: statistical shape

modeling and the statistical analysis. The details of these components are presented in the

following sections.

99

D. Statistical Shape Modeling

Shape representation is the main task in the analysis of shapes. The selection of

such representation is very important in several computer vision and medical applications

such as registration and segmentation. There are several ways described in [107, 108] for

shape representation. Although some of these ways are powerful enough to capture local

deformations, they require a large number of parameters to deal with important shape de-

tails, and some problems arise with changing the topology of shapes. In order to obtain

a shape model that realistically describes an object, a statistical shape representation ap-

proach is proposed in this work which is outlined in the following algorithm. The approach

assumes that there are multiple data sets (e.g. images) which describe the same shape.

Like most of the shape modeling approaches, the proposed approach starts with aligning

the different data sets together using a registration algorithm (one of such algorithms can

be found in [109]). The edges for each of the shape regions,Vi, in the different data sets

are determined and the average 2D edge,Vm, for each region is calculated (see [108]).

The contribution of the proposed approach is to introduce the signed distance con-

cept [110] in constructing a probabilistic map; the signed distance map (SD-Map), for a

data set that contains the object shape. The SD-Map is a representation for the relative

positions of the different points in the data set with respect to theshape points; points

that belong to the shape boundary in the reference data set (the data set that contains the

reference shape).

In case of images, the SD-Map is an image where the absolute value at a certain

pixel is the shortest Euclidean distance between the spatial position of that pixel and the

average 2D edges,Vm, of the object shape. By convention, the signed of a pixel is deter-

mined by whether that pixel lies inside (positive) or outside (negative) the boundary of one

of the shape regions. This representation of the signed distance map enables capturing of

the object shape with two interesting features: (1) the sign at a pixel determines whether

100

Algorithm 3 Outlines of the Statistical Shape Modeling Approach.

• Align the collected data sets together using a registration approach.

• Calculate the 2D edge,Vi, that describes the boundary of aregion from the object

shape in the data seti; i = 1 · · ·N , of theN data sets in the aligned data base for that

shape.

• Calculate the average 2D edgeVm for each region, i.e.1N

∑Ni=1 Vi.

• Given a 2D shape boundaryV (which is a collection of the average shape regions

Vm’s; i.e. V = ∪Mm=1Vm, where the shape is constructed fromM objects), the func-

tion S(i, j) which describes the distribution of the signed distance map inside and

outside the shape V is defined as follows:

S(i, j) =

0 (i, j) ∈ V

d((i, j), V ) (i, j) ∈ RV

−d((i, j), V ) Otherwise

(98)

whereRV is the space of the points which lie inside the region andd

((i, j), V

)is

the minimum Euclidean distance between the data set location(i, j) and the curveV .

Note: The proposed approach assumes that the shape is represented in a 2D Euclidean

space.

that pixel lies inside or outside the object shape, and (2) the absolute value at a pixel, which

is variant depending on the relative position of the pixel to the shape boundary, provides

a probabilistic representation of the object shape (see Fig 30 for a quick illustration of

SD-Maps).

Either the shape internal points (points which lie inside the shape) or the shape

external points (points which lie outside the shape) are enough to construct a model for

101

that shape. In the proposed approach, the shape internal points (the positive points in the

SD-Map) are used to model that shape. As stated above, these points have a probabilistic

distribution that calls for a probabilistic shape modeling which is anothercontributionof

the proposed approach. The MF-based SVM density estimator has proven itself as an

accurate density estimation algorithm and thus it is used to model the shape of each class

in an image.

One of the most powerful features of the proposed approach is that it allows mod-

eling for objects (classes) of multiple regions in the image, i.e., the object shape can have

multiple disjoint regions. Also, the shape representation in that approach is invariant to

translation and rotation. Further, to make this representation invariant to scaling, the fol-

lowing registration approach is used.

E. Change Detection Algorithm

The proposed algorithm depends on the Bayes theory for the analysis of the esti-

mated densities for the shape and sensor readings that are estimated using MF-based SVM

density estimator. Incorporating the shape information with the sensor data provides a

strong evidence for a change condition. If the combined shape and sensor information does

not provide an evidence for a change, the approach suspects the shape information and

deals only with the senor information. This is because, the shape information sometimes

becomes strong enough to hide the sensor information (because of the relative closeness

of the pixel location to the class shape boundaries). In such a case, the algorithm favors

stability over time of the sensor data. The steps of the algorithm are as follows:

F. Discussion of Some Change Detection Methods

This section presents brief discussion of two state-of-the art methods that used in

102

Algorithm 4 Outlines of the Change Detection Algorithm.

• Generating shape information: from the reference imageIr, generate the signed dis-

tance map for each class.

• Statistical modeling for the Shape Information: Use the proposed MF-based SVM

algorithm to model the shape information of each class.

• Pixel classification using both shape information and sensor data: for a pixelp:

1. Use the signed-distanced between the pixel location and each of the class aver-

age shape to getps(d | m) for m = 1 · · ·M , whereM is the number of defined

classes.

2. Use the sensor readingy to get the class conditional probabilityp(y | m).

3. Get a primary labeling of the pixel as:

m∗(p) = arg maxm∈M

ps(d | m) p(y | m)

• Change Detection at the pixelp: report a change atp if

1. If m∗(p) 6= mr(p); wheremr(p) is the class of pixelp in the reference image.

2. If m∗(p) = mr(p) still there is a chance for a change according to the following

steps:

– Get the primary class of the pixel using only sensor reading as:m∗(p) =

arg maxm∈M p(y | m).

– There will be a change ifm∗(p) 6= mr(p) and| p(y | m)−pr(y | m) |> T

whereT is a threshold.

the literature for change detection. These methods are used in the experimental work for

comparison.

103

1. Change Detection using Automatic Analysis of the Difference Image and EM Algo-

rithm (DIEM)

This approach [103, 104], is based on the formulation of the unsupervised change-

detection problem in terms of the Bayesian decision theory. In this context, an adap-

tive technique for the estimation of the statistical terms associated with the gray levels

of changed/unchanged pixels in a difference image is considered. This approach deals with

the widely used type of unsupervised techniques that perform change detection through a

direct comparison of the original raw images acquired in the same area at two different

times. The change-detection process performed by such unsupervised techniques is usu-

ally divided into three main sequential steps: 1) pre-processing, 2) image comparison and

3) analysis of the difference image. These steps are briefly detailed in the following.

• Preprocessing: Unsupervised change-detection algorithms usually take two digi-

tized images as input and return the locations where differences between the two

images can be identified. To accomplish such a task, a preprocessing step is neces-

sary aimed at rendering the two images comparable in both the spatial and spectral

domains. Concerning the spatial domain, the two images should be co-registered so

that pixels with the same coordinates in the images may be associated with the same

area on the ground. This is a very critical step, which, if inaccurately performed, may

render change-detection results unreliable [82].

With regard to the spectral domain, changes in illumination and atmospheric con-

ditions between the two acquisition times may be a potential source of errors and

should be taken into account in order to obtain accurate results [93, 94].

• Image Comparison: The two registered and corrected images (or a linear or non-

linear combination of the spectral bands of such images [84]) are compared, pixel

by pixel, in order to generate a further image (“difference image”). The difference

104

image is computed in such a way that pixels associated with land-cover changes

present gray level values significantly different from those of pixels associated with

unchanged areas. For example, the widely used Change Vector Analysis (CVA) tech-

nique is used to generate the difference image in remote sensing images. In this case,

several spectral channels are considered at each date (i.e., each pixel of the image

considered is represented by a vector whose components are the gray level values

associated with that pixel in the different spectral channels selected). Then, for each

pair of corresponding pixels, the so-called “spectral change vector” is computed as

the difference in the feature vectors at the two times. At this point, the pixels in the

difference image are associated with the magnitudes of the spectral change vectors;

it follows that unchanged pixels present small gray-level values, whereas changed

pixels present rather large values.

• Analysis of the Difference Image: Land-cover changes can be detected by applying

a decision threshold to the histogram of the difference image. For instance, when the

CVA technique is used (i.e., each pixel in the difference image is associated with the

magnitude of the difference between the corresponding feature vectors in the original

images), changed pixels can be identified on the right side of the histogram as they

are associated with large gray-levelvalues.The selection of the decision threshold

is of major importance as the accuracy of the final change-detection map strongly

depends on this choice.

The approach in [104] is based on the assumption that the histogram of the dif-

ference image can be modeled as a mixture density composed of the distributions of two

classes associated with changed and unchanged pixels, respectively. In this context, the

considered approach here uses the EM algorithm for the estimation of the conditional den-

sity functions of these classes. The estimated parameters by EM are used in pixelwise

classification of the difference image to change/no-change classes.

105

2. Change Detection using MRF Modeling (DIMRF)

The above described technique for change detection only considers information

contained within a pixel, even though intensity levels of neighboring pixels of images are

known to have significant correlation. Also, changes are more likely to occur in connected

regions rather than at disjoint points. By using these facts, a more reliable change detection

algorithm can be developed. To accomplish this, a Markov random field (MRF) model for

images is employed in the presented approach [103–105] so that statistical correlation of

intensity levels among neighboring pixels can be exploited.

For estimating MRF parameters, the method in [104] uses heuristic method; which

is not reliable, while the methods in [105] uses Simulating Annealing, which is known

to be slow. This dissertation proposed to use a numerical approach for estimating MRF

parameters (see Chapter 3).

G. Experimental Work

This section presents experimental work on the proposed approach for change de-

tection. The experiments are done on remote sensing data. The experiments illustrate first

the performance of the proposed statistical shape modeling approach, and presents an ap-

plication of it in remote sensing imagery segmentation. The change detection approach is

evaluated next.

1. Experiments on the proposed shape modeling approach

Figure (30-a) shows a RGB shot of the remote sensing LANDSAT 7 data set which

is used in the experiments. This data set is collected during November 2002 over Cairo,

Egypt. It contains: a 15-meter Panchromatic data, a 30-meter, 6-band multispectral data,

and a 60-meter, 2-band Thermal data. The size of the used scene is250x250 ”in the 6-band

106

multispectral case” which cropped from7771x8664 data set. There are 7 classes defined in

this data set: Water, Open, Deciduous, Low Density Residential, High Density Residential,

Urban, and Transportation. The 30-meter ground cover is available (see Fig (30-b)), so the

multispectral data is used as the reference set. The classification of data set is challenging

because the ir-regularity and the scattered nature of most of the classes in the scene (see

for example: the Deciduous, Low Density Residential, and Urban classes in the reference

classified image in Fig. (30-b)).

2. Experiments on Statistical Shape Modeling Using the MF-based SVM

The MF-based SVM density estimator is used in this section for statistical modeling

of the probabilistic distribution of the class shapes in the classified reference image as a

prior information. The approach outlined in Algorithm-3 is used to collect the data points

of interest for modeling the shape of each class defined in the image. A subset from this

training sample (50 points in the current implementation) is used to train the MF-based

SVM density estimator algorithm.

Figure (30) shows samples of the estimated pdf’s of class shapes and illustrate how

accurate the MF-based SVM algorithm managed to capture a shape pdf. The figure shows

the empirical density (histogram) of the class shape and the estimated density using the

MF-based SVM algorithm.

3. Experiments on the Segmentation Algorithm

To illustrate the performance of the proposed statistical based shape modeling in real

application, a classification algorithm [111] is used for multispectral data segmentation.

a. Image Segmentation Algorithm

This section summarizes the steps that are used to incorporate the proposed shape mold-

107

ing approach in image segmentation. The procedure used here is basically a Bayes clas-

sification rule based on shape priors. Letxi represents a class label in the image, where

i ∈ [1, K] andK is the total number of classes (objects) in the image. The goal of a

pixel-wise image segmentation algorithm is to determine the class to which a feature vec-

tor y belongs, i.e. the class index for the pixel which has a feature vectory. Bayes rule

formulates this goal as (see [3]):

y ∈ xi if p(y | xi) p(xi) > p(y | xj) p(xj),∀i 6= j; i, j ∈ [1, K] (99)

The termp(y | xi) is the conditional probability of the pixel valuey given that its

class label isxi (class conditional probability). The second termp(xi) represents the class

prior probability. Since the class prior probability represents the prior belief that the pixel

belongs to classxi, the proposed statistical shape modeling is used to substitute for this

belief. Thus the modified Bayes rule becomes:

y ∈ xi if p(y | xi) p(S | xi) > p(y | xj) p(S | xj),∀i 6= j; i, j ∈ [1, K]. (100)

whereS is the signed distance at the pixel which has the feature vectory (see Algorithm-

3). The MF-based SVM density estimator is used to implement the class conditional prob-

ability of that rule since this estimator has proven itself to be of special interest in high

dimensional density estimation problems [112]. Also, it is used to implement the statistical

shape density function term. Therefore, the whole segmentation process is summarized as

shown in Algorithm-5:

b. Results

Since the ground truth of the classified image is available, a subset from each class is used

to train the MF-based SVM density estimation algorithm and the rest is used for evaluation.

The RGB images shown in Fig (31) show the segmentation results.

108

FIGURE 29 – The RGB and the reference classified image of Cairo data set.

109

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

Sign Distance

pdf

EmpiricalEstimated

1 1.5 2 2.5 3 3.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Sign Distance

pdf

(a) Water class (c)Transportation class

FIGURE 30 – Samples of shape modeling two classes of Cairo data set: class points, signeddistance map, and shape model density function for a) Water class, and b)Transportationclass

110

Algorithm 5 Image Segmentation Algorithm Outlines.

Pre-Segmentation Step:

Register the input and the reference images.

• Training:

For each defined class in the image:

– train the MF-based SVM density estimator using the prior information about

that class to get its statistical shape model.

– train the MF-based SVM density estimator using the class data to get its class

conditional probability model.

• Segmentation:

For each pixel in the image:

– calculate the class conditional probabilities and the corresponding shape prob-

abilities.

– classify the pixel using the above modified Bayes’ rule, (100).

Table 17 shows the classification confusion matrix of the data set using a classical

Bayes classifier using the MF-based SVM density estimator. The class priors are assumed

as the shares of class points in the data set. The results illustrate how challenging is the data

set. The average class classification accuracy is 54.26%. The Water class has the highest

class classification accuracy (76.52%) while the Transportation class has the lowest class

classification accuracy (14.08%). The trust in the points assigned to a specific class (class

reliability) is also small. The Low density residential class has the lowest reliability rate,

9.48% while the Water class has the highest reliability of 84%.

111

(a) Classified without shape constraints (b) Classified with shape constraints

FIGURE 31 – Classification results of Cairo data set.

TABLE 17CLASSIFICATION CONFUSION MATRIX FOR THE MULTISPECTRAL DATA SET

WITHOUT USING SHAPE MODELING

Class Total

Points

Water Open Decid-

uous

L Den-

sity

H Den-

sity

Urban Transp-

ortation

%

True

Water 5613 4295 113 927 48 53 158 19 76.52

Open 25313 342 15225 1473 3807 1275 2736 455 60.15

Deciduous 2252 288 677 684 235 120 192 56 30.37

L. Density 2141 12 554 99 614 367 335 160 28.68

H. Density 18940 140 472 232 1725 9726 5313 1332 51.35

Urban 5571 29 77 144 319 1590 2992 420 53.71

Transport. 2670 4 275 76 341 790 808 376 14.08

% +ve Rate 84 87.5 23.17 9.48 72.49 69.86 13.34 54.26

Table 17 shows the classification confusion matrix using the Bayes classifier with

the shape statistical modeling applied. The results illustrate how much excellent improve-

ment can be achieved using the shape modeling. The average class classification accuracy

112

TABLE 18CLASSIFICATION CONFUSION MATRIX FOR THE MULTISPECTRAL DATA SET

USING SHAPE MODELING

Class Total

Points

Water Open Decid-

uous

L Den-

sity

H Den-

sity

Urban Transp-

ortation

%

True

Water 5613 4986 142 330 27 54 68 6 88.83

Open 25313 239 22345 739 1059 592 181 158 88.27

Deciduous 2252 59 83 1990 37 38 36 9 88.37

L. Density 2141 4 59 7 2009 21 29 12 93.83

H. Density 18940 115 400 79 123 17418 359 446 91.96

Urban 5571 22 46 30 42 219 5145 67 92.35

Transport. 2670 5 89 9 56 3 57 2451 91.80

% +ve Rate 91.82 96.46 62.5 59.91 94.94 87.57 77.8390.15

increases to 90.15%. The Urban class has the highest class classification accuracy (92.35%)

while the Open class has the lowest class classification accuracy (88.27%). The trust in the

points assigned to a specific class (class reliability) increases too. The Open class has the

highest reliability rate, 96.46% while the Transportation class has the lowest reliability of

77.83%. Figure (31) illustrates the improvement of applying the shape constraint on the

segmented image; Fig (31-c) which is so close to the reference image; Fig (31-b).

4. Experiments on Different Resolutions Data Sets

To assess the overall segmentation setup and illustrate the effect of the registration

step, the Panchromatic 15-meter resolution data, and the Thermal 60-meter resolution data

sets are used. The images are co-registered first to the multispectral (6-band) data set using

the proposed MI algorithm. Then the registered data is used for classification. Shots of the

results are shown in Fig. (32) and Fig. (33), while Tables 19 and 20 show results of applying

113

(a) (b) (c) (d)

FIGURE 32: Results for the 15-meter resolution data set: (a) Original, (b) Registration

results, (c) Classification results, and (d) Classification results after inverse transformation.

TABLE 19: Comparison of classification accuracies for the 15-meter resolution data set

using different algorithms.

Class % Accuracy

MLE KNN MF-SVM Shape-based

Water 86.2 94.8 72.6 84.22

Open 80.2 16.8 18.4 99.4

Deciduous 0 0 48.6 59.77

L. Density 0 0 17.1 54.32

H. Density 34.6 0 10.8 91.71

Urban 0 0 18.4 96.07

Transportation 0 0 17.5 49.25

% Average 51 15.4 22 90.29

the proposed algorithm in comparison with other algorithms. The results illustrate that the

traditional algorithms fail with these data sets, while the proposed algorithm performs very

well. The average classification accuracy is about 90% in the panchromatic data, while it

is 93% in the Thermal data.

114

(a) (b) (c) (d)

FIGURE 33: Results for the 60-meter resolution data set: (a) Original, (b) Registration

results, (c) Classification results, and (d) Classification results after inverse transformation.

TABLE 20: Comparison of classification accuracies for the 60-meter resolution data set

using different algorithms.

Class % Accuracy

MLE KNN MF-SVM Shape-based

Water 84.2 94.8 29.2 96.04

Open 58.1 4.62 29.4 97.73

Deciduous 0 5.64 23.8 62.61

L. Density 0 17.8 25.9 57.08

H. Density 54.2 18.5 17.8 99.91

Urban 0 4.13 19.2 87.45

Transport. 0 3.87 20.8 59.40

% Average 48 17.3 24.5 93.03

5. Experiments on the Change Detection Algorithm

The following experiments discuss the performance of the proposed statistical learn-

ing based change detection algorithm. Two data sets are used in the experiments: the first

115

is the multispectral Cairo data set which is presented before. The second one is a mul-

tispectral data set for a dam in the Louisville downtown, Kentucky, USA. The algorithm

performance is compared with the above presented: analysis of the difference image and

EM algorithm, and using MRF modeling.

a. Cairo Data Set

The description of this data set is presented above. Since, there is not available to us

samples of this data set over different instants of time, we simulate some changes in the

data set. This simulation is done by assuming that an Urban area is grown in the Open

area and also assuming that a new Transportation facility is established. The growing of

the new areas is simulated by randomly sampling pixels from the reference points of Urban

and Transportation classes in the areas where changes occur. In Fig (34-a), the reference

image shown above with the superimposed changes. A reference changes-map is shown in

Fig (34-b) where0 represents a change.

Figure (35) shows the results of applying the different change detection approaches.

The difference-map used by the approach which depends on the analysis of the difference-

map is shown in Fig (35-a). It easily noted that even the unchanged pixels have some small

values in the difference-map. The detected changes using the analysis of the difference-

map with EM algorithm are shown in Fig (35-b). It can be noted that there are some

pixels that marked as changes while they are not because of the pixelwise nature of the

algorithm. The effect of applying MRF modeling to the difference-map is illustrated in

Fig (35-c) where some enhancements can be seen, especially for the isolated pixels, but

some deformations to the changes area are induced. The results of applying the proposed

shape-based change detection algorithm are shown in Fig (35-d) which illustrates that the

algorithm successfully detects the changes happened in the data set. The detection rates of

the different approaches are shown in Table 21 which illustrates that the proposed approach

outperforms the other algorithm with its detection of85%.

116

(a) (b)

FIGURE 34: Cairo data set for the change detection evaluation: (a) Reference with

changes, (b) Reference changes-map

TABLE 21: Detection rates of the different change detection approaches.

Algorithm DIEM DIMRF Shape-based

Detection Rate 80% 82% 85%

b. Louisville Data

These are two multispectral Landsat data sets of the downtown Louisville, Kentucky,

USA. One of the data sets is collected in Summer 1992 and the other is collected in 2001. A

scene of size200x164 pixels is used in the experiments. The reference land covers for both

instances are available from the internet website (http://gisdata.usgs.net/website/kentucky/viewer.php).

Since, the number of defined classes in this scene is big and the data collection spans a

decade, there is a large amount of changes in this scene. So, for simplicity of the illustra-

tion, we consider only the changes in few number of classes which surround the McAlpine

dam.

Figure (36) illustrate the different images related to the scene. The RGB images

are shown in (a) and (b). The references classification of the scene considered in this

117

experiment are shown in (c) and (d), and the changes-map is shown in (d). As the legend

illustrates, only changes in the Deciduous and Wetland classes are considered for detection

in this experiment. The same steps used in Cairo data set (Algorithm -4) are applied to this

data set: shape modeling of the classes, sensor readings modeling, and the statistical Bayes

analysis for change detection. The change detection rate is almost the same as Cairo data

set:84.5%.

H. Conclusion

In this chapter, statistical learning is used to develop a new method for change

detection. The land cover changes in remote sensing data are used as an application. The

method starts with learning the shapes of the classes defined in the classified image of

the reference data set. The shape learning uses also a statistical learning algorithm which

discussed in the chapter. The previously presented density function estimation approach is

used to model the sensor readings of the data sets. The change detection approach used the

models of the shape and the sensor reading to detect the changes in the scene.

Experiments are carried out using two data sets: one for Cairo, Egypt, and the other

for Louisville, KY, USA. The average change detection rate is85% which is outperforming

previous approaches.

The proposed change detection approach differs from other approaches in many

disguisedly features. First: it does not use the row (sensor) data directly, which means

that it is possible to use it for detecting changes from different sources (sensors) data; e.g.

multispectral and hyperspectral data. Second: It does not classify the data sets and then

compare the classification results, which means that there not accumulation of the error.

Third: it does not apply specific types of filters on the images, which adds simplicity and

speed to the approach.

118

(a) (b)

(c) (d)

FIGURE 35: Results for change detection algorithms: (a) difference-image using CVA, (b)

detected changes-map using pixelwise analysis of the difference map, (c) detected changes-

map using MRF modeling, and (d) detected changes-map using the proposed algorithm

119

(a) (b)

(c) (d)

(e) (f)

FIGURE 36: Results for the change detection algorithm: (a) Reference with changes, (b)

Reference change-map, (c) Ordinary classification, and (d) Detected changes-map using

the proposed algorithm

120

CHAPTER VII

Conclusion

The attractive features of statistical learning methods introduce them rapidly in re-

placing classical learning methods. The good generalization capabilities make the statisti-

cal learning methods applicable for a wide range of practical problems. The introduction

of mathematically-based methods for the analysis and learning of these methods open the

door for better understanding and more contributions in boosting their performance. The

Mean Field-based Support Vector Machines (MF-based SVM) regression approach is the

nucleus of the statistical learning methods introduced in this dissertation. The approach

utilizes probability and statistical theories in establishing a reliable, efficient, and fast re-

gression approach that can cope with a variety of problems in the Computer Vision and

Pattern Recognition world. An approach which uses the mean field theory is established

for the learning of the regression approach. A basic tool in machine learning problems is the

estimation of the probability density function. A method which uses MF-based regression

algorithm is presented and illustrated using a variety of both synthetic and real data sets.

The statistical properties of the density estimation approach are discussed and illustrated.

The camera calibration, which is a fundamental issue in computer vision applications is

formulated in way that enables the use of MF-based SVM regression approach in solving

this fundamental problem. The density estimation approach is used in a variety of pattern

recognition problem including, classification, shape modeling, and changes detection with

applications in remote sensing imagery processing.

This chapter presents a review of the dissertation, outlines the applications of the

proposed approaches, and finally describes some natural extensions.

121

A. Review and Applications

Building statistical learning based frameworks for solving computer vision a pattern

recognition problems is the main research focus of the thesis. It starts with building a

regression approach which is then used in a variety of frameworks to solve the estimation of

the probability density function problem, camera calibration problem, image segmentation,

and changes detection.

Chapter 2 outlined the theoretical principles of the proposed MF-based SVM re-

gression approach. The inclusion of the Mean Field theory in learning SVM algorithm is

presented and investigated.

Chapter 3 discussed the use of the proposed MF-based SVM regression framework

in a deep problem of machine learning which is the probability density function problem.

The theoretical aspect of this MF-based SVM density estimation approach are discussed in

detail. The statistical properties of this estimation approach; consistency and convergence

of the approach are discussed. Statistical performance measures are used to illustrate the

results of the presented density estimation approach. Also, several estimation approaches

are presented and illustrated for automating the learning algorithm: the EM algorithm for

estimating the parameters of the kernel, and Cross Validation for estimating the rearranging

parameters.

Chapter 4 presented the camera calibration problem in statistical learning based for-

mulation. Motivations behind using statistical learning approaches for camera calibration

are discussed in this chapter. The formulation of the camera calibration problem in a regres-

sion setup is outlined and the link between this formulation and the learning of MF-based

SVM regression algorithm is established. A mixed learning algorithm between gradient

descent and MF-based SVM regression is formulated and applied to synthetic as well as

real data sets.

Chapter 5 presented the application of the proposed density estimation approach in

122

segmentation of remote sensing data sets. Application of the framework in the segmen-

tation of real world multispectral and hyperspectral imagery is presented and evaluated

against other algorithms. Estimation of the MRF parameters in image modeling using the

proposed MF-Based SVM framework is presented.

Chapter 6 presented the application of the proposed statistical learning approaches

in solving the changes detection problem. The problem considered in this work is the

changes in land cover of a scene using remote sensing imagery. The approach depends

on using statistical learning in modeling the shapes of the classes defined in the reference

image. These shape models are used with the models of the statistical sensor models (prob-

ability densities) in detecting the changes in a scene.

B. Limitations

While the proposed MF-based SVM regression algorithm is promising in terms of

accuracy and speed, it has some limitations which should be further addressed. The major

limitation is that it contains some learning parameters which have to be carefully selected.

The dissertation provides some automation approaches, but integration of these approaches

for a fully automated approach should be considered.

C. Recommendations

This section discusses a number of recommendations that are suggested as exten-

sions to the dissertation work. These recommendations can be classified broadly into two

directions: performance improvement of the principal building block (statistical learning

based SVM regression) and real time applications. The principal building block is affected

by a number of learning parameters that control its performance. The choice of the values

of these parameters is not an easy task. The performance of the approach will be enhanced

123

greatly with automation procedures for estimating these parameters. The thesis presented

few algorithms to estimate the values of some of these parameters. But, more work is

needed to integrate these procedures and to establish new ones for the other parameters.

This thesis presented many real world useful applications. These applications will

be more valuable if they can be done faster, preferably in real time. For example, the camera

calibration approach can be further improved by applying it to active vision application and

real time 3D reconstruction applications.

Also, further investigation of the presented new applications needs to be done. The

changes detection approach should be applied for different kind of data sets. Different kinds

means: different sensors, different resolutions and different view angles. This requires a

sophisticated registration approach that can be applied for different kind imagery.

124

REFERENCES

[1] S. Chen, X. Hong, and C. Harris. Sparse Kernel Density Construction Using Orthog-

onal Forward Regression with Leave-One-Out Test Score and Local Regularization.

IEEE Transactions on Systems, Man, and CyberneticsPartb:Cybernetics, 34:1708–

1717, August 2004.

[2] R. Duda, P. Hart, and D. Stork.Pattern Classification. John Wiley and Sons, 2

edition, 2001.

[3] Aly Farag, Refaat Mohamed, and Hani Mahdi. Experiments in Image Classification

and Data Fusion. InProceedings of the Fifth International Conference on Informa-

tion Fusion, IF02, pages I 299–308, Annapolis, MD, July 11-17 2002.

[4] M. Koeppen. The Curse of Dimensionality. InProceedings of the 5th Online World

Conference on Soft Computing in Industrial Applications (WSC5), held on the inter-

net, September 4-18 2000.

[5] Z. Zhang. A Flexible New Technique for Camera Calibration.IEEE Trans. on

Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000.

[6] J. Su, J. Wang, and Y. Xi. Incremental Learning with Balanced Update on Receptive

Fields for Multi-Sensor Data Fusion.IEEE Trans. on Systems, Man and Cybernatics,

Part B, 34(1):659–665, February 2004.

[7] B. Boser, I. Guyon, and V. Vapnik. A Training Algorithm for Optimal Margin Clas-

sifiers. InProceedings of the Computational Learning Theory (COLT), pages 144–

152, Berlin Heidelberg, New York, 1992.

125

[8] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1 edition,

1995.

[9] J. Shawe-Taylor and N. Cristianini.Kernel Methods for Pattern Analysis. University

Press, Cambridge, The United Kingdom, 1 edition, 2004.

[10] V. Vapnik, S. Golowich, and A. Smola. Support Vector Method for Multivariate

Density Estimation.Advances in Neural Information Processing Systems, 12:659–

665, April 1999.

[11] B. Scholkopf, C. Burges, and A. Smola.Advances in Kernel Methods – Support

Vector Learning. MIT Press, Cambridge, MA, 1999.

[12] T. Friess, N. Cristianini, and C. Campbell. The Kernel ADATRON Algorithm: A

Fast and Simple Learning Procedure for Support Vector Machines. InProceedings of

the 15th International Conference on Machine Learning, pages 188–196, Madison,

Wisconsin USA, July 24-27 1998.

[13] J. Platt.Fast Training of Support Vector Machines Using Sequential Minimal Opti-

mization. Advances in Kernel Methods Book. MIT Press, Cambridge: MA, 1999.

[14] C. Williams and M. Seeger. Using the Nystrom Method to Speed Up Kernel Ma-

chines.Advances in Neural Information Processing System, 14, 2001.

[15] M. Tipping and A. Faul. Fast Marginal Likelihood Maximization for Sparse

Bayesian Models. InProceedings of the International Workshop on AI and Sta-

tistics, Key West, FL, Jan 3-6 2003.

[16] C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition.Data

Mining and Knowledge Discovery, 2(2):1–47, 1998.

126

[17] P. Mitra, C. Murthy, and S. Pal. A Probabilistic Active Support Vector Learn-

ing Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence,

26(3):413 – 418, March 2004.

[18] D. Cohn, Z. Ghahramani, and M. Jordan. Active Learning with Statistical Models.

Journal of AI Research, 4:129–145, 1996.

[19] D. MacKay. Information Based Objective Function for Active Data Selection.

Neural Computation, 4(4):590–604, 1992.

[20] M. Opper and O. Winther. Gaussian Processes for Classification: Mean Field Algo-

rithms. Neural Computation, 12:2655–2684, 2000.

[21] A. Papoulis and S. Pillai.Probability, Random Variables and Stochastic Processes.

McGraw-Hill, New York, 4 edition, 2001.

[22] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, second

edition, 2001.

[23] J. Gao, S. Gunn, and C. Harris. Mean Field Method for the Support Vector Machine

Regression.Neurocomputing, 50:391–405, November 2003.

[24] M. Opper and D. Saad.Advanced Mean Field Methods: Theory and Practice. MIT

press, Cambridge, MA, 2001.

[25] D. MacKay. Information theory, Inference, and Learning Algorithms. Cambridge

University Press, Cambridge, MA, 2003.

[26] V. Yakhot. Mean-Field Approximation and a Small Parameter in Turbulence Theory.

Physical Reviews, E, 63:026307, 2001.

[27] L. Saul, T. Jaakkola, and M. Jordan. Mean Field Theory for Sigmoid Belief Net-

works. Artificial Intelligence Research, 4:61–76, 1996.

127

[28] H. Kappen and W. Wiegerinck. Mean Field Theory for Graphical Models. In M. Op-

per and D. Saad, editors,Advanced Mean Field Theory, pages 37–49. MIT, Cam-

bridge, MA, 2001.

[29] B. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and

Hall¡ FL, USA, 1986.

[30] C. Bishop.Neural Networks for Pattern Recognition. Oxford University Press, 1995.

[31] Ayman El-Baz and Aly Farag. Pararmeter Estimation in Gibbs-Markov Image Mod-

els. In Proceedings of the 6th International Conference on Information Fusion,

pages 934–942, Queensland, Australia, July 8-11 2003.

[32] Refaat Mohamed and Aly Farag. A New Unsupervised Approach for the Classifi-

cation of Multispectral Data. InThe Sixth International Conference on Information

Fusion, pages 951–958, Queensland, Australia, July 8-11 2003.

[33] E. Parzen. On Estimation of a Probability Density Function and Mode.Annals of

Math. Statistics, 33:1065–1076, 1962.

[34] J. Lamperti. Probability-A survey of the Mathematical Theory. Wiley Series in

Probability and Statistics, Wiley, New York, 1996.

[35] J. Shao.Mathematical Statistics. Springer-Verlag, New York, 1999.

[36] C. Williams. Prediction with Gaussian Processes: Basic Ideas and Theortical Per-

spectives. InProc. Workshop Notions of Complexity: Information-theoretic, Com-

putational and Statistical Approaches, Eindhoven, The Netherlands, October 7-9

2004.

[37] P. Sollich and C. Williams. Understanding Gaussian Process Regression Using The

Equivalent Kernel. In L. Saul, Y. Weiss, and Leon Bottou, editors,Advances in

128

Neural Information Processing Systems 17, pages 1313–1320. MIT Press, Cam-

bridge, MA, 2005.

[38] A. Oppenheim, R. Schafer, and J. Buck.Discrete-Time Signal Processing (2nd ed.).

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.

[39] B. Dacorogna.Direct Methods in the Calculus of Variations. Springer-Verlag New

York, Inc., New York, NY, USA, 1989.

[40] Z. Ghahramani and M. Jordan. Function Approximation via Density Estimation

Using the EM Approach. In G. Tesauro J. Cowan and J. Alspector, editors,Advances

in Neural Information Processing Systems 6, pages 120–127. Morgan Kaufmann,

San Mateo, CA, 1994.

[41] G. McLachlan and D. Peel.Finite Mixture Models. New York, Wiley, 2000.

[42] Aly Farag, Ayman El-Baz, and G. Gimelfarb. Density Estimation Using Modified

Expectation Maximization for a Linear Combination of Gaussians. InProceedings of

IEEE International Conference on Image Processing (ICIP- 2004), volume I, pages

194–197, Singapore, October 24-27 2004.

[43] A. Dempster, N. Laird, and D. Rubin. Maximum-Likelihood from Incomplete Data

via the EM Algorithm.Journal of Royal Statistics Society, Ser. B.(39), 1977.

[44] R. Redner and H. Walker. Mixture Densities, Maximum Likelihood and the EM

Algorithm. SIAM Review, 26(2), 1984.

[45] M. Jordan and R. Jacobs. Hierarchical Mixtures of Experts and the EM Algorithm.

Neural Computation, 6:181214, 1994.

[46] T. Hastie, R. Tibshirani, and J. Friedman.The Elements of Statistical Learning.

Springer, New York, 2001.

129

[47] R. Hocking. Developments in Linear Regression Methodology.Technometrics,

25:219–249, 1983.

[48] S. Kullback.Information Theory and Statistics. Wiley, New York, 1959.

[49] S. M. Ali and S. D. Silvey. A General Class of Coefficients of Divergence of One

Distribution from Another.Journal of Royal Statistics Society, B28:131–142, 1966.

[50] M. Girolami and C. He. Probability Density Estimation from Optimally Condensed

Data Samples.IEEE Transactions on Pattern Analysis and Machine Intelligence,

25(10):1253–1264, 2003.

[51] F. Saul and D. Lee. Multiplicative Updates for Non-Negative Quadratic Program-

ming in Support Vector Machines. MS-CIS 02-19, University of Pennsylvania, 2002.

[52] B. Scholkopf, J. Platt, J. Shawe-Taylor, J. Smola, and R. Williamson. Estimating the

Support of a High-Dimensional Distribution.Neural Computation, 13:1443–1471,

2001.

[53] S. Mukherjee and V. Vapnik. Support Vector Method for Multivariate Density Esti-

mation. A. I. Memo 1738, MIT AI Lab., 1999.

[54] R. Tsai. A Versatile Camera Calibration Technique for High-Accuracy 3D Machine

Vision Metrology Using Of-The-Shelf TV Cameras and Lenses.IEEE J. Robotics

and Automation, 3(4):323–344, August 1987.

[55] J. Weng, P. Cohen, and M. Herniou. Camera Calibration with Distortion Models and

Accuracy Evaluation.IEEE Trans. on Pattern Analysis and Machine Intelligence,

14(10):965–980, 1992.

[56] J. Gao, C. Harris, and S. Gunn. On a Class of Support Vector Kernels Based on

Frames in Function Hilbert Spaces.Neural Computation, 13:1975–1994, 2001.

130

[57] S. Gunn. Support Vector Machines for Classification and Regression. ISIS 1-98,

Department of Electronics and Computer Science, University of Southampton, 1998.

[58] O. Faugeras, editor.Three-Dimensional Computer Vision: A Geometric Viewpoint.

MIT Press, 1993.

[59] Aly Farag, Refaat Mohamed, and Ayman El-Baz. A Unified Framework for MAP

Estimation in Remote Sensing Image Segmentation.IEEE Trans. on Geoscience

and Remote Sensing, 43(7):1617–1634, 2005.

[60] M. Trucco and A. Verri.Intoductory Techniques for 3-D Computer Vision. Prentice

Hall, NJ, USA, 1998.

[61] Moumen Ahmed and Aly Farag. A Neural Network Approach for Solving the Prob-

lem of Camera Calibration.Image and Vision Computing, 20(9-10):619630, 2002.

[62] J. Heikkila. Geometric Camera Calibration Using Circular Control Points.IEEE

Trans. on Pattern Analysis and Machine Intelligence, 22(10):1066–1077, October

2000.

[63] R. Hartley and A. Zisserman.Multiple View Geometry in Computer Vision. Cam-

bridge University Press, ISBN: 0521540518, second edition, 2004.

[64] B. Shahshahani and D. Landgrebe. The Effect of Unlabeled Samples in Reducing the

Small Sample Size Problem and Mitigating the Hughes Phenomenon.IEEE Trans.

on Geoscience and Remote Sensing, 32(5):1087 – 1095, 1994.

[65] S Tadjudin and D. Landgrebe. Robust Parameter Estimation for Mixture Model.

IEEE Trans. on Geoscience and Remote Sensing, 38(1):439–445, 2000.

[66] J. Besag. On the Statistical Analysis of Dirty Pictures.Journal of Royal Statistical

Society, B48(3):259–302, 1986.

131

[67] C. Bouman and M. Shapiro. A Multiscale Random Field Model for Bayesian Image

Segmentation.IEEE Transaction on Image Processing, 3(2):162–177, 1994.

[68] Aly Farag and E. Delp. Image Segmentation Based on Composite Random Field

Models.Journal of Optical Engineering, 12:25942607, December 1992.

[69] G. Gimel’farb. Image Textures and Gibbs Random Fields. Kluwer Academic, Dor-

drecht, Netherland, 1999.

[70] Ayman El-Baz and Aly Farag. Image Segmentation Using GMRF Models: Pa-

rameters Estimation and Applications. InProceedings of the IEEE International

Conference on Image Processing, ICIP 2003, pages II 173–176, Barcelona, Spain,

September 14-17 2003.

[71] A. Jain and R. Dubes. Random Field Models in Image Analysis.Journal of Applied

Statistics, 16(2):131–164, 1989.

[72] J. Besag. Spatial Interaction and the Statistical Analysis of Lattice System.Journal

of Royal Statistical Society, B36(2):192–225, 1974.

[73] M. Carlotto. Detection and Analysis of Change in Remotely Sensed Imagery with

Application to Wide Area Surveillance.IEEE Transactions on Image Processing,

6:189–202, 1997.

[74] X. Yang and R.Yang. Change Detection Based on Remote Sensing Information

Model and its Application on Coastal Line of Yellow River Delta. InAsian Confer-

ence on Remote Sensing, Hong Kong, China, November 22-25 1999.

[75] D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications.

Communication of ACM, 34:47–58, 1991.

132

[76] S. Mallat. A Theory for Multiresolution Signal Decomposition: The Wavelet Repre-

sentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:674–

692, 1989.

[77] L. Chen and S. Chang. A Video Tracking System with Adaptive Predictors.Pattern

Recognition, 25:1171–1180, 1992.

[78] W. Kan, J. Krogmeier, and P. Doerschuk. Model-Based Vehicle Tracking from Im-

age Sequences with an Application to Road Surveillance.Optical Engineering,

35:1723–1729, 1996.

[79] S. Liu, C. Fu, and S. Chang. Statistical Change Detection with Moments Under

Time-Varying Illumination.IEEE Transactions on Image Processing, 7:1258–1268,

September 1998.

[80] C. Fu and S. Chang. A Motion Estimation Algorithm Under Time Varying Illumi-

nation Case.Pattern Recognition Letters, 10:195–199, 1989.

[81] L. Bruzzone and S. Serpico. An Iterative Technique for The Detection of Land-

Cover Transitions in Multitemporal Remote-Sensing Images.IEEE Transactions on

Geoscience and Remote Sensing, 35:858–867, 1997.

[82] J. Townshend, C. Justice, and C. Gurney. The Impact of Misregistration on Change

Detection. IEEE Transactions on Geoscience and Remote Sensing, 30:1054–1060,

1992.

[83] T. Fung. An Assessment of TM Imagery for Land-Cover Change Detection.IEEE

Transactions on Geoscience and Remote Sensing, 28:681–684, 1990.

[84] A. Singh. Digital Change Detection Techniques Using Remotely Sensed Data.In-

ternational Journal of Remote Sensing, 10:989–1003, 1989.

133

[85] J. Townshend and C. Justice. Spatial Variability of Images and the Monitoring of

Changes in the Normalized Difference Vegetation Index.International Journal of

Remote Sensing, 16:2187–2195, 1995.

[86] D. Wiemker. An Iterative Spectral-Spatial Bayesian Labeling Approach for Unsu-

pervised Robust Change Detection on Remotely Sensed Multispectral Imagery. In

Proc. 7th International Conference on Computer Analysis of Images and Patterns,

pages 263–270, Kiel, Germany, September 22-25 1997.

[87] A. Nielsen, K. Conradsen, and J. Simpson. Multivariate Alteration Detection (MAD)

and MAF Processing in Multispectral, Bitemporal Image Data: New Approaches to

Change Detection Studies.Remote Sensing of Environment, 64:1–19, 1998.

[88] J. Flusser and T. Suk. A Moment-Based Approach to Registration of Images with

Affine Geometric Distortion. IEEE Transactions on Geoscience Remote Sensing,

32:382–387, 1994.

[89] D. Barnea and H. Silverman. A Class of Algorithms for Fast Digital Image Regis-

tration. IEEE Transactions on Computers, C-21:179–186, 1972.

[90] J. Ton and A. Jain. Registering Landsat Images by Point Matching.IEEE Transac-

tions on Geoscience and Remote Sensing, 27:642–650, September 1989.

[91] T. Knoll and E. Delp. Adaptive Gray Scale Mapping to Reduce Registration Noise in

Difference Images.Computer Vision, Graphics, and Image Processing, 33:129–137,

1986.

[92] P. Gong, E. Ledrew, and J. Miller. Registration-Noise Reduction in Difference Im-

ages for Change Detection.International Journal of Remote Sensing, 13:773–779,

1992.

134

[93] P. Chavez. Radiometric Calibration of Landsat Thematic Mapper Multispectral Im-

ages.Photogrammetric Engineering and Remote Sensing, 55:1285–1294, 1989.

[94] P. Chavez and D. MacKinnon. Automatic Detection of Vegetation Changes in the

Southwestern United States Using Remotely Sensed Images.Photogrammetric En-

gineering and Remote Sensing, 60:571583, 1994.

[95] J. Richards.Remote Sensing Digital Image Analysis. Springer, NY, USA, 2nd edi-

tion, 1993.

[96] P. Slater. Reflectance and Radiance Based Methods for the In-Flight Absolute Cali-

bration of Multispectral Sensors.Remote Sensing of Environment, 22:11–37, 1987.

[97] P. Teillet and et al. Three Methods for the Absolute Calibration of the NOAA

AVHRR Sensors in Flight.Remote Sensing of Environment, 31:105–120, 1990.

[98] H. Olsson. Reflectance Calibration of Thematic Mapper for Forest Change Detec-

tion. International Journal of Remote Sensing, 16:81–96, 1995.

[99] P. Rosin. Thresholding for Change Detection.Computer Vision and Image Under-

standing, 86:79–95, 2002.

[100] T. Fung and E. LeDrew. The Determination of Optimal Threshold Levels for Change

Detection Using Various Accuracy Indices.Photogrammetric Engineering and Re-

mote Sensing, 54:1449–1454, 1988.

[101] K. Fukunaga.Introduction to Statistical Pattern Recognition. Academic, NY, USA,

2nd edition, 1990.

[102] R. Nelson. Detecting Forest Canopy Change Due to Insect Activity Using Landsat

Mss. Photogrammetric Engineering and Remote Sensing, 49:1303–1314, 1983.

135

[103] L. Bruzzone and D. Prieto. Automatic Analysis of the Difference Image for Unsu-

pervised Change Detection.IEEE Transactions on Geoscience and Remote Sensing,

38:1171–1182, 2000.

[104] L. Bruzzone and D. Prieto. An Adaptive and Semiparametric and Context Based

Approach to Unsupervised Change-Detection in Multitemporal Remote Sensing Im-

ages.IEEE Transactions on Image Processing, 11:452–466, 2002.

[105] T. Kasetkasem and P. Varshney. An Image Change-Detection Algorithm Based on

Markov Random Filed Models.IEEE Transactions on Geoscience and Remote Sens-

ing, 40:1815–1823, 2002.

[106] A. Bernstein. Monitoring Large Enrichment Plants Using Thermal Imagery from

Commercial Satellites: A Case Study.Science and Global Security, 9:143–163,

2001.

[107] T. Sebastin and et al. Recognition of Shapes by Editting Shock Graphs. InInterna-

tional Conference on Computer Vision, page 755762, Vancouver, Canada, July 9-12

2001.

[108] B. Ginneken and at al. Active Shape Model Segmentation with Optimal Features.

IEEE Transactions on Medical Imaging, 21:755–762, August 2002.

[109] A. Eldeib, S. Yamany, and Aly Farag. Volume Registration by Surface Point Sig-

nature and Mutual Information Maximization with Applications in Intra-Operative

MRI Surgeries. InInternational Conference on Image Processing, pages 200–203,

Vancouver, Canada, October 2000.

[110] X. Huang, D. Metaxas, and T. Chen. MetaMorphs: Deformable Shape and Texture

Models. InInternational Conference on Computer Vision and Pattern Recognition,

pages 496–503, Washington, D.C., USA, June 27-July 2 2004.

136

[111] Ayman El-Baz, Refaat Mohamed, and Aly Farag. Shape Constraints for Accurate

Image Segmentation with Applications in Remote Sensing Data. InThe eighth In-

ternational Conference on Information Fusion, Philadelphia, PA, USA, July 25-29

2005.

[112] Refaat Mohamed and Aly Farag. Mean Field Theory for Density Estimation Using

Support Vector Machines. InThe seventh International Conference on Information

Fusion, pages 856–861, Stockholm, Sweden, June 28-July 1 2004.

137

CURRICULUM VITA

NAME: Refaat M Mohamed

ADDRESS: Department of Electrical and Computer Engineering,

University of Louisville,

Louisville, KY 40292.

EDUCATION: * M.Sc. Electrical Engineering,

University of Assiut, Assiut, Egypt, 2001.

M.Sc. THESIS TITLE:

“An Intelligent Trajectory Tracking Controller for Robotics.”

* B.S. Electrical Engineering,

Very Good with the honor, first on class,

University of Assiut, Assiut, Egypt, 1995.

PREVIOUS

RESEARCH: Learning Systems, Robotic Control, Electronic Controllers Design.

TEACHING: Pattern Analysis and Machine Intelligence – GTA.

HONORS and AWARDS:

138

Dean’s Citation, University of Louisville Commencement,

Fall 2005.

Who’s Who Among Students in American Universities, 2005.

Outstanding Graduate Student, ECE Dept., University of

Louisville, 2004.

Second place on ECE department, University of Louisville,

Engineer’s Days Exhibit 2002.

Student Member, IEEE, since 2002.

Member, Eta Kappa Nu (HKN), since 2004.

PUBLICATIONS:

JOURNALS

1. Refaat M. Mohamed, Ayman S El-Baz, and Aly A Farag “A Bayes Analysis Ap-

proach for Change Detection in Remote Sensing Images,” Under preparation for the

IEEE Transactions on Geoscience and Remote Sensing.

2. Aly A Farag,Refaat M. Mohamedand Ayman S El-Baz, “A Unified Framework for

MAP Estimation in Remote Sensing Image Segmentation,” IEEE Transactions on

Geoscience and Remote Sensing, Vol. 43, No. 7, July 2005, pp. 1617-1634.

3. Ayman S El-Baz,Refaat M. Mohamed, and Aly A Farag, “Advanced Support Vector

Machines for Image Modeling Using Gibbs-Markov Random Field,” International

Journal of Information Technology Vol. 1, No. 4, pp. 297-300, 2004.

4. Refaat M. Mohamed, Ayman S El-Baz, and Aly A Farag, “Probability Density Esti-

mation Using Advanced Support Vector Machines and the Expectation Maximization

139

Algorithm,” International Journal Of Signal Processing Vol. 1, No. 4, pp. 260-264,

2004.

5. Aly A. Farag, Ayman S El-Baz, andRefaat M. Mohamed, “Density Estimation using

Generalized Linear Model and a Linear Combination of Gaussians,” International

Journal Of Signal Processing Vol. 1, No. 4, pp. 265-268, 2004.

CONFERENCES

6. Refaat M. Mohamed, Abdel-Rehim Ahmed, Ahmed Eid, and Aly Farag, “Statistical

Learning for Camera Calibration,” Submitted to the ECCV 2006.

7. Ayman S El-Baz,Refaat M. Mohamed, Aly A. Farag, and Georgy Gimel’farb, ”Un-

supervised Segmentation of Multi-Modal Images by a Precise Approximation of

Individual Modes with Linear Combinations of Discrete Gaussians,” International

Conference on Computer Vision and Pattern Recognition, CVPR-05, Workshop on

Learning in Computer Vision and Pattern Recognition, San Diego, California, June

19-25, 2005.

8. Refaat M. Mohamed, Ayman El-Baz and Aly A Farag, “Advanced Algorithms For

Bayesian Classification In High Dimensional Spaces With Applications In Hyper-

spectral Image Segmentation,” Accepted, The International Conference on Image

Processing, ICIP 2005, Sept. 11-14, Genoa, Italy.

9. Refaat Mohamed, Ayman El-Baz, and Aly Farag, “Remote Sensing Image Segmen-

tation Using SVM with Automatic Selection for the Kernel Parameters,” Accepted,

The eighth International Conference on Information Fusion, Philadelphia, PA, USA

, July 25-29, 2005.

140

10. Ayman S El-Baz,Refaat M. Mohamed, and Aly A. Farag, “Shape Constraints for Ac-

curate Image Segmentation with Applications in Remote Sensing Data,” Accepted,

The eighth International Conference on Information Fusion, Philadelphia, PA, USA

, July 25-29, 2005.

11. Refaat M. Mohamedand Aly A. Farag, “Mean Field Theory for Density Estimation

Using Support Vector Machines,” Seventh International Conference on Information

Fusion, Stockholm, July, 2004, pp. 495-501.

12. Hashem M. Mohamed, Khaled M Shaaban andRefaat M. Mohamed“A Robust

Framework for Detection of Human Faces in Clutter Color Images,” Seventh Interna-

tional Conference on Humans and Computers, University of Aizu, Japan, September

1-4, 2004.

13. Refaat M. Mohamedand Aly A. Farag, “Parameter Estimation for Bayesian Clas-

sification of Multispectral Data,” Seventh International Conference on Knowledge-

Based Intelligent Information and Engineering Systems, University of Oxford, United

Kingdom, September 4-5, 2003, pp. 346-355.

14. Refaat M. Mohamedand Aly A. Farag, “A New Unsupervised Approach for the

Classification of Multispectral Data,” Sixth International Conference on Information

Fusion, Fusion-03, Queensland, Australia, July 8-11, 2003, pp. 951-958.

15. Refaat M. Mohamedand Aly A. Farag, “Two Sequential Stages Classifier for Mul-

tispectral Data,” International Conference on Computer Vision and Pattern Recogni-

tion, CVPR-03, Workshop on Intelligent Learning, Madison, Wisconsin, June 16-22,

2003, pp. 110-116.

16. Refaat M. Mohamedand Aly A. Farag, “Classification of Multispectral Data Using

Support Vector Machines Approach for Density Estimation,” International Confer-

141

ence on Intelligent Engineering System, Assiut, Egypt, March 4-6, 2003, pp. 102-

109.

17. Aly A. Farag,Refaat M. Mohamedand Hani Mahdi, “Experiments in Image Classifi-

cation and Data Fusion,” Proceedings of 5th International Conference on Information

Fusion, Annapolis, MD, Vol. 1, pp. 299-308, July 2002.

18. Khaled M. Shaaban andRefaat M. Mohamed, “Autonomous Learning Cerebellum

Model Articulation Controller,” 2002 World Congress on Computational Intelligence,

WCCI 2002, Hilton Hawaiian Village Hotel Honolulu, Hawaii, May 12-17, 2002.

142

· kernel methods for statistical learning in computer vision and pattern recognition applications...

Documents