motion estimation using statistical learning theory ...fli/papers/stl_motion.pdf · learning,...

13
Motion Estimation Using Statistical Learning Theory Harry Wechsler, Fellow, IEEE, Zoran Duric, Senior Member, IEEE, Fayin Li, and Vladimir Cherkassky, Senior Member, IEEE Abstract—This paper describes a novel application of Statistical Learning Theory (SLT) to single motion estimation and tracking. The problem of motion estimation can be related to statistical model selection, where the goal is to select one (correct) motion model from several possible motion models, given finite noisy samples. SLT, also known as Vapnik-Chervonenkis (VC), theory provides analytic generalization bounds for model selection, which have been used successfully for practical model selection. This paper describes a successful application of an SLT-based model selection approach to the challenging problem of estimating optimal motion models from small data sets of image measurements (flow). We present results of experiments on both synthetic and real image sequences for motion interpolation and extrapolation; these results demonstrate the feasibility and strength of our approach. Our experimental results show that for motion estimation applications, SLT-based model selection compares favorably against alternative model selection methods, such as the Akaike’s fpe, Schwartz’ criterion (sc), Generalized Cross-Validation (gcv), and Shibata’s Model Selector (sms). The paper also shows how to address the aperture problem using SLT-based model selection for penalized linear (ridge regression) formulation. Index Terms—Aperture problem, complexity control, condition number, image flow, model selection, motion estimation, robust learning, statistical learning theory, tracking, visual motion. æ 1 INTRODUCTION L EARNING plays a fundamental role in facilitating “the balance between internal representations and external regularities.” As “versatility <generalization> and scalabil- ity are desirable attributes in most vision systems,” the “only solution is to incorporate learning capabilities within the vision system” [22]. Many challenging problems in computer vision could thus benefit from using predictive learning, when the goal is to come up with “good” models based on available (training) data under fairly general (flexible) assumptions. A good model is expected to provide accurate predictions for future (test) data. Toward that end, this paper describes a novel application of Statistical Learning Theory (SLT) to motion estimation. SLT provides the mathematical and conceptual framework needed for estimating dependencies from finite training data, it enables a better understanding of issues responsible for general- ization, and it facilitates the development of better (more rigorous) learning algorithms. SLT provides analytical generalization bounds for model selection, which relate unknown prediction risk (generalization performance) and known quantities such as the number of training samples, empirical error, and a measure of model complexity called the Vapnik-Chervonenkis (VC)-dimension. In the predictive learning framework [29], [30], [9], obtaining a good model with finite training data requires the specification of admissible models (or approximating functions), e.g., the regression estimator, an inductive principle for combining admissible models with available data, and an optimization (“learning”) procedure for estimating the parameters of the admissible models. The inductive principle is responsible for model selection and it directly affects the generalization ability in terms of prediction risk, i.e., the performance on unseen/future data. Conversely, a learning method is a constructive implementation of an inductive principle, i.e., an optimiza- tion or parameter estimation procedure, for a given set of approximating functions. Model selection corresponds to one seeking an optimal model for a well-defined predictive learning problem formulation, i.e., for a given set of admissible models and the loss function that provides a measure of generalization performance. In this setting, model selection amounts to model complexity control [9], [10]. In computer vision applications, model selection using predictive learning expands on the classical statistical framework and defines a novel framework, that of robust learning. Robustness is related to both accuracy and functionality, which have been succinctly defined in the context of computer vision as “how close the captured motion corresponds to the actual motion performed by the subject” and “the fewer assumptions a system imposes on its operational conditions, the more robust it is considered to be. Many systems are based on knowing the initial state of their systems and/or a well-defined model fitted (offline) to the current subject. In a real life scenario, we may expect a system to be capable of autonomy and run on its own, i.e., adapt to the current situation. Related to this is the problem of how to recover from failure. A number of systems are based on incremental updates or searching around a predicted value. Many of these fail due to bad predictions and are not able to recover” (see Moeslund and Granum [21]). Poggio and Shelton [23] further address robust learning in computer vision, when they make a strong case 466 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004 . H. Wechsler, Z. Duric, and F. Li are with the Department of Computer Science, George Mason University, Fairfax, VA 22030-4444. E-mail: {wechsler, zduric, fli}@cs.gmu.edu. . V. Cherkassky is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455. E-mail: [email protected]. Manuscript received 14 May 2002; revised 20 Feb. 2003; accepted 6 Oct. 2003. Recommended for acceptance by L. Vincent. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number 116542. 0162-8828/04/$20.00 ß 2004 IEEE Published by the IEEE Computer Society

Upload: others

Post on 28-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

Motion Estimation UsingStatistical Learning Theory

Harry Wechsler, Fellow, IEEE, Zoran Duric, Senior Member, IEEE,

Fayin Li, and Vladimir Cherkassky, Senior Member, IEEE

Abstract—This paper describes a novel application of Statistical Learning Theory (SLT) to single motion estimation and tracking. Theproblem of motion estimation can be related to statistical model selection, where the goal is to select one (correct) motion model fromseveral possible motion models, given finite noisy samples. SLT, also known as Vapnik-Chervonenkis (VC), theory provides analyticgeneralization bounds for model selection, which have been used successfully for practical model selection. This paper describes asuccessful application of an SLT-based model selection approach to the challenging problem of estimating optimal motion models fromsmall data sets of imagemeasurements (flow).We present results of experiments on both synthetic and real image sequences formotioninterpolation and extrapolation; these results demonstrate the feasibility and strength of our approach. Our experimental results showthat for motion estimation applications, SLT-based model selection compares favorably against alternative model selection methods,such as the Akaike’s fpe, Schwartz’ criterion (sc), Generalized Cross-Validation (gcv), and Shibata’s Model Selector (sms). The paperalso shows how to address the aperture problem using SLT-based model selection for penalized linear (ridge regression) formulation.

Index Terms—Aperture problem, complexity control, condition number, image flow, model selection, motion estimation, robust

learning, statistical learning theory, tracking, visual motion.

1 INTRODUCTION

LEARNING plays a fundamental role in facilitating “thebalance between internal representations and external

regularities.” As “versatility <generalization> and scalabil-ity are desirable attributes in most vision systems,” the“only solution is to incorporate learning capabilities withinthe vision system” [22]. Many challenging problems incomputer vision could thus benefit from using predictivelearning, when the goal is to come up with “good” modelsbased on available (training) data under fairly general(flexible) assumptions. A good model is expected to provideaccurate predictions for future (test) data. Toward that end,this paper describes a novel application of StatisticalLearning Theory (SLT) to motion estimation. SLT providesthe mathematical and conceptual framework needed forestimating dependencies from finite training data, it enablesa better understanding of issues responsible for general-ization, and it facilitates the development of better (morerigorous) learning algorithms. SLT provides analyticalgeneralization bounds for model selection, which relateunknown prediction risk (generalization performance) andknown quantities such as the number of training samples,empirical error, and a measure of model complexity calledthe Vapnik-Chervonenkis (VC)-dimension.

In the predictive learning framework [29], [30], [9],

obtaining a good model with finite training data requires

the specification of admissible models (or approximating

functions), e.g., the regression estimator, an inductiveprinciple for combining admissible models with availabledata, and an optimization (“learning”) procedure forestimating the parameters of the admissible models. Theinductive principle is responsible for model selection and itdirectly affects the generalization ability in terms ofprediction risk, i.e., the performance on unseen/futuredata. Conversely, a learning method is a constructiveimplementation of an inductive principle, i.e., an optimiza-tion or parameter estimation procedure, for a given set ofapproximating functions. Model selection corresponds toone seeking an optimal model for a well-defined predictivelearning problem formulation, i.e., for a given set ofadmissible models and the loss function that provides ameasure of generalization performance. In this setting,model selection amounts to model complexity control [9], [10].

In computer vision applications, model selection usingpredictive learning expands on the classical statisticalframework and defines a novel framework, that of robustlearning. Robustness is related to both accuracy andfunctionality, which have been succinctly defined in thecontext of computer vision as “how close the capturedmotion corresponds to the actual motion performed by thesubject” and “the fewer assumptions a system imposes onits operational conditions, the more robust it is consideredto be. Many systems are based on knowing the initial stateof their systems and/or a well-defined model fitted (offline)to the current subject. In a real life scenario, we may expecta system to be capable of autonomy and run on its own, i.e.,adapt to the current situation. Related to this is the problemof how to recover from failure. A number of systems arebased on incremental updates or searching around apredicted value. Many of these fail due to bad predictionsand are not able to recover” (see Moeslund and Granum[21]). Poggio and Shelton [23] further address robustlearning in computer vision, when they make a strong case

466 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

. H. Wechsler, Z. Duric, and F. Li are with the Department of ComputerScience, George Mason University, Fairfax, VA 22030-4444.E-mail: {wechsler, zduric, fli}@cs.gmu.edu.

. V. Cherkassky is with the Department of Electrical and ComputerEngineering, University of Minnesota, Minneapolis, MN 55455.E-mail: [email protected].

Manuscript received 14 May 2002; revised 20 Feb. 2003; accepted 6 Oct. 2003.Recommended for acceptance by L. Vincent.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 116542.

0162-8828/04/$20.00 � 2004 IEEE Published by the IEEE Computer Society

Page 2: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

that “the problem of learning is arguably at the very core ofthe problem of intelligence, both biological and artificial.Since seeing is intelligence, learning is also becoming a keyto the study of artificial and biological vision. Visionsystems that learn and adapt represent one of the mostimportant directions in computer vision research. It may bethe only way to develop vision systems that are robust andeasy to use in many different tasks.”

This paper describes an application of an SLT-basedmodel selection to the challenging problem of estimatingoptimal motion models from small sets of image measure-ments (flow). We present results of experiments on bothsynthetic and real image sequences for both motioninterpolation and extrapolation; these results demonstratethe feasibility and strengths of our approach. In addition,our results showed that our approach compares favorablyagainst alternative model selection methods regarding theconfidence they offer on model selection for motionestimation. Our experimental results show that for motionestimation applications, SLT-based model selection com-pares favorably against alternative model selectionmethods,such as the Akaike’s Final Prediction Error (fpe), Schwartz’criterion (sc), Generalized Cross-Validation (gcv), andShibata’s Model Selector (sms). The paper also shows howto address the aperture problem using SLT-based modelselection for penalized linear (ridge regression) formulation.

2 MODEL SELECTION FOR REGRESSION

This section briefly reviews model selection in predictivelearning, and reviews classical and VC-based analyticmethods for model selection. A learning method is analgorithm that estimates an unknownmapping (dependency)between system’s inputs and outputs, from the availabledata, i.e., known (input, output) samples. Once such adependency has been estimated, it can be used for predictionof system outputs from the input values. The usual goal oflearning is the prediction accuracy, also known as general-ization. A generic learning system [13], [9], [30] consists of:

. a Generator of random input vectors x, drawn from afixed (unknown) probability distribution P ðxÞ;

. a System (or teacher) which returns an output valuey for every input vector x according to the fixedconditional distribution P ðyjxÞ, which is alsounknown;

. a Learning Machine, which is capable of implement-ing a set of approximating functions fðx; !Þ, ! 2 �,where � is a set of parameters of an arbitrary nature.

The goal of learning is to select a function (from this set),which approximates best the System’s response. Manyproblems in computer vision can be formally stated as theproblem of real-valued function estimation from noisysamples. This problem is known in statistics as (nonlinear)regression. In the regression formulation, the goal of learningis to estimate an unknown (target) function gðxÞ in therelationship: y ¼ gðxÞ þ ", where the randomerror " (noise) iszeromean,x is a d-dimensional vector andy is a scalar output.A learning method (or estimation procedure) selects the“best” model fðx; !0Þ from a set of (parameterized) approx-imating functions (or possible models) fðx; !Þ specified apriori, where the quality of an approximation is measured bythe loss or discrepancymeasureLðy; fðx; !ÞÞ. A common loss

function for regression is the squared error. Thus, learning isthe problem of finding the function fðx; !0Þ (regressor) thatminimizes the prediction risk functional

Rð!Þ ¼Z

ðy� fðx; !ÞÞ2pðx; yÞdxdy

using only the training data ðxi; yiÞ; i ¼ 1; . . . ; n, generatedaccording to some (unknown) joint probability densityfunction (pdf) pðx; yÞ ¼ pðxÞpðyjxÞ. Prediction risk func-tional measures the accuracy of the learning method’spredictions of the unknown target function gðxÞ.

The standard formulation of the learning problem (asdefined above) amounts to function estimation, i.e., select-ing the “best” function from a set of admissible functionsfðx; !Þ. Here, the “best” function (model) is the oneminimizing the prediction risk. The problem is ill-posedsince the prediction risk functional is unknown (bydefinition). Most learning methods implement the ideaknown as “empirical risk minimization” (ERM), which ischoosing the model minimizing the empirical risk, or theaverage loss for the training data:

Rempð!Þ ¼1

n

Xnk¼1

ðyk � fðxk; !ÞÞ2: ð1Þ

The ERM approach is only appropriate under parametricsettings, i.e., when the parametric form of unknowndependency is known. Under such a (parametric) approachthe unknown dependency is assumed to belong to a narrowclass of functions (specified by a given parametric form). Inmost practical applications, parametric assumptions do nothold true, and the unknown dependency is estimated in awide class of possible models of varying complexity. Sincethe goal of learning is to obtain a model providing minimalprediction risk, it is achieved by choosing a model ofoptimal complexity corresponding to smallest prediction(generalization) error for future data. Existing provisionsfor model complexity control include [24], [9]: penalization(regularization), weight decay (in neural networks), para-meter (weight) initialization (in neural network training),and various greedy procedures (also known as constructive,growing, or pruning methods).

Classical methods for model selection are based onasymptotic results for linear models. Recent approaches [3],[14], [18] basedonapproximation theoryextendclassical rate-of-convergence results to nonlinear models (such as multi-layer perceptrons); they are, however, still based on asymp-totic assumptions. Nonasymptotic (guaranteed) bounds onthe prediction risk for finite-sample settings have beenproposed in VC-theory [29]. We also point out that allapproximation theory results are aimed at deriving accurateestimatesofrisksince thegoalof (prediction)riskestimationisequivalent tocomplexity controlwhen thenumberof samplesis large (i.e., asymptotic case). However, there is a subtle butimportant difference between the goal of accurate estimationof prediction risk and using those estimates for modelcomplexity control with finite samples. That is, a modelselection criterion can provide poor estimates of predictionrisk,yet thedifferencesbetween its riskestimates (formodelsofdifferent complexity)mayyieldaccuratemodel selection [10].

There are two general approaches for estimating predic-tion risk for regression problems with finite data: analyticalanddata-driven.Analyticalmethodsuseanalytic estimatesof

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 467

Page 3: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

the prediction risk as a function of the empirical risk (trainingerror) penalized (adjusted) by some measure of modelcomplexity. Once an accurate estimate of the prediction riskis found, it can be used for model selection by choosing themodel complexity that minimizes the estimated predictionrisk. In the statistical literature, various analytic predictionrisk estimates have been proposed for model selection (forlinear regression). These estimates take the form of:

Rest ¼ rd

n

� �1

n

Xni¼1

ðyi � yyiÞ2; ð2Þ

where r is a monotonically increasing function of the ratio ofmodel complexity d (the number of degrees of freedom) andthe training sample size n. r is often called a penalizationfactor because it inflates the average residual sum of squaresfor increasingly complex models.

SLT provides analytic upper bounds on the predictionrisk that can be used for model selection [29], [30]. To makepractical use of such bounds for model selection, practicalvalues for theoretical constants involved have to be chosen[9], [10]; this results in the penalization factor, rðp; nÞ, calledthe Vapnik’s measure (vm):

rðp; nÞ ¼ 1�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip� p ln pþ lnn

2n

r !�1

þ

; ð3Þ

where p ¼ h=n, h denotes the VC-dimension of a model andð�Þþ ¼ 0, for x < 0. In SLT, Restð!Þ is obtained by substitingrðp; nÞ for rðd=nÞ in (2). For linear estimators withm degreesof freedom, the VC-dimension is h ¼ mþ 1. For a giventraining data set, selection of the “best” model from severalparametric models corresponds to choosing the modelproviding minimum bound on prediction risk (2), with apenalization factor given by (3).

3 MOTION ESTIMATION

The estimation ofmotion from image sequences is “a difficultproblem that involves pooling noisy measurements to makereliableestimates;” furthermore,motionestimation“assumessomemodel of the image variationwithin a region” [6]. Earlyattempts at robust (optical flow) motion estimation involvedleast-squares regression followed by outlier detection andrejection, and then reestimation of the motion for theremaining pixels [4] using robust statistical techniques (M-estimators) known for their lowbreakdownpoint to computea dominant motion while downweighting the outliers.

Much of computer vision, including motion, registration,segmentation, and stereo, calls for optimal estimation usinglinear and nonlinear (penalized) regression. In motionanalysis, regression models are often used in two differentcontexts, i.e., interpolation or extrapolation, as explainednext. The regression problem involved in motion estimationhas a “training” set of examples, i.e., input vectors xi alongwith corresponding targets yi, which put in correspondencesimilar image locations, drawn from two consecutive framessampled at time t and tþ 1, respectively, or several con-secutive frames fromavideo sequence.Using the training set,one seeks to learnhow tomodel thedependencyof the targets(“dependent variables”) on the inputs (“independent vari-ables”). The objective is to make accurate predictions onimage points available but not included in the training set, or

on image points not yet seen from future frames; thiscorresponds to interpolation and extrapolation, respectively.

We consider a short “real image sequence” and a long“synthetic image sequence” both involving approximatelyconstant motion for the purpose of motion estimation.Ground truth is available only for the synthetic imagesequences. The real image sequence is inherently noisyand no ground truth is available. In the case of smallimage displacements, we generate data fxi; yig for regres-sion using normal flow computation (see Section 3.1); forlarge displacements, i.e., the synthetic image sequence, theimage correspondences and ground truth are given.Model fitting for each of a set of admissible models isdone using the LS estimation (see Section 3.2). Section 3.3details how to choose the optimal model using VC-basedcomplexity control.

Note that the goal here is not to suggest novel motionanalysis models, but rather to choose, based on finite andnoisy data, an optimal model that would yield minimumerror for future inputs (i.e., minimum prediction risk).Under motion analysis setting, we use model selectioncriteria in order to select “correct” motion model fromseveral possible motion models, where the parametricmotion models are known to contain the true motionmodel. This setting is much simpler than the generalproblem of model selection [9], where the set of possiblemodels may not contain the true model.

3.1 Normal Flow

Normal flow computation [2], [12] provides the trainingdata needed for model selection and model fitting, i.e.,parameter estimation. Let~ii and~jj be the unit vectors in the xand y directions, respectively; �~rr ¼~ii�xþ~jj�y is the projecteddisplacement field at the point ~rr ¼ x~iiþ y~jj. If we choose aunit direction vector ~nnr ¼ nx

~iiþ ny~jj at the image point~rr and

call it the normal direction, then the normal displacementfield at ~rr is �~rrn ¼ ð�~rr � ~nnrÞ~nnr ¼ ðnx�xþ ny�yÞ~nnr. The normaldirection ~nnr can be chosen in various ways; the usualchoice is the direction of the image intensity gradient~nnr ¼ rI=krIk. Note that the normal displacement fieldalong an edge is orthogonal to the edge direction. Thus, if attime t we observe an edge element at position ~rr, theapparent position of that edge element at time tþ�twill be~rrþ�t�~rrn. This is a consequence of the well-known apertureproblem (see Section 6). We base our method of estimatingthe normal displacement field on this observation.

For an image frame (say collected at time t), we find edgesusingan implementationof theCannyedgedetector. For eachedgeelement, sayat~rr,we resample the image locally toobtaina small window with its rows parallel to the image gradientdirection ~nnr ¼ rI=krIk. For the next image frame (collectedat time t0 þ�t), we create a largerwindow, typically twice aslarge as themaximumexpected value of themagnitude of thenormal displacement field. We then slide the first (smaller)window along the second (larger) window and compute thedifference between the image intensities. The zero of theresulting function is at distance un from the origin of thesecond window; note that the image gradient in the secondwindow at the positions close to un must be positive. Ourestimate of the normal displacement field is then�un, andwecall it the normal flow. A real color image sequence used inour experiments is shown in Fig. 1. The correspondingnormal flow is shown in Fig. 2.

468 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

Page 4: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

3.2 Parameter Estimation

A hierarchy of parametric flow models has been developedin the past starting from pure translation, image planerotation, 2D affine, and 2D homography (8-parameter flow,also known as quadratic flow). We will consider all thosemodels here. Eight-parameter flow corresponds to theinstantaneous projected image motion field generated by amoving plane. Other models used here can be obtained bysetting some of the eight parameters to zero. In the 8-parameter model, coordinates of a point ðx; yÞ in the firstframe will move to ðx0; y0Þ in the next frame:

x0

y0

� �¼ w1

w4

� �þ w2 w3

w5 w6

� �xy

� �þ x2 xy

xy y2

� �w7

w8

� �:

ð4Þ

Equation (4) relates corresponding points in successiveimage frames. To obtain the displacement ~uuðx; yÞ ¼ ð�x �yÞTof ðx; yÞ we subtract ðx yÞT from both sides of (4). The left-hand side of (4) is replaced by ð�x �yÞT and on the right-hand side w2 and w6 get replaced by w1

2 ¼ w2 � 1 andw1

6 ¼ w6 � 1. We obtain

�x�y

� �¼ w1

w4

� �þ w1

2 w3

w5 w16

� �xy

� �þ x2 xy

xy y2

� �w7

w8

� �:

ð5Þ

The normal displacement field at ðx; yÞ is given by

unðx; yÞ ¼ �~rrn � ~nnr ¼ nx�xþ ny�y ¼ w1nx þ w12xnx

þ w3ynx þ w7x2nx þ w8xynx þ w4ny þ w5xny

þ w16yny þ w7xyny þ w8y

2ny ¼ w � p;where ~nnr ¼ nx

~iiþ ny~jj is the gradient direction,

p ¼ ðnx xnx ynx ny xny yny x2nx þ xyny xynx þ y2nyÞT ;

and w ¼ ðw1 w12 w3 w4 w5 w1

6 w7 w8ÞT is the vector ofparameters.

Weuse themethoddescribed in Section 3.1 to compute thenormal flow. For each edge point~rri, we have one normal flowvalue un;i, that we use as an estimate of the normaldisplacement at the point, a vector pi computed from ðxi; yiÞand ~nnr;i ¼ nx;i

~iiþ ny;i~jj, and an approximate equation

w � pi � un;i. Let the number of edge points be N � 8. WeneedtofindasolutionofPw� b ¼ e,whereb isanN-elementvectorwith elementsun;i,P is anN � 8parametermatrixwithrowspi, ande is anN-element errorvector.Weseek themodelw that minimizes kek ¼ kb� Pwk; the solution satisfies thesystem PTPw ¼ PTb and corresponds to the linear leastsquares (LS) solution. Model fitting for large image displace-mentssampledfromthesynthetic imagesequenceisalsodoneusing the LS as described above.

3.3 Model Selection

Trainingdata consists of normal flowor point displacements.Affine and quadratic models are responsible for datageneration. The motion estimation (learning) problem corre-sponds to choosing the best motionmodel from a given set ofpossiblemotions using the observed (training) data. The goalis to choose a model that will yield the lowest error at theimage points not used in training. In this section, we combinethe SLT regression (see Section 2) and motion estimationtechniques described in the preceding sections to choose aflow model that has the best predictive performance.

We use the square loss function—i.e., the squareddifference RempðwmÞ ¼ 1

n

PNi¼1ðun;i � um

n;iÞ2 between the com-

puted normal flow un;i and the predicted normal flowumn;i ¼ wm � pi, where wm corresponds to the estimated

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 469

Fig. 1. Frames 2, 4, 6, 8, 10, and 12 from a 13-frame sequence of a moving arm.

Fig. 2. Normal flow computed from pairs of frames 2-3, 4-5, 7-8, and 10-11 of the moving arm sequence.

Page 5: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

model. The task of model selection corresponds to choosingthe best predictive model from a given set of linearparametric models, using a small set of noisy training data.We use VC-generalization bounds (see (2) and (3)). TheVC-dimension h of a linear model is given by the number ofdegrees of freedom (DoF) of the model plus one. We choosefrom the following five models that we obtain by settingvarious elements of w in (5) to zero:

. [M1:] pure translation, w12 ¼ w3 ¼ w5 ¼ w1

6 ¼ w7 ¼ w8

¼ 0, 2 DoF, h ¼ 3.. [M2:] translation, shear, and rotation, w1

2 ¼ w16 ¼ w7 ¼

w8 ¼ 0, 4 DoF, h ¼ 5.. [M3:] translation and scaling, w3 ¼ w5 ¼ w7 ¼ w8 ¼ 0,

4 DoF, h ¼ 5.. [M4:] 6-parameter affine, w7 ¼ w8 ¼ 0, 6 DoF, h ¼ 7.. [M5:] full affine, quadratic flow, 8 DoF, h ¼ 9.

The next two sections present the results of experimentsusing both synthetic and real image sequences for motioninterpolation and extrapolation. These results demonstratethe feasibility and the strengths of our approach. Theexperiments assume single (spatially and temporallycoherent) motions for all data sets. The same parametervalues apply to all frames of each data set.

4 EXPERIMENTAL RESULTS FOR SYNTHETIC IMAGE

SEQUENCES

We generated a synthetic “first moving square” imagesequence consisting of 11 frames (see Figs. 3a and 3c); thesquare consists of 128 pixels. A corresponding “noisy”sequence was created by adding Gaussian noise with mean0 and variance 0.5 (see Figs. 3b and 3d). The ground truthcorresponds to the affine transformation model M4 (see (4)and Section 3.3):

x0

y0

� �¼ 2:699308

�2:4648887

� �þ 0:991562 0:129631

�0:129631 0:991562

� �xy

� �:

The other motion models considered are: M1, puretranslation, w2 ¼ w6 ¼ 1, w3 ¼ w5 ¼ w7 ¼ w8 ¼ 0; M2, trans-lation and scaling, w3 ¼ w5 ¼ w7 ¼ w8 ¼ 0; M3, translation,shear, and rotation, w2 ¼ w6 ¼ 1, w7 ¼ w8 ¼ 0; and M5,quadratic model.

The parameters wi have qualitative interpretations interms of image motion. For example, w1 and w4 representhorizontal and vertical translation, respectively. Addition-ally, it is possible to express divergence (isotropic expan-sion), curl (rotation about the viewing direction), anddeformation (squashing or stretching) as combinations ofthe wis, while the parameters w7 and w8 roughly representthe yaw and pitch deformations in the image plane. (Similarqualitative interpretation for the affine optical flow model ispossible in terms of rotation, translation, scale, and skew.)Black et al. [5] point out that, for small regions of humanbody images such as eyes and fingers, the quadratic modelmay not be necessary and the motion of these regions can beapproximated by the simpler affine model defined earlier inwhich the terms w7 and w8 are zero.

For interpolation experiments, we randomly subsamplen ¼ 32 or 64 pixel correspondences (out of 128) from10 successive pairs of frames ð< i; iþ 1 >; i ¼ 1 . . . 10Þ, and

estimate the parameters for each of the motion models usingLS (see Section 3.2). Note that estimation is done indepen-dently for the x and y coordinates. As a consequence, the VC-dim h for M1, M2, M3, M4, and M5 is now 2, 3, 3, 4, and 5,respectively.Theboundonrisk (see (2)and(3)) foreachmodelis derived using the LS error and the penalty vm (3). Thecorresponding (toeachmodel) interpolating total error (forallframe pairs) is calculated using all the 128 points and theground truth information. Note that the total error is anestimate of (unknown) prediction risk in the VC-theoreticalformulation. The experiment is repeated 100 times withdifferent random realizations of training data, for bothnonnoisy and noisy image sequences. The average interpo-lated sequences are shown in Fig. 3. The box plots (see Fig. 4)summarizing the empirical risk, the boundonprediction risk,andthe totalerroracross thewholesequence, forn ¼ 32, showthat the bound on prediction risk can be used for modelselection for both non-noisy and noisy sequences. Note thatthe bound on prediction risk is a better predictor than theempiricalrisk.Theempiricalrisk, theboundonpredictionriskand (total) interpolation error for M3 are too large to bedisplayed. Ground truth M4 is consistently found as theoptimal motion model; its interpolation error is minimum.The quadratic model M5 is a very close runner-up to M4.Similar results were obtained for experiments with n ¼ 64.

For extrapolation (“tracking”) experiments, we randomlysubsample n ¼ 32 or 64 pixel correspondences (out of astack of 5� 128 correspondences) from the first five pairs offrames ð< i; iþ 1 >; i ¼ 1 . . . 5Þ, and estimate the para-meters for each of the models using LS (see Section 3.2).Note that the data available for extrapolation is much lessthan the data available for the interpolation experiment.Here again, estimation is done independently for the x and ycoordinates. As a consequence, the VC-dim h for M1, M2,M3, M4, and M5 is now 2, 3, 3, 4, and 5, respectively. Thebound on risk for each model is derived using the LS errorand the penalty vm (3). The corresponding extrapolating

470 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

Fig. 3. Motion interpolation results for a noise-free (a) and a noisysequence (b). Motion extrapolation results for a noise-free (c) and anoisy sequence (d). The pixels corresponding to the ground truth (M4)and the average estimated positions are shown.

Page 6: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

error for the remaining frames is calculated using theground truth data starting from image frame 6. Theexperiment is repeated 300 times with different randomrealizations of training data for nonnoisy and noisy imagesequences. The average extrapolated sequences are shownin Fig. 3. Fig. 5 shows the bound on prediction risk for eachmotion model, while Tables 1 and 2 show the medianextrapolation error for each of the remaining five imageframes for all models. It can be seen that the bound onprediction risk yields good model selection. The bound onprediction risk for M3 is too large to be displayed. Groundtruth M4 is found as the optimal model for motion trackingand its extrapolated error is minimum. The quadratic modelM5 is again a very close runner-up to the ground truth.Similar results were obtained for experiments with n ¼ 64. Itcan be seen from Tables 1 and 2 that the second rankedmodel, M5, can keep track with the optimal model M4 onlyfor the first two extrapolated frames (6 and 7).

In the next experiment, we generate a “first movingsquare” synthetic image sequence consisting of 11 frames.To create the corresponding noisy sequence, we add

Gaussian noise (the mean 0 and the variance 0.5). Themotion is a quadratic transformation—M5, with the centerof the square as a reference point—given by

x0

y0

� �¼

4:386585

4:983113

� �þ

0:965926 0:258819

�0:258819 0:965926

� �x

y

� �

þ x2 xy

xy y2

� �0:002140

0:006435

� �:

For interpolation experiments, we randomly subsamplen ¼ 32 or 64 pixel correspondences (out of 128) from10 successive pairs of frames ð< i; iþ 1 >; i ¼ 1 . . . 10Þ,and estimate the parameters of the motion models usingLS (see Section 3.2). The bound on prediction risk for eachmodel is derived using the LS error and the penalty vm (3).The corresponding (to each model) interpolating error (forall pair wise frames) is calculated using all the 128 pointsand the ground truth information. The experiment isrepeated 100 times with different random realizations oftraining data. The box plots summarizing the empirical risk,bound on prediction risk, and total error for the sequence

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 471

Fig. 4. Summary of interpolation results for the “first moving square” sequence, n ¼ 32. The ground truth motion model is M4. Left to right: empiricalrisk, bound on prediction risk, and interpolation error. Top row: nonnoisy sequence. Bottom row: noisy sequence.

Fig. 5. Bounds on prediction risk for the “first moving square” sequence, n ¼ 32. The ground truth motion model is M4. (a) Nonnoisy sequence. (b)Noisy sequence.

Page 7: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

(see Fig. 6) show that the bound on prediction risk can beused for model selection in motion estimation for bothnonnoisy and noisy image sequences. The empirical risk,bound on prediction risk, and interpolation error for M3 aretoo large to be displayed. The ground truth model, M5, isconsistently found as the optimal motion model; itsinterpolation error is minimum. Similar results to thoseshown in Fig. 6 were obtained for n ¼ 64.

For extrapolation (“tracking”) purposes we randomlysubsample n ¼ 32 or 64 pixel correspondences (out of a stackof 5� 128 correspondences) from ð< i; iþ 1 >; i ¼ 1 . . . 5Þ,and estimate the parameters for each of the models using LS(see Section 3.2).Note that thedata available for extrapolationis much less than the data available for the interpolationexperiment. The corresponding extrapolation error for theremaining frames is calculated using the ground truth datastarting from frame 6. The experiment is repeated 300 times

with different random realizations of training data. Fig. 7shows the bound on prediction risk for each motion model.Tables 3and4showthemedianextrapolation error for eachofthe remaining five image frames for all models. It can be seenthat theboundonprediction riskyieldsgoodmodel selection.The bound on prediction risk for M3 is too large to bedisplayed.Theground truthM5 is foundas theoptimalmodelfor motion tracking and its extrapolation error is minimum.Similar results were obtained for experiments with n ¼ 64.

As could be expected, for synthetic image sequences, boththe bound on prediction risk and the interpolation andextrapolation errors, were consistently decreasing as weincreased the number of sample points from 16 to 128 (seeFig. 8). Note that the bound on prediction risk for the groundtruth model is consistently lower than the bound onprediction risk for the next best model. Similar results forboth interpolation and extrapolation were obtained if the

472 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

TABLE 1Median Extrapolation Error for the Nonnoisy Sequence, n ¼ 32,

the Ground Truth is M4

TABLE 2Median Extrapolation Error for the Noisy Sequence, n ¼ 32,

the Ground Truth is M4

Fig. 6. Summary of interpolation results for the “first moving square” sequence, n ¼ 32. The ground truth motion model is M5. Left to right: empiricalrisk, bound on prediction risk, and interpolation error. Top row: nonnoisy sequence. Bottom row: noisy sequence.

Page 8: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

samples were selected randomly across all the trainingframes, which we label as the “whole” interpolation orextrapolation, rather than the “standard” interpolation usingrandom sampling from successive pairs of frames and thenpooling the data together.

4.1 Comparative Analysis of Model SelectionCriteria

Various analytic prediction risk estimates have beenproposed in the statistical literature for model selection (2)using the observed empirical risk and penalization factor r.We performed a comparative analysis of model selectioncriteria for different classical forms of penalization factorrðpÞ ¼ r d

n

� �(2), which have been proposed in the statistical

literature. These penalization factors are listed below andcompared to the Vapnik measure vm (3).

. Final prediction error (fpe) [1]: rðpÞ¼ð1þpÞð1�pÞ�1.

. Schwartz’s criterion (sc) [27]: rðp; nÞ¼1þlnn2 pð1�pÞ�1.

. Generalized cross-validation (gcv) [11]: rðpÞ¼ð1�pÞ�1.

. Shibata’s model selector (sms) [26]: rðpÞ ¼ 1þ 2p.

Note that, in this experiment, all model selection criteriarequire two regressions. All these classical approaches aremotivated by asymptotic arguments for linear models and,therefore, apply well for large training sets. In fact, for largen, prediction estimates provided by fpe, gcv, and sms areasymptotically equivalent. The penalization factor r inflates

the average residual sum of squares for increasinglycomplex models. Note in particular that the fpe risk followsfrom general Akaike Information Criterion (AIC) when thenoise variance is estimated via empirical risk for eachchosen model complexity [9]. AIC is derived under a veryrestrictive setting such as asymptotic linear models andknown noise model. AIC does not indicate how to estimatethe noise model, which is required for model selection andis assumed to be known.

Our experimental results show that the Vapnik measurecompares favorably against alternative model selectionmethods, regarding the confidence they offer in modelselection for motion estimation (see Table 5). The resultsshown in the table are for standard interpolation, wholeinterpolation, and standard extrapolation. The entries showthe percentage of time that each model selection criteria isaccurate in its predictions. Please note that Table 5 is aboutchoosing the right model rather than measuring theprediction error. Recall that the goal for VC-theory is toselect the model providing the best prediction accuracyrather than selecting the correct model. Hence, it is possiblethat the VC-based model selection approach may select the“wrong” model providing the best generalization (predic-tion) accuracy [9], [10]. For example, Cherkassky and Mulier[9, pp. 28-29] provide experimental evidence from anexperiment where a simpler linear decision rule, which doesnot match the underlying class distributions, often performsbetter than the ground truth quadratic decision rule.

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 473

TABLE 3Median Extrapolation Error for the Nonnoisy Sequence, n ¼ 32,

the Ground Truth is M5

TABLE 4Median Extrapolation Error for the Noisy Sequence, n ¼ 32,

the Ground Truth is M5

Fig. 7. Bounds on prediction risk for the “first moving square” sequence, n ¼ 32. The ground truth motion model is M5. (a) Nonnoisy sequence.

(b) Noisy sequence.

Page 9: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

5 EXPERIMENTAL RESULTS FOR REAL IMAGE

SEQUENCES

Training data comes from eleven frames drawn from a realimage sequence of a moving arm and the correspondingnormal flow (see Figs. 1 and 2). We report the resultsobtained for both interpolation and extrapolation (using (5)and in Section 3.3). The ground truth is not known and theimages are inherently noisy. In the interpolation experi-ment, we randomly subsampled 25 percent of image flowvalues (out of approximately 400 points); the experiment isrepeated 100 times with different random realizations oftraining data. Training data is drawn from two frame pairs:ði; iþ 1Þ and ðiþ 2; iþ 3Þ; interpolation is performed for

frames ðiþ 1; iþ 2Þ. Fig 9 summarizes the prediction risk

and interpolation error for the whole experiment. The

prediction risk ranks the motion models so that its optimal

choice, M2, yields the minimum total interpolation error.In the extrapolation experiment we randomly sub-

sampled 10 percent of image flow values (out of

approximately 400 points); the experiment is repeated

100 times with different random realizations of training

data. Training data was drawn from the first five pairs of

frames: ði; iþ 1Þ starting with i ¼ 1. Extrapolation is

performed for the remaining five frames of the sequence.

Fig. 10 summarizes the prediction risk and extrapolation

error for the whole experiment. The prediction risk ranks

474 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

Fig. 8.Boundsonprediction risk asa functionof numberof samples for anonnoisy (a) andanoisy sequence (b). Theground truthmodels areM4 andM5.

TABLE 5Comparison of Model Selection Criteria on the “First Moving Square” Synthetic Image Sequence

Page 10: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

the motion models so that its optimal choice, M2, yieldsthe minimum total extrapolation error.

To further illustrate the workings of the extrapolation

experiment, Fig. 11 shows predicted position of handcontours for each of the motion models. The starting handcontour, the predicted hand contour, and the actual handcontour after two frames, are displayed using pluses,triangles, and dots, respectively. It can be seen that theoptimal motion model, M2, predicts the hand contour

closest to the actual hand contour position. Figs. 9 and 10show that a simple model, M2, explains the data as well asmore complex models. Since the hand moves in a plane, M5

is probably the true model.

6 CONDITION NUMBER AND THE APERTURE

PROBLEM

In Section 3.2, we showed how to estimate the flowparameters w by solving the LS problem min kPw� bk.Condition number �2ðP Þ is computed as the ratio of thelargest and the smallest singular values of P :

�2ðP Þ ¼ �maxðP Þ�minðP Þ : ð6Þ

The sensitivity in estimating w is roughly proportional to

"ð�2ðP Þ þ �LS�22ðP ÞÞ; ð7Þ

where �LS is the magnitude of the residual of the LS

solution and " ¼ k�bk=kbk is the relative error in b. Thenormal flow is computed with subpixel accuracy [12] and" ¼ Oð0:01Þ. Since, in the examples presented here,

�LS ¼ Oð1Þ, it can be seen that condition numbers greaterthan 10 are undesirable. A large condition number typicallycorresponds to either inappropriate scaling of columns of Por to the aperture problem.

Scaling of columns P is handled as follows: When thecolumns of P have different scales the problem can be fixedby scaling the columns before solving the LS problem. Theoriginal problem is replaced by minfkðPGÞy� bkg. G ischosen to be a diagonal matrix whose elements arekP ð:; iÞk�1, where P ð:; iÞ is the ith column of P . If thematrix PG is well conditioned, y is estimated using theLS method and w ¼ Gy is computed. In the experimentswith a moving forearm (see Figs. 1 and 2 and Section 5),typical values of �2ðP Þ are in the range of 1 to 2 for M1, inthe range of 30 to 40 for M2 and M3, in the range of 40 to 50for M4, and in the range of 3,000 to 4,000 for M5. Scalingbrings all these condition numbers below 5, i.e., �2ðPGÞ < 5.This scaling procedure has been implemented and used forthe experiments using real image sequences (see Section 5).

When the condition number after scaling, �2ðP Þ, is still toolarge, it can be said that the data is not appropriate for theparameter estimation due to the aperture problem [28]. Notethat theapertureproblemrefers to the fact that the flowcannotbe estimated from the given normal flow due to theinappropriate distribution of feature points. This distributionis reflected in thedatamatrixP . In this case, the solution is notprovided by standard LS. The problem has to be solved byminimizing the penalized risk functional

RpenðyÞ ¼1

nðkðPGÞy� bk2 þ yT�yÞ; ð8Þ

where � is a symmetric and nonnegative definite penaltymatrix [9]. A reasonable choice of the penalty term is the

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 475

Fig. 9. Interpolation results for the moving arm sequence. (a) Bounds on prediction risk. (b) Average square error.

Fig. 10. Extrapolation results for the moving arm sequence. (a) Prediction risk. (b) Square error.

Page 11: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

ridge regression penalty function � ¼ �I, where I is an

identity matrix [15]. Solving the following modified least

squares problem minimizes RpenðyÞ (8):

. Create the modified data matrices

U ¼ PGffiffiffi�

pI

� �; v ¼ p

0

� �;

where 0 is a column vector of zeroes.. Minimize the empirical risk functional Remp ¼

1n kUy� vk. The minimization is done by solvingfor y by LS method. Finally, compute w ¼ Gy.

. Compute the effective DoF for the penalized

problem as DoF ¼Pm

i¼1�2i

�2iþ�, where �i are the

singular values of PG.

� is chosen to make �2ðP Þ small. For illustrationpurposes, experiments with �2ðUÞ ¼ 10 and �2ðUÞ ¼ 100were performed for the example shown in Fig. 12. The datacomes from a small region along the arm shown in Figs. 1and 2. In this example, the models M1, M2, and M3 havesmall values of the condition numbers of the PG matrices.The condition numbers of data matrices PG for both M4

and M5, however, are very large ð> 1016Þ and their effectiveranks are 5 and 7, respectively. The estimated empiricalrisk for models M1 �M5 are ð0:552; 0:015; 0:3; 0:0139617;0:0131344Þ and the corresponding prediction risks are

476 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004

Fig. 11. Tracking results for the real image sequence. Left to right and top to bottom are shown the extrapolation results forM1,M2,M3,M4, andM5.Each figure shows starting hand contour, predicted hand contour after two frames, and actual hand contour after two frames.

Page 12: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

ð1:364; 0:049; 0:978; 0:059; 0:072Þ. Note that, based on (2) and(3), M2 would be chosen. However, if there were morefeature points, either M4 or M5 could have been chosensince they have smaller empirical risk values than the othermodels; since they are poorly conditioned, however,penalized solutions have to be used. (Note that thepenalization factor is heavy for small numbers of featurepoints.) The relevant results are as follows. In the case ofthe affine model M4, the penalization coefficients � ¼ð0:017; 0:17Þ result in condition numbers ð100; 10Þ, empiricalrisk values ð0:0139622; 0:0190552Þ, effective DoF valuesð4:9969; 4:7258Þ, and prediction risk values ð0:052; 0:068Þ.The effective DoF are a crude estimate of the VC-dimension. It can be seen that the larger the penalizationterm, the more bias is introduced. In the case of thequadratic model (M5), the penalization coefficients � ¼ð0:020; 0:203Þ result in condition numbers ð100; 10Þ, empiri-cal risk values ð0:0131595; 0:022909Þ, effective DoF valuesð6:1726; 5:2896Þ, and prediction risk values ð0:057; 0:088Þ.Note that, for small penalization factors, the empirical riskgoes up slightly, but the prediction risk goes down due tothe lowered effective DoF.

Note that the bound on the prediction risk will be veryhigh for higher order models. The model choices are biasedtoward lower order models rather than zero velocitymotion. Zero velocity motion results only from LS para-meter estimation errors. Lack of data, which is characteristicof the aperture problem, biases the choice involved towardsselecting lower order models.

7 CONCLUSIONS

This paper describes a novel application of StatisticalLearning Theory to optimal model selection, with applica-tions to single motion estimation and tracking from smalldata sets of imagemeasurements (flow). This is accomplishedwithout using restrictive assumptions such as asymptoticsettings and/or Gaussian noise. The experimental results,using both synthetic and real image sequences, demonstratethe feasibility and strengths of our approach for motionmodel selection using SLT. Our experimental results alsoshow that our approach compares favorably against alter-native model selection methods regarding the confidencethey offer onmotion estimation. The paper also showshow toaddress the aperture problem using SLT-based modelselection under penalized regression formulation.

In practical computer vision applications, one is likely toencounter two modifications of the basic formulation formodel selection and motion estimation used in this paper.Namely, the type of motion can change (at some unknowntime moments)—this is known as temporal partitioning

problem. Also, different portions of an image may undergo

different types of motion—this is known as spatial partition-

ing problem. Here, the nature of the learning problem

changes since the primary goal of learning/estimation

becomes to partition the data into two subsets: inlier samples

that will be used for estimating motion parameters and

outlier samples that will be ignored or downplayed. We are

presently working on those problems using methodological

aspects of SLT, in general, and robust Support Vector

Machines (SVM) regression, in particular.Another aspect relevant to many CV applications—i.e.,

motion analysis—and presently under investigation is the

need to identify/estimate several motions from a given data

set. In this case, the goal of robust learning is to estimate

different models for each appropriately chosen subset of the

originaldataset.Theresultingproblemof robustmultiplemodel

estimation is intrinsically more difficult than the standard

problemof singlemodel estimation since the former involves

simultaneous partitioning of the original data into several

subsets and estimating amodel (structure) for each subset.

ACKNOWLEDGMENTS

The work of V. Cherkassky was supported, in part, by the

US National Science Foundation grant ECS-0099906.

REFERENCES

[1] H. Akaike, “Statistical Predictor Information,” Ann. Inst. ofStatistical Math., vol. 22, pp. 203-217, 1970.

[2] Y. Aloimonos and Z. Duric, “Estimating Heading Direction UsingNormal Flow,” Int’l J. Computer Vision, vol. 13, pp. 33-56, 1994.

[3] A. Barron, “Universal Approximation Bounds for Superpositionof a Sigmoid Function,” IEEE Trans. Information Theory, vol. 39,pp. 930-945, 1993.

[4] M.J. Black and P. Anandan, “The Robust Estimation of MultipleMotions: Parametric and Piecewise-Smooth Flow Fields,” Compu-ter Vision and Image Understanding, vol. 63, pp. 75-104, 1996.

[5] M.J. Black, Y. Yacoob, and S.X. Ju, “Recognizing Human MotionUsing Parametrized Models of Optical Flow,” Motion-BasedRecognition, S.Mubarak andR. Jain, eds., pp. 245-269, Kluwer, 1997.

[6] M.J. Black, D.J. Fleet, and Y. Yacoob, “Robustly EstimatingChanges in Image Appearance,” Computer Vision and ImageUnderstanding, vol. 78, pp. 8-31, 2000.

[7] K. Bubna and C.V. Stewart, “Model Selection and Surface Mergingin Reconstruction Algorithms,” Proc. Conf. Computer Vision andPattern Recognition, pp. 895-902, 1998.

[8] K. Bubna and C.V. Stewart, “Model Selection Techniques andMerging Rules for Range Data Segmentation Algorithms,” Compu-ter Vision and Image Understanding, vol. 80, pp. 215-245, 2000.

[9] V. Cherkassky and F. Mulier, Learning from Data. Wiley, 1998.[10] V. Cherkassky, X. Shao, F. Mulier, and V. Vapnik, “Model

Selection for Regression Using VC-Generalization Bounds,” IEEETrans. Neural Networks, vol. 10, pp. 1075-1089, 1999.

[11] P. Craven and G. Wahba, “Smoothing Noisy Data with SplineFunctions,” Numerische Math., vol. 31, pp. 377-403, 1979.

[12] Z. Duric, F. Li, Y. Sun, and H. Wechsler, “Using Normal Flow forDetection and Tracking of Limbs in Color Images,” Proc. Int’l Conf.Pattern Recognition, 2002.

[13] J.H. Friedman, “An Overview of Predictive Learning and FunctionApproximation,” From Statistics to Neural Networks: Theory andPattern Recognition Applications, V. Cherkassky, J.H. Friedman, andH. Wechsler, eds., NATO ASI Series F, vol. 136, Springer, 1994.

[14] F. Girosi, “Regularization Theory, Radial Basis Functions andNetworks,” From Statistics to Neural Networks: Theory and PatternRecognition Applications, V. Cherkassky, J.H. Friedman, andH. Wechsler, eds., NATO ASI Series F, v. 136, Springer, 1994.

[15] G.H. Golub and C.F. Van Loan, Matrix Computation, third ed. JohnHopkins Univ. Press, 1996.

WECHSLER ET AL.: MOTION ESTIMATION USING STATISTICAL LEARNING THEORY 477

Fig. 12. Normal flow vectors from a small region along the arm in Fig. 2.

Page 13: Motion estimation using statistical learning theory ...fli/papers/STL_Motion.pdf · learning, statistical learning theory, tracking, visual motion. 1INTRODUCTION L EARNING plays a

[16] F.R. Hampel, E.M. Ronchetti, P.J. Rousseeuw, and W.A. Stahel,Robust Statistics: An Approach Based on Influence Functions. Wiley,1986.

[17] P.J. Huber, Robust Statistics. Wiley, 1981.[18] W. Lee, P. Bartlett, and R. Williamson, “Efficient Agnostic

Learning in Neural Networks with Bounded Fan-In,” IEEE Trans.Information Theory, vol. 42, pp. 2118-2132, 1996.

[19] P. Meer, D. Mintz, and A. Rosenfeld, “Robust Regression Methodsfor Computer Vision: A Review,” Int’l J. Computer Vision, vol. 6,pp. 59-70, 1991.

[20] P. Meer, C.V. Stewart, and D.E. Tyler, “Robust Computer Vision:An Interdisciplinary Challenge,” Computer Vision and ImageUnderstanding, vol. 78, pp. 1-7, 2000.

[21] T.M. Moeslund and E. Granum, “A Survey of Computer Vision-Based Human Motion Capture,” Computer Vision and ImageUnderstanding, vol. 81, pp. 231-268, 2001.

[22] S. Nayar and T. Poggio, Early Visual Learning, S. Nayar andT. Poggio, eds., Oxford Univ. Press, 1996.

[23] T. Poggio and C.R. Shelton, “Machine Learning, Machine Vision,and the Brain,” AI Magazine, vol. 20, no. 3, pp. 37-55, 1999.

[24] B.D. Ripley, Pattern Recognition and Neural Networks. CambridgeUniv. Press, 1996.

[25] P.H.S. Torr, “An Assessment of Information Criteria for ModelSelection,” Proc. Conf. Computer Vision and Pattern Recognition,pp. 47-53, 1997.

[26] R. Shibata, “An Optimal Selection of Regression Variables,”Biometrika, vol. 68, pp. 45-54, 1981.

[27] G. Schwartz, “Estimating the Dimension of a Model,” Ann.Statistics, vol. 6, pp. 461-464, 1978.

[28] E. Trucco and A. Verri, Introductory Techniques for 3D ComputerVision. Prentice Hall, 1998.

[29] V.N. Vapnik, Statistical Learning Theory. Wiley, 1998.[30] V.N. Vapnik, The Nature of Statistical Learning Theory, second ed.

Springer Verlag, 1999.

Harry Wechsler received the PhD degree incomputer science from the University of Califor-nia, Irvine, in 1975. He is presently a professorof computer science and the director for theCenter of Distributed and Intelligent Computa-tion at George Mason University (GMU) http://cs.gmu.edu/~wechsler/DIC_Center/. His re-search, in the field of intelligent systems, hasbeen in the areas of perception: automatic targetrecognition, computer vision, signal and image

processing; machine intelligence: data mining, neural networks, patternrecognition, and statistical learning theory; evolutionary computation:animats and swarm intelligence, and genetic algorithms; biometrics:face recognition, performance evaluation, and surveillance; and human-computer intelligent interaction: hand gesture recognition, smart inter-faces, and video tracking and interpretation of human activities. He wasthe director for the NATO Advanced Study Institutes (ASI) on “ActivePerception and Robot Vision” (Maratea, Italy, 1989), “From Statistics toNeural Networks” (Les Arcs, France, 1993), and “Face Recognition:From Theory to Applications” (Stirling, UK, 1997). He served as cochairfor the International Conference on Pattern Recognition held in Vienna,Austria, in 1996. He has authored more than 200 scientific papers andthe book Computational Vision (Academic Press, 1990). He was theeditor for Neural Networks for Perception, vols. 1 and 2, (AcademicPress, 1990), and the principal coeditor for Face Recognition: FromTheory to Applications (Springer-Verlag, 1998). He was elected as anIEEE fellow in 1992 and as an IAPR (International Association of PatternRecognition) fellow in 1998.

Zoran Duric received the PhD degree incomputer science from the University of Mary-land at College Park in 1995. He is an associateprofessor of computer science at George MasonUniversity. From 1982 to 1989, he was amember of the research staff in the Division forVision and Robotics, Energoinvest Institute forControl and Computer Science, Sarajevo. Hewas also affiliated with the Electrical EngineeringDepartment of the University of Sarajevo. From

1995 to 1997, he was an assistant research scientist at the MachineLearning and Inference Laboratory at George Mason University and atthe Center for Automation Research at the University of Maryland. From1996 to 1997, he was also a visiting assistant professor at the ComputerScience Department of George Mason University. He joined the facultyof George Mason University in the Fall of 1997 as an assistant professorof computer science. His research interests include computer vision,human-computer interaction, motion analysis, machine learning invision, and information hiding in images. He is a senior member of theIEEE and a member of the IEEE Computer Society.

Fayin Li received the BS degree in electricalengineering from Huazhong University ofScience and Technology, China, in 1996, andthe MS degree in computer science from theInstitute of Automation, Chinese Academy ofSciences, China, in 1999. He is currently work-ing toward the PhD degree at George MasonUniversity, Fairfax, VA. His research interestsinclude automatic gesture recognition, facerecognition, video tracking and surveillance,

image processing, pattern recognition, and model selection.

VladimirCherkassky received thePhDdegree inelectrical engineering from theUniversity of Texasat Austin in 1985. He is a professor of electricaland computer engineering at the University ofMinnesota. His current research is onmethods forpredictive learning from data, and he has coau-thored a monograph Learning From Data pub-lished byWiley in 1998. He has served on editorialboards of several journals including IEEE Trans-actions on Neural Networks, Neural Networks,

and Neural Processing Letters. He served on the program committee ofmajor international conferences on Artificial Neural Networks. He wasDirector ofNATOAdvancedStudy Institute (ASI) FromStatistics toNeuralNetworks: Theory and Pattern Recognition Applications held in France, in1993. He presented numerous tutorials and invited lectures on neuralnetwork and statistical methods for learning from data. He is a seniormember of the IEEE and a member of the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

478 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 26, NO. 4, APRIL 2004