ball detection via machine learning - kth · ball detection via machine learning rafael osorio...
Post on 13-Sep-2018
222 Views
Preview:
TRANSCRIPT
Ball Detection via Machine Learning
R A F A E L O S O R I O
Master of Science Thesis Stockholm, Sweden 2009
Ball Detection via Machine Learning
R A F A E L O S O R I O
Master’s Thesis in Computer Science (30 ECTS credits) at the School of Computer Science and Engineering Royal Institute of Technology year 2009 Supervisor at CSC was Örjan Ekeberg Examiner was Anders Lansner TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE ISSN-1653-5715 Royal Institute of Technology School of Computer Science and Communication KTH CSC SE-100 44 Stockholm, Sweden URL: www.csc.kth.se
Abstract
This thesis evaluates a method for real time detection of footballs in low
resolution images. The company Tracab uses a system of 8 camera pairs that
cover the whole pitch during a football match. By using stereo vision it is
possible to track the players and the ball, to be able to extract statistical data. In
this report a method proposed by Viola and Jones is evaluated to see if it can be
used to detect footballs in the images extracted by the cameras. The method is
based on a boosting algorithm called Adaboost and has mainly been used for
face detection. A cascade of boosted classifiers is trained from examples of
positive and negative images of footballs. In this report the images are much
smaller than the typical objects that the method has been developed for and a
question that this thesis tries to answer is if the method is applicable to objects
of such small sizes.
The Support Vector Machine (SVM) method has also been tested to see if the
performance of the classifier can be improved. Since the SVM method is a
time-consuming method it has been tested as a last step in the classifier
cascade, using features selected by the boosting process as input.
In addition to this, a database of images of footballs from 6 different matches
consisting of 10317 images used for training and 2221 images used for testing
has been produced. Results show that detection can be made with improved
performance compared to Tracab’s existing software.
Sammanfattning
Bolldetektion via maskininlärning
Denna rapport granskar en metod för realtidsdetektion av fotbollar i
lågupplösta bilder. Företaget Tracab använder sig av 8 kamerapar som
tillsammans täcker en hel fotbollsplan under en match. Med hjälp av
stereoseende är det möjligt att följa spelare och boll för att sedan erbjuda
statistik till fans. I denna rapport utvärderas en metod utvecklad av Viola och
Jones för att se om den går att använda till att detektera fotbollar i bilderna från
de 16 kamerorna. Metoden baseras på en boostingalgoritm som kallas
Adaboost och som främst har använts för ansiktsdetektion. En kaskad av
boostade klassificerare tränas utifrån positiva och negativa exempelbilder av
fotbollar. I denna rapport används bilder på små bollar som är mindre än de
vanliga objekt som metoden skapats för. En fråga som denna rapport försöker
svara på är huruvida denna metod är applicerbar på så små objekt.
Support Vektor Maskiner (SVM) har även testats för att se om klassificerarens
prestanda kan höjas. Eftersom SVM är en långsam metod har den integrerats
som ett sista steg i den tränade kaskaden. Features från Viola och Jones metod
har använts som input till SVM.
En databas bestående utav ett träningsset och ett testset har skapats från 6
matcher. Träningssetet består av 10317 bilder och testsetet består av 2221
bilder. Resultaten visar att detektion går att göra med högre precision jämfört
med Tracabs nuvarande mjukvara.
Content
Introduction .............................................................................................................. 1
1.1 Background ........................................................................................................... 1
1.2 Objective of the thesis .......................................................................................... 2
1.3 Hit rate vs. false positive rate ............................................................................... 4
1.4 Related Work ........................................................................................................ 5
1.5 Thesis Outline ....................................................................................................... 9
Image Database ....................................................................................................... 10
2.1 Ball tool ............................................................................................................... 10
2.2 Images ................................................................................................................. 11
2.2.1 Training set .................................................................................................. 11
2.2.2 Negatives ..................................................................................................... 12
2.2.3 Test set ........................................................................................................ 12
2.2.4 Five-a-side.................................................................................................... 13
2.2.5 Correctness .................................................................................................. 13
Theoretical background ........................................................................................... 15
3.1 Overview ............................................................................................................. 15
3.2 Features .............................................................................................................. 16
3.2.1 Haar features ............................................................................................... 16
3.2.2 Integral Image .............................................................................................. 17
3.5 AdaBoost ............................................................................................................. 19
3.5.1 Analysis ........................................................................................................ 19
3.5.2 Weak classifiers ........................................................................................... 20
3.5.3 Boosting ....................................................................................................... 21
3.6 Cascade ............................................................................................................... 22
3.6.1 Bootstrapping .............................................................................................. 23
3.7 Support Vector Machine ..................................................................................... 24
3.7.1 Overfitting ................................................................................................... 24
3.7.2 Non-linearly separable data ........................................................................ 25
3.7.3 Features extracted with Adaboost .............................................................. 26
3.8 Tying it all together ............................................................................................. 26
Method ................................................................................................................... 27
4.1 Training ............................................................................................................... 27
4.2 Step size and scaling ........................................................................................... 28
4.3 Masking out the audience .................................................................................. 29
4.4 Number of stages ................................................................................................ 29
4.5 Brightness threshold ........................................................................................... 30
4.6 SVM ..................................................................................................................... 31
4.7 OpenCV ............................................................................................................... 32
Results ..................................................................................................................... 33
5.1 ROC-curves .......................................................................................................... 33
5.2 Training results ................................................................................................... 34
5.3 Using different images for training ..................................................................... 35
5.3.1 Image size .................................................................................................... 35
5.3.2 Image sets .................................................................................................... 36
5.3.3 Negative images .......................................................................................... 37
5.4 Step Size .............................................................................................................. 39
5.5 Real and Gentle Adaboost .................................................................................. 41
5.6 Minimum hit rate and max false alarm rate ....................................................... 42
5.7 Brightness threshold ........................................................................................... 44
5.8 Number of stages ................................................................................................ 46
5.9 Support Vector Machine ..................................................................................... 47
5.10 Compared to existing detection........................................................................ 49
5.11 Five-a-side ......................................................................................................... 50
5.12 Discussion ......................................................................................................... 52
Conclusions and future work ................................................................................... 55
6.1 Conclusions ......................................................................................................... 55
6.2 Future work......................................................................................................... 56
Bibliography ............................................................................................................ 57
Appendix 1 .............................................................................................................. 59
Training set: .......................................................................................................... 59
Test set 1: ............................................................................................................. 59
1
Chapter 1
Introduction
In this chapter the circumstances of the problem are presented as well as the
goal of the thesis. Related work is described and an outline of the thesis is
given.
1. 1 Background
This Master thesis was performed at Svenska Tracab AB. Tracab has
developed real-time camera-based technology for locating the positions of
football players and the ball during football matches. Eight pairs of cameras are
installed around the pitch, controlled by a cluster of computers. Fig 1 shows
how pairs of cameras give stereo vision and how this makes it possible to
calculate the X and Y coordinates of an object on the pitch.
Fig 1 - Eight camera pairs cover the pitch giving stereo vision.
With this information it is possible to extract statistics such as total distance
covered by a player, a heat map where warmer color means that the player has
spent more time in that area of the pitch, completed passes, speed and
acceleration of the ball and of the players and a lot more. The whole process is
carried out in real-time (25 times per second). The system is semi-automatic
and is staffed with operators during the game. All moving objects that are
player-like are shown as targets by the system. The operators need to assign the
players to a target, since no face recognition or shirt number identification is
2
done to identify the players. They must also remove targets that are not subject
of tracking, e.g. medics and ball boys.
One big advantage of the system is that it does not interfere with the game in
any way. No transmitters or any other kind of device on the players or the ball
are used.
1. 2 Objective of the thesis
The objective of this Master thesis is to improve the ball detection using
machine learning techniques. Today the existing ball tracking method primarily
uses the movement of an object to recognize the ball, instead of the appearance.
In this report we will see if it is possible to shift the focus from using the
movement to doing object detection in every frame. A key attribute of the
method used is that it has to be fast enough for real-time usage.
Tracab’s technology is already good at detecting moving balls against a static
background, so an aim for this project is to produce reasonable ball hypotheses
in more difficult situations such as:
The ball is partially occluded by players.
The lighting conditions are uneven, especially when the sun only lights
up a part of the pitch.
Other objects like the head of a player or the socks of a player look like
the ball.
The ball is still, e.g. a free kick.
A classifier is to be trained to detect footballs, based on a labeled data set of
ball / non-ball image regions from images captured by Tracab’s cameras. When
talking about image regions in this report, a smaller sub window that is part of
the whole image is what is meant (left of figure 2). When only talking about an
image, the whole image is meant (right of figure 2).
Fig 2 Example of an image region and an image captured by Tracab’s cameras.
The classifier needs to be somewhat robust to changes in ball size and
preferably also ball color since they differ in different situations. One big
difference between this project and previous studies of object detection such as
3
the paper of Viola and Jones is the size of the object [32]. Here it is very small,
only a few pixels wide. A big question is to see if the method presented in this
report can be applied to objects of this size.
Even with reasonably good detection of the ball it is difficult to tell apart the
ball from other objects using only techniques based on the analysis of still
images. One way of solving this is by examining the trajectory of the object in
a sequence of images to discard objects that do not move like a ball. Also, if the
classifier detects the ball most of the time, only missing out a few frames at a
time, it is possible to do post processing to calculate the most likely ball path
between two detections. These two steps are already used today at Tracab and
are not a part of this thesis. Hopefully results from this thesis can be used to
provide more accurate detections thus improving the data input to these steps
and hence reduce the amount of calculation that needs to be done in those
steps.
The aim for this thesis is to evaluate if a machine learning approach can be
used to detect a small object such as a ball. More specifically, different
algorithms based on the work of Viola and Jones, which uses Adaboost, will be
evaluated [32]. In addition to this an extension to their work using SVMs at the
last stage has been tested and evaluated, inspired by the study by Le and Satoh
[13]. An overview of how the methods are combined can be seen in fig 3.
Fig 3 Overview. Image regions of an image are extracted and a cascade of classifiers trained
with Adaboost is run for each image region. A bright pixel is searched for before running a SVM
classifier on the regions that have not been rejected in earlier stages.
Input - all image regions
Cascade of classifiers
Ball
Brightness threshold
Ball
SVM-classifier
Non BallRejected
Ball
Non BallRejected
Non BallRejected
4
1. 3 Hit rate vs. false positive rate
While it is possible to get close to 100% in the hit rate, this will also probably
lead to a very high false positive rate. Having a hit rate of 80% and 0.1% false
positives could be great in some situations and for some applications; in other
situations this is not possible. In a medical environment for example this may
not be good enough, because you really want to be certain before giving some
risky medicine. In this application there is no limit that has to be reached.
Instead it is the ratio between the hit rate and the false alarm that is interesting.
Results will therefore be presented using Receiver Operating Characteristic-
curves (ROC) with hit rate on the y-axis and the false positive rate on the x-
axis. The different results are obtained by varying a sensitivity parameter. In
many applications a threshold is varied to get different detection rates. In this
thesis the number of stages used for detection and the step size used during
detection are varied to get the different rates. ROC-curves are usually used to
present this kind of results, which makes it easier to compare these results with
others. A more detailed description of the ROC-curves is given in Section 5.1
along with the results.
Somewhat promising results have been achieved, as seen in fig 4. Compared to
the existing method in Tracab we can see small improvements. More results
can be seen in Chapter 5.
Fig 4 Results compared to an existing method used by Tracab.
The open source project OpenCV has been used in this project for most of the
algorithms [38]. Preprocessing of the image data has been done using Matlab.
The SVM part has been done using libSVM [6] which has been integrated with
OpenCV.
5
1. 4 Related Work
Much of the research in object detection focuses on face detection and texture
classification. Little work has been done on ball detection. Research based on
face detection and texture classification is therefore also presented. More work
has been done on ball tracking in multiple image sequences but are not of
interest for my work and will thus not be regarded. The organization of this
chapter is as follows: First, specific research on ball detection will be
presented. Then, the most interesting methods found for face detection and
texture recognition will be presented.
Much of the research on color independent ball detection has been done using
some kind of edge detection. A Circle Hough Transform (CHT) is used by
D’Orazio and Guaragnella for circle detection [9]. Edges are first obtained by
applying the canny edge detector. Each edge point contributes a circle of radius
R to an output accumulator space. When the radius is not previously known,
the algorithm needs to be run for all possible radiuses. This requires a large
amount of computations to be performed. Another disadvantage is that this
method is bad at handling noisy images. The sub-windows found by the CHT
step are then used to train a Neural Network with different wavelet filters as
features. The Haar wavelet was found to be the best among Daubechies 2,
Daubechies 3 and Daubechies 4 with different decomposition levels. A faster
version of the Circle Hough Transform is used for ball detection by
Scaramuzza et al [27]. Coath and Musumeci use an edge based arc detection to
detect partially occluded balls [7].
Ville Lumikero thresholds a grayscale image into binary form [19]. He applies
morphological operations such as dilation and erosion to “clean” the images
from noise and to fill up holes in objects. The ball candidates are then
thresholded by size and color. The remaining ball candidates are further
processed by a tracking algorithm not described here.
Ancona et al [1, 3] implement an example based method for ball detection by
training a Support Vector Machine (SVM, read Burges tutorial for more
information [5]) with positive and negative example images of a ball. 20x20
pixels large images of footballs are used as input to the SVM. The images are
preprocessed with histogram equalization to reduce variations in image
brightness and contrast. The SVM is a robust algorithm that searches for the
hyper plane that maximizes the minimum distance from the hyper plane to the
closest training point. SVMs are described in more detail in Section 3.7.
To do face detection, a bootstrapping technique is used by Osuna et al to
improve the performance of a SVM [23]. Misclassified images that do not
contain faces are stored and used as negative examples in later training phases.
The importance of this step was first shown by Sung and Poggio [30]. To
reduce computation time of the SVM a reduced set of vectors is introduced by
6
Romdhani et al [26]. By applying the reduced vectors one after the other it may
not be necessary to evaluate all the reduced vectors. If an image is considered
very unlikely to be a face early in the chain it can be discarded fast. Regular
detectors based on a single classifier such as the SVM without using the
reduced vector set are slow because they spend an equally large amount of time
evaluating negative responses as positive responses in an image. This approach
of a cascaded detector that is coarse-to-fine has been used widely in the
literature.
One of these cascade approaches is the algorithm proposed by Viola and Jones
[31, 32]. Their algorithm uses a cascade of boosted classifiers to detect faces
and mimics a decision tree behavior. In the first stages of the cascade only a
few features are evaluated. Later on down the cascade the stages become more
complex, thus a lot of non-faces can be thrown away without much effort while
face candidates are more thoroughly examined. Viola and Jones have shown
that their final classifier is very efficient and well suited for real time
applications (15 384x288 frames/s on a 700 MHz Pentium II). Much of their
work is based on Papageorgiou et al who did not work directly on pixel values
[24]. Instead they use a set of Haar filters as features. These features are
selected by evaluating them in a statistical way and then used as input to a
SVM, used to classify new images. The features in Viola and Jones report are
selected with the help of a boosting technique called Adaboost. Adaboost is a
boosting algorithm that enhances the performance of a simple, so called
“weak” classifier. Haar rectangles are used as features, so the task for the weak
learner is to select the rectangle with the lowest classification error of the
examples. After each round of learning, the examples are re-weighted so that
the examples that were misclassified are given greater importance in the next
step. Interesting is that it has been shown in a study by Freund and Schapire
that this approach minimizes the training error exponentially in the number of
rounds [10]. Viola and Jones also use a fast way of computing the Haar
features by pre-calculating the Integral Image. The Integral Image entry I(x, y)
is the rectangle sum to the left and up of the point (x, y). This makes it possible
to calculate any rectangular sum in four array references. The resulting detector
is easily shifted in scale and location thanks to this. A simple tradeoff between
processing time and detection rate can be done by experimenting with the step
size. This report is largely based on the work by Viola and Jones, but also
regards extensions of their work. This is mainly due to the efficiency of the
method shown in their study [32] as well as the promising proofs about the
error that are explained in Section 3.5.1.
Many extensions to the work of Viola & Jones have been done, for example in
the study by Mitri et al where it is used for ball detection [21]. First images are
preprocessed by extracting edges using a Sobel filter. These images are then
used as input to train the classifier. A variant of Adaboost is used called Gentle
Adaboost, as proposed by Lienhart et al [15]. Here it is shown that for the face
detection problem, Gentle Adaboost outperform the Discrete Adaboost (the
7
original Adaboost is nowadays often referred to as the discrete version) and the
Real Adaboost which was introduced by Shapire and Singer as an extension to
the original version [28].
Lienhart et al also extend the work done by Viola and Jones by introducing the
Rotated Integral Image [14]. With the help of this, an extended set of Haar
features is used which show improvements of the detection rate of the classifier
while the computation time required does not increase to the same extent.
Some improvements to the weak classifiers are done by Rasolzadeh [25]. The
feature response of the Haar wavelets for both the positive and the negative
examples used in Viola and Jones are modeled using a normal distribution. By
doing this it is possible to improve the discriminative power of the weak
classifiers by using multiple thresholds (i.e. two new thresholds are introduced
where the two distributions intersect). In Viola and Jones algorithm a single
threshold is used by the weak classifiers to separate the two classes. They
further show that this multithresholding procedure is a specific implementation
of response binning. By calculating two histograms of the feature response for
the positive and the negative examples for a certain number of bins it is
possible to determine multiple thresholds without modeling the response as
normal distributions. The weak classifier hypothesis consists of comparing the
two histograms for the feature response. This change can be implemented
replacing the old weak classifiers immediately without any other major changes
to the algorithm. Their results suggest some improvement of the detection rate
while keeping the low computing time.
Another extension of the Viola and Jones algorithm that combines their work
with a Support Vector Machine at the last stage of the cascade is presented in a
study by Le and Satoh [13]. Two new stages are added to the cascade of
classifiers. Firstly a stage to reject non-face regions more quickly has been
added. It does so by using a larger window size and a larger step size. As the
last stage of the new cascade a SVM is used. The Haar features that were
selected by Adaboost in the previous stage are used to train the SVM. As they
are already calculated no recalculation is needed. In contrast to the extension of
the Haar features proposed by Lienhart et al [14] a reduction of these features is
implemented here. This is mainly done to reduce training time, and their
experiments show that for rejection purposes no efficiency is lost. Wu et al [33]
have been able to make an algorithm for training the cascade that is roughly
100 times faster than the one of Viola and Jones. The difference is that Wu et al
only train the weak classifier once per node, instead of once for each feature in
the cascade.
Liu et al map the feature response into two histograms (one for the positive set
and one for the negative set) and searches for the feature that maximizes the
divergence of the two classes [17]. This is done by using the Kullback-Leibler
(KL) divergence as the margin of the data. Results are promising compared to
8
Adaboost but the final speed of the classifier is slower (2.5 320x240 frames/s
on a P4 1.8GHz).
Lin and Liu train 8 different cascades to handle 8 different types of occluded
faces [16]. If a sample is detected as a non-face only the trained classifiers that
contain features that do not intersect with the occluded part are evaluated. A
majority of these classifiers should give positive responses when the sample
was indeed a face. The sample is then evaluated with one of the new cascades.
These additional stages yield in a three times longer computing time
(18 320x240 frames/s on a P4 3.06 GHz). To avoid overfitting, the mechanism
to select weak learners during boosting is reconsidered. Influenced by the
Kullback-Leibler Boost they use the Bhattacharyya distance as a measure of
the seperability of the two classes. They claim that the Bhattacharyya
coefficient is much easier to compute than the Kullback-Leibler distance while
maintaining the same performance.
A totally different approach to feature detection has been developed by Lowe
called SIFT [18]. It is a highly appreciated method that has been used widely
and proven effective [20]. It has a high matching percentage and is robust to
lighting variations compared to other local feature methods. The main
interesting part of this work is how the features are described. Points of interest
are found by looking for peaks in the gradient image. Descriptors representing
the local image gradients are extracted from the area around these points of
interest. A 4x4 grid is constructed and for each of these “bins” a histogram of 8
gradients is computed. This representation has the advantage that it is good at
capturing the small distinctive spatial patterns of an object. Rotation invariance
is achieved by combining all the gradient orientations into a reference
orientation. When a new image is to be classified, these descriptors are
matched to the descriptors in the database trained with example images using a
Nearest Neighbor method. For further reading, a study made by Mikolajczyk
and Schmidt evaluates an improvement to SIFT and different local descriptors
are compared [20].
A similar approach to SIFT is a feature descriptor called SURF [4]. The main
difference here is that instead of gradients, Haar features are calculated around
a point of interest and represented as vectors. By using the Integral Image,
calculations can be made faster.
Local Binary Patterns is another approach that has been used for texture
classification by Ojala et al [22]. For each pixel, an occurrence histogram of
features is calculated. A feature is represented by the signs of the difference in
value between the center pixel and its neighbors (positive or negative
responses). The neighbor pixels are chosen as equally spaced pixels on a circle
of radius R. In the report it has been tested with different values of R and a
different number of neighbors. With this representation the output can take as
many as different values, P being the number of neighbor points. Certain
9
local binary patterns are overrepresented in the textures and these patterns share
the property of having few spatial transitions. These are called uniform and the
definition is that they contain less than three 0/1 changes in the pattern. By
making the representation rotationally invariant and by only considering
uniform patterns this amount is reduced significantly. Noticeable is that the
occurrence histogram does not save the spatial layout in the image; it only
stores information about the frequency of the local features.
1. 5 Thesis Outline
The rest of the thesis is organized into five different chapters. In Chapter 2, the
database of images needed for training and testing is described. Chapter 3
explains the theory behind the methods used, both the method proposed by
Viola and Jones based on a cascade of classifiers trained with Adaboost and the
method of using a Support Vector Machine as a classifier. It also describes how
these two methods can be combined in different ways. Chapter 4 describes how
the methods were used in this specific problem setting. Experimental results are
shown and discussed in Chapter 5 and conclusions are drawn in Chapter 6,
along with proposed future work.
10
Chapter 2
Image Database
The image database was produced semi-automatically from example images
taken from the Tracab system cameras. This section describes how this was
done and how the image sets have been created.
2. 1 Ball tool
The images are extracted using a tool that Eric Hayman at Tracab developed.
The procedure was to look at an image and click out the ball and try to center
the position as exact as possible. At the same time the images were labeled
with the degree of freeness of the ball and with the contrast in the image.
A free ball is not in contact with anything and is completely visible. The higher
number in degree of freeness of the ball, the closer to other objects it is. In the
same way the higher the number of contrast the harder it is to distinguish the
ball from the background. Examples can be seen in fig 5.
The resulting information is a text file holding the path to the image, the x and
y position of the ball in the image, the degree of freeness and the degree of
contrast of the ball. Later this information is used to extract image regions that
consist of the ball in the center along with some of the background. This way it
is easy to do tests using image regions of different sizes for training, as
described in Section 5.3.1.
Example (row created by the ball tool):
Image path x y free contrast
“C:\HEIF-GAIS\ball.000001.04.jpg” 116.53 25.10 2 3
11
Description of the two scales:
Freeness
1. Free
2. Close to a player/line but not in contact
3. Contact with the player/line but not over the same
4. Over the player/line
5. Partially occluded by player
Contrast
The contrast scale ranges from 1 to 4 where 1 represents a very high contrast
where the ball is easily distinguishable from the background and 4 represents a
very low contrast where it is difficult to separate the football from the
background.
2. 2 Images
The images taken by the 16 cameras and saved by the ball tool all have a
resolution of 352x288 pixels. The whole images are saved by the ball tool as
RGB images but are converted into grayscale color space for training and
detection. The transformation from RGB to grayscale has been done in the
same way as calculating the luminance of the Y’UV color space [35]. It has
been calculated with the transformation to Y as: 0.299R + 0.587G + 0.114B.
Each channel has color values ranging between [0..255] so Y ranges between
[0..255] as well.
The images have been split into two groups: A training set and a test set. The
training set along with the negative samples is used to train the cascade while
the test set is used to measure performance.
2. 2. 1 Training set
10317 positive example images were extracted for training from 6 different
matches in Allsvenskan, Champions League and an international match from
2007, all with different lighting conditions.
Table 1 shows the number of images that have a value equal to or lower than
the contrast/freeness measures indicated on the left and top of the table.
12
Table 1 The different types of images in the training set.
Contrast/Free 1 2 3 4 5
1 75 83 83 84 84
2 3258 4988 5898 5982 6031
3 4651 7447 9194 9446 9582
4 4840 7845 9777 10120 10317
2. 2. 2 Negatives
To construct negatives the same images are used as for the positive samples.
The ball is removed from the image by setting the pixels in the area of the ball
to black. The whole 352x288 image is then saved and labeled as a negative
image. During training, image regions that are detected as positives are then
extracted from these images and used as negatives, as we know that there is no
football in the images. This procedure is called Bootstrapping and is described
further in section 3.6.1
2. 2. 3 Test set
2221 images were extracted from other sequences in the same matches. Testing
is done on these. The image set is expected to have the same ratio of different
kinds of images as there is in a match generally. Tests are done on images from
the same matches as used in training, to avoid the problem of not having
enough variety in the training data. This set is called test set 1 and the
distribution of different images can be seen in table 2. The ratio of the different
images is similar to the ratio in the training set. The ratio in percentage of the
different images of the two sets can be seen in Appendix 1.
Table 2 The different types of images in test set 1.
Contrast/Free 1 2 3 4 5
1 6 12 12 12 12
2 659 1071 1277 1284 1287
3 951 1612 2097 2131 2139
4 972 1643 2149 2202 2221
Tests have also been done on a match not used in training, though we should
not expect getting good results in general on matches that have not been used
during training when having only 6 different matches to train on. From this
match there are 884 images. This set is called test set 2.
13
(a) (b)
Fig 5 Examples of image regions with different properties. (a) 2 on the free scale and 3 on the
contrast scale. (b) 1 on the free scale and 2 on the contrast scale
2. 2. 4 Five-a-side
New images have been collected from a five a side-match where the cameras
are positioned closer to the pitch. This is possible since the pitch in a five a
side-match is about 16 times smaller than normal. This setup gives images of
the footballs of higher resolution since they come closer to the cameras. The
footballs are now between 2 and 8 pixels in radius in the images which is
significantly larger than before and the texture of the ball now becomes visible.
The training set contains a total of 5937 images and the test set contains 2068
images extracted in the same way as the training set. For this set no analysis of
the quality (freeness and contrast) of the images has been done. Also, to save
time, the process of extracting footballs in the images has been made faster by
mostly including easy targets.
2. 2. 5 Correctness
The data set contains football images of variable size and with a wide range of
lighting conditions. Balls that were close to the cameras are larger than those
far away from the cameras; they can vary up to a couple of pixels in diameter.
It is questionable if the variation in lighting conditions in the extracted images
is enough to capture the variance there is in reality between all different
matches. Optimal would be to also have images from a much wider range of
matches to be able to generalize completely. This has not been done due to the
large amount of time it takes to extract footballs from images manually. The
same can be said about the problem of having to deal with different kinds of
footballs. They are not always white. Some are black and white checkered and
others are even red. This could be solved by training several cascades.
It is also uncertain if the test set represents a general set of images. To be able
to detect the football as often as possible it is optimal to have a training set that
represents all the different image types that are present during a game.
14
Hopefully this is achieved automatically when taking a wide range of images
without any special selection process.
Another thing that could affect the results in a negative way is the labeling of
the data that has been done in table 1 and table 2. This labeling is as always
when there are humans involved a result of subjective reasoning. Also
according to the research done by gestalt psychologists the eye is easily fooled
[37].
15
Chapter 3
Theoretical background
This chapter gives an overview of the general approach used in the project and
describes the theory needed to understand the method. It includes these areas:
Training a boosted classifier using Adaboost, constructing the trained
classifiers into a coarse-to-fine cascade and training a Support Vector
Machine to be used as the last stage of the classifier.
3. 1 Overview
The algorithm used in this report is largely based on the work of Viola and
Jones from 2001 [31]. It is a popular method that has been widely used (See
Related Work). Some proofs about its generalization ability and the bound of
the error have been made which makes the algorithm very interesting. More
about this can be found in the analysis of the method in Section 3.5.1. The
method has mainly been developed and evaluated for face detection rather than
ball detection.
The algorithm works in the following way. A classifier is trained using positive
and negative image regions of an object of same size. The classifier consists of
several so called weak classifiers (3.5.2), consisting of haar-like features
(3.2.1), which are trained using a boosting technique called Adaboost (3.5).
The boosted weak classifiers are combined into a cascade of coarse-to-fine
classifiers (3.6). The idea is to reject a lot of non-objects in the early stages
where the computation is light, reducing process time, while positives get
processed further. When classifying an image region, the classifier outputs if
the object is detected or not, like a binary classifier. The classifier is easily
scaled and shifted to be able to detect objects at different locations and of
different sizes in an image.
This algorithm is combined with a Support Vector Machine (SVM) at the end
as described in Section 3.7. SVMs have been reported to be good classifiers
[26]. The SVM only needs to evaluate the image regions selected by the
Adaboost classifier in the previous stage, making it faster. Otherwise, a big
16
disadvantage with the SVM method is that it may be too slow for real time
[23]. An overview of how the method works can be seen in fig 6.
Fig 6 System overview. A cascade of classifiers trained with Adaboost is combined with a
brightness threshold and a SVM classifier as the last stage. Image regions that make it through
the system without being rejected are classified as footballs. Notice: same figure as fig 3.
3. 2 Features
A feature is the characteristic that is used to distinguish objects from non-
objects. The two main reasons why features are used instead of using the pixel
values directly is that it improves speed (explained in more detail in Chapter 2
about Integral Images) and that they can capture different kinds of properties in
an image. Any feature can be used such as the total sum of an area, variance,
gradients etc.
3. 2. 1 Haar features
In this thesis, the difference in pixel value between adjacent rectangles is used
as features. The features can be seen in fig 7. The value of the pixels in the
white rectangle is subtracted from the value of the pixels in the black rectangle.
The resulting difference that comes from calculating the feature is called the
feature response.
Input - all sub images
Cascade of classifiers
Ball
Brightness threshold
Ball
SVM-classifier
Non BallRejected
Ball
Non BallRejected
Non BallRejected
17
As shown in fig 7, 14 different features are used. With a base resolution of the
detector of 12x12, the total number of possible features in my setup is 8893.
This is a large number of features, but as we will see only a portion of the
entire set of features will be needed.
The features are called Haar-like features because they mimic the behavior of
the basis haar-wavelets. Much like gradients they capture the change in pixel
value rather than the pixel value itself. They are insensitive to mean differences
in intensity and scale.
Fig 7 The extended set of features as Lienhart et al [14] suggested.
1a-b, 2a-d and 3a in figure 7 were in the original set of features used by Viola
and Jones. The rest of the features were introduced by Lienhart et al [14]. The
new set of features consists of 7 additional rectangles that have been rotated by
45 degrees. Having a larger set of features make it possible to more accurately
capture the properties of the object, but it also affects the time it takes to train
the classifier since there are more features to evaluate. However, it does not
automatically mean that a larger amount of features will be used by the final
classifier, which would affect the speed of the final classifier. To speed up the
calculations of the features an integral image is used.
3. 2. 2 Integral Image
An Integral Image is a matrix made to simplify the calculation of the value of
an upright rectangular area in an image. It is a pre-calculation step made to
speed up other calculations. The value of the Integral Image (II) at II(x,y) is the
sum of all the pixels from the original image (OI) up and to the left of OI(x,y).
An example can be seen in fig 8.
18
→
Original Integral Image
Fig 8 The original image (left) and the corresponding Integral image (right)
This makes calculations of the value of any rectangle in the image faster. The
formula is:
II(x,y) = II(x-1,y) + II(x,y-1) – II(x-1,y-1) + OI(x,y) (1)
where OI(x,y) is the original image and II(x,y) is the Integral Image.
Once the Integral Image is calculated, the rectangle D in figure 9 can be
computed by:
4-3-2+1 = (A-(A+B) – (A+C) + (A+B+C+D)) (2)
=
2A-2A+B-B+C-C+D
=
D
Fig 9 Any rectangle D can be computed from the Integral Image by: 1-2-3+4
A rectangle can with the help of the Integral Image be calculated with only 4
array references. Differences in two adjacent rectangles (the edge features) can
be calculated with six array references while three adjacent rectangles (the line
features) require eight array references.
Since the rectangles are small (the training samples are 8-15 pixels in both
width and height) and therefore so are the features, the Integral Image does not
help in all cases. When the rectangles are small enough it is faster to do the
calculations directly on the pixels. This is true when the rectangles are smaller
155 201 226
98 78 48
14 111 44
155 356 582
253 532 806
267 657 975
19
than 3 pixels large. But the loss in time using the Integral Images in those cases
is very small. The calculations with the Integral Image are done in O(1).
For the rotated rectangles a different Integral Image is needed, called a Rotated
Integral Image. The idea is the same as previous and it is calculated with the
formula:
(3)
where RII is the Rotated Integral Image.
3. 5 AdaBoost
The principal idea of boosting is to combine many weak classifiers to produce
a more powerful one. This is motivated by the idea that it is difficult to find one
single highly accurate classifier. The T weak classifiers are combined
into a strong classifier by:
(4)
where are found during boosting.
The weak classifiers are used to select which features among the large number
of features best separate the two classes: objects and non-objects. This kind of
feature selection was first done in a statistical manner by Papageorgiou et al
[24]. By using Adaboost this selection process can be optimized when the best
feature is not obvious.
The boosting step of the algorithm is done by reweighting the examples and
putting more weight on the difficult ones. A new round of feature selection is
then done with the new distribution. These weak classifiers add up to strong
classifiers which are combined to construct a cascade of strong classifiers.
3. 5. 1 Analysis
It has been proven by Shapire and Freund that the error of the final classifier
drops exponentially fast if it is possible to find weak classifiers that classify
more than 50% of the examples correctly [29].
20
The final error is at most:
(5) where is the error of the t-th weak hypothesis
The bound of the error on the final classifier improves when any of the weak
classifiers is improved.
In the same article it was also shown that the generalization error of the final
classifier with high probability is bounded by the training error. This means
that the final classifier is most likely to generalize well on samples it has not
seen before. They say that with high probability, the generalization error is less
than
(6)
where Pr[] is the empirical probability on the training sample, T is the number
of rounds of boosting, m is the size of the sample and d is the VC-dimension
(Vapnik-Chervonenkis dimension) of the base classifier space. The VC-
dimension of hypothesis space H defined over instance space X, is the size of
the largest finite subset of X shattered by H. Further explanation of the VC-
dimension can for example be found in the tutorial by Sewell [34].
Shapire and Freund’s analysis implies that overfitting may be a problem if
running training for too many rounds. However, their tests showed that
boosting does not overfit even after thousands of rounds. They also found out
that the generalization error decreases even after the training error has reached
zero. These are promising results that motivates the use of this method.
3. 5. 2 Weak classifiers
A weak classifier is a simple classifier that only has two prerequisites: that it is
better than chance, i.e. that it classifies more than 50% of the samples correctly
and that it can handle a set of weights over the training examples. The weights
are needed in the boosting step.
In this case the weak classifier consists of one feature along with its threshold.
(7)
21
The feature response is compared to the threshold . The variable is
used to indicate the direction of the inequality sign.
In order to find the best weak classifier at each round of training the
feature responses from all samples are calculated and by applying a threshold
it is possible to separate the samples into two classes. During training the
optimal threshold is determined for each feature, by optimal meaning the
threshold that minimizes the classification error of that feature.
In each step of boosting the feature and its according threshold with the lowest
classification error is selected along with a weight α inversely proportional to
the classification error of that feature. This weight can be seen as a measure of
the importance of that particular weak classifier.
The error is calculated with respect to the weights of the examples (I.e. the
error is the sum of the weights of the misclassified examples).
(8)
where is the true label of the example.
3. 5. 3 Boosting
In order to train and combine several weak classifiers instead of using one
more complex classifier, Adaboost repeats the training step with a modified
distribution of the training set of examples. For each round more emphasis is
put on the more difficult examples. Those examples which were wrongly
classified by the previous weak classifier are given higher weights than the
correctly classified examples. The weights are then normalized. This is done at
each round of training until the total number of rounds is reached.
Different variants of Adaboost have been evaluated for face detection and two
of them have been used and compared for ball detection in this report [15]. The
Discrete Adaboost is the original version proposed by Schapire and Freund
[29]. According to Lienhart et al [15] the Gentle Adaboost is the most
successful algorithm for face detection. The Real Adaboost uses class
probability estimates to construct real-valued contributions. They are all similar
in computational complexity during classification, but differ somewhat during
learning in the way they update the weights at each round of boosting.
22
The main idea is still the same in all three cases:
General pseudo-code for Adaboost
Initialize weights w = 1/m (normalized)
For t = 1..T
Train weak learners using distribution w
Fit the weak classifiers to the data and
calculate the error with respect to the
weights.
Choose the weak classifier with the lowest
error and update weights by increasing the
weights of the misclassified examples.
Do until T is reached.
Output the final hypothesis as a weighted combination
(related to the error) of the weak classifiers.
For more information on the different variants of Adaboost see the comparative
study made by Friedman et al [11].
3. 6 Cascade
By forming several strong classifiers into a cascade of simple to complex level
it is possible to reduce computation time. In the first stages of the cascade
simpler and faster classifiers are used. Since the majority of the image regions
going in to the first stage are non-objects many of them are easily rejected by
the early stages while letting through the majority of the positives. Once an
image region has been rejected at some stage it is discarded for the rest of the
cascade as a non-object and thus not evaluated further. A positive image region
goes through the whole cascade and is evaluated in every stage requiring
further processing, but in total this is a rare event. When going through the
cascade the classifiers at deeper stages get more and more complex, requiring
more computation time. Also, with increasing stage number the number of
weak classifiers which are needed to achieve the desired false alarm rate at the
given hit rate, increases.
The cascade of classifiers is trained by introducing goals in terms of positive
detections and number of false positives. For example, to achieve a final
classifier with a hit rate of 90% and a false positive rate of 0.1%, each stage in
a classifier with 10 stages needs to have a hit rate of 99% (0.99^10 = 0.9) but
23
only a maximum in false positive rate of 50% (0.5^10 = 0.001). Each stage
reduces both values but since the hit rate is close to one the result of the
multiplication stays close to one while the result of the multiplication of the
smaller false positive rate rapidly decreases towards zero. This is all done
under the assumption that the different stages in the classifier are independent
of each other.
The way the cascade is formed is by setting a minimum value for every stage.
The stage is trained and features are added until the desired hit rate value and
the desired false positive rate value have been reached. By specifying these
goals it is possible to get a classifier of your choice.
3. 6. 1 Bootstrapping
A new negative set is constructed for each stage by selecting those image
regions that were falsely detected by the classifier using all previous stages. A
false detection as the one in figure 10 would be added to the negative set. This
method is called bootstrapping. Intuitively this makes sense as we expect the
new examples to help us get away from the current mistakes.
Since at each stage the classifier becomes more and more accurate, it becomes
more and more difficult to find false positives. Also the false positives get more
and more similar to the true detections making the separation task harder. As a
result, deeper stages are more likely to have a high rate of false positives.
Fig 10 A typical hit along with a false detection. The image region of the players shoe is used as
a negative sample in the training of the next stage.
24
3. 7 Support Vector Machine
Support vector machines (SVM) are used for data classification. The basics of
SVMs needed to understand the method is presented here.
In the same way as in the Adaboost case a training set and a test set is needed
to train and evaluate the SVM. Given a set of labeled data points (the training
set) belonging to one of two classes, the SVM finds the hyper plane that best
separates the two classes. It does this by maximizing the margin between the
two classes. In the left image of fig 11 we can see an example of a hyper plane
that separates the two classes with a small margin. In the right image the hyper
plane that maximizes the margin has been found by the SVM. The points that
constrain the margin are called support vectors.
Fig 11 SVM finds the plane that maximizes the margin. The image to the right is considered to
have greater generalization capabilities. Image taken from DTREG’s homepage [36].
3. 7. 1 Overfitting
Fig 12 shows how a classifier that is fitted well to the training set may not
generalize well. In image (a) the classifier has been learnt to classify all
examples correctly. As seen in image (b) this results in some wrongly classified
examples on the test set. However in image (c) we see a classifier that although
classifying one example from the training set wrongly it classifies all the
examples in the test set correctly (as seen in image (d) ). The latter classifier
generalizes better due to that it allowed a wrongly classified example during
training. This can be handled by introducing the penalty parameter that
weights the samples according to how they were classified. Misclassifying a
sample now costs and by increasing the cost of misclassifying an example
increases, making the model more adjusted to the training data.
25
(a)Training data and an overfitting classifier (b) Applied on test data
(c) Training data and a better classifier (d) Applied on test data
Fig 12 An overfitting classifier and a more general classifier. Images from libSVM Guide[6].
3. 7. 2 Non-linearly separable data
The examples in fig 11 show two linearly separable classes. When having more
complicated data, a line may not be enough to separate the two classes. To cope
with this problem the data is moved into a higher (maybe infinite) dimensional
space by a function [6]. The function can take many forms. In this new
space it may be possible to find a plane to separate the data. The problem of
going into a higher dimensional space is that calculations get more expensive
and it makes the method slow. Therefore the kernel trick, first introduced by
Aizerman et al, is used to solve this [1]. Since all SVM calculations can be
done using the dot product <x,y> between the training samples, the operations
in the high dimensional space do not have to be performed. Instead we
can try to find a function . This function is called the
Kernel function. Examples of popular kernels are: the Polynomial kernel, the
Radial Basis Function (RBF), the linear kernel and the sigmoid kernel. As
proposed by LibSVM, the RBF is a good choice to start with:
(9)
The linear and the sigmoid kernels are special cases of the RBF with certain
values of the parameters (C, ) [6]. The polynomial kernel is more complex in
terms of number of parameters to select.
When using the RBF kernels there are two parameters to select: C and . Since
it may not be useful to achieve high training accuracy, these parameters have
been evaluated by doing cross validation on the training set. This is done by
26
dividing the training set into two parts, one for training and the other part for
testing. This is done repeatedly with different partitions to get a more accurate
result. The values of the parameter have been tested by increasing the values
logarithmically and then doing cross-validation to get the performance. The
cross-validation can help us get around the problem of overfitting.
3. 7. 3 Features extracted with Adaboost
Having a good classifier does not make sense unless the data points represent
something meaningful. The idea is to use features extracted from some stage of
the cascade constructed with Adaboost. Fig 12 could then be interpreted as
having the feature response of one feature on the x-axis and the feature
response of another feature on the y-axis. But unlike the examples in fig 11 and
fig 12, more features than two needs to be used. This does not change any of
the theory except that we move into a higher dimensional space.
The samples used for training were gathered by letting an Adaboost trained
classifier classify the image regions in the training set as in Chapter 2. The
feature response from the samples classified as positives were chosen as the
positive training set and the feature responses from some of the false positives
from each image were selected to be part of the negative set. A detailed
explanation on how this was done is given in Section 4.6.
3. 8 Tying it all together
To be able to add SVM as the last stage of the classifier we need to decide from
which stage to take the features and how many features to use. Results in a face
detection study by Le and Satoh suggest that the switch between the Adaboost
classifier to the SVM classifier can be done in any stage [13]. Also shown in
the same study there seems to be a big increase in performance when going
from 25 to 75 features, while the difference between using 75 and 200 features
is not significantly large. Since the objects to detect are different in this report
and in the study by Le and Satoh there is no guarantee that the optimal number
of features is the same. The speed of the classifier depends very much on the
number of features that are used, so it is important to find an optimal tradeoff
here.
27
Chapter 4
Method
This section describes how boosted classifiers described in section 3 have been
trained and how they are used on the specific task of detecting footballs. It
describes how the classifiers are shifted in both location and scale across the
image during detection. To reduce the detections of some false positives a
brightness threshold is introduced, and also a mask is used to only do detection
on the area where it is interesting to search for the ball: the pitch. A
description of how the features for the SVM have been collected and tested is
given.
4. 1 Training
Several different cascades are trained as described in Chapter 3 and the
performances of these classifiers can be seen in Chapter 5.
Image regions of sizes between 8x8 and 15x15 pixels have been used to train
four different classifiers. Bigger image regions result in training samples that
include more or less of the background. If no background was included in the
image regions used for training, the classifier would only learn the texture of
the ball. Since the resolution is low, it is very difficult to distinguish any texture
on the footballs. The idea here is therefore to include some of the background
to give the classifier more information to work with. By including the
background the classifier has the possibility of finding the difference between
the dark background and the bright ball. How much of the background that
should be included in the samples is not clear. If there is too little background
maybe the classifier will not be able to capture the property that the ball is
white and round compared to the darker background. On the other hand, if too
much of the background is used the classifier will probably do detections only
based on the background instead.
The difference in using different parts of the training set has been evaluated.
One classifier has been trained with easier images and another with harder
images. So called easier images are the images labeled with contrast 1 and 2
and labeled with freeness 1, 2 and 3. The harder images have an additional
1747 images that have been labeled with contrast 3. By using harder images
28
where the ball is occluded and the contrast is bad during training should result
in better detection when the ball is close to a player or occluded in some other
way but also implies that it is harder to distinguish between a ball and a non-
ball. The rejection process will be more forgiving, letting more examples
through the cascade since the training images have a wider diversity. One can
expect that more false detections will be made, requiring a higher number of
stages to reach the same level of false detections.
Two classifiers have been trained to evaluate the importance of using a high
number of negative samples. 2000 and 5000 false positives have been extracted
to use as negative samples in the bootstrapping step.
Discrete Adaboost and Gentle Adaboost, two different kinds of boosting
algorithms are evaluated and the minimum hit rate and maximum false alarm
rate are varied to train three new classifiers.
An overview of how the classifier is used can be seen in fig 6.
4. 2 Step size and scaling
As mentioned in Chapter 3, detection is done by sweeping a window of
different sizes over the image, running the classifier at each image region.
Since the footballs are not perfectly aligned and have a small variation in
position and size, the trained detector is somewhat independent to small shifts.
An object can therefore be detected even though not being perfectly centered.
However, if not going through all possible image regions some objects are
likely to be missed. The step size also affects the detector speed. With a step
size of 1 pixel and running 10 different window sizes there are around one
million image regions that the classifier needs to be run on. By simply
increasing the step size to 2 (hence skipping one pixel at each step) the number
of image regions can be halved and thus halving the total time of the
classification. The step size is therefore a tradeoff between detection rate and
time. When having such small objects to detect as is the case in this report, it is
very likely that a small step size is required. The shift in location and window
size has been tested with different step sizes and results are shown in Section
5.3.
Since the balls we want to detect range in size between 3 and 7 pixels in
diameter there is no reason to search for objects of other sizes. The detection
window is therefore scaled until it is larger than the biggest object possible, but
not more. When scaling, this can be done either by scaling the image region
itself or by scaling the features. In this case scaling is done by scaling the
features since this is done without any cost (see the section of Integral Image
29
that shows that the size of the rectangle doesn’t affect calculation time) while
scaling the image region is time-consuming.
4. 3 Masking out the audience
Since the ball only needs to be detected when it is in play there is no need to
perform detection outside the pitch. Since the system of cameras have a model
of the pitch it is easy to get the limits from there. For each match 16 different
mask images are constructed, one for each camera (to the right in fig 13). When
stepping through the x and y coordinates of an image it is first evaluated to see
if the mask says that detection should be made at this position or not. By doing
this a more accurate result can be achieved. The result can be seen to the left in
figure 13. No detections are made outside the boundaries of the mask.
Fig 13 Detections (left) and the corresponding mask(right).
4. 4 Number of stages
What has been noticed when running the trained classifiers on the test data is
that images from different matches respond very differently to the cascade.
Images from a bright match may give a lot of false detections when running the
cascade with many stages, while images from another match give hardly any
false detections. But when increasing the number of stages used, positive
detections are lost from the latter while only reducing the false positives from
the former. This means that the optimal number of stages used for classification
differs from match to match. So another way of getting the ROC-curves would
be to set a limit on how many false detections the classifier is allowed to find in
an image, and run the classifier until this limit has been reached. This increases
30
the performance of the classifier when testing on a range of matches, and a test
made this way is presented in Chapter 5. However during testing of other
parameters this method would not be practicable, since it would be impossible
to get any comparison data that depended only on the tested parameter.
4. 5 Brightness threshold
A brightness threshold has been used to eliminate some of the false detections.
Since the features used do not capture the pixel value but only how the pixel
values are related to each other, some of the detections have been found to be
located on grass areas which are totally green. These can easily be rejected by
looking at the brightness of the pixels in the detected area. If no single pixel is
bright enough the detection can be ruled out.
Fig 14 Left: Image of false alarms on grass. Right: Mask that indicates the two different
thresholds when an umbra is present.
To the left in fig 14 we can see two false detections that have the same feature
responses as a ball. It can be seen that these two detections consist of a brighter
circular area in the center and darker pixels around. It also seems clear that the
areas do not contain any pixels that are as white as a ball would be.
To find the optimal threshold value for each individual match, a histogram of
the intensity of the pixels in the whole image has been used. Usually, a peak in
this histogram indicates the color of the grass. When there is an umbra because
of the sun, it should be possible to find two peaks indicating the color of the
grass. It is possible to extract a mask of where the umbra is as seen to the right
in fig 14 and by looking at the brightness histogram it is possible to find a
corresponding threshold that is optimal for the two different regions. On the
dark sections of the pitch the threshold is close to one peak, on the bright
sections of the field the threshold is close to the other peak. To the right in fig
14 the threshold has been chosen to be .
31
4. 6 SVM
Two different types of features have been used to train and evaluate the SVM
method: using the pixel value of the image regions directly and using the
feature responses from a classifier trained with Adaboost. False positives have
been extracted the same way. Both methods uses images from the training set
and the test set 1 described in Chapter 2.
The training and testing procedure for the SVM using feature responses is as
follows:
Training - Let a cascade using some low number of stages run the training set
discussed in Chapter 2. The feature responses of around 200 features starting
from some stage are calculated, for both positives and negatives (one false
positive from each image is taken to get an equal amount of positives and
negatives in the resulting training data). How many features that should be used
is not known but results from the study made by Le and Satoh show that 200 is
a good number [13]. Some different values are tested. The Support Vector
Machine is then trained with these feature responses. By using feature
responses from later stages it should be possible to change the SVM into
classifying more difficult samples better. At the same time it should classify the
easier samples worse, but the cascade of boosted classifiers that will be run
before the SVM step are meant to take care of these.
Testing - Let the same cascade classify the test set 1 discussed in Chapter 2.
Get the around 200 feature responses starting from some stage and label the
ones that are classified as detections as positives, and the rest as negatives. Run
the SVM on the extracted feature responses and evaluate the performance. By
adding the SVM after different number of stages it will be possible to construct
a ROC-curve. Run the cascade up until stages that give similar rejection or
detection rate as the SVM and evaluate the performance. We can now compare
how well the two different methods perform on the same positive and negative
set.
As mentioned in the guide made by LibSVM scaling of the data is very
important to achieve good results [6]. Without scaling the data it was not
possible to get any acceptable results. The main reason why scaling is
important is to avoid the domination of large numbers over small numbers.
Also, by scaling the data calculations are made easier and thus improving the
efficiency of the classifier. All data, both the training and the testing data, has
been scaled in the same way to the range .
32
4. 7 OpenCV
The detection part of OpenCV was developed with face detection in mind. As a
result it has been somewhat optimized for larger objects than the footballs in
this thesis. This means that modifications to the code have been made to search
for smaller objects.
33
Chapter 5
Results
In this section results from the different trained classifiers are presented.
Comparisons are made using ROC-curves. The performance is only shown for
the values of interest since the training of later stages requires a lot of time.
5. 1 ROC-curves
To see the ratio between hit rate and false alarm rate, ROC-curves are used to
present the results. Hit rates are shown in percentage while the false alarm rate
shows the actual count of detected false positives per image.
A detection is added as a positive hit if:
The distance between the center of the detection and the center of the
actual football is less than 30% of the width of the actual football
The width of the detection window is within ±50% of the actual
football width
Other detections are regarded as false alarms.
A test with a perfect result has a ROC-curve that passes through the upper left
corner. That means 100% hit rate and 0 false alarms. The closer the curve is to
this point, the better accuracy of the classifier. Points that fall below the dotted
line in fig 15 are the results of a classifier that is worse than chance.
34
Fig 15 A ROC-curve. The closer the curve is to the top left corner the better.
To get the curves, the number of stages used for detection is varied. More
stages mean a more specified classifier while less number of stages lets more of
the image regions through the cascade as positives.
The choice of number of stages influences the performance differently on
different matches. In some matches a lot of detections are made using a specific
number of stages. Using the same number of stages on another match may
result in very few numbers of detections. Since there are images from different
matches in the test set, this is a problem. A test is therefore made by adjusting
the number of stages used during detection. Depending on the number of
detections made in the previous image, more or fewer stages are used for the
next images. This improves performance as seen in Section 5.8.
5. 2 Training results
Training and testing has been done on a 2.16 GHz computer with 2 GB RAM.
In general it has taken time in order of days to train a classifier up until stage
35. Some training sessions have been forced to stop earlier due to the training
being too time consuming. It has however always been possible to get a ROC-
curve that can be used to compare the performance with the others. The
variables that affect training time are: the number of negatives used in the
bootstrapping step, the size of the image regions, the minimum hit rate and the
maximum false alarm. The first attribute can be very time consuming and it is
therefore important to have a good set of negative images from where the
algorithm can find false positives. The last three attributes increase the number
of features that is needed to reach the goal at each step.
Some examples of features chosen by the algorithm can be seen in figure 16.
Fig 16 Example of features selected by Adaboost in early stages
35
5. 3 Using different images for training
Adaboost has been reported to be sensitive to noisy data [8]. Here some tests
with different images used during training are reported.
5. 3. 1 Image size
Image regions of sizes between 8x8 and 15x15 pixels have been used to train 4
different classifiers. Examples are given in figure 17.
Results show small differences in the performance of the classifiers trained
using image regions of size 10*10 pixels compared to the classifier trained
using image regions of size 12*12 pixels. Using image regions of size 8*8
pixels show worse performance. Using bigger image regions than 12*12 does
not improve performance as seen in fig 18.
Size 8*8 Size 10*10 Size 12*12 size 15*15
Fig 17 Image regions of different sizes
36
Fig 18 By using image regions of size 12*12 best performance is achieved, although there is not
a big difference between the different classifiers.
5. 3. 2 Image sets
The next test shows the importance of having a good image database. Two
classifiers trained with different images are compared, one trained with images
in which the ball is occluded more and the contrast is worse. The results can be
seen in fig 19.
Around 4 more stages were required to get down to the same false alarm rate.
At the same time the classifier that was trained with harder images shows a
better performance in total.
37
Fig 19 A classifier trained with images with less contrast and where the ball is partially
occluded shows better performance than a classifier trained only on clearly visible footballs.
5. 3. 3 Negative images
Another modification that can be done to the set of images used in training is
using more negatives in the bootstrapping step (see Section 3.6.1). The
bootstrapping step selects a number of false positives classified by the present
available classifier as examples of negatives. Until now the algorithm has used
only 2000 negative samples at each stage. By letting the training procedure
extract 5000 samples of negatives at each stage it should be possible to
improve performance. One problem of using a high number of negatives in this
step is that as training reaches later stages it becomes more and more difficult
to find a large number of false positives. One should also mention that
increasing this number immediately increases the time needed for training. As a
reference it took 705 s to find 2000 false positives at stage 39, while it took
under a second in the first stages. To find 5000 false positives at stage 39 took
2105 s. In early stages the difference was not noticeable.
38
Fig 20 Using a higher number of negative samples for each stage of training increases
performance a couple of percentages.
The comparison in performance between using 5000 negatives at each stage
and using 2000 negatives can be seen in fig 20. The two curves follow each
other, with the curve for the classifier trained with 5000 negatives a couple of
percentages higher.
One question that arises is how it affects the generalization performance to use
a higher number of negatives? Is it only positive or could it be so that the new
classifier gets too specialized to the training data? If the negatives are very
similar to the positives the trained classifier is likely to have a decision
boundary that lies very close to the positives. By running the classifiers on the
test set 2 described in Section 2 we can see some indications on how well the
two classifiers generalize to the data. The difference in performance of the two
classifiers is very small. There is not a big enough difference in performance
when testing the generalization ability to be able to draw any conclusions about
it. As seen in fig 21 the detection rates are very low.
39
Fig 21 The classifier trained with 5000 negatives performs better in the lower regions of the
ROC-curve when testing on a game that has not been used in training
5. 4 Step Size
During detection a detection window is swept across the image at different
locations and at different scales. By shifting the window a few pixels at a time
it is possible to scan the whole image. A step factor is used to increase both
the window size and the step size. For example, if the current step size is the
window is shifted pixels. This means that when the detection window is
large it is shifted more than one pixel at a time. Since the image regions used in
training are not perfectly centered, a small amount of translational variability is
trained into the classifier. While speeding up the classifier substantially,
shifting more than one pixel at a time has resulted in a decrease of the
performance.
With a scale factor of 1.2 the classifier classified around 24 images per second
(it took between 87 and 96 seconds to classify all 2221 images). On the other
hand it took about double the time with a scale factor of 1.1 This means that it
takes one second to classify 13 images. Due to the higher performance when
using a small step factor, a fixed step size at 1 pixel and not using the scale
factor for the step size has been tested. The window size is still increased with
40
the step factor until the window is large enough. This way it only classifies
about 9 images per second.
Fig 22 A smaller step factor increases performance but also increases processing time.
These results as seen in fig 22 clearly show a big increase in hit rate when
decreasing the step size. In the forthcoming results a fixed step size of 1 pixel is
used along with a step factor of 1.1 for the window size. Also a pre-calculation
step has been removed which used a step size of 2 pixels as the first step. By
removing this step the detection rate improves as shown in figure 23 while
decreasing the number of images processed per second to 8.5.
41
Fig 23 Old step size as in fig 22 compared with having removed a pre-calculation step.
5. 5 Real and Gentle Adaboost
Two variants of the Adaboost algorithm have been evaluated. The discrete
Adaboost was not able to finish, so it will therefore be left out of the
comparison. What happened was that the boosting step was not able to improve
performance by updating the weights, so the process got stuck. Lienhart also
declared that he had convergence problems when using LogitBoost for face
detection and was not able to evaluate that method [14]. Also, Lienhart could
show that the Gentle Adaboost was the best method between the Real, Discrete
and Gentle Adaboost, at least for face detection. D. Le and S. Satoh also stated
that the Discrete Adaboost is too weak to boost in the case of a hard
distinguished dataset [13].
42
Fig 24 The performance of the Real Adaboost and the Gentle Adaboost is significantly the same.
Figure 24 shows how the difference in performance between the two different
variants of Adaboost, Real and Gentle, is minimal.
5. 6 Minimum hit rate and max false
alarm rate
As described in Section 3.6 the min hit rate and max false alarm rate are used
to set up the properties of the cascade. They describe the values each stage
needs to reach in order to move on to the next stage.
We see that increasing the max false alarm rate improves performance
significantly. Even better performance is achieved by using a higher minimum
hit rate during training. It is worth noting that a higher min hit rate alone shows
as good performance as when rising both the min hit rate and the max false
alarm. To make it easier to refer back to this classifier later in the report, the
classifier with the best performance in fig 25 is called classifier 1.
43
Fig 25 Comparison between using different values of the minimum hit rate and the false alarm
rate during training. Better performance is achieved by using a higher min hit rate during
training.
A question that arises is if this procedure deteriorates the generalization
performance of the classifier. Maybe these results show a too specific classifier
that has been too well adjusted to the training data? To test this, the
performance of this classifier is compared with the performance of the
classifier using a min hit rate of 0.995 on the test set 2. Test set 2 contains
images from a game not used for training. The results in fig 26 show small
indications on that the classifier has reduced its generalization ability. The
classifier trained with an increased minimum hit rate and a decreased
maximum false alarm rate still performs better than the classifier trained with
the default values, but the difference is much smaller.
The downside of changing these limits is that it may be bad for training. It can
take longer time for the classifier to reach the limits, increasing the total time
taken to train the cascade. Some training sessions never even ended because
they couldn’t reach the limits. In addition to this, more features are usually
needed to reach higher limits which mean that the final classifier will be
slower.
44
Fig 26 The classifier trained with a higher min hit rate and a lower false alarm rate still
performs better when testing on test set 2, but the difference between the two classifiers is
minimal.
5. 7 Brightness threshold
When looking at the false positives that the classifiers detect, it can be seen that
false detections are sometimes made on the green grass. By looking at them it
seems clear that they do not contain any white pixels, but are thought of as
footballs anyway. See section 4.5 for more information. A simple way of
removing these detections should be to reject the detections if they do not
contain any white pixels. This is done by introducing a brightness threshold
that rules out detections if no pixel in the detected area is brighter than the
threshold. The performance when testing different values of the brightness
threshold can be seen in fig 27.
45
Fig 27 Shows how performance changes when using different threshold values to remove
detections that are not bright enough.
The images are saved by the ball tool in RGB space but later transformed into
grayscale with one color channel. The luminance Y in the Y’UV color space is
used as grayscale value. This is described in more detail in Section 2.2.
The brightness threshold is added as the last step of detection.
The results show that by adding this threshold, many of the false detections can
be ruled out without decreasing the detection rate of the classifier. These tests
were made on the test set 1, and the classifier used was again classifier 1. The
best result achieved without losing any hits was using a threshold of 125. Using
20 stages of the cascade no positive detections were lost, while decreasing the
number of false detections from 5.4 to 4.3 per image.
Further tests show that the threshold can be optimized even more. When
increasing the threshold value, only images from one match suffer from weaker
detection while the detection rate of the other matches remains. Logically this
match had cumbersome lighting conditions and the football was often darker
than normal. These results suggest that the threshold value can be optimized for
each individual match.
By using the threshold mask described in Section 4.5 the threshold can
automatically be optimized for the current lighting conditions. The comparison
is made in fig 27 between the cascaded classifier 1 with and without threshold.
As expected the performance is increased even further using this new
threshold.
46
5. 8 Number of stages
By increasing or decreasing the number of stages used for detection in an
image depending on how many detections that were made in the previous
image it is possible to adjust the classifier to each game since the images in the
test set are ordered.
An upper and a lower limit for when to lower and raise the number of stages
are needed. By varying these two limits it is possible to get a ROC-curve as in
fig 28.
Fig 28 By adapting the number of stages according to the number of detections made in the
previous image better results are achieved.
Results show that the classifier benefits from being adjusted to each game to
maximize performance.
47
5. 9 Support Vector Machine
As mentioned in Section 4.6, there are two parameters that need to be selected
when using the RBF kernel for SVM classification: C and γ. Unfortunately
there is no way to generalize the selection of these parameters. For every data
set there is a different choice of parameter values that are optimal. These values
have been found using cross-validation (see Section 3.7.2). Only the
performances of the best classifiers are shown in this section.
In these tests the SVM has been integrated as the last step after the cascaded
classifier. The ROC-curve has been constructed by varying the stage when the
SVM takes over the classification task.
The first test uses a SVM model that has been trained with 283 feature
responses from stages 16 and 17 of the best classifier from Section 5.6
(classifier 1). As seen in fig 29, the results of the SVM model stay on the
negative side of the cascaded classifier in the ROC-curve. The results from a
similar classifier trained on the pixel values directly are even inferior and are
therefore left out of the report.
Fig 29 Using SVM as the last stage at different stages. Using SVM as the last stage does not
show better results than the cascaded classifier when trained on 283 features.
48
Possible explanations to why the results from the SVM are worse than when
only using the boosted classifier are that too few or not enough features have
been used or that the features have been taken from stages too late or too early
in the cascade. The next tests are made to examine these possibilities.
The second two tests using SVM have been done on classifiers trained with the
same number of feature responses (283) but using feature responses from
earlier and later stages of the cascade. The feature responses have been taken
starting from stage 2 and 26 respectively. The same cascaded classifier has
been used to be able to compare the difference of using more or less
discriminative features. Using features from later stages for training should
result in a classifier that is better on separating harder samples than before. In
fig 30 results show that the overall performances of the two new classifiers are
worse than the previous classifier.
A higher number of features do not help the SVM classifier, as seen in fig 31.
On the other hand, when lowering the number of features the performance is
increased. The results in this figure comes when using a classifier trained with
features from stage 15 and forth until the wanted number of features have been
reached. A zero-mean normalization has also been done on each image region
but without signs of improvement.
Fig 30 Training the SVM on feature responses from earlier and later stages results in poor
performance.
49
Fig 31 Using additional features do not improve the performance of the SVM classifier. Good
performance is shown by the classifiers trained with very few features.
5. 10 Compared to existing detection
The main method for finding the ball today is based on tracking the movement
of the ball. Since there has not been enough time to integrate the method
proposed in this report with the existing software used by Tracab, comparisons
of the performance of finding a final ball hypotheses has not been done
between the two methods. Instead, a specific detection step in Tracab’s system
is compared with the method proposed in this report.
The comparison has been done by running the two methods on the test set 1. A
first step in Tracab’s algorithm extracts possible ball candidates from the test
set by looking for image regions that are brighter than the background. These
regions are then stereo-matched. The resulting set of image regions are saved
and used as the database in the comparison test. The next step of the method
continues to narrow down the candidates by using a correlation method that
extracts ball-like candidates. This results in a detection rate and a false
detection rate that is compared with the cascaded classifier as can be seen in fig
32. The cascaded classifier has been run on the same database using different
number of stages to get a ROC-curve.
50
Fig 32 The method proposed in this report compared with a method in the existing system
As seen in fig 32 the performance of the cascaded classifier is slightly higher.
5. 11 Five-a-side
Five-a-side is the name of the game when there are only 5 persons in each team
playing and the dimensions of the pitch is much smaller, around 16 times
smaller than in regular football. This test can be compared with having the
same setup as before but with better cameras of higher resolution. Since
technology evolves rapidly and prices fall, there are reasons to believe that
better cameras will be used in the near future.
The cascade has been trained in the same way as before as described in Chapter
3, but with images as described in Section 2.2.4. Due to long training time the
parameters for training has been set to a max false alarm rate of 0.4, a min hit
rate of 0.995 and 2000 negatives for each stage. The SVM has been trained
with 197 feature responses from stage 16 and is added as the last step of the
cascade.
Fig 34 shows the performance of the classifier run with and without SVM on
the test set described in Section 2.2.4. The hit rate is over 95% even at low
false alarm rate. As before, using SVM does not increase performance. The two
curves in fig 34 follow each other closely.
51
Fig 33 System has been set up at a 5-a-side pitch. Cameras are closer to the ball giving footballs
of higher resolution. The texture of the ball is now distinguishable.
Fig 34 Very good performance is shown by the classifier trained using images of footballs of
higher resolution. No improvement can be seen when using SVM as the last stage.
In the same way as in Section 5.10 a comparison has also been made between
the classifier proposed in this section and a present detection method used by
Tracab. This can be seen in fig 35.
52
Fig 35 The method proposed in this report compared with a method in the existing system at
Tracab.
As seen the cascaded classifier outperforms the current method. Again, it is
important to remember that this is not the only method used by Tracab’s
system. In addition to a higher detection rate, the most positive result in this
test is the low false alarm rate shown by both methods.
5. 12 Discussion
Overall results show that the detection task set up for this thesis can be done
with pleasing results. It seems as the boosting procedure is capable of
extracting the information available in the sample images when looking at the
features it selects. In early stages the features are understandable. They capture
the property of the football being bright in the middle and darker to the sides.
On the other hand, it is not as obvious what features in later stages represent.
These may be signs that the process overfits to the data, but as studies have
shown the method is robust to overfitting [29].
As expected, results in Section 5.3 indicate that the image database is crucial
for getting a good performance. It is a little surprising that best performance is
shown when using so much of the background as in the image regions of 12*12
53
pixels. The increased performance when using harder images is probably due to
that the image set used in training related better to the test set 1 used for both
tests. The results confirm the theory that it requires more stages when using
harder images in order to reject as many samples as when using only the easier
image.
Also as expected, results show a big difference in performance when reducing
the step size. This is the case because the objects dealt with in this report are
small. Unfortunately this is directly related to the time needed to classify an
image. Luckily it is very easy to do a tradeoff between processing time and
performance.
Tests of using a brightness threshold show how the boosting only trains the
classifier to identify relative features, not exact pixel values. This is why the
brightness threshold can be successful. The tests of brightness threshold along
with varying the number of stages show the importance of adjusting the
classifier to each game and lighting condition.
The most disappointing results in this report have been shown by the Support
Vector Machine method. The results in this thesis contradicts the results in a
related study made by Le and Satoh which suggest that a higher number of
features extracted by a boosted classifier makes it easier for the SVM to
separate the two classes [13]. This may be due to overfitting (Section 3.7.1).
These results are discouraging since better results were expected from the
Support Vector Machine method.
One of the goals with this thesis was to evaluate if it was possible to improve
the football detection of today. In the comparison, one has to bear in mind that
additional techniques are used by Tracab to find the final ball hypothesis. Both
classifiers show good performance on the comparison made in Section 5.10.
Another comparison would have been to include harder images such as
footballs that are partially occluded by a player and therefore only visible in
one camera. This has not been possible. One advantage with the cascaded
classifier that can be seen in fig 32 may be that it makes it easy to rate the
detections according to how confident they are. A detection that makes it
through a high number of stages is more likely to be the actual ball than a
detection that gets thrown away in an earlier stage.
Even higher detection rates are shown in the test made on the five-a-side
match. As the cameras are closer to the pitch, the texture of the ball now
becomes visible and the classifier should have more to work with. This is part
of the explanation to why the classifier shows such good results as in fig 33.
However when comparing the performance one should bear in mind that this
test set is not the same as before so it is impossible to compare these results
straight off. The labeling of the images regarding the contrast and how free the
footballs are has not been done for this set, which makes it even more difficult
54
to compare with the first test set. Also, to save time, the process of extracting
footballs in the images has been made faster by mostly including easy targets
and by removing detections in areas around tracked players. Of course this
makes it easier to explain the high hit rates of the classifiers.
Similar disappointing results of the SVM method are shown here as in the
earlier SVM tests. Although the performance of the SVM seems to have
increased it still does not perform better than the cascaded classifier. This can
also be an indication that the cascaded classifier is showing good results.
55
Chapter 6
Conclusions and future
work
This chapter gives an overview of the results and what conclusions that can be
drawn from them. Some thoughts on what needs to be improved in the future
are presented.
6. 1 Conclusions
In this report a method for object detection has been used to detect small
footballs in real time. Finding these footballs is a hard task mainly because the
footballs are very small. This method has not been used on objects of this size
before. Because of the size of the balls, a smaller spatial step size has been
needed to achieve a desirable hit rate compared to what has been reported in
previous reports. This results in a much slower detector. When using the best
classifier a speed of 8.5 images/seconds is achieved. On the other hand no
optimization for speed has been done. By introducing the brightness threshold
before the classifier the processing time could be reduced. It is also easy to do a
tradeoff between processing time and performance. Tests made on images of
footballs in higher resolution show increased performance. On the other hand,
so does the method available at Tracab today.
The overall performance shown by the classifier in tests is promising, but since
the method has not been implemented to do a single final hypothesis of the best
ball candidate, it has been difficult to make fair comparisons with the method
available at Tracab today. Therefore it is difficult to say if this method
implemented as a final ball detector would be better or worse than the one
available today at Tracab.
The idea of using a classifier such as SVM as the last stage has been shown not
to work perfectly. Decreased performance when using a higher number of
features during training of the SVM contradicts results in previous studies [13].
This may be due to overfitting.
56
6. 2 Future work
The natural first next step to take would be to integrate the method into the
existing system to see if it can be used to improve performance of finding a
final ball hypothesis. This is the only way of getting a true comparison with the
existing methods at Tracab.
Big differences in performance can be seen when using different image sets
(Section 5.3.2). The image set can therefore probably be improved and should
definitely be revised. The current classifiers have been trained on 6 different
matches. This small set is not enough to get a variety in the images that cover
all possible conditions regarding illumination and color. To get a classifier that
generalizes well on any kind of new data it is necessary to use a wider range of
matches. With such a data set it will be necessary to test the classifier on a wide
range of data from matches not used during training. Another approach would
be to train a cascade that is optimized for one setup. This could be done by
training only on images with certain lighting conditions or images of a specific
football. During classification one would for each match or maybe even for
each image region start by examining which of the several trained classifier to
use.
The results from the SVM method suggest that the feature selection can be
done in a better way. Are the features relevant or is one feature worth more
than the other and needs to be weighted up? These are some of the questions
that need to be answered and as a beginning one can read a survey addressing
the problem of feature selection [12].
In the near future cameras of higher resolution will probably be used so it is
natural to continue the research towards this. The five-a-side test was a first
step towards testing this.
57
Bibliography
1. M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the
potential function method in pattern recognition learning. Automation and Remote Control 25, pp 821-837, 1964.
2. N. Ancona, A. Branca. Example based object detection in cluttered background with Support Vector Machines. Instituto Elaborazione Segnali ed Immagini. Bari, Italy 2000.
3. N. Ancona, G. Cicirelli, E. Stella, A. Distante. Ball Detection in Static Images with SVM for Classification. Image and Vision Computing 21, pp 675-692, 2003.
4. H. Bay, T. Tuytelaars, L. Van Gool. SURF: Speeded Up Robust Features. Proceedings of the ninth European Conference on Computer Vision, 2006.
5. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2, 121–167, 1998.
6. C. Chang, C. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
7. G. Coath, P. Musumeci. Adaptive Arc Fitting for Ball Detection in RoboCup. APRS Workshop on Digital Image Analysing, 2003
8. T. Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting and Randomization. Machine Learning, 1-22, 1999.
9. T. D’Orazio, C. Guaragnella, M. Leo, A. Distante. A new algorithm for ball recognition using circle Hough transform and neural classifer. Pattern Recognition 37, pp 393-408, 2003.
10. Y. Freund, R. Schapire. A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting. European Conference on Computational Learning Theory, 1995.
11. J.Friedman, T.Hastie, R.Tibshirani. Additive Logistic Regression : a Statistical View of Boosting. Annals of Statistics, vol. 28, no. 2, pp. 237--407, 2000.
12. I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 2003.
13. D. Le, S. Satoh. A Multi-Stage Approach to Fast Face Detection. IEICE TRANS. INF. & SYST., Vol.E89–D, NO.7, 2006.
14. R. Lienhart, J. Maydt. An Extended Set of Haar-like Features for Rapid Object Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, 2002.
15. R. Lienhart, A. Kuranov, V. Pisarevsky. Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection. MRL Technical Report, 2002.
16. Y. Lin, T. Liu. Fast Object Detection with Occlusions. The 8th European Conference on Computer Vision (ECCV-2004), Prague, 2004.
17. C. Liu, H. Shum. Kullback-Leibler Boosting. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 587-594, 2003.
58
18. D. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), pp 91-110, 2004.
19. Ville Lumikero. Football Tracking in Wide-Screen Video Sequences. Master Thesis in Computer Science, School of Electrical Engineering Royal Institute of Technology. Stockholm, 2004.
20. K. Mikolajczyk, C. Schmid. A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 27, NO. 10, 2005.
21. S. Mitri, K. Pervölz, H. Surmann, A. Nüchter. Fast Color-Independent Ball Detection for Mobile Robots. Fraunhofer Institute for Autonomous Intelligens Systems (AIS), Sankt Augustin, Germany 2004.
22. T. Ojala, M. Pietikäinen, T. Mäenpää. Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24 NO. 7, 2001.
23. E. Osuna, R. Freund, F. Girosi. Training Support Vector Machines: An Application to Face Detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Puerto Rico 1997.
24. C. Papageorgiou, M. Oren, T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998.
25. B. Rasolzadeh. Response Binning: Improved Weak Classifiers for Boosting. Intelligent Vehicles Symposium, pp 344 – 349, 2006.
26. S. Romdhani, P. Torr, B. Schölkopf, A. Blake. Computationally Efficient Face Detection. Proceeding of the 8th International Conference on Computer Vision, 2001.
27. D. Scaramuzza, S. Pagnottelli, P. Valigi. Ball Detection and Predictive Ball Following Based on a Stereoscopic Vision System. Proceedings of the 2005 IEEE International Conference on Robotics and Automation. Barcelona, Spain, 2005.
28. R. Schapire, Y. Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning, 37(3):297-336, 1999.
29. R. Schapire, Y. Freund. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of computer and system sciences 55, 119-139 (1997)
30. K. Sung and T. Poggio. Example-based Learning for View-based Human Face Detection. A.I. Memo 1521, MIT A.I. Lab. 1994
31. P.Viola, M.Jones. Robust Real-Time Object Detection. IEEE ICCV Workshop Statistical and Computational Theories of Vision, 2001.
32. P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp 511–518, 2001.
33. J.Wu, J.M. Rehg, and M.D.Mullin, Learning a rare event detection cascade by direct feature selection. Advances in Neural Information Processing Systems (NIPS), 2003.
34. M. Sewell. http://www.svms.org/vc-dimension/ Accessed 2008-05-15 35. C. Poynton. Frequently Asked Questions about Color.
http://www.poynton.com/PDFs/ColorFAQ.pdf Accessed 2008-05-15 36. http://www.dtreg.com/svm.htm Accessed 2008-05-28 37. http://www.gestalttheory.net/ Accessed 2008-05-28 38. Open CV library. http://opencvlibrary.sourceforge.net/ Accessed 2008-06-27
59
Appendix 1
Percentage of images that have a value equal or lower than the one on the side
and on the top of the table:
Training set:
Contrast/Free 1 2 3 4 5
1 0.7 0.8 0.8 0.8 0.8
2 31.0 48.3 57.1 58.0 58.5
3 45.0 72.2 89.1 91.6 92.9
4 46.9 76.0 94.7 98.0 100
Test set 1:
Contrast/Free 1 2 3 4 5
1 0.2 0.5 0.5 0.5 0.5
2 29.7 48.2 57.5 57.8 57.9
3 42.9 72.6 94.4 95.9 96.3
4 43.8 74.0 96.8 99.1 100
TRITA-CSC-E 2009:004 ISRN-KTH/CSC/E--09/004--SE
ISSN-1653-5715
www.kth.se
top related