maritime targets detection from ground cameras exploiting

Maritime Targets Detection from Ground Cameras ExploitingSemi-supervised Machine Learning

Eftychios Protopapadakis1, Konstantinos Makantasis1 and Nikolaos Doulamis2

1Technical University of Crete, Chania, Greece2National Technical University of Athens, Athens, Greece

Keywords: Vision-based System, Maritime Surveillance, Semi-supervised Learning, Visual Attention Maps, VehicleTracking.

Abstract: This paper presents a vision-based system for maritime surveillance, using moving PTZ cameras. The pro-posed methodology fuses a visual attention method that exploits low-level image features appropriately se-lected for maritime environment, with appropriate tracker. Such features require no assumptions about envi-ronmental nor visual conditions. The offline initialization is based on large graph semi-supervised technique inorder to minimize user’s effort. System’s performance was evaluated with videos from cameras placed at Li-massol port and Venetian port of Chania. Results suggest high detection ability, despite dynamically changingvisual conditions and different kinds of vessels, all in real time.

1 INTRODUCTION

Management of emergency situations, known to themaritime domain, can be supported by advancedsurveillance systems suitable for complex environ-ments. Such systems vary from radar-based to video-based. The former, however, has two major draw-backs (Zemmari et al., 2013); it is quite expensiveand its performance is affected by various factors (e.g.echoes from targets out of interest). The latter, con-sists of various techniques, each one with specific ad-vantages and drawbacks. The majority of such sys-tems are are controlled by humans, who are respon-sible for monitoring and evaluating numerous videofeeds simultaneously.

Advanced surveillance systems should processand present collected sensor data, in an intelligent andmeaningful way, to give a sufficient information sup-port to human decision makers (Fischer and Bauer,2010). The detection and tracking of vessels is inher-ently depended on dynamically varying visual con-ditions (e.g. varying lighting and reflections of sea).So, to successfully design a vision-based surveillancesystem, we have to carefully define both its operationrequirements and vessels’ characteristics.

On the one hand there are minimum stan-dards concerning operation requirements (Szpak andTapamo, 2011). At first, it must determine possibletargets within a scene containing a complex, mov-

ing background. Additionally, the system must notproduce false negatives and keep as low as possiblethe number of false positives. Since we are talkingabout surveillance system, it must be fast and highlyefficient, operating at a reasonable frame rate and forlong time periods using a minimal number of scene-related assumptions.

On the other hand, regardless of vessel types vari-ation, there are four major descriptive categories.First comes the size, which ranges from jet-skis tolarge cruise ships. Secondly, we have the movingspeed. Thirdly, vessels move to any direction, accord-ing to the camera position, and thus their angle variesfrom 0�to 360�. Finally, there is vehicles’ visibility.Some vessels have a good contrast to the sea waterwhile others are intentionally camouflaged. A robustmaritime surveillance system must be able to detectvessels having any of the above properties.

1.1 Related Work

This paper focuses on detection and tracking of tar-gets within camera’s range, rather than their trajec-tory patterns’ investigation (Lei, 2013; Vandecasteeleet al., 2013) or their classification in categories of in-terest (Maresca et al., 2010). The system’s main pur-pose is to support end-user in monitoring coastlines,regardless of existing conditions.

Object detection is a common approach with

583Protopapadakis E., Makantasis K. and Doulamis N..Maritime Targets Detection from Ground Cameras Exploiting Semi-supervised Machine Learning.DOI: 10.5220/0005456205830594In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (MMS-ER3D-2015), pages 583-594ISBN: 978-989-758-090-1Copyright c 2015 SCITEPRESS (Science and Technology Publications, Lda.)

mayny variations; i.e. an-isotropic diffusion (Voles,1999), which has high computational cost and per-forms well only for horizontal and vertical edges,foreground object detection /image color segmenta-tion fusion (Socek et al., 2005). In (Albrecht et al.,2011a; Albrecht et al., 2010) a maritime surveillancesystem mainly focuses on finding regions in images,where is a high likelihood of a vessel being present,is proposed. Such system was expanded by adding asea/sky classification approach using HOG (Albrechtet al., 2011b). Vessel classes detection, using a trainedset of MACH filters was proposed by (Rodriguez Sul-livan and Shah, 2008).

All of the above approaches adopt offline learn-ing methods that are sensitive to accumulation errorsand difficult to generalize for various operational con-ditions. (Wijnhoven et al., 2010) utilized an onlinetrained classifier, based on HOG. However, retrain-ing takes place when a human user manually anno-tates the new training set. In (Szpak and Tapamo,2011) an adaptive background subtraction techniqueis proposed for vessels extraction. Unfortunately,when a target is almost homogeneous is difficult, forthe background model, to learn such environmentalchanges without misclassifying the target.

More recent approaches, using monocular videodata, are the works (Makantasis et al., 2013) and(Kaimakis and Tsapatsoulis, 2013). The former, uti-lizes a fusion of Visual Attention Map (VAM) andbackground subtraction algorithm, based on MixtureOf Gaussians (MOG), to produce a refined VAM.These features are fed to a neural network tracker,which is capable of online adaptation. The lat-ter, utilized statistical modelling of the scene’s non-stationary background to detect targets implicitly.

The work of (Auslander et al., 2011) emphasizeon algorithms that automatically learn anomaly de-tection models for maritime vessels, where the tracksare derived from ground-based optical video, and nodomain-specific knowledge is employed. Some mod-els can be created manually, by eliciting anomalymodels in the form of rules from experts (Nilssonet al., 2008), but this may be impractical if expertsare not available, cannot easily provide these models,or the elicitation cost may be high.

1.2 Our Contribution

A careful examination of the proposed methodolo-gies suggest that specific points have to be addressed.Firstly, a system needs to combine both supervisedand unsupervised tracking techniques, in order to ex-ploit all the possible advantages. Secondly, since wedeal with vast amount of available data, we need to re-

duce, as much as possible, the required effort for theinitialization of the system.

The innovation of this paper lies in the creation ofa visual detection system, able to overcome the afore-mentioned difficulties by combining various, welltested techniques and, at the same time, minimizeseffort during the offline initialization using a Semi-Supervised Learning (SSL) technique, appropriate forlarge data sets.

In contrast to the approach of (Makantasis et al.,2013), the user has to roughly segment few images,i.e. use minimal effort, in order to create an initialtraining set. Such procedure is easily implementedusing the suggested areas according to the unsuper-vised techniques’ results. Collaboration of visual at-tention maps, that represents the probability of a ves-sel being present in the scene, and background sub-traction algorithms provides to the user initially seg-mented parts, over which user further actuates.

Then, SVMs are used as the additional super-vised technique, in order to handle new video frames.The significant amount of labelled data for the train-ing process originates from the previously generatedroughly segmented data sets. In order to facilitatethe creation of such training set and further refine it(i.e. correct some user errors), SSL graph-based algo-rithms need to be involved.

Unfortunately, SSL techniques scale badly as theavailable data rises. To make matters worse, (Nadleret al., 2009) have shown that graph laplacian methods(and more specific the regularization approach (Zhu,2003) and the spectral approach (Belkin and Niyogi,2002)) are not well posed in spaces Rd , d � 2, andas the number of unlabelled points increases the solu-tion degenerates to a non-informative function. Con-sequently, a semi-supervised procedure, suitable forlarge data sets is exploited for the offline initializa-tion, significantly reducing the effort required.

The rest of the paper is organized as follows: Sec-tion 2 presents the system’s structure, suitable for themaritime surveillance problem. Section 3 describesthe procedure followed for the construction of featurevectors, capable to characterize the pixels of a frame.Section 4 explains how target detection is performedusing pixel-wise binary classification technique. Fi-nally, in section 5, an excessive study on system’s re-sults is presented.

VISAPP�2015�-�International�Conference�on�Computer�Vision�Theory�and�Applications

584

2 THE PROPOSED SYSTEM

2.1 System Architecture

The goal of the presented system is the real-time de-tection and tracking of maritime targets. Towards thisdirection, an appearance-based approach is adoptedto create visual attention maps that represent the prob-ability of a target being present in the scene. Highprobability implies high confidence for a maritimetarget’s presence.

Visual attention maps creation is based exclu-sively on each frame’s visual content, in relation totheir surrounding regions or the entire image. Dueto this limitation, high probability is assigned, fre-quently, to image regions that depict non-maritimetargets (e.g. stationary land parts). In order to over-come such drawback, our system exploits the tem-poral relationship between subsequent frames. Con-cretely, video blocks, containing a predefined number,h, of frames and covering a time span, T , are used tomodel the pixels’ intensities.

Thus, the temporal evolution of pixels intensitiesis utilized to estimate a pixel-wise background model,capable to denote each one of the pixels of the sceneas background or foreground. By using a backgroundmodelling algorithm, system can efficiently discrimi-nate moving from stationary objects in the scene. Inorder to model pixels’ intensities, we use the back-ground modelling algorithm presented in (Zivkovic,2004). This choice is justified by the fact that this al-gorithm can automatically fully adapt to dynamicallychanging visual conditions and cluttered background.

Let us denote as p(i)xy the pixel of a frame i at loca-tion (x;y) on image plane. Having constructed the vi-sual attention maps and applied background modelingalgorithm, the pixel p(i)xy is described by a feature vec-tor fff (i)xy : fff (i)xy = [ f (i)1;xy ::: f (i)k;xy]

T , where f (i)1;xy; ::: ; f (i)k�1;xystand for scalar features that correspond to the proba-bilities assigned to the pixel p(i)xy by different visual at-tention maps, while f (i)k;xy is the binary output of back-ground modeling algorithm, associated with the samepixel. In order to detect maritime targets, these fea-tures are fed to a binary classifier which classifies pix-els into two disjoint classes, CT and CB.

If we denote as Z(i) = C(i)T [C(i)

B the set that con-tains all pixels of frame i, then the first class, C(i)

T , con-tains all pixels that depict a part of a maritime target,while the second class, C(i)

B , equals to Z(i)�C(i)T . We

used SVMs to transact the classification task for theproposed maritime surveillance system. Selection ofthe SVM, over other supervised classification meth-

SVM

(i)

(ii)

(iii)

(iv)(v)

(vi)

(vii)

.

.

.

Figure 1: System’s architecture illustration. Image in (i),corresponds to the original captured frame. In (ii), the out-put of visual attention maps is presented. High probabil-ity is represented with red color, while low probability withdeep blue. The output of background modeling algorithmis shown in (iii). The column in (iv) represents a featurevector for a specific pixel, which is fed to a binary classifier(v). The output of the classifier in pixel level is presented in(vi) and in frame level in (vii).

ods, is justified by its robustness, when handling un-balanced classes.

The overall architecture of the proposed maritimesurveillance system is presented in Fig.1. Initially, theoriginal captured frame, Fig.1(i), is processed to ex-tract pixel-wise features using visual attention maps,Fig.1(ii), and background modelling, Fig.1(iii). Thenthe feature vector of each one pixel, Fig.1(iv), is pro-cessed by a binary classifier, Fig.1(v), who decidesif the pixel corresponds to a part of a maritime tar-get, Fig.1(vi). The classifier’s output in frame-level isshown in Fig.1(vii).

2.2 Problem Formulation

Maritime target detection can be seen as an imageclassification problem. Thus, we classify each one ofthe frame’s pixels in one of two classes, CT and CB.If we denote as l(i)xy the label of pixel p(i)xy , then, for aframe i, the classification task can be formulated as:

l(i)xy =

(1 if p(i)xy 2CT

�1 if p(i)xy 2CB(1)

where x = 1; :::;w; y = 1; :::;h and h,w stand forframe’s height and width.

The SVM classifier will be formed through atraining process, which requires the formation of arobust training set composed of pixels, along withtheir associated labels. Such a set can be formed bythe user, through a rough segmentation of a frame tinto two regions, that contain positive and negativesamples, i.e. C(t)

T class labelled with 1 and C(t)B class

labelled with -1. The union of C(t)T and C(t)

B consiststhe initial training set S.

Maritime�Targets�Detection�from�Ground�Cameras�Exploiting�Semi-supervised�Machine�Learning

585

At this point in the training set S, each pixel is de-scribed only by its intensity, which does not providesufficient information for separating pixels into twodisjoint classes. Taking into consideration the appli-cation domain, which indicates that the largest partof a frame will depict sea and sky, we exploit lowlevel features to emphasize man-made structures inthe scene.

Then, visual attention maps are created, which in-dicate the probability a pixel to depict a part of a mar-itime target. In addition, based on the observation thata vessel must be depicted as a moving object, we im-plicitly capture the presence of motion by exploiting abackground modeling algorithm. Using the output ofvisual attention maps and the background modelingalgorithm, each pixel is described by the feature vec-tor of sec.2.1 and the training set S can be transformedto: S = f( fff (t)xy ; l

(t)xy )g for x = 1; :::;w and y = 1; :::;h.

Although the elements of S are labelled by a hu-man user, the labelling procedure may contain incon-sistencies. This is mainly caused by the fact that hu-man centric labelling, especially of image data, is anarduous and inconsistent task, due to the complexityof the visual content and the huge manual effort re-quired.

In order to overcome this drawback, we refine theinitial training set by i) selecting the most represen-tative samples from each class and ii) labelling therest of the samples using a semi-supervised algorithm.Selection of the most representative samples is takenplace py applying simplex volume expansion on thesamples of each class separately. Then representativesamples are used by the semi-supervised algorithm aslandmarks, in order to label the rest of the samples.Using the refined training set, the binary classifier canbe successfully trained to classify the pixels of subse-quent frames, addressing this way the initial classifi-cation problem of Eq.1.

3 PIXEL-WISE VISUALDESCRIPTION

In this section we describe the procedure for con-structing feature vectors, capable to characterize thepixels of a frame. The whole process is tuned formaritime imagery and is guided by the operationalrequirements that an accurate and robust maritimesurveillance system must fulfil. Feature vectors arecreated for each pixel.

3.1 Scale Invariance

Potential targets in maritime environment vary insizes, either due to their physical size or due to thedistance between them and the camera. Despite that,most of the feature detectors operate as kernel basedmethod and thus they prefer objects of a certain size.As presented in (Alexe et al., 2010) and (Liu et al.,2011) images must be represented in different scalesin order to overcome this limitation. In our approach,a Gaussian image pyramid is exploited in order to pro-vide scale invariance and to take into considerationthe relationship between adjacent pixels.

The Gaussian image pyramid is created by succes-sively low-pass filtering and sub-sampling an image.During the stage of low-pass filtering the Gaussianfunction can be approximated by a discretized con-volution kernel as follows:

GGGd = 1256

266641 4 6 4 14 16 24 16 46 24 36 24 64 16 24 16 41 4 6 4 1

37775 (2)

During sub-sampling every even-numbered rowand column is removed. If we denote as Io the orig-inal captured image and as If the image at pyramidlevel f then image at pyramid level f+1 is computedas: If+1(x;y) = [GGGd � If](2x;2y)

One must combine the various scales together intoa single unified and scale-independent feature map,to provide scale-independent feature analysis. To doso, image at level f+ 1, firstly, is upsized twice ineach dimension, with the new even rows and columnsfilled with zeros. Secondly, a convolution is per-formed with the kernel GGGu to approximate the valuesof the ”missing pixels”. Because each new pixel hasfour non new-created adjacent pixels, GGGu is definedas: GGGu = 4 �GGGd .

Then, a pixel-wise weighted summation is per-formed to adjacent images in pyramid so as the uni-fied image at level f, Jf is defined as:

Jf =12� [If +[GGGu �U(If+1)]] (3)

where U stands for the upsize operation. The finalunified image is computed by repeating the above op-eration, from coarser to finer pyramid image levels.

3.2 Low-level Features Analysis

As described in (Albrecht et al., 2011b) different low-level image features respond to different attributes ofpotential maritime targets. A combination of featuresshould be exploited in order to reveal targets’ pres-ence. The selected features do not require a specific


586

format for the input image. These are edges, horizon-tal and vertical lines, frequency, color and entropy.

Each one of these features are calculated for allimage’s pyramid levels, independently. Then, image’spyramid is combined to form a single unified featuremap by using Eq.(3). In Fig.2 the original capturedframe along with the features responses are presented.All of the features emphasize the stationary land partand the white boat, which is the actual maritime tar-get.

3.2.1 Edges

Edges in the form of horizontal and vertical lines areable to denote man-made structures, making the sys-tem able to suppress large image regions, depictingsea and sky. Canny operator (McIlhagga, 2011) is avery accurate image edge detector, which outputs ze-ros and ones for image edges absence and presencerespectively. Sobel operator (Yasri et al., 2008), al-though being less accurate, measures the strength ofdetected edges by approximating the derivatives of in-tensity changes along image rows and columns.

So, by multiplying pixel-wise the output of twooperators the system is able to detect edges in a veryaccurate way, while at the same time it preserves theirmagnitude. If we denote as CI and SI the Canny andSobel operators for image I, then the edges E I are de-fined as: E I =CI �SI . Matrix E I has the same dimen-sions with image I; its elements E I(x;y) correspondto the magnitude of an image edge at location (x;y)on image plane.

3.2.2 Frequency

The high frequency components of the input frame,I, are computed as: F I = Ñ2 � I. The matrix F I hasthe same dimensions with image I and its elementsF I(x;y) correspond to the frequency’s magnitude atlocation (x;y) on image plane.

While exploitation of frequency features may em-phasize (highly) wavy sea regions, they will suppressimage regions that depict sky parts, since such imageparts are dominated by low frequencies. Furthermore,image frequencies are complementary to image edgesemphasizing highly structured regions within an ob-ject and thus improving detection accuracy.

3.2.3 Horizontal and Vertical Lines

The detection of horizontal and vertical lines in an im-age require an appropriate kernel K. Kernel K is tunedto strengthen the response of a pixel if this consists apart of a horizontal or vertical line and suppress pix-els’ responses in all other cases. Kernel will be con-

volved with each image’s pyramid level. In order toemphasize this kind of lines the kernel K is designedas:

KKK =1

16

266640 0 1 0 00 0 2 0 01 2 4 2 10 0 2 0 00 0 1 0 0

37775 (4)

and vertical and horizontal lines in a frame I canbe computed as: L I = KKK � I. Again, the matrix L Ihas the same dimensions with image I and its ele-ments L I(x;y) indicates the magnitude of an hori-zontal and/or vertical line at location (x;y) on imageplane.

Line detector works like an edge detector. Incoastal regions, captured frames are likely to containsland parts that will respond to the edge detector, af-fecting detection accuracy of actual maritime targets.Since vertical and horizontal lines are more dominantin man-made structures the line detector supports ac-curacy of actual targets by suppressing the regions ofthe image that depict natural land parts, such as rocks.

3.2.4 Color

In maritime environment, no assumptions about thecolor of vessels can be made. For this reason differ-ences in color are more likely to indicate the presenceof a potential target. Furthermore, maritime scenesusually contain large regions with similar colors (seaand sky). This observation as described in (Achantaet al., 2009) and (Achanta and Susstrunk, 2010) canbe exploited to increase the performance of visual at-tention map by identifying potential targets.

In order to compute the color difference, the cap-tured frame’s colorspace is converted to CIELab. Thecomputation of color differences takes place by cal-culating the Euclidean distances between individualpixels color vectors and the mean color vector of thewhole frame. For a frame I this procedure resultsin a matrix C I of the same dimensions, whose ele-ment, C I(x;y), at location (x;y) on image plane, in-dicates the difference in color between this pixel andthe mean color of the rest pixels of the frame.

3.2.5 Entropy

Images that depict large homogeneous regions, suchas sky or sea regions, present low entropy, whilehighly textured images will present high entropy. Im-age entropy can be interpreted as a statistical measureof randomness, which can be used to characterize thetexture of the input image. Thus, entropy can be uti-lized to suppress homogeneous regions of sea and sky


587

and highlight potential maritime targets. Entropy, Hr,of a region r of an image is defined as:

Hr =k

åj=1

P(r)j � logP(r)

j (5)

where P(r)j is the frequency of intensity j in image

region r. For a grayscale image, variable k is equal to256.

In order to compute entropy for a pixel located at(x;y) on image plane, we apply the relation of Eq.5on a square window centered at (x;y). In our case, thesize of the window is 5�5 pixels. The application ofEq.5 on (x;y) of a frame I, for x = 1; :::;w and y =1; :::;h, where w and h correspond to frame’s widthand height, results in a matrix H I that has the samedimensions with the frame I. The matrix H I can beinterpreted as a pixel-wise entropy indicator of I.

(a) (b) (c)

(d) (e) (f)

Figure 2: Original captured frame (a) and feature responses(b)-(f); (b) edges, (c) frequencies, (d) vertical and horizontallines, (e) color and (f) entropy. All feature responded to theland part and the boat (maritime target).

3.3 Visual Descriptors

Visual descriptors are computed to encode visual in-formation of captured images, using the extractedlow-level features described in subsection 3.2. Thesedescriptors are utilized for constructing the visual at-tention maps. Their computation, instead of pixel-wise, takes place block-wise, in order to reduce theeffect of noisy pixels during low-level features extrac-tion. In this paper, three different descriptors are com-puted:

a) Local Descriptors that take into considerationeach one of the image pixels separately. Local de-scriptors indicate the magnitude of local featuresfor each one of image pixels.

b) Global Descriptors that are capable to emphasizepixels with high uniqueness compared to the restof the image. To achieve this they indicate how

different local features for a specific pixel are, inrelation with the same features of all other imagepixels.

c) Window Descriptors that compare local featuresof a pixel with the same features of its neighboringpixels.

3.3.1 Local Descriptor

One local descriptor is computed for each one of theextracted low-level features. Let us denote as F thefeature in question, which can correspond to imageedges, frequency, horizontal and vertical lines, coloror entropy. For the feature F , the computation of lo-cal descriptor is derived by feature’s response image.Firstly the feature’s response image is divided into Bblocks of size 8�8 pixels. Then, the local descriptorfor a specific block j is defined as the average mag-nitude of the feature F in the same block. More for-mally, for a block j, with bh height and bw width, thelocal descriptor of feature F is computed as follows:

lFj =1

bh �bwå

(x;y)2 jF(x;y) (6)

where F(x;y) is the response of feature in questionat pixel (x;y). This kind of descriptor is capable tohighlight image bocks with high feature responses.

3.3.2 Global Descriptor

The local descriptors handle each image block sep-arately and, thus, are insufficient to provide usefulinformation when features’ responses are quite sim-ilar along all image blocks (e.g. a wavy sea). Theproposed system can overcome this problem by usingglobal descriptors. Uniqueness of a block j can beevaluated by the absolute difference of the feature re-sponse between this block and the rest blocks of theimage. The global descriptor for a feature F and im-age block j is defined as:

gFj =1B�

B

åi=1jlFj� lFij (7)

As mentioned before a global descriptor is able toemphasize blocks presenting high uniqueness, in termof features’ responses, compared to the rest blocks ofthe image.

3.3.3 Window Descriptor

Unfortunately, if potential targets are presented inmore than one block, the local and global descriptorswill emphasize the most dominant target and will sup-press the others. In order to overcome this problem,


588

Edges Frequency Lines Color Entropy

Local

Global

Window

Figure 3: Visual attention maps for each local, global andwindow descriptors. Using give low level features and threedescriptor, each one of the frames pixels is described by a15-dimensional vector. The presented visual attention mapscorrespond to the original frame of Fig.2.

system exploits a window descriptor, that compareseach image block with its neighbouring blocks.

Window descriptor for an image with N � Mblocks uses an image window W , which is spannedby the maximum symmetric distance, dh and dv alonghorizontal and vertical axes respectively. Symmetricdistances are defined as dh = min(l;kh;N � kh) anddv =min(l;kv;M�kv), where l is the default symmet-ric distance, 3 blocks in our case, and kh and kv standsfor block coordinates on image plane along horizontaland vertical axes respectively. The window descriptorfor a feature F and image block j with coordinates( j1; j2) is defined as:

wFj =1

2dh �2dv�

dh

åk=�dh

dv

ål=�dv

jlFj� lFj1+k; j2+l j (8)

By using three descriptors and five low-level im-age features, each image block is described by a1� 15 feature vector. Each feature of this vectorcorresponds to a different visual attention map. Forblocks of size 8� 8 pixels the visual attention mapsare sixty four times smaller than the original capturedframe. In order to create a pixel-wise feature vector,visual attention maps must have the same dimensionswith the original captured frame. For this reason theyare up-sampled, by using Eq.3.

Visual attention maps that correspond to the orig-inal frame of Fig.2, for each one of the descriptorsand each one of the low level features, are presentedin Fig.3. All visual attention maps emphasize the sta-tionary land part and the boat, while at the same timethey suppress the background.

3.4 Background Subtraction

In maritime surveillance, most state-of-the-art back-ground modeling algorithms, like (Doulamis andDoulamis, 2012), (Makantasis et al., 2012), fail ei-ther due to their high computational cost or due to thecontinuously moving background, and moving cam-eras. However, if the background modeling algorithm

output is fused in a unified feature vector with the pre-viously constructed visual attention maps, our systemwill be able to emphasize potential threats and at thesame time to suppress land parts that may be appearedin the scene by implicitly capture motion presence.

The proposed system uses the Mixtures of Gaus-sians (MOG) background modeling technique, pre-sented in (Zivkovic, 2004). This choice is justifiedby the fact that MOG is fast, robust to small peri-odic movements of background, and easy to param-eterize algorithm. By fusing together the outputs ofvisual attention maps and the output of a backgroundmodeling algorithm, camera motion temporarily in-creases false positives detections, but false negatives,that comprises the most important characteristic of amaritime surveillance system, are not affected.

(a) (b)Figure 4: Original frame (a) and the output of backgroundmodeling algorithm (b).

4 TARGET DETECTION

The maritime target detection can be seen as an imagesegmentation problem. In our case target detection, isfurther reduced to a binary classification problem. Forany pixel at location (x;y) of a frame i, the feature ex-traction process (see Sec.3) constructs an 1� 16 fea-ture vector, fff (i)xy . Given fff (i)xy as input, the classifier willdecide if the corresponding pixel depicts some part ofa maritime target or not.

4.1 Initial Training Set Formation

In order to be able to exploit a binary classifier, a pro-cess of classifier training should be preceded. Train-ing process requires the formation of a robust trainingset which contains pixels along with their associatedlabels. Let us denote as Z(t) the set that contains allthe pixels of frame t, C(t)

T the set that contains pixelsthat depict some part of a maritime target and as C(t)

B

the set Z(t)�C(t)T .

The creation of a training set S requires from theuser to roughly segment the frame t into two regions,which contain positive and negative samples (i.e. pix-els that belong to C(t)

T and C(t)B class respectively).


589

This segmentation results in a set S = f(p(t)xy ; l(t)xy )g,

and labels are defined as:

lxy =

(1 if pxy 2CT

�1 if pxy 2CB(9)

where pxy is a pixel at location (x;y). By utilizing thefeature vector fff (t)xy the set S takes the form describedin Sec.2.2.

However, human centric labeling, especially ofimage data, is an arduous and inconsistent task,mainly due to the complexity of the visual content andthe huge manual effort required. To overcome thisdrawback, we refine the initial training set through asemi-supervised approach.

4.2 Training Set Refinement

In order to refine the initial user-defined training set,we partition the set S into two disjoint classes, R andU . The class R contains the most representative sam-ples of S, i.e. the samples that can best describe theclasses CT and CB, while class U is equal to S�R.Samples of class R are considered as labeled, whilesamples belonging to U are considered as unlabeled.Then, via a semi-supervised approach the samples ofR are used for label propagation through the ambigu-ously labeled data of U . In the following we describein detail the aforementioned process.

For selecting the most representative samples foreach one of the classes CT and CB, we consider eachsample as a point into an µ-dimensional space. Then,simplex volume expansion is utilized. In our case µis equal to 16, because the dimension of the space isequal to the dimension of the feature vectors that de-scribe the pixels. The process for representatives se-lection is conducted twice, once for class CT and oncefor CB.

4.3 Graph-based Label Propagation

The aforementioned procedure results to two sets ofrepresentative samples, CT;R and CB;R, one for eachclass. The samples of CT;R and CB;R are consideredas labeled, while the rest samples of the classes CTand CB are considered as ambiguously labeled. Moreformally, we have: i) R =CT;R[CB;R and ii) U = S�CT;R�CB;R. At this point, we need to refine the initialtraining set, S, using a suitable approach for the labelpropagation, through the ambiguously labeled data.

Thus, we need to estimate a labeling predictionfunction g : Rµ 7! R defined on the samples of S, byusing the labeled data R. Let us denote as rrri the sam-ples of set R such as R = frrrigm

i=1, where m is the car-dinality of the set R. Then, according to (Liu et al.,

2010), the label prediction function can be expressedas a convex combination of the labels of a subset ofrepresentative samples:

g( fff i) =m

åk=1

Zik �g(lllk) (10)

where Zik denotes sample-adaptive weights, whichmust satisfy the constraints å

mk=1 Zik = 1 and Zik � 0

(convex combination constraints). By defining vec-tors ggg and aaa respectively as ggg = [g( fff 1); :::;g( fff n)]

T

and aaa = [g(rrr1); :::;g(rrrm)]T . Eq.10 can be rewritten as

ggg = Zaaa where Z 2 Rn�m.The design of matrix Z, which measures the un-

derlying relationship between the samples of U andrepresentative samples R (were R � U), is based onweights optimization; actually non-parametric regres-sion is being performed by means of data reconstruc-tion with representative samples. Thus, the recon-struction for any data point fff i; i = 1; :::;n is a con-vex combination of its closest representative samples.In order to optimize these coefficients the followingquadratic programming problem needs to be solved:

minzzzi2Rs

h(zzzi) =12jj fff i�Rs � zzzijj2

s.t. 111T zzzi = 1;zzzi � 0(11)

where, Rs 2Rµ�s is a matrix containing as elements asubset of R = frrr1; :::;rrrmg composed of s < m nearestrepresentative samples of fff i and zzzi stands for the ith

row of Z matrix.Nevertheless, the creation of matrix Z is not suf-

ficient for labeling the entire data set, as it does notassure a smooth function g. As mentioned before, alarge portion of data are considered as ambiguouslylabeled. Despite the small labeled set, there is alwaysthe possibility of inconsistencies in segmentation; inspecific frames the user may miss some pixels thatdepict targets. In order to deal with such cases thefollowing SSL framework is employed:

minAAA=[aaa1;:::;aaac]

Q (A) = 12 jjZ �A�Yjj2F + g

2 trace(AT L̂A) (12)

where L̂ = ZT �L �Z is an memory-wise and com-putationally tractable alternative of the Laplacian ma-trix L. The matrix A = [aaa1; :::;aaac] 2 Rm�c is the softlabel matrix for the representative samples, in whicheach column vector accounts for a class. The matrixY = [yyy1; :::;yyyc]2Rn�c a class indicator matrix on am-biguously labeled samples with Yi j = 1 if the label liof sample i is equal to j and Yi j = 0 otherwise.

In order to calculate the Laplacian matrix L, theadjacency matrix W needs to be calculated, sinceL = D�W, where D 2 Rn�n is a diagonal degree


590

matrix such that Dii = ånj=1 Wi j. In this case W is ap-

proximated as W = Z �LLL�1 �ZT , where LLL 2 Rm�m isdefined as: Lkk = å

ni=1 Zik. The solution of the Eq.12

has the form of: A� = (ZT �Z+ gL̂)1ZT Y. Each sam-ple label is, then, given by:

l̂i = arg maxj2f1;:::;cg

Zi �aaa j

l j(13)

where Zi: 2 R1�m denotes the i-th row of Z, andthe normalization factorl j = 1T Zaaa j balances skewedclass distributions.

4.4 Maritime Target Detection

Having constructed a training set, S = f fff i; ligni=1, a

binary classifier, capable to discriminate pixels thatdepict some part of a maritime target from pixels thatdepict the background, can be trained. In this paperwe choose to utilize Support Vectors Machine (SVM)to transact the classification task. The optimizationproblem is described as (Cortes and Vapnik, 1995):

minw̄ww;b;xxx

12jjwwwjj+ c

n

åi=1

xi

s.t. li(www � fff i�b)� 1�xi

(14)

for i = 1; :::;n, xi � 0. Where xi � 1 are variables thatallow a sample to be in the margin or to be misclassi-fied and c is a constant that weights these errors.

In the framework of maritime detection, SVMmust be able to handle unbalanced classification prob-lems, due to the fact that maritime target usually oc-cupy the minority of captured frames’ pixels let alonetheir total absence from the scene for large time peri-ods. To address this problem, the misclassification er-ror for each class is weighted separately. This meansthat the total misclassification error of Eq.14 is re-placed with two terms:

cn

åi=1

xi! cp åfijli=1g

xi + cn åfijli=�1g

xi (15)

where cp and cn are constant variables that weightseparately the misclassification errors for positive andnegative examples. The solution of Eq.14 with theclassification error of Eq.15 results to a trained SVM,which is capable to classify the pixels of new capturedframes.

5 EXPERIMENTAL RESULTS

There is code in Python, concerning visual atten-tion maps construction, available to download1. The

1https://github.com/konstmakantasis/Poseidon

performance of each system’s component have beenchecked separately; extracted features were evalu-ated in terms of discriminative ability and importance,semi-supervised labeling for the predicting outcomeand, finally, the binary classifier for its performance.

5.1 Data Set Description

Data consists of recorded videos from camerasmounted at the Limassol port, Cyprus and Chania oldport, Crete, Greece. The data sets describe real lifescenarios, in various weather conditions. As long asthe camera is able to capture a vessel (i.e. spans anarea of more than 40 pixels in the frame) the systemwill likely detect it, regardless the weather conditions(e.g. rain, fog, waves etc.).

Unfortunately, for the vast majority of the videoframes, maritime targets are absent from the scene. Inorder to deal with such cases, we manually edited thevideos and kept only the tracks that depict intrusion ofone or more targets in the scene. Then, we manuallylabeled the pixels of key video frames, keyframes, tocreate a ground truth dataset for evaluating our sys-tem.

Keyframes originate from raw video frames thatcorrespond to time instances t;2t;3t; � � � The timespan is selected to be 6 seconds, which means that oneframe out of 150 is denoted as keyframe. We followedthis approach for practical reasons. Firstly, it wouldbe impossible to manually label all video frames ata framerate of 25 fps. Also, the time interval of 6seconds is small enough to allow the detection of theintrusion of a maritime target in the scene. At thispoint it has to be clarified that feature extraction task,as well as the binary classification are performed forall frames of a video track. Keyframes are used onlyfor system’s performance evaluation.

5.2 Evaluation of Extracted Features

In this section, we examine if extracted features fulfilspecific requirements, i.e. be informative and sepa-rable, in order to assure good classification accuracyand smooth training set refinement, through the graphbased SSL technique.

To evaluate features information, we utilized thekeyframes’ ground truth data. The feature extrac-tion task results in a 16-dimensional feature vectorfor each pixel in a frame. The quality of features’ in-formation is evaluated through dimensionality reduc-tion and samples plotting, in order to visually examinetheir distribution in space, see Fig.5. The two classes,as shown in Fig.5, are linearly separable, which sug-gests high quality features. The small amount of pos-


591

0.010.00

0.010.02

0.030.04

0.06

0.04

0.02

0.00

0.02

0.04

0.06

0.06

0.04

0.02

0.00

0.02

0.04

0.06

0.08

positive samplesnegative samples

Figure 5: Positive and negative samples plotted in 3-dimensional space. Randomized PCA was used to extractthe three dominant components of the dataset. The classcontaining positive samples can be linearly separated fromthe class containing negative samples. The small amountof positive samples that lie inside the region of the negativeclass, correspond to maritime targets’ contours.

itive samples, that lie inside the region of the negativeclass, correspond to maritime targets’ contours andprobably occurred due to segmentation errors duringmanual labeling.

The importance of each one of the extracted fea-tures is examined separately, in order to define howmuch each one of the features affects the classifica-tion task. The importance of features is specified viaForest of Randomized Trees. The relative rank (i.e.depth) of a feature used as a decision node in a treecan be used to assess the relative importance of thatfeature with respect to the predictability of the targetvariable. Features used at the top of the tree contributeto the final prediction decision of a larger fraction ofthe input samples. The expected fraction of the sam-ples they contribute to can thus be used as an estimateof the relative importance of the features.

In Fig.6 the relative importance of each one of theextracted features is presented (features labeling fol-lows the notation of Section 3). The dominant fea-ture is the one that corresponds to the output of back-ground modeling algorithm, which, in practice, cap-tures the presence of motion in the scene. The rest ofthe features contribute almost the same, except fromthe feature that corresponds to the local descriptor ofimage entropy.

5.3 Evaluation of Semi-supervisedLabeling

In order to evaluate semi-supervised labeling, we as-sume that manual labeling of keyframes contains nosegmentation errors. The ratio of the representa-tives samples in relation with the ambiguously labeledsamples is the only factor that affect the performanceof labeling algorithm.

As shown in Fig.7, the labeling error is lower than

lE gE wE lF gF wF lL gL wL lC gC wC lH gH wH BG

Features

0.00

0.05

0.10

0.15

0.20

0.25

Import

ance

Figure 6: Features importances. The feature that corre-sponds to output of background modeling algorithm, whichimplicitly captures the presence of motion in the scene, ispresented to be he most important.The rest of the featurescontribute almost the same to the classification task, exceptfrom the feature that corresponds to the local descriptor ofimage entropy, which presents the lowest importance.

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50

Ratio of representatives

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Labelin

g e

rror

labeling error

Figure 7: Semi-supervised labeling performance. When ra-tio of representative samples is over 40% the labeling erroris lower than 2%. When the ratio of representatives is lowerthan 40% the error is linearly increasing till the value of5.7% for 10% of representatives.

2% when the ratio of the representatives samples inrelation with the ambiguously labeled samples is over40%. When the ratio is smaller than 40% the labelingerror is linearly increasing and it reaches the value of5.7% when the ratio of representative samples is 10%.

The choice for an appropriate value for the ratio ofrepresentatives is inherently dependent on the qualityof human based labeling. If labeling is the result of arough image segmentation, a lot of the labeled pixelwill carry the wrong label. In such cases the afore-mentioned ratio must be set to a small value. Themost representative samples from each class is as-sumed that carry the right label, while the labels ofthe rest of the samples must be reconsidered.

In our case, we ask the user to segment the framein a very careful way, which implies that the vastmajority of the pixels will carry the right label. Forthis reason we set the ratio value to 40%. The semi-supervised labeling algorithm with 40% of represen-tatives is expected to re-label 1.7% of the samples.


592

5.4 Binary Classifier Evaluation

The performance of the binary classifier is dependenton the values of the parameters cp and cn of Eq.15.Let us denote as np and nn the number of samples inpositive and negative class respectively. To examinethe influence of parameters cp and cn on classificationaccuracy we define the parameter k as:

k =cn �nn

cp �np(16)

In practice, parameter k assigns different weights tomisclassification errors, which correspond to posi-tive and negative examples. When the value of k isequal to one, the weights that penalize misclassifica-tion sample for each class are inversely proportionalto the cardinalities of the classes. When k < 1 a big-ger penalty is assigned to false negatives, while fork > 1 false positives are considered more important.False negatives correspond to pixels that actually de-pict some part of a maritime target, but are denoted asbackground by the classifier.

However, a maritime surveillance system mustemphasize on minimizing the false negative rate. Inother words, it is more important, the system to de-tect all potential maritime targets, even if it will raisea small amount of false alarms, than minimizing falsepositives at the cost of missing target intrusions.

Fig.8 presents the performance of classifier fordifferent values of parameter k. The green line rep-resents classification accuracy, while the blue line therecall of the system. If we denote as pc the set of pixelthat denoted by the classifier as positive samples andas pt the set of pixels that actually belong to the posi-tive class, then recall r is defined as:

r =pc\ pt

pt(17)

When r is equal to one, all true positive samples havebeen correctly classified by the binary classifier. Ac-curacy is the proportion of correctly classified sam-ples of the whole dataset. As shown by the green linein Fig.8 the accuracy of the classifier reaches its max-imum value, when k is equal to one. On the otherhand, the recall of the system is monotonically de-creasing as the value of k is increasing. In our casewe set k = 0:7 to balance between maximizing classi-fication accuracy and minimizing false negative rate.For k = 0:7 the accuracy of the classifier is equal to96.4%, while recall is equal to 97.1%.

6 CONCLUSIONS

A vision based system, using monocular camera data,is presented in this paper. The system provides ro-

10-1 100 101

k

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Perf

orm

ance

accuracyrecall

Figure 8: Classifier performance. The recall of the systemis monotonically decreasing as the value of k is increas-ing (blue line). The accuracy presents the maximum valuewhen k = 1, which means that the penalties for misclassify-ing positive and negative samples are inversely proportionalto the cardinalities of positive and negative classes.

bust results by combining supervised and unsuper-vised methods, appropriate for maritime surveillance,utilizing an innovative initialization procedure. Thesystem offline initialization is achieved through graphbased SSL algorithm, suitable for large data sets, sup-porting users during segmentation process.

Extensive performance analysis suggest that theproposed system performs well, in real time, for longperiods without any special hardware requirementsand the without any assumptions related to scene, en-vironment and/or visual conditions. Such system isexpected to be easily expanded to other surveillancecases, using minor modifications depending on thecase.

ACKNOWLEDGEMENTS

The research leading to these results has been sup-ported by European Union funds and National funds(GSRT) from Greece and EU under the project JA-SON: Joint synergistic and integrated use of eArth ob-Servation, navigatiOn and commuNication technolo-gies for enhanced border security funded under thecooperation framework. The work has been partiallysupported by IKY Fellowships of excellence for post-graduate studies in Greece�Siemens program.

REFERENCES

Achanta, R., Hemami, S., Estrada, F., and Susstrunk, S.(2009). Frequency-tuned salient region detection. InIEEE Conf. on Comp. Vis. and Pat. Rec., 2009. CVPR2009, pages 1597–1604.

Achanta, R. and Susstrunk, S. (2010). Saliency detectionusing maximum symmetric surround. In 2010 17th


593

IEEE Int. Conf. on Image Processing (ICIP), pages2653–2656.

Albrecht, T., Tan, T., West, G., Ly, T., and Moncrieff, S.(2011a). Vision-based attention in maritime environ-ments. In Communications and Signal Processing(ICICS) 2011 8th Int. Conf. on Information,pages 1–5.

Albrecht, T., West, G., Tan, T., and Ly, T. (2010). Multi-ple views tracking of maritime targets. In 2010 Int.Conf. on Digital Image Computing: Techniques andApplications (DICTA), pages 302–307.

Albrecht, T., West, G., Tan, T., and Ly, T. (2011b). Vi-sual maritime attention using multiple low-level fea-tures and na #x0ef;ve bayes classification. In 2011 Int.Conf. on Digital Image Computing Techniques andApplications (DICTA), pages 243–249.

Alexe, B., Deselaers, T., and Ferrari, V. (2010). What is anobject? In 2010 IEEE Conf. on Comp. Vis. and Pat.Rec. (CVPR), pages 73–80.

Auslander, B., Gupta, K. M., and Aha, D. W. (2011). Acomparative evaluation of anomaly detection algo-rithms for maritime video surveillance. volume 8019,pages 801907–801907–14.

Belkin, M. and Niyogi, P. (2002). Using manifold structurefor partially labelled classification. page 929.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.Machine Learning, 20(3):273–297.

Doulamis, N. and Doulamis, A. (2012). Fast and adaptivedeep fusion learning for detecting visual objects. InFusiello, A., Murino, V., and Cucchiara, R., editors,Comp. Vis. ECCV 2012. Workshops and Demonstra-tions, number 7585 in Lecture Notes in Computer Sci-ence, pages 345–354. Springer Berlin Heidelberg.

Fischer, Y. and Bauer, A. (2010). Object-oriented sensordata fusion for wide maritime surveillance. In Water-side Security Conf. (WSS), 2010 Int., pages 1–6.

Kaimakis, P. and Tsapatsoulis, N. (2013). Backgroundmodeling methods for visual detection of maritimetargets. In Proceedings of the 4th ACM/IEEE Int.Workshop on Anal. and Retrieval of Tracked Eventsand Motion in Imagery Stream, ARTEMIS ’13, pages67–76, New York, NY, USA. ACM.

Lei, P.-R. (2013). Exploring trajectory behavior model foranomaly detection in maritime moving objects. In2013 IEEE Int. Conf. on Intelligence and Security In-formatics (ISI), pages 271–271.

Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X.,and Shum, H.-Y. (2011). Learning to detect a salientobject. IEEE Trans. on Pat. Anal. and Machine Intel-ligence, 33(2):353–367.

Liu, W., He, J., and Chang, S.-F. (2010). Large graphconstruction for scalable semi-supervised learning. InProceedings of the 27th Int. Conf. on Machine Learn-ing (ICML-10), pages 679–686.

Makantasis, K., Doulamis, A., and Doulamis, N. (2013).Vision-based maritime surveillance system usingfused visual attention maps and online adaptabletracker. In 2013 14th Int. Workshop on Image Anal. forMultimedia Interactive Services (WIAMIS),pages 1–4.

Makantasis, K., Doulamis, A., and Matsatsinis, N. (2012).Student-t background modeling for persons’ fall de-tection through visual cues. In 2012 13th Int. Work-

shop on Image Anal. for Multimedia Interactive Ser-vices (WIAMIS), pages 1–4.

Maresca, S., Greco, M., Gini, F., Grasso, R., Coraluppi, S.,and Horstmann, J. (2010). Vessel detection and clas-sification: An integrated maritime surveillance systemin the tyrrhenian sea. In 2010 2nd Int. Workshop onCognitive Information Processing (CIP),pages 40–45.

McIlhagga, W. (2011). The canny edge detector revisited.Int. Journal of Comp. Vis., 91(3):251–261.

Nadler, B., Srebro, N., and Zhou, X. (2009). Statistical anal-ysis of semi-supervised learning: The limit of infiniteunlabelled data. In Bengio, Y., Schuurmans, D., Laf-ferty, J. D., Williams, C. K. I., and Culotta, A., editors,Advances in Neural Information Processing Systems22, pages 1330–1338. Curran Associates, Inc.

Nilsson, M., van Laere, J., Ziemke, T., and Edlund, J.(2008). Extracting rules from expert operators to sup-port situation awareness in maritime surveillance. In2008 11th Int. Conf. on Information Fusion,pages 1–8.

Rodriguez Sullivan, M. D. and Shah, M. (2008). Visualsurveillance in maritime port facilities. volume 6978,pages 697811–697811–8.

Socek, D., Culibrk, D., Marques, O., Kalva, H., and Furht,B. (2005). A hybrid color-based foreground object de-tection method for automated marine surveillance. InBlanc-Talon, J., Philips, W., Popescu, D., and Scheun-ders, P., editors, Advanced Concepts for Intelligent Vi-sion Systems, number 3708 in Lecture Notes in Com-puter Science, pages 340–347. Springer Berlin Hei-delberg.

Szpak, Z. L. and Tapamo, J. R. (2011). Maritime surveil-lance: Tracking ships inside a dynamic backgroundusing a fast level-set. Expert Systems with Applica-tions, 38(6):6669–6680.

Vandecasteele, A., Devillers, R., and Napoli, A. (2013). Asemi-supervised learning framework based on spatio-temporal semantic events for maritime anomaly detec-tion and behavior analysis. In Proceedings CoastGIS2013 Conf.: Monitoring and Adapting to Change onthe Coast.

Voles, P. (1999). Target identification in a complex maritimescene. volume 1999, pages 15–15. IEE.

Wijnhoven, R., van Rens, K., Jaspers, E., and de With, P.(2010). Online learning for ship detection in maritimesurveillance. pages 73–80.

Yasri, I., Hamid, N., and Yap, V. (2008). Performance anal-ysis of FPGA based sobel edge detection operator. InInt. Conf. on Electronic Design, 2008. ICED 2008,pages 1–4.

Zemmari, R., Daun, M., Feldmann, M., and Nickel, U.(2013). Maritime surveillance with GSM passiveradar: Detection and tracking of small agile targets.In Radar Symposium (IRS), 2013 14th Int., volume 1,pages 245–251.

Zhu, X. (2003). Semi-supervised learning using gaussianfields and harmonic functions. In Proceedings of the20th Int. Conf. on Machine learning (ICML-2003),volume 20, page 912.

Zivkovic, Z. (2004). Improved adaptive gaussian mixturemodel for background subtraction. In Proceedings ofthe 17th Int. Conf. on Pat. Rec., 2004. ICPR 2004, vol-ume 2, pages 28–31 Vol.2.


594

maritime targets detection from ground cameras exploiting

Documents