learning attribute representations for remote sensing ship …sahbi/jstars2017.pdf · 2017. 9....

22
1 Learning Attribute Representations for Remote Sensing Ship Category Classification Quentin Oliveau and Hichem Sahbi Abstract Object category classification in remote sensing applications usually relies on exemplar-based training. The latter is achieved by modeling the intricate relationships between visual features and their corresponding object categories. However, these models might fail when applied to fine-grained object classification problems especially when training examples are scarce and when objects exhibit complex visual appearances and strong variability. In this paper, we introduce a framework dedicated to object category classification in the context of scarce datasets. Our method builds discriminative mid-level image representations (also referred to as attributes) by learning a nonlinear mapping between the input image features and the attribute space. Moreover, we also enforce these learned attributes to be highly discriminative and easy to predict. We compare our proposed framework to existing attribute and related dictionary-based methods and apply it to two challenging tasks with scarce datasets: binary ship classification on Synthetic Aperture Radar images and multi-class ship category recognition on optical images. These experiments show that our proposed framework is indeed highly effective and generalizes well despite the scarcity of training data. Index Terms Representation Design, Image Classification, Maritime Surveillance, Ship Category Recognition. I. I NTRODUCTION With the increasing amount of satellite programs (Pléiade, Quickbird, TerraSAR-X etc.) and Unmanned Aerial Vehicles (UAV), huge remote sensing image collections are nowadays available. These data require new algorithmic solutions able to automatically analyze their visual content for different real-world applications such as image segmentation, object detection and recognition, etc. Object category detection and recognition are applications which are both particularly interesting and challenging. Their general principle consists in designing automatic solutions able to assign visual contents to well defined object categories. In the particular context of maritime environment, traffic monitoring and fishery control require detecting and classifying ships from continuous flows of images. Whereas ship detection on Synthetic Aperture Radar (SAR) [46], [35], [38], multi-spectral [26], [25] or optical images [47], [52], [32], [6] has been well studied, ship recognition has received much less attention in the literature. Existing algorithms dedicated to ship classification from SAR data are often based on the objects geometric properties (such as their area, length/width ratio, perimeter or solidity) [33], [55], [29], textures (mainly through Gray-Level Co-occurrence Matrices and Gray Level Run-Length Matrices) [29] or Radar Cross Section characteristics [50], [30], [31]; these algorithms have shown promising results on problems with a limited number

Upload: others

Post on 07-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

1

Learning Attribute Representations for Remote

Sensing Ship Category ClassificationQuentin Oliveau and Hichem Sahbi

Abstract

Object category classification in remote sensing applications usually relies on exemplar-based training. The latter

is achieved by modeling the intricate relationships between visual features and their corresponding object categories.

However, these models might fail when applied to fine-grained object classification problems especially when training

examples are scarce and when objects exhibit complex visual appearances and strong variability.

In this paper, we introduce a framework dedicated to object category classification in the context of scarce

datasets. Our method builds discriminative mid-level image representations (also referred to as attributes) by learning

a nonlinear mapping between the input image features and the attribute space. Moreover, we also enforce these

learned attributes to be highly discriminative and easy to predict. We compare our proposed framework to existing

attribute and related dictionary-based methods and apply it to two challenging tasks with scarce datasets: binary ship

classification on Synthetic Aperture Radar images and multi-class ship category recognition on optical images. These

experiments show that our proposed framework is indeed highly effective and generalizes well despite the scarcity

of training data.

Index Terms

Representation Design, Image Classification, Maritime Surveillance, Ship Category Recognition.

I. INTRODUCTION

With the increasing amount of satellite programs (Pléiade, Quickbird, TerraSAR-X etc.) and Unmanned Aerial

Vehicles (UAV), huge remote sensing image collections are nowadays available. These data require new algorithmic

solutions able to automatically analyze their visual content for different real-world applications such as image

segmentation, object detection and recognition, etc. Object category detection and recognition are applications

which are both particularly interesting and challenging. Their general principle consists in designing automatic

solutions able to assign visual contents to well defined object categories. In the particular context of maritime

environment, traffic monitoring and fishery control require detecting and classifying ships from continuous flows

of images. Whereas ship detection on Synthetic Aperture Radar (SAR) [46], [35], [38], multi-spectral [26], [25]

or optical images [47], [52], [32], [6] has been well studied, ship recognition has received much less attention in

the literature. Existing algorithms dedicated to ship classification from SAR data are often based on the objects

geometric properties (such as their area, length/width ratio, perimeter or solidity) [33], [55], [29], textures (mainly

through Gray-Level Co-occurrence Matrices and Gray Level Run-Length Matrices) [29] or Radar Cross Section

characteristics [50], [30], [31]; these algorithms have shown promising results on problems with a limited number

Page 2: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

2

of classes. On the other hand, ship classification on optical images is also promising as it allows to capture more

object details compared to SAR images [11]. However, it remains understudied and still very challenging especially

when training data are scarce and when object categories are highly variable.

Existing object (and ship) category detection and recognition methods in remote sensing can be assigned to two

major families: semantic segmentation and object classification. Methods belonging to the first family are usually

bottom-up and proceed by mapping local primitives (pixels, grid cells, etc.) into semantic blobs using probabilistic

models [4], [23] while methods in the second family are holistic and proceed by first extracting/pooling features

(either handcrafted or not) [34], [51] and then assigning them to categories using variety of machine learning and

classification techniques such as SVMs and deep networks [47].

In this paper, we will focus on ship classification problems and propose a novel attribute learning algorithm that

handles fine-grained category recognition even with few training data; such as ship categories. Originally introduced

by Lampert et al. [21], attributes are defined as semantic characteristics shared among different object categories.

They have the appealing property of providing both image description and classification criteria that can be reused

through categories. Attributes have been used in several applications ranging from image classification [21], [9],

[2], [8] to zero-shot learning [21] through image description and retrieval [37], [12]. Early algorithms dedicated

to attribute learning were supervised [21]; they require a preliminary (and burdensome) annotation step in order

to collect training data and build semantic attribute prediction criteria. In contrast, unsupervised methods consider

attributes as mid-level characteristics generally deprived of any semantic meaning [2], [54], [12].

Whereas the design principle of supervised methods relies on learning interpretable semantic attributes, these

methods may produce weakly discriminative attributes and usually fail to achieve good classification accuracy. On

the other hand, “semantic-free” attributes are usually better tuned and more discriminative but less interpretable

compared to semantic attributes. Hence, various hybrid approaches have been proposed in order to overcome the

limitations of these two principles while benefiting from their advantages; these methods include automatic semantic

attribute discovery [1], semantic attribute discriminative power improvement based on human interaction [8],

semantic and discriminative feature combination [56], [20] and attribute expression ranking [37]. All these methods,

which mostly proceed by discovering explicit (and binary) class-attribute1 relationships, have shown competitive

results on different image classification tasks.

On the other hand, supervised dictionary learning techniques are closely related to attribute-based methods as they

transform image features into discriminative features; the latters, also seen as attributes, are obtained by applying

linear transformations between the input space (related to the original image features) and the new learned attribute

space. Some of these supervised dictionary learning techniques are based on multi-dictionary or category-specific

1Here the terminology class and category refer to the same principle.

Page 3: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

3

dictionary learning [27], [58] while other methods learn a single dictionary for all categories [28], [57], [18]. In

addition, learning a discriminative dictionary is often achieved in two steps; first a dictionary is learned and then

a classifier is built on top of the underlying attributes [27], [58]. Other methods consider, instead, a one step

optimization process involving a global objective function that combines both reconstruction error and classification

loss [28], [57], [18].

All the aforementioned methods proceed first by extracting features (either handcrafted or learned), and then

assign those features to classes using variety of machine learning and classification techniques including support

vector machines (SVMs) and deep networks. Even though relatively successful, these classification methods highly

depend on the abundance of training data, especially for fine-grained object categories exhibiting strong variability.

When training data are scarce, pre-trained deep networks become suitable alternatives; indeed, one may take

advantage of the shared attributes in intermediate and deep levels of pre-trained deep networks, in order to adapt their

parameters and hence tackle different classification tasks even if training images are scarce. However, and besides

the computational issues, there is no guarantee about the generalization ability of the adapted networks, particularly

when the few training images, used in the new classification tasks, are taken from very different distributions

compared to those used to pre-train the initial deep networks.

A. Related work and Motivations

This paper focuses on attribute learning; this is closely related to the previous work on attribute based classification,

supervised dictionary learning and deep learning, which all aim to learn new image representations shared among

different categories that also provide better discrimination power compared to the original image features. In this

section, we present the main classes of representation learning methods (mainly attributes, dictionary-based and

deep learning) while highlighting their strengths and weaknesses.

1) Attribute-based classification: Lampert et al. [21] originally proposed to describe categories of object us-

ing semantic attributes. This consists in describing images using class-attribute matrices that provide the binary

membership of each attribute to each category. These matrices are usually collected using – expert or volunteer

– annotators and provide a way to understand the relationships between the learned attributes and categories.

However, this category of description scheme has at least two major drawbacks; binary class-attribute matrices may

convey very similar attribute memberships for two different classes, resulting into a weak separability between the

underlying attribute vectors on images belonging to these classes. Moreover, attributes might be difficult to predict

especially if relationships between attributes and image features are hard to model2 or if some of them are hard to

disentangle when they co-occur in many classes. In order to overcome these limitations, Farhadi et al. [9] propose

to learn random attributes with low prediction error. Their method consists in randomly choosing few positive and

negative classes and learning linear SVMs separating these classes. The SVMs with the lowest prediction errors are

2For instance, abstract attributes such as “fast”.

Page 4: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

4

then selected in order to construct image attribute representations. If their approach may possibly generate attribute

predictors with low estimation errors, there is no guarantee about the discrimination power of these learned attributes

w.r.t. the task at hand: indeed, images belonging to two different classes could have very close attribute descriptions,

and this may affect classification performances.

As alternatives to random attributes, other methods have been proposed. For instance, Yu et al. [54] constrain the

learned attributes to be discriminative by explicitly maximizing the distance between attributes belonging to different

classes. In addition, a low attribute prediction error criterion is added as well as constraints that make attribute

specifications less redundant through classes, while being common for classes sharing some visual appearances.

On the other hand, Rastegari et al. [39] propose to learn Discriminative Binary Codes by learning hyperplanes

separating classes in the feature space. Discrimination power is obtained by maximizing the ratio between inter

and intra-class distances while the attribute prediction error is minimized by learning simultaneously the binary

attributes and the hyperplanes. Another alternative method introduced by Guo et al. [12] consists in learning binary

class-attribute memberships while taking into account the discrimination power of the attributes, their prediction

error as well as their intra-class variability. This is achieved by maximizing the margin of the hyperplanes predicting

attributes while minimizing a multi-class classification loss on top of these attributes.

2) Dictionary-based classification: dictionary-based methods [28], [27], [58], [57], [18] aim to transform original

features into a new (possibly discriminative) representation. Among the algorithms dedicated to supervised dictionary

learning, the Label Consistent KSVD3 (LC-KSVD) [18] outperforms other dictionary based approaches especially

on classification tasks with scarce datasets. This algorithm consists in learning simultaneously a dictionary and a

classifier by minimizing (i) a reconstruction criterion (between original image features and the new dictionary-based

representations) as well as (ii) a classification error while (iii) enforcing both class-smoothness and sparsity of the

learned representations; the dimension of this representation is controlled by the size of the dictionary.

All dictionary-based methods mentioned earlier rely on the assumption that the relationship between the original

image features and the learned attribute spaces is linear and predicting the attributes of some new test images

usually requires to solve a costly optimization problem.

3) Deep learning-based representations: image classification algorithms based on handcrafted features are usually

suitable for large (and well resolute) images while for small image snapshots (such as small ships in remote sensing

applications) these techniques are often inappropriate. In contrast, deep learning classification methods [10] are

more suitable for small size (and low resolute) images and have proven to be very effective when handling a large

number of classes. However, they often poorly generalize when training data are scarce and this is mainly due

to the large number of parameters to learn on these networks. In order to overcome these limitations, many new

3In the following, we only consider the LC-KSVD2 variant as it is explicitly designed to produce discriminative attributes and provides a

linear classifier to classify new examples.

Page 5: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

5

regularization techniques and architectures have been recently proposed; including regularization methods – such

as dropout [44] and batch normalization [17]. Moreover, new architectures – including Highway Networks [45] and

ResNet [15] – have also demonstrated promising performances on small scale problems. These approaches have

been used in remote sensing for hyper-spectral image segmentation [16] and also ship detection on SAR images [42].

When training data are scarce for new targeted tasks, pre-trained networks are also suitable alternatives. Their

general principle consists in using deep networks pre-trained offline on different tasks with large datasets (such as

ImageNet [7], CIFAR-10 [19]) in order to obtain new representations for the new targeted task. These pre-trained

networks can be used in two different ways. The simplest one consists in feeding the pre-trained network with

images of the new targeted task and use the outputs as abstract representations4. Another possibility consists in

fine-tuning the parameters of the pre-trained networks [36], [53] on the new targeted task. This requires removing

the deepest layers which are usually specific to the tasks for which the deep network has been pre-trained on, and

then adding new layers which are trained for the considered task using back-propagation. Note that layers kept

from the pre-trained networks may have their weights fixed or fine-tuned, depending on computational resources

and cardinality of training data. Fine-tuning usually provides more discriminative representations compared to the

straightforward use of pre-trained networks, as the weights are specifically updated for task at hand. However, when

training data are scarce, this may still lead to poor generalization especially when images of the new targeted tasks

are drawn from very different distributions compared to the ones used to pre-train the deep networks.

B. Contribution

Considering all the issues discussed earlier, we propose in this paper an attribute design framework which is

particularly suitable for classification tasks with scarce training data. The design principle of our method is based

on learning highly discriminative, reusable and predictable features by taking advantage of shared attributes through

different object categories. This makes it possible to benefit from larger training sets (at the attribute level) and

thereby overcoming the scarcity of training data at the category level. Existing feature combination, extraction and

selection methods (e.g. [13], [14]) usually require defining supersets of initial handcrafted features (such as color,

texture and shape) and then selecting or combining those features while trying to maximize performances. In spite of

being relatively successful (e.g. [29], [22], [49]), these techniques might lead to highly combinatorial formulations

and large search space heuristics. Instead, our attribute design framework neither requires basic handcrafted features

nor large search space heuristics; indeed, our method learns joint attributes “from scratch” in a way that makes

them discriminative, without reordering, selecting or combining supersets of initial features. In contrast to semantic

attribute learning such as [21], our method favors discrimination power of attributes at the detriment of their

interpretability, and hence avoids the tedious preliminary step of annotation – at the attribute level – resulting into

a more tractable and also scalable method w.r.t. the number of attributes as well as the number of categories. In

4The deeper the layer, the more abstract and eventually discriminative is the obtained representation.

Page 6: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

6

addition, in contrast to other non-semantic attribute learning algorithms such as [9], [54], [39], [12], we propose

to learn nonlinear relationships between the image feature space and the attributes, in order to represent data of

different classes in a richer and more discriminative space.

More precisely, our method consists in learning a nonlinear mapping between low-level image features and

real-valued attributes. The latter are obtained by optimizing different criteria: the first one minimizes an attribute

prediction error using regression, while the second criterion seeks to find classifiers and attributes that correctly

separate training images while maximizing their margins. We also consider a reconstruction term by learning a

dictionary which enforces the learned attributes to fit training data. Note that, in contrast to binary attributes, our

learned real-valued attributes provide a more effective way to automatically disentangle the intricate relationships

between categories and images; for that purpose, our approach neither requires the explicit specification of semantic

attributes nor the tedious manual annotation of the underlying training images. Finally, we apply our attribute learning

method to two remote sensing applications using data extracted from two different sensors: binary ship/no-ship

classification on SAR images and multi-class fine-grained ship classification on optical images.

II. PROPOSED FRAMEWORK

In this section, we present the objectives of our proposed framework, detail it and discuss how to fully benefit

from the available (even scarce) training data by learning image representations shared across different categories.

The main goal or our framework is to learn discriminative visual characteristics as well as functions able to predict

those characteristics on new images. Following previous works described section I-A1, these visual characteristics,

referred to as “attributes”, should be highly discriminative and reusable (shared) through different classes.

Given a set {Ii}`+ui=1 as the union of ` labeled (training) images (belonging to C different categories) and u

unlabeled (test) images, we define for each image Ii the variable xi ∈ RM as its original visual features and yi its

category label in C = {1, . . . ,C}. Our goal is to learn K nonlinear functions (denoted { fk}Kk=1) that map a given

xi to an attribute vector αi ∈ RK with αi = [αi1 . . . αiK ]T and αik = fk (xi), while ensuring that (i) original image

features can be partially reconstructed from their attributes and (ii) these attributes are discriminative enough in

order to separate images belonging to different classes.

Thus, our attribute design principle, introduced subsequently, aims to guarantee the following properties: (i)

fidelity of attributes to training data, this term aims to improve the generalization of the learned attributes by

enforcing that images features can be partially reconstructed from the attributes (ii) discrimination power, simple

classifiers built on top of attributes should correctly classify images (iii) high predictability, attributes predicted

from the low-level features must be accurate. We implement these properties using the following criteria, where

X ∈ RM×` denotes the matrix whose columns correspond to the low level training features {xi}`i=1 and α ∈ RK×`

the matrix of the underlying attributes.

Page 7: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

7

A. Fidelity to training data

We propose to enforce a “fitness” between training data X ∈ RM×` and learned attributes α ∈ RK×` by minimizing

a reconstruction error. This reconstruction involves attributes which are also predicted using a nonlinear mapping

and it aims to improve the generalization ability of the learned attributes on new images. More specifically, we

learn α and a dictionary D of size M × K by minimizing the following criterion:

arg minα,D

12‖X − Dα‖2F (1)

where F denotes the matrix Frobenius norm.

B. Discrimination power

Our main goal consists in learning on top of the attributes a classifier which is able (i) to separate training images

belonging to different classes with a small empirical error, while (ii) being able to generalize well on unseen test

data.

We implement this property by optimizing an objective function mixing (via a positive constant Q1) an empirical

error and a regularizer. In practice, we consider a “one-versus-all” soft-margin linear SVM [3] for each class c ∈ C.

For a given c ∈ C, we learn a hyperplane Wc separating attribute vectors {αi}i , which belong to class c (i.e.,

yi = c), from those belonging to C\c. Considering for each image Ii and category c a variable yci set to 1 if yi = c

and −1 otherwise, the SVM learning problem can be written as:

minWc,bc,ξc,α

12

WTc Wc +Q1

∑̀i=1

ξci

s.t. yci(WT

c αi + bc)≥ 1 − ξci, i = 1, . . . , `

ξci ≥ 0, i = 1, . . . , `

(2)

C. Inductive attribute prediction

In order to capture complex relationships between training data and their corresponding attributes we propose to

predict attributes from low-level features using nonlinear regressions. More specifically, we use the support vector

regression (SVR) formulation [43] whose primal form is given in equation (3). Following this formulation, the

SVR aims to optimize a constrained quadratic objective function that balances regularization and empirical error;

the latter is controlled by the constant Q2. In this formulation, the parameter ε corresponds to the width of the

tube surrounding the SVR function, i.e., how tolerant is the SVR against small prediction errors. Following the

KKT conditions, the SVR formulation can be rewritten in its dual form as shown in equation (4); here κ is a

positive semi-definite kernel, and {βk, β?k } are the learning parameters associated to each attribute predictor fk

Page 8: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

8

(k ∈ {1, . . . ,K}). Attribute prediction is inductive and the value of the k-th attribute of a new sample xj can then

be computed using equation (5), according to the representer theorem [41].

minwk,dk,ξ+,ξ−,α

12| |wk | |2 +Q2

∑̀i=1

(ξ+i + ξ

−i

)s.t. αik − fk (xi) ≤ ε + ξ+i , ∀i ∈ {1, . . . , `}

fk (xi) − αik ≤ ε + ξ−i , ∀i ∈ {1, . . . , `}

ξ+i , ξ−i ≥ 0, ∀i ∈ {1, . . . , `},

(3)

here (wk, dk) is the normal and the shift of the hyperplane associated to the k th regression function fk . The dual

form of the above constrained minimization problem is

minβk,β

?k,α

∑̀i=1

∑̀j=1

(βki − β?ki

) (βk j − β?k j

)κ(xi, xj

)+ ε

∑̀i=1

(βki + β

?ki

)−

∑̀i=1

αik(βki − β?ki

)s.t.

∑̀i=1

(βki − β?ki

)= 0

0 ≤ βki ≤ Q2, ∀i ∈ {1, . . . , `}

0 ≤ β?ki ≤ Q2, ∀i ∈ {1, . . . , `}

(4)

fk(xj

)=

∑̀i=1

(βki − β?ki

)κ(xj, xi

)+ dk (5)

From the above formulation, training the regression functions does not require any labeled data, and this allows

us to benefit from relatively larger available training sets (across different categories) and hence improve the

generalization ability of the learned representations. Notice that, as these regressions are kernel-based, the whole

framework allows to learn nonlinear boundaries between classes despite the fact that its final component is a linear

SVM. Moreover, this attribute prediction model is inductive and allows us to easily predict attribute values of new

data (see equation (5)), in contrast to purely dictionary based methods such as LC-KSVD which require to solve

costly optimization problems in order to obtain the representations of new unseen data.

D. Global model

Let {Jc}c , {Lk}k denote the objective functions associated to equations (2) and (4) respectively and W, ξ, b, β

and β? given by W = {Wc}c , ξ = {ξc}c , b = {bc}c , β = {βk, β?k }k . Using equation (1) the global optimization

Page 9: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

9

problem can be rewritten as:

minα,D,β,ξ,W,b

12 X − Dα

2F+

C∑c=1

Jc(W, ξ, b, α)

+

K∑k=1

Lk(β, α)

s.t. yci(WT

c αi + bc)≥ 1 − ξci, ∀c, ∀i

ξci ≥ 0, ∀i ∈ {1, . . . , `}, ∀c∑̀i=1

(βki − β?ki

)= 0, ∀k

0 ≤ βki ≤ Q2, ∀i, ∀k

0 ≤ β?ki ≤ Q2, ∀i, ∀k

(6)

III. OPTIMIZATION

It is clear that the minimization problem given in equation (6) is not convex jointly w.r.t. α, D, β, W and b.

Nevertheless, each of the three sub-problems given in equations (1), (2) and (4) can be expressed a quadratic

problem with linear constraints and can be solved efficiently. Thus we propose to solve the global problem using an

EM-like optimization procedure which consists in solving alternately and iteratively each of the three sub-problems:

we first learn the dictionary D by minimizing the left-hand side term of (6), then we minimize the inverse of the

SVM margins and the empirical losses in the second term of (6) w.r.t. W, b and finally, we update the attributes α

by minimizing all the three terms in (6) w.r.t. α. This process detailed in algorithm (1) is repeated until convergence;

i.e. all the unknowns remain unchanged from one iteration to another. The superscript (t) is added to all the variables

in order to show the evolution of their values through different iterations of the learning process.

Note that attribute vectors α are the only parameters requiring an initialization. We propose to initialize them by

reducing the dimension of the low-level features from M to K by PCA before normalizing each attribute in [0, 1],

leading to uncorrelated and scaled attribute initializations.

A. Learning dictionary D

Assuming fixed α(t) (denoted simply as α) and enforcing the gradient of (1) to vanish w.r.t. D, we obtain

D(t+1) = XαT(ααT

)−1(7)

B. Learning W and b

Parameters W(t+1), b(t+1) representing the hyperplanes separating examples belonging to different classes are

obtained by solving the following problem:

(W(t+1), b(t+1)) ← arg minW,b

C∑c=1

Jc(W, ξ, b, α) (8)

Page 10: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

10

Algorithm 1: Iterative optimization of equation (6)

Input: Training samples {(xi, yi)}`i=1

begint ← 0

α(0) ← Attribute initialization by PCA

repeatCompute D(t+1), W(t+1) and b(t+1) using equations (7) and (8)

Learn β(t+1) using equation (4)

Update attributes α(t+1) following equation (10)

t ← t + 1until Convergence or t { Tmax;

end

C. Learning β

Following equation (4), parameters β(t+1) related to attribute prediction from low-level features are learned as

β(t+1) ← arg minβ

K∑k=1

Lk(β, α) (9)

and this is efficiently solved using the LIBSVM library [5].

D. Learning attributes α

Considering fixed β(t+1), D(t+1), W(t+1), b(t+1) (denoted simply as β, D, W, b in this section), we set α(t+1) ← {α∗i }iwith {α∗i }i being the optimum of the following convex QP problems (for i = 1, . . . , `)

minαi,ν+,ν−

12 xi − Dαi

22 +Q3

K∑k=1(ν+k + ν

−k )

s.t. yci(WT

c αi + bc)≥ 1, c = 1, . . . ,C

αik − fk(xi) ≤ ε + ν+k ,

fk(xi) − αik ≤ ε + ν−k ,

ν+k ≥ 0, ν−k ≥ 0, k ∈ {1, . . . ,K}.

(10)

here Q3 is a constant that controls the trade-off between the approximation errors of the SVR predictor and the

matrix decomposition. It is easy to see that these separate QP problems (w.r.t. i) are computationally very tractable

as the number of parameters is proportional to K and C which are relatively small in practice.

IV. EXPERIMENTS

In this section we apply our method to two remote sensing applications, we analyze the impact of different

parameters of our method on its accuracy and we discuss its computational efficiency. The first remote sensing

Page 11: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

11

application is binary ship/no ship classification on SAR images, while the second task is multi-class ship recognition

on optical satellite images. In what follows, we detail the datasets and tasks, then we present a comprehensive

study of the different building blocks of our model. We also compare our method to the representative related work

including discriminative dictionary learning and deep networks.

A. Ship/no ship classification on SAR images

The goal of this task is to classify snapshots belonging to a SAR image into one of the two classes (“ship”,

“no-ship”); no-ship refers to other objects as well as artifacts. We consider SAR images for this task as they

are commonly used in ship detection. We based our work on Sentinel-1 data (IW GRD mode, HH polarization)

taken over one spot of the English Channel at one given date (thus ensuring that each ship on the SAR image is

unique and preventing overlaps between training and test sets) and extracted from the Sentinel Scientific Data Hub

website5. We calibrated the raw SAR images and used a CFAR algorithm in order to extract snapshots of putative

ships while also detecting irrelevant objects or artifacts. Afterwards, we use the normalized Radar Cross Section

(RCS) of these snapshots and we assign them manually to two classes: (i) actual ships and (ii) irrelevant objects or

artifacts. Examples of images from this set (referred to as SARShip database) are shown in figure (1); we clearly

observe that the two classes share some common characteristics such as common parts or shapes. Note that the

number of training examples available to build the binary classifier is very low, so our attribute-based model is

suitable for this scenario and it provides us with discriminative representations as shown subsequently.

Fig. 1. The left-hand side figure shows six positive examples (ships) from the SARShip database while in the right-hand side six negative

examples (corresponding to irrelevant objects or artifacts).

In order to evaluate the performance of our attribute learning method, we use image snapshots from the SARShip

dataset which includes 60 images randomly6 split into two subsets of 30 images for training and 30 for testing with

each subset including 15 positive and 15 negative examples7. This dataset, in spite of being relatively small, is very

challenging as it allows us to test our attribute learning method in the extreme conditions of scarcity of training data.

5https://scihub.copernicus.eu6We consider 100 random splits in order to measure mean accuracy as well as standard deviation.7Excepting the OFaug features where each subset contains 60 positive and 60 negative examples, see section IV-A1.

Page 12: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

12

1) Input features: we described image snapshots of the SARShip dataset using three types of features: original

features, augmented original features and deep features (referred to as OF, OFaug and DF respectively). In order to

obtain the OF, each snapshot (of size 15× 15 pixels) is reshaped as a 225-dimensional vector. Augmented original

features OFaug are generated by applying vertical and horizontal symmetries to the original images, multiplying by

four the size of the dataset. The DF are obtained by submitting the snapshots to the deep Network-in-Network [24]

which is pre-trained on the CIFAR-10 dataset [19]. We chose this deep network as it is pre-trained on images of

small sizes (32 × 32 pixels), and has then similar characteristics compared to our problem. Hence, we resize our

image snapshots (to 32 × 32 pixels) in order to fit the input of this pre-trained network. In order to select the most

appropriate layer in this network, we train different SVMs on top of these layers and we select the one that provides

the highest accuracy.

2) Comparison: we compare the accuracy of our attribute learning method against related frameworks including

linear and Gaussian SVMs as well as the LC-KSVD algorithm described in section I-A; all these algorithms are

trained either on top of original, augmented original or deep features. Note that the linear SVM is a suitable choice

for the baselines as the targeted classification problems involve small training sets in high dimensional spaces.

Therefore, it becomes useless to map these training sets into high dimensional spaces (via nonlinear kernels) in

order to make them linearly separable8. Furthermore, this choice is motivated by the fairness of comparison w.r.t.

LC-KSVD which also relies on linear SVM. For further comparison, we also fined-tuned the pre-trained deep

network on our binary ship classification task using different settings of network architecture (depth of the ablation,

number and size of newly trained layers) and regularization methods (dropout, batch normalization). Note that

random attributes are not explored for comparison as they are not suitable for our binary problem. Table I shows

the accuracy of all these methods. When using original features, we observe that both our attribute learning method

and the LC-KSVD outperform the linear and Gaussian SVM baselines. This clearly corroborates the fact that

image representations learned by these two algorithms are more discriminative than the original features while

being low dimensional. We also observe that LC-KSVD – in spite of being suitable for scarce datasets – is slightly

worse compared to our method. When comparing the results of OF and OFaug, we observe that increasing the

number of training and test samples improves significantly the accuracy of our method against LC-KSVD . We also

obtain significant gains w.r.t LC-KSVD and other baselines when using deep features9, which also provide more

discriminative image representations compared to the original features.

In order to study the impact of the parameter K (number of attributes learned on top of original or deep features),

we evaluate the accuracy of our binary classification problem w.r.t different values of K . Figure 2 shows that

learning only one attribute is already enough in order to decently separate the two classes (see also visualization

8Nevertheless, we added a Gaussian SVM to check whether a nonlinear classifier learned on top of the original or deep features improves

the accuracy.9These features (which have the highest degree of abstraction) are extracted from the layer before last, as this layer provides the highest

accuracy. The dimensionality of this layer is 640.

Page 13: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

13

TABLE I

THIS TABLE SHOWS THE MEAN ACCURACY AND STANDARD DEVIATIONS OBTAINED WITH 100 RUNS (CORRESPONDING TO 100 SPLITS OF

THE SARSHIP DATASET). Din (RESP. Dout ) STANDS FOR THE DIMENSION OF THE INPUT (RESP. OUTPUT LEARNED) FEATURES WHILE OF,

OFAUG AND DF STAND FOR THE TYPE OF INPUT FEATURES (ORIGINAL, AUGMENTED ORIGINAL AND DEEP FEATURES RESPECTIVELY)

WHILE #tr aintest STANDS FOR THE NUMBER OF TRAINING AND TEST EXAMPLES. WE OBSERVE THAT OUR ATTRIBUTE LEARNING

FRAMEWORK ALLOWS US TO LEARN MORE DISCRIMINATIVE IMAGE REPRESENTATIONS COMPARED TO ORIGINAL, AUGMENTED ORIGINAL

AND DEEP FEATURES WHILE OUTPERFORMING LC-KSVD, THE LINEAR SVM AND THE BEST GAUSSIAN SVM (WHOSE PARAMETER γ WAS

TESTED IN THE RANGE {10−9, 10−8, . . . , 108, 109 }).

Method Din Dout #tr aintest Accuracy %

OF + linear SVM 225 - 30 73.90±6.98

OF + LC-KSVD 225 15 30 80.47±7.67

OF + attributes + linear SVM 225 25 30 81.60±7.08

OF + Gaussian SVM 225 - 30 75.05±8.71

OFaug + linear SVM 225 - 120 69.17±5.92

OFaug + LC-KSVD 225 50 120 82.02±5.92

OFaug + attributes + lin. SVM 225 25 120 89.20±4.64

OFaug + Gaussian SVM 225 - 120 84.46±5.29

DF + SVM linear kernel 640 - 30 80.97±6.22

DF + LC-KSVD 640 12 30 81.01±6.68

DF + attributes + linear SVM 640 25 30 84.87±5.65

DF + Gaussian SVM 640 - 30 81.20±5.97

OF OF K=1 OF K=5 OF K=10 OF K=20 OF K=25 DF DF K=1 DF K=10 DF K=15 DF K=25 DF K=3060

65

70

75

80

85

90

Acc

urac

y(%

)

Fig. 2. This figure shows the evolution the mean accuracy and standard deviation w.r.t. the numbers of attributes K , fixed for each type of

features at the beginning of experiments (on the SARShip database). Attributes learned on top of the original features (OF) are shown in blue

whereas attributes learned on top of deep features (DF) are shown in green. Baselines shown in light blue (resp. light green) correspond to

linear SVMs learned on top of original (resp. deep) features. We observe that learning attributes allows us to improve the accuracy compared

to the baselines, and improvement is more noticeable with original features. In addition, these results illustrate the fact that only one attribute

is already enough in order to decently separate the two classes even if learning multiple attributes can further improve the accuracy.

of classification in Figure 3); larger values of K enhance further the accuracy. Generally speaking, the number of

attributes used for a C-class problem should follow K ≥ C − 1. Here, on the SARShip dataset, we can observe that

K = C − 1 is already an appropriate value.

Finally, we fine-tuned the pre-trained Network-in-Network (see [24] for the exact model implementation) to our

Page 14: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

14

Fig. 3. This figure shows the attributes corresponding to the test data of SARShip (learned on deep features with K = 1). Note that they

correspond to only one instance among the 100 splits used to produce the mean accuracy shown in table (I). The hyperplane separating the two

classes, learned on the training samples, is shown in black. We observe that only one attribute is already enough in order to separate almost all

the test data in these two classes.

binary classification task. For that purpose, we tried different tuning strategies: we varied the number of used pre-

trained layers as well as the number and the size of the new adaptation layers. We also tested different regularization

methods including dropout and batch normalization. According to these experiments, these strategies produce very

small empirical errors on the training sets; however, the generalization power on the test sets is far from the values

obtained in the tables above. This clearly illustrates the difficulty in tuning the huge number of parameters of these

networks when training data are very scarce. In contrast, our method has much less parameters to tune, it learns

common representations (across categories) and hence it exploits better the few available training data.

B. Multi-class Ship Recognition

We also evaluate our attribute learning method on multi-class ship classification on the Ship12 dataset. The latter

includes 240 (optical) satellite image snapshots belonging to 12 categories (with 20 image snapshots per category)

including dhow, catamaran, cruise ship, military vessel, fishing boat, barge, tanker, container ship, tugboat, yacht,

sail boat and bulk carrier. These images exhibiting strong variations of shapes and resolution were hand-picked on

various Google Maps images taken over large harbors such as Rotterdam, Miami and Marseilles, and labeled. We

resize each image in Ship12 to 32×32 pixels and concatenate all the pixels – in the 3 (RGB) channels – in order to

form a 3072-dimensional original feature vector (OF). We obtain the augmented original features OFaug by applying

horizontal and vertical symmetries to the original features following the protocol described in section IV-A1. Finally

we use these resized images to extract deep features as also described earlier in section IV-A1.

Following the same protocol as for the SAR image classification problem, we randomly10 split the Ship12 dataset

into two parts: a training set of 120 images (with 10 per class) and a test set of 120 images, except for OFaug where

each subset contains 480 images.

1) Comparison: similarly to the binary ship classification problem, we compare our framework w.r.t. different

related algorithms detailed earlier in section I-A including the LC-KSVD [18], random attributes [9], linear and

Gaussian SVMs and fine-tuned deep networks.

10We consider 10 random splits in order to measure mean accuracy and standard deviation.

Page 15: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

15

Fig. 4. This figure shows a sample of images from the Ship12 dataset. We can see a high diversity in resolution, textures and colors among

these examples belonging to different classes.

TABLE II

THIS TABLE SHOWS THE MEAN ACCURACY AND STANDARD DEVIATIONS OBTAINED WITH 10 RUNS (CORRESPONDING TO 10 SPLITS OF

THE SHIP12 DATASET). Din (RESP. Dout ) STANDS FOR THE DIMENSION OF THE INPUT (RESP. OUTPUT LEARNED) FEATURES WHILE OF,

OFAUG AND DF STAND FOR THE TYPE OF INPUT FEATURES (ORIGINAL AUGMENTED ORIGINAL AND DEEP FEATURES RESPECTIVELY)

WHILE #tr aintest STANDS FOR THE NUMBER OF TRAINING AND TEST EXAMPLES. NOTE THAT RANDOM ATTRIBUTES HAVE NOW BEEN

INCORPORATED AS THEY CAN BE APPLIED TO MULTI-CLASS PROBLEMS SUCH AS THE SHIP12 CLASSIFICATION. WE OBSERVE THAT OUR

ATTRIBUTE LEARNING FRAMEWORK ALLOWS US TO LEARN MORE DISCRIMINATIVE IMAGE REPRESENTATIONS COMPARED TO THE

ORIGINAL, AUGMENTED ORIGINAL AND DEEP FEATURES WHILE OUTPERFORMING LC-KSVD, RANDOM ATTRIBUTES AS WELL AS THE

BEST GAUSSIAN SVM WHOSE PARAMETER γ WAS TESTED IN THE RANGE {10−9, 10−8, . . . , 108, 109 }. THE RESULTS ALSO ILLUSTRATE THE

FACT THAT DEEP FEATURES ARE MORE DISCRIMINATIVE THAN THE ORIGINAL ONES.

Method Din Dout #tr aintest Accuracy %

OF + SVM linear kernel 3072 - 120 68.00±2.58

OF + LC-KSVD 3072 80 120 69.55±4.54

OF + random attributes 3072 50 120 56.42±4.14

OF + attributes + lin. SVM 3072 55 120 73.92±2.94

OF + SVM Gaussian kernel 3072 - 120 72.41±3.30

OFaug + SVM linear kernel 3072 - 480 63.45±2.94

OFaug + LC-KSVD 3072 125 480 61.35±1.03

OFaug + random attributes 3072 100 480 54.45±3.18

OFaug + att. + lin. SVM 3072 120 480 73.81±2.65

OFaug + Gaussian SVM 3072 - 480 73.92±2.84

DF + SVM linear kernel 24576 - 120 74.50±3.02

DF + LC-KSVD 24576 85 120 78.08±3.22

DF + random attributes 24576 120 120 63.58±5.20

DF + attributes + lin. SVM 24576 100 120 81.17±2.72

DF + SVM Gaussian kernel 24576 - 120 79.40±1.87

Results given in table (II) show that our attribute learning framework outperforms LC-KSVD, random attributes

and linear SVM when applied to all types of features, demonstrating its relevance. We also observe that methods

handling nonlinearities in data representation – such as our framework and Gaussian SVM – outperform linear

classifiers on the Ship12 task. Figure 6 shows that the accuracy increases as the number of attributes K increases

Page 16: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

16

till reaching a plateau. However, only a few attributes are actually necessary to achieve a high accuracy; this clearly

shows that the learned attribute-based representation should be low dimensional.

In these experiments, and in order to extract the deep features, we use the first pooling layer – referred to as pool1

– from the Network-in-Network as this layer provides the highest accuracy when combined with a linear SVM;

note that this choice leads to very sparse feature vectors (only 11.75% of non-zeros in average). Figure 5 compares

accuracies obtained by different layers and shows that very deep layers are not discriminative enough for our task

as they are too specific to the original task, whereas layers, which are not sufficiently deep, are not discriminative

enough. As expected, we also observe that these deep features are more discriminative than the original input

features. Moreover, learning attributes on top of these deep features still allows us to increase the final accuracy.

Projections of deep features and attributes (learned on top of these deep features) are shown in figure (7), using the

t-SNE algorithm [48]; here same color refers to the same class. We observe from these projections that the learned

attribute-based representations are packed into separate classes and this clearly illustrates their discrimination power

against the original deep features. Thus, this task-oriented attribute learning framework allows us to obtain more

suitable image representations in spite of having scarce training data.

Fig. 5. This figure shows the classification accuracy on Ship12 for a linear SVM built on top of deep features extracted from various layers of

Network-in-Network (pre-trained on CIFAR-10). Depth of the layers increases from left to right; cccp, conv and pool stand for Cascaded Cross

Channel Parametric Pooling, Convolution and Pooling respectively.

Fig. 6. This figure shows the evolution the mean accuracy and standard deviation w.r.t. the numbers of attributes K , fixed for each type of

features at the beginning of experiments (on the Ship12 database). Attributes learned on top of the original features (OF) are shown in blue

whereas attributes learned on top of deep features (DF) are shown in green. Baselines shown in light blue (resp. light green) correspond to

linear SVMs learned on top of original (resp. deep) features. We observe that learning attributes allows us to improve the accuracy compared

to the baselines, and improvement is noticeable for both original and deep features.

Page 17: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

17

Fig. 7. This figure shows the t-SNE projections of deep features (left) and attributes learned on top of them (right) taken from one random

split of the Ship12 dataset. Each color corresponds to a class, training and test examples are represented by triangles and circles respectively.

We observe that attributes are more discriminative compared to the initial features as they allow us to gather examples belonging to the same

classes.

C. Computational efficiency

We also compare the computational efficiency of our framework (during training and testing) against LC-KSVD

on SARShip and Ship12 datasets. We consider the Matlab implementation of LC-KSVD provided by the authors11 and we compare it against the Matlab implementation of our method. The speedup factor shown in Tab. III is

defined as time spent by LC-KSVDtime spent by our framework and it is obtained on a standard Intelr Core i5-4690K 3.50Ghz CPU. These results

correspond to the average gain obtained with 100 (resp. 10) runs on SARShip (resp. Ship12), using deep features

and the experiments parameters shown in sections IV-A2 and IV-B1.

TABLE III

TRAINING AND TEST SPEEDUP FACTORS ON SARSHIP AND SHIP12. WE OBSERVE THAT OUR FRAMEWORK IS AS EFFICIENT OR FASTER

THAN LC-KSVD DURING TRAINING AND CLEARLY FASTER THAN LC-KSVD DURING TESTING (SPEEDUP FACTOR ≥ 1).

Speedup factor

TrainingSARShip 0.91

Ship12 2.65

TestSARShip 8.36

Ship12 4.28

On small scale training problems, we observe that LC-KSVD and our framework have comparable computational

efficiency during training while on relatively larger sets (Ship12) our framework is more efficient. During testing,

our framework clearly outperforms LC-KSVD on the two datasets. Note that the relative slowness of LC-KSVD

comes essentially from a coding step that requires solving a costly sparse dictionary decomposition problem for

each training and test data.

11www.umiacs.umd.edu/~zhuolin/projectlcksvd.html

Page 18: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

18

D. Model analysis and settings

In this section, we study and discuss the behavior of our model w.r.t. its parameters, using samples from SARShip

and Ship12 (respectively described by original and deep features).

We have shown in figures (2) and (6) that the number of attributes to learn (i.e., K) has an asymptotic impact

on accuracy, so it can be overestimated. As previously mentioned in section I-B, these K attributes are designed to

be discriminative without using any selection or combination heuristics..

Now we study the impact of the parameter Q1 (involved in equation (2)) that controls the discrimination power

of the learned attributes. Indeed, the accuracy of the trained framework is sensitive to the setting of this parameter;

figure (8) shows the accuracy on SARShip and Ship12 w.r.t. Q1. Choosing the best Q1 value seems to be more

critical for multi-class tasks (i.e. on Ship12) than binary classification; finding a common Q1 that jointly optimizes

the accuracy of C hyperplanes (i.e., through all the C categories) is clearly more challenging than finding Q1 that

optimizes the accuracy of a single hyperplane (i.e. that separates two categories).

Fig. 8. This figure shows the impact of the parameter Q1 on the classification accuracy on the SARShip (with original features) and the Ship12

(with deep features) datasets. We observe that both problems are sensitive to the value of Q1. The setting of the latter has a particular influence

on the accuracy of the multi-class problem in Ship12.

Finally, we study the impact of the parameters that control the nonlinear attribute regressions. The latter depend

on two parameters: the relaxation weight Q2 and the size ε of the “tube” around the regression functions. We also

study the behavior of the accuracy w.r.t. the kernel used for nonlinear regression; in practice, we use the triangular

kernel κ(x, x′) = −‖x − x′‖γ, 0 < γ < 2 which has the appealing property of producing scale invariant regression

functions [40], when Q2 is large enough12. Considering this kernel we also study the impact of accuracy w.r.t. the

kernel parameter γ.

In our experiments, ε should be set to very small values in order to learn regression functions that fit accurately

the learned attributes and this makes the setting of ε relatively easy; in practice, we set ε to 10−4 for both ship

classification tasks. Setting Q2 is also easy as a consequence of the scale invariance property of our learned regression

functions. Indeed, the value of Q2 should be overestimated in order to guarantee scale invariance (see again [40]);

12This scale invariance property makes the setting of Q2 easy and common for all the K attributes (by overestimating its value; see [40]).

Page 19: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

19

moreover, we observe that all these overestimated values provide very similar behaviors of the accuracy for both

classification tasks. In practice, we set Q2 to 10−1.

The classification accuracy of our model strongly depends on the kernel parameter γ. Figure 9 shows the behavior

of the accuracy on the SARShip and Ship12 sets using original and deep features respectively (while all the other

parameters, namely K , Q1, Q2, Q3 and ε are fixed to 100, 10−2, 10−1, 10−4 and 10−4 respectively). On Ship12, we

observe an increase of the accuracy as γ increases until reaching a plateau while on SARShip the accuracy behaves

as a flat bell.

Fig. 9. This figure shows the classification accuracy on the SARShip and Ship12 datasets for different values of γ.

Finally, we study the impact of the parameter Q3 that controls the relaxation in equation (10), and the deviation

of the learned attributes w.r.t. the regression functions. Small values of this parameter result in a gap between the

learned attributes and their underlying regression functions, and thereby a drop in the discrimination power of these

attributes. Otherwise, large values result in overfitting. We found that the best setting of Q3 belongs to a wide range

of values and in practice it is set to 10−4.

In sum, all these discussed behaviors show that only two parameters, namely Q1 and γ, need to be carefully

tuned. Parameters Q2 and Q3 can be easily set whereas ε does not require any particular tuning. Thus, despite

having many parameters, our framework is still easy to tune.

V. CONCLUSION

We introduced in this paper a novel method for category recognition based on attribute learning. This method

consists in learning more accurate classifiers on top of mid-level characteristics (a.k.a. attributes) shared among

different categories and allows us to overcome the difficulties induced by scarce training data. We have shown that

our attribute learning framework outperforms attribute and dictionary based methods such as random attributes and

LC-KSVD on two remote-sensing image classification tasks, and can be applied to various types of image features.

As a future work, we are currently investigating the extension of our method to semi supervised (particularly

transductive) learning; indeed, one may exploit the structure of unlabeled data in order to further enhance the

accuracy of our attributes using large unlabeled datasets.

Page 20: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

20

REFERENCES

[1] Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. Automatic attribute discovery and characterization from noisy web data. In

Proceedings of the European Conference on Computer Vision, 2010.

[2] Alessandro Bergamo and Lorenzo Torresani. Meta-class features for large-scale object categorization on a budget. In Proceeding of the

IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[3] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of

the Fifth Annual Workshop on Computational Learning Theory, New York, NY, USA, 1992. ACM.

[4] Mario Caetano. ESA training course on land remote sensing – image classification, 2009.

[5] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and

Technology, 2, 2011.

[6] Paula Craciun and Josiane Zerubia. Towards efficient simulation of marked point process models for boat extraction from high resolution

optical remotely sensed images. In Proceedings of the IEEE Geoscience and Remote Sensing Symposium, 2014.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, 2009.

[8] Kun Duan, Devi Parikh, David J. Crandall, and Kristen Grauman. Discovering localized attributes for fine-grained recognition. In

Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3474–3481, 2012.

[9] Ali Farhadi, Ian Endres, Derek Hoiem, and David A. Forsyth. Describing objects by their attributes. In Proceeding of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 1778–1785, 2009.

[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. http://www.deeplearningbook.org, 2016.

[11] Harm Greidanus and Naouma Kourti. Findings of the declims project – detection and classification of marine traffic from space. In

SEASAR 2006: "Advances in SAR Oceanography from ENVISAT and ERS missions, 2006.

[12] Yuchen Guo, Guiguang Ding, Xiaoming Jin, and Jianmin Wang. Learning predictable and discriminative attributes for visual recognition.

In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3783–3789, 2015.

[13] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–

1182, 2003.

[14] Isabelle Guyon and André Elisseeff. An introduction to feature extraction. In Feature extraction, pages 1–25. Springer, 2006.

[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.

[16] Lee Hyungtae and Kwon Heesung. Contextual deep cnn based hyperspectral classification. In Proceedings of the IEEE International

Geoscience and Remote Sensing Symposium, 2016.

[17] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR,

abs/1502.03167, 2015.

[18] Zhuolin Jiang, Zhe Lin, and Larry S. Davis. Learning a discriminative dictionary for sparse coding via label consistent K-SVD. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1697–1704, 2011.

[19] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[20] Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. Attribute and simile classifiers for face verification. In

Proceeding of the International Conference on Computer Vision, 2009.

[21] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, pages 951–958, June 2009.

[22] H. Lang, J. Zhang, X. Zhang, and J. Meng. Ship classification in sar image by joint feature and classifier selection. IEEE Geoscience and

Remote Sensing Letters, 13(2):212–216, 2016.

[23] Thomas Martin Lillesand, Ralph W. Kiefer, and Jonathan Chipman. Remote Sensing and Image Interpretation, 7th Edition. John Wiley

& Sons, 2015.

[24] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.

[25] Y. Liu, L. Yao, W. Xiong, and Z. Zhou. Fusion detection of ship targets in low resolution multi-spectral images. In Proceedings of the

IEEE International Geoscience and Remote Sensing Symposium, pages 6545–6548, July 2016.

[26] Gianinetto M, Aiello M, Marchesi A, Topputo F, Massari M, Lombardi R, Banda F, and Tebaldini S. Obia ship detection with multispectral

and SAR images: A simulation for copernicus security applications. In Proceedings of the IEEE International Geoscience and Remote

Sensing Symposium, 2016.

Page 21: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

21

[27] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Discriminative learned dictionaries for local image

analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.

[28] Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. Supervised dictionary learning. In D. Koller,

D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1033–1040. Curran

Associates, Inc., 2009.

[29] A. Makedonas, C. Theoharatos, V. Tsagaris, V. Anastasopoulos, and S. Costicoglou. Vessel Classification in Cosmo-Skymed SAR Data

Using Hierarchical Feature Selection. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information

Sciences, pages 975–982, April 2015.

[30] G. Margarit, J. J. Mallorqui, and X. Fabregas. Single-pass polarimetric SAR interferometry for vessel classification. IEEE Transactions

on Geoscience and Remote Sensing, 45(11):3494–3502, 2007.

[31] G. Margarit and A. Tabasco. Ship classification in single-pol sar images based on fuzzy logic. IEEE Transactions on Geoscience and

Remote Sensing, 49(8):3129–3138, 2011.

[32] G. Máttyus. Near real-time automatic marine vessel detection on optical satellite images. In ISPRS Hannover Workshop, 2013.

[33] R. G. V. Meyer, W. Kleynhans, and C. P. Schwegmann. Small ships don’t shine: Classification of ocean vessels from low resolution,

large swath area SAR acquisitions. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, pages 975–978,

2016.

[34] T. Moranduzzo and F. Melgani. A SIFT-SVM method for detecting cars in UAV images. In Proceedings of the IEEE International

Geoscience and Remote Sensing Symposium, 2012.

[35] Arnesen Tonje Nanette and Olsen Richard B. Literature review on vessel detection. Technical report, Forsvarets Forskningsinstitutt, 2004.

[36] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image representations using convolutional neural networks.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1717–1724, June 2014.

[37] Devi Parikh and Kristen Grauman. Relative attributes. In IEEE International Conference on Computer Vision, pages 503–510, 2011.

[38] R. Pelich, N. Longépé, G. Mercier, G. Hajduch, and R. Garello. Performance evaluation of Sentinel-1 data in SAR ship detection. In

Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, pages 2103–2106, July 2015.

[39] Mohammad Rastegari, Ali Farhadi, and David Forsyth. Attribute discovery via predictable discriminative binary codes. In Proceedings of

the European Conference on Computer Vision, 2012.

[40] Hichem Sahbi and François Fleuret. Kernel methods and scale invariance using the triangular kernel. Research Report RR-5143, INRIA,

2004.

[41] Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In Proceedings of the Conference on

Computational Learning Theory, London, UK, UK, 2001. Springer-Verlag.

[42] C.P. Schwegmann, W. Kleynhans, B.P. Salmon, L.W. Mdakane, and R.G.V. Meyer. Very deep learning for ship discrimination in synthetic

aperture radar imagery. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2016.

[43] Alex J. Smola and Bernhard Schölkopf. A tutorial on support vector regression. Statistics and Computing, 14(3), 2004.

[44] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural

networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

[45] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015.

[46] Mattia Stasolla, Carlos Santamaria, Jordi J. Mallorqui, Gerard Margarit, and Nick Walker. Automatic ship detection in sar satellite images:

Performance assessment. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2015.

[47] Jiexiong Tang, Chenwei Deng, Guang-Bin Huang, and Baojun Zhao. Compressed-domain ship detection on spaceborne optical image

using deep neural network and extreme learning machine. IEEE Transactions on Geoscience and Remote Sensing, 53(3):1174–1185, 2015.

[48] Laurens van der Maaten and Geoffrey E. Hinton. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research,

9:2579–2605, 2008.

[49] S. Wang, M. Wang, S. Yang, and L. Jiao. New hierarchical saliency filtering for fast ship detection in high-resolution sar images. IEEE

Transactions on Geoscience and Remote Sensing, 55(1):351–362, 2017.

[50] X. Xing, K. Ji, H. Zou, W. Chen, and J. Sun. Ship classification in TerraSAR-X images with feature space based sparse representation.

IEEE Geoscience and Remote Sensing Letters, 10(6):1562–1566, 2013.

[51] Feng Yang, Qizhi Xu, Feng Gao, and Lei Hu. Ship detection from optical satellite images based on visual search mechanism. In

Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, 2015.

Page 22: Learning Attribute Representations for Remote Sensing Ship …sahbi/jstars2017.pdf · 2017. 9. 21. · 3) Deep learning-based representations: image classification algorithms based

22

[52] Guang Yang, Bo Li, Shufan Ji, Feng Gao, and Qizhi Xu. Ship detection from optical satellite images based on sea surface analysis. IEEE

Geoscience and Remote Sensing Letters, 11(3):641–645, 2014.

[53] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Z. Ghahramani,

M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages

3320–3328. Curran Associates, Inc., 2014.

[54] Felix Yu, Liangliang Cao, Rogerio Feris, John Smith, and Shih-Fu Chang. Designing category-level attributes for discriminative visual

recognition. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2013.

[55] H. Zhang, X. Tian, C. Wang, F. Wu, and B. Zhang. Merchant vessel classification based on scattering component analysis for cosmo-skymed

sar images. IEEE Geoscience and Remote Sensing Letters, 10(6):1275–1279, 2013.

[56] Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, Yue Gao, and Tat-Seng Chua. Attribute-augmented semantic hierarchy:

Towards bridging semantic gap and intention gap in image retrieval. In Proceedings of the ACM International Conference on Multimedia,

2013.

[57] Qiang Zhang and Baoxin Li. Discriminative K-SVD for dictionary learning in face recognition. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 2691–2698, 2010.

[58] Wei Zhang, Akshat Surve, Xiaoli Fern, and Thomas Dietterich. Learning non-redundant codebooks for classifying complex objects. In

Proceedings of the International Conference on Machine Learning, ICML ’09, pages 1241–1248, 2009.