nazr-cnn: object detection and fine-grained classification in

Nazr-CNN: Object Detection and Fine-GrainedClassification in Crowdsourced UAV Images

N. Attari and F. Ofli and M. Awad and J. Lucas and S. ChawlaQatar Computing Reserach Institute

Hamad bin Khalifa University{nattari, fofli, mawad, jlucas, schawla}@qf.org.qa

ABSTRACTWe propose Nazr-CNN1, a deep learning pipeline for ob-ject detection and fine-grained classification in images ac-quired from Unmanned Aerial Vehicles (UAVs). The UAVswere deployed in the Island of Vanuatu to assess damage inthe aftermath of cyclone PAM in 2015. The images were la-beled by a crowdsourcing effort and the labeling categoriesconsisted of fine-grained levels of damage to built structures.

Nazr-CNN consists of two components. The function ofthe first component is to localize objects (e.g. houses) inan image by carrying out a pixel-level classification. In thesecond component, a hidden layer of a Convolutional Neu-ral Network (CNN) is used to encode Fisher Vectors (FV)of the segments generated from the first component in or-der to help discriminate between between different levels ofdamage. Since our data set is relatively small, a pre-trainednetwork for pixel-level classification and FV encoding wasused. Nazr-CNN attains promising results both for objectdetection and damage assessment suggesting that the inte-grated pipeline is robust in the face of small data sets andlabeling errors by annotators. While the focus of Nazr-CNN is on assessment of UAV images in a post-disasterscenario, our solution is general and can be applied in manydiverse settings.

1. INTRODUCTIONUnmanned Aerial Vehicles (UAVs) are now being increas-

ingly used in humanitarian efforts. UAVs provide humani-tarians with a “bird’s eye” view of the disaster-affected areaswhich need immediate help. The images can be used to as-sess the overall level of damage inflicted on the distressedareas as well as the most effective allocation of the limitedrelief resources by humanitarian organizations.

Both the United States Federal Emergency ManagementAgency (FEMA) and the European Commission’s Joint Re-search Center (JRC) have noted that aerial imagery will play

1Nazr means “sight” in Arabic

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

WWW’17 April 3-7, 2017, Perth, Australiac© 2016 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

an important role in disaster response and present a big datachallenge [34]. The World Bank, for example, took the ini-tiative to cooperate with the Humanitarian UAV Network(UAViators)2 in the wake of Cyclone Pam, a category fivecyclone that caused extensive damage in Vanuatu in March2015. The annotation and labeling of the images was crowd-sourced to a group of volunteers who used MicroMappers tocreate masks around areas of built structure and assessed thelevel of severity of the damage (mild, medium and severe).

Figure 1: Disaster images at different angle and elevation.

Left: images-from-UAV. Right: Crowd-labeling. Notice the

difficulty of correctly annotating the images and thus gener-

ating accurate ground truth.

Analyzing large volumes of high-resolution aerial imagesgenerated after a major disaster remains a challenging taskin contrast to the ease of acquiring them due to low opera-tional costs. A popular approach is to use a hybrid strategywhere initially a crowdsourcing effort is carried out to createa labeled training set which is used to infer a machine learn-ing model [20]. The ML model then automatically classifiesincoming images. However, until now, the image classifica-

2http://uaviators.org/

arX

iv:1

611.

0647

4v1

[cs

.CV

] 2

0 N

ov 2

016

10.1145/1235

tion task (unlike text classification) suffered from low accu-racy and lack of robustness in an uncontrolled environment.

Before we describe our solution, we enumerate some ofthe challenges that must be overcome for object detectionin images acquired from UAVs:

1. As a UAV will fly at different heights and orientations,the images are captured at varying scales. Similarly,because of the orientation of the UAV, the backgroundis often highly heterogeneous: ocean, sky, forest, grass-land. To the best of our knowledge, standard deeplearning pipelines have just not been tested on such adiverse but related set of images. In Figure 1, we showfew example images from UAVs in Vanuatu which con-firm not only the difficulty of object detection taskbut also challenges in annotation. In the top image,the shack is almost unrecognizable from the land bothin terms of texture and color. The annotations arealso potentially highly subjective. For example, all thehouses are labeled as mild damage in the bottom imageexcept one as medium damage for which the amountof damage is extremely hard to infer due to the highaltitude of the UAV.

2. The amount of labeled imagery data is limited and thelabels are often noisy as there is disagreement betweenthe volunteers on the severity of damage. In particu-lar, there is a significant overlap between houses whosedamage levels are rated medium or severe.

3. A key success of deep learning is that “feature engi-neering” is part of the learning process. However, wehave observed, at least in our data set, distinguish-ing images based on texture is an important aspect ofthe problem and straightforward application of CNN isunlikely to result in fine-grained object classification.Furthermore, one of the strengths of CNN, that theytend to be spatially invariant is precisely the weaknessthat needs to be overcome as object localization (forexample, where is the house located and not just if animage has a house or not), has to be an important partof the solution.

Solution in a Nutshell: Our solution (Nazr-CNN)is to combine two deep learning pipelines. The firstpipeline carries out a pixel-level segmentation (oftenknown as semantic segmentation) to localize objects(segments) in images. The localized objects are thenpassed through a pre-trained CNN and a Fisher Vec-tor encoding is extracted from the single (last) con-volutional layer to generate a new representation ofthe objects. The fisher vectors are then trained us-ing a standard SVM classifier. This results in a highlyaccurate detection and fine-grained discrimination ofobjects based on levels of damage.

The rest of the paper is structured as follows. In Sec-tion 2 we precisely define and state the problem ofobject detection and classification. In Section 3, weintroduce the building blocks of Nazr-CNN with aparticular focus on the use of fisher vector encoding.In Section 4, we describe the experiment set-up and re-sults. We survey the related literature in Section 5 andwe summarize the paper and set directions for futurework in Section 6.

2. PROBLEM STATEMENTWe now define the problem of object detection and clas-

sification in UAV images for damage assessment.

Given: A set of images obtained from UAVs over a disasterassessment area and labeled using a crowdsourcing platform.Built structures (e.g. houses) are labeled as mildly damaged(M), medium damage (Md) and severely damaged (S).

Design: A machine learning classifier which takes as inputan unlabeled image and classifies regions in the image asbackground (B) or containing structures which can be clas-sified as M, Md or S.

Constraints: (i) Images are taken from different heightsand varying orientations of the UAV; (ii) the size of thedata set is relatively small and (iii) there is disagreementamongst the annotators on the class labels.

3. DEEP LEARNING FRAMEWORKSince our labeled data set is relatively small (3096 images),

we have built our deep learning pipeline (Nazr-CNN) us-ing existing pre-trained networks as building blocks. Ourproposed pipeline consists of integrating pixel-level classifi-cation (often known as semantic image segmentation) andtexture discrimination. For semantic segmentation we haveused DeepLab, a CNN network followed by a fully-connectedConditional Random Field (CRF) for smoothing [5]. Fortexture discrimination we have used FV-CNN [6], whichconsists of extracting fisher vectors from a hidden layer of apre-trained CNN. Intuitively, the aim of semantic segmenta-tion is to localize objects (built structures) in an image andthe aim of texture discrimination is to distinguish betweendifferent types of damage and background. For a compre-hensive background on CNNs, we refer the reader to theupcoming book by the pioneers of the field [15].

3.1 Pre-Trained NetworksIn practice, CNNs are rarely fully trained afresh for new

data sets because it is relatively rare to have access to largedata sets - which is indeed the case we have. A popularchoice of a pre-trained network is the VGG-16 network whichis trained on the ImageNet data set which contains over 1.2million images and 1000 categories (labels). The VGG-16network consists of 16 layers and over 130 million weightparameters [40]. There are two ways that a pre-trained net-work is used on new data sets. The first is to just passeach data point from the new data set and use the layersas feature extractors where each data point can be mappedinto a new representation. Lower layers of VGG-16 can beconsidered as low-level feature extractors which should beapplicable across domains. Higher layers tend to be moredomain-specific. The second approach is to use the existingweights of the pre-trained network as an initialization for thenew data set. This approach can often prevent overfittingbut is computationally expensive.

3.2 Semantic SegmentationWe have used the “DeepLab” system to carry out pixel-

level segmentation of images (often known as semantic seg-mentation) [5]. This promotes the localization of objectsin images. DeepLab combines the VGG-16 network with afully-connected Conditional Random Field (CRF) model on

the output of the final layer of CNN. The CRF overcomesthe poor localization property of CNNs and results in bet-ter segmentation. The CRF optimizes the following energyfunction:

E(x) =∑i

gi(xi) +∑ij

hij(xi, xj) (1)

where x is the pixel-level label assignment and gi = − logP (xi)is the label assignment probability at pixel i computed byCNN. The pairwise potential is given by

hij(xi, xj) = µ(xi, xj)

K∑m=1

wmkm(fi, fj)

Here µ is the binary Potts model function, km is a Gaus-sian kernel for label k and is dependent on the features fextracted at the pixel level. The CRF is fully connected,i.e., there is one pair-wise term for each pair of pixels irre-spective of whether they are neighbors or not. The reasonthat DeepLab uses fully connected CRF is that the objec-tive is to extract local structure (shape) from the pixels andnot just carry out a local smoothing (which might smoothout the local shapes). Pixel-level classification is an integralpart of Nazr-CNN. Besides identifying the built structuresin UAV images and their shapes, pixel-level classification,as we will see in the Experiments section, serves as an au-tomatic data cleaning step: it is robust against annotationerrors of both kinds, i.e, it can identify segments which weremissed by the annotator as well as fix labeling errors.

3.3 Fisher Vector EncodingWe use FV-CNN to extract features which will help in

distinguishing between different levels of damage [6]. In par-ticular we explain why the use of fisher vectors as a repre-sentation of the input can help distinguish between differentlevels of damage. Since our data set is of modest size (3096labeled images), as is the usual practice we use a Fisher-Vector CNN (FV-CNN) using the pre-trained VGG-M ar-chitecture on the ImageNet ILSVRC 2012 data set.

Fisher Vectors (FV) are a generalization of the popularBag-Of-Visual words (BoV) representation of images andare known to result in substantial increase in accuracy forimage classification tasks [37]. As we will show that FVs ex-tracted from high level CNN features are particulary usefulfor distinguishing different types of building damages.

We assume that an appropriate layer of CNN will generatea set of features X = {xi, i = 1, . . . , N} where each xi ∈RD. Further we assume that the set X is generated from aGaussian Mixture Model (GMM). Thus for each x ∈ X,

P (x|λ) =

K∑i=1

wiN(x, µi,Σi) (2)

and

∀i : wi ≥ 0,

K∑i=1

wi = 1 (3)

Here N(x, µ,Σ) is a multi-dimensional Normal (Gaussian)distribution. In order to avoid enforcing the constraints onthe weights (w), a re-paramatertization is carried out suchthat

wi =exp(αi)∑Kj=1 exp(αj)

(4)

Now the FV encoding of an x ∈ X are the gradients oflogP (x|λ) with respect to λ = {αj , µj ,Σj , j = 1, . . .K}.The covariance matrix Σj is assumed to be a diagonal ma-trix. Thus

∇αj logP (x) = γ(j)− wj (5)

∇µj logP (x) = γ(j)

(x− µjσ2j

)(6)

∇σj logP (x) = γ(j)

[(x− µj)σ3j

− 1

σj

](7)

where the responsibility γ(j) (the posterior probability) thatx belongs to the N(uj ,Σj) is given by

γ(j) =wjN(x, uj ,Σj)∑Ki=1 wiN(x, ui,Σi)

(8)

Often further normalization is carried out in Equation 5, 6,7 to arrive at the precise fisher vectors. Further details areprovided in [37].Example: We now give a simple example to show why FVencoding results in more discriminative features comparedto the BoV model. Consider the example show in Figure 2.Assume we are given two-dimensional descriptors of two im-ages (red and blue in the Figure). The BOV model is tocluster the descriptors using the kmeans algorithm and usethe centroid as a representation of a visual word. In theexample there are two visual words. Then each image isrepresented as a histogram consisting of counts associatedwith the visual words. For example the histogram of theblue image is (6, 6) as six descriptors of the blue image areassociated with the first visual word and six with the secondvisual word.

1 1.5 2 2.5 3 3.5 4 4.5 5Feature 1

0

1

2

3

Feat

ure

2

Image1 Image2 Centroids

1 2 3 4 5 6 7 8Descriptors

-0.5

0

0.5

1

Valu

e

Fisher Vector for Image 1Fisher Vector for Image 2

V1

V2

Figure 2: Bag of Visual Words (BoV) vs. FisherVector (FV) represenation.

In contrast, the fisher vector of the image has a dimension-ality equal to the number of parameters of the GMM. Forexample the bottom plot in Figure 2 shows a parallel plotof eight of the ten features of each image. The fisher vectoris more sensitive to local variations (as the value at eachdescriptor represents the deviation from the GMM model).This is particular suitable for texture discrimination wherevariation with a region and not the shape of the entity isnot an important parameter.

Pixel&Level)Segmenta0on)(CNN)+)CRF))

)Fisher&Vector)Encoding)on)

Last)CNN)Layer))

)UAV))Image)

SVM)Classifier)(one)vs.)all))

segment){B,$M,$Md,$S}$

FV)of)segment)

Figure 3: Nazr-CNN combines pixel-level classification with FV-CNN. The fisher-vectors are then trainedusing multi-class SVM

.

3.4 The Nazr-CNN pipelineFigure 3 shows the Nazr-CNN pipeline which works as

follows.Training:

1. For pixel-level classification, the DeepLab system re-quires the image along with the masks created by theannotators.

2. The DeepLab system will generate segments. Eachsegment will be assigned a label based on the anno-tated mask which overlaps it the most.

3. The segment output from DeepLab will then be fedinto FV-CNN. For each segment, fisher vectors froman intermediate hidden layer of FV-CNN will be gen-erated. Along with the fisher vector, the label of thesegment will form an element of the training set of themulti-class SVM. An SVM model will be induced fromthe segment and its label.

Testing

1. An image (without annotation) will be passed to DeepLabwhich will create segments.

2. Each segment will be fed into FV-CNN, which willgenerate fisher vectors for the segment.

3. The fisher vector of the segment will be classified bythe SVM model and a label (B, M, Md, S) will begenerated.

4. EXPERIMENTS AND RESULTSIn this section, we report on the extensive set of exper-

iments that we have carried out to assess the accuracy ofNazr-CNN. We begin by describing the data acquisitionprocess used in these experiments. Then, we report the fol-lowing three results: (i) the accuracy of using CNN on thebinary classification problem of detecting if an image con-tains a built structure or not. We use this step to automat-ically select images which contain built structures, (ii) theaccuracy of FV-CNN on pre-labeled segments to discrimi-nate between severity of damage, and finally, (iii) the accu-racy Nazr-CNN which is the combined pipeline of pixel-level semantic segmentation and fisher vector encoding usingFV-CNN.

4.1 Data Description and WorkflowThe UAV image data was acquired as part of an initiative

by the World Bank in collaboration with the HumanitarianUAV Network during cycle Vanuatu in 2015. The workflowfollowed is analogous to that of AIDR (Artificial Intelligencefor Disaster Response[20] for text data and is shown in Fig-ure 4: images were acquired through UAVs and a groupof digital volunteers using MicroMappers to annotate builtstructures found in the images. MicroMappers is a crowd-sourcing platform built by QCRI in partnership with UnitedNations and the Standby Task Force, specifically for crisismanagement. Statistics acquired from the labeled imagescan then be potentially used to allocate relief resources inthe disaster affected areas.

The entire image dataset for Vanuatu contains 3,096 im-ages with approximately 60% of the set containing no built-structures. Every image contains one or more levels of dam-age: Mild (little to no-damage), Medium, Severe and ev-erything else is considered as Background. Each of theseimages is labeled by two or more annotators and finally aglobal annotation is produced using a union with a majorityvoting scheme. This is an important step as the global an-notations serve as ground-truth for performance evaluationof the subsequent models for the system as a whole. Also,the ground-truth is quite noisy and instead of crisp bound-ary regions what we get from crowd-labeling are polygons.Due to this reason, the overall performance of the system ishampered to some extent. We now report on the result ofthe first experiment.

Figure 4: UAV Image Acquisition Worklow for Hu-manitarian Relief. The workflow follows the ap-proach for text classification as used by AIDR [20].However annotation of images is substantially moredifficult and until now, there did not exist an accu-rate and robust object detection and classificationmodel.

4.2 Binary ClassificationWe have initially used a binary classification model to

determine whether an image consists of built structures orjust background. The model is implemented using the Torchframework and a training model is built from scratch usinga two-stage JarretNet [21]. We compare the deep learningapproach with the conventional SIFT-SVM [27] model us-ing ten-fold cross-validation. The per-class average accuracyis shown in Table 4.2 and it shows that CNNs outperformthe conventional approach. Since the model was built fromscratch (and not initialized using a pre-trained network) weused several data augmentation techniques including rota-tion, flipping and taking multiple crops [24]. Finally, weobtained a set of around 1200 images that contained builtstructures and we verified the results manually.

Method Building BackgroundTrain Test Train Test

CNN 99 94 99 82SIFT-SVM 88 90 85 81

Table 1: Binary Image Classification using deeplearning and SIFT-SVM. We obtained around 1200images which contained built structures

4.3 Evaluation of FV-CNNIn this section we evaluate the performance of FV-CNN

[7] for carrying out fine-grained damage assessment assum-ing that the ground-truth region is known and leveragingtexture features using fisher encoding. FV-CNN uses a pre-trained VGG-M model. The hidden layer used is a 512-dimensional local feature vector which gets further pooledinto a FV representation with a Gaussian Mixture Model(GMM) of size 64. The total resulting dimensionality isaround 65K (64 + 2 × 64 × 512). We assume each compo-nent of the GMM has a diagonal covariance matrix. Finally,the region descriptors are classified using 1-vs-rest supportvector machines with a learning constant C=1.

The number of images that lacked a labeling consensuswere not taken into consideration. Thus, we worked with aset of 1084 images that were further divided into a trainingand test set of 976 and 108 images respectively. Each imagehas varying number of segments i.e. one out of three cate-gories of damage (i.e. mild, medium and severe) includingbackground. We thus obtained a total of 2979 segments,with background being a part of all the images and struc-tures with mild damage covering almost 66% of the data andrest was distributed amongst medium and severly damagedcategories. Despite the severe imbalance of damaged typesegments, FV encoding based classification gave accurateresults across classes.

The confusion matrices in Table 4.3 and 4.3 show that re-gions with background and mild-damage are classified withhigh accuracy. Medium and severely damaged classes havelower accuracy. It is important to note that there was sub-stantial disagreement between the annotators about the levelof damage suffered by built structures and the accurary wasexacerbated due to the class-imbalance problem 4.3. Theoverall average accuracy per segment was 80.2% and 88.02%for three-classes and two-classes of damage, respectively.

Some example images with their ground truth annota-

tions and FV-CNN predictions are shown in Figure 5. FVencoding is quite powerful in discriminating objects basedon texture. However, the difficulty in resolving medium andsevere forms of damage is clearly revealed. FV-CNN wasimplemented using the Matconvnet toolbox using a TeslaK20Xm 6GB GPU memory card.

Figure 5: Region Classification with FV-CNN

PPPPPPPGTPred

Mild Medium Severe Background

Mild 84 13 3 0Medium 17 69 12 2Severe 8 23 67 2Background 0 0 0 100

Table 2: Confusion Matrix for Three-level DamageClassification per Segment

PPPPPPPGTPred

Mild Damage Background

Mild 83 17 0Damage 18 80 2Background 0 0 100

Table 3: Confusion Matrix for Two-level DamageClassification per Segment

4.4 Pixel-level SegmentationWhile the performance of FV-CNN for discrimination based

on texture is high, our aim was not only to recognize thetype of damage but also to identify where the damage hadoccured. For this purpose, we used Deep ConvolutionalNeural Network (DCNNs) with CRF [5] to segment the im-ages so that the regions of potential damage can be recog-nized. The “DeepLab” system obtained a high accuracy onthe PASCAL VOC-2012 semantic image segmentation task,reaching 71.6% IOU accuracy and thus became a naturalchoice for pixel-level segmentation. We used the “DeepLab-LargeFOV” model to perform per-pixel classification for thegiven images generating segments belonging to one of thedamage categories (or background). CRFs were used aspost-processing step to discern local shapes and overcomethe noisy labeling of the annotators. We performed exper-iments with three classes and two classes of damage. Ad-ditionally, to overcome the significant class-imbalance, wemodified the cross-entropy loss function using class-weighting [2].Table 4 shows the mean accuracy for our data for the dif-ferent cases. Though, the paper suggests the meanIU is a

Background Mild Medium Severe

20

30

Figure 6: The class distribution for the UAV Vanu-atu data set.

better evaluation metric, however given the quality of an-notations for our images (they are neither weak nor strongbut somewhat of intermediate quality), we have used meanaccuracy as a baseline for evaluating the results on Nazr-CNN.

For completeness we note that the DeepLab system wasimplemented using the Caffe framework. Similar to the pre-vious experiment, the same train to test split ratio was usedwith a batch size of 4 using an NVIDIA K3100M 4GB GPUmemory card.

Damage Level Mean Accuracy(%)

3-class 45.63-class* 49.32-class 48.52-class* 53.28

* indicates training with class weighting

Table 4: Semantic Segmentation Performance Eval-uation

In Figure 7 we have provided a few real examples whichhighlight that CNNs are good in learning shapes but areare poor at texture discrimination. Aditionally, the poorperformance is partly due to the fact that we have noisyground-truth annotations. An interesting observation is thatthe system was able to identify segment regions which weremissed by the annotators. In one of the examples we alsoobserved that roofs with different color gets different damageseverity even though both are not damaged. We have alsoobserved that class-weighting enhances the localization anddiscriminative capabiity of the DeepLab system.

In the next section, we explain the new model which takesadvantage of both the above discussed approaches and lever-age texture features.

4.5 Nazr-CNN

Nazr-CNNcombines pixel-level segmentation and fisherencoding. The accuracy of the combined system is shown inTable 5. The average accuracy of the DeepLab(X) system isshown in the first column. It is clear that class weighting im-proves the overall accuracy. Furthermore, when the mediumand severe labeling is combined, the resulting two class dam-age classification problem shows improvement (with the twodamaged classes there are total of three classes includingbackground). In the second column the standalone accuracyof FV-CNN(Y ) is shown. Here again, reduction to the twodamage class improves the average accuracy from 83.6% to90.1%. Thus if we combine the two pipelines, the hypothet-

Figure 7: Semantic Segmentation

ical best accuracy is a product of X and Y which is shownin column three. However the combined pipeline of Nazr-CNN does slightly better then a simple product of the two.This suggests that in Nazr-CNN the FV-CNN componentis able to fix some of the errors incurred by DeepLab. Wenow give concrete examples which illustrate the error-fixingability of FV-CNN.

Figure 8 shows the detection and classification results onthree distinct images. The first column is the actual image,the second column is the consensus crowdsourced annota-tion, the third column is the output of DeepLab which isthen piped into FV-CNN and the output is shown in thefourth column. For the first image (from the top), DeepLab(the semantic segmentation step), correctly predicts one seg-ment (the big one) but misses three segments. Since thesegments are missed, FV-CNN can only operate on the seg-ment discovered by DeepLab. However, in the second andthe third image we notice an interesting phenomenon. WhileDeepLab is able to pick a segment (due to class weighting) itgives it a wrong label (severely damage (green) is labeled asmildly damage (red)). However, since FV-CNN works withraw intensities of the segment, it is able to correct the errorand relabel it as severe. A similar observation holds for thethird image. This explains why the accuracy of Nazr-CNNtends to be better than the standalone product of DeepLaband FV-CNN.

5. RELATED WORKDeep Learning techniques now underpin many computer

Figure 8: Semantic Segmentation with FV-CNN

Damage DeepLab FV-CNN Hypothesis Nazr-CNNLevels (X) (Y) (X*Y)

3-classes 45.6 83.6 38.1 39.63-classes* 49.2 83.6 41.2 44.22-classes 48.5 90.1 43.7 47.052-classes* 53.2 90.1 48.0 52.1

* indicates training with class weighting

Table 5: Comparison of all the models

vision tasks. In particular the best algorithms for imageclasification are now likely to be based on ConvolutionalNeural Networks (CNNs) [26, 25, 38, 41, 12, 42, 18]. CNNs(especially with the pooling layer) are designed to be trans-lation invariant. However one of the key strengths of CNNs,that they are designed to be invariant to spatial transfor-mations, is precisely the weakness when it comes to objectlocalization as required for damage assessment from UAVs.To overcome the weakness of CNNs, Chen et.al. [5], have in-troduced the “DeepLab” system which combines CNNs withConditional Random Fields (CRFs). CRFs are particularlyuseful for capturing local interaction between neighboringpixels. In particular DeepLab performs pixel-level classi-fication, a task sometimes known as semantic image seg-mentation (SS) in the computer vision community. Earlyattempts in semantic image segementation used a set ofbounding boxes and masked regions as input to the CNNarchitecture to incorporate shape information into the classi-fication process to perform object localization and semanticsegmentation [12, 16, 32, 13, 36]. Taking a slightly differentapproach, some studies employed segmentation algorithmsindependently on top of deep CNNs that were trained fordense image labeling [17, 10]. More direct approaches, onthe other hand, aim to predict a class label for each pixelby applying deep CNNs to the whole image in a fully con-volutional fashion [39, 9]. Similarly, [2] and [33] train anend-to-end deep encoder-decoder architecture for multiclasspixelwise segmentation.

One of the key strengths of systems based on deep learningis that they automatically infer a representation of the data

suitable for the defined task. For example, the lower layersof deep learning correspond to a representation suitable forlow-level vision tasks while the higher layers are more do-main specific [15] and obviates the need for pre-defined fea-ture engineering like SIFT [27]. Fisher Vectors (FV) are animportant representation used in computer vision for objectdiscrimination [37]. FVs generalize the Bag of Visual Word(BoV) model which are now often built on top of CNN hid-den layers. In particular FV-CNN [6], is a recent attemptto combine the use of CNN and fisher vectors for texturerecognition and segementation.

Aerial image analysis for detecting objects, classifying re-gions and analyzing human behavior is an active researcharea. A recent overview is presented in Mather and Koch[28] which also mentions the use of damage assessment datasets(e.g., [1]) as a benchmark. Works which use texture for aerialimagery include [19].

Examples of recent work include Blanchart et al. [3], wherethey utilize SVM-based active learning to analyze aerial im-ages in a coarse-to-fine setting. Bruzzone and Prieto [4]is an example of a change detection-based analysis tech-nique. Zhang et al. [43] develop coding schemes for clas-sifying aerial images by land use. Similarly, Hung et al. [19]tackle with weed classification based on deep auto-encoderswhile Quanlong et al. [11] analyze the urban vegetation map-ping using random forests with texture-based features. Or-eifej et al. [35] recognize people from aerial images. Gleasonet al. [14] and Moranduzzo and Melgani [31] detect carsin aerial images using kernel methods and support vectormachines.

There are also studies that aim to produce a complete se-mantic segmentation of the aerial image into object classessuch as building, road, tree, water [8, 22, 23]. Some of therecent attempts apply deep CNNs to perform binary classifi-cation of the aerial image for a single object class [8, 30, 29].These recent attempts to apply deep learning techniques tohigh-resolution aerial imagery have resulted in highly accu-rate object detectors and image classifiers, suggesting thatautomated aerial imagery analysis systems may be withinreach.

However, most of the aforementioned aerial image analysismethods assume that the images are captured at a nadir an-gle via satellites with known ground resolution, and hence,fixed viewpoint and scale for the objects in the scene. How-ever, UAVs usually fly at variable altitudes and angles, andtherefore, capture oblique images with varying object sizesand appearances. Therefore, in contrast to the traditionalaerial image analysis and computer vision paradigms, a newset of computer vision and machine learning approachesmust be developed for UAV imagery to account for suchdifferences in the acquired image characteristics.

6. SUMMARY AND FUTURE WORKIn this paper we have proposed an integrated deep learn-

ing pipeline (Nazr-CNN) for identifing built structures (e.g.,houses) in UAV images followed by a fine-grained damageclassification. The images were collected in the aftermathof cyclone PAM which struck Vanuatu in 2015 with the aimof assessing the level of damage. The labeling of the imageswas crowdsourced to a group of digital volunteers organizedby the World Bank and UAViators.

Nazr-CNN has two distinct components. The first com-ponent carries out a pixel-level classification of images (a

task often know as semantic segmentation (SS)) with theaim of identifying built structures in an image. The aim ofthe second component is to carry out a fine-grained classi-fication of the structures identified to assess the severity ofdamage. We use a Fisher Vector representation of imagesegments to assess the severity of damage. Nazr-CNN isparticularly robust against noisy labels and appears to beheight invariant - a necessary property for UAV images.

To the best of our knowledge this is the first known deeplearning pipeline for object detection and classification ofUAV images collected from disaster struck regions. At thispoint, our work handles more complex problem, such as, im-age segmentation for noisy aerial images. We plan to furtherinvestigate techniques to better handle the noisy labels weget from the digital volunteers.

7. APPENDIXWe derive Equation 5. The other two follow a similar

logic. First note that

∂wa∂αb

=

{wa − w2

a if a = b

−wawb if a 6= b(9)

Now, logP (x) = log∑Kj=1 wjNj(x). Therefore

∂ logP (x)

∂αi=

1∑Kj=1 wjNj(x)

K∑j=1

∂

∂αi(wjNj(x))

=1∑K

j=1 wjNj(x)

∂

∂αi(wiNi(x)) +

K∑j=1,j 6=i

wjNj(x)

=

1∑Kj=1 wjNj(x)

(wi − w2i )Ni(x)−

K∑j=1,j 6=i

(wjwi)Nj(x)

=

1∑Kj=1 wjNj(x)

[(wiNi(x)−

K∑j=1

(wjwi)Nj(x)

]=γ(i)− wi

8. REFERENCES[1] Remote sensing damage assessment:. Technical report,

January 2010.

[2] V. Badrinarayanan, A. Kendall, and R. Cipolla.Segnet: A deep convolutional encoder-decoderarchitecture for image segmentation. arXiv preprintarXiv:1511.00561, 2015.

[3] P. Blanchart, M. Ferecatu, and M. Datcu. Cascadedactive learning for object retrieval using multiscalecoarse to fine analysis. In 2011 18th IEEEInternational Conference on Image Processing, pages2793–2796, Sept 2011.

[4] L. Bruzzone and D. F. Prieto. An adaptivesemiparametric and context-based approach tounsupervised change detection in multitemporalremote-sensing images. IEEE Transactions on ImageProcessing, 11(4):452–466, Apr 2002.

[5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Semantic image segmentation with deepconvolutional nets and fully connected crfs. CoRR,abs/1412.7062, 2014.

[6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banksfor texture recognition and segmentation. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3828–3836,2015.

[7] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banksfor texture recognition and segmentation. In IEEEConference on Computer Vision and PatternRecognition, CVPR 2015, Boston, MA, USA, June7-12, 2015, pages 3828–3836, 2015.

[8] P. Dollar, Z. Tu, and S. Belongie. Supervised learningof edges and object boundaries. In Computer Visionand Pattern Recognition, 2006 IEEE ComputerSociety Conference on, volume 2, pages 1964–1971,2006.

[9] D. Eigen and R. Fergus. Predicting depth, surfacenormals and semantic labels with a commonmulti-scale convolutional architecture. In 2015 IEEEInternational Conference on Computer Vision(ICCV), pages 2650–2658, Dec 2015.

[10] C. Farabet, C. Couprie, L. Najman, and Y. LeCun.Learning hierarchical features for scene labeling. IEEETransactions on Pattern Analysis and MachineIntelligence, 35(8):1915–1929, Aug 2013.

[11] Q. Feng, J. Liu, and J. Gong. Uav remote sensing forurban vegetation mapping using random forest andtexture analysis. Remote Sensing, 7(1):1074, 2015.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik.Rich feature hierarchies for accurate object detectionand semantic segmentation. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition (CVPR), 2014.

[13] R. B. Girshick. Fast R-CNN. In InternationalConference on Computer Vision, pages 1440–1448,2015.

[14] J. Gleason, A. Nefian, X. Bouyssounousse, T. Fong,and G. Bebis. Vehicle detection from aerial imagery.In Robotics and Automation (ICRA), 2011 IEEEInternational Conference on, pages 2065–2070, May2011.

[15] I. Goodfellow, Y. Bengio, and A. Courville. Deeplearning. Book in preparation for MIT Press, 2016.

[16] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik.Simultaneous Detection and Segmentation, pages297–312. Springer International Publishing, Cham,2014.

[17] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik.Hypercolumns for object segmentation andfine-grained localization. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR),June 2015.

[18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In Computer Visionand Pattern Recognition (CVPR), 2016 IEEEConference on, June 2016.

[19] C. Hung, Z. Xu, and S. Sukkarieh. Feature learningbased approach for weed classification using highresolution aerial images from a digital cameramounted on a uav. Remote Sensing, 6(12):12037, 2014.

[20] M. Imran, C. Castillo, J. Lucas, P. Meier, andS. Vieweg. Aidr: Artificial intelligence for disasterresponse. In Proceedings of the 23rd International

Conference on World Wide Web, WWW ’14Companion, pages 159–162, 2014.

[21] K. Jarrett, K. Kavukcuoglu, and Y. Lecun. What isthe best multi-stage architecture for objectrecognition?

[22] S. Kluckner and H. Bischof. Semantic classification bycovariance descriptors within a randomized forest. InComputer Vision Workshops (ICCV Workshops),2009 IEEE 12th International Conference on, pages665–672, Sept 2009.

[23] S. Kluckner, T. Mauthner, P. M. Roth, andH. Bischof. Computer Vision – ACCV 2009: 9thAsian Conference on Computer Vision, Xi’an,September 23-27, 2009, Revised Selected Papers, PartII, chapter Semantic Classification in Aerial Imageryby Integrating Appearance and Height Information,pages 477–488. Springer Berlin Heidelberg, Berlin,Heidelberg, 2010.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton.ImageNet classification with deep convolutional neuralnetworks. In F. Pereira, C. Burges, L. Bottou, andK. Weinberger, editors, Advances in NeuralInformation Processing Systems 25, pages 1097–1105.Curran Associates, Inc., 2012.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton.Imagenet classification with deep convolutional neuralnetworks. In F. Pereira, C. Burges, L. Bottou, andK. Weinberger, editors, Advances in NeuralInformation Processing Systems 25, pages 1097–1105.Curran Associates, Inc., 2012.

[26] Y. LeCun and Y. Bengio. Convolution networks forimages, speech, and time-series. 1995.

[27] D. G. Lowe. Distinctive image features fromscale-invariant keypoints. Int. J. Comput. Vision,60(2):91–110, Nov. 2004.

[28] P. M. Mather and M. Koch. Computer Processing ofRemotely-Sensed Images: An Introduction. Number v.4 in Computer Processing of Remotely-Sensed Images:An Introduction. John Wiley & Sons, 2011.

[29] V. Mnih and G. Hinton. Learning to label aerialimages from noisy data. In Proceedings of the 29thAnnual International Conference on MachineLearning (ICML 2012), June 2012.

[30] V. Mnih and G. E. Hinton. Computer Vision – ECCV2010: 11th European Conference on Computer Vision,Heraklion, Crete, Greece, September 5-11, 2010,Proceedings, Part VI, chapter Learning to DetectRoads in High-Resolution Aerial Images, pages210–223. Springer Berlin Heidelberg, Berlin,Heidelberg, 2010.

[31] T. Moranduzzo and F. Melgani. Automatic carcounting method for unmanned aerial vehicle images.Geoscience and Remote Sensing, IEEE Transactionson, 52(3):1635–1647, March 2014.

[32] M. Mostajabi, P. Yadollahpour, andG. Shakhnarovich. Feedforward semantic segmentationwith zoom-out features. In 2015 IEEE Conference onComputer Vision and Pattern Recognition (CVPR),pages 3376–3385, June 2015.

[33] H. Noh, S. Hong, and B. Han. Learning deconvolutionnetwork for semantic segmentation. In The IEEEInternational Conference on Computer Vision

(ICCV), December 2015.

[34] F. Ofli, P. Meier, M. Imran, C. Castillo, D. Tuia,N. Rey, J. Briant, P. Millet, F. Reinhard, M. Parkan,and S. Joost. Combining human computing andmachine learning to make sense of big (aerial) data fordisaster response. Big Data, 4(1):47–59, Mar 2016.

[35] O. Oreifej, R. Mehran, and M. Shah. Human identityrecognition in aerial images. In Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conferenceon, pages 709–716, June 2010.

[36] S. Ren, K. He, R. B. Girshick, and J. Sun. FasterR-CNN: Towards Real-Time Object Detection withRegion Proposal Networks. In Advances in NeuralInformation Processing Systems, pages 91–99, 2015.

[37] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek.Image classification with the fisher vector: Theory andpractice. International journal of computer vision,105(3):222–245, 2013.

[38] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,R. Fergus, and Y. LeCun. Overfeat: Integratedrecognition, localization and detection usingconvolutional networks. In International Conferenceon Learning Representations (ICLR 2014). CBLS,April 2014.

[39] E. Shelhamer, J. Long, and T. Darrell. Fullyconvolutional networks for semantic segmentation.IEEE Transactions on Pattern Analysis and MachineIntelligence, PP(99):1–1, 2016.

[40] K. Simonyan and A. Zisserman. Very deepconvolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556, 2014.

[41] K. Simonyan and A. Zisserman. Very deepconvolutional networks for large-scale imagerecognition. CoRR, abs/1409.1556, 2014.

[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. InComputer Vision and Pattern Recognition (CVPR),2015 IEEE Conference on, pages 1–9, June 2015.

[43] H. Zhang, J. Zhang, and F. Xu. Land use and landcover classification base on image saliency mapcooperated coding. In Image Processing (ICIP), 2015IEEE International Conference on, pages 2616–2620,Sept 2015.

nazr-cnn: object detection and fine-grained classification in

Documents