learning to match aerial images with deep attentive ... · learning to match aerial images with...

Learning to Match Aerial Images with Deep Attentive Architectures

Hani Altwaijry1,2, Eduard Trulls3, James Hays4, Pascal Fua3, Serge Belongie1,2

1 Department of Computer Science, Cornell University 2 Cornell Tech3 Computer Vision Laboratory, Ecole Polytechnique Federale de Lausanne (EPFL)

4 School of Interactive Computing, College of Computing, Georgia Institute of Technology

Abstract

Image matching is a fundamental problem in ComputerVision. In the context of feature-based matching, SIFT andits variants have long excelled in a wide array of applica-tions. However, for ultra-wide baselines, as in the case ofaerial images captured under large camera rotations, theappearance variation goes beyond the reach of SIFT andRANSAC. In this paper we propose a data-driven, deeplearning-based approach that sidesteps local correspon-dence by framing the problem as a classification task. Fur-thermore, we demonstrate that local correspondences canstill be useful. To do so we incorporate an attention mech-anism to produce a set of probable matches, which allowsus to further increase performance. We train our models ona dataset of urban aerial imagery consisting of ‘same’ and‘different’ pairs, collected for this purpose, and character-ize the problem via a human study with annotations fromAmazon Mechanical Turk. We demonstrate that our mod-els outperform the state-of-the-art on ultra-wide baselinematching and approach human accuracy.

1. Introduction

Finding the relationship between two images depictinga 3D scene is one of the fundamental problems of Com-puter Vision. This relationship can be examined at differentgranularities. At a coarse level, we can ask whether two im-ages show the same scene. At the other extreme, we wouldlike to know the dense pixel-to-pixel correspondence, orlack thereof, between the two images. These granularitiesare directly related to broader topics in Computer Vision;in particular, one can look at the coarse-grained problemas a recognition/classification task, whereas the pixel-wiseproblem can be viewed as one of segmentation. Traditionalgeometry-based approaches live in a middle ground, rely-ing on a multi-stage process that typically involves key-point matching and outlier rejection, where image-level cor-respondence is derived from local correspondence.

Input Pair SIFT CNN Match

Figure 1. Matching ultra-wide baseline aerial images. Left: Thepair of images in question. Middle: Local correspondence match-ing approaches fail to handle this baseline and rotation. Right: TheCNN matches the pair and proposes possible region matches.

In this paper we focus on pairs of oblique aerial im-ages acquired by distant cameras from very different an-gles, as shown in Fig. 1. These images are challenging forgeometry-based approaches for a number of reasons—chiefamong them are dramatic appearance distortions due toviewpoint changes and ambiguities due to repetitive struc-tures. This renders methods based on local correspondenceinsufficient for ultra-wide baseline matching.

In contrast, we follow a data-driven approach. Specifi-cally, we treat the problem from a recognition standpoint,without appealing specifically to hand-crafted, feature-based approaches or their underlying geometry. Our aim isto learn a discriminative representation from a large amountof instances of same and different pairs, which separates thegenuine matches from the impostors.

We propose two architectures based on ConvolutionalNeural Networks (CNN). The first architecture is only con-cerned with learning to discriminate image pairs as same ordifferent. The second one extends it by incorporating a Spa-tial Transformer module [16] to propose possible matching

1

Figure 2. Sample pairs from one of our datasets, collected fromGoogle Maps [13] ‘Birds-Eye’ view. Pairs show an area or build-ing from two widely separated viewpoints.

regions, in addition to the classification task. We learn bothnetworks given only same and different pairs, i.e., we learnthe spatial transformations in a semi-supervised manner.

To train and validate our models, we use a dataset with49k ultra-wide baseline pairs of aerial images compiledfrom Google Maps specifically for this problem: exam-ple pairs are shown in Fig. 2. We benchmark our mod-els against multiple baselines, including human annotations,and demonstrate state-of-the-art performance, close to thatof the human annotations.

Our main contributions are as follows. First, we demon-strate that deep CNNs offer a solution for ultra-wide base-line matching. Inspired by recent efforts in patch matching[14, 43, 31] we build a siamese/classification hybrid modelusing two AlexNet networks [19], cut off at the last poolinglayer. The networks share weights, and are followed by anumber of fully-connected layers embodying a binary clas-sifier. Second, we show how to extend the previous modelwith a Spatial Transformer (ST) module, which embodiesan attention mechanism that allows our model to proposepossible patch matches (see Fig. 1), which in turn increasesperformance. These patches are described and comparedwith MatchNet [14]. As with the first model, we trainthis network end-to-end, and only with same and differenttraining signal, i.e., the ST module is trained in a semi-supervised manner. In sections 3.2 and 4.6 we discuss thedifficulties in training this network, and offer insights in thisdirection. Third, we conduct a human study to help us char-acterize the problem, and benchmark our algorithms againsthuman performance. This experiment was conducted onAmazon Mechanical Turk, where participants were shownpairs of images from our dataset. The results confirm thathumans perform exceptionally while responding relativelyquickly. Our top-performing model falls within 1% of hu-man accuracy.

2. Related Work2.1. Correspondence Matching

Correspondence matching has been long dominated byfeature-based methods, led by SIFT [23]. Numerous de-

scriptors have been developed within the community, suchas SURF [5], BRIEF [8], and DAISY [36]. These de-scriptors generally provide excellent performance in nar-row baselines, but are unable to handle the large distortionspresent in ultra-wide baseline matching [25].

Sparse matching techniques typically begin by extractingkeypoints, e.g., Harris Corners [15]; followed by a descrip-tion step, e.g., computing SIFT descriptors; then a keypointmatching step, which gives us a pool of probable keypointmatches. These are then fed into a model-estimation tech-nique, e.g., RANSAC [11] with a homography model. Thispipeline assumes certain limitations and demands assump-tions to be made. Relying on keypoints can be limiting—dense techniques have been successful in wide-baselinestereo with calibration data [36, 38, 40], scene alignment[21, 40] and large displacement motion [38, 40].

The descriptor embodies assumptions about the topologyof the scene, e.g., SIFT is not robust against affine distor-tions, a problem addressed by Affine-SIFT [42]. Furtherassumptions are made in the matching step: do we con-sider only unique keypoint matches? What about repetitivestructures? Finally, the robust model estimation step is ex-pected to tease out a correct geometric model. We believethat these assumptions play a major role in why feature-based approaches are currently incapable of matching im-ages across very wide baselines.

2.2. Ultra-wide Baseline Feature-Based Matching

Ultra-wide baseline matching generally falls under theumbrella of correspondence matching problems. Therehave been several works on wide-baseline matching [35,24]. For urban scenery, Bansal et al. [4] presented theScale-Selective Self-Similarity (S4) descriptor which theyused to identify and match building facades for image geo-localization purposes. Altwaijry and Belongie [1] matchedurban imagery under ultra-wide baseline conditions withan approach involving affine invariance and a controlledmatching step. Chung et al. [9] calculate sketch-like repre-sentations of buildings used for recognition and matching.In general, these approaches suffer from poor performancedue to the difficulty of the problem.

2.3. Convolutional Neural Networks

Neural Networks have a long history in the field of Artifi-cial Intelligence, starting with [30]. Recently, Deep Convo-lutional Neural Networks have achieved state-of-the-art re-sults and become the dominant paradigm in multiple frontsof Computer Vision research [19, 33, 34, 12].

Several works have investigated aspects of correspon-dence matching with CNNs. In [22], Long et al. shed somelight on feature localization within a CNN, and determinethat features in later stages of the CNN correspond to fea-tures finer than the receptive fields they cover. Toshev andSzegedy [37] determine the pose of human bodies using

CNNs in a regression framework. In their setting, the neuralnetwork is trained to regress the locations of body joints ina multi-stage process. Lin et al. [20] use a siamese CNNarchitecture to put aerial and ground images in a commonembedding for ground image geo-localization.

The literature has seen a number of approaches to learn-ing descriptors prior to neural networks. In [7], Brown etal. introduce three sets of matching patches obtained fromstructure-from-motion reconstructions, and learn descriptorrepresentations to match them better. Simonyan et al. [32]learn the placement of pooling regions in image-space anddimensionality reduction for descriptors. However, with therise of CNNs, several lines of work investigated learningdescriptors with deep networks. They generally rely on atwo-branch structure inspired by the siamese network of [6],where two networks are given pairs of matching and non-matching patches. This is the approach followed by Han etal. with MatchNet [14], which relies on a fully connectednetwork after the siamese structure to learn the comparisonmetric. DeepCompare [43] uses a similar architecture andfocuses on the center of the patch to increase performance.In contrast, Simo-Serra et al. [31] learn descriptors that canbe compared with the L2 distance, discarding the siamesenetwork after training. These three methods relied on datafrom [7] to learn their representations. They assume thatsalient regions are already determined, and deliver a bet-ter approach to feature description for feature-based corre-spondence matching techniques. The question of obtainingCNN-borne correspondences between two input pairs, how-ever, remains unexplored.

Lastly, attention models [26, 3] have been developedto recognize objects by an attention mechanism examiningsub-regions of the input image sequentially. In essence, theattention mechanism embodies a saliency detector. In [16],the Spatial Transformer (ST) network was introduced as anattention mechanism capable of warping the inputs to in-crease recognition accuracy. In section 3.2 we discuss howwe employ an ST module to let the network produce guessesfor probable region matches.

3. Deep-Learning Architectures

3.1. Hybrid Network

We introduce an architecture which, given a pair of im-ages, estimates the likelihood that they belong to the samescene. Inspired by the recent success of patch-matchingapproaches based on CNNs [43, 14, 31], we use a hybridsiamese/classification network. The network comprises twoparts: two feature extraction arms that share weights (thesiamese component) and process each input image sepa-rately, and a classifier component that produces the match-ing probability. For the siamese component we use the con-volutional part of AlexNet [19], i.e., cutting off the fullyconnected layers. For the classifier we use a set of fully-

Yes/No

Convolution Max-PoolingVector Fully-Connected

Shared Weights

Decision

Figure 3. The siamese/classification Hybrid network. Weights areshared between the convolutional arms. ReLU and LRN (LocalResponse Normalization) layers are not shown for brevity.

connected layers that takes as input the concatenation of thesiamese features and ends with a binary classifier, for whichwe minimize the binary cross-entropy loss. Fig. 3 illustratesthe structure of the ‘Hybrid’ network.

The main motivation behind this design is that it allowsfeatures with local information from both images to be con-sidered jointly. This is achieved where the two convolu-tional features are concatenated. At that layer, the featuresfrom both images retain correspondence to specific regionswithin the input images.

3.2. Hybrid++

Unlike traditional geometry-based approaches, the hy-brid network proposed in the previous section does notmodel local similarity explicitly, making it difficult to drawconclusions about corresponding image regions. We wouldlike to determine whether modeling local similarities moreexplicitly can produce more discriminative models.

We therefore sought to expand our hybrid architectureto allow for predictions of probable region matches, in ad-dition to the classification task. To accomplish this, weleverage the Spatial Transformer (ST) network describedin [16]. Spatial transformers consist of a network usedfor localization, which takes as input the image and pro-duces the parameters for a pre-determined transformationmodel (e.g., translation, affine, etc.) which is used in turnto transform the image. It relies on a grid generator and adifferentiable sampling kernel to keep track of the gradientpropagation to the localization network. The model can betrained with standard back-propagation, unlike the attentionmechanisms of [3, 26] that relied on reinforcement learningtechniques. The spatial transformer is typically a standardCNN followed by a set of fully-connected layers with therequired number of outputs, i.e., the number of transforma-tion parameters, e.g., two for translation, six for affine.

The spatial transformer allows for any transformationas long as it is differentiable. However, in this work we

Spatial Transformer Localization Network

Input Image Transform Grid Result

Figure 4. Overview of a Spatial Transformer module operating ona single image. The module uses the regressed parameters Θ togenerate and sample a grid of pixels in the original image.

only consider extracting patches at a fixed scale, i.e., trans-lations, which are used to generate patch proposals overboth images—richer models, such as perspective transfor-mations, can potentially be more descriptive, but are alsomore difficult to train.

We build the spatial transformer with the same convo-lutional network used for the ‘arms’ of the siamese com-ponent of our hybrid network, plus a set of fully con-nected layers that regress the transformation parametersΘ = {Θ1,Θ2}, which are used to transform the input im-ages, effectively sampling patches. Note that patch loca-tions for each individual image are a function of both im-ages. The number of extracted patches is reflected in thenumber of regressed parameters specified. Fig. 4 illustrateshow the spatial transformer module operates.

The spatial transformer modules allow us to explicitlymodel regions within each input image, permitting the net-work to propose similar regions given an architecture thatdemands such a goal. The overall structure of this model,which we call ‘Hybrid++’, is shown in Fig. 5.

3.2.1 Describing Patches

In our model, we pair a ST module which produces a pre-determined number of fixed-scale patch proposals with ourhybrid network. The extracted patches are given to a Match-Net [14] network, which was trained with interest pointsfrom Structure-from-Motion data [7] and thus already has ameasure of invariance against perspective changes built-in.

MatchNet has two components in its network, a featureextractor modeled as a series of convolutional layers, and aclassifier network that takes the outputs of two feature ex-tractors and produces a similarity score. We pass each ex-tracted patch, after converting it to grayscale, through theMatchNet feature extractor network (MatchNet-Feat) andarrive at a 4096-dimensional descriptor vector.

These descriptors are then used for three different objec-tives. The first objective is to supplement the global featuredescription extracted by the original hybrid architecture. Inthis manner, the extracted descriptors provide the classifierwith information extracted at a dedicated higher-resolutionmode. The second objective is to match patches in the other

Shared Weights

MatchNet-Feat

ConvolutionMax-PoolingVectorFully-ConnectedSoftmax

MatchNet-Classify

Spatial Transformation

Modules

Yes/No

1

.

.

.

.

Yes/No

.

.

Yes/No

No

.

.

No

2

Image-1Patches

Image-2Patches

PatchesFromBoth

Images

PatchesFromSameImage

Figure 5. The ‘Hybrid++’ Network. Spatial Transformer modulesare incorporated into the ‘Hybrid’ model to predict probable patchmatches.

image. This objective encourages the network to use thespatial transformer to focus on similar patches in both im-ages simultaneously. The third objective is for the patchto not match other patches extracted from the same image,which we mainly use to discourage the network from col-lapsing onto a single patch. For the last two tasks, we usethe MatchNet classification network (MatchNet-Classify).

3.2.2 Optimization

Combining the image-wise classification objective with theregional descriptor objectives yields an objective functionwith four components:

(1)L =1

N

N∑i=1

(Lclass +αLpatch + βLpairwise + γLbounds

)where N is the size of the training batch and α, β, γ areused to adjust the weights. The first component of the loss

function encodes the image classification objective:

(2)Lclass = yi log pi + (1− yi) log(1− pi)

where pi is the probability of the images matching andyi ∈ {0, 1} is the label. The second component encodesthe match of each pair of patches across both images:

(3)Lpatch =1

M

M∑m=1

[yi log qm + (1− yi) log(1− qm)

]where M is the number of patches, and qm is the probabil-ity of patch x1

m on image 1 matching patch x2m on image 2.

The third component is a pairwise penalty function that dis-courages good matches among the patches within the sameimage, to prevent the network from collapsing the transfor-mations on top of each other:

(4)Lpairwise =4

M(M − 1)

2∑t=1

M∑m=1

M∑k=m+1

log(1−utm,k)

where utm,k is the probability of patch xtm matching patch

xtk on image t = {1, 2}. The last component is a penalty

function that discourages spatial transformations that fallout of bounds:

(5)Lbounds =2

M

2∑t=1

M∑m=1

f(xtm)

where f(xtm) is a function that computes the ratio of pixels

sampled out of bounds for patch xtm. The out-of-bounds

loss term discourages the model from stepping outside theimage, which may minimize the patch-matching loss, givenan appropriate weight—with this penalty function we gainmore control over the optimization process.

3.3. Training ProcedureTo train the hybrid network, we follow a standard train-

ing procedure by fine-tuning the model after loading pre-trained AlexNet weights into the convolutional arms only.However, training the Hybrid++ network is more subtle,as the network needs to get started on the right foot. Weinitially train the non-ST and ST sides separately with theglobal yes/no matching signal. Afterwards, we train the net-works jointly. We learned this is necessary to prevent thenetwork from shutting off one side while minimizing theobjective. Similar to the Hybrid case, we use pre-trainedweights for the convolutional arms.

We use MatchNet as a pure feature descriptor, withfrozen weights, i.e., no learning. This is primarily doneto prevent the network from minimizing the loss by chang-ing the descriptors themselves without moving the attentionmechanism. Our training procedure does not have pixel-to-pixel correspondence labels, and hence we do not know

if the network is examining similar patches. We rely onthe power provided by MatchNet to determine patch simi-larity. The global matching label in turn becomes a semi-supervised cue. Therefore, the network can only minimizethe loss component for patch matching by moving the atten-tion mechanism to examine patches that appear to be simi-lar, as per MatchNet.

The reliance on MatchNet is a double-edged sword, as itis our only means of moving the attention mechanism with-out explicit knowledge of labeled patch correspondences.That means if MatchNet cannot find correspondence for twopatches that do match, then the attention mechanism cannotlearn to look for these two patches.

4. Experiments4.1. Dataset

We compiled 49,271 matching pairs (98,542 images) ofoblique aerial imagery through Google Maps [13]. The im-ages were collected using an automated process that looksfor planar surfaces such that the normal vector of the sur-face is within 40◦ to 75◦ of one cardinal direction. Thisguarantees the visibility of the surface from two differentviewpoints. The pairs were collected non-uniformly from:San Francisco, Boston and Milan. Those locations werechosen with a goal of diversifying the scenery.

We split the dataset into roughly ∼39K/∼10K train-ing/testing positive pairs. For training we generate samplesin an online manner by sampling from the reservoir of pos-itive matching pairs. The sampling procedure is set to pro-duce samples with a 1:1 positive:negative ratio. Therefore,a random classifier would score 50% on the test-set. We callthis the ‘aerial’ dataset.

4.2. Human Performance

We ask ourselves: How well do humans perform whenmatching such images? To this end, we conducted a smallexperiment with human participants on Amazon Mechani-cal Turk [2]. We picked a subset of 1,000 pairs from ourtest set and presented them to the human subjects. Eachparticipant was shown 10 pairs of different images, and wasasked to determine whether each pair showed the same areaor building, as a binary question. We show a screenshot ofthe interface presented to the participants in Fig. 6. Eachpair of images was presented at least 5 times to differentparticipants, giving us a total of 5000 labels, 5 per pair.

Our interface was prone to adversarial participants, thoseanswering randomly or giving a constant answer all thetime. To mitigate the effect of unfaithful workers, we tookthe majority vote of the 5 labels per-pair. Human accuracywas then calculated to be 93.3%, with a precision of 98%and a recall of 89.4%.

We observed that the average response time for hu-mans was less than 4.5 seconds/pair, with a minimum re-

Figure 6. The user interface presented to our human subjectsthrough Amazon Mechanical Turk.

sponse time of half a second. This quick response aver-age prompted us to examine mislabeled pairs: we showexamples of False-Positives in Fig. 7 and False-Negativesin Fig. 8. Most of the False-Positive pairs have a simi-lar general structure, a cue that humans relied on hastily—notice that these examples require deliberate correspon-dence matching. This is a non-trivial, time-consuming task,which explains why the human subjects, who operate in anenvironment that favors lower response times, labeled themas False. This is also corroborated by the high precisionand lower recall of the human labelers, which is another in-dication that humans are performing high-level image com-parisons. All in all, we believe this indicates that the humanparticipants were relying mostly on global appearance cues,which indicates the need for local correspondence match-ing.

4.3. Training Framework

We train our networks with Torch7 [10]. We transplantweights in our models from the pre-trained reference modelCaffeNet available from Caffe [18]. For the convolutionalfeature arms, we keep the AlexNet layers up to ‘pool5’ anddiscard the rest. The fully connected layers of our classifiercomponent are trained from scratch. For the patch descrip-tor network, i.e., MatchNet [14], we transplant the ‘feature’-network and the ‘classification’-network as-is and freeze thelearning for both.

We use Rectified Linear Units (ReLU) for all our non-linearities, and train the networks with Stochastic Gradi-ent Descent. The spatial transformer modules are trainedspecifically without momentum.

4.4. Spatial Transformer Details

The spatial transformer regresses |Θ|= 4n parameters,where n is the number of patches per image. Each 2 pa-rameters are taken for an x-y location in the image plane inthe range [−1, 1]. We specify a fixed-scale interpretation,where extracted patches are always 64 × 64, the resolution

Figure 7. False-Positive pairs from the human experiment.

Figure 8. False-Negative pairs from the human experiment.

required by MatchNet.In the Hybrid++ network, we remove the ‘pool5’ and

‘conv5’ layers provided by AlexNet from the convolutionalarms, and learn a new 1 × 1 convolutional layer with anoutput size of 64× 13× 13, performing dimensionality re-duction from the 384-channel output of ‘conv4’. The local-ization network takes a 2×64×13×13 input from the twoconvolutional arms and follows up with 3 fully-connectedlayers as follows: 21632 → 1024 → 256 → 4n. The ini-tialization of the last fully-connected layer is not random;as recommended in [16], we initialize it with a zero-weightmatrix and a bias specifying initial locations for the patches.In our experiments, we predict M = 6 patches per image,initialized to non-overlapping grid locations.

4.5. Matching Results

We compare our CNN models with a variety of baselineson the ‘aerial’ dataset. Our first baseline was a feature-basedcorrespondence-matching method. We chose A-SIFT [42]as it offers all the capabilities of SIFT with the additionof affine invariance. In aerial images we mainly observeaffine distortion effects, which makes A-SIFT’s invarianceproperties particularly relevant. We use the implementationoffered by the authors, which computes the matches andperforms outlier rejection to estimate the fundamental ma-trix between the views, providing a yes/no answer, given athreshold. The accuracy of A-SIFT is better than random by11%, but suffers from low accuracy for the positive samples(i.e., low recall), as it is unable to find enough correspon-dences to perform the fundamental matrix estimation for alarge number of positive pairs. This illustrates the difficultyof this problem with local correspondence matching.

Our second set of baselines are a measure of the perfor-mance of holistic representation methods used in the im-age classification and retrieval literature. We chose to com-pare the performance of GIST [27], Fisher Vectors [28], andVLAD [17]. The GIST-based classifier predicted most im-age pairs to be non-matching. Fisher Vectors surpassed A-SIFT performance by showing a better ability to recognizepositive matches, but performed worse than A-SIFT in dis-tinguishing negative pairs. VLAD performed the best out ofthese three holistic approaches with an average accuracy of78.6%. For GIST we use the authors’ implementation, and

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Recall

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Prec

isio

n

Hybrid (94.2)Hybrid w/o pool5 (96.3)Hybrid++ (97.5)Siamese-Alexnet (84.0)Siamese-PlacesCNN (76.2)VLAD (86.3)Fisher Vectors (72.2)GIST (55.3)A-SIFT (69.4)Human

Precision/Recall

Figure 9. Precision/Recall curves for the ‘aerial’ dataset. The num-ber between parenthesis denotes the average precision (%).

for Fisher Vectors and VLAD we use VLFeat [39].The third set of baselines are vanilla CNN models used

in a siamese fashion (without fine-tuning). We compareagainst AlexNet [19], trained on ImageNet, and PlacesCNN[44], which is an instance of the AlexNet architecturetrained on the Places205 dataset [44]. We extract the ‘fc7’layer outputs as descriptor vectors for input images, and usethe L2 distance as a similarity metric. This group of base-lines explores the applicability of pre-trained networks asgeneric feature descriptors, for which there is mounting ev-idence [29]. Both CNNs performed well, considering thelack of fine-tuning. We note that while VLAD surpassedthe performance of these two CNN approaches, both VLADand Fisher Vectors require training with our dataset. Thisshows the power of CNNs generalizing to other domains.

Finally we measure the classification accuracy of ourproposed architectures. Our Hybrid CNN outperforms allthe baselines. A variant of the Hybrid CNN was trainedwithout the ‘conv5’ and ‘pool5’ layers, with a 1 × 1 con-volution layer after ‘conv4’ to reduce the dimensionality ofits output. This variant outperforms the base Hybrid CNNby a small margin. Our Hybrid++ model with Spatial Trans-formers gives us a further boost, and performs nearly as wellas the human participants in our study.

Table 1 summarizes the accuracy for every method, andFig. 9 shows precision/recall curves, along with the averageprecision, expressed as a percentage.

4.6. Insights and Discussion

One of the main difficulties in the application of CNNs toreal-world problems lies in designing and training the net-works. This is particularly true for complex architectureswith multiple components, such as our Hybrid++ network.In this section we discuss our experience and attempt to of-

Method Acc. Acc. pos Acc. neg AP

Human∗ .933 .894 .972 —

A-SIFT [42] .613 .353 .874 .694GIST [27] .549 .242 .821 .553

Fisher Vectors [28] .659 .605 .713 .722VLAD [17] .786 .769 .803 .863

Siamese PlacesCNN [44] .690 .626 .754 .762Siamese AlexNet [19] .754 .697 .811 .840

Hybrid CNN .881 .901 .861 .942Hybrid w/o pool5 .909 .928 .891 .963

Hybrid++ .926 .927 .925 .975

Table 1. Classification performance on the ‘aerial’ dataset. APdenotes Average Precision. (∗Human performance was measuredon a subset of the samples.)

fer insights that may not be immediately obvious.

We obtained a small improvement by removing the‘pool5’ layer from the AlexNet model, and replacing‘conv5’ by a 1 × 1 dimensionality reduction convolution.We believe this is mainly due to the increased resolutionof 13× 13 presented to the classifier. This resolution wouldtypically allow for more local detail to be considered jointly.In particular, this detail appears to be crucial to trainingthe Hybrid++ model, as it provided the Spatial Transformermodule with more resolution to work with. In Fig. 10we show a sample of matched images with probable patchmatches highlighted. Even with the increase in resolution,the receptive field for each neuron is still quite large in theoriginal image space. This suggests that higher resolutionfeatures would be needed for finer localization of similarpatches. This aspect is reflected in the network learning re-gions of interest for each of its attention mechanisms.

We attempted to use transformations with more degreesof freedom with the Spatial Transformer module, such asaffine transforms, but we found the task increasingly diffi-cult without higher levels of supervision and additional con-straints. This was the origin of our ‘out-of-bounds’ penaltyterm. For example, the network would learn to stretch partsof each image into seemingly similar looking patches, effec-tively minimizing the pairwise patch similarity loss term.

To train the pairwise patch similarity portion of the net-work, we only have the image-level match label, with noinformation regarding pixel-wise correspondence. It mightseem unclear what target labels should be presented to thepairwise similarity loss. However, by studying the lossfunction we can see that the attention mechanism would notbe able to find matching patches unless we actively look forcorrespondences; hence it is sensible to use the image-levellabel for patch correspondence. Given that MatchNet mod-ules are frozen, the network will not induce a high loss fornon-corresponding patches over negative samples, but onlyfor non-corresponding patches over positive samples.

Figure 10. Image pairs from ‘aerial’, matched with Hybrid++. The overlaying boxes indicate patch proposals. Red boxes denote patchesthat do not match, according to MatchNet. Boxes with colors other than red indicate matches, with the color encoding the correspondence.

Figure 11. Image pairs from ‘Lausanne’, matched with Hybrid++. Color coding follows the same conventions are the figure above.

4.7. Investigating the Spatial TransformersThe patch proposal locations of Fig. 10 are meaningful

from pair to pair, and across the images for a given pair.However, while the baseline between the two images in apair is very large, it does not change much from pair topair—an inevitable artifact of the dataset collection process.This results in patch proposals with similar configurationsand raises questions about the Spatial Transformers.

We thus set up a second experiment to study the effectof varying viewpoint changes explicitly. To this end weused several high-resolution aerial images from the city ofLausanne, Switzerland, to build a Structure-from-Motiondataset [41] and extract corresponding patches, with 8.7ktraining pairs and 3.6k test pairs. Patches were extractedaround SIFT locations and are thus significantly easier tomatch than those in the ‘aerial’ dataset. However, the view-point changes from pair to pair are much more pronounced.

We followed the same methodology as before to trainour models on this new dataset. In Fig. 11 we show dif-ferent pairs from the new dataset, along with the probablepatch matches suggested by the model. The model learns topredict patch locations that are consistent with the change inperspective, while also differing from pair to pair. Match-Net results on the proposals corroborate the findings whenthe contents of those patches do match (non-red boxes), andwhen they do not (red boxes). Numerical results are pro-vided in Table 2. As this data is significantly easier, thebaselines (notably A-SIFT) perform much better, but ourmethod achieves the highest accuracy of 96%. The perfor-mance gain from Hybrid to Hybrid++ is however negligible.

5. Conclusions and Future WorkWe present two neural network architectures to address

the problem of ultra-wide baseline image matching. First,we fine-tune a pre-trained AlexNet model over aerial data,with a siamese architecture for feature extraction, and a bi-nary classifier. This network proves capable of discerningimage-level correspondence, but is agnostic to local corre-

Method Acc. Acc. pos Acc. neg AP

A-SIFT [42] .947 .896 .998 .968GIST [27] .856 .798 .914 .937

Fisher Vectors [28] .769 .723 .816 .867VLAD [17] .898 .867 .930 .965

Siamese PlacesCNN [44] .690 .626 .754 .958Siamese AlexNet [19] .754 .697 .811 .968

Hybrid CNN .959 .960 .957 .992Hybrid++ .959 .962 .956 .992

Table 2. Classification performance on the ‘Lausanne’ dataset.

spondence. We then show how to integrate Spatial Trans-former modules to predict probable patch matches in ad-dition to the classification task, which further boosts per-formance. Our models achieve state-of-the-art accuracy inultra-wide baseline matching, and close the gap with humanperformance. We also demonstrate the adaptability of ourapproach on a new dataset with varied viewpoint changeswhich the ST modules can adapt to.

This work is a step towards bridging the gap betweenneural networks and traditional image-matching techniquesbased on local correspondence, in a framework that is train-able end-to-end. We intend to build on it in the follow-ing directions. First, we plan to explore means to increasethe resolution of the localization network to obtain finer-grained patch proposals. Second, we plan to replace Match-Net with ‘descriptor’ networks trained for this specific pur-pose. Third, we are interested in richer transformations forthe ST modules, e.g., affine, and in exploring constraints inorder to do so. Finally, we want to study the use of highersupervision for a better feature-localization step, bringingneural networks closer to local correspondence techniques.

AcknowledgmentsWe would like to thank Kevin Matzen and Tsung-Yi Linfor their valuable input. This work was supported by theKACST Graduate Studies Scholarship and EU FP7 projectMAGELLAN under grant number ICT-FP7-611526.

References[1] H. Altwaijry and S. Belongie. Ultra-wide baseline aerial im-

agery matching in urban environments. In BMVC, 2013. 2[2] Amazon.com. Amazon mechanical turk. 5[3] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recog-

nition with visual attention. In ICLR, 2015. 3[4] M. Bansal, K. Daniilidis, and H. Sawhney. Ultra-wide base-

line facade matching for geo-localization. In ECCV, 2012.2

[5] H. Bay, T. Tuytelaars, and L. V. Gool. SURF: Speeded UpRobust Features. In ECCV, 2006. 2

[6] J. Bromley, I. Guyon, Y. Lecun, E. Sckinger, and R. Shah.Signature verification using a “siamese” time delay neuralnetwork. In NIPS, 1994. 3

[7] M. Brown, G. Hua, and S. Winder. Discriminative learningof local image descriptors. PAMI, 2011. 3, 4

[8] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski,C. Strecha, and P. Fua. BRIEF: Computing a local binarydescriptor very fast. PAMI, 2012. 2

[9] Y.-C. Chung, T. Han, and Z. He. Building recognition usingsketch-based representations and spectral graph matching. InICCV, 2009. 2

[10] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7:A MATLAB-like environment for machine learning. InBigLearn, NIPS Workshop, 2011. 6

[11] M. A. Fischler and R. C. Bolles. Random sample consensus:A paradigm for model fitting with applications to image anal-ysis and automated cartography. Communications of ACM,1981. 2

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014. 2

[13] Google Inc. Google maps. 2, 5[14] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg.

MatchNet: Unifying feature and metric learning for patch-based matching. In CVPR, 2015. 2, 3, 4, 6

[15] C. Harris and M. Stephens. A combined corner and edgedetector. In Fourth Alvey Vision Conference, 1988. 2

[16] M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial transformer networks. NIPS,2015. 1, 3, 6

[17] H. Jegou, M. Douze, C. Schmid, and P. Perez. Aggregatinglocal descriptors into a compact image representation. InCVPR, 2010. 6, 7, 8

[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014. 6

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNetclassification with deep convolutional neural networks. InNIPS. 2012. 2, 3, 7, 8

[20] T.-Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learningdeep representations for ground-to-aerial geolocalization. InCVPR, 2015. 3

[21] C. Liu, J. Yuen, A. Torralba, J. Sivic, and W. T. Freeman.SIFT Flow: Dense correspondence across different scenes.In ECCV, 2008. 2

[22] J. L. Long, N. Zhang, and T. Darrell. Do convnets learncorrespondence? In NIPS, 2014. 2

[23] D. G. Lowe. Object recognition from local scale-invariantfeatures. ICCV, 1999. 2

[24] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust widebaseline stereo from maximally stable extremal regions. InBMVC, 2002. 2

[25] K. Mikolajczyk and C. Schmid. A performance evaluationof local descriptors. PAMI, 2005. 2

[26] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recur-rent models of visual attention. In NIPS, 2014. 3

[27] A. Oliva and A. Torralba. Modeling the shape of the scene:A holistic representation of the spatial envelope. IJCV 2001.6, 7, 8

[28] F. Perronnin and C. Dance. Fisher kernels on visual vocabu-laries for image categorization. In CVPR, 2007. 6, 7, 8

[29] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNNfeatures off-the-shelf: an astounding baseline for recogni-tion. In CVPR Workshop, 2014. 7

[30] F. Rosenblatt. The perceptron: a probabilistic model for in-formation storage and organization in the brain. Psychologi-cal review, 1958. 2

[31] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, andF. Moreno-Noguer. Discriminative learning of deep convo-lutional feature point descriptors. In ICCV, 2015. 2, 3

[32] K. Simonyan, A. Vedaldi, and A. Zisserman. Learning localfeature descriptors using convex optimisation. PAMI, 2014.3

[33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. CVPR, 2015. 2

[34] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace:Closing the gap to human-level performance in face verifica-tion. In CVPR, 2014. 2

[35] D. Tell and S. Carlsson. Combining appearance and topologyfor wide baseline matching. In ECCV, 2002. 2

[36] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient densedescriptor applied to wide-baseline stereo. PAMI, 2010. 2

[37] A. Toshev and C. Szegedy. DeepPose: Human pose estima-tion via deep neural networks. In CVPR, 2014. 2

[38] E. Trulls, I. Kokkinos, A. Sanfeliu, and F. Moreno-Noguer.Dense segmentation-aware descriptors. In CVPR, 2013. 2

[39] A. Vedaldi and B. Fulkerson. VLFeat: An open and portablelibrary of computer vision algorithms. In ACM InternationalConference on Multimedia, 2010. 7

[40] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid.DeepFlow: Large displacement optical flow with deepmatching. In ICCV, 2013. 2

[41] C. Wu. Towards linear-time incremental structure from mo-tion. In 3DV, 2013. 8

[42] G. Yu and J.-M. Morel. ASIFT: An Algorithm for FullyAffine Invariant Comparison. Image Processing On Line,2011. 2, 6, 7, 8

[43] S. Zagoruyko and N. Komodakis. Learning to compare im-age patches via convolutional neural networks. CVPR, 2015.2, 3

[44] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.Learning deep features for scene recognition using placesdatabase. In NIPS, 2014. 7, 8

learning to match aerial images with deep attentive ... · learning to match aerial images with...

Documents