hierarchicalmetriclearningandmatchingfor2dand3dgeometriccorrespondenceshuy/lib/exe/fetch.php?media=... ·...

HierarchicalMetric Learning andMatching for 2D and 3DGeometric Correspondences?†

Mohammed E. Fathy1, Quoc-Huy Tran2, M. Zeeshan Zia3, Paul Vernaza2, and Manmohan Chandraker2,4

1Google Cloud AI 2NEC Laboratories America, Inc. 3Microsoft 4University of California, San Diego

Problems & Contributions

Semantic features+

Localization robust to appearance

variations

Geometric features

+Localization

sensitive to local structure

Hierarchical metric Learning and Matching (HiLM)

Our hierarchical metric learning andmatching (HiLM ) retain the bestproperties of various levels of abstrac-tion in CNN feature representations.Corresponding pixels are overlaid onthe input images with the same col-ors.

Problems: Given two RGB images capturing a common scene from different viewpoints, our task is to estimate2D geometric correspondences between the pixels in the images. In addition, we propose an extension tocomputing 3D geometric correspondences from two point clouds.

Contributions:• We derive useful insights for designing CNN-based correspondence methods, where both deep and shallow

features are exploited.

• Hierarchical metric losses at multiple CNN layers for correspondence learning.

• Hierarchical matching within CNN activation maps for refining correspondences.

• Experimental results show significant improvements over previous single-layer and feature fusion approaches.

• The improvements further translate to different data modalities (2D images, 3D point clouds) and generalizeacross various datasets (Sintel, HPatches, KITTI).

2D CNN Architecture for 2D Correspondence Estimation

ReLU/LRN/Max-Pool: 3x3/S2

Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4


Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2


Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4


Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2

FE-Conv: 128 3x3

L2-Normalize

FE-Conv: 128 3x3

L2-Normalize Hard-Negative

Mining

CCL Loss

FE-Conv: 128 1x1 FE-Conv: 128 1x1

L2-Normalize L2-Normalize

Hard-Negative Mining

CCL Loss

Coarse Matching

Constrained Matching

Precise matches

Testing

Training

Deep supervision improves generalization by encouraging the CNN early on to learn task-relevant features.Also, both deep and shallow layers can be supervised simultaneously within one CNN.

Hard Negative Mining happens “on-the-fly” to speed up training and leverage the latest instance of networkweights. Hard negative mining is employed independently for each of the feature levels being supervised.

Correspondence Contrastive Loss (CCL):

L :=L∑

l=1

∑(x,x′,y)∈D

y . d2l (x, x′) + (1− y) . (max(0,m− dl(x, x′)))2.

?Part of this work was done during M. E. Fathy’s internship at NEC Labs America in Cupertino, CA.The authors thank C. B. Choy and A. Zeng for their help with the code of UCN and 3DMatch respectively.

Baseline Architectures Inspired by Previous Works


Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4


Conv-1: 96 7x7/S2

L2-Normalize

CCL Loss


Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D4

Conv-4: 512 3x3/D4


Conv-5: 512 3x3/D4

Conv-1: 96 7x7/S2 Max-Pool: 1x1/S2

Concat

FE-Conv:

128 1x1

L2-Normalize

CCL Loss

(a) conv3-net (b) hypercolumn-fusion


Conv-2: 256 5x5/S1

Conv-3: 512 3x3/D2

Conv-4: 512 3x3/D2/S2


Conv-5: 512 3x3

Conv-1: 96 7x7/S2

Upsample x 2

Concat

FE-Conv: 128 1x1

L2-Normalize

CCL Loss

Conv: 256 3x3

Conv: 512 1x1/ReLU

Conv: 512 3x3

Concat

Upsample x 2

Conv: 512 3x3

Conv: 512 1x1/ReLU

Concat

(a) convi-net (i = 3) [1] (b) hypercolumn-fusion [2] (c) topdown-fusion [3]

2D Correspondence Results

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accura

cy (

PC

K)

conv1−net

conv2−net

conv3−net

conv4−net

conv5−net

hypercolumn−fusion

topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)

HiLM (conv2+conv5) Sintel

10 20 30 40 50 60 70 80 90 10020

30

40

50

60

70

80

90

100

Threshold (pixel)

Accu

racy (

PC

K)

conv1−net

conv2−net

conv3−net

conv4−net

conv5−net


topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)


Comparison of different CNN-based methods for 2D corre-spondence estimation on KITTIFlow 2015.

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accu

racy (

PC

K)

SIFT

DAISY

KAZE

UCN

HiLM (VGG−M)

HiLM (GoogLeNet)

10 20 30 40 50 60 70 80 90 10010

20

30

40

50

60

70

80

90

100

Threshold (pixel)

Accu

racy (

PC

K)

SIFT

DAISY

KAZE

UCN

HiLM (VGG−M)

HiLM (GoogLeNet)

Comparison of CNN-based andhand-crafted methods for 2Dcorrespondence estimation onKITTI Flow 2015.

1 2 3 4 5 6 7 8 9 100

10

20

30

40

50

60

70

80

90

Threshold (pixel)

Accu

racy (

PC

K)


topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)




10 20 30 40 50 60 70 80 90 10070

75

80

85

90

95

100

Threshold (pixel)

Accu

racy (

PC

K)


topdown−fusion

HiLM (conv2+conv3)

HiLM (conv2+conv4)

HiLM (conv2+conv5)




Generalization results whentraining on MPI Sintel andevaluating on KITTI Flow2015.

†Code and models will be made available at http://www.nec-labs.com/∼mas/HiLM/.

Optical Flow Results

Method Fl-bg Fl-fg Fl-allFlowNet2 10.75% 8.75% 10.41%

SDF 8.61% 26.69% 11.62%SOF 14.63% 27.73% 16.81%

CNN-HPM 18.33% 24.96% 19.44%HiLM (Ours) 23.73% 21.79% 23.41%

SPM-BP 24.06% 24.97% 24.21%FullFlow 23.09% 30.11% 24.26%

AutoScaler [4] 21.85% 31.62% 25.64%EpicFlow 25.81% 33.56% 27.10%

DeepFlow2 27.96% 35.28% 29.18%PatchCollider 30.60% 33.09% 31.01%

Quantitative results on KITTI Flow 2015. AutoScaler[4] is an image-pyramid/multi-scale 2D correspon-dence estimation method. Bold represents best result,underlined depicts second best.

Qualitative results on KITTI Flow 2015. 1st row: in-put images. 2nd row: DeepFlow2. 3rd row: EpicFlow.4th row: SPM-BP. 5th row: HiLM (Ours). Red depictshigh error, blue represents low error.

3D CNN Architecture for 3D Correspondence Estimation

Conv-8: 512, 3^3, S1

Conv-7: 512, 3^3, S1

Conv-6: 256, 3^3, S1

Conv-5: 256, 3^3, S1

Conv-4: 128, 3^3, S1

Conv-3: 128, 3^3, S1

Conv-2: 64, 3^3, S1

Conv-1: 64, 3^3, S1

Pool-1: 64, 2^3, S2

TDF voxel grid

CCL Loss

30^3 local patch

Conv-2b: 256, 3^3, S2

Conv-2a: 128, 3^3, S2

FE: 512, 5^3, S1 FE: 512, 5^3, S1

CCL LossFE: 512, 1^3, S1 FE: 512, 1^3, S1

Constrained Matching

Coarse Matching

Conv-8: 512, 3^3, S1

Conv-7: 512, 3^3, S1

Conv-6: 256, 3^3, S1

Conv-5: 256, 3^3, S1

Conv-4: 128, 3^3, S1

Conv-3: 128, 3^3, S1

Conv-2: 64, 3^3, S1

Conv-1: 64, 3^3, S1

Pool-1: 64, 2^3, S2

TDF voxel grid

30^3 local patch

Conv-2b: 256, 3^3, S2

Conv-2a: 128, 3^3, S2

Testing

Training

Training

Precise matches

For a fair comparison with3DMatch [5], a CNN-based ap-proach to 3D correspondence es-timation, we disable hard nega-tive mining in our 3D CNN ar-chitecture.

3D Correspondence Results

1 2 3 4 5 6 7 8 9 100

5

10

15

20

25

30

35

Threshold (cm)

Accu

racy

(PC

K)

3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)

10 15 20 2520

30

40

50

60

70

80

90

100

Threshold (cm)

Accu

racy

(PC

K)

3DMatchHiL (conv2)HiL (conv8)HiLM (conv2+conv8)

Comparison of different CNN-based methods for 3D corre-spondence estimation.

References[1] Choy et al.: Universal Correspondence Network. In: NIPS (2016)[2] Hariharan et al.: Hypercolumns for object segmentation and fine-grained localization. In: CVPR (2015)[3] Pinheiro et al.: Learning to refine object segments. In: ECCV (2016)[4] Wang et al.: AutoScaler: Scale-Attention Networks for Visual Correspondence. In: BMVC (2017)

[5] Zeng et al.: 3DMatch: Learning Local Geometric Descriptors from RGB-D Reconstructions. In: CVPR (2017)

hierarchicalmetriclearningandmatchingfor2dand3dgeometriccorrespondenceshuy/lib/exe/fetch.php?media=... ·...

Documents