octnetfusion: learning depth fusion from data · octnetfusion: learning depth fusion from data...

1
Autonomous Vision Group Max Planck Institute for Intelligent Systems OctNetFusion: Learning Depth Fusion from Data Gernot Riegler 1 , Ali Osman Ulusoy 2,3 , Horst Bischof 1 , Andreas Geiger 2,4 1 Graz University of Technology 2 MPI for Intelligent Systems Tübingen 3 Microsoft 4 ETH Zürich Motivation Scene TSDF Fusion, views, no noise TSDF Fusion, views, no noise TSDF Fusion, views, noise OctNet Use grid of shallow octrees 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 0 Convolution O out [i, j, k ] = pool_voxels ( ¯ i, ¯ j, ¯ k)Ω[i,j,k] L-1 X l=0 M -1 X m=0 N -1 X n=0 W l,m,n · O in [ ˆ i, ˆ j, ˆ k ] ! 0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 === Conv. 0.125 0.250 0.125 0.000 0.000 0.000 0.125 0.250 0.125 == Pool Pooling O out [i, j, k ]= ( O in [2i, 2j, 2k ] if cell_width(2i, 2j, 2k ) > 1 P else P = max l,m,n[0,1] (O in [2i + l, 2j + m, 2k + n]) = = OctNet Unpooling Naïve Guided = = OctNetFusion Architecture 64 3 128 3 256 3 128 3 256 3 64 3 L 128 3 L 256 3 L 64 3 128 3 256 3 Encoder-Decoder Module Input Features Concat Conv 1, 16, 3×3×3, 1 Conv 16, 32, 3×3×3, 1 Pool 32, 32, 2×2× 2, 2 Conv 32, 32, 3×3×3, 1 Conv 32, 64, 3×3×3, 1 Pool 64, 64, 2×2× 2, 2 Conv 64, 64, 3×3×3, 1 Conv 64, 64, 3×3×3, 1 Conv 64, 64, 3×3×3, 1 Unpool 64, 64, 2×2× 2, 2 Concat Conv 128, 32, 3×3×3, 1 Conv 32, 32, 3×3×3, 1 Unpool 32, 32, 2×2× 2, 2 Concat Conv 64, 16, 3×3×3, 1 Conv 16, 16, 3×3×3, 1 Structure Features Reconstruction Structure Module O D×H×W n Unpool n, n, 2 × 2, 2 P 2D×2H×2W n Split Q 2D×2H×2W n Conv n, 1, 3 × 3, 1 R D×H×W 1 L Intermediate Reconstruction Derived Split Mask Resulting Octree Structure Volumetric Shape Completion Voxlets Dataset [] Method IoU Precision Recall Zheng et al.* .8 . .6 Firman et al.* .8 . .68 Firman et al. . . . Ours .6 .8 .6 Zheng et al. Firman et al. Ours Ground-Truth Volumetric Depth Fusion ModelNet: Input Encoding MAD (mm) VolFus TV-L Occ TDF + Occ TSDF TSDF Hist 64 3 .6 .8 . .8 . . 128 3 .8 .6 . .6 .88 .6 256 3 . .8 . .8 .8 . VolFus [] TV-L [] Ours Ground-Truth ModelNet: Number of Input Views MAD (mm) views= views= views= views=6 VolFus TV-L Ours VolFus TV-L Ours VolFus TV-L Ours VolFus TV-L Ours 64 3 . 8. .8 .66 .6 . .6 .8 . . . .8 128 3 . 6. .8 .8 6. . .8 .6 .6 .68 . .66 256 3 . . . . . .66 . .8 . .8 .6 .6 ModelNet: Varying Input Noise MAD (mm) σ =0.00 σ =0.01 σ =0.02 σ =0.03 VolFus TV-L Ours VolFus TV-L Ours VolFus TV-L Ours VolFus TV-L Ours 64 3 . . .6 . . .8 .6 .8 . .8 . .8 128 3 . .6 . .6 . .66 .8 .6 .6 . .8 .8 256 3 .6 .6 . .8 .6 . . .8 . .88 .88 . σ =0.0 σ =0.03 VolFus [] TV-L [] Ours Ground-Truth Kinect Object Scans [] MAD (mm) views= views= VolFus TV-L Ours VolFus TV-L Ours 64 3 .8 .6 . .6 .8 8. 128 3 8.8 .8 .8 .6 . .6 256 3 . . .86 . . . References [] S. Choi et al. “A Large Dataset of Object Scans”. In: arXiv.org 6.8 (6). [] B. Curless and M. Levoy. “A Volumetric Method for Building Complex Models from Range Images”. In: SIG- GRAPH. 6. [] M. Firman et al. “Structured Prediction of Unobserved Voxels From a Single Depth Image”. In: CVPR. 6. [] G. Riegler et al. “OctNet: Learning Deep D Representations at High Resolutions”. In: CVPR. . [] C. Zach et al. “A Globally Optimal Algorithm for Robust TV-L Range Image Integration.” In: ICCV. . [6] B. Zheng et al. “Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics”. In: CVPR. .

Upload: others

Post on 01-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OctNetFusion: Learning Depth Fusion from Data · OctNetFusion: Learning Depth Fusion from Data Gernot Riegler1, Ali Osman Ulusoy2;3, Horst Bischof1, Andreas Geiger2;4 1Graz University

Autonomous Vision Group

Max Planck Institutefor Intelligent Systems

OctNetFusion: Learning Depth Fusion from DataGernot Riegler1, Ali Osman Ulusoy2,3, Horst Bischof1, Andreas Geiger2,41Graz University of Technology 2MPI for Intelligent Systems Tübingen 3Microsoft 4ETH Zürich

Motivation

Scene TSDF Fusion, 32 views, no noise

TSDF Fusion, 2 views, no noise TSDF Fusion, 32 views, noise

OctNet

Use grid of shallow octrees

1

0 1

01010001

0 1

01000101

0 1

00010000

0 1

01010000

Convolution

Oout[i, j, k] = pool_voxels(i,j,k)∈Ω[i,j,k]

(L−1∑l=0

M−1∑m=0

N−1∑n=0

Wl,m,n ·Oin [i, j, k]

)

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

===⇒Conv.

0.125

0.250

0.125

0.000

0.000

0.000

0.125

0.250

0.125

==⇒Pool

Pooling

Oout[i, j, k] =

Oin[2i, 2j, 2k] if cell_width(2i, 2j, 2k) > 1

P else

P = maxl,m,n∈[0,1](Oin[2i+ l, 2j +m, 2k + n])

=⇒ =⇒

OctNet Unpooling

Naïve Guided

=⇒ =⇒

OctNetFusion

Architecture

643 1283 2563

1283 2563

643

L

1283

L

2563

L

643

1283

2563

Encoder-Decoder Module

Input

Featu

res

Con

cat

Con

v1,16,3×

3×3,1

Con

v16,32,3×

3×3,1

Pool

32,32,2×

2×2,2

Con

v32,32,3×

3×3,1

Con

v32,64,3×

3×3,1

Pool

64,64,2×

2×2,2

Con

v64,64,3×

3×3,1

Con

v64,64,3×

3×3,1

Con

v64,64,3×

3×3,1

Unpool

64,64,2×

2×2,2

Con

cat

Con

v128,32,3×

3×3,1

Con

v32,32,3×

3×3,1

Unpool

32,32,2×

2×2,2

Con

cat

Con

v64,16,3×

3×3,1

Con

v16,16,3×

3×3,1

Stru

cture

Featu

resRecon

struction

Structure Module

OD×H×Wn

Unpooln, n, 2 × 2, 2

P 2D×2H×2Wn Split Q2D×2H×2W

n

Convn, 1, 3 × 3, 1

RD×H×W

1 L

IntermediateReconstruction

Derived SplitMask

Resulting OctreeStructure

Volumetric Shape Completion

Voxlets Dataset [3]

Method IoU Precision Recall

Zheng et al.* 0.528 0.773 0.630Firman et al.* 0.585 0.793 0.658

Firman et al. 0.550 0.734 0.705Ours 0.650 0.834 0.756

Zheng et al. Firman et al. Ours Ground-Truth

Volumetric Depth Fusion

ModelNet: Input Encoding

MAD (mm)VolFus TV-L1 Occ TDF + Occ TSDF TSDF Hist

643 4.136 3.899 2.095 1.987 1.710 1.7151283 2.058 1.690 0.955 0.961 0.838 0.7362563 1.020 0.778 0.410 0.408 0.383 0.337

VolFus [2] TV-L1 [5] Ours Ground-Truth

ModelNet: Number of Input Views

MAD (mm)views=1 views=2 views=4 views=6

VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours

643 59.295 48.345 7.855 15.626 13.267 2.755 4.136 3.899 1.715 3.171 2.905 1.4841283 29.795 26.525 3.853 7.850 6.999 1.333 2.058 1.690 0.736 1.648 1.445 0.6612563 14.919 14.529 1.927 3.929 3.537 0.616 1.020 0.778 0.337 0.842 0.644 0.360

ModelNet: Varying Input Noise

MAD (mm)σ = 0.00 σ = 0.01 σ = 0.02 σ = 0.03

VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours VolFus TV-L1 Ours

643 3.020 3.272 1.647 3.439 3.454 1.487 4.136 3.899 1.715 4.852 4.413 1.9381283 1.330 1.396 0.744 1.647 1.543 0.676 2.058 1.690 0.736 2.420 1.850 0.8042563 0.621 0.637 0.319 0.819 0.697 0.321 1.020 0.778 0.429 1.188 0.858 0.402

σ=

0.0

σ=

0.0

3

VolFus [2] TV-L1 [5] Ours Ground-Truth

Kinect Object Scans [1]

MAD (mm)views=10 views=20

VolFus TV-L1 Ours VolFus TV-L1 Ours

643 103.855 25.976 22.540 72.631 22.081 18.4221283 58.802 12.839 11.827 41.631 11.924 9.6372563 31.707 5.372 4.806 22.555 5.195 4.110

References[1] S. Choi et al. “A Large Dataset of Object Scans”. In: arXiv.org 1602.02481 (2016).[2] B. Curless and M. Levoy. “A Volumetric Method for Building Complex Models from Range Images”. In: SIG-

GRAPH. 1996.[3] M. Firman et al. “Structured Prediction of Unobserved Voxels From a Single Depth Image”. In: CVPR. 2016.[4] G. Riegler et al. “OctNet: Learning Deep 3D Representations at High Resolutions”. In: CVPR. 2017.[5] C. Zach et al. “A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration.” In: ICCV. 2007.[6] B. Zheng et al. “Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics”. In: CVPR.

2013.