learning to refine depth for robust stereo estimationnetwork...end system for depth estimation. most...
TRANSCRIPT
Pattern Recognition 74 (2018) 122–133
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
Learning to refine depth for robust stereo estimation
Feiyang Cheng
a , b , Xuming He
b , 1 , Hong Zhang
a , ∗
a Image Research Center, Beihang University, Xueyuan Rd., Haidian Dist., Beijing, 100191, China b Computer Vision Group, National ICT Australia, Locked Bag 8001, Canberra, 2601, Australia
a r t i c l e i n f o
Article history:
Received 11 March 2017
Revised 18 July 2017
Accepted 26 July 2017
Available online 31 August 2017
Keywords:
Stereo matching
Confidence measure
Convolutional neural network
a b s t r a c t
Traditional depth estimation from stereo images is usually formulated as a patch-matching problem,
which requires post-processing stages to impose smoothness and handle depth discontinuities and oc-
clusions. While recent deep network approaches directly learn a regressor for the entire disparity map,
they still suffer from large errors near the depth discontinuities. In this paper, we propose a novel method
to refine the disparity maps generated by deep regression networks. Instead of relying on ad hoc post-
processing, we learn a unified deep network model that predicts a confidence map and the disparity
gradients from the learned feature representation in regression networks. We integrate the initial dispar-
ity estimation, the confidence map and the disparity gradients into a continuous Markov Random Field
(MRF) for depth refinement, which is capable of representing rich surface structures. Our disparity MRF
model can be solved via efficient global optimization in a closed form. We evaluate our approach on
both synthetic and real-world datasets, and the results show it achieves the state-of-art performance and
produces more structure-preserving disparity maps with smaller errors in the neighborhood of depth
boundaries.
© 2017 Elsevier Ltd. All rights reserved.
p
e
i
t
p
t
n
l
i
t
s
e
i
s
t
w
p
1. Introduction
Inferring depth from images is a fundamental problem in com-
puter vision [1] , vital for a large number of real-world applica-
tions such as 3D scene reconstruction, robotics and autonomous
driving. Despite the progress in predicting depth from a single im-
age [2,3] or using active sensors [4,5] , stereo image matching still
remains one of the most effective strategies for depth estimation
due to its efficiency and broad range of applicable settings [6–
9] . Traditional stereo matching paradigm typically comprises four
steps, including 1) computing matching cost, 2) cost aggrega-
tion, 3) global optimization and 4) disparity refinement [10,11] .
In particular, a variety of ad hoc post-processing methods have
been studied to handle occlusion and uncertainty in the match-
ing stage [9,10,12,13] . Moreover, most existing methods use a dis-
crete global optimization to enforce smoothness in the estimated
disparity map [7,14,15] and refine it into subpixel accuracy after-
wards [9,16] . However, such a stage-wise pipeline is prone to errors
in each step and lacks an overall objective to optimize.
∗ Corresponding author at: 37 Xueyuan Road, Haidian District, Beijing, China.
E-mail addresses: [email protected] (F. Cheng),
[email protected] (X. He), [email protected] , [email protected]
(H. Zhang). 1 X.He was also with Australia National University and he is currently at Shang-
haiTech University.
i
d
r
M
p
d
http://dx.doi.org/10.1016/j.patcog.2017.07.027
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
Recent progress in learning-based deep neural networks has
rovided an alternative strategy that aims at building an end-to-
nd system for depth estimation. Most prior work focus on single
mage-based depth prediction [2,17] , or learning neural networks
o compute matching cost [8,18,19] . By contrast, Mayer et al. pro-
osed end-to-end trainable deep regression networks for stereo es-
imation [20] . Despite the achieved competitive performance, the
etworks suffer from the well-known foreground-fattening prob-
em, which appears as halo-effect near object boundaries as shown
n Fig. 1 b.
In this paper, we propose a deep network approach to refine
he disparity maps generated by the deep regression networks,
uch as [20] . Our main idea is based on a detect-and-correct strat-
gy [13,21] , in which we find the regions with low confidence
n initial predictions and exploit the disparity gradients to recon-
truct more accurate and structure-preserving disparity maps. To
his end, we design a novel two-branch fully convolutional net-
ork that takes the regression network features and images as in-
ut, and predicts a dense confidence map for the regressed dispar-
ties and a disparity gradient map.
Given the confidence estimation and predicted disparity gra-
ients, we develop a continuous Markov Random Field (MRF) to
efine the disparities generated by the regression network. Our
RF takes the initial disparity map as its observation and im-
oses a structure-preserving prior based on the estimated confi-
ence scores and disparity gradients. Specifically, we enforce the
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 123
Fig. 1. An illustration of refining estimated disparity map based on deep networks. (a) Reference image and two cropped patches of ground truth disparity. (b) Initial result
of the DispNetCorr1D which has blurred object boundaries and noisy structures on object surfaces [20] . (c) Halo-free and structure-preserving result of our method.
Fig. 2. An overview of our method. The input consists of the reference image and the matching feature extracted from a pre-trained CNN for disparity estimation [20] .
We learn a confidence network (ConfNet) and a gradient network (GradNet) to predict the correctness of initial estimated disparities and the disparity gradients. A global
optimization method is proposed to combine the useful information to reconstruct high-quality disparity maps.
r
g
o
t
s
i
t
l
b
s
l
a
l
d
2
p
p
v
l
t
p
p
r
d
t
d
o
m
l
a
o
o
e
a
t
d
i
n
o
l
o
n
a
e
e
o
t
s
d
c
t
v
r
m
s
m
efined disparity map to be consistent with the predicted disparity
radients especially at the regions with low confidences. We solve
ur continuous disparity MRF via an efficient global optimization
hanks to its quadratic form and convexity. An example of our re-
ults is shown in Fig. 1 (c), large errors can be removed near dispar-
ty discontinuities and surface structures are improved. We refer
he readers to Fig. 2 for an overview of our framework.
We extensively evaluate our refinement method on four chal-
enging synthetic datasets and the real-world KITTI2015 stereo
enchmark. The results show that our approach outperforms the
tate-of-the-art and several global optimization baselines. Particu-
arly, our method not only produces accurate disparity maps but
lso results in better surface structures. Moreover, the confidence
earning via a deep network is more robust than learning confi-
ence with hand-crafted features.
. Related work
Depth estimation from stereo images has a long history in com-
uter vision [22] and it is beyond the scope of this paper to
resent a complete review. We refer the readers to recent sur-
eys of the literature [10,11,22] . Here, we mainly focus on recent
earning-based work, including deep networks for depth predic-
ion, learning confidence of stereo matching and modeling depth
riors with MRFs.
Deep networks for depth prediction: Deep learning ap-
roaches have achieved significant progresses in depth estimation
ecently. A large number of prior work focus on single image-based
epth prediction, e.g., [2,17] , which typically build an end-to-end
rainable deep network to directly predict a dense depth map.
However, such data-driven methods require a large training
ataset with ground truth depth, which is difficult to obtain for
utdoor scenes. As a consequence, until recently deep learning
ethods on generic stereo estimation mostly consider the task of
earning local matching cost. In particular, Zbontar and LeCun learn
convolutional neural network for estimating the affinity of a pair
f patches, which outperforms the matching cost functions based
n low-level features [8,9] . However, computing matching cost for
ach patch pair is time-consuming during inference. An efficient
rchitecture is proposed in [18] by learning a probability distribu-
ion over all disparities at once with a dot-product layer. Using a
ot-product layer is also analyzed in [9,19] to speedup the match-
ng cost computation. The main drawback of those patch matching
etworks is that they have to rely on ad hoc post-processing to
btain the final disparity estimation. Besides the stereo matching
iterature, Zagoruyko and Komodakis propose various architectures
f deep neural networks for comparing image patches [23] .
Only recent work by Mayer et al. [20] learns deep regression
etworks to estimate continuous disparities thanks to the avail-
bility of large synthetic stereo datasets. Their networks adopt an
ncoder-decoder structure, which enables end-to-end training and
fficient inference at test time. Our work is built on top of one
f their regression networks and aims at removing the aforemen-
ioned halo-effect near disparity discontinuities and reconstructing
tructure-preserving disparity maps. We note that unsupervised
eep learning methods for stereo also attract much attention re-
ently [24,25] . Nevertheless, their performances are still inferior to
he networks trained with strong supervision.
Learning confidence in stereo estimation: Traditional stereo
ision often uses confidence measures of disparity estimation to
emove potentially large errors (i.e., bad pixels) within disparity
aps [26] . For example, the left-right consistency (LRC) check and
imple interpolation are commonly used to tackle occlusions and
ismatches [10,12,13] . Learning confidence measure with various
124 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133
Fig. 3. The architecture of our confidence and gradient networks.
Fig. 4. Sparsification curves of the first frames of the FlyingThings3D, Driving, Monkaa and Sintel datasets. Optimal method means that the ground truth is used to remove
bad pixels.
t
f
p
b
g
b
t
t
s
n
hand-crafted confidence features has been investigated in [12] .
Spyropoulos et al. propose to learn a confidence map using sim-
ilar hand-crafted features and to incorporate the predicted confi-
dences into a discrete MRF model to obtain dense high-quality dis-
parity maps [21] . Park and Yoon use a two-stage learning method
to choose effective confidence measures at first and then use them
as features to refine the confidence estimation [13] . The predicted
confidence map is used to modulate the matching cost which can
be incorporated into the semi-global optimization framework later.
A training data generation method is proposed in [27] to improve
the accuracy of confidence prediction. With disparity patches as
he input, convolutional neural networks are proved to be effective
or learning confidence measures recently [28,29] . By contrast, we
ropose to learn the confidence map using a deep neural network
ased on learned matching and image features, which can be inte-
rated into a unified deep network model.
Modeling depth smoothness: Modeling depth structures has
een extensively studied in stereo matching literature [10] . In
he discrete setting, MRF-based global optimization methods of-
en make use of smoothness priors such as first-order piece-wise
moothness or second-order smoothness to tackle challenging sce-
arios such as occlusions or textureless regions [7,15] . The continu-
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 125
1000 2000 3000
FlyingThings3D frames
0
0.1
0.2
0.3
0.4
AU
C v
alue
OptimalConfNet
1000 2000 3000 4000
Driving frames
0
0.2
0.4
0.6
0.8
1
AU
C v
alue
OptimalConfNet
2000 4000 6000 8000
Monkaa frames
0
0.2
0.4
0.6
0.8
1
AU
C v
alue
OptimalConfNet
200 400 600 800 1000
Sintel frames
0
0.2
0.4
0.6
0.8
1
AU
C v
alue
OptimalConfNet
Fig. 5. AUC values of the FlyingThings3D, Driving, Monkaa and Sintel datasets. The AUC values are permutated according to the ascending order of the optimal AUC values.
o
t
o
p
v
fi
R
d
s
t
d
M
3
t
T
w
a
d
fi
u
s
s
t
m
t
d
a
s
p
t
fi
p
c
a
t
3
d
m
a
t
f
t
(
C
w
a
t
l
t
t
a
us counterparts typically formulate the problem in variational op-
imization frameworks in which the smoothness priors are based
n the total variation functionals [30,31] . The fast bilateral solver
roposed in [32] achieves competitive performances on various
ision tasks. However, these smoothness priors are globally de-
ned based on relative simple assumption on surface structures.
ecently, surface normal classifiers are adopted to impose image-
ependent constraint for depth estimation and multi-view recon-
truction [33,34] , which has shown effective im provements. In con-
rast, we propose a deep learning based approach to estimate the
isparity gradients for every image and incorporate them into our
RF model to obtain structure-preserving disparity maps.
. Two-branch fully convolutional networks
Our goal is to refine an initial disparity estimation generated by
he deep regression network [20] , given a pair of rectified images.
o achieve this, we follow a detect-and-correct strategy, in which
e first identify noisy regions in the disparity estimation based on
predicted confidence map, and then estimate the disparity gra-
ients to capture surface structures in the scene. Finally, the con-
dence map and disparity gradients are integrated into a contin-
ous Markov Random Field (MRF) based on which we compute a
tructure-preserving disparity map via global optimization. In this
ection, we introduce a two-branch deep network for computing
he confidence map and disparity gradients. The MRF-based refine-
ent process will be described in Section 4 .
An overview of our deep network is shown in Fig. 2 . In addition
o the original regression layers, we add two sub-networks to the
eep regression network, which are for the disparity confidence
nd gradient prediction respectively. Both sub-networks share a
imilar network architecture except their input features and out-
ut layer. Each sub-network consists of 5 convolutional layers. For
he first four layers, we use 3 × 3 convolution kernels and recti-
ed linear units (ReLU) as their activation functions. The fifth layer
erforms 1 × 1 convolution with its full connection as in the fully
onvolutional network (FCN) [35] , followed by a dropout layer. The
rchitecture is shown in Fig. 3 . We now describe the specific de-
ails for these two sub-networks in the following.
.1. Confidence network
The confidence network aims at detecting regions in the pre-
icted disparity map that have large uncertainty or errors. We for-
ulate it as a binary labeling task and design an FCN as described
bove to predict a confidence map, in which each pixel indicates
he probability of being correct.
The input to our confidence network consists of three types of
eatures: the reference image, the last concatenated matching fea-
ures of the DispNetCorr1D in [20] , and the left-right consistency
LRC) [12,13,21] . The LRC map C lr is defined as follows,
lr (i, j) =
{| d l i, j
− d r i, j −d l
i, j
| if ( j − d l i, j
) > 0
−1 otherwise (1)
here i, j are the pixel coordinates in the reference image, and d l
nd d r denote the left and right estimated disparity maps respec-
ively. This feature combines the disparity information from the
eft and the right views to identify potential mismatches. We stack
hese three inputs into a multi-dimensional image before passing
hem into the confidence network.
The output layer of the confidence network for each pixel is
softmax function with two outputs. We train the network us-
126 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133
Fig. 6. Comparisons of the disparity maps of the synthetic datasets. From top to down: reference image, initial disparity map, initial error map, predicted confidences, our
disparity map and our error map. For the error map, the bad pixels are marked red and blue regions mean small errors. The predicted confidence maps locate the bad
pixels well and our method focuses on changing these regions without degradation within other regions according to the refined disparity maps. (For interpretation of the
references to color in this figure legend, the reader is referred to the web version of this article.)
i
i
s
i
m
c
r
d
s
w
d
g
ing the cross-entropy loss and stochastic gradient descent based
on Adam [36] . On the training data, we generate the ground truth
binary label map L by locating pixels with large disparity errors.
Specifically, for the pixel at location ( i, j ),
L (i, j) = 1 (| d i, j − d ∗i, j | ≤ t) (2)
where t is a predefined threshold and 1 ( · ) is the indicator func-
tion. d and d ∗ denote the estimated disparity and the ground truth
respectively.
3.2. Gradient network
To exploit 3D surface structure to refine the initial disparity es-
timation, we focus on the smoothness property of surfaces which
s commonly used in prior work for stereo estimation. However,
nstead of relying on a generic smoothness prior, we design a FCN
ub-network to predict the disparity gradients for the reference
mage and use it as an image-specific prior for disparity refine-
ent.
Our gradient network takes the reference image and the last
oncatenated matching features of the DispNetCorr1D as input and
egresses the disparity gradients at each pixel. Unlike the confi-
ence network, we use the leaky ReLU activation with a negative
lope of 0.1 in all the layers. The output layer of the gradient net-
ork has two neurons for each pixel, predicting its disparity gra-
ients along both horizontal and vertical directions.
We use forward-difference to approximately compute the
round truth gradient for training our deep gradient network. The
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 127
Fig. 7. AUC value comparison of our ConfNet and RDF 12 . The gaps between our
ConfNet and optimal AUC values are relative narrow in most cases.
g
m
a
1
t
4
w
D
i
p
l
E
w
E
p
p
S
E
w
i
t
t
E
w
t
e
ϕ
w
p
t
e
c
p
E
w
w
a
i
i
f
a
p
t
o
q
t
e
a
A
L
t
m
d
a
i
t
h
5
l
b
d
t
5
m
i
u
t
p
a
a
f
t
a
M
a
d
a
t
2 The bad sequence ‘A-149’ which contains totally occluded views is not used.
radients at disparity discontinuities are ill-defined and thus we
ask out those pixels with values of forward-difference larger than
threshold during training. We set the threshold empirically to
pixel in this work. The L 1 loss function is used for training and
he weights are learned based on Adam [36] .
. Continuous MRF for depth refinement
Given the disparity gradient map G and the confidence map τ ,
e now build a continuous MRF to refine the initial disparity map
. The MRF model encodes an image specific smoothness prior by
ntegrating the predicted confidence scores and disparity gradients.
Formally, denote the reference image as I and the refined dis-
arity map as S , we define the energy function of the MRF as fol-
ows,
(S) = E d (S| D, τ ) + αE g (S| G ) + βE smooth (S| I) (3)
here E d is the data term, E g is the gradient consistency term, and
smooth is the disparity smoothness term. α and β are the hyper-
arameters to balance the effect of the three terms.
Data term: The data term E d ( S | D ) enforces that the refined dis-
arity value is close to the initial disparity in the confident regions.
pecifically,
data (S| D ) = min
S
∑
p
τp (S p − D p ) 2 (4)
here D p and S p are the initial and refined disparity at pixel p. τ p
s the confidence score of the initial disparity at p .
Gradient consistency term: The gradient term E g ( S | G ) enforces
he consistency between the gradients of the refined disparity with
he predicted gradients:
g (S| G ) =
∑
p
ϕ
x p (∇ x S p − G
x p )
2 + ϕ
y p (∇ y S p − G
y p )
2 (5)
here ∇ x and ∇ y are the horizontal and vertical gradient opera-
ors. ϕ
x p and ϕ
y p are the image-based weights that approximately
stimate the likelihood of smooth surfaces, which are defined as:
x p = exp
(−||∇ x I p || 2
2 σ 2 r
), ϕ
y p = exp
(−||∇ y I p || 2
2 σ 2 r
)(6)
here ∇ x I p and ∇ y I p are the forward-differences of the color of
ixel p at the horizontal and vertical directions respectively. σ r is
he parameter of the Gaussian kernel.
Disparity smoothness term: The disparity smoothness term
nforces that the refined disparity at each potential disparity dis-
ontinuity should be close to the disparities of the neighboring
ixels with similar appearance:
smooth (S| I) =
∑
p
(1 − ϕ p )(S p −∑
q ∈ N p w q S q )
2 (7)
here N p is the neighborhood of the pixel p defined by a local
indow. A 3 × 3 patch centered at p is used in our work. w q is
weight based on the difference of two pixels’ intensities, which
s w q = exp (−‖ I p −I q ‖ 2 2 σ 2
p ) where σ p is the variance of the intensities
n a window around p . The same regularization has been exploited
or image colorization and segmentation algorithms [37] . This term
nd the gradient consistency term represent an image specific dis-
arity prior for surfaces and disparity discontinuities.
Global optimization for disparity refinement: We compute
he refined disparity map by minimizing the energy function E ( s )
f the continuous MRF. As the energy function consists of a set of
uadratic functions with positive weights, the overall energy func-
ion is convex and it can be solved in a closed form.
Specifically, taking ∂E(S) ∂S
= 0 , we can find the optimal disparity
fficiently by solving a sparse linear system AS = b. Here A and b
re defined as,
= τ − α( ̃ C x φx C x +
˜ C y φy C y ) + β(2 − φx − φy )( − W )
b = τD − α( ̃ C x φx G
x +
˜ C y φy G
y ) (8)
et N denote the number of pixels and denote the identity ma-
rix. Here, τ is a N × N diagonal matrix with τ p as the diagonal ele-
ents. ˜ C and C are the Toeplitz matrices which perform backward-
ifference and forward-difference computation respectively. φ is
lso a N × N diagonal matrix with ϕp as the diagonal elements. W
s a matrix which stores the weights w q , q ∈ N p for each pixel p at
he p th row. The hyper-parameters of the MRF are validated on a
eld-out training set and fixed throughout our experiments.
. Experiments
The proposed method is evaluated on four recently published
arge synthetic stereo datasets and the real-world KITTI2015 stereo
enchmark. The following subsections describe the experimental
etails and the comparisons to the state-of-the-art approaches for
he synthetic and real-world datasets in turn.
.1. Synthetic datasets
We first demonstrate the effectiveness of our disparity refine-
ent method on four synthetic datasets [20] , including FlyingTh-
ngs3D, Driving, Monkaa , and MPI Sintel .
For the FlyingThings3D dataset, we follow the setup of [20] and
se 22,390 image pairs for training the neural networks. We splits
he test scenes into 780 image pairs for validation and 3580 image
airs for testing 2 . For the validation set, the first 700 image pairs
re used to validate the training parameters of our neural networks
nd the other 80 image pairs are used to validate hyper-parameters
or both our global optimization method and the baselines.
The Driving dataset has 4400 image pairs of driving scenes and
he Monkaa dataset is made from an animated short film Monkaa
nd contains 8640 image pairs [20] . The 1064 image pairs in the
PI Sintel dataset simulate realistic scenes including natural im-
ge degradations such as fog and motion blur [38] . All these three
atasets are used as test sets as in [20] for evaluating our method
nd the baselines.
Implementation details: We fix the regression network and
rain the sub-networks independently. For training the gradient
128 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133
Fig. 8. Refining the disparity discontinuities on the KITTI2015 stereo dataset. Our ConfNet detects the bad pixels(non-confident results marked blue ) accurately and visually
improvement can be seen in our refined disparity maps. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 129
n
0
d
s
e
b
i
t
i
σ
i
o
t
d
L
f
L
w
o
t
g
b
o
n
t
d
w
M
u
e
A
t
q
e
d
p
p
d
m
t
p
c
o
o
t
w
p
i
l
o
m
p
d
c
a
a
K
g
C
t
c
m
o
t
b
s
i
i
o
c
m
t
i
r
o
p
v
m
s
b
i
f
i
c
f
l
t
fi
a
e
s
i
o
l
t
a
t
p
i
o
a
w
e
c
a
c
a
s
t
a
i
r
e
g
m
v
b
t
c
i
t
etwork, we start from a learning rate 0.0 0 01 and divided it by
.5 every 20K iterations and stop after 200K iterations. The weight
ecay is set to be 0.0 0 04. For training the confidence network, we
tart from a learning rate 0.001 and divided it by 0.5 every 40K it-
rations and stop after 320K iterations. The weight decay is set to
e 0.0 0 05. Both the models are trained on the train set of the Fly-
ngThings3D dataset only and evaluated on the other datasets as in
he prior work [20] . The hyper-parameters of our continuous MRF
s fixed for all the synthetic datasets as follows: α = 100 , β = 0 . 1 ,
r = 2 . 55 .
Baselines: We employ two groups of baselines in our compar-
sons, including deep regression networks and MRF-based meth-
ds. For deep regression networks, the first baseline is the pre-
rained neural network [20] . We further fine-tune it with an ad-
itional loss term on disparity gradients as our second baseline.
et the difference l p = (d p − d ∗p ) for a pixel p . We define the loss
unction for the fine-tuning as follows:
=
1
N
∑ | l p | +
λ
N
∑
(|∇ x l p | + |∇ y l p | ) (9)
here λ = 5 is the parameter to balance the two different terms
f the loss function. The loss function encourages the network
o find a local optimal to balance between the disparity and the
radient structure of the surfaces. A similar loss function has
een used in [3] for single image-based depth prediction. More-
ver, we include a weighted median filter based smoothing tech-
ique for the pre-trained regression network, as it is often used
o remove noises to refine disparity maps without blurring the
isparity discontinuities [39,40] . For the MRF-based methods,
e compare our method with three state-of-the-art continuous
RFs [4,31,32,37] based on our ConfNet.
Evaluation metrics: The most commonly used metric for eval-
ating stereo matching algorithms is the percentage of bad pix-
ls among all the pixels with valid ground truth disparity values.
bad pixel means the absolute disparity error is larger than a
hreshold, which is set to 1 pixel in this paper. To evaluate the
uality of continuous disparities, we also compute the end-point
rror 1 N
∑
p | d p − d ∗p | to compare different methods as in [20] . Here,
p and d ∗p denote the estimated and ground truth disparity of a
ixel p respectively. Additionally, we compute the sum of the end-
oint errors of both the horizontal and the vertical disparity gra-
ients to compare the structure-preserving performances of our
ethod and the baselines.
Detailed analysis: A sparsification curve shows the change of
he percentage of bad pixels while removing least confident dis-
arities from the disparity map [12,13] . Examples of sparsification
urve are shown in Fig. 4 . We first evaluate the performance of
ur ConfNet, we plot the area under sparsification curves (AUC)
f the test sets as [12,13] in Fig. 5 . For the FlyingThings3D dataset,
he gaps between our AUC and the optimal AUC values are small,
hich evidents that our ConfNet is able to effectively detect bad
ixels. On the Monkaa and the Sintel datasets, we observe the sim-
lar trend except for some challenging scenes corresponding to the
arge gaps in the plots. For the Driving dataset, we note that the
ptimal AUC values are large, which means that the initially esti-
ated disparity maps have poor qualities. Therefore, locating bad
ixels (i.e., selecting reliable disparities) is also challenging in this
ataset. For large scale synthetic datasets, learning a random forest
lassifier with hand-crafted features to predict confidence is not
pplicable due to memory limit. Hence, the comparison of ConfNet
nd the previous work [13] will be discussed on the real world
ITTI2015 dataset later.
We then compare the predicted disparity gradients with the
radients of the initially estimated disparity maps of DispNet-
orr1D in Table 1 . The predicted gradients are much better and
his demonstrates that they can be used as effective smoothness
onstraints to improve the noisy disparity maps.
Comparing with CNN-based methods: We first compare our
ethod with the baselines on the FlyingThings3D dataset since
ur models are trained on the training data of this dataset. For
he occlusions and the whole image, we list the percentage of
ad pixels respectively in Table 2 . Fine-tuning the deep regres-
ion network with an additional loss term on disparity gradient
s proved to be useful for improving disparity estimation accord-
ng to the current metric. Both WMF and our method work well
n removing bad pixels and our results outperform them in all
ases. Table 2 also shows the end-point errors of different refine-
ent methods based on deep networks and filtering. Fine-tuning
he network with an additional gradient-related loss term results
n slight accuracy changes, and our method and the WMF algo-
ithm achieves competitive performances.
For the other three synthetic datasets, we list the percentages
f bad pixels and the end-point-errors in Table 3 . Our method out-
erforms the baselines in most cases, which demonstrates the uni-
ersality of our trained models. We note that our method only re-
oves a little bad pixels for the Driving dataset. The probable rea-
on is that the confidence networks cannot effectively detect the
ad pixels since the FlyingThings3D ’s train set has no similar driv-
ng scenes like the Driving dataset. Hence, we also compute a per-
ormance upper bound of our method as shown in Table 3 . Specif-
cally, instead of using predicted values, we put the ground truth
onfidences into our MRF model to generate the upper bound per-
ormances. The upper bound shows that our method can remove a
arge amount of bad pixels with the ground truth confidence and
he predicted disparity gradients. An interesting result is that the
ne-tuned network has competitive performances on the Driving
nd the Sintel datasets according to the percentages of bad pix-
ls but worse performances according to the end-point-errors. It
hows that using a single metric to compare the quality of dispar-
ty maps is not enough. Lower average absolute error may mean
ver-smooth disparity maps and fewer bad pixels may mean very
arge errors for a certain number of pixels.
Additionally, we compare the quality of the final gradients of
he disparities in Table 4 . Our method has the best performance on
ll the datasets, which proves that we can effectively reconstruct
he surface structures. Note that the fine-tuned network only im-
rove the disparity gradients slightly and using a large λ will result
n bad disparity maps according to our experiment. In conclusion,
ur method outperforms the baselines in most cases considering
ll the three metrics.
Comparing with MRF-based methods: Using our confidence
eighted data term, we perform a separate set of experiments to
valuate the importance of different MRF regularization terms. The
olor-based term makes simple piece-wise smoothness assumption
nd is commonly used for filling missing depth values and image
olorization [4,37] . The TGVL2 model, which is the state-of-the-
rt method for noisy disparity map upsampling, employs both the
econd-order smoothness prior and the anisotropic diffusion tensor
o enforce smoothness within surfaces and preserve discontinuities
t potential boundaries [31] . The fast bilateral solver is proposed
n [32] to perform confidence-based edge-aware filtering for depth
efinement and upsampling. Our method is built on a similar strat-
gy and the difference is that the smoothness prior (i.e., disparity
radients) is learned from training data. According to Table 5 , our
ethod has better performance on estimating accurate disparity
alues while TGVL2-based regularization can remove slightly more
ad pixels. However, TGVL2 model needs to solve the global op-
imization problem iteratively, which can be slow in practice. In
ontrast, our global optimization problem is solved efficiently us-
ng standard least square solvers and is about 10x faster according
o the computational time. The general fast bilateral solver (FBS)
130 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133
Table 1
Comparisons of the disparity gradients. The output gradients of our GradNet are much better on all the
datasets. The metric is the sum of the end-point errors of both the horizontal and the vertical disparity gra-
dients.
Method FlyingThings3D(test) FlyingThings3D(val) Driving Monkaa Sintel(train)
DispNetCorr1D [20] 0.55 0.55 0.71 0.45 0.45
GradNet 0.05 0.06 0.32 0.09 0.10
Table 2
Quantification comparisons of the disparities of both occlusions and all the pixels. The metrics are the percent-
age of bad pixels ( > 1 px) and the end-point-error (EPE) respectively. The suffix F means that the network is
fine-tuned with an additional gradient-related loss term. Our method outperforms all the baselines on both
the metrics.
Method FlyingThings3D(val) FlyingThings3D(test)
Occ All Occ All
> 1 px EPE > 1 px EPE > 1 px EPE > 1 px EPE
DispNetCorr1D [20] 54.29 4.70 24.99 2.01 53.08 4.57 23.48 2.16
DispNetCorr1D + WMF [40] 48.91 4.15 23.11 1.84 47.66 4.02 21.57 1.98
DispNetCorr1D_F 51.23 4.79 23.24 2.07 49.78 4.73 21.67 2.10
Ours 47.74 4.08 22.73 1.81 46.53 3.98 21.17 1.98
Table 3
Quantification comparisons of the whole disparity maps. The metrics are the percentage of bad
pixels ( > 1 px) and the end-point-error (EPE) respectively. The suffix F means that the network is
fine-tuned with an additional gradient-related loss term. Our method outperforms all the baselines
on end-point error and removes bad pixels effectively except for the Driving dataset. The possible
reason is that the training data contains no similar driving scenes. The upper bound proves that
bad pixel removing can benefit from estimating better confidence measures.
Method Driving Monkaa Sintel(train)
> 1 px EPE > 1 px EPE > 1 px EPE
DispNetCorr1D [20] 70.50 12.56 36.95 10.92 46.96 5.74
DispNetCorr1D + WMF [40] 70.18 12.54 36.48 10.86 46.21 5.65
DispNetCorr1D_F 68.92 13.57 37.00 10.84 44.41 7.31
Ours 70.49 12.35 35.73 10.81 46.33 5.49
Ours + CONF_GT 63.36 12.21 25.50 10.97 35.47 4.99
Table 4
Comparisons of the quality of the final disparity gradients.
Method FlyingThings3D(val) FlyingThings3D(test) Driving Monkaa Sintel(train)
DispNetCorr1D [20] 0.55 0.55 0.71 0.45 0.45
DispNetCorr1D + WMF [40] 0.47 0.47 0.75 0.44 0.40
DispNetCorr1D_F 0.47 0.47 0.67 0.42 0.47
Ours 0.38 0.38 0.60 0.32 0.32
Table 5
Comparisons of different MRF regularization terms. Grad means that only the gra-
dient consistency regularization term is used in our model. Our method use both
the color-based smoothness and gradient consistency terms to tackle smooth sur-
faces and boundaries respectively.
Method FlyingThings3D(val) Sintel(train) Time(s)
disp bad grad disp bad grad –
Color [37] 2.04 22.54 0.43 5.65 46.70 0.36 13.6
TGVL2 [31] 1.82 22.37 0.39 5.59 46.03 0.36 148.5
FBS [32] 1.83 23.31 0.53 5.78 47.84 0.47 0.7
Grad 1.99 22.92 0.42 5.51 46.42 0.38 2.5
Ours 1.81 22.73 0.38 5.49 46.33 0.32 14.3
a
n
t
m
s
5
i
i
s
is efficient but fails to work competitively. 3 Note that using a sin-gle regularization term of our model cannot achieve state-of-the-
3 We use the publicly available code to evaluate this baseline. Note that the au-
thors propose a variant of their algorithm (the RBS solver) to refine stereo matching
results using iteratively reweighted least squares. The effectiveness and efficiency of
this iterative algorithm is not clear since this part of code is not available.
t
N
t
i
d
rt performance. It shows that making simple piece-wise smooth-
ess assumption is not robust for smooth surfaces and modeling
he disparity boundaries is important for our approach.
Visually results are shown in Fig. 6 to demonstrate that our
ethod can remove bad pixels and smooth the disparities within
urfaces.
.2. Real world dataset
We now turn to the KITTI2015 dataset, which consists of 200
mage pairs for training and 200 image pairs for testing [41] . The
mages are captured from real world driving environment with
parse ground truth disparity maps.
Detailed analysis: We use the first 100 images of KITTI2015
rain set to train our ConfNet for predicting the confidence of Disp-
etCorr1D on KITTI before fine-tuning. We compare our approach
o a baseline method RDF 12 [13] . For the baseline, all the match-
ng cost-based features are not available in this setting and a 12
imension feature vector is used to train the Random Forest clas-
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 131
Table 6
Comparisons of state-of-the-art methods on the
KITTI2015 stereo benchmark. The metric is the
percentage of bad pixels, bg and fg denote back-
ground and foreground regions respectively.
Methods bg fg all
Displets [42] 3.00 5.56 3.43
MC-CNN-acrt [9] 2.89 8.88 3.89
DispNetCorr1D-K [20] 4.32 4.41 4.34
Content-CNN [18] 3.73 8.58 4.54
SPS-St [14] 3.84 12.67 5.31
Ours 4.39 4.59 4.43
s
t
o
s
b
a
o
t
d
W
g
p
l
b
a
g
f
g
T
t
t
d
6
p
S
w
s
t
t
u
o
w
s
A
C
w
C
w
p
d
t
a
i
C
t
S
f
R
[
[
[
[
[
[
[
[
[
ifier. 4 We refer the readers to [12,13] for the details of these fea-
ures. The results of our ConfNet and RDF 12 are shown in Fig. 7 and
ur ConfNet outperforms RDF 12 in most cases. It shows that adding
ub-network can learn the error mode of its parent network easily
ased on the pre-learned representations.
To fine-tune our GradNet, we need dense disparity maps which
re unavailable in the KITTI dataset. We instead use the results
f [9] for its strong performance in background regions and replace
he object regions with ground truth to create high-quality dense
isparity maps for fine-tuning our GradNet.
Overall results: We first show our qualitative results in Fig. 8 .
e can see the visual improvement in these examples. The learned
radients from high-quality disparity maps seems to refine the dis-
arities near depth discontinuities. Also, the predicted confidence
ocates the errors well and our method focuses on removing these
ad pixels near depth discontinuities. We note that there is still
number of bad pixels cannot be removed since we learn the
radient of [9] which also suffers from foreground-fattening ef-
ects in some cases. On the public KITTI2015 stereo benchmark, we
et close-by performance to DispNetCorr1D-K model as shown in
able 6 . This is probably due to the fact that there are a few ground
ruth points near object boundaries in the KITTI2015 dataset and
he synthetic datasets are insufficient to train a network to predict
isparity gradients for real world data. 5
. Conclusions
We propose a deep network method to refine continuous dis-
arity maps which are estimated by a deep regression network.
pecifically, we learn a unified two-brunch fully convolutional net-
ork to predict the confidence map of disparity and the surface
moothness priors respectively. We then integrate them into a con-
inuous MRF for improving the initial disparity maps, and solve
he MRF efficiently with a closed-form solution. Performance eval-
ated on both synthetic and real-world dataset demonstrate that
ur method effectively reduces the errors of a deep regression net-
ork effectively without requiring multiple ad hoc post-processing
teps.
cknowledgment
The work of F. Cheng was supported by the China Scholarship
ouncil . This work was done during Cheng’s visit to NICTA, and
as also supported by the National Natural Science Foundation of
hina (grant number 61571026 ). The work of X. He was supported
4 Most of the hand-crafted confidence features are based on the matching cost,
hich is absence in our case since the regression network outputs continuous dis-
arity maps only. Hence, we use the code of [13] to compute all the 12 image and
isparity map-based confidence features to learn RDF 12 as the baseline. 5 We have tried to train our sub-networks on the Driving dataset and tested on
he KITTI data but without success. The Driving dataset has rather different image
nd disparity statistics compared to the realistic driving scenes.
[
[
[
n part by the Australian Government through the Department of
ommunications and in part by the Australian Research Council
hrough the ICT Center of Excellence Program.
upplementary material
Supplementary material associated with this article can be
ound, in the online version, at 10.1016/j.patcog.2017.07.027
eferences
[1] R.I. Hartley , A. Zisserman , Multiple View Geometry in Computer Vision, 2nd,
Cambridge University Press, 2004 . ISBN: 0521540518 [2] D. Eigen , C. Puhrsch , R. Fergus , Depth map prediction from a single image us-
ing a multi-scale deep network, in: NIPS, 2014, pp. 2366–2374 . [3] D. Eigen , R. Fergus , Predicting depth, surface normals and semantic la-
bels with a common multi-scale convolutional architecture, in: ICCV, 2015,pp. 2650–2658 .
[4] N. Silberman , D. Hoiem , P. Kohli , R. Fergus , Indoor segmentation and support
inference from rgbd images, in: ECCV, 2012, pp. 746–760 . [5] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti
vision benchmark suite, in: CVPR, pp. 3354–3361. [6] H. Hirschmuller , Accurate and efficient stereo processing by semi-global
matching and mutual information, in: CVPR, 2005, pp. 807–814 . [7] O. Woodford , P. Torr , I. Reid , A. Fitzgibbon , Global stereo reconstruction under
second-order smoothness priors, PAMI 31 (12) (2009) 2115–2128 .
[8] J. Zbontar , Y. LeCun , Computing the stereo matching cost with a convolutionalneural network, in: CVPR, 2015, pp. 1592–1599 .
[9] J. Zbontar , Y. LeCun , Stereo matching by training a convolutional neural net-work to compare image patches, JMLR 17 (2016) 1–32 .
[10] D. Scharstein , R. Szeliski , A taxonomy and evaluation of dense two-framestereo correspondence algorithms, IJCV 47 (1–3) (2002) 7–42 .
[11] H. Hirschmuller , D. Scharstein , Evaluation of stereo matching costs on imageswith radiometric differences, PAMI 31 (9) (2009) 1582–1599 .
[12] R. Haeusler , R. Nair , D. Kondermann , Ensemble learning for confidence mea-
sures in stereo vision, in: CVPR, 2013, pp. 305–312 . [13] M.-G. Park , K.-J. Yoon , Leveraging stereo matching with learning-based confi-
dence measures, in: CVPR, 2015, pp. 101–109 . [14] K. Yamaguchi , D. McAllester , R. Urtasun , Efficient joint segmentation, occlusion
labeling, stereo and flow estimation, in: ECCV, 2014, pp. 756–771 . [15] M.G. Mozerov , J. van de Weijer , Accurate stereo matching by two-step energy
minimization, TIP 24 (3) (2015) 1153–1163 .
[16] R. Szeliski , D. Scharstein , Sampling the disparity space image, PAMI 26 (3)(2004) 419–425 .
[17] F. Liu , C. Shen , G. Lin , Deep convolutional neural fields for depth estimationfrom a single image, in: CVPR, 2015, pp. 5162–5170 .
[18] W. Luo , A.G. Schwing , R. Urtasun , Efficient deep learning for stereo matching,in: CVPR, 2016, pp. 5695–5703 .
[19] Z. Chen , X. Sun , L. Wang , Y. Yu , C. Huang , A deep visual correspondence em-
bedding model for stereo matching costs, in: ICCV, 2015, pp. 972–980 . 20] N. Mayer , E. Ilg , P. Hausser , P. Fischer , D. Cremers , A. Dosovitskiy , T. Brox , A
large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation, in: CVPR, 2016, pp. 4040–4048 .
[21] A. Spyropoulos , N. Komodakis , P. Mordohai , Learning to detect ground con-trol points for improving the accuracy of stereo matching, in: CVPR, 2014,
pp. 1621–1628 .
22] R. Szeliski , Computer Vision: Algorithms and Applications, 1st, Springer-VerlagNew York, Inc., New York, NY, USA, 2010 .
23] Z. Sergey , K. Nikos , Learning to compare image patches via convolutional neu-ral networks, in: CVPR, 2015, pp. 4353–4361 .
24] R. Garg, K.B. Vijay, C. Gustavo, I. Reid, Unsupervised cnn for single view depthestimation: geometry to the rescue, in: ECCV, pp. 740–756.
25] G. Clment, M. Oisin, G.J. Brostow, Unsupervised monocular depth estimation
with left-right consistency, in: arXiv:1609.03677v2 , 2016. 26] X. Hu , P. Mordohai , A quantitative evaluation of confidence measures for stereo
vision, PAMI 34 (11) (2012) 2121–2133 . [27] C. Mostegel, M. Rumpler, F. Fraundorfer, H. Bischof, Using self-contradiction to
learn confidence measures in stereo vision, in: CVPR, pp. 4067–4076. 28] A. Seki , M. Pollefeys , Patch based confidence prediction for dense disparity
map, BMVC, 2016 .
29] P. Matteo , M. Stefano , Learning from scratch a confidence measure, in: BMVC,2016, pp. 4165–4175 .
30] R. Ranftl , S. Gehrig , T. Pock , H. Bischof , Pushing the limits of stereo us-ing variational stereo estimation, in: Intelligent Vehicles Symposium, 2012,
pp. 401–407 . [31] D. Ferstl , C. Reinbacher , R. Ranftl , M. Ruther , H. Bischof , Image guided
depth upsampling using anisotropic total generalized variation, in: ICCV, 2013,pp. 993–10 0 0 .
32] J.T. Barron, B. Poole, The fast bilateral solver, in: ECCV, pp. 617–632.
33] C. Hane , L. Ladicky , M. Pollefeys , Direction matters: depth estimation with asurface normal classifier, in: CVPR, 2015, pp. 381–389 .
34] G. Silvano , S. Konrad , Just look at the image: viewpoint-specific surfacenormal prediction for improved multi-view reconstruction, in: CVPR, 2016,
pp. 5479–5487 .
132 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133
[
[
[35] J. Long , E. Shelhamer , T. Darrell , Fully convolutional networks for semantic seg-mentation, in: CVPR, 2015, pp. 3431–3440 .
[36] D. Kingma , J. Ba , Adam: a method for stochastic optimization, arXiv:1412.6980,2014 .
[37] A. Levin , D. Lischinski , Y. Weiss , Colorization using optimization, TOG 23 (3)(2004) 689–694 .
[38] D.J. Butler , J. Wulff, G.B. Stanley , M.J. Black , A naturalistic open source moviefor optical flow evaluation, in: ECCV, 2012, pp. 611–625 .
39] Z. Ma , K. He , Y. Wei , J. Sun , E. Wu , Constant time weighted median filteringfor stereo matching and beyond, in: ICCV, 2013, pp. 49–56 .
[40] Q. Zhang , L. Xu , J. Jia , 100+ times faster weighted median filter (wmf), in:CVPR, 2014, pp. 2830–2837 .
[41] M. Menze , A. Geiger , Object scene flow for autonomous vehicles, in: CVPR,2015, pp. 3061–3070 .
42] F. Guney , A. Geiger , Displets: resolving stereo ambiguities using object knowl-edge, in: CVPR, 2015, pp. 4165–4175 .
F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 133
F al University in 2010. He is currently a Ph.D. candidate in Image Research Center, School o t National ICT Australia (NICTA) from 2014 to 2016. His research interests include stereo
v
X and Technology at ShanghaiTech University. He received Ph.D. degree in computer science f the University of California at Los Angeles (USA) from 2008 to 2010. After that, he joined
i He was also an adjunct Research Fellow at the Australian National University from 2010 t D scene understanding, visual motion analysis, and efficient inference and learning in
s
H 2002. She is currently a Professor of Beihang University. She was at the University of P ctivity recognition, image indexing, object detection and stereo vision.
eiyang Cheng received B.S. degree in bio-medical engineering from Tianjin Medicf Astronautics at Beihang University, China. He was also a visiting Ph.D. student a
ision and semantic segmentation.
uming He is currently an Associate Professor in the School of Information Science rom the University of Toronto (Canada) in 2008. He held a postdoctoral position at
n National ICT Australia (NICTA) and was a Senior Researcher from 2013 to 2016. o 2016. His research interests include semantic image and video segmentation, 3
tructured models.
ong Zhang received Ph.D. degree from Beijing Institute of Technology, China inittsburgh as a visiting scholar from 2007 to 2008. Her research interests include a