A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning

A NO VEL MONOCULAR DISP ARITY ESTIMA TION NETWORK WITH DOMAIN TRANSFORMA TION AND AMBIGUITY LEARNING J uan Luis Gonzalez Bello and Munchurl Kim K orea Advanced Institute of Science and T echnology (KAIST) ABSTRA CT Con volutional neural networks (CNN) have shown state-of-the-art results for low-le vel computer vision problems such as stereo and monocular disparity estimations, but still, hav e much room to fur- ther improve their performance in terms of accuracy , numbers of parameters, etc. Recent works have uncov ered the advantages of using an unsupervised scheme to train CNN’ s to estimate monocu- lar disparity , where only the relati vely-easy-to-obtain stereo images are needed for training. W e propose a nov el encoder-decoder ar- chitecture that outperforms previous unsupervised monocular depth estimation networks by (i) taking into account ambiguities, (ii) ef ﬁ- cient fusion between encoder and decoder features with rectangular con volutions and (iii) domain transformations between encoder and decoder . Our architecture outperforms the Monodepth baseline in all metrics, ev en with a considerable reduction of parameters. Fur- thermore, our architecture is capable of estimating a full disparity map in a single forward pass, whereas the baseline needs two passes. W e perform extensi ve experiments to verify the ef fectiveness of our method on the KITTI dataset. Index T erms — Monocular disparity estimation, Deep Con volu- tional Neural Networks (DCNN), unsupervised learning. 1. INTR ODUCTION For a gi ven object displayed in a rectiﬁed stereo pair of images, the disparity is deﬁned as the horizontal pixel distance of the object in the left and right images. Disparity estimation is an ill-posed prob- lem due to occlusion, as not all the objects displayed in the left image are visible in the right image and vice v ersa. Classic approaches for depth / disparity estimation rely on stereo matching, multiple view stereo, or single or multiple defocus map [1]. These methods rely on multiple images to perform feature matching, triangulation, or texture measurement with limited performance. On the other hand, ev en early supervised learning approaches have demonstrated supe- rior results [2, 3] for binocular and monocular inputs respectiv ely . 1.1. Supervised disparity estimation Deep learning approaches ha ve been extensiv ely studied for su- pervised disparity estimation. For the stereo inputs case, the early work of Zbontar [2] focused on learning the similarity measure between two patches from the left and right images. For the monoc- ular case, Liu et al. [3] de vised a deep con volutional neural ﬁeld (DCNF) model for supervised depth estimations by exploring con- ditional random ﬁelds and super-pix el pooling. None of these previous methods provided an end-to-end learning architecture. Instead of comparing local patches, Mayer et al.[4] adopted the fully-con volutional FlowNet [5] architecture for dense supervised stereo matching, called the DispNet. This kind of auto-encoder T raining inputs Disparity maps Ambiguity masks Fig. 1 . Using the left vie w as input and the right view only for super- vision during training, our model estimates bidirectional disparity and ambiguity masks. architecture would become common for future disparity estimation networks. Jie et al. [6] proposed a recurrent architecture that learns potentially erroneous areas that guide the model to focus on these regions for subsequent reﬁnement. Their model is fully supervised, takes stereo inputs and needs 5 iterations during test time, making it too slo w for real-time (4 seconds). Atapour et al. [7] le verage fully annotated synthetic data to train a monocular depth estimation network. They apply style transfer to conv ert natural images into the synthetic domain during test time. While their approach unlocks the use of big synthetic data for real applications, according to their paper , their method struggles from false objects and depth holes arising from shadows post style transfer . Xie et al. [8] combine monocular and stereo input approaches using a Deep3D model to generate a synthetic right view , followed by 1D correlation, and a full resolution DispNet architecture to perform stereo matching. Deep3D [9] produces a probabilistic disparity map to blend multiple versions of the left vie w with different left and right shifts. Recent works on disparity estimation hav e the tendenc y to com- bine other low-lev el and high-level tasks like motion estimation, scene ﬂow , and object segmentation [10, 11]. Chen et al. [12] for- mulate the style transfer problem for the binocular input case. Their intermediate disparity is trained for simultaneous bidirectional dis- parity and occlusion mask estimation in a fully supervised fashion. 1.2. Unsupervised disparity estimation Previously mentioned methods require dense ground truth disparity . Obtaining such annotated data is a challenging task. On the other hand, unsupervised techniques rely on additional views to estimate depth from a scene, for which, capturing monocular video or stereo images is a relativ ely simpler task. Zhou et al.[13] exploited the relativ e pose information in a monocular video to train a disparity network and a pose estimation network in an unsupervised manner by performing view synthesis of the center frame (source image) to the different reference frames (target images). In [14], Zou et al. proposed a cross-task consistency loss to jointly train for optical ﬂow estimation and disparity estimation from a video in an unsupervised fashion. Even though these kinds of approaches achieve good results for the additional task (camera motion and optical ﬂo w estimations), Fig. 2 . Our full model with ambiguity mask estimation, rectangular conv olutions and domain transformation blocks. their performance for disparity estimation is very limited. Giv en only the stereo pair , unsupervised disparity estimation can beneﬁt from exploiting the geometrical relationships between the left and right images. Garg et al. [15] trained an encoder-decoder archi- tecture for unsupervised monocular depth estimation by synthesiz- ing a backward-warped image using the estimated left disparity map and the right image. The warped images are used to calculate the reconstruction error . W e use the state-of-the-art work of Godard et al. [16] as the baseline for our work. Their Monodepth network estimates the left and right view disparity maps and incorporates a photometric loss, a disparity smoothness loss and a consistency loss for unsupervised training. Even though their consistency loss term greatly improves the performance of the network, they fail in the following three points that we attack in our work, which we call the rrdispnet dtm (residual rectangular masked disparity network with left to right domain transformations): (i) Incorporation of full resolution features from the encoder into the decoder . W e add a con volution layer with full resolution features to the auto-encoder architecture; (ii) Fusion between Left-domain encoder features into Left-Right-domain decoder features. W e fuse the skip connections from the encoder into the decoder using domain transformation blocks and rectangular conv olutions with 3x5 ker- nels instead of 3x3 kernels. Rectangular con volutions facilitate the con version from Left-domain to Left-Right-domain; (iii) Account- ing for ”ambiguities” in the loss function. By letting the network learn an ambiguity mask for each view , we effecti vely re-weight the loss functions, which results in a selectively decreased learning rate for occluded and complex cluttered disparity areas, which improves accuracy and robustness. In addition, accounting for ambiguities al- lows our netw ork to estimate full disparity maps in a single pass. These contributions greatly improve the quality of the disparity maps. Moreov er , in combination with residual blocks [17] in the en- coder section, we can reduce the numbers of parameters from 31 to 14 million. Parameter reduction is achieved by having less numbers of channels in the intermediate stages of our auto-encoder architec- ture, we trade quantity for quality of features. 2. METHOD W e propose a nov el learning pipeline that accounts for occlusions and comple x/cluttered areas or ”ambiguities”. Giv en a single image, Fig. 3 . Residual(left) and domain transform(right) blocks. our model outputs bidirectional disparity and ambiguity masks as de- picted in Figure 1. The works of Godard [16] and Garg [15] model depth estimation as image reconstruction by ˜ I L = g ( I R , D L ) , where the backward warping operation g ( I , D ) is a fully differen- tiable bilinear sampling function for right image I R and left disparity D L . Howe ver , these approaches do not take ambiguities into ac- count. Our pipeline handles ambiguities by including them in the reconstruction model, as depicted in Eqs.1 and 2 for left and right view reconstruction respecti vely . ˜ I L = ˜ a mask L  g ( I R , D L ) + (1 − ˜ a mask L )  I L (1) ˜ I R = ˜ a mask R  g ( I L , D R ) + (1 − ˜ a mask R )  I R (2) where ˜ a mask L and ˜ a mask R contain the information about the dis-occluded left and right border areas that cannot be reconstructed by the warping operation. T o model these areas the ambiguity masks are element-wise multiplied, denoted as  , by the dis- occlusion masks, yielding ˜ a mask L = a mask L  dis occ L and ˜ a mask R = a mask R  dis occ R . The later are deﬁned as dis occ Lij =  0 if j < 0 . 15 W 1 o.w. (3) dis occ Rij =  0 if j > 0 . 85 W 1 o.w. (4) where W is the image width. 2.1. Network architecture Our network architecture is depicted in Figure 2 where we adopt a UNET -like architecture with se veral additions. After each strided con volution layer , a residual block (Figure 3) further reﬁnes the fea- tures and increases the numbers of non-linearities in the network, in- creasing its representation power . The bottleneck features in our net- work are upscaled using nearest upsampling and concatenated with encoder features in order to include global and local information. The features from the encoder are concatenated to the decoder side via direct skip connections and domain transformations. The number of channels is then reduced by a fusion layer . This process repeats until the feature maps reach the input resolution. Since the encoder processes spatial information from the left input image and the de- coder outputs left and right disparities and masks, a mechanism is needed to con vert from the ”Left domain” encoder features to the ”Left-Right” domain decoder features. The domain transformation blocks (Figure 3) and 3x5con v fusion layers facilitate the conv ersion from the Left domain features to the Left-Right domain features. Similar to [16], we take multiscale outputs for training at the four ﬁnest scales. The intermediate outputs are upscaled using nearest upsampling and concatenated to the next decoder stage. Output dis- parities and ambiguity masks are preceded by a sigmoid activ ation. Disparities are limited to 30% of the image width. 2.2. Loss Functions W e adopt a multi-scale loss function at the four ﬁnest scales as de- scribed by Eq. 5. For each scale, the loss function takes the shape of the weighted sum of ﬁv e terms: reconstruction loss, edge preserv- ing smoothness loss, perceptual loss, ambiguity loss, and left-right consistency loss. The loss function at each scale is depicted in Eq. 6 (each term has a right and left component l L x and l R x ). l = 4 X s =0 l s (5) l s = a rec l rec + a ds 0 . 1 2 s − 1 l ds + a p l p + a a l a + a lr l lr (6) The selection of the coefﬁcients for ambiguity and perceptual loss terms in Eq. 6 was critical for the correct training of the networks. The weights during training are set to a rec = 1 , a ds = 0 . 1 , a p = 0 . 1 , a a = 0 . 2 and a lr = 1 . In contrast with [16], all our loss terms are directly or indirectly modiﬁed by the ambiguity masks. Reconstruction Loss . The reconstruction loss enforces the image ˜ I L to be similar to the input image I L , and can be deﬁned by the weighted sum of the L 1 and S S I M losses (a weight of α = 0 . 85 was used). The reconstruction loss is deﬁned as l L rec = α || I l − ˜ I L || 1 + (1 − α ) S S I M ( I L , ˜ I L ) (7) Disparity smoothness Loss . Similar to [16], the smoothness loss is set up to put less penalty on image gradients, which is giv en by l L ds = || ∂ x D L  exp −| ∂ x I L | || 1 + || ∂ y D L  exp −| ∂ y I L | || 1 (8) Per ceptual Loss . Occluded areas will normally be represented by highly deformed regions in the reconstructed images. Perceptual loss [18] is ideal to detect this deformation and to put more penalty on it. The use of perceptual loss allows for learning the ambiguity masks properly . Three layers ( r elu 1 2 , rel u 2 2 , r elu 3 4 ) from the pre- trained V GG 19 [19] on ImageNet were used to hav e our perceptual loss as l L p = 3 X l =1 || φ l ( I L ) − φ l ( ˜ I L ) || 1 (9) Ambiguity Loss . Cross entropy loss is used to encourage ambiguity mask elements to be close to 1. W ithout this term the ambiguity masks would collapse to zero. So we have the ambiguity loss as l a = || log a mask L || 1 + || log a mask R || 1 (10) Fig. 4 . From left to right: Input, Incomplete disparity map (Mon- odepth) and our complete result (rdispnet m). Left-Right (LR) consistency Loss . Similar to [16], a LR consis- tency term is used. Consistency loss encourages coherence between left and right disparities. Another interpretation is that consistency loss allows the network to use each vie w’ s disparity map as weak ground truth for the other view . In contrast with [16], our LR con- sistency loss is re-weighted by the ˜ a mask ambiguity masks. The ambiguity masks allow the LR consistency loss to penalize less those areas where disparities are not good enough to be used as the ground truth for the other view’ s disparity . The LR consistency loss is de- ﬁned as l L lr = || ˜ a mask L  ( D L − g ( D R , D L )) || 1 (11) 3. RESUL TS W e perform extensi ve experiments to verify the effecti veness of each of our contributions and compare against the state-of-the-art unsu- pervised monocular depth estimation, the Monodepth [16], on the Kitti2015 [20] dataset which contains 200 stereo frames and sparse disparity ground truth from velodyne laser scanners and CAD mod- els. W e train and test our network with and without domain transfor- mations and rectangular conv olutions, and name them accordingly , as shown in T able 1. Performance is measured in terms of the Kitti metrics from [21]. Additionally , we test the Monodepth pp and our best model on our own dataset, captured with a cellphone camera. 3.1. Implementation details For a fair comparison, we adopted the training settings from [16]. Adam [22] optimizer is used with default betas and the models are trained for 50 epochs with an initial learning rate of 0.0001. The learning rate is halved at epochs 30 and 40. The same data augmen- tations were performed: random crop (256x512), random horizontal ﬂips, random gamma, brightness and color shifts [16]. All models are trained on the KITTI Split [16] only , which consists of a selection of 29,000 stereo pairs from the Kitti2012 dataset [23], comprising a total of 33 scenes. 3.2. Kitti2015 Our simplest network, rdispnet m, does not include domain trans- formations nor 3x5 con volutions, but still manages to outperform the Monodepth baselines in all metrics. Monodepth post processing (pp), runs a second forward pass with a ﬂipped input and combines the two outputs into a full disparity map. In contrast, our network is able to generate a complete disparity map in a single forward pass as depicted in Figure 4, even with the higher overall quality , as conﬁrmed by results in T able 1. W e compare our more complex networks with rectangular con volutions and domain transformations against Monodepth pp in Figure 5. Again, our networks outperform Monodepth pp in all metrics in T able 1. The additional complexity Model D R F abs rel sq rel rmse log rmse a1 a2 a3 W arp rmse T ime P aram Monodepth 0.149 2.565 6.645 0.245 0.849 0.936 0.969 17.565 0.015 31.6 Monodepth pp x 0.114 1.138 5.452 0.204 0.859 0.946 0.977 17.565 0.032 31.6 rdispnet m x 0.111 1.031 5.416 0.199 0.860 0.948 0.978 17.244 0.014 12.8 rrdispnet m x x 0.113 1.114 5.364 0.195 0.866 0.951 0.981 17.062 0.018 14.2 rrdispnet dtm x x x 0.112 1.038 5.304 0.198 0.863 0.950 0.979 16.791 0.024 16.0 rrdispnet m pp x x 0.105 0.949 5.174 0.190 0.866 0.952 0.981 17.062 0.036 14.2 T able 1 . Network Performance Metrics. D, network uses domain transformation block; R, network uses rectangular con volutions; F , network estimates full disparity map. Inference time is in seconds (s), tested on a T itan Xp GPU. Numbers of parameters is in millions. Fig. 5 . Qualitative comparison on the Kitti2015 dataset. From left to right: Input, Monodepth pp, rrdispnet m and rrdispnet dtm. Our networks succeed at detecting thin structures and suf fer less from lateral side artifacts. of 3x5 k ernels and domain transformations makes our rrdispnet dtm slower than our rdispnet m but still faster than Monodepth pp. The rrdispnet dtm yields the lo west rmse and has intermediate v alues for the other Kitti metrics, which exhibits the best performance among all our networks. As can be observed in Figure 5, our rrdispnet dtm generates sharper and more accurate disparity maps. It is also more robust against thin structures and does not produce lateral side arti- facts as the Monodepth pp does. In addition, the rrdispnet dtm has the lowest warp rmse, which tells how similar the generated right- from-left view to the ground truth right view is. The warp rmse is a rough indicator for the quality on the estimated right disparity . The estimated right disparities and the generated right views are depicted in Figure 6. Interestingly , our network does a good job at estimat- ing the full right disparity map under the case that the right view is unknown at test time. 3.3. Own dataset T o prove our models generalize well, we test the rrdispnet dtm on our own dataset, and compare it against the Monodepth pp baseline (Figure 7). Since the ground truth disparity is not av ailable for our dataset, we ev aluate the quality of the disparity maps by the feasi- bility of the generated outputs. Our network generates more feasible disparity maps where details are very well preserv ed, thin structures better detected, and far aw ay object disparities better estimated. 4. CONCLUSIONS In this paper , by using residual blocks, full resolution features, rect- angular con volution fusion and domain transformations between the Left-domain and Left-Right-domain features combined with the es- timations of ambiguity masks, we achiev ed superior qualitativ e and quantitativ e results on the Kitti2015 benchmark over the Monodepth Fig. 6 . Left column: Input left view , Monodepth and rrdispnet dtm output right disparities. Right column: Ground truth right view , warped left vie w using disparities in left column. Fig. 7 . From left to right: Input, Monodepth pp and rrdispnet dtm. baseline. Furthermore, using our o wn dataset we sho wed that our method generalizes better in comparison to the baseline. The design of our nov el loss function allows for end-to-end unsupervised learn- ing of binocular disparity and ambiguity masks. W e presented three networks that generate full disparity maps in a single pass at higher speeds than the Monodepth pp baseline. While other recent works are focused on fully supervised disparity networks, we demonstrated a signiﬁcant improv ement in the unsupervised class of algorithms by modeling depth estimation as an ambiguous image reconstruction problem. 5. REFERENCES [1] Aamir Saeed Malik and T ae-Sun Choi, “ A novel algorithm for estimation of depth map using image focus for 3d shape recov ery in the presence of noise, ” P attern Recognition , v ol. 41, pp. 2200–2225, 2008. [2] Jure Zbontar and Y ann LeCun, “Stereo matching by train- ing a con volutional neural network to compare image patches, ” CoRR , vol. abs/1510.05970, 2015. [3] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian D. Reid, “Learning depth from single monocular images using deep con volutional neural ﬁelds, ” CoRR , vol. abs/1502.07411, 2015. [4] Nikolaus Mayer , Eddy Ilg, Philip H ¨ ausser , Philipp Fischer , Daniel Cremers, Alexey Dosovitskiy , and Thomas Brox, “ A large dataset to train conv olutional networks for dispar- ity , optical ﬂow , and scene ﬂow estimation, ” CoRR , vol. abs/1512.02134, 2015. [5] Philipp Fischer, Alex ey Dosovitskiy , Eddy Ilg, Philip H ¨ ausser , Caner Hazirbas, Vladimir Golkov , Patrick van der Smagt, Daniel Cremers, and Thomas Brox, “Flownet: Learn- ing optical ﬂow with con volutional networks, ” CoRR , vol. abs/1504.06852, 2015. [6] Zequn Jie, Pengfei W ang, Y onggen Ling, Bo Zhao, Y unchao W ei, Jiashi Feng, and W ei Liu, “Left-right comparativ e recur- rent model for stereo matching, ” CoRR , vol. abs/1804.00796, 2018. [7] A. Atapour-Abar ghouei and T .P . Breckon, “Real-time monoc- ular depth estimation using synthetic data with domain adap- tation via image style transfer ., ” in 2018 IEEE Confer ence on Computer V ision and P attern Recognition (CVPR). June 2018, pp. 1–8, IEEE. [8] Y ue Luo, Jimmy S. J. Ren, Mude Lin, Jiahao Pang, W enxiu Sun, Hongsheng Li, and Liang Lin, “Single view stereo match- ing, ” CoRR , vol. abs/1803.02612, 2018. [9] Junyuan Xie, Ross B. Girshick, and Ali Farhadi, “Deep3d: Fully automatic 2d-to-3d video con version with deep conv olu- tional neural networks, ” CoRR , vol. abs/1604.03650, 2016. [10] Eddy Ilg, T onmoy Saikia, Margret Keuper , and Thomas Brox, “Occlusions, motion and depth boundaries with a generic net- work for disparity , optical ﬂow or scene ﬂow estimation, ” CoRR , vol. abs/1808.01838, 2018. [11] Guorun Y ang, Hengshuang Zhao, Jianping Shi, Zhidong Deng, and Jiaya Jia, “Segstereo: Exploiting semantic information for disparity estimation, ” CoRR , vol. abs/1807.11699, 2018. [12] Dongdong Chen, Lu Y uan, Jing Liao, Nenghai Y u, and Gang Hua, “Stereoscopic neural style transfer , ” CoRR , vol. abs/1802.10591, 2018. [13] Tinghui Zhou, Matthew Brown, Noah Snavely , and David G. Lowe, “Unsupervised learning of depth and ego-motion from video, ” CoRR , vol. abs/1704.07813, 2017. [14] Y uliang Zou, Zelun Luo, and Jia-Bin Huang, “Df-net: Un- supervised joint learning of depth and ﬂow using cross-task consistency , ” CoRR , vol. abs/1809.01649, 2018. [15] Ravi Gar g, V ijay Kumar B. G, and Ian D. Reid, “Unsuper- vised CNN for single view depth estimation: Geometry to the rescue, ” CoRR , vol. abs/1603.04992, 2016. [16] Cl ´ ement Godard, Oisin Mac Aodha, and Gabriel J. Brostow , “Unsupervised monocular depth estimation with left-right con- sistency , ” CoRR , vol. abs/1609.03677, 2016. [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition, ” CoRR , vol. abs/1512.03385, 2015. [18] Justin Johnson, Alexandre Alahi, and Fei-Fei Li, “Perceptual losses for real-time style transfer and super-resolution, ” CoRR , vol. abs/1603.08155, 2016. [19] Karen Simonyan and Andre w Zisserman, “V ery deep con volu- tional networks for large-scale image recognition, ” CoRR , vol. abs/1409.1556, 2014. [20] Moritz Menze and Andreas Geiger , “Object scene ﬂow for autonomous vehicles, ” in Confer ence on Computer V ision and P attern Recognition (CVPR) , 2015. [21] David Eigen, Christian Puhrsch, and Rob Fergus, “Depth map prediction from a single image using a multi-scale deep net- work, ” CoRR , vol. abs/1406.2283, 2014. [22] Diederik P . Kingma and Jimmy Ba, “ Adam: A method for stochastic optimization, ” CoRR , vol. abs/1412.6980, 2014. [23] Andreas Geiger , Philip Lenz, and Raquel Urtasun, “ Are we ready for autonomous driving? the kitti vision benchmark suite, ” in Conference on Computer V ision and P attern Recog- nition (CVPR) , 2012.

A Novel Monocular Disparity Estimation Network with Domain Transformation and Ambiguity Learning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment