Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation

Enhanced Encoder -Decoder Arc hitecture f or Accurate Monocular Depth Estimation Dabbrata Das , Argho Deb Das and Farhan Sadaf ∗ Department of Computer Science and Engineer ing, Khulna Univer sity of Engineering & T echnology, Khulna - 9203, Bangladesh. A R T I C L E I N F O Keyw ords : Monocular Dept h Estimation Depth Map Encoder-Decoder Architecture Inception-ResNe t-v2 (IRv2) NYU Dept h V2 KITTI Cityscapes Computer Vision Deep Lear ning A B S T R A C T Estimating depth from a single 2D image is a challenging t ask due to the lack of stereo or multi-view data, which are typically req uired for depth perception. In state-of-the-ar t architectures, the main challenge is to eﬃciently capture comple x objects and ﬁne-grained details, which are often diﬃcult to predict. This paper introduces a novel deep lear ning-based approach using an enhanced encoder- decoder architecture, where the Inception-ResNet-v2 model serves as the encoder . This is t he ﬁrst instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrat- ing improv ed per formance ov er previous models. It incor porates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. W e propose a compos- ite loss function comprising depth loss, gradient edge loss, and Structural Similar ity Index Measure (SSIM) loss, with ﬁne-tuned weights to optimize the weighted sum, ensur ing a balance across dif- f erent aspects of depth estimation. Experimental results on the KITTI dataset show that our model achie ves a signiﬁcantly faster inference time of 0.019 seconds, outper forming vision transformers in eﬃciency while maintaining good accuracy . On the NYU Depth V2 dataset, the model establishes state-of-the-ar t per f or mance, with an Absolute Relativ e Error (ARE) of 0.064, a R oot Mean Square Error (RMSE) of 0.228, and an accuracy of 89.3% for 𝛿 < 1.25. These metrics demonstrate that our model can accurately and eﬃciently predict depth even in challenging scenar ios, providing a practical solution for real-time applications. 1. Introduction Scene depth estimation is a cr ucial t ask in computer vi- sion t hat signiﬁcantly enhances machine perception and com- prehension of three-dimensional (3D) environments. Depth estimate is cr ucial for v ar ious applications, including au- tonomous dr iving, robotic navigation, vir tual reality (VR), and augmented reality (AR). In these ﬁelds, precise depth inf or mation helps machines to eﬃcientl y interact wit h the actual world, ensuring saf e na vigation, accurate object de- tection, and realistic interaction wit h virtual environments. Tr aditional depth es timation techniq ues, such as stereo vision and activ e depth sensing using LiD AR [ 1 ] or struc- tured light, hav e been widely used to generate dept h maps. These methods are eﬀective but often require specialized hardw are setups, including multiple cameras or expensive depth sensors, which signiﬁcantly increase the cost and com- plexity of the system. For example, in autonomous dr iv - ing, LiD AR sys tems provide highly accurate dept h maps, but their high cost and resource requirements limit their broad deployment. This has increased demand for more scalable and cost-eﬀective solutions, par ticularly in scenar ios where only a single camera is a vailable. Monocular dept h estimation, which der iv es depth from a singular 2D image, has emerged as an appealing alter nativ e due to its simplicity and minimal technology prerequisites. It remov es the need for stereo vision systems or depth sensors, making it compatible with various devices, including mobile ∗ Corresponding author dasdabbrata@gmail.com (D.D. ); atdeb727@gmail.com (A.D.D. ); farhansadaf@cse.kuet.ac.bd (F.S. ) 1 Data: https://www .kaggle.com/datasets/soumikrakshit/nyu-depth-v2 2 Code: https://github.com/dabbrat a/Depth-Estimation-Enc-Dec phones and drones. This challenge is tricky because, with- out stereo inf or mation, accurately estimating depth from a single image becomes more diﬃcult. Problems like telling apart objects of diﬀerent sizes or ﬁguring out if one object is blocking another make monocular depth estimation espe- cially diﬃcult. Earl y studies on monocular depth estimation predomi- nantly utilized hand-crafted features and geome tr ic indica- tors such as vanishing points, shadow s, and defocus. These methods work ed well in simple en vironments but struggled in complex real-world scenes. With machine lear ning, mod- els using techniques like Scale-In variant Feature Transf or m (SIFT) [ 2 ] and Conditional Random Fields (CRF) [ 3 ] im- pro ved dept h map predictions by lear ning from data. Ho w- ev er, these models still had trouble w orking w ell in diﬀerent types of scenes because they relied too much on hand-made f eatures. The recent growth of deep learning has signiﬁcantly im- pro ved monocular depth estimation. Conv olutional Neural Netw orks (CNNs), which can lear n complex patter ns directly from ra w image data, are replacing traditional handcrafted f eatures. Models proposed b y Kim et al. [ 4 ] and Laina et al. [ 5 ] demonstrate that deep neural ne tworks can accurately predict dense, high-resolution depth maps with signiﬁcant accuracy , even in challenging en vironments. Moreo ver , t he integration of Generative A dversarial Netw orks (G ANs) [ 6 ] and attention mechanisms has advanced the ﬁeld e ven fur - ther . Qiao et al. [ 7 ] introduce an inno vative multi-stage depth super-resolution netw ork that utilizes explicit high- frequency data from a transf or mer and implicit signals from the freq uency domain to impro ve depth map recons tr uction. Besides, Zhang et al. [ 55 ] lev eraged deep lear ning archi- Das et al.: Preprint submitted to The Visual Computer Page 1 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation Nomenclatur e 𝐴𝑅𝐸 Absolute Relative Erro r 𝐶 𝑁 𝑁 Convolutional Neural Net wo rk 𝐷𝑁 𝑁 Deep Neural Net wo rk 𝐹 𝐿𝑂𝑃 𝑠 Floating P oint Op erations 𝐺𝐴𝑁 𝑉 𝑂 Generative Adversa rial Netw ork fo r Visual Odom- etry 𝐺𝑁 𝑁 Generative Adversa rial Netw ork 𝐼 𝑜𝑈 Intersection over Union 𝐼 𝑅 − 𝐴 Inception Resnet A 𝐼 𝑅 − 𝐵 Inception Resnet B 𝐼 𝑅 − 𝐶 Inception Resnet C 𝐼 𝑅𝑣 2 Inception Resnet V2 𝐿𝑂𝐺 10 Logarithmic Base 10 𝐿𝑅 Linea r Regression 𝐿𝑆 𝑇 𝑀 Long Short-T erm Memo ry 𝑀 𝐴𝑅𝐸 Mean Absolute Relative Erro r 𝑀 𝑆 𝐸 Mean Square Error 𝑅 − 𝐴 Reduction A 𝑅 − 𝐵 Reduction B 𝑅 2 Co eﬃcient of Determination 𝑅𝐸 𝐿𝑈 Rectiﬁed Linear Unit 𝑅𝐸 𝑆 𝑁 𝐸 𝑇 Residual Netw ork 𝑅𝑀 𝑆 𝐸 Root Mean Square Error 𝑅𝑁 𝑁 Recurrent Neural Netw ork 𝑆 𝐺𝐴𝑁 𝑉 𝑂 Stacked Generative A dversarial Netw ork for Vi- sual Odometry 𝑆 𝑆 𝐼 𝑀 Structural Simila rity Index Measure 𝑉 𝑖𝑇 Vision T ransformer tectures f or decomposing, extracting, and reﬁning f eatures from dat a or images to separate individual ﬁngerpr int fea- tures. Fur ther more, Y adav et al. [ 8 ] presented an inception- based self-attentive generativ e adversarial network designed f or high-quality f acial image synthesis. The parallel self- attention module impro ves image quality by preserving spa- tial characteristics and speeds up conv ergence. This paper focuses on advancing monocular dept h esti- mation using deep lear ning, proposing a nov el architecture based on Inception-ResNe t-v2 [ 9 ], which succeeds at captur- ing multi-scale features and reﬁning depth predictions. Our contributions include a customized loss function which in- cludes depth loss, g radient edge loss, and SSIM loss to op- timize both the accuracy and structural consistency of depth maps. Here SSIM, edge, and depth losses are combined, functioning like f eature fusion [ 57 ], to create a robust model f or depth prediction under varying conditions. W e demon- strate the eﬃcacy of our methodology through experiments on the NYU Depth V2 [ 10 ], KITTI [ 13 ] and Cityscapes [ 19 ] datasets, where our model sur passes leading encoder-decoder techniq ues in accuracy on NYU Dept h V2 and outper f or ms vision transf or mers in eﬃciency on KITTI. The remaining parts of this w ork are structured as fol- low s: Section II examines per tinent literature in monocular depth estimation, emphasizing signiﬁcant prog ress in deep learning. Section III presents the approach and model’ s ar- chitecture, while Section IV co vers the dataset, training pro- cedure, and speciﬁcs of implementation. Section V includes the results and analy sis, while Section VI ﬁnishes with future prospects and problems in t he ﬁeld. 2. Related W or ks Monocular depth estimation has made tremendous ad- vances, transitioning from classic manual methods to ad- vanced deep lear ning-based systems. This section examines major advances, focusing on earl y approaches, CNNs, and more recent innov ations such as transf or mer-based systems, emphasizing their merits and limits. 2.1. T raditional Methods: Earl y F oundations in Depth Estimation The ﬁrst attempts at dept h estimation w ere mostl y based on classic computer vision techniq ues like stereo vision and structure-from-motion. These monocular dept h estimation approaches used handmade features to estimate dept h from tw o-dimensional images. Models such as the Scale-Inv ariant Feature Transf orm (SIFT) [ 2 ] and Conditional Random Fields (CRF) [ 3 ] used predetermined f eatures to estimate depth. While these algorithms per formed well in controlled en- vironments, they were unable to eﬀectivel y generalize to com- plex, real-w orld scenes due to t heir heavy dependence on static f eatures and inability to adjust to changes in lighting, texture, or object shadowing. For example, regular stereo vi- sion systems require specialized hardw are setups using many cameras or depth sensors such as LiD AR, making them ex- pensive and diﬃcult to implement on a large scale. Further- more, these algor ithms usuall y f ail to adapt in dynamic or cro wded situations, limiting their use in real-world applica- tions like autonomous dr iving or augmented reality . 2.2. The Deep Learning Era: T ransition to data-driven approaches With the r ise of deep lear ning, the ﬁeld of monocular depth estimation made g reat progress. Con volutional Neu- ral Netw orks (CNNs) had become popular as pow er ful tools to lear n complex patterns from image data. Laina et al. [ 5 ] created a full y con volutional residual network f or monoc- ular depth estimation, incorporating an inno vativ e upsam- pling method to enhance output resolution. Their model attained an A verag e Relativ e Error (ARE) of 0.127 on the NYU Depth V2 dataset [ 10 ] and 0.176 on t he Make3D dat aset Das et al.: Preprint submitted to The Visual Computer P age 2 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation [ 32 ] respectiv ely . Xu et al. [ 27 ] f or mulated a structured attention-guided conditional neural ﬁeld model f or the es- timation of monocular depth. The integration of multi-scale characteristics and attention mechanisms resulted in supe- rior per formance compared to pre vious CRF -based models, achie ving an ARE of 0.125 on the NYU Depth V2 dat aset [ 10 ] and 0.122 on the KITTI dataset [ 13 ]. Zoran et al. [ 28 ] established a deep lear ning framew ork f or mid-lev el vision tasks that acquires ordinal cor relations among picture points. Utilizing the NYU Dept h v2 [ 10 ] dataset, their model achie ved an RMSE(log) of 0.42 in depth estimation. Kim et al. [ 4 ] introduced a deep variational model f or monocular depth estimation, integrating global and local predictions from two CNNs.Their model achie ved an RMSE(log) of 0.172 on the NYU v2 [ 10 ] dat aset. Lee and Kim [ 26 ] created a conv olutional neural network us- ing an encoder -decoder architecture for dept h estimation uti- lizing the NYUv2 [ 10 ] dataset. Their approach achie ved an RMSE(log) of 0.180, improving accuracy by eﬀectiv ely combining depth maps at multiple scales. Zhang et al. [ 51 ] dev eloped a multit ask lear ning framew ork that estimates depth, camera posture, and semantic segment ation from monocular videos. Using g eometr ic reasoning, t heir method achie ved state-of-the-ar t performance with an RMSE of 6.317 on t he KITTI dataset [ 13 ] and an a verag e IoU improv ement of 3.1% on SYNTHIA [ 52 ]. This highlights the eﬀectiveness of com- bining diﬀerent task s into a single frame work for increasing ov erall depth estimate performance. Generativ e Adv ersarial Ne tworks (GANs) also gained popularity dur ing this period. Jung et al. [ 15 ] introduced Generativ e Adversarial Netw orks (GANs) f or monocular depth estimation, emplo ying a GlobalNet to extract global featur es and a ReﬁnementNet to ascer tain local structures from a sin- gle image, utilizing t he NYU v2 [ 10 ] dataset. Their method achie ved signiﬁcant improv ement, with an ARE of 0.134. Similarl y , Lore et al. [ 16 ] introduced a dept h map estima- tion technique utilizing Conditional Generative Adv ersar ial Netw orks (cGANs). The methodology was assessed using the NYU Depth v2 dat aset [ 10 ], resulting in a root mean square er ror (RMSE) of 0.875. Their model outperf or med traditional non-parametric sampling methods. Feng and Gu [ 17 ] devised an unsupervised methodology for depth and ego-motion estimation with Stack ed GANs (SGANV O). The model surpassed cur rent techniques in depth estimation on the KITTI dataset [ 13 ], wit h an a verage RMSE log of 0.1623 across a number of scenarios. Recurrent Neur al Netw orks (RNNs) were also studied f or their ability to represent temporal dependencies in f ootage from videos. Kumar et al. [ 11 ] introduced an innov ative con volutional LSTM-based [ 12 ] recur rent neural netw ork ar- chitecture f or monocular depth estimation from video se- quences. Their methodology intended to take advantage of temporal dependencies between video frames, and they eval- uate it using t he KITTI dataset [ 13 ]. The best-per f or ming model achiev ed an absolute relativ e error of 0.137. Mancini et al. [ 14 ] impro ved scene depth prediction by adding LSTM [ 12 ] units af ter the encoder networ k’ s con volutional lay ers. Their approach, as ev aluated on t he KITTI dataset [ 13 ], sig- niﬁcantly improv ed generalization, achieving an RMSE(log) of 0.366 and an A bsolute Relative Diﬀerence of 0.312. Despite these advances, signiﬁcant challeng es were still present. Many deep lear ning algorithms require large la- beled datasets f or super vised training, which can be time- consuming and expensiv e to collect. Fur ther more, the gen- eralization of models across diﬀerent and unkno wn environ- ments remained a major challenge, especiall y f or applica- tions needing stability in highl y dynamic environments such as outdoor and under water en vironments. Another obstacle w as comput ational resource needs, whic h limited the use of such models in resource-constrained sys tems such as mobile and embedded devices. 2.3. R ecent Innov ations: T ac kling Modern Challenges Recent inno vations hav e addressed man y of the shor t- comings of previous techniques by adding unique architec- tures, self-super vised lear ning concepts, and task -speciﬁc optimizations. These dev elopments hav e enhanced the accu- racy and scalability of monocular depth estimating systems. For instance, Li et al. [ 47 ] proposed a ne w approach for determining depth from a single image using classiﬁcation and regression algorithms. It use a Transf ormer-based tec h- nique to adaptivel y build dept h bins and enhance predictions across many scales. It achie ved state-of-t he-art results, with RMSE(log) impro vements of up to 6.1% on KITTI [ 13 ] and signiﬁcant generalization on diﬀerent datasets. Another recent contribution by Lu and Chen [ 49 ] de- signed a framew ork for ev aluating both dept h and optical ﬂow in dynamic conditions. Their method improv es depth estimation accuracy by combining motion segment ation and self-supervised learning techniques. T ested on t he KITTI dataset [ 13 ], it obtained remarkable accuracy wit h metr ics such as RMS(log) of 0.1680, outper forming pre vious ap- proaches. In t he domain of aquatic en vironments, Lu and Chen [ 48 ] proposed a self-super vised approach f or estimating monoc- ular dept h in w ater scenarios wit h specular reﬂection pr i- ors. The model divides water surfaces and uses reﬂections as intra-frame super vision to calculate depth. When tested on t he WRS dataset, it produced state-of-the-ar t results, in- cluding an Absolute Relative Er ror (A bsRel) of 0.121 and RMS of 4.43, surpassing pre vious approaches by up to 28%. Zhang et al. [ 50 ] proposed a three-step method for regis- tering 3D point clouds in outside en vironments. It combines preprocessing, ya w angle estimation, and coarse registra- tion wit h frequency histograms. In sev eral outdoor circum- stances, the approac h outperformed state-of-the-ar t meth- ods, reducing a verag e angle er rors by 62.8% and translation er rors b y 46.5%. Zhou et al. [ 53 ] introduced a Perception-Oriented U- Shaped Transf ormer Netw ork (U-Former) f or assessing 360- degree image quality without using a ref erence image. The model e xtracts perceptual features using cube map projec- tion, saliency-based self-attention, and a U-For mer encoder . Das et al.: Preprint submitted to The Visual Computer P age 3 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation T able 1 Summa ry of Impactful Wo rks in Depth Estimation of Deep Learning Reference Analysis T ype Mo del T echnique Dataset P erformance Kumar et al. [ 11 ] RNN-based Convolutional LSTM Utilized temp o ral dependencies betw een video frames KITTI Absolute Relative Error: 0.137 Jung et al. [ 15 ] GAN-based GlobalNet + ReﬁnementNet Extracted global features and estimated local structures NYU Depth v2 ARE: 0.134 Lore et al. [ 16 ] GAN-based Conditional GANs Improved depth map estimation with cGANs NYU Depth v2 RMSE: 0.875 Li et al. [ 25 ] CNN-based VGG-16-based System Fused depth and depth gradients NYU Depth v2 RMSE: 0.611 Lee and Kim [ 26 ] CNN-based Encoder-Deco der CNN Combined depth maps at multiple scales NYUv2 and KITTI RMSE(log): 0.180 on NYUv2 Xu et al. [ 27 ] CNN-based Structured Attention Mo del Integrated multi-scale characteristics and attention mechanisms NYU Depth v2 and KITTI ARE: 0.125 on NYUv2 Li et al. [ 47 ] T ransfo rmer- based Adaptive Depth Prediction Built depth bins and enhanced predictions across many scales NYU, KITTI, and SUN RGB-D RMSE(log): Improved up to 6.1% on KITTI Lu and Chen [ 49 ] Hybrid Self-supervised Depth + Optical Flow Combined motion segmentation and self-supervised lea rning techniques KITTI RMSE(log): 0.1680 Zhou et al. [ 58 ] proposed a blind image quality assessment model t hat combines self-attention and recur rent neural net- w orks. It collects local and global image features using win- dow ed self-attention and GR Us, achie ving cutting-edg e per - f or mance on benc hmark datasets while resolving dra wbacks in pr ior CNN- and transf or mer-based BIQA approaches. These approaches oﬀer w ays to assess image quality while utilizing advanced transformer-based methods. Furt hermore, Xi et al. [ 33 ] proposed LapUNet, a deep- learning framew ork f or monocular dept h estimation. Their model introduced a Dynamic Laplacian Residual U-shape (DLR U) module and incor porated an ASPP module to en- hance multi-scale contextual f eature e xtraction. Using t he NYU Depth V2 [ 10 ] and KITTI datasets [ 13 ], the model achie ved a signiﬁcant improv ement in dept h accuracy , with an RMSE of 0.406 on NYU Depth V2 [ 10 ] and 2.247 on KITTI [ 13 ]. Song et al. [ 34 ] integ rated CNN and Vision Tr ansformer components, utilizing an improv ed HRFormer as t he encoder . Using the KITTI [ 13 ], Cityscapes, and Make3D datasets, MFFENe t outperf or med s t ate-of-the-ar t methods, achie ving an RMSE of 4.356 on KITTI [ 13 ]. Choudhary et al. [ 36 ] proposed a dual-channel conv olutional neural net- w ork (MEStereo-Du2CNN) designed f or robust depth es- timation using multi-exposure stereo imag es. The model introduces a mono-to-stereo transf er lear ning approach and eliminates the traditional cost v olume construction in stereo matching. On the Middlebury dataset [ 37 ], the model achie ved an RMSE of 0.079. These newes t de velopments deal with ke y issues such as increasing depth prediction in complicated situations, main- taining ﬁne-grained spatial inf ormation, and enabling im- pro ved generalization across various datasets. Ho wev er, re- strictions such as computing needs f or real-time applications and dealing with extreme edge cases remain, pro viding room f or fur ther study . Table 1 pro vides a summar y of inﬂuential w orks in monocular depth estimation. Building upon insights from previous studies and their associated challenges, it is evident that enhancing t he encoder- decoder architecture can address many of t hese issues eﬀec- tivel y . Our proposed model is designed to tac kle these chal- lenges. The ke y contributions of our work are summar ized as f ollow s: • W e utilized the Inception-ResN et v2 (IRv2) architec- ture as the encoder, lever aging its simult aneous multi- scale feature extraction capability . This pro ved advan- tageous in achie ving better accuracy , particularly for smaller and more comple x objects. • A composite loss function was emplo yed, incor porat- ing three distinct weights. After extensive experimen- tation with various combinations, the optimal values were identiﬁed, contr ibuting to improv ed model per- f or mance. Das et al.: Preprint submitted to The Visual Computer P age 4 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation • When compared to Vision Transf ormers (ViT), our IRv2-based model demonstrated signiﬁcantly low er in- f erence times. This eﬃciency enhancement makes it more suit able for real-time applications, oﬀer ing a con- siderable advantag e ov er ViT in practical scenar ios. • W e experimented with our model in diﬀerent indoor and outdoor scenar ios, pro viding a detailed statistical analy sis of the results, which rev ealed t hat our model outperforms several state-of-the-ar t architectures across various metrics. 3. Methodology This section pro vides a comprehensiv e explanation of data preprocessing and model architecture wit h detailed wor k - ﬂow and loss functions. 3.1. Prepr ocessing Preprocessing t he images is a cr ucial step in building the model, as it ensures better results before passing the data into it. W e ﬁrst applied the min-max nor malization eq uation ( 1 ), a linear transf or mation that scales t he image intensity v alues betw een 0 and 1.  𝑥 = 𝑥 − 𝑥 min 𝑥 max − 𝑥 min (1) Here, the normalized value is  𝑥 , and the original value is x. Data augment ation, including horizontal ﬂipping of images, is also applied during the preprocessing steps. 3.2. Ne twor k Architecture Our model makes use of an encoder-decoder arc hitec- ture based on deep conv olutional neural netw orks, imple- mented using t he T ensorFlow [ 38 ]. The decoder networ k employ s deconv olution lay ers to produce the depth map by pixel-b y-pix el estimation that matches the input size, while the encoder netw ork utilizes con volution and pooling la yers to acquire depth data. In the encoding and decoding stages, downsampling and upsampling are performed, respectivel y . 3.2.1. Encoder In the encoder section, we used t he Inception-R esNet- v2 (IRv2) [ 9 ] architecture as a pre-trained model. Do wn- sampling is perf or med through the truncated la yers of this pre-trained model. By passing the images through these lay - ers, we will extract more accurate f eatures from the images. In Inception-Resnet-v2 architecture, three categor ies of in- ception blocks e xist (Inception Resnet Block A, Inception Resnet Block B, Inception Resne t Bloc k C), two types of re- duction blocks (Reduction Block A and B), as well as a Stem Block and an A v erage Pooling lay er . The inception blocks are repeated multiple times to f or m the full-lay er arc hitec- ture. IR- A . The Inception-ResNet- A block is a cr ucial compo- nent of the Inception-ResNet-v2 architectur e. It is designed to capture multi-scale f eatures and reduce t he vanishing g ra- dient problem. This block typically comprises multiple par - allel branches, each utilizing diﬀerent conv olutional ker nel sizes, including Conv(1x1), Con v(3x3), and Conv(5x5). T o optimize parameter eﬃciency , a 1 × 1 conv olutional ﬁlter pre- cedes the 3 × 3 ﬁlter , and the 5 × 5 ﬁlter is f actor ized into tw o 3 × 3 ﬁlters. Residual connections within the block f acilit ate impro ved gradient ﬂow dur ing both forward and backw ard passes, addressing the vanishing gradient issue. The activa- tion functions, typically ReLUs, are applied after each con- v olution to stabilize and activate the f eatures. The structure of the Inception-ResNe t-A block is represented in Figure 3 (I). R- A . The Reduction- A block reduces spatial dimensions and parameters using conv olutional la yers with str ides greater than 1 or pooling operations. It is cr ucial for compressing f eature maps while retaining essential information, impro v- ing comput ational eﬃciency , and generating more compact f eature representations. The simpliﬁed visualization of t he block is outlined in Figure 3 (II). IR-B . Inception-Resnet-B is another conv olution block represented in Figure 3 (III), consisting of a 7x7 con volu- tion ﬁlter where the ﬁlter is f actor ized into 1x7 and 7x1, two asymmetric conv olutions. By f actor izing, the quantity of pa- rameters is decreased, while the 1 × 1 con volution applied be- f ore the 7 × 7 con volution fur ther limits the parameter count and enhances computational eﬃciency . R-B . Figure 3 (IV) represents a simpliﬁed diag ram of the Reduction-B block where the block reduces spatial dimen- sions while increasing channel depth, helping the Inception- ResN et-v2 model capture high-le vel features eﬃciently dur- ing downsam pling. IR-C . Inception-R esNet-C is another inception block in the Inception-R esNet-v2 architecture that uses a 3x3 con- v olution ﬁlter, preceded by a 1x1 con volution to reduce the number of parameters. Additionally , the 3x3 conv olution is f actor ized into two asymmetric conv olutions. The conv olu- tions occur parallelly , and ﬁnally , all the outputs are concate- nated. This allow s t he networ k to focus on captur ing more abstract and higher-le vel data from t he image. Here, Figure 3 (V) illustrates t he simpliﬁed representation of the block. The main advantage of using Inception-ResN et v2 [ 9 ] is, multi-scale f eature e xtraction occurs through the use of Inception blocks, which are speciﬁcally designed to extr act f eatures at multiple scales simultaneously . Figure 4 provides a clear illustration of how multi-scale feature e xtraction op- erates in Inception-ResNet v2. Each Inception module has parallel branches, where each branch processes the input fea- tures wit h diﬀerent operations, such as 1x1, 3x3 and 7x7 con volutions. Next, Inception-ResNet modules introduce resid- ual connections, where the output of the parallel branc hes is combined with t he input and mitigate vanishing gradient problems. The architecture often uses fact or ized con volu- tions to reduce the computational cost of processing large ﬁlters. For e xample, a 3x3 con volution is factorized into 1D con volutions. Finally , the outputs of all branches in the In- ception module are concatenated. Here, as t he netw ork pro- gresses deeper , subsequent inception modules process the multi-scale features e xtracted earlier . This hierarchical pro- Das et al.: Preprint submitted to The Visual Computer P age 5 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation 10 x Inception- Resnet-B Reduction-A Reduction-B Stem Block 5 x Inception- Resnet-A 5 x Inception Resnet-C Con 1*1 A vg Pooling Concat Concat Concat R Encoder Decoder Sk ip Connection ReLU 1x1 Conv 32 1x1 Conv 32 1x1 Conv 32 1x1 Conv 384 linear 3x3 Conv 48 3x3 Conv 32 3x3 Conv 64 Upsampling Downsampling 240 x 320 x 3 240 x 320 x 3 60 x 80 x 64 60 x 80 x 64 60 x 80 x 64 30 x 40 x 80 30 x 40 x 80 30 x 40 x 80 15 x 20 x 256 15 x 20 x 256 240 x 320 x 3 ReLU ReLU 1x1 (128) 7x1 (192) 1x1 (192) 1x1 (1 154) 1x7 (160) ReLU ReLU 1x1 (192) 3x1 (256) 1x1 (192) 1x1 (2048) 1x3 (224) Figure 1: An outline of our net wo rk (enco der-decoder) architecture. The encoder uses a p re-trained Inception-ResNet-v2 (IRv2) [ 9 ] net wo rk, consisting of several Inception-ResNet blocks (A, B, and C) and reduction lay ers. The decoder consists of convolutional lay ers that process the upsampled output from the p revious la yer, combined with the corresponding feature maps from the encoder. Encoder Decoder colorma (inferno_r) Input(R GB) colormap(grey) Output Figure 2: The process of generating a color map b y mapping depth information from a grayscale image through an enco der- deco der netw ork. Das et al.: Preprint submitted to The Visual Computer P age 6 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation I: Inception-Resnet Block A + Feature map Conv(1 × 1 ) Conv(3 × 3) Conv(3 × 3) Concatenatoin Conv(1 × 1 ) ReLU(Concat) + III: Inception-Resnet Block B + Concatenatoin Conv(1 × 7) Conv(7 × 1 ) Concatenatoin Conv(1 × 1 ) ReLU(Concat) + V: Inception-Resnet Block C + Concatenatoin Conv(1 × 3) Conv(3 × 1 ) Concatenatoin Conv(1 × 1 ) ReLU(Concat) + II: Reduction Block A MaxPool(3 × 3), s = 2 Conv(3 × 3, s = 2) Conv(1 × 1 ) Conv(3 × 3) Conv(3 × 3, s = 2) Concatenatoin + ReLU(Concat) IV: Reduction Block B MaxPool(3 × 3), s = 2 Conv(1 × 1 ) Conv(1 × 1 ) Conv(3 × 3) Conv(3 × 3, s = 2) Concatenatoin + ReLU(Concat) Conv(3 × 3, s = 2) Figure 3: Schematic Representation of Inception-Resnet A (IR-A), Reduction A (R-A), Inception-Resnet B (IR-B), Reduction B (R-B), and Inception-Resnet C (IR-C) Blo ck cessing allow s the ne twork to capture complex features at various scales. 3 x  (Filter) Scale 1 (240 x 320 x 3) Scale 3 Scale 2 Con v(3x3) Con caten ation Con v(5x5) Con v(7x7) 5 x  (Filter) 7 x  (Filter) Figure 4: Illustration of multi-scale feature extraction in Inception-ResNet v2, sho wcasing pa rallel b ranches with va ry- ing ﬁlter sizes an d combining features across diﬀerent scales. 3.2.2. Decoder After downsampling the image in the encoder sectio n, the resolution will become too low , losing some impor tant f eatures. As the number of features of the images is low er than t he or iginal size, it is necessary to reconstr uct the im- age to increase t he image ’ s resolution before concatenation. The process of increasing t he imag e size from a low er num- ber of parameters to a higher number of parameters is called upsampling. This process will be started from the bottle- neck situation of the architecture whether the bottlenec k sit- uation means the simplest form of t he image with some spe- ciﬁc import ant f eatures. W e use the LeakyRelu [ 59 ] acti- vation function bef ore st arting upsampling from the bottle- neck. Here, LeakyR elu [ 59 ] is used to a void neuron inactiv - ity , ensures gradient ﬂow , preserves information, and leads to fas ter, more stable conv ergence in t his stage. Then upsam- pling of the image is necessar y to concatenate the upsampled image with its cor responding downsampled image t hrough a skip connection Figure 1 . A skip connection is a mecha- nism in neural networ k s that allo ws t he input from one lay er to concatenate straight to a subsequent lay er, skipping one or more intermediate lay ers in the process, helping to pre- serve impor tant information and reduce vanishing g radient problems. This concatenation is necessar y because it helps recov er the f eatures that w ere lost from t he image during the Algorithm 1: Inception-ResNe t-v2 (IRv2) En- coder with Repeated Bloc ks Input : Feature map 𝑥 Output: Encoded f eature map 𝐼 𝑅𝑣 2 _ 𝐸 𝑛𝑐 𝑜𝑑 𝑒𝑟 ( 𝑥 ) 1 Repeat IR_A block 10 times: 2 𝑖 ← 1 ; 3 while 𝑖 ≤ 10 do 4 𝑥 ← 𝐼 𝑅 _ 𝐴 ( 𝑥 ) ; [Figure 3 (I)] 5 𝑖 ← 𝑖 + 1 ; 6 end 7 Apply R_A block: 8 𝑥 ← 𝑅 _ 𝐴 ( 𝑥 ) ; [Figure 3 (II)] 9 Repeat IR_B block 5 times: 10 𝑖 ← 1 ; 11 while 𝑖 ≤ 5 do 12 𝑥 ← 𝐼 𝑅 _ 𝐵 ( 𝑥 ) ; [Figure 3 (III)] 13 𝑖 ← 𝑖 + 1 ; 14 end 15 Apply R_B block: 16 𝑥 ← 𝑅 _ 𝐵 ( 𝑥 ) ; [Figure 3 (IV)] 17 Repeat IR_C block 10 times: 18 𝑖 ← 1 ; 19 while 𝑖 ≤ 10 do 20 𝑥 ← 𝐼 𝑅 _ 𝐶 ( 𝑥 ) ; [Figure 3 (V)] 21 𝑖 ← 𝑖 + 1 ; 22 end 23 return 𝑥 ; earlier stages of processing. Here, skip connections enable the decoder to retrieve high-resolution f eature maps from the encoder , guaranteeing the preser vation of these features in the ﬁnal dept h map. In the ﬁnal stage of the model, an acti- vation function (sigmoid) is applied to the depth map output to ensure that the predicted values are appropr iately scaled betw een 0 and 1, pro viding a nor malized representation of the depth information. 3.3. Output The trained model generates a dept h map from an input R GB imag e, representing object distances in t he scene. In a grayscale map, closer objects appear darker , while fart her ones are lighter, or vice v ersa. In a color depth map, the In- f er no_r colormap (from Matplotlib [ 60 ]) is used, transition- ing from br ight yello w f or closer objects to dark pur ple f or f ar ther ones. In t he case of color map representation, t here Das et al.: Preprint submitted to The Visual Computer P age 7 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation Compare Encoder Decoder Skip Connection Inpu t 240 x 320 x 3 240 x 320 x 3 240 x 320 x 3 240 x 320 x 3 27 x 37 x 128 27 x 37 x 128 13 x 18 x 1088 13 x 18 x 1088 13 x 18 x 1088 13 x 18 x 1088 6 x 8 x 448 6 x 8 x 448 6 x 8 x 1536 GT depth map Upsampled1 Downsampled4 Downsampled3 Downsampled2 Downsampled1 Upsampled2 Upsampled3 Upsampled4 Upsampled5 Ou tpu t Bottleneck Concatenation Figure 5: Lay er-by-la yer feature map rep resentation within the enco der-deco der netw ork architecture, with Inception-ResNet-v2 (IRv2) as the enco der, designed for depth map generation. are tw o s teps that alw ays happen. Firstl y , the model gen- erates a gray scale image as a depth map. This means that each pixel’ s color is reduced to a single-intensity v alue. Af- ter t hat, the g ra yscale values are then mapped to colors in the Inf er no_r color map. As a result, lo w gra yscale values (close to black in gray scale) would be mapped to br ight yellow , and high g ra yscale v alues (close to white in g ra yscale) w ould be mapped to dark purple. 3.4. Loss Function The discrepancy between 𝑦 , the ground tr uth depth map, and  𝑦 , the anticipated dept h map, is taken into account by a typical loss function for dept h reg ression issues. The train- ing pace and total dept h estimation per formance can be g reatly impacted by various loss function factors. A wide range of loss function modiﬁcations are used to optimize t he neu- ral netw ork Neural netw ork for estimating depth. In our method, a composite loss function equation ( 2 ) is used, which helps to increase the model’ s accuracy b y tuning the weight. For training the model, loss L is deﬁned between 𝑦 and  𝑦 as the weighted sum of t hree loss functions. 𝐿 ( 𝑦,  𝑦 ) = 𝑤 1 𝐿 depth ( 𝑦,  𝑦 ) + 𝑤 2 𝐿 grad ( 𝑦,  𝑦 ) + 𝑤 3 𝐿 SSIM ( 𝑦,  𝑦 ) (2) Here, 𝑤 1 , 𝑤 2 , and 𝑤 3 are t he weights assigned to diﬀer - ent losses. 𝐿 depth , 𝐿 grad , and 𝐿 SSIM are depth, gradient, and SSIM loss terms, respectivel y . 3.4.1. Depth Loss Here, the loss term 𝐿 depth ref ers to point-wise loss, a typical loss function for all tec hniques based on deep learn- ing. The pix el-by-pix el discrepancy betw een the anticipated depth map and t he actual dept h map is estimated by calcu- lating the av erage of these absolute discrepancies throughout ev er y pix el in the image. 𝐿 depth ( 𝑦,  𝑦 ) = 1 𝑛 𝑛  𝑝    𝑦 𝑝 −  𝑦 𝑝    (3) 3.4.2. Gradient Edge Loss The term gradient edge loss 𝐿 grad is measured b y cal- culating the mean absolute dispar ity between the real dept h and the anticipated depth ’ s vertical and hor izontal gradients. 𝐿 grad ( 𝑦,  𝑦 ) = 1 𝑛 𝑛  𝑝    𝑔 𝑥 ( 𝑦 𝑝 ,  𝑦 𝑝 )    +    𝑔 𝑦 ( 𝑦 𝑝 ,  𝑦 𝑝 )    (4) Here, 𝑔 𝑥 and 𝑔 𝑦 represent hor izontal edges and vertical edges, respectiv ely . 3.4.3. Structural Similar ity (SSIM) Loss Finall y , the loss ter m 𝐿 SSIM is accustomed to deter min- ing how w ell the str uctural f eatures are retained when com- paring the projected depth map to the ground tr ut h depth map. The initial task of calculating t he structural loss is to determine the SSIM index. 𝐿 SSIM ( 𝑦,  𝑦 ) = 1 − SSIM ( 𝑦,  𝑦 ) 2 (5) SSIM ( 𝑦,  𝑦 ) = (2 𝜇 𝑦 𝜇  𝑦 + 𝐶 1 )(2 𝜎 𝑦  𝑦 + 𝐶 2 ) ( 𝜇 2 𝑦 + 𝜇 2  𝑦 + 𝐶 1 )( 𝜎 2 𝑦 + 𝜎 2  𝑦 + 𝐶 2 ) (6) Here, the means of 𝑦 and  𝑦 are denoted b y 𝜇 𝑦 and 𝜇  𝑦 , re- spectivel y . The variances are 𝜎 2 𝑦 and 𝜎 2  𝑦 , and the cov ar iance betw een 𝑦 and  𝑦 is 𝜎 𝑦  𝑦 . The constants used to stabilize the division are 𝐶 1 and 𝐶 2 . Das et al.: Preprint submitted to The Visual Computer P age 8 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation In the composite loss function wit h three w eighted com- ponents 𝑤 1 , 𝑤 2 , and 𝑤 3 (each ranging betw een 0 and 1), we initially set all weights to 1. Then, we iterativ ely ad- justed the w eights one at a time to optimize model per f or- mance by reducing 𝑤 1 while keeping 𝑤 2 and 𝑤 3 constant. If per formance degraded compared to the initial setup, we restored 𝑤 1 to its previous value and mov ed on to adjust 𝑤 2 , and so on for 𝑤 3 . Using this iterative cross-chec king pro- cess, we ﬁne-tuned the values of 𝑤 1 , 𝑤 2 , and 𝑤 3 f or optimal model performance. The follo wing algorit hm 2 provides a detailed explanation of the process used f or selecting t he op- timal weights. Algorithm 2: Optimizing W eights 𝑤 1 , 𝑤 2 , 𝑤 3 Input : Initial weights 𝑤 1 = 𝑤 2 = 𝑤 3 = 1 , V alidation set Output: Optimized weights 𝑤 1 , 𝑤 2 , 𝑤 3 1 for 𝑤 ∈ { 𝑤 1 , 𝑤 2 , 𝑤 3 } do 2 Fix other w eights, decrement 𝑤 ; 3 while per formance improv es do 4 V alidate performance ; 5 if per formance de teriorates then 6 Re vert 𝑤 ; 7 end 8 end 9 end 10 Cross-check and reﬁne t he values of 𝑤 1 , 𝑤 2 , 𝑤 3 to ensure optimal adjustments; 11 Evaluate on v alidation set; 12 return Optimized w eights 𝑤 1 , 𝑤 2 , 𝑤 3 4. Implementation 4.1. Dataset W e utilized the NYU Depth V2 [ 10 ] dataset to assess our suggested model. This contains RGB-D photos cap- tured using the Microsoft Kinect sensor . It includes almost 1,400 densely annotated indoor scenes from 464 diﬀerent lo- cations, each one cont aining dept h information and match- ing R GB photos. The dat aset co vers a variety of room types, such as living rooms, kitchens, and oﬃces, as well as de- tailed per-pix el object t agging. From t he 120,000 a vailable samples in t he dat aset, a random subset of 65,000 samples w as designated f or training, while 654 samples were reser ved f or testing. W e applied data augmentation to all imag es in the training set by hor izontally ﬂipping them while maint ain- ing a constant height of 240 pixels to improv e model train- ing. The model generated predictions at a resolution of 240 × 320 × 3, which is half the input size (480 × 640 × 3). For training, the input images are do wn-scaled to 240 × 320 × 3, while the ground tr uth dept h maps remain at their or iginal resolution. Down-sampling reduces the computational load and speeds up training by decreasing the number of pixels the model needs to process, making it more eﬃcient. Dur - ing testing, the predicted depth map f or the entire test image is computed and t hen up-scaled by a factor of tw o to match Algorithm 3: T raining the Depth Prediction Model on NYU Depth v2 Input : T raining data 𝐷 train from NYU Depth v2, Learning rate 𝜂 , Number of iterations 𝑇 , W eights 𝑤 1 , 𝑤 2 , 𝑤 3 f or loss ter ms Output: Trained model 𝜃 1 Initialize the model and optimizer : 2 𝜃 ← Random Initialization ; 3 Optimizer ← Adam ( 𝜃 , 𝜂 ) ; 4 𝑡 ← 0 ; 5 while 𝑡 < 𝑇 do 6 Load batch of training data: ( 𝑥, 𝑦 ) ← LoadBatch ( 𝐷 train ) ; 7 F orward pass (IRv2 Encoder-Decoder): 8 𝑧 encoder ← IRv2_Encoder ( 𝑥 ) ; 9  𝑦 ← Decoder ( 𝑧 encoder ) ; 10 Compute total loss 𝐿 total : 11 𝐿 depth ← MeanAbsoluteError ( 𝑦,  𝑦 ) ; 12 𝐿 grad ← GradientEdgeLoss ( 𝑦,  𝑦 ) ; 13 𝐿 SSIM ← SSIM_Loss ( 𝑦,  𝑦 ) ; 14 𝐿 total ← 𝑤 1 𝐿 depth + 𝑤 2 𝐿 grad + 𝑤 3 𝐿 SSIM ; 15 Backpropag ation and update: 16 Optimizer .zero_gradients(); 17 𝐿 total . backw ard(); 18 Optimizer .step(); 19 𝑡 ← 𝑡 + 1 ; 20 end 21 return 𝜃 ; the ground tr uth resolution. W e used tw o additional datasets wit h outdoor scenes for our experiments. The ﬁrst one is the KITTI [ 13 ] Depth Estima- tion dataset, a subset of the KITTI dataset speciﬁcally de- signed f or dept h estimation tasks using monocular or stereo images. For this study , 7,281 image pairs were used for train- ing, while 200 image pairs were allocated f or testing and ev aluation. Each pair consists of a single RGB image and its cor responding ground tr uth dept h map. The second dat aset is the Cityscapes [ 19 ] dataset, basicall y designed for seman- tic segment ation tasks to ev aluate the model’ s per f or mance in this operation. From t his dataset, a subset of 1,572 image pairs was selected f or training, while 500 image pairs were used f or testing and e valuation. Dur ing preprocessing, all images were resized to 240 × 320 × 3 to ensure consistency of the data for both training and testing. 4.2. En vironmental Setup The proposed model was implemented using TensorFlo w [ 38 ]. The model training was conducted on an NVIDIA T4 GPU setup (2 GPUs), eac h with 16 GB of memory . The Adam [ 39 ] optimizer is used with AMSGrad enabled, a vari- ant of g radient descent that adapts the lear ning rate dynam- ically w as employ ed during training. Setting 0.0001 as the initial lear ning rate, and training was performed ov er 15 epochs. The def ault momentum parameters for Adam were main- tained, wit h 𝛽 1 = 0 . 9 and 𝛽 2 = 0 . 999 . The complete train- ing process took appro ximately 5 hours with a batch size of 16. To mitig ate o verﬁtting, data augmentation techniques were applied, including horizontal ﬂipping and cropping the Das et al.: Preprint submitted to The Visual Computer P age 9 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation height of the images to a const ant size of 320 pixels. W e did the same f or other pre-trained models including V GG19 [ 40 ], ResNet50 [ 41 ], ResNet152 [ 41 ], DenseNet169 [ 42 ], and DenseNet201 [ 42 ]. These models varied in t he number of parameters, and their per formance was also varied wit h diﬀerent error values and accuracy scores. 4.3. Ev aluation Metrics The error metrics used f or evaluating the model’ s perfor - mance are deﬁned by the f ollowing equations, which provide a quantitativ e assessment. • Root Mean Squared Er ror (RMSE): RMSE =     1 𝑁 𝑁  𝑖 =1 ( 𝑦 𝑖 −  𝑦 𝑖 ) 2 (7) • Log10 Er ror: Log10 = 1 𝑁 𝑁  𝑖 =1   log 10 ( 𝑦 𝑖 ) − log 10 (  𝑦 𝑖 )   (8) • Absolute Relativ e Er ror (ARE): ARE = 1 𝑁 𝑁  𝑖 =1     𝑦 𝑖 −  𝑦 𝑖 𝑦 𝑖     (9) • Accuracy: Three diﬀerent kinds of threshold v alues are used to evaluate the depth map’ s accuracy . A thresh- old in depth map prediction typically ref ers to a set of pre-deﬁned limits within which the estimated depth values are considered to be accurate. The thresholds pro vide a straightf or w ard w ay to e valuate the accuracy of depth predictions at diﬀerent le vels of precision. 𝛿 𝑖 = 1 𝑁 𝑁  𝑖 =1  max  𝑦 𝑖  𝑦 𝑖 ,  𝑦 𝑖 𝑦 𝑖  < t h 𝑖  , where th 𝑖 ∈ {1 . 25 , 1 . 25 2 , 1 . 25 3 } (10) where  𝑦 𝑖 is a pixel in the expected depth image  𝑦 and 𝑦 𝑖 is a pixel in dept h image y . The total number of pixels f or each depth picture is N, and 𝛿 represents the accu- racy cor responding to the respectiv e t hreshold values. 5. Results 5.1. Com par ison with State-of-the-Art Archit ectures across V arious Dat asets From table 3 , t he proposed encoder -decoder method with IRv2 achie ves a 𝛿 < 1 . 25 accuracy of 0.893, whic h is sig- niﬁcantly higher t han most competing methods. This high accuracy at a strict t hreshold ( 𝛿 < 1 . 25 ) indicates that the method can reliabl y estimate depth with high precision, even when objects in the image are complex and vary in distance. T able 2 Eﬃciency analysis of va rious enco der-deco der metho ds for mono cula r depth estimation on the NYU-Depth v2 [ 10 ] dataset. Method T raining Time (s) T esting Time (s/sample) Enc-Dec-DenseNet201 18400 0.019 Enc-Dec-ResNet152 15800 0.015 Enc-Dec-VGG19 20115 0.019 Enc-Dec-IRv2 (Prop osed) 17550 0.018 The t able 3 also presents the computational complexity of the models measured in Floating Point Operations (FLOPs). The proposed model has a computational complexity of 1 . 1 × 10 11 FLOPs, which is higher than DenseNet201 [ 42 ] and ResN et152 [ 41 ]. How ev er, V GG19 [ 40 ] achie ves the high- est computational cost with t he highest FLOPs among all the compared models. In Figure 6 , the outputs of two diﬀerent models are presented. The marked areas f or Alhashim et al. [ 43 ] and our model highlight the diﬀerences in dept h predic- tion accuracy when compared to t he ground tr uth depth map. From t he mark ed area, it is evident t hat the comple x objects or the complex portion of the objects are distinguished more precisely by our model. Besides, our model achiev es the low est ARE of 0.064, showing that it is more consistent in estimating depth values, wit h the low est RMSE of 0.228, in- dicating that t he predicted depth values align more closely with the true values. The model also ex cels in the Log10 er ror , scor ing 0.032, showing superior per f or mance in min- imizing the magnitude of error in the logarit hmic scale. Although some studies demonstrate impro ved accuracy under specialized scenar ios, such as the work by Li et al. [ 46 ], t heir model outperforms ours by leveraging the com- bination of multiscale boundar y features and reﬂectance to enhance ov erall prediction accuracy . During training, the model’ s loss g radually decreases un- til it conv erges with t he validation loss. A t this stage, train- ing is ter minated, resulting in an optimal loss v alue (0.1523) f or our model by selecting t he appropriate number of epochs. Figure 8 illustrates the v ar iations of training loss and valida- tion loss after each epoc h. The histograms located in t he top-left and bottom-left corners of Figure 11 compare the R 2 scores of our proposed model (depicted in blue) agains t the encoder-decoder model using V GG19 and the architecture proposed by Alhashim et al. [ 43 ] (sho wn in orange). Additionally , the scatter plots on the right side of Figure 11 displa y the R 2 scores f or individ- ual test images. Our model sig niﬁcantl y outper f or ms these alternatives, achie ving an R 2 score of 0.8682. After that, the proposed model was e valuated on the out- door KITTI dataset [ 13 ]. A comparative analy sis was con- ducted between our proposed model, IRv2 [ 9 ], and the Vi- sion T ransformer-based model, DINOv2 [ 56 ]. As illustrated in Figure 7 , our model demonstrates results t hat closely align with DINOv2-S, which is the most eﬃcient and smallest variant among the DINOv2 models. The results w ere ob- Das et al.: Preprint submitted to The Visual Computer P age 10 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation RGB RGB GT GT Output Depth Ma (Alhashim et al.) Output Depth Ma (Alhashim et al.) Output Depth Ma (Proposed) Output Depth Ma (Proposed) Near Distant Figure 6: Mono cula r depth estimation on the NYU-Depth v2 dataset using various enco der-deco der a rchitectures, with output compa rison b etw een the metho d of Alhashim et al. [ 43 ] and the proposed mo del. RGB GT IRv2(Proposed) DINOv2-S Figure 7: Mono cula r depth estimation on the KITTI [ 13 ] dataset using tw o diﬀerent architectures. F rom left to right: DINOv2-S [ 56 ], and Inception-ResNet v 2 [ 9 ]. Das et al.: Preprint submitted to The Visual Computer P age 11 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation T able 3 P erformance comparison of va rious encoder-deco der metho ds for depth estimation on the NYU-Depth v2 dataset, where each metho d uses a diﬀerent pre-trained encoder. All metho ds w ere implemented by the authors, with the proposed IRv2 model demonstrating superior performance across b oth error metrics and accuracy thresholds ( 𝛿 < 1 . 25 , 𝛿 < 1 . 25 2 , 𝛿 < 1 . 25 3 ). Method P arams FLOPs Error Metrics ↓ A ccuracy ↑ ARE RMSE Log10 𝛿 < 1 . 25 𝛿 < 1 . 25 2 𝛿 < 1 . 25 3 Enc-Dec-DenseNet201 17.23 M 1.6e10 0.101 0.320 0.051 0.831 0.956 0.979 Enc-Dec-ResNet152 15.54 M 7.6e09 0.117 0.364 0.058 0.796 0.942 0.970 Enc-Dec-VGG19 22.37 M 1.3e12 0.110 0.324 0.054 0.820 0.950 0.976 Enc-Dec-IRv2 (Prop osed) 31.15 M 1.1e11 0.064 0.228 0.032 0.893 0.967 0.985 T able 4 P erformance comparison of va rious existing metho ds for depth estimation using error metrics and accuracy at diﬀerent thresholds ( 𝛿 < 1 . 25 , 𝛿 < 1 . 25 2 , 𝛿 < 1 . 25 3 ) on the NYU-Depth v2 dataset, where the proposed mo del outp erforms the others. Method Error Metrics ↓ A ccuracy ↑ ARE RMSE Log10 𝛿 < 1 . 25 𝛿 < 1 . 25 2 𝛿 < 1 . 25 3 Rudolph et al. [ 44 ] 0.138 0.501 0.058 0.823 0.961 0.990 Basak et al. [ 45 ] 0.103 0.388 – 0.892 0.978 0.995 Alhashim et al. [ 43 ] 0.123 0.465 0.053 0.846 0.974 0.994 Lee et al. [ 26 ] 0.131 0.538 – 0.837 0.971 0.994 Xu et al. [ 27 ] 0.125 0.593 0.057 0.806 0.952 0.986 Jung et al. [ 15 ] 0.134 0.527 – 0.822 0.971 0.993 Li et al. (V GG16) [ 25 ] 0.152 0.611 0.064 0.789 0.955 0.988 Li et al. (V GG19) [ 25 ] 0.146 0.617 0.063 0.795 0.958 0.991 Li et al. (ResNet50) [ 25 ] 0.143 0.635 0.063 0.788 0.958 0.991 Enc-Dec-IRv2 (Prop osed) 0.064 0.228 0.032 0.893 0.967 0.985 T able 5 P erformance comparison of the proposed method against va rious Vision T ransformer (ViT) mo dels fo r depth estimation. The compa rison is based on inference time at diﬀerent input dimensions on the KITTI [ 13 ] dataset using 2x NVIDIA T4 GPUs. Method Inference Time (s/sample) ↓ Input Dim (224 x 224) Input Dim (320 x 240) Input Dim (1242 x 375) DINOv2 (ViT-G) [ 56 ] 3.651 5.652 76.19 DINOv2 (ViT-L) [ 56 ] 1.296 2.177 31.32 DINOv2 (ViT-B) [ 56 ] 0.667 1.129 14.14 DINOv2 (ViT-S) [ 56 ] 0.440 0.781 8.959 Enc-Dec-IRv2 (Prop osed) – 0.019 – T able 6 P erformance comparison of the proposed method against va rious Vision T ransformer (ViT) mo dels fo r depth estimation. The compa rison is based on erro r metrics and accuracy at diﬀerent thresholds ( 𝛿 < 1 . 25 , 𝛿 < 1 . 25 2 , 𝛿 < 1 . 25 3 ) on the KITTI [ 13 ] dataset. Method P arams Erro r Metrics ↓ Accu racy ↑ ARE RMSE Log10 𝛿 < 1 . 25 𝛿 < 1 . 25 2 𝛿 < 1 . 25 3 DINOv2 (ViT-G) [ 56 ] 1196.05 M 0.065 2.112 0.038 0.968 0.997 0.999 DINOv2 (ViT-L) [ 56 ] 337.10 M 0.082 2.788 0.062 0.952 0.993 0.998 DINOv2 (ViT-B) [ 56 ] 109.70 M 0.089 2.965 0.077 0.937 0.990 0.995 DINOv2 (ViT-S) [ 56 ] 35.48 M 0.101 3.208 0.094 0.923 0.986 0.993 Enc-Dec-IRv2 (Prop osed) 31.15 M 0.124 3.720 0.110 0.899 0.972 0.989 Das et al.: Preprint submitted to The Visual Computer P age 12 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation T able 7 P erformance comparison of va rious encoder-deco der methods fo r segmentation tasks on the Cit yscap es [ 19 ] dataset. The table highlights training and testing times along with segmentation accuracy metrics such as Mean IoU, Precision, Recall, F1-Sco re, and A ccuracy . Method T raining Time (s) T esting Time (s/sample) Mean IoU Precision Recall F1-Sco re Accuracy (%) Enc-Dec-DenseNet201 3314 0.018 0.842 0.838 0.763 0.781 90.2 Enc-Dec-ResNet152 3020 0.014 0.785 0.893 0.602 0.671 84.2 Enc-Dec-VGG19 3240 0.019 0.818 0.899 0.645 0.724 89.3 Enc-Dec-IRv2 (Prop osed) 3263 0.017 0.854 0.848 0.812 0.818 91.5 Figure 8: T rain and Validation Loss over Ep o chs. This graph sho ws the decrease in both train loss (blue, solid line) and validation loss (red, dashed line) across 15 epo chs on the NYU- Depth v2 dataset using IRv2 [ 9 ]. tained using high-resolution input images (1242 × 375). T able 5 presents a comparison of t he inference times of the proposed model with diﬀerent variants of DINOv2 [ 56 ]. Inf erence time means the amount of time a model t akes to process an image and produce a result. To ev aluate infer - ence time, images (RGB) of three diﬀerent dimensions were used: 1242 × 375 (the raw image size from the dataset), 224 × 224 (the dimension used b y the V iT model), and 320 × 240 (the dimension used by the proposed model). It w as ob- served t hat resizing the input images signiﬁcantl y reduced inf erence time. For instance, the smallest variant of DINOv2 achie ved t he lowes t inf erence time of 0.440 seconds wit h 224 × 224 image resizing. Similar ly , f or the 320 × 240 image size, the smallest DINOv2 v ar iant had an inf erence time of 0.781 seconds, which w as f aster than other DINOv2 variants. Ho we ver , when compared to our proposed model, the IRv2 signiﬁcantly outper formed DINOv2-S in terms of eﬃciency , wit h inf erence times nearly 41 times faster than DINOv2-S and 23 times f aster than the best perf or mance of DINOv2 [ 56 ]. Although t he accuracy of our proposed model is slightly (2.6%) low er than the small variant of Transf ormer- based models, it demonstrates remarkable eﬃciency in ter ms of inf erence time, sur passing DINOv2-S and other variants, as shown in T able 5 and 6 . This highlights the potential of our model f or real-time applications where computational eﬃciency is critical. The model w as furt her trained on t he Cityscapes [ 19 ] dataset, which is primar ily designed for semantic segmenta- tion task. S t atistical analy ses were conducted to compare t he performance of t he proposed model with other state-of-t he- art CNN-based encoder-decoder arc hitectures. As shown in table 7 , the proposed model achiev ed the highest perfor - mance across multiple metrics, including mean Intersection ov er U nion (mIoU), R ecall, F1-Score, and Accuracy . The Precision-Recall cur ve presented in Figure 10 illustrates t he performance of all ev aluated models, highlighting t hat t he proposed model outperforms others. Additionall y , Figure 9 pro vides a side-by -side compar ison of t he segment ation out- puts, illustr ating that the proposed model produces better - segmented results compared to t he other models. 5.2. A blation Study 5.2.1. Layer -wise F eature Extraction with IRv2 Our proposed architecture uses IRv2 [ 9 ] lay ers as the encoder for f eature e xtraction. This combines the eﬃcient multi-branch architecture of the Inception network with the residual connections of ResNet. By lev eraging these lay ers, the encoder eﬃcientl y extracts rich, high-lev el representa- tions from t he input RGB imag es and this f eature extraction mechanism enhances the model’ s ability to handle complex visual information. Figure 12 provides a visual representa- tion of each lay er along with its extracted f eatures. 5.2.2. Composite Loss F unction The composite loss function consists of three compo- nents: Str uctural Similarity Loss (SSIM), Edge Loss, and Depth Loss, combined as a weighted sum where the corre- sponding weights 𝑤 1 , 𝑤 2 , and 𝑤 3 with selected values (e.g., 3.4 , Loss Function) are applied to Depth, Edge, and SSIM losses, respectivel y . To ev aluate t he eﬀect of each compo- nent, diﬀerent combinations are considered b y setting spe- ciﬁc weights to zero. When 𝑤 1 = 0 , Depth Loss is ex cluded, and t he loss function uses onl y Edg e and SSIM losses. Sim- ilarl y , 𝑤 2 = 0 e xcludes Edge Loss, and 𝑤 3 = 0 e xcludes SSIM Loss, retaining t he remaining tw o losses. Addition- ally , w e analyze cases wit h a single loss: 𝑤 1 = 0 , 𝑤 2 = 0 considers only SSIM Loss; 𝑤 1 = 0 , 𝑤 3 = 0 considers only Edge Loss; and 𝑤 2 = 0 , 𝑤 3 = 0 considers onl y Dept h Loss. Finall y , the results are presented f or the case where all three Das et al.: Preprint submitted to The Visual Computer P age 13 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation RGB GT ResNet152 VGG19 DenseNet201 IRv2(P roposed) Figure 9: Mono cular depth estimation on the Cit yscap es [ 19 ] dataset using va rious encoder-deco der architectures. F rom left to right: ResNet152 [ 41 ], VGG19 [ 40 ], DenseNet201 [ 42 ], and Inception-ResNet v2 [ 9 ]. (a) (b) (c) (d) Figure 10: Precision-Recall Curves fo r Diﬀerent Enco der-Decoder Architectures on the Cityscapes [ 19 ] dataset. The plots display the precision (y-axis) versus recall (x-axis) for va rious architectures: ResNet152 (a), VGG19 (b), DenseNet201 (c), and Inception- ResNet v2 (d). types of loss are combined. T able 8 presents the per f or- mance of the model using diﬀerent combinations of three diﬀerent losses of t he composite loss function and Figure 13 illustrates the individual outputs cor responding to each com- bination. 5.2.3. Depth Prediction under Degradations W e applied noise, blur, and occlusion to t he same RGB image and then generated the predicted depth maps using our proposed trained model. This process was performed on both indoor and outdoor images. Figure 14 shows the results f or each case, demonstrating t he model’ s performance under varying conditions. 6. Discussion The proposed model, which uses the Inception-ResNet- v2 architecture in an encoder-decoder framew ork, makes con- siderable improv ements in monocular depth estimation. The combination of multi-scale f eature extraction with a com- posite loss function has shown useful in addressing a wide range of depth estimation diﬃculties, attaining great accu- racy and eﬃciency on benchmark datasets including KITTI and NYU Depth V2. These ﬁndings support the model’ s ability to g eneralize betw een indoor and outdoor en viron- ments, ex ceeding sev eral cutting-edge architectures in a va- riety of cr iteria. How ev er, some limits and practical con- straints f or real-w orld applications persist. Our approach achiev es lower accuracy than advanced depth prediction techniques such as Vision Transf ormer (ViT). Our Das et al.: Preprint submitted to The Visual Computer P age 14 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation Figure 11: R 2 Sco res for Monocular Depth Estimation Across T est Images for T wo Mo dels on the NYU Depth v2. The left plot shows the distribution of R 2 values (x-axis) with frequency (y-axis), while the right plot displa ys individual R 2 sco res (y-axis) against image index (x -axis). T able 8 P erformance compa rison using error metrics and accuracy at diﬀerent thresholds ( 𝛿 < 1 . 25 , 𝛿 < 1 . 25 2 , 𝛿 < 1 . 25 3 ) on the NYU- Depth v2 dataset using IRv2 across various weight combinations in the comp osite loss function. W eight Combination ARE RMSE Log10 𝛿 < 1 . 25 𝛿 < 1 . 25 2 𝛿 < 1 . 25 3 𝑤 1 = 0 0.067 0.230 0.036 0.889 0.964 0.982 𝑤 2 = 0 0.066 0.236 0.034 0.885 0.960 0.979 𝑤 3 = 0 0.069 0.229 0.038 0.876 0.962 0.980 𝑤 1 = 0 , 𝑤 2 = 0 0.091 0.332 0.061 0.861 0.947 0.968 𝑤 2 = 0 , 𝑤 3 = 0 0.088 0.339 0.055 0.864 0.949 0.970 𝑤 1 = 0 , 𝑤 3 = 0 0.097 0.401 0.077 0.845 0.940 0.966 𝑤 1 ≠ 0 , 𝑤 2 ≠ 0 , 𝑤 3 ≠ 0 0.064 0.228 0.032 0.893 0.967 0.985 model obtains an accuracy of 89.9%, whereas the smallest variation of ViT (DINOv2-S) achiev es 92.3% f or 𝛿 < 1.25. How ev er, it outperforms ViT in terms of eﬃciency . Table 5 show s that our model achiev es an infer ence time of 0.019 seconds for an input size of 240 × 320, while ViT -S requires 0.781 seconds on the same 2x NVIDIA T4 GPUs. Despite Vi T-S’ s 1.02x higher accuracy , our model has a 41x f aster inf erence time. Though increasing t he input resolution im- pro ves output accuracy , it also increases inference time f or Vision Transf or mers. In real-world applications, eﬃciency Das et al.: Preprint submitted to The Visual Computer P age 15 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation RGB G T La y er 1 La y er 2 La y er 3 La y er 4 (Zoomed) Figure 12: Visual representation of some lay ers and the cor- resp onding extracted features on the NYU-Depth v2 dataset using IRv2 [ 9 ]. might sometimes t ake precedence ov er reaching t he utmost precision, especially when the accuracy is already at an ac- ceptable standard. While the inf erence time of 0.019 seconds is enough f or many applications, the model’ s relativel y large parameter count may provide diﬃculties f or deplo yment in resource- constrained situations, such as mobile de vices or embedded systems. These issues can be addressed via optimization ap- proaches such as model reduction, quantization, and eﬃcient parallel computation. Further more, real-w orld settings fre- quentl y include diﬀerences in lighting, resolution, and noise lev els, necessit ating robust preprocessing pipelines and care- ful ev aluation of a variety of dat asets to assure consistent performance. Despite these c hallenges, the proposed model’ s scalabil- ity and modularity make it adapt able to future hardware and algorit hm dev elopments. These character istics make it ideal f or applications such as robotics, augmented reality , and 3D reconstruction. Future research could concentrate on reduc- ing computing complexity and increasing resilience to c hal- lenging en vironments, ensuring practicality and eﬃcacy in deployment. 7. Conclusion This research presents a noble method f or depth map production utilizing monocular depth estimation t hrough an encoder -decoder architecture founded on the Inception-ResNet- v2 model. Our methodology employ s multi-scale feature ex- traction and a composite loss function that integrates dept h loss, gradient edge loss, and SSIM loss, resulting in a no- table improv ement in depth prediction accuracy , evidenced by an Absolute Relativ e Error (ARE) of 0.064, a Roo t Mean Square Er ror (RMSE) of 0.228, and a Log10 er ror of 0.032. These values demonstrate the model’ s precision in predict- ing accurate dept h v alues f or indoor scenarios. Besides, our model achiev ed 89.3% accuracy for 𝛿 < 1 . 25 , outper f or ming other approaches in complex scenarios with changing object sizes and distances. Moreov er, our model achie ves signiﬁ- cantly low er inference time compared to state-of-t he-art Vi- sion T ransformer models (DINOv2 [ 56 ]) on the KITTI [ 13 ] dataset, while maintaining an acceptable lev el of accuracy . These results demonstrate t he potential of our model f or real- time applications, where computational eﬃciency is crucial and maintaining good accuracy is essential. Integration with other sensors and systems, such as GPS and LiDAR, can improv e ov erall performance and reliabil- ity when deploy ed in real-time systems suc h as autonomous cars or drones. The model’ s energy consumption should also be tuned for batter y-po wered devices to ensure operational f easibility during long-duration tasks. Fur ther more, com- pliance wit h ethical and regulatory standards, especially in applications such as surveillance or healthcare, is cr itical for public acceptance and legal use. One dra wback of our architectur e is its higher compu- tational cost (110 GFLOPs), which makes the model more resource-intensiv e and slower dur ing both training and inf er- ence. In real-time applications, suc h as autonomous dr iving or robotic navig ation, this could cause latency , limiting the model’ s applicability in resource-constrained environments like mobile devices. How ev er, it still signiﬁcantly outper - f or ms state-of-the-ar t transformer -based methods in ter ms of eﬃciency . Despite these challenges, the model’ s improv ements in er ror metrics and high accuracy make it a promising solution f or real-world applications. Oﬀer ing more reliable depth predictions in complex en vironments, it performs well in generating accurate depth maps ev en in intr icate and demand- ing situations. Compared to Vision Transf or mers, our model demonstrates super ior eﬃciency , making it par ticularl y well- suited f or critical tasks, such as autonomous sys tems and other real-time applications, where reliability and per f or mance are crucial. Declaration of Funding This paper w as not funded. A uthor Contributions Dabbrata Das: Conceptualization, Methodology , Visu- alization, Writing – or iginal draft. Argho Deb Das: Con- ceptualization, For mal analy sis, W r iting – or iginal draft & revie w . F arhan Sadaf: For mal analysis, Inv estigation, Writ- ing – re view & editing, Project administration. Ethical Appro val Not required. Declaration of Compe ting Interest The authors declare no competing interests. A cknow ledgements This w ork is supported in part by the Khulna University of Engineering & T echnology (KUET). Ref erences [1] M. Car ranza-García, F . Galan-Sales, J. M. Luna-Romera, J. Riquelme, Object detection using dept h completion and camera- lidar fusion for autonomous dr iving, Integrated Computer- Aided Engineering 29 (2022) 1–18. doi:10.3233/ICA- 220681 . Das et al.: Preprint submitted to The Visual Computer P age 16 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation w1=0, w2=0 w3=0 w2=0 w1=0 GT RGB w2=0, w3=0 w1=0, w3=0 Figure 13: The visualization of p erformance comparison on the NYU-Depth v2 dataset using IRv2 across va rious weight combi- nations in the composite loss function. RGB Noisy Blurred Occlu ded RGB (a) (b) Blurred Occluded Noisy Figure 14: Predicted depth maps generated by the p rop osed mo del fo r RGB images with added noise, blur, and occlusion under indoor (a) and outdo or (b) scenarios. [2] W . Burger , M. J. Burg e, Scale-in variant feature transf or m (sift), in: Digital Image Processing: An Algor ithmic Introduction, Springer , 2022, pp. 709–763. [3] A. Quattoni, M. Collins, T . Dar rell, Conditional random ﬁelds for ob- ject recognition, 2004. [4] Y . Kim, H. Jung, D. Min, K. Sohn, Deep monocular dept h estimation via integration of global and local predictions, IEEE Transactions on Image Processing 27 (8) (2018) 4131–4144. doi:10.1109/TIP.2018. 2836318 . [5] I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, N. Na vab, Deeper depth prediction with fully conv olutional residual networks, in: 2016 Fourth Inter national Conference on 3D Vision (3DV), 2016, pp. 239– 248. doi:10.1109/3DV.2016.32 . [6] I. J. Goodf ellow , J. Pouge t-Abadie, M. Mirza, B. Xu, D. W arde-Farle y, S. Ozair, A. Courville, Y . Bengio, Generative adversarial networks (2014). . URL [7] X. Qiao, C. Ge, Y . Zhang, Y . Zhou, F . Tosi, M. Poggi, S. Mattoccia, Depth super-resolution from explicit and implicit high-frequency fea- tures, Computer Vision and Image Unders t anding 237 (2023) 103841. doi:https://doi.org/10.1016/j.cviu.2023.103841 . [8] N. K. Y adav , S. K. Singh, S. R. Dubey , Isa-gan: inception-based self- attentive encoder –decoder networ k for face synthesis using delineated facial images, The Visual Computer (2024) 1–21. [9] C. Szegedy , S. Ioﬀe, V . V anhouck e, A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on lear ning (2016). . URL [10] N. Silber man, D. Hoiem, P . Kohli, R. Fergus, Indoor segmentation and suppor t infer ence from rgbd images, in: Computer Vision, ECCV 2012 - 12th European Conference on Computer Vision, Proceedings, part 5 Edition, no. P ART 5 in Lecture Notes in Computer Science (in- cluding subseries Lecture Notes in Ar tiﬁcial Intelligence and Lecture Notes in Bioinf or matics), 2012, pp. 746–760, 12th European Con- f erence on Computer Vision, ECCV 2012 ; Conf erence date: 07-10- 2012 Through 13-10-2012. doi:10.1007/978- 3- 642- 33715- 4_54 . [11] A. C. Kumar, S. M. Bhandarkar, M. Prasad, Depthnet: A recurrent neural network architecture for monocular depth prediction, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition W orkshops (CVPR W), 2018, pp. 396–3968. doi:10.1109/CVPRW.2018. 00066 . [12] X. SHI, Z. Chen, H. Wang, D.- Y . Y eung, W .-k. W ong, W .-c. WOO, Conv olutional lstm netw ork: A machine learning approach f or precip- itation nowcas ting , in: C. Cor tes, N. Lawrence, D. Lee, M. Sugiyama, R. Gar nett (Eds.), Advances in Neural Information Processing Sys- tems, V ol. 28, Curran Associates, Inc., 2015. Das et al.: Preprint submitted to The Visual Computer P age 17 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 07563a3fe3bbe7e3ba84431ad9d055af- Paper.pdf [13] A. Geiger, P . Lenz, C. Stiller , R. Ur tasun, Vision meets robotics: the kitti dataset, The International Journal of Robotics R esearch 32 (2013) 1231–1237. doi:10.1177/0278364913491297 . [14] M. Mancini, G. Cost ante, P . V aligi, T . A. Ciar fuglia, J. Delmer ico, D. Scar amuzza, To ward domain independence for lear ning-based monocular depth estimation, IEEE R obotics and Automation Letters 2 (3) (2017) 1778–1785. [15] H. Jung, Y . Kim, D. Min, C. Oh, K. Sohn, Depth prediction from a single image with conditional adversarial netw orks, 2017, pp. 1717– 1721. doi:10.1109/ICIP.2017.8296575 . [16] K. G. Lore, K. Reddy , M. Gier ing, E. A. Ber nal, Generative adv er- sarial networks for depth map estimation from rgb video, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition W orkshops (CVPRW), 2018, pp. 1258–12588. doi:10.1109/CVPRW. 2018.00163 . [17] T. Feng, D. Gu, Sgan vo: Unsupervised deep visual odometr y and depth estimation with stack ed generative adversarial netw orks , IEEE Robo tics and Aut omation Le tters 4 (4) (2019) 4431–4437. doi: 10.1109/lra.2019.2925555 . URL http://dx.doi.org/10.1109/LRA.2019.2925555 [18] F. Aleotti, F . Tosi, M. Poggi, S. Mattoccia, Generative Adv ersar ial Netw orks f or Unsupervised Monocular Dept h Prediction: Munich, Germany , September 8-14, 2018, Proceedings, Part I, 2019, pp. 337– 354. doi:10.1007/978- 3- 030- 11009- 3_20 . [19] M. Cordts, M. Omran, S. Ramos, T. R ehfeld, M. Enzweiler , R. Be- nenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding (2016). . URL [20] Y . Li, K. Qian, T . Huang, J. Zhou, Depth estimation from monocu- lar image and coarse depth points based on conditional gan, MA TEC W eb of Conferences 175 (2018) 03055. doi:10.1051/matecconf/ 201817503055 . [21] Y . Almalioglu, M. R. U. Saputra, P . P . B. d. Gusmão, A. Markham, N. Trigoni, Gan vo: Unsupervised deep monocular visual odometr y and dept h estimation with generative adversarial networks, in: 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 5474–5480. doi:10.1109/ICRA.2019.8793512 . [22] A. Resindra, Y . Monno, M. Okutomi, S. Suzuki, T. Gotoda, K. Miki, Self-supervised monocular depth estimation in gastroendoscopy us- ing gan-augmented images, 2021, p. 35. doi:10.1117/12.2579317 . [23] A. Rau, P . Edwards, O. Ahmad, P . Riordan, M. Janatka, L. Lovat, D. Stoy anov , Implicit domain adaptation wit h conditional generativ e adversarial networks for depth prediction in endoscopy , International Journal of Computer Assisted Radiology and Surg er y 14 (04 2019). doi:10.1007/s11548- 019- 01962- w . [24] W . Kim, C.-H. Jeong, S. Kim, Impro vements in deep lear ning-based precipitation nowcasting using major atmospheric factors with radar rain rate , Computers & Geosciences 184 (2024) 105529. doi:https://doi.org/10.1016/j.cageo.2024.105529 . URL https://www.sciencedirect.com/science/article/pii/ S0098300424000128 [25] J. Li, R. Klein, A. Y ao, A two-streamed network f or estimating ﬁne- scaled dept h maps from single rgb images (2017). . URL [26] J. Lee, C.-S. Kim, Monocular dept h estimation using relativ e dept h maps, 2019. doi:10.1109/CVPR.2019.00996 . [27] D. Xu, W . W ang, H. Tang, H. Liu, N. Sebe, E. Ricci, S tr uctured at- tention guided conv olutional neural ﬁelds for monocular dept h esti- mation (2018). . URL [28] D. Zoran, P . Isola, D. Krishnan, W . T . Freeman, Learning ordinal re- lationships for mid-level vision, in: Proceedings of t he IEEE Inter na- tional Conference on Computer Vision (ICCV), 2015. [29] K. Simonyan, A. Zisser man, V ery deep conv olutional networks for large-scale image recognition (2015). . URL [30] W . Chen, Z. Fu, D. Y ang, J. Deng, Single-image depth perception in the wild (2017). . URL [31] Z. Zhang, C. Xu, J. Y ang, J. Gao, Z. Cui, Progressive hard-mining netw ork for monocular depth estimation, IEEE T ransactions on Image Processing 27 (8) (2018) 3691–3702. doi:10.1109/TIP.2018.2821979 . [32] A. Saxena, M. Sun, A. Ng, Make3d: Learning 3d scene structure from a single s till image, IEEE transactions on patter n anal ysis and mac hine intelligence 31 (2009) 824–40. doi:10.1109/TPAMI.2008.132 . [33] Y . Xi, S. Li, Z. Xu, F . Zhou, J. Tian, Lapunet: a no vel approach to monocular depth estimation using dynamic laplacian residual u-shape netw orks, Scientiﬁc Reports 14 (1) (2024) 23544. [34] C. Song, Q. Chen, F. W . Li, Z. Jiang, D. Zheng, Y . Shen, B. Y ang, Multi-f eature fusion enhanced monocular depth estimation with boundary aw areness, The Visual Computer 40 (7) (2024) 4955–4967. [35] J. Liu, Y . Zhang, High quality monocular depth estimation wit h par- allel decoder, Scientiﬁc Repor ts 12 (1) (2022) 16616. [36] R. Choudhar y , M. Shar ma, T. Uma, R. Anil, Mestereo-du2cnn: a dual-channel cnn for learning r obust depth estimates from multi- exposure stereo images for hdr 3d applications, The Visual Computer 40 (3) (2024) 2219–2233. [37] Y . W ang, L. W ang, J. Y ang, W . An, Y . Guo, Flickr1024: A large- scale dat aset for stereo image super-resolution, in: Proceedings of the IEEE/CVF International Conf erence on Computer Vision W orkshops, 2019, pp. 0–0. [38] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemaw at, G. Ir ving, M. Isard, M. K udlur, J. Lev enberg, R. Monga, S. Moore, D. G. Mur ray , B. Steiner, P . Tuc ker, V . V asude- van, P . W arden, M. Wick e, Y . Y u, X. Zheng, T ensor ﬂow: A system f or large-scale machine learning (2016). . URL [39] S. J. Reddi, S. Kale, S. Kumar, On the conv ergence of adam and be- yond (2019). . URL [40] V . Sudha, D. Ganeshbabu, A con volutional neural netw ork classi- ﬁer vgg-19 architecture for lesion detection and grading in diabetic retinopathy based on deep lear ning, Computers, Materials & Con- tinua 66 (2020) 827–842. doi:10.32604/cmc.2020.012008 . [41] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition (2015). . URL [42] G. Huang, Z. Liu, L. van der Maaten, K. Q. W einberger , Densely connected convolutional networ ks (2018). . URL [43] I. Alhashim, P . W onka, High quality monocular depth estimation via transf er lear ning (2019). . URL [44] M. Rudolph, Y . Daw oud, R. Güldenr ing, L. Nalpantidis, V . Belagian- nis, Lightweight monocular depth estimation through guided decod- ing (2022). . URL [45] H. Basak, S. Ghosal, M. Sarkar , M. Das, S. Chattopadhy ay , Monoc- ular depth estimation using encoder-decoder architecture and trans- f er learning from single rgb imag e, 2020, pp. 1–6. doi:10.1109/ UPCON50219.2020.9376365 . [46] C. Li, R. Yi, S. G. Ali, L. Ma, E. Wu, J. W ang, L. Mao, B. Sheng, Radepthnet: Reﬂectance-a ware monocular depth estimation , Virtual Reality & Intelligent Hardware 4 (5) (2022) 418–431, computer graphics for meta verse. doi:https: //doi.org/10.1016/j.vrih.2022.08.005 . URL https://www.sciencedirect.com/science/article/pii/ S2096579622000808 [47] Z. Li, X. W ang, X. Liu, J. Jiang, BinsFormer: Revisiting Adaptiv e Bins f or Monocular Depth Estimation, IEEE Transactions on Image Processing 33 (2024) 3964–3976. doi:10.1109/TIP.2024.3416065 . [48] Z. Lu, Y . Chen, Self-super vised monocular depth estimation on w a- ter scenes via specular reﬂection prior, Digital Signal Processing 149 (2024) 104496. doi:10.1016/j.dsp.2024.104496 . Das et al.: Preprint submitted to The Visual Computer P age 18 of 19 Enhanced Encoder-Deco der Architecture for Accurate Mono cular Depth Estimation [49] Z. Lu, Y . Chen, Joint self-supervised depth and optical ﬂow estimation tow ards dynamic objects, N eural Processing Letters 55 (2023) 10235– 10249. doi:10.1007/s11063- 023- 11325- x . [50] J. Zhang, S. Huang, J. Liu, X. Zhu, F. Xu, PYRF-PCR: A Robust Three-Stage 3D Point Cloud Registration for Outdoor Scene, IEEE Transactions on Intelligent V ehicles 9 (2024) 1270–1281. doi:10. 1109/TIV.2023.3327098 . [51] J. Zhang, Q. Su, B. Tang, C. Wang, Y . Li, DPSNet: Multitask Learn- ing Using Geometry Reasoning f or Scene Depth and Semantics, IEEE Transactions on Neural Networks and Lear ning Systems 34 (2023) 2710–2721. doi:10.1109/TNNLS.2021.3107362 . [52] G. Ros, L. Sellart, J. Materzynska, D. V azquez, A. M. Lopez, The SYNTHIA Dataset: A Larg e Collection of Synthetic Images for Se- mantic Segment ation of Urban Scenes, in: Proceedings of t he IEEE Conf erence on Computer Vision and Pattern Recognition (CVPR), 2016. [53] M. Zhou, L. Chen, X. W ei, X. Liao, Q. Mao, H. W ang, H. Pu, J. Luo, T . Xiang, B. Fang, Perception-Oriented U-Shaped Trans- f or mer Netw ork f or 360-Degree No-Ref erence Image Quality As- sessment, IEEE Transactions on Broadcasting 69 (2023) 396–405. doi:10.1109/TBC.2022.3231101 . [54] M. Zhou, X. Lan, X. W ei, X. Liao, Q. Mao, Y . Li, C. W u, T . Xiang, B. Fang, An End-to-End Blind Image Quality Assessment Method Using a Recurrent Netw ork and Self-Attention, IEEE Transactions on Broadcasting 69 (2023) 369–377. doi:10.1109/TBC.2022.3215249 . [55] J. Zhang, Y . Liu, G. Ding, B. Tang, and Y . Chen, "Adaptiv e De- composition and Extraction Netw ork of Individual Fingerprint Fea- tures for Speciﬁc Emitter Identiﬁcation," IEEE T ransactions on In- formation F or ensics and Secur ity , vol. 19, pp. 8515–8528, 2024. doi:10.1109/TIFS.2024.3427361 . [56] M. Oquab, T . Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov , P . Fer nandez, D. Haziza, F . Massa, A. El-Noub y, R. Howes, P .- Y . Huang, H. Xu, V . Sharma, S.- W . Li, W . Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaev e, I. Misra, H. Jegou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski, "DINOv2: Learn- ing Robust Visual Features wit hout Super vision," , 2023. [57] M. Zhou, X. Zhao, F . Luo, J. Luo, H. Pu, and T. Xiang, "Robust R GB- T T racking via Adaptive Modality W eight Correlation Filters and Cross-Modality Learning," ACM T ransactions on Multimedia Com- puting, Communications, and Applications , vol. 20, no. 4, ar ticle 95, pp. 1–20, Dec. 2023. doi:10.1145/3630100 . [58] M. Zhou, X. Lan, X. W ei, X. Liao, Q. Mao, Y . Li, C. W u, T . Xiang, B. Fang, An End-to-End Blind Image Quality Assessment Method Using a Recurrent Netw ork and Self-Attention, IEEE Transactions on Broadcasting 69 (2023) 369–377. doi:10.1109/TBC.2022.3215249 . [59] J. Xu, Z. Li, B. Du, M. Zhang, and J. Liu, “Reluple x made more practical: Leaky ReLU ,” in Proceedings of the 2020 IEEE Sympo- sium on Computers and Communications (ISCC) , 2020, pp. 1–7. doi:10.1109/ISCC50000.2020.9219587 . [60] J. D. Hunter , “Matplotlib: A 2D g raphics environment, ” Comput- ing in Science & Engineering , vol. 9, no. 3, pp. 90–95, 2007. doi:10.1109/MCSE.2007.55 . Das et al.: Preprint submitted to The Visual Computer P age 19 of 19

Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment