Monocular Depth Estimation: A Review on Hybrid Architectures, Transformers and Addressing Adverse Weather Conditions

Lakindu Kumara; Nipuna Senanayake; Guhanathan Poravi

doi:10.2478/acss-2025-0003

Abstract

Monocular depth estimation is one of the essential tasks in computer vision as it can provide depth information from 2D images and is extremely beneficial for applications such as autonomous driving, robot navigation, etc. Monocular depth estimation has significantly improved over the past couple of years and deep learning-based methods have surpassed traditional and machine learning-based methods. Deep learning-based methods have further been enhanced using transformer and hybrid approaches. This paper first discusses the sensors used for depth estimation and their limitations. Then, we briefly discuss the evolution of depth estimation. Then we dive into the deep learning methods including transformer and CNN-transformer hybrid methods and their limitations. Later, we discuss several methods addressing challenging weather conditions. Finally, we discuss the current trends, challenges and future directions of the transformer and hybrid methods.

References

P. Vyas, C. Saxena, A. Badapanda, and A. Goswami, “Outdoor monocular depth estimation: A research review,” arXiv preprint arXiv:2205.01399, May 2022. https://doi.org/10.48550/arXiv.2205.01399
Search in Google Scholar Back to article
Q. Li et al., “Deep learning based monocular depth prediction: Datasets, methods and applications,” arXiv preprint arXiv:2011.04123, 2020. https://arxiv.org/pdf/2011.04123
Search in Google Scholar Back to article
A. Masoumian, H. A. Rashwan, J. Cristiano, M. S. Asif, and D. Puig, “Monocular depth estimation using deep learning: A review,” Sensors, vol. 22, no. 14, Art. no. 5353, Jul. 2022. https://doi.org/10.3390/s22145353
Search in Google Scholar Back to article
Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, May 2021. https://doi.org/10.1016/j.neucom.2020.12.089
Search in Google Scholar Back to article
Foresight, “An overview of autonomous sensors – LIDAR, RADAR, and cameras,” 2023. [Online]. Available: https://www.foresightauto.com/an-overview-of-autonomous-sensors-lidar-radar-and-cameras/
Search in Google Scholar Back to article
Y. Li and J. Ibanez-Guzman, “Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems,” IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 50–61, Jul. 2020. https://doi.org/10.1109/MSP.2020.2973615
Search in Google Scholar Back to article
J. Hasch, “Driving towards 2020: Automotive radar technology trends,” in 2015 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), Heidelberg, Germany, Apr. 2015, pp. 1–4. https://doi.org/10.1109/ICMIM.2015.7117956
Search in Google Scholar Back to article
C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular depth estimation based on deep learning: An overview,” Science China Technological Sciences, vol. 63, no. 9, pp. 1612–1627, June 2020. https://doi.org/10.1007/s11431-020-1582-8
Search in Google Scholar Back to article
A. Saxena, J. Schulte, and A. Y. Ng, “Depth estimation using monocular and stereo cues,” in IJCAI-07, 2007, pp. 2197–2203. [Online]. Available: https://www.ijcai.org/Proceedings/07/Papers/354.pdf
Search in Google Scholar Back to article
H. Caesar et al., “nuScenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, Mar. 2019. https://doi.org/10.48550/arXiv.1903.11027
Search in Google Scholar Back to article
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, June 2012, pp. 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
Search in Google Scholar Back to article
G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019, pp. 899–908. https://doi.org/10.1109/CVPR.2019.00099
Search in Google Scholar Back to article
D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Adv. Neural. Inf. Process. Syst., vol. 27, 2014. https://doi.org/10.48550/arXiv.1406.2283
Search in Google Scholar Back to article
I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, Oct. 2016, pp. 239–248. https://doi.org/10.1109/3DV.2016.32
Search in Google Scholar Back to article
B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 2015, pp. 1119–1127. https://doi.org/10.1109/CVPR.2015.7298715
Search in Google Scholar Back to article
I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, Dec. 2018. https://doi.org/10.48550/arXiv.1812.11941
Search in Google Scholar Back to article
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 2017, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243
Search in Google Scholar Back to article
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Search in Google Scholar Back to article
C.-H. Yeh, Y.-P. Huang, C.-Y. Lin, and C.-Y. Chang, “Transfer2Depth: Dual attention network with transfer learning for monocular depth estimation,” IEEE Access, vol. 8, pp. 86081–86090, May 2020. https://doi.org/10.1109/ACCESS.2020.2992815
Search in Google Scholar Back to article
C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 2017, pp. 270–279. https://doi.org/10.1109/CVPR.2017.699
Search in Google Scholar Back to article
R. Garg, V. Kumar B.G., G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Part VIII 14, Oct. 2016, pp. 740–756. https://doi.org/10.1007/978-3-319-46484-8_45
Search in Google Scholar Back to article
M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “Towards real-time unsupervised monocular depth estimation on CPU,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, Oct. 2018, pp. 5848–5854. https://doi.org/10.1109/IROS.2018.8593814
Search in Google Scholar Back to article
J. Liu, Q. Li, R. Cao, W. Tang, and G. Qiu, “MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 255–267, Aug. 2020. https://doi.org/10.1016/j.isprsjprs.2020.06.004
Search in Google Scholar Back to article
C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), Oct. 2019, pp. 3828–3838. https://doi.org/10.1109/ICCV.2019.00393
Search in Google Scholar Back to article
J. Jin, B. Tao, X. Qian, J. Hu, and G. Li, “Lightweight monocular absolute depth estimation based on attention mechanism,” Journal of Electronic Imaging, vol. 33, no. 2, Mar. 2024, Art. no. 23010. https://doi.org/10.1117/1.JEI.33.2.023010
Search in Google Scholar Back to article
N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-Mono: A lightweight CNN and transformer architecture for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 18537–18546. https://doi.org/10.1109/CVPR52729.2023.01778
Search in Google Scholar Back to article
J. Wang et al., “WeatherDepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions,” arXiv preprint arXiv:2310.05556, Oct. 2023. https://doi.org/10.48550/arXiv.2310.05556
Search in Google Scholar Back to article
C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation with a vision transformer,” in 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, Sep. 2022, pp. 668–678. https://doi.org/10.1109/3DV57658.2022.00077
Search in Google Scholar Back to article
M. A. Rahman and S. A. Fattah, “ DwinFormer: Dual window transformers for end-to-end monocular depth estimation,” IEEE Sensors Journal, vol. 23, no. 18, Aug. 2023. https://doi.org/10.1109/JSEN.2023.3299782
Search in Google Scholar Back to article
G. Manimaran and J. Swaminathan, “Focal-WNet: An architecture unifying convolution and attention for depth estimation,” in 2022 IEEE 7th International Conference for Convergence in Technology (I2CT), Mumbai, India, Apr. 2022, pp. 1–7. https://doi.org/10.1109/I2CT54291.2022.9824488
Search in Google Scholar Back to article
Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long-range correlation and local information for accurate monocular depth estimation,” Machine Intelligence Research, vol. 20, no. 6, pp. 837–854, Dec. 2023. https://doi.org/10.1007/s11633-023-1458-0
Search in Google Scholar Back to article
A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, Oct. 2020. https://doi.org/10.48550/arXiv.2010.11929
Search in Google Scholar Back to article
C. Xia et al., “PCTDepth: Exploiting parallel CNNs and transformer via dual attention for monocular depth estimation,” Neural Processing Letters, vol. 56, no. 2, Feb. 2024, Art. no. 73. https://doi.org/10.1007/s11063-024-11524-0
Search in Google Scholar Back to article
D. Shim and H. J. Kim, “SwinDepth: Unsupervised depth estimation using monocular sequences via Swin transformer and densely cascaded network,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), London, United Kingdom, May 2023, pp. 4983–4990. https://doi.org/10.1109/ICRA48891.2023.10160657
Search in Google Scholar Back to article
C. Ning and H. Gan, “Trap attention: Monocular depth estimation with manual traps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 5033–5043. https://doi.org/10.1109/CVPR52729.2023.00487
Search in Google Scholar Back to article
A. Astudillo, A. Barrera, C. Guindel, A. Al-Kaff, and F. García, “DAttNet: monocular depth estimation network based on attention mechanisms,” Neural Computing and Applications, vol. 36, no. 7, pp. 3347–3356, Dec. 2023. https://doi.org/10.1007/s00521-023-09210-8
Search in Google Scholar Back to article
A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, Jan. 2023, pp. 5861–5870. https://doi.org/10.1109/WACV56688.2023.00581
Search in Google Scholar Back to article
W. Zhao, Y. Song, and T. Wang, “SAU-Net: Monocular depth estimation combining multi-scale features and attention mechanisms,” IEEE Access, vol. 11, Dec. 2023, pp. 137734–137746. https://doi.org/10.1109/ACCESS.2023.3339152
Search in Google Scholar Back to article
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, Oct. 2021, pp. 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
Search in Google Scholar Back to article
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, part III 18, Munich, Germany, Oct. 2015, pp. 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
Search in Google Scholar Back to article
D. Xing, J. Shen, C. Ho, and A. Tzes, “ROIFormer: semantic-aware region of interest transformer for efficient self-supervised monocular depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2983–2991. https://doi.org/10.1609/aaai.v37i3.25401
Search in Google Scholar Back to article
L. Yan, F. Yu, and C. Dong, “EMTNet: efficient mobile transformer network for real-time monocular depth estimation,” Pattern Analysis and Applications, vol. 26, no. 4, pp. 1833–1846, Oct. 2023. https://doi.org/10.1007/s10044-023-01205-4
Search in Google Scholar Back to article
K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More features from cheap operations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2020, pp. 1580–1589. https://doi.org/10.1109/CVPR42600.2020.00165
Search in Google Scholar Back to article
L. Song et al., “Spatial-aware dynamic lightweight self-supervised monocular depth estimation,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 883–890, Nov. 2023. https://doi.org/10.1109/LRA.2023.3337991
Search in Google Scholar Back to article
L. Papa, P. Russo, and I. Amerini, “METER: a mobile vision transformer architecture for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 10, pp. 5882–5893, Mar. 2023. https://doi.org/10.1109/TCSVT.2023.3260310
Search in Google Scholar Back to article
Q. Liu and S. Zhou, “LightDepthNet: Lightweight CNN architecture for monocular depth estimation on edge devices,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 71, no. 4, pp. 2389–2393, Nov. 2023. https://doi.org/10.1109/TCSII.2023.3337369
Search in Google Scholar Back to article
M. Tang, Z. Zhao, and J. Qiu, “A foggy weather simulation algorithm for traffic image synthesis based on monocular depth estimation,” Sensors, vol. 24, no. 6, Mar. 2024, Art. no. 1966. https://doi.org/10.3390/s24061966
Search in Google Scholar Back to article
K. Saunders, G. Vogiatzis, and L. J. Manso, “Self-supervised monocular depth estimation: Let’s talk about the weather,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, Oct. 2023, pp. 8907–8917. https://doi.org/10.1109/ICCV51070.2023.00818
Search in Google Scholar Back to article
M. Tremblay, S. S. Halder, R. de Charette, and J. F. Lalonde, “Rain rendering for evaluating and improving robustness to bad weather,” International Journal of Computer Vision, vol. 129, no. 2, pp. 341–360, Feb. 2021. https://doi.org/10.1007/s11263-020-01366-3
Search in Google Scholar Back to article
F. Pizzati and R. de Charette, “CoMoGAN: continuous model-guided image-to-image translation”, [Online]. Available: https://github.com/cvrits/CoMoGAN. Accessed on: Jul. 04, 2024.
Search in Google Scholar Back to article
U. Saxena and R. Giriraj, “Automold--Road-Augmentation-Library,” GitHub, Feb. 12, 2023. [Online]. Available: https://github.com/UjjwalSaxena/Automold--Road-Augmentation-Library
Search in Google Scholar Back to article
X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The ApolloScape open dataset for autonomous driving and its application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2702–2719, Oct. 2020. https://doi.org/10.1109/TPAMI.2019.2926463
Search in Google Scholar Back to article
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Part V 12, Florence, Italy, Oct. 2012, pp. 746–760. https://doi.org/10.1007/978-3-642-33715-4_54
Search in Google Scholar Back to article
M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 2016, pp. 3213–3223. https://doi.org/10.1109/CVPR.2016.350
Search in Google Scholar Back to article
A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,” Neural Information Processing Systems (NIPS), vol. 18, pp. 1–8, Dec. 2005.
Search in Google Scholar Back to article

Monocular Depth Estimation: A Review on Hybrid Architectures, Transformers and Addressing Adverse Weather Conditions

Abstract

Paradigm

My account