Have a personal or library account? Click to login

Monocular Depth Estimation: A Review on Hybrid Architectures, Transformers and Addressing Adverse Weather Conditions

Open Access
|Jan 2025

References

  1. P. Vyas, C. Saxena, A. Badapanda, and A. Goswami, “Outdoor monocular depth estimation: A research review,” arXiv preprint arXiv:2205.01399, May 2022. https://doi.org/10.48550/arXiv.2205.01399
  2. Q. Li et al., “Deep learning based monocular depth prediction: Datasets, methods and applications,” arXiv preprint arXiv:2011.04123, 2020. https://arxiv.org/pdf/2011.04123
  3. A. Masoumian, H. A. Rashwan, J. Cristiano, M. S. Asif, and D. Puig, “Monocular depth estimation using deep learning: A review,” Sensors, vol. 22, no. 14, Art. no. 5353, Jul. 2022. https://doi.org/10.3390/s22145353
  4. Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, May 2021. https://doi.org/10.1016/j.neucom.2020.12.089
  5. Foresight, “An overview of autonomous sensors – LIDAR, RADAR, and cameras,” 2023. [Online]. Available: https://www.foresightauto.com/an-overview-of-autonomous-sensors-lidar-radar-and-cameras/
  6. Y. Li and J. Ibanez-Guzman, “Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems,” IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 50–61, Jul. 2020. https://doi.org/10.1109/MSP.2020.2973615
  7. J. Hasch, “Driving towards 2020: Automotive radar technology trends,” in 2015 IEEE MTT-S International Conference on Microwaves for Intelligent Mobility (ICMIM), Heidelberg, Germany, Apr. 2015, pp. 1–4. https://doi.org/10.1109/ICMIM.2015.7117956
  8. C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular depth estimation based on deep learning: An overview,” Science China Technological Sciences, vol. 63, no. 9, pp. 1612–1627, June 2020. https://doi.org/10.1007/s11431-020-1582-8
  9. A. Saxena, J. Schulte, and A. Y. Ng, “Depth estimation using monocular and stereo cues,” in IJCAI-07, 2007, pp. 2197–2203. [Online]. Available: https://www.ijcai.org/Proceedings/07/Papers/354.pdf
  10. H. Caesar et al., “nuScenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, Mar. 2019. https://doi.org/10.48550/arXiv.1903.11027
  11. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, June 2012, pp. 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
  12. G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “DrivingStereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, June 2019, pp. 899–908. https://doi.org/10.1109/CVPR.2019.00099
  13. D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Adv. Neural. Inf. Process. Syst., vol. 27, 2014. https://doi.org/10.48550/arXiv.1406.2283
  14. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, Oct. 2016, pp. 239–248. https://doi.org/10.1109/3DV.2016.32
  15. B. Li, C. Shen, Y. Dai, A. van den Hengel, and M. He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, June 2015, pp. 1119–1127. https://doi.org/10.1109/CVPR.2015.7298715
  16. I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv preprint arXiv:1812.11941, Dec. 2018. https://doi.org/10.48550/arXiv.1812.11941
  17. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 2017, pp. 4700–4708. https://doi.org/10.1109/CVPR.2017.243
  18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
  19. C.-H. Yeh, Y.-P. Huang, C.-Y. Lin, and C.-Y. Chang, “Transfer2Depth: Dual attention network with transfer learning for monocular depth estimation,” IEEE Access, vol. 8, pp. 86081–86090, May 2020. https://doi.org/10.1109/ACCESS.2020.2992815
  20. C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, July 2017, pp. 270–279. https://doi.org/10.1109/CVPR.2017.699
  21. R. Garg, V. Kumar B.G., G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Part VIII 14, Oct. 2016, pp. 740–756. https://doi.org/10.1007/978-3-319-46484-8_45
  22. M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “Towards real-time unsupervised monocular depth estimation on CPU,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, Oct. 2018, pp. 5848–5854. https://doi.org/10.1109/IROS.2018.8593814
  23. J. Liu, Q. Li, R. Cao, W. Tang, and G. Qiu, “MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 255–267, Aug. 2020. https://doi.org/10.1016/j.isprsjprs.2020.06.004
  24. C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea (South), Oct. 2019, pp. 3828–3838. https://doi.org/10.1109/ICCV.2019.00393
  25. J. Jin, B. Tao, X. Qian, J. Hu, and G. Li, “Lightweight monocular absolute depth estimation based on attention mechanism,” Journal of Electronic Imaging, vol. 33, no. 2, Mar. 2024, Art. no. 23010. https://doi.org/10.1117/1.JEI.33.2.023010
  26. N. Zhang, F. Nex, G. Vosselman, and N. Kerle, “Lite-Mono: A lightweight CNN and transformer architecture for self-supervised monocular depth estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 18537–18546. https://doi.org/10.1109/CVPR52729.2023.01778
  27. J. Wang et al., “WeatherDepth: Curriculum contrastive learning for self-supervised depth estimation under adverse weather conditions,” arXiv preprint arXiv:2310.05556, Oct. 2023. https://doi.org/10.48550/arXiv.2310.05556
  28. C. Zhao et al., “MonoViT: Self-supervised monocular depth estimation with a vision transformer,” in 2022 International Conference on 3D Vision (3DV), Prague, Czech Republic, Sep. 2022, pp. 668–678. https://doi.org/10.1109/3DV57658.2022.00077
  29. M. A. Rahman and S. A. Fattah, “ DwinFormer: Dual window transformers for end-to-end monocular depth estimation,” IEEE Sensors Journal, vol. 23, no. 18, Aug. 2023. https://doi.org/10.1109/JSEN.2023.3299782
  30. G. Manimaran and J. Swaminathan, “Focal-WNet: An architecture unifying convolution and attention for depth estimation,” in 2022 IEEE 7th International Conference for Convergence in Technology (I2CT), Mumbai, India, Apr. 2022, pp. 1–7. https://doi.org/10.1109/I2CT54291.2022.9824488
  31. Z. Li, Z. Chen, X. Liu, and J. Jiang, “DepthFormer: Exploiting long-range correlation and local information for accurate monocular depth estimation,” Machine Intelligence Research, vol. 20, no. 6, pp. 837–854, Dec. 2023. https://doi.org/10.1007/s11633-023-1458-0
  32. A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, Oct. 2020. https://doi.org/10.48550/arXiv.2010.11929
  33. C. Xia et al., “PCTDepth: Exploiting parallel CNNs and transformer via dual attention for monocular depth estimation,” Neural Processing Letters, vol. 56, no. 2, Feb. 2024, Art. no. 73. https://doi.org/10.1007/s11063-024-11524-0
  34. D. Shim and H. J. Kim, “SwinDepth: Unsupervised depth estimation using monocular sequences via Swin transformer and densely cascaded network,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), London, United Kingdom, May 2023, pp. 4983–4990. https://doi.org/10.1109/ICRA48891.2023.10160657
  35. C. Ning and H. Gan, “Trap attention: Monocular depth estimation with manual traps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, June 2023, pp. 5033–5043. https://doi.org/10.1109/CVPR52729.2023.00487
  36. A. Astudillo, A. Barrera, C. Guindel, A. Al-Kaff, and F. García, “DAttNet: monocular depth estimation network based on attention mechanisms,” Neural Computing and Applications, vol. 36, no. 7, pp. 3347–3356, Dec. 2023. https://doi.org/10.1007/s00521-023-09210-8
  37. A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, Jan. 2023, pp. 5861–5870. https://doi.org/10.1109/WACV56688.2023.00581
  38. W. Zhao, Y. Song, and T. Wang, “SAU-Net: Monocular depth estimation combining multi-scale features and attention mechanisms,” IEEE Access, vol. 11, Dec. 2023, pp. 137734–137746. https://doi.org/10.1109/ACCESS.2023.3339152
  39. Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, Oct. 2021, pp. 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
  40. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015: 18th International Conference, part III 18, Munich, Germany, Oct. 2015, pp. 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
  41. D. Xing, J. Shen, C. Ho, and A. Tzes, “ROIFormer: semantic-aware region of interest transformer for efficient self-supervised monocular depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 2983–2991. https://doi.org/10.1609/aaai.v37i3.25401
  42. L. Yan, F. Yu, and C. Dong, “EMTNet: efficient mobile transformer network for real-time monocular depth estimation,” Pattern Analysis and Applications, vol. 26, no. 4, pp. 1833–1846, Oct. 2023. https://doi.org/10.1007/s10044-023-01205-4
  43. K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More features from cheap operations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, June 2020, pp. 1580–1589. https://doi.org/10.1109/CVPR42600.2020.00165
  44. L. Song et al., “Spatial-aware dynamic lightweight self-supervised monocular depth estimation,” IEEE Robotics and Automation Letters, vol. 9, no. 1, pp. 883–890, Nov. 2023. https://doi.org/10.1109/LRA.2023.3337991
  45. L. Papa, P. Russo, and I. Amerini, “METER: a mobile vision transformer architecture for monocular depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 10, pp. 5882–5893, Mar. 2023. https://doi.org/10.1109/TCSVT.2023.3260310
  46. Q. Liu and S. Zhou, “LightDepthNet: Lightweight CNN architecture for monocular depth estimation on edge devices,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 71, no. 4, pp. 2389–2393, Nov. 2023. https://doi.org/10.1109/TCSII.2023.3337369
  47. M. Tang, Z. Zhao, and J. Qiu, “A foggy weather simulation algorithm for traffic image synthesis based on monocular depth estimation,” Sensors, vol. 24, no. 6, Mar. 2024, Art. no. 1966. https://doi.org/10.3390/s24061966
  48. K. Saunders, G. Vogiatzis, and L. J. Manso, “Self-supervised monocular depth estimation: Let’s talk about the weather,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, Oct. 2023, pp. 8907–8917. https://doi.org/10.1109/ICCV51070.2023.00818
  49. M. Tremblay, S. S. Halder, R. de Charette, and J. F. Lalonde, “Rain rendering for evaluating and improving robustness to bad weather,” International Journal of Computer Vision, vol. 129, no. 2, pp. 341–360, Feb. 2021. https://doi.org/10.1007/s11263-020-01366-3
  50. F. Pizzati and R. de Charette, “CoMoGAN: continuous model-guided image-to-image translation”, [Online]. Available: https://github.com/cvrits/CoMoGAN. Accessed on: Jul. 04, 2024.
  51. U. Saxena and R. Giriraj, “Automold--Road-Augmentation-Library,” GitHub, Feb. 12, 2023. [Online]. Available: https://github.com/UjjwalSaxena/Automold--Road-Augmentation-Library
  52. X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The ApolloScape open dataset for autonomous driving and its application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2702–2719, Oct. 2020. https://doi.org/10.1109/TPAMI.2019.2926463
  53. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” in Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Part V 12, Florence, Italy, Oct. 2012, pp. 746–760. https://doi.org/10.1007/978-3-642-33715-4_54
  54. M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, June 2016, pp. 3213–3223. https://doi.org/10.1109/CVPR.2016.350
  55. A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,” Neural Information Processing Systems (NIPS), vol. 18, pp. 1–8, Dec. 2005.
DOI: https://doi.org/10.2478/acss-2025-0003 | Journal eISSN: 2255-8691 | Journal ISSN: 2255-8683
Language: English
Page range: 21 - 33
Submitted on: Aug 22, 2024
Accepted on: Dec 18, 2024
Published on: Jan 24, 2025
Published by: Riga Technical University
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2025 Lakindu Kumara, Nipuna Senanayake, Guhanathan Poravi, published by Riga Technical University
This work is licensed under the Creative Commons Attribution 4.0 License.