Have a personal or library account? Click to login
A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features Cover

A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features

Open Access
|Dec 2025

References

  1. Bai, S., S. An. A Survey on Automatic Image Caption Generation. – Neurocomputing, Vol. 311, 2018, pp. 291-304. DOI: 10.1016/j.neucom.2018.05.080.
  2. Makav, B., V. Kılıç. A New Image Captioning Approach for Visually Impaired People. – In: Proc. of 11th International Conference on Electrical and Electronics Engineering (ELECO’19), Bursa, Turkey, 2019, pp. 945-949. DOI: 10.23919/ELECO47770.2019.8990630.
  3. Hossain, M. Z., F. Sohel, M. F. Shiratuddin, H. Laga. A Comprehensive Survey of Deep Learning for Image Captioning. – ACM Computing Surveys, Vol. 51, 2019, No 6, pp. 1-36. DOI: 10.1145/3295748.
  4. Wang, H., Y. Zhang, X. Yu. An Overview of Image Caption Generation Methods. – Computational Intelligence and Neuroscience, 2020, pp. 1-13. DOI: 10.1155/2020/3062706.
  5. Redmon, J., S. Divvala, R. Girshick, A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779-788.
  6. Ren, S., K. He, R. Girshick, J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. – Adv. NeuReal Inf. Process Syst., Vol. 28, 2015, pp. 91-99.
  7. Eluri, Y., N. Vinutha, G. S. Abhiram. Image Captioning Using Visual Attention and Detection Transformer Model. – In: Proc. of IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT’24), IEEE, July 2024, pp. 1-4.
  8. Poddar, A. K., R. Rani. Hybrid Architecture Using CNN and LSTM for Image Captioning in the Hindi Language. – Procedia Computer Science, Vol. 218, 2023, pp. 686-696. DOI: 10.1016/j.procs.2023.01.049.
  9. Xiao, X., L. Wang, K. Ding, S. Xiang, C. Pan. Deep Hierarchical Encoder-Decoder Network for Image Captioning. – IEEE Transactions on Multimedia, Vol. 21, 2019, No 11, pp. 2942-2956. DOI: 10.1109/TMM.2019.2915033.
  10. Sasibhooshan, R., S. Kumaraswamy, S. Sasidharan. Image Caption Generation Using Visual Attention Prediction and Contextual Spatial Relation Extraction. – Journal of Big Data, Vol. 10, 2023, No 18. DOI: 10.1186/s40537-023-00693-9.
  11. Al-Malla, M. A., A. Jafar, N. Ghneim. Image Captioning Model Using Attention and Object Features to Mimic Human Image Understanding. – Journal of Big Data, Vol. 9, 2022, No 20. DOI: 10.1186/s40537-022-00571-w.
  12. Wang, E. K., X. Zhang, F. Wang, T. Y. Wu, C. M. Chen. Multilayer Dense Attention Model for Image Caption. – IEEE Access, Vol. 7, 2019, pp. 66358-66368. DOI: 10.1109/ACCESS.2019.2917771.
  13. Dhir, R., S. K. Mishra, S. Saha, P. Bhattacharyya. A Deep Attention-Based Framework for Image Caption Generation in the Hindi Language. – Computación Y Sistemas, Vol. 23, 2019, No 3. DOI: 10.13053/cys-23-3-3269.
  14. Mishra, S. K., S. Sinha, S. Saha, P. Bhattacharyya. Dynamic Convolution-Based Encoder-Decoder Framework for Image Captioning in Hindi. – ACM Transactions on Asian and Low-Resource Language Information Processing, 2022. DOI: 10.1145/3573891.
  15. Xu, K., et al. Attend and Tell: Neural Image Caption Generation with Visual Attention. – In: Proc. of 32nd Int. Conf. Mach. Learn. (ICML’15), Lille, France, 2015, pp. 2048-2057.
  16. Vinyals, O., A. Toshev, S. Bengio, D. Erhan. Show and Tell: A Neural Image Caption Generator. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
  17. Long, D. T. Efficient DenseNet Model with Fusion of Channel and Spatial Attention for Facial Expression Recognition. – Cybernetics and Information Technologies, Vol. 24, 2024, No 1, pp. 171-189.
  18. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251-1258.
  19. Ultralytics. Yolov8 Anchor-Free Bounding Box Prediction, Issue 189, 2023. https://github.com/ultralytics/ultralytics/issues/189
  20. Bahdanau, D., K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. – arXiv preprint arXiv:1409.0473, 2014.
  21. Hodosh, M., P. Young, J. Hockenmaier. Framing Image Description as a Ranking Task: Data, Models, and Evaluation Metrics. – Journal of Artificial Intelligence Research, Vol. 47, 2013, pp. 853-899. DOI: 10.1613/jair.3994.
  22. Papineni, K., S. Roukos, T. Ward, W.-J. Zhu. BLEU. – In: Proc. of 40th Annual Meeting on Association for Computational Linguistics (ACL’02), 2001, p. 311. DOI: 10.3115/1073083.1073135.
  23. Banerjee, S., A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. – In: Proc. of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65-72.
  24. Kiros, R., R. Salakhutdinov, R. Zemel. Multimodal Neural Language Models. – In: Proc. of International Conference on Machine Learning, (PMLR’14), June 2014, pp. 595-603.
  25. Kamangar, Z. U., G. M. Shaikh, S. Hassan, N. Mughal, U. A. Kamangar. Image Caption Generation Related to Object Detection and Colour Recognition Using a Transformer-Decoder. – In: Proc. of 4th International Conference on Computing, Mathematics and Engineering Technologies (iCoMETs’23), Sukkur, Pakistan, 2023, pp. 1-5. DOI: 10.1109/iCoMET57998.2023.10099161.
  26. Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. – arXiv. DOI: 10.48550/arXiv.1706.03762, 2017.
  27. Mishra, S. K., Harshit, S. Saha, P. Bhattacharyya. An Object Localization-Based Dense Image Captioning Framework in Hindi. – ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 22, 2022, No 2, pp. 1-15. DOI: 10.1145/3558391.
  28. Kaur, M., H. Kaur. An Efficient Deep Learning Based Hybrid Model for Image Caption Generation. – International Journal of Advanced Computer Science and Applications, Vol. 14, 2023, No 3. DOI: 10.14569/IJACSA.2023.0140326.
DOI: https://doi.org/10.2478/cait-2025-0036 | Journal eISSN: 1314-4081 | Journal ISSN: 1311-9702
Language: English
Page range: 116 - 128
Submitted on: Jun 20, 2025
Accepted on: Sep 25, 2025
Published on: Dec 11, 2025
Published by: Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Harshil Narendrabhai Chauhan, Chintan Bhupeshbhai Thacker, published by Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.