A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features

Chauhan, Harshil Narendrabhai; Thacker, Chintan Bhupeshbhai

doi:10.2478/cait-2025-0036

Abstract

Bridging the gap between visual data and human language has been increasingly looked at through the task of automatically generating descriptive captions for images. This places the work within important scopes of accessibility, multimedia search, and human-computer interaction. For this work, we propose a hybrid deep learning model that fuses high-level scene context with localized object information for quality captions. Global image features are obtained through an Xception network, while You Only Look Once, version 8 (YOLOv8) is used to derive object-specific fine details. These visual features are merged and passed to a Bahdanau attention mechanism, which feeds an LSTM decoder to generate context-aware captions. The proposed method was tested on the Flickr8k dataset using BLEU and METEOR metrics; it showed promising improvements over traditional single-stream approaches. Results speak well of the model’s ability to deliver better interpretability and accuracy in image captioning.

References

Bai, S., S. An. A Survey on Automatic Image Caption Generation. – Neurocomputing, Vol. 311, 2018, pp. 291-304. DOI: 10.1016/j.neucom.2018.05.080.
Search in Google Scholar Back to article
Makav, B., V. Kılıç. A New Image Captioning Approach for Visually Impaired People. – In: Proc. of 11th International Conference on Electrical and Electronics Engineering (ELECO’19), Bursa, Turkey, 2019, pp. 945-949. DOI: 10.23919/ELECO47770.2019.8990630.
Search in Google Scholar Back to article
Hossain, M. Z., F. Sohel, M. F. Shiratuddin, H. Laga. A Comprehensive Survey of Deep Learning for Image Captioning. – ACM Computing Surveys, Vol. 51, 2019, No 6, pp. 1-36. DOI: 10.1145/3295748.
Search in Google Scholar Back to article
Wang, H., Y. Zhang, X. Yu. An Overview of Image Caption Generation Methods. – Computational Intelligence and Neuroscience, 2020, pp. 1-13. DOI: 10.1155/2020/3062706.
Search in Google Scholar Back to article
Redmon, J., S. Divvala, R. Girshick, A. Farhadi. You Only Look Once: Unified, Real-Time Object Detection. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779-788.
Search in Google Scholar Back to article
Ren, S., K. He, R. Girshick, J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. – Adv. NeuReal Inf. Process Syst., Vol. 28, 2015, pp. 91-99.
Search in Google Scholar Back to article
Eluri, Y., N. Vinutha, G. S. Abhiram. Image Captioning Using Visual Attention and Detection Transformer Model. – In: Proc. of IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT’24), IEEE, July 2024, pp. 1-4.
Search in Google Scholar Back to article
Poddar, A. K., R. Rani. Hybrid Architecture Using CNN and LSTM for Image Captioning in the Hindi Language. – Procedia Computer Science, Vol. 218, 2023, pp. 686-696. DOI: 10.1016/j.procs.2023.01.049.
Search in Google Scholar Back to article
Xiao, X., L. Wang, K. Ding, S. Xiang, C. Pan. Deep Hierarchical Encoder-Decoder Network for Image Captioning. – IEEE Transactions on Multimedia, Vol. 21, 2019, No 11, pp. 2942-2956. DOI: 10.1109/TMM.2019.2915033.
Search in Google Scholar Back to article
Sasibhooshan, R., S. Kumaraswamy, S. Sasidharan. Image Caption Generation Using Visual Attention Prediction and Contextual Spatial Relation Extraction. – Journal of Big Data, Vol. 10, 2023, No 18. DOI: 10.1186/s40537-023-00693-9.
Search in Google Scholar Back to article
Al-Malla, M. A., A. Jafar, N. Ghneim. Image Captioning Model Using Attention and Object Features to Mimic Human Image Understanding. – Journal of Big Data, Vol. 9, 2022, No 20. DOI: 10.1186/s40537-022-00571-w.
Search in Google Scholar Back to article
Wang, E. K., X. Zhang, F. Wang, T. Y. Wu, C. M. Chen. Multilayer Dense Attention Model for Image Caption. – IEEE Access, Vol. 7, 2019, pp. 66358-66368. DOI: 10.1109/ACCESS.2019.2917771.
Search in Google Scholar Back to article
Dhir, R., S. K. Mishra, S. Saha, P. Bhattacharyya. A Deep Attention-Based Framework for Image Caption Generation in the Hindi Language. – Computación Y Sistemas, Vol. 23, 2019, No 3. DOI: 10.13053/cys-23-3-3269.
Search in Google Scholar Back to article
Mishra, S. K., S. Sinha, S. Saha, P. Bhattacharyya. Dynamic Convolution-Based Encoder-Decoder Framework for Image Captioning in Hindi. – ACM Transactions on Asian and Low-Resource Language Information Processing, 2022. DOI: 10.1145/3573891.
Search in Google Scholar Back to article
Xu, K., et al. Attend and Tell: Neural Image Caption Generation with Visual Attention. – In: Proc. of 32nd Int. Conf. Mach. Learn. (ICML’15), Lille, France, 2015, pp. 2048-2057.
Search in Google Scholar Back to article
Vinyals, O., A. Toshev, S. Bengio, D. Erhan. Show and Tell: A Neural Image Caption Generator. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156-3164.
Search in Google Scholar Back to article
Long, D. T. Efficient DenseNet Model with Fusion of Channel and Spatial Attention for Facial Expression Recognition. – Cybernetics and Information Technologies, Vol. 24, 2024, No 1, pp. 171-189.
Search in Google Scholar Back to article
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. – In: Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1251-1258.
Search in Google Scholar Back to article
Ultralytics. Yolov8 Anchor-Free Bounding Box Prediction, Issue 189, 2023. https://github.com/ultralytics/ultralytics/issues/189
Search in Google Scholar Back to article
Bahdanau, D., K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. – arXiv preprint arXiv:1409.0473, 2014.
Search in Google Scholar Back to article
Hodosh, M., P. Young, J. Hockenmaier. Framing Image Description as a Ranking Task: Data, Models, and Evaluation Metrics. – Journal of Artificial Intelligence Research, Vol. 47, 2013, pp. 853-899. DOI: 10.1613/jair.3994.
Search in Google Scholar Back to article
Papineni, K., S. Roukos, T. Ward, W.-J. Zhu. BLEU. – In: Proc. of 40th Annual Meeting on Association for Computational Linguistics (ACL’02), 2001, p. 311. DOI: 10.3115/1073083.1073135.
Search in Google Scholar Back to article
Banerjee, S., A. Lavie. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. – In: Proc. of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65-72.
Search in Google Scholar Back to article
Kiros, R., R. Salakhutdinov, R. Zemel. Multimodal Neural Language Models. – In: Proc. of International Conference on Machine Learning, (PMLR’14), June 2014, pp. 595-603.
Search in Google Scholar Back to article
Kamangar, Z. U., G. M. Shaikh, S. Hassan, N. Mughal, U. A. Kamangar. Image Caption Generation Related to Object Detection and Colour Recognition Using a Transformer-Decoder. – In: Proc. of 4th International Conference on Computing, Mathematics and Engineering Technologies (iCoMETs’23), Sukkur, Pakistan, 2023, pp. 1-5. DOI: 10.1109/iCoMET57998.2023.10099161.
Search in Google Scholar Back to article
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is All You Need. – arXiv. DOI: 10.48550/arXiv.1706.03762, 2017.
Search in Google Scholar Back to article
Mishra, S. K., Harshit, S. Saha, P. Bhattacharyya. An Object Localization-Based Dense Image Captioning Framework in Hindi. – ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 22, 2022, No 2, pp. 1-15. DOI: 10.1145/3558391.
Search in Google Scholar Back to article
Kaur, M., H. Kaur. An Efficient Deep Learning Based Hybrid Model for Image Caption Generation. – International Journal of Advanced Computer Science and Applications, Vol. 14, 2023, No 3. DOI: 10.14569/IJACSA.2023.0140326.
Search in Google Scholar Back to article

A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features

Abstract

Paradigm

My account