Have a personal or library account? Click to login
A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features Cover

A Novel Hybrid Deep Learning Framework for Image Captioning Using Combined Semantic and Object-Level Features

Open Access
|Dec 2025

Abstract

Bridging the gap between visual data and human language has been increasingly looked at through the task of automatically generating descriptive captions for images. This places the work within important scopes of accessibility, multimedia search, and human-computer interaction. For this work, we propose a hybrid deep learning model that fuses high-level scene context with localized object information for quality captions. Global image features are obtained through an Xception network, while You Only Look Once, version 8 (YOLOv8) is used to derive object-specific fine details. These visual features are merged and passed to a Bahdanau attention mechanism, which feeds an LSTM decoder to generate context-aware captions. The proposed method was tested on the Flickr8k dataset using BLEU and METEOR metrics; it showed promising improvements over traditional single-stream approaches. Results speak well of the model’s ability to deliver better interpretability and accuracy in image captioning.

DOI: https://doi.org/10.2478/cait-2025-0036 | Journal eISSN: 1314-4081 | Journal ISSN: 1311-9702
Language: English
Page range: 116 - 128
Submitted on: Jun 20, 2025
Accepted on: Sep 25, 2025
Published on: Dec 11, 2025
Published by: Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Harshil Narendrabhai Chauhan, Chintan Bhupeshbhai Thacker, published by Bulgarian Academy of Sciences, Institute of Information and Communication Technologies
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.