Chestxgen: Dynamic Memory-Augmented Vision-Language Transformer with Context-Aware Gating for Radiology Report Generation

Sharofiddin Allaberdiev; Asif Khan; Sardor Mamarasulov; Xiaojun Chen

doi:10.2478/jaiscr-2026-0003

Abstract

Chest X-ray analysis is vital for clinical screening, diagnosis, and treatment planning. The increasing workload on radiologists calls for robust automated solutions to generate accurate and standardized reports. Conventional report generation models often struggle to detect rare and anomalous diseases, particularly when faced with imbalanced datasets, which can compromise diagnostic knowledge accuracy. To address these limitations, we propose ChestXGen, a novel multimodal framework for automated radiology report generation. Our model is based on a fully Transformer-based encoder-decoder architecture that integrates Memory Augmented Transformer (MAT) blocks with a Context-Aware Bi-Gate (CABG) mechanism. These enable the model to capture long-range dependencies, effectively fuse visual and textual features, and better handle underrepresented conditions. Visual features are extracted using a ResNet-101-V2 backbone and refined through a shared memory module that continuously reinforces cross-modal associations. This integrated approach facilitates the generation of comprehensive, accurate, and contextually coherent reports. Extensive evaluation on the large-scale MIMIC-CXR dataset, comprising 377,110 images and corresponding free-text reports demonstrate that ChestXGen outperforms previous models on BLEU-1, BLEU-2, BLEU-3, and METEOR metrics. The results demonstrate the efficacy of Transformer-based models in substantially reducing radiologists’ reporting burden while concurrently enhancing the precision and reliability of diagnostic interpretations.

References

P. Roy, A. Bhunia, A. Das, P. Dhar, U. Pal, Keyword spotting in doctor’s handwriting on medical prescriptions, Expert Systems with Applications 76 (2017) 113–128. https://doi.org/10.1016/j.eswa.2017.01.042, doi:10.1016/j.eswa.2017.01.042.
Search in Google Scholar Back to article
Y. Han, G. Holste, Y. Ding, Radiomics-guided global-local transformer for weakly supervised pathology localization in chest x-rays, IEEE Transactions on Medical Imaging 42 (3) (2022) 750–761.https://doi.org/10.1109/TMI.2022.3217292, doi:10.1109/TMI.2022.3217292.
Search in Google Scholar Back to article
C. Bluethgen, P. Chambon, J.-B. Delbrouck, R. van der Sluijs, M. Połacin, J. M. Zambrano Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, A. S. Chaudhari, A vision–language foundation model for the generation of realistic chest x-ray images, Nature Biomedical Engineering (2024) 1–13 https://doi.org/10.1038/s41551-024-01152-3, doi:10.1038/s41551-024-01152-3
Search in Google Scholar Back to article
G. Reale-Nosei, E. Amador-Domínguez, E. Serrano, From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation, Medical Image Analysis (2024) 103264.
Search in Google Scholar Back to article
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, R. Cucchiara, From show to tell: A survey on deep learning-based image captioning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 539–559. https://doi.org/10.1109/TPAMI.2022.3148210, doi:10.1109/TPAMI.2022.3148210
Search in Google Scholar Back to article
Z. Tian, A. Liu, G. Zhu, X. Chen, A paralleled cnn and transformer network for ppg-based cuff-less blood pressure estimation, Biomedical Signal Processing and Control 99 (2025) 106741. https://doi.org/10.1016/j.bspc.2023.106741, doi:10.1016/j.bspc.2023.106741
Search in Google Scholar Back to article
F. Zeiser, C. Costa, G. Gabriel R, A. Maier, R. da Rosa Righi, Chexreport: A transformer-based architecture to generate chest xray reports suggestions, Expert Systems with Applications 255 (2024) 124644. https://doi.org/10.1016/j.eswa.2023.124644, doi:10.1016/j.eswa.2023.124644
Search in Google Scholar Back to article
N. Linna, C. E. Kahn Jr, Applications of natural language processing in radiology: A systematic review, International Journal of Medical Informatics 163 (2022) 104779.
Search in Google Scholar Back to article
H. Sharma, D. Padha, A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues, Artificial Intelligence Review 56 (11) (2023) 13619–13661. https://doi.org/10.1007/s10462-023-10489-5, doi:10.1007/s10462-023-10489-5
Search in Google Scholar Back to article
Z. Wang, L. Wang, X. Li, L. Zhou, Diagnostic captioning by cooperative task interactions and sample-graph consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
Search in Google Scholar Back to article
D. Singh, M. Kaur, J. M. Alanazi, A. A. AlZubi, H. N. Lee, Efficient evolving deep ensemble medical image captioning network, IEEE Journal of Biomedical and Health Informatics 27 (2) (2022) 1016–1025. https://doi.org/10.1109/JBHI.2022.3149312, doi:10.1109/JBHI.2022.3149312
Search in Google Scholar Back to article
Z. Chen, Y. Shen, Y. Song, X. Wan, Cross-modal memory networks for radiology report generation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5904–5914.
Search in Google Scholar Back to article
Z. Chen, Y. Song, T.-H. Chang, X. Wan, Generating radiology reports via memory-driven transformer, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449.
Search in Google Scholar Back to article
Y. Wang, J. Xu, Y. Sun, End-to-end transformer based model for image captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 12, 2022, pp. 2585–2594.
Search in Google Scholar Back to article
W. M. da Silva, S. C. Cazella, R. S. Rech, Deep learning algorithms to assist in imaging diagnosis in individuals with disc herniation or spondylolis-thesis: A scoping review, International Journal of Medical Informatics (2025) 105933.
Search in Google Scholar Back to article
M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-memory transformer for image captioning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 10578–10587.
Search in Google Scholar Back to article
Y. Tang, H. Yang, L. Zhang, Y. Yuan, Work like a doctor: Unifying scan localizer and dynamic generator for automated computed tomography report generation, Expert Systems with Applications 237 (2024) 121442. https://doi.org/10.1016/j.eswa.2023.121442, doi:10.1016/j.eswa.2023.121442
Search in Google Scholar Back to article
M. Lin, T. Li, Z. Sun, G. Holste, Y. Ding, F. Wang, Y. Peng, Improving fairness of automated chest radiograph diagnosis by contrastive learning, Radiology: Artificial Intelligence 6 (5) (2024) e230342. https://doi.org/10.1148/ryai.230342, doi:10.1148/ryai.230342
Search in Google Scholar Back to article
V. D. Rao, B. N. Shashank, S. Nagesh Bhattu, Improved image captioning using gan and vit, in: International Conference on Computer Vision and Image Processing, Springer Nature Switzerland, 2023, pp. 375–385. https://doi.org/10.1007/978-3-031-38366-5 30, doi: 10.1007/978-3-031-38366-5 30.
Search in Google Scholar Back to article
Y. Li, B. Yang, X. Cheng, Z. Zhu, H. Li, Y. Zou, Unify, align and refine: Multi-level semantic alignment for radiology report generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2863–2874.
Search in Google Scholar Back to article
J. H. Moon, H. Lee, W. Shin, Y. H. Kim, E. Choi, Multi-modal understanding and generation for medical images and text via vision-language pre-training, IEEE Journal of Biomedical and Health Informatics 26 (12) (2022) 6070–6080. https://doi.org/10.1109/JBHI.2022.3208800, doi:10.1109/JBHI.2022.3208800
Search in Google Scholar Back to article
J. Duan, J. Xiong, Y. Li, W. Ding, Deep learning based multimodal biomedical data fusion: An overview and comparative review, Information Fusion (2024) 102536.
Search in Google Scholar Back to article
S. Li, P. Qiao, L. Wang, M. Ning, L. Yuan, Y. Zheng, J. Chen, An organ-aware diagnosis framework for radiology report generation, IEEE Transactions on Medical Imaging (2024). https://doi.org/10.1109/TMI.2024.3361536, doi: 10.1109/TMI.2024.3361536
Search in Google Scholar Back to article
H. Li, H. Wang, X. Sun, H. He, J. Feng, Context-enhanced framework for medical image report generation using multimodal contexts, Knowledge-Based Systems (2025) 112913.
Search in Google Scholar Back to article
M. Trzciński, S. Łukasik, A. H. Gandomi, Optimizing the structures of transformer neural networks using parallel simulated annealing, Journal of Artificial Intelligence and Soft Computing Research 14 (3) (2024) 267–282, Społeczna Akademia Nauk. https://doi.org/10.2478/jaiscr-2024-0015, doi: 10.2478/jaiscr-2024-0015
Search in Google Scholar Back to article
S. Muksimova, S. Umirzakova, K. Shoraimov, J. Baltayev, Y. I. Cho, Novelty classification model use in reinforcement learning for cervical cancer, Cancers 16 (22) (2024) 3782. https://doi.org/10.3390/cancers16223782, doi: 10.3390/cancers16223782
Search in Google Scholar Back to article
X. Huang, Y. Zhang, J. P. Cohen, L. Zhang, E. P. Xing, Gloria: A multimodal global-local representation learning framework for medical vision-language tasks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951.
Search in Google Scholar Back to article
B. Boecking, N. Usuyama, N. Bannur, C. Yu, S. Pournejat, A. Sethi, Z. Zhan, K. Lakhotia, A. Kumar, P. He, et al., Making the most of text semantics to improve biomedical vision–language processing, arXiv preprint arXiv: 2204.09817 (2022). http://arxiv.org/abs/2204.09817, arXiv: 2204.09817
Search in Google Scholar Back to article
K. Singhal, S. Azizi, T.-J. Tu, S. Mahdavi, Y. Berne, J. Wei, H. W. Chung, N. Scales, et al., Large language models encode clinical knowledge, Nature 620 (7972) (2023) 172–180. https://doi.org/10.1038/s41586-023-06291-2, doi:10.1038/s41586-023-06291-2
Search in Google Scholar Back to article
A. E. Johnson, T. J. Pollard, N. R. Green-baum, M. P. Lungren, C. Y. Deng, Y. Peng, S. Horng, Mimic-cxr-jpg, a large publicly available database of labeled chest radio-graphs, arXiv preprint arXiv: 1901.07042 (2019). http://arxiv.org/abs/1901.07042, arXiv: 1901.07042
Search in Google Scholar Back to article
B. Jing, Z. Wang, E. Xing, Show, describe and conclude: On exploiting the structure information of chest x-ray reports, arXiv preprint arXiv: 2004.12274 (2020). http://arxiv.org/abs/2004.12274, arXiv: 2004.12274
Search in Google Scholar Back to article
W. Ansar, S. Goswami, A. Chakrabarti, A survey on transformers in nlp with focus on efficiency, arXiv preprint arXiv: 2406.16893 (2024). http://arxiv.org/abs/2406.16893, arXiv: 2406.16893
Search in Google Scholar Back to article
J. Wang, W. Jiang, L. Ma, W. Liu, Y. Xu, Bidirectional attentive fusion with context gating for dense video captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7190–7198.
Search in Google Scholar Back to article
M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, X. Chang, Dynamic graph enhanced contrastive learning for chest x-ray report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343.
Search in Google Scholar Back to article
D.-P. Kuo, Y.-C. Chen, S.-J. Cheng, K. L.-C. Hsieh, Y.-T. Li, P.-C. Kuo, Y.-C. Chang, C.-Y. Chen, A vision transformer-convolutional neural network framework for decision-transparent dual-energy xray absorptiometry recommendations using chest low-dose ct, International Journal of Medical Informatics 199 (2025) 105901.
Search in Google Scholar Back to article
Z. Hu, A. Iscen, C. Sun, Z. Wang, K. W. Chang, Y. Sun, C. Schmid, D. A. Ross, A. Fathi, Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23369–23379.
Search in Google Scholar Back to article
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive survey of deep learning for image captioning, ACM Computing Surveys (CSUR) 51 (6) (2019) 1–36. https://doi.org/10.1145/3241036, doi: 10.1145/3241036
Search in Google Scholar Back to article
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077–6086.
Search in Google Scholar Back to article
D. Jiang, M. Ye, Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2023, pp. 2787–2797.
Search in Google Scholar Back to article
K. Papineni, S. Roukos, T. Ward, W. J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
Search in Google Scholar Back to article
G. Datta, N. Joshi, K. Gupta, Analysis of automatic evaluation metric on low-resourced language: Bertscore vs bleu score, in: International Conference on Speech and Computer, 2022, pp. 155–162.
Search in Google Scholar Back to article
S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
Search in Google Scholar Back to article
C. Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
Search in Google Scholar Back to article
D. P. Kingma, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). http://arxiv.org/abs/1412.6980, arXiv: 1412.6980
Search in Google Scholar Back to article
F. Liu, S. Ge, Y. Zou, X. Wu, Competence-based multimodal curriculum learning for medical report generation, arXiv preprint arXiv:2206.14579 (2022). http://arxiv.org/abs/2206.14579, arXiv: 2206.14579
Search in Google Scholar Back to article
F. Liu, C. Yin, X. Wu, S. Ge, Y. Zou, P. Zhang, X. Sun, Contrastive attention for automatic chest x-ray report generation, arXiv preprint arXiv: 2106.06965 (2021). http://arxiv.org/abs/2106.06965, arXiv: 2106.06965
Search in Google Scholar Back to article
B. Yan, M. Pei, M. Zhao, C. Shan, Z. Tian, Prior guided transformer for accurate radiology reports generation, IEEE Journal of Biomedical and Health Informatics 26 (11) (2022) 5631–5640. https://doi.org/10.1109/JBHI.2022.3181343, doi: 10.1109/JBHI.2022.3181343.
Search in Google Scholar Back to article

Chestxgen: Dynamic Memory-Augmented Vision-Language Transformer with Context-Aware Gating for Radiology Report Generation

Abstract

Paradigm

My account