Have a personal or library account? Click to login

Chestxgen: Dynamic Memory-Augmented Vision-Language Transformer with Context-Aware Gating for Radiology Report Generation

Open Access
|Sep 2025

References

  1. P. Roy, A. Bhunia, A. Das, P. Dhar, U. Pal, Keyword spotting in doctor’s handwriting on medical prescriptions, Expert Systems with Applications 76 (2017) 113–128. https://doi.org/10.1016/j.eswa.2017.01.042, doi:10.1016/j.eswa.2017.01.042.
  2. Y. Han, G. Holste, Y. Ding, Radiomics-guided global-local transformer for weakly supervised pathology localization in chest x-rays, IEEE Transactions on Medical Imaging 42 (3) (2022) 750–761.https://doi.org/10.1109/TMI.2022.3217292, doi:10.1109/TMI.2022.3217292.
  3. C. Bluethgen, P. Chambon, J.-B. Delbrouck, R. van der Sluijs, M. Połacin, J. M. Zambrano Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, A. S. Chaudhari, A vision–language foundation model for the generation of realistic chest x-ray images, Nature Biomedical Engineering (2024) 1–13 https://doi.org/10.1038/s41551-024-01152-3, doi:10.1038/s41551-024-01152-3
  4. G. Reale-Nosei, E. Amador-Domínguez, E. Serrano, From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation, Medical Image Analysis (2024) 103264.
  5. M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, R. Cucchiara, From show to tell: A survey on deep learning-based image captioning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 539–559. https://doi.org/10.1109/TPAMI.2022.3148210, doi:10.1109/TPAMI.2022.3148210
  6. Z. Tian, A. Liu, G. Zhu, X. Chen, A paralleled cnn and transformer network for ppg-based cuff-less blood pressure estimation, Biomedical Signal Processing and Control 99 (2025) 106741. https://doi.org/10.1016/j.bspc.2023.106741, doi:10.1016/j.bspc.2023.106741
  7. F. Zeiser, C. Costa, G. Gabriel R, A. Maier, R. da Rosa Righi, Chexreport: A transformer-based architecture to generate chest xray reports suggestions, Expert Systems with Applications 255 (2024) 124644. https://doi.org/10.1016/j.eswa.2023.124644, doi:10.1016/j.eswa.2023.124644
  8. N. Linna, C. E. Kahn Jr, Applications of natural language processing in radiology: A systematic review, International Journal of Medical Informatics 163 (2022) 104779.
  9. H. Sharma, D. Padha, A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues, Artificial Intelligence Review 56 (11) (2023) 13619–13661. https://doi.org/10.1007/s10462-023-10489-5, doi:10.1007/s10462-023-10489-5
  10. Z. Wang, L. Wang, X. Li, L. Zhou, Diagnostic captioning by cooperative task interactions and sample-graph consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
  11. D. Singh, M. Kaur, J. M. Alanazi, A. A. AlZubi, H. N. Lee, Efficient evolving deep ensemble medical image captioning network, IEEE Journal of Biomedical and Health Informatics 27 (2) (2022) 1016–1025. https://doi.org/10.1109/JBHI.2022.3149312, doi:10.1109/JBHI.2022.3149312
  12. Z. Chen, Y. Shen, Y. Song, X. Wan, Cross-modal memory networks for radiology report generation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5904–5914.
  13. Z. Chen, Y. Song, T.-H. Chang, X. Wan, Generating radiology reports via memory-driven transformer, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449.
  14. Y. Wang, J. Xu, Y. Sun, End-to-end transformer based model for image captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 12, 2022, pp. 2585–2594.
  15. W. M. da Silva, S. C. Cazella, R. S. Rech, Deep learning algorithms to assist in imaging diagnosis in individuals with disc herniation or spondylolis-thesis: A scoping review, International Journal of Medical Informatics (2025) 105933.
  16. M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-memory transformer for image captioning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 10578–10587.
  17. Y. Tang, H. Yang, L. Zhang, Y. Yuan, Work like a doctor: Unifying scan localizer and dynamic generator for automated computed tomography report generation, Expert Systems with Applications 237 (2024) 121442. https://doi.org/10.1016/j.eswa.2023.121442, doi:10.1016/j.eswa.2023.121442
  18. M. Lin, T. Li, Z. Sun, G. Holste, Y. Ding, F. Wang, Y. Peng, Improving fairness of automated chest radiograph diagnosis by contrastive learning, Radiology: Artificial Intelligence 6 (5) (2024) e230342. https://doi.org/10.1148/ryai.230342, doi:10.1148/ryai.230342
  19. V. D. Rao, B. N. Shashank, S. Nagesh Bhattu, Improved image captioning using gan and vit, in: International Conference on Computer Vision and Image Processing, Springer Nature Switzerland, 2023, pp. 375–385. https://doi.org/10.1007/978-3-031-38366-5 30, doi: 10.1007/978-3-031-38366-5 30.
  20. Y. Li, B. Yang, X. Cheng, Z. Zhu, H. Li, Y. Zou, Unify, align and refine: Multi-level semantic alignment for radiology report generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2863–2874.
  21. J. H. Moon, H. Lee, W. Shin, Y. H. Kim, E. Choi, Multi-modal understanding and generation for medical images and text via vision-language pre-training, IEEE Journal of Biomedical and Health Informatics 26 (12) (2022) 6070–6080. https://doi.org/10.1109/JBHI.2022.3208800, doi:10.1109/JBHI.2022.3208800
  22. J. Duan, J. Xiong, Y. Li, W. Ding, Deep learning based multimodal biomedical data fusion: An overview and comparative review, Information Fusion (2024) 102536.
  23. S. Li, P. Qiao, L. Wang, M. Ning, L. Yuan, Y. Zheng, J. Chen, An organ-aware diagnosis framework for radiology report generation, IEEE Transactions on Medical Imaging (2024). https://doi.org/10.1109/TMI.2024.3361536, doi: 10.1109/TMI.2024.3361536
  24. H. Li, H. Wang, X. Sun, H. He, J. Feng, Context-enhanced framework for medical image report generation using multimodal contexts, Knowledge-Based Systems (2025) 112913.
  25. M. Trzciński, S. Łukasik, A. H. Gandomi, Optimizing the structures of transformer neural networks using parallel simulated annealing, Journal of Artificial Intelligence and Soft Computing Research 14 (3) (2024) 267–282, Społeczna Akademia Nauk. https://doi.org/10.2478/jaiscr-2024-0015, doi: 10.2478/jaiscr-2024-0015
  26. S. Muksimova, S. Umirzakova, K. Shoraimov, J. Baltayev, Y. I. Cho, Novelty classification model use in reinforcement learning for cervical cancer, Cancers 16 (22) (2024) 3782. https://doi.org/10.3390/cancers16223782, doi: 10.3390/cancers16223782
  27. X. Huang, Y. Zhang, J. P. Cohen, L. Zhang, E. P. Xing, Gloria: A multimodal global-local representation learning framework for medical vision-language tasks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951.
  28. B. Boecking, N. Usuyama, N. Bannur, C. Yu, S. Pournejat, A. Sethi, Z. Zhan, K. Lakhotia, A. Kumar, P. He, et al., Making the most of text semantics to improve biomedical vision–language processing, arXiv preprint arXiv: 2204.09817 (2022). http://arxiv.org/abs/2204.09817, arXiv: 2204.09817
  29. K. Singhal, S. Azizi, T.-J. Tu, S. Mahdavi, Y. Berne, J. Wei, H. W. Chung, N. Scales, et al., Large language models encode clinical knowledge, Nature 620 (7972) (2023) 172–180. https://doi.org/10.1038/s41586-023-06291-2, doi:10.1038/s41586-023-06291-2
  30. A. E. Johnson, T. J. Pollard, N. R. Green-baum, M. P. Lungren, C. Y. Deng, Y. Peng, S. Horng, Mimic-cxr-jpg, a large publicly available database of labeled chest radio-graphs, arXiv preprint arXiv: 1901.07042 (2019). http://arxiv.org/abs/1901.07042, arXiv: 1901.07042
  31. B. Jing, Z. Wang, E. Xing, Show, describe and conclude: On exploiting the structure information of chest x-ray reports, arXiv preprint arXiv: 2004.12274 (2020). http://arxiv.org/abs/2004.12274, arXiv: 2004.12274
  32. W. Ansar, S. Goswami, A. Chakrabarti, A survey on transformers in nlp with focus on efficiency, arXiv preprint arXiv: 2406.16893 (2024). http://arxiv.org/abs/2406.16893, arXiv: 2406.16893
  33. J. Wang, W. Jiang, L. Ma, W. Liu, Y. Xu, Bidirectional attentive fusion with context gating for dense video captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7190–7198.
  34. M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, X. Chang, Dynamic graph enhanced contrastive learning for chest x-ray report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343.
  35. D.-P. Kuo, Y.-C. Chen, S.-J. Cheng, K. L.-C. Hsieh, Y.-T. Li, P.-C. Kuo, Y.-C. Chang, C.-Y. Chen, A vision transformer-convolutional neural network framework for decision-transparent dual-energy xray absorptiometry recommendations using chest low-dose ct, International Journal of Medical Informatics 199 (2025) 105901.
  36. Z. Hu, A. Iscen, C. Sun, Z. Wang, K. W. Chang, Y. Sun, C. Schmid, D. A. Ross, A. Fathi, Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23369–23379.
  37. M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive survey of deep learning for image captioning, ACM Computing Surveys (CSUR) 51 (6) (2019) 1–36. https://doi.org/10.1145/3241036, doi: 10.1145/3241036
  38. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077–6086.
  39. D. Jiang, M. Ye, Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2023, pp. 2787–2797.
  40. K. Papineni, S. Roukos, T. Ward, W. J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  41. G. Datta, N. Joshi, K. Gupta, Analysis of automatic evaluation metric on low-resourced language: Bertscore vs bleu score, in: International Conference on Speech and Computer, 2022, pp. 155–162.
  42. S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  43. C. Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
  44. D. P. Kingma, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). http://arxiv.org/abs/1412.6980, arXiv: 1412.6980
  45. F. Liu, S. Ge, Y. Zou, X. Wu, Competence-based multimodal curriculum learning for medical report generation, arXiv preprint arXiv:2206.14579 (2022). http://arxiv.org/abs/2206.14579, arXiv: 2206.14579
  46. F. Liu, C. Yin, X. Wu, S. Ge, Y. Zou, P. Zhang, X. Sun, Contrastive attention for automatic chest x-ray report generation, arXiv preprint arXiv: 2106.06965 (2021). http://arxiv.org/abs/2106.06965, arXiv: 2106.06965
  47. B. Yan, M. Pei, M. Zhao, C. Shan, Z. Tian, Prior guided transformer for accurate radiology reports generation, IEEE Journal of Biomedical and Health Informatics 26 (11) (2022) 5631–5640. https://doi.org/10.1109/JBHI.2022.3181343, doi: 10.1109/JBHI.2022.3181343.
Language: English
Page range: 55 - 72
Submitted on: May 20, 2025
Accepted on: Aug 22, 2025
Published on: Sep 26, 2025
Published by: SAN University
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Sharofiddin Allaberdiev, Asif Khan, Sardor Mamarasulov, Xiaojun Chen, published by SAN University
This work is licensed under the Creative Commons Attribution 4.0 License.