Y. Han, G. Holste, Y. Ding, Radiomics-guided global-local transformer for weakly supervised pathology localization in chest x-rays, IEEE Transactions on Medical Imaging 42 (3) (2022) 750–761.https://doi.org/10.1109/TMI.2022.3217292, doi:10.1109/TMI.2022.3217292.
C. Bluethgen, P. Chambon, J.-B. Delbrouck, R. van der Sluijs, M. Połacin, J. M. Zambrano Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, A. S. Chaudhari, A vision–language foundation model for the generation of realistic chest x-ray images, Nature Biomedical Engineering (2024) 1–13 https://doi.org/10.1038/s41551-024-01152-3, doi:10.1038/s41551-024-01152-3
G. Reale-Nosei, E. Amador-Domínguez, E. Serrano, From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation, Medical Image Analysis (2024) 103264.
M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, R. Cucchiara, From show to tell: A survey on deep learning-based image captioning, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (1) (2022) 539–559. https://doi.org/10.1109/TPAMI.2022.3148210, doi:10.1109/TPAMI.2022.3148210
F. Zeiser, C. Costa, G. Gabriel R, A. Maier, R. da Rosa Righi, Chexreport: A transformer-based architecture to generate chest xray reports suggestions, Expert Systems with Applications 255 (2024) 124644. https://doi.org/10.1016/j.eswa.2023.124644, doi:10.1016/j.eswa.2023.124644
N. Linna, C. E. Kahn Jr, Applications of natural language processing in radiology: A systematic review, International Journal of Medical Informatics 163 (2022) 104779.
H. Sharma, D. Padha, A comprehensive survey on image captioning: From handcrafted to deep learning-based techniques, a taxonomy and open research issues, Artificial Intelligence Review 56 (11) (2023) 13619–13661. https://doi.org/10.1007/s10462-023-10489-5, doi:10.1007/s10462-023-10489-5
Z. Wang, L. Wang, X. Li, L. Zhou, Diagnostic captioning by cooperative task interactions and sample-graph consistency, IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
D. Singh, M. Kaur, J. M. Alanazi, A. A. AlZubi, H. N. Lee, Efficient evolving deep ensemble medical image captioning network, IEEE Journal of Biomedical and Health Informatics 27 (2) (2022) 1016–1025. https://doi.org/10.1109/JBHI.2022.3149312, doi:10.1109/JBHI.2022.3149312
Z. Chen, Y. Shen, Y. Song, X. Wan, Cross-modal memory networks for radiology report generation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5904–5914.
Z. Chen, Y. Song, T.-H. Chang, X. Wan, Generating radiology reports via memory-driven transformer, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1439–1449.
Y. Wang, J. Xu, Y. Sun, End-to-end transformer based model for image captioning, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 12, 2022, pp. 2585–2594.
W. M. da Silva, S. C. Cazella, R. S. Rech, Deep learning algorithms to assist in imaging diagnosis in individuals with disc herniation or spondylolis-thesis: A scoping review, International Journal of Medical Informatics (2025) 105933.
M. Cornia, M. Stefanini, L. Baraldi, R. Cucchiara, Meshed-memory transformer for image captioning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2020, pp. 10578–10587.
Y. Tang, H. Yang, L. Zhang, Y. Yuan, Work like a doctor: Unifying scan localizer and dynamic generator for automated computed tomography report generation, Expert Systems with Applications 237 (2024) 121442. https://doi.org/10.1016/j.eswa.2023.121442, doi:10.1016/j.eswa.2023.121442
M. Lin, T. Li, Z. Sun, G. Holste, Y. Ding, F. Wang, Y. Peng, Improving fairness of automated chest radiograph diagnosis by contrastive learning, Radiology: Artificial Intelligence 6 (5) (2024) e230342. https://doi.org/10.1148/ryai.230342, doi:10.1148/ryai.230342
V. D. Rao, B. N. Shashank, S. Nagesh Bhattu, Improved image captioning using gan and vit, in: International Conference on Computer Vision and Image Processing, Springer Nature Switzerland, 2023, pp. 375–385. https://doi.org/10.1007/978-3-031-38366-5 30, doi: 10.1007/978-3-031-38366-5 30.
Y. Li, B. Yang, X. Cheng, Z. Zhu, H. Li, Y. Zou, Unify, align and refine: Multi-level semantic alignment for radiology report generation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2863–2874.
J. H. Moon, H. Lee, W. Shin, Y. H. Kim, E. Choi, Multi-modal understanding and generation for medical images and text via vision-language pre-training, IEEE Journal of Biomedical and Health Informatics 26 (12) (2022) 6070–6080. https://doi.org/10.1109/JBHI.2022.3208800, doi:10.1109/JBHI.2022.3208800
J. Duan, J. Xiong, Y. Li, W. Ding, Deep learning based multimodal biomedical data fusion: An overview and comparative review, Information Fusion (2024) 102536.
H. Li, H. Wang, X. Sun, H. He, J. Feng, Context-enhanced framework for medical image report generation using multimodal contexts, Knowledge-Based Systems (2025) 112913.
M. Trzciński, S. Łukasik, A. H. Gandomi, Optimizing the structures of transformer neural networks using parallel simulated annealing, Journal of Artificial Intelligence and Soft Computing Research 14 (3) (2024) 267–282, Społeczna Akademia Nauk. https://doi.org/10.2478/jaiscr-2024-0015, doi: 10.2478/jaiscr-2024-0015
X. Huang, Y. Zhang, J. P. Cohen, L. Zhang, E. P. Xing, Gloria: A multimodal global-local representation learning framework for medical vision-language tasks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3942–3951.
B. Boecking, N. Usuyama, N. Bannur, C. Yu, S. Pournejat, A. Sethi, Z. Zhan, K. Lakhotia, A. Kumar, P. He, et al., Making the most of text semantics to improve biomedical vision–language processing, arXiv preprint arXiv: 2204.09817 (2022). http://arxiv.org/abs/2204.09817, arXiv: 2204.09817
A. E. Johnson, T. J. Pollard, N. R. Green-baum, M. P. Lungren, C. Y. Deng, Y. Peng, S. Horng, Mimic-cxr-jpg, a large publicly available database of labeled chest radio-graphs, arXiv preprint arXiv: 1901.07042 (2019). http://arxiv.org/abs/1901.07042, arXiv: 1901.07042
B. Jing, Z. Wang, E. Xing, Show, describe and conclude: On exploiting the structure information of chest x-ray reports, arXiv preprint arXiv: 2004.12274 (2020). http://arxiv.org/abs/2004.12274, arXiv: 2004.12274
W. Ansar, S. Goswami, A. Chakrabarti, A survey on transformers in nlp with focus on efficiency, arXiv preprint arXiv: 2406.16893 (2024). http://arxiv.org/abs/2406.16893, arXiv: 2406.16893
J. Wang, W. Jiang, L. Ma, W. Liu, Y. Xu, Bidirectional attentive fusion with context gating for dense video captioning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7190–7198.
M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, X. Chang, Dynamic graph enhanced contrastive learning for chest x-ray report generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3334–3343.
Z. Hu, A. Iscen, C. Sun, Z. Wang, K. W. Chang, Y. Sun, C. Schmid, D. A. Ross, A. Fathi, Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23369–23379.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive survey of deep learning for image captioning, ACM Computing Surveys (CSUR) 51 (6) (2019) 1–36. https://doi.org/10.1145/3241036, doi: 10.1145/3241036
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6077–6086.
D. Jiang, M. Ye, Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2023, pp. 2787–2797.
K. Papineni, S. Roukos, T. Ward, W. J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
G. Datta, N. Joshi, K. Gupta, Analysis of automatic evaluation metric on low-resourced language: Bertscore vs bleu score, in: International Conference on Speech and Computer, 2022, pp. 155–162.
S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
F. Liu, S. Ge, Y. Zou, X. Wu, Competence-based multimodal curriculum learning for medical report generation, arXiv preprint arXiv:2206.14579 (2022). http://arxiv.org/abs/2206.14579, arXiv: 2206.14579
F. Liu, C. Yin, X. Wu, S. Ge, Y. Zou, P. Zhang, X. Sun, Contrastive attention for automatic chest x-ray report generation, arXiv preprint arXiv: 2106.06965 (2021). http://arxiv.org/abs/2106.06965, arXiv: 2106.06965