Report Generation from X-ray Images: An Evaluation with Transformer Architectures

Bengü Fetiler; Ömer Atılım Koca; Volkan Kılıç

Research Article

Report Generation from X-ray Images: An Evaluation with Transformer Architectures

Year 2025, Volume: 5 Issue: 2, 1 - 10, 01.10.2025

Bengü Fetiler , Ömer Atılım Koca , Volkan Kılıç

Abstract

The automatic generation of medical reports from chest X-ray images has attached increasing attention due to its capability to enhance diagnostic accuracy and reduce work-load in clinical decision support. The latest advancements in medical report generation, particularly with encoder-decoder models, emphasize their ability to integrate visual information with textual reports. However, these models suffer from challenges such as generating generic statements, failing to capture detailed pathological findings, and inconsistency of reports. In this study, the effectiveness of Vision Transformer and Convolutional Vision Transformer encoders combined with GPT2-based (Generative Pre-trained Transformer) decoders are investigated for the task of chest X-ray report generation. Their ability to capture radiological findings and generate clinically meaningful reports is evaluated through comparative analyses conducted under diverse experimental configurations on IU X-RAY dataset.

Keywords

convolutional vision transformer , image processing , medical report generation , natural language processing , radiography , vision transformer

Supporting Institution

the Scientific Research Projects Coordination Unit of Izmir Katip Celebi University

Project Number

2025-TYL-FEBE-0006

References

[1] K. Belikova, O. Y. Rogov, A. Rybakov, M. V. Maslov, and D. V. J. S. R. Dylov, "Deep negative volume segmentation," vol. 11, no. 1, p. 16292, 2021.
[2] M. Gurgitano et al., "Interventional Radiology ex-machina: Impact of Artificial Intelligence on practice," vol. 126, no. 7, pp. 998-1006, 2021.
[3] M. Sermesant, H. Delingette, H. Cochet, P. Jaïs, and N. J. N. R. C. Ayache, "Applications of artificial intelligence in cardiovascular imaging," vol. 18, no. 8, pp. 600-609, 2021.
[4] Ö. Çayli, X. Liu, V. Kiliç, and W. Wang, "Knowledge distillation for efficient audio-visual video captioning," in 2023 31st European Signal Processing Conference (EUSIPCO), 2023, pp. 745-749: IEEE.
[5] B. Fetiler, Ö. Çaylı, V. J. E. J. o. S. Kılıç, and Technology, "Leveraging Pre-trained 3D-CNNs for Video Captioning," no. 53, pp. 58-63, 2024.
[6] B. Fetiler, Ö. Çaylı, Ö. T. Moral, V. Kılıç, and A. J. A. B. v. T. D. Onan, "Video captioning based on multi-layer gated recurrent unit for smartphones," no. 32, pp. 221-226, 2021.
[7] V. J. S. U. J. o. C. Kılıç and I. Sciences, "Deep gated recurrent unit for smartphone-based image captioning," vol. 4, no. 2, pp. 181-191, 2021.
[8] Ö. A. Koca, H. Ö. Kabak, and V. J. T. J. o. S. Kılıç, "Attention-based multilayer GRU decoder for on-site glucose prediction on smartphone," vol. 80, no. 17, pp. 25616-25639, 2024.
[9] Ö. A. Koca, A. Türköz, and V. J. A. B. v. T. D. Kılıç, "Tip 1 Diyabette Çok Katmanlı GRU Tabanlı Glikoz Tahmini," no. 52, pp. 80-86, 2023.
[10] Ö. A. Koca and V. J. A. B. v. T. D. Kılıç, "Multi-Parametric Glucose Prediction Using Multi-Layer LSTM," no. 52, pp. 169-175, 2023.
[11] Ö. T. Moral, V. Kiliç, A. Onan, and W. Wang, "Automated Image Captioning with Multi-layer Gated Recurrent Unit," in 2022 30th European Signal Processing Conference (EUSIPCO), 2022, pp. 1160-1164: IEEE.
[12] Ö. Çaylı, V. Kılıç, A. Onan, and W. Wang, "Auxiliary classifier based residual rnn for image captioning," in 2022 30th European Signal Processing Conference (EUSIPCO), 2022, pp. 1126-1130: IEEE.
[13] Ö. Çaylı, B. Makav, V. Kılıç, and A. Onan, "Mobile application based automatic caption generation for visually impaired," in International Conference on Intelligent and Fuzzy Systems, 2020, pp. 1532-1539: Springer.
[14] X. Liu et al., "Visually-aware audio captioning with adaptive audio-visual attention," 2022.
[15] J. Lu, C. Xiong, D. Parikh, and R. Socher, "Knowing when to look: Adaptive attention via a visual sentinel for image captioning," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 375-383.
[16] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156-3164.
[17] O. Alfarghaly, R. Khaled, A. Elkorany, M. Helal, and A. J. I. i. M. U. Fahmy, "Automated radiology report generation using conditioned transformers," vol. 24, p. 100557, 2021.
[18] N. Kaur, A. J. C. i. B. Mittal, and Medicine, "CADxReport: Chest x-ray report generation using co-attention mechanism and reinforcement learning," vol. 145, p. 105498, 2022.
[19] N. Kaur, A. J. J. o. a. i. Mittal, and h. computing, "CheXPrune: sparse chest X-ray report generation model using multi-attention and one-shot global pruning," vol. 14, no. 6, pp. 7485-7497, 2023.
[20] Z. Xu et al., "Hybrid reinforced medical report generation with m-linear attention and repetition penalty," 2023.
[21] S. J. a. p. a. Singh, "Clinical context-aware radiology report generation from medical images using transformers," 2024.
[22] Q. Pu, Z. Xi, S. Yin, Z. Zhao, and L. J. B. e. o. Zhao, "Advantages of transformer and its application for medical image segmentation: a survey," vol. 23, no. 1, p. 14, 2024.
[23] F. Shamshad et al., "Transformers in medical imaging: A survey," vol. 88, p. 102802, 2023.
[24] A. Vaswani et al., "Attention is all you need," vol. 30, 2017.
[25] H. Wu et al., "Cvt: Introducing convolutions to vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22-31.
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. J. O. b. Sutskever, "Language models are unsupervised multitask learners," vol. 1, no. 8, p. 9, 2019.
[27] M. Li, R. Liu, F. Wang, X. Chang, and X. J. W. W. W. Liang, "Auxiliary signal-guided knowledge encoder-decoder for medical report generation," vol. 26, no. 1, pp. 253-270, 2023.
[28] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
[29] C.-Y. Lin, "Rouge: A package for automatic evaluation of summaries," in Text summarization branches out, 2004, pp. 74-81.
[30] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
[31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566-4575.
[32] P. Pino, D. Parra, P. Messina, C. Besa, and S. J. a. p. a. Uribe, "Inspecting state of the art performance and NLP metrics in image-based medical report generation," 2020.
[33] C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, "Knowledge-driven encode, retrieve, paraphrase for medical image report generation," in Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, no. 01, pp. 6666-6673.
[34] X. Huang, F. Yan, W. Xu, and M. J. I. A. Li, "Multi-attention and incorporating background information model for chest x-ray image report generation," vol. 7, pp. 154808-154817, 2019.

There are 34 citations in total.

Details

Primary Language	English
Subjects	Image Processing, Deep Learning, Natural Language Processing
Journal Section	Research Articles
Authors	Bengü Fetiler 0000-0002-2761-7751 Ömer Atılım Koca 0009-0007-7286-6785 Volkan Kılıç 0000-0002-3164-1981
Project Number	2025-TYL-FEBE-0006
Publication Date	October 1, 2025
Submission Date	August 4, 2025
Acceptance Date	September 24, 2025
Published in Issue	Year 2025 Volume: 5 Issue: 2

Cite

APA	Fetiler, B., Koca, Ö. A., & Kılıç, V. (2025). Report Generation from X-ray Images: An Evaluation with Transformer Architectures. Artificial Intelligence Theory and Applications, 5(2), 1-10.

Download Cover Image

Article Files

Full Text