The performance of machine learning models is often evaluated using accuracy metrics, which provide only a partial view of their capabilities, particularly in medical imaging, where explainability is essential. In this study, we propose to use a hybrid CNN-Transformer model MedViT, for the classification of retinal OCT images. The model achieved 95.95% accuracy on the modified public Retinal OCT C8 dataset and 90% on a private independent external test dataset, correctly classifying 9 out of 10 cases. Explainability analysis with Grad-CAM showed that the model consistently focused on clinically relevant macular regions. These results indicate that the proposed approach can accurately detect pathological changes and focus on diagnostically important regions also in an independent dataset beyond the training data, making it a reliable and helpful tool for ophthalmologists in clinical practice.
© 2025 Samuel Gibala, Veronika Kurilova, Milos Oravec, Jarmila Pavlovicova, Jana Stefanickova, published by Slovak University of Technology in Bratislava
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.