ARMDiaRD: A robust multi-class diabetic retinopathy detection using hybrid swin transformers with hierarchical fusion

J. Dhiviya Rose; Ved Prakash Bhardwaj

doi:10.2478/ijssis-2026-0011

Figures & Tables

Proposed ARMDiaRD module-wise system flowchart adopted during experimentation. IMPA, improved marine predator algorithm.

Data agglomeration pipeline with sample images showing the three stages of data preprocessing, quality-aware down-sampling, and augmentation ending with splitting for the training and testing process.

Histogram of Q-scores calculated for each grade of DR before and after quality-aware down-sampling, which reflects the improvement in quality of images selected for the oversampled classes. DR, diabetic retinopathy; ICDR, international clinical diabetic retinopathy.

Proposed robust and hybrid deep learning model architecture with initially capturing the features using EfficientNet and GC block, followed by hybrid Swin Transformer blocks with MSFF and hierarchical aggregation. GC, global context; MSFF, multi-scale feature fusion.

Scatter plot of the HP lr, dr, and embed dim for the three optimizer algorithms, IMPO, PSO and DE. DE, differential evolution; dr, dropout rate; HP, hyperparameters; lr, learning rate; PSO, particle swarm optimizer.

Accuracy box plot obtained from the three optimizer algorithms, DE, IMPO, and PSO. DE, differential evolution; PSO, particle swarm optimizer.

Results from stratified five-fold validation performed with average metrics of accuracy, precision, recall, and F1-score. CBAM, convolutional block attention module; GC, global context.

Results from stratified five-fold validation performed during the training phase using Resnet-50.

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0.

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 with GC. GC, global context.

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + SE. GC, global context.

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + CBAM. CBAM, convolutional block attention module; GC, global context.

Results from stratified five-fold validation performed during the training phase using the proposed model ARMDiaRD.

Results of ablation analysis performed with the various metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE with the training dataset subjected to the proposed agglomeration process, showing the overall improvement of the proposed model. MAE, mean absolute error; QWK, quadratic weighted kappa.

Results obtained during the testing process with metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE. MAE, mean absolute error; QWK, quadratic weighted kappa.

Summary of major fundus image datasets relevant to the ICDR grading system

Dataset	Year	Images	Grade	Resolution	Country	Strengths	Limitations
EyePACS [19]	2015	88,702	5	Varying dimensions	USA	Large-scale dataset	Class imbalance
IDRiD [21]	2018	516	5	4,288 × 2,848	India	Classification and segmentation	Small size, limited masks
APTOS [20]	2019	3,662		512 × 512	Asia-Pacific	Clean, uniform images	Small dataset, class imbalance
FGADR [22]	2020	1,842	5	2,048 × 3,072	China	Balanced dataset	Moderate size, population-specific
MESSIDOR [23]	2008	1,200	4	Varying dimensions	France	Good for bench-marking	Uses only 4 grading
EOphtha [24]	2014	611	Not graded	2,544 × 1,696	France	Early DR detection	No severity grading
DDR	2020	13,673	5	High-resolution	China	Large-scale, diverse images	Variable image quality

Literature review of various deep learning techniques for DR detection

Year	Model used	Dataset	Accuracy (%)	Key contributions	Study finding
2023 [31]	Resnet50	APTOS	83.90	Compared the model with VGG16, Xception, AlexNet and found Resnet outperformed	Performance can be improved
2021 [32]	Multi-scale attention network (MSA-Net)	EYEPACS and APTOS	84.40	MSA-Net helps to retrieve features at multiple scales	Performance can be improved
2022 [33]	Inception-V3	89,947 images from various datasets	99	Robust binary classification for fundus image quality	ICDR grading can be implemented
2023 [34]	Resnet50	APTOS	83.90	Compared the model with VGG16, Xception, AlexNet and found Resnet outperformed	Performance can be improved
2023 [35]	Inception-V3	APTOS	98.7	Image enhancement using CLAHE	Better performance with preprocessing
2023 [36]	DenseNet121	APTOS	97.30	Combines VGG16, XGBoost, and DenseNet121; highlights overfitting risk	Overfitting due to class 0 dominance
2023 [37]	SqueezeNet, Darknet-53, EfficientNet-B0	ODIR	95, 99.4, 90	Multi-classification: normal, glaucoma, cataract	ICDR DR grading can be extended
2024 [38]	DeepDR Plus (ResNet-50 + self-attention)	83,500 images from various datasets	∼84.6	Predicts DR progression time; supports personalized screening	Time prediction solved; classification remains open
2025 [39]	Inception ResNet V2, MobileNet, Residual Net	645 clinical images	93	Binary classification of eye diseases	DR grading can be implemented
2025 [40]	Inception-ResNet-v2 + GRU	APTOS	98.00	FFO fine-tunes GRU	Optimization improves accuracy

HP achieved using IMPO optimizer for training the proposed ARMDiaRD

Sl. No.	Hyperparameter	Specification
01	Learning rate	0.00039
02	Number of epochs	40
03	Batch size	32
04	Dropout	0.428
05	Embed dim	80

Class-wise image summary after dataset agglomeration process used for ARMDiaDR system experimentation

textbfDataset	No DR	Mild NPDR	Moderate NPDR	Severe NPDR	PDR	Total Images	Dimensions
APTOS [20]	1,805	370	999	193	295	3,662	Mixed
IDRID [21]	168	25	168	93	62	516	4,288 × 2,848
EYEPACS [19]	25,802	2,438	5,288	872	708	35,108	Mixed
MESSIDOR-V1 [23]	546	153	247	254	–	1,200	2,240 × 1,488
MESSIDOR-V2 [46]	1,017	270	347	75	35	1,744	2,240 × 1,488
Training Dataset	5,600	5,600	5,600	5,600	5,600	28,000	224 × 224
Testing Dataset	1,400	1,400	1,400	1,400	1,400	7,000	224 × 224

Evaluation performance metrics used

Performance Metric	Mathematical Expression	Description
Accuracy	Accuracy=TP+TNTotal Instances {\rm{Accuracy}} = {{{\rm{TP}} + {\rm{TN}}} \over {{\rm{Total Instances}}}}	Provides a high-level overview of model performance.
Recall	Recall=TPTP+FN {\rm{Recall}} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FN}}}}	Measures how many actual positive instances the model correctly identified.
Precision	Precision=TPTP+FP {\rm{Precision}} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FP}}}}	Measures the accuracy of positive predictions.
F1-Score	F1-Score=2⋅(Precision⋅Recall)Precision+Recall {\rm{F}}1{\rm{ - Score}} = {{2 \cdot ({\rm{Precision}} \cdot {\rm{Recall}})} \over {{\rm{Precision}} + {\rm{Recall}}}}	Harmonic mean of precision and recall; balances false positives and false negatives.
QWK	K=Observed−Expected1−Expected K = {{{\rm{Observed}} - {\rm{Expected}}} \over {1 - {\rm{Expected}}}}	Measures agreement between two raters, penalizing bigger disagreements quadratically.
MAE	MAE=1n∑i=1nYi−Y^i {\rm{MAE}} = {1 \over n}\sum\limits_{i = 1}^n {\left\| {{Y_i} - {{\hat Y}_i}} \right\|}	Measures average magnitude of the errors between predicted and actual values.
MSE	MSE=1n∑i=1nYi−Y^i2 {\rm{MSE}} = {1 \over n}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - {{\hat Y}_i}} \right)}^2}}	Similar to MAE but penalizes larger errors more by squaring them.
Spearman Correlation	ρ=1−6∑di2nn2−1 \rho = 1 - {{6\sum {d_i^2} } \over {n\left( {{n^2} - 1} \right)}}	Measures strength and direction of a monotonic relationship in ranked or ordinal data.

ARMDiaRD: A robust multi-class diabetic retinopathy detection using hybrid swin transformers with hierarchical fusion

Figures & Tables

Figure 1:

Figure 2:

Figure 3:

Figure 4:

Figure 5:

Figure 6:

Figure 7:

Figure 8:

Figure 9:

Figure 10:

Figure 11:

Figure 12:

Figure 13:

Figure 14:

Figure 15:

Summary of major fundus image datasets relevant to the ICDR grading system

Literature review of various deep learning techniques for DR detection

HP achieved using IMPO optimizer for training the proposed ARMDiaRD

Class-wise image summary after dataset agglomeration process used for ARMDiaDR system experimentation

Evaluation performance metrics used

Paradigm

My account