Have a personal or library account? Click to login
ARMDiaRD: A robust multi-class diabetic retinopathy detection using hybrid swin transformers with hierarchical fusion Cover

ARMDiaRD: A robust multi-class diabetic retinopathy detection using hybrid swin transformers with hierarchical fusion

Open Access
|Feb 2026

Figures & Tables

Figure 1:

Proposed ARMDiaRD module-wise system flowchart adopted during experimentation. IMPA, improved marine predator algorithm.
Proposed ARMDiaRD module-wise system flowchart adopted during experimentation. IMPA, improved marine predator algorithm.

Figure 2:

Data agglomeration pipeline with sample images showing the three stages of data preprocessing, quality-aware down-sampling, and augmentation ending with splitting for the training and testing process.
Data agglomeration pipeline with sample images showing the three stages of data preprocessing, quality-aware down-sampling, and augmentation ending with splitting for the training and testing process.

Figure 3:

Histogram of Q-scores calculated for each grade of DR before and after quality-aware down-sampling, which reflects the improvement in quality of images selected for the oversampled classes. DR, diabetic retinopathy; ICDR, international clinical diabetic retinopathy.
Histogram of Q-scores calculated for each grade of DR before and after quality-aware down-sampling, which reflects the improvement in quality of images selected for the oversampled classes. DR, diabetic retinopathy; ICDR, international clinical diabetic retinopathy.

Figure 4:

Proposed robust and hybrid deep learning model architecture with initially capturing the features using EfficientNet and GC block, followed by hybrid Swin Transformer blocks with MSFF and hierarchical aggregation. GC, global context; MSFF, multi-scale feature fusion.
Proposed robust and hybrid deep learning model architecture with initially capturing the features using EfficientNet and GC block, followed by hybrid Swin Transformer blocks with MSFF and hierarchical aggregation. GC, global context; MSFF, multi-scale feature fusion.

Figure 5:

Scatter plot of the HP lr, dr, and embed dim for the three optimizer algorithms, IMPO, PSO and DE. DE, differential evolution; dr, dropout rate; HP, hyperparameters; lr, learning rate; PSO, particle swarm optimizer.
Scatter plot of the HP lr, dr, and embed dim for the three optimizer algorithms, IMPO, PSO and DE. DE, differential evolution; dr, dropout rate; HP, hyperparameters; lr, learning rate; PSO, particle swarm optimizer.

Figure 6:

Accuracy box plot obtained from the three optimizer algorithms, DE, IMPO, and PSO. DE, differential evolution; PSO, particle swarm optimizer.
Accuracy box plot obtained from the three optimizer algorithms, DE, IMPO, and PSO. DE, differential evolution; PSO, particle swarm optimizer.

Figure 7:

Results from stratified five-fold validation performed with average metrics of accuracy, precision, recall, and F1-score. CBAM, convolutional block attention module; GC, global context.
Results from stratified five-fold validation performed with average metrics of accuracy, precision, recall, and F1-score. CBAM, convolutional block attention module; GC, global context.

Figure 8:

Results from stratified five-fold validation performed during the training phase using Resnet-50.
Results from stratified five-fold validation performed during the training phase using Resnet-50.

Figure 9:

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0.
Results from stratified five-fold validation performed during the training phase using EfficientNet-B0.

Figure 10:

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 with GC. GC, global context.
Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 with GC. GC, global context.

Figure 11:

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + SE. GC, global context.
Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + SE. GC, global context.

Figure 12:

Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + CBAM. CBAM, convolutional block attention module; GC, global context.
Results from stratified five-fold validation performed during the training phase using EfficientNet-B0 + GC + CBAM. CBAM, convolutional block attention module; GC, global context.

Figure 13:

Results from stratified five-fold validation performed during the training phase using the proposed model ARMDiaRD.
Results from stratified five-fold validation performed during the training phase using the proposed model ARMDiaRD.

Figure 14:

Results of ablation analysis performed with the various metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE with the training dataset subjected to the proposed agglomeration process, showing the overall improvement of the proposed model. MAE, mean absolute error; QWK, quadratic weighted kappa.
Results of ablation analysis performed with the various metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE with the training dataset subjected to the proposed agglomeration process, showing the overall improvement of the proposed model. MAE, mean absolute error; QWK, quadratic weighted kappa.

Figure 15:

Results obtained during the testing process with metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE. MAE, mean absolute error; QWK, quadratic weighted kappa.
Results obtained during the testing process with metrics, such as accuracy, precision, recall, QWK, Spearman correlation, and MAE. MAE, mean absolute error; QWK, quadratic weighted kappa.

Summary of major fundus image datasets relevant to the ICDR grading system

DatasetYearImagesGradeResolutionCountryStrengthsLimitations
EyePACS [19]201588,7025Varying dimensionsUSALarge-scale datasetClass imbalance
IDRiD [21]201851654,288 × 2,848IndiaClassification and segmentationSmall size, limited masks
APTOS [20]20193,662 512 × 512Asia-PacificClean, uniform imagesSmall dataset, class imbalance
FGADR [22]20201,84252,048 × 3,072ChinaBalanced datasetModerate size, population-specific
MESSIDOR [23]20081,2004Varying dimensionsFranceGood for bench-markingUses only 4 grading
EOphtha [24]2014611Not graded2,544 × 1,696FranceEarly DR detectionNo severity grading
DDR202013,6735High-resolutionChinaLarge-scale, diverse imagesVariable image quality

Literature review of various deep learning techniques for DR detection

YearModel usedDatasetAccuracy (%)Key contributionsStudy finding
2023 [31]Resnet50APTOS83.90Compared the model with VGG16, Xception, AlexNet and found Resnet outperformedPerformance can be improved
2021 [32]Multi-scale attention network (MSA-Net)EYEPACS and APTOS84.40MSA-Net helps to retrieve features at multiple scalesPerformance can be improved
2022 [33]Inception-V389,947 images from various datasets99Robust binary classification for fundus image qualityICDR grading can be implemented
2023 [34]Resnet50APTOS83.90Compared the model with VGG16, Xception, AlexNet and found Resnet outperformedPerformance can be improved
2023 [35]Inception-V3APTOS98.7Image enhancement using CLAHEBetter performance with preprocessing
2023 [36]DenseNet121APTOS97.30Combines VGG16, XGBoost, and DenseNet121; highlights overfitting riskOverfitting due to class 0 dominance
2023 [37]SqueezeNet, Darknet-53, EfficientNet-B0ODIR95, 99.4, 90Multi-classification: normal, glaucoma, cataractICDR DR grading can be extended
2024 [38]DeepDR Plus (ResNet-50 + self-attention)83,500 images from various datasets∼84.6Predicts DR progression time; supports personalized screeningTime prediction solved; classification remains open
2025 [39]Inception ResNet V2, MobileNet, Residual Net645 clinical images93Binary classification of eye diseasesDR grading can be implemented
2025 [40]Inception-ResNet-v2 + GRUAPTOS98.00FFO fine-tunes GRUOptimization improves accuracy

HP achieved using IMPO optimizer for training the proposed ARMDiaRD

Sl. No.HyperparameterSpecification
01Learning rate0.00039
02Number of epochs40
03Batch size32
04Dropout0.428
05Embed dim80

Class-wise image summary after dataset agglomeration process used for ARMDiaDR system experimentation

textbfDatasetNo DRMild NPDRModerate NPDRSevere NPDRPDRTotal ImagesDimensions
APTOS [20]1,8053709991932953,662Mixed
IDRID [21]1682516893625164,288 × 2,848
EYEPACS [19]25,8022,4385,28887270835,108Mixed
MESSIDOR-V1 [23]5461532472541,2002,240 × 1,488
MESSIDOR-V2 [46]1,01727034775351,7442,240 × 1,488
Training Dataset5,6005,6005,6005,6005,60028,000224 × 224
Testing Dataset1,4001,4001,4001,4001,4007,000224 × 224

Evaluation performance metrics used

Performance MetricMathematical ExpressionDescription
Accuracy Accuracy=TP+TNTotalInstances {\rm{Accuracy}} = {{{\rm{TP}} + {\rm{TN}}} \over {{\rm{Total Instances}}}} Provides a high-level overview of model performance.
Recall Recall=TPTP+FN {\rm{Recall}} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FN}}}} Measures how many actual positive instances the model correctly identified.
Precision Precision=TPTP+FP {\rm{Precision}} = {{{\rm{TP}}} \over {{\rm{TP}} + {\rm{FP}}}} Measures the accuracy of positive predictions.
F1-Score F1-Score=2(PrecisionRecall)Precision+Recall {\rm{F}}1{\rm{ - Score}} = {{2 \cdot ({\rm{Precision}} \cdot {\rm{Recall}})} \over {{\rm{Precision}} + {\rm{Recall}}}} Harmonic mean of precision and recall; balances false positives and false negatives.
QWK K=ObservedExpected1Expected K = {{{\rm{Observed}} - {\rm{Expected}}} \over {1 - {\rm{Expected}}}} Measures agreement between two raters, penalizing bigger disagreements quadratically.
MAE MAE=1ni=1nYiY^i {\rm{MAE}} = {1 \over n}\sum\limits_{i = 1}^n {\left| {{Y_i} - {{\hat Y}_i}} \right|} Measures average magnitude of the errors between predicted and actual values.
MSE MSE=1ni=1nYiY^i2 {\rm{MSE}} = {1 \over n}\sum\limits_{i = 1}^n {{{\left( {{Y_i} - {{\hat Y}_i}} \right)}^2}} Similar to MAE but penalizes larger errors more by squaring them.
Spearman Correlation ρ=16di2nn21 \rho = 1 - {{6\sum {d_i^2} } \over {n\left( {{n^2} - 1} \right)}} Measures strength and direction of a monotonic relationship in ranked or ordinal data.
Language: English
Submitted on: Aug 22, 2025
|
Published on: Feb 20, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 J. Dhiviya Rose, Ved Prakash Bhardwaj, published by Professor Subhas Chandra Mukhopadhyay
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.