Table 1
Data distribution of the PCID dataset.
| Instrument | # Train | # Test | # Val |
|---|---|---|---|
| Daf | (52 m) | (6.5 m) | (6.5 m) |
| Divan | (59 m) | (7 m) | (7 m) |
| Dutar | (50.5 m) | (6 m) | (6 m) |
| Gheychak | (50 m) | (6 m) | (6 m) |
| Kamancheh | (2 h, 14 m) | (16.5 m) | (16.5 m) |
| Ney Anban | (1 h, 6 m) | (8 m) | (8 m) |
| Ney | (2 h, 15 m) | (17 m) | (17 m) |
| Oud | (2 h, 32 m) | (19 m) | (19 m) |
| Qanun | (1 h, 1 m) | (7.5 m) | (7.5 m) |
| Rubab | (50 m) | (6 m) | (6 m) |
| Santur | (2 h, 11 m) | (16 m) | (16 m) |
| Setar | (3 h, 22 m) | (25 m) | (25 m) |
| Tanbour | (1 h, 18 m) | (9.5 m) | (9.5 m) |
| Tar | (2 h, 7 m) | (16 m) | (16 m) |
| Tonbak | (1 h, 9 m) | (8.5 m) | (8.5 m) |

Figure 1
Flowchart of our proposed model structure of the model.

Figure 2
Our proposed contrastive (base) model architecture.

Figure 3
Accuracy vs. input length tested on the Nava and PCID datasets (trained on the PCID 5 Instruments subset).

Figure 4
Accuracy vs. input length tested on the Nava and PCID datasets (trained on PCID).

Figure 5
Accuracy vs. input length tested on the Nava and PCID datasets (trained on the original Nava dataset).

Figure 6
Comparison of test accuracy between the proposed model, Baba Ali et al. (2019), and Baba Ali (2024).

Figure 7
Comparison of accuracy for Dastgah detection across Baba Ali et al. (2019), Baba Ali (2024), and the proposed method.

Figure 8
Architecture of the best model for the classifier of the one‑second, 15‑class classification task.

Figure 9
Architecture of the best model for the meta‑classifier of the 20‑second, 15‑class classification task.

Figure 10
t‑SNE projection of penultimate‑layer features for 10,000 one‑second test segments from the PCID.

Figure 11
Normalized confusion matrix (one‑second input, PCID test set).
Table 2
Comparison of instrument classification performance across different studies.
| Study | Dataset | #of classes | Methodology | Accuracy (%) | F1‑Score (%) |
|---|---|---|---|---|---|
| Our Study | Extended Dataset (15 instruments) | 15 | Supervised contrastive learning with SSA | 97.48 | 98 |
| Our Study | Subset of Extended Dataset (5 instruments) | 5 | Supervised contrastive learning with SSA | 99.78 | 100 |
| Our Study | Nava Dataset (Modified) | 5 | Supervised contrastive learning with SSA | 99.88 | 100 |
| Agostini et al. (2003) | Orchestral Instruments Dataset | 27 | Spectral features with KNN and neural networks | 70–80 | N/A |
| Essid et al. (2006) | Solo Recordings and Mixtures of Western Instruments | 7 | MFCCs, timbral descriptors with SVM | 65–75 | N/A |
| Han et al. (2016) | Subset of MIREX Dataset (Various Genres and Instruments) | 11 | Deep CNNs for predominant instrument recognition | 75 | 80 |
| Solanki and Pandey (2022) | IMRAS Dataset (6705 recordings) | 11 | Eight‑layer deep CNN with mel spectrogram input | 92.61 | N/A |
| Prabavathy et al. (2020) | RWC Database, MusicBrainz.org, IRMAS, NSynth | 16 | SVM and KNN with MFCC and sonogram features | 99.29 | 95.15 |
| Gong et al. (2021) | ChMusic Dataset (Traditional Chinese Instruments) | 11 | MFCCs with KNN and majority voting | 94.15 | N/A |
| Humphrey et al. (2018) | OpenMIC‑2018 Dataset | 20 | Deep learning with CNN and multi‑instance learning | N/A | 78 (AUC‑PR) |
| Reghunath and Rajan (2022) | Polyphonic Music Dataset | 11 | Transformer‑based ensemble method | 85 | 79 |
| Mousavi et al. (2019) | PCMIR Dataset (Persian Classical Music) | 6 | MFCCs, spectral features with neural network | 80 | N/A |
| Baba Ali et al. (2019) | Nava Dataset (Original) | 5 | MFCC and i‑vector with SVM | 84.75 | 84 |
| Baba Ali et al. (2024) | Nava Dataset (Original) | 5 | Self‑supervised, pre‑trained models | 99.64 | 99.64 |
