Supervised Contrastive Models for Music Information Retrieval in Classical Persian Music

Ali Ahmadi Katamjani; Seyed Abolghasem Mirroshandel; Mahdi Aminian

doi:10.5334/tismir.271

Figures & Tables

Table 1

Data distribution of the PCID dataset.

Instrument	# Train	# Test	# Val
Daf	(52 m)	(6.5 m)	(6.5 m)
Divan	(59 m)	(7 m)	(7 m)
Dutar	(50.5 m)	(6 m)	(6 m)
Gheychak	(50 m)	(6 m)	(6 m)
Kamancheh	(2 h, 14 m)	(16.5 m)	(16.5 m)
Ney Anban	(1 h, 6 m)	(8 m)	(8 m)
Ney	(2 h, 15 m)	(17 m)	(17 m)
Oud	(2 h, 32 m)	(19 m)	(19 m)
Qanun	(1 h, 1 m)	(7.5 m)	(7.5 m)
Rubab	(50 m)	(6 m)	(6 m)
Santur	(2 h, 11 m)	(16 m)	(16 m)
Setar	(3 h, 22 m)	(25 m)	(25 m)
Tanbour	(1 h, 18 m)	(9.5 m)	(9.5 m)
Tar	(2 h, 7 m)	(16 m)	(16 m)
Tonbak	(1 h, 9 m)	(8.5 m)	(8.5 m)

Flowchart of our proposed model structure of the model.

Our proposed contrastive (base) model architecture.

Accuracy vs. input length tested on the Nava and PCID datasets (trained on the PCID 5 Instruments subset).

Accuracy vs. input length tested on the Nava and PCID datasets (trained on PCID).

Accuracy vs. input length tested on the Nava and PCID datasets (trained on the original Nava dataset).

Comparison of test accuracy between the proposed model, Baba Ali et al. (2019), and Baba Ali (2024).

Comparison of accuracy for Dastgah detection across Baba Ali et al. (2019), Baba Ali (2024), and the proposed method.

Architecture of the best model for the classifier of the one‑second, 15‑class classification task.

Architecture of the best model for the meta‑classifier of the 20‑second, 15‑class classification task.

t‑SNE projection of penultimate‑layer features for 10,000 one‑second test segments from the PCID.

Normalized confusion matrix (one‑second input, PCID test set).

Table 2

Comparison of instrument classification performance across different studies.

Study	Dataset	#of classes	Methodology	Accuracy (%)	F1‑Score (%)
Our Study	Extended Dataset (15 instruments)	15	Supervised contrastive learning with SSA	97.48	98
Our Study	Subset of Extended Dataset (5 instruments)	5	Supervised contrastive learning with SSA	99.78	100
Our Study	Nava Dataset (Modified)	5	Supervised contrastive learning with SSA	99.88	100
Agostini et al. (2003)	Orchestral Instruments Dataset	27	Spectral features with KNN and neural networks	70–80	N/A
Essid et al. (2006)	Solo Recordings and Mixtures of Western Instruments	7	MFCCs, timbral descriptors with SVM	65–75	N/A
Han et al. (2016)	Subset of MIREX Dataset (Various Genres and Instruments)	11	Deep CNNs for predominant instrument recognition	75	80
Solanki and Pandey (2022)	IMRAS Dataset (6705 recordings)	11	Eight‑layer deep CNN with mel spectrogram input	92.61	N/A
Prabavathy et al. (2020)	RWC Database, MusicBrainz.org, IRMAS, NSynth	16	SVM and KNN with MFCC and sonogram features	99.29	95.15
Gong et al. (2021)	ChMusic Dataset (Traditional Chinese Instruments)	11	MFCCs with KNN and majority voting	94.15	N/A
Humphrey et al. (2018)	OpenMIC‑2018 Dataset	20	Deep learning with CNN and multi‑instance learning	N/A	78 (AUC‑PR)
Reghunath and Rajan (2022)	Polyphonic Music Dataset	11	Transformer‑based ensemble method	85	79
Mousavi et al. (2019)	PCMIR Dataset (Persian Classical Music)	6	MFCCs, spectral features with neural network	80	N/A
Baba Ali et al. (2019)	Nava Dataset (Original)	5	MFCC and i‑vector with SVM	84.75	84
Baba Ali et al. (2024)	Nava Dataset (Original)	5	Self‑supervised, pre‑trained models	99.64	99.64

Supervised Contrastive Models for Music Information Retrieval in Classical Persian Music

Figures & Tables

Table 1

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Table 2

Paradigm

My account