Supervised Contrastive Models for Music Information Retrieval in Classical Persian Music

Ali Ahmadi Katamjani; Seyed Abolghasem Mirroshandel; Mahdi Aminian

doi:10.5334/tismir.271

Full Article

1 Introduction

Musical instrument classification is an important aspect of music information retrieval (MIR), which enables applications such as automatic transcription, music recommendation, and digital archiving (Herrera, et al., 2010; Humphrey et al., 2018). This is a very challenging task in the case of traditional Persian music, which features a very complex melody, microtonal scales, and rich timbral texture (Abbası Layegh et al., 2014). The nuanced subtleties of shading and embellishment can be found in music created using Ney, Setar, Tar, Santur, and Kamancheh, traditional Persian instruments that far removed from their Western counterparts. These delicate features often elude traditional classification methods, which are designed for Western musical instruments. Conventional algorithms for handling Western musical instruments cannot be effective in this domain (Baba Ali et al., 2019).

One of the most important barriers to advancing research in this field is the incompleteness and non‑uniformity of datasets related to traditional Persian music. The Nava dataset, introduced by Baba Ali et al. (2019), marks a key contribution by providing a benchmark corpus for both instrument and Dastgah classification. In Persian classical music, a Dastgah is a modal framework—roughly analogous to an Indian rāga or an Arabic maqām—that specifies a characteristic scale, a set of melodic motifs, and rules for ornamentation and modulation (Farhat, 1990). Nava contains recordings of 40 artists performing five common Persian instruments in seven Dastgahs. Although invaluable for baseline evaluations, its limited instrument diversity and occasional label inconsistencies can hinder the development of robust, generalizable classification systems.

In addition to the Nava dataset, the work of Mousavi et al. (2019) further enriched domain resources with the so‑called Persian Classical Music Instrument Recognition (PCMIR) dataset. This data set records six traditional Iranian instruments and expands the data spectrum, but the diversity of the corpus remained insufficient for generalizable training and testing. To address these gaps, we compiled the Persian Classical Instrument Dataset (PCID)—a curated collection of publicly available solo‑instrument recordings representing 15 classical Persian instruments. PCID extends the instrument repertoire and provides a more comprehensive foundation for MIR research in non‑Western contexts. Because the recordings originate from heterogeneous online sources, detailed performer metadata are often unavailable; this limitation is explicitly acknowledged in the dataset documentation.

This calls for different methodologies to take advantage of both the scarcity and imperfection of data to capture fine acoustic details representative of Persian musical instruments. Contrastive learning has become an incredibly powerful framework that learns robust invariant representations by assembling variants of similar data points and separating dissimilar ones (Chen et al., 2020). In particular, supervised contrastive learning—a training objective that brings examples of the same instrument closer together in the internal space of the model while pushing different instruments apart—relies on label information to improve the discriminativeness of features, which is very suitable for classification tasks where the difference between classes is not well pronounced (Khosla et al., 2020). In the case of music, the efficiency of contrastive learning in the capture of subtle audio signals has been shown: extraction of temporal and spectral characteristics (Spijkervet and Burgoyne, 2021).

In addition to contrastive learning, we present a reproducible baseline framework that combines supervised contrastive learning with stacked slice‑level aggregation (SSA), a late‑fusion method that integrates predictions across one‑second audio segments. The framework is general‑purpose but captures short‑term timbral patterns characteristic of isolated instrument sounds. Shorter segments also make the otherwise simple monophonic task more challenging, promoting model robustness and better feature learning.

Although contrastive learning approaches have advanced, little is known about how they could be used in the classification of traditional Persian musical instruments. This gap offers a chance to use these successful methods to tackle the particular difficulties presented by Persian musical instruments and their modal systems.

To address the difficulties of classifying instruments in classical Persian music, this research proposes a framework that combines SSA techniques with supervised contrastive learning. These are the main elements of our strategy:

Data compilation and preprocessing: We compile an extended dataset of 15 classical Persian instruments, which has considerably enhanced the diversity and richness of the available data. We have proposed some augmentation techniques to deal with problems of imbalance and scarcity. These enhancements help us increase the size of the PCID while improving the generalization of the models to the real‑world variations present in the audio recordings.
Base model training with supervised contrastive learning: We train base models on one‑second audio clips using supervised contrastive loss. One such training strategy that leverages useful label information is used to learn discriminative features that may capture the subtle nuances or ornaments characteristic of Persian instruments. This narrowing down to small segments of audio helps the models learn those fine‑grained temporal features highly important in proper classification.
SSA: The trained base models are extended to handle longer audio sequences using an SSA approach. Each recording is divided into one‑second slices, which are independently processed by the base model to generate segment‑level predictions. These predictions are then aggregated through a meta‑classifier that learns to combine the slice‑level outputs into a final track‑level decision. This late‑fusion strategy enables efficient handling of longer inputs without increasing computational complexity and provides a more stable overall prediction over time.
State‑of‑the‑art results: Recognizing the impact of data quality on model performance, we identify and correct mislabeled samples in the Nava dataset. We evaluated our models on both the original and corrected versions of the Nava dataset and on the PCID dataset. This dual evaluation demonstrates the robustness of our approach and highlights the importance of accurate labeling in the training of effective models.

Our experiments show that the proposed method significantly outperforms existing state‑of‑the‑art systems, achieving near‑perfect accuracy and F1‑scores on longer audio inputs. Specifically, our models achieve an accuracy greater than 99% in the five‑instrument classification task and an accuracy greater than 98% in the more challenging 15‑instrument classification task when evaluated on 30‑second audio inputs. Most importantly, all of the above help in scaling the models to different lengths of input effectively; hence, proving versatile in a broad variety of application scenarios, including real‑time instrument recognition.

The key contributions of this paper are as follows:

Combining supervised contrastive learning with SSA: This work presents a reproducible baseline framework for classifying musical instruments in classical Persian music, combining supervised contrastive learning with an SSA strategy. The approach addresses challenges posed by limited data availability and the complex acoustic characteristics of Persian instruments by capturing fine‑grained timbral information at the slice level and aggregating it efficiently across longer audio inputs.
Building the PCID: We collect a large dataset of 15 classical Persian instruments and then extend this through data augmentation and balancing for future research in this area.
Evaluation and analysis of data quality impact: Extensive experiments on several datasets, including corrected versions of existing ones, are conducted to evaluate our models. As can be seen, the quality of the data greatly impacts the performance of our models.
Insights for future research directions: Our findings, on the other hand, show that our work has been extended in the direction of multi‑label classification, singer voice detection, and emotion recognition for classical Persian music. These will also extend the periphery of the MIR system relating to this genre of music.

The remainder of this paper is as follows. Section 2 covers the related work in classifying music instruments, Dastgah detection, and classical Persian MIR. Section 3 introduces the PCID, including data sources, preprocessing, metadata, and ethical considerations. Section 4 is where contrastive learning and ensemble learning for audio‑processing technologies are discussed and the model is proposed, with key aspects consisting of data preprocessing, base model architecture, training procedures, and strategy. Section 5 shows the effectiveness of the proposed model and offers an in‑depth analysis of the impact of input length and data quality on performance. In addition, Section 6 generalizes the results, indicates possible applications, considers the limitations of the present study, and includes other discussions. This is followed by a section on the future of research, as we look at how the model can be expanded to manage more than one task, such as multi‑label classification, the addition of singer voice detection, and emotion‑classification features.

2 Related Works

In this section, the formation of an exact definition of MIR and the classification of musical instruments will be discussed, particularly those related to the challenges associated with classical Persian music.

The classification of musical instruments is one of the most fundamental tasks in MIR, in which all the instruments present in the recording are identified. It has found its application in digital archiving, recommendation engines, automatic music transcription, and music education (Reghunath and Rajan, 2022). To achieve accurate instrument classification, various computational approaches have been developed. The traditional method of classifying musical instruments is based on feature‑extraction algorithms that capture the spectral, temporal, and timbral aspects of the audio signal (Agostini et al., 2003).

In traditional Chinese music, a multi‑modal knowledge graph convolutional network optimized with ensemble techniques was developed to classify genres such as Dance, Metal, Rural, Classical, and Folk, achieving substantial improvements over traditional deep learning approaches through the integration of audio features with textual metadata (Niu, 2024). Work on Chinese instrumental music classification has further applied deep learning frameworks with multi‑functional fusion and attention mechanisms, showing the effectiveness of combining multiple acoustic representations for traditional instrument recognition (Yang et al., 2025).

In the Indian classical music domain, contrastive learning has proven promising for raga classification, with novel class discovery methods leveraging contrastive frameworks to identify previously unseen ragas by combining supervised feature extraction with self‑supervised contrastive training to address the open‑set nature of the task (Singh et al., 2024). Ensemble methods have also been explored for tabla stroke recognition (Madhusudhan and Chowdhary, 2024) and multi‑task raga analysis (Sharma et al., 2014), where traditional statistical measures were combined with modern machine learning approaches to improve accuracy.

Beyond specific cultural contexts, cross‑modal contrastive learning in music representation has demonstrated how multiple information sources can be aligned through contrastive objectives to produce enriched musical representations (Ferraro et al., 2021).

Classifying instruments in classical Persian music is substantially more challenging than doing so in other traditions due to its complex structures and the varying degrees of aural richness (Mürer, 2021). Instruments such as the Ney, Setar, Tar, Santur, and Kamancheh—structurally distinct from their Western counterparts—exhibit a high degree of decoration and ornamentation (Azar et al., 2018; Farhat, 1990).

A study introduced a novel approach to classifying Persian classical instruments using mel‑frequency cepstral coefficients (MFCCs), spectral roll‑off, and spectral centroid, achieving over 80% accuracy with a multi‑layer neural network (Mousavi et al., 2019). The Nava database has also contributed extensive resources for both Dastgah and instrument recognition, using MFCCs and i‑vector features with support vector machine (SVM) classifiers, reaching 98% accuracy in instrument‑recognition tasks (Baba Ali et al., 2019). Together, these works highlight the challenges in classifying traditional Persian instrumental sounds and demonstrate the effectiveness of advanced machine learning algorithms.

Research on spectrogram‑based classification methods has shown that Manhattan distance can yield the best performance in Dastgah classification for Santur recordings (Heydarian and Bainbridge, 2019). Another study employed a multi‑layer perceptron neural network to classify the seven Dastgahs from a dataset containing Ney, violin, and vocal recordings, using characteristic peak spectral features, with recognition accuracies of 65%, 72%, and 56%, respectively (Beigzadeh and Belali Koochesfahani, 2016). These findings emphasize the importance of tailoring feature extraction and model design to the unique acoustic properties of Persian instruments.

Deep neural networks incorporating bidirectional long short‑term memory and gated recurrent unit layers have also been applied to Dastgah recognition, with MFCCs proving to be effective features in both architectures (Ebrat et al., 2022). In the domain of genre classification, another investigation categorized Persian music as Pop, Rap, and Traditional using a dataset of 500 tracks, demonstrating that genre classification can be generalized to a broader scope within Persian music (Farajzadeh et al., 2023).

The use of deep learning systems, such as convolutional neural networks and recurrent neural networks, has enabled automatic learning of hierarchical features from raw audio or spectrograms, thus improving performance in classification tasks (Bhalke et al., 2016). However, progress in this area is hindered by a major obstacle—the limited availability of extensive and diverse datasets for non‑Western music traditions, including Persian music, which restricts the affordable training of such models (Baba Ali et al., 2019).

The application of self‑supervised learning models to various tasks in traditional Persian music, including instrument classification, Dastgah recognition, and artist identification, has recently been demonstrated using the Nava dataset (Baba Ali, 2024). In these authors’ work, the performance of three pre‑trained models—Music2vec, MusicHuBERT, and MERT—was evaluated, with MERT achieving state‑of‑the‑art accuracy scores of 99.64%, 24.70%, and 79.25% for the respective tasks. The study also examined the effects of the duration of the music segment, the representation layers, and the fusion of models, concluding that fine‑tuning and the fusion of multiple models can significantly enhance accuracy. This line of research addresses an underexplored domain and establishes a robust framework for Persian MIR, paving the way for future studies.

To address these issues, contrastive learning has been employed to develop richer feature representations (Spijkervet and Burgoyne, 2021), while ensemble learning techniques have been used to combine multiple models for greater accuracy and reliability (Reghunath and Rajan, 2022). In contrast, conventional approaches often rely on manually engineered features that capture frequency and timbre characteristics, which are then classified using algorithms such as SVMs or k‑nearest neighbors (KNNs) (Gourisaria et al., 2024).

Recent work confirms that contrastive objectives yield strong and transferable audio features. Instance discrimination schemes, such as MoCo, have been shown to improve the classification of music and environmental sounds (Lin, 2022). The COLA model, a self‑supervised approach trained on AudioSet, surpasses previous self‑supervised learning (SSL) baselines on nine audio tasks (Saeed et al., 2021). This line of work has been extended with SemiSupCon, which combines limited labels with contrastive loss to incorporate musical priors and further boost downstream accuracy (Guinot et al., 2024).

The CLMR—Deterministic Contrastive Learning of Musical Representations for Audio—operates directly on raw waveforms (Spijkervet and Burgoyne, 2021). On a numerical benchmark for the music‑tagging challenge, CLMR achieved performance comparable to that of fully supervised systems. Design choices in the framework appear important for robust performance across diverse applications. Results on polytimbral music are encouraging for instrument classification, where subtle representational differences matter.

Contrastive SSL reduces the need for labeled data and is gaining rapid traction, with strong results in natural language processing (NLP) and computer vision (Kumar et al., 2022). The approach also transfers well to music‑tagging tasks such as instrument recognition. A related line of work introduces SimCLR, which improves the efficiency of contrastive learning through extensive data augmentation and large sample sizes; techniques like random cropping and color distortions strengthen learned representations, yielding results competitive with supervised baselines (Chen et al., 2020). These observations underscore the importance of augmentation strategy and loss design for overall performance in contrastive learning.

Taken together, the literature shows successful use of advanced machine‑learning methods for information retrieval in Persian classical music. Even so, the intricacy of Persian musical patterns and the limited availability of high‑quality datasets call for continued advances in model design and dataset curation to improve both efficiency and reliability.

3 The PCID

The PCID is a curated collection of solo‑instrument recordings representing 15 classical Persian instruments. PCID was created to address the limitations of existing resources, which typically include a small number of instruments, contain labeling inconsistencies, or lack documentation for reproducibility. This section describes the data sources, selection criteria, instrumentation, preprocessing, metadata structure, and ethical considerations.

3.1 Source material and data collection

All audio recordings in PCID were collected from publicly accessible online platforms that host educational demonstrations, solo performances, and instrument‑specific tutorials. These recordings are publicly available for non‑commercial and research purposes.

Recordings were selected based on the following criteria:

Solo‑instrument performance: Only monophonic, isolated recordings were included.
Sufficient audio quality: Clear timbral characteristics with minimal background noise or accompaniment.
Representative timbre: Inclusion of common performance styles and techniques (e.g., plucked, bowed, or blown articulations).

Because public recordings vary in production quality, PCID includes material from various acoustic environments, performers, and recording conditions. This variability supports the robustness of the model and provides a solid foundation for reproducible baseline evaluation.

3.2 Instrument classes and selection rationale

PCID includes music created with the following 15 instruments: Ney, Setar, Tar, Santur, Kamancheh, Tonbak, Daf, Oud/Barbat, Qanun, Gheychak, Divan, Robab, Tanbur, Dotar, and Ney Anban.

Instrument inclusion was guided by four principles: (1) documented historical usage in Persian classical or regional traditions, (2) representation across all three organological families (strings, winds, percussion), (3) continued use in contemporary Persian performance, and (4) sufficient availability of high‑quality recordings for balanced training.

Several instruments broaden the cultural and acoustic range of the dataset. Ney Anban, a droneless bagpipe from the Persian Gulf coast, contributes a continuous‑drone timbre distinct from the double‑reed Ney. Gheychak introduces a bowed timbre distinct from that of Kamancheh, while the Daf serves as the principal frame drum of Sufi rhythmic performance. Dotar and Tanbur, two‑string and long‑neck lutes from northeastern and western Iran, respectively, expand the modal and timbral diversity of plucked strings. Together, these instruments provide both cultural breadth and acoustically separable classes, strengthening the dataset’s scholarly and computational value.

3.3 Dataset composition

PCID combines new web‑sourced recordings with material adapted from the existing PCMIR dataset, which originally contained six instruments (Kamancheh, Ney, Santur, Setar, Tar, and Oud). The combined collection expands coverage to 15 instruments and several hundred recordings, producing thousands of one‑second segments after preprocessing. Table 1 presents the distribution of 15 instruments in the PCID. For evaluation purposes, three main subsets were used:

Five‑instrument subset: contains recordings of Kamancheh, Ney, Santur, Setar, and Tar, corresponding to the Nava dataset’s instrument set, used for comparative experiments.
Full PCID dataset: the complete 15‑instrument corpus, used to train and evaluate models on a wider acoustic range.
Nava dataset: used as an external benchmark for cross‑dataset generalization and comparison with existing state‑of‑the‑art methods.

Table 1

Data distribution of the PCID dataset.

Instrument	# Train	# Test	# Val
Daf	(52 m)	(6.5 m)	(6.5 m)
Divan	(59 m)	(7 m)	(7 m)
Dutar	(50.5 m)	(6 m)	(6 m)
Gheychak	(50 m)	(6 m)	(6 m)
Kamancheh	(2 h, 14 m)	(16.5 m)	(16.5 m)
Ney Anban	(1 h, 6 m)	(8 m)	(8 m)
Ney	(2 h, 15 m)	(17 m)	(17 m)
Oud	(2 h, 32 m)	(19 m)	(19 m)
Qanun	(1 h, 1 m)	(7.5 m)	(7.5 m)
Rubab	(50 m)	(6 m)	(6 m)
Santur	(2 h, 11 m)	(16 m)	(16 m)
Setar	(3 h, 22 m)	(25 m)	(25 m)
Tanbour	(1 h, 18 m)	(9.5 m)	(9.5 m)
Tar	(2 h, 7 m)	(16 m)	(16 m)
Tonbak	(1 h, 9 m)	(8.5 m)	(8.5 m)

During preparation, several mislabeled samples were identified in the validation and test sets of the Nava dataset. These were corrected to produce a cleaned version of Nava, which was used alongside the original for fair comparison. This correction notably contributed to improved accuracy, highlighting the importance of data integrity in MIR evaluation.

3.4 Audio format and preprocessing

All recordings were standardized to ensure consistency:

Format: MP3
Sampling rate: 44.1 kHz
Channels: mono
Bit depth: 16‑bit
Length filtering: short or noisy recordings removed

Longer recordings were segmented into fixed‑length one‑second excerpts for supervised contrastive learning and evaluation.

3.5 Metadata and limitations

Because the recordings were sourced from public online platforms, detailed metadata such as performer name, recording date, or recording conditions could not be consistently obtained. Each file includes an instrument label, audio duration, and standardized file format. These fields represent the most reliable information available from public sources. Metadata limitations are discussed in Section 6.

3.6 Ethical and licensing considerations

The dataset consists entirely of publicly available, non‑commercial recordings accessible for research and education. Distribution of PCID includes organized audio files and a statement restricting the dataset to use research‑only. This follows established MIR practice for datasets derived from online media and complies with the platforms’ terms of use. No private, restricted, or commercial materials were included.

3.7 Dataset‑Splitting strategy

To avoid data leakage, PCID was split at the track level so that no recording appears in more than one partition. The data are divided as follows:

80% training,
10% validation, and
10% testing.

Segmentation into one‑second slices was performed after the track‑level split to maintain separation between sets.

The dataset substantially broadens the scope of available resources for Persian MIR, supporting reproducible benchmarking and generalizable model evaluation in non‑Western music research.

4 Methodology

In this section, we primarily present the overall structure of our model and the essentials of supervised contrastive learning and SSA, which form the foundation of our proposed approach.

4.1 Overall structure of the model

A broad overview of the proposed model in the work is presented in Figure 1. The procedure begins by performing contrastive learning on one‑second audio clips to initialize the base models. Then, aggregation is performed to work with such a longer input.

Flowchart of our proposed model structure of the model.

The training pipeline begins with cutting the data into one‑second segments, and model training follows using supervised contrastive learning. Each base model is built to discriminate between the given instruments. Thereafter, long audio segments, such as those 5 or 10 seconds in length, were cut into one‑second pieces. We then employ our slice‑level aggregation method on these segments and combine the outputs. This technique makes it easier to process different input sizes while maintaining efficiency and effectiveness. Additional information on the SSA method is presented in Section 4.5. The entire training and evaluation process is repeated for different input lengths to assess the scalability of the models. The key steps of our methodology are as follows:

Data preparation: Preprocess audio datasets into one‑second segments.
Contrastive model training: Train base models using supervised contrastive learning on one‑second audio inputs.
SSA: a late‑fusion mechanism that aggregates slice‑level predictions to obtain track‑level classifications.
Evaluation: Evaluate the model’s performance on existing benchmarks.

We trained all base encoders on one‑second excerpts for three pragmatic reasons: (i) The entire pipeline was developed on a single 6‑GB consumer‑grade GPU; with a one‑second window; each mini‑batch fits comfortably in memory, yet still contains dozens of positive/negative pairs, allowing supervised‑contrastive learning to converge in a few hours. (ii) One of our intended use cases is real‑time instrument recognition on low‑resource hardware (e.g., classroom laptops or stage‑side Raspberry Pi devices). Processing a one‑second slice—including mel extraction and network inference—keeps end‑to‑end latency below the threshold required for interactive feedback. (iii) Short segments multiply the number of training examples that can be harvested from each raw track, enriching class balance and stylistic variation without extra recording effort; this additional diversity demonstrably reduces over‑fitting. To verify that the short window does not artificially inflate performance, we froze the one‑ and five‑second encoders after contrastive pre‑training and trained a shallow classifier on top of each representation for an independent downstream task like Dastgah recognition.

4.2 Preprocessing

We preprocessed the raw audio recordings by slicing them into smaller segments. The preprocessing steps that have been performed are as follows:

Audio slicing: Slice the original audio into uniform segments, each lasting 1 second. In this way, the model could take inputs of variable sizes ranging from 1 to 30 seconds. Save the sliced segments in the corresponding directories so that they remain organized by class.
Audio augmentation: Various audio‑augmentation techniques were applied to deal with class imbalance and further improve the diversity of training data. These techniques included:
- Pitch shifting: changing the pitch of the audio slightly to simulate different playing styles.
- Adding noise: introducing background noise to make the model more robust to real‑world variations.
- Time shifting: altering the timing of the audio signals to create slight variations.
- Volume adjustment: modifying the volume to simulate different recording conditions.
  This will ensure that all other classes are brought up to the size of the largest class. These transformations provided variations to the original audio but maintained characteristics so that the model could generalize better.
Dataset‑balancing: This was also crucial in ensuring that one class did not dominate the training process. Equalizing the number of examples in each class after augmentation prevented model bias toward classes with higher frequency. Such a balanced dataset will enable the model to learn more fairly, improving its generalization across classes.
Data split: The preprocessed audio tracks were divided into training sets (80%), validation sets (10%), and test sets (10%) at the track level. This approach ensures that no single music track appears in more than one split, eliminating potential evaluation bias caused by data leakage. To maintain balance, tracks were distributed to preserve equal representation of instrument class in all splits.

4.3 Training phase

The training process of our contrastive learning model involved two key stages, which compose the training of the base contrastive model and using a meta‑classifier for longer sequences. As an initial step, base contrastive models were trained on one‑second audio segments using five‑way stratified k‑fold cross‑validation. This type of approach ensured that there was a balance between class representation within both the training and the validation sets. Each segment would then go through an encoder network composed of several convolutional layers whose function was to extract relevant characteristics. When added to the encoder, a projection head was included; subsequently, supervised contrastive loss was used in training such that similar samples are embedded close together, while dissimilar ones are separated.

For longer audio sequences, as well as one‑second data, we used an ensemble of one‑second models. The longer inputs were divided into segments of one second each, which were then fed into the set of models. The outputs obtained from these models formed meta‑features that served as inputs to a meta‑classifier. This meta‑classifier was trained to make final predictions based on the aggregated features. It was itself a deep neural network with many fully connected layers, optimized using categorical cross‑entropy as the loss function.

4.4 Base supervised contrastive models

All base models are trained using supervised contrastive learning with one‑second audio inputs. The core idea of contrastive learning is to maximize agreement between different augmentations of the same sample while minimizing agreement between distinct samples. In our case, we apply a supervised contrastive loss that draws positive pairs of audio segments from the same instrument closer together in the feature space, while pushing negative pairs of segments from different instruments farther apart.

Model architecture

Our contrastive model comprises two main components, as shown in Figure 2: an encoder and a projection head.

Encoder Design:
- Conv2D layers: multiple convolutional layers with varying filter sizes to capture diverse sound patterns from Mel spectrograms.
- MaxPooling2D layers: Down‑sample the feature maps for computational efficiency.
- Batch normalization: stabilizes and accelerates training after each convolutional layer.
- Dropout layers: reduces overfitting by randomly disabling a subset of neurons during training.
- Fully connected layer: produces a compact representation of the extracted features.
Projection head: For low‑dimensional embedding of the input characteristics, which is important in contrastive learning, there is an addition to the encoder called the projection head. To compute contrasting losses satisfactorily, this projection head consists of dense layers that are mapped into appropriate spaces. The encoder under the projection head is trained using the Adam optimizer at a constant learning rate, where early stopping is employed to curb any possible overfitting during model training.

Finally, the classifier is built on top of a trained encoder. Its softmax output layer has multiple dense layers with dropout for regularization, which predict class probabilities as well. Categorical cross‑entropy is used as a loss function to train the whole model that employs the Adam optimizer. During the training process, hyperparameters like the learning rate, dropout rate, and the number of hidden units in dense layers are carefully tuned to get the best results.

Supervised contrastive loss

We employ supervised contrastive loss to learn discriminative embeddings. Given an anchor sample , its set of positives (same instrument), and the set of all other samples , the loss is defined as:

1

where

and are the normalized features of the anchor and positive samples,
is the set of all samples in a batch,
is the set of positives for the anchor ,
is the set of all samples excluding the anchor, and
is a temperature hyperparameter controlling the concentration of the distribution.

Training procedure

Training progresses in three stages:

Contrastive pre‑training: These models train for several epochs with mini‑batches of the training data. This is guarded by a supervised contrastive loss function (Equation 1) that guides the optimization process to make sure embeddings of samples from the same class are put close while being well separated in the feature space from samples of other classes.
Classifier integration: After the training of the encoder by the contrastive loss, a classifier was mounted on top for the final instrument classification. It consists of several fully connected dense layers with dropout regularization to avoid overfitting. It uses a softmax output layer for multiclass classification.
End‑to‑end fine‑tuning: The whole model is fine‑tuned together (encoder and classifier) using categorical cross‑entropy loss. This employs the Adam optimizer with a learning rate of 0.001, which allows for efficient convergence.

The outputs of these models, after one‑second training, constitute the basic elements in building a meta‑classifier leveraging all base model strengths and therefore being robust and accurate in instrument classification.

4.5 Stacked SSA

In addition to the base supervised contrastive models, this study employs an SSA strategy to improve classification robustness across different input lengths. For all recordings, the audio is divided into one‑second segments, each independently processed by a set of pre‑trained classifiers. This design ensures that, even for short clips, multiple temporal slices contribute to the final decision.

Predictions from each slice are then combined in a stacking‑based fashion, where the slice‑level probabilities serve as meta‑features for a secondary meta‑classifier. This meta‑classifier learns to optimally fuse the predictions from all slices, producing a single track‑level output. The SSA framework thus provides a more sophisticated aggregation mechanism than simple averaging or majority voting, enabling the model to leverage complementary strengths of the base classifiers.

Unlike traditional ensemble methods that average encoder outputs or combine model logits, SSA explicitly learns the dependencies between slice‑level predictions. By training a lightweight meta‑classifier over these outputs, the framework captures temporal relationships and decision patterns that would otherwise be lost in standard aggregation.

Compared with conventional ensemble approaches such as bagging and boosting—which reduce variance or bias through probabilistic averaging—SSA directly models complex dependencies among classifier outputs. This results in more fine‑grained and stable predictions, particularly for variable‑length audio inputs.

Because SSA operates on short one‑second slices, it scales efficiently to longer audio without additional training cost or memory overhead. It achieves high accuracy while maintaining computational efficiency, demonstrating robustness to both short and long recordings. This property also makes the approach suitable for real‑time inference scenarios where processing latency must remain low.

4.6 Implementation details and reproducibility

Experiments ran on a single NVIDIA RTX 3060 (6 GB). A five‑fold stratified cross‑validation loop trained each supervised‑contrastive encoder in 1 h 30 min and its slice‑level meta‑classifier in 30 min, for a total wall‑time of 10 h. Peak GPU memory stayed below 5 GB.

Hyperparameters were fixed after validation, as follows: Adam, LR = with exponential decay ( = 0.95 every 10 epochs); supervised‑contrastive temperature, = 0.10; dropout, 0.30 (encoder) / 0.50 (meta); and batch sizes, 16 (contrastive) and 64 (classifier). Training proceeded for a maximum of 100 epochs with early stopping (patience = 10). Data‑augmentation covered pitch, 2 semitones; time‑shift, 0.1 seconds; pink‑noise signal‑to‑noise ratio, 15–25 dB; and volume 3 dB. Results average five seeds; all random number generators (RNGs) are synchronized, and CuDNN is forced to deterministic mode.

The proposed methodology leverages supervised contrastive learning on short (one‑second) audio segments to efficiently extract robust, discriminative features from Persian musical instrument recordings. Using a lightweight encoder–projection head structure enables fast, memory‑efficient training even on modest hardware and supports real‑time, low‑latency inference for practical applications. SSA strategy then aggregates predictions across segments, employing a meta‑classifier to handle inputs of arbitrary length with improved accuracy and scalability. All choices—from segment size and model architecture to preprocessing and augmentation—are made according to the specific resource constraints, data diversity requirements, and performance goals of the study.

5 Experiments and Results

The following sections describe the experimental protocols, as well as datasets and outcomes resulting from our experiments. First, we introduce the datasets, followed by a detailed account of the data‑processing methods and model training procedures. Ultimately, we present and analyze the results of our contrastive model SSA, evaluated on various datasets and input lengths.

5.1 Datasets

Our work uses three primary datasets for model training and evaluation.

Nava dataset: This dataset includes five classical Persian instruments and is used to benchmark the latest models.
Subset of the PCID dataset: A subset of the PCID dataset, restricted to the same five instruments as the Nava dataset, for direct comparison.
The PCID dataset: Full dataset with our 15 unique classical Persian instruments.

The audio signals in every dataset were preprocessed by being converted into mel spectrograms. The further modification included repairing the label inconsistencies in the Nava dataset, as described in Section 5.2.

The PCID dataset had a class imbalance, which required the use of data augmentation and balancing. To ensure a balanced representation of all musical instruments, we set the target count for each class to that of the largest.

Following this augmentation, samples per class were united in a balanced set with equal example numbers for each class. Not only did this help reduce the likelihood of model bias toward classes that were more frequently represented, but it also improved the effective generalization of models by exposing them to various other versions of audio data.

5.2 Evaluation and metrics

The performance of our model was evaluated using a wide range of metrics, such as accuracy, F1‑score, precision, and recall. For comparison purposes, we also compared the PCID dataset with Nava’s.

First, we used the original and a modified version of the Nava dataset for robustness, considering possible inconsistencies in the labels. In this modification, mislabeled samples were corrected using a relevant procedure, while comparing our results with previous state‑of‑the‑art results. This evaluation is done on both versions to increase the transparency of the results and allow a better evaluation of the model.

The experiments were carried out using the Keras framework with TensorFlow back‑end, exploiting GPU acceleration for fast computations. With a good structure encoder and a strong classifier, this model achieved good accuracy in instrument‑classification tasks conducted by contrastive learning.

The following are included in the evaluation workflow:

Validation phase: The trained models are first tested in the validation set. This phase is critical for optimizing hyperparameters and selecting the most effective model configuration.
Testing phase: The selected model is then evaluated on the test set. The results are discussed in the Results section (Section 5.3).
Dataset variants evaluation: By evaluating the model on the original and modified Nava datasets, we can assess the impact of label corrections on performance metrics. This dual evaluation helps in understanding the model’s ability to generalize, especially after addressing label inconsistencies.

5.3 Results

The results of our models that were trained on three different datasets are presented in this section: (1) a subset of the PCID dataset with five instruments, (2) the entire PCID dataset with 15 instruments, and (3) the Nava dataset. Each model was tested in terms of various input lengths ranging from 1–30 seconds. In addition to benchmarking them against state‑of‑the‑art approaches, we evaluated the models with both the original and modified Nava test sets.

Our experiments observed numerous mislabeled entries in testing and validation sets within the Nava dataset. These errors could skew the evaluation of the model’s performance. To have accurate and reliable results, we reviewed these datasets expertly and corrected the mislabeling issue, thereby establishing another version, a modified one, of the Nava test as well as validation sets.

Thereafter, we evaluated our models using both the original as well as amended datasets. This dual evaluation served as an illustration of the consistency of the performance of the models, regardless of the versions used, where they were kept constant in various versions of the datasets. The subsequent sections further analyze how this affected model performance.

Before discussing our results, we want to address the measurements employed to evaluate our models. In our experiments, we took the accuracy, recall, precision, and score into account as metrics for our evaluations. The formulations of these evaluation metrics are shown in Equations 2−5, where FN, FP, TN, and TP indicate true positive (correct instrument classifications), true negative (correct non‑instrument classifications), false positive (incorrect instrument predictions), and false negative (missed instrument predictions), respectively.

2

3

4

5

Our research used numerous disparate, contrasting models for categorizing musical devices on either five‑class– or 15 class–based datasets. Thus, the SSA was tested on input lengths ranging from 1–30 seconds, enabling us to track both minor and major time‑dependent characteristics. To improve our model assessments, we also altered the Nava test and dev collections for problems of label misclassification. The performance of contrastive analysis in describing different types of instruments was revealed through accuracy and F1‑score analysis involving checks performed on varied input sizes.

Figures 3, 4, and 5 offer a complete analysis of how the performance of the model varies with different input lengths, accuracy, and F1‑scores on different datasets. We can see the accuracy trends for the Nava and PCID, which reflect both 5‑ and 15‑instrument classification problems. As the input length increases from 1 to 30 seconds, there is a significant increase in accuracy by the models, reaching near‑perfect values at longer durations for both datasets. This highlights that feeding longer temporal sequences into models allows them more information to distinguish instruments better. The modified versions of the Nava datasets have been consistently better than the original ones, confirming that the corrections made on the label handling were valid.

Accuracy vs. input length tested on the Nava and PCID datasets (trained on the PCID 5 Instruments subset).

Accuracy vs. input length tested on the Nava and PCID datasets (trained on PCID).

Accuracy vs. input length tested on the Nava and PCID datasets (trained on the original Nava dataset).

Figure 6 makes a side‑by‑side comparison between the proposed model and those presented in the works of Baba Ali et al. (2019) and Baba Ali (2024) based on test accuracy. Compared with all other input lengths examined here, our models obtained higher test accuracies and better F1‑scores, especially when longer periods were used as inputs. Based on the resulting higher scores on 5 and 15 instrument tasks, it can be said that the models developed in this study using SSA and contrastive learning generalize better than state‑of‑the‑art models. This improvement is especially striking for the task of classifying 15 instruments, which shows that our method is scalable and robust. In addition, as illustrated in Figure 7, the proposed method achieves higher Dastgah detection accuracy than both Baba Ali et al. (2019) and Baba Ali (2024).

Comparison of test accuracy between the proposed model, Baba Ali et al. (2019), and Baba Ali (2024).

Comparison of accuracy for Dastgah detection across Baba Ali et al. (2019), Baba Ali (2024), and the proposed method.

Figures 8 and 9 illustrate the architecture of the best‑performing models used in the 1‑ and 20‑second classification tasks. Figure 8 shows the classifier architecture used to process encoded features, highlighting the efficient mapping of these features to instrument classes. The architecture of the meta‑classifier model for the 20‑second classification task is shown in Figure 9, which combines predictions from smaller input models (one‑second models). Therefore, performance improvement has been demonstrated earlier by accuracy and F1‑scores due to the diverse temporal representations used by the smaller models in the SSA approach. By combining the outputs of shorter models, such an architecture can process long inputs without retraining them on large sequences.

Architecture of the best model for the classifier of the one‑second, 15‑class classification task.

Architecture of the best model for the meta‑classifier of the 20‑second, 15‑class classification task.

5.3.1 Feature space visualization and class separability

To illuminate how effectively the proposed model learns discriminative features for Persian instrument classification, we visualize the penultimate‑layer embeddings using t‑distributed stochastic neighbor embedding (t‑SNE) and present the normalized confusion matrix of final predictions on the test set.

t‑SNE embedding of penultimate‑layer features

Figure 10 displays a two‑dimensional t‑SNE projection of 10,000 randomly sampled one‑second test segments from the PCID dataset, where each point’s color encodes the instrument class. The resulting plot reveals highly structured feature geometry: timbrally distinct classes (such as Ney, Tonbak, and Daf) form compact, well‑separated clusters with minimal overlap, underscoring the model’s strong ability to disentangle acoustic families. Structurally similar plucked strings—Setar, Divan, Dutar—show some degree of cluster proximity or overlap, reflecting their inherent timbral resemblance and the challenge they pose for both human and machine listeners.

t‑SNE projection of penultimate‑layer features for 10,000 one‑second test segments from the PCID.

Confusion matrix analysis

Figure 11 presents the normalized confusion matrix for the same test set. Overall, most instrument classes achieve near‑perfect correct classification (diagonal entries at or near 1.00), further validating the effectiveness of the supervised contrastive approach. Off‑diagonal entries are rare and concentrated primarily among string instruments with known acoustic similarity. For example, modest confusion is observed between Setar, Dutar, and Divan—which matches the partial cluster overlap seen in the t‑SNE plot—while other classes, such as Ney, Tonbak, and Daf, remain nearly error‑free.

Normalized confusion matrix (one‑second input, PCID test set).

Interpretation and implications

Combined, the t‑SNE visualization and confusion matrix demonstrate that the penultimate‑layer embeddings learned by our model yield a feature geometry that is both interpretable and highly effective for classification. Distinct clusters along with a diagonal‑dominant confusion matrix support strong generalization and robust class separation.

5.3.2 Comparative analysis with existing studies

To contextualize our results within the broader MIR literature, in Table 2, we provide a high‑level comparison of instrument‑classification performance reported in prior studies. Importantly, this table is not intended as a direct benchmark, since the referenced works differ substantially in dataset size, recording conditions, instrument sets, cultural traditions, and evaluation protocols. Instead, the purpose of this comparison is to illustrate the diversity of methodological approaches used across MIR and to situate the performance of our framework relative to typical accuracy ranges observed in both Western and non‑Western contexts.

Table 2

Comparison of instrument classification performance across different studies.

Study	Dataset	#of classes	Methodology	Accuracy (%)	F1‑Score (%)
Our Study	Extended Dataset (15 instruments)	15	Supervised contrastive learning with SSA	97.48	98
Our Study	Subset of Extended Dataset (5 instruments)	5	Supervised contrastive learning with SSA	99.78	100
Our Study	Nava Dataset (Modified)	5	Supervised contrastive learning with SSA	99.88	100
Agostini et al. (2003)	Orchestral Instruments Dataset	27	Spectral features with KNN and neural networks	70–80	N/A
Essid et al. (2006)	Solo Recordings and Mixtures of Western Instruments	7	MFCCs, timbral descriptors with SVM	65–75	N/A
Han et al. (2016)	Subset of MIREX Dataset (Various Genres and Instruments)	11	Deep CNNs for predominant instrument recognition	75	80
Solanki and Pandey (2022)	IMRAS Dataset (6705 recordings)	11	Eight‑layer deep CNN with mel spectrogram input	92.61	N/A
Prabavathy et al. (2020)	RWC Database, MusicBrainz.org, IRMAS, NSynth	16	SVM and KNN with MFCC and sonogram features	99.29	95.15
Gong et al. (2021)	ChMusic Dataset (Traditional Chinese Instruments)	11	MFCCs with KNN and majority voting	94.15	N/A
Humphrey et al. (2018)	OpenMIC‑2018 Dataset	20	Deep learning with CNN and multi‑instance learning	N/A	78 (AUC‑PR)
Reghunath and Rajan (2022)	Polyphonic Music Dataset	11	Transformer‑based ensemble method	85	79
Mousavi et al. (2019)	PCMIR Dataset (Persian Classical Music)	6	MFCCs, spectral features with neural network	80	N/A
Baba Ali et al. (2019)	Nava Dataset (Original)	5	MFCC and i‑vector with SVM	84.75	84
Baba Ali et al. (2024)	Nava Dataset (Original)	5	Self‑supervised, pre‑trained models	99.64	99.64

Notes:

The compared studies use different datasets, class definitions, cultural traditions, and evaluation settings; thus, the table provides contextual—not benchmark‑equivalent—comparisons.
Metric values from prior work are reported as published and may not be directly comparable across studies.
Our purpose is to situate PCID and the proposed framework within a wider MIR performance landscape.

Our models achieved higher accuracy and F1‑scores than existing studies, even those employing advanced deep learning techniques. Although some studies focused on Western instruments and others on non‑Western instruments such as Indian classical music, the performance of our models on the complex dataset of Persian classical instruments demonstrates the effectiveness of our approach.

5.3.3 Dastgah classification

We evaluated our models on the Dastgah detection task using the Nava dataset. Figure 7 presents a comparison of our method with two other studies that have addressed this task on the same dataset.

This comprehensive experiments section highlights the dataset construction, rigorous evaluation against multiple baselines and data variants, and strong performance gains achieved by supervised contrastive SSA models across instrument classification and Dastgah detection tasks.

5.3.4 Extended evaluation

We further validated our approach by evaluating alternative strategies and baselines. First, when comparing the use of a meta‑classifier to combine one‑second predictions against simple averaging, the meta‑classifier achieved slightly higher accuracy (94.65% vs. 94.14%), particularly improving classification for timbrally similar instruments such as Kamancheh and Setar. In exploring fine‑tuning strategies, shallow fine‑tuning, where the encoder is frozen, outperformed full fine‑tuning, achieving 91.91% accuracy compared to 87.28%. This suggests that preserving pretrained encoder features helps reduce overfitting on limited data. As a baseline, a simple CNN trained on one‑second spectrograms reached 90.16%, lagging behind all SSA contrastive models. Furthermore, evaluating on random five‑second audio chunks demonstrated the advantage of ensembling, with the SSA achieving 98.27% accuracy compared to 92.36% for a single model; the most significant improvements were observed on challenging instruments like the Setar. Overall, these results confirm that the meta‑classifier provides more consistent performance than averaging, shallow fine‑tuning is preferable when data are limited, and contrastive pretraining outperforms standard CNN baselines and off‑domain pretrained models while maintaining robustness to variable segment lengths.

6 Discussion

Our study used three datasets: the five‑class subset of PCID, the complete PCID, and the Nava dataset. The results exhibited by these datasets reaffirm the flexibility and sturdiness of the suggested SSA contrastive learning models.

The model trained on a five‑class subset of the PCID dataset performed superbly on all datasets, especially on the modified Nava dataset, where it obtained 100% accuracy for inputs of 30 seconds. This shows that this model can generalize exceptionally well to new data, even if other datasets have different structures or data‑collection methods. The increase in accuracy noticed in its revised version emphasizes the need to address mislabeling issues to obtain quality outputs.

However, the model that was trained on the comprehensive 15‑class dataset encountered a slight decrease in performance when tested on the five‑class Nava dataset that included a wider variety of instruments. For example, the accuracy during the tests was 99.29% for cases where inputs took 30 seconds. This lower performance may be explained by the complexity associated with the dataset as well as by a more difficult classification task. Despite the complexity of the 15‑class task, performance improved consistently with longer inputs, from 89.09% at one second to 97.08% at 30 seconds on the test set, and from 86.55% to 92.49%, respectively, on validation. This demonstrates strong, stable performance when training and testing data are consistent.

These results show that SSA contrastive learning models can fit varying datasets. However, dataset complexity impacts model performance, especially when making transfers between datasets with different class numbers.

It is important to note here that, as raised in recent MIR discussions, pitch invariance and the utility of off‑the‑shelf, globally pre‑trained music embeddings have been widely debated. We do not explicitly incorporate pitch‑invariant architectural changes; instead, we rely on the empirical power of our supervised contrastive approach to emphasize timbral distinctions relevant to Persian instruments. We highlight that this precise experiment was already conducted in prior work (Baba Ali, 2024). In that study, embeddings were extracted from three state‑of‑the‑art, globally pre‑trained audio models—Music2Vec, MusicHuBERT, and MERT—and shallow classifiers were trained on top for Persian instrument recognition. Although these models performed strongly on Western benchmarks, their shallow fine‑tuned accuracy on the Nava dataset remained inferior (97%) compared to our domain‑trained supervised contrastive SSA approach (98–99%) when evaluated under identical conditions. Thus, while global embeddings do capture broad audio features, our empirical results demonstrate that a domain‑tailored supervised contrastive encoder trained on Persian data yields meaningfully better timbre discrimination.

We conducted extensive research on input length and its influence on model performance. The models were trained on input from 1–30 seconds in length. In the five‑class models, the trend for accuracy manifested itself in the same way as all the F1‑scores, which increased along with the increase in input duration. For example, our five‑class model scored 97.30% against the modified Nava test set of one‑second samples, yet it managed 100% on the 30‑second samples. The other massive progress in accuracy was seen for the 15‑class model, which moved from 94.65% using one‑second samples up to 99.29% with 30‑second samples.

This trend emphasizes the longer lengths of input preferable for sufficient capture of the subtle temporal characteristics of audio signals. Less information translates into poor performance, and longer inputs better represent tonal changes and the playing techniques applied by individual instruments, thereby securing higher classification accuracy.

This work contributed mainly to the alteration of the Nava dataset to address the case of the original test and development sets that had incorrect labels. While using our models on this altered dataset, we saw significant increases in both accuracy and F1‑scores compared to the findings with the previous dataset.

For example, Baba Ali et al. (2019) stated an accuracy of 97.17% in their test set, consisting of input samples that were 60 seconds long, and Baba Ali (2024) stated a precision of 99.64% in their best model, while our model attained an accuracy of 99.89% via input samples that were 20 seconds long. Further, our model attained 100% accuracy for input samples that lasted more than 30 seconds in the modified test sample set. This signifies a huge improvement, thereby making clear that our method is highly effective, especially in situations where the data are assessed on edited data.

Besides, the comparison of Dastgah detection among other state‑of‑the‑art methods in Figure 7 illustrates a large performance gain from our proposed method. For example, our model showed greater accuracy on different input lengths relative to the works of Baba Ali et al. (2019) and Baba Ali (2024). Our method reached an accuracy of 34.82%, while, for Baba Ali et al. (2019) and Baba Ali (2024), the results were 20.13% and 20.54%, respectively. It shows that the SSA contrastive models can learn important features effectively on smaller input lengths and maintain their generalization across datasets.

In dealing with longer input sequences in contrastive learning, a key part of our methodology is the SSA, which breaks it down into segments of a manageable size of one second. We dramatically reduced computational complexity and memory requirements by using shorter segments to train the model and then pooling the predictions using aggregation. This allowed us to manage longer inputs efficiently, preventing overloading on systems and making both training and inference easier through this model.

The SSA strategy provided two major advantages: First, it has made it possible for the model to manage longer sequences without training them directly, thereby improving performance while not increasing the time taken for training; second, it has made it more efficient for real‑time applications, since instrument classification can now occur in real time with a little delay due to processing through one‑second segments at a time. This characteristic is particularly important when using them during live performances or interactive applications that require immediate feedback.

We acknowledge that a comprehensive cross‑cultural evaluation testing our models on datasets containing non‑Persian or mixed musical traditions remains an important future direction. Due to limitations in available datasets and computational resources, we were unable to implement such analyses within the timeline of this work. We fully agree that expanding the evaluation to include other regional or global music collections would provide an even more rigorous assessment of generalization and practical applicability.

The primary limitations of this study arise from the construction and scope of the PCID dataset. Because the recordings were collected from publicly accessible platforms, detailed metadata—such as performer identity, recording conditions, microphone setup, or geographic origin—are not available consistently. In addition, although the resulting dataset is diverse in audio quality, it contains only isolated, monophonic instrument recordings. This simplifies the classification task and does not reflect the polyphonic textures, ensemble interactions, or overlapping timbres typical of real Persian musical performances. Furthermore, while the framework achieves high accuracy, it does not explicitly incorporate musicological characteristics of Persian music, such as microtonality, and thus should be interpreted as a general audio‑classification pipeline rather than a culturally informed MIR model.

Despite strong generalization to the Nava dataset and competitive performance in Dastgah detection, the system has not been evaluated on polyphonic settings; noisy real‑world conditions; or additional Persian MIR tasks such as playing‑technique recognition, artist identification, or emotion analysis. These limitations outline important avenues for future work and clarify the boundaries within which the reported results should be interpreted.

For future work in this area, improving the ability of the model to perform a multi‑label classification of musical instruments is one of our objectives; this will involve focusing on the concurrent detection of several instruments within Persian classical music. Real performances usually employ several instruments at once, rendering current models that deal with single‑label classifications unsuitable for such complexities. By integrating multi‑label classifications into the models, overlapping sound patterns become easier for them to identify while distinguishing different instruments, allowing better performance in mixed audio environments. Another possible area to explore is the detection of the voice of singers, as solo vocalists are a part of many pieces in Persian music. In turn, creating a model capable of detecting and classifying vocal parts together with their accompaniments can enable the classification and segmentation of Persian songs into audio tracks.

Another interesting avenue would be the detection of emotions present in Persian music, where melodies, rhythms, and tonalities often have deep emotional undertones. If emotion recognition were part of our model, then we would be able to classify any piece of music emotionally, adding yet another dimension, interpretatively speaking. To accomplish this, annotated datasets that link particular emotions with specific pieces or segments of a song need to be trained.

7 Conclusions

This paper introduced an approach to instrument classification in classical Persian music using supervised contrastive learning and SSA methods. We trained base models that effectively capture discriminative features from short audio clips by preprocessing audio data into one‑second segments and applying data‑augmentation techniques to balance the dataset. The SSA strategy allowed us to handle longer input sequences efficiently by aggregating the outputs of these base models, improving scalability and robustness. Our methodology was evaluated in three datasets, the Nava dataset, PCID dataset, and the five‑instrument subset of the PCID dataset, and demonstrated significant improvements over existing state‑of‑the‑art methods.

Our experiments demonstrated that, with input durations ranging from 1–30 seconds, the proposed models achieve nearly perfect accuracy and an F1‑score. Cleaning the Nava dataset for mislabeled samples highlighted the importance of high‑quality data for model performance. In the context of our research, the SSA approach underlines the underlying significance of intelligently combining multiple models for effective short‑ and long‑term temporal features from audio data. Future work will extend this task to handle multilabel classification, which allows the model to detect the presence of multiple instruments simultaneously. This is also considered for singer voice detection and emotion recognition, allowing for a wide range of classical Persian music analyses.

Reproducibility

The complete implementation used in this study—including data preprocessing, model training, and evaluation scripts—is released and can be accessed at https://github.com/arianm01/Instrument-Detection. In addition, we have released the dataset on Zenodo (https://doi.org/10.5281/zenodo.16580241), enabling direct replication of our results and facilitating further research extensions.

Competing Interests

The authors have no competing interests to declare.

Authors’ Contributions

Ali Ahmadi Katamjani designed the study, developed the methodology, implemented the models, conducted the experiments, performed the analysis, and wrote the manuscript. Seyed Abolghasem Mirroshandel supervised the research, contributed to conceptual development, and provided critical feedback on the methodology and manuscript. Mahdi Aminian co‑supervised the research and provided feedback on the methodology, experiments, and manuscript revisions. All authors reviewed and approved the final version of the article.