A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

Ruigang Li; Yongxu Zhu

doi:10.5334/tismir.300

A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

Transactions of the International Society for Music Information Retrieval

Volume 9 (2026): Issue 1

By: Ruigang Li and Yongxu Zhu

Open Access

|Apr 2026

Abstract

Existing multi‑timbre transcription models struggle with generalization beyond pretrained instruments, rigid source‑count constraints, and high computational demands that hinder deployment on low‑resource devices. We address these limitations with a lightweight model that extends a timbre‑agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations, including spectral normalization, dilated convolutions, and contrastive clustering, further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality and shows a promising generalization ability, making it highly suitable for real‑world deployment in practical and resource‑constrained settings.

References

Balhar, J., and Hájíč, J. (2019). Melody extraction using a harmonic convolutional neural network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR 2018), Paris, France (Vol. 10).
Search in Google Scholar Back to article
Benetos, E., Dixon, S., Duan, Z., and Ewert, S. (2019). Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1), 20–30.
Search in Google Scholar Back to article
Bittner, R. M., Bosch, J. J., Rubinstein, D., Meseguer‑Brocal, G., and Ewert, S. (2022). A lightweight instrument‑agnostic model for polyphonic note transcription and multipitch estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022 (pp. 781–785). IEEE.
Search in Google Scholar Back to article
Bittner, R. M., McFee, B., Salamon, J., Li, P., and Bello, J. P. (2017). Deep salience representations for F0 estimation in polyphonic music. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China. ISMIR.
Search in Google Scholar Back to article
Cwitkowitz, F., Cheuk, K. W., Choi, W., Ramírez, M. A. M., Toyama, K., Liao, W., and Mitsufuji, Y. (2024). Timbretrap: A low‑resource framework for instrument‑agnostic music transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14–19, 2024 (pp. 1291–1295). IEEE.
Search in Google Scholar Back to article
Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non‑peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 2121–2133.
Search in Google Scholar Back to article
Duan, Z., Zhang, Y., Zhang, C., and Shi, Z. (2008). Unsupervised single‑channel music source separation by average harmonic structure modeling. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 766–778.
Search in Google Scholar Back to article
Esaki, Y., Koide, S., and Kutsuna, T. (2024). One‑shot domain incremental learning. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN)), Yokohama, Japan (pp. 1–8). IEEE.
Search in Google Scholar Back to article
Gardner, J., Simon, I., Manilow, E., Hawthorne, C., and Engel, J. H. (2022). MT3: Multi‑task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022. OpenReview.net.
Search in Google Scholar Back to article
Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon, I., Raffel, C., Engel, J., Oore, S., and Eck, D. (2018). Onsets and frames: Dual‑objective piano transcription. Proceedings of the 19th International Society for Music Information Retrieval Conference), Paris, France (pp. 50–57).
Search in Google Scholar Back to article
Hershey, J. R., Chen, Z., Roux, J. L., and Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20–25, 2016 (pp. 31–35). IEEE.
Search in Google Scholar Back to article
Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019). Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2), 522–535.
Search in Google Scholar Back to article
Lin, L., Kong, Q., Jiang, J., and Xia, G. (2021). A unified model for zero‑shot music source separation, transcription and synthesis. In Proceedings of 22st International Conference on Music Information Retrieval, ISMIR.
Search in Google Scholar Back to article
Lin, T., Goyal, P., Girshick, R. B., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017 (pp. 2999–3007). IEEE Computer Society.
Search in Google Scholar Back to article
Luo, Y., Chen, Z., and Mesgarani, N. (2018). Speaker‑ independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787–796.
Search in Google Scholar Back to article
Luo, Y., and Mesgarani, N. (2018). TasNet: Time‑domain audio separation network for real‑time, single‑channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018 (pp. 696–700). IEEE.
Search in Google Scholar Back to article
Maman, B., and Bermano, A. H. (2022). Unaligned supervision for automatic music transcription in the wild. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato (Eds.), International Conference on Machine Learning, ICML 2022, July 17–23, 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research (pp. 14918–14934). PMLR.
Search in Google Scholar Back to article
Manilow, E., Seetharaman, P., and Pardo, B. (2020). Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4–8, 2020 (pp. 771–775). IEEE.
Search in Google Scholar Back to article
Manilow, E., Wichern, G., Seetharaman, P., and Roux, J. L. (2019). Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 45–49).
Search in Google Scholar Back to article
Miron, M., Carabias‑Orti, J. J., Bosch, J., Gómez, E., and Janer, J. (2016). Phenicx‑anechoic: note annotations for aalto anechoic orchestral database. Phenicx‑Anechoic: Note Annotations for Aalto Anechoic Orchestral Database.
Search in Google Scholar Back to article
Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Search in Google Scholar Back to article
Raffel, C., McFee, B., Humphrey, E. J., Salamon, J., and Ellis, D. P. W. (2014). mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR, 2014), Taipei, Taiwan.
Search in Google Scholar Back to article
Ravanelli, M., and Bengio, Y. (2018). Speaker recognition from raw waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece (pp. 1021–1028), IEEE.
Search in Google Scholar Back to article
Riley, X., Edwards, D., and Dixon, S. (2024). High resolution guitar transcription via domain adaptation. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea (pp. 1051–1055).
Search in Google Scholar Back to article
Rouard, S., Massa, F., and Défossez, A. (2023). Hybrid transformers for music source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4–10, 2023 (pp. 1–5). IEEE.
Search in Google Scholar Back to article
Schörkhuber, C., and Klapuri, A. (2010). Constant‑Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference (SMC), Barcelona, Spain (pp. 3–64).
Search in Google Scholar Back to article
Su, L., and Yang, Y.‑H. (2015). Combining spectral and temporal representations for multipitch estimation of polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(10), 1600–1612.
Search in Google Scholar Back to article
Tamer, N. C., Özer, Y., Müller, M., and Serra, X. (2023a). High‑resolution violin transcription using weak labels. In Ismir 2023 Hybrid Conference, Milano, Italy.
Search in Google Scholar Back to article
Tamer, N. C., Özer, Y., Müller, M., and Serra, X. (2023b). TAPE: An end‑to‑end timbre‑aware pitch estimator. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4–10, 2023 (pp. 1–5). IEEE.
Search in Google Scholar Back to article
Tanaka, K., Nakatsuka, T., Nishikimi, R., Yoshii, K., and Morishima, S. (2020). Multi‑instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In ISMIR, Montreal, Canada (pp. 327–334).
Search in Google Scholar Back to article
Thickstun, J., Harchaoui, Z., and Kakade, S. M. (2017). Learning features of music from scratch. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net.
Search in Google Scholar Back to article
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA (pp. 5998–6008).
Search in Google Scholar Back to article
Wu, Y., Wei, W., Li, D., Li, M., Yu, Y., Gao, Y., and Li, W. (2024). Harmonic frequency‑separable transformer for instrument‑agnostic music transcription. In 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, Canada (pp. 1–6).
Search in Google Scholar Back to article
Wu, Y.‑T., Chen, B., and Su, L. (2019). Polyphonic music transcription with semantic segmentation. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom (pp. 166–170). IEEE.
Search in Google Scholar Back to article
Wu, Y.‑T., Chen, B., and Su, L. (2020). Multi‑instrument automatic music transcription with self‑attention‑based instance segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2796–2809.
Search in Google Scholar Back to article
Zeghidour, N., Teboul, O., Quitry, F. D. C., and Tagliasacchi, M. (2021). LEAF: A learnable frontend for audio classification. arXiv preprint arXiv:2101.08596.
Search in Google Scholar Back to article