Skip to main content
Have a personal or library account? Click to login
A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering Cover

A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

By: Ruigang Li and  Yongxu Zhu  
Open Access
|Apr 2026

References

  1. Balhar, J., and Hájíč, J. (2019). Melody extraction using a harmonic convolutional neural network. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR 2018), Paris, France (Vol. 10).
  2. Benetos, E., Dixon, S., Duan, Z., and Ewert, S. (2019). Automatic music transcription: An overview. IEEE Signal Processing Magazine, 36(1), 2030.
  3. Bittner, R. M., Bosch, J. J., Rubinstein, D., Meseguer‑Brocal, G., and Ewert, S. (2022). A lightweight instrument‑agnostic model for polyphonic note transcription and multipitch estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23–27 May 2022 (pp. 781785). IEEE.
  4. Bittner, R. M., McFee, B., Salamon, J., Li, P., and Bello, J. P. (2017). Deep salience representations for F0 estimation in polyphonic music. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China. ISMIR.
  5. Cwitkowitz, F., Cheuk, K. W., Choi, W., Ramírez, M. A. M., Toyama, K., Liao, W., and Mitsufuji, Y. (2024). Timbretrap: A low‑resource framework for instrument‑agnostic music transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14–19, 2024 (pp. 12911295). IEEE.
  6. Duan, Z., Pardo, B., and Zhang, C. (2010). Multiple fundamental frequency estimation by modeling spectral peaks and non‑peak regions. IEEE Transactions on Audio, Speech, and Language Processing, 18(8), 21212133.
  7. Duan, Z., Zhang, Y., Zhang, C., and Shi, Z. (2008). Unsupervised single‑channel music source separation by average harmonic structure modeling. IEEE Transactions on Audio, Speech, and Language Processing, 16(4), 766778.
  8. Esaki, Y., Koide, S., and Kutsuna, T. (2024). One‑shot domain incremental learning. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN)), Yokohama, Japan (pp. 18). IEEE.
  9. Gardner, J., Simon, I., Manilow, E., Hawthorne, C., and Engel, J. H. (2022). MT3: Multi‑task multitrack music transcription. In Proceedings of the 10th International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25–29, 2022. OpenReview.net.
  10. Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon, I., Raffel, C., Engel, J., Oore, S., and Eck, D. (2018). Onsets and frames: Dual‑objective piano transcription. Proceedings of the 19th International Society for Music Information Retrieval Conference), Paris, France (pp. 5057).
  11. Hershey, J. R., Chen, Z., Roux, J. L., and Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20–25, 2016 (pp. 3135). IEEE.
  12. Li, B., Liu, X., Dinesh, K., Duan, Z., and Sharma, G. (2019). Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications. IEEE Transactions on Multimedia, 21(2), 522535.
  13. Lin, L., Kong, Q., Jiang, J., and Xia, G. (2021). A unified model for zero‑shot music source separation, transcription and synthesis. In Proceedings of 22st International Conference on Music Information Retrieval, ISMIR.
  14. Lin, T., Goyal, P., Girshick, R. B., He, K., and Dollár, P. (2017). Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22–29, 2017 (pp. 29993007). IEEE Computer Society.
  15. Luo, Y., Chen, Z., and Mesgarani, N. (2018). Speaker‑ independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787796.
  16. Luo, Y., and Mesgarani, N. (2018). TasNet: Time‑domain audio separation network for real‑time, single‑channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018 (pp. 696700). IEEE.
  17. Maman, B., and Bermano, A. H. (2022). Unaligned supervision for automatic music transcription in the wild. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato (Eds.), International Conference on Machine Learning, ICML 2022, July 17–23, 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research (pp. 1491814934). PMLR.
  18. Manilow, E., Seetharaman, P., and Pardo, B. (2020). Simultaneous separation and transcription of mixtures with multiple polyphonic and percussive instruments. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4–8, 2020 (pp. 771775). IEEE.
  19. Manilow, E., Wichern, G., Seetharaman, P., and Roux, J. L. (2019). Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 4549).
  20. Miron, M., Carabias‑Orti, J. J., Bosch, J., Gómez, E., and Janer, J. (2016). Phenicx‑anechoic: note annotations for aalto anechoic orchestral database. Phenicx‑Anechoic: Note Annotations for Aalto Anechoic Orchestral Database.
  21. Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  22. Raffel, C., McFee, B., Humphrey, E. J., Salamon, J., and Ellis, D. P. W. (2014). mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR, 2014), Taipei, Taiwan.
  23. Ravanelli, M., and Bengio, Y. (2018). Speaker recognition from raw waveform with SincNet. In 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece (pp. 10211028), IEEE.
  24. Riley, X., Edwards, D., and Dixon, S. (2024). High resolution guitar transcription via domain adaptation. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea (pp. 10511055).
  25. Rouard, S., Massa, F., and Défossez, A. (2023). Hybrid transformers for music source separation. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4–10, 2023 (pp. 15). IEEE.
  26. Schörkhuber, C., and Klapuri, A. (2010). Constant‑Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference (SMC), Barcelona, Spain (pp. 364).
  27. Su, L., and Yang, Y.‑H. (2015). Combining spectral and temporal representations for multipitch estimation of polyphonic music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(10), 16001612.
  28. Tamer, N. C., Özer, Y., Müller, M., and Serra, X. (2023a). High‑resolution violin transcription using weak labels. In Ismir 2023 Hybrid Conference, Milano, Italy.
  29. Tamer, N. C., Özer, Y., Müller, M., and Serra, X. (2023b). TAPE: An end‑to‑end timbre‑aware pitch estimator. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4–10, 2023 (pp. 15). IEEE.
  30. Tanaka, K., Nakatsuka, T., Nishikimi, R., Yoshii, K., and Morishima, S. (2020). Multi‑instrument music transcription based on deep spherical clustering of spectrograms and pitchgrams. In ISMIR, Montreal, Canada (pp. 327334).
  31. Thickstun, J., Harchaoui, Z., and Kakade, S. M. (2017). Learning features of music from scratch. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference Track Proceedings. OpenReview.net.
  32. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA (pp. 59986008).
  33. Wu, Y., Wei, W., Li, D., Li, M., Yu, Y., Gao, Y., and Li, W. (2024). Harmonic frequency‑separable transformer for instrument‑agnostic music transcription. In 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, Canada (pp. 16).
  34. Wu, Y.‑T., Chen, B., and Su, L. (2019). Polyphonic music transcription with semantic segmentation. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom (pp. 166170). IEEE.
  35. Wu, Y.‑T., Chen, B., and Su, L. (2020). Multi‑instrument automatic music transcription with self‑attention‑based instance segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 27962809.
  36. Zeghidour, N., Teboul, O., Quitry, F. D. C., and Tagliasacchi, M. (2021). LEAF: A learnable frontend for audio classification. arXiv preprint arXiv:2101.08596.
DOI: https://doi.org/10.5334/tismir.300 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jun 28, 2025
Accepted on: Mar 25, 2026
Published on: Apr 15, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Ruigang Li, Yongxu Zhu, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.