Skip to main content
Have a personal or library account? Click to login
A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering Cover

A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

By: Ruigang Li and  Yongxu Zhu  
Open Access
|Apr 2026

Figures & Tables

Figure 1

Overview of the proposed method. Top: Overall pipeline, which takes a multi‑timbral mixture audio as input and outputs note events for each constituent timbre. Bottom left: AMT branch producing timbre‑agnostic transcription outputs—frame activation posteriorgram YF and onset activation posteriorgram YO. Bottom right: Timbre‑encoding branch yielding a D‑dimensional timbre embedding V for each time–frequency bin, where N=84 denotes the target pitch range.

Figure 2

Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence.

Figure 3

Results of frame‑level and note‑level postprocessing for triple separation.

Table 1

Datasets used in experiments. ‘Our:AMT’ and ‘Our:Sep’ are our synthetic datasets created to examine whether human composition and real recordings are indispensable. The latter’s timbres are grouped into 10 classes; when training, cross‑class mixtures yield single‑sample multi‑timbre pieces. Bold datasets are used for training.

DatasetDur.SongsInstr.K/Song
MusicNet34 h330111–8
BACH10334 s1044
PHENICX637 s4108–10
URMP1.3 h44142–4
Our:AMT24 h8316331
Our:Sep836 s120341
Figure 4

Randomly generated piano roll (left) and the corresponding CQT spectrogram of the synthesized audio using a trumpet timbre (right).

Table 2

Comparison of timbre‑agnostic transcription. Unless stated otherwise, models are trained on our synthetic dataset. ‘noLog’: EnergyNorm without log; ‘BN’: BatchNorm replacing EnergyNorm (BasicPitch‑style); ‘Conv39’: BasicPitch’s 39‑tap conv replacing our dilated conv; ‘BPloss’: BasicPitch’s loss; ‘MN’: trained on MusicNet; ‘lCQT’: learnable CQT; ‘lCQTMN’: learnable CQT trained on MusicNet; ‘BP’: re‑implementation of BasicPitch (Bittner et al., 2022); ‘BPfl’: BP with focal loss; ‘BPMNfl’: BPfl trained on MusicNet; ‘OF’: re‑implementation of Onsets&Frames (Hawthorne et al., 2018) with its original BCE loss; ‘OFfl’: OF with focal loss; and ‘OFMNfl’: OFfl trained on MusicNet. In headers, ‘tutti’ = full‑mix polyphonic pieces; ‘stems’ = single‑instrument samples.

Dataset Metric (%)BACH10 tuttiBACH10 stemsPHENICXURMP tuttiURMP stems
FFFNFFFNFFFNFFFNFFFN
Ours84.675.291.688.963.249.475.471.481.183.3
noLog80.768.190.086.661.145.772.466.780.782.1
BN85.976.491.988.658.546.072.667.878.580.3
Conv3984.174.690.885.862.146.275.470.681.683.0
BPloss84.363.691.277.044.646.274.561.781.171.2
MN84.879.488.486.769.755.679.777.482.384.4
lCQT79.163.588.483.563.247.974.868.181.481.9
lCQTMN86.579.289.187.970.156.379.575.981.682.5
BP84.853.792.026.958.044.970.760.480.976.8
BPfl86.175.292.389.362.247.776.271.582.284.9
BPMNfl85.780.589.387.869.854.079.977.982.884.9
OF82.770.791.486.655.947.667.964.481.180.7
OFfl82.172.391.890.055.547.070.166.781.283.7
OFMNfl84.676.787.988.665.556.776.374.181.186.1
Table 3

Comparison of timbre‑separated transcription. Unless stated otherwise, models are trained on MusicNet with InfoNCE loss (8). ‘D16’: D=16; ‘MSE’: using Laffinity(6); ‘Syn’: trained on our synthetic dataset; ‘Rescale’: forcibly scaling amplitude using Frame prediction before InstanceNorm in the timbre‑encoding branch; ‘Share’: sharing the first residual block between two branches; and ‘Tanaka’: baseline model (Tanaka et al., 2020) using InfoNCE. All experiments use identical pretrained AMT branch parameters except ‘Share.’

Dataset Metric (%)BACH10 2 mixBACH10 3 mixBACH10 4 mixURMP 2 mixURMP 3 mix
FFSFSratioFFSFSratioFFSFSratioFFSFSratioFFSFSratio
Ours83.484.699.077.880.196.766.272.789.968.966.882.658.560.077.1
D1683.284.498.876.879.596.064.570.487.169.168.484.758.059.576.5
MSE80.482.296.370.875.791.459.067.984.065.465.480.953.957.173.5
Syn72.478.191.458.768.082.146.056.169.449.653.966.740.945.158.1
Rescale83.484.699.176.879.696.263.670.987.768.466.882.758.659.876.9
Share82.183.098.476.178.996.668.572.891.069.068.384.657.257.574.1
Tanaka77.979.693.266.169.083.455.459.273.265.564.179.356.556.572.8
Figure 5

T‑distributed stochastic neighbor embedding (t‑SNE) visualization of timbre embeddings. (a) Frame‑level embeddings for BACH10 Piece 2, (b) note‑level aggregates of (a), (c) frame‑level for URMP Piece 18, and (d) frame‑level for URMP Piece 18 using top‑k attention.

DOI: https://doi.org/10.5334/tismir.300 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jun 28, 2025
Accepted on: Mar 25, 2026
Published on: Apr 15, 2026
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Ruigang Li, Yongxu Zhu, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.