A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

Ruigang Li; Yongxu Zhu

doi:10.5334/tismir.300

A Lightweight Two‑Branch Architecture for Multi‑Instrument Transcription via Note‑Level Contrastive Clustering

Transactions of the International Society for Music Information Retrieval

Volume 9 (2026): Issue 1

By: Ruigang Li and Yongxu Zhu

Open Access

|Apr 2026

Figures & Tables

Overview of the proposed method. **Top:** Overall pipeline, which takes a multi‑timbral mixture audio as input and outputs note events for each constituent timbre. **Bottom left:** AMT branch producing timbre‑agnostic transcription outputs—frame activation posteriorgram $Y_{F}$ and onset activation posteriorgram $Y_{O}$ . **Bottom right:** Timbre‑encoding branch yielding a $D$ ‑dimensional timbre embedding $V$ for each time–frequency bin, where $N = 84$ denotes the target pitch range.

Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence.

Results of frame‑level and note‑level postprocessing for triple separation.

Table 1

Datasets used in experiments. ‘Our:AMT’ and ‘Our:Sep’ are our synthetic datasets created to examine whether human composition and real recordings are indispensable. The latter’s timbres are grouped into 10 classes; when training, cross‑class mixtures yield single‑sample multi‑timbre pieces. Bold datasets are used for training.

Dataset	Dur.	Songs	Instr.	K/Song
MusicNet	34 h	330	11	1–8
BACH10	334 s	10	4	4
PHENICX	637 s	4	10	8–10
URMP	1.3 h	44	14	2–4
Our:AMT	24 h	8316	33	1
Our:Sep	836 s	120	34	1

Randomly generated piano roll **(left)** and the corresponding CQT spectrogram of the synthesized audio using a trumpet timbre **(right)**.

Table 2

Comparison of timbre‑agnostic transcription. Unless stated otherwise, models are trained on our synthetic dataset. ‘noLog’: EnergyNorm without log; ‘BN’: BatchNorm replacing EnergyNorm (BasicPitch‑style); ‘Conv39’: BasicPitch’s 39‑tap conv replacing our dilated conv; ‘BPloss’: BasicPitch’s loss; ‘MN’: trained on MusicNet; ‘lCQT’: learnable CQT; ‘l ${CQT}_{MN}$ ’: learnable CQT trained on MusicNet; ‘BP’: re‑implementation of BasicPitch (Bittner et al., 2022); ‘ ${BP}^{fl}$ ’: BP with focal loss; ‘ ${BP}_{MN}^{fl}$ ’: ${BP}^{fl}$ trained on MusicNet; ‘OF’: re‑implementation of Onsets&Frames (Hawthorne et al., 2018) with its original BCE loss; ‘ ${OF}^{fl}$ ’: OF with focal loss; and ‘ ${OF}_{MN}^{fl}$ ’: ${OF}^{fl}$ trained on MusicNet. In headers, ‘tutti’ = full‑mix polyphonic pieces; ‘stems’ = single‑instrument samples.

Dataset Metric (%)	BACH10 tutti		BACH10 stems		PHENICX		URMP tutti		URMP stems
Dataset Metric (%)	F_F	F_N	F_F	F_N	F_F	F_N	F_F	F_N	F_F	F_N
Ours	84.6	75.2	91.6	88.9	63.2	49.4	75.4	71.4	81.1	83.3
noLog	80.7	68.1	90.0	86.6	61.1	45.7	72.4	66.7	80.7	82.1
BN	85.9	76.4	91.9	88.6	58.5	46.0	72.6	67.8	78.5	80.3
Conv39	84.1	74.6	90.8	85.8	62.1	46.2	75.4	70.6	81.6	83.0
BPloss	84.3	63.6	91.2	77.0	44.6	46.2	74.5	61.7	81.1	71.2
MN	84.8	79.4	88.4	86.7	69.7	55.6	79.7	77.4	82.3	84.4
lCQT	79.1	63.5	88.4	83.5	63.2	47.9	74.8	68.1	81.4	81.9
$l {CQT}_{MN}$	86.5	79.2	89.1	87.9	70.1	56.3	79.5	75.9	81.6	82.5
BP	84.8	53.7	92.0	26.9	58.0	44.9	70.7	60.4	80.9	76.8
${BP}^{fl}$	86.1	75.2	92.3	89.3	62.2	47.7	76.2	71.5	82.2	84.9
${BP}_{MN}^{fl}$	85.7	80.5	89.3	87.8	69.8	54.0	79.9	77.9	82.8	84.9
OF	82.7	70.7	91.4	86.6	55.9	47.6	67.9	64.4	81.1	80.7
${OF}^{fl}$	82.1	72.3	91.8	90.0	55.5	47.0	70.1	66.7	81.2	83.7
${OF}_{MN}^{fl}$	84.6	76.7	87.9	88.6	65.5	56.7	76.3	74.1	81.1	86.1

Table 3

Comparison of timbre‑separated transcription. Unless stated otherwise, models are trained on MusicNet with InfoNCE loss (8). ‘D16’: $D = 16$ ; ‘MSE’: using $L_{affinity}$ (6); ‘Syn’: trained on our synthetic dataset; ‘Rescale’: forcibly scaling amplitude using Frame prediction before InstanceNorm in the timbre‑encoding branch; ‘Share’: sharing the first residual block between two branches; and ‘Tanaka’: baseline model (Tanaka et al., 2020) using InfoNCE. All experiments use identical pretrained AMT branch parameters except ‘Share.’

Dataset Metric (%)	BACH10 2 mix			BACH10 3 mix			BACH10 4 mix			URMP 2 mix			URMP 3 mix
Dataset Metric (%)	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio	F_FS	F_S	ratio
Ours	83.4	84.6	99.0	77.8	80.1	96.7	66.2	72.7	89.9	68.9	66.8	82.6	58.5	60.0	77.1
D16	83.2	84.4	98.8	76.8	79.5	96.0	64.5	70.4	87.1	69.1	68.4	84.7	58.0	59.5	76.5
MSE	80.4	82.2	96.3	70.8	75.7	91.4	59.0	67.9	84.0	65.4	65.4	80.9	53.9	57.1	73.5
Syn	72.4	78.1	91.4	58.7	68.0	82.1	46.0	56.1	69.4	49.6	53.9	66.7	40.9	45.1	58.1
Rescale	83.4	84.6	99.1	76.8	79.6	96.2	63.6	70.9	87.7	68.4	66.8	82.7	58.6	59.8	76.9
Share	82.1	83.0	98.4	76.1	78.9	96.6	68.5	72.8	91.0	69.0	68.3	84.6	57.2	57.5	74.1
Tanaka	77.9	79.6	93.2	66.1	69.0	83.4	55.4	59.2	73.2	65.5	64.1	79.3	56.5	56.5	72.8