
Figure 1
Overview of the proposed method. Top: Overall pipeline, which takes a multi‑timbral mixture audio as input and outputs note events for each constituent timbre. Bottom left: AMT branch producing timbre‑agnostic transcription outputs—frame activation posteriorgram and onset activation posteriorgram . Bottom right: Timbre‑encoding branch yielding a ‑dimensional timbre embedding for each time–frequency bin, where denotes the target pitch range.

Figure 2
Training outcomes using Focal Loss with varying positive class weights. Each column represents a training session from initialization to convergence.

Figure 3
Results of frame‑level and note‑level postprocessing for triple separation.
Table 1
Datasets used in experiments. ‘Our:AMT’ and ‘Our:Sep’ are our synthetic datasets created to examine whether human composition and real recordings are indispensable. The latter’s timbres are grouped into 10 classes; when training, cross‑class mixtures yield single‑sample multi‑timbre pieces. Bold datasets are used for training.
| Dataset | Dur. | Songs | Instr. | K/Song |
|---|---|---|---|---|
| MusicNet | 34 h | 330 | 11 | 1–8 |
| BACH10 | 334 s | 10 | 4 | 4 |
| PHENICX | 637 s | 4 | 10 | 8–10 |
| URMP | 1.3 h | 44 | 14 | 2–4 |
| Our:AMT | 24 h | 8316 | 33 | 1 |
| Our:Sep | 836 s | 120 | 34 | 1 |

Figure 4
Randomly generated piano roll (left) and the corresponding CQT spectrogram of the synthesized audio using a trumpet timbre (right).
Table 2
Comparison of timbre‑agnostic transcription. Unless stated otherwise, models are trained on our synthetic dataset. ‘noLog’: EnergyNorm without log; ‘BN’: BatchNorm replacing EnergyNorm (BasicPitch‑style); ‘Conv39’: BasicPitch’s 39‑tap conv replacing our dilated conv; ‘BPloss’: BasicPitch’s loss; ‘MN’: trained on MusicNet; ‘lCQT’: learnable CQT; ‘l’: learnable CQT trained on MusicNet; ‘BP’: re‑implementation of BasicPitch (Bittner et al., 2022); ‘’: BP with focal loss; ‘’: trained on MusicNet; ‘OF’: re‑implementation of Onsets&Frames (Hawthorne et al., 2018) with its original BCE loss; ‘’: OF with focal loss; and ‘’: trained on MusicNet. In headers, ‘tutti’ = full‑mix polyphonic pieces; ‘stems’ = single‑instrument samples.
| Dataset Metric (%) | BACH10 tutti | BACH10 stems | PHENICX | URMP tutti | URMP stems | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| FF | FN | FF | FN | FF | FN | FF | FN | FF | FN | |
| Ours | 84.6 | 75.2 | 91.6 | 88.9 | 63.2 | 49.4 | 75.4 | 71.4 | 81.1 | 83.3 |
| noLog | 80.7 | 68.1 | 90.0 | 86.6 | 61.1 | 45.7 | 72.4 | 66.7 | 80.7 | 82.1 |
| BN | 85.9 | 76.4 | 91.9 | 88.6 | 58.5 | 46.0 | 72.6 | 67.8 | 78.5 | 80.3 |
| Conv39 | 84.1 | 74.6 | 90.8 | 85.8 | 62.1 | 46.2 | 75.4 | 70.6 | 81.6 | 83.0 |
| BPloss | 84.3 | 63.6 | 91.2 | 77.0 | 44.6 | 46.2 | 74.5 | 61.7 | 81.1 | 71.2 |
| MN | 84.8 | 79.4 | 88.4 | 86.7 | 69.7 | 55.6 | 79.7 | 77.4 | 82.3 | 84.4 |
| lCQT | 79.1 | 63.5 | 88.4 | 83.5 | 63.2 | 47.9 | 74.8 | 68.1 | 81.4 | 81.9 |
| 86.5 | 79.2 | 89.1 | 87.9 | 70.1 | 56.3 | 79.5 | 75.9 | 81.6 | 82.5 | |
| BP | 84.8 | 53.7 | 92.0 | 26.9 | 58.0 | 44.9 | 70.7 | 60.4 | 80.9 | 76.8 |
| 86.1 | 75.2 | 92.3 | 89.3 | 62.2 | 47.7 | 76.2 | 71.5 | 82.2 | 84.9 | |
| 85.7 | 80.5 | 89.3 | 87.8 | 69.8 | 54.0 | 79.9 | 77.9 | 82.8 | 84.9 | |
| OF | 82.7 | 70.7 | 91.4 | 86.6 | 55.9 | 47.6 | 67.9 | 64.4 | 81.1 | 80.7 |
| 82.1 | 72.3 | 91.8 | 90.0 | 55.5 | 47.0 | 70.1 | 66.7 | 81.2 | 83.7 | |
| 84.6 | 76.7 | 87.9 | 88.6 | 65.5 | 56.7 | 76.3 | 74.1 | 81.1 | 86.1 | |
Table 3
Comparison of timbre‑separated transcription. Unless stated otherwise, models are trained on MusicNet with InfoNCE loss (8). ‘D16’: ; ‘MSE’: using (6); ‘Syn’: trained on our synthetic dataset; ‘Rescale’: forcibly scaling amplitude using Frame prediction before InstanceNorm in the timbre‑encoding branch; ‘Share’: sharing the first residual block between two branches; and ‘Tanaka’: baseline model (Tanaka et al., 2020) using InfoNCE. All experiments use identical pretrained AMT branch parameters except ‘Share.’
| Dataset Metric (%) | BACH10 2 mix | BACH10 3 mix | BACH10 4 mix | URMP 2 mix | URMP 3 mix | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FFS | FS | ratio | FFS | FS | ratio | FFS | FS | ratio | FFS | FS | ratio | FFS | FS | ratio | |
| Ours | 83.4 | 84.6 | 99.0 | 77.8 | 80.1 | 96.7 | 66.2 | 72.7 | 89.9 | 68.9 | 66.8 | 82.6 | 58.5 | 60.0 | 77.1 |
| D16 | 83.2 | 84.4 | 98.8 | 76.8 | 79.5 | 96.0 | 64.5 | 70.4 | 87.1 | 69.1 | 68.4 | 84.7 | 58.0 | 59.5 | 76.5 |
| MSE | 80.4 | 82.2 | 96.3 | 70.8 | 75.7 | 91.4 | 59.0 | 67.9 | 84.0 | 65.4 | 65.4 | 80.9 | 53.9 | 57.1 | 73.5 |
| Syn | 72.4 | 78.1 | 91.4 | 58.7 | 68.0 | 82.1 | 46.0 | 56.1 | 69.4 | 49.6 | 53.9 | 66.7 | 40.9 | 45.1 | 58.1 |
| Rescale | 83.4 | 84.6 | 99.1 | 76.8 | 79.6 | 96.2 | 63.6 | 70.9 | 87.7 | 68.4 | 66.8 | 82.7 | 58.6 | 59.8 | 76.9 |
| Share | 82.1 | 83.0 | 98.4 | 76.1 | 78.9 | 96.6 | 68.5 | 72.8 | 91.0 | 69.0 | 68.3 | 84.6 | 57.2 | 57.5 | 74.1 |
| Tanaka | 77.9 | 79.6 | 93.2 | 66.1 | 69.0 | 83.4 | 55.4 | 59.2 | 73.2 | 65.5 | 64.1 | 79.3 | 56.5 | 56.5 | 72.8 |

Figure 5
T‑distributed stochastic neighbor embedding (t‑SNE) visualization of timbre embeddings. (a) Frame‑level embeddings for BACH10 Piece 2, (b) note‑level aggregates of (a), (c) frame‑level for URMP Piece 18, and (d) frame‑level for URMP Piece 18 using top‑ attention.
