Have a personal or library account? Click to login

Figures & Tables

tismir-7-1-172-g1.png
Figure 1

Statistics of movies in CDXDB23.

Table 1

Final Leaderboard A (models trained only on DnR; top 5).

RankParticipantGlobal SDR (dB)Submissions to Ldb A
MeanDialogueEffectsMusic1st phase2nd phase
Submissions
    1.aim-less–4.3457.981  1.2173.8373632Code8
    2.mp3d–4.2378.484  1.6222.60742Code9
    3.subatomicseer–4.1447.178  2.8202.4336522Code10
    4.thanatoz–3.8718.948  1.2241.4422122
    5.kuielab–3.5377.687  0.4492.4743615
Baseline
Scaled Identity s^j(n)=13x(n)–0.0191.562–1.236–0.383
Cocktail-Fork (Petermann et al., 2022)–2.4917.321–1.0491.200
Table 2

Final Leaderboard B (models trained on any data; top 5).

RankParticipantGlobal SDR (dB)Submissions to Ldb A + B
MeanDialogueEffectsMusic1st phase2nd phase
Submissions
    1.JusperLee8.18114.6193.9585.96642102
    2.Audioshake8.07714.9634.0345.234197
    3.ZFTurbo7.63014.7343.3234.83425131Code11
    4.aim-less4.34517.9811.2173.83736153Code8
    5.mp3d4.23718.4841.6222.607148Code9
tismir-7-1-172-g2.png
Figure 2

Performance of submissions on full CDXDB23 over time.

tismir-7-1-172-g3.png
Figure 3

Analysis of overfitting of global SDR. The y-axis shows the difference between global SDR on the hidden test set and global SDR displayed to the participants (trajectories with negative slope indicate overfitting).

tismir-7-1-172-g4.png
Figure 4

Comparison of the cocktail-fork baseline with winning submissions on both leaderboards for individual movies. For movie “000”, we only have one clip and, hence, the box plot collapses to a horizontal line. Circles represent outliers that are outside the whiskers of the boxplot.

Table 3

Comparison of 2-stem and 3-stem HT demucs models trained on DnR and evaluated on CDXDB23 (Team ZFTurbo).

ModelGlobal SDR (dB)
MeanDialogueEffectsMusic
HT demucs trained on 2-stem mix7.56014.5323.3554.794
HT demucs trained on 3-stem mix6.69214.5303.2772.269
Ensemble of 2- and 3-stem HT demucs7.63014.7343.3234.834
Table 4

Comparison of single model HT demucs with final ensemble model (Team ZFTurbo).

ModelGlobal SDR on val1 (dB)Global SDR on val2 (dB)Global SDR on CDXDB23 (dB)
MeanDialogueEffectsMusicMeanDialogueEffectsMusicMeanDialogueEffectsMusic
HT demucs (single)6.38713.8872.7812.4949.63414.1517.7407.0122.60216.6500.6480.507
CDX23 best ensemble model8.92214.9273.7808.0607.58519.9496.3776.4297.63014.7343.3234.834
tismir-7-1-172-g5.png
Figure 5

SDR dependencies on the input volume in LUFS for music, dialogue, and effects. A solid line shows SDR values on RED; crosses mark SDR on CDXDB23. Horizontal dashed and dotted lines show SDR for models without converting the volume of the input signal. The MRX model is blue, MRX-C is orange, MRX-C with a Wiener filter is green, and MRX-C with post-processing scaling is red. In the case of testing MRX-C scaling on the CDXDB23, the SDR values are only available for effects (Team mp3d).

Table 5

SDR values obtained during testing on RED for MRX, MRX-C, MRX-C with Wiener filter, and MRX-C with scaling. SDR values from the table are maximum possible values from all input volumes (Team mp3d).

ModelGlobal SDR (dB)
MeanDialogueEffectsMusic
MRX4.388.381.723.02
MRX-C4.368.481.622.99
MRX-C Wiener4.578.751.903.07
MRX-C scaling4.247.921.952.85
Table 6

Loudness and Dynamic Range Compression (DRC) statistics for DnR and CDXDB23.

Divide and Remaster (DnR)CDXDB23
DialogueEffectsMusicDialogueEffectsMusic
Loudness (LUFS)–24.4 ± 1.3–29.7 ± 1.9–31.4 ± 1.8–28.4 ± 3.1–33.9 ± 8.0–33.6 ± 7.1
DRC (dB)–10.7 ± 0.9–5.1 ± 2.4–12.6 ± 1.4–11.4 ± 1.3–10.6 ± 3.7–11.2 ± 2.3
tismir-7-1-172-g6.png
Figure 6

Comparison of loudness between DnR and CDXDB23.

tismir-7-1-172-g7.png
Figure 7

Comparison of average equalization between DnR and CDXDB23. Dashed curves give one standard deviation above/below average.

tismir-7-1-172-g8.png
Figure 8

Figure 8: Comparison of average amplitude panning between DnR and CDXDB23. Channel amplitude similarity Ψ(f) can take values 0Ψ(f)1 where Ψ(f)=1 refers to panning frequency f to the center whereas Ψ(f)<1 denotes a panning to either side. Dashed curves give one standard deviation above/below average. Please note that DnR is monaural and, hence, Ψ(f) collapses to a horizontal line at Ψ(f)=1.

tismir-7-1-172-g9.png
Figure 9

Figure 9: Comparison of average amplitude panning between DnR and CDXDB23. Δ(f)=sign(ΨL(f)ΨR(f)) denotes the panning direction where Δ(f)<0 refers to panning to the left and Δ(f)>0 to a panning to the right. Dashed curves give one standard deviation above/below average. Please note that DnR is monaural and, hence, Δ(f) collapses to a horizontal line at Δ(f)=0.

Table 7

Results on CDXDB23 for training the cocktail-fork model with adjusted DnR versions where we matched either the average loudness or the average equalization from CDXDB23. “Input norm” refers to the loudness normalization to -27 LUFS introduced with version 1.1 of the cocktail-fork model.

Training DatasetGlobal SDR w/o input norm (dB)Global SDR w/ input norm (dB)
MeanDialogueEffectsMusicMeanDialogueEffectsMusic
DnR–0.1044.108–2.018–2.4010.3254.662–1.979–1.707
DnR w/ adapted loudness  1.2876.535–1.506–1.1681.5396.727–1.278–0.832
DnR w/ adapted equalization  0.1764.621–1.470–2.6230.5444.922–1.212–2.078
DOI: https://doi.org/10.5334/tismir.172 | Journal eISSN: 2514-3298
Language: English
Submitted on: Aug 22, 2023
Accepted on: Feb 13, 2024
Published on: Apr 17, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.