A Dataset of Larynx Microphone Recordings for Singing Voice Reconstruction

Simon Schwär; Michael Krause; Michael Fast; Sebastian Rosenzweig; Frank Scherbaum; Meinard Müller

doi:10.5334/tismir.166

Figures & Tables

Conceptual overview of SVR from LM signals. Time-frequency representations of an exemplary LM signal and the corresponding reconstructed signal are depicted in red and grey, respectively.

Table 1

Overview of the songs and takes in LM-SSD. C1 and C0 represent the number of takes with and without crosstalk, respectively. Songs marked with * have German lyrics.

ID	SONG NAME	ORIGINAL ARTIST	SINGER ID	TAKES C1	LM-A C0	TAKES C1	LM-B C0	DURATION (MM:SS)
`AA`	All Alone	Michael Fast	`1M`	1	2	1	2	27:03
`TS`	The Scientist	Coldplay	`1M`	1	2	1	2	21:37
`YF`	Your Fires	All The Luck In The World	`1M`	1	2	1	2	24:21
`DL`	Dezemberluft*	Heisskalt	`2M`	1	2	1	2	14:47
`BB`	Books From Boxes	Maxïmo Park	`2M`	1	2	1	2	17:39
`NB`	Narben*	Alligatoah	`2M`	1	2	1	2	11:47
`SG`	Supergirl	Reamonn	`3F, 1M`	1	2	1	2	26:34
`OC`	One Call Away	Charlie Puth	`3F, 1M`	1	2	1	2	19:32
`PL`	Past Life	Trevor Daniel & Selena Gomez	`3F, 1M`	1	2	1	2	17:45
`CC`	Chasing Cars	Snow Patrol	`4F`	1	2	1	2	28:10
`BT`	Breakfast At Tiffany’s	Deep Blue Something	`4F`	1	2	1	2	22:16
`LL`	Little Lion Man	Mumford & Sons	`4F`	1	2	1	2	19:06
Total				12	24	12	24	250:37

Photograph of the recording setup (top) and detailed depiction of the LMs used (bottom). `LM-A`: Albrecht AE-38-S2a larynx microphone; `LM-B`: self-made larynx microphone with TE Connectivity CM-01B sensor; `CM`: close-up microphone (Neumann U87); `GP`: guitar pickup (AMG Electronics C-Ducer); `GL/GR`: guitar stereo left/right (AKG C414).

Relative transfer function (RTF) estimates w.r.t. `CM` for `LM-A` (top) and `LM-B` (bottom). RTF estimates for individual singers are shown in grey (`1M`: solid, `2M`: dashed, `3F`: dotted, `4F`: dash-dotted). The black line indicates the mean RTF across singers for each LM model.

Coherence estimates w.r.t. `CM` for `LM-A` (top) and `LM-B` (bottom). Coherence estimates for individual singers are shown in grey (`1M`: solid, `2M`: dashed, `3F`: dotted, `4F`: dash-dotted). The black line indicates the mean coherence across singers for each LM model.

Table 2

Dataset dimensions and naming scheme.

FIELD	DESCRIPTION	VALUES
`UID`	Unique numerical identifier for a take across songs	`001 – 072`
`SongID`	Two-letter abbreviation of the song	`cf. Table 1`
`Type`	Microphone type or mix setting	`LM-A, LM-B, CM, GP, GL, GR, MixA, MixB`
`Crosstalk`	Whether guitar crosstalk is present on `CM` (`C1`) or not (`C0`)	`C1, C0`
`Singer`	Singer identifier (with gender)	`1M, 2M, 3F, 4F`
`Take`	Take number for the given song (`T1-T3 use LM-A, T4-T6 use LM-B`)	`T1 – T6`

Architecture of the DDSP-based baseline system. Blue color is used for differentiable DSP building blocks, yellow color for NN building blocks with learnable parameters, and white color for fixed pre-processing steps. Control parameter flow is denoted with dashed line arrows, while solid lines indicate flow of audio signals. The spectrograms show signal content at the indicated position in the signal flow diagram. The shown example uses an excerpt from the `LM-B` signal of song `AA T5` as the input signal x_LM and a corresponding model trained with the OF scenario (see Section 6).

Table 3

Word error rate (WER) of lyrics transcription with the Whisper (Radford et al., 2022) medium model for a selection of songs from LM-SSD. Song DL uses the dedicated German Whisper model.

		WER (%)
SONG	SINGER	CM	LM	OF	DT	DS
`AA`	`1M`	1.83	72.56	1.83	22.56	20.73
`TS`	`1M`	2.82	31.69	2.82	24.65	33.10
`DL`	`2M`	2.16	10.81	2.70	5.95	7.03
`BB`	`2M`	7.40	11.25	9.65	11.90	21.22
`SG`	`3F`	3.70	11.11	5.76	11.11	57.61
`OC`	`3F`	3.31	84.30	4.96	12.40	58.68
`CC`	`4F`	0.49	92.65	0.49	29.41	91.67
`LL`	`4F`	1.98	85.71	1.98	15.87	69.44
	Average	3.27	49.05	4.25	15.89	46.13

Listening test results according to stimulus and singer ID. LM: Larynx Microphone; NA: Naive Approach (linear filtering); OF, DT, DS: Overfitting, Different Take, and Different Song training scenarios; HR: Hidden Reference (CM signal).

A Dataset of Larynx Microphone Recordings for Singing Voice Reconstruction

Figures & Tables

Figure 1

Table 1

Figure 2

Figure 3

Figure 4

Table 2

Figure 5

Table 3

Figure 6

Paradigm

My account