Have a personal or library account? Click to login
GOLF: A Singing Voice Synthesiser with Glottal Flow Wavetables and LPC Filters Cover

GOLF: A Singing Voice Synthesiser with Glottal Flow Wavetables and LPC Filters

By: Chin-Yun Yu and  György Fazekas  
Open Access
|Dec 2024

Figures & Tables

tismir-7-1-210-g1.png
Figure 1

An example of the wavetables we use (before normalisation and peak alignment), corresponding to matrix with .

tismir-7-1-210-g2.png
Figure 2

Overview of the GOLF synthesis process.

Table 1

Evaluation results on the test set. The lower, the better. We omit the standard deviation which is less than 0.01.

SingersModelsMSSTFT ()MAE‑f0 (cent, )FAD ()
f1DDSP3.0974.47 1.190.47 0.03
SawSing3.1278.91 1.180.32 0.01
GOLF3.2177.06 0.880.59 0.04
PULF3.2776.90 1.110.76 0.03
m1DDSP3.1252.95 1.030.56 0.02
SawSing3.1356.46 1.040.44 0.01
GOLF3.2654.09 0.300.74 0.02
PULF3.3554.60 0.731.26 0.03
Table 2

The required number of VRAM (GB) for training with a batch size of 32, real‑time factor (RTF).

ModelsMemory ()RTF ()
GPUCPU
DDSP7.30.0150.237
SawSing7.30.0150.240
GOLF2.60.0090.023
PULF7.50.0150.248
tismir-7-1-210-g3.png
Figure 3

The MUSHRA results of the vocoders trained on different singers with  95% confidence interval.

Table 3

The L2 loss between the predicted waveforms and the ground truth.

DDSPSawSingGOLFPULF
Min71.8375.7221.9844.08
Max88.7793.1664.8270.59
tismir-7-1-210-g4.png
Figure 4

The predicted waveforms of a short segment from one of the m1 test samples (77 s   84 s, m1_003).

tismir-7-1-210-g5.png
Figure 5

The average spectrums of the harmonics and noise components throughout the test set, further smoothed using cepstrum spectral analysis. The FFT size is 1024, and the hop size is 256.

Table 4

The cosine distance between harmonics and noise magnitude spectrograms of each baseline on the test data (the distance is computed only on the frequency bins below 6 kHz to avoid the frequency cut issues discussed in Section 7.2).

DDSPSawSingGOLFPULF
f10.9450.9330.9480.954
m10.9110.8760.9370.941
tismir-7-1-210-g6.png
Figure 6

The portion of time occurrence of real poles of each ‑order filter in the LPC filters. For , only the frames when are considered. We sort the filters on the basis of their portion of real poles.

tismir-7-1-210-g7.png
Figure 7

The mean and standard deviation of the resonant frequencies and resonant gains of each ‑order filter in the LPC filters. cmplx means filters contain real poles less than 10% of the time and cmplx + real means the opposite. The statistics are computed only on the frames when the poles are complex conjugate. For , only the frames when are considered.

tismir-7-1-210-g8.png
Figure 8

The trajectories of the poles of three filter sections in the LPC filters of GOLF on a slice of  0.1‑second (20 frames) test data of the f1 singer. Filter sections 2 and 3 correspond to formants and , respectively.

tismir-7-1-210-g9.png
Figure 9

The glottal flows computed using the minimum, maximum and average of the values of the singers, plotted in both the time and frequency domain. The f0 for the frequency plots is set to 100 Hz, and the flows are multiplied by complex sinusoidals so the harmonic peaks are separated in the plots.

DOI: https://doi.org/10.5334/tismir.210 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jul 1, 2024
Accepted on: Nov 2, 2024
Published on: Dec 19, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Chin-Yun Yu, György Fazekas, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.