GOLF: A Singing Voice Synthesiser with Glottal Flow Wavetables and LPC Filters

Chin-Yun Yu; György Fazekas

doi:10.5334/tismir.210

Full Article

1 Introduction

Singing voice synthesis (SVS) attracts substantial interest as a research topic, with various techniques developed over the last decades. Early successful SVS systems are usually based on sample concatenation (Bonada et al., 2001; Bonada and Serra, 2007; Bonada et al., 2016; Macon et al., 1997), while parametric systems became much more prevalent recently. The synthesis process in parametric systems is carried out by a vocoder controlled by synthesis parameters generated from a separate acoustic model given some musical context factors (i.e., note number, duration, phoneme).

Early systems of this kind used a linear source‑filter model as vocoder (Hono et al., 2018; Saino et al., 2006). Later on, end‑to‑end models using deep neural networks (DNNs) took over and became the dominant approach for state‑of‑the‑art voice synthesis (Cho et al., 2021). The early successful neural vocoders were based on autoregressive modelling, such as WaveNet (Shen et al., 2018), WaveRNN (Kalchbrenner et al., 2018), and LPCNet (Valin and Skoglund, 2019). Nowadays, popular vocoders commonly use deep generative models such as generative adversarial networks (GANs; Kong et al., 2020; Kumar et al., 2019; Yoneyama et al., 2022), normalising flows (Prenger et al., 2019), diffusion models (Kong et al., 2021; Liu et al., 2022; Takahashi et al., 2023) or neural source‑filter models (Wang et al., 2020). These models are capable of parallel synthesis, running faster than the autoregressive methods. However, mel spectrograms are often chosen as input features for these vocoders, which are less interpretable than traditional vocoder parameters (e.g., f0, aperiodicity ratios). Moreover, the black‑box nature of these models makes it difficult to manipulate the voice characteristics directly and analyse the synthesis process if they are not explicitly modelled. A significant amount of data is also needed to cover various vocal expressions to achieve generalisation.

In contrast, differentiable digital signal processing (DDSP) models (Engel et al., 2020; Hayes et al., 2021; Shan et al., 2022) incorporate existing signal processing operators into neural networks as an inductive bias for sound synthesis and learn their control parameters end to end using differentiable optimisation. For applications where users need flexible control over the synthesis process, the increased interpretability and controllability in the DDSP models can outweigh the slight loss in quality due to the limited expressiveness of digital signal processing (DSP) components. In addition, they usually require less data to train and have less computational cost for inference, and thus can be placed in real‑time applications directly without sparsification on model weights (Kalchbrenner et al., 2018).

In the context of SVS, additive synthesis has been proposed by Alonso and Erkut (2021), where the model predicts gains to control multiple sinusoidal oscillators with frequencies corresponding to the fundamental frequency partials. This was later improved using subtractive synthesis, where one sawtooth oscillator generates energy‑decaying harmonics, and a time‑varying filter is used to shape the distribution of harmonics (Wu et al., 2022). Nercessian (2023) proposed a differentiable version of the WORLD vocoder (Morise et al., 2016) for end‑to‑end singing voice conversion. Yoshimura et al. (2023b) use Taylor expansion to approximate the exponential function of the mel‑log spectrum approximation filter (MLSA) and embed it into an SVS system. Other tasks regarding singing voice, such as source separation (Schulze‑Forster et al., 2023) and enhancement (Schwär et al., 2024), have also been explored using DDSP.

However, most DDSP‑based approaches are less restrictive on the type of target signal, only assuming it is a monophonic source, potentially leading to synthesis parameters that do not fulfil some voice properties. In most designs, the harmonic sources are fixed to a specific shape (e.g., sawtooth, pulse train), and the filters are symmetric in the time domain (zero‑phase), except Yoshimura et al. (2023b), who use a minimum‑phase MLSA filter. Incorporating constraints specific to the human voice into the source and the filters could enhance the interpretability of the vocoder.

In this work, we introduce GlOttal‑flow LPC Filter (GOLF), an SVS module informed by the physical mechanism of the human voice. We build upon the harmonic‑plus‑noise architecture of DDSP (Engel et al., 2020) and the subtractive synthesis of SawSing (Wu et al., 2022) but replace the harmonic source with a glottal model and use LPC filters. We develop a differentiable LPC implementation in PyTorch (Li et al., 2020) for training efficiency. We use this module as a neural vocoder and compare its performance with other DDSP‑based vocoders. Specifically, a simple and lightweight neural encoder converts the mel spectrogram into synthesis parameters, and the synthesiser decodes the signal from it. We train the same kind of encoder with other synthesisers we want to compare.

Our contributions are threefold. First, GOLF only requires 14.5–21.0% of the synthesis parameters compared with state‑of‑the‑art DDSP‑based SVS vocoders, while still being competitive in terms of synthesis quality. Second, GOLF requires less than 40% of memory to train and runs 10 times faster than its alternatives for inference. Moreover, we indirectly show that GOLF could model the phase components of the human voice by aligning the synthesised waveforms to the ground truth and calculating the differences. The filter parametrisation we use also enables us to track the formants unsupervised. These characteristics have excellent potential for analysing singing voices in a differentiable manner. Decomposing the human voice into the glottal source and vocal tract could also enable us to adjust the singing style in different ways, such as altering the amount of vocal effort with varying shapes of the glottal pulse.

This paper extends our previous publication (Yu and Fazekas, 2023) in several ways. The derivation of the efficient backpropagation of an LPC filter is more clearly notated. The architecture of the neural encoder is illustrated to make the inference pipeline easier to understand. Extensive ablation studies and additional experiments are added. These provide further insights into how the four models we evaluated treat the harmonics/noise components, the individual contributions of the factorised LPC filters, statistics on the learnt glottal pulses and issues when using frame‑wise LPC synthesis.

2 Background

We first introduce the relevant notation. denotes the response of in the z‑domain. Unless stated otherwise, we use as the time index and as the frame index. Angular frequencies and periods are normalised to the interval .

2.1 Glottal source‑filter model

In the source‑filter model, we have the following simplified voice production model:

1

where represents the periodic vibration from the vocal folds, represents random components of the glottal source, represents the vocal‑tract filter and represents the radiation at the lips (Degottex, 2010). Since this formulation is linear, the radiation filter and the glottal pulse can be merged into a single source called the radiated glottal pulse. If we assume is a first‑order differentiator (Lu and Smith III, 2000), then is the derivative of the glottal pulse, which can be described by the LF model (Fant et al., 1985), a four‑parameter model of glottal flow. is usually modelled with a linear prediction (LP) filter.

2.2 Linear predictive coding

Linear predictive coding (LPC) assumes that the current voice sample can be predicted from a finite number of previous samples by a linear combination with residuals :

2

where are the linear prediction coefficients. In this work, we only evaluate (so the audio length is ) and assume that, when , is always zeros. In our case, this is the same as filtering the residuals, equivalent to the glottal source, with an ‑order all‑pole filter. This filter has an infinite impulse response (IIR). We can use the LPC filter to represent the response of the vocal tract if the vocal tract is approximated by a series of cylindrical tubes with varying diameters (Markel and Gray, 1976), providing a physical interpretation.

Using LPC for neural audio synthesis is not new (Oh et al., 2020; Subramani et al., 2022; Valin and Skoglund, 2019), and works have been conducted to incorporate IIR filters and train them jointly with deep learning models (Bhattacharya et al., 2020; Colonel et al., 2022; Kim et al., 2022; Kuznetsov et al., 2020; Nercessian, 2020; Nercessian et al., 2021; Steinmetz et al., 2022). The difficulty of training IIR in a deep learning framework (e.g., PyTorch) using Equation 2 is that its computation is recursive, i.e., the output at each step depends on the previous results, and to make the calculation differentiable, separated tensors are allocated in each step. This results in significant computational overhead for allocating fragmented memory, which can lead to performance issues, especially for long sequences.

One way to mitigate this issue is to pre‑allocate a shared tensor before computation as a buffer and reuse it for each step. However, in‑place modification is mostly not differentiable in these frameworks.^¹ Some studies sidestep the recursion by approximating IIR in the frequency domain using discrete Fourier transform (DFT). This is usually called frequency sampling (FS; Colonel et al., 2022; Kim et al., 2022; Nercessian, 2020; Nercessian et al., 2021; Oh et al., 2020; Steinmetz et al., 2022; Wright and Valimaki, 2022). The resulting finite impulse response (FIR) is a folded version of the original IIR, introducing errors called time‑domain aliasing (Smith, 2024). Directly multiplying the DFT of the input with the transfer function of the filter (Oh et al., 2020) does not equal its time‑domain convolution but a circular convolution with the folded FIR. The aliasing effect cannot be avoided but only mitigated by increasing the DFT resolution, which also increases the filter’s computational cost.

Moreover, the IIRs used in practice are usually low order, such as the classic biquad filters (Colonel et al., 2022; Nercessian, 2020; Nercessian et al., 2021) and one‑pole smoothing filters (Wright and Valimaki, 2022). In these cases, filtering the signal directly is faster than using the FS method, especially when the sequence is long.

3 Proposed Model

We aim to develop a neural synthesiser while sticking with the interpretable model (Equation 1) as much as possible so the learnt parameters are easier to analyse and control to create new sounds. The rest of this section explains our design choices to achieve this goal.

Usually, in Equation 1 is treated as amplitude‑modulated Gaussian noise (Degottex, 2010; Lu and Smith III, 2000). We found this formulation challenging to optimise in the early experiments. As an alternative, we move the noise components outside the glottal source and filter it with a time‑varying filter , resulting in:

3

This resembles the classic harmonic‑plus‑noise model (Serra and Smith, 1990), and it has been used in previous DDSP‑based SVS (Alonso and Erkut, 2021; Wu et al., 2022). Alonso and Erkut (2021) model jointly using additive harmonic oscillators and adopt time‑varying finite impulse responses (FIRs) as ; Wu et al. (2022) introduce a sawtooth oscillator as and zero‑phase time‑varying FIRs as and . In this work, we use a glottal flow model to synthesise harmonic sources and time‑varying IIRs as filters.

3.1 Glottal flow wavetables

We adopt the transformed‑LF model (Fant, 1995) for generating glottal pulses. This model re‑parametrises the LF model (Fant et al., 1985) using just one parameter, , which was found to correspond to the perceived vocal effort well and covers a wide range of different glottal flow shapes.

To save time computing the pulses, we pre‑compute a set of the glottal flow and synthesise the waveform using wavetable synthesis. Thus, the computation is limited to indexing the memory plus naive interpolation. We sample values of with equal spacing inside following the value range suggested by Degottex (2010). We calculate the flow derivative function in continuous time for each sampled and then sample points in one period to get its discrete version. The details for calculating are given in Gobl (2017). By stacking these sampled glottal flows, we build wavetables , with each row containing one period of a sampled glottal pulse (Figure 1). The rows are sorted based on .

An example of the wavetables we use (before normalisation and peak alignment), corresponding to matrix with .

The model generates glottal pulses by linearly interpolating the two axes. The encoder network first predicts instantaneous frequency and the fractional index for . We then use the instantaneous phase to interpolate the waveform as:

4

where . The bilinear interpolation maps and as the row/column indexes to the matrix and returns the interpolated value using the four nearest neighbours. The wavetables are fixed in our case, contrary to Shan et al. (2022), and we only pick one wavetable at a time, not a weighted sum. We normalise each row of to have unit energy and align their negative peaks to the same position.

3.2 Frame‑Wise LPC synthesis

Time‑varying LPC synthesis is usually done by linearly interpolating the LPC coefficients to the audio resolution and filtering sample by sample. This is not parallelisable and slows down the training process. As an alternative, we approximate LPC synthesis by treating each frame independently and using overlap‑add:

5

where are the filter coefficients at the frame, and are the windowing functions, is the gain and is the hop size. is fixed to the square window. In this way, the computation can be parallelised on the frame level.

3.3 LPC coefficients parametrisation

For the LPC filter to be stable, all its poles must lie inside the unit circle on the complex plane. Stability can be guaranteed using robust representations, such as reflection coefficients (Subramani et al., 2022). The representation we choose in this work is cascaded ‑order IIR filters, and we solve the stability issue by ensuring all the ‑order filters are stable. We use the coefficient representation from Nercessian et al. (2021) to parametrise the IIR filter’s coefficients from the encoder’s outputs and cascade filters together to form an ‑order LPC filter . In practice, we multiply the filter polynomials first and perform one function call of the ‑order LPC filter, as it runs faster than sequentially applying each filter section.

3.4 Unvoiced gating

The encoder’s prediction of the instantaneous frequency is always non‑zero, keeping the oscillator working. Without constraint, the model would utilise these harmonics in the unvoiced region, creating buzzing artefacts (Wu et al., 2022). We propose to mitigate this problem by jointly training the model to predict the voiced/unvoiced probabilities as and feeding the gated frequency to the oscillator instead.

4 Optimisation

Training deep learning models is usually accomplished by backpropagating the gradients evaluated at a chosen loss function throughout the whole computational graph back to the parameters. Partially inspired by Bhattacharya et al. (2020), we derive the backpropagation through time formulas to utilise efficient IIR implementation to solve the problems we mentioned in Section 2.2 while keeping the filter differentiable.

4.1 Gradients to the coefficients

Taking the derivatives of Equation 2 with respect to , we get:

6

does not depend on , so the initial conditions . is simply a matrix multiplication of . For , we use the fact that so we can get all the entries of using just . The details of computing are given in the Appendix.

4.2 Gradients to the input

To get the derivatives for input , let us first re‑write Equation 2 in the following convolutional form:

7

where . In this way, there is no output on the right‑hand side, so we can derive the gradients more easily. We assume , so we only need to evaluate in Equation 7. This assumption also implies that is always true, an assumption we stated in Section 2.2. is not practically computable, so we only use it for demonstration. The derivative of loss with respect to depends on future samples , which is:

8

By swapping the variables and and considering the equivalence of Equations 2 and 7, Equation 8 can be simplified to:

9

where and denotes function composition. Equation 9 shows that we can get the derivatives by just filtering with the same filter, but running in the reverse direction. The initial conditions are zeros, as the loss does not depend on them.

In conclusion, backpropagation through an IIR filter consists of two passes of the same filter and one matrix multiplication. The computation of the IIR does not need to be implemented in an automatic differentiation framework since we view an IIR as a single differentiable function, not a batch of differentiable multiplications and additions, avoiding the computational overhead issue we discussed in Section 2.2, and the filter can be highly optimised.

Interestingly, since the differentiable IIR calls itself for backpropagation, the backpropagation computational graph is differentiable, enabling higher‑order gradients. However, this goes beyond the paper’s scope but is worth exploring in the future.

5 Experimental Setup

5.1 Dataset

We test GOLF as a neural vocoder on the MPop600 dataset (Chu et al., 2020), a high‑quality Mandarin singing dataset featuring nearly 600 singing recordings with aligned lyrics sung by four singers. We use the audio recordings from the f1 (female) and m1 (male) singers. For each singer, we select the first three recordings as the test set, the following 27 as the validation set, and the rest as training data (around 3 hours). All the recordings are downsampled to 24 kHz, following the same procedure as Wu et al. (2022). The vocoder feature is the log mel spectrogram. We compute the feature with a window size of 1024 and 80 mel‑frequency bins and set the hop size to 120 (equal to 200 Hz frame rate). We normalise the feature to fall between zero and one and slice the training data into 2‑second excerpts with 1.5 seconds of overlap.

5.2 Model details

We adapt the encoder from SawSing with some modifications. The architecture details are shown in Figure 2. The number of parameters is 0.7 M. The synthesis parameters are predicted at the same rate as the mel spectrogram (200 Hz). The wavetable fractional index is predicted at 20 Hz to prevent modulation artefacts from rapid wavetable switching. We set and for dense sampling and accurate interpolation. For a sample rate of 24 kHz, the wavetables faithfully represent the glottal flow down to 12 Hz, which is much lower than the minimum f0 of the dataset (greater than 80 Hz). All the parameters except and are linearly upsampled to the audio rate. and are the Gaussian noise’s gain and filter coefficients. We use the Hanning window for and set for both LPC filters. We use 75% overlap for frame‑wise LPC.

We implemented the IIR in C++ with multi‑threading and CUDA for batch processing. The differentiability is achieved by registering the above backward computation and binding the optimised code in PyTorch. We contributed the implementation to TorchAudio (Hwang et al., 2023) in 2021 as part of the function torchaudio.functional.lfilter.^² The initial implementation was developed independently in parallel with dynoNet, a similar tool by Forgione and Piga (2021). Our implementation provides a naive CUDA kernel, which enables it to be run on NVIDIA GPUs (the most common hardware for deep learning) for convenience. In comparison, dynoNet uses SciPy’s lfilter function as the backend and does not support its running on GPU.

We compare GOLF with three DDSP‑based baselines using the same encoder (without the downsampler for ) to predict their synthesis parameters. The size of the last linear layer is adjusted to match the number of parameters. The first two are the original DDSP (Engel et al., 2020) and SawSing (Wu et al., 2022). We set their noise filter length to 80 and harmonic filter length to 256 for SawSing. The third model is PUlse‑train LPC Filter (PULF), which is similar to GOLF but replaces the glottal flow wavetables with a band‑limited pulse train (Nercessian, 2023) using additive synthesis, while for the harmonic source is increased to 26 to accommodate the glottal pulse response. We set the number of oscillating sinusoids to be more than 150 for all baselines. We do not compare GOLF with Nercessian (2023) and Yoshimura et al. (2023b), because their architectures are based on the original source‑filter model and use additional post‑nets to enhance the voice, which makes direct comparison difficult.

5.3 Training configurations

We train separate models for each singer, resulting in eight models. The loss function is the summation of the same multi‑resolution short‑time Fourier transform (STFT) loss (MSSTFT) and f0 loss used in SawSing with the fast Fourier transform (FFT) sizes set to , plus a binary cross‑entropy loss on voiced/unvoiced prediction. Our MSSTFT follows the original DDSP (Engel et al., 2020) implementation, where the L1 distance is calculated between the magnitude of the ground truth and the predicted STFTs both in linear and log scale. Similar to Wu et al. (2022), we stop the gradients from the harmonic source to the f0s and voiced decisions to stabilise the training.

Although this paper uses reconstruction losses without adversarial training, which might result in an overly smooth spectral response, we empirically found this effect is not significant. The reason is most likely that, in a singer‑dependent setting, the compressed information of the mel spectrogram is enough to capture the singer’s voice characteristics. We verified this by training a speaker‑independent GOLF on a multi‑speaker dataset using almost the same configurations and found that the smoothing effect became more noticeable.

We used Adam (Kingma and Ba, 2015) for all optimisations. For DDSP and SawSing, the batch size and learning rate are set to 32 and 0.0005; for GOLF and PULF, the numbers were set to 64 and 0.0001 to reduce training instability. We used the ground truth f0s [extracted by WORLD (Morise et al., 2016)] for the harmonic oscillator of PULF during training due to stability issues. We trained all models for 800,000 steps to reach sufficient convergence (based on several early‑stage test runs). We picked the checkpoint with the lowest validation loss as the final model.

6 Results

The models we discuss here all use the same encoder and are trained with the same loss function; therefore, differences in the results should mostly be due to different synthesis methods.

6.1 Objective evaluation

The objective metrics we choose are MSSTFT, mean absolute error (MAE) in f0 and Fréchet audio distance (FAD; Kilgour et al., 2019) on the predicted singing of the test set.^³ Table 1 shows that DDSP has the lowest MSSTFT and f0 errors, while SawSing reaches the lowest FAD. GOLF and PULF show comparable results in f0 errors to other baselines. We report the memory usage measured on an Ubuntu 20.04 LTS machine with an i5‑4790k processor and an NVIDIA GeForce RTX 3070 GPU when training these models and their real‑time factor (RTF) in Table 2. GOLF requires approximately 35% of the memory needed by the other models for training, and demonstrates significantly faster performance, particularly on a central processing unit (CPU).

Table 1

Evaluation results on the test set. The lower, the better. We omit the standard deviation which is less than 0.01.

Singers	Models	MSSTFT ()	MAE‑f0 (cent, )	FAD ()
`f1`	DDSP	3.09	74.47 1.19	0.47 0.03
	SawSing	3.12	78.91 1.18	0.32 0.01
	GOLF	3.21	77.06 0.88	0.59 0.04
	PULF	3.27	76.90 1.11	0.76 0.03
`m1`	DDSP	3.12	52.95 1.03	0.56 0.02
	SawSing	3.13	56.46 1.04	0.44 0.01
	GOLF	3.26	54.09 0.30	0.74 0.02
	PULF	3.35	54.60 0.73	1.26 0.03

Table 2

The required number of VRAM (GB) for training with a batch size of 32, real‑time factor (RTF).

Models	Memory ()	RTF ()
Models	Memory ()	GPU	CPU
DDSP	7.3	0.015	0.237
SawSing	7.3	0.015	0.240
GOLF	2.6	0.009	0.023
PULF	7.5	0.015	0.248

Given the evaluation results and the fact that GOLF’s number of synthesis parameters (49) is roughly four to six times less than DDSP (233) and SawSing (338), it is clear that GOLF’s synthesis parameters are more compact representations of the singing voice. Comparing the differences between GOLF and PULF in Table 2, we can see that wavetable synthesis is the factor yielding significant speedup. Other baselines synthesise band‑limited harmonic sources with many sinusoids oscillating simultaneously, thus increasing the computational cost.

6.2 Subjective evaluation

We conducted an online multiple stimuli with hidden reference and anchor (MUSHRA; Series, 2014) listening test using Go Listen (Barry et al., 2021). We picked one short clip from each test set recording, resulting in six clips with durations ranging from 6 to 11 seconds. Thus, the test is divided into six examples, each consisting of one ground truth clip and four synthesised results from different models. Their loudness was normalised to −16 dB LUFS using pyloudnorm (Steinmetz and Reiss, 2021). The order of the examples and the stimulus were randomised for each subject. Each subject was asked to rate these stimuli’s quality using a score ranging from 0 to 100. We collected responses from 33 anonymous participants. We excluded one response, as the participant did not indicate the use of headphones. We normalised scores from one to five and report the mean score in Figure 3. DDSP has the highest scores overall, and a Wilcoxon signed‑rank test shows that it is not significantly different from the f1 ground truth (p = 0.168). We apply the one‑sided Wilcoxon test on GOLF and PULF to compare them with SawSing. The results show that GOLF outperforms SawSing (p < 0.0001), and PULF performs better than SawSing on m1 (p < 0.022).

The MUSHRA results of the vocoders trained on different singers with 95% confidence interval.

PULF’s MUSHRA score is much worse on f1, with noticeable artefacts in the unvoiced region and random components of the voice. After investigation, we found the noise gains predicted by PULF fluctuating at high speed, producing a modulation effect. This behaviour is also found in PULF trained on m1 and GOLF, even on the harmonic gains . Still, the degree of fluctuation is small and barely noticeable in the test samples. This amplitude modulation effect cannot be observed in MSSTFT spectrograms; thus, our training loss does not capture it. We will discuss the possible reason for this phenomenon in Section 7.5.

In addition, SawSing produced low scores for both singers because of the buzzing artefacts in the unvoiced region. This was pointed out in the original paper (Wu et al., 2022). Although unvoiced gating (Section 3.4) reduces this problem to a large degree, human ears are susceptible to this effect. This could be an inherent problem when using a sawtooth as the harmonic source.

7 Ablation Study

7.1 Waveform‑Matching experiment

GOLF and PULF are the only two models that introduce non‑linear phase response because of IIR filtering. We are interested in whether their synthesised waveform matches the ground truth better than the other models. The filters in DDSP and SawSing are all zero‑phase, and the initial phases of the sinusoidal oscillators are fixed to zero. We hypothesise that, if a model can more faithfully represent the human voice mechanism, its waveform should be more similar to the reference recordings, even if its training objective does not consider phase information. We emphasise that this analysis does not target human perception but the phase reconstruction ability of the models. Humans cannot perceive the absolute frequency phase, but accurate reconstruction could be important in sound matching and mixing use cases.

We perform the test on one of the samples from m1 that we use in the subjective evaluation. We created a new parameter called phase offset sampled at 20 Hz. We linearly upsampled phase offset and added it to the instantaneous phase before feeding it to the oscillator, introducing a slowly varying phase shift. We optimised this parameter by minimising the predicted waveform’s L2 loss to the ground truth using Adam with a 0.001 learning rate and 1000 steps. We wrapped the differences between the points of phase offset during optimisation to [−0.5, 0.5]. We ran this optimisation five times for each model. Each time, the phase offset was initialised randomly. We report the minimum and maximum final losses from these trials. Table 3 shows that the lowest losses GOLF and PULF can reach are much smaller than the others, with GOLF having the smallest of all.

Table 3

The L2 loss between the predicted waveforms and the ground truth.

	DDSP	SawSing	GOLF	PULF
Min	71.83	75.72	21.98	44.08
Max	88.77	93.16	64.82	70.59

The L2 loss shown in Table 3 demonstrates that GOLF matches phase‑related characteristics more accurately than other models. Figure 4 shows that GOLF produces the most similar waveform to the ground truth. Other baselines’ waveforms are similar because they use the same additive synthesiser. Optimising the oscillators’ initial phases can reduce their L2 loss, but this does not account for time‑varying wave shapes.

The predicted waveforms of a short segment from one of the `m1` test samples (77 s 84 s, `m1_003`).

7.2 Spectral analysis

To further understand the synthesised components, we compute the harmonics () and noise () spectrograms of each vocoder on the test set and take the average over the frames. The results are shown in Figure 5.

The average spectrums of the harmonics and noise components throughout the test set, further smoothed using cepstrum spectral analysis. The FFT size is 1024, and the hop size is 256.

The first difference we notice is that, for GOLF and PULF, there is a substantial cut in the frequencies from 6 kHz to 10 kHz (except PULF on m1). The possible reason is that the voice usually has less energy in these high frequencies. As we factorise GOLF’s LPC filters into 11 ‑order filters (in PULF, the number is 13), to utilise the limited number of filters better, the model might discard the low energy frequencies and focus on the more important ones. This claim is supported by the differences in harmonic spectral envelopes between GOLF and PULF on m1, as PULF has two more filter sections than GOLF and does not cut these frequencies. In addition, as males’ fundamental frequencies are lower than females’, the intervals between the harmonics are narrower and less sparse, giving more energy to learning the filters in this range. This reasoning is based on a similar finding in our recent work Yu and Fazekas (2024), where we found that, if the signal to the GOLF filter has a more even energy distribution across the frequencies, the filter will learn better.

Another observation is that the harmonics from both PULF and GOLF show strong energies in frequencies above 10 kHz. In contrast, their noise components in these frequencies are much weaker than the other two baselines. We are still unsure why the harmonics filter acts like the noise filter near the Nyquist frequency. A possible cause may be the energies of high‑order partials, which vary more rapidly than the fundamental frequency. In very high frequencies, the contours of these partials are noise‑like, so the LPC filters tend to model them as noise. This is confirmed after a closer look at the harmonics spectrogram of GOLF and PULF, where we cannot see a clear harmonic structure in the filtered high frequencies.

We hypothesise that this effect is not observed in DDSP and SawSing due to the parametrisation we used in GOLF and PULF. For example, DDSP has to take the varying f0 into account when predicting corresponding harmonic gains to boost a specific frequency range. On the contrary, GOLF only needs to predict fixed coefficients in one of the filter sections to achieve the same effect. The in SawSing is an FIR filter, so it can model both the peaks and valleys in the spectrum, while the LPC filters are all‑pole and can only model the peaks (resonances). This results in high peak gains in the high frequencies we are discussing. The windowing in our frame‑wise LPC could also encourage this effect, as it smears the spectral peaks and makes them flatter. We will analyse the filter sections of the LPC filters in Section 7.4 to provide more insight.

7.3 Separability of components

The harmonics of SawSing are flatter and have more energy in high frequencies than the other models, as shown in Figure 5. After inspecting their spectrograms, we also found that SawSing harmonics tend to contribute more consonant noises, indicating more overlap between the harmonics and noise components. A more apparent separation between the harmonics and noise components could help analyse and manipulate the voice signal more effectively. To examine this, we compute the cosine distance between the harmonics and noise spectrograms of each model on the test set as:

10

where and are the harmonics and noise magnitude spectrograms, respectively. We only consider the frequency bins below 6 kHz to exclude the effects we discussed in Section 7.2.

The results are shown in Table 4. We can see that SawSing has the lowest cosine distance, indicating the least separability between the harmonics and noise. GOLF and PULF separate the components better than DDSP and SawSing, suggesting that the LPC filters are more suitable in the harmonic‑plus‑noise voice model.

Table 4

The cosine distance between harmonics and noise magnitude spectrograms of each baseline on the test data (the distance is computed only on the frequency bins below 6 kHz to avoid the frequency cut issues discussed in Section 7.2).

	DDSP	SawSing	GOLF	PULF
`f1`	0.945	0.933	0.948	0.954
`m1`	0.911	0.876	0.937	0.941

7.4 Filter sections statistics

As we factorise the LPC filter into cascaded ‑order filters (Section 3.3), we can decompose the spectra further and analyse the behaviour of each filter section. We factorise every ‑order filter section into two poles . If the poles form a complex conjugate pair (), then the filter is a peak filter. For real poles, this equals cascading two first‑order high/low pass filters (Smith, 2024).

The number of occurrences of real‑pole filters can be found in Figure 6. We see that, most of the time, the poles are complex conjugate pairs (peak filters). The noise filters have more real poles than the harmonic filters. This is reasonable, as the harmonics contain more resonant structures such as formants, while the random components are more broadband. The only exception is PULF for the m1 singer, having two filters with significantly more real poles. This phenomenon may be attributed to the flat spectrum of pulse trains, which necessitates the model’s use of real poles (high/low pass filters) to adjust the spectral tilt.

The portion of time occurrence of real poles of each ‑order filter in the LPC filters. For , only the frames when are considered. We sort the filters on the basis of their portion of real poles.

Since GOLF and PULF filter sections most often predict conjugate poles, the following analysis will focus only on peak filters. We plot the statistics of (1) the resonant frequencies and (2) the resonant gains of the peak filters in Figure 7. The resonant frequency is computed as . Although the actual peak frequency slightly deviates from this due to the existence of the conjugate pole, this is a good approximation (Smith, 6 30). The resonant gain is .

The mean and standard deviation of the resonant frequencies and resonant gains of each ‑order filter in the LPC filters. *cmplx* means filters contain real poles less than 10% of the time and *cmplx* + *real* means the opposite. The statistics are computed only on the frames when the poles are complex conjugate. For , only the frames when are considered.

From Figure 7, we see that around two to three extremely high peak filters are used above 10 kHz for the harmonic filters, which aligns with the high energy frequencies we discussed in Section 7.2. Both models learnt to assign distinct filters to formants, and their frequency statistics are aligned with each other, at least for (the one below 1 kHz) and (the one between 1 kHz and 2 kHz). We see, in the male singer m1, there are more aligned filters, possibly belonging to (around 3 kHz and above 20 dB) and (around 4 kHz). This is likely because the intervals between the harmonic partials are narrower for the male voice, so the energy distribution is denser, making it easier to learn the formants, which is the same reason we discussed in Section 7.2. Due to broadband noise, the noise filter statistics between the two models tend to be more similar than those of the harmonic filters.

7.5 Interleaving behaviour of resonant frequencies

Although the time‑varying behaviour of the resonant frequencies is not constrained, we found that the learnt formants generally correspond to the filters one‑to‑one. In other words, the models learnt to implicitly assign the same filter to the same formant across the frames, which could be used for formant tracking. We believe this is because the formants have a high signal‑to‑noise ratio and distinct structures, which make it easier for the model to learn.

Other filter sections do not always have this one‑to‑one mapping and sometimes have different resonant frequency trajectories interleaved inside one filter section. To demonstrate this, we pick three filter sections from the harmonic filters of GOLF on the f1 singer and plot the trajectories of the poles in the complex plane in Figure 8. Filter 1 has three separated groups of poles, and in this short 20‑frame slice, it jumps between them. This effect is not observable in the filtered spectrogram with only filter 1, as the overlap‑add of frame‑wise LPC smooths these abrupt changes. We suspect this is the same cause resulting in fluctuating noise gains discussed in Section 8. The fluctuation pattern we see is the gain oscillates with a cycle of two frames, which can also be seen as jumping between two groups of gain values. The MSSTFT loss cannot capture this change as the smallest FFT size in our loss is 512, which can cover at least five synthesis frames (120 4 + 1 = 481), and thus is large enough to smooth this effect in the spectrogram. We test this idea using an FFT size of 128 to calculate the spectrograms of PULF’s noise component. The resulting spectrograms show a perceivable periodic pattern, thus verifying our hypothesis.

The trajectories of the poles of three filter sections in the LPC filters of GOLF on a slice of 0.1‑second (20 frames) test data of the `f1` singer. Filter sections 2 and 3 correspond to formants and , respectively.

7.6 Statistics of

To analyse how GOLF uses to synthesise the glottal flow, we compute the minimum, maximum and average values of [inversely mapped from ] on the frames when and plot the synthesised glottal flows using these statistics in Figure 9.

The glottal flows computed using the minimum, maximum and average of the values of the singers, plotted in both the time and frequency domain. The f0 for the frequency plots is set to 100 Hz, and the flows are multiplied by complex sinusoidals so the harmonic peaks are separated in the plots.

From the plots, we see that there is no significant difference between the two singers. The models learn to use higher values (which result in a fast‑decaying spectrum) more and barely use an below 1, which corresponds to a flatter spectrum. This is expected, as our training objective is a spectral loss, so GOLF likely learns a response that matches the shape of the spectrum on average (see the similar decaying structure of in Figure 5). If the time‑domain loss is used, the model might learn a very different distribution of , as the target becomes learning the most similar waveform instead of its magnitude response.

8 Discussions

The main contribution of this work is the efficient backpropagation method for a differentiable LPC filter. This all‑pole filter (LPC) can be plugged into other filters with a recursive structure, such as IIR filters, all‑pass filters, dynamic controllers, etc. This opens up many possibilities for tasks involving differentiable signal processing, such as audio effects modelling and system identification. We believe this is a promising direction for future research.

The frame‑wise LPC filters are used in this work to handle the time‑varying nature of the voice signal. However, in the ablation study, we reveal that this approximation can create artefacts such as fluctuating filter gains and interleaving resonant frequencies. The straightforward way to solve the issue is to calculate the spectral loss with a sufficiently small FFT size that can capture the fluctuation/interleaving behaviour between frames. However, this does not remove the fundamental limitation of frame‑wise overlap–add approximation. Removing this approximation and performing LPC synthesis with time‑varying is a possible direction for improving the model, enforcing the continuity of the filter coefficients. The time‑varying LPC synthesis is formulated as:

11

can just be a linear interpolation of the predicted , which is a similar approach to SPTK (Yoshimura et al., 2023a). The key is to develop a method to calculate gradients efficiently through this time‑varying filter, which we believe can be done similarly to the method introduced in Section 4. Additionally, methods to guarantee the stability of Equation 11 is also an important issue. This is more complicated than the time‑invariant case (Laroche, 2007).

The computational efficiency of GOLF is due to the use of wavetables to synthesise the harmonic source, besides the efficient backpropagation method here. However, the linear interpolation synthesis method discussed in Section 3.1 can introduce frequency aliasing, which we overlook in this work. Oversampling the wavetables during synthesis can surpass this aliasing effect. However, the aliasing problem might not be a big issue for the source signal, as similar sources such as time‑domain pulse trains have been used before (Yoshimura et al., 2023b), and the aliasing effect can be observed in the high frequencies of their spectrogram figures.

We observe correlation between the quality of the learnt filters and the type of the source signal. The filters learn better, e.g., smoother spectral envelopes and more aligned formants, when the source signal has more evenly distributed energy across the frequencies, e.g., the noise filter and the harmonic filter of PULF on the male singer. This suggests the model might learn the filters better on a speech signal, as their fundamental frequency is lower and the harmonics are denser. Using source‑filter formulation (Section 2.1) could also help to learn the , as its input is mixed with filtered broadband noise. Finding a better way to parametrise the source‑filter model than the amplitude‑modulated Gaussian noise is crucial for this direction.

The low reconstruction error in the waveform is a positive effect of the deterministic phase responses embedded in GOLF. This opens up many possibilities, such as decomposing and analysing the voice in a differentiable manner and training the vocoder using the time‑domain loss function. GOLF’s waveform matching can be improved by using a more flexible glottal source model and adding FIR and all‑pass filters to account for the voice’s mixed‑phase components and the recording environment’s acoustic response.

Lastly, we note that cascaded IIR filters provide an orderless representation; i.e., the cascading order does not affect the outputs. This results in the responsibility problem (Zhang et al., 2019; Hayes et al., 2023) for the last layer of the encoder, which might be one of the reasons why GOLF and PULF are less stable when it comes to training than other baselines. Developing architectures that handle orderless representation, such as using a transformer without positional encoding instead or switching to other robust representations may address this.

9 Conclusions

In this paper, we develop a lightweight singing voice vocoder called GOLF. It uses wavetables to synthesise glottal flows as a time‑varying harmonic source and frame‑wise LPC filters to filter harmonics and random elements. We implement an efficient method to backpropagate gradients through the LPC filters, which allows us to train the model end to end. GOLF shows competitive voice quality in subjective and objective evaluations of other DDSP‑based vocoders while being more efficient in training and inference. The vocoders’ learnt characteristics are analysed and discussed, including waveform matching, spectral analysis, separability of the components, filter section statistics and the interleaving behaviour of the resonant frequencies.

Our work is a step towards a more efficient, realistic and interpretable neural voice synthesis model. It has several desirable properties, including more realistic waveforms that could be used for audio editing in the post‑production stage, and the ability to track formants and other resonant structures in the voice signal. Our future work will focus on extending GOLF to handle more complex voice signals, such as singing with vibrato or a growl, and developing a more efficient and stable training method for the model.

10 Reproducibility

The trained checkpoints, source codes, evaluation scripts and jupyter notebooks for the ablation studies are available at https://doi.org/10.5281/zenodo.14160345. Listening test samples are available at https://iamycy.github.io/golf-demo/.

11 Ethics and Consent

The authors declare that they have obtained consent from the participants in the listening test. All the participants were over 18 years old. No personal data were recorded, and the gathered data are anonymised and will not be used for purposes other than this study.

Notes

[1] To understand why in‑place operations are not differentiable in PyTorch, see https://pytorch.org/docs/stable/notes/autograd.html\#in-place-operations-with-autograd.

[2] https://pytorch.org/audio/main/generated/torchaudio.functional.lfilter.html.

[3] In the latest version, we adopt fadtk for calculating FAD; thus, the results may vary slightly from those reported in Yu and Fazekas (2023).

Funding Information

The first author is a research student at the UK Research and Innovation (UKRI) Centre for Doctoral Training in Artificial Intelligence and Music, supported jointly by UK Research and Innovation (grant number EP/S022694/1) and Queen Mary University of London.

Competing Interests

The authors have no competing interests to declare.

Additional File

The additional file for this article can be found as follows:

Supplementary Appendix

DOI: https://doi.org/10.5334/tismir.210.s1.