Have a personal or library account? Click to login
GOLF: A Singing Voice Synthesiser with Glottal Flow Wavetables and LPC Filters Cover

GOLF: A Singing Voice Synthesiser with Glottal Flow Wavetables and LPC Filters

By: Chin-Yun Yu and  György Fazekas  
Open Access
|Dec 2024

References

  1. Alonso, J., and Erkut, C. (2021). Latent space explorations of singing voice synthesis using DDSP. arXiv preprint arXiv:2103.07197.
  2. Barry, D., Zhang, Q., Sun, P. W., and Hines, A. (2021). Go Listen: An end‑to‑end online listening test platform. Journal of Open Research Software, 9(1), 20.
  3. Bhattacharya, P., Nowak, P., and Zölzer, U. (2020). Optimization of cascaded parametric peak and shelving filters with backpropagation algorithm. In International Conference on Digital Audio Effects, Vienna, Austria (pp. 101108).
  4. Bonad, J., Celma Herrada, O., Loscos, A., Ortolà, J., Serra, X., Yoshioka, Y., Kayama, H., Hisaminato, Y., and Kenmochi, H. (2001). Singing voice synthesis combining excitation plus resonance and sinusoidal plus residual models. In International Computer Music Conference (ICMC), Havana, Cuba.
  5. Bonad, J., and Serra, X. (2007). Synthesis of the singing voice by performance sampling and spectral models. IEEE Signal Processing Magazine, 24(2), 6779.
  6. Bonad, J., Umbert, M., and Blaauw, M. (2016). Expressive singing synthesis based on unit selection for the Singing Synthesis Challenge 2016. In Interspeech 2016, San Francisco, USA (pp. 12301234).
  7. Cho, Y.‑P., Yang, F.‑R., Chang, Y.‑C., Cheng, C.‑T., Wang, X.‑H., and Liu, Y.‑W. (2021). A survey on recent deep learning‑ driven singing voice synthesis systems. In 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), Taichung, Taiwan (pp. 319323). IEEE.
  8. Chu, C.‑C., Yang, F.‑R., Lee, Y.‑J., Liu, Y.‑W., and Wu, S.‑H. (2020). MPop600: A mandarin popular song database with aligned audio, lyrics, and musical scores for singing voice synthesis. In Asia‑Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand (pp. 16471652). IEEE.
  9. Colonel, J. T., Steinmetz, C. J., Michelen, M., and Reiss, J. D. (2022). Direct design of biquad filter cascades with deep learning by sampling random polynomials. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore (pp. 31043108). IEEE.
  10. Degottex, G. (2010). Glottal source and vocal‑tract separation. PhD thesis, Université Pierre et Marie Curie‑Paris VI.
  11. Engel, J., Hantrakul, L. H., Gu, C., and Roberts, A. (2020). DDSP: Differentiable digital signal processing. In International Conference on Learning Representations, Addis Ababa, Ethiopia.
  12. Fant, G. (1995). The LF‑model revisited. Transformations and frequency domain analysis. Speech Trans. Lab. Q. Rep., Royal Inst. of Tech. Stockholm, 2(3), 40.
  13. Fant, G., Liljencrants, J., and Lin, Q.‑g. (1985). A four‑parameter model of glottal flow. STL‑QPSR, 4(1985), 113.
  14. Forgione, M., and Piga, D. (2021). dynoNet: A neural network architecture for learning dynamical systems. International Journal of Adaptive Control and Signal Processing, 35(4), 612626.
  15. Gobl, C. (2017). Reshaping the transformed LF model: Generating the glottal source from the waveshape parameter Rd. In Interspeech 2017, Stockholm, Sweden (pp. 30083012).
  16. Hayes, B., Saitis, C., and Fazekas, G. (2021). Neural waveshaping synthesis. In Proceedings of the 22nd International Society for Music Information Retrieval Conference, Online (pp. 254261). ISMIR.
  17. Hayes, B., Saitis, C., and Fazekas, G. (2023). The responsibility problem in neural networks with unordered targets. arXiv preprint arXiv:2304.09499.
  18. Hono, Y., Murata, S., Nakamura, K., Hashimoto, K., Oura, K., Nankaku, Y., and Tokuda, K. (2018). Recent development of the DNN‑based singing voice synthesis system – Sinsy. In Asia‑Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Hawaii, USA (pp. 10031009). IEEE.
  19. Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., Kumar, A., Yu, C.‑Y., Zhu, C., Liu, C., Kahn, J., Ravanelli, M., Sun, P., Watanabe, S., Shi, Y., and Tao, Y. (2023). TorchAudio 2.1: Advancing speech recognition, self‑supervised learning, and audio processing components for PyTorch. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Taipei, Taiwan (pp. 19).
  20. Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., Oord, A., Dieleman, S., and Kavukcuoglu, K. (2018). Efficient neural audio synthesis. In International Conference on Machine Learning, Stockholm, Sweden (pp. 24102419). PMLR.
  21. Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. (2019). Fréchet audio distance: A metric for evaluating music enhancement algorithms. In Interspeech 2019, Graz, Austria (pp. 23502354).
  22. Kim, T., Yang, Y.‑H., and Nam, J. (2022). Joint estimation of fader and equalizer gains of DJ mixers using convex optimization. In International Conference on Digital Audio Effects, Vienna, Austria (pp. 312319).
  23. Kingma, D. P., and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations, San Diego, USA.
  24. Kong, J., Kim, J., and Bae, J. (2020). HiFi‑GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Advances in Neural Information Processing Systems (Vol. 33, pp. 1702217033). Virtual. Curran Associates, Inc.
  25. Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. (2021). DiffWave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, Vienna, Austria.
  26. Kumar, K., Kumar, R., de Boissiere, T., Gestin, L., Teoh, W. Z., Sotelo, J., Brébisson, de., Y., A., Bengio., and Courville, A. C. (2019). MelGAN: Generative adversarial networks for conditional waveform synthesis. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché‑Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc.
  27. Kuznetsov, B., Parker, J. D., and Esqueda, F. (2020). Differentiable IIR filters for machine learning applications. In International Conference on Digital Audio Effects, Vienna, Austria (pp. 297303).
  28. Laroche, J. (2007). On the stability of time‑varying recursive filters. Journal of the Audio Engineering Society, 55(6), 460471.
  29. Li, S., Zhao, Y., Varma, R., Salpekar, O., Noordhuis, P., Li, T., Paszke, A., Smith, J., Vaughan, B., and Damania, P. (2020). PyTorch distributed: Experiences on accelerating data parallel training. Proceedings of the VLDB Endowment, 13(12), 30053018.
  30. Liu, J., Li, C., Ren, Y., Chen, F., and Zhao, Z. (2022). DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In AAAI Conference on Artificial Intelligence (Vol. 36, pp. 1102011028). Virtual.
  31. Lu, H.‑L., and Smith III, J. O. (2000). Glottal source modeling for singing voice synthesis. In International Computer Music Conference (ICMC), Berlin, Germany.
  32. Macon, M. W., Jensen‑Link, L., Oliverio, J., Clements, M. A., and George, E. B. (1997). A singing voice synthesis system based on sinusoidal modeling. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Munich, Germany (Vol. 1, pp. 435438). IEEE.
  33. Markel, J. D., and Gray, A. H. (1976). Linear Prediction of Speech, volume 12 of Communication and Cybernetics. Springer.
  34. Morise, M., Yokomori, F., and Ozawa, K. (2016). WORLD: A vocoder‑based high‑quality speech synthesis system for real‑time applications. IEICE TRANSACTIONS on Information and Systems, 99(7), 18771884.
  35. Nercessian, S. (2020). Neural parametric equalizer matching using differentiable biquads. In International Conference on Digital Audio Effects (pp. 265272).
  36. Nercessian, S. (2023). Differentiable WORLD synthesizer‑based neural vocoder with application to end‑to‑end audio style transfer. Journal of the Audio Engineering Society, 1(10661), https://aes2.org/publications/elibrary-page/?id=22073.
  37. Nercessian, S., Sarroff, A., and Werner, K. J. (2021). Lightweight and interpretable neural modeling of an audio distortion effect using hyperconditioned differentiable biquads. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada (pp. 890894). IEEE.
  38. Oh, S., Lim, H., Byun, K., Hwang, M.‑J., Song, E., and Kang, H.‑G. (2020). ExcitGlow: Improving a WaveGlow‑based neural vocoder with linear prediction analysis. In Asia‑Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand (pp. 831836). IEEE.
  39. Prenger, R., Valle, R., and Catanzaro, B. (2019). WaveGlow: A flow‑based generative network for speech synthesis. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK (pp. 36173621). IEEE.
  40. Saino, K., Zen, H., Nankaku, Y., Lee, A., and Tokuda, K. (2006). An HMM‑based singing voice synthesis system. In Interspeech 2006, Pittsburgh, USA (paper 2077–Thu1BuP.7).
  41. Schulze‑Forster, K., Richard, G., Kelley, L., Doire, C. S., and Badeau, R. (2023). Unsupervised music source separation using differentiable parametric source models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31, 12761289.
  42. Schwär, S. J., Krause, M., Fast, M., Rosenzweig, S., Scherbaum, F., and Müller, M. (2024). A dataset of larynx microphone recordings for singing voice reconstruction. Transactions of the International Society for Music Information Retrieval, 7(1), 3043.
  43. Series, B. (2014). Method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radiocommunication Assembly, 2. https://www.itu.int/rec/R-REC-BS.1534-3-201510-I/en.
  44. Serra, X., and Smith, J. (1990). Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4), 1224.
  45. Shan, S., Hantrakul, L., Chen, J., Avent, M., and Trevelyan, D. (2022). Differentiable wavetable synthesis. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore (pp. 45984602). IEEE.
  46. Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., and Skerrv‑Ryan, R. (2018). Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada (pp. 47794783). IEEE.
  47. Smith, J. O. (2024 October, 25). Mathematics of the discrete fourier transform (DFT). online book. http://ccrma.stanford.edu/~jos/mdft/.
  48. Smith, J. O. (2024 June, 30). Introduction to digital filters with audio applications. online book. http://ccrma.stanford.edu/~jos/filters/.
  49. Steinmetz, C. J., Bryan, N. J., and Reiss, J. D. (2022). Style transfer of audio effects with differentiable signal processing. Journal of the Audio Engineering Society, 70(9), 708721.
  50. Steinmetz, C. J., and Reiss, J. D. (2021). pyloudnorm: A simple yet flexible loudness meter in python. AES 150th Convention, Virtual. Audio Engineering Society.
  51. Subramani, K., Valin, J.‑M., Isik, U., Smaragdis, P., and Krishnaswamy, A. (2022). End‑To‑End LPCNet: A neural vocoder with fully‑differentiable LPC estimation. In Interspeech 2022, Incheon, Korea (pp. 818822).
  52. Takahashi, N., Kumar, M., Singh, and Mitsufuji, Y. (2023). Hierarchical diffusion models for singing voice neural vocoder. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island (pp. 15). IEEE.
  53. Valin, J.‑M., and Skoglund, J. (2019). LPCNet: Improving neural speech synthesis through linear prediction. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK (pp. 58915895). IEEE.
  54. Wang, X., Takaki, S., and Yamagishi, J. (2020). Neural source‑filter waveform models for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 402415.
  55. Wright, A., and Valimaki, V. (2022). Grey‑box modelling of dynamic range compression. In International Conference on Digital Audio Effects, Vienna, Austria (pp. 304311).
  56. Wu, D.‑Y., Hsiao, W.‑Y., Yang, F.‑R., Friedman, O. D., Jackson, W., Bruzenak, S., Liu, Y.‑W., and Yang, Y.‑H. (2022). DDSP‑based singing vocoders: A new subtractive‑ based synthesizer and a comprehensive evaluation. In Proceedings of the 23rd International Society for Music Information Retrieval Conference, Bengaluru, India (pp. 7683). ISMIR.
  57. Yoneyama, R., Wu, Y.‑C., and Toda, T. (2022). Unified source‑filter GAN with harmonic‑plus‑noise source excitation generation. In Interspeech 2022, Incheon, Korea (pp. 848852).
  58. Yoshimura, T., Fujimoto, T., Oura, K., and Tokuda, K. (2023a). SPTK4: An open‑source software toolkit for speech signal processing. In 12th ISCA Speech Synthesis Workshop (SSW 2023), Grenoble, France (pp. 211217).
  59. Yoshimura, T., Takaki, S., Nakamura, K., Oura, K., Hono, Y., Hashimoto, K., Nankaku, Y., and Tokuda, K. (2023b). Embedding a differentiable mel‑cepstral synthesis filter to a neural speech synthesis system. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island (pp. 15). IEEE.
  60. Yu, C.‑Y., and Fazekas, G. (2023). Singing voice synthesis using differentiable LPC and glottal‑flow‑inspired wavetables. In Proceedings of the 24th International Society for Music Information Retrieval Conference, Milan, Italy (pp. 667675). ISMIR.
  61. Yu, C.‑Y., and Fazekas, G. (2024). Differentiable time‑varying linear prediction in the context of end‑to‑end analysis‑by‑synthesis. In Interspeech 2024, Kos, Greece (pp. 18201824).
  62. Zhang, Y., Hare, J., and Prugel‑Bennett, A. (2019). Deep set prediction networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc.
DOI: https://doi.org/10.5334/tismir.210 | Journal eISSN: 2514-3298
Language: English
Submitted on: Jul 1, 2024
Accepted on: Nov 2, 2024
Published on: Dec 19, 2024
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Chin-Yun Yu, György Fazekas, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.