References
- 1Alonso, J., and Erkut, C. (2021). Latent space explorations of singing voice synthesis using DDSP. In Proceedings of the Sound and Music Computing Conference (SMC), pages 183–190, Online.
- 2Askenfelt, A., Gauffin, J., Sundberg, J., and Kitzing, P. (1980). A comparison of contact microphone and electroglottograph for the measurement of vocal fundamental frequency. Journal of Speech, Language, and Hearing Research, 23(2):258–273. DOI: 10.1044/jshr.2302.258
- 3Benesty, J., Chen, J., and Huang, Y. (2008). Microphone Array Signal Processing, volume 1. Springer Verlag, 1st edition.
- 4Cano, E., FitzGerald, D., Liutkus, A., Plumbley, M. D., and Stöter, F. (2019). Musical source separation: An introduction. IEEE Signal Processing Magazine, 36(1):31–40. DOI: 10.1109/MSP.2018.2874719
- 5Cho, Y.-P., Yang, F.-R., Chang, Y.-C., Cheng, C.-T., Wang, X.-H., and Liu, Y.-W. (2021). A survey on recent deep learning-driven singing voice synthesis systems. In 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), pages 319–323. DOI: 10.1109/AIVR52153.2021.00067
- 6Choi, H., Lee, J., Kim, W., Lee, J., Heo, H., and Lee, K. (2021). Neural analysis and synthesis: Reconstructing speech from self-supervised representations. In Advances in Neural Information Processing Systems (NeurIPS), pages 16251–16265, Virtual.
- 7Choi, H.-S., Yang, J., Lee, J., and Kim, H. (2022). NANSY++: Unified voice synthesis with neural analysis and synthesis. Computing Research Repository (CoRR), abs/2211.09407.
- 8Dai, J., and Dixon, S. (2019). Intonation trajectories within tones in unaccompanied soprano, alto, tenor, bass quartet singing. The Journal of the Acoustical Society of America, 146(2):1005–1014. DOI: 10.1121/1.5120483
- 9Dekens, T., Patsis, Y., Verhelst, W., Beaugendre, F., and Capman, F. (2008). A multi-sensor speech database with applications towards robust speech processing in hostile environments. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
- 10Engel, J., Hantrakul, L., Gu, C., and Roberts, A. (2020). DDSP: Differentiable digital signal processing. In Proceedings of the International Conference on Learning Representations (ICLR).
- 11Gannot, S., Burshtein, D., and Weinstein, E. (2001). Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing, 49(8):1614–1626. DOI: 10.1109/78.934132
- 12Graciarena, M., Franco, H., Sonmez, K., and Bratt, H. (2003). Combining standard and throat microphones for robust speech recognition. IEEE Signal Processing Letters, 10(3):72–74. DOI: 10.1109/LSP.2003.808549
- 13He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA. DOI: 10.1109/CVPR.2016.90
- 14Henry, P., and Letowski, T. (2007).
Bone conduction: Anatomy, physiology, and communication . Technical Report ARL-TR-4138, United States Army Research Laboratory. - 15Herbst, C. (2020). Electroglottography – an update. Journal of Voice, 34(4):503–526. DOI: 10.1016/j.jvoice.2018.12.014
- 16Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. DOI: 10.1109/ICASSP.2017.7952132
- 17Kates, J. M. (1992). On using coherence to measure distortion in hearing aids. The Journal of the Acoustical Society of America, 91(4):2236–2244. DOI: 10.1121/1.403657
- 18Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. (2019). Fréchet audio distance: A metric for evaluating music enhancement algorithms. Computing Research Repository (CoRR), abs/1812.08466. DOI: 10.21437/Interspeech.2019-2219
- 19Li, X., Chebiyyam, V., and Kirchhoff, K. (2019). Speech Audio Super-Resolution for Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pages 3416–3420, Graz, Austria. DOI: 10.21437/Interspeech.2019-3043
- 20McBride, M., Tran, P., Letowski, T., and Patrick, R. (2011). The effect of bone conduction microphone locations on speech intelligibility and sound quality. Applied Ergonomics, 42(3):495–502. DOI: 10.1016/j.apergo.2010.09.004
- 21Mitsufuji, Y., Fabbro, G., Uhlich, S., St¨oter, F.-R., Défossez, A., Kim, M., Choi, W., Yu, C.-Y., and Cheuk, K.-W. (2022). Music Demixing Challenge 2021. Frontiers in Signal Processing, 1. DOI: 10.3389/frsip.2021.808395
- 22Nakajima, Y., Kashioka, H., Shikano, K., and Campbell, N. (2003). Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), volume 5, pages 708–711. DOI: 10.1109/ICASSP.2003.1200069
- 23Otani, M., Hirahara, T., and Adachi, S. (2006). Numerical simulation of sound originated from the vocal tract in soft neck tissues. The Journal of the Acoustical Society of America, 120(5):3352–3352. DOI: 10.1121/1.4781428
- 24Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022).
Robust speech recognition via large-scale weak supervision . Technical report, OpenAI. - 25Renault, L., Mignot, R., and Roebel, A. (2022). Differentiable piano model for MIDI-to-audio performance synthesis. In Proceedings of the 25th International Conference on Digital Audio Effects (DAFx), Vienna, Austria.
- 26Rosenzweig, S., Cuesta, H., Weis, C., Scherbaum, F., Gómez, E., and Müller, M. (2020). Dagstuhl ChoirSet: A multitrack dataset for MIR research on choral singing. Transactions of the International Society for Music Information Retrieval (TISMIR), 3(1):98–110. DOI: 10.5334/tismir.48
- 27Rosenzweig, S., Scherbaum, F., and Müller, M. (2022). Computer-assisted analysis of field recordings: A case study of Georgian funeral songs. ACM Journal on Computing and Cultural Heritage (JOCCH), 16(1):1–16. DOI: 10.1145/3551645
- 28Scherbaum, F. (2016). On the benefit of larynx-microphone field recordings for the documentation and analysis of polyphonic vocal music. Proceedings of the International Workshop Folk Music Analysis, pages 80–87.
- 29Scherbaum, F., Mzhavanadze, N., Rosenzweig, S., and Müller, M. (2022). Tuning systems of traditional Georgian singing determined from a new corpus of field recordings. Musicologist, 6(2):142–168. DOI: 10.33906/musicologist.1068947
- 30Schmidt, K., and Edler, B. (2021). Blind bandwidth extension of speech based on LPCNet. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 426–430. DOI: 10.23919/Eusipco47968.2020.9287465
- 31Schoeffler, M., Bartoschek, S., Stöter, F.-R., Roess, M., Westphal, S., Edler, B., and Herre, J. (2018). web-MUSHRA — a comprehensive framework for webbased listening tests. Journal of Open Research Software, 6(8). DOI: 10.5334/jors.187
- 32Schulze-Forster, K., Doire, C. S. J., Richard, G., and Badeau, R. (2022). Unsupervised audio source separation using differentiable parametric source models. Computing Research Repository (CoRR), abs/2201.09592.
- 33Serrà, J., Pascual, S., Pons, J., Araz, R. O., and Scaini, D. (2022). Universal speech enhancement with scorebased diffusion. Computing Research Repository (CoRR), abs/2206.03065.
- 34Serra, X., and Smith III, J. (1990). Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4):12–24. DOI: 10.2307/3680788
- 35Shimizu, S., Otani, M., and Hirahara, T. (2009). Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical Science and Technology, 30(2):139–142. DOI: 10.1250/ast.30.139
- 36Stupakov, A., Hanusa, E., Bilmes, J., and Fox, D. (2009). COSINE – a corpus of multi-party conversational speech in noisy environments. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09), pages 4153–4156. DOI: 10.1109/ICASSP.2009.4960543
- 37Vincent, E., Gribonval, R., and Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469. DOI: 10.1109/TSA.2005.858005
- 38Vincent, E., Virtanen, T., and Gannot, S., editors (2018). Audio Source Separation and Speech Enhancement. Wiley, 1st edition. DOI: 10.1002/9781119279860
- 39Werner, N., Balke, S., Stöter, F.-R., Müller, M., and Edler, B. (2017). trackswitch.js: A versatile web-based audio player for presenting scientific results. In Proceedings of the Web Audio Conference (WAC), London, UK.
- 40Wu, D.-Y., Hsiao, W.-Y., Yang, F.-R., Friedman, O., Jackson, W., Bruzenak, S., Liu, Y.-W., and Yang, Y.-H. (2022). DDSP-based singing vocoders: A new subtractive-based synthesizer and a comprehensive evaluation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 76–83, Bengaluru, India.
- 41Zhuo, L., Yuan, R., Pan, J., Ma, Y., Li, Y., Zhang, G., Liu, S., Dannenberg, R., Fu, J., Lin, C., Benetos, E., Chen, W., Xue, W., and Guo, Y. (2023). LyricWhiz: Robust multilingual zero-shot lyrics transcription by Whispering to ChatGPT. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 343–351, Milano, Italy.
