A Dataset of Larynx Microphone Recordings for Singing Voice Reconstruction

Simon Schwär; Michael Krause; Michael Fast; Sebastian Rosenzweig; Frank Scherbaum; Meinard Müller

doi:10.5334/tismir.166

References

1Alonso, J., and Erkut, C. (2021). Latent space explorations of singing voice synthesis using DDSP. In Proceedings of the Sound and Music Computing Conference (SMC), pages 183–190, Online.
Back to article
2Askenfelt, A., Gauffin, J., Sundberg, J., and Kitzing, P. (1980). A comparison of contact microphone and electroglottograph for the measurement of vocal fundamental frequency. Journal of Speech, Language, and Hearing Research, 23(2):258–273. DOI: 10.1044/jshr.2302.258
Back to article
3Benesty, J., Chen, J., and Huang, Y. (2008). Microphone Array Signal Processing, volume 1. Springer Verlag, 1st edition.
Back to article
4Cano, E., FitzGerald, D., Liutkus, A., Plumbley, M. D., and Stöter, F. (2019). Musical source separation: An introduction. IEEE Signal Processing Magazine, 36(1):31–40. DOI: 10.1109/MSP.2018.2874719
Back to article
5Cho, Y.-P., Yang, F.-R., Chang, Y.-C., Cheng, C.-T., Wang, X.-H., and Liu, Y.-W. (2021). A survey on recent deep learning-driven singing voice synthesis systems. In 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), pages 319–323. DOI: 10.1109/AIVR52153.2021.00067
Back to article
6Choi, H., Lee, J., Kim, W., Lee, J., Heo, H., and Lee, K. (2021). Neural analysis and synthesis: Reconstructing speech from self-supervised representations. In Advances in Neural Information Processing Systems (NeurIPS), pages 16251–16265, Virtual.
Back to article
7Choi, H.-S., Yang, J., Lee, J., and Kim, H. (2022). NANSY++: Unified voice synthesis with neural analysis and synthesis. Computing Research Repository (CoRR), abs/2211.09407.
Back to article
8Dai, J., and Dixon, S. (2019). Intonation trajectories within tones in unaccompanied soprano, alto, tenor, bass quartet singing. The Journal of the Acoustical Society of America, 146(2):1005–1014. DOI: 10.1121/1.5120483
Back to article
9Dekens, T., Patsis, Y., Verhelst, W., Beaugendre, F., and Capman, F. (2008). A multi-sensor speech database with applications towards robust speech processing in hostile environments. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco.
Back to article
10Engel, J., Hantrakul, L., Gu, C., and Roberts, A. (2020). DDSP: Differentiable digital signal processing. In Proceedings of the International Conference on Learning Representations (ICLR).
Back to article
11Gannot, S., Burshtein, D., and Weinstein, E. (2001). Signal enhancement using beamforming and nonstationarity with applications to speech. IEEE Transactions on Signal Processing, 49(8):1614–1626. DOI: 10.1109/78.934132
Back to article
12Graciarena, M., Franco, H., Sonmez, K., and Bratt, H. (2003). Combining standard and throat microphones for robust speech recognition. IEEE Signal Processing Letters, 10(3):72–74. DOI: 10.1109/LSP.2003.808549
Back to article
13He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, Las Vegas, NV, USA. DOI: 10.1109/CVPR.2016.90
Back to article
14Henry, P., and Letowski, T. (2007). Bone conduction: Anatomy, physiology, and communication. Technical Report ARL-TR-4138, United States Army Research Laboratory.
Back to article
15Herbst, C. (2020). Electroglottography – an update. Journal of Voice, 34(4):503–526. DOI: 10.1016/j.jvoice.2018.12.014
Back to article
16Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. DOI: 10.1109/ICASSP.2017.7952132
Back to article
17Kates, J. M. (1992). On using coherence to measure distortion in hearing aids. The Journal of the Acoustical Society of America, 91(4):2236–2244. DOI: 10.1121/1.403657
Back to article
18Kilgour, K., Zuluaga, M., Roblek, D., and Sharifi, M. (2019). Fréchet audio distance: A metric for evaluating music enhancement algorithms. Computing Research Repository (CoRR), abs/1812.08466. DOI: 10.21437/Interspeech.2019-2219
Back to article
19Li, X., Chebiyyam, V., and Kirchhoff, K. (2019). Speech Audio Super-Resolution for Speech Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pages 3416–3420, Graz, Austria. DOI: 10.21437/Interspeech.2019-3043
Back to article
20McBride, M., Tran, P., Letowski, T., and Patrick, R. (2011). The effect of bone conduction microphone locations on speech intelligibility and sound quality. Applied Ergonomics, 42(3):495–502. DOI: 10.1016/j.apergo.2010.09.004
Back to article
21Mitsufuji, Y., Fabbro, G., Uhlich, S., St¨oter, F.-R., Défossez, A., Kim, M., Choi, W., Yu, C.-Y., and Cheuk, K.-W. (2022). Music Demixing Challenge 2021. Frontiers in Signal Processing, 1. DOI: 10.3389/frsip.2021.808395
Back to article
22Nakajima, Y., Kashioka, H., Shikano, K., and Campbell, N. (2003). Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’03), volume 5, pages 708–711. DOI: 10.1109/ICASSP.2003.1200069
Back to article
23Otani, M., Hirahara, T., and Adachi, S. (2006). Numerical simulation of sound originated from the vocal tract in soft neck tissues. The Journal of the Acoustical Society of America, 120(5):3352–3352. DOI: 10.1121/1.4781428
Back to article
24Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. Technical report, OpenAI.
Back to article
25Renault, L., Mignot, R., and Roebel, A. (2022). Differentiable piano model for MIDI-to-audio performance synthesis. In Proceedings of the 25th International Conference on Digital Audio Effects (DAFx), Vienna, Austria.
Back to article
26Rosenzweig, S., Cuesta, H., Weis, C., Scherbaum, F., Gómez, E., and Müller, M. (2020). Dagstuhl ChoirSet: A multitrack dataset for MIR research on choral singing. Transactions of the International Society for Music Information Retrieval (TISMIR), 3(1):98–110. DOI: 10.5334/tismir.48
Back to article
27Rosenzweig, S., Scherbaum, F., and Müller, M. (2022). Computer-assisted analysis of field recordings: A case study of Georgian funeral songs. ACM Journal on Computing and Cultural Heritage (JOCCH), 16(1):1–16. DOI: 10.1145/3551645
Back to article
28Scherbaum, F. (2016). On the benefit of larynx-microphone field recordings for the documentation and analysis of polyphonic vocal music. Proceedings of the International Workshop Folk Music Analysis, pages 80–87.
Back to article
29Scherbaum, F., Mzhavanadze, N., Rosenzweig, S., and Müller, M. (2022). Tuning systems of traditional Georgian singing determined from a new corpus of field recordings. Musicologist, 6(2):142–168. DOI: 10.33906/musicologist.1068947
Back to article
30Schmidt, K., and Edler, B. (2021). Blind bandwidth extension of speech based on LPCNet. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 426–430. DOI: 10.23919/Eusipco47968.2020.9287465
Back to article
31Schoeffler, M., Bartoschek, S., Stöter, F.-R., Roess, M., Westphal, S., Edler, B., and Herre, J. (2018). web-MUSHRA — a comprehensive framework for webbased listening tests. Journal of Open Research Software, 6(8). DOI: 10.5334/jors.187
Back to article
32Schulze-Forster, K., Doire, C. S. J., Richard, G., and Badeau, R. (2022). Unsupervised audio source separation using differentiable parametric source models. Computing Research Repository (CoRR), abs/2201.09592.
Back to article
33Serrà, J., Pascual, S., Pons, J., Araz, R. O., and Scaini, D. (2022). Universal speech enhancement with scorebased diffusion. Computing Research Repository (CoRR), abs/2206.03065.
Back to article
34Serra, X., and Smith III, J. (1990). Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4):12–24. DOI: 10.2307/3680788
Back to article
35Shimizu, S., Otani, M., and Hirahara, T. (2009). Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical Science and Technology, 30(2):139–142. DOI: 10.1250/ast.30.139
Back to article
36Stupakov, A., Hanusa, E., Bilmes, J., and Fox, D. (2009). COSINE – a corpus of multi-party conversational speech in noisy environments. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’09), pages 4153–4156. DOI: 10.1109/ICASSP.2009.4960543
Back to article
37Vincent, E., Gribonval, R., and Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469. DOI: 10.1109/TSA.2005.858005
Back to article
38Vincent, E., Virtanen, T., and Gannot, S., editors (2018). Audio Source Separation and Speech Enhancement. Wiley, 1st edition. DOI: 10.1002/9781119279860
Back to article
39Werner, N., Balke, S., Stöter, F.-R., Müller, M., and Edler, B. (2017). trackswitch.js: A versatile web-based audio player for presenting scientific results. In Proceedings of the Web Audio Conference (WAC), London, UK.
Back to article
40Wu, D.-Y., Hsiao, W.-Y., Yang, F.-R., Friedman, O., Jackson, W., Bruzenak, S., Liu, Y.-W., and Yang, Y.-H. (2022). DDSP-based singing vocoders: A new subtractive-based synthesizer and a comprehensive evaluation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 76–83, Bengaluru, India.
Back to article
41Zhuo, L., Yuan, R., Pan, J., Ma, Y., Li, Y., Zhang, G., Liu, S., Dannenberg, R., Fu, J., Lin, C., Benetos, E., Chen, W., Xue, W., and Guo, Y. (2023). LyricWhiz: Robust multilingual zero-shot lyrics transcription by Whispering to ChatGPT. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 343–351, Milano, Italy.
Back to article

A Dataset of Larynx Microphone Recordings for Singing Voice Reconstruction

References

Paradigm

My account