References
- 1Avendano, C. (2003).
Frequency-domain source identification and manipulation in stereo mixes for enhancement, suppression and re-panning applications . In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), pages 55–58. IEEE. DOI: 10.1109/ASPAA.2003.1285818 - 2Brossier, P., Tintamar, Muller, E., Philippsen, N., Seaver, T., Fritz, H., cyclopsian, Alexander, S., Williams, J., Cowgill, J., and Cruz, A. (2019). aubio/aubio: 0.4.9. DOI: 10.5281/zenodo.2578765
- 3Chen, Z., Luo, Y., and Mesgarani, N. (2017). Deep attractor network for single-microphone speaker separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. DOI: 10.1109/ICASSP.2017.7952155
- 4Defferrard, M., Benzi, K., Vandergheynst, P., and Bresson, X. (2016). FMA: A dataset for music analysis. arXiv preprint arXiv:1612.01840.
- 5Dubey, H., Gopal, V., Cutler, R., Aazami, A., Matusevych, S., Braun, S., Eskimez, S. E., Thakker, M., Yoshioka, T., Gamper, H., et al. (2022). ICASSP 2022 Deep Noise Suppression Challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9271–9275. DOI: 10.1109/ICASSP43922.2022.9747230
- 6Fabbro, G., Uhlich, S., Lai, C.-H., Choi, W., Martinez-Ramirez, M., Liao, W., Gadelha, I., Ramos, G., Hsu, E., Rodrigues, H., Stoeter, F.-R., Defossez, A., Luo, Y., Yu, J., Chakraborty, D., Mohanty, S., Solovyev, R., Stempkovskiy, A., Habruseva, T., Goswami, N., Harada, T., Kim, M., Lee, J. H., Dong, Y., Zhang, X., Liu, J., and Mitsufuji, Y. (2024). The Sound Demixing Challenge 2023 – Music Demixing Track. Transactions of the International Society for Music Information Retrieval, 7(1): 63–84. DOI: 10.5334/tismir.171
- 7Fonseca, E., Favory, X., Pons, J., Font, F., and Serra, X. (2021). FSD50K: An open dataset of humanlabeled sound events. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30: 829–852. DOI: 10.1109/TASLP.2021.3133208
- 8Geiger, J. T., Grosche, P., and Parodi, Y. L. (2015). Dialogue enhancement of stereo sound. In 23rd European Signal Processing Conference (EUSIPCO), pages 869–873. DOI: 10.1109/EUSIPCO.2015.7362507
- 9Grais, E. M., Sen, M. U., and Erdogan, H. (2014). Deep neural networks for single channel source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3734–3738. DOI: 10.1109/ICASSP.2014.6854299
- 10Hershey, J. R., Chen, Z., Le Roux, J., and Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. DOI: 10.1109/ICASSP.2016.7471631
- 11Huang, P.-S., Chen, S. D., Smaragdis, P., and Hasegawa-Johnson, M. (2012). Singing-voice separation from monaural recordings using robust principal component analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60. DOI: 10.1109/ICASSP.2012.6287816
- 12International Telecommunications Union (2015). ITUR BS.1770-4: Algorithms to measure audio programme loudness and true-peak audio level.
https://www.itu.int/rec/R-REC-BS.1770 . - 13Kim, M., Choi, W., Chung, J., Lee, D., and Jung, S. (2021). Kuielab-mdx-net: A two-stream neural network for music demixing. arXiv preprint arXiv:2111.12203.
- 14Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- 15Le Roux, J., Wisdom, S., Erdogan, H., and Hershey, J. R. (2019). SDR – Half-baked or well done? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. DOI: 10.1109/ICASSP.2019.8683855
- 16Luo, Y. and Yu, J. (2023). Music source separation with band-split RNN. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1893–1901. DOI: 10.1109/TASLP.2023.3271145
- 17Martinez-Ramirez, M. A., Liao, W.-H., Fabbro, G., Uhlich, S., Nagashima, C., and Mitsufuji, Y. (2022). Automatic music mixing with deep learning and out-of-domain data. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR).
- 18Masri, P. (1996). Computer Modelling of Sound for Transformation and Synthesis of Musical Signals. PhD thesis, University of Bristol.
- 19Mitsufuji, Y., Fabbro, G., Uhlich, S., Stoter, F.-R., Defossez, A., Kim, M., Choi, W., Yu, C.-Y., and Cheuk, K.- W. (2022). Music Demixing Challenge 2021. Frontiers in Signal Processing, 1:18. DOI: 10.3389/frsip.2021.808395
- 20Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. DOI: 10.1109/ICASSP.2015.7178964
- 21Paulus, J., Torcoli, M., Uhle, C., Herre, J., Disch, S., and Fuchs, H. (2019). Source separation for enabling dialogue enhancement in object-based broadcast with MPEG-H. Journal of the Audio Engineering Society, 67(7/8):510–521. DOI: 10.17743/jaes.2019.0032
- 22Petermann, D., Wichern, G., Subramanian, A. S., Wang, Z.-Q., and Le Roux, J. (2023). Tackling the cocktail fork problem for separation and transcription of real-world soundtracks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31. DOI: 10.1109/TASLP.2023.3290428
- 23Petermann, D., Wichern, G., Wang, Z.-Q., and Le Roux, J. (2022). The cocktail fork problem: Three-stem audio separation for real-world soundtracks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 526–530. DOI: 10.1109/ICASSP43922.2022.9746005
- 24Rafii, Z., Liutkus, A., Stoter, F.-R., Mimilakis, S. I., and Bittner, R. (2019). MUSDB18-HQ – an uncompressed version of MUSDB18. DOI: 10.5281/zenodo.3338373
- 25Rouard, S., Massa, F., and Défossez, A. (2023). Hybrid transformers for music source separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOI: 10.1109/ICASSP49357.2023.10096956
- 26Sawata, R., Takahashi, N., Uhlich, S., Takahashi, S., and Mitsufuji, Y. (2023). The whole is greater than the sum of its parts: Improving DNNbased music source separation. arXiv preprint arXiv:2305.07855.
- 27Sawata, R., Uhlich, S., Takahashi, S., and Mitsufuji, Y. (2021). All for one and one for all: Improving music separation by bridging networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 51–55. DOI: 10.1109/ICASSP39728.2021.9414044
- 28Solovyev, R., Stempkovskiy, A., and Habruseva, T. (2023). Benchmarks and leaderboards for sound demixing tasks. arXiv preprint arXiv:2305.07489.
- 29Sound Effects Wiki (2024). Godzilla roar.
https://soundeffects.fandom.com/wiki/Godzilla_Roar [Accessed: 2024-01-15]. - 30Steinmetz, C. J. and Reiss, J. (2021).
pyloudnorm: A simple yet flexible loudness meter in python . In Audio Engineering Society Convention 150. Audio Engineering Society. - 31Stöter, F.-R., Liutkus, A., and Ito, N. (2018). The 2018 Signal Separation Evaluation Campaign. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pages 293–305.
Springer . DOI: 10.1007/978-3-319-93764-9_28 - 32Stöter, F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-unmix – A reference implementation for music source separation. Journal of Open Source Software, 4(41):1667. DOI: 10.21105/joss.01667
- 33Torcoli, M., Simon, C., Paulus, J., Straninger, D., Riedel, A., Koch, V., Wits, S., Rieger, D., Fuchs, H., Uhle, C., et al. (2021). Dialog+ in broadcasting: First field tests using deep-learning-based dialogue enhancement. arXiv preprint arXiv:2112.09494.
- 34Tzanetakis, G., Jones, R., and McNally, K. (2007). Stereo panning features for classifying recording production style. In Proceedings of the International Conference on Music Information Retrieval (ISMIR), pages 441–444.
- 35Uhle, C., Hellmuth, O., and Weigel, J. (2008).
Speech enhancement of movie sound . In Audio Engineering Society Convention 125. Audio Engineering Society. - 36Uhlich, S., Giron, F., and Mitsufuji, Y. (2015). Deep neural network based instrument extraction from music. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2135–2139. DOI: 10.1109/ICASSP.2015.7178348
- 37Uhlich, S., Porcu, M., Giron, F., Enenkl, M., Kemp, T., Takahashi, N., and Mitsufuji, Y. (2017). Improving music source separation based on deep neural networks through data augmentation and network blending. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 261–265. DOI: 10.1109/ICASSP.2017.7952158
- 38Vincent, E., Sawada, H., Bofill, P., Makino, S., and Rosca, J. P. (2007). First stereo audio source separation evaluation campaign: Data, algorithms and results. In Proceedings of the International Conference on Independent Component Analysis and Signal Separation, pages 552–559.
Springer . DOI: 10.1007/978-3-540-74494-8_69 - 39Watcharasupat, K. N., Wu, C.-W., Ding, Y., Orife, I., Hipple, A. J., Williams, P. A., Kramer, S., Lerch, A., and Wolcott, W. (2023). A generalized bandsplit neural network for cinematic audio source separation. IEEE Open Journal of Signal Processing. DOI: 10.1109/OJSP.2023.3339428
- 40Wisdom, S., Hershey, J. R., Wilson, K., Thorpe, J., Chinen, M., Patton, B., and Saurous, R. A. (2019). Differentiable consistency constraints for improved deep speech enhancement. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 900–904. DOI: 10.1109/ICASSP.2019.8682783
- 41Yu, D., Kolbak, M., Tan, Z.-H., and Jensen, J. (2017). Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 241–245. DOI: 10.1109/ICASSP.2017.7952154
- 42Yu, J., Chen, H., Luo, Y., Gu, R., Li, W., and Weng, C. (2023). TSpeech-AI system description to the 5th Deep Noise Suppression (DNS) Challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). DOI: 10.1109/ICASSP49357.2023.10097210
- 43Yu, J., Luo, Y., Chen, H., Gu, R., and Weng, C. (2022). High fidelity speech enhancement with band-split RNN. arXiv preprint arXiv:2212.00406. DOI: 10.21437/Interspeech.2023-1433
