Have a personal or library account? Click to login
Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR Cover

Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR

Open Access
|Mar 2025

References

  1. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. <em>Advances in neural information processing systems</em>, 33:12449–12460, 2020.
  2. Anton Ragni, Kate M Knill, Shakti P Rath, and Mark JF Gales. Data augmentation for low resource languages. In <em>INTERSPEECH 2014: 15th annual conference of the international speech communication association</em>, pages 810–814. International Speech Communication Association (ISCA), 2014.
  3. Shiyu Zhou, Shuang Xu, and Bo Xu. Multilingual end-to-end speech recognition with a single transformer on low-resource languages. <em>arXiv preprint arXiv:1806.05059</em>, 2018.
  4. Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu. Applying wav2vec2. 0 to speech recognition in various low-resource languages. <em>arXiv preprint arXiv:2012.12121</em>, 2020.
  5. Satwinder Singh, Ruili Wang, and Feng Hou. Improved meta learning for low resource speech recognition. In <em>ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, pages 4798–4802. IEEE, 2022.
  6. Ankit Kumar and Rajesh Kumar Aggarwal. A hybrid cnn-ligru acoustic modeling using raw waveform sincnet for hindi asr. <em>Computer Science</em>, 21(4), 2020.
  7. A Kumar, T Choudhary, M Dua, and M Sabharwal. Hybrid end-to-end architecture for hindi speech recognition system. In <em>Proceedings of the International Conference on Paradigms of Communication, Computing and Data Sciences: PCCDS 2021</em>, pages 267–276. Springer, 2022.
  8. Ankit Kumar and Rajesh K Aggarwal. An investigation of multilingual tdnn-blstm acoustic modeling for hindi speech recognition. <em>International Journal of Sensors Wireless Communications and Control</em>, 12(1):19–31, 2022.
  9. Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. Speech recognition using deep neural networks: A systematic review. <em>IEEE access</em>, 7:19143–19165, 2019.
  10. Martijn Bartelds, Nay San, Bradley McDonnell, Dan Jurafsky, and Martijn Wieling. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. <em>arXiv preprint arXiv:2305.10951</em>, 2023.
  11. Ankit Kumar and Rajesh Kumar Aggarwal. An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. <em>Journal of Reliable Intelligent Environments</em>, 8(2):117–132, 2022.
  12. Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In <em>ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, pages 7084–7088. IEEE, 2020.
  13. Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, and Takaaki Hori. Momentum pseudo-labeling for semi-supervised speech recognition. <em>arXiv preprint arXiv:2106.08922</em>, 2021.
  14. Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. <em>arXiv preprint arXiv:2005.09629</em>, 2020.
  15. Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, and Yonghong Yan. Alternative pseudo-labeling for semi-supervised automatic speech recognition. <em>IEEE/ACM Transactions on Audio, Speech, and Language Processing</em>, 2023.
  16. Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. <em>Advances in neural information processing systems</em>, 33:12449–12460, 2020.
  17. Julia Mainzinger. Fine-tuning asr models for very low-resource languages: A study on mvskoke. Master's thesis, University of Washington, 2024.
  18. Robert Jimerson, Zoey Liu, and Emily Prud'Hommeaux. An (unhelpful) guide to selecting the best asr architecture for your under-resourced language. In <em>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</em>, pages 1008–1016, 2023.
  19. Shiyue Zhang, Ben Frey, and Mohit Bansal. How can nlp help revitalize endangered languages? a case study and roadmap for the cherokee language. <em>arXiv preprint arXiv:2204.11909</em>, 2022.
  20. Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. <em>Journal of Machine Learning Research</em>, 25(97):1–52, 2024.
  21. Marieke Meelen, Alexander O'neill, and Rolando Coto-Solano. End-to-end speech recognition for endangered languages of nepal. In <em>Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages</em>, pages 83–93, 2024.
  22. Panji Arisaputra, Alif Tri Handoyo, and Amalia Zahra. Xls-r deep learning model for multilingual asr on low-resource languages: Indonesian, javanese, and sundanese. <em>arXiv preprint arXiv:2401.06832</em>, 2024.
  23. Siqing Qin, Longbiao Wang, Sheng Li, Jianwu Dang, and Lixin Pan. Improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling. <em>EURASIP Journal on Audio, Speech, and Music Processing</em>, 2022(1):2, 2022.
  24. Kaushal Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M Khapra. Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages. In <em>Icassp 2023–2023 ieee international conference on acoustics, speech and signal processing (icassp)</em>, pages 1–5. IEEE, 2023.
  25. Zoey Liu, Justin Spence, and Emily Prud'Hommeaux. Studying the impact of language model size for low-resource asr. In <em>Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages</em>, pages 77–83, 2023.
  26. Gueorgui Pironkov, Sean UN Wood, and Stéphane Dupont. Hybrid-task learning for robust automatic speech recognition. <em>Computer Speech &amp; Language</em>, 64:101103, 2020.
  27. Mohamed Tamazin, Ahmed Gouda, and Mohamed Khedr. Enhanced automatic speech recognition system based on enhancing power-normalized cepstral coefficients. <em>Applied Sciences</em>, 9(10):2166, 2019.
  28. Syed Shahnawazuddin, KT Deepak, Gayadhar Pradhan, and Rohit Sinha. Enhancing noise and pitch robustness of children's asr. In <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, pages 5225–5229. IEEE, 2017.
  29. Jiri Malek, Jindrich Zdansky, and Petr Cerva. Robust automatic recognition of speech with background music. In <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, pages 5210–5214. IEEE, 2017.
  30. Sheng-Chieh Lee, Jhing-Fa Wang, and Miao-Hia Chen. Threshold-based noise detection and reduction for automatic speech recognition system in human-robot interactions. <em>Sensors</em>, 18(7):2068, 2018.
  31. Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. <em>arXiv preprint arXiv:2005.09629</em>, 2020.
  32. Satyender Jaglan, Sanjeev Kumar Dhull, and Krishna Kant Singh. Tertiary wavelet model based automatic epilepsy classification system. <em>International Journal of Intelligent Unmanned Systems</em>, 11(1):166–181, 2023.
  33. Yuzong Liu and Katrin Kirchhoff. Graph-based semisupervised learning for acoustic modeling in automatic speech recognition. <em>IEEE/ACM Transactions on Audio, Speech, and Language Processing</em>, 24(11):1946–1956, 2016.
  34. Michael I Mandel and Jon Barker. Multichannel spatial clustering for robust far-field automatic speech recognition in mismatched conditions. In <em>INTERSPEECH</em>, pages 1991–1995, 2016.
  35. Naoki Hirayama, Koichiro Yoshino, Katsutoshi Itoyama, Shinsuke Mori, and Hiroshi G Okuno. Automatic speech recognition for mixed dialect utterances by mixing dialect language models. <em>IEEE/ACM Transactions on Audio, Speech, and Language Processing</em>, 23(2):373–382, 2015.
  36. Delu Zeng, Minyu Liao, Mohammad Tavakolian, Yulan Guo, Bolei Zhou, Dewen Hu, Matti Pietikäinen, and Li Liu. Deep learning for scene classification: A survey. <em>arXiv preprint arXiv:2101.10531</em>, 2021.
  37. Harveen Singh Chadha, Anirudh Gupta, Priyanshi Shah, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, and Vivek Raghavan. Vakyansh: Asr toolkit for low resource indic languages. <em>arXiv preprint arXiv:2203.16512</em>, 2022.
  38. Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. <em>Advances in Neural Information Processing Systems</em>, 33:8067–8077, 2020.
  39. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. <em>Advances in neural information processing systems</em>, 33:17022–17033, 2020.
  40. Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Lightly supervised and unsupervised acoustic model training. <em>Computer Speech &amp; Language</em>, 16(1):115–129, 2002.
  41. Ho Yin Chan and Phil Woodland. Improving broadcast news transcription by lightly supervised discriminative training. In <em>2004 IEEE International Conference on Acoustics, Speech, and Signal Processing</em>, volume 1, pages I–737. IEEE, 2004.
  42. Vimal Manohar, Hossein Hadian, Daniel Povey, and Sanjeev Khudanpur. Semi-supervised training of acoustic models using lattice-free mmi. In <em>2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)</em>, pages 4844–4848. IEEE, 2018.
  43. Thiago Fraga-Silva, Jean-Luc Gauvain, and Lori Lamel. Lattice-based unsupervised acoustic model training. In <em>2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, pages 4656–4659. IEEE, 2011.
  44. Vaswani, A. Attention is all you need, Advances in Neural Information Processing Systems, 2017.
Language: English
Submitted on: Nov 20, 2024
Published on: Mar 4, 2025
Published by: Professor Subhas Chandra Mukhopadhyay
In partnership with: Paradigm Publishing Services
Publication frequency: 1 times per year

© 2025 Tripti Choudhary, Vishal Goyal, Atul Bansal, published by Professor Subhas Chandra Mukhopadhyay
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.