Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR

Tripti Choudhary; Vishal Goyal; Atul Bansal

doi:10.2478/ijssis-2025-0009

.blurhash-client-img { display: none !important; }

Enhanced lstm network with semi-supervised learning and data augmentation for low-resource ASR

International Journal on Smart Sensing and Intelligent Systems

Volume 18 (2025): Issue 1 (January 2025)

By: Tripti Choudhary, Vishal Goyal and Atul Bansal

Open Access

|Mar 2025

Abstract

Automatic speech recognition (ASR) is essential for developing intelligent systems capable of accurately processing human speech, particularly in low-resource languages. This study addresses the challenges faced by ASR systems in Indian languages, where data and resources are limited. The authors propose a novel three-step methodology that combines data augmentation and semi-supervised learning to enhance ASR performance. First, an enhanced long short-term memory (LSTM) network is used to train a baseline model with limited labeled data. Next, synthetic data is generated and combined with original recordings to refine the ASR model. Finally, semi-supervised training further boosts accuracy. Evaluations demonstrate significant improvements over existing models for Hindi, Marathi, and Odia languages.

References

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Search in Google Scholar Back to article
Anton Ragni, Kate M Knill, Shakti P Rath, and Mark JF Gales. Data augmentation for low resource languages. In INTERSPEECH 2014: 15th annual conference of the international speech communication association, pages 810–814. International Speech Communication Association (ISCA), 2014.
Search in Google Scholar Back to article
Shiyu Zhou, Shuang Xu, and Bo Xu. Multilingual end-to-end speech recognition with a single transformer on low-resource languages. arXiv preprint arXiv:1806.05059, 2018.
Search in Google Scholar Back to article
Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, and Bo Xu. Applying wav2vec2. 0 to speech recognition in various low-resource languages. arXiv preprint arXiv:2012.12121, 2020.
Search in Google Scholar Back to article
Satwinder Singh, Ruili Wang, and Feng Hou. Improved meta learning for low resource speech recognition. In ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4798–4802. IEEE, 2022.
Search in Google Scholar Back to article
Ankit Kumar and Rajesh Kumar Aggarwal. A hybrid cnn-ligru acoustic modeling using raw waveform sincnet for hindi asr. Computer Science, 21(4), 2020.
Search in Google Scholar Back to article
A Kumar, T Choudhary, M Dua, and M Sabharwal. Hybrid end-to-end architecture for hindi speech recognition system. In Proceedings of the International Conference on Paradigms of Communication, Computing and Data Sciences: PCCDS 2021, pages 267–276. Springer, 2022.
Search in Google Scholar Back to article
Ankit Kumar and Rajesh K Aggarwal. An investigation of multilingual tdnn-blstm acoustic modeling for hindi speech recognition. International Journal of Sensors Wireless Communications and Control, 12(1):19–31, 2022.
Search in Google Scholar Back to article
Ali Bou Nassif, Ismail Shahin, Imtinan Attili, Mohammad Azzeh, and Khaled Shaalan. Speech recognition using deep neural networks: A systematic review. IEEE access, 7:19143–19165, 2019.
Search in Google Scholar Back to article
Martijn Bartelds, Nay San, Bradley McDonnell, Dan Jurafsky, and Martijn Wieling. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. arXiv preprint arXiv:2305.10951, 2023.
Search in Google Scholar Back to article
Ankit Kumar and Rajesh Kumar Aggarwal. An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. Journal of Reliable Intelligent Environments, 8(2):117–132, 2022.
Search in Google Scholar Back to article
Jacob Kahn, Ann Lee, and Awni Hannun. Self-training for end-to-end speech recognition. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020.
Search in Google Scholar Back to article
Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, and Takaaki Hori. Momentum pseudo-labeling for semi-supervised speech recognition. arXiv preprint arXiv:2106.08922, 2021.
Search in Google Scholar Back to article
Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020.
Search in Google Scholar Back to article
Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, and Yonghong Yan. Alternative pseudo-labeling for semi-supervised automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
Search in Google Scholar Back to article
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Search in Google Scholar Back to article
Julia Mainzinger. Fine-tuning asr models for very low-resource languages: A study on mvskoke. Master's thesis, University of Washington, 2024.
Search in Google Scholar Back to article
Robert Jimerson, Zoey Liu, and Emily Prud'Hommeaux. An (unhelpful) guide to selecting the best asr architecture for your under-resourced language. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1008–1016, 2023.
Search in Google Scholar Back to article
Shiyue Zhang, Ben Frey, and Mohit Bansal. How can nlp help revitalize endangered languages? a case study and roadmap for the cherokee language. arXiv preprint arXiv:2204.11909, 2022.
Search in Google Scholar Back to article
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, et al. Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52, 2024.
Search in Google Scholar Back to article
Marieke Meelen, Alexander O'neill, and Rolando Coto-Solano. End-to-end speech recognition for endangered languages of nepal. In Proceedings of the Seventh Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 83–93, 2024.
Search in Google Scholar Back to article
Panji Arisaputra, Alif Tri Handoyo, and Amalia Zahra. Xls-r deep learning model for multilingual asr on low-resource languages: Indonesian, javanese, and sundanese. arXiv preprint arXiv:2401.06832, 2024.
Search in Google Scholar Back to article
Siqing Qin, Longbiao Wang, Sheng Li, Jianwu Dang, and Lixin Pan. Improving low-resource tibetan end-to-end asr by multilingual and multilevel unit modeling. EURASIP Journal on Audio, Speech, and Music Processing, 2022(1):2, 2022.
Search in Google Scholar Back to article
Kaushal Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M Khapra. Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages. In Icassp 2023–2023 ieee international conference on acoustics, speech and signal processing (icassp), pages 1–5. IEEE, 2023.
Search in Google Scholar Back to article
Zoey Liu, Justin Spence, and Emily Prud'Hommeaux. Studying the impact of language model size for low-resource asr. In Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 77–83, 2023.
Search in Google Scholar Back to article
Gueorgui Pironkov, Sean UN Wood, and Stéphane Dupont. Hybrid-task learning for robust automatic speech recognition. Computer Speech & Language, 64:101103, 2020.
Search in Google Scholar Back to article
Mohamed Tamazin, Ahmed Gouda, and Mohamed Khedr. Enhanced automatic speech recognition system based on enhancing power-normalized cepstral coefficients. Applied Sciences, 9(10):2166, 2019.
Search in Google Scholar Back to article
Syed Shahnawazuddin, KT Deepak, Gayadhar Pradhan, and Rohit Sinha. Enhancing noise and pitch robustness of children's asr. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5225–5229. IEEE, 2017.
Search in Google Scholar Back to article
Jiri Malek, Jindrich Zdansky, and Petr Cerva. Robust automatic recognition of speech with background music. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5210–5214. IEEE, 2017.
Search in Google Scholar Back to article
Sheng-Chieh Lee, Jhing-Fa Wang, and Miao-Hia Chen. Threshold-based noise detection and reduction for automatic speech recognition system in human-robot interactions. Sensors, 18(7):2068, 2018.
Search in Google Scholar Back to article
Daniel S Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, and Quoc V Le. Improved noisy student training for automatic speech recognition. arXiv preprint arXiv:2005.09629, 2020.
Search in Google Scholar Back to article
Satyender Jaglan, Sanjeev Kumar Dhull, and Krishna Kant Singh. Tertiary wavelet model based automatic epilepsy classification system. International Journal of Intelligent Unmanned Systems, 11(1):166–181, 2023.
Search in Google Scholar Back to article
Yuzong Liu and Katrin Kirchhoff. Graph-based semisupervised learning for acoustic modeling in automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):1946–1956, 2016.
Search in Google Scholar Back to article
Michael I Mandel and Jon Barker. Multichannel spatial clustering for robust far-field automatic speech recognition in mismatched conditions. In INTERSPEECH, pages 1991–1995, 2016.
Search in Google Scholar Back to article
Naoki Hirayama, Koichiro Yoshino, Katsutoshi Itoyama, Shinsuke Mori, and Hiroshi G Okuno. Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(2):373–382, 2015.
Search in Google Scholar Back to article
Delu Zeng, Minyu Liao, Mohammad Tavakolian, Yulan Guo, Bolei Zhou, Dewen Hu, Matti Pietikäinen, and Li Liu. Deep learning for scene classification: A survey. arXiv preprint arXiv:2101.10531, 2021.
Search in Google Scholar Back to article
Harveen Singh Chadha, Anirudh Gupta, Priyanshi Shah, Neeraj Chhimwal, Ankur Dhuriya, Rishabh Gaur, and Vivek Raghavan. Vakyansh: Asr toolkit for low resource indic languages. arXiv preprint arXiv:2203.16512, 2022.
Search in Google Scholar Back to article
Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33:8067–8077, 2020.
Search in Google Scholar Back to article
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
Search in Google Scholar Back to article
Lori Lamel, Jean-Luc Gauvain, and Gilles Adda. Lightly supervised and unsupervised acoustic model training. Computer Speech & Language, 16(1):115–129, 2002.
Search in Google Scholar Back to article
Ho Yin Chan and Phil Woodland. Improving broadcast news transcription by lightly supervised discriminative training. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–737. IEEE, 2004.
Search in Google Scholar Back to article
Vimal Manohar, Hossein Hadian, Daniel Povey, and Sanjeev Khudanpur. Semi-supervised training of acoustic models using lattice-free mmi. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4844–4848. IEEE, 2018.
Search in Google Scholar Back to article
Thiago Fraga-Silva, Jean-Luc Gauvain, and Lori Lamel. Lattice-based unsupervised acoustic model training. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4656–4659. IEEE, 2011.
Search in Google Scholar Back to article
Vaswani, A. Attention is all you need, Advances in Neural Information Processing Systems, 2017.
Search in Google Scholar Back to article

Articles in this issue

DOI: https://doi.org/10.2478/ijssis-2025-0009 | Journal eISSN: 1178-5608

Journal RSS Feed

Language: English

Submitted on: Nov 20, 2024

Published on: Mar 4, 2025

Published by: Macquarie University, Australia

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

Automatic Speech Recognition,

Data Augmentation,

Semi-supervised learning,

Low-resource ASR

Related subjects:

Engineering,

Introductions and overviews,

Engineering, other

© 2025 Tripti Choudhary, Vishal Goyal, Atul Bansal, published by Macquarie University, Australia
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.

Volume 18 (2025): Issue 1 (January 2025)