Speech Processing Using Dynamic Micro-Block Optimization Based on Deep Learning

Jiajun Hao; Chaoyang Geng

doi:10.2478/ijanmc-2025-0035

Abstract

—Driven by deep learning advances, speech processing systems such as automatic speech recognition (ASR), source segregation, noise suppression have achieved significant performance improvements. However, traditional training strategies, particularly static mini-batch selection, often overlook the dynamic variations in data complexity and model convergence behavior, resulting in ineffective training efficiency and limited model accuracy. To tackle this limitation, we introduce a novel training paradigm called Dynamic Micro-block Optimization (DMBO). The method introduces a fine-grained sampling mechanism by partitioning the training set into smaller units called “micro-blocks,” which are dynamically updated during training based on real-time characteristics such as sample loss, gradient diversity, and utterance complexity. Four sampling strategies—loss-weighted, gradient-diversity, gender-based, and accent-based—are designed to self-adjust the composition of training data. The DMBO framework is implemented using Connectionist Temporal Classification (CTC) and Long Short-term Memory (LSTM) networks for end-to-end speech recognition. Experimental evaluations on the VCTK datasets demonstrate that the proposed method significantly accelerates convergence and improves model accuracy. Specifically, the gender-homogeneous strategy reduces the Label Error Rate (LER) by 9.0% compared to standard mini-batch training, while the accent-heterogeneous strategy achieves a 9.2% absolute LER reduction. These results confirm that dynamic optimization at the micro-block level enhances the efficacy of deep learning models in speech processing tasks, and the experimental outcomes are consistent with theoretical expectations, validating the effectiveness and correctness of the proposed approach.

References

Hu Y, Liu Y, Lv S, et al. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement [J]. Arxiv preprint arxiv: 2008.00264, 2020.
Search in Google Scholar Back to article
Wang K, He B. Zhu W P. TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain[C]//ICASSP 2021-2021 IEEE international Conference on acoustics, speech and signal processing (ICASSP). IEEE, 2021: 7098-7102.
Search in Google Scholar Back to article
Reddy C K A, Gopal V, Cutler R, et al. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results [J]. Arxiv preprint arxiv: 2005.13981, 2020.
Search in Google Scholar Back to article
Fedorov I, Stamenovic M, Jensen C, et al. TinyLSTMs: Efficient neural speech enhancement for hearing aids [J]. Arxiv preprint arxiv: 2005.11138, 2020.
Search in Google Scholar Back to article
Zhang H, Cisse M, Dauphin Y N, et al. mixup: Beyond empirical risk minimization [J]. Arxiv preprint arxiv: 1710.09412, 2017.
Search in Google Scholar Back to article
Valin J M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement[C]//2018 IEEE 20th international workshop on multimedia signal processing (MMSP). IEEE, 2018: 1-5.
Search in Google Scholar Back to article
Dubey H, Aazami A, Gopal V, et al. Icassp 2023 deep noise suppression challenge [J]. IEEE Open Journal of Signal Processing, 2024, 5: 725-737.
Search in Google Scholar Back to article
Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.
Search in Google Scholar Back to article
Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [J]. Advances in neural information processing systems, 2020, 33: 12449-12460.
Search in Google Scholar Back to article
Elman J L. Finding structure in time [J]. Cognitive science, 1990, 14(2): 179-211.
Search in Google Scholar Back to article
Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks[C]//International conference on machine learning. Pmlr, 2013: 1310-1318.
Search in Google Scholar Back to article
Hochreiter S. Untersuchungen zu dynamischen neuronalen Netzen [J]. Diploma, Technische Universität München, 1991, 91(1): 31.
Search in Google Scholar Back to article
Graves A. Long short-term memory [J]. Supervised sequence labelling with recurrent neural networks, 2012: 37-45.
Search in Google Scholar Back to article
Gers F A, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with LSTM [J]. Neural computation, 2000, 12(10): 2451-2471.
Search in Google Scholar Back to article
Sak H, Senior A W, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//Interspeech. 2014, 2014: 338-342.
Search in Google Scholar Back to article
Bottou L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Heidelberg: Physica-Verlag HD, 2010: 177-186.
Search in Google Scholar Back to article
Seide F, Fu H, Droppo J, et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs[C]//Interspeech. 2014, 2014: 1058-1062.
Search in Google Scholar Back to article
Reddy C K A, Gopal V, Cutler R, et al. The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results [J]. Arxiv preprint arxiv: 2005.13981, 2020.
Search in Google Scholar Back to article
Veaux C, Yamagishi J, MacDonald K. CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit [J]. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017, 6: 15.
Search in Google Scholar Back to article
Morris A C, Maier V, Green P D. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition[C]//Interspeech. 2004: 2765-2768.
Search in Google Scholar Back to article
Wen X, Li W. Time series prediction based on LSTM attention-LSTM model [J]. IEEE access, 2023, 11: 48322-48331.
Search in Google Scholar Back to article
Liu X. Deep convolutional and LSTM neural networks for acoustic modelling in automatic speech recognition [J]. 2018.
Search in Google Scholar Back to article
Kitza M, Golik P, Schlüter R, et al. Cumulative adaptation for BLSTM acoustic models [J]. Arxiv preprint arxiv: 1906.06207, 2019.
Search in Google Scholar Back to article
Zeyer A, Irie K, Schlüter R, et al. Improved training of end-to-end attention models for speech recognition [J]. Arxiv preprint arxiv: 1805.03294, 2018.
Search in Google Scholar Back to article

Speech Processing Using Dynamic Micro-Block Optimization Based on Deep Learning

Abstract

Paradigm

My account