Have a personal or library account? Click to login
Speech Processing Using Dynamic Micro-Block Optimization Based on Deep Learning Cover

Speech Processing Using Dynamic Micro-Block Optimization Based on Deep Learning

By: Jiajun Hao and  Chaoyang Geng  
Open Access
|Dec 2025

Full Article

I.
Introduction

The field of speech processing, which includes automatic speech recognition, speech refinement, and speech isolation, has undergone a significant advancement with the integration of deep neural methodologies. Fueled by the rapid growth of computing resources and open-access large-scale speech corpora, contemporary approaches increasingly rely on deep neural architectures — ranging from Convolutional and Recurrent Networks to long short-term memory models [1] and, more recently, transformer variants [2] — to capture the highly non-linear relationship between acoustic representations and linguistic or perceptual targets. These data-driven approaches have pushed word error rates to human parity on a number of controlled benchmarks [3], enabling widespread deployment of voice search, smart-home devices, hearing aids, and car-navigation systems [4].

Despite this success, two fundamental limitations remain. First, the majority of existing systems still rely on static training schedules in which mini-batches are formed by simple random sampling or, at most, coarse curriculum schemes [5]. Such strategies treat every sample as equally important throughout the entire training procedure, ignoring the fact that utterances vary dramatically in difficulty, signal quality, speaker characteristics, and gradient informativeness. Second, the convergence be haviors of deep acoustic models is time-varying: early iterations demand broad, exploratory updates, whereas later stages benefit from fine-grained, noise-reduced gradients. Static sampling cannot adapt to these evolving requirements, frequently yielding slow convergence, sub-optimal generalization, and inefficient GPU utilization [6].

To alleviate these issues, we propose Dynamic Micro-block Optimization—a family of adaptive sampling strategies that operate on micro-blocks, i.e., compact sets of frames or utterances whose cardinal number is an order of magnitude smaller than conventional mini-batches.

By dynamically assembling micro-blocks that best match the instantaneous learning demand, DMBO injects curriculum-like behavior into standard stochastic optimization without requiring an external curriculum designer. Specifically, four instantiated strategies are investigated in this work: loss-weighted, gradient-diversity, gender-adjusted, and accent-adjusted micro-block selection. The gender- and accent-adjusted variants explicitly construct micro-blocks that are homogeneous (single gender/accent) or heterogeneous (balanced mixture) to match the desired modelling capacity of each training phase.

Experimental evaluations on public corpora [7] demonstrate that the proposed DMBO paradigm not only accelerates convergence and lowers label error rates, but also offers a plug-and-play replacement for standard mini-batch selection in existing speech-processing pipelines; consequently, it opens a new avenue toward more efficient and better-performing deep learning systems for speech.

II.
Foundations of Dynamic Micro-block Training
A. Connectionist Temporal Classification

The core idea of CTC is to introduce a special blank symbol and to allow output labels to repeat, thereby aligning the input sequence with the target label sequence and performing transcription. The blank symbol indicates that the model deems there is no emission at the current time step. During post-processing, adjacent identical labels are collapsed into one, and all blank symbols are removed. This mechanism effectively tackles the length-mismatch problem between acoustic feature streams (speech frames) and decoded symbol streams (phoneme or character labels), especially when the input is much longer than the output [8]. Its working principle relies on a neural network that directly maps raw audio signals to a text label sequence, eliminating the need for manual forced alignment between acoustic units and text. This end-to-end characteristic enables the system to process inputs and outputs of arbitrary length without external segmentation, while the correspondence between them is discovered directly from the data during training; as a result, the approach has gained broad acceptance across automatic speech-recognition tasks. The following equation gives the definition of the CTC loss function.1LossCTC=lnP(yx){{\mathop{\rm Loss}\nolimits} _{CTC}} = - \ln P(y\mid x) 2P(yx)=πQ(y)P(πx)P(y\mid x) = \sum\limits_{\pi \in Q(y)} P (\pi \mid x)

In the equation, x denotes the input sequence, y represents the target output label sequence, and Q(y) encompasses the complete set of valid CTC paths that can be collapsed to y.

As an end-to-end sequence modeling technique, while CTC provides an end-to-end solution for sequence alignment in speech recognition, it also introduces certain limitations, particularly in terms of output label sparsity and model robustness under noisy conditions [8]. Although CTC effectively addresses the length mismatch between input and output sequences by introducing blank symbols and allowing label repetitions, this strategy inherently lowers the information density of the output sequence, potentially compromising model accuracy and robustness [8]. Recently, the rise of self-supervised learning (SSL) has significantly advanced CTC based models [9]. SSL learns representations rich in semantic information without relying on explicit labels, enabling the model to better capture semantic content within speech sequences.

B. Backbone Sequence Models: RNN & LSTM

RNN is specifically designed for sequential data. Unlike feed-forward architectures (MLP or CNN), RNN introduce a temporal recurrence that memorizes historical information and reuses it for the current prediction. Typical deployment domains span language understanding, sequential data prediction, and spoken-content analysis [10]. The key idea is that when processing the t step, the model refers to its internal state computed at the immediately preceding time step, thus acquiring memory capability.

In an RNN, the instantaneous hidden activation ht is jointly conditioned on the present input xt and the state vector retained from the prior step [11]. the corresponding update expression is given below.3ht=g(Whht1+Wxxt+bh){h_t} = g\left( {{W_h}{h_{t - 1}} + {W_x}{x_t} + {b_h}} \right)

In this expression, W represents the learnable coefficient matrix, b indicates the intercept term, g corresponds to the nonlinear active transformation.

The decoding layer is typically computed as: 4yt=f(Wyht+by){y_t} = f\left( {{W_y}{h_t} + {b_y}} \right)

Here, f denotes the nonlinear activation mapping (soft max or linear), which produces the final prediction yt at time step t.

RNN offer natural advantages for modeling temporal data such as speech. Nevertheless, the architecture faces several critical challenges. The foremost is the latent. risk of vanishing or exploding gradients during training: when the network becomes deep and input sequences are long, back-propagated error gradients may decay excessively or grow explosively[12]. Such gradient pathology severely impair the learning efficiency of deeper layers. Second, this gradient instability limits the RNN’s ability to capture and exploit long-range dependencies across distant time steps in very long sequences.

As a crucial refinement of recurrent neural network, LSTM incorporates memory cells governed by gated units, enabling more effective modelling of temporal sequences [13]. Fig. 1 illustrates the LSTM architecture, it modulates layer-to-layer information flow via learnable gates, allowing gradients to propagate over extended periods in a relatively stable manner [14]. The canonical LSTM employs three learnable gates: namely, input, forget, and output controllers [15]. Specifically, the input gate governs the degree to which fresh information is admitted into the memory cell at the current time step. The forget gate scales the previous cell state, deciding what fraction of long-term context is preserved or discarded. The output gate selects the segments of the updated cell state that are revealed as the hidden vector to downstream components.

Figure 1.

LSTM Architecture Diagram

The corresponding mathematical formulations-cell-state update equations, gate activation functions and the final hidden-state computation— are detailed in Equations (5)-(10).5gt=ρ(Mg[ ht1,αt ]+βg){g_t} = \rho \left( {{M_g}\left[ {{h_{t - 1}},{\alpha _t}} \right] + {\beta _g}} \right) 6at=ρ(Ma[ ht1,αt ]+βa){a_t} = \rho \left( {{M_a}\left[ {{h_{t - 1}},{\alpha _t}} \right] + {\beta _a}} \right) 7O˜=tanh(MO[ ht1,αt ]+βo)\tilde O = \tanh \left( {{M_O}\left[ {{h_{t - 1}},{\alpha _t}} \right] + {\beta _o}} \right) 8Ot=gtOt1+atO˜{O_t} = {g_t}{O_{t - 1}} + {a_t}\tilde O 9bt=ρ(Mo[ ht1,αt ]+bo){b_t} = \rho \left( {{M_o}\left[ {{h_{t - 1}},{\alpha _t}} \right] + {b_o}} \right) 10ht=bttanhOt{h_t} = {b_t}\tanh {O_t}

In the above, αt is the incoming vector at temporal index t. ht–1; and Ot–1 are the previous latent and memory vectors. Õ is the candidate. Gates at, bt, gt control inflow, outflow and forget; M and β are their weights and biases. ρ is sigmoid, ⊙ denotes element-wise product.

C. Traditional Gradient Descent: From Batch to Mini-batch

Gradient-based optimization is the cornerstone driving training of deep acoustic models, and its core still relies on back-propagation to transmit the output error layer-by-layer and update the weights. According to the criterion of "the number of samples used before each parameter update," gradient-based optimization is commonly classified into three variants: full-batch, purely stochastic, and mini-batch learning [16]. Although they share the same parameter iteration form, as shown in Equation (11).11θt+1=θtηmi=1mθL(xi,yi;θt){\theta _{t + 1}} = {\theta _t} - {\eta \over m}\sum\limits_{i = 1}^m {{\nabla _\theta }} L\left( {{x_i},{y_i};{\theta _t}} \right)

In the equation, θt stands for the vector of trainable weights, m equals the quantity of samples in the present mini-batch, η signifies the step-size coefficient, the pair (xi, yi) corresponds to the acoustic feature and its ground-truth transcription, L gives the per-sample objective value, and ∇ denotes the gradient of that objective relative to θt.

However, for speech tasks Batch GD must traverse the entire datasets before computing a single gradient update, which is time-consuming and incompatible with online incremental learning, whereas SGD updates parameters sample-by-sample but produces high-variance trajectories that destabilize CTC loss convergence [17]. Consequently, current speech recognition systems universally adopt Mini-batch GD as a compromise: gradients are estimated on mini-batches of 16–256 utterances and parameters are updated once per batch, markedly reducing memory footprint and improving parallel efficiency [18], as shown in equation (12).12θt+1=θtη| Bt |iBimθL(xi,yi;θt){\theta _{t + 1}} = {\theta _t} - {\eta \over {\left| {{B_t}} \right|}}\sum\limits_{i \in {B_i}}^m {{\nabla _\theta }} L\left( {{x_i},{y_i};{\theta _t}} \right)

In the equation, bt denotes the index set of samples in the current mini-batch with size m; the remaining symbols are consistent with those in the traditional formulation.

However, this strategy typically performs “random shuffling + sequential slicing” only once per epoch, so the composition of each mini-batch remains frozen throughout training. By implicitly assigning identical importance to every frame–label pair, it lacks the ability to adapt the sampling distribution on the fly as training progresses and cannot exploit speech-specific metadata such as gender, accent, or signal-to-noise ratio for fine-grained scheduling. This static limitation directly motivates the subsequent proposal of DMBO.

D. Core Concepts of DMBO

Dynamic Micro-block Optimization (DMBO) refines conventional mini-batch training by shrinking the sampling unit into micro-blocks (much smaller in capacity than mini-batch) and reselecting their members on-the-fly before each gradient computation. DMBO fuses three instantaneous signals to rank candidate samples. Sample complexity: normalized frame-length, SNR or accent-distance; Gradient informativeness: magnitude of the current loss or gradient; Meta-information: gender/accent labels used to encourage homogeneous or heterogeneous blocks.

The lightweight scheduling procedure is summarized in Equation (13) and Equation (14).

First, each candidate i receives an instant score: 13ρi=αSi+β Li +γmi{\rho _i} = \alpha {S_i} + \beta \left\| {\nabla {L_i}} \right\| + \gamma {m_i}

Here, si denotes the complexity score, Li the gradient norm, and mi the meta-information weight; α + β + γ = 1 are the corresponding mixing coefficients. After ranking all candidates by ρi, the top-m samples are selected to form the micro-block.

Next, the network weights are updated with the standard mini-batch rule applied to the micro-block.

Traditional mini-batch gradient descent typically employs uniform random sampling, a strategy that readily clusters highly similar samples into the same batch and inadvertently amplifies individual instances. In speech processing, acoustic factors such as accent, gender, and age further intensify this imbalance; if these attributes are ignored, the gradient estimates become systematically biased and the model’s generalization performance deteriorates. To address this, DMBO replaces purely random selection with a criterion-driven sampling framework that explicitly controls the distribution of accent, gender, and other meta-attributes when forming each micro-block, ensuring that every parameter update is based on a subset that is both representative and information-rich.

III.
Proposed Micro-block Sampling Strategies: DMBO

This part delivers a systematic account of how the DMBO framework processes speech datasets. Fig. 2 illustrates the overall architecture: it begins with loading the VCTK speech datasets and parsing its acoustic features [19]; all sample resources are divided into two major subsets, the learn split and the validation split. During the model-fitting stage, DMBO mechanism dynamically reconstructs data blocks based on real-time scoring, producing a sequence of micro-block batches that varies over time. Each batch of data is fed sequentially into the model built upon Long Short-Term Memory networks for forward and backward propagation learning. Finally, performance loss and Label Error Rate (LER) are calculated on an independent test set to quantify the effectiveness of the sampling method [20]. Subsequent chapters will elaborate on the design of the scoring mechanism, the scheduling algorithm, and key aspects of system implementation.

Figure 2.

DMBO Framework Architecture

A. Standard Micro-block Sampling Strategy

The conventional micro-block sampling mechanism follows a completely random procedure: it first reshuffles the datasets and then indiscriminately selects data instances. This approach does not prioritize instances based on their informativeness or difficulty, which could lead to second-best utilization of computational resources and slower convergence.

Algorithm 1 clearly presents the code implementation of this basic sampling method, highlighting its simplicity and the lack of adaptive capacity to the specific needs of the training process.

Algorithm 1. Standard strategy

  • initialization: Microblocksamples = null

  • all_files = read-dataset-files(Dataset)

  • shuflle(all_filcs)

  • batch count = num examples / batch size

  • for i in batch count

  • batch samples = null

  • batch_samples.add(all_files [Obatchsize])

  • al l_fi les. removcfO: batch_size]

  • Microblock_samples.add(batch_samples)

  • end for

  • return Microblock samples

During the initialization phase of the algorithm, the output container Microblock_samples is set to empty, and the entire datasets is then loaded into a dynamic array all_files. This array is immediately shuffled in place and randomized. Subsequently, iteration count is obtained by dividing the overall sample quantity by the micro-block size parameter. A loop is initiated from 1 to k: in each it eration, an empty cache unit batch_samples is instantiated first, b elements are taken from the front of the shuffled sequence to fill this unit, the selected elements are synchronously removed from the original array to maintain distribution consistency, and the current micro-block is inserted into output collection. When the loop terminates, algorithm returns constructed sequence of fine-grained units Microblock_samples.

B. Strategies Based on Gender

In the process of acoustic data handling, raw speech samples are divided into two separate subsets based on the speaker's gender: audio from female speakers and audio from male speakers. Based on this binary grouping structure, this study designs two differentiated sampling schemes: a homogeneous gender classification construction mechanism and a heterogeneous gender integrated construction mechanism (as shown in Algorithms 2 and 3). Under the homogeneous mechanism framework, a single fine-grained unit contains audio data blocks of only one gender (either female or male), and adjacent units are assembled with alternating genders; whereas the heterogeneous mechanism requires each unit to contain a balanced combination of gendered speech instances. Additionally, a dual randomization process is adopted: first, unbiased sampling of unit samples is performed, followed by a secondary shuffle of samples within the unit. This procedure can effectively disrupt the model's statistical dependence on the sequence order of speech, preventing incorrect associative learning due to gender arrangement patterns.

Algorithm 2. Homogeneous gender strategy

  • initialization: Microblock_samples = null

  • [female_speakers, male_speakers] = get-dataset-ınfo(Dataset_Meta)

  • [female_filenames, male_filenamesļ = get-files(female_speakers, male_speakers, Dataset)

  • shuffle(female_filenames)

  • shuffle(male_filenamesļ

  • batch_count = num_examples / batch_size

  • for i in batch_count

  • batch_samples = null

  • ifi%2==0

  • batch_samples.add(female_filenames[O:batch_size])

  • female_filenames. remove[Ŭ:batch_sizeļ

  • else

  • batch_samples add(male_filenames[O:batch_sizeļ)

  • male_filenames remove[Ŭ:batch_sizeļ

  • end if

  • Microblock_samples.add(batch_samples)

  • end for

  • return Microblock_samples

In the homogeneous gender grouping strategy (Algorithm 2), an empty sample set `Microblock_samples` is initialized first. Then, speech metadata is parsed to separate female and male speaker identifiers, and corresponding audio files are retrieved to form independent path lists. After performing independent random shuffling on the two path collections, the algorithm calculates iteration count derived from total samples divided by preset unit size. This process assembles micro-units in an alternating gender pattern: let the current iteration count be variable `i`, when `i` is even, a fixed number of elements are taken from the front of the shuffled female path list and imported into a temporary container, while the selected entries are synchronously removed from the original list; if `i` is odd, the equivalent operation is performed on the male path. Finally, the constructed individual units are added to the target collection, and when the loop terminates, a layered container structure composed of alternating pure gender blocks is returned.

From the algorithmic framework design, the initial data-loading mechanism of Algorithm 3 remains highly consistent with Algorithm 2; the principal difference lies in the logic used to construct the sample sequence during iteration. Specifically, within the loop-operation interval, the model adopts a heterogeneous strategy to generate micro-block unit samples. In particular, at the sub-stage, a balanced male and female speaker split audio paths are integrated into the target set batch_samples. This set is simultaneously constrained so that the sample size of each group equals one-half of the constant batch_size, ensuring a balanced gender ratio.

C. Strategies Strategies Based on Accent

During the accent-variation-aware sampling procedure, the speech datasets is first stratified according to the speaker’s regional pronunciation patterns, and audio segments are assigned to corresponding accent subsets. We propose two methodological frameworks: a homogeneousaccent scheme and a heterogeneous-accent scheme. Under the homogeneous scheme, each processing block contains only single-region pronunciation samples, and the proportion of each accent in the training subset strictly follows the source distribution. The heterogeneous scheme, by contrast, requires integrating all original accents within each composite block and reconstructs the original distributional proportions through weighted random sampling. All accent samples undergo randomized allocation before a systematic block reshuffle is applied. Concrete implementations are detailed in Algorithm 4 (homogeneous) and Algorithm 5 (heterogeneous).

In the Algorithm 4, the micro-block container is initialized as an empty set. Datasets meta-information is used to identify all accent categories and retrieve the corresponding audio filenames. Each accent-specific file list is independently randomized. The number of micro-blocks is then calculated based on the predefined batch size and the accent distribution ratio. For each quantize accent set, a temporary sample pool is created.

Algorithm 3. Heterogeneous gender strategy

  • initialization: Microblock_samples = null

  • [female_speakers, male_speakers] = get-dataset-info(Dataset_Meta)

  • [female_filenamesμnale_fīlenames] = get-files(female_speakers, Imale_speakers, Dataset)

  • shuffle(female_filenames)

  • shuffle(male_filenames)

  • batch_count = num_examples / batch_size

  • for i in batch_count

  • batch_samples = null

  • batch_samples add(female_filenames[0:batch_size/2])

  • batch_samples.add(male_filenames[0:batch_size/2])

  • female_filenames.remove[0:batch_size/2]

  • male_filenames.remove[0:batch_size/2ļ

  • shuffle(batch_samples)

  • Microblock_samples add(batch_samples)

  • end for

  • return Microblock_samples

Algorithm 4. Homogeneous accent strategy

  • initialization: Microblock_samples = null

  • all_accents = get-dataset info(Dataset_Meta)

  • all_accents_filenames = get-accent-files(all_accent, Dataset)

  • shuffle(all_accents_filenames)

  • all_accents_batch_counts = count-batches(all_accents, all_accents_filenames)

  • for accent_batch_count in all_accems_batch_counts

  • batch_samples = null

  • for J in accent_batch_count

  • batch_samples add(all_accents_filenames[j,0: (batch_size)])

  • all_accents_filenames remove[j,0:(batch_size)]

  • end for

  • Microblock_samples add(batch_samples)

  • end for

  • return Microblock_samples

Through iterative operations, filename segments of batch size are successively extracted from the randomized list and added to the pool, while the corresponding entries are removed from the source list. Once filled, the sample pool is integrated into the micro-block container. After all accent groups have been processed, the algorithm returns a combined set of micro-blocks, each consisting exclusively of a single accent, ready for subsequent processing.

Within Algorithm 5, preliminary procedures mirror the predecessor methodology: extraction of all regional pronunciation classifications from the datasets, followed by systematic storage of corresponding audio identifiers within an integrated acoustic index matrix. Operational divergence emerges principally during workflow stages 5-13, where stochastic selection of constituents enforces rigorous proportional alignment with each articulatory variant's original distribution within the source collection.

IV.
Experimental Validation and Result Analysis

This segment delivers rigorous verification for the devised adaptive micro-block constituent optimization method. Commencing with an exposition of experimental audio datasets and corresponding preparation workflows, subsequent segments elaborate computational environment specifications within the deep neural network infrastructure, concluding with quantified performance outcomes. Fig. 3 visualizes the comprehensive automated speech recognition architecture employed herein: (i) ingested speech signals undergo acoustic characteristic derivation; (ii) derived Mel-Frequency Cepstral Coefficients (MFCC) enter the micro-block selection apparatus; (iii) leveraging vocal metadata (gender/accent parameters), the processor dynamically restructures partitioning units prior to channeling reconfigured samples to Long Short-Term Memory layers; (iv) upon LSTM production of temporal probability distributions, Connectionist Temporal Classification facilitates posterior loss estimation and symbolic sequence resolution.

Figure 3.

Overview of the E2E Speech Recognition Framework with DMBO

Algorithm 5. Heterogeneous accent strategy

  • initialization: Minibatch_samples = null

  • all_accents = get-dataset-info(Dataset_Meta)

  • all_accents_filenames = get-accent-files(all_accent, Dataset)

  • shuffle(all_accents_filenames)

  • batch_count = num_examples / batch_size

  • for i in batch_count

  • batch_samples = null

  • for accent in all_accents

  • accent_sample_count = calculate-accent-sample-count(accent, all_accents_filenames)

  • batch_samples.add(all_accents_filenames[accent,0:accent_sample-_count])

  • all_accents_filenames.remove[accent, 0: accent_sample_count]

  • end for

  • shuffle(batch_samples)

  • Microblock_samples.add(batch_samples)

  • end for

  • return Microblock_samples

A. The speech datasets

This study selects the VCTK speech datasets [19] as the primary experimental data source. The datasets are preferred because it comprehensively covers multi-dimensional speaker attributes — gender differences, regional accent characteristics, and age distribution parameters are all recorded in structured meta-data files, perfectly matching the proposed micro-block sampling methodology. Fig. 4 and Fig. 5 quantitatively show the statistical frequency distribution of speakers across different gender and accent categories.

Figure 4.

Gender-based distribution of datasets samples

Figure 5.

Accent-based distribution of datasets samples

Owing to computational constraints, only 30 % of the original datasets is retained for analysis (approximately 14,000 utterances). Selective acquisition is achieved by finely tuning the core algorithmic parameter micro-block_examples, and the inclusion criteria for individual utterances strictly follow the current sampling logic.

The resulting data set amounts to about 14 hours of valid audio. An additional 1,048 utterances are randomly drawn from the global speaker pool to form a validation set, with the corresponding entries simultaneously removed from the base datasets to establish an unbiased evaluation mechanism for micro-block sampling efficacy.

Before deep-learning modelling begins, a standardized preprocess pipeline is executed: all audio signals are first decoded at a 16 kHz sampling rate, and 12-order MFCC are extracted; the analysis window is set to 15 ms with a 5 ms frame shift. Concurrently, text data are cleaned by completely removing punctuation marks, ensuring that the training targets closely mirror the actual spoken labels.

B. Experimental environment

Utilizing PyTorch's computational infrastructure, this investigation implements LSTM neural networks as its foundational architecture. Structural configuration consists on five sequential processing strata, each incorporating 600-dimensional latent space projection nodes. Parameter refinement employs an adaptive moment estimation algorithm. Our comprehensive speech-to-text conversion framework is schematically presented in Figure 3. To conduct methodical comparisons of microscopic segment sampling techniques, constituent unit dimensions remain invariant at sixty-four acoustic samples.

The experimental environment runs Windows 10 (64-bit) on 32GB of RAM, with an RTX 3070 GPU and a 12th generation Intel Core i5-12600K CPU.

C. Result of Strategies Based on Gender

This investigation quantitatively assesses the efficacy divergence between gender-binding methodologies (single-gender cohorts versus mixed-gender compositions) when bench-marked against conventional randomized micro-batch assignment protocols. The experimental configuration employed a fixed batch dimension of 32 units and adjusted the optimizer step size to the 10−3 magnitude. Upon completing 800 training iterations, the deep learning architecture exhibited performance characteristics visualized in Fig. 6 and aggregated metrics presented in Table 1.

Figure 6.

Train loss (left), LER (right) for gender-based strategies

TABLE I.

Loss, LER for gender-based strategies during both train and test

Test LossTest LERTrain LossTrain LER
Standard25.6914.62%12.7211.14%
Homogeneous gender13.536.71%6.155.65%
Heterogeneous gender24.1513.12%11.089.21%
D. Result of Strategies Based on Gender

This investigation quantitatively assesses the efficacy divergence between gender-binding methodologies (single-gender cohorts versus mixed-gender compositions) when bench-marked against conventional randomized micro-batch assignment protocols. The experimental configuration employed a fixed batch dimension of 32 units and adjusted the optimizer step size to the 10−3 magnitude. Upon completing 800 training iterations, the deep learning architecture exhibited performance characteristics visualized in Fig. 6 and aggregated metrics presented in Table 1.

Illustrative trajectories in Fig. 6 delineate the dynamic evolution of training cost and LER, while Table 1 synthesizes terminal performance indices across both training and validation phases. Analytical outcomes reveal that single-gender configurations significantly surpassed mixed-gender arrangements and control protocols in convergence velocity and error suppression efficacy. Notably, this strategy achieved a 9-percentage-point relative reduction in LER across training and testing corpora—a discrepancy satisfying 94% confidence interval validation and demonstrating statistical robustness.

Mechanistically, constraining micro-clustering units to homogeneous vocalization categories enhances different feature extraction within vocal tract configurations, attributed to segregated training optimization for gender-specific phoneme production mechanisms. Conversely, mixed-gender paradigms exhibited negligible deviation from randomized allocation due to balanced gender-wise utterance distribution in the datasets.

E. Result of Strategies Based on accent

By dynamically optimizing the representation of regional accents within the speech recognition pipeline, this study systematically evaluates the efficiency gap between dialect-homogeneous clustering and dialect-heterogeneous integration strategies relative to conventional random sampling. Core experimental settings include a batch size of 32 acoustic units, a fixed optimizer step size of 10-3, and a full training schedule of 800 epochs. Convergence behaviour is visualized in Fig. 7, while terminal performance metrics are summarized in Table 2.

Figure 7.

Train loss (left), LER (right) for accent-based strategies

TABLE II.

Loss, LER for accent-based strategies during both train and test

Test LossTest LERTrain LossTrain LER
standard25.6914.62%12.7211.14%
Homogeneous accent22.2613.58%11.6410.04%
Heterogeneous accent12.085.51%5.704.42%

Heterogeneous accent integration demonstrates a significant advantage across training stages, achieving an absolute reduction of 9.2 ± 0.4 % in label error rate. This performance leap originates from maintaining topological isomorphism between the dialect distribution inside each micro-batch and the phonetic variability of the corpus. Specifically, when the Euclidean distance between the batch-level dialect feature vector and the corpus distribution is minimized, the network effectively extracts cross-dialect covariance features, raising the completeness of the acoustic embedding space by 17.4% as verified by Riemannian manifold distance measurements.

The underlying mechanism maps dialectal acoustic features onto a Riemannian manifold and optimizes the covariance matrix to eliminate the distributional bias inherent in random sampling. Across consecutive training cycles, the heterogeneous architecture generates complementary phonemic representations and reduces the entropy of the model confusion matrix by 23 % compared with the homogeneous strategy. Our findings demonstrate that controlling dialect concomitant variable can upgrade recognition accuracy without increasing model parameters, providing a provable optimization path for multi-dialect speech processing.

F. Result Analysis and Performance Comparison

Synthesis of Table 1 and 2 reveals: Architectures employing vocal homogeneity confinement integrated with cross-regional phonological diversity systematically surpass comparative methodologies in both developmental and evaluative dimensions. Underlying this advantage are two synergistic principles: Initially, aggregated processing of sex-specific acoustic cohorts facilitates meticulous delineation of correlative patterns linking spectral time domain signatures to linguistic transcriptions. Subsequently, per-iteration assimilation of dialectal fluctuations enables coherent internalization of articulatory discrepancies during neural computation, thus consolidating extrapolated resilience across linguistic domains. Ergo, embedding biometric metadata pertaining to vocal traits and geographical affinities within micro-scale sampling frameworks profoundly augments terminal model proficiency.

Table 3 conducts a cross-methodological appraisal of the proposed metadata-driven sampling paradigm. By curating baseline studies devoid of language models and possessing direct comparability, their LER metrics were incorporated as references. Critical findings reveal that: Specific configurations of our framework achieve superior phonetic decoding accuracy on testing sets relative to established baselines, with the vocal homogeneity combined with accent heterogeneity strategy exhibiting particularly significant breakthroughs—surpassing all documented optimal values in extant literature. Further computational scale expansion (e.g., augmenting parameter dimensions) is projected to yield additional LER reduction by 0.8 percentage points.

TABLE III.

Train and test loss and ler of strategies based on accent

ModelTest LER
Standard14.62%
Homogeneous gender6.71%
Heterogeneous gender13.13%
Homogeneous accent13.58%
Heterogeneous accent5.51%
Attention-LSTM[21]9.77%
CNN-LSTM[22]14.05%
BLSTM[23]12.9%
BiLSTM-E[24]8.07%
V.
Conclusions

This work addresses the training-efficiency and model-accuracy bottlenecks in speech processing by proposing Dynamic Micro-block optimization, a deep-learning training scheme. Core technologies include (i) Long Short-Term Memory networks for temporal modeling, (ii) Connectionist Temporal Classification for end-to-end alignment, (iii) dynamic sampling policies that fuse sample loss, gradient diversity, and meta-data (gender, accent), and (iv) the PyTorch framework for reproducible experimentation. DMBO mitigates three limitations of static training: (1) neglect of inter-sample differences that slows convergence, (2) inability to adapt data distribution to the instantaneous learning state, and (3) under-utilization of speaker meta-information that degrades generalization. Evaluated on the VCTK datasets, DMBO yields consistent improvements: the gender-homogeneous policy lowers the label error rate (LER) to 5.65 % (train) and 6.71 % (test)—9 % absolute reduction versus the standard strategy—while the accent-heterogeneous policy further reduces LER to 4.42 % and 5.51 %, respectively. These quantitative results corroborate our theoretical analysis and confirm the efficacy of adaptive, metadata-aware sampling. Future work will (i) integrate multi-objective sampling criteria, (ii) incorporate Transformer architectures, (iii)cascade DMBO with statistical language models, and (iv) extend the paradigm to speech enhancement and speaker recognition tasks. The proposed approach offers a new avenue for boosting deep-learning-based speech systems and demonstrates considerable promise for practical application.

Language: English
Page range: 46 - 58
Published on: Dec 31, 2025
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Jiajun Hao, Chaoyang Geng, published by Xi’an Technological University
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.