Detection of Laryngeal Pathologies from Voice using EMD-based Mel-Spectrograms and Scalograms with AlexNet

Sofiane Cherif; Abdelhafid Kaddour; Abdelmoudjib Benkada; Said Karoui

doi:10.2478/msr-2025-0030

Introduction

Laryngeal pathologies are disorders that affect the larynx, which houses the vocal folds, leading to various voice problems [1], [2], [3]. Early detection of these pathologies is crucial to prevent permanent damage to the vocal folds and to significantly improve the effectiveness of treatment. The diagnosis of voice disorders usually requires invasive clinical examinations such as laryngoscopy and videostroboscopy. However, vocal signal analysis using the signal processing techniques can be used to extract features that help distinguish between healthy and pathological voices. Therefore, there is a growing need to develop a non-invasive, automated approach based on deep learning to identify pathological voices. A substantial body of related work exists in this domain [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14]. Many studies focus on the extraction of signal processing features such as MFCCs and Wavelet Packet Transform (WPT) [15], [16], [17], [18], [19], [20], as well as the use of deep learning for voice pathology detection [21], [22], [23]. This paper focuses on the detection of laryngeal pathologies using MFCC spectrograms and scalograms — time–frequency representations — derived from the most relevant IMFs. These representations are used as inputs to the AlexNet convolutional neural network (AlexNet-CNN) for automatic classification of normal and pathological voices. This study investigates the use of scalogram representations with AlexNet-CNN for pathological voice detection, an approach that has received little to no attention in the existing literature. The adopted processing workflow is illustrated in the synoptic diagram (Fig. 1). Once the relevant IMFs are extracted from each vocal signal, the signal is segmented, and the MFCCs are calculated for each segment. In parallel, scalograms are generated using continuous wavelet transform (CWT). These MFCC images and scalograms are then used as input to the AlexNet-CNN convolutional neural network for classification. This paper is organized as follows: Section 2 presents the materials and related methods, detailing the methodology and the detection process. Section 3 reports and discusses the results. Finally, the paper concludes with a summary of the results and a comparison of recent studies on pathological voice classification.

Materials and methods

Saarbrücken Voice Database (SVD)

The vocal signals used in this study were obtained from the publicly accessible Saarbrücken Voice Database (SVD) [24]. The SVD contains a diverse collection of voice recordings from subjects with various laryngeal pathologies, including both functional and organic disorders. The database contains multiple recordings per speaker, featuring the sustained pronunciation of the vowels /a/, /i/, and /u/ with different intonations: normal, low, high, and low–high–low. This diversity contributes to improved model performance when utilized. For this study, only the sustained /a/ vowels pronounced in normal pitch were selected. This choice was motivated by the fact that the sustained vowel /a/ is a common phonation task found in many voice disorder datasets, and it provides a consistent basis for analysis. All voice recordings in the SVD are sampled at 50 KHz and 16-bit resolution. The subset used in this work consists of 259 healthy voice samples and 50 pathological males samples diagnosed with laryngitis, all corresponding to the neutral vowel /a/. To increase the amount of training data and better capture temporal variations, the most relevant IMFs from the healthy and pathological voice samples were segmented into overlapping frames, thus increasing the input to AlexNet-CNN.

Use of AlexNet-CNN with EMD-based scalograms for pathological voice classification

In this study, we used AlexNet-CNN, a pre-trained convolutional neural network (CNN), to detect laryngeal pathologies from voice signals. AlexNet-CNN consists of eight layers — five convolutional layers followed by three fully connected layers — and uses the ReLU activation function to improve non-linearity and accelerate training [25]. Scalo-gram images obtained from the most relevant IMF followed by the CWT were used as input to the network. These time–frequency representations capture rich, multiscale features that are highly relevant for vocal disorder characterization. The model, originally trained on ImageNet, was either fine-tuned for direct classification or used as a deep feature extractor, with the outputs of the penultimate fully connected layer fed into an external classifier, such as a support vector machine (SVM) or Softmax layer. The dataset was split into a training and a validation subsets, with 80 % of the images used for training and the remaining 20 % for validation. The choice of AlexNet-CNN was motivated by its computational efficiency, fast convergence, and demonstrated effectiveness in biomedical imaging tasks, particularly in scenarios with limited datasets. The integration of AlexNet-CNN complements traditional acoustic features such as MFCCs and results in a hybrid, multimodal feature space that improves the robustness and accuracy of pathological voice classification.

Voice signal pre-processing and feature extraction pipeline

The overall process for detecting laryngeal pathologies from vocal signals is summarized in the synoptic diagram (Fig. 1). It comprises three main phases: signal preprocessing, feature extraction, and classification. The first phase begins with the formation of a matrix containing voice signals from healthy and pathological male subjects (suffering from laryngitis), limited to the sustained neutral vowel /a/. To simulate real-world acoustic conditions, Gaussian noise with a signal-to-noise ratio of SNR = 0 dB and a standard deviation σ = 1 is added to each signal. Denoising is then applied using wavelet transform-based methods. Next, all signals are normalized and centered to create a zero-mean matrix. To ensure uniformity, the signals are equalized to the same length. Silence segments are removed to focus on the voiced regions, followed by low-pass filtering with a cut-off frequency of 3400 Hz to retain only the relevant spectral content. A Hamming window corresponding to the length of each signal segment is applied to reduce spectral leakage during subsequent analysis.

In the second phase, each pre-processed signal undergoes empirical mode decomposition (EMD) to extract its IMFs. Among these, the IMF with the highest energy is selected as the most relevant component for further analysis. This IMF is then segmented using a sliding window of 23 ms with a 50 % overlap (i.e., half the window length). Feature extraction is performed for each segment to generate two types of representations: Mel-spectrograms and scalograms, which serve as time–frequency descriptors that capture both the spectral and temporal dynamics of the voice signal. Finally, the classification phase is performed using the AlexNet-CNN. Two separate classification paths are considered: one using MFCC images and the other using scalogram images as input. In both cases, the network outputs a binary decision indicating whether the voice is healthy or pathological.

Wavelet-based denoising method

We applied a denoising technique based on the wavelet transform, which involves decomposing the vocal signal into wavelet coefficients at multiple frequency scales using the Daubechies wavelet of order 4(db4). Stein’s unbiased risk estimate (SURE) method was used to determine the optimal thershold for noise suppression. A hard thresholding approach was applied: coefficients with magnitudes below the estimated threshold were completely discarded (set to zero), while those above the threshold were kept unchanged. This technique effectively removes the noise while preserving the significant components of the vocal signal. The denoised signal was then reconstructed using the inverse wavelet transform applied to the modified coefficients. To evaluate the effectiveness of the wavelet denoising approach, clean vocal signals were artificially contaminated with Gaussian noise (standard deviation σ = 1, signal-to-noise ratio SNR = 0 dB). The results showed that wavelet-based denoising significantly improved signal clarity and preserved diagnostically relevant acoustic features, even under severe noise conditions. To analyze the time–frequency characteristics of the relevant IMF, the CWT was also applied. The CWT of a signal x(t) is calculated by integrating the signal with a family of scaled and shifted wavelets, and is mathematically defined as: (1) ${CWT}_{x} (a, b) = \frac{1}{\sqrt{a}} \int_{- \infty}^{\infty} x (t) ψ^{*} (\frac{t - b}{a}) dt$ {\rm{CW}}{{\rm{T}}_x}\left( {a,b} \right) = {1 \over {\sqrt a }}\int_{ - \infty }^\infty {x\left( t \right)\;{\psi ^*}\left( {{{t - b} \over a}} \right)dt} where ψ^* denotes the complex conjugate, the wavelet $ψ^{*} (\frac{t - b}{a})$ {\psi ^*}\left( {{{t - b} \over a}} \right) is obtained by scaling (dilating) and shifting (translating) the mother wavelet ψ(t). Here, t denotes time, a > 0 is the scale parameter that controls the frequency resolution, b is the translation parameter that represents the time shift, and ψ^* (t) is the complex conjugate of the mother wavelet ψ(t). This representation captures both the spectral and temporal variations of the signal and is therefore highly suitable for analyzing and classifying pathological voice signals.

Normalization and equalization of voice signal lengths

To ensure the consistency of all samples and to enable uniform processing, each voice signal was subjected to normalization and length equalization. Normalization is a critical pre-processing step in which the data are transformed to a standard scale. In this study, we used a combination of two normalization techniques:

Z-score normalization, defined as $x_{in} = \frac{x_{i} - μ}{σ}$ {x_{{\rm{in}}}} = {{{x_i} - \mu } \over \sigma } , where µ is the mean and σ is the standard deviation of the signal. This method centers the signal around zero with a unit variance, effectively removing the DC offset and scaling the amplitude distribution.
Peak amplitude normalization, defined as $X_{in} = \frac{x_{in}}{max (x_{in}|)}$ {X_{{\rm{in}}}} = {{{x_{{\rm{in}}}}} \over {\max \left( {\left| {{x_{{\rm{in}}}}} \right|} \right)}} , where each sample is divided by the maximum absolute amplitude value to ensure that all signals are within the range [–1, 1].

here, X_in denotes the final normalized signal, N is the signal length, and the standard deviation is calculated as:

σ = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}}

\sigma = \sqrt {{1 \over N}\sum\limits_{i = 1}^N {{{\left( {{x_i} - \mu } \right)}^2}} }

After normalization, all signals were adjusted to a uniform length by truncating longer sequences or applying zero-padding to shorter ones. This equalization ensures compatibility with the subsequent stages, particularly during feature extraction and classification using convolutional neural networks (CNNs), which require a fixed-size of the input dimensions.

Silence removal using energy thresholding

To improve the signal quality and reduce computational complexity, silent regions were removed from the voice recordings prior to feature extraction. This pre-processing step is particularly important in pathological voice analysis, as non-phonated segments do not contain diagnostically relevant features related to vocal fold behavior. In this study, silence detection was performed using short-term energy analysis. A frame was considered silent when its energy fell below 2 % of the maximum signal energy. This threshold effectively identified low-activity regions while preserving the meaningful voiced segments. Detected silent frames were discarded, which improved the signal-to-noise ratio and ensured that only diagnostically useful components were retained for reliable feature extraction and classification.

Low-pass filtering of voice signals

To eliminate high-frequency noise components that are not relevant for speech production, a low-pass filter was applied to all vocal signals. This filtering stage is crucial for preserving the frequency band that is most informative for voice analysis, particularly for pathology detection. We used a low-pass filter with a cut-off frequency of 3400 Hz. This value is typically used in speech processing applications, as the majority of voice energy is below this threshold. Frequencies above 3400 Hz typically contain ambient noise or artifacts irrelevant to the phonatory process. The filtering process contributes to improving the signal-to-noise ratio and increases the reliability of subsequent feature extraction steps, including Mel-spectrograms and scalograms.

Windowing of frames using the Hamming function

A Hamming window was applied to each pre-processed voice signal to reduce spectral leakage by truncating the signal at its edges (Fig. 2).

EMD algorithm

EMD, introduced by Huang et al. [26], is an adaptive and fully data-driven method designed for analyzing nonlinear and non-stationary signals. The EMD method can self-adaptively decompose a complicated multicomponent signal into a finite set of components known as IMFs without any a priori assumptions (Fig. 3). Due to its adaptive nature, EMD is particularly well-suited for the analysis of biomedical and voice signals. It enables for the isolation of the most informative oscillatory modes and thus the extraction of relevant features — such as MFCCs or scalograms — from the most energetically significant IMF. We used a Huang’s Empirical Mode Decomposition for the signal analysis. The EMD algorithm is explained next [26], [27], [28]. Let x(t) be a real-valued, non-linear and non-stationary signal. EMD expresses x(t) as the sum of N IMFs and a final residual component r(t): (2) $x (t) = \sum_{i = 1}^{N} ({IMF}_{i} (t) + r (t))$ x\left( t \right) = \sum\limits_{i = 1}^N {\left( {{\rm{IM}}{{\rm{F}}_i}\left( t \right) + r\left( t \right)} \right)} where IMF_i(t) is the i-th IMF, r(t) is the final residual and and N is the total number of extracted IMFs. We have calculated the temporal energy for each of the IMFs obtained from the EMD analysis of the voice signal, and only the IMF with the highest energy value is chosen as relevant one, according to the following equation: (3) $E = \sum_{i = 1}^{K} {({IMF}_{i} (n))}^{2}$ E = \sum\limits_{i = 1}^K {{{\left( {{\rm{IM}}{{\rm{F}}_i}\left( n \right)} \right)}^2}} where E, K and IMF_i(n) are the temporal energy, the length of the IMF and the i-th IMF digitized signal, respectively.

Algorithm 1

EMD algorithm (Huang et al.)

Require: Signal x(t)

Ensure: A set of intrinsic mode functions (IMFs) {IMF₁(t),IMF₂(t),...,IMF_N(t)} and a residual r (t)

1: r(t) ← x(t)
2: i ← 1
3: while r(t) has more than two extrema do
4: h(t) ← r(t)
5: repeat
6: Identify all local maxima and minima of h(t)
7: Interpolate maxima to obtain upper envelope e_upper(t)
8: Interpolate minima to obtain lower envelope e_lower(t)
9: Compute mean envelope: $m (t) \leftarrow \frac{e_{upper} (t) + e_{lower} (t)}{2}$ m\left( t \right) \leftarrow {{{e_{{\rm{upper}}}}\left( t \right) + {e_{{\rm{lower}}}}\left( t \right)} \over 2}
10: Update proto-IMF: h(t) ← h(t) − m(t)
11: until h(t) satisfies IMF conditions:
– Number of extrema and zero crossings differ by at most one
– Mean envelope is approximately zero
12: IMF_i(t) ← h(t)
13: r(t) ← r(t) − IMF_i(t)
14: i ← i + 1
15: end while
16: return {IMF₁(t),IMF₂(t),...,IMF_i−1(t)} and residual r(t)

Extracted features

Mel-spectrogram images

Once the relevant IMF was selected, it was segmented into overlapping frames for Mel-frequency cepstral coefficient (MFCC) extraction. Each IMF was divided into frames of 23 ms duration, with a 50 % overlap between consecutive frames to ensure smooth temporal continuity and to capture transitional acoustic features. For each windowed frame, MFCCs were calculated by transforming the power spectrum into the Mel scale using a bank of triangular filters spaced according to the Mel frequency warping function. This process resulted in a time–frequency representation of the relevant IMF in the perceptually motivated Mel scale. The MFCCs were then aggregated into Mel-spectrogram images, which were subsequently used as inputs for the classification stage (Fig. 4).

Scalogram representation of the relevant IMF using CWT

A scalogram is a two-dimensional time–frequency representation that visually shows how the spectral content of a signal evolves over time. It is particularly well suited for analyzing non-stationary signals such as the human voice, where spectral characteristics change dynamically during phonation. In addition to the MFCC extraction, each frame of the selected relevant IMF was processed using the Morlet CWT filter bank to generate scalograms. This technique enables the identification of localized spectral variations over time, which are crucial for the detection of subtle irregularities associated with vocal fold pathologies. For each frame, the CWT generates a matrix of wavelet coefficients representing the signal’s energy distribution over time and frequency. These matrices were then visualized as scalogram images, resized to 224 × 224 pixels, and converted to RGB format to comply with the input specifications of the AlexNet-CNN used in the classification stage (Fig. 5). This frame-level approach not only captures fine-grained, time-localized acoustic features relevant for pathology detection, but also significantly increases the number of training samples. As a result, the combination of scalogram-based representations and MFCCs enriches the feature space and improves the model’s robustness in discriminating between healthy and pathological voice signals.

Results

The effectiveness of the proposed approach was evaluated by training and validating the AlexNet-CNN on two distinct types of input representations: MFCC-based images and CWT-derived scalograms. Each representation was generated from the most relevant IMF extracted from voice signal frames, as described in the previous sections. Classification performance was evaluated using accuracy as the primary metric in the validation datasets. The classifier achieved an accuracy of 85.66 % when using MFCC images, and a slightly higher accuracy of 86.4 % when using scalogram images (Fig. 6). These results suggest that both representations are effective in capturing discriminative features relevant to pathological voice detection. However, the superior performance of scalograms highlights their ability to encode rich time–frequency information that complements, and in some cases, surpasses, conventional cepstral features. The improved performance of the scalograms can be attributed to their fine-grained temporal and spectral resolution, which allows the model to detect subtle irregularities associated with vocal fold dysfunctions. These results confirm the suitability of combining EMD with CWT-based scalograms for robust and accurate voice pathology classification.

Discussion and comparison with previous studies

In this section, we compare the performance of our proposed method with several recent studies on pathological voice classification using different datasets, features, and classification models. Table 1 summarizes the key characteristics and classification accuracies reported in the literature.

Table 1.

Comparison of our method with recent studies on pathological voice detection.

Study	Dataset	Features and model	Accuracy [%]
[29]	SVD	Multipeak, Gaussian mixture model (GMM)	91.83
[30]	SVD + HUPA	MFCCs, SVM	71.45–76.19
[31]	MEEI voice disorders	MFCC (500 ms frames, 5 ms shift), SVM	66.4–75.1
[32]	SVD + HUPA	wav2vec, SVM	68.55–83.11
[33]	SVD + HUPA	Mel-spectrogram, SVM	69.45–75
[34]	VOICED	wav2vec 2.0, SVM / KNN	98
[35]	UA-speech + TORGO	MFCCs, SVM	63.13–89.22

This work	SVD	EMD-IMF, Mel-spectrogram + scalogram, AlexNet-CNN	85.66 / 86.4

To evaluate the effectiveness of our proposed method, we compared its performance with several recent studies on pathological voice detection using different datasets, features, and machine learning models (see Table 1). Our approach, based on EMD-derived Mel-spectrograms and scalograms as inputs to the AlexNet-CNN, achieved an accuracy of 85.66 % and 86.4 %, respectively. In comparison, Eskidere et al. [29] achieved a slightly higher accuracy of 91.83 % using a GMM with multipeak features on the same SVD dataset. However, their method did not use deep learning or time–frequency representations, which could limit the generalization. Similarly, Kadiri et al. [30] used MFCCs and SVM on a combined dataset (SVD and HUPA), reporting accuracies ranging from 71.45 % to 76.19 %. More recent studies have explored deep representations, such as wav2vec features in combination with SVM classifiers and achieved accuracies randing from 68.55 % to 83.11 % [32]. Other approaches, including Mel-spectrogram features with SVM [33], and classical MFCC-based systems [31], [35], have generally shown lower or more variable performance, especially when applied to small or heterogeneous datasets. The highest reported accuracy in the literature (98 %) was obtained by Cai et al. [34] using the VOICED database and wav2vec 2.0 features in conbination with SVM and KNN classifiers. While promising, this result is based on a different dataset and may not be directly comparable due to variations in recording conditions, subject demographics, and pathology types.

Our method offers a competitive and robust alternative, particularly because it:

operates effectively on a publicly available and widely used dataset (SVD),
integrates the EMD to isolate the most informative IMF component,
extracts both MFCC images and scalograms, and
utilizies deep learning through AlexNet-CNN, a well-established architecture for small to medium-sized datasets.

Overall, these results suggest that our hybrid approach, which combines classical signal processing with deep learning, delivers performance that is not only competitive with state-of-the-art methods, but also interpretable and adaptable for non-invasive clinical screening of voice disorders.

Conclusion

This study proposed an effective framework for automatic detection of laryngeal pathologies by combining advanced signal processing and deep learning techniques. The voice recordings were pre-processed and decomposed using EMD, and the most relevant IMFs were selected based on temporal energy. From each frame of this IMF, MFCC images and scalograms based on CWT were extracted to capture both spectral and temporal information.

Using a deep CNN, namely AlexNet, we classified the extracted features and achieved promising results: 85.66 % accuracy with MFCC-based spectrograms and 86.4 % with scalograms. These results show the potential of the proposed method to extract and analyze diagnostically relevant information from voice signals for non-invasive and early-stage detection of laryngeal pathologies.

The proposed method provides a novel combination of signal decomposition and time–frequency feature representation that complements recent advances in deep learning-based voice pathology detection. Compared to existing approaches, our framework proves to be not only effective but also interpretable and well-suited for small-scale datasets. Therefore, the method is suitable for an initial screening of pathological voice conditions and can serve as a valuable diagnostic aid. However, further research and large-scale clinical validation are essential to improve its robustness and generalizability for real-world applications.

Detection of Laryngeal Pathologies from Voice using EMD-based Mel-Spectrograms and Scalograms with AlexNet

Full Article

Paradigm

My account