An Enhanced Measurement of Epicardial Fat Segmentation and Severity Classification using Modified U-Net and FOA-guided XGBoost

K Rajalakshmi; S Palanivel Rajan

doi:10.2478/msr-2025-0012

Full Article

1.

Introduction

Cardiovascular diseases (CVDs) are the leading cause of death worldwide. Epicardial adipose tissue (EAT) is one of the most important risk factors for CVDs [1]–[3]. EAT is the fat deposit in the heart that is located between the myocardial surface and the visceral layer of the pericardium. The most important functions of EAT include protection against hypothermia, mechanical protection of the coronary circulation and energy supply to the myocardium. It also secretes adiponectin from the epicardial adipocytes. The volume and thickness of fat affect cardiovascular activity and is the main cause of obesity. Recently, the image processing-based EAT approach has gained more attention due to its non-invasiveness [4]. This technique assists healthcare providers in analyzing volume, density, spatial distribution, etc. These imaging techniques include various modalities such as computer tomography (CT), magnetic resonance imaging (MRI), and echocardiography for accurate quantification and visualization of EAT.

This quantification supports risk stratification and disease prognosis. Recently, the use of machine learning (ML) and deep learning (DL) models in cardiovascular imaging has gained greater attention [5]. This model is used by clinicians to analyze complex data and extract meaningful information [6]. The well-known ML algorithms such as Ad Boost, Random Forests, and Decision Trees are used for image classification, segmentation, and feature extraction. The well-known DL algorithm such as convolutional neural networks (CNNs) has gained importance due to its ability to automatically learn features from images. Recently, popular DL architectures such as U-Net, DenseNet, and ResNet are well suited for medical image analysis such as tumor detection, organ segmentation, and disease classification. This model’s ability to learn complex patterns from large datasets contributes to improved patient care.

A medical image segmentation model using a feature compression pyramid network was developed [7]. An innovative interactive segmentation framework utilizing DL principles was presented [8]. An ensemble of U-Net architectures was developed for kidney tumor segmentation [9]. A segmentation model based on a 3-D CNN called HyperDenseNet has been proposed for medical image processing [10]. An attention mechanism-based model was introduced for medical image segmentation [11]. DSI-Net, which includes a classification section, a coarse segmentation branch, and a fine segmentation branch, was introduced [12]. A smaller model called PocketNet was proposed for the combined segmentation and classification process [13].

Transformer networks use self-attention mechanisms and are able to capture global dependencies within images. A spatio-temporal transformer network has been proposed for medical image processing [14]. Traditional pixel-wise classifiers in deep learning do not take into account the output structure and the inter-dependence. To address this, a new training approach was introduced [15]. A new deep learning model was developed for the epicardial fat segmentation and classification in non-contrast CT images [16]. The dual U-Net concept was proposed for epicardial fat detection [17].

In light of these advancements, this work aims to contribute to the evolving model of cardiovascular research by proposing a novel approach for accurate EAT assessment and severity classification. The proposed model integrates DL models with architectural modifications and optimization frameworks for accessing EAT severity. The contributions of the proposed work are as follows:

Performing a modified U-Net based segmentation to obtain strong features;
Applying an optimization algorithm to select the best features;
Developing an XGBoost classification model for EAT severity classification;
The proposed approach is compared with other models.

The rest of the work is organized as follows: Section 2 discusses the proposed model for EAT segmentation and classification. This is followed by a results and discussion section in Section 3. Section 4 discusses a conclusion.

2.

Proposed model

This section first describes the proposed U-Net model for EAT segmentation and XGBoost model-based severity classification. First, the EAT regions of the image are segmented using the proposed modified squeeze-and-excitation gated recurrent unit U-Net (MSE-GRU-U-Net). This architecture includes a MSE and multi-scale dense (MS-D) network for accurate EAT segmentation. Then the ML model of XGBoost is applied to calculate the severity based on the trained features extracted from the EAT regions. The entire workflow is shown in Fig. 1.

MSE-GRU U-Net

The proposed MSE-GRU-U-Net architecture comprises a symmetric structure with two core segments: the encoder and the decoder, as shown in Fig. 2. The encoder segment focuses on feature extraction and the decoder component is used for precise feature positioning. This architecture includes residual blocks, pooling layers, MSEs, MS-D, and 4 up-sampling blocks for the 512 × 512 × 1 input images.

The residual learning strategy is applied by integrating shortcut connections into conventional U-Nets. These connections make the model suitable for deeper training and prevent degradation. Throughout the model, features undergo a series of operations including convolutions, feature extraction, and pooling. The size of the final binary segmented image is 512 × 512 × 1.

To solve the vanishing gradient problem, batch normalization and rectified linear unit (ReLU) activation units are used in each convolution operation. In addition, the MS integrates a bi-directional gated recurrent unit (BiGRU) layer to adaptively extract image features. In the U-Net, the BiGRU layer is used for channel attention and effective segmentation.

In addition, the MS-D network is used as a transition layer after the encoder and at the output of the decoder. The insertion of the MS-D network into the U-Net improves the model’s ability to learn multi-scale contextual information [18]. It also effectively solves the resolution reduction problem caused by multiple down-sampling operations. The dilated convolutions allow the network to gather contextual information without over down-sampling the feature maps.

Modified squeeze-and-excitation

In CNN, the squeeze-and-excitation (SE) block is inserted to increase the representational power of the model. It is used to focus on relevant features and eliminate irrelevant features [19]. The SE block works by learning channel-wise feature recalibration weights to adaptively focus on essential features. This process involves two fundamental operations: squeezing and excitement. The squeezing step aggregates channel-wise information to capture global statistics and the excitement step performs feature re-weighting based on the learned parameters. In the MSE block, the additional layers are introduced to improve feature representation and the learning capabilities of the neural networks. The GRU layers are added in the conventional SE block to better capture spatial and temporal information in the images. The MSE block includes global pooling, fully connected layers, ReLU activation, BiGRU layers, and a final sigmoid activation.

Global pooling layer

The input feature maps are globally pooled to summarize the spatial information into a single vector for each channel. The output of the pooling layer is expressed as follows: (1) $Y = GlobalPooling (X)$ Y = GlobalPooling\left( X \right)

Fully connected layer 1

The pooled features are passed through a fully connected layer FC1 that facilitates transformation to a higher-level feature representation. (2) $Z 1 = FC 1 (Y)$ Z1 = FC1\left( Y \right)

ReLU activation

Following the fully connected layer, a ReLU activation function introduces non-linearity into the network. (3) $A 1 = ReLU (Z 1)$ A1 = ReLU\left( {Z1} \right)

BiGRU layer 1

The ReLU-activated features are passed through the first BiGRU layer, BiGRU1, which allows the network to capture sequential information bidirectionally. (4) $H 1 = BiGRU 1 (A 1)$ H1 = BiGRU1\left( {A1} \right)

BiGRU layer 2

The output of the first BiGRU layer is further processed by a second BiGRU layer, BiGRU2 to capture more complicated sequential patterns. (5) $H 2 = BiGRU 2 (H 1)$ H2 = BiGRU2\left( {H1} \right)

Fully connected layer 2

Another fully connected layer FC2 is used to perform additional feature transformations. (6) $Z 2 = FC 2 (H 2)$ Z2 = FC2\left( {H2} \right)

Sigmoid activation

Finally, a sigmoid activation function is applied to generate the output probabilities by considering the important features and suppressing the less relevant ones. (7) $0 = Sigmoid (Z 2)$ {\it 0} = Sigmoid\left( {Z2} \right) where Y represents the global pooled features, Z1 represents the output of the first fully connected layer, A1 represents the output after ReLU activation, H1 and H2 represent the outputs of the first and second BiGRU layers, respectively, Z2 represents the features after the last fully connected layer, and O represents the output after sigmoid activation, emphasizing important features based on the learned weights.

First, the input feature maps are subjected to global pooling operations to summarize the spatial information, allowing the extraction of global statistics for each channel. The addition of multiple layers enables hierarchical feature learning, allowing the model to extract complex representations of the fat image. This hierarchical feature learning helps to capture both spatial and temporal patterns.

The excitation operation involves feature learning based on channel-wise weights. In addition, the network’s adaptability allows it to dynamically focus its attention on different aspects of the pixels. The MSE block improves the standard SE by integrating multi-scale feature aggregation and frequency-aware attention. It improves contextual understanding and reduces computational effort.

MS-D network

The MS-D network architecture is an innovative approach in DL for image processing. This network combines mixed-scale dilated convolutions with dense connections to improve feature extraction and information flow between layers. This architecture comprises multiple layers of feature maps, each derived from a consistent set of operations. A sequential process of operations to generate feature maps is referred to as follows: (8) $g_{ij} (z_{0}, \dots . z_{i - 1}\}) = \sum_{l = 0}^{i - 1} \sum_{k = 0}^{c_{l} - 1} D_{h_{ijkl}, s_{ij}} z_{l}^{k}$ {g_{ij}}\left( {\left\{ {{z_0}, \ldots .{z_{i - 1}}} \right\}} \right) = \sum\limits_{l = 0}^{i - 1} {\sum\limits_{k = 0}^{{c_l} - 1} {{D_{{h_{ijkl}},{s_{ij}}}}z_l^k} } where g_ij is the output feature map at position i, j derived from the previous feature maps {z₀,….z_i−1}, D_{h_ijkl,s_ij} represents the weight parameter for the dilated convolution operation, $z_{l}^{k}$ z_l^k represents the previous feature maps (z) indexed by k at position l, D denotes the total number of channels in the feature maps, and h_ijkl,s_ij denotes the parameters controlling the dilation and convolution operation. This equation represents the calculation for a specific position i, j within a feature map in the MS-D network architecture. It takes into account the input or previous feature maps, the convolutional weights, and the dilation parameters to generate the output feature map at that spatial position.

The operations include Dilated Convolutions, Pixel-wise Summation and Bias Addition. Using 3 × 3 pixel filters with channel-specific dilation, convolutions are applied to all previous feature maps, capturing the multi-scale information important for nuanced feature extraction. The resulting images from the convolutions are summed pixel by pixel, enabling the integration of different spatial information across multiple scales. A constant bias is added to each pixel, which helps in introducing learnable parameters to fine-tune the feature representations. This sequential operation generates a series of feature maps to refine and extract hierarchical features from the input image.

Feature selection using falcon optimization algorithm

Feature selection is an important task to achieve higher accuracy and reduce computation time. The performance of classification mainly depends on the feature selection process. In this work, falcon optimization algorithm (FOA) is used to select the optimal features extracted from the modified U-Net architecture. The falcons represent potential solutions that navigate through the solution space. This population-based algorithm uses a set of N falcons as search agents in the challenge space. Optimal solutions are represented by the falcons' positions referred to as g_best. Each falcon updates its position based on (4).

The convergence rate decreases linearly from 1 to 2 over the iterations and increases the probability of reaching a global solution. Each falcon updates its position based on only one g_best, which promotes a comprehensive utilization of potential solutions.

The features selected by FOA are normalized and filtered based on the mean and standard deviation to determine the best subset of features: (9) ${NZ}_{i} = \frac{K_{i} - μ}{S}$ N{Z_i} = {{{K_i} - \mu } \over S} where K_i denotes the selected features, μ is the mean, S is the standard deviation, and NZ_i is the normalized feature vector. The final selection is determined by the comparison between the selected features and the best vector derived by FOA using a quadratic fitness function evaluated by the mean square error rate.

Pseudocode:

FOA Feature Selection and XGBoost Optimization

XGBoost based classification

XGBoost is a ML model that works on the basis of gradient boosting algorithms. To achieve an objective function, it sequentially constructs a decision tree. The algorithm updates the tree weights to minimize the loss function by gradient descent optimization. The result of this boosting method is an ensemble of trees, whose predictions are combined using weighted averaging. The number of trees, the tree depth, the learning rate, and the regularization terms are the four most important hyperparameters of the XGBoost model.

FOA is used in this work to determine the ideal hyperparameters of the XGBoost model. To find the ideal set of hyperparameters, FOA is used to optimize a subset of these values. FOA provides effective assistance. The pseudocode for the proposed feature selection and tuning of the XGBoost model is presented above.

The first function, FOA_feature_selection (), focuses on using FOA to iteratively select the most relevant features from a given set. Through population initialization, fitness evaluation, and iterative refinement, which includes updating the best positions, generating new locations, and updating the falcon locations, this process identifies the optimal set of features that are critical for EAT segmentation and severity analysis. Next, the FOA_optimize_xgb () function uses FOA to optimize the parameters within the XGBoost model. It iterates through the defined parameter space and optimizes parameters such as the learning rate, max depth, and other parameters relevant to XGBoost to improve the model's efficiency in classifying the EAT severity levels. FOA increases the computational complexity due to fractional-order calculations that require memory-intensive recursive operations. It improves feature sensitivity but requires efficient GPU acceleration to maintain practical training times.

3.

Results and Discussion

The experimental datasets used in this work are taken from the website (http://visual.ic.uff.br/en/cardio/ctfat/index.php). These datasets consist of features extracted from 200 patients. The validity and repeatability of the proposed method was assessed by conducting ten-fold cross-validation experiments with the collected dataset. In each cross-validation, the dataset was divided into 70 % for training and 30 % for validation. In this dataset, segmentation faces challenges such as low contrast and artifacts. To overcome this, adaptive weighting and post-processing are performed. The performance of the proposed segmentation architecture is compared with other segmentation models in terms of mean intersection over union (Mean IOU), mean Dice score (MDS), and Pearson correlation coefficient (PCC).

Mean IOU evaluates the intersection between the predicted segmentation and the ground truth (GT) segmentation for multiple classes or instances. It is calculated as the intersection divided by the union of the predicted and GT masks as follows: (10) $IOU = \frac{Area of Overlap}{Area of Union}$ IOU = {{{\rm{Area}}\;{\rm{of}}\;{\rm{Overlap}}} \over {{\rm{Area}}\;{\rm{of}}\;{\rm{Union}}}}

The Mean IOU is calculated as follows: (11) $Mean IOU = \frac{1}{N} \sum_{i = 1}^{N} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}}$ Mean\;IOU = {1 \over N}\sum\limits_{i = 1}^N {{{T{P_i}} \over {T{P_i} + F{P_i} + F{N_i}}}} where N is the number of classes or instances, TP_i is the true positive for class i, FP_i is the false positive for class i, FN_i is the false negative for class i.

The Dice score (also known as F1 score) is used to evaluate the similarity between two samples. In segmentation, it measures the overlap between the predicted and GT segmentation as follows: (12) $Dice score = \frac{2 \times Area of Overlap}{Area of Predicted + Area of GT}$ Dice\;score = {{2 \times {\rm{Area}}\;{\rm{of}}\;{\rm{Overlap}}} \over {{\rm{Area}}\;{\rm{of}}\;{\rm{Predicted}} + {\rm{Area}}\;{\rm{of}}\;{\rm{GT}}}}

The MDS is calculated as follows: (13) $\begin{array}{l} MDS = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times {TP}_{i}}{2 \times {TP}_{i} + {FP}_{i} + {FN}_{i}} \\ MDS = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times {TP}_{i}}{2 \times {TP}_{i} + {FP}_{i} + {FN}_{i}} \end{array}$ \matrix{ {MDS = {1 \over N}\sum\limits_{i = 1}^N {{{2 \times T{P_i}} \over {2 \times T{P_i} + F{P_i} + F{N_i}}}} } \hfill \cr {MDS = {1 \over N}\sum\limits_{i = 1}^N {{{2 \times T{P_i}} \over {2 \times T{P_i} + F{P_i} + F{N_i}}}} } \hfill \cr }

The PCC estimates the linear correlation between two sets of data. It is used to analyze the relationship between predicted and GT pixel values. It can be calculated as follows: (14) $PCC = \frac{\sum (x - x^{'}) + (y - y^{'})}{\sqrt{\sum {(x - x^{'})}^{2}} \times \sqrt{\sum {(y - y^{'})}^{2}}}$ PCC = {{\sum (x - x') + \left( {y - y'} \right)} \over {\sqrt {\sum {{(x - x')}^2}} \times \sqrt {\sum {{(y - y')}^2}} }} where x and y are the predicted and GT values for the pixels, and x′ and y′ are their respective means.

The segmentation results are shown in Fig. 3. The segmentation results correspond well with the GT images. The red color represents the epicardial fat, the green color represents the mediastinal fat, and the blue color represents the gap between the epicardial and mediastinal fat. The training and validation loss of the proposed model over epochs is shown in Fig. 4.

The performance of the proposed and existing segmentation models is given in Table 1. The fully convolutional network (FCN) model shows a Mean IOU of 68.92 % and a MDS of 73.5 %, indicating reasonable but comparatively lower segmentation accuracy. The U-Net shows improvements with a Mean IOU of 71.5 % and a MDS of 79.7 %, indicating improved segmentation ability. Seg-Net further improves performance, achieving a Mean IOU of 76.7 % and a MDS of 85.42 %, indicating significant improvements in image region delineation. The Attention U-Net shows a Mean IOU of 80.2 % and a MDS of 88.54 %, which represents a remarkable progress in segmentation accuracy. The SAR-U-Net achieved a Mean IOU of 86.78 % and a MDS of 92.6 %, indicating exceptional precision in the accurate segmentation of image structures. The correlation coefficient between these models is between 0.789 and 0.954. The proposed MSE-GRU-U-Net models show better performance in terms of Mean IOU, MDS and PCC with values of 89.5, 94.3 and 0.973, respectively. The results are shown graphically in Fig. 5.

Table 1.

Performance analysis.

Model	Mean IOU [%]	MDS [%]	PCC
FCN	68.92	73.5	0.789
U-Net	71.5	79.7	0.8721
Seg-Net	76.7	85.42	0.8905
Attention U-Net	80.2	88.54	0.932
SAR-U-Net	86.78	92.6	0.954
MSE-GRU-U-Net	89.5	94.3	0.973

The features extracted from the segmentation model are used to train the XGBoost classifier with labels of low, high and medium severity levels. For a fair comparison, the FOA-XGBoost model is compared with the other FOA-optimized models from ID3 and naive bayes (NB). The measured values for accuracy, specificity, precision and recall are shown in Table 2. The FOA-ID3 model shows 92 % accuracy, 91 % specificity, 90% precision, and 94 % recall rates. The FOA-NB model has 95 % accuracy, 95 % specificity, 94 % precision, and 94 % recall rates. The FOA-XGBoost model stands out with the highest performance values, achieving 97 % accuracy, 96 % specificity, 97 % precision, and 98 % recall rates demonstrating exceptional accuracy in categorizing severity levels.

Table 2.

Performance analysis.

Model	FOA-ID3 [%]	FOA-NB [%]	FOA-XGBoost [%]
Accuracy	92	95	97
Specificity	91	95	96
Precision	90	94	97
Recall	94	94	98

4.

Conclusion

In this work, a new approach to EAT segmentation is proposed. The new architecture integrates MSE with MS-Ds for accurate segmentation of fat regions. The MSE-GRU-U-Net achieves a Mean IOU value of 89.5 % and a MDS of 94.3 % for segmentation. It also has a strong correlation coefficient of 0.973, indicating a highly reliable relationship between the predicted and GT values. For classification, the FOA-XGBoost model achieves the highest accuracy of 97 %. The proposed model is fully automated and can be an accurate diagnostic tool for advanced clinical practice.

An Enhanced Measurement of Epicardial Fat Segmentation and Severity Classification using Modified U-Net and FOA-guided XGBoost

Full Article

Paradigm

My account