Cardiovascular diseases (CVDs) are the leading cause of death worldwide. Epicardial adipose tissue (EAT) is one of the most important risk factors for CVDs [1]–[3]. EAT is the fat deposit in the heart that is located between the myocardial surface and the visceral layer of the pericardium. The most important functions of EAT include protection against hypothermia, mechanical protection of the coronary circulation and energy supply to the myocardium. It also secretes adiponectin from the epicardial adipocytes. The volume and thickness of fat affect cardiovascular activity and is the main cause of obesity. Recently, the image processing-based EAT approach has gained more attention due to its non-invasiveness [4]. This technique assists healthcare providers in analyzing volume, density, spatial distribution, etc. These imaging techniques include various modalities such as computer tomography (CT), magnetic resonance imaging (MRI), and echocardiography for accurate quantification and visualization of EAT.
This quantification supports risk stratification and disease prognosis. Recently, the use of machine learning (ML) and deep learning (DL) models in cardiovascular imaging has gained greater attention [5]. This model is used by clinicians to analyze complex data and extract meaningful information [6]. The well-known ML algorithms such as Ad Boost, Random Forests, and Decision Trees are used for image classification, segmentation, and feature extraction. The well-known DL algorithm such as convolutional neural networks (CNNs) has gained importance due to its ability to automatically learn features from images. Recently, popular DL architectures such as U-Net, DenseNet, and ResNet are well suited for medical image analysis such as tumor detection, organ segmentation, and disease classification. This model’s ability to learn complex patterns from large datasets contributes to improved patient care.
A medical image segmentation model using a feature compression pyramid network was developed [7]. An innovative interactive segmentation framework utilizing DL principles was presented [8]. An ensemble of U-Net architectures was developed for kidney tumor segmentation [9]. A segmentation model based on a 3-D CNN called HyperDenseNet has been proposed for medical image processing [10]. An attention mechanism-based model was introduced for medical image segmentation [11]. DSI-Net, which includes a classification section, a coarse segmentation branch, and a fine segmentation branch, was introduced [12]. A smaller model called PocketNet was proposed for the combined segmentation and classification process [13].
Transformer networks use self-attention mechanisms and are able to capture global dependencies within images. A spatio-temporal transformer network has been proposed for medical image processing [14]. Traditional pixel-wise classifiers in deep learning do not take into account the output structure and the inter-dependence. To address this, a new training approach was introduced [15]. A new deep learning model was developed for the epicardial fat segmentation and classification in non-contrast CT images [16]. The dual U-Net concept was proposed for epicardial fat detection [17].
In light of these advancements, this work aims to contribute to the evolving model of cardiovascular research by proposing a novel approach for accurate EAT assessment and severity classification. The proposed model integrates DL models with architectural modifications and optimization frameworks for accessing EAT severity. The contributions of the proposed work are as follows:
Performing a modified U-Net based segmentation to obtain strong features;
Applying an optimization algorithm to select the best features;
Developing an XGBoost classification model for EAT severity classification;
The proposed approach is compared with other models.
The rest of the work is organized as follows: Section 2 discusses the proposed model for EAT segmentation and classification. This is followed by a results and discussion section in Section 3. Section 4 discusses a conclusion.
This section first describes the proposed U-Net model for EAT segmentation and XGBoost model-based severity classification. First, the EAT regions of the image are segmented using the proposed modified squeeze-and-excitation gated recurrent unit U-Net (MSE-GRU-U-Net). This architecture includes a MSE and multi-scale dense (MS-D) network for accurate EAT segmentation. Then the ML model of XGBoost is applied to calculate the severity based on the trained features extracted from the EAT regions. The entire workflow is shown in Fig. 1.

Workflow of the proposed model.
The proposed MSE-GRU-U-Net architecture comprises a symmetric structure with two core segments: the encoder and the decoder, as shown in Fig. 2. The encoder segment focuses on feature extraction and the decoder component is used for precise feature positioning. This architecture includes residual blocks, pooling layers, MSEs, MS-D, and 4 up-sampling blocks for the 512 × 512 × 1 input images.

The proposed model.
The residual learning strategy is applied by integrating shortcut connections into conventional U-Nets. These connections make the model suitable for deeper training and prevent degradation. Throughout the model, features undergo a series of operations including convolutions, feature extraction, and pooling. The size of the final binary segmented image is 512 × 512 × 1.
To solve the vanishing gradient problem, batch normalization and rectified linear unit (ReLU) activation units are used in each convolution operation. In addition, the MS integrates a bi-directional gated recurrent unit (BiGRU) layer to adaptively extract image features. In the U-Net, the BiGRU layer is used for channel attention and effective segmentation.
In addition, the MS-D network is used as a transition layer after the encoder and at the output of the decoder. The insertion of the MS-D network into the U-Net improves the model’s ability to learn multi-scale contextual information [18]. It also effectively solves the resolution reduction problem caused by multiple down-sampling operations. The dilated convolutions allow the network to gather contextual information without over down-sampling the feature maps.
In CNN, the squeeze-and-excitation (SE) block is inserted to increase the representational power of the model. It is used to focus on relevant features and eliminate irrelevant features [19]. The SE block works by learning channel-wise feature recalibration weights to adaptively focus on essential features. This process involves two fundamental operations: squeezing and excitement. The squeezing step aggregates channel-wise information to capture global statistics and the excitement step performs feature re-weighting based on the learned parameters. In the MSE block, the additional layers are introduced to improve feature representation and the learning capabilities of the neural networks. The GRU layers are added in the conventional SE block to better capture spatial and temporal information in the images. The MSE block includes global pooling, fully connected layers, ReLU activation, BiGRU layers, and a final sigmoid activation.
The input feature maps are globally pooled to summarize the spatial information into a single vector for each channel. The output of the pooling layer is expressed as follows:
The pooled features are passed through a fully connected layer FC1 that facilitates transformation to a higher-level feature representation.
Following the fully connected layer, a ReLU activation function introduces non-linearity into the network.
The ReLU-activated features are passed through the first BiGRU layer, BiGRU1, which allows the network to capture sequential information bidirectionally.
The output of the first BiGRU layer is further processed by a second BiGRU layer, BiGRU2 to capture more complicated sequential patterns.
Another fully connected layer FC2 is used to perform additional feature transformations.
Finally, a sigmoid activation function is applied to generate the output probabilities by considering the important features and suppressing the less relevant ones.
First, the input feature maps are subjected to global pooling operations to summarize the spatial information, allowing the extraction of global statistics for each channel. The addition of multiple layers enables hierarchical feature learning, allowing the model to extract complex representations of the fat image. This hierarchical feature learning helps to capture both spatial and temporal patterns.
The excitation operation involves feature learning based on channel-wise weights. In addition, the network’s adaptability allows it to dynamically focus its attention on different aspects of the pixels. The MSE block improves the standard SE by integrating multi-scale feature aggregation and frequency-aware attention. It improves contextual understanding and reduces computational effort.
The MS-D network architecture is an innovative approach in DL for image processing. This network combines mixed-scale dilated convolutions with dense connections to improve feature extraction and information flow between layers. This architecture comprises multiple layers of feature maps, each derived from a consistent set of operations. A sequential process of operations to generate feature maps is referred to as follows:
The operations include Dilated Convolutions, Pixel-wise Summation and Bias Addition. Using 3 × 3 pixel filters with channel-specific dilation, convolutions are applied to all previous feature maps, capturing the multi-scale information important for nuanced feature extraction. The resulting images from the convolutions are summed pixel by pixel, enabling the integration of different spatial information across multiple scales. A constant bias is added to each pixel, which helps in introducing learnable parameters to fine-tune the feature representations. This sequential operation generates a series of feature maps to refine and extract hierarchical features from the input image.
Feature selection is an important task to achieve higher accuracy and reduce computation time. The performance of classification mainly depends on the feature selection process. In this work, falcon optimization algorithm (FOA) is used to select the optimal features extracted from the modified U-Net architecture. The falcons represent potential solutions that navigate through the solution space. This population-based algorithm uses a set of N falcons as search agents in the challenge space. Optimal solutions are represented by the falcons' positions referred to as gbest. Each falcon updates its position based on (4).
The convergence rate decreases linearly from 1 to 2 over the iterations and increases the probability of reaching a global solution. Each falcon updates its position based on only one gbest, which promotes a comprehensive utilization of potential solutions.
The features selected by FOA are normalized and filtered based on the mean and standard deviation to determine the best subset of features:
FOA Feature Selection and XGBoost Optimization
XGBoost is a ML model that works on the basis of gradient boosting algorithms. To achieve an objective function, it sequentially constructs a decision tree. The algorithm updates the tree weights to minimize the loss function by gradient descent optimization. The result of this boosting method is an ensemble of trees, whose predictions are combined using weighted averaging. The number of trees, the tree depth, the learning rate, and the regularization terms are the four most important hyperparameters of the XGBoost model.
FOA is used in this work to determine the ideal hyperparameters of the XGBoost model. To find the ideal set of hyperparameters, FOA is used to optimize a subset of these values. FOA provides effective assistance. The pseudocode for the proposed feature selection and tuning of the XGBoost model is presented above.
The first function, FOA_feature_selection (), focuses on using FOA to iteratively select the most relevant features from a given set. Through population initialization, fitness evaluation, and iterative refinement, which includes updating the best positions, generating new locations, and updating the falcon locations, this process identifies the optimal set of features that are critical for EAT segmentation and severity analysis. Next, the FOA_optimize_xgb () function uses FOA to optimize the parameters within the XGBoost model. It iterates through the defined parameter space and optimizes parameters such as the learning rate, max depth, and other parameters relevant to XGBoost to improve the model's efficiency in classifying the EAT severity levels. FOA increases the computational complexity due to fractional-order calculations that require memory-intensive recursive operations. It improves feature sensitivity but requires efficient GPU acceleration to maintain practical training times.
The experimental datasets used in this work are taken from the website (http://visual.ic.uff.br/en/cardio/ctfat/index.php). These datasets consist of features extracted from 200 patients. The validity and repeatability of the proposed method was assessed by conducting ten-fold cross-validation experiments with the collected dataset. In each cross-validation, the dataset was divided into 70 % for training and 30 % for validation. In this dataset, segmentation faces challenges such as low contrast and artifacts. To overcome this, adaptive weighting and post-processing are performed. The performance of the proposed segmentation architecture is compared with other segmentation models in terms of mean intersection over union (Mean IOU), mean Dice score (MDS), and Pearson correlation coefficient (PCC).
Mean IOU evaluates the intersection between the predicted segmentation and the ground truth (GT) segmentation for multiple classes or instances. It is calculated as the intersection divided by the union of the predicted and GT masks as follows:
The Mean IOU is calculated as follows:
The Dice score (also known as F1 score) is used to evaluate the similarity between two samples. In segmentation, it measures the overlap between the predicted and GT segmentation as follows:
The MDS is calculated as follows:
The PCC estimates the linear correlation between two sets of data. It is used to analyze the relationship between predicted and GT pixel values. It can be calculated as follows:
The segmentation results are shown in Fig. 3. The segmentation results correspond well with the GT images. The red color represents the epicardial fat, the green color represents the mediastinal fat, and the blue color represents the gap between the epicardial and mediastinal fat. The training and validation loss of the proposed model over epochs is shown in Fig. 4.

Segmentation result (epicardial and mediastinal fats).

Training and validation loss of the proposed model over epochs.
The performance of the proposed and existing segmentation models is given in Table 1. The fully convolutional network (FCN) model shows a Mean IOU of 68.92 % and a MDS of 73.5 %, indicating reasonable but comparatively lower segmentation accuracy. The U-Net shows improvements with a Mean IOU of 71.5 % and a MDS of 79.7 %, indicating improved segmentation ability. Seg-Net further improves performance, achieving a Mean IOU of 76.7 % and a MDS of 85.42 %, indicating significant improvements in image region delineation. The Attention U-Net shows a Mean IOU of 80.2 % and a MDS of 88.54 %, which represents a remarkable progress in segmentation accuracy. The SAR-U-Net achieved a Mean IOU of 86.78 % and a MDS of 92.6 %, indicating exceptional precision in the accurate segmentation of image structures. The correlation coefficient between these models is between 0.789 and 0.954. The proposed MSE-GRU-U-Net models show better performance in terms of Mean IOU, MDS and PCC with values of 89.5, 94.3 and 0.973, respectively. The results are shown graphically in Fig. 5.

Performance analysis.
Performance analysis.
| Model | Mean IOU [%] | MDS [%] | PCC |
|---|---|---|---|
| FCN | 68.92 | 73.5 | 0.789 |
| U-Net | 71.5 | 79.7 | 0.8721 |
| Seg-Net | 76.7 | 85.42 | 0.8905 |
| Attention U-Net | 80.2 | 88.54 | 0.932 |
| SAR-U-Net | 86.78 | 92.6 | 0.954 |
| MSE-GRU-U-Net | 89.5 | 94.3 | 0.973 |
The features extracted from the segmentation model are used to train the XGBoost classifier with labels of low, high and medium severity levels. For a fair comparison, the FOA-XGBoost model is compared with the other FOA-optimized models from ID3 and naive bayes (NB). The measured values for accuracy, specificity, precision and recall are shown in Table 2. The FOA-ID3 model shows 92 % accuracy, 91 % specificity, 90% precision, and 94 % recall rates. The FOA-NB model has 95 % accuracy, 95 % specificity, 94 % precision, and 94 % recall rates. The FOA-XGBoost model stands out with the highest performance values, achieving 97 % accuracy, 96 % specificity, 97 % precision, and 98 % recall rates demonstrating exceptional accuracy in categorizing severity levels.
Performance analysis.
| Model | FOA-ID3 [%] | FOA-NB [%] | FOA-XGBoost [%] |
|---|---|---|---|
| Accuracy | 92 | 95 | 97 |
| Specificity | 91 | 95 | 96 |
| Precision | 90 | 94 | 97 |
| Recall | 94 | 94 | 98 |
In this work, a new approach to EAT segmentation is proposed. The new architecture integrates MSE with MS-Ds for accurate segmentation of fat regions. The MSE-GRU-U-Net achieves a Mean IOU value of 89.5 % and a MDS of 94.3 % for segmentation. It also has a strong correlation coefficient of 0.973, indicating a highly reliable relationship between the predicted and GT values. For classification, the FOA-XGBoost model achieves the highest accuracy of 97 %. The proposed model is fully automated and can be an accurate diagnostic tool for advanced clinical practice.