Skip to main content
Have a personal or library account? Click to login
Lightweight inception-UNet with attention mechanisms for semantic segmentation Cover

Lightweight inception-UNet with attention mechanisms for semantic segmentation

Open Access
|Apr 2026

Full Article

I.
Introduction

Semantic segmentation is a fundamental but challenging task in the field of computer vision. It dissects the image at the pixel level, providing categorical information that is beneficial in many real-world applications such as autonomous driving vehicles [1], automated medical imaging [2], computational photography [3], augmented reality [4], human pose detection [5], and web-search engines [6]. Due to its importance in various fields of computer vision, many researchers have implemented the task and tried to address the numerous challenges such as occlusions, unlabeled mask, illumination, and unseen annotation issues that arise during segmentation of the images accurately in a complex environment.

With the evolution of hardware and technology, deep learning allows neural networks to be used in a much larger range of semantic segmentation problems than ever before. Deep neural networks have proven to be extremely effective for semantic segmentation, which essentially means classifying what is in each and every pixel or region of an image. This is arguably the most important piece of functionality for any image-understanding application and is used in a wide range of applications in computer vision as well as in artificial intelligence.

The applications span diverse domains, including but not limited to autonomous vehicle driving [1], color imaging [7], data classification [8], data mining [9], cognitive and computational sciences with applications such as salient object detection and classification, agriculture sciences, speech and text recognition [10], and intelligent healthcare systems [11], particularly emphasizing on intelligent medical imaging. In contrast to earlier approaches like text-on-forest and random-forest (RF) based classifiers [12], deep learning techniques have enabled more precise and significantly faster semantic segmentation. The overall contribution of the proposed model is given below:

  • The proposed lightweight architecture enhances the network’s ability to discern crucial information from the input images, enabling improved accuracy and minimal computational resources.

  • A multiscale convolution is integrated in inception-based encoder to extract multiscale features that allow the network to capture both fine-grained details and high-level context.

  • The integration of attention unit in the decoder side leverages in effective identification of region of interest (ROI) from an image.

  • The proposed model can attain superior performance in terms of accuracy and sensitivity without implementing complicated heuristics.

The proposed model underwent rigorous validation against five state-of-the-art segmentation models, namely, UNeT [13], ENet [14], SegNet [15], UNeT with ResNet-18 encoder (UNet-ResNet18) [16], and UNeT with ResNet-34 encoder (UNet-ResNet34) [17] encoder. This comprehensive comparative analysis was conducted by employing five publicly accessible datasets, namely, Autorickshaw Detection Dataset [18], Indian Driving Dataset-Lite (IDD-Lite) [19], Computed Tomography Liver (CT-Liver) [20], martial arts, dancing and sports (MADS) [21], and Oxford-IIIT Pets (PETS) [22]. The chosen datasets exemplify diverse scenarios in complex surroundings by their inherent irregularity and unpredictability. The experimental process encompassed a thorough evaluation, combining both qualitative and quantitative assessments. Seven pivotal parameters, namely, intersection over union (IoU) score, error (Err), specificity (Sp), sensitivity (Ss), accuracy (Acc), F-score, and correlation coefficient (Cc) were employed for the evaluation.

The remaining of this paper is structured as follows: Section II discusses the literature review of the semantic segmentation. Section III offers a detailed explanation of the proposed lightweight deep learning-based segmentation model. Section IV provides a comprehensive overview of the experimental setup, followed by a presentation of the experimental analysis and a detailed discussion of the results in Section V. Finally, Section VI summarizes the work drawn from this study. The following section represents the overview of the existing study related to semantic segmentation.

II.
Literature Review

Generally, deep learning-based models can be categorized into four categories, namely, contextual information-based methods, feature enhancement-based methods, recurrent neural network (RNN)-based, and deconvolution-based methods. Context-based methods deal with the crucial information in a scene, which accelerates the segmentation process. For dense prediction, Yu and Koltun [23] proposed a dilated model known as DilatedNet, which aggregates the contextual information at multiple scales without losing resolution or analyzing the rescaled images. However, the model lacks end-to-end simplification and unification. To overcome this, Liu et al. [24] designed ParseNet based on FCN32s. The model is based on enlarging the receptive field to extract global features directly. The model outputs smooth segmentation and accurate results on the PASCAL VOC 2012 test dataset. Lin et al. [25] designed Piecewise based on VGG-16, extracting the patch-wise context by integrating convolutional neural network (CNN) with CRF. In EncNet, proposed by Zhang et al. [26], an additional fully connected layer is added, incorporating a sigmoid activation function. This additional layer is used to generate individual predictions for the presence of object categories. The authors introduced the concept of object context, which enhances object information by leveraging semantic relationships between pixels. In this approach, a dense relation matrix is employed as a substitute for the binary relation matrix. Dense relation network (DRN) and context-restricted loss (CRL), as proposed by Zhang et al. [27], combined both global and local context information, resulting in improved segmentation accuracy. Surpassing the global and local context information, Zhang et al. [28] proposed a model based on the detection of co-occurrent features, which harnesses co-occurrent features to provide finely detailed representations. Furthermore, Ding et al. [29] noticed that the unique shapes and intricate arrangement of objects in images can hinder the effectiveness and efficiency of context aggregation. To overcome the same, the author suggested a model that utilizes semantic masks that vary in scale and shape for individual pixels. These semantic masks are created using paired and shape-variant convolution techniques and generate the contextual region for each pixel accordingly.

In a traditional CNN pipeline, feature-enhancement-based deep layers extract features that are rich in semantics but lack spatial features due to the process of pooling and strided convolution. Conversely, shallow layers yield features with a heightened awareness of finer details like stronger edges, and so on. The effective combination of deep and shallow layers enhances the performance of semantic segmentation. Long et al. [30] proposed a fully convolutional network (FCN) that enhances the features through skip connections between prediction features and mid-layer features and significantly increases the final resolution resulting in improvement in mean-intersection over union accuracy. Ronneberger et al. [13] emphasized the importance of skip connection and proposed UNet gaining considerable attention in the field of medical image analysis [31, 32]. Furthermore, to address the spatial information limitations in 2D-based models along the z-dimension, Li et al. [33] extended UNet with 2D-DenseUNet and 3D-DenseUNet, and incorporated a fusion block to combine the intraslice and interslice features. UNet also found extensions in other applications, including natural image segmentation with stacked UNets [34], introducing residual blocks [35] and proposing the hyper-column representation [36], which concatenates features from various CNN pipeline layers for final inferences. RefineNet, introduced by Lin et al. [37], used the long-range residual connections aiming to yield the high-level resolution prediction. The model employs a multifold refinement process that progressively enhances spatial details in feature maps. However, it is worth noting that RefineNet is relatively computationally expensive [38]. Pohlen et al. [39] proposed a full-resolution residual network that is based on the amalgamation of pixelwise accuracy with a mutiscale-based context. The coupling is achieved by separating the network into two sub-streams: the pooling preprocessing stream for high-level semantics (low resolution) and the residual preprocessing stream for detailed preservation (high resolution), which collaborates with features obtained from the latter. The models mentioned above primarily involve designing networks to connect and fuse different feature types to achieve feature enhancement. Collectively, these methods predominantly revolve around the design of networks aimed at connecting and amalgamating different feature types to realize feature enhancement.

RNNs have demonstrated promising capabilities in processing sequential data, including text and speech [40, 41], and have found applications in semantic segmentation. Pinheiro and Collobert [42] introduced the recurrent convolutional neural network (RCNN) aimed to train the model end-to-end on raw pixel data, which enables the system to effectively capture intricate spatial relationships at minimal computational cost. Poudel et al. [43] proposed recurrent fully convolutional networks (RFCNs) for segmenting objects in MRI sequences, showcasing the synergy between FCN and RNN. Byeon et al. [44] presented a model foccusing on the challenge of pixel-level segmentation and classification of scene images, employing a purely learning-based methodology that harnesses long short-term memory (LSTM) RNN. Visin et al. [45] introduced an efficient, flexible, architecture ReSeg, built upon the ReNet model. Each layer comprises four RNNs that traverse the image both horizontally and vertically in both directions. Extending the concept of a two-dimensional-RNN, Shuai et al. [46] developed the DAG-RNN method, which exhibits denser connections compared with 2D-RNN, addressing the issue of fading dependencies along long-range paths. To mitigate the problem of fading dependencies and enhancement of long-range connections, Fan and Ling [47] proposed a dense recurrent neural network (DRNN). Dense RNNs, characterized by their interconnectedness with every pair of image elements, can encapsulate more intricate contextual dependencies for each image unit. This model employs an attention mechanism [48] to assign greater weight to beneficial dependencies while diminishing the significance of irrelevant ones.

The inception of deconvolution-based semantic segmentation marked a significant milestone in computer vision research, with its origin attributed to Noh et al. [49], under the moniker “DeconvNet.” This method is hinged upon the symmetrical encoder–decoder architectural paradigm. Within the encoder component, DeconvNet methodically extracts semantic features, albeit at a reduced resolution due to the application of max pooling. In the decoder module, an unpooling operator effectively employs the saved indices to upsample the low-resolution feature map, thereby transforming it into a high-resolution counterpart. For performing pixelwise semantic segmentation, Badrinarayan et al. [50] designed an efficient model known as SegNet. Furthermore, the decoder employs the pooling indices to execute the nonlinear upsampling resulting in dense feature maps. Fourure et al. [51] proposed “GridNet” which adopted a grid pattern, facilitating the operation of multiple interconnected streams operating at various spatial resolutions. Kendall et al. [52] expanded the SegNet framework into a probabilistic-based pixelwise segmentation model known as “Bayesian-SegNet.” The model can predict class labels for individual pixels while also quantifying the model’s uncertainty. Furthermore, to facilitate the smooth flow of information and gradient propagation across the network, Fu et al. [53] introduced the Stacked Deconvolution Network (SDN). The proposed model ensured the precise localization information recovery due to its sequential layer structure.

From the literature, it has been witnessed that feature-enhancement-based deep learning models can indeed yield benefits in terms of improving feature representations and ultimately boosting the performance of downstream tasks. However, it is crucial to remain mindful of the accompanying drawbacks and factors to consider. Therefore, a meticulous evaluation of these methods is imperative, considering the trade-offs and their appropriateness for specific datasets and problem contexts. In a similar vein, context-based deep learning methods have showcased impressive capabilities across a spectrum of applications, ranging from natural language understanding to computer vision. A mindful approach can be followed while employing context-based methods in real-world applications as they struggle to deal with sequential data and have high inference time. Furthermore, deconvolution-based methods output high-resolution masks but suffer from slow convergence rate and low interpretability where the segmentation of the object depends upon the global contextual information, hence, making them trivial to be used in the decision-making process.

To overcome the various limitations of deep learning-based models, this paper proposed a new attention-inception-based lightweight modified UNet model (LWA-MoDUNet), specifically developed for semantic segmentation of objects within diverse images of scenes characterized by congested and unstructured patterns.

III.
Proposed Model

In this paper, the proposed model draws inspiration from the family of UNet architectures. UNet is a fully CNN architecture meticulously designed for the specific purpose of semantic segmentation. The architecture consists of an encoding phase and a decoding phase that are symmetrically connected. The central part of UNet is the bottleneck, which serves as a bridge between the encoder and the decoder. UNet incorporates skip connections that concatenate feature maps from the contracting path to the corresponding layers in the expansive path. These connections help to preserve spatial information and enable precise segmentation. However, UNet struggles with extremely large images or cases where objects have complex shapes and interactions. Furthermore, UNet may encounter challenges in capturing intricate image features that could potentially enhance segmentation accuracy, particularly when dealing with images portraying road traffic scenarios marked by congestion and unstructured traffic patterns. As a result, this research introduces the inception and attention modules in the U-Net architecture and presents a new variant termed “lightweight attention-based modified-UNet” (LWA-ModUNet). Figure 1 represents the architecture of the proposed LWA-ModUNet model. The inception module at the encoder side helps in extracting multiscale features and captures fine-grained details with high-level context while the attention module captures the contextual information by considering relationships between pixels or regions within an image. The objective is to enhance the network’s efficacy in accurately segmenting images, particularly in scenarios involving complex patterns. Both the encoder and decoder parts of the proposed model are described in the following sections.

Figure 1:

Framework of the proposed LWA-MoDUNet.

a.
Encoding phase

The encoding phase extracts features and reduces spatial dimensions. For this, the proposed LWA-MoDUNet employs only two encoding blocks for the encoding path. Each encoding block performs a comprehensive set of operations by incorporating multiple convolution layers, and max-pooling layers, with an inception block, and a batch normalization. Figure 2 illustrates the structure of an encoder block.

Figure 2:

Architecture of the encoder block of the proposed LWA-ModUNet.

From Figure 2, it is seen that the input data undergoes processing through three paths. The first path includes data processing through a 3 × 3 convolution layer and then through the inception module. The inception block consists of multiple kernels that capture features at different receptive fields. This ensures that the model has a better context than the standard convolution layer. The complete block of inception block is depicted in Figure 3 The features extracted from the inception module are passed to the decoding phase as well as within the same encoder block which enables the network to efficiently differentiate between the multiple-object boundaries and context, prevents information loss and mitigates the noise and occlusions in an image. In the second path, the 1 × 1 convolution layer, batch normalization, and 2 × 2 max-pooling process the input data. The processed data are included with extracted features from the inception module. This encourages the model to ensure that the learned features are more enriched than the original input in varied conditions for better scene understanding. Finally, the third path employs 2 × 2 max-pooling to downsample the input data and concatenate with the features from the second path. This acts as an input reinforcement to the next block and also aids the further layers to always have a comparative sense between the learned and the original features. This architecture of the encoder block enhances the network’s ability to discern crucial information from input images and distinguish between different objects in better scene understanding.

Figure 3:

Architecture of the inception block of the proposed LWA-ModUNet [54].

b.
Decoding phase

The decoding phase upsamples the feature space to produce a high-resolution segmentation mask. In the proposed architecture, only two decoder blocks are incorporated in the decoding phase. Each decoder block employs an attention module, a deconvolution layer, and convolution layers to perform the feature upsampling. The illustrative structure of the decoder block is depicted in Figure 4. In the decoder block, the features from the encoding phase are passed to the attention module to excite the important features at the channel level. The skip connections used in UNet provides spatial information from the decoder to the corresponding encoder but they bring along the poor feature representation. Hence, the modified inception has been integrated to enhance UNet with attention module as depicted in Figure 4 to reduce the complexity during feature extraction in an image.

Figure 4:

Decoder architecture of the proposed LWA-MoDUnet.

The attention block [55], as depicted in Figure 5, focuses on attention coefficients βi ∈ [0, 1], which identifies the significant features and then assign the weights to them. The attention coefficient is element wise multiplied with generated input feature maps, that is, mil m_i^l , c=mil c = m_i^l , c=βil c\; \cdot = \beta _i^l , to produce the result from attention gates. In a default configuration, the value of single scalar attention is determined for each pixel vector milRNl m_i^l \in {R^{Nl}} in L layer for N number of feature maps. The gating vector vi ∈ RNv that contains the contextual information significantly reduces the influence of less relevant features. The gating coefficients are obtained by using additive attention, which has been equated in Eqs (1) and (2): (1) qattl=ΨTσ1WTmil+WTvi+bv+bψ q_{att}^l = {\Psi ^T}\;{\sigma _1}\left( {{W^T}\;m_i^l + {W^T}\;{v_i} + {b_v}} \right) + {b_\psi } (2) βl,i=σ2qattlmil,vi;Θatt {\beta ^l},i = {\sigma _2}\left( {q_{{\rm{att}}}^l\left( {m_i^l,{v_i};{\Theta _{{\rm{att}}}}} \right)} \right) where σ2(mi,c) denotes the sigmoid activation, Θatt depicts the parameters used to characterize attention gates that include Wm ∈ RNl × Nint and Wv ∈ RNl×Nint. For input tenors, a channel-wise convolution of 1 × 1 × 1 has been performed to calculate the linear transformations. bψ ∈ R and bv ∈ RNint denote the bias used.

Figure 5:

Diagram of additive attention gate [55].

The sequential use of softmax activation [56], used in attention block, is to normalize the attention coefficients. Due to the implementation of sigmoid activation, the proposed model has shown better convergence during training. Within the architecture, skip connections are utilized to emphasize important features. The updating rule for convolution parameters from the lower layers (l − 1) is shown in Eq. (3): (3) m^lβlfml1;ϕl1fml1;ϕl1βli=i=βl+imlϕl1ϕl1iϕl1ϕl1i \matrix{ {\partial \left( {{{\hat m}^l}} \right)} \hfill \cr {\partial \left( {{\beta ^l}f\left( {{m^{l - 1}};{\phi ^{l - 1}}} \right)} \right)} \hfill \cr {\partial \left( {f\left( {{m^{l - 1}};{\phi ^{l - 1}}} \right)} \right)} \hfill \cr {\partial \left( {{\beta ^l}} \right)} \hfill \cr {i = i = {\beta ^l} + i\;{m^l}} \hfill \cr {\partial \left( {{\phi ^{l - 1}}} \right)} \hfill \cr {\partial \left( {{\phi ^{l - 1}}} \right)} \hfill \cr {i\;\partial \left( {{\phi ^{l - 1}}} \right)} \hfill \cr {\partial {{\left( {{\phi ^{l - 1}}} \right)}^i}} \hfill \cr }

This focuses on capturing the relevant contextual information between the regions of an image. Furthermore, the attention module can effectively fuse features from multiple scales within an image. This benefits in capturing both local and global contexts, making it suitable for objects of varying sizes, which is particularly valuable for distinguishing objects from their surroundings and handling complex scenes. Consecutively, a couple of standard convolution layers are followed by a deconvolution layer in a decoder block. This advantage in suppressing the noise in the predicted segmentation mask and pixelwise prediction is for accurate segmentation of complex images. Finally, the decoder block has a 1 × 1 convolution layer that processes the features from the skip connections and performs concatenation with the upsampled features by the deconvolution layer. This helps the model to retain the coarse features from the encoder side. At the end of the decoding phase, a 1 × 1 convolution layer is applied to generate the output segmentation mask. The flowchart of the proposed model is shown in Figure 6, depicting the generation of the segmentation mask.

Figure 6:

Flowchart of the proposed LWA-MoDUNet depicting the encoding and decoding phases with inception and attention modules.

IV.
Experimental Setup
a.
Considered datasets

To assess the performance of the proposed semantic segmentation method, five publicly accessible datasets, namely, Autorickshaw Detection Dataset [18], IDD-Lite [19], CT-Liver [20], MADS [21], and PETS [22], are used. The used datasets contain objects within diverse images of scenes characterized by congested and unstructured patterns, which make them ideal for the assessment. The Autorickshaw Detection Dataset [18] consists of 1,000 images and has been publicly released by the Indian Institute of Information Technology, Hyderabad, India, in conjunction with the Autorickshaw Detection Challenge. The IDD-Lite [19] comprises 1,404, 204, and 408 training, validation, and testing images, respectively. This dataset is organized into seven distinct classes, namely, drivable regions, non-drivable areas, living entities, vehicles, roadside objects, distant objects, and the sky. The PETS [22] dataset consists of 37 sets of various breeds of cats and dogs in varied lighting, poses, and background conditions. Each set of breeds contains approximately 200 images, resulting in a total of 7,349 images that have been divided into training, testing, and validation datasets. From 7,349 images, 3,312 have been taken for training, 368 for validation, and 3,669 for testing purposes. The CT-Liver [20] dataset consists of 232 images in which 152 images have been considered for training, 35 for testing, and 45 for validation purposes. The MADS [21] dataset is a collection of human movement images of martial arts, dance, and sports movements in varied angles for pose detection. For training, validation, and testing of segmentation, 861, 152, and 179, respectively, images have been taken. Figure 7 depicts the representative images from each dataset along with the respective ground-truth (GT) images.

Figure 7:

Sample images of considered datasets (Row 1) along with their corresponding GT (Row 2). GT, ground truth; IDD-Lite, Indian driving dataset-lite; MADS, martial arts, dancing and sports.

The quality of the generated segmentation masks for the abovementioned datasets has been comprehensively evaluated against seven performance parameters, namely, IoU score, error (Err), specificity (Sp), sensitivity (Ss), accuracy (Acc), correlation coefficient (Cc), and F-score.

b.
Implementation details

During training, each image is first resized to p × q, where p = 256, q = 256, and randomly flipped horizontally, and croppped m × n, where m = 128 and n = 128. The input images were subjected to different augmentations to avoid distortions and over-fitting such as cropping, zoom-in/out, rotation, lightning, and affine (scaling and translation). For the data normalization, each image is subtracted from the mean and divided by standard deviation. Each input image is resized to 128 × 128 × 3 dimensions, where 3 represents the RGB channel. For the autorickshaw, PETS, MADS, and CT-Liver datasets, the loss function employed is the flattened binary cross-entropy loss [57], also known as flattened sigmoid cross-entropy loss. As the IDD-Lite is a multi-class dataset, cross-entropy loss [58] is utilized. In LWA-MoDUNet, since both the instances of feature and target networks are convolutional layers, inspired from the idea mentioned above to rectify the gradient vanishing issue, we have used Xavier initialization [59] for convolutional layers to speed up the training process. Surprisingly, no existing backbone is used as all of the model is trained from scratch by training over 200 epochs. The Adam optimizer [60] with early stopping and step learning rate scheduler with initial learning rate as 1E − 3 is used. All other hyperparameters are considered as default. At the output, bilinear interpolation is employed to resize the predicted saliency maps. The model is implemented in PyTorch 2.0.0 and trained on an Intel Xeon 2.2 GHz CPU (24GB RAM) and an NVIDIA A100-SXM4 (40 GB memory), which are a Hexa-core, 12-thread PC.

c.
Ablation study

To explore the robustness of the proposed model, an ablation study [61] is conducted through a quantitative analysis of the proposed model with and without inception and attention modules. The hyperparameter settings of the proposed model are detailed in Section “Implementation details”. Experimental evaluation is performed on the autorickshaw dataset by considering four metrics, that is, IoU score, Error, Accuracy, and F-score. The experimental results are depicted in Table 1.

Table 1:

Ablation study of LWA-MoDUNet on autorickshaw dataset

MeasureLWA-MoDUNetW/o inception moduleW/o attention
IoU score0.87680.67290.8173
Error0.06630.14520.0927
Accuracy93.369178.796686.4217
F-score0.93440.79130.8786

IoU, intersection over union.

It can be observed from Table 1 that the inclusion of the inception module has significantly enhanced the performance of the LWA-MoDUNet (proposed model) as compared with its absence. Furthermore, the model has attained superior performance with the use of the attention module as shown in Table 1. Therefore, this study supports the architectural advantages of the proposed model and is evident from the efficient results.

V.
Experimental Analysis

In this paper, a comprehensive evaluation of the proposed LWA-MoDUNet is conducted both qualitatively and quantitatively by performing a comparison with SOTA models, including U-Net, UNet-ResNet18, UNet-ResNet34, SegNet, and ENet.

For qualitative assessment, Figures 8–12 showcase the segmentation outcomes of these methods on images sampled from PETS, MADS, autorickshaw, IDD-Lite, and CT-Liver datasets, respectively. In the visual representations, the first two columns correspond to an unaltered image and the respective GT. For a thorough quantitative assessment, seven crucial parameters, namely, IoU score, sensitivity, accuracy, error, specificity, correlation coefficient, and F-score, have been considered, and generated values are outlined in Tables 2–6 corresponding to the autorickshaw, IDD Lite, PETS, MADS, and CT-Liver datasets, respectively. In Tables 2–6, the numerals highlighted in bold signify the most optimal results. The IoU score served as the primary metric for evaluating segmentation accuracy. Additionally, to visualize the variations in IoU scores during the 200 training and validation epochs for the models under consideration, line plots are presented in Figures 13–17 for the autorickshaw, IDD-Lite, PETS, MADS, CT-Liver datasets, respectively. Furthermore, this paper conducts a nonparametric Wilcoxon signed-rank test [62] to statistically validate the segmentation results of the LWA-MoDUNet with the compared SOTA models. Extending the analysis further, we include Figure 18 showing the trade-off between type 1 and type 2 errors for the proposed and compared state-of-the-art models. Moreover, to evaluate the complexity of LWA-ModUNet, its parameter count and inference time were analyzed in comparison to five state-of-the-art deep learning models.

Figure 8:

Generated output masks by UNet, E-Net, SegNet, UNet with ResNet-18 encoder, UNet with ResNet-34 encoder, and LWA-ModUNet models on exemplary images of autorickshaw dataset. GT, ground truth.

Figure 9:

Generated output masks by UNet, E-Net, SegNet, UNet with ResNet-18 encoder, UNet with ResNet-34 encoder, and LWA-ModUNet on exemplary images of IDD Lite dataset. GT, ground truth; IDD-Lite, Indian driving dataset-lite.

Figure 10:

Generated output masks by UNet, E-Net, SegNet, UNet with ResNet-18 encoder, UNet with ResNet-34 encoder, and LWA-ModUNet models on exemplary images of PETS dataset. GT, ground truth.

Figure 11:

Generated output masks by UNet, E-Net, SegNet, UNet with ResNet-18 encoder, UNet with ResNet-34 encoder, and LWA-ModUNet models on exemplary images of MADS dataset. GT, ground truth; MADS, martial arts, dancing and sports.

Figure 12:

Generated output masks by UNet, E-Net, SegNet, UNet with ResNet-18 encoder, UNet with ResNet-34 encoder, and LWA-ModUNet models on exemplary images of CT-Liver dataset. GT, ground truth.

Table 2:

Comparison of key performance indicators of LWA-MoDUNet and SOTA models on autorickshaw dataset

MeasuresUNetUNet-ResNet34UNet-ResNet18E-NetSegNetLWA-MoDUNet
IoU score0.76650.84590.83560.84800.75920.8768
Err0.13260.08490.09120.08570.14030.0663
Acc86.741291.506190.883791.434985.967793.3691
Sp0.86970.93030.92410.95140.87890.9429
Ss0.86510.90090.89460.88290.84230.9248
F-score0.86780.91650.91040.91770.86310.9344
Cc0.73480.83060.81820.83150.72030.8676

Acc, accuracy; Cc, correlation coefficient; Err, error; IoU, intersection over union; Sp, specificity; Ss, sensitivity.

Table 3:

Comparison of key performance indicators of LWA-MoDUNet and compared models on IDD-Lite dataset

MeasuresUNetUNet-ResNet34UNet-ResNet18E-NetSegNetLWA-MoDUNet
IoU score0.60310.61740.59810.5660.30760.9283
Acc0.92030.93980.93560.93210.89710.9616
Err0.07160.06010.06430.06790.10280.00435
Sp0.95000.94840.94690.93950.89750.9328
Ss0.85340.86170.84720.86690.88960.9436
F-score0.70560.76350.74850.72290.47050.8909
Cc0.67940.73710.71860.69790.49650.8153

Acc, accuracy; Cc, correlation coefficient; Err, error; IDD-Lite, Indian driving dataset-lite; IoU, intersection over union; Sp, specificity; Ss, sensitivity.

Table 4:

Comparison of key performance indicators of LWA-MoDUNet and compared models on PETS dataset

MeasuresUNetUNet-ResNet34UNet-ResNet18E-NetSegNetLWA-MoDUNet
IoU score0.76650.84590.83560.84800.75920.8768
Err0.13260.08490.09120.08570.14030.0663
Acc86.741291.506190.883791.434985.967793.3691
Sp0.86970.93030.92410.95140.87890.9429
Ss0.86510.90090.89460.88290.84230.9248
F-score0.86780.91650.91040.91770.86310.9344
Cc0.73480.83060.81820.83150.72030.8676

Acc, accuracy; Cc, correlation coefficient; Err, error; IoU, intersection over union; Sp, specificity; Ss, sensitivity.

Table 5:

Comparison of key performance indicators of LWA-MoDUNet and compared models on MADS dataset

MeasuresUNetUNet-ResNet34UNet-ResNet18E-NetSegNetLWA-MoDUNet
IoU score0.89680.94480.90840.96090.97210.9768
Err0.05450.02900.04960.02010.01420.0118
Acc94.552097.104295.042897.991998.578198.8225
Sp0.94640.99050.98180.98720.99160.9934
Ss0.94470.95310.92290.97280.98010.9832
F-score0.94560.97160.95200.98010.98590.9883
Cc0.89100.94280.90280.95990.97160.9765

Acc, accuracy; Cc, correlation coefficient; Err, error; IoU, intersection over union; MADS, martial arts, dancing and sports; Sp, specificity; Ss, sensitivity.

Table 6:

Comparison of key performance indicators of LWA-MoDUNet and compared models on CT-Liver dataset

MeasuresUNetUNet-ResNet34UNet-ResNet18E-NetSegNetLWA-MoDUNet
IoU score0.88390.94450.95650.93490.94950.9650
Err0.06160.02870.02230.03380.02620.0178
Acc93.840097.130897.768696.616797.380898.2162
Sp0.93840.97580.98110.97190.98450.9855
Ss0.93840.96690.97430.96050.96350.9788
F-score0.93840.97140.97780.96640.97410.9822
Cc0.87680.94270.95540.93240.94780.9643

Acc, accuracy; Cc, correlation coefficient; Err, error; IoU, intersection over union; Sp, specificity; Ss, sensitivity.

a.
Discussion of results

In this section, detailed analysis of the results obtained are presented. Examining Figure 8, it becomes apparent that the comparative SOTA, especially UNet with ResNet-18 encoder and E-Net, struggled in the segmentation of the input images, exhibiting noise, poor-boundary definitions, and gaps. While LWA-MoDUNet can generate similar segmentation masks that best match the GT. The resultant segmentation masks have sharp and smooth boundaries with minimal noise. In Figure 9, noticeable discrepancies were observed in the segmented results of E-Net and Seg-Net. Conversely, LWA-MoDUNet consistently generated outputs closely resembling their respective GT images for the examined figures. Furthermore, the output images of the LWA-MoDU-Net demonstrated comparatively lower levels of distortion. Similarly, from Figure 10, it can be observed that the results obtained from the LWA-MoDUNet are approximately similar to the GT of the respective images under consideration. The segmentation masks have accurate boundaries even in the cluttered backgrounds, while ResNet18 and ENet models generated segmentation masks that were partially matching with the GT of the original image. Conversely, UNet and SegNet underperformed by generating distorted output masks. On the contrary, considering Figure 11, it can be witnessed that LWA-ModUNet and UNet with ResNet34 have achieved competitive results. The segmented outputs of the earlier one are better than the later one. While UNet underperformed and outputs the distorted masks. Moreover, Figure 12 shows that the considered models have shown average results. The result suggests that the masks produced by LWA-ModUNet are considerably similar to their original GT images. The compared models have generated distorted images and hence cannot segment the image effectively.

Table 2 presents the experimental results on the autorickshaw dataset. UNet with ResNet18/ResNet34 encoder demonstrate similar performance across all the parameters. However, their performance falls behind that of the other models under consideration, as well as the proposed methods for the autorickshaw dataset. Notably, LWA-MoDUNet outperformed comparative models, achieving the highest IoU-score, that is, 0.8768. Furthermore, LWA-MoDUNet demonstrated superiority across all considered parameters, that is, error as 0.0663, accuracy as 0.9336, specificity as 0.9429, sensitivity as 0.9248, F-score as 0.9344, and correlation coefficient as 0.8676. Similarly, upon analyzing the results for the IDD Lite dataset in Table 3, it is evident that our proposed model exhibited exceptional performance, surpassing the state-of-the-art comparative models, achieving the highest IoU score of 0.9283. The second highest IoU of 0.6174 is achieved by UNet-ResNet34, which shows that the proposed model segments the complex images efficiently.

Similarly, the performance of the compared models and proposed model on the PETS dataset has been illustrated in Table 4. The proposed LWA-ModUNet performed better than the compared SOTA models, indicating the highest IoU score of 0.8768. On other parameters, LWA-ModUNet performed commendably achieving the highest values for accuracy, sensitivity, specificity, F-score, and correlation-coefficiency. According to the result, LWA-MoDUNet is most suitable for accurately segmenting objects in complex images, achieving and maintaining a better balance between precision and recall. Contrarily, SegNet consistently delivers lower performances on all parameters across the metrics, thus making the model a crucial choice for the semantic segmentation of complex images. By contrast, UNet with ResNet34 and UNet with ReSNet18 performed better but remained behind LWA-MoDUNet reporting an IoU score of 0.8459 and 0.8356, respectively.

The performance evaluation of the models on the MADS dataset is illustrated in Table 5. The results depict that the performance of SegNet is comparable to the LWA-MoDUNet in terms of all the considered parameters. In terms of IoU score, LWA-MoDUNet reported 0.9768, which is the highest while SegNet achieved 0.9721. UNet underperformed by achieving the lowest IoU score value of 0.8968. Table 6 illustrates the segmentation results of the compared models on the CT-Liver dataset. The table suggests that the proposed model marked the highest parametric values across the metrics, achieving an IoU score of 0.9651, accuracy of 0.9821, correlation coefficient of 0.9643, specificity of 0.9855, F-score of 0.9822, and sensitivity of 0.9788, respectively, which are the highest. UNet with ResNet34 and UNet with ResNet-18 fall behind LWA-MoDUNet with an IoU score of 0.9445 and 0.9565, respectively. On the contrary, SegNet, and Enet fall slightly behind the ResNet encoders resulting in IoU score of 0.9495 and 0.9349, respectively. Moreover, UNet underperformed in the segmentation of image reporting the lowest IoU score of 0.8839. Hence, based on the segmentation accuracy (IoU score), it can be confidently affirmed that our proposed model represents a substantial improvement, attributable to its enhanced capacity to extract deeper features.

Figures 13–17 demonstrate that LWA-MoDUNet attains the best IoU score, exhibiting a substantial margin of superiority over the compared models for all five compared datasets. Furthermore, for visual illustrations, the graphs for the models’ performance at the validation phase provide unequivocal evidence that the proposed method consistently outperforms throughout the entire span of epochs considered. It is worth noting that the proposed model maintains its supremacy in terms of IoU score and outperforms the compared state-of-the-art-methods.

Figure 13:

Comparative analysis of IoU score on training and validation images taken from autorickshaw dataset. IoU, intersection over union.

Figure 14:

Comparative analysis of IoU score on training and validation images taken from IDD-Lite dataset. IDD-Lite, Indian driving dataset-lite; IoU, intersection over union.

Figure 15:

Comparative analysis of IoU score on training and validation images taken from PETS dataset. IoU, intersection over union.

Figure 16:

Comparative analysis of IoU score on training and validation images taken from MADS dataset. IoU, intersection over union; MADS, martial arts, dancing and sports.

Figure 17:

Comparative analysis of IoU score on training and validation images taken from CT-Liver dataset. IoU, intersection over union.

To validate, the same Wilcoxon rank sum test has been conducted by using the IoU scores of the two models across varied sample sizes, that is, 30 and 50. If the p-value is lower than the significance level (α = 0.05), then the null hypothesis (i.e., both models are statistically similar) is rejected, and the alternative hypothesis (i.e., the proposed model is statistically superior to the compared model) is accepted. Table 7 depicts the p-values of the proposed model in comparison to other considered models. Notably, it is clearly evident from the Table 7 that the p-value of the compared models is considerably less than the α. Furthermore, it can be observed in Table 7 that the p-value is indirectly proportional to the sample size. Therefore, it is pertinent to state that the proposed model performs statistically better segmentation in comparison to the compared models.

Table 7:

Statistical performance analysis of LWA-MoDUNeT against other compared SOTA models

ModelsSamples size = 30 p-valueSamples size = 50 p-value
UNet and LWA-MoDUNet0.0029 × E−053.54 × E−09
ResNet34 and LWA-ModUNet8.49 × E−047.28 × E−06
ENet and LWA-MoDUNet3.04 × E−029.51 × E−03
SegNet and LWA-ModUNet4.85 × E−026.39 × E−04

LWA-MoDUNet, lightweight modified UNet model.

Additionally, the results in Figure 18 clearly show that LWA-ModUNet yields fewer Type 1 errors than the compared state-of-the-art models. On the contrary, the proposed model depicted comparable Type 2 errors on PETS and IDD datasets but lower errors on the remaining datasets as compared with the state-of-the-art models. As compared -with the UNet and ResNet-based models LWA-MoDUNet raises less false alarms.

Figure 18:

Model type 1 and type 2 errors comparison across datasets. IDD-Lite, Indian driving dataset-lite; MADS, martial arts, dancing and sports.

Furthermore, this study analyzed the complexity of these models in terms of parameters and inference time. As illustrated in Table 8, LWA-ModUNet has significantly fewer parameters among the considered models. On the contrary, DeconvNet exhibits the highest number of parameters, along with the training and inference times. The inference time for ResUNet is longer than that for LWA-ModUNet, indicating the drawbacks of deep residual learning. Furthermore, U-Net has nearly six times more parameters than LWA-ModUNet despite being based on the UNet architecture. When compared with FCN-8s and SegNet, the inference time of LWA-ModUNet is comparatively less than both. The simpler structure of LWA-ModUNet has contributed to reduced execution time. With improved accuracy and minimal computational resources, LWA-ModUNet demonstrates superior performance. Hence, LWA-ModUNet is a computationally lightweight model, making it desirable for real-time applications.

Table 8:

Comparative analysis of computational complexity

ModelMeasuresInference time (ms/image)
FCN-8s [28]134.2786.1
SegNet [29]29.4660.7
DeconvNet [31]251.84214.3
U-Net [30]31.0347.2
ResUNet [51]8.1055.8
LWA-ModUNet5.1736.4
VI.
Conclusion

In this paper, a lightweight attention-based modified UNet (LWA-MoDUNet) has been proposed for semantic segmentation. The semantic segmentation process strongly relies on the attention model, integrating it with the inception module and deconvolution. The attention module is a component through which we were able to dynamically highlight crucial regions in complex environments such as unstructured driving scenarios, medical images, various human poses, and pets, hence allowing the model to attend the useful features and suppress redundant or noisy information. The purpose of this selective attention is to allow the model to have more fine-grained details and complex patterns essential for the precise segmentation of the scene. The object or region to be segmented in the image is delineated optimally (with improvements) by the proposed model. To sum up, the inception-based encoder integrated with UNet in conjunction with attention mechanism at the decoder side is a lightweight and effective architecture for semantic segmentation in sophisticated diverse images. The experimental results presented here establish that the model can attain the state-of-the-art performance while making a substantial leap toward generalizing the object of interest in cluttered scenes in a more robust manner. In the future, it can be proposed to design a network with a larger and denser receptive field in LWA-MoDUNet. The model is implemented for semantic segmentation, though it can be further improved and modified for multi-class segmentation. In addition, the inclusion of some effective postprocessing techniques to analyze the performance of the model.

Language: English
Submitted on: Jul 11, 2025
Published on: Apr 10, 2026
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2026 Twinkle Tiwari, Mukesh Saraswat, published by Macquarie University, Australia
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.