Have a personal or library account? Click to login
Improvement of Remote Sensing Target Tracking Method Based on Deep Learnin Cover

Improvement of Remote Sensing Target Tracking Method Based on Deep Learnin

By: Xuhao Wang and  Long Ma  
Open Access
|Jun 2025

Figures & Tables

Figure 1.

Shows remote sensing portraits. The left portrait is the original image, which contains a lot of information. The right image is a partial image captured from the left image. It can be seen that even if it is magnified many times, the information in the image is still very rich
Shows remote sensing portraits. The left portrait is the original image, which contains a lot of information. The right image is a partial image captured from the left image. It can be seen that even if it is magnified many times, the information in the image is still very rich

Figure 2.

Network Structure. The proposed network's architecture is organized into three main components: the Backbone, the ResSwinT module, and the regression head. The Backbone is constructed upon the ResNet-50 architecture, with significant enhancements that include the incorporation of C3Minus and CA modules. These additions are designed to enhance the ability to extract features. The network effectively fuses the outputs from three distinct convolutional layers along with the outputs from the CA module. This fused information is subsequently passed to the ResSwinT module, which processes the data to generate hierarchical feature representations. Finally, the output from the ResSwinT module is directed to the regression head, which is responsible for accurately locating the target object in the image.
Network Structure. The proposed network's architecture is organized into three main components: the Backbone, the ResSwinT module, and the regression head. The Backbone is constructed upon the ResNet-50 architecture, with significant enhancements that include the incorporation of C3Minus and CA modules. These additions are designed to enhance the ability to extract features. The network effectively fuses the outputs from three distinct convolutional layers along with the outputs from the CA module. This fused information is subsequently passed to the ResSwinT module, which processes the data to generate hierarchical feature representations. Finally, the output from the ResSwinT module is directed to the regression head, which is responsible for accurately locating the target object in the image.

Figure 3.

C3Minus network structure. The network consists of three convolution layers and one BottleNeck layer. ConvBN refers to the Batch Normalization and activation function, and Concat is a short circuit
C3Minus network structure. The network consists of three convolution layers and one BottleNeck layer. ConvBN refers to the Batch Normalization and activation function, and Concat is a short circuit

Figure 4.

CA network structure. The entire module performs average pooling in the horizontal and vertical directions, then uses Transform to encode the spatial information, and finally fuses the spatial information by weighting it on the channel, making the network's overall perception of space more profound.
CA network structure. The entire module performs average pooling in the horizontal and vertical directions, then uses Transform to encode the spatial information, and finally fuses the spatial information by weighting it on the channel, making the network's overall perception of space more profound.

Figure 5.

ResSwinT network structure, the overall structure uses a RestNet module as the basis, adds a Swin Transformer layer, and further extracts and fuses image features.
ResSwinT network structure, the overall structure uses a RestNet module as the basis, adds a Swin Transformer layer, and further extracts and fuses image features.

Figure 6.

Depth-wise Cross Correlation
Depth-wise Cross Correlation

Figure 7.

Experimental results. The red part is the model result and the green part is the real frame.
Experimental results. The red part is the model result and the green part is the real frame.

Experimental Environment

Experimental EnvironmentVersion
CPUIntel Xeon E5-2698
GPUNVIDIA Tesla V100 32G
LanguagePython 3.8
FrameworkPytorch

The success rate and accuracy of this method are compared with the SOTA method_ The red value in the table is highest, and green value is second highest_

ModelsYearsprecisionsuccess
SiamRPN20180.7530.342
SiamRPN++20180.4350.261
SiamMask20190.5690.278
SiamBAN20200.7840.497
SiamCar20220.7690.502
Ours-0.8030.549
Language: English
Page range: 1 - 10
Published on: Jun 13, 2025
In partnership with: Paradigm Publishing Services
Publication frequency: 4 issues per year

© 2025 Xuhao Wang, Long Ma, published by Xi’an Technological University
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.