Research on a Lightweight Small Object Detection Method Based on Lite-RFB Modules

Wang, Fei; Lu, Liping

doi:10.2478/ijanmc-2025-0039

Full Article

I. Introduction

Object detection, as one of the core technologies in computer vision, plays a pivotal role in numerous real-world scenarios: in autonomous driving, it enables real-time recognition of vehicles, pedestrians, and traffic signs to ensure safe navigation; in video surveillance systems, it facilitates intelligent analysis and tracking of specific targets (e.g., faces, abnormal behaviors); in medical image analysis, it assists physicians in precisely locating pathological regions (e.g., tumors, cellular lesions), enhancing diagnostic efficiency. With the widespread adoption of artificial intelligence, high-precision, high-efficiency object detection has become a shared demand across industry and academia. Significant challenges remain, particularly in detecting small and densely packed objects within complex scenes.

Since the introduction of the SSD framework in 2016, technological advancements in small object detection have centered on two core challenges: effectively integrating multi-scale features and achieving efficient detection within limited computational resources. Early improvements like DSSD (2017) enhanced feature resolution by introducing deconvolution layers, boosting small object detection accuracy (AP_S) by 4.2% but at the cost of a 60% speed reduction. RFB-Net (2018) adopted a bio-inspired multi-branch receptive field design, boosting APs to 21.3% on the VisDrone drone aerial dataset but simultaneously increasing parameters by 45%. MobileNet-SSD (2021) achieved real-time performance at 25 FPS through depthwise separable convolutions, yet its small object detection accuracy (AP_s) dropped back to 18.1%.

The proposed Lite-RFB SSD offers significant advantages over previous methods: First, it lightens the traditional RFB module through depth-separable convolutions and channel pruning. This preserves the multi-scale receptive field advantage while reducing parameters by 62%, significantly improving deployment efficiency on edge devices. Second, it innovatively embeds an optimized RFB module into SSD's shallow feature layer (conv4_3), preserving the critical role of high-resolution features for small object detection while avoiding the computational burden of deep networks; Finally, a dynamic branch weight adjustment mechanism enables the network to adaptively match targets of varying scales. This achieves an AP_s of 22.9% on the VOC dataset—a 4.2% improvement over the original SSD—while maintaining an inference speed of 28 FPS (on Jetson Nano), striking a superior balance between accuracy and efficiency.

II. Related Technology Overview

A. SSD Framework Overview

SSD (Single Shot MultiBox Detector) (as shown in Figure 1) achieves a high-efficiency balance between speed (59 FPS) and accuracy (74.3 mAP) by simultaneously performing object classification and localization in a single forward pass through multi-scale feature map prediction and a preset default box mechanism.

SSD employs an enhanced VGG-16 network as its backbone feature extractor (truncated at the Conv5_3 layer), constructing a multi-scale detection framework through cascaded extra convolutional layers (Extra Feature Layers). Its architecture enables hierarchical prediction through feature maps at varying resolutions (e.g., Conv4_3 to Conv10_2). Each layer simultaneously generates class scores and bounding box offsets via 3×3 convolutions (output channels: 4×(Classes+4)). The network covers targets of varying sizes across input scales ranging from 160 to 1600 pixels using hierarchically designed default boxes. Shallow layers with high-resolution features (e.g., 38×38 Conv4_3) focus on small object detection, while deep layers with large receptive fields handle large targets. Non-Maximum Suppression (NMS) filters 8,732 initial detections per class, achieving 74.3mAP detection accuracy and 59FPS real-time performance, demonstrating the efficiency-accuracy balance of single-stage detectors.

The SSD framework achieves multi-scale object detection by constructing multi-level feature maps (conv4_3 to conv9_2). Its core lies in the scale generation formula (1) for default boxes: 1 $S_{k} = S_{\min} + \frac{S_{\max} - S_{\min}}{m - 1} (k - 1), k \in [1, m]$ {S_k} = {S_{\min }} + {{{S_{\max }} - {S_{\min }}} \over {m - 1}}(k - 1),k \in [1,m]

Where ^{S_min = 0.2}, ^{S_max =0.9} (VOC dataset), and ^m denotes the feature map layer number (typically 6). The shallow feature conv4_3 (38×38) covers 15% of small objects (<50×50 pixels) in VOC with high resolution, but its limited semantic expressiveness results in only 65.2% mAP for this category (VOC2007 test). The RFB module enhances the receptive field through multi-branch dilatedconvolutions: 2 $F_{R F B} = σ (\sum_{i = 1}^{3} {Conv}_{3 \times 3} (F_{i n}, d = i))$ {F_{RFB}} = \sigma \left( {\sum\limits_{i = 1}^3 {{{{\mathop{\rm Conv}\nolimits} }_{3 \times 3}}} \left( {{F_{in}},d = i} \right)} \right) (^σ represents ReLU activation), boosting small object mAP to 68.9% on VOC while increasing parameters by 40%.

B. RFB Module

The RFB (Receptive Field Block) network represents a substantial enhancement to the SSD (Single Shot MultiBox Detector) framework for object detection. Its core innovation lies in introducing a biomimetic receptive field module designed to overcome the limitations of standard convolutional layers in terms of receptive field size and diversity, thereby significantly improving the model's robustness to scale variations.

1) Backbone Network and Multi-Scale Feature Extraction

The model employs VGG-16 as its backbone network with critical modifications. Input images are standardized to 300x300 pixels. Rather than using only VGG-16's final output, the model extracts feature from two key layers:

Shallow High-Resolution Features (Conv4-3): Located near the front end of the network, these features possess high spatial resolution (e.g., 38x38) and retain rich detail and edge information, which are crucial for detecting small objects.

Deep semantic features (Conv7): Composed of later convolutional layers in VGG-16 (e.g., Conv5-3) combined with additional dimension-reducing convolutional layers (Conv7-fe in the diagram). These feature maps have smaller dimensions (e.g., 19x19) but carry stronger semantic information at each location, benefiting large object recognition.

2) Cascading Receptive Field Blocks and Feature Enhancement

This is key to enhancing model performance. The RFB module employs a multi-branch architecture where each branch utilizes dilated convolutions with varying dilation rates. This design enables rapid expansion of the receptive field without increasing parameters or reducing spatial resolution, thereby capturing broader contextual information.

RFB_a: Directly processes shallow features Conv4-3. Its design likely emphasizes smaller dilation rates, introducing appropriate context while preserving detail to optimize small object detection.

RFB1, RFB2, RFB3: Process deep features in a cascading manner. Deep features inherently possess large receptive fields. By combining different dilation rates, the RFB modules further simulate multi-scale receptive fields. This enables the model to simultaneously perceive both local features and global context of objects, leading to more accurate localization and classification—especially for targets in complex scenes.

Feature Pyramid Construction and Multi-Scale Prediction

After feature enhancement by the RFB module (Figure 2), the model undergoes sub-sampling through a series of standard convolutional layers (BASIC-Conv1 to BASIC-Conv4), naturally constructing a feature pyramid. Ultimately, the model selects feature maps at multiple scales for input to the detection layer:

High-resolution feature maps: Derived from the Conv4-3 layer enhanced by RFB_a, used for detecting small objects.

Medium-resolution feature maps: e.g., outputs from RFB2 or BASIC-Conv1, used for detecting medium-sized objects.

Low-resolution feature maps: Output from BASIC-Conv3 or BASIC-Conv4, featuring the largest receptive field, for detecting large objects in images.

C. MobileNet-SSD

MobileNet is a lightweight deep neural network specifically designed for mobile and embedded devices. It employs depthwise separable convolutions to reduce computational load and enhance computational efficiency.

Key features of MobileNet include:

Low computational complexity with significantly fewer parameters compared to models like VGG and ResNet.

Suitability for low-computational-power devices such as smartphones, embedded systems, and IoT devices.

Suitability for transfer learning and easy integration with other tasks like object detection and semantic segmentation.

MobileNet-SSD in Figure3. Is designed for lightweight efficiency, integrating the MobileNet backbone with the SSD detection framework. Its processing flow is: Inputs undergo multi-stage convolutions (Conv0 to Conv11, etc.) to extract features. The MobileNet section drastically reduces computational load via depthwise separable convolutions (e.g., Conv1), generating feature maps at different resolutions (150×150×32 to 19×19×512). The SSD detection branch utilizes these multi-scale feature maps (e.g., 1×1× (Classes+4)) to predict object categories and bounding boxes via convolutions. Finally, Non-Maximum Suppression (NMS) filters the detection results, achieving both efficiency and accuracy in mobile object detection scenarios.

The main drawbacks of MobileNet-SSD are: weak detection capability for small objects, as the backbone network retains insufficient fine-grained features, leading to missed detections in identifying and locating small-sized targets; limited robustness in complex scenes, where insufficient feature discrimination results in high false positive rates when encountering dense objects or complex backgrounds; Despite lightweight optimization, its parameter count and computational load remain relatively high, resulting in suboptimal real-time performance on ultra-low-power embedded devices; and its network architecture exhibits low flexibility, leading to functional expansionsor hardware adaptations.

D. Lite-RFB

The proposed Lite-RFB module reconstructs the traditional RFB through structured compression. Its parameters consist of three components:

1) Depth-separable convolution decomposition

3

{Params}_{Lite-RFB} = \underset{Deepthwise}{\underset{︸}{3 \times 3 \times k^{2}}} + \underset{Point wise}{\underset{︸}{3 C \times C}} + \underset{Fusion}{\underset{︸}{C \times C}}

{\rm{Params}}{{\rm{ }}_{{\rm{Lile - RFB }}}} = \underbrace {3 \times 3 \times {k^2}}_{{\rm{Depephinise }}} + \underbrace {3C \times C}_{{\rm{Point nise }}} + \underbrace {C \times C}_{{\rm{Fusion }}}

Where C denotes the number of channels, and k represents the convolution kernel size (default: k = 3 ).

2) Theoretical compression ratio calculation

4

η = 1 - \frac{3 k^{2} + 4 C}{3 C k^{2}}

\eta = 1 - {{3{k^2} + 4C} \over {3C{k^2}}}

When ^C = 256, the theoretical compression ratio ^ŋ = 62.5% (parameter count reduced from 36.8M to 13.8M).

3) Dynamic adjustment of receptive fields

The size of each branch's receptive field is controlled by the porosity d: 5 $R F_{d} = 1 + d \times (k - 1)$ R{F_d} = 1 + d \times (k - 1)

For d={1,2,3}, this corresponds to receptive fields of 7×7, 11×11, and 15×15 pixels, respectively.

The proposed Lite-RFB module offers the following core advantages: It reconstructs multi-branch receptive field structures through depthwise separable convolutions, achieving a theoretical parameter reduction of 62.5% while preserving multi-scale feature extraction capabilities (hole ratios d ∈ {1,2,3} correspond to 7×7 to 15×15 pixel receptive fields); Its dynamic branch weight mechanism effectively adapts to varying input scales. With parameters reduced to 13.8 million, it maintains detection accuracy close to the original SSD (74.1% mAP) while achieving 1.8 times faster inference speed, significantly outperforming comparable lightweight solutions. Key improvements over SSD, RFB, and MobileNet-SSD include: addressing SSD's weakness in small object detection by optimizing feature fusion and receptive field design to enhance small object capture capability; compared to RFB, further compressing parameters and computational load while preserving detection accuracy to support lower-power devices; and for MobileNet-SSD, strengthening fine-grained feature retention to reduce false positive/negative rates in complex scenes, while optimizing network flexibility to minimize functional expansion and hardware adaptation costs. This design achieves the first synergistic optimization of accuracy and efficiency within the receptive field module, providing an innovative solution for real-time object detection in edge computing scenarios.

III. Experiments and Results Analysis

A. Experimental Preparation

The dataset employed in this study is derived from the PASCAL VOC dataset in Figure 4, a benchmark collection widely used for object detection and image segmentation tasks in computer vision.

The PASCAL VOC data and associated challenges were initiated under the EU-funded PASCAL2 Network of Excellence, focused on Pattern Analysis, Statistical Modeling, and Computational Learning. The annual competitions ran from 2005 to 2012 and were discontinued after 2012. As a result, the PASCAL VOC dataset consists of a sequence of yearly releases from 2005 to 2012, denoted as VOC2005, VOC2006, and so forth. It should be noted that after VOC2007, the test sets were no longer made publicly available. This paper specifically uses the combined VOC2007 and VOC2012 dataset, which includes 16,551 training images and 4,952 test images spanning 20 object categories, as listed in Table 1. Among the annotated objects, approximately 15% are small targets with dimensions smaller than 50×50 pixels.

TABLE I.

Object Categories in the VOC

Vehicles	Household	Animals	Others
Aeroplane	Bottle	Bird	Person
Bicyle	Chair	Cat
Boat	Dining table	Cow
Car	Potted plant	Dog
Motorbike	Sofa	Horse
Train	TV/Monitor	Sheep

B. Experimental Environment

The hardware platform and software environment relied upon in this experiment are shown in the Table 2. below:

TABLE II.

Experimental Environment Configuration

Parameter	Configuration
CPU	Intel® Core™ i9-14900KF 3.20 GHz
GPU	NVIDIA GeForce RTX 5080
Memory	256GB
Graphics Memory	1.82TB
System Environment	64-bit operating system
Experimental Platform	PyTorch 2.1, Python 3.8, PyCharm
Acceleration Environment	CUDA 12.1

C. Experimental Training

Through systematic comparative testing, this experiment clearly demonstrates the complete processing pathways for three object detection models: After inputting 300×300 test images, the original SSD employs standard VGG16 feature extraction. SSD+RFB achieves feature enhancement through multi-branch dilated convolutions, while Lite-RFB utilizes depthwise separable convolutions for lightweight processing. All three converge at a unified multi-scale detection head, ultimately outputting quantified results including mAP, FPS, and memory consumption. This flowchart visually reveals architectural differences among models: the RFB module enhances small object detection accuracy through feature fusion but increases computational overhead, while lightweight designs optimize resource consumption while maintaining accuracy. This provides a visual basis for model selection across diverse application scenarios.

D. Experiments and Results Analysis

Experimental results are evaluated across three metrics: mAP (mean average precision), FPS (frames per second), and Parameters. Relevant calculation formulas are as follows:

Let the total number of classes be ^C, the number of correctly classified pixels in class ⁱ be ^TP_i (true positives), and the total number of pixels in classes ⁱ be ^N_i (including both correctly and incorrectly classified pixels).The mPA formula is: 6 $m P A = \frac{1}{C} \sum_{i = 1}^{C} \frac{T P_{i}}{N_{i}}$ mPA = {1 \over C}\sum\nolimits_{i = 1}^C {{{T{P_i}} \over {{N_i}}}}

Here, a higher mPA (%) indicates more balanced classification capability across all categories.

Let the total time required to process ^N frames of images be ^T (in seconds). In real-time monitoring, autonomous driving, and similar scenarios, a minimum FPS of 30 is typically required to ensure smooth video playback. The FPS calculation formula is: 7 $F P S = \frac{N}{T}$ FPS = {N \over T}

The experimental results clearly demonstrate the performance trade-offs resulting from different optimization approaches, providing clear guidance for model selection (as shown in Figure 5).

1) Original SSD Model: Efficiency Benchmark

The original SSD model excels in inference speed (FPS), achieving approximately 46 FPS and demonstrating outstanding real-time processing capability. In terms of model complexity, its parameter counts of 26.3 million is moderately sized among the three models. This relatively compact structure underpins its high speed. However, its detection accuracy (mAP) of approximately 74.3% is the lowest among the three models, reflecting the limitations of the base model in feature extraction capabilities.

2) SSD+RFB Model: Precision-First Approach

Introducing the RFB (Receptive Field Block) module into the original SSD significantly enhances detection accuracy. The SSD+RFB achieves the highest mAP among the three models at 78.1%. By simulating the receptive field mechanism of human vision, the RFB module effectively strengthens the model's ability to capture features of multi-scale objects, substantially improving recognition accuracy. However, this accuracy gain comes at the cost of speed, with FPS dropping to approximately 35. Simultaneously, the number of parameters increased to 28.7 million, the highest among the three models. This indicates that increased model complexity imposes greater computational overhead. This model is highly suitable for scenarios demanding extremely high detection accuracy and abundant computational resources, such as high-precision industrial vision inspection.

3) Lite-RFB Model: Balancing Performance and Efficiency

The Lite-RFB model strives for an optimal balance between accuracy, speed, and model size. Its mAP is approximately 72.8%, slightly lower than SSD+RFB but still superior to the original SSD model, achieving reasonable accuracy retention. More importantly, it achieves a high FPS of 58, the fastest inference speed, while maintaining a lightweight model size of only 18.9 million parameters. This efficiency stems from lightweight designs like depth-separable convolutions, which substantially reduce computational load and parameter count while maintaining robust performance. Consequently, this model is ideal for resource-constrained scenarios—such as mobile deployments and edge computing devices—where high-efficiency real-time processing is required without compromising detection effectiveness.

After a comprehensive training regimen, the proposed Lite-RFB SSD algorithm was subjected to rigorous evaluation. The test set data, comprising unseen images from the PASCAL VOC dataset, were fed into the trained network model to generate detection results in Figure 6. As illustrated in the corresponding figures, the model successfully identified and localized multiple objects within complex scenes, such as distinguishing between a 'sofa' and a 'chair' on a modern architectural terrace, accurately detecting a 'train' against a natural backdrop, and precisely delineating a 'person' and a 'dog' in a close-up interaction. These visual outcomes not only demonstrate the model's robust classification capabilities but also validate its precision in bounding box regression, confirming the practical efficacy of the Lite-RFB architecture in real-world object detection tasks.

IV. Conclusions

This paper presented Lite-RFB SSD, a lightweight object detection framework optimized for small targets. By integrating a depthwise-separable and pruned RFB module into the shallow layer of SSD, we enhanced semantic representation and receptive field alignment for small objects. The model achieves a 4.2% improvement in small object AP on the VOC dataset while reducing parameters by 62% and maintaining a real-time inference speed of 28 FPS on embedded hardware. Compared to existing approaches, Lite-RFB SSD offers a superior balance between accuracy and efficiency, making it suitable for deployment in edge computing, mobile vision, and real-time surveillance systems. While the current research has achieved notable success, future work will explore dynamic receptive field optimization and enhanced adaptation to extreme conditions. The modular design provides a solid foundation for further developments in edge computing applications.

This study establishes a new benchmark for lightweight object detection systems, offering both theoretical insights and practical solutions for real-world applications in resource-constrained environments.

Research on a Lightweight Small Object Detection Method Based on Lite-RFB Modules

Full Article

Paradigm

My account